Skip to content
View lanshanikilven's full-sized avatar

Block or report lanshanikilven

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Please don't include any personal information such as legal names or email addresses. Maximum 100 characters, markdown supported. This note will be visible to only you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
Showing results

Fast and memory-efficient exact attention

Python 14,974 1,411 Updated Jan 8, 2025

Performance of the C++ interface of flash attention and flash attention v2 in large language model (LLM) inference scenarios.

C++ 32 3 Updated Sep 7, 2024

ROCm BLAS marshalling library

C++ 124 79 Updated Jan 7, 2025

how to optimize some algorithm in cuda.

Cuda 1,809 150 Updated Jan 8, 2025

flash attention tutorial written in python, triton, cuda, cutlass

Cuda 241 21 Updated Jan 3, 2025

FlashAttention2 implementation with TensorCore WMMA API

Cuda 3 Updated Apr 8, 2024

Implement FlashAttention v2 with minimal code to learn.

Cuda 9 1 Updated Jun 12, 2024

A flash attention2 extension for stable diffusion webui in Linux pytorch-rocm environments.

Cuda 7 Updated Jul 9, 2024

使用 cutlass 实现 flash-attention 精简版,具有教学意义

Cuda 34 4 Updated Aug 12, 2024

An unofficial cuda assembler, for all generations of SASS, hopefully :)

Python 78 10 Updated Mar 20, 2023

Flash Attention in raw Cuda C beating PyTorch

Cuda 16 1 Updated May 14, 2024

A framework that support executing unmodified CUDA source code on non-NVIDIA devices.

C++ 111 14 Updated Jan 3, 2025

Codes & examples for "CUDA - From Correctness to Performance"

C++ 77 19 Updated Oct 24, 2024

LLVM/MLIR based compiler instrumentation of AMD GPU kernels

C++ 15 4 Updated Dec 10, 2024

CUDA on non-NVIDIA GPUs

Rust 10,307 673 Updated Jan 3, 2025

AMD ROCm™ Software - GitHub Home

Shell 4,795 395 Updated Jan 8, 2025

CuPBoP-AMD is a CUDA translator that translates CUDA programs at NVVM IR level to HIP-compatible IR that can run on AMD GPUs. Currently, CuPBoP-AMD translates a broader range of applications in the…

LLVM 3 Updated Nov 10, 2023

HIP: C++ Heterogeneous-Compute Interface for Portability

C++ 3,821 544 Updated Jan 8, 2025

Implementation of a simple CNN using CUDA

Cuda 66 20 Updated May 2, 2017

CuPBoP-AMD is a CUDA translator that translates CUDA programs at NVVM IR level to HIP-compatible IR that can run on AMD GPUs.

LLVM 35 4 Updated Nov 19, 2023

A library for accelerating Transformer models on NVIDIA GPUs, including using 8-bit floating point (FP8) precision on Hopper and Ada GPUs, to provide better performance with lower memory utilizatio…

Python 2,064 341 Updated Jan 8, 2025

A collection of pre-trained, state-of-the-art models in the ONNX format

Jupyter Notebook 8,116 1,420 Updated Apr 30, 2024

AutoKernel 是一个简单易用,低门槛的自动算子优化工具,提高深度学习算法部署效率。

C++ 781 94 Updated Sep 23, 2022
Python 11 2 Updated Dec 31, 2019

Eigen is a C++ template library for linear algebra: matrices, vectors, numerical solvers, and related algorithms.

C++ 622 133 Updated Oct 18, 2023

Protocol Buffers - Google's data interchange format

C++ 66,196 15,561 Updated Jan 8, 2025
Next