Skip to content
View lanshanikilven's full-sized avatar

Block or report lanshanikilven

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Please don't include any personal information such as legal names or email addresses. Maximum 100 characters, markdown supported. This note will be visible to only you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
103 results for source starred repositories
Clear filter

AMD's Machine Intelligence Library

Assembly 1,109 236 Updated Jan 24, 2025

Fast and memory-efficient exact attention

Python 15,164 1,433 Updated Jan 18, 2025

Performance of the C++ interface of flash attention and flash attention v2 in large language model (LLM) inference scenarios.

C++ 34 3 Updated Sep 7, 2024

ROCm BLAS marshalling library

C++ 126 80 Updated Jan 23, 2025

how to optimize some algorithm in cuda.

Cuda 1,842 153 Updated Jan 21, 2025

flash attention tutorial written in python, triton, cuda, cutlass

Cuda 252 22 Updated Jan 3, 2025

FlashAttention2 implementation with TensorCore WMMA API

Cuda 3 Updated Apr 8, 2024

Implement FlashAttention v2 with minimal code to learn.

Cuda 9 1 Updated Jun 12, 2024

A flash attention2 extension for stable diffusion webui in Linux pytorch-rocm environments.

Cuda 7 Updated Jul 9, 2024

使用 cutlass 实现 flash-attention 精简版,具有教学意义

Cuda 34 4 Updated Aug 12, 2024

Flash Attention in raw Cuda C beating PyTorch

Cuda 16 1 Updated May 14, 2024

A framework that support executing unmodified CUDA source code on non-NVIDIA devices.

C++ 112 14 Updated Jan 3, 2025

Codes & examples for "CUDA - From Correctness to Performance"

C++ 77 19 Updated Oct 24, 2024

LLVM/MLIR based compiler instrumentation of AMD GPU kernels

C++ 16 4 Updated Jan 13, 2025

CUDA on non-NVIDIA GPUs

Rust 10,433 676 Updated Jan 3, 2025

AMD ROCm™ Software - GitHub Home

Shell 4,859 398 Updated Jan 23, 2025

CuPBoP-AMD is a CUDA translator that translates CUDA programs at NVVM IR level to HIP-compatible IR that can run on AMD GPUs. Currently, CuPBoP-AMD translates a broader range of applications in the…

LLVM 3 Updated Nov 10, 2023

HIP: C++ Heterogeneous-Compute Interface for Portability

C++ 3,845 546 Updated Jan 24, 2025

Implementation of a simple CNN using CUDA

Cuda 66 20 Updated May 2, 2017

CuPBoP-AMD is a CUDA translator that translates CUDA programs at NVVM IR level to HIP-compatible IR that can run on AMD GPUs.

LLVM 36 4 Updated Nov 19, 2023

A library for accelerating Transformer models on NVIDIA GPUs, including using 8-bit floating point (FP8) precision on Hopper and Ada GPUs, to provide better performance with lower memory utilizatio…

Python 2,108 352 Updated Jan 24, 2025

A collection of pre-trained, state-of-the-art models in the ONNX format

Jupyter Notebook 8,178 1,421 Updated Apr 30, 2024

AutoKernel 是一个简单易用,低门槛的自动算子优化工具,提高深度学习算法部署效率。

C++ 736 94 Updated Sep 23, 2022
Python 11 2 Updated Dec 31, 2019

Eigen is a C++ template library for linear algebra: matrices, vectors, numerical solvers, and related algorithms.

C++ 631 133 Updated Oct 18, 2023

Protocol Buffers - Google's data interchange format

C++ 66,395 15,593 Updated Jan 24, 2025
Next