-
University of Science and Technology of China
- Shanghai & Hefei
- lihan-byte.github.io
Highlights
- Pro
Stars
The official GitHub page for the survey paper "A Survey on Mixture of Experts in Large Language Models".
Lab for Parallel computing (USTC COMP6201P)
Mooncake is the serving platform for Kimi, a leading LLM service provided by Moonshot AI.
My learning notes/codes for ML SYS.
Homework solutions for CSAPP (a.k.a. Computer System A Programmer's Perspective) Third Edition.
Simple example of how to write an Implicit GEMM Convolution in CUDA using the tensor core WMMA API and bindings for PyTorch.
My personal vim/neovim configuration files, dotfiles, docs and other scripts.
📚200+ Tensor/CUDA Cores Kernels, ⚡️flash-attn-mma, ⚡️hgemm with WMMA, MMA and CuTe (98%~100% TFLOPS of cuBLAS/FA2 🎉🎉).
A tutorial of building an LSM-Tree storage engine (database) in a week.
A large number of cuda/tensorrt cases . 大量案例来学习cuda/tensorrt
Flash-LLM: Enabling Cost-Effective and Highly-Efficient Large Generative Model Inference with Unstructured Sparsity
notes i made while reading the papers. Including database, distributed systems and HPC.
Learning materials for Stanford CS149 : Parallel Computing
A library of GPU kernels for sparse matrix operations.
A GPU-driven system framework for scalable AI applications
A Easy-to-understand TensorOp Matmul Tutorial