Stars
SOTA low-bit LLM quantization (INT8/FP8/INT4/FP4/NF4) & sparsity; leading model compression techniques on TensorFlow, PyTorch, and ONNX Runtime
A dual clock asynchronous FIFO written in verilog, tested with Icarus Verilog
Fast inference from large lauguage models via speculative decoding
Low Precision Arithmetic Simulation in PyTorch
A PyTorch implementation of the Transformer model in "Attention is All You Need".
📰 Must-read papers and blogs on Speculative Decoding ⚡️
📰 Must-read papers on KV Cache Compression (constantly updating 🤗).
⚡️HivisionIDPhotos: a lightweight and efficient AI ID photos tools. 一个轻量级的AI证件照制作算法。
利用HuggingFace的官方下载工具从镜像网站进行高速下载。
This repository contains demos I made with the Transformers library by HuggingFace.
A machine learning compiler for GPUs, CPUs, and ML accelerators
A high-throughput and memory-efficient inference and serving engine for LLMs
PyTorch Tutorial for Deep Learning Researchers
[MLSys 2024 Best Paper Award] AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration
Tender: Accelerating Large Language Models via Tensor Decompostion and Runtime Requantization (ISCA'24)
The official GitHub page for the survey paper "A Survey of Large Language Models".
20+ high-performance LLMs with recipes to pretrain, finetune and deploy at scale.
Implementation of the LLaMA language model based on nanoGPT. Supports flash attention, Int8 and GPTQ 4bit quantization, LoRA and LLaMA-Adapter fine-tuning, pre-training. Apache 2.0-licensed.
Representation and Reference Lowering of ONNX Models in MLIR Compiler Infrastructure
A collection of pre-trained, state-of-the-art models in the ONNX format
Intermediate Language (IL) for Hardware Accelerator Generators