Stars
Minimalistic 4D-parallelism distributed training framework for education purpose
fanshiqing / grouped_gemm
Forked from tgale96/grouped_gemmPyTorch bindings for CUTLASS grouped GEMM.
When it comes to optimizers, it's always better to be safe than sorry
Get up and running with Llama 3.3, DeepSeek-R1, Phi-4, Gemma 3, and other large language models.
Survey of Small Language Models from Penn State, ...
A subset of PyTorch's neural network modules, written in Python using OpenAI's Triton.
Quantized Attention that achieves speedups of 2.1-3.1x and 2.7-5.1x compared to FlashAttention2 and xformers, respectively, without lossing end-to-end metrics across various models.
The Official Implementation of Ada-KV: Optimizing KV Cache Eviction by Adaptive Budget Allocation for Efficient LLM Inference
[ICLR 2025] DuoAttention: Efficient Long-Context LLM Inference with Retrieval and Streaming Heads
On-device AI across mobile, embedded and edge for PyTorch
LongLLaVA: Scaling Multi-modal LLMs to 1000 Images Efficiently via Hybrid Architecture
Qwen2.5-VL is the multimodal large language model series developed by Qwen team, Alibaba Cloud.
Python Intelligence Config Manager. A superset of hydra+pydantic+lsp
FlagGems is an operator library for large language models implemented in Triton Language.
Odysseus: Playground of LLM Sequence Parallelism
A framework for serving and evaluating LLM routers - save LLM costs without compromising quality
A family of compressed models obtained via pruning and knowledge distillation
BitBLAS is a library to support mixed-precision matrix multiplications, especially for quantized LLM deployment.
Repository for Meta Chameleon, a mixed-modal early-fusion foundation model from FAIR.
The official evaluation suite and dynamic data release for MixEval.
source code of paper "On the Hallucination in Simultaneous Machine Translation"
[Neurips2024] Source code for xRAG: Extreme Context Compression for Retrieval-augmented Generation with One Token