-
Together AI
- LA & SF
- ericauld.github.io
- @aulderic
- in/eric-auld
Stars
FlexAttention w/ FlashAttention3 Support
Infinity is a high-throughput, low-latency serving engine for text-embeddings, reranking models, clip, clap and colpali
Efficient implementations of state-of-the-art linear attention models in Pytorch and Triton
TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficie…
ericauld / flash-attention
Forked from Dao-AILab/flash-attentionFast and memory-efficient exact attention
Examples demonstrating available options to program multiple GPUs in a single node or a cluster
Code for paper: "QuIP: 2-Bit Quantization of Large Language Models With Guarantees"
FlashFFTConv: Efficient Convolutions for Long Sequences with Tensor Cores
Building blocks for foundation models.
Fast and memory-efficient exact attention
GPU programming related news and material links