Skip to content
View skewer's full-sized avatar

Block or report skewer

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Please don't include any personal information such as legal names or email addresses. Maximum 100 characters, markdown supported. This note will be visible to only you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
Beta Lists are currently in beta. Share feedback and report bugs.
Showing results

Linux Device Drivers 3 examples updated to work in recent kernels

C 2,369 910 Updated Sep 28, 2024

FlashInfer: Kernel Library for LLM Serving

Cuda 1,305 117 Updated Oct 17, 2024

materials available to the public

HTML 17 2 Updated Jun 10, 2024

📖A curated list of Awesome LLM Inference Paper with codes, TensorRT-LLM, vLLM, streaming-llm, AWQ, SmoothQuant, WINT8/4, Continuous Batching, FlashAttention, PagedAttention etc.

2,664 182 Updated Oct 15, 2024

Script that organizes the Google Takeout archive into one big chronological folder

Dart 3,916 194 Updated Aug 12, 2024

A large-scale simulation framework for LLM inference

Python 253 33 Updated Oct 10, 2024

backup tools for OG pixel & pixel XL

Shell 44 6 Updated Sep 18, 2024

Documentation of NVIDIA chip/hardware interfaces

C 1,247 91 Updated Sep 10, 2024

Training materials associated with NVIDIA's CUDA Training Series (www.olcf.ornl.gov/cuda-training-series/)

Cuda 581 222 Updated Aug 19, 2024

Awesome resources for GPUs

480 48 Updated Jul 1, 2023

MLIR For Beginners tutorial

C++ 781 63 Updated Sep 30, 2024

通过MicroBenchmark获悉Ampere微架构知识

C++ 4 Updated Jan 11, 2024

[CVPR 2023 Best Paper Award] Planning-oriented Autonomous Driving

Python 3,429 382 Updated Aug 28, 2024

Tengine is a lite, high performance, modular inference engine for embedded device

C++ 4,630 997 Updated Sep 15, 2024

Typescript based SystemVerilog meta-language and HDL design framework

Verilog 3 5 Updated Sep 6, 2024

Efficient operation implementation based on the Cambricon Machine Learning Unit (MLU) .

C++ 103 103 Updated Oct 17, 2024

Compare different hardware platforms via the Roofline Model for LLM inference tasks.

Jupyter Notebook 73 3 Updated Mar 13, 2024

CUDA on non-NVIDIA GPUs

Rust 9,467 623 Updated Oct 16, 2024
C++ 93 47 Updated Oct 17, 2024

SOTA low-bit LLM quantization (INT8/FP8/INT4/FP4/NF4) & sparsity; leading model compression techniques on TensorFlow, PyTorch, and ONNX Runtime

Python 2,195 253 Updated Oct 17, 2024

Best inference performance optimization framework for HuggingFace Diffusers on NVIDIA GPUs.

Python 1,167 71 Updated Jul 16, 2024

🤘 TT-NN operator library, and TT-Metalium low level kernel programming model.

C++ 434 62 Updated Oct 17, 2024

Assembler and Decompiler for NVIDIA (Maxwell Pascal Volta Turing Ampere) GPUs.

Python 66 8 Updated Feb 23, 2023

Unified compiler/runtime for interfacing with PyTorch Dynamo.

Python 93 48 Updated Oct 17, 2024

Shared Middle-Layer for Triton Compilation

MLIR 173 37 Updated Oct 15, 2024

Assembler for NVIDIA Maxwell architecture

Sass 945 161 Updated Jan 3, 2023

Tensors and Dynamic neural networks in Python with strong GPU acceleration

Python 83,021 22,385 Updated Oct 17, 2024

Stable Diffusion and Flux in pure C/C++

C++ 3,372 285 Updated Sep 2, 2024
Next