-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Insights: sgl-project/sglang
Overview
Could not load contribution data
Please try again later
117 Pull requests merged by 45 people
-
Split test_mla.py into two files (deepseek v2 and deepseek v3)
#4216 merged
Mar 8, 2025 -
refine quant kernel code style
#4211 merged
Mar 8, 2025 -
Test no vllm custom allreduce
#4210 merged
Mar 8, 2025 -
Refactor Dockerfile: unify CUDA logic and reduce image size by ~2.6 GB
#3749 merged
Mar 8, 2025 -
Use clang format 18 in pr-test-sgl-kernel.yml
#4203 merged
Mar 8, 2025 -
Fix bench_serving flush cache not recognizing OPENAI_API_KEY
#4181 merged
Mar 8, 2025 -
lazy import attn backends
#4200 merged
Mar 8, 2025 -
Revert "Minor improvement to per_tensor_quant_fp8 (#4197)"
#4198 merged
Mar 8, 2025 -
Minor improvement to per_tensor_quant_fp8
#4197 merged
Mar 8, 2025 -
Remove the vllm dependency from the moe_align function
#4164 merged
Mar 8, 2025 -
[EAGLE] many fixes for eagle
#4195 merged
Mar 8, 2025 -
New clang format for sgl kernel
#4194 merged
Mar 8, 2025 -
Update amd ci docker image to v0.4.3.post4-rocm630.
#4189 merged
Mar 7, 2025 -
Fix eagle hang issue for max_new_tokens=1
#4185 merged
Mar 7, 2025 -
use same version for ci and pyproject
#4187 merged
Mar 7, 2025 -
Revert "ROCm: Flex Attention Enablement with custom backends (#4178)"
#4186 merged
Mar 7, 2025 -
ROCm: Flex Attention Enablement with custom backends
#4178 merged
Mar 7, 2025 -
[Docs] Improve bullets appearance and grammar
#4174 merged
Mar 7, 2025 -
fix int8 doc link
#4179 merged
Mar 7, 2025 -
Add an example of using deepseekv3 int8 sglang.
#4177 merged
Mar 7, 2025 -
chore: bump v0.0.3.post7 for sgl-kernel
#4176 merged
Mar 7, 2025 -
Memory pool fix for upstream change about eagle
#4170 merged
Mar 7, 2025 -
Put utils in ifndef USE_ROCM to fix CI (#4167)
#4168 merged
Mar 7, 2025 -
Put utils in ifndef USE_ROCM to fix CI
#4167 merged
Mar 7, 2025 -
Remove non-existent AMD header include
#4166 merged
Mar 7, 2025 -
[Docs] Fix links and grammar issues
#4162 merged
Mar 7, 2025 -
[Refactor] Reducing code duplication across FP8 CUDA quantization kernels
#4163 merged
Mar 7, 2025 -
[Feature] DeepSeek V3/R1 INT8 Quantization (channel-wise)
#3888 merged
Mar 7, 2025 -
Add sgl_per_token_quant_fp8
#4089 merged
Mar 7, 2025 -
[quant kernel] sgl-kernel support per_tensor_quant fp8
#3786 merged
Mar 7, 2025 -
Add Support for Qwen2-VL Multi-modal Embedding Models
#3694 merged
Mar 7, 2025 -
ROCm: enable trillion-parameter MoE models with INT4-FP8 single node
#4152 merged
Mar 6, 2025 -
Hot fix small vocal eagle in docs
#4154 merged
Mar 6, 2025 -
Docs: add torch compile cache
#4151 merged
Mar 6, 2025 -
[docs] fix HF reference script command
#4148 merged
Mar 6, 2025 -
Release v0.4.3.post4
#4140 merged
Mar 6, 2025 -
Fix constrained generation errors by adding datasets dependency
#4142 merged
Mar 6, 2025 -
Fix nightly ci Gsm8k & Fix flashinfer backend kvcache quant
#4147 merged
Mar 6, 2025 -
Fix a draft model accuracy bug in eagle; support step=1; return logprob in eagle
#4134 merged
Mar 6, 2025 -
fix bench serving bug
#4135 merged
Mar 6, 2025 -
Update CODEOWNER
#4138 merged
Mar 6, 2025 -
AMD/ROCm: update base image string
#4137 merged
Mar 6, 2025 -
[Minor] make the
__init__
function of model_runner.py shorter#4132 merged
Mar 6, 2025 -
Split the __init__ of scheduler as smaller functions. Improve the eagle tests
#4128 merged
Mar 6, 2025 -
remove unused max_jobs in setup_rocm.py
#4126 merged
Mar 6, 2025 -
Add tag suffix to nightly docker builds.
#4129 merged
Mar 6, 2025 -
Add codeowners for eagle implementations
#4131 merged
Mar 6, 2025 -
EAGLE docs
#4038 merged
Mar 6, 2025 -
feat: support docs auto live-reload with sphinx-autobuild
#4111 merged
Mar 6, 2025 -
Add a pointer to the real KV cache pool
#4113 merged
Mar 6, 2025 -
Remove prefill-only-one-req
#4117 merged
Mar 6, 2025 -
[Hoxfix] Fix incomplete token_to_kv_pool refactor
#4121 merged
Mar 6, 2025 -
chore: bump v0.4.3.post3
#4114 merged
Mar 6, 2025 -
fix Non-consecutive header level increase in docs/router/router.md
#4099 merged
Mar 6, 2025 -
fix cross-reference error and spelling mistakes
#4101 merged
Mar 6, 2025 -
Online serving benchmarks of real datasets for hierarchical KV caching
#3211 merged
Mar 6, 2025 -
Debug radixcache: refactor recursive helper methods
#3029 merged
Mar 6, 2025 -
remove testing on PR workflow change
#4110 merged
Mar 6, 2025 -
Create release-docker-amd-nightly.yml
#4105 merged
Mar 5, 2025 -
revert deepseek docs
#4109 merged
Mar 5, 2025 -
Add examples for server token-in-token-out
#4103 merged
Mar 5, 2025 -
reorganize dpsk docs
#4108 merged
Mar 5, 2025 -
Add DeepSeek optimization ablations documentation
#4107 merged
Mar 5, 2025 -
Add update_weights_from_disk endpoint to Engine
#4102 merged
Mar 5, 2025 -
Fix triton kernel illegal memory issue for eagle
#4100 merged
Mar 5, 2025 -
[Revision] Add fast decode plan for flashinfer mla
#4012 merged
Mar 5, 2025 -
Fix the moe padding conditional logic
#4081 merged
Mar 5, 2025 -
[Eagle] Refactor eagle speculative decoding
#3986 merged
Mar 5, 2025 -
ROCM: AITER BLOCK GEMM
#4075 merged
Mar 5, 2025 -
bench: add dataset param for bench_multiturn
#3990 merged
Mar 5, 2025 -
[QUANT] Add GPTQModel Dynamic Quantization +
lm_head
Quantization#3790 merged
Mar 5, 2025 -
test: add vlm to token in & out example
#3941 merged
Mar 5, 2025 -
[Minor] more code cleanup
#4077 merged
Mar 5, 2025 -
Add examples for returning hidden states when using the server
#4074 merged
Mar 5, 2025 -
Simplify eagle tests and TP sync in grammar backend
#4066 merged
Mar 4, 2025 -
Update nextn ci test
#4071 merged
Mar 4, 2025 -
Revert "Fix nightly-test CI"
#4065 merged
Mar 4, 2025 -
[Feature] Add test for speculative_token_map
#4016 merged
Mar 4, 2025 -
remove unused max_jobs
#3607 merged
Mar 4, 2025 -
fix: support gelu_new activation function in gpt2
#3712 merged
Mar 4, 2025 -
sgl-router - issues on routing and project build. (#3870)
#3948 merged
Mar 4, 2025 -
[XCCL] Use xccl for xpu backend since xccl is ready in latest PyTorch.
#3954 merged
Mar 4, 2025 -
[Fix & Style] Refactor the grammar backend to reduce human errors and improve readability
#4030 merged
Mar 4, 2025 -
Remove grafana dashboard's datasource uid
#4051 merged
Mar 4, 2025 -
Fix
debug_tensor_dump_output_folder
optional key missing#4046 merged
Mar 4, 2025 -
ROCm: update aiter and its usage to fused moe (bloat16, fp8, fp8 block-quant)
#4053 merged
Mar 4, 2025 -
Fix breakage problem when using custom_ar
#4052 merged
Mar 4, 2025 -
HotFix for #3988 using blockwise_int8
#4023 merged
Mar 4, 2025 -
Reasoning parser
#4000 merged
Mar 4, 2025 -
Fix assert options.num_stages != 0 error in the latest ROCm build image
#4049 merged
Mar 4, 2025 -
docs: update README
#4044 merged
Mar 4, 2025 -
Add a link to the roadmap in README.md
#4043 merged
Mar 4, 2025 -
Share target model embed and head weights for nextn
#4033 merged
Mar 3, 2025 -
Add examples in sampling parameters
#4039 merged
Mar 3, 2025 -
Remove outdated test utils and fix links for the doc of sampling params
#3999 merged
Mar 3, 2025 -
Docs: Fix sampling parameter
#4034 merged
Mar 3, 2025 -
Misc clean up; Remove the support of jump forward
#4032 merged
Mar 3, 2025 -
Reorganize python source files in sgl-kernel with multiple files
#4027 merged
Mar 3, 2025 -
Reorganize c++ source files in sgl-kernel with multiple folders
#4025 merged
Mar 3, 2025 -
Update metrics documentation
#3264 merged
Mar 3, 2025 -
remove cache configs in model definitions
#4031 merged
Mar 3, 2025 -
Clean up custom allreduce
#4029 merged
Mar 3, 2025 -
Improve code styles
#4021 merged
Mar 3, 2025 -
Support penalty in overlap mode; return logprob with chunked prefill; improve benchmark scripts
#3988 merged
Mar 3, 2025 -
Optimize Triton Kernel of Group GEMM in DeepGEMM Benchmark
#4014 merged
Mar 3, 2025 -
Fix nightly-test CI
#3826 merged
Mar 3, 2025 -
Add examples to token-in-token-out for LLM
#4010 merged
Mar 3, 2025 -
Revert "Add fast decode plan for flashinfer mla"
#4008 merged
Mar 3, 2025 -
Add fast decode plan for flashinfer mla
#3987 merged
Mar 3, 2025 -
[feat] add small vocab table for eagle's draft model[1].
#3822 merged
Mar 3, 2025 -
Add Benchmark for DeepGEMM Group GEMM
#3993 merged
Mar 3, 2025 -
Enable custom AR for AMD GPUs and maintain it in sgl-kernel
#3406 merged
Mar 2, 2025 -
Add accuracy test for TP torch compile
#3994 merged
Mar 2, 2025 -
Fix all gather torch compile
#3992 merged
Mar 2, 2025 -
fix typo
#3991 merged
Mar 2, 2025 -
add deepgemm and sglang fp8 block-wise gemm benchmark
#3893 merged
Mar 2, 2025 -
Update CODEOWNERS
#3989 merged
Mar 2, 2025
35 Pull requests opened by 29 people
-
Reasoning parser
#4001 opened
Mar 2, 2025 -
Hierarchical Caching supports MLA
#4009 opened
Mar 3, 2025 -
ROCM support tree_speculative_sampling_target_only
#4015 opened
Mar 3, 2025 -
Avoid duplicated request ids in batch APIs
#4026 opened
Mar 3, 2025 -
Added fix for TP>1 grammar sync for slow networks
#4040 opened
Mar 3, 2025 -
chore: bump v0.4.4
#4041 opened
Mar 3, 2025 -
Let `bench_one_batch` support `enable_dp_attention`
#4058 opened
Mar 4, 2025 -
codegen for blackwell
#4063 opened
Mar 4, 2025 -
Fix RotaryEmbedding when using Triton backend for EXAONE-3.5-2.4B
#4064 opened
Mar 4, 2025 -
Tool call with text
#4067 opened
Mar 4, 2025 -
[WIP] Support overlapping two batches
#4068 opened
Mar 4, 2025 -
add INT8 example into dsv3 README
#4079 opened
Mar 5, 2025 -
Hierarchical Caching Refactoring and Fixing TP issue
#4082 opened
Mar 5, 2025 -
Enable the native path of DeepSeek
#4086 opened
Mar 5, 2025 -
Add awq dequantize kernel to sgl with 1x to 3x speedup
#4104 opened
Mar 5, 2025 -
Support cuda graph for LoRA
#4115 opened
Mar 6, 2025 -
docs(reasoning content): :memo: deepseek-r1 parser support qwq
#4124 opened
Mar 6, 2025 -
Add test for Radix cache variants
#4125 opened
Mar 6, 2025 -
Add A800 tuning configs support DeepSeek V3/R1 BF16 and INT8(block-wise)
#4136 opened
Mar 6, 2025 -
[QUANT] Support DeepSeek-V3 gptq
#4139 opened
Mar 6, 2025 -
add --served-model-name arg for bench_serving
#4141 opened
Mar 6, 2025 -
fix the input_ids is None error
#4144 opened
Mar 6, 2025 -
[ROCm] Enable silu_and_mul, gelu_and_mul, gelu_tanh_and_mul in amd platform
#4150 opened
Mar 6, 2025 -
DeepGemm integrate to sgl-kernel
#4165 opened
Mar 7, 2025 -
[ROCm/Draft/No-Merge]: Flex Attention Enablement
#4172 opened
Mar 7, 2025 -
Fix MoE quant args
#4190 opened
Mar 8, 2025 -
[docs] Unhide production metrics page
#4193 opened
Mar 8, 2025 -
linear support deepgemm
#4199 opened
Mar 8, 2025 -
Statistical Analysis of the Output Stability of the Deepseek Model
#4202 opened
Mar 8, 2025 -
Added example for multimodal embedding
#4206 opened
Mar 8, 2025 -
Try sgl kernel 0.0.3.post7
#4207 opened
Mar 8, 2025 -
[Fix] Check the device backend before calling empty_cache function
#4212 opened
Mar 8, 2025 -
Rename files in sgl kernel to avoid nested folder structure
#4213 opened
Mar 8, 2025 -
Remove vllm ops scaled fp8 quant
#4215 opened
Mar 8, 2025 -
Check eagle server args
#4217 opened
Mar 8, 2025
47 Issues closed by 20 people
-
[Bug] launch dsv2 service failed
#2763 closed
Mar 9, 2025 -
Deploy QwQ-32B with 0.4.3.post4 cannot find curand_kernel.h
#4169 closed
Mar 8, 2025 -
[Bug] cannot find this project in hugging face :lmsys/sglang-ci-dsv3-test.
#4192 closed
Mar 8, 2025 -
[Feature] Implement Dynamic Depth Decoding (DDD) for EAGLE-2
#2754 closed
Mar 8, 2025 -
[Feature] several features for veRL integration
#2736 closed
Mar 8, 2025 -
[Bug] SGLang stops working after a few requests when serving DeepSeek V3
#2741 closed
Mar 8, 2025 -
Decode out of memory happened when run deepseek-r1 inference
#4184 closed
Mar 7, 2025 -
[Bug] The radix cache affects the accuracy of the output results.
#4175 closed
Mar 7, 2025 -
I wonder if the offline engine API supports OpenAI input format.
#2734 closed
Mar 7, 2025 -
permission issue in newly updated docker lmsysorg/sglang:v0.4.3.post2-rocm630
#4122 closed
Mar 6, 2025 -
[Bug] Inaccurate or Inconsistent Output in Qwen2.5-VL Multi-Image Testing with sglang
#4123 closed
Mar 6, 2025 -
[Bug] sgl.Engine(**dataclasses.asdict(server_args)) return_logprob=True error
#4085 closed
Mar 6, 2025 -
[Bug] pydantic validation errors for ChatCompletion
#3637 closed
Mar 6, 2025 -
[Feature] Reasoning model API support
#3043 closed
Mar 6, 2025 -
Request help: VRAM usage issues
#4118 closed
Mar 6, 2025 -
[Bug] sglang crashes with multi node
#3932 closed
Mar 6, 2025 -
How can I close the server?
#4080 closed
Mar 6, 2025 -
[Kernel] Launch two kernels for mixed chunked prefill
#2273 closed
Mar 6, 2025 -
[Feature] Add examples for server token-in-token-out
#4078 closed
Mar 5, 2025 -
[Bug] HiCacheController gets stuck when testing with multiple long text documents
#3998 closed
Mar 5, 2025 -
offline throughput benchmark numbers mismatch with custom benchmark by 3x
#4076 closed
Mar 5, 2025 -
[Bug] NCCL Crash with SIGSEGV Frequently when deploying deepseek v3
#2803 closed
Mar 5, 2025 -
[Feature] When will pipeline model parallelism be supported?
#4059 closed
Mar 5, 2025 -
[Bug] Config file not found when use NVIDIA_H20-3e
#4028 closed
Mar 5, 2025 -
[willing to PR] optimzation of eagle2
#2720 closed
Mar 5, 2025 -
[Bug] cannot build sgl-kernel without GPU
#4060 closed
Mar 4, 2025 -
[Bug] Directly importing Grafana JSON does not work
#4050 closed
Mar 4, 2025 -
[Bug] LoRA Model Not Appearing or Working with sglang Server
#4057 closed
Mar 4, 2025 -
logger "Receive: obj=GenerateReqInput()" part with text rather than input_ids.
#4045 closed
Mar 4, 2025 -
[Feature] Enable SGLang on more AMD GPUs
#2320 closed
Mar 4, 2025 -
[Feature] Possible optimization in actor rollout parameter sync
#2708 closed
Mar 4, 2025 -
[Feature] Make vLLM optional in model code
#1673 closed
Mar 3, 2025 -
Development Roadmap (2025 H1)
#4035 closed
Mar 3, 2025 -
Development Roadmap (2024 Q4)
#1487 closed
Mar 3, 2025 -
[Feature] Split Docs CI
#3158 closed
Mar 3, 2025 -
[Bug] Min new tokens error
#4011 closed
Mar 3, 2025 -
[Bug] After using H200*8 to deploy DeepSeekR1, the large stress test model crashes
#4019 closed
Mar 3, 2025 -
[Feature] Do we have any plan to integrate FLashMLA
#4006 closed
Mar 3, 2025 -
[Bug] AWQ scalar type error
#3780 closed
Mar 3, 2025 -
[Feature] Reward EOS close to max_tokens
#2694 closed
Mar 3, 2025 -
[Bug] Continuous batching (OpenAI Server) with greedy search return different results
#2687 closed
Mar 3, 2025 -
[Bug] loading phi4-mini-instruct with sglang
#3935 closed
Mar 2, 2025 -
[Bug] After using --enable-torch-compile, garbled output
#3944 closed
Mar 2, 2025 -
[Bug] H20 deepseek infer enable flashinfer mla hang
#3578 closed
Mar 2, 2025
52 Issues opened by 46 people
-
[Bug] The sgl-kernel 0.0.3.post7 can't pass the CIs.
#4214 opened
Mar 8, 2025 -
[Bug] I encountered a 'Capture CUDA graph failed' error.
#4205 opened
Mar 8, 2025 -
[Bug] text generation hangs after serving some requests
#4191 opened
Mar 8, 2025 -
[Bug] Server crashes with CUDA errors during EAGLE Speculative Decoding under high concurrency
#4188 opened
Mar 7, 2025 -
Our performance test of DeepSeek-R1-Block-INT8 is inconsistent with #3730
#4180 opened
Mar 7, 2025 -
[Feature] Support correctly exit using ctrl+c
#4173 opened
Mar 7, 2025 -
[Bug] Qwen2.5-VL-7B-Instruct Inference Server crashes
#4171 opened
Mar 7, 2025 -
[Bug] DeepSeek server crushed while using sglang.bench_serving
#4161 opened
Mar 7, 2025 -
[Bug] sglang-router failure when first load model, try again successed
#4160 opened
Mar 7, 2025 -
[Bug] Key conflict of `AutoImageProcessor.register`
#4159 opened
Mar 7, 2025 -
[Bug] Accuracy issue with SGLang using DeepSeek-R1-AWQ
#4158 opened
Mar 7, 2025 -
[Bug] --enable-metrics raise Error
#4157 opened
Mar 7, 2025 -
[Feature] Support multiple Languages for Docs
#4155 opened
Mar 7, 2025 -
[Bug] Eagle error with small vocal
#4153 opened
Mar 6, 2025 -
[Bug] Incorrect DP rank and DP worker messaging when using TP > 1 and DP > 1
#4149 opened
Mar 6, 2025 -
[Bug] Is YaRN supported in SGLang? How to enable it?
#4145 opened
Mar 6, 2025 -
[Bug] HiRadixCache.__init__() got an unexpected keyword argument 'token_to_kv_pool_allocator'
#4143 opened
Mar 6, 2025 -
[Feature] nvcc fatal : Unknown option '-generate-dependencies-with-compile'
#4120 opened
Mar 6, 2025 -
[Bug] NEXTN CUDA GRAPH master node use more memory (20GB) than worker node
#4119 opened
Mar 6, 2025 -
[Bug] launch_server.py: error: unrecognized arguments: --enable-reasoning
#4116 opened
Mar 6, 2025 -
[Bug] How to use SGLang's cpu code for inference
#4098 opened
Mar 5, 2025 -
[Bug] -inf in top_logprobs breaks FastAPI endpoint
#4097 opened
Mar 5, 2025 -
[Bug] Processes hang after weight update process group initialized
#4096 opened
Mar 5, 2025 -
[Bug] AssertionError: res=<Response [503]> Process was always killed automactically
#4094 opened
Mar 5, 2025 -
[Bug] NEXTN performance becomes low after some hours
#4093 opened
Mar 5, 2025 -
[Bug] 'DeepseekV3ForCausalLM' object has no attribute 'get_embed_and_head'
#4092 opened
Mar 5, 2025 -
[BUG] The #3836 problem still exists
#4091 opened
Mar 5, 2025 -
Questions about the calculation of `max_req_num`
#4090 opened
Mar 5, 2025 -
[Bug] RecursionError: maximum recursion depth exceeded while calling a Python object
#4088 opened
Mar 5, 2025 -
[Bug] Tool call with Llama3 models has inconsistent behavior with OpenAI
#4072 opened
Mar 4, 2025 -
[Bug] granite-vision-3.2-2b failing on sglang with "LlavaNextForConditionalGeneration not supported"
#4062 opened
Mar 4, 2025 -
[Feature] Add e4m3fnuz support to MoE-EP in FP8
#4056 opened
Mar 4, 2025 -
[Feature] Apply structured output sampling after reasoning steps in Reasoning models
#4055 opened
Mar 4, 2025 -
NVIDIA L40*8 docker NCCL Hanging During Initialization on Single Node with Multiple GPUs
#4054 opened
Mar 4, 2025 -
[Error]Input length (160062 tokens) exceeds the maximum allowed length (59862 tokens).
#4048 opened
Mar 4, 2025 -
[Bug] ImportError: cannot import name 'BaseImageProcessor' from 'transformers'
#4047 opened
Mar 4, 2025 -
Development Roadmap (2025 H1)
#4042 opened
Mar 4, 2025 -
[Feature] Refactor all parser features
#4036 opened
Mar 3, 2025 -
[Bug] How to add chat_template on Offline Batch Inference
#4024 opened
Mar 3, 2025 -
[Bug] running requests low
#4022 opened
Mar 3, 2025 -
[Bug] After using H200*8 to deploy DeepSeekR1, the large stress test model crashes
#4020 opened
Mar 3, 2025 -
[Bug] deepseek-r1 occasionally crash
#4017 opened
Mar 3, 2025 -
[Bug] Stuck at CUDA graph capture when serving with two A100*8 nodes
#4007 opened
Mar 3, 2025 -
[Bug] The ncu command blocks when collecting data.
#4005 opened
Mar 3, 2025 -
[Feature] i can not use function call of deepseek-v3、R1 with sglang==0.4.3.post2.
#4004 opened
Mar 3, 2025 -
Performance issue comparing on MI210x4
#3996 opened
Mar 2, 2025
150 Unresolved conversations
Sometimes conversations happen on old items that aren’t yet closed. Here is a list of all the Issues and Pull Requests with unresolved conversations.
-
model: Support Janus-pro
#3203 commented on
Mar 8, 2025 • 22 new comments -
Minicpmo
#3023 commented on
Mar 8, 2025 • 4 new comments -
Add device detection and count functions to utils.
#3962 commented on
Mar 8, 2025 • 4 new comments -
[ROCm] add amd ocp fp8 E3M4FUZ constant
#3959 commented on
Mar 8, 2025 • 4 new comments -
Apply sgl w8a8 fp8 kernel
#3148 commented on
Mar 8, 2025 • 3 new comments -
[ROCm] Enable per token group quant fp8 in amd
#3702 commented on
Mar 8, 2025 • 2 new comments -
IPv6 support
#3949 commented on
Mar 8, 2025 • 2 new comments -
[Bug fixed] fixed the crash when enable the dp-attention on the single card
#3958 commented on
Mar 8, 2025 • 2 new comments -
Feat/support code completion
#3612 commented on
Mar 8, 2025 • 1 new comment -
fix: second_per_grid_ts should be used to get mrop position
#3682 commented on
Mar 8, 2025 • 1 new comment -
fix: Fix deprecated max_tokens param in openai ChatCompletionRequest
#3122 commented on
Mar 8, 2025 • 1 new comment -
model: Intern vl 2.5
#3351 commented on
Mar 8, 2025 • 1 new comment -
[Bug Fix] Add partial rotary factor support for Phi-4 and upgrade to transformers v4.49.0
#3984 commented on
Mar 8, 2025 • 1 new comment -
[Frontend] Support optional auto truncation
#3721 commented on
Mar 8, 2025 • 0 new comments -
Support building Dockerfile and Dockerfile.rocm using local repo.
#3734 commented on
Mar 8, 2025 • 0 new comments -
add bench for OpenAI Chat Completions
#3738 commented on
Mar 8, 2025 • 0 new comments -
feat: log input text for OpenAI format API
#3699 commented on
Mar 8, 2025 • 0 new comments -
[enh] add '\n' for non-streaming case to keep connection alive
#3689 commented on
Mar 8, 2025 • 0 new comments -
split ReplicatedLinear used in MLA prefill computing along hidden_states[0] to save duplicated computing on all devices
#3688 commented on
Mar 8, 2025 • 0 new comments -
add switch to disable open api doc
#3744 commented on
Mar 8, 2025 • 0 new comments -
Make Marlin incompatible AWQ models work
#3759 commented on
Mar 8, 2025 • 0 new comments -
[BugFix] Illegal memory access for MoE On H20
#3779 commented on
Mar 8, 2025 • 0 new comments -
[moe] fix: correct the cache size in the last chunk
#3679 commented on
Mar 8, 2025 • 0 new comments -
[WIP][Feautre, Hardware] add initial suport for ascend npu
#3782 commented on
Mar 8, 2025 • 0 new comments -
[Fix]: DeepSeek support cpu_offload_gb option
#3675 commented on
Mar 8, 2025 • 0 new comments -
[ROCm] Enable MTP (NextN) on AMD GPU
#3670 commented on
Mar 8, 2025 • 0 new comments -
RPD integration into SGLANG with --profile-req support
#3653 commented on
Mar 8, 2025 • 0 new comments -
[docker] Distributed Serving with k8s Statefulset ( good example for DeepSeek-R1)
#3631 commented on
Mar 8, 2025 • 0 new comments -
Update decode kernel benchmark with new Triton backend interface
#3618 commented on
Mar 8, 2025 • 0 new comments -
solve the mrope position bugs for qwen2-vl
#3605 commented on
Mar 8, 2025 • 0 new comments -
[DRAFT] change block size&wave number to increase wave inflight
#3565 commented on
Mar 8, 2025 • 0 new comments -
Add README for benchmarking SGLang using GenAI-Perf.
#3552 commented on
Mar 8, 2025 • 0 new comments -
chore: use sglang as alias of sglang.launch_server
#3546 commented on
Mar 8, 2025 • 0 new comments -
Add integration with lws for multi-node inference
#3540 commented on
Mar 8, 2025 • 0 new comments -
Make decord package optional.
#3528 commented on
Mar 8, 2025 • 0 new comments -
[optimize][scheduler] implement shortest-queue method
#3526 commented on
Mar 8, 2025 • 0 new comments -
[ROCm] accelerate rocm image build
#3481 commented on
Mar 8, 2025 • 0 new comments -
fix adjust_max_prefix_ids bug
#3478 commented on
Mar 8, 2025 • 0 new comments -
Support MLA in Torch Native Attention Backend
#3475 commented on
Mar 8, 2025 • 0 new comments -
Check for Already Terminated Process before kill
#3474 commented on
Mar 8, 2025 • 0 new comments -
Support `n` in OpenAI API completions
#3446 commented on
Mar 8, 2025 • 0 new comments -
run eagle speculative decodeing error!
#3362 commented on
Mar 2, 2025 • 0 new comments -
HotFix: json serialization error when using OAI v1/batches endpoint with logprobs
#3896 commented on
Mar 8, 2025 • 0 new comments -
Support FP4 gemm (1/2)
#3899 commented on
Mar 8, 2025 • 0 new comments -
Add block_wise INT8 support MTP NextN function
#3911 commented on
Mar 8, 2025 • 0 new comments -
Docs; Fix link in offline enine
#3919 commented on
Mar 8, 2025 • 0 new comments -
[Feature] Support for qwen2vl radix cache
#3924 commented on
Mar 8, 2025 • 0 new comments -
[bugfix]fix finish reason in response
#3926 commented on
Mar 8, 2025 • 0 new comments -
[Feature] Support Efficient Sparse HiP Attention (InfiniteHiP) with Long-Context Generalization and KV Offloading Capabilties
#3930 commented on
Mar 8, 2025 • 0 new comments -
Update vLLM version for ROCm
#3937 commented on
Mar 8, 2025 • 0 new comments -
triton python backend
#3939 commented on
Mar 8, 2025 • 0 new comments -
[Feature] Devin/1740716491 Add a hash for each new release
#3947 commented on
Mar 8, 2025 • 0 new comments -
add deepgemm into sgl-kernel
#3957 commented on
Mar 8, 2025 • 0 new comments -
Update function_call_parser.py
#3960 commented on
Mar 8, 2025 • 0 new comments -
example: add async offline inference demo
#3961 commented on
Mar 8, 2025 • 0 new comments -
feat(remote_model): support variable remote backend for model loader
#3964 commented on
Mar 8, 2025 • 0 new comments -
Blackwell fp8 blockscale
#3968 commented on
Mar 8, 2025 • 0 new comments -
[CI] Remove unused imports with Ruff to pre-commit config
#3969 commented on
Mar 8, 2025 • 0 new comments -
FP4 weight loading and inference (2/2)
#3972 commented on
Mar 8, 2025 • 0 new comments -
Move `aiohttp` into public dependencies
#3980 commented on
Mar 8, 2025 • 0 new comments -
[ROCM MOE] Enable ROCM AITER Block MOE For DeepSeek R1/V3
#3788 commented on
Mar 8, 2025 • 0 new comments -
Workaround for transformers 4.49
#3792 commented on
Mar 8, 2025 • 0 new comments -
[Bug] Fix chat completion empty input
#3793 commented on
Mar 8, 2025 • 0 new comments -
Add model name in EntryClass for tracking doc update to date
#3800 commented on
Mar 8, 2025 • 0 new comments -
use torch api to get distributed backend
#3804 commented on
Mar 8, 2025 • 0 new comments -
Ensure Usage Data in Streaming Responses Aligns with vLLM’s Implementation
#3814 commented on
Mar 8, 2025 • 0 new comments -
[BugFix]: Add type check before compare for max_new_tokens
#3832 commented on
Mar 8, 2025 • 0 new comments -
Resolve LinearBase circular import
#3833 commented on
Mar 8, 2025 • 0 new comments -
[WIP] [Feature] Support DeepSeek-v3 gptq
#3834 commented on
Mar 8, 2025 • 0 new comments -
[Doc] Fix typo in backend/sampling_params
#3835 commented on
Mar 8, 2025 • 0 new comments -
[docs] Update outdated description about `torch.compile`
#3844 commented on
Mar 8, 2025 • 0 new comments -
Update FP8 kernel configuration for 4xGPU support on AMD
#3850 commented on
Mar 8, 2025 • 0 new comments -
[Feature] Support for Ascend NPU backend
#3853 commented on
Mar 8, 2025 • 0 new comments -
Fix block_shape in benchmark_vllm_vs_sglang_fused_moe_triton.py
#3858 commented on
Mar 8, 2025 • 0 new comments -
Feat: Supports returning expected error responses to wrong requests
#3875 commented on
Mar 8, 2025 • 0 new comments -
Update router.rs
#3877 commented on
Mar 8, 2025 • 0 new comments -
[Quantization] support VPTQ
#3879 commented on
Mar 8, 2025 • 0 new comments -
Add examples for token in token out for LLM
#3886 commented on
Mar 8, 2025 • 0 new comments -
[Feature] Prefill assistant response
#3971 commented on
Mar 5, 2025 • 0 new comments -
[Bug] RecursionError: maximum recursion depth exceeded while calling a Python object
#3953 commented on
Mar 5, 2025 • 0 new comments -
[Bug] [AMD] ncclAllReduce Error under tp > 1 + multi nodes + enable cuda graph when running DeepSeek R1 on MI300X
#3622 commented on
Mar 5, 2025 • 0 new comments -
[Bug] ImportError: cannot import name 'is_valid_list_of_images' from transformers.models.mllama.image_processing_mllama
#3878 commented on
Mar 5, 2025 • 0 new comments -
[Bug] Qwen 2.5 VL
#3321 commented on
Mar 5, 2025 • 0 new comments -
[Track] long context performance sglang vs vllm
#3471 commented on
Mar 6, 2025 • 0 new comments -
[Feature] Lora Development Roadmap
#2929 commented on
Mar 6, 2025 • 0 new comments -
[Bug] 8xMI300X DeepSeek v3/R1 takes 45 mins to download and 1 hour to load shards
#3825 commented on
Mar 6, 2025 • 0 new comments -
[Feature] Proposal for adding PD-Disaggregation Feature to SGLang
#3554 commented on
Mar 6, 2025 • 0 new comments -
[Bug] SGLang Tool Calling for Qwen2.5 models returns empty ChatCompletionMessage content
#3797 commented on
Mar 6, 2025 • 0 new comments -
[Bug] Parameter/message body error directly returns 500
#3805 commented on
Mar 7, 2025 • 0 new comments -
[Feature] remove vllm _custom_ops
#2965 commented on
Mar 7, 2025 • 0 new comments -
[Bug] fused_moe OOM when run deepseek-r1 with --speculative-algo NEXTN
#3633 commented on
Mar 7, 2025 • 0 new comments -
[Feature] Instructions for running Sglang on AMD RX 7900 XTX (gfx1100) ROCm 6.2.4
#3243 commented on
Mar 7, 2025 • 0 new comments -
[DeepseekR1]How ragged prefill manage kv_cache?
#3849 commented on
Mar 8, 2025 • 0 new comments -
[Bug] --dp-size issue with AMD 8xMI300X and Llama 3.1 70B
#3890 commented on
Mar 8, 2025 • 0 new comments -
Extensive benchmarking of reasoning models including variance
#3725 commented on
Mar 8, 2025 • 0 new comments -
[Feature] A800 kernel config file to support deepseek r1 bf16
#3748 commented on
Mar 9, 2025 • 0 new comments -
[Bug] GGUF tokenizations issues
#3427 commented on
Mar 2, 2025 • 0 new comments -
[Bug] HuggingFace and SGLang inference don't match
#2671 commented on
Mar 3, 2025 • 0 new comments -
[Bug] torch.distributed.all_reduce raised Segmentation fault on 2 * 8 * H800
#3745 commented on
Mar 3, 2025 • 0 new comments -
[Feature] Support Deepseek's DeepGemm MoE
#3881 commented on
Mar 3, 2025 • 0 new comments -
[Bug] Dimension mismatched error when capturing cuda graph while enabling NEXTN
#3891 commented on
Mar 3, 2025 • 0 new comments -
[Bug] tensor_model_parallel_all_reduce' is not defined
#2931 commented on
Mar 3, 2025 • 0 new comments -
[Bug] ERROR: No matching distribution found for vllm==0.6.3.post2.dev1; extra == "srt-hip"
#3189 commented on
Mar 3, 2025 • 0 new comments -
[Bug] AttributeError: module 'vllm._custom_ops' has no attribute 'silu_and_mul'
#3392 commented on
Mar 3, 2025 • 0 new comments -
[Bug] Decode Throughput Inconsistency Between bench_serving and Engine Logs
#3050 commented on
Mar 3, 2025 • 0 new comments -
[Feature] attention dp + attention tp for deepseek v3
#3750 commented on
Mar 3, 2025 • 0 new comments -
[Bug] KeyError: 'model.layers.0.self_attn.k_scale'
#3936 commented on
Mar 4, 2025 • 0 new comments -
[Track] progress in removing vLLM dependencies
#2245 commented on
Mar 4, 2025 • 0 new comments -
[Question] Why is the performance worse after --enable-flashinfer-mla on H20
#3917 commented on
Mar 4, 2025 • 0 new comments -
[Feature] Support mistralai/Pixtral
#2351 commented on
Mar 4, 2025 • 0 new comments -
[Bug] deploy the DeepSeek-R1-awq get jumbled or nonsensical answers
#3580 commented on
Mar 4, 2025 • 0 new comments -
The number of image token (3) should be the same as in the number of provided images (1)
#3819 commented on
Mar 4, 2025 • 0 new comments -
Question sgl_kernel on amd paltforms
#3965 commented on
Mar 4, 2025 • 0 new comments -
[Feature] Use xgrammar as default grammar backend to aviod I/O errors while using Outlines in a multi-node setting
#3383 commented on
Mar 5, 2025 • 0 new comments -
[WIP] Integration of TurboMind AWQ
#2900 commented on
Mar 8, 2025 • 0 new comments -
[Core] Optimize the delay scheduling of in batch prefix caching
#2962 commented on
Mar 8, 2025 • 0 new comments -
Integrate turbomind into sgl-kernel
#2999 commented on
Mar 8, 2025 • 0 new comments -
support telechat2 model
#3000 commented on
Mar 8, 2025 • 0 new comments -
Support int8 kvcahe
#3034 commented on
Mar 8, 2025 • 0 new comments -
Modify the kernel test path & add it to the CI process.
#3044 commented on
Mar 8, 2025 • 0 new comments -
[Feature] Beam Search
#3066 commented on
Mar 8, 2025 • 0 new comments -
[Feature] Rewrite Sampling Parameter #3165
#3185 commented on
Mar 8, 2025 • 0 new comments -
Add logit bias into the SGLang interface.
#3187 commented on
Mar 8, 2025 • 0 new comments -
Add deepseek_v3 fused gate
#3191 commented on
Mar 8, 2025 • 0 new comments -
add date_string to the chat template
#3297 commented on
Mar 8, 2025 • 0 new comments -
FEA compat with ipv6
#3301 commented on
Mar 8, 2025 • 0 new comments -
fix(mig): fallback gpu_memory_total value
#3353 commented on
Mar 8, 2025 • 0 new comments -
Better support of tp checkpoint loading
#3367 commented on
Mar 8, 2025 • 0 new comments -
Update Intel XPU install instruction
#3390 commented on
Mar 8, 2025 • 0 new comments -
fix: return null instead of "" for finish_reason when unfinished
#3391 commented on
Mar 8, 2025 • 0 new comments -
Modify metrics service endpoint
#3443 commented on
Mar 8, 2025 • 0 new comments -
Add torch profiler activity for HPU
#3445 commented on
Mar 8, 2025 • 0 new comments -
How to return reasoning_content from sglang server response?
#3428 commented on
Mar 9, 2025 • 0 new comments -
DeepSeek-R1 Optimization Option Ablations
#3956 commented on
Mar 9, 2025 • 0 new comments -
[Feature] Add initial support for sequence parallelism
#1436 commented on
Mar 8, 2025 • 0 new comments -
Surpport kv cache int8/int4 for triton backend
#1644 commented on
Mar 8, 2025 • 0 new comments -
feat: use cascade attention kernel (single level)
#2101 commented on
Mar 8, 2025 • 0 new comments -
Add support for Phi3V
#2383 commented on
Mar 8, 2025 • 0 new comments -
Add InfiniteBench for long context benchmarking
#2421 commented on
Mar 8, 2025 • 0 new comments -
[Experimental] Add a gRPC server for completion request
#2478 commented on
Mar 8, 2025 • 0 new comments -
Refactor Scheduler to improve code organization
#2593 commented on
Mar 8, 2025 • 0 new comments -
[Feature] support compute-communication overlap with TransformerEngine
#2627 commented on
Mar 8, 2025 • 0 new comments -
Support InternVL2 Series
#2629 commented on
Mar 8, 2025 • 0 new comments -
[Feature] Support regex as a stopping condition
#2699 commented on
Mar 8, 2025 • 0 new comments -
Speculative decoding with lookahead
#2790 commented on
Mar 8, 2025 • 0 new comments -
Add endpoint for file support, purely to speed up processing of input_embeds.
#2797 commented on
Mar 8, 2025 • 0 new comments -
[Feature] Support Deepseek-VL2
#2798 commented on
Mar 8, 2025 • 0 new comments -
Improve the mixed chunk prefill by lanuch two kernels
#2811 commented on
Mar 8, 2025 • 0 new comments -
Use CUDA_VISIBLE_DEVICES instead of gpu_id variables everywhere.
#2824 commented on
Mar 8, 2025 • 0 new comments -
[Feature] Support dynamic loading and unloading of Lora adapters
#2891 commented on
Mar 8, 2025 • 0 new comments