CUDA: Triton cache improves startup performance by ~20%
ROCm: Triton cache improves startup performance by ~20%
This benchmark compares GPU memory usage and startup performance of a custom vllm
configuration using Triton flash attention in two scenarios:
- With Triton cache pre-loaded - Cache exists from previous run
- Without Triton cache - Clean cache state
Key findings:
- Triton cache reduces startup time by approximately 20%
- More consistent memory usage patterns with cached kernels
- Improved resource utilization during initial model loading
- Triton installed
- Custom
vllm
fork with Triton support:git clone -b triton https://github.com/cmagina/vllm.git cd vllm && pip install -e .
- NVIDIA GPU (CUDA) or AMD GPU (ROCm)
./benchmark.sh --arch [cuda|rocm]
# Custom cache location and script
./benchmark.sh \
--arch cuda \
--triton-cache-dir ~/alternate_cache \
--script ./custom_script.py
gpu_usage_log.csv
- Time-series memory datagpu_memory_usage_comparison.png
- Visualization plot
-
Cold Start (no cache):
- Purge existing Triton cache
- Run inference script
- Log GPU memory at 1Hz frequency
-
Warm Start (with cache):
- Reuse generated kernels
- Run identical inference script
- Compare memory/time metrics
export VLLM_ATTENTION_BACKEND=TRITON_FLASH # Required for Triton support
export TRITON_CACHE_DIR="~/.triton/cache" # Default cache location
Apache 2.0 LICENSE