Update TensorRT-LLM Release branch (NVIDIA#745)

* Update TensorRT-LLM --------- Co-authored-by: Shixiaowei02 <[email protected]>
KeitaW · Dec 26, 2023 · 80bc075 · 80bc075
1 parent a8018c1
commit 80bc075
Show file tree

Hide file tree

Showing 19 changed files with 450 additions and 169 deletions.
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -0,0 +1,99 @@
+# Change Log
+
+## Versions 0.6.0 / 0.6.1
+
+ * Models
+ * ChatGLM3
+ * InternLM (contributed by @wangruohui)
+ * Mistral 7B (developed in collaboration with Mistral.AI)
+ * MQA/GQA support to MPT (and GPT) models (contributed by @bheilbrun)
+ * Qwen (contributed by @Tlntin and @zhaohb)
+ * Replit Code V-1.5 3B (external contribution)
+ * T5, mT5, Flan-T5 (Python runtime only)
+
+ * Features
+ * Add runtime statistics related to active requests and KV cache
+ utilization from the batch manager (see
+ the [batch manager](docs/source/batch_manager.md) documentation)
+ * Add `sequence_length` tensor to support proper lengths in beam-search
+ (when beam-width > 1 - see
+ [tensorrt_llm/batch_manager/GptManager.h](cpp/include/tensorrt_llm/batch_manager/GptManager.h))
+ * BF16 support for encoder-decoder models (Python runtime - see
+ [examples/enc_dec](examples/enc_dec/README.md))
+ * Improvements to memory utilization (CPU and GPU - including memory
+ leaks)
+ * Improved error reporting and memory consumption
+ * Improved support for stop and bad words
+ * INT8 SmoothQuant and INT8 KV Cache support for the Baichuan models (see
+ [examples/baichuan](examples/baichuan/README.md))
+ * INT4 AWQ Tensor Parallelism support and INT8 KV cache + AWQ/weight-only
+ support for the GPT-J model (see [examples/gptj](examples/gptj/README.md))
+ * INT4 AWQ support for the Falcon models
+ (see [examples/falcon](examples/falcon/README.md))
+ * LoRA support (functional preview only - limited to the Python runtime,
+ only QKV support and not optimized in terms of runtime performance) for
+ the GPT model (see the
+ [Run LoRA with the Nemo checkpoint](examples/gpt/README.md#Run-LoRA-with-the-Nemo-checkpoint)
+ in the GPT example)
+ * Multi-GPU support for encoder-decoder models (Python runtime - see
+ [examples/enc_dec](examples/enc_dec/README.md))
+ * New heuristic for launching the Multi-block Masked MHA kernel (similar
+ to FlashDecoding - see
+ [decoderMaskedMultiheadAttentionLaunch.h](cpp/tensorrt_llm/kernels/decoderMaskedMultiheadAttention/decoderMaskedMultiheadAttentionLaunch.h))
+ * Prompt-Tuning support for GPT and LLaMA models (see the
+ [Prompt-tuning](examples/gpt/README.md#Prompt-tuning) Section in the GPT example)
+ * Performance optimizations in various CUDA kernels
+ * Possibility to exclude input tokens from the output (see `excludeInputInOutput` in
+ [`GptManager`](cpp/include/tensorrt_llm/batch_manager/GptManager.h))
+ * Python binding for the C++ runtime (GptSession - see [`pybind`](cpp/tensorrt_llm/pybind))
+ * Support for different micro batch sizes for context and generation
+ phases with pipeline parallelism (see `GptSession::Config::ctxMicroBatchSize` and
+ `GptSession::Config::genMicroBatchSize` in
+ [tensorrt_llm/runtime/gptSession.h](cpp/include/tensorrt_llm/runtime/gptSession.h))
+ * Support for "remove input padding" for encoder-decoder models (see
+ [examples/enc_dec](examples/enc_dec/README.md))
+ * Support for context and generation logits (see `mComputeContextLogits` and
+ `mComputeGenerationLogits` in
+ [tensorrt_llm/runtime/gptModelConfig.h](cpp/include/tensorrt_llm/runtime/gptModelConfig.h))
+ * Support for `logProbs` and `cumLogProbs` (see `"output_log_probs"` and
+ `"cum_log_probs"` in [`GptManager`](cpp/include/tensorrt_llm/batch_manager/GptManager.h))
+ * Update to CUTLASS 3.x
+
+ * Bug fixes
+ * Fix for ChatGLM2 #93 and #138
+ * Fix tensor names error "RuntimeError: Tensor names
+ (`host_max_kv_cache_length`) in engine are not the same as expected in
+ the main branch" #369
+ * Fix weights split issue in BLOOM when `world_size = 2` ("array split
+ does not result in an equal division") #374
+ * Fix SmoothQuant multi-GPU failure with tensor parallelism is 2 #267
+ * Fix a crash in GenerationSession if stream keyword argument is not None
+ #202
+ * Fix a typo when calling PyNVML API [BUG] code bug #410
+ * Fix bugs related to the improper management of the `end_id` for various
+ models [C++ and Python]
+ * Fix memory leaks [C++ code and Python models]
+ * Fix the std::alloc error when running the gptManagerBenchmark -- issue
+ gptManagerBenchmark std::bad_alloc error #66
+ * Fix a bug in pipeline parallelism when beam-width > 1
+ * Fix a bug with Llama GPTQ due to improper support of GQA
+ * Fix issue #88
+ * Fix an issue with the Huggingface Transformers version #16
+ * Fix link jump in windows readme.md #30 - by @yuanlehome
+ * Fix typo in batchScheduler.h #56 - by @eltociear
+ * Fix typo #58 - by @RichardScottOZ
+ * Fix Multi-block MMHA: Difference between `max_batch_size` in the engine
+ builder and `max_num_sequences` in TrtGptModelOptionalParams? #65
+ * Fix the log message to be more accurate on KV cache #224
+ * Fix Windows release wheel installation: Failed to install the release
+ wheel for Windows using pip #261
+ * Fix missing torch dependencies: [BUG] The batch_manage.a choice error
+ in --cpp-only when torch's cxx_abi version is different with gcc #151
+ * Fix linking error during compiling google-test & benchmarks #277
+ * Fix logits dtype for Baichuan and ChatGLM: segmentation fault caused by
+ the lack of bfloat16 #335
+ * Minor bug fixes
+
+## Version 0.5.0
+
+ * TensorRT-LLM v0.5.0 is the first public release.
diff --git a/README.md b/README.md
@@ -8,7 +8,7 @@ TensorRT-LLM
 [![python](https://img.shields.io/badge/python-3.10.12-green)](https://www.python.org/downloads/release/python-31012/)
 [![cuda](https://img.shields.io/badge/cuda-12.2-green)](https://developer.nvidia.com/cuda-downloads)
 [![trt](https://img.shields.io/badge/TRT-9.2-green)](https://developer.nvidia.com/tensorrt)
-[![version](https://img.shields.io/badge/release-0.7.0-green)](./setup.py)
+[![version](https://img.shields.io/badge/release-0.7.1-green)](./setup.py)
 [![license](https://img.shields.io/badge/license-Apache%202-blue)](./LICENSE)
 
 [Architecture](./docs/source/architecture.md)&nbsp;&nbsp;&nbsp;|&nbsp;&nbsp;&nbsp;[Results](./docs/source/performance.md)&nbsp;&nbsp;&nbsp;|&nbsp;&nbsp;&nbsp;[Examples](./examples/)&nbsp;&nbsp;&nbsp;|&nbsp;&nbsp;&nbsp;[Documentation](./docs/source/)
@@ -108,16 +108,16 @@ concepts used in TensorRT-LLM, we recommend you to read the following
 
 ## Installation
 
-*For Windows installation, see [`Windows`](windows/README.md).*
-
-TensorRT-LLM must be built from source, instructions can be found
+The documentation for installing TensorRT-LLM can be found
 [here](./docs/source/installation.md). An image of a Docker container with
 TensorRT-LLM and its Triton Inference Server Backend will be made available
 soon.
 
 The remaining commands in that document must be executed from the TensorRT-LLM
 container.
 
+*For Windows installation, see [`Windows`](windows/README.md).*
+
 ## Quick Start
 
 To create a TensorRT engine for an existing model, there are 3 steps:
@@ -379,103 +379,43 @@ For example: `mpirun -n 1 python3 examples/gpt/build.py ...`
 
 ### Change Log
 
-#### Version 0.6.1
-
- * Models
- * ChatGLM3
- * InternLM (contributed by @wangruohui)
- * Mistral 7B (developed in collaboration with Mistral.AI)
- * MQA/GQA support to MPT (and GPT) models (contributed by @bheilbrun)
- * Qwen (contributed by @Tlntin and @zhaohb)
- * Replit Code V-1.5 3B (external contribution)
- * T5, mT5, Flan-T5 (Python runtime only)
-
- * Features
- * Add runtime statistics related to active requests and KV cache
- utilization from the batch manager (see
- the [batch manager](docs/source/batch_manager.md) documentation)
- * Add `sequence_length` tensor to support proper lengths in beam-search
- (when beam-width > 1 - see
- [tensorrt_llm/batch_manager/GptManager.h](cpp/include/tensorrt_llm/batch_manager/GptManager.h))
- * BF16 support for encoder-decoder models (Python runtime - see
- [examples/enc_dec](examples/enc_dec/README.md))
- * Improvements to memory utilization (CPU and GPU - including memory
- leaks)
- * Improved error reporting and memory consumption
- * Improved support for stop and bad words
- * INT8 SmoothQuant and INT8 KV Cache support for the Baichuan models (see
- [examples/baichuan](examples/baichuan/README.md))
- * INT4 AWQ Tensor Parallelism support and INT8 KV cache + AWQ/weight-only
- support for the GPT-J model (see [examples/gptj](examples/gptj/README.md))
- * INT4 AWQ support for the Falcon models
- (see [examples/falcon](examples/falcon/README.md))
- * LoRA support (functional preview only - limited to the Python runtime,
- only QKV support and not optimized in terms of runtime performance) for
- the GPT model (see the
- [Run LoRA with the Nemo checkpoint](examples/gpt/README.md#Run-LoRA-with-the-Nemo-checkpoint)
- in the GPT example)
- * Multi-GPU support for encoder-decoder models (Python runtime - see
- [examples/enc_dec](examples/enc_dec/README.md))
- * New heuristic for launching the Multi-block Masked MHA kernel (similar
- to FlashDecoding - see
- [decoderMaskedMultiheadAttentionLaunch.h](cpp/tensorrt_llm/kernels/decoderMaskedMultiheadAttention/decoderMaskedMultiheadAttentionLaunch.h))
- * Prompt-Tuning support for GPT and LLaMA models (see the
- [Prompt-tuning](examples/gpt/README.md#Prompt-tuning) Section in the GPT example)
- * Performance optimizations in various CUDA kernels
- * Possibility to exclude input tokens from the output (see `excludeInputInOutput` in
- [`GptManager`](cpp/include/tensorrt_llm/batch_manager/GptManager.h))
- * Python binding for the C++ runtime (GptSession - see [`pybind`](cpp/tensorrt_llm/pybind))
- * Support for different micro batch sizes for context and generation
- phases with pipeline parallelism (see `GptSession::Config::ctxMicroBatchSize` and
- `GptSession::Config::genMicroBatchSize` in
- [tensorrt_llm/runtime/gptSession.h](cpp/include/tensorrt_llm/runtime/gptSession.h))
- * Support for "remove input padding" for encoder-decoder models (see
- [examples/enc_dec](examples/enc_dec/README.md))
- * Support for context and generation logits (see `mComputeContextLogits` and
- `mComputeGenerationLogits` in
- [tensorrt_llm/runtime/gptModelConfig.h](cpp/include/tensorrt_llm/runtime/gptModelConfig.h))
- * Support for `logProbs` and `cumLogProbs` (see `"output_log_probs"` and
- `"cum_log_probs"` in [`GptManager`](cpp/include/tensorrt_llm/batch_manager/GptManager.h))
- * Update to CUTLASS 3.x
-
- * Bug fixes
- * Fix for ChatGLM2 #93 and #138
- * Fix tensor names error "RuntimeError: Tensor names
- (`host_max_kv_cache_length`) in engine are not the same as expected in
- the main branch" #369
- * Fix weights split issue in BLOOM when `world_size = 2` ("array split
- does not result in an equal division") #374
- * Fix SmoothQuant multi-GPU failure with tensor parallelism is 2 #267
- * Fix a crash in GenerationSession if stream keyword argument is not None
- #202
- * Fix a typo when calling PyNVML API [BUG] code bug #410
- * Fix bugs related to the improper management of the `end_id` for various
- models [C++ and Python]
- * Fix memory leaks [C++ code and Python models]
- * Fix the std::alloc error when running the gptManagerBenchmark -- issue
- gptManagerBenchmark std::bad_alloc error #66
- * Fix a bug in pipeline parallelism when beam-width > 1
- * Fix a bug with Llama GPTQ due to improper support of GQA
- * Fix issue #88
- * Fix an issue with the Huggingface Transformers version #16
- * Fix link jump in windows readme.md #30 - by @yuanlehome
- * Fix typo in batchScheduler.h #56 - by @eltociear
- * Fix typo #58 - by @RichardScottOZ
- * Fix Multi-block MMHA: Difference between `max_batch_size` in the engine
- builder and `max_num_sequences` in TrtGptModelOptionalParams? #65
- * Fix the log message to be more accurate on KV cache #224
- * Fix Windows release wheel installation: Failed to install the release
- wheel for Windows using pip #261
- * Fix missing torch dependencies: [BUG] The batch_manage.a choice error
- in --cpp-only when torch's cxx_abi version is different with gcc #151
- * Fix linking error during compiling google-test & benchmarks #277
- * Fix logits dtype for Baichuan and ChatGLM: segmentation fault caused by
- the lack of bfloat16 #335
- * Minor bug fixes
-
-#### Version 0.5.0
-
- * TensorRT-LLM v0.5.0 is the first public release.
+#### Versions 0.7.0 / 0.7.1
+
+* Models
+ - BART and mBART support in encoder-decoder models
+ - FairSeq Neural Machine Translation (NMT) family
+ - Mixtral-8x7B model
+ - Support weight loading for HuggingFace Mixtral model
+ - OpenAI Whisper
+ - Mixture of Experts support
+ - MPT - Int4 AWQ / SmoothQuant support
+ - Baichuan FP8 quantization support
+* Features
+ - [Preview] Speculative decoding
+ - Add Python binding for `GptManager`
+ - Add a Python class `ModelRunnerCpp` that wraps C++ `gptSession`
+ - System prompt caching
+ - Enable split-k for weight-only cutlass kernels
+ - FP8 KV cache support for XQA kernel
+ - New Python builder API and `trtllm-build` command(already applied to [blip2](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/blip2) and [OPT](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/opt#3-build-tensorrt-engines) )
+ - Support `StoppingCriteria` and `LogitsProcessor` in Python generate API (thanks to the contribution from @zhang-ge-hao)
+ - fMHA support for chunked attention and paged kv cache
+* Bug fixes
+ - Fix tokenizer usage in quantize.py #288, thanks to the contribution from @0xymoro
+ - Fix LLaMa with LoRA error #637
+ - Fix LLaMA GPTQ failure #580
+ - Fix Python binding for InferenceRequest issue #528
+ - Fix CodeLlama SQ accuracy issue #453
+* Performance
+ - MMHA optimization for MQA and GQA
+ - LoRA optimization: cutlass grouped gemm
+ - Optimize Hopper warp specialized kernels
+ - Optimize AllReduce for parallel attention on Falcon and GPT-J
+ - Enable split-k for weight-only cutlass kernel when SM>=75
+* Documentation
+ - Add [documentation for new builder workflow](https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/new_workflow.md)
+
+#### For history change log, please see [CHANGELOG.md](./CHANGELOG.md).
 
 ### Known Issues
 

diff --git a/benchmarks/python/allowed_configs.py b/benchmarks/python/allowed_configs.py
@@ -232,6 +232,7 @@ class ModelConfig:
  builder_opt=None,
  pre_norm=False,
  do_layer_norm_before=False,
+ use_custom_all_reduce=False,
  )),
  "opt_2.7b":
  ModelConfig(name="opt_2.7b",
@@ -250,6 +251,7 @@ class ModelConfig:
  builder_opt=None,
  pre_norm=False,
  do_layer_norm_before=True,
+ use_custom_all_reduce=False,
  )),
  "opt_6.7b":
  ModelConfig(name="opt_6.7b",
@@ -268,6 +270,7 @@ class ModelConfig:
  builder_opt=None,
  pre_norm=False,
  do_layer_norm_before=True,
+ use_custom_all_reduce=False,
  )),
  "opt_66b":
  ModelConfig(name="opt_66b",
@@ -286,6 +289,7 @@ class ModelConfig:
  builder_opt=None,
  pre_norm=True,
  do_layer_norm_before=True,
+ use_custom_all_reduce=False,
  )),
  "llama_7b":
  ModelConfig(name="llama_7b",
@@ -512,6 +516,7 @@ class ModelConfig:
  max_output_len=200,
  builder_opt=None,
  remove_input_padding=False,
+ use_custom_all_reduce=False,
  )),
  "bloom_560m":
  ModelConfig(name="bloom_560m",
@@ -528,6 +533,7 @@ class ModelConfig:
  max_input_len=1024,
  max_output_len=1024,
  builder_opt=None,
+ use_custom_all_reduce=False,
  )),
  "bloom_176b":
  ModelConfig(name="bloom_176b",
@@ -544,6 +550,7 @@ class ModelConfig:
  max_input_len=1024,
  max_output_len=1024,
  builder_opt=None,
+ use_custom_all_reduce=False,
  )),
  "bert_base":
  ModelConfig(name="bert_base",

diff --git a/cpp/tensorrt_llm/batch_manager/aarch64-linux-gnu/libtensorrt_llm_batch_manager_static.a b/cpp/tensorrt_llm/batch_manager/aarch64-linux-gnu/libtensorrt_llm_batch_manager_static.a
diff --git a/...orrt_llm/batch_manager/aarch64-linux-gnu/libtensorrt_llm_batch_manager_static.pre_cxx11.a b/...orrt_llm/batch_manager/aarch64-linux-gnu/libtensorrt_llm_batch_manager_static.pre_cxx11.a
diff --git a/cpp/tensorrt_llm/batch_manager/aarch64-linux-gnu/version.txt b/cpp/tensorrt_llm/batch_manager/aarch64-linux-gnu/version.txt
@@ -1,3 +1,3 @@
-516ff2db1e17536e92150b0c05200589 libtensorrt_llm_batch_manager_static.a
-428a500536705184a1aad8aaf5c9c0ca libtensorrt_llm_batch_manager_static.pre_cxx11.a
-33b6139e3bb108df093aab3a6de38a87f1f1e2dd commit
+ffe001b0bf9ee66b3e3696423d6d09a2 libtensorrt_llm_batch_manager_static.a
+3657ea3400959a64be77c12d8598dd72 libtensorrt_llm_batch_manager_static.pre_cxx11.a
+9a775b3dbb20444f130f13f90e675cc971fe7e15 commit
diff --git a/cpp/tensorrt_llm/batch_manager/x86_64-linux-gnu/libtensorrt_llm_batch_manager_static.a b/cpp/tensorrt_llm/batch_manager/x86_64-linux-gnu/libtensorrt_llm_batch_manager_static.a
diff --git a/...sorrt_llm/batch_manager/x86_64-linux-gnu/libtensorrt_llm_batch_manager_static.pre_cxx11.a b/...sorrt_llm/batch_manager/x86_64-linux-gnu/libtensorrt_llm_batch_manager_static.pre_cxx11.a
diff --git a/cpp/tensorrt_llm/batch_manager/x86_64-linux-gnu/version.txt b/cpp/tensorrt_llm/batch_manager/x86_64-linux-gnu/version.txt
@@ -1,2 +1,2 @@
-0403e89a23fd77aed43cac0ecd8136cf libtensorrt_llm_batch_manager_static.a
-9fa2a1c18860eaf226a6ce61a8e3ed5d libtensorrt_llm_batch_manager_static.pre_cxx11.a
+bb69bf376c5f955c327e867049639d78 libtensorrt_llm_batch_manager_static.a
+14b107676c74ce17bfc8ce950b36a984 libtensorrt_llm_batch_manager_static.pre_cxx11.a
diff --git a/cpp/tensorrt_llm/kernels/contextFusedMultiHeadAttention/fmhaRunner.cpp b/cpp/tensorrt_llm/kernels/contextFusedMultiHeadAttention/fmhaRunner.cpp
@@ -121,7 +121,8 @@ class FusedMHARunnerV2::mhaImpl
  if (mLaunchParams.useKernelWithoutAlibi)
  {
  // The kernel adopts the log2f optimziation.
- set_alpha(params.scale_bmm1, scale_bmm1 * float(M_LOG2E), DATA_TYPE_FP32);
+ constexpr float kLog2e = 1.4426950408889634074; // log_2(e) = M_LOG2E
+ set_alpha(params.scale_bmm1, scale_bmm1 * float(kLog2e), DATA_TYPE_FP32);
  }
  else
  {