[CI/Build] A perplexity-computing test for the FP8 KV cache system. Originally used in the context of PR #3290 #3730

Alexei-V-Ivanov-AMD · 2024-03-29T11:54:40Z

The script benchmarks/measure_pplv2_MC.py produces a realistic perplexity measurement for the quantized KV cache system by processing a sequence of non-overlapping patches of the reference text. Generation of the consecutive symbols in each patch is governed (forced) by the reference text.

The initial context size for the system is set by the parameter "--context-size".

The number of output symbols to generate starting from a given context is set by the parameter "--sample-size". This variable also defines the size of the individual patch. The size of the patch in tokens is equal to the sample size.

For the N-token reference text that is split into M patches with the system's initial context size C, the method takes M*preload + (N-C)*generation time to complete.

Quick correctness validation tips:

Running llama-2-7b-chat-hf model
(
./vllm/benchmarks/measure_ppl2_MC.py
--model=/data/models/llama-2-7b-chat-hf
--data=./vllm/tests/prompts/wiki.test.raw
--context-size=1024
--sample-size=512
)
should result in PPL ~ 6.524227946419175

Running llama-2-7b-chat-hf model
(
./vllm/benchmarks/measure_ppl2_MC.py
--model=/data/models/llama-2-7b-chat-hf
--data=./vllm/tests/prompts/wiki.test.raw
--context-size=1024
--sample-size=512
--patch-size=1
)
should result in PPL ~ PPL=3.8968611189957523

This testing method is sensitive to the representation precision of the KV cache.
The table below presents perplexities, achieved with different quantization and scaling methods.

llama-2-7b-chat-hf	647 patches 330849 symb(max)
gen 512 X init 1024	PPLv2
FP8 Scaling 1e3	2016.919661
FP8 Scaling 1e2	7.110797102
FP8 Scaling 1e1	6.550152394
FP16	6.524227946
FP8	6.541197624
FP8 Scaling 1e-1	6.545720813
FP8 Scaling 1e-2	57.70005660

Insert the minimal perplexity computation benchmark.

engine functionality to support it.

Fixed comments in the measurement script.

typo

casper-hansen · 2024-03-30T10:56:40Z

Hi @Alexei-V-Ivanov-AMD, this is a nice script to have at hand. Other packages like llama.cpp run perplexity tests in their CI, which I think vLLM maintainers should consider to avoid regressions.

sunway513 · 2024-04-02T21:03:36Z

cc @simon-mo can we review and get this PR in? it'll help unblock AMD team on adding more tests.. thanks..

Adding functionality to ingest scaling factors upon merge of the PR vllm-project#3290

typo

simon-mo · 2024-04-04T17:38:51Z

Sounds good. I agree with @casper-hansen that this is very valuable and a good start for #3780

simon-mo · 2024-04-04T20:49:07Z

At a high level I would imagine running more end to end test like https://github.com/EleutherAI/lm-evaluation-harness which can directly support vLLM with simpler command should be better?

For actual testing I would prefer using lm-eval. For this script, I think it has value to be put into examples folder?

Alexei-V-Ivanov-AMD · 2024-04-05T15:01:12Z

For actual testing I would prefer using lm-eval. For this script, I think it has value to be put into examples folder?

Agreed. Moving the script into 'examples' folder. Thank you!

HaiShaw

LGTM, with a few comments.
Maybe we can call this P-PPL - stand for Preloaded, Prefilled or Prefix-PPL.

HaiShaw · 2024-05-09T22:42:37Z

vllm/model_executor/layers/sampler.py

+def _forced_sample(
+    selected_seq_groups: List[SequenceGroupToSample],
+    samples: torch.Tensor,
+) -> List[Tuple[List[int], List[int]]]:


Can we have a function header (comments) below this, as others?

Added the proper comment.

HaiShaw · 2024-05-09T22:59:38Z