-
-
Notifications
You must be signed in to change notification settings - Fork 5.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[CI/Build] A perplexity-computing test for the FP8 KV cache system. Originally used in the context of PR #3290 #3730
base: main
Are you sure you want to change the base?
[CI/Build] A perplexity-computing test for the FP8 KV cache system. Originally used in the context of PR #3290 #3730
Conversation
Insert the minimal perplexity computation benchmark.
engine functionality to support it.
Fixed comments in the measurement script.
Hi @Alexei-V-Ivanov-AMD, this is a nice script to have at hand. Other packages like |
cc @simon-mo can we review and get this PR in? it'll help unblock AMD team on adding more tests.. thanks.. |
Adding functionality to ingest scaling factors upon merge of the PR vllm-project#3290
Sounds good. I agree with @casper-hansen that this is very valuable and a good start for #3780 |
At a high level I would imagine running more end to end test like https://github.com/EleutherAI/lm-evaluation-harness which can directly support vLLM with simpler command should be better? For actual testing I would prefer using lm-eval. For this script, I think it has value to be put into |
Agreed. Moving the script into 'examples' folder. Thank you! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, with a few comments.
Maybe we can call this P-PPL
- stand for Preloaded, Prefilled or Prefix
-PPL
.
def _forced_sample( | ||
selected_seq_groups: List[SequenceGroupToSample], | ||
samples: torch.Tensor, | ||
) -> List[Tuple[List[int], List[int]]]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we have a function header (comments) below this, as others?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added the proper comment.
import datetime | ||
import math | ||
|
||
from transformers import LlamaTokenizer |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to consider different tokenizers that to be used for models other than llama.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We're going to extend support of other models, but we will do it separately.
dtype=args.dtype, | ||
kv_cache_dtype=args.kv_cache_dtype, | ||
#scales_path=args.kv_cache_scales_path | ||
# if args.kv_cache_scales_path!='' else None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
May remove these lines
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
if args.kv_cache_scales_path != '' else None, | ||
enforce_eager=args.enforce_eager) | ||
|
||
sampling_params = SamplingParams(n=1, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to consider n>1
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, that is not needed as we're in the "forced" sampling mode.
print(MESSAGE) | ||
my_ppl = 0.0 | ||
|
||
my_tokenizer = LlamaTokenizer.from_pretrained(args.model) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This won't work for other models
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We'll cover other models with differently purposed scripts.
args.context_size:upper_boundary]) | ||
my_sampl_par.max_tokens = len(my_sampl_par.future_context[0]) | ||
my_sampl_par.cntr = c | ||
LOGPROBS = vllm_predict(CONTEXT, my_llm, my_sampl_par) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Watch for CONTEXT > max_context_length_of_model
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
@@ -112,6 +115,8 @@ def __init__( | |||
top_p: float = 1.0, | |||
top_k: int = -1, | |||
min_p: float = 0.0, | |||
ppl_measurement: bool = False, | |||
future_context: Optional[List[int]] = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing docstring
long_sample_indices = sample_indices.long() | ||
if sampling_type == SamplingType.GREEDY: | ||
if sampling_type == SamplingType.FORCED: | ||
#pdb.set_trace() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove
sampling_metadata.seq_groups[0].seq_data[ | ||
sampling_params.cntr].output_token_ids)] | ||
], | ||
device='cuda:0') # |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this break in CPU runtime
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
def _forced_sample( | ||
selected_seq_groups: List[SequenceGroupToSample], | ||
samples: torch.Tensor, | ||
) -> List[Tuple[List[int], List[int]]]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
tests/prompts/wiki.test.raw
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this file downloadable from external URL? I would not recommend adding this for the size of the repo
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this file downloadable from external URL? I would not recommend adding this for the size of the repo
This file is our measurement stick. The slight variation to it invalidates all previous measurements.
It is absolutely essential to have it properly recorded.
Done. Re-named into "PPPL" |
This pull request has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this pull request should remain open. Thank you! |
This pull request has merge conflicts that must be resolved before it can be |
The script benchmarks/measure_pplv2_MC.py produces a realistic perplexity measurement for the quantized KV cache system by processing a sequence of non-overlapping patches of the reference text. Generation of the consecutive symbols in each patch is governed (forced) by the reference text.
The initial context size for the system is set by the parameter "--context-size".
The number of output symbols to generate starting from a given context is set by the parameter "--sample-size". This variable also defines the size of the individual patch. The size of the patch in tokens is equal to the sample size.
For the N-token reference text that is split into M patches with the system's initial context size C, the method takes M*preload + (N-C)*generation time to complete.
Quick correctness validation tips:
Running llama-2-7b-chat-hf model
(
./vllm/benchmarks/measure_ppl2_MC.py
--model=/data/models/llama-2-7b-chat-hf
--data=./vllm/tests/prompts/wiki.test.raw
--context-size=1024
--sample-size=512
)
should result in PPL ~ 6.524227946419175
Running llama-2-7b-chat-hf model
(
./vllm/benchmarks/measure_ppl2_MC.py
--model=/data/models/llama-2-7b-chat-hf
--data=./vllm/tests/prompts/wiki.test.raw
--context-size=1024
--sample-size=512
--patch-size=1
)
should result in PPL ~ PPL=3.8968611189957523
This testing method is sensitive to the representation precision of the KV cache.
<style> </style>The table below presents perplexities, achieved with different quantization and scaling methods.