Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CI/Build] A perplexity-computing test for the FP8 KV cache system. Originally used in the context of PR #3290 #3730

Open
wants to merge 70 commits into
base: main
Choose a base branch
from

Conversation

Alexei-V-Ivanov-AMD
Copy link
Contributor

@Alexei-V-Ivanov-AMD Alexei-V-Ivanov-AMD commented Mar 29, 2024

The script benchmarks/measure_pplv2_MC.py produces a realistic perplexity measurement for the quantized KV cache system by processing a sequence of non-overlapping patches of the reference text. Generation of the consecutive symbols in each patch is governed (forced) by the reference text.

The initial context size for the system is set by the parameter "--context-size".

The number of output symbols to generate starting from a given context is set by the parameter "--sample-size". This variable also defines the size of the individual patch. The size of the patch in tokens is equal to the sample size.

For the N-token reference text that is split into M patches with the system's initial context size C, the method takes M*preload + (N-C)*generation time to complete.

Quick correctness validation tips:

Running llama-2-7b-chat-hf model
(
./vllm/benchmarks/measure_ppl2_MC.py
--model=/data/models/llama-2-7b-chat-hf
--data=./vllm/tests/prompts/wiki.test.raw
--context-size=1024
--sample-size=512
)
should result in PPL ~ 6.524227946419175

Running llama-2-7b-chat-hf model
(
./vllm/benchmarks/measure_ppl2_MC.py
--model=/data/models/llama-2-7b-chat-hf
--data=./vllm/tests/prompts/wiki.test.raw
--context-size=1024
--sample-size=512
--patch-size=1
)
should result in PPL ~ PPL=3.8968611189957523

This testing method is sensitive to the representation precision of the KV cache.
The table below presents perplexities, achieved with different quantization and scaling methods.

<style> </style>
llama-2-7b-chat-hf 647 patches 330849 symb(max)
gen 512 X init 1024 PPLv2
FP8 Scaling 1e3 2016.919661
FP8 Scaling 1e2 7.110797102
FP8 Scaling 1e1 6.550152394
FP16 6.524227946
FP8 6.541197624
FP8 Scaling 1e-1 6.545720813
FP8 Scaling 1e-2 57.70005660

@casper-hansen
Copy link
Contributor

Hi @Alexei-V-Ivanov-AMD, this is a nice script to have at hand. Other packages like llama.cpp run perplexity tests in their CI, which I think vLLM maintainers should consider to avoid regressions.

@sunway513
Copy link

cc @simon-mo can we review and get this PR in? it'll help unblock AMD team on adding more tests.. thanks..

@simon-mo
Copy link
Collaborator

simon-mo commented Apr 4, 2024

Sounds good. I agree with @casper-hansen that this is very valuable and a good start for #3780

@simon-mo
Copy link
Collaborator

simon-mo commented Apr 4, 2024

At a high level I would imagine running more end to end test like https://github.com/EleutherAI/lm-evaluation-harness which can directly support vLLM with simpler command should be better?

For actual testing I would prefer using lm-eval. For this script, I think it has value to be put into examples folder?

@Alexei-V-Ivanov-AMD
Copy link
Contributor Author

For actual testing I would prefer using lm-eval. For this script, I think it has value to be put into examples folder?

Agreed. Moving the script into 'examples' folder. Thank you!

Copy link
Contributor

@HaiShaw HaiShaw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, with a few comments.
Maybe we can call this P-PPL - stand for Preloaded, Prefilled or Prefix-PPL.

def _forced_sample(
selected_seq_groups: List[SequenceGroupToSample],
samples: torch.Tensor,
) -> List[Tuple[List[int], List[int]]]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we have a function header (comments) below this, as others?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added the proper comment.

import datetime
import math

from transformers import LlamaTokenizer
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to consider different tokenizers that to be used for models other than llama.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're going to extend support of other models, but we will do it separately.

dtype=args.dtype,
kv_cache_dtype=args.kv_cache_dtype,
#scales_path=args.kv_cache_scales_path
# if args.kv_cache_scales_path!='' else None,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May remove these lines

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

if args.kv_cache_scales_path != '' else None,
enforce_eager=args.enforce_eager)

sampling_params = SamplingParams(n=1,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to consider n>1?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, that is not needed as we're in the "forced" sampling mode.

print(MESSAGE)
my_ppl = 0.0

my_tokenizer = LlamaTokenizer.from_pretrained(args.model)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This won't work for other models

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'll cover other models with differently purposed scripts.

args.context_size:upper_boundary])
my_sampl_par.max_tokens = len(my_sampl_par.future_context[0])
my_sampl_par.cntr = c
LOGPROBS = vllm_predict(CONTEXT, my_llm, my_sampl_par)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Watch for CONTEXT > max_context_length_of_model?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@@ -112,6 +115,8 @@ def __init__(
top_p: float = 1.0,
top_k: int = -1,
min_p: float = 0.0,
ppl_measurement: bool = False,
future_context: Optional[List[int]] = None,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing docstring

long_sample_indices = sample_indices.long()
if sampling_type == SamplingType.GREEDY:
if sampling_type == SamplingType.FORCED:
#pdb.set_trace()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove

sampling_metadata.seq_groups[0].seq_data[
sampling_params.cntr].output_token_ids)]
],
device='cuda:0') #
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this break in CPU runtime

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

def _forced_sample(
selected_seq_groups: List[SequenceGroupToSample],
samples: torch.Tensor,
) -> List[Tuple[List[int], List[int]]]:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this file downloadable from external URL? I would not recommend adding this for the size of the repo

Copy link
Contributor Author

@Alexei-V-Ivanov-AMD Alexei-V-Ivanov-AMD Aug 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this file downloadable from external URL? I would not recommend adding this for the size of the repo

This file is our measurement stick. The slight variation to it invalidates all previous measurements.
It is absolutely essential to have it properly recorded.

@Alexei-V-Ivanov-AMD
Copy link
Contributor Author

Maybe we can call this P-PPL - stand for Preloaded, Prefilled or Prefix-PPL

Done. Re-named into "PPPL"

Copy link

This pull request has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this pull request should remain open. Thank you!

@github-actions github-actions bot added the stale label Nov 23, 2024
Copy link

mergify bot commented Nov 23, 2024

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @Alexei-V-Ivanov-AMD.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants