-
-
Notifications
You must be signed in to change notification settings - Fork 5.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Prefill with Prefix Cache] Improve the efficiency of prefilling with prefix cache by allowing a larger batch size #3402
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add relevant tests to core/test_scheduler.py and core/test_block_manager.py?
@@ -98,6 +98,41 @@ def get_num_free_blocks(self) -> int: | |||
return (self.num_blocks - self.current_num_blocks + | |||
self.evictor.num_blocks) | |||
|
|||
def get_num_cached_blocks(self, seq: Sequence) -> int: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add doctsring to indicate the difference between cached vs computed blocks?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, thank you for your review! Since now it is very late in my area, I plan to add unit tests and more doctstrings tommorow. I have reviewed my logic, and think it shoud be right but kind of tricky. I will add more comments to make it more easy to understand. Thank you!
vllm/core/scheduler.py
Outdated
num_computed_tokens = self.block_manager \ | ||
.get_num_computed_tokens(waiting_seq) | ||
new_seq_lens = seq_lens + [ | ||
num_prompt_tokens - num_computed_tokens |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be max(num_prompt_tokens - num_computed_tokens, 0)
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
E.g., block size == 16, 4 blocks are cached (and 60 tokens are prefix), and if the token size is 63, it can have negative value?
vllm/core/block_manager.py
Outdated
return self.gpu_allocator.get_num_computed_blocks(seq) | ||
|
||
def get_num_computed_tokens(self, seq: Sequence) -> int: | ||
return self.block_size * self.get_num_computed_blocks(seq) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Q: Does this overestimate the number of computed tokens? (as it is not guaranteed all slots in the block has token allocated)? Can you write a docstring that explains the behavior here? (Either it can be overestimated OR if we have a certain assumption it should be documented)
if self.cached_blocks[block_hash].computed: | ||
num_computed_blocks += 1 | ||
else: | ||
break |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comment why it breaks? (I assume it has assumption blocks are computed in order?)
vllm/sequence.py
Outdated
@@ -289,7 +296,7 @@ class SequenceGroupState: | |||
"""Mutable state tied to a specific sequence group""" | |||
|
|||
# torch.Generator used in seeded sampling | |||
generator: Optional = None | |||
generator: Optional[torch.Generator] = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd intentionally avoided the type here so that torch doesn't need to be imported, since this layer deals with higher level sequence abstractions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, I haved removed this modification. I did it before because Pylance in VSCode always remind me this problem, which is kind of annoying. BTW, it' fine according to your explanation, thank you!
@MeloYang05 this looks great, I think it would be good to split into two PRs though since the improvements are independent? |
OK thanks, I will split the PR into two independent ones. |
@MeloYang05 Thanks for pointing out the incremental detokenization issue. Do you mind if I submit a PR fixing it in a more comprehensive way? You will be added as a co-author. |
OK it's fine, I'm very glad to see it can be fixed in a more comprehensive way, thank you! I will only handle the batch size of prefix prefill in this PR. |
107fd9a
to
02a8a2e
Compare
vllm/core/block_manager.py
Outdated
return self.gpu_allocator.get_num_computed_blocks(seq) | ||
|
||
def get_num_computed_tokens(self, seq: Sequence) -> int: | ||
# The last block has been excluded in method `get_num_computed_blocks` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit;
def get_num_computed_tokens(self, seq: Sequence) -> int:
"""Return the number of tokens that are already computed.
NOTE: This excludes tokens from the last blocks.
"""
…Allocator Abstraction
Hi, I have made some edits to ensure my modifications are compatible with the latest updates regarding the block manager abstraction. |
This pull request has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this pull request should remain open. Thank you! |
This pull request has merge conflicts that must be resolved before it can be |
Hello everyone, I am testing the efficiency of prefilling with cache enabled to explore the potential opportunities to apply the techniques of chunked prefilling or even SplitFuse in vLLM. However, currently the performance of prefilling with prefix cache is not satisfied mainly because of the following two reasons:
scheduler.py
, when determining whether aSequenceGroup
can be added to the running queue, thecan_allocate()
method ofblock_manager
doesn't aware the cached blocks of theSequenceGroup
. For example, assume there's 10 free blocks, and theSequenceGroup
requires 12 blocks, but 8 of the blocks is cached. In this scenario, suchSequenceGroup
should be allowed to be added to the running queue as it only requires 4 extra blocks indeed. However, current implementation doesn't check the cached blocks and so there's no difference between a prefix prefill and a normal prefill. In theory, prefix prefill should allow a much larger batch size than the normal prefill to increase the entire throughput. Moreover, in the later checking ofnum_batched_tokens
, current implementation also doesn't move out the computed tokens number, further lead to the small batch size.incremental_detokenize
. I also find that even when I fix the batch size to be 1, prefill prefill with 1 extra token's latency(e.g., prompt length 257, and 256 tokens have been cached) is still far behind decode 1 token. I find that the performance gap mainly results from the different execution time ofincremental_detokenize
of decoding and prefix prefilling. As for decoding, since theseq.tokens
have been computed, it only needs to compute the token for the new token id. However, in prefix prefill,seq.tokens
isNone
, so it has to detokenize all the token ids, which is a huge gap. Additionally, I find thatincremental_detokenize
actually detokenize all the token ids more than once when prefilling. In method_decode_sequence
inllm_engine.py
, stepself._decode_logprobs(...)
will callincremental_detokenize
many times, and all theseincremental_detokenize
will all detokenize all the token ids ifseq.tokens
isNone
, which I think is totally unnecessary. Such implementation leads to the performance degradation especially when the prompt length is long.In this PR, I primarily address the problem of enabling a larger batch size for prefix prefilling. The issue of repeated
incremental_detokenize
calls will be resolved in a separate PR by @Yard1, who will approach the fix in a more systematic way.How do I fix the problems
Enable Larger Batch Size for Prefilling with Prefix Cache
To enable a larger batch size, we have to make the following two kinds of modifications:
Aware of the Cached Blocks
Cached blocks are those blocks stored in block manager's
cached_blocks
item, where the hash value of the block is the key.In
block_manager.py
, I add a method to get the number of cached blocks:And in method
can_allocate()
, I add the following lines to make it exclude the number of cached blocks:Aware of the Computed Blocks
The computed blocks of a sequence should satisfy the following conditions:
cached blocks
orevictor
block.computed
is truecached_blocks
and marked as computed, but 4 is not continuous with (0, 1, 2). Thus, the computed blocks are only (0, 1, 2)Moreover, to align with #3239, the last blocks of a sequence will always not be regarded as the computed block.
Therefore, I write the following method in
block_manager.py
to get the number of computed blocks before it is allocated.I believe the logic of my method aligns with that of the
get_all_computed_blocks
method, which will be invoked following the allocation of the sequence.Additionally, in the
_scheduler()
method inscheduler.py
, I have modified it to exclude computed tokens when checking the number of batched tokens.Other Optimizations
Moreover, since now it is more frequently to compute the cache value of logical blocks(
seq.hash_of_block(logical_idx)
), to improve the efficiency, I store the hash value of the logical block when it is firstly computed(Notice that only the full logical block can store hash value).Also there's a small modification in method
compute_full_blocks_in_seq
inblock_manager.py
:In the previous implementation, we had
max_full_block = seq.get_len() // self.block_size - 1
. However, I believe there is no need to subtract 1 frommax_full_block
if the last block is full. This is because the last block will be excluded byblock_table[:-1]
in the expressionfor b in takewhile(lambda b: b.computed, block_table[:-1])
within theget_all_computed_blocks
method. Given that the last block is full, it can also be marked as computed.Performance Improvement
I write the following scripts file to test the performance gain after my modifications.
I use the
Llama-2-7B-FP16
model, and my gpu is A800. Before the modifications, the maximum batch size for prefix prefilling is just 3, and the entire prefix prefilling process takes about 0.93s. After the changes, the batch size reaches 48, and the prefix prefilling time has been shorten to about 0.21s (also with the repeated detokenization simply fixed ), which achieves more than 4x speed up. I hope these changes can help vLLM prepared for techniques which applying prefix prefilling.