-
-
Notifications
You must be signed in to change notification settings - Fork 5.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[V1] [2/n] Logging and Metrics - OutputProcessor
Abstraction
#11973
base: main
Are you sure you want to change the base?
[V1] [2/n] Logging and Metrics - OutputProcessor
Abstraction
#11973
Conversation
Signed-off-by: [email protected] <[email protected]>
👋 Hi! Thank you for contributing to the vLLM project. Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can do one of these:
🚀 |
OutputProcessor
Abstraction
@@ -189,85 +178,3 @@ def _get_next_output_text(self, finished: bool, delta: bool) -> str: | |||
self._last_output_text_offset = length | |||
return self.output_text[last_offset:length] | |||
return "" | |||
|
|||
|
|||
class Detokenizer: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is now called OutputProcessor
, since we will do more than "just" detokenization in the step()
function
@@ -233,57 +225,46 @@ async def generate( | |||
await self.abort(request_id) | |||
raise | |||
|
|||
def _process_request_outputs(self, request_outputs: List[RequestOutput]): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This logic is moved into OutputProcessor.process_outputs()
loop
queue=queue, | ||
) | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NOTE: this was previously called Detokenizer
@@ -59,9 +59,6 @@ def __init__( | |||
lora_config=vllm_config.lora_config) | |||
self.tokenizer.ping() | |||
|
|||
# Request streams (map of request_id -> queue). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NOTE: these queues are held in OutputProcessor
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice structure and comments, this LGTM. It would be nice to have a test that IterationStats gets updated within the OutputProcessor
If you need to touch every element of the batch, implement a | ||
method called XXXClass.update_from_output() to be called | ||
within the loop below. For examples, see: | ||
* IterationStats.update_from_output() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a great abstraction IMO.
nit: I wonder if we also want to make it more explicit by having something like a OutputHandler
protocol that takes in the engine core output + maybe a current request state?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will add a comment suggesting that we do this once we add RequestStats
. I want to keep flexibility while we are in the development stage
SUMMARY:
Detokenizer
>>OutputProcessor
.XXXClass.update_from_output
+ be called inOutputProcessor.process_outputs
loop.self._process_request_outputs
into this loop (previously this was a separate loop inoutput_handler
)IterationStats.update_from_output()
to this loopNOTES:
Detokenizer
in a separate process hurts performance ([V1] [7/N] API Server: Multiprocessing Detokenizer [ DO NOT MERGE ] #11636). So feel confident that adding all of this to a single loop is the right approach.TODO: