-
-
Notifications
You must be signed in to change notification settings - Fork 5.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[v1][stats][1/n] Add RequestStatsUpdate and RequestStats types #10907
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: rickyx <[email protected]>
👋 Hi! Thank you for contributing to the vLLM project. Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can do one of these:
🚀 |
Signed-off-by: rickyx <[email protected]>
Signed-off-by: rickyx <[email protected]>
@robertgshaw2-neuralmagic Do you have bandwidth to review this PR by any chance? |
Yeah |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally looks really good. My thoughts are:
-
Preemption
: I do not think that the way we are handling preemption is consistent with what we should be exposing to users (see comments in the code for why) -
Detokenization
/Finished
: I think that tracking these data are a bit complicated since they are not in theEngineCore
and theDetokenizer
runs concurrently with theEngineCore
. I am curious to get your thoughts around from which object's POV we should consider a request "finished" and how/if we should include detokenization time in the TPOT metrics given the V1 design.
Thanks for the great review @robertgshaw2-neuralmagic!
For detokenization stage and finish case:
|
Signed-off-by: rickyx <[email protected]>
Signed-off-by: rickyx <[email protected]>
Updates:
|
Signed-off-by: rickyx <[email protected]>
As we are adding more supports to caching (MM preprocessor cache, prefix caching, and potentially embedding caching), it would be useful to track cache status in metrics, so would like to bring the metric system to v1 asap if possible. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
xxx
This PR adds the basic data structures to capture request stats change and their lifecycle transitions.
Intended usage in follow-up PRs
RequestStatsUpdate
would be recorded on both the engine core as well as the frontend engine (async multiprocess architecture) process as request goes through various stages (i.e. input processing, prefill, decode, finished, etc). If it's single process architecture, they would be on the same process.RequestStats
for each request, and build a loggableStats
object each iteration.See full prototype here in #10651
Addresses #10582