Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature] reasoning_content in API for reasoning models like DeepSeek R1 #12468

Closed
1 task done
gaocegege opened this issue Jan 27, 2025 · 13 comments · Fixed by #12473
Closed
1 task done

[Feature] reasoning_content in API for reasoning models like DeepSeek R1 #12468

gaocegege opened this issue Jan 27, 2025 · 13 comments · Fixed by #12473

Comments

@gaocegege
Copy link
Contributor

🚀 The feature, motivation and pitch

To better support reasoning models like DeepSeek-R1, adding a reasoning_content parameter to the API would be great so that users can see the steps in the reasoning process.

Ref sgl-project/sglang#3043

Alternatives

No response

Additional context

No response

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.
@gaocegege
Copy link
Contributor Author

gaocegege commented Jan 27, 2025

For DeepSeek series models, this means we could shift the <think></think> tags into the reasoning_content.

Current output format:

content='<think>\nAlright, I just received a query asking, "Wher....\n</think>\n\nThe 2020 World Series was played in **Texas** at the residence of the Los Angeles Dodgers team, the **Rays**. The series lasted from July 19 to July 31 and was won by the **Los Angeles Dodgers**.'```

@simon-mo
Copy link
Collaborator

So this does break OpenAI compatibility but I think it is the right time to break it. Do you have suggestion on how to automatically figure out the <think> token so it's a bit general for the future model?

@gaocegege
Copy link
Contributor Author

The OpenAI Python library still runs fine, even if we’re kinda messing with the compatibility a bit.

I think we could keep it more general since the token might change for different models. I'm diving into the OpenAI server wrappers to raise a basic design proposal here.

from openai import OpenAI
client = OpenAI(api_key="", base_url="https://api.deepseek.com")

messages = [{"role": "user", "content": "9.11 and 9.8, which is greater?"}]
response = client.chat.completions.create(
    model="deepseek-reasoner",
    messages=messages
)

+ reasoning_content = response.choices[0].message.reasoning_content
content = response.choices[0].message.content

@arunpatala
Copy link

This would also be great to have when we are trying to get structured outputs from content without trying to parse the reasoning output.

seconded +1

@gaocegege
Copy link
Contributor Author

@arunpatala Hi, could you please explain more about the use case?

@arunpatala
Copy link

Its just to make sure that if we provide JSON schema for the output to follow, we specify the following using OpenAI API:


from pydantic import BaseModel
from openai import OpenAI


class Info(BaseModel):
    name: str
    age: int


client = OpenAI(base_url="http://0.0.0.0:8000/v1", api_key="dummy")
completion = client.beta.chat.completions.parse(
    model="meta-llama/Llama-3.1-8B-Instruct",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "My name is Cameron, I'm 28. What's my name and age?"},
    ],
    response_format=Info
)


Just want to make sure the thinking tokens are seperated from the main content and the main content follows the schema. Maybe this is already how it is being implemented.

@gaocegege
Copy link
Contributor Author

gaocegege commented Jan 27, 2025

Hey @simon-mo,

I was thinking it might be cool to create a new abstraction called Reasoning Parser, kind of like what we have in abstract_tool_parser.py.

We could have a specific implementation like DeepSeekR1ReasoningParser that parses the <think> and </think> tokens to generate reasoning_content for delta_message in streaming requests and message in sync requests.

And we need to add two CLI arguments to vllm serve:

vllm serve deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B \
    --enable-reasoning \
    --reasoning-parser deepseek_r1 \

One key insight is that the <think> token usually shows up as the first token in the response. But I'm not sure if that’s consistent across different models in the future, so I’d rather not rely on it for optimization.

What do you think?

@gaocegege
Copy link
Contributor Author

gaocegege commented Jan 27, 2025

@arunpatala

I don’t think it will work because the structured output engine, like xgrammar, sets the logits for the reasoning tokens to −∞. As a result, the output from the LLMEngine doesn't include the reasoning content.

tokens that would violate the required structure are identified as invalid. Their logits are set to −∞, effectively assigning them zero probability after the softmax operation and preserving the relative probabilities of other valid tokens. This ensures that only valid tokens are sampled.

Do you have some suggestions about it?

@gaocegege
Copy link
Contributor Author

gaocegege commented Jan 28, 2025

Here are the proposed changes: #12473

@lucasalvarezlacasa
Copy link

This is not available yet, isn't it?
I'm getting the following error when trying to use:

api_server.py: error: unrecognized arguments: --enable-reasoning --reasoning-parser deepseek_r1

This is the full command I'm launching:

docker run --runtime nvidia --gpus all \
    -v /path/to/weights:/root/.cache/huggingface \
    -p 8000:8000 \
    --ipc=host \
    vllm/vllm-openai:v0.7.0 \
    --model deepseek-ai/DeepSeek-R1-Distill-Llama-8B \
    --trust-remote-code \
    --enable-chunked-prefill \
    --uvicorn-log-level error \
    --gpu-memory-utilization 0.95 \
    --dtype bfloat16 \
    --enable-reasoning --reasoning-parser deepseek_r1 \
    --max-model-len 8192

@gaocegege
Copy link
Contributor Author

Hi, it does not work with v0.7.0. Perhaps you could try using vllm/vllm-openai:latest instead.

@lucasalvarezlacasa
Copy link

I tried using latest before "v0.7.0" and it didn't work either. I think this is still not released.

@DarkLight1337
Copy link
Member

DarkLight1337 commented Jan 31, 2025

Yeah, it's not released yet. You need to use latest code, not latest release. (i.e. you need to use the docker image after 0.7.0)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants