Reasoning Outputs
Reasoning models return an additional reasoning_content
field in their output, which contains the reasoning steps that led to the final conclusion.
Supported Models
Model Name | Parser Name | Eable_thinking by Default |
---|---|---|
baidu/ERNIE-4.5-VL-424B-A47B-Paddle | ernie-45-vl | ✓ |
baidu/ERNIE-4.5-VL-28B-A3B-Paddle | ernie-45-vl | ✓ |
The reasoning model requires a specified parser to extract reasoning content. The reasoning mode can be disabled by setting the enable_thinking=False
parameter.
Interfaces that support toggling the reasoning mode:
1. /v1/chat/completions
requests in OpenAI services.
2. /v1/chat/completions
requests in the OpenAI Python client.
3. llm.chat
requests in Offline interfaces.
For reasoning models, the length of the reasoning content can be controlled via reasoning_max_tokens
. Add metadata={"reasoning_max_tokens": 1024}
to the request.
Quick Start
When launching the model service, specify the parser name using the --reasoning-parser
argument.
This parser will process the model's output and extract the reasoning_content
field.
python -m fastdeploy.entrypoints.openai.api_server \
--model /path/to/your/model \
--enable-mm \
--tensor-parallel-size 8 \
--port 8192 \
--quantization wint4 \
--reasoning-parser ernie-45-vl
Next, make a request to the model that should return the reasoning content in the response.
curl -X POST "http://0.0.0.0:8192/v1/chat/completions" \
-H "Content-Type: application/json" \
-d '{
"messages": [
{"role": "user", "content": [
{"type": "image_url", "image_url": {"url": "https://paddlenlp.bj.bcebos.com/datasets/paddlemix/demo_images/example2.jpg"}},
{"type": "text", "text": "Which era does the cultural relic in the picture belong to"}
]}
],
"metadata": {"enable_thinking": true}
}'
The reasoning_content
field contains the reasoning steps to reach the final conclusion, while the content
field holds the conclusion itself.
Streaming chat completions
Streaming chat completions are also supported for reasoning models. The reasoning_content
field is available in the delta
field in chat completion response chunks
from openai import OpenAI
# Set OpenAI's API key and API base to use vLLM's API server.
openai_api_key = "EMPTY"
openai_api_base = "http://localhost:8192/v1"
client = OpenAI(
api_key=openai_api_key,
base_url=openai_api_base,
)
chat_response = client.chat.completions.create(
messages=[
{"role": "user", "content": [ {"type": "image_url", "image_url": {"url": "https://paddlenlp.bj.bcebos.com/datasets/paddlemix/demo_images/example2.jpg"}},
{"type": "text", "text": "Which era does the cultural relic in the picture belong to"}]}
],
model="vl",
stream=True,
metadata={"enable_thinking": True}
)
for chunk in chat_response:
if chunk.choices[0].delta is not None:
print(chunk.choices[0].delta, end='')
print("\n")