Reasoning Outputs

Reasoning models return an additional reasoning_content field in their output, which contains the reasoning steps that led to the final conclusion.

Supported Models

Model Name	Parser Name	Eable_thinking by Default
baidu/ERNIE-4.5-VL-424B-A47B-Paddle	ernie-45-vl	✓
baidu/ERNIE-4.5-VL-28B-A3B-Paddle	ernie-45-vl	✓

The reasoning model requires a specified parser to extract reasoning content. The reasoning mode can be disabled by setting the "enable_thinking": false parameter.

Interfaces that support toggling the reasoning mode: 1. /v1/chat/completions requests in OpenAI services. 2. /v1/chat/completions requests in the OpenAI Python client. 3. llm.chat requests in Offline interfaces.

For reasoning models, the length of the reasoning content can be controlled via reasoning_max_tokens. Add "reasoning_max_tokens": 1024 to the request.

Quick Start

When launching the model service, specify the parser name using the --reasoning-parser argument. This parser will process the model's output and extract the reasoning_content field.

python -m fastdeploy.entrypoints.openai.api_server \
    --model /path/to/your/model \
    --enable-mm \
    --tensor-parallel-size 8 \
    --port 8192 \
    --quantization wint4 \
    --reasoning-parser ernie-45-vl

Next, make a request to the model that should return the reasoning content in the response.

curl -X POST "http://0.0.0.0:8192/v1/chat/completions" \
-H "Content-Type: application/json" \
-d '{
  "messages": [
    {"role": "user", "content": [
      {"type": "image_url", "image_url": {"url": "https://paddlenlp.bj.bcebos.com/datasets/paddlemix/demo_images/example2.jpg"}},
      {"type": "text", "text": "Which era does the cultural relic in the picture belong to"}
    ]}
  ],
  "chat_template_kwargs":{"enable_thinking": true},
  "reasoning_max_tokens": 1024
}'

The reasoning_content field contains the reasoning steps to reach the final conclusion, while the content field holds the conclusion itself.

Streaming chat completions

Streaming chat completions are also supported for reasoning models. The reasoning_content field is available in the delta field in chat completion response chunks

from openai import OpenAI
# Set OpenAI's API key and API base to use vLLM's API server.
openai_api_key = "EMPTY"
openai_api_base = "http://localhost:8192/v1"
client = OpenAI(
    api_key=openai_api_key,
    base_url=openai_api_base,
)
chat_response = client.chat.completions.create(
    messages=[
        {"role": "user", "content": [ {"type": "image_url", "image_url": {"url": "https://paddlenlp.bj.bcebos.com/datasets/paddlemix/demo_images/example2.jpg"}},
        {"type": "text", "text": "Which era does the cultural relic in the picture belong to"}]}
    ],
    model="vl",
    stream=True,
    extra_body={
      "chat_template_kwargs":{"enable_thinking": True},
      "reasoning_max_tokens": 1024
    }
)
for chunk in chat_response:
    if chunk.choices[0].delta is not None:
        print(chunk.choices[0].delta, end='')
        print("\n")