Skip to content

Reasoning Outputs

Reasoning models return an additional reasoning_content field in their output, which contains the reasoning steps that led to the final conclusion.

Supported Models

Model Name Parser Name Eable_thinking by Default
baidu/ERNIE-4.5-VL-424B-A47B-Paddle ernie-45-vl
baidu/ERNIE-4.5-VL-28B-A3B-Paddle ernie-45-vl

The reasoning model requires a specified parser to extract reasoning content. The reasoning mode can be disabled by setting the enable_thinking=False parameter.

Interfaces that support toggling the reasoning mode: 1. /v1/chat/completions requests in OpenAI services. 2. /v1/chat/completions requests in the OpenAI Python client. 3. llm.chat requests in Offline interfaces.

For reasoning models, the length of the reasoning content can be controlled via reasoning_max_tokens. Add metadata={"reasoning_max_tokens": 1024} to the request.

Quick Start

When launching the model service, specify the parser name using the --reasoning-parser argument. This parser will process the model's output and extract the reasoning_content field.

python -m fastdeploy.entrypoints.openai.api_server \
    --model /path/to/your/model \
    --enable-mm \
    --tensor-parallel-size 8 \
    --port 8192 \
    --quantization wint4 \
    --reasoning-parser ernie-45-vl

Next, make a request to the model that should return the reasoning content in the response.

curl -X POST "http://0.0.0.0:8192/v1/chat/completions" \
-H "Content-Type: application/json" \
-d '{
  "messages": [
    {"role": "user", "content": [
      {"type": "image_url", "image_url": {"url": "https://paddlenlp.bj.bcebos.com/datasets/paddlemix/demo_images/example2.jpg"}},
      {"type": "text", "text": "Which era does the cultural relic in the picture belong to"}
    ]}
  ],
  "metadata": {"enable_thinking": true}
}'

The reasoning_content field contains the reasoning steps to reach the final conclusion, while the content field holds the conclusion itself.

Streaming chat completions

Streaming chat completions are also supported for reasoning models. The reasoning_content field is available in the delta field in chat completion response chunks

from openai import OpenAI
# Set OpenAI's API key and API base to use vLLM's API server.
openai_api_key = "EMPTY"
openai_api_base = "http://localhost:8192/v1"
client = OpenAI(
    api_key=openai_api_key,
    base_url=openai_api_base,
)
chat_response = client.chat.completions.create(
    messages=[
        {"role": "user", "content": [ {"type": "image_url", "image_url": {"url": "https://paddlenlp.bj.bcebos.com/datasets/paddlemix/demo_images/example2.jpg"}},
        {"type": "text", "text": "Which era does the cultural relic in the picture belong to"}]}
    ],
    model="vl",
    stream=True,
    metadata={"enable_thinking": True}
)
for chunk in chat_response:
    if chunk.choices[0].delta is not None:
        print(chunk.choices[0].delta, end='')
        print("\n")