Deploy ERNIE-4.5-VL-28B-A3B-Thinking Multimodal Thinking Model

This document explains how to deploy the ERNIE-4.5-VL multimodal model, supporting user interaction via multimodal data and tool call (including for multimodal data). Ensure your hardware meets the requirements before deployment.

GPU Driver >= 535
CUDA >= 12.3
CUDNN >= 9.5
Linux X86_64
Python >= 3.10
80G A/H 1 GPUs

Refer to the Installation Guide for FastDeploy setup.

Prepare the Model

Specify --model baidu/ERNIE-4.5-VL-28B-A3B-Thinking during deployment to automatically download the model from AIStudio with resumable downloads. You can also manually download the model from other sources. Note that FastDeploy requires Paddle-format models. For more details, see Supported Models.

Launch the Service

Execute the following command to start the service. For parameter configurations, refer to Parameter Guide.

python -m fastdeploy.entrypoints.openai.api_server \
        --model baidu/ERNIE-4.5-VL-28B-A3B-Thinking \
        --max-model-len 131072 \
        --max-num-seqs 32 \
        --port 8180 \
        --quantization wint8 \
        --reasoning-parser ernie-45-vl-thinking \
        --tool-call-parser ernie-45-vl-thinking \
        --mm-processor-kwargs '{"image_max_pixels": 12845056 }'

Request the Service

After launching, the service is ready when the following logs appear:

api_server.py[line:91] Launching metrics service at http://0.0.0.0:8181/metrics
api_server.py[line:94] Launching chat completion service at http://0.0.0.0:8180/v1/chat/completions
api_server.py[line:97] Launching completion service at http://0.0.0.0:8180/v1/completions
INFO:     Started server process [13909]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:8180 (Press CTRL+C to quit)

Health Check

Verify service status (HTTP 200 indicates success):

curl -i http://0.0.0.0:8180/health

cURL Request

Send requests as follows:

curl -X POST "http://0.0.0.0:8180/v1/chat/completions" \
-H "Content-Type: application/json" \
-d '{
  "messages": [
    {"role": "user", "content": "Rewrite Li Bai's 'Quiet Night Thoughts' as a modern poem"}
  ]
}'

For image inputs:

curl -X POST "http://0.0.0.0:8180/v1/chat/completions" \
-H "Content-Type: application/json" \
-d '{
  "messages": [
    {"role": "user", "content": [
      {"type":"image_url", "image_url": {"url":"https://paddlenlp.bj.bcebos.com/datasets/paddlemix/demo_images/example2.jpg"}},
      {"type":"text", "text":"From which era does the artifact in the image originate?"}
    ]}
  ]
}'

For video inputs:

curl -X POST "http://0.0.0.0:8180/v1/chat/completions" \
-H "Content-Type: application/json" \
-d '{
  "messages": [
    {"role": "user", "content": [
      {"type":"video_url", "video_url": {"url":"https://bj.bcebos.com/v1/paddlenlp/datasets/paddlemix/demo_video/example_video.mp4"}},
      {"type":"text", "text":"How many apples are in the scene?"}
    ]}
  ]
}'

Input includes tool calls, send requests with the command below:

curl -X POST "http://0.0.0.0:8180/v1/chat/completions" \
-H "Content-Type: application/json" \
-d $'{
    "tools": [
        {
            "type": "function",
            "function": {
                "name": "image_zoom_in_tool",
                "description": "Zoom in on a specific region of an image by cropping it based on a bounding box (bbox) and an optional object label.",
                "parameters": {
                    "type": "object",
                    "properties": {
                        "bbox_2d": {
                            "type": "array",
                            "items": {
                                "type": "number"
                            },
                            "minItems": 4,
                            "maxItems": 4,
                            "description": "The bounding box of the region to zoom in, as [x1, y1, x2, y2], where (x1, y1) is the top-left corner and (x2, y2) is the bottom-right corner, and the values of x1, y1, x2, y2 are all normalized to the range 0–1000 based on the original image dimensions."
                        },
                        "label": {
                            "type": "string",
                            "description": "The name or label of the object in the specified bounding box (optional)."
                        }
                    },
                    "required": [
                        "bbox_2d"
                    ]
                },
                "strict": false
            }
        }
    ],
    "messages": [
        {
            "role": "user",
            "content": [
                {
                    "type": "text",
                    "text": "Is the old lady on the left side of the empty table behind older couple?"
                }
            ]
        }
    ],
    "stream": false
}'

For multi-round requests with tool results in history context, use the command below:

curl -X POST "http://0.0.0.0:8180/v1/chat/completions" \
-H "Content-Type: application/json" \
-d $'{
    "messages": [
        {
            "role": "user",
            "content": [
                {
                    "type": "text",
                    "text": "Get the current weather in Beijing"
                }
            ]
        },
        {
            "role": "assistant",
            "tool_calls": [
                {
                    "id": "call_1",
                    "type": "function",
                    "function": {
                        "name": "get_weather",
                        "arguments": {
                            "location": "Beijing",
                            "unit": "c"
                        }
                    }
                }
            ],
            "content": ""
        },
        {
            "role": "tool",
            "content": [
                {
                    "type": "text",
                    "text": "location: Beijing，temperature: 23，weather: sunny，unit: c"
                }
            ]
        }
    ],
    "tools": [
        {
            "type": "function",
            "function": {
                "name": "get_weather",
                "description": "Determine weather in my location",
                "parameters": {
                    "type": "object",
                    "properties": {
                        "location": {
                            "type": "string",
                            "description": "The city and state e.g. San Francisco, CA"
                        },
                        "unit": {
                            "type": "string",
                            "enum": [
                                "c",
                                "f"
                            ]
                        }
                    },
                    "additionalProperties": false,
                    "required": [
                        "location",
                        "unit"
                    ]
                },
                "strict": true
            }
        }
    ],
    "stream": false
}'

Python Client (OpenAI-compatible API)

FastDeploy's API is OpenAI-compatible. You can also use Python for streaming requests:

import openai
host = "0.0.0.0"
port = "8180"
client = openai.Client(base_url=f"http://{host}:{port}/v1", api_key="null")

response = client.chat.completions.create(
    model="null",
    messages=[
        {"role": "user", "content": [
            {"type": "image_url", "image_url": {"url": "https://paddlenlp.bj.bcebos.com/datasets/paddlemix/demo_images/example2.jpg"}},
            {"type": "text", "text": "From which era does the artifact in the image originate?"},
        ]},
    ],
    stream=True,
)
for chunk in response:
    if chunk.choices[0].delta:
        print(chunk.choices[0].delta.content, end='')
print('\n')

Model Output

Example output with reasoning (reasoning content in reasoning_content, response in content, tool_calls in tool_calls):

Example of non-streaming results without tool call:

{
    "choices": [
        {
            "index": 0,
            "message": {
                "role": "assistant",
                "content": "The artifact in the image ...",
                "multimodal_content": null,
                "reasoning_content": "The user asks about ...",
                "tool_calls": null
            },
            "finish_reason": "stop"
        }
    ],
    "usage": {
        "prompt_tokens": 1290,
        "total_tokens": 1681,
        "completion_tokens": 391,
        "prompt_tokens_details": {
            "cached_tokens": 0,
            "image_tokens": 1240,
            "video_tokens": 0
        },
        "completion_tokens_details": {
            "reasoning_tokens": 217,
            "image_tokens": 0
        }
    }
}

Example of non-streaming results with tool call, where the content field is empty and finish_reason is tool_calls:

{
    "choices": [
        {
            "index": 0,
            "message": {
                "role": "assistant",
                "content": "",
                "multimodal_content": null,
                "reasoning_content": "What immediately stands out is that I need to determine the spatial relationship between the old lady, the empty table, and the older couple. The original image might not provide enough detail to make this determination clearly, so I should use the image_zoom_in_tool to focus on the relevant area where these elements are located.\n",
                "tool_calls": [
                    {
                        "id": "chatcmpl-tool-dd0ef62027cf409c8f013af65f88adc3",
                        "type": "function",
                        "function": {
                            "name": "image_zoom_in_tool",
                            "arguments": "{\"bbox_2d\": [285, 235, 999, 652]}"
                        }
                    }
                ]
            }
            "finish_reason": "tool_calls"
        }
    ],
    "usage": {
        "prompt_tokens": 280,
        "total_tokens": 397,
        "completion_tokens": 117,
        "prompt_tokens_details": {
            "cached_tokens": 0,
            "image_tokens": 0,
            "video_tokens": 0
        },
        "completion_tokens_details": {
            "reasoning_tokens": 66,
            "image_tokens": 0
        }
    }
}