Deploy ERNIE-4.5-VL-28B-A3B-Paddle Multimodal Model in 10 Minutes
Before deployment, please ensure your environment meets the following requirements:
- GPU Driver >= 535
- CUDA >= 12.3
- CUDNN >= 9.5
- Linux X86_64
- Python >= 3.10
- Hardware configuration meets minimum requirements (refer to Supported Models)
For quick deployment across different hardware, this guide uses the ERNIE-4.5-VL-28B-A3B-Paddle multimodel model as an example, which can run on most hardware configurations.
For more information about how to install FastDeploy, refer to the installation document.
💡 Note: All ERNIE multimodal models support reasoning capability. Enable/disable it by setting
enable_thinking
in requests (see example below).
1. Launch Service
After installing FastDeploy, execute the following command in the terminal to start the service. For the configuration method of the startup command, refer to Parameter Description
python -m fastdeploy.entrypoints.openai.api_server \
--model baidu/ERNIE-4.5-VL-28B-A3B-Paddle \
--port 8180 \
--metrics-port 8181 \
--engine-worker-queue-port 8182 \
--max-model-len 32768 \
--max-num-seqs 32 \
--reasoning-parser ernie-45-vl \
--enable-mm
💡 Note: In the path specified by
--model
, if the subdirectory corresponding to the path does not exist in the current directory, it will try to query whether AIStudio has a preset model based on the specified model name (such asbaidu/ERNIE-4.5-0.3B-Base-Paddle
). If it exists, it will automatically start downloading. The default download path is:~/xx
. For instructions and configuration on automatic model download, see Model Download.
--max-model-len
indicates the maximum number of tokens supported by the currently deployed service.
--max-num-seqs
indicates the maximum number of concurrent processing supported by the currently deployed service.
--reasoning-parser
specifies the thinking content parser.
--enable-mm
indicates whether to enable multi-modal support.
Related Documents - Service Deployment - Service Monitoring
2. Request the Service
After starting the service, the following output indicates successful initialization:
api_server.py[line:91] Launching metrics service at http://0.0.0.0:8181/metrics
api_server.py[line:94] Launching chat completion service at http://0.0.0.0:8180/v1/chat/completions
api_server.py[line:97] Launching completion service at http://0.0.0.0:8180/v1/completions
INFO: Started server process [13909]
INFO: Waiting for application startup.
INFO: Application startup complete.
INFO: Uvicorn running on http://0.0.0.0:8180 (Press CTRL+C to quit)
Health Check
Verify service status (HTTP 200 indicates success):
curl -i http://0.0.0.0:8180/health
cURL Request
Send requests to the service with the following command:
curl -X POST "http://0.0.0.0:8180/v1/chat/completions" \
-H "Content-Type: application/json" \
-d '{
"messages": [
{"role": "user", "content": [
{"type": "image_url", "image_url": {"url": "https://paddlenlp.bj.bcebos.com/datasets/paddlemix/demo_images/example2.jpg"}},
{"type": "text", "text": "What era does this artifact belong to?"}
]}
],
"metadata": {"enable_thinking": false}
}'
Python Client (OpenAI-compatible API)
FastDeploy's API is OpenAI-compatible. You can also use Python for requests:
import openai
host = "0.0.0.0"
port = "8180"
client = openai.Client(base_url=f"http://{host}:{port}/v1", api_key="null")
response = client.chat.completions.create(
model="null",
messages=[
{"role": "user", "content": [
{"type": "image_url", "image_url": {"url": "https://paddlenlp.bj.bcebos.com/datasets/paddlemix/demo_images/example2.jpg"}},
{"type": "text", "text": "What era does this artifact belong to?"},
]},
],
metadata={"enable_thinking": false},
stream=True,
)
for chunk in response:
if chunk.choices[0].delta:
print(chunk.choices[0].delta.content, end='')
print('\n')