Deploy ERNIE-4.5-300B-A47B Model

This document explains how to deploy the ERNIE-4.5 model. Before starting the deployment, please ensure that your hardware environment meets the following requirements: - GPU Driver >= 535 - CUDA >= 12.3 - CUDNN >= 9.5 - Linux X86_64 - Python >= 3.10 - 80G A/H 4 GPUs

For FastDeploy installation, refer to the Installation Guide.

Prepare the Model

Specify --model baidu/ERNIE-4.5-300B-A47B-Paddle during deployment to automatically download the model from AIStudio with support for resumable transfers. Alternatively, you can download the model manually from other sources. Note that FastDeploy requires the model in Paddle format. For more details, see the Supported Models List.

Start the Service

💡 Note: Since the model parameter size is 300B-A47B,, on an 80G * 8-GPU machine, specify --quantization wint4 (wint8 is also supported, where wint4 requires 4 GPUs and wint8 requires 8 GPUs).

Execute the following command to start the service. For configuration details, refer to the Parameter Guide:

python -m fastdeploy.entrypoints.openai.api_server \
       --model baidu/ERNIE-4.5-300B-A47B-Paddle \
       --port 8180 --engine-worker-queue-port 8181 \
       --cache-queue-port 8182 --metrics-port 8182 \
       --tensor-parallel-size 8 \
       --quantization wint4 \
       --max-model-len 32768 \
       --max-num-seqs 32

Request the Service

After starting the service, the following output indicates successful initialization:

api_server.py[line:91] Launching metrics service at http://0.0.0.0:8181/metrics
api_server.py[line:94] Launching chat completion service at http://0.0.0.0:8180/v1/chat/completions
api_server.py[line:97] Launching completion service at http://0.0.0.0:8180/v1/completions
INFO:     Started server process [13909]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:8180 (Press CTRL+C to quit)

Health Check

Verify service status (HTTP 200 indicates success):

curl -i http://0.0.0.0:8180/health

cURL Request

Send requests to the service with the following command:

curl -X POST "http://0.0.0.0:8180/v1/chat/completions" \
-H "Content-Type: application/json" \
-d '{
  "messages": [
    {"role": "user", "content": "Write me a poem about large language model."}
  ]
}'

Python Client (OpenAI-compatible API)

FastDeploy's API is OpenAI-compatible. You can also use Python for requests:

import openai
host = "0.0.0.0"
port = "8180"
client = openai.Client(base_url=f"http://{host}:{port}/v1", api_key="null")

response = client.chat.completions.create(
    model="null",
    messages=[
        {"role": "system", "content": "I'm a helpful AI assistant."},
        {"role": "user", "content": "Write me a poem about large language model."},
    ],
    stream=True,
)
for chunk in response:
    if chunk.choices[0].delta:
        print(chunk.choices[0].delta.content, end='')
print('\n')