Skip to content

Kunlunxin XPU

Requirements

  • OS: Linux
  • Python: 3.10
  • XPU Model: P800
  • XPU Driver Version: ≥ 5.0.21.10
  • XPU Firmware Version: ≥ 1.31

Verified platform: - CPU: INTEL(R) XEON(R) PLATINUM 8563C / Hygon C86-4G 7490 64-core Processor - Memory: 2T - Disk: 4T - OS: CentOS release 7.6 (Final) - Python: 3.10 - XPU Model: P800 (OAM Edition) - XPU Driver Version: 5.0.21.10 - XPU Firmware Version: 1.31

Note: Currently, only INTEL or Hygon CPU-based P800 (OAM Edition) servers have been verified. Other CPU types and P800 (PCIe Edition) servers have not been tested yet.

mkdir Work
cd Work
docker pull ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/fastdeploy-xpu:2.0.0
docker run --name fastdeploy-xpu --net=host -itd --privileged -v $PWD:/Work -w /Work \
    ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/fastdeploy-xpu:2.0.0 \
    /bin/bash
docker exec -it fastdeploy-xpu /bin/bash

2. Set up using pre-built wheels

Install PaddlePaddle

python -m pip install paddlepaddle-xpu==3.1.0 -i https://www.paddlepaddle.org.cn/packages/stable/xpu-p800/

Alternatively, you can install the latest version of PaddlePaddle (Not recommended)

python -m pip install --pre paddlepaddle-xpu -i https://www.paddlepaddle.org.cn/packages/nightly/xpu-p800/

Install FastDeploy (Do NOT install via PyPI source)

python -m pip install fastdeploy-xpu==2.0.0 -i https://www.paddlepaddle.org.cn/packages/stable/fastdeploy-xpu-p800/ --extra-index-url https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simple

Alternatively, you can install the latest version of FastDeploy (Not recommended)

python -m pip install --pre fastdeploy-xpu -i https://www.paddlepaddle.org.cn/packages/stable/fastdeploy-xpu-p800/ --extra-index-url https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simple

3. Build wheel from source

Install PaddlePaddle

python -m pip install paddlepaddle-xpu==3.1.0 -i https://www.paddlepaddle.org.cn/packages/stable/xpu-p800/

Alternatively, you can install the latest version of PaddlePaddle (Not recommended)

python -m pip install --pre paddlepaddle-xpu -i https://www.paddlepaddle.org.cn/packages/nightly/xpu-p800/

Download Kunlunxin Toolkit (XTDK) and XVLLM library, then set their paths.

# XTDK
wget https://klx-sdk-release-public.su.bcebos.com/xtdk_15fusion/dev/3.2.40.1/xtdk-llvm15-ubuntu2004_x86_64.tar.gz
tar -xvf xtdk-llvm15-ubuntu2004_x86_64.tar.gz && mv xtdk-llvm15-ubuntu2004_x86_64 xtdk
export CLANG_PATH=$(pwd)/xtdk

# XVLLM
wget https://klx-sdk-release-public.su.bcebos.com/xinfer/daily/eb/20250624/output.tar.gz
tar -xvf output.tar.gz && mv output xvllm
export XVLLM_PATH=$(pwd)/xvllm

Alternatively, you can download the latest versions of XTDK and XVLLM (Not recommended)

XTDK: https://klx-sdk-release-public.su.bcebos.com/xtdk_15fusion/dev/latest/xtdk-llvm15-ubuntu2004_x86_64.tar.gz
XVLLM: https://klx-sdk-release-public.su.bcebos.com/xinfer/daily/eb/latest/output.tar.gz

Download FastDeploy source code, checkout the stable branch/TAG, then compile and install.

git clone https://github.com/PaddlePaddle/FastDeploy
cd FastDeploy
bash build.sh

The compiled outputs will be located in the FastDeploy/dist directory.

Installation verification

python -c "import paddle; paddle.version.show()"
python -c "import paddle; paddle.utils.run_check()"
python -c "from paddle.jit.marker import unified"
python -c "from fastdeploy.model_executor.ops.xpu import block_attn"

If all the above steps execute successfully, FastDeploy is installed correctly.

Quick start

The P800 supports the deployment of the ERNIE-4.5-300B-A47B-Paddle model using the following configurations (Note: Different configurations may result in variations in performance). - 32K WINT4 with 8 XPUs (Recommended) - 128K WINT4 with 8 XPUs - 32K WINT4 with 4 XPUs

Online serving (OpenAI API-Compatible server)

Deploy an OpenAI API-compatible server using FastDeploy with the following commands:

Start service

Deploy the ERNIE-4.5-300B-A47B-Paddle model with WINT4 precision and 32K context length on 8 XPUs(Recommended)

python -m fastdeploy.entrypoints.openai.api_server \
    --model baidu/ERNIE-4.5-300B-A47B-Paddle \
    --port 8188 \
    --tensor-parallel-size 8 \
    --max-model-len 32768 \
    --max-num-seqs 64 \
    --quantization "wint4" \
    --gpu-memory-utilization 0.9

Deploy the ERNIE-4.5-300B-A47B-Paddle model with WINT4 precision and 128K context length on 8 XPUs

python -m fastdeploy.entrypoints.openai.api_server \
    --model baidu/ERNIE-4.5-300B-A47B-Paddle \
    --port 8188 \
    --tensor-parallel-size 8 \
    --max-model-len 131072 \
    --max-num-seqs 64 \
    --quantization "wint4" \
    --gpu-memory-utilization 0.9

Deploy the ERNIE-4.5-300B-A47B-Paddle model with WINT4 precision and 32K context length on 4 XPUs

export XPU_VISIBLE_DEVICES="0,1,2,3"
python -m fastdeploy.entrypoints.openai.api_server \
    --model baidu/ERNIE-4.5-300B-A47B-Paddle \
    --port 8188 \
    --tensor-parallel-size 4 \
    --max-model-len 32768 \
    --max-num-seqs 64 \
    --quantization "wint4" \
    --gpu-memory-utilization 0.9

Refer to Parameters for more options.

Send requests

Send requests using either curl or Python

curl -X POST "http://0.0.0.0:8188/v1/chat/completions" \
-H "Content-Type: application/json" \
-d '{
  "messages": [
    {"role": "user", "content": "Where is the capital of China?"}
  ]
}'
import openai
host = "0.0.0.0"
port = "8188"
client = openai.Client(base_url=f"http://{host}:{port}/v1", api_key="null")

response = client.completions.create(
    model="null",
    prompt="Where is the capital of China?",
    stream=True,
)
for chunk in response:
    print(chunk.choices[0].text, end='')
print('\n')

response = client.chat.completions.create(
    model="null",
    messages=[
        {"role": "user", "content": "Where is the capital of China?"},
    ],
    stream=True,
)
for chunk in response:
    if chunk.choices[0].delta:
        print(chunk.choices[0].delta.content, end='')
print('\n')

For detailed OpenAI protocol specifications, see OpenAI Chat Compeltion API. Differences from the standard OpenAI protocol are documented in OpenAI Protocol-Compatible API Server.