PD Disaggregated Deployment Best Practices

This document provides a comprehensive guide to FastDeploy's PD (Prefill-Decode) disaggregated deployment solution, covering both single-machine and cross-machine deployment modes with support for Tensor Parallelism (TP), Data Parallelism (DP), and Expert Parallelism (EP).

1. Deployment Overview and Environment Preparation

This guide demonstrates deployment practices using the ERNIE-4.5-300B-A47B-Paddle model on H100 80GB GPUs. Below are the minimum GPU requirements for different deployment configurations:

Single-Machine Deployment (8 GPUs, Single Node)

Configuration	TP	DP	EP	GPUs Required
P：TP4DP1 D：TP4DP1	4	1	-	8
P：TP1DP4EP4 D：TP1DP4EP4	1	4	✓	8

Multi-Machine Deployment (16 GPUs, Cross-Node)

Configuration	TP	DP	EP	GPUs Required
P：TP8DP1 D：TP8DP1	8	1	-	16
P：TP4DP2 D：TP4DP2	4	2	-	16
P：TP1DP8EP8 D：TP1DP8EP8	1	8	✓	16

Important Notes: 1. Quantization: All configurations above use WINT4 quantization, specified via --quantization wint4 2. EP Limitations: When Expert Parallelism (EP) is enabled, only TP=1 is currently supported; multi-TP scenarios are not yet available 3. Cross-Machine Network: Cross-machine deployment requires RDMA network support for high-speed KV Cache transmission 4. GPU Calculation: Total GPUs = TP × DP × 2, with identical configurations for both Prefill and Decode instances 5. CUDA Graph Capture: Decode instances enable CUDA Graph capture by default for inference acceleration, while Prefill instances do not

1.1 Installing FastDeploy

Please refer to the FastDeploy Installation Guide to set up your environment.

For model downloads, please check the Supported Models List.

1.2 Deployment Topology

Single-Machine Deployment Topology

┌──────────────────────────────┐
│  Single Machine 8×H100 80GB  │
│  ┌──────────────┐            │
│  │  Router      │            │
│  │  0.0.0.0:8109│            │
│  └──────────────┘            │
│         │                    │
│    ┌────┴────┐               │
│    ▼         ▼               │
│ ┌─────────┐  ┌─────────┐     │
│ │Prefill  │  │Decode   │     │
│ │GPU 0-3  │  │GPU 4-7  │     │
│ └─────────┘  └─────────┘     │
└──────────────────────────────┘

Cross-Machine Deployment Topology

┌─────────────────────┐                      ┌─────────────────────┐
│   Prefill Machine   │      RDMA Network    │   Decode Machine    │
│   8×H100 80GB       │◄────────────────────►│   8×H100 80GB       │
│                     │                      │                     │
│  ┌──────────────┐   │                      │                     │
│  │  Router      │   │                      │                     │
│  │ 0.0.0.0:8109 │───┼──────────────────────┼──────────           │
│  └──────────────┘   │                      │         │           │
│         │           │                      │         │           │
│         ▼           │                      │         ▼           │
│  ┌──────────────┐   │                      │  ┌──────────────┐   │
│  │Prefill Nodes │   │                      │  │Decode Nodes  │   │
│  │GPU 0-7       │   │                      │  │GPU 0-7       │   │
│  └──────────────┘   │                      │  └──────────────┘   │
└─────────────────────┘                      └─────────────────────┘

2. Single-Machine PD Disaggregated Deployment

2.1 Test Scenarios and Parallelism Configuration

This chapter demonstrates the TP4DP1｜D：TP4DP1 configuration test scenario: - Tensor Parallelism (TP): 4 — Each 4 GPUs independently load complete model parameters - Data Parallelism (DP): 1 — Each GPU forms a data parallelism group - Expert Parallelism (EP): Not enabled

To test other parallelism configurations, adjust parameters as follows: 1. TP Adjustment: Modify --tensor-parallel-size 2. DP Adjustment: Modify --data-parallel-size, ensuring --ports and --num-servers remain consistent with DP 3. EP Toggle: Add or remove --enable-expert-parallel 4. GPU Allocation: Control GPUs used by Prefill and Decode instances via CUDA_VISIBLE_DEVICES

2.2 Startup Scripts

Start Router

python -m fastdeploy.router.launch \
    --port 8109 \
    --splitwise

Note: This uses the Python version of the router. If needed, you can also use the high-performance Golang version router.

Start Prefill Nodes

export CUDA_VISIBLE_DEVICES=0,1,2,3

python -m fastdeploy.entrypoints.openai.api_server \
    --model /path/to/ERNIE-4.5-300B-A47B-Paddle \
    --port 8188 \
    --splitwise-role "prefill" \
    --cache-transfer-protocol "rdma,ipc" \
    --router "0.0.0.0:8109" \
    --quantization wint4 \
    --tensor-parallel-size 4 \
    --data-parallel-size 1 \
    --max-model-len 8192 \
    --max-num-seqs 64

Start Decode Nodes

export CUDA_VISIBLE_DEVICES=4,5,6,7

python -m fastdeploy.entrypoints.openai.multi_api_server \
    --model /path/to/ERNIE-4.5-300B-A47B-Paddle \
    --ports 8200,8201 \
    --splitwise-role "decode" \
    --cache-transfer-protocol "rdma,ipc" \
    --router "0.0.0.0:8109" \
    --quantization wint4 \
    --tensor-parallel-size 2 \
    --data-parallel-size 2 \
    --max-model-len 8192 \
    --max-num-seqs 64

2.3 Key Parameter Descriptions

Parameter	Description
`--splitwise`	Enable PD disaggregated mode
`--splitwise-role`	Node role: `prefill` or `decode`
`--cache-transfer-protocol`	KV Cache transfer protocol: `rdma` or `ipc`
`--router`	Router service address
`--quantization`	Quantization strategy (wint4/wint8/fp8, etc.)
`--tensor-parallel-size`	Tensor parallelism degree (TP)
`--data-parallel-size`	Data parallelism degree (DP)
`--max-model-len`	Maximum sequence length
`--max-num-seqs`	Maximum concurrent sequences
`--num-gpu-blocks-override`	GPU KV Cache block count override

3. Cross-Machine PD Disaggregated Deployment

3.1 Deployment Principles

Cross-machine PD disaggregation deploys Prefill and Decode instances on different physical machines: - Prefill Machine: Runs the Router and Prefill nodes, responsible for processing input sequence prefill computation - Decode Machine: Runs Decode nodes, communicates with the Prefill machine via RDMA network, responsible for autoregressive decoding generation

3.2 Test Scenarios and Parallelism Configuration

This chapter demonstrates the TP1DP8EP8｜D：TP1DP8EP8 cross-machine configuration (16 GPUs total): - Tensor Parallelism (TP): 1 - Data Parallelism (DP): 8 — 8 GPUs per machine, totaling 8 Prefill instances and 8 Decode instances - Expert Parallelism (EP): Enabled — MoE layer shared experts are distributed across 8 GPUs for parallel computation

To test other cross-machine parallelism configurations, adjust parameters as follows: 1. Inter-Machine Communication: Ensure RDMA network connectivity between machines; Prefill machine needs KVCACHE_RDMA_NICS environment variable configured 2. Router Address: The --router parameter on the Decode machine must point to the actual IP address of the Prefill machine 3. Port Configuration: The number of ports in the --ports list must match --num-servers and --data-parallel-size 4. GPU Visibility: Each machine specifies its local GPUs via CUDA_VISIBLE_DEVICES

3.3 Prefill Machine Startup Scripts

Start Router

unset http_proxy && unset https_proxy

python -m fastdeploy.router.launch \
    --port 8109 \
    --splitwise

Start Prefill Nodes

export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7

python -m fastdeploy.entrypoints.openai.multi_api_server \
    --ports 8198,8199,8200,8201,8202,8203,8204,8205 \
    --num-servers 8 \
    --args --model /path/to/ERNIE-4.5-300B-A47B-Paddle \
    --splitwise-role "prefill" \
    --cache-transfer-protocol "rdma,ipc" \
    --router "<ROUTER_MACHINE_IP>:8109" \
    --quantization wint4 \
    --tensor-parallel-size 1 \
    --data-parallel-size 8 \
    --enable-expert-parallel \
    --max-model-len 8192 \
    --max-num-seqs 64

3.4 Decode Machine Startup Scripts

Start Decode Nodes

export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7

python -m fastdeploy.entrypoints.openai.multi_api_server \
    --ports 8198,8199,8200,8201,8202,8203,8204,8205 \
    --num-servers 8 \
    --args --model /path/to/ERNIE-4.5-300B-A47B-Paddle \
    --splitwise-role "decode" \
    --cache-transfer-protocol "rdma,ipc" \
    --router "<PREFILL_MACHINE_IP>:8109" \
    --quantization wint4 \
    --tensor-parallel-size 1 \
    --data-parallel-size 8 \
    --enable-expert-parallel \
    --max-model-len 8192 \
    --max-num-seqs 64

Note: Please replace <PREFILL_MACHINE_IP> with the actual IP address of the Prefill machine.

4. Sending Test Requests

curl -X POST "http://localhost:8109/v1/chat/completions" \
-H "Content-Type: application/json" \
-d '{
  "messages": [
    {"role": "user", "content": "你好，请介绍一下自己。"}
  ],
  "max_tokens": 100,
  "stream": false
}'

5. Frequently Asked Questions (FAQ)

If you encounter issues during use, please refer to FAQ for solutions.