Disaggregated Deployment
Large model inference consists of two phases: Prefill and Decode, which are compute-intensive and memory access-intensive respectively. Deploying Prefill and Decode separately in certain scenarios can improve hardware utilization, effectively increase throughput, and reduce overall sentence latency.
- Prefill phase: Processes all input Tokens (such as user prompts), completes the model's forward propagation, and generates the first token.
- Decode phase: Starting from the first generated token, it generates one token at a time autoregressively until reaching the stop token. For N output tokens, the Decode phase requires (N-1) forward propagations that must be executed serially. During generation, the number of tokens to attend to increases, and computational requirements gradually grow.
The core of disaggregated deployment is to deploy Prefill and Decode on different computing resources to improve their respective utilization. To achieve disaggregated deployment, communication between Prefill and Decode must be considered. During actual inference, Prefill needs to transmit the computed KV Cache to the Decode instance, which then reads the KV Cache for continuation.
KV Cache Transmission Methods
We provide two transmission methods for KV Cache, targeting intra-machine and inter-machine scenarios respectively.
Intra-machine Transmission
Uses cudaMemcpyPeer for KV Cache transmission between two GPUs within a single machine, offering low latency and high throughput.
Inter-machine Transmission
For transmission between multiple machines, uses high-speed RDMA network for KV Cache transmission. We provide the rdma_comm
high-speed transmission network library for cross-machine KV Cache transmission.
PD Disaggregated Scheduling
Building upon the global scheduler, FastDeploy supports the PD disaggregated scheduling strategy, specifically designed for large language model inference scenarios, decoupling the two phases of the inference process:
* Prefill phase: Builds KV cache, compute-intensive, high memory usage but low latency.
* Decode phase: Performs autoregressive decoding, serial process, time-consuming but with low memory usage.
In multi-instance scenarios, each incoming request needs to be assigned to different Prefill and Decode instances based on different strategies. Through role separation (Prefill nodes handle request reception and processing, Decode nodes complete subsequent generation), resource allocation can be more finely controlled to improve throughput and GPU utilization.
Usage Instructions
Single-machine Disaggregated Deployment
Online Inference Service
Use the following commands for service deployment:
Prefill Instance
export FD_LOG_DIR="log_prefill"
export CUDA_VISIBLE_DEVICES=0,1,2,3
python -m fastdeploy.entrypoints.openai.api_server \
--model ERNIE-4.5-300B-A47B-BF16 \
--port 8180 --metrics-port 8181 \
--engine-worker-queue-port 8182 \
--cache-queue-port 8183 \
--tensor-parallel-size 4 \
--quantization wint4 \
--splitwise-role "prefill"
Decode Instance
export FD_LOG_DIR="log_decode"
export CUDA_VISIBLE_DEVICES=4,5,6,7
# Note: innode-prefill-ports should specify the engine-worker-queue-port of the Prefill service
python -m fastdeploy.entrypoints.openai.api_server \
--model ERNIE-4.5-300B-A47B-BF16 \
--port 8184 --metrics-port 8185 \
--engine-worker-queue-port 8186 \
--cache-queue-port 8187 \
--tensor-parallel-size 4 \
--quantization wint4 \
--innode-prefill-ports 8182 \
--splitwise-role "decode"
Note: When requesting single-machine PD disaggregated service, users should request the Decode service's port.
Offline Inference Service
Refer to the example code offline_disaggregated_demo.py
in the fastdeploy/demo
directory for offline inference service deployment.
Multi-machine Disaggregated Deployment
Prerequisite: Redis
- Installation via
conda
# Install
conda install redis
# Start
nohup redis-server > redis.log 2>&1 &
- Installation via
apt
# Install
sudo apt install redis-server -y
# Start
sudo systemctl start redis-server
- Installation via
yum
# Install
sudo yum install redis -y
# Start
sudo systemctl start redis
Online Inference Service
For multi-machine deployment, confirm that the NIC supports RDMA and that all nodes in the cluster have network connectivity.
Note:
* KVCACHE_RDMA_NICS
specifies the RDMA NICs of the current machine, with multiple NICs separated by commas.
Prefill Instance
export FD_LOG_DIR="log_prefill"
export CUDA_VISIBLE_DEVICES=0,1,2,3
export KVCACHE_RDMA_NICS="mlx5_2,mlx5_3,mlx5_4,mlx5_5"
python -m fastdeploy.entrypoints.openai.api_server \
--model ERNIE-4.5-300B-A47B-BF16 \
--port 8180 --metrics-port 8181 \
--engine-worker-queue-port 8182 \
--cache-queue-port 8183 \
--tensor-parallel-size 4 \
--quantization wint4 \
--cache-transfer-protocol "rdma,ipc" \
--rdma-comm-ports "7671,7672,7673,7674" \
--pd-comm-port "2334" \
--splitwise-role "prefill" \
--scheduler-name "splitwise" \
--scheduler-host "127.0.0.1" \
--scheduler-port 6379 \
--scheduler-ttl 9000
Decode Instance
export FD_LOG_DIR="log_decode"
export CUDA_VISIBLE_DEVICES=4,5,6,7
export KVCACHE_RDMA_NICS="mlx5_2,mlx5_3,mlx5_4,mlx5_5"
python -m fastdeploy.entrypoints.openai.api_server \
--model ERNIE-4.5-300B-A47B-BF16 \
--port 8184 --metrics-port 8185 \
--engine-worker-queue-port 8186 \
--cache-queue-port 8187 \
--tensor-parallel-size 4 \
--quantization wint4 \
--scheduler-name "splitwise" \
--cache-transfer-protocol "rdma,ipc" \
--rdma-comm-ports "7671,7672,7673,7674" \
--pd-comm-port "2334" \
--scheduler-host "127.0.0.1" \
--scheduler-port 6379 \
--scheduler-ttl 9000
--splitwise-role "decode"
Parameter Description
- --splitwise-role: Specifies whether the current service is prefill or decode
- --cache-queue-port: Specifies the cache service port for communication between prefill and decode services
Single-machine Parameters
- --inner-prefill-ports: Only required for Decode instance, specifies the port list of prefill instances to connect to
Multi-machine Parameters
- --cache-transfer-protocol: Specifies KV Cache transmission protocol, supports ipc and rdma, default is ipc
- --scheduler-name: For PD disaggregation, set to "splitwise"
- --scheduler-host: Redis address to connect to
- --scheduler-port: Redis port to connect to
- --scheduler-ttl: Specifies Redis TTL time in seconds
- --pd-comm-port: Specifies PD communication port
- --rdma-comm-ports: Specifies RDMA communication ports, multiple ports separated by commas, quantity should match GPU count