Data Parallelism (DP)
Data Parallelism (DP) is a distributed inference approach in which incoming requests are distributed across multiple identical model replicas, with each replica independently handling inference for its assigned requests.
In practice, especially when deploying Mixture-of-Experts (MoE) models, Data Parallelism (DP) is often combined with Expert Parallelism (EP). Each DP service independently performs the Attention computation, while all DP services collaboratively participate in the MoE computation, thereby improving overall inference performance.
FastDeploy supports DP-based inference and provides the multi_api_server interface to launch multiple inference services simultaneously.

Launching FastDeploy Services
Taking the ERNIE-4.5-300B model as an example, the following command launches a service with DP=8, TP=1, EP=8:
export FD_ENABLE_MULTI_API_SERVER=1
python -m fastdeploy.entrypoints.openai.multi_api_server \
--num-servers 8 \
--ports "1811,1822,1833,1844,1855,1866,1877,1888" \
--metrics-ports "3101,3201,3301,3401,3501,3601,3701,3801" \
--args --model ERNIE-4_5-300B-A47B-FP8-Paddle \
--engine-worker-queue-port "25611,25621,25631,25641,25651,25661,25671,25681" \
--tensor-parallel-size 1 \
--data-parallel-size 8 \
--max-model-len 12288 \
--max-num-seqs 64 \
--num-gpu-blocks-override 256 \
--enable-expert-parallel
Parameter Description
num-servers: Number of DP service instances to launch.ports: API server ports for the DP service instances. The number of ports must matchnum-servers.metrics-ports: Metrics server ports for the DP service instances. The number must matchnum-servers. If not specified, available ports will be allocated automatically.args: Arguments passed to each DP service instance. Refer to the parameter documentation for details.
Request Scheduling
After launching multiple DP services using the data parallel strategy, incoming user requests must be distributed across the services by a scheduler to achieve load balancing.
Web Server–Based Scheduling
Once the IP addresses and ports of the DP service instances are known, common web servers (such as Nginx) can be used to implement request scheduling. Details are omitted here.
FastDeploy Router
FastDeploy provides a Python-based Router to handle request reception and scheduling. A high-performance Router implementation is currently under development.
The usage and request scheduling workflow is as follows:
- Start the Router
- Start FastDeploy service instances (either single-DP or multi-DP), which register themselves with the Router
- User requests are sent to the Router
- The Router selects an appropriate service instance based on the global load status
- The Router forwards the request to the selected instance for inference
- The Router receives the generated result from the instance and returns it to the user
Quick Start Example
Launching the Router
Start the Router service. Logs are written to log_router/router.log.
export FD_LOG_DIR="log_router"
python -m fastdeploy.router.launch \
--host 0.0.0.0 \
--port 30000
Launching DP Services with Router
Again using the ERNIE-4.5-300B model as an example, the following command launches DP=8, TP=1, EP=8 services and registers them with the Router via the --router argument:
export FD_ENABLE_MULTI_API_SERVER=1
python -m fastdeploy.entrypoints.openai.multi_api_server \
--num-servers 8 \
--ports "1811,1822,1833,1844,1855,1866,1877,1888" \
--metrics-ports "3101,3201,3301,3401,3501,3601,3701,3801" \
--args --model ERNIE-4_5-300B-A47B-FP8-Paddle \
--tensor-parallel-size 1 \
--data-parallel-size 8 \
--max-model-len 12288 \
--max-num-seqs 64 \
--num-gpu-blocks-override 256 \
--enable-expert-parallel \
--router "0.0.0.0:30000"