FastDeploy Code Structure Overview
This document provides a detailed overview of the FastDeploy codebase structure, helping developers quickly understand each module's functionality for development and feature extension.
Directory Overview
FastDeploy/
├── fastdeploy/ # Core code directory
├── custom_ops/ # C++/CUDA custom operators
├── tests/ # Unit tests
├── scripts/ # Utility scripts
├── tools/ # Development tools
├── docs/ # Documentation
├── examples/ # Example code
├── benchmarks/ # Performance benchmarks
├── dockerfiles/ # Docker image build files
└── setup.py # Python package installation script
I. Core Code Directory (fastdeploy/)
The main entry file fastdeploy/__init__.py exports core classes:
LLM- Main entry class, offline inference interfaceSamplingParams- Sampling parameter configurationModelRegistry- Model registryversion- Version information
1. engine/ - Core Engine Module
Function: Manages LLM inference lifecycle and coordinates components.
| File | Function | Development Guide |
|---|---|---|
engine.py |
LLMEngine core engine class, manages scheduler, preprocessor, resource manager |
Entry point for modifying engine behavior, adding new components |
async_llm.py |
Async LLM interface, AsyncRequestQueue request queue management |
Async inference, streaming output development |
request.py |
Core request data structures: Request, RequestOutput, RequestStatus |
Adding request fields, modifying request processing logic |
sampling_params.py |
SamplingParams sampling parameter configuration |
Adding new sampling strategy parameters |
args_utils.py |
EngineArgs engine argument parsing |
Adding new engine configuration parameters |
resource_manager.py |
GPU/CPU resource management | Resource allocation optimization |
Subdirectory:
sched/- Core scheduling implementation, containsresource_manager_v1.py(core scheduling logic)
2. model_executor/ - Model Executor
Function: Core execution module for model inference, containing model definitions, layers, operators.
2.1 models/ - Model Implementations
| File/Directory | Function | Development Guide |
|---|---|---|
model_base.py |
ModelRegistry model registration base class |
Must read for adding new models |
deepseek_v3.py |
DeepSeek V3 model | MoE large model reference |
ernie4_5_moe.py |
ERNIE 4.5 MoE model | Baidu's flagship model |
ernie4_5_mtp.py |
ERNIE 4.5 MTP multi-token prediction | Speculative decoding model |
qwen2.py |
Qwen2 model | General model reference |
qwen3.py |
Qwen3 model | Latest model reference |
ernie4_5_vl/ |
ERNIE 4.5 vision-language model | Multimodal model development reference |
qwen2_5_vl/ |
Qwen2.5 VL multimodal model | VL model reference |
paddleocr_vl/ |
PaddleOCR VL model | OCR multimodal reference |
2.2 layers/ - Network Layer Implementations
| Subdirectory/File | Function | Development Guide |
|---|---|---|
attention/ |
Attention mechanism implementations (flash_attn, append_attn, mla_attn) | First choice for attention performance optimization |
moe/ |
MoE layer implementations (Cutlass, Triton, DeepGEMM backends) | MoE performance optimization |
quantization/ |
Quantization layers (FP8, W4A8, WINT2, Weight-only) | Quantization scheme development |
linear.py |
Linear layer implementation | Matrix multiplication optimization |
embeddings.py |
Embedding layer implementation | Word embedding modification |
normalization.py |
Normalization layers (RMSNorm, LayerNorm) | Normalization optimization |
rotary_embedding.py |
Rotary Position Encoding ROPE | Position encoding modification |
sample/ |
Sampler implementation | Sampling strategy development |
backends/ |
Hardware backend implementations (cuda, xpu, dcu, hpu, metax, gcu, npu) | Entry point for new hardware adaptation |
2.3 Other Submodules
| Directory | Function | Development Guide |
|---|---|---|
model_loader/ |
Model weight loader | New model format support |
guided_decoding/ |
Guided decoding (JSON/regex constrained output) | Structured output development |
graph_optimization/ |
Graph optimization (CUDA Graph) | Inference performance optimization |
logits_processor/ |
Logits processor | Output control logic |
ops/ |
Python-callable operators (organized by hardware platform) | Operator call entry point |
Key Files:
model_base.py- Model base class, registry definitionpre_and_post_process.py- Pre/post processing utilities
3. scheduler/ - Scheduler Module
Function: Request scheduling, supporting single-node, distributed, PD disaggregation scenarios.
Note: - Core scheduling logic is mainly implemented in
engine/sched/resource_manager_v1.py- Schedulers in this directory are being gradually deprecated. For PD disaggregation scheduling, userouter/orgolang_router/
| File | Function | Development Guide |
|---|---|---|
global_scheduler.py |
GlobalScheduler distributed scheduler (Redis) |
(Being deprecated) |
local_scheduler.py |
LocalScheduler local scheduler |
(Being deprecated) |
splitwise_scheduler.py |
SplitwiseScheduler PD disaggregation scheduling |
(Being deprecated, use router) |
dp_scheduler.py |
Data parallel scheduler | (Being deprecated) |
config.py |
SchedulerConfig scheduling configuration |
Scheduling parameter adjustment |
storage.py |
Storage adapter, wraps Redis connection | Storage layer modification |
Core Scheduling Implementation (engine/sched/):
| File | Function | Development Guide |
|---|---|---|
resource_manager_v1.py |
Core scheduling logic, contains ScheduledDecodeTask, ScheduledPreemptTask task classes |
First choice for scheduling strategy modification |
4. entrypoints/ - API Entry Points
Function: External service interfaces, including offline inference and online API services.
| File | Function | Development Guide |
|---|---|---|
llm.py |
LLM main entry class, offline inference interface |
Entry point for using FastDeploy |
engine_client.py |
Engine client | Request forwarding logic modification |
4.1 openai/ - OpenAI Compatible API
| File | Function | Development Guide |
|---|---|---|
api_server.py |
FastAPI server | Deployment service entry point |
protocol.py |
OpenAI protocol definition | API format modification |
serving_chat.py |
Chat Completion API | Chat interface development |
serving_completion.py |
Completion API | Completion interface development |
serving_embedding.py |
Embedding API | Vectorization interface |
tool_parsers/ |
Tool call parsers | Function Calling development |
5. worker/ - Worker Process Module
Function: Actual execution process for model inference.
| File | Function | Development Guide |
|---|---|---|
gpu_model_runner.py |
GPU model runner (core inference loop) | First choice for inference flow modification |
gpu_worker.py |
GPU Worker process management | Worker lifecycle management |
xpu_model_runner.py |
XPU model runner | Kunlun chip adaptation |
hpu_model_runner.py |
HPU model runner | Intel HPU adaptation |
worker_process.py |
Worker process base class | Process management logic |
6. input/ - Input Processing Module
Function: Input data preprocessing, including tokenization, multimodal input processing.
| File | Function | Development Guide |
|---|---|---|
text_processor.py |
BaseDataProcessor text processor base class |
Input processing extension |
ernie4_5_processor.py |
ERNIE 4.5 input processor | Baidu model input processing |
ernie4_5_tokenizer.py |
ERNIE 4.5 tokenizer | Tokenization logic modification |
preprocess.py |
Input preprocessing utilities | Preprocessing flow |
Multimodal Processing Subdirectories:
| Directory | Function |
|---|---|
ernie4_5_vl_processor/ |
ERNIE 4.5 VL image/video processing |
qwen_vl_processor/ |
Qwen VL multimodal processing |
paddleocr_vl_processor/ |
PaddleOCR VL processing |
7. output/ - Output Processing Module
Function: Inference result post-processing, streaming output management.
| File | Function | Development Guide |
|---|---|---|
token_processor.py |
TokenProcessor token output processing |
Streaming output, speculative decoding |
pooler.py |
Pooling output processing | Embedding output |
stream_transfer_data.py |
Streaming transfer data structure | Data transfer format |
8. cache_manager/ - Cache Management Module
Function: KV Cache management, supporting prefix caching, cross-device transfer.
| File | Function | Development Guide |
|---|---|---|
prefix_cache_manager.py |
PrefixCacheManager prefix tree cache |
First choice for KV Cache optimization |
cache_transfer_manager.py |
KV Cache cross-device transfer | PD disaggregation cache transfer |
cache_data.py |
BlockNode, CacheStatus data structures |
Cache data definition |
multimodal_cache_manager.py |
Multimodal cache management | Multimodal caching |
Subdirectory:
transfer_factory/- Cache transfer factory (IPC, RDMA)
9. platforms/ - Hardware Platform Support
Function: Multi-hardware platform adaptation, defining operators and features for each platform.
| File | Function | Development Guide |
|---|---|---|
base.py |
Platform base class, _Backend enum |
Entry point for new hardware adaptation |
cuda.py |
NVIDIA CUDA platform | GPU optimization |
xpu.py |
Baidu Kunlun XPU platform | Kunlun chip adaptation |
dcu.py |
AMD DCU (ROCm) platform | AMD GPU adaptation |
maca.py |
MetaX GPU (MACA) platform | Biren GPU adaptation |
intel_hpu.py |
Intel HPU platform | Intel Gaudi adaptation |
iluvatar.py |
Iluvatar GPU platform | Iluvatar adaptation |
10. metrics/ - Monitoring Metrics Module
Function: Prometheus metric collection, performance monitoring.
| File | Function | Development Guide |
|---|---|---|
metrics.py |
Prometheus metric definition | Adding new monitoring metrics |
stats.py |
ZMQ metric statistics | Distributed monitoring |
trace_util.py |
OpenTelemetry distributed tracing | Link tracing |
11. Other Important Modules
| Directory | Function | Development Guide |
|---|---|---|
inter_communicator/ |
Inter-process communication (ZMQ) | Engine-Worker communication modification |
spec_decode/ |
Speculative decoding (MTP, N-gram) | Speculative decoding strategy development |
distributed/ |
Distributed communication (AllReduce) | Distributed inference development |
multimodal/ |
Multimodal data processing | Multimodal feature extension |
reasoning/ |
Reasoning mode parsing (DeepSeek R1 style) | Chain-of-thought parsing |
router/ |
Request router, recommended for PD disaggregation | First choice for PD disaggregation deployment |
golang_router/ |
Go-implemented router, better PD inter-scheduling performance | High-performance PD disaggregation scenarios |
eplb/ |
Expert Parallel load balancing | MoE load balancing |
rl/ |
Reinforcement learning Rollout | RLHF scenarios |
plugins/ |
Plugin system | Custom extensions |
logger/ |
Logging module | Log format modification |
trace/ |
Tracing module | Performance analysis |
12. Configuration Files
| File | Function | Development Guide |
|---|---|---|
config.py |
FDConfig main configuration class |
Entry point for configuration parameter modification |
envs.py |
Environment variable configuration | Adding new environment variables |
utils.py |
General utility functions | Utility function reuse |
II. Custom Operators Directory (custom_ops/)
Function: C++/CUDA high-performance operator implementations, organized by hardware platform.
custom_ops/
├── gpu_ops/ # NVIDIA GPU operators (main)
├── cpu_ops/ # CPU operators
├── xpu_ops/ # Baidu Kunlun XPU operators
├── iluvatar_ops/ # Iluvatar GPU operators
├── metax_ops/ # MetaX GPU operators
├── utils/ # Common utilities
└── third_party/ # Third-party libraries (cutlass, DeepGEMM)
gpu_ops/ - GPU Operator Details
| Directory/File | Function | Development Guide |
|---|---|---|
append_attn/ |
Append Attention implementation | First choice for attention optimization |
moe/ |
MoE operators (fused_moe, expert_dispatch) | MoE performance optimization |
flash_mask_attn/ |
Flash Mask Attention | Attention mask optimization |
mla_attn/ |
Multi-Head Latent Attention | MLA model support |
machete/ |
Machete GEMM | Matrix multiplication optimization |
quantization/ |
Quantization operators | Quantization performance optimization |
sample_kernels/ |
Sampling operators | Sampling performance optimization |
speculate_decoding/ |
Speculative decoding operators | Speculative decoding optimization |
cutlass_kernels/ |
CUTLASS kernels | High-performance GEMM |
cpp_extensions.cc |
C++ extension entry | Entry point for new operator registration |
append_attention.cu |
Append Attention core | Attention core implementation |
Key Operator Files:
fused_rotary_position_encoding.cu- Fused rotary position encodingmulti_head_latent_attention.cu- MLA attentionper_token_quant_fp8.cu- FP8 quantization
III. Test Directory (tests/)
Function: Unit tests and end-to-end tests, organized by module.
tests/
├── e2e/ # End-to-end service tests
├── operators/ # Operator unit tests
├── model_executor/ # Model executor tests
├── model_loader/ # Model loading tests
├── layers/ # Network layer tests
├── scheduler/ # Scheduler tests
├── cache_manager/ # Cache management tests
├── entrypoints/ # API entry tests
├── input/ # Input processing tests
├── output/ # Output processing tests
├── metrics/ # Metric tests
├── distributed/ # Distributed tests
├── graph_optimization/# Graph optimization tests
├── quantization/ # Quantization tests
├── multimodal/ # Multimodal tests
├── xpu_ci/ # XPU CI tests
├── ce/ # CE environment tests
├── ci_use/ # CI utility tests
└── conftest.py # pytest configuration
Test Directory Details
| Directory | Content | Development Guide |
|---|---|---|
e2e/ |
Complete service tests for each model (ERNIE, Qwen, DeepSeek, etc.) | Service integration testing |
operators/ |
Operator unit tests (test_fused_moe.py, test_flash_mask_attn.py, etc.) |
Required tests for operator development |
layers/ |
Network layer tests (attention, moe, quantization) | Network layer testing |
model_executor/ |
Model execution flow tests | Model execution testing |
scheduler/ |
Scheduler function tests | Scheduling logic verification |
cache_manager/ |
Cache management tests | Cache logic verification |
IV. Scripts Directory (scripts/)
Function: CI/CD, performance tuning, utility scripts.
| File | Function | Usage Scenario |
|---|---|---|
run_unittest.sh |
Unit test runner | Local testing |
run_ci_xpu.sh |
XPU CI runner | Kunlun CI |
run_ci_hpu.sh |
HPU CI runner | Intel HPU CI |
run_ci_dcu.sh |
DCU CI runner | AMD DCU CI |
coverage_run.sh |
Code coverage statistics | Code quality |
tune_cublaslt_int8_gemm.py |
cuBLASLt INT8 GEMM tuning | Performance tuning |
tune_cutlass_fp8_gemm.py |
CUTLASS FP8 GEMM tuning | Performance tuning |
offline_w4a8.py |
Offline W4A8 quantization tool | Model quantization |
extract_mtp_weight_from_safetensor.py |
MTP weight extraction | Model processing |
V. Other Directories
docs/ - Documentation
- Usage documentation, API documentation, architecture design documents
examples/ - Example Code
- Model usage examples, deployment examples
benchmarks/ - Performance Benchmarks
- Performance test scripts, benchmark data
tools/ - Development Tools
codestyle/- Code style checking toolsdockerfile/- Docker build tools
dockerfiles/ - Docker Images
- Dockerfiles for each platform runtime environment
VI. Quick Development Guide
Adding a New Model
- Reference
models/model_base.pyto understand model registration mechanism - Create new model file under
models/ - Add corresponding input processor under
input/ - Add tests under
tests/model_executor/
Adding a New Operator
- Implement CUDA operator under
custom_ops/gpu_ops/ - Register operator in
cpp_extensions.cc - Add Python wrapper under
model_executor/ops/gpu/ - Add tests under
tests/operators/
New Hardware Platform Adaptation
- Reference
platforms/base.pyto create new platform class - Create hardware operator directory under
custom_ops/ - Create backend implementation under
model_executor/layers/backends/ - Create model runner under
worker/
Optimizing Inference Performance
- Attention optimization:
custom_ops/gpu_ops/append_attn/ - MoE optimization:
custom_ops/gpu_ops/moe/ - Graph optimization:
fastdeploy/model_executor/graph_optimization/
PD Disaggregation Deployment
- Router:
router/router.py(Python implementation, recommended) - High-performance router:
golang_router/(Go implementation, better PD inter-scheduling performance) - Cache transfer:
cache_manager/cache_transfer_manager.py
VII. Configuration System
FDConfig (config.py)
├── ModelConfig # Model configuration
├── CacheConfig # Cache configuration
├── ParallelConfig # Parallel configuration
├── SchedulerConfig # Scheduler configuration
├── LoRAConfig # LoRA configuration
└── ...
Environment Variable Configuration (envs.py)
├── FD_* series environment variables
└── Runtime behavior control
This document covers the main modules and key files of the FastDeploy codebase. It can be used as a code navigation and development reference. For questions, please refer to detailed documentation of each module or source code comments.