🔮 Speculative Decoding
This project implements an efficient Speculative Decoding inference framework based on PaddlePaddle. It supports Multi-Token Proposing (MTP) to accelerate large language model (LLM) generation, significantly reducing latency and improving throughput.
✅ Supported Speculative Decoding Methods
Supported
-
Naive: Normal decoding mode that uses the speculative decoding code path without generating draft tokens, useful for testing the speculative decoding framework
-
Ngram: N-gram matching based speculative decoding
-
Suffix Decoding
-
MTP (Multi-Token Prediction)
- ✅ Supported: TP Sharding
- ✅ Supported: Shared Prefix
- ✅ Supported: TP Sharding + PD Separation
- ⏳ Coming Soon: EP + DP + PD Separation
- ⏳ Coming Soon: Support Chunk-prefill
-
⏳ Coming Soon: Multi-layer MTP Layer
-
Decoding with Hybrid MTP and Ngram Methods(Hybrid-MTP-with-Ngram)
-
Overview: A hybrid method combining MTP and Ngram. First, MTP generates N draft tokens, then Ngram matching is used to supplement additional draft tokens.
-
Use Cases: Suitable when higher draft token coverage is required, leveraging both MTP’s generation capability and the efficiency of Ngram matching.
Coming Soon
- Draft Model
- Eagle
- Hydra
- Medusa
- ...
⚙️ Efficient Speculative Decoding Architecture
-
Attention Mechanism: We employ Cascade Append Attention, which allows unified processing of queries with varying token lengths, enabling efficient verification. All tokens can be verified in a single forward pass. We deeply customized the underlying kernels to fully leverage Tensor Cores and maintain high throughput even under heavy concurrency.
-
Virtual Padding Mechanism: A virtual padding strategy is used to locate output token batch IDs, eliminating the overhead of data copying and slicing operations.
-
Parallel Sampling and Verification: We developed multiple fused CUDA kernels for concurrent sampling and verification. These kernels allow parallel processing for each sample in a batch, avoiding explicit loop execution on the host side.
-
Efficient Draft Model/MTP Framework: Multiple fused CUDA kernels are used to handle pre- and post-processing within the model class, replacing traditional loop-based and slicing-based methods with a more performant and maintainable structure.
🔧 Configuration Parameters
Basic Parameters
method: The speculative decoding strategy, supports["mtp", "ngram", "naive", "suffix"].naive: Normal decoding mode using speculative decoding code path without generating draft tokensngram: N-gram matching based speculative decodingmtp: Multi-Token Predictionsuffix: Suffix decoding based speculative decodingnum_speculative_tokens: Number of speculative tokens to generate; max is 5, currently MTP supports only 1.num_model_steps: MTP model steps, must satisfynum_speculative_tokens >= num_model_stepsmodel: Path to the MTP draft model when using the"mtp"method.quantization: Quantization method of the MTP model (e.g., WINT4).- Max
batch_size: 256
Verification Strategy (verify_strategy)
Controls how draft tokens are verified:
- topp (default): Top-P sampling verification, draft token must be in top-p candidate set
- greedy: Greedy verification, draft token must equal target model's argmax output
- target_match: Target match verification, draft token must equal target model's sampled output
--speculative-config '{"method": "mtp", "verify_strategy": "greedy", "num_speculative_tokens": 1, "model": "${path_to_mtp_model}"}'
Accept Policy (accept_policy)
Controls draft token acceptance behavior:
- normal (default): Normal verification flow
- accept_all: Accept all draft tokens (for debugging)
- reject_all: Reject all draft tokens (for debugging)
--speculative-config '{"method": "mtp", "accept_policy": "accept_all", "num_speculative_tokens": 1}'
🚀 Using Multi-Token Prediction (MTP)
For detailed theory, refer to: 📄 DeepSeek-V3 Paper
TP Sharding Mode
Launch service on 4 × H100 GPUs using WINT4 quantization (Dense: WINT8, MoE: WINT4):
Config file:
benchmarks/yaml/eb45t-32k-wint4-mtp-h100-tp4.yaml
python -m fastdeploy.entrypoints.openai.api_server \
--model ${path_to_main_model} \
--tensor-parallel-size 4 \
--config ${path_to_FastDeploy}benchmarks/yaml/eb45t-32k-wint4-mtp-h100-tp4.yaml \
--speculative-config '{"method": "mtp", "num_speculative_tokens": 1, "model": "${path_to_mtp_model}"}'
PD-Separated Deployment (1P1D Mode)
Deploy 1P1D on H100 with both Prefill (P) and Decode (D) nodes using TP4 + WINT4 quantization. This deployment only requires changing the config and adding speculative_config. For details, refer to the PD Separation. - P Node(Prefill)
Config file:
benchmarks/yaml/eb45t-32k-wint4-mtp-tp4-prefill.yaml
export FD_LOG_DIR="log_prefill"
rm -rf ${FD_LOG_DIR}
export CUDA_VISIBLE_DEVICES=0,1,2,3
python -m fastdeploy.entrypoints.openai.api_server \
--model ${path_to_main_model} \
--port 8180 \
--metrics-port 8181 \
--engine-worker-queue-port 8182 \
--cache-queue-port 8183 \
--workers 2 \
--tensor-parallel-size 4 \
--quantization wint4 \
--splitwise-role "prefill" \
--scheduler-name "splitwise" \
--scheduler-host "127.0.0.1" \
--scheduler-port 6379 \
--scheduler-ttl 9000 \
--scheduler-topic mtp \
--config ${path_to_FastDeploy}/benchmarks/yaml/eb45t-32k-wint4-mtp-tp4-prefill.yaml \
--scheduler-password "scheduler_mtp" \
--speculative-config '{"method": "mtp", "num_speculative_tokens": 1, "model": "${path_to_mtp_model}"}' &
- D Node(Decode)
Config file:
benchmarks/yaml/eb45t-32k-wint4-mtp-tp4-decode.yaml
export FD_LOG_DIR="log_decode"
rm -rf ${FD_LOG_DIR}
export CUDA_VISIBLE_DEVICES=0,1,2,3
python -m fastdeploy.entrypoints.openai.api_server \
--model ${path_to_main_model} \
--port 8190 \
--metrics-port 8191 \
--engine-worker-queue-port 8192 \
--cache-queue-port 8193 \
--workers 2 \
--tensor-parallel-size 4 \
--quantization wint4 \
--splitwise-role "decode" \
--scheduler-name "splitwise" \
--scheduler-host "127.0.0.1" \
--scheduler-port 6379 \
--scheduler-ttl 9000 \
--scheduler-topic mtp \
--config ${path_to_FastDeploy}/benchmarks/yaml/eb45t-32k-wint4-mtp-tp4-decode.yaml \
--scheduler-password "scheduler_mtp" \
--speculative-config '{"method": "mtp", "num_speculative_tokens": 1, "model": "${path_to_mtp_model}"}' &
Decoding with Hybrid MTP and Ngram Methods
When starting the service, you only need to modify the --speculative-config option. For example, use MTP to generate two draft tokens, and then append three additional draft tokens from Ngram matching:
--speculative-config '{"method": "mtp", "num_model_steps": 2, "mtp_strategy": "with_ngram", "num_speculative_tokens": 5, "model": "'$model_path'/mtp"}'
🧠 Using Ngram-Based Decoding
This method uses an n-gram sliding window to match the prompt and generated tokens to predict draft tokens. It is particularly effective in scenarios with high input-output overlap (e.g., code completion, document search).
Run on 4 × H100 GPUs with WINT4 quantization:
Config file:
benchmarks/yaml/eb45t-32k-wint4-mtp-h100-tp4.yaml
python -m fastdeploy.entrypoints.openai.api_server \
--model ${path_to_main_model} \
--tensor-parallel-size 4 \
--config ${path_to_FastDeploy}benchmarks/yaml/eb45t-32k-wint4-mtp-h100-tp4.yaml \
--speculative-config '{"method": "ngram", "num_speculative_tokens": 1}'
🌲 Using Suffix Decoding
Suffix Decoding is a model-free speculative decoding method that accelerates repetitive inference tasks (e.g., agent workflows, coding) using efficient CPU-based suffix trees for rapid draft token prediction, eliminating GPU overhead.
Run on 4 × H100 GPUs with WINT4 quantization:
Config file: benchmarks/yaml/eb45t-32k-wint4-mtp-h100-tp4.yaml
python -m fastdeploy.entrypoints.openai.api_server \
--model ${path_to_main_model} \
--tensor-parallel-size 4 \
--config ${path_to_FastDeploy}benchmarks/yaml/eb45t-32k-wint4-mtp-h100-tp4.yaml \
--speculative-config '{"method": "mtp", "num_speculative_tokens": 4, "suffix_decoding_max_tree_depth": 64, "suffix_decoding_max_cached_requests": 10000, "suffix_decoding_max_spec_factor": 1.0, "suffix_decoding_min_token_prob": 0.1}'
Parameter Descriptions
# The maximum length of token sequences cached in suffix trees.
self.suffix_decoding_max_tree_depth: int = 64
# The limits of requests that can be stored in the cache.
self.suffix_decoding_max_cached_requests: int = -1
# The factor of matched length, calculated as num_draft_tokens = suffix_max_spec_factor * matched_length
self.suffix_decoding_max_spec_factor: float = 1.0
# The probability threshold for speculated tokens.
self.suffix_decoding_min_token_prob: float = 0.1
📝 Using Naive Mode (Normal Decoding)
Naive mode uses the speculative decoding code path without generating draft tokens, useful for testing the correctness of the speculative decoding framework or establishing performance baselines.
python -m fastdeploy.entrypoints.openai.api_server \
--model ${path_to_main_model} \
--tensor-parallel-size 4 \
--speculative-config '{"method": "naive", "num_speculative_tokens": 1}'
Note: In Naive mode, num_speculative_tokens will be forced to 0.