Benchmark
FastDeploy extends the vLLM benchmark script with additional metrics, enabling more detailed performance benchmarking for FastDeploy.
Benchmark Dataset
The following dataset is sourced from open-source data (original data from HuggingFace Datasets):
| Dataset | Description |
|---|---|
| https://fastdeploy.bj.bcebos.com/eb_query/filtered_sharedgpt_2000_input_1136_output_200_fd.json | Open-source dataset |
How to Run
cd FastDeploy/benchmarks
python -m pip install -r requirements.txt
# Start service
python -m fastdeploy.entrypoints.openai.api_server \
--model baidu/ERNIE-4.5-0.3B-Base-Paddle \
--port 8188 \
--tensor-parallel-size 1 \
--max-model-len 8192
# Run benchmark
python benchmark_serving.py \
--backend openai-chat \
--model baidu/ERNIE-4.5-0.3B-Base-Paddle \
--endpoint /v1/chat/completions \
--host 0.0.0.0 \
--port 8188 \
--dataset-name EBChat \
--dataset-path ./filtered_sharedgpt_2000_input_1136_output_200_fd.json \
--percentile-metrics ttft,tpot,itl,e2el,s_ttft,s_itl,s_e2el,s_decode,input_len,s_input_len,output_len \
--metric-percentiles 80,95,99,99.9,99.95,99.99 \
--num-prompts 1 \
--max-concurrency 1 \
--save-result
In-Process Benchmark Metrics Logger
FastDeploy provides a built-in performance monitoring module that runs inside the inference process. It collects per-request timing data and computes rolling statistics aligned with benchmark_serving.py, writing results to a JSONL file for real-time monitoring and post-hoc analysis.
Enable
Add --benchmark-metrics-config with a JSON string to the service startup command:
python -m fastdeploy.entrypoints.openai.api_server \
--model baidu/ERNIE-4.5-0.3B-Base-Paddle \
--benchmark-metrics-config '{"enable": true}'
Configuration Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
enable |
bool | false |
Whether to enable the benchmark metrics logger. Must be set to true to activate. |
window_size |
int | 0 |
Number of recent requests to aggregate. 0 = cumulative (all requests since start). |
window_mode |
str | "sliding" |
Window aggregation mode. "sliding" = sliding window (keeps last N records, oldest automatically dropped). "tumbling" = tumbling window (clears and restarts after every N records). |
percentiles |
str | "50,90,95,99" |
Comma-separated percentile values to compute. |
metrics |
str | "all" |
Comma-separated metric names to report, or "all" for all metrics. |
Available Metrics
Metrics are aligned with benchmark_serving.py --percentile-metrics:
| Metric Name | Description | Unit |
|---|---|---|
ttft |
Time to First Token (client arrival → first token) | ms |
s_ttft |
Server TTFT (inference start → first token) | ms |
tpot |
Time per Output Token (excluding first token) | ms |
s_itl |
Infer Inter-token Latency | ms |
e2el |
End-to-end Latency (client arrival → last token) | ms |
s_e2el |
Server E2EL (inference start → last token) | ms |
s_decode |
Decode speed (excluding first token) | tok/s |
input_len |
Prefix cache hit token count ("Cached Tokens") | tokens |
s_input_len |
Infer input length (total prompt tokens) | tokens |
output_len |
Output token length per request | tokens |
In addition, the following throughput metrics are always computed (not user-selectable) when there are 2+ records:
| Metric | Description | Unit |
|---|---|---|
request_throughput |
Request throughput | req/s |
output_throughput |
Output token throughput | tok/s |
total_throughput |
Total token throughput (input + output) | tok/s |
Window Modes
Sliding Window ("sliding", default):
The window keeps the most recent N records. When a new record arrives and the window is full, the oldest record is automatically dropped. Each output line reflects the statistics of the latest N requests.
--benchmark-metrics-config '{"enable": true, "window_size": 64, "window_mode": "sliding"}'
Tumbling Window ("tumbling"):
The window accumulates records up to N, then clears and starts fresh. Each output line still reflects the current window's accumulated statistics, but the window resets at every boundary. This is useful for RL training scenarios where each step has a fixed batch size and you want per-step independent analysis.
--benchmark-metrics-config '{"enable": true, "window_size": 64, "window_mode": "tumbling"}'
No Window (window_size: 0):
All completed requests are accumulated. Statistics reflect the entire lifetime of the service.
--benchmark-metrics-config '{"enable": true, "window_size": 0}'
Output
Results are written to {FD_LOG_DIR}/benchmark_metrics.jsonl (default: ./log/benchmark_metrics.jsonl). Each line is a JSON object representing the window statistics at the time of a request completion.
Example output line:
{
"timestamp": "2026-05-14T10:30:05.123",
"window_size": 64,
"window_mode": "sliding",
"completed": 64,
"total_input_tokens": 8192,
"total_output_tokens": 16384,
"request_throughput": 5.2,
"output_throughput": 1250.0,
"total_throughput": 2500.0,
"ttft_ms": {"mean": 45.0, "median": 42.1, "p50": 42.1, "p90": 68.5, "p95": 82.3, "p99": 120.5},
"s_decode": {"mean": 67.3, "median": 67.5, "p50": 67.5, "p90": 70.1, "p95": 71.2, "p99": 73.0}
}