Benchmark

FastDeploy extends the vLLM benchmark script with additional metrics, enabling more detailed performance benchmarking for FastDeploy.

Benchmark Dataset

The following dataset is sourced from open-source data (original data from HuggingFace Datasets):

Dataset	Description
https://fastdeploy.bj.bcebos.com/eb_query/filtered_sharedgpt_2000_input_1136_output_200_fd.json	Open-source dataset

How to Run

cd FastDeploy/benchmarks
python -m pip install -r requirements.txt

# Start service
python -m fastdeploy.entrypoints.openai.api_server \
       --model baidu/ERNIE-4.5-0.3B-Base-Paddle \
       --port 8188 \
       --tensor-parallel-size 1 \
       --max-model-len 8192

# Run benchmark
python benchmark_serving.py \
  --backend openai-chat \
  --model baidu/ERNIE-4.5-0.3B-Base-Paddle \
  --endpoint /v1/chat/completions \
  --host 0.0.0.0 \
  --port 8188 \
  --dataset-name EBChat \
  --dataset-path ./filtered_sharedgpt_2000_input_1136_output_200_fd.json \
  --percentile-metrics ttft,tpot,itl,e2el,s_ttft,s_itl,s_e2el,s_decode,input_len,s_input_len,output_len \
  --metric-percentiles 80,95,99,99.9,99.95,99.99 \
  --num-prompts 1 \
  --max-concurrency 1 \
  --save-result

In-Process Benchmark Metrics Logger

FastDeploy provides a built-in performance monitoring module that runs inside the inference process. It collects per-request timing data and computes rolling statistics aligned with benchmark_serving.py, writing results to a JSONL file for real-time monitoring and post-hoc analysis.

Enable

Add --benchmark-metrics-config with a JSON string to the service startup command:

python -m fastdeploy.entrypoints.openai.api_server \
       --model baidu/ERNIE-4.5-0.3B-Base-Paddle \
       --benchmark-metrics-config '{"enable": true}'

Configuration Parameters

Parameter	Type	Default	Description
`enable`	bool	`false`	Whether to enable the benchmark metrics logger. Must be set to `true` to activate.
`window_size`	int	`0`	Number of recent requests to aggregate. `0` = cumulative (all requests since start).
`window_mode`	str	`"sliding"`	Window aggregation mode. `"sliding"` = sliding window (keeps last N records, oldest automatically dropped). `"tumbling"` = tumbling window (clears and restarts after every N records).
`percentiles`	str	`"50,90,95,99"`	Comma-separated percentile values to compute.
`metrics`	str	`"all"`	Comma-separated metric names to report, or `"all"` for all metrics.

Available Metrics

Metrics are aligned with benchmark_serving.py --percentile-metrics:

Metric Name	Description	Unit
`ttft`	Time to First Token (client arrival → first token)	ms
`s_ttft`	Server TTFT (inference start → first token)	ms
`tpot`	Time per Output Token (excluding first token)	ms
`s_itl`	Infer Inter-token Latency	ms
`e2el`	End-to-end Latency (client arrival → last token)	ms
`s_e2el`	Server E2EL (inference start → last token)	ms
`s_decode`	Decode speed (excluding first token)	tok/s
`input_len`	Prefix cache hit token count ("Cached Tokens")	tokens
`s_input_len`	Infer input length (total prompt tokens)	tokens
`output_len`	Output token length per request	tokens

In addition, the following throughput metrics are always computed (not user-selectable) when there are 2+ records:

Metric	Description	Unit
`request_throughput`	Request throughput	req/s
`output_throughput`	Output token throughput	tok/s
`total_throughput`	Total token throughput (input + output)	tok/s

Window Modes

Sliding Window ("sliding", default):

The window keeps the most recent N records. When a new record arrives and the window is full, the oldest record is automatically dropped. Each output line reflects the statistics of the latest N requests.

--benchmark-metrics-config '{"enable": true, "window_size": 64, "window_mode": "sliding"}'

Tumbling Window ("tumbling"):

The window accumulates records up to N, then clears and starts fresh. Each output line still reflects the current window's accumulated statistics, but the window resets at every boundary. This is useful for RL training scenarios where each step has a fixed batch size and you want per-step independent analysis.

--benchmark-metrics-config '{"enable": true, "window_size": 64, "window_mode": "tumbling"}'

No Window (window_size: 0):

All completed requests are accumulated. Statistics reflect the entire lifetime of the service.

--benchmark-metrics-config '{"enable": true, "window_size": 0}'

Output

Results are written to {FD_LOG_DIR}/benchmark_metrics.jsonl (default: ./log/benchmark_metrics.jsonl). Each line is a JSON object representing the window statistics at the time of a request completion.

Example output line:

{
  "timestamp": "2026-05-14T10:30:05.123",
  "window_size": 64,
  "window_mode": "sliding",
  "completed": 64,
  "total_input_tokens": 8192,
  "total_output_tokens": 16384,
  "request_throughput": 5.2,
  "output_throughput": 1250.0,
  "total_throughput": 2500.0,
  "ttft_ms": {"mean": 45.0, "median": 42.1, "p50": 42.1, "p90": 68.5, "p95": 82.3, "p99": 120.5},
  "s_decode": {"mean": 67.3, "median": 67.5, "p50": 67.5, "p90": 70.1, "p95": 71.2, "p99": 73.0}
}