Skip to content

FastDeploy: Large Language Model Deployment

Bench

bench: Benchmark Testing

1. bench latency: Offline Latency Test

Parameters

Parameter	Description	Default
--input-len	Input sequence length (tokens)	32
--output-len	Output sequence length (tokens)	128
--batch-size	Batch size	8
--n	Number of sequences generated per prompt	1
--use-beam-search	Whether to use beam search	False
--num-iters-warmup	Number of warmup iterations	10
--num-iters	Number of actual test iterations	30
--profile	Whether to enable performance profiling	False
--output-json	Path to save latency results as a JSON file	None
--disable-detokenize	Whether to disable detokenization	False

Example

# Run latency benchmark on the inference engine
fastdeploy bench latency --model baidu/ERNIE-4.5-0.3B-Paddle

2. bench serve: Online Latency and Throughput Test

Parameters

Parameter	Description	Default
--backend	Backend type	"openai-chat"
--base-url	Base URL of the server or API	None
--host	Host address	"127.0.0.1"
--port	Port	8000
--endpoint	API endpoint path	"/v1/chat/completions"
--model	Model name	Required
--dataset-name	Dataset name	"sharegpt"
--dataset-path	Path to dataset	None
--num-prompts	Number of prompts to process	1000
--request-rate	Requests per second	inf
--max-concurrency	Maximum concurrency	None
--top-p	Sampling top-p (OpenAI backend)	None
--top-k	Sampling top-k (OpenAI backend)	None
--temperature	Sampling temperature (OpenAI backend)	None

Example

# Run online performance test
fastdeploy bench serve --backend openai-chat \
  --model baidu/ERNIE-4.5-0.3B-Paddle \
  --endpoint /v1/chat/completions \
  --host 0.0.0.0 \
  --port 8891 \
  --dataset-name EBChat \
  --dataset-path /datasets/filtered_sharedgpt_2000_input_1136_output_200.json \
  --percentile-metrics ttft,tpot,itl,e2el,s_ttft,s_itl,s_e2el,s_decode,input_len,s_input_len,output_len \
  --metric-percentiles 80,95,99,99.9,99.95,99.99 \
  --num-prompts 1 \
  --max-concurrency 1 \
  --save-result

3. bench throughput: Throughput Test

Parameters

Parameter	Description	Default
--backend	Inference backend	"fastdeploy"
--dataset-name	Dataset name	"random"
--model	Model name	Required
--input-len	Input sequence length	None
--output-len	Output sequence length	None
--prefix-len	Prefix length	0
--n	Number of sequences generated per prompt	1
--num-prompts	Number of prompts	50
--output-json	Path to save results as a JSON file	None
--disable-detokenize	Whether to disable detokenization	False
--lora-path	Path to LoRA adapter	None

Example

# Run throughput benchmark on the inference engine
fastdeploy bench throughput --model baidu/ERNIE-4.5-0.3B-Paddle \
--backend fastdeploy-chat \
--dataset-name EBChat \
--dataset-path /datasets/filtered_sharedgpt_2000_input_1136_output_200.json \
--max-model-len 32768

4. bench eval: Online Task Evaluation

Parameters

Parameter	Description	Default
--model, -m	Model name	"hf"
--tasks, -t	List of evaluation tasks	None
--model_args, -a	Model arguments	""
--num_fewshot, -f	Number of few-shot examples	None
--samples, -E	Number of samples	None
--batch_size, -b	Batch size	1
--device	Device	None
--output_path, -o	Output file path	None
--write_out, -w	Whether to write output results	False

Example

# Run task evaluation on an online service
fastdeploy bench eval --model local-completions \
  --model_args pretrained=./baidu/ERNIE-4.5-0.3B-Paddle,base_url=http://0.0.0.0:8490/v1/completions
  --write_out \
  --tasks ceval-valid_accountant