Skip to content

简体中文

Monitoring Metrics

After FastDeploy is launched, it supports continuous monitoring of the FastDeploy service status through Metrics. When starting FastDeploy, you can specify the port for the Metrics service by configuring the metrics-port parameter.

Category Metric Name Type Description Unit
Request fastdeploy:requests_number Counter Total number of received requests count
Request fastdeploy:request_success_total Counter Number of successfully processed requests count
Request fastdeploy:num_requests_running Gauge Number of requests currently running count
Request fastdeploy:num_requests_waiting Gauge Number of requests currently waiting count
Latency fastdeploy:time_to_first_token_seconds Histogram Time to generate the first token (TTFT) s
Latency fastdeploy:time_per_output_token_seconds Histogram Time interval between generated tokens (TPOT) s
Latency fastdeploy:e2e_request_latency_seconds Histogram End-to-end request latency distribution s
Latency fastdeploy:request_inference_time_seconds Histogram Time spent in the RUNNING phase s
Latency fastdeploy:request_queue_time_seconds Histogram Time spent in the WAITING phase s
Latency fastdeploy:request_prefill_time_seconds Histogram Time spent in the Prefill phase s
Latency fastdeploy:request_decode_time_seconds Histogram Time spent in the Decode phase s
Token fastdeploy:prompt_tokens_total Counter Total number of processed prompt tokens count
Token fastdeploy:generation_tokens_total Counter Total number of generated tokens count
Token fastdeploy:request_prompt_tokens Histogram Prompt token count per request count
Token fastdeploy:request_generation_tokens Histogram Generation token count per request count
Token fastdeploy:request_params_max_tokens Histogram Distribution of max_tokens per request count
Batch fastdeploy:available_batch_size Gauge Number of additional requests that can be inserted during Decode count
Batch fastdeploy:batch_size Gauge Actual batch size during inference count
Batch fastdeploy:max_batch_size Gauge Maximum batch size configured at service startup count
KV Cache fastdeploy:cache_config_info Gauge Cache configuration info of the inference engine count
KV Cache fastdeploy:hit_req_rate Gauge Prefix cache hit rate at the request level %
KV Cache fastdeploy:hit_token_rate Gauge Prefix cache hit rate at the token level %
KV Cache fastdeploy:cpu_hit_token_rate Gauge CPU-side token-level prefix cache hit rate %
KV Cache fastdeploy:gpu_hit_token_rate Gauge GPU-side token-level prefix cache hit rate %
KV Cache fastdeploy:prefix_cache_token_num Counter Total number of tokens in prefix cache count
KV Cache fastdeploy:prefix_gpu_cache_token_num Counter Total number of prefix cache tokens on GPU count
KV Cache fastdeploy:prefix_cpu_cache_token_num Counter Total number of prefix cache tokens on CPU count
KV Cache fastdeploy:available_gpu_block_num Gauge Available GPU blocks in cache (including unreleased prefix blocks) count
KV Cache fastdeploy:free_gpu_block_num Gauge Number of free GPU blocks in cache count
KV Cache fastdeploy:max_gpu_block_num Gauge Total number of GPU blocks initialized at startup count
KV Cache fastdeploy:available_gpu_resource Gauge Ratio of available GPU blocks to total GPU blocks %
KV Cache fastdeploy:gpu_cache_usage_perc Gauge GPU KV cache utilization %
KV Cache fastdeploy:send_cache_failed_num Counter Total number of cache send failures count

Accessing Metrics

  • Access URL: http://localhost:8000/metrics
  • Metric Type: Prometheus format