FAQ
1.CUDA out of memory
- when starting the service:
- Check the minimum number of deployment GPUs corresponding to the model and quantification method. If it is not met, increase the number of deployment GPUs.
-
If CUDAGraph is enabled, try to reserve more GPU memory for CUDAGraph by lowering
gpu_memory_utilization
, or reduce the GPU memory usage of CUDAGraph by reducingmax_num_seqs
and settingcudagraph_capture_sizes
。 -
during service operation:
- Check whether there is information similar to the following in the log. If so, it is usually caused by insufficient output blocks. You need to reduce
kv-cache-ratio
need_block_len: 1, free_list_len: 0
step max_id: 2, max_num: 133, encoder block len: 24
recover seq_id: 2, free_list_len: 144, used_list_len: 134
need_block_len: 1, free_list_len: 0
step max_id: 2, max_num: 144, encoder_block_len: 24
It is recommended to enable the service management global block. You need add environment variables before starting the service.
export ENABLE_V1_KVCACHE_SCHEDULER=1
2.Poor model performance
- First, check whether the output length meets expectations and whether it is caused by excessive decoding length. If the output is long, please check whether there is similar information as follows in the log. If so, it is usually caused by insufficient output blocks and you need to reduce
kv-cache-ratio
need_block_len: 1, free_list_len: 0
step max_id: 2, max_num: 133, encoder block len: 24
recover seq_id: 2, free_list_len: 144, used_list_len: 134
need_block_len: 1, free_list_len: 0
step max_id: 2, max_num: 144, encoder_block_len: 24
It is also recommended to enable the service management global block. You need add environment variables before starting the service.
export ENABLE_V1_KVCACHE_SCHEDULER=1
- Check whether the KVCache blocks allocated by the automatic profile are as expected. If the automatic profile is affected by the fluctuation of video memory and may result in less allocation, you can manually set the
num_gpu_blocks_override
parameter to expand the KVCache block.