Prefix Caching
Prefix Caching is a technique to optimize the inference efficiency of generative models. Its core idea is to cache intermediate computation results (KV Cache) of input sequences, avoiding redundant computations and thereby accelerating response times for multiple requests sharing the same prefix.
How It Works
Prefix Identification: When multiple requests share identical input prefixes (e.g., prompts or initial context), the system caches the intermediate states (KV Cache) corresponding to that prefix.
Incremental Computation: For subsequent requests, only the newly added portions (e.g., user-appended input) need computation while reusing cached intermediate results, significantly reducing computational overhead.
Enabling Prefix Caching for Service Deployment
To enable prefix caching when launching the service, add the parameter enable-prefix-caching
. By default, only first-level caching (GPU cache) is enabled.
To enable CPU caching, specify the swap-space
parameter to allocate CPU cache space (in GB). The size should be set based on available machine memory after model loading.
Note: The ERNIE-4.5-VL multimodal model currently does not support prefix caching.
For detailed parameter descriptions, refer to the Parameters Documentation.
Example launch command:
python -m fastdeploy.entrypoints.openai.api_server \
--model "baidu/ERNIE-4.5-21B-A3B-Paddle" \
--port 8180 --engine-worker-queue-port 8181 \
--metrics-port 8182 \
--cache-queue-port 8183 \
--enable-prefix-caching \
--swap-space 50 \
--max-model-len 8192 \
--max-num-seqs 32
Enabling Prefix Caching for Offline Inference
Set enable_prefix_caching=True
when launching FastDeploy. Enable CPU caching via swap_space
based on available machine memory.
A test example is provided: demo/offline_prefix_caching_demo.py