Early Stopping

The early stopping is used to prematurely terminate the token generation of the model. Specifically, the early stopping uses different strategies to determine whether the currently generated token sequence meets the early stopping criteria. If so, token generation is terminated prematurely. FastDeploy currently supports the repetition strategy and stop sequence and stop_token_ids.

1. Repetition Strategy

The repetition strategy determines whether to trigger the early stopping function by checking the number of times a high-probability token is generated.
Specifically, if the probability of generating a token for a batch exceeds a user-set probability threshold for a specified number of consecutive times, token generation for that batch is terminated prematurely.

Usage Instructions

When starting the service, add the early stopping function startup option.

Online inference startup example:
Using default hyperparameters: --enable-early-stop shell python -m fastdeploy.entrypoints.openai.api_server \ --model baidu/ERNIE-4.5-0.3B-Paddle \ --port 8180 \ --metrics-port 8181 \ --engine-worker-queue-port 8182 \ --max-model-len 32768 \ --max-num-seqs 32 \ --enable-early-stop
Using custom hyperparameters: --early-stop-config shell python -m fastdeploy.entrypoints.openai.api_server \ --model baidu/ERNIE-4.5-0.3B-Paddle \ --port 8180 \ --metrics-port 8181 \ --engine-worker-queue-port 8182 \ --max-model-len 32768 \ --max-num-seqs 32 \ --early-stop-config '{"enable_early_stop":true, "window_size": 1000, "threshold": 0.9}'
Offline reasoning example
Use default hyperparameter: enable_early_stop ```python from fastdeploy.engine.sampling_params import SamplingParams from fastdeploy.entrypoints.llm import LLM

model_name_or_path = "baidu/ERNIE-4.5-0.3B-Paddle"

sampling_params = SamplingParams(temperature=0.1, max_tokens=30) llm = LLM(model=model_name_or_path, tensor_parallel_size=1, enable_early_stop=True) output = llm.generate(prompts="who are you?", use_tqdm=True, sampling_params=sampling_params)

print(output) * Use custom hyperparameters: early_stop_configpython from fastdeploy.engine.sampling_params import SamplingParams from fastdeploy.entrypoints.llm import LLM

model_name_or_path = "baidu/ERNIE-4.5-0.3B-Paddle" early_stop_config = {"enable_early_stop":True, "window_size":1000, "threshold":0.9} sampling_params = SamplingParams(temperature=0.1, max_tokens=30) llm = LLM(model=model_name_or_path, tensor_parallel_size=1, early_stop_config=early_stop_config) output = llm.generate(prompts="who are you?", use_tqdm=True, sampling_params=sampling_params)

print(output) ```

Parameter Description

enable_early_stop: (bool) Whether to enable the early stopping. Default False.
strategy: (str) The strategy used by the early stopping. Currently, only the repetition strategy is supported. Default "repetition".
window_size: (int) The upper limit of the number of consecutive high-probability tokens in the repetition strategy. If the number exceeds this limit, the early stopping will be triggered. Default 3000.
threshold: (float) The high-probability threshold in the repetition strategy. Default 0.99.

2. Stop Sequence

The Stop Sequence strategy determines whether to trigger early stopping by checking whether the generated token sequence contains a user-specified stop sequence.
Specifically, if the token sequence generated by a batch contains a user-specified stop sequence, token generation for that batch is terminated prematurely.

Usage Instructions

Before starting the service, set the following environment variables

FD_STOP_SEQS_MAX_LEN (Maximum length of stop sequences, default is 8)

FD_MAX_STOP_SEQS_NUM (Maximum number of stop sequences, default is 5)

request with stop parameter， it can be str or List[str]

online serving, set stop parameter in request

# create a chat request with "stop" parameter
import openai
ip = "0.0.0.0"
service_http_port = "8233"
client = openai.Client(base_url=f"http://{ip}:{service_http_port}/v1", api_key="EMPTY_API_KEY")
response = client.chat.completions.create(
    model="default",
    messages=[
        {"role": "user", "content": '今天天气真好'},
    ],
    temperature=1.0,
    top_p=0,
    stream=False,
    stop=["明天", "出去走走"]
)

offline LLM, set stop_seqs parameter in SamplingParams

from fastdeploy.engine.sampling_params import SamplingParams
from fastdeploy.entrypoints.llm import LLM

model_name_or_path = "ERNIE-4.5-21B-A3B-Paddle"

sampling_params = SamplingParams(temperature=1, top_p=0, stop=["出去走走"])
llm = LLM(model=model_name_or_path, tensor_parallel_size=1)
output = llm.chat(messages=[{"role": "user", "content": "今天天气真好"}], use_tqdm=True, sampling_params=sampling_params)

print(output)

3. Stop_token_ids

The Stop_token_ids strategy determines whether to trigger early stopping by checking whether the generated token sequence contains a user-specified stop token_id.
Specifically, if the token sequence generated by a batch contains a user-specified stop_token_ids, token generation for that batch is terminated prematurely.

Usage Instructions

request with stop_token_ids parameter， it can be List[int]

online serving, set stop_token_ids parameter in request

# create a chat request with "stop_token_ids" parameter
curl -X POST "http://0.0.0.0:13312/v1/chat/completions" \
-H "Content-Type: application/json" \
-d '{
    "model": "default",
    "messages": [
        {
            "role": "user",
            "content": "北京天安门在哪里?"
        }
    ],
    "temperature": 0.7,
    "stream": false,
    "seed": 1,
    "stop_token_ids":[104208]
}'

offline LLM, set stop_token_ids parameter in SamplingParams ``` from fastdeploy.engine.sampling_params import SamplingParams from fastdeploy.entrypoints.llm import LLM model_name_or_path = "/Qwen/Qwen3-0.6B" sampling_params = SamplingParams(temperature=1, seed=1,stop_token_ids=[104208]) llm = LLM(model=model_name_or_path, tensor_parallel_size=1) output = llm.chat(messages=[{"role": "user", "content": "北京天安门在哪里?"}], use_tqdm=True, sampling_params=sampling_params) print(output)