Thinking Budget Logits Processor
Overview
ThinkingBudgetLogitsProcessor limits the number of tokens generated inside the <think> ... </think> segment. When the budget is reached, it forces a line break token and then the </think> token to terminate the thinking section.
When to Use
- Models that emit
<think>/</think>tokens for reasoning. - You need a hard cap on thinking length without changing sampling logic.
How It Works
- CPU precompute (DataProcessor): when a request includes
thinking_budget, the prompt token ids are scanned to determine whether thinking has started, whether it already ended, and how many tokens are already inside the thinking section. - Per-step update: during decoding, the processor tracks
last_token_idandtokens_after_start. - Budget enforcement: once the budget is reached, it forces a line break and then the thinking end token.
Requirements
- The model must provide valid token ids for
think_start_id,think_end_id, andline_break_id(viaModelConfig). - If any of these ids are invalid, the processor is disabled and
thinking_budgetwill not take effect.
Request Parameters
thinking_budget(int, required to enable): maximum number of tokens after<think>before forced termination.think_stop_sentence(string, optional): a stop sentence that will be tokenized on the CPU side and enforced near the budget boundary.
Operator-Level vs LogitsProcessor
FastDeploy has two ways to limit thinking length:
- Operator-level limit (
enable_thinking=true+reasoning_max_tokens): - Implemented in built-in post-processing kernels.
- Lower overhead and better throughput under high concurrency.
- Best for simple "cap the thinking length" use cases.
ThinkingBudgetLogitsProcessor(logits_processors_args.thinking_budget):- Implemented in per-step Python logits processing.
- Supports flexible controls, such as
think_stop_sentence(custom inserted sentence before ending thinking). - Higher runtime overhead under high concurrency compared with operator-level limit.
In short:
- If you only need a hard cap on thinking length, prefer
reasoning_max_tokens. - If you need custom behavior (for example, injecting custom sentence tokens), use
ThinkingBudgetLogitsProcessor.
Practical guidance
reasoning_max_tokens and thinking_budget are not mutually exclusive in current implementation.
If both are configured for the same request, both constraints can take effect, and whichever triggers first will end the thinking phase.
- To use operator-level-only behavior: this is request-level config only. Set
enable_thinking=trueandreasoning_max_tokensin request, and do not setthinking_budget. - To use logits-processor-only behavior (especially with
think_stop_sentence): this requires service-level + request-level config. Start service with--logits-processors ThinkingBudgetLogitsProcessor, and setthinking_budget(and optionalthink_stop_sentence) inlogits_processors_args; leavereasoning_max_tokensunset. - Avoid enabling both for strict custom sentence insertion requirements, because operator-level termination may cut the custom sentence path earlier.
Online Usage
1. Start service
python -m fastdeploy.entrypoints.openai.api_server \
--model Qwen/Qwen3-0.6B \
--port 8180 \
--metrics-port 8181 \
--engine-worker-queue-port 8182 \
--max-model-len 32768 \
--max-num-seqs 32 \
--logits-processors ThinkingBudgetLogitsProcessor
2. Send request
curl -X POST "http://0.0.0.0:8180/v1/chat/completions" \
-H "Content-Type: application/json" \
-d '{
"messages": [{"role": "user", "content": "Hello!"}],
"max_completion_tokens": 30,
"logits_processors_args": {
"thinking_budget": 20,
"think_stop_sentence": "Thinking limit reached, now replying."
}
}'
If you do not need thinking control for a request, simply omit thinking_budget.
3. Operator-level thinking cap only (no logits processor)
curl -X POST "http://0.0.0.0:8180/v1/chat/completions" \
-H "Content-Type: application/json" \
-d '{
"messages": [{"role": "user", "content": "Hello!"}],
"max_completion_tokens": 512,
"enable_thinking": true,
"reasoning_max_tokens": 200
}'
Offline Usage
from fastdeploy import LLM, SamplingParams
llm = LLM(
model="Qwen/Qwen3-0.6B",
engine_worker_queue_port=8282,
cache_queue_port=8383,
logits_processors=["ThinkingBudgetLogitsProcessor"],
)
sampling_params = SamplingParams(
max_tokens=512,
logits_processors_args={"thinking_budget": 20, "think_stop_sentence": "Thinking limit reached, now replying."},
)
outputs = llm.chat([{"role": "user", "content": "Hello, who are u?"}], sampling_params)
print(outputs[0].outputs.text)
Performance Note
This processor runs update_state and apply on every decode step. If you only need a hard thinking-length cap and care most about throughput, consider the operator-level reasoning-length controls instead of per-step logits processing.