Offline Inference

1. Usage

FastDeploy supports offline inference by loading models locally and processing user data. Usage examples:

Chat Interface (LLM.chat)

from fastdeploy import LLM, SamplingParams

msg1=[
    {"role": "system", "content": "I'm a helpful AI assistant."},
    {"role": "user", "content": "把李白的静夜思改写为现代诗"},
]
msg2 = [
    {"role": "system", "content": "I'm a helpful AI assistant."},
    {"role": "user", "content": "Write me a poem about large language model."},
]
messages = [msg1, msg2]

# Sampling parameters
sampling_params = SamplingParams(top_p=0.95, max_tokens=6400)

# Load model
llm = LLM(model="ERNIE-4.5-0.3B", tensor_parallel_size=1, max_model_len=8192)
# Batch inference (internal request queuing and dynamic batching)
outputs = llm.chat(messages, sampling_params)

# Output results
for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs.text

Documentation for SamplingParams, LLM.generate, LLM.chat, and output structure RequestOutput is provided below.

Note: For reasoning models, when loading the model, you need to specify the reasoning_parser parameter. Additionally, during the request, you can toggle the reasoning feature on or off by configuring the enable_thinking parameter within chat_template_kwargs.

from fastdeploy.entrypoints.llm import LLM
# 加载模型
llm = LLM(model="baidu/ERNIE-4.5-VL-28B-A3B-Paddle", tensor_parallel_size=1, max_model_len=32768, enable_mm=True, limit_mm_per_prompt={"image": 100}, reasoning_parser="ernie-45-vl")

outputs = llm.chat(
    messages=[
        {"role": "user", "content": [ {"type": "image_url", "image_url": {"url": "https://paddlenlp.bj.bcebos.com/datasets/paddlemix/demo_images/example2.jpg"}},
                                     {"type": "text", "text": "图中的文物属于哪个年代"}]}
    ],
    chat_template_kwargs={"enable_thinking": False})

# 输出结果
for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs.text
    reasoning_text = output.outputs.reasoning_content

Text Completion Interface (LLM.generate)

from fastdeploy import LLM, SamplingParams

prompts = [
    "User: 帮我写一篇关于深圳文心公园的500字游记和赏析。\nAssistant: 好的。"
]

# 采样参数
sampling_params = SamplingParams(top_p=0.95, max_tokens=6400)

# 加载模型
llm = LLM(model="baidu/ERNIE-4.5-21B-A3B-Base-Paddle", tensor_parallel_size=1, max_model_len=8192)

# 批量进行推理（llm内部基于资源情况进行请求排队、动态插入处理）
outputs = llm.generate(prompts, sampling_params)

# 输出结果
for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs.text

Note: Text completion interface, suitable for scenarios where users have predefined the context input and expect the model to output only the continuation content. No additional prompt concatenation will be added during the inference process. For the chat model, it is recommended to use the Chat Interface (LLM.chat).

For multimodal models, such as baidu/ERNIE-4.5-VL-28B-A3B-Paddle, when calling the generate interface, you need to provide a prompt that includes images. The usage is as follows:

import io
import requests
from PIL import Image

from fastdeploy.entrypoints.llm import LLM
from fastdeploy.engine.sampling_params import SamplingParams
from fastdeploy.input.ernie_tokenizer import ErnieBotTokenizer

PATH = "baidu/ERNIE-4.5-VL-28B-A3B-Paddle"
tokenizer = ErnieBotTokenizer.from_pretrained(PATH)

messages = [
    {
        "role": "user",
        "content": [
            {"type":"image_url", "image_url": {"url":"https://paddlenlp.bj.bcebos.com/datasets/paddlemix/demo_images/example2.jpg"}},
            {"type":"text", "text":"图中的文物属于哪个年代"}
        ]
     }
]

prompt = tokenizer.apply_chat_template(messages, tokenize=False)
images, videos = [], []
for message in messages:
    content = message["content"]
    if not isinstance(content, list):
        continue
    for part in content:
        if part["type"] == "image_url":
            url = part["image_url"]["url"]
            image_bytes = requests.get(url).content
            img = Image.open(io.BytesIO(image_bytes))
            images.append(img)
        elif part["type"] == "video_url":
            url = part["video_url"]["url"]
            video_bytes = requests.get(url).content
            videos.append({
                "video": video_bytes,
                "max_frames": 30
            })

sampling_params = SamplingParams(temperature=0.1, max_tokens=6400)
llm = LLM(model=PATH, tensor_parallel_size=1, max_model_len=32768, enable_mm=True, limit_mm_per_prompt={"image": 100}, reasoning_parser="ernie-45-vl")
outputs = llm.generate(prompts={
    "prompt": prompt,
    "multimodal_data": {
        "image": images,
        "video": videos
    }
}, sampling_params=sampling_params)

# 输出结果
for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs.text
    reasoning_text = output.outputs.reasoning_content

Note: The generate interface does not currently support passing parameters to control the thinking function (on/off). It always uses the model's default parameters.

2. API Documentation

2.1 fastdeploy.LLM

For LLM configuration, refer to Parameter Documentation.

Configuration Notes: 1. port and metrics_port is only used for online inference. 2. After startup, the service logs KV Cache block count (e.g. total_block_num:640). Multiply this by block_size (default 64) to get total cacheable tokens. 3. Calculate max_num_seqs based on cacheable tokens. Example: avg input=800 tokens, output=500 tokens, blocks=640 → kv_cache_ratio = 800/(800+500)=0.6, max_seq_len = 640*64/(800+500)=31.

2.2 fastdeploy.LLM.chat

messages(list[dict],list[list[dict]]): Input messages (batch supported)
sampling_params: See 2.4 for parameter details
use_tqdm: Enable progress visualization
chat_template_kwargs(dict): Extra template parameters (currently supports enable_thinking(bool)) usage example: chat_template_kwargs={"enable_thinking": False}

2.3 fastdeploy.LLM.generate

prompts(str, list[str], list[int], list[list[int]], dict[str, Any], list[dict[str, Any]]): : Input prompts (batch supported), accepts decoded token ids example of using a dict-type parameter: prompts={"prompt": prompt, "multimodal_data": {"image": images}}
sampling_params: See 2.4 for parameter details
use_tqdm: Enable progress visualization

2.4 fastdeploy.SamplingParams

presence_penalty(float): Penalizes repeated topics (positive values reduce repetition)
frequency_penalty(float): Strict penalty for repeated tokens
repetition_penalty(float): Direct penalty for repeated tokens (>1 penalizes, <1 encourages)
temperature(float): Controls randomness (higher = more random)
top_p(float): Probability threshold for token selection
top_k(int): Number of tokens considered for sampling
min_p(float): Minimum probability relative to the maximum probability for a token to be considered (>0 filters low-probability tokens to improve quality)
max_tokens(int): Maximum generated tokens (input + output)
min_tokens(int): Minimum forced generation length

2.5 fastdeploy.engine.request.RequestOutput

request_id(str): Request identifier
prompt(str): Input content
prompt_token_ids(list[int]): Tokenized input
outputs(fastdeploy.engine.request.CompletionOutput): Results
finished(bool): Completion status
metrics(fastdeploy.engine.request.RequestMetrics): Performance metrics
num_cached_tokens(int): Cached token count (only valid when enable_prefix_caching``` is enabled)
error_code(int): Error code
error_msg(str): Error message

2.6 fastdeploy.engine.request.CompletionOutput

index(int): Batch index
send_idx(int): Request token index
token_ids(list[int]): Output tokens
text(str): Decoded text
reasoning_content(str): (X1 model only) Chain-of-thought output

2.7 fastdeploy.engine.request.RequestMetrics

arrival_time(float): Request receipt time
inference_start_time(float): Inference start time
first_token_time(float): First token latency
time_in_queue(float): Queuing time
model_forward_time(float): Forward pass duration
model_execute_time(float): Total execution time (including preprocessing)