OpenAI Protocol-Compatible API Server
FastDeploy provides a service-oriented deployment solution that is compatible with the OpenAI protocol. Users can quickly deploy it using the following command:
python -m fastdeploy.entrypoints.openai.api_server \
--model baidu/ERNIE-4.5-0.3B-Paddle \
--port 8188 --tensor-parallel-size 8 \
--max-model-len 32768
To enable log probability output, simply deploy with the following command:
python -m fastdeploy.entrypoints.openai.api_server \
--model baidu/ERNIE-4.5-0.3B-Paddle \
--port 8188 --tensor-parallel-size 8 \
--max-model-len 32768 \
--enable-logprob
For more usage methods of the command line during service deployment, refer to Parameter Descriptions.
Chat Completion API
FastDeploy provides a Chat Completion API that is compatible with the OpenAI protocol, allowing user requests to be sent directly using OpenAI's request method.
Sending User Requests
Here is an example of sending a user request using the curl command:
curl -X POST "http://0.0.0.0:8188/v1/chat/completions" \
-H "Content-Type: application/json" \
-d '{
"messages": [
{"role": "user", "content": "Hello!"}
]
}'
Here's an example curl command demonstrating how to include the logprobs parameter in a user request:
curl -X POST "http://0.0.0.0:8188/v1/chat/completions" \
-H "Content-Type: application/json" \
-d '{
"messages": [
{"role": "user", "content": "Hello!"}, "logprobs": true, "top_logprobs": 5
]
}'
Here is an example of sending a user request using a Python script:
import openai
host = "0.0.0.0"
port = "8170"
client = openai.Client(base_url=f"http://{host}:{port}/v1", api_key="null")
response = client.chat.completions.create(
model="null",
messages=[
{"role": "system", "content": "I'm a helpful AI assistant."},
{"role": "user", "content": "Rewrite Li Bai's 'Quiet Night Thought' as a modern poem"},
],
stream=True,
)
for chunk in response:
if chunk.choices[0].delta:
print(chunk.choices[0].delta.content, end='')
print('\n')
For a description of the OpenAI protocol, refer to the document OpenAI Chat Completion API.
Compatible OpenAI Parameters
messages: Union[List[Any], List[int]]
# List of input messages, which can be text messages (`List[Any]`, typically `List[dict]`) or token ID lists (`List[int]`).
tools: Optional[List[ChatCompletionToolsParam]] = None
# List of tool call configurations, used for enabling function calling (Function Calling) or tool usage (e.g., ReAct framework).
model: Optional[str] = "default"
# Specifies the model name or version to use, defaulting to `"default"` (which may point to the base model).
frequency_penalty: Optional[float] = None
# Frequency penalty coefficient, reducing the probability of generating the same token repeatedly (`>1.0` suppresses repetition, `<1.0` encourages repetition, default `None` disables).
logprobs: Optional[bool] = False
# Whether to return the log probabilities of each generated token, used for debugging or analysis.
top_logprobs: Optional[int] = 0
# Returns the top `top_logprobs` tokens and their log probabilities for each generated position (default `0` means no return).
max_tokens: Optional[int] = Field(
default=None,
deprecated="max_tokens is deprecated in favor of the max_completion_tokens field",
)
# Deprecated: Maximum number of tokens to generate (recommended to use `max_completion_tokens` instead).
max_completion_tokens: Optional[int] = None
# Maximum number of tokens to generate (recommended alternative to `max_tokens`), no default limit (restricted by the model's context window).
presence_penalty: Optional[float] = None
# Presence penalty coefficient, reducing the probability of generating new topics (unseen topics) (`>1.0` suppresses new topics, `<1.0` encourages new topics, default `None` disables).
stream: Optional[bool] = False
# Whether to enable streaming output (return results token by token), default `False` (returns complete results at once).
stream_options: Optional[StreamOptions] = None
# Additional configurations for streaming output (such as chunk size, timeout, etc.), refer to the specific definition of `StreamOptions`.
temperature: Optional[float] = None
# Temperature coefficient, controlling generation randomness (`0.0` for deterministic generation, `>1.0` for more randomness, default `None` uses model default).
top_p: Optional[float] = None
# Nucleus sampling threshold, only retaining tokens whose cumulative probability exceeds `top_p` (default `None` disables).
response_format: Optional[AnyResponseFormat] = None
# Specifies the output format (such as JSON, XML, etc.), requires passing a predefined format configuration object.
user: Optional[str] = None
# User identifier, used for tracking or distinguishing requests from different users (default `None` does not pass).
metadata: Optional[dict] = None
# Additional metadata, used for passing custom information (such as request ID, debug markers, etc.).
Additional Parameters Added by FastDeploy
Note: When sending requests using curl, the following parameters can be used directly; When sending requests using openai.Client, these parameters need to be placed in the
extra_body
parameter, e.g.extra_body={"chat_template_kwargs": {"enable_thinking":True}, "include_stop_str_in_output": True}
.
The following sampling parameters are supported.
top_k: Optional[int] = None
# Limits the consideration to the top K tokens with the highest probability at each generation step, used to control randomness (default None means no limit).
min_p: Optional[float] = None
# Nucleus sampling threshold, only retaining tokens whose cumulative probability exceeds min_p (default None means disabled).
min_tokens: Optional[int] = None
# Forces a minimum number of tokens to be generated, avoiding premature truncation (default None means no limit).
include_stop_str_in_output: Optional[bool] = False
# Whether to include the stop string content in the output (default False, meaning output is truncated when a stop string is encountered).
bad_words: Optional[List[str]] = None
# List of forbidden words (e.g., sensitive words) that the model should avoid generating (default None means no restriction).
repetition_penalty: Optional[float] = None
# Repetition penalty coefficient, reducing the probability of repeating already generated tokens (`>1.0` suppresses repetition, `<1.0` encourages repetition, default None means disabled).
The following extra parameters are supported:
chat_template_kwargs: Optional[dict] = None
# Additional parameters passed to the chat template, used for customizing dialogue formats (default None).
reasoning_max_tokens: Optional[int] = None
# Maximum number of tokens to generate during reasoning (e.g., CoT, chain of thought) (default None means using global max_tokens).
structural_tag: Optional[str] = None
# Structural tag, used to mark specific structures of generated content (such as JSON, XML, etc., default None).
guided_json: Optional[Union[str, dict, BaseModel]] = None
# Guides the generation of content conforming to JSON structure, can be a JSON string, dictionary, or Pydantic model (default None).
guided_regex: Optional[str] = None
# Guides the generation of content conforming to regular expression rules (default None means no restriction).
guided_choice: Optional[List[str]] = None
# Guides the generation of content selected from a specified candidate list (default None means no restriction).
guided_grammar: Optional[str] = None
# Guides the generation of content conforming to grammar rules (such as BNF) (default None means no restriction).
return_token_ids: Optional[bool] = None
# Whether to return the token IDs of the generation results instead of text (default None means return text).
prompt_token_ids: Optional[List[int]] = None
# Directly passes the token ID list of the prompt, skipping the text encoding step (default None means using text input).
max_streaming_response_tokens: Optional[int] = None
# Maximum number of tokens returned at a time during streaming output (default None means no limit).
disable_chat_template: Optional[bool] = False
# Whether to disable chat template rendering, using raw input directly (default False means template is enabled).
Differences in Return Fields
Additional return fields added by FastDeploy:
arrival_time
: Cumulative time consumed for all tokensreasoning_content
: Return results of the chain of thoughtprompt_token_ids
: List of token IDs for the input sequencecompletion_token_ids
: List of token IDs for the output sequence
Overview of return parameters:
ChatCompletionResponse:
id: str
object: str = "chat.completion"
created: int = Field(default_factory=lambda: int(time.time()))
model: str
choices: List[ChatCompletionResponseChoice]
usage: UsageInfo
ChatCompletionResponseChoice:
index: int
message: ChatMessage
logprobs: Optional[LogProbs] = None
finish_reason: Optional[Literal["stop", "length", "tool_calls", "recover_stop"]]
ChatMessage:
role: str
content: str
reasoning_content: Optional[str] = None
prompt_token_ids: Optional[List[int]] = None
completion_token_ids: Optional[List[int]] = None
# Fields returned for streaming responses
ChatCompletionStreamResponse:
id: str
object: str = "chat.completion.chunk"
created: int = Field(default_factory=lambda: int(time.time()))
model: str
choices: List[ChatCompletionResponseStreamChoice]
usage: Optional[UsageInfo] = None
ChatCompletionResponseStreamChoice:
index: int
delta: DeltaMessage
logprobs: Optional[LogProbs] = None
finish_reason: Optional[Literal["stop", "length", "tool_calls"]] = None
arrival_time: Optional[float] = None
DeltaMessage:
role: Optional[str] = None
content: Optional[str] = None
prompt_token_ids: Optional[List[int]] = None
completion_token_ids: Optional[List[int]] = None
reasoning_content: Optional[str] = None
Completion API
The Completion API interface is mainly used for continuation scenarios, suitable for users who have customized context input and expect the model to only output continuation content; the inference process does not add other prompt
concatenations.
Sending User Requests
Here is an example of sending a user request using the curl command:
curl -X POST "http://0.0.0.0:8188/v1/completions" \
-H "Content-Type: application/json" \
-d '{
"prompt": "以下是一篇关于深圳文心公园的500字游记和赏析:"
}'
Here is an example of sending a user request using a Python script:
import openai
host = "0.0.0.0"
port = "8170"
client = openai.Client(base_url=f"http://{host}:{port}/v1", api_key="null")
response = client.completions.create(
model="default",
prompt="以下是一篇关于深圳文心公园的500字游记和赏析:",
stream=False,
)
print(response.choices[0].text)
For an explanation of the OpenAI protocol, refer to the OpenAI Completion API。
Compatible OpenAI Parameters
model: Optional[str] = "default"
# Specifies the model name or version to use, defaulting to `"default"` (which may point to the base model).
prompt: Union[List[int], List[List[int]], str, List[str]]
# Input prompt, supporting multiple formats:
# - `str`: Plain text prompt (e.g., `"Hello, how are you?"`).
# - `List[str]`: Multiple text segments (e.g., `["User:", "Hello!", "Assistant:", "Hi!"]`).
# - `List[int]`: Directly passes a list of token IDs (e.g., `[123, 456]`).
# - `List[List[int]]`: List of multiple token ID lists (e.g., `[[123], [456, 789]]`).
best_of: Optional[int] = None
# Generates `best_of` candidate results and returns the highest-scoring one (requires `n=1`).
frequency_penalty: Optional[float] = None
# Frequency penalty coefficient, reducing the probability of generating the same token repeatedly (`>1.0` suppresses repetition, `<1.0` encourages repetition).
logprobs: Optional[int] = None
# Returns the log probabilities of each generated token, can specify the number of candidates to return.
max_tokens: Optional[int] = None
# Maximum number of tokens to generate (including input and output), no default limit (restricted by the model's context window).
presence_penalty: Optional[float] = None
# Presence penalty coefficient, reducing the probability of generating new topics (unseen topics) (`>1.0` suppresses new topics, `<1.0` encourages new topics).
Additional Parameters Added by FastDeploy
Note: When sending requests using curl, the following parameters can be used directly; When sending requests using openai.Client, these parameters need to be placed in the
extra_body
parameter, e.g.extra_body={"chat_template_kwargs": {"enable_thinking":True}, "include_stop_str_in_output": True}
.
The following sampling parameters are supported.
top_k: Optional[int] = None
# Limits the consideration to the top K tokens with the highest probability at each generation step, used to control randomness (default None means no limit).
min_p: Optional[float] = None
# Nucleus sampling threshold, only retaining tokens whose cumulative probability exceeds min_p (default None means disabled).
min_tokens: Optional[int] = None
# Forces a minimum number of tokens to be generated, avoiding premature truncation (default None means no limit).
include_stop_str_in_output: Optional[bool] = False
# Whether to include the stop string content in the output (default False, meaning output is truncated when a stop string is encountered).
bad_words: Optional[List[str]] = None
# List of forbidden words (e.g., sensitive words) that the model should avoid generating (default None means no restriction).
repetition_penalty: Optional[float] = None
# Repetition penalty coefficient, reducing the probability of repeating already generated tokens (`>1.0` suppresses repetition, `<1.0` encourages repetition, default None means disabled).
The following extra parameters are supported:
guided_json: Optional[Union[str, dict, BaseModel]] = None
# Guides the generation of content conforming to JSON structure, can be a JSON string, dictionary, or Pydantic model (default None).
guided_regex: Optional[str] = None
# Guides the generation of content conforming to regular expression rules (default None means no restriction).
guided_choice: Optional[List[str]] = None
# Guides the generation of content selected from a specified candidate list (default None means no restriction).
guided_grammar: Optional[str] = None
# Guides the generation of content conforming to grammar rules (such as BNF) (default None means no restriction).
return_token_ids: Optional[bool] = None
# Whether to return the token IDs of the generation results instead of text (default None means return text).
prompt_token_ids: Optional[List[int]] = None
# Directly passes the token ID list of the prompt, skipping the text encoding step (default None means using text input).
max_streaming_response_tokens: Optional[int] = None
# Maximum number of tokens returned at a time during streaming output (default None means no limit).
Overview of Return Parameters
CompletionResponse:
id: str
object: str = "text_completion"
created: int = Field(default_factory=lambda: int(time.time()))
model: str
choices: List[CompletionResponseChoice]
usage: UsageInfo
CompletionResponseChoice:
index: int
text: str
prompt_token_ids: Optional[List[int]] = None
completion_token_ids: Optional[List[int]] = None
arrival_time: Optional[float] = None
logprobs: Optional[int] = None
reasoning_content: Optional[str] = None
finish_reason: Optional[Literal["stop", "length", "tool_calls"]]
# Fields returned for streaming responses
CompletionStreamResponse:
id: str
object: str = "text_completion"
created: int = Field(default_factory=lambda: int(time.time()))
model: str
choices: List[CompletionResponseStreamChoice]
usage: Optional[UsageInfo] = None
CompletionResponseStreamChoice:
index: int
text: str
arrival_time: float = None
prompt_token_ids: Optional[List[int]] = None
completion_token_ids: Optional[List[int]] = None
logprobs: Optional[float] = None
reasoning_content: Optional[str] = None
finish_reason: Optional[Literal["stop", "length", "tool_calls"]] = None