OpenAI Protocol-Compatible API Server
FastDeploy provides a service-oriented deployment solution that is compatible with the OpenAI protocol. Users can quickly deploy it using the following command:
python -m fastdeploy.entrypoints.openai.api_server \
--model baidu/ERNIE-4.5-0.3B-Paddle \
--port 8188 --tensor-parallel-size 8 \
--max-model-len 32768
For more usage methods of the command line during service deployment, refer to Parameter Descriptions.
Sending User Requests
The FastDeploy interface is compatible with the OpenAI protocol, allowing user requests to be sent directly using OpenAI's request method.
Here is an example of sending a user request using the curl command:
curl -X POST "http://0.0.0.0:8188/v1/chat/completions" \
-H "Content-Type: application/json" \
-d '{
"messages": [
{"role": "user", "content": "Hello!"}
]
}'
Here is an example of sending a user request using a Python script:
import openai
host = "0.0.0.0"
port = "8170"
client = openai.Client(base_url=f"http://{host}:{port}/v1", api_key="null")
response = client.chat.completions.create(
model="null",
messages=[
{"role": "system", "content": "I'm a helpful AI assistant."},
{"role": "user", "content": "Rewrite Li Bai's 'Quiet Night Thought' as a modern poem"},
],
stream=True,
)
for chunk in response:
if chunk.choices[0].delta:
print(chunk.choices[0].delta.content, end='')
print('\n')
For a description of the OpenAI protocol, refer to the document OpenAI Chat Completion API.
Parameter Differences
Request Parameter Differences
The differences in request parameters between FastDeploy and the OpenAI protocol are as follows. Other request parameters will be ignored:
- prompt
(supported only in the v1/completions
interface)
- messages
(supported only in the v1/chat/completions
interface)
- frequency_penalty
: Optional[float] = 0.0
- max_tokens
: Optional[int] = 16
- presence_penalty
: Optional[float] = 0.0
- stream
: Optional[bool] = False
- stream_options
: Optional[StreamOptions] = None
- temperature
: Optional[float] = None
- top_p
: Optional[float] = None
- metadata
: Optional[dict] = None (supported only in v1/chat/completions
for configuring additional parameters, e.g., meta_data={"enable_thinking": True}
)
- min_tokens
: Optional[int] = 1 (minimum number of tokens generated)
- reasoning_max_tokens
: Optional[int] = None (maximum number of tokens for reasoning content, defaults to the same as max_tokens
)
- enable_thinking
: Optional[bool] = True (whether to enable reasoning for models that support deep thinking)
- repetition_penalty
: Optional[float] = None (coefficient for directly penalizing repeated token generation (>1 penalizes repetition, <1 encourages repetition))
Note: For multimodal models, since the reasoning chain is enabled by default, resulting in overly long outputs,
max_tokens
can be set to the model's maximum output length or the default value can be used.
Return Field Differences
The additional return fields added by FastDeploy are as follows:
arrival_time
: Returns the cumulative time taken for all tokensreasoning_content
: The returned result of the reasoning chain
Overview of return parameters:
ChatCompletionStreamResponse:
id: str
object: str = "chat.completion.chunk"
created: int = Field(default_factory=lambda: int(time.time()))
model: str
choices: List[ChatCompletionResponseStreamChoice]
ChatCompletionResponseStreamChoice:
index: int
delta: DeltaMessage
finish_reason: Optional[Literal["stop", "length"]] = None
arrival_time: Optional[float] = None
DeltaMessage:
role: Optional[str] = None
content: Optional[str] = None
token_ids: Optional[List[int]] = None
reasoning_content: Optional[str] = None