PaddleX Serving Guide¶
Serving is a common deployment strategy in real-world production environments. By encapsulating inference functions into services, clients can access these services via network requests to obtain inference results. PaddleX supports various solutions for serving pipelines.
Demonstration of PaddleX pipeline serving:
To address different user needs, PaddleX offers multiple pipeline serving solutions:
- Basic serving: A simple and easy-to-use serving solution with low development costs.
- High-stability serving: Built on NVIDIA Triton Inference Server. Compared to basic serving, this solution offers higher stability and allows users to adjust configurations to optimize performance.
It is recommended to first use the basic serving solution for quick verification, and then evaluate whether to try more complex solutions based on actual needs.
Note
- PaddleX serves pipelines rather than modules.
1. Basic Serving¶
1.1 Install the Serving Plugin¶
Execute the following command to install the serving plugin:
1.2 Run the Server¶
Run the server via PaddleX CLI:
paddlex --serve --pipeline {pipeline name or path to pipeline config file} [{other commandline options}]
Take the general image classification pipeline as an example:
You can see the following information:
INFO: Started server process [63108]
INFO: Waiting for application startup.
INFO: Application startup complete.
INFO: Uvicorn running on http://0.0.0.0:8080 (Press CTRL+C to quit)
--pipeline
can be specified as the official pipeline name or the path to a local pipeline configuration file. PaddleX builds the pipeline and deploys it as a service. If you need to adjust configurations (such as model paths, batch size, device for deployment, etc.), please refer to the "Model Application" section in General Image Classification Pipeline Tutorial.
The command-line options related to serving are as follows:
Name | Description |
---|---|
--pipeline |
Pipeline name or path to the pipeline configuration file. |
--device |
Device for pipeline deployment. Defaults to cpu (if GPU is unavailable) or gpu (if GPU is available). |
--host |
Hostname or IP address the server binds to. Defaults to `0.0.0.0`. |
--port |
Port number the server listens on. Defaults to `8080`. |
--use_hpip |
If specified, enables the high-performance inference plugin. |
In application scenarios where strict requirements are placed on service response time, the PaddleX high-performance inference plugin can be used to accelerate model inference and pre/post-processing, thereby reducing response time and increasing throughput.
To use the PaddleX high-performance inference plugin, please refer to the PaddleX High-Performance Inference Guide for instructions on installing the high-performance inference plugin, obtaining a serial number, and completing the activation process. Additionally, note that not all pipelines, models, and environments support the use of the high-performance inference plugin. For detailed information on supported pipelines and models, please refer to the section on supported pipelines and models for high-performance inference plugins.
You can use the --use_hpip
flag to enable the high-performance inference plugin. An example is as follows:
1.3 Invoke the Service¶
The "Development Integration/Deployment" section in each pipeline’s tutorial provides API references and multi-language invocation examples for the service. You can find the tutorials for each pipeline here.
2. High-Stability Serving¶
Please note that the current high-stability serving solution only supports Linux systems.
2.1 Download the High-Stability Serving SDK¶
Find the high-stability serving SDK corresponding to the pipeline in the table below and download it:
👉 Click to view
2.2 Obtain the Serial Number¶
Using the high-stability serving solution requires obtaining a serial number and completing activation on the machine used for deployment. The method to obtain the serial number is as follows:
On Baidu AI Studio Community - AI Learning and Training Platform, under the "开源模型产线部署序列号咨询与获取" (open-source pipeline deployment serial number inquiry and acquisition) section, select "立即获取" (acquire now) as shown in the following image:
Select the pipeline you wish to deploy and click "获取" (acquire). Afterwards, you can find the acquired serial number in the "开源产线部署SDK序列号管理" (open-source pipeline deployment SDK serial number management) section at the bottom of the page:
Please note: Each serial number can only be bound to a unique device fingerprint and can only be bound once. This means that if a user deploys a pipeline on different machines, a separate serial number must be prepared for each machine. The high-stability serving solution is completely free. PaddleX's authentication mechanism is targeted to count the number of deployments across various pipelines and provide pipeline efficiency analysis for our team through data modeling, so as to optimize resource allocation and improve the efficiency of key pipelines. It is important to note that the authentication process only uses non-sensitive information such as disk partition UUIDs, and PaddleX does not collect sensitive data such as device telemetry data. Therefore, in theory, the authentication server cannot obtain any sensitive information.
2.3 Adjust Configurations¶
The PaddleX high-stability serving solution is built on NVIDIA Triton Inference Server, allowing users to modify the configuration files of Triton Inference Server.
In the model_repo/{endpoint name}
directory of the high-stability serving SDK, you can find one or more config*.pbtxt
files. If a config_{device type}.pbtxt
file exists in the directory, please modify the configuration file corresponding to the desired device type. Otherwise, please modify config.pbtxt
.
A common requirement is to adjust the number of execution instances for horizontal scaling. To achieve this, you need to modify the instance_group
setting in the configuration file, using count
to specify the number of instances placed on each device, kind
to specify the device type, and gpus
to specify the GPU IDs. An example is as follows:
-
Place 4 instances on GPU 0:
-
Place 2 instances on GPU 1, and 1 instance each on GPUs 2 and 3:
For more configuration details, please refer to the Triton Inference Server documentation.
2.4 Run the Server¶
The machine used for deployment needs to have Docker Engine version 19.03 or higher installed.
First, pull the Docker image as needed:
-
Image supporting deployment with NVIDIA GPU (the machine must have NVIDIA drivers that support CUDA 11.8 installed):
-
CPU-only Image:
With the image prepared, execute the following command to run the server:
docker run \
-it \
-e PADDLEX_HPS_DEVICE_TYPE={deployment device type} \
-e PADDLEX_HPS_SERIAL_NUMBER={serial number} \
-e PADDLEX_HPS_UPDATE_LICENSE=1 \
-v "$(pwd)":/workspace \
-v "${HOME}/.baidu/paddlex/licenses":/root/.baidu/paddlex/licenses \
-v /dev/disk/by-uuid:/dev/disk/by-uuid \
-w /workspace \
--rm \
--gpus all \
--network host \
--shm-size 8g \
{image name} \
/bin/bash server.sh
- The deployment device type can be
cpu
orgpu
, and the CPU-only image supports onlycpu
. - If CPU deployment is required, there is no need to specify
--gpus
. -
The above commands can only be executed properly after successful activation. PaddleX offers two activation methods: online activation and offline activation. They are detailed as follows:
- Online activation: Set
PADDLEX_HPS_UPDATE_LICENSE
to1
for the first execution to enable the program to automatically update the license and complete activation. When executing the command again, you may setPADDLEX_HPS_UPDATE_LICENSE
to0
to avoid online license updates. - Offline Activation: Follow the instructions in the serial number management section to obtain the machine’s device fingerprint. Bind the serial number with the device fingerprint to obtain the certificate and complete the activation. For this activation method, you need to manually place the certificate in the
${HOME}/.baidu/paddlex/licenses
directory on the machine (if the directory does not exist, you will need to create it). When using this method, setPADDLEX_HPS_UPDATE_LICENSE
to0
to avoid online license updates.
- Online activation: Set
-
It is necessary to ensure that the
/dev/disk/by-uuid
directory on the host machine exists and is not empty, and that this directory is correctly mounted in order to perform activation properly. - If you need to enter the container for debugging, you can replace
/bin/bash server.sh
in the command with/bin/bash
. Then execute/bin/bash server.sh
inside the container. - If you want the server to run in the background, you can replace
-it
in the command with-d
. After the container starts, you can view the container logs withdocker logs -f {container ID}
. - Add
-e PADDLEX_USE_HPIP=1
to use the PaddleX high-performance inference plugin to accelerate the pipeline inference process. However, please note that not all pipelines support using the high-performance inference plugin. Please refer to the PaddleX High-Performance Inference Guide for more information.
You may observe output similar to the following:
I1216 11:37:21.601943 35 grpc_server.cc:4117] Started GRPCInferenceService at 0.0.0.0:8001
I1216 11:37:21.602333 35 http_server.cc:2815] Started HTTPService at 0.0.0.0:8000
I1216 11:37:21.643494 35 http_server.cc:167] Started Metrics Service at 0.0.0.0:8002
2.5 Invoke the Service¶
Currently, only the Python client is supported for calling the service. Supported Python versions are 3.8, 3.9, and 3.10.
Navigate to the client
directory of the high-stability serving SDK, and run the following command to install dependencies:
# It is recommended to install in a virtual environment
python -m pip install -r requirements.txt
python -m pip install paddlex_hps_client-*.whl
The client.py
script in the client
directory contains examples of how to call the service and provides a command-line interface.
The services deployed using the high-stability serving solution offer the primary operations that match those of the basic serving solution. For each primary operation, the endpoint names and the request and response data fields are consistent with the basic serving solution. Please refer to the "Development Integration/Deployment" section in the tutorials for each pipeline. The tutorials for each pipeline can be found here.