PaddleX High-Performance Inference Guide¶

In real production environments, many applications impose strict performance metrics—especially in response time—on deployment strategies to ensure system efficiency and a smooth user experience. To address this, PaddleX offers a high-performance inference plugin that, through automatic configuration and multi-backend inference capabilities, enables users to significantly accelerate model inference without concerning themselves with complex configurations and low-level details. In addition to supporting inference acceleration on pipelines, the PaddleX high-performance inference plugin can also be used to accelerate inference when modules are used standalone.

Table of Contents¶

1. Installation and Basic Usage
1.1 Installing the High-Performance Inference Plugin
1.2 Enabling the High-Performance Inference Plugin
2. Advanced Usage
2.1 Working Modes of High-Performance Inference
2.2 High-Performance Inference Configuration
2.3 Modifying the High-Performance Inference Configuration
2.4 Enabling/Disabling the High‑Performance Inference Plugin in Configuration Files
2.5 Model Cache Description
2.6 Customizing the Model Inference Library
3. Frequently Asked Questions

1. Installation and Basic Usage¶

Before using the high-performance inference plugin, please ensure that you have completed the PaddleX installation according to the PaddleX Local Installation Tutorial and have run the quick inference using the PaddleX pipeline command line or the PaddleX pipeline Python script as described in the usage instructions.

The high-performance inference plugin supports handling multiple model formats, including PaddlePaddle static graph (.pdmodel, .json), ONNX (.onnx) and Huawei OM (.om), among others. For ONNX models, you can convert them using the Paddle2ONNX Plugin. If multiple model formats are present in the model directory, PaddleX will automatically choose the appropriate one as needed, and automatic model conversion may be performed.

1.1 Installing the High-Performance Inference Plugin¶

Currently, the supported processor architectures, operating systems, device types, and Python versions for high-performance inference are as follows:

Operating System	Processor Architecture	Device Type	Python Version
Linux	x86-64
		CPU	3.8–3.12
		GPU (CUDA 11.8 + cuDNN 8.9)	3.8–3.12
		NPU	3.10
	AArch64	NPU	3.10

1.1.1 Installing the High-Performance Inference Plugin in a Docker Container (Highly Recommended)¶

Refer to Get PaddleX based on Docker to start a PaddleX container using Docker. After starting the container, execute the following commands according to your device type to install the high-performance inference plugin:

Device Type	Installation Command	Description
CPU	`paddlex --install hpi-cpu`	Installs the CPU version of the high-performance inference feature.
GPU	`paddlex --install hpi-gpu`	Installs the GPU version of the high-performance inference feature. Includes all functionalities of the CPU version.

The official PaddleX Docker images come with the Paddle2ONNX plugin pre-installed, allowing PaddleX to convert model formats on demand. In addition, the GPU version of the image includes TensorRT, so the high-performance inference plugin can leverage the Paddle Inference TensorRT subgraph engine for accelerated inference.

Please note that the aforementioned Docker image refers to the official PaddleX image described in Get PaddleX via Docker, rather than the PaddlePaddle official image described in PaddlePaddle Local Installation Tutorial. For the latter, please refer to the local installation instructions for the high-performance inference plugin.

1.1.2 Installing the High-Performance Inference Plugin Locally¶

It is recommended to install the Paddle2ONNX plugin first before installing the high-performance inference plugin, so that PaddleX can convert model formats when needed.

To install the CPU version of the high-performance inference plugin:

Run:

paddlex --install hpi-cpu

To install the GPU version of the high-performance inference plugin:

Before installation, please ensure that CUDA and cuDNN are installed in your environment. The official PaddleX currently only provides precompiled packages for CUDA 11.8 + cuDNN 8.9, so please ensure that the installed versions of CUDA and cuDNN are compatible with the compiled versions. Below are the installation documentation links for CUDA 11.8 and cuDNN 8.9:

If you are using the official PaddlePaddle image, the CUDA and cuDNN versions in the image already meet the requirements, so there is no need for a separate installation.

If PaddlePaddle is installed via pip, the relevant CUDA and cuDNN Python packages will usually be installed automatically. In this case, you still need to install the non-Python-specific CUDA and cuDNN. It is also advisable to install the CUDA and cuDNN versions that match the versions of the Python packages in your environment to avoid potential issues arising from coexisting libraries of different versions. You can check the versions of the CUDA and cuDNN related Python packages as follows:

# For CUDA related Python packages
pip list | grep nvidia-cuda
# For cuDNN related Python packages
pip list | grep nvidia-cudnn

If you wish to use the Paddle Inference TensorRT subgraph engine, you will need to install TensorRT additionally. Please refer to the related instructions in the PaddlePaddle Local Installation Tutorial. Note that because the underlying inference library of the high-performance inference plugin also integrates TensorRT, it is recommended to install the same version of TensorRT to avoid version conflicts. Currently, the TensorRT version integrated into the high-performance inference plugin's underlying inference library is 8.6.1.6. If you are using the official PaddlePaddle image, you do not need to worry about version conflicts.

After confirming that the correct versions of CUDA, cuDNN, and TensorRT (optional) are installed, run:

paddlex --install hpi-gpu

To install the NPU version of the high-performance inference plugin:

Please refer to the Ascend NPU High-Performance Inference Tutorial.

Note:

Currently, the official PaddleX only provides precompiled packages for CUDA 11.8 + cuDNN 8.9; support for CUDA 12 is in progress.
Only one version of the high-performance inference plugin should exist in the same environment.
For Windows systems, it is currently recommended to install and use the high-performance inference plugin within a Docker container or in WSL environments.

1.2 Enabling the High-Performance Inference Plugin¶

Below are examples of enabling the high-performance inference plugin in both the PaddleX CLI and Python API for the general image classification pipeline and the image classification module.

For the PaddleX CLI, specify --use_hpip to enable the high-performance inference plugin.

General Image Classification Pipeline:

paddlex \
    --pipeline image_classification \
    --input https://paddle-model-ecology.bj.bcebos.com/paddlex/imgs/demo_image/general_image_classification_001.jpg \
    --use_hpip

Image Classification Module:

python main.py \
    -c paddlex/configs/modules/image_classification/ResNet18.yaml \
    -o Global.mode=predict \
    -o Predict.model_dir=None \
    -o Predict.input=https://paddle-model-ecology.bj.bcebos.com/paddlex/imgs/demo_image/general_image_classification_001.jpg \
    -o Predict.use_hpip=True

For the PaddleX Python API, enabling the high-performance inference plugin is similar. For example:

General Image Classification Pipeline:

from paddlex import create_pipeline

pipeline = create_pipeline(
    pipeline="image_classification",
    use_hpip=True
)

output = pipeline.predict("https://paddle-model-ecology.bj.bcebos.com/paddlex/imgs/demo_image/general_image_classification_001.jpg")

Image Classification Module:

from paddlex import create_model

model = create_model(
    model_name="ResNet18",
    use_hpip=True
)

output = model.predict("https://paddle-model-ecology.bj.bcebos.com/paddlex/imgs/demo_image/general_image_classification_001.jpg")

The inference results obtained with the high-performance inference plugin enabled are identical to those without the plugin. For some models, the first time the high-performance inference plugin is enabled, it may take a longer time to complete the construction of the inference engine. PaddleX caches the related information in the model directory after the inference engine is built for the first time, and subsequently reuses the cached content to improve the initialization speed.

Enabling the high‑performance inference plugin via the PaddleX CLI and Python API applies by default to the entire pipeline/module. If you need finer‑grained control—e.g. to enable the plugin only on a specific sub‑pipeline or sub‑module within your pipeline—you can set use_hpip in the configuration file at the appropriate level. Please refer to 2.4 Enabling/Disabling the High‑Performance Inference Plugin in Configuration Files. If use_hpip is not specified in the CLI options, API calls, or any configuration file, the high‑performance inference plugin will remain disabled by default.

2. Advanced Usage¶

This section introduces the advanced usage of the high-performance inference plugin, which is suitable for users who have a good understanding of model deployment or wish to manually adjust configurations. Users can customize the use of the high-performance inference plugin according to their requirements by referring to the configuration instructions and examples. The following sections describe advanced usage in detail.

2.1 Working Modes of High-Performance Inference¶

The high-performance inference plugin supports two working modes. The operating mode can be switched by modifying the high-performance inference configuration.

2.1.1 Safe Auto-Configuration Mode¶

In safe auto-configuration mode, a protective mechanism is enabled. By default, the configuration with the best performance for the current environment is automatically selected. In this mode, while the user can override the default configuration, the provided configuration will be subject to checks, and PaddleX will reject configurations that are not available based on prior knowledge. This is the default operating mode.

2.1.2 Unrestricted Manual Configuration Mode¶

In unrestricted manual configuration mode, full freedom is provided to configure—users can choose the inference backend freely and modify its configuration, etc.—but there is no guarantee that inference will always succeed. This mode is recommended for experienced users who have clear requirements for the inference backend and its configuration; it is advised to use this mode only when familiar with high-performance inference.

2.2 High-Performance Inference Configuration¶

Common configuration items for high-performance inference include:

Name	Description	Type	Default Value
`auto_config`	Whether to enable the safe auto-configuration mode. `True` enables safe auto-configuration mode, `False` enables the unrestricted manual configuration mode.	`bool`	`True`
`backend`	Specifies the inference backend to use. In unrestricted manual configuration mode, it cannot be `None`.	`str \| None`	`None`
`backend_config`	The configuration for the inference backend. If not `None`, it can override the default backend configuration options.	`dict \| None`	`None`
`auto_paddle2onnx`	Whether to automatically convert the PaddlePaddle static graph model to an ONNX model. When the Paddle2ONNX plugin is unavailable, no conversion will be performed.	`bool`	`True`

The optional values for backend are as follows:

Option	Description	Supported Devices
`paddle`	Paddle Inference engine; supports enhancing GPU inference performance using the Paddle Inference TensorRT subgraph engine.	CPU, GPU, NPU
`openvino`	OpenVINO, a deep learning inference tool provided by Intel, optimized for inference performance on various Intel hardware.	CPU
`onnxruntime`	ONNX Runtime, a cross-platform, high-performance inference engine.	CPU, GPU
`tensorrt`	TensorRT, a high-performance deep learning inference library provided by NVIDIA, optimized for NVIDIA GPUs to enhance speed.	GPU
`om`	The inference engine corresponding to the offline model format customized for Huawei Ascend NPU, deeply optimized for hardware to reduce operator computation and scheduling time, effectively enhancing inference performance.	NPU

The available configuration items for backend_config vary for different backends, as shown in the following table:

Backend	Configuration Items
`paddle`	Refer to PaddleX Single Model Python Usage Instructions. The attributes of the `PaddlePredictorOption` object can be configured via key-value pairs.
`openvino`	`cpu_num_threads` (`int`): The number of logical processors used for CPU inference. The default is `10`.
`onnxruntime`	`cpu_num_threads` (`int`): The number of parallel computation threads within the operator during CPU inference. The default is `10`.
`tensorrt`	`precision` (`str`): The precision used, either `"fp16"` or `"fp32"`. The default is `"fp32"`. `dynamic_shapes` (`dict`): Dynamic shape configuration that specifies, for each input, its minimum shape, optimization shape, and maximum shape. The format is: `{input tensor name}: [{minimum shape}, {optimization shape}, {maximum shape}]`. Dynamic shapes is TensorRT’s ability to defer specifying some or all tensor dimensions until runtime. For more information, see the TensorRT official documentation.
`om`	None at the moment

2.3 Modifying the High-Performance Inference Configuration¶

When the model is initialized, the log will, by default, record the high-performance inference configuration that is about to be used. Due to the diversity of actual deployment environments and requirements, the default configuration might not meet all needs. In such cases, manual adjustment of the high-performance inference configuration may be necessary. Users can modify the configuration by editing the pipeline/module configuration file or by passing the hpi_config field in the parameters via CLI or Python API. Parameters passed via CLI or Python API will override the settings in the pipeline/module configuration file. Different levels of configurations in the config file are automatically merged, and the deepest-level settings take the highest priority. The following examples illustrate how to modify the configuration.

For the general OCR pipeline, use the onnxruntime backend for all models:

👉 Modify via Pipeline Configuration File (click to expand)

...
hpi_config:
  backend: onnxruntime

👉 CLI Parameter Method (click to expand)

paddlex \
    --pipeline image_classification \
    --input https://paddle-model-ecology.bj.bcebos.com/paddlex/imgs/demo_image/general_image_classification_001.jpg \
    --use_hpip \
    --hpi_config '{"backend": "onnxruntime"}'

👉 Python API Parameter Method (click to expand)

from paddlex import create_pipeline

pipeline = create_pipeline(
    pipeline="OCR",
    use_hpip=True,
    hpi_config={"backend": "onnxruntime"}
)

For the image classification module, use the onnxruntime backend:

👉 Modify via Pipeline Configuration File (click to expand)

Predict:
  ...
  hpi_config:
    backend: onnxruntime

👉 CLI Parameter Method (click to expand)

python main.py \
    -c paddlex/configs/modules/image_classification/ResNet18.yaml \
    -o Global.mode=predict \
    -o Predict.model_dir=None \
    -o Predict.input=https://paddle-model-ecology.bj.bcebos.com/paddlex/imgs/demo_image/general_image_classification_001.jpg \
    -o Predict.use_hpip=True \
    -o Predict.hpi_config='{"backend": "onnxruntime"}'

👉 Python API Parameter Method (click to expand)

from paddlex import create_model

model = create_model(
    model_name="ResNet18",
    use_hpip=True,
    hpi_config={"backend": "onnxruntime"}
)

For the general OCR pipeline, use the onnxruntime backend for the text_detection module and the tensorrt backend for the text_recognition module:

👉 Modify via Pipeline Configuration File (click to expand)

SubModules:
  TextDetection:
    ...
    hpi_config:
      backend: onnxruntime
  TextRecognition:
    ...
    hpi_config:
      backend: tensorrt

For the general image classification pipeline, modify TensorRT dynamic shape configuration:

👉 Modify via Pipeline Configuration File (click to expand)

  SubModules:
    ImageClassification:
      hpi_config:
        ...
        backend: tensorrt
        backend_config:
          dynamic_shapes:
            x:
              - [1, 3, 300, 300]
              - [4, 3, 300, 300]
              - [32, 3, 1200, 1200]

For the image classification module, modify TensorRT dynamic shape configuration:

👉 Modify via Pipeline Configuration File (click to expand)

Predict:
  hpi_config:
    ...
    backend: tensorrt
    backend_config:
      dynamic_shapes:
        x:
          - [1, 3, 300, 300]
          - [4, 3, 300, 300]
          - [32, 3, 1200, 1200]

2.4 Enabling/Disabling the High‑Performance Inference Plugin in Configuration Files¶

In the configuration file, you can use use_hpip to control whether the high-performance inference plugin is enabled or disabled. Unlike configuring via the CLI or API, this approach allows you to specify use_hpip at the sub-pipeline or sub-module level, enabling high-performance inference only for a specific sub-pipeline or sub-module within the entire pipeline. For example:

In the general OCR pipeline, enable high-performance inference for the text_detection module, but not for the text_recognition module:

👉 Click to expand

SubModules:
  TextDetection:
    ...
    use_hpip: True # This sub-module uses high-performance inference
  TextLineOrientation:
    ...
    # This sub-module does not have a specific configuration; it defaults to the global configuration
    # (if neither the configuration file nor CLI/API parameters set it, high-performance inference will not be used)
  TextRecognition:
    ...
    use_hpip: False # This sub-module does not use high-performance inference

Note:

When use_hpip is set at multiple levels in the configuration file, the setting at the deepest level takes precedence.
When enabling or disabling the high-performance inference plugin by modifying the pipeline configuration file, it is not recommended to also configure it using the CLI or Python API. Setting use_hpip through the CLI or Python API is equivalent to modifying the top-level use_hpip in the configuration file.

2.5 Model Cache Description¶

The model caches are stored in the .cache directory under the model directory.

After modifying configurations related to Paddle Inference TensorRT subgraph engine or TensorRT, it is recommended to clear the caches to avoid the new configuration being overridden by the cache.

When the auto_paddle2onnx option is enabled, an inference.onnx file may be automatically generated in the model directory.

2.6 Customizing the Model Inference Library¶

ultra-infer is the model inference library that the high-performance inference plugin depends on. It is maintained as a sub-project under the PaddleX/libs/ultra-infer directory. PaddleX provides a build script for ultra-infer, located at PaddleX/libs/ultra-infer/scripts/linux/set_up_docker_and_build_py.sh. The build script, by default, builds the GPU version of ultra-infer and integrates three inference backends: OpenVINO, TensorRT, and ONNX Runtime.

If you need to customize the build of ultra-infer, you can modify the following options in the build script according to your requirements:

Option	Description
http_proxy	The HTTP proxy used when downloading third-party libraries; default is empty.
PYTHON_VERSION	Python version, default is `3.10.0`.
WITH_GPU	Whether to enable GPU support, default is `ON`.
ENABLE_ORT_BACKEND	Whether to integrate the ONNX Runtime backend, default is `ON`.
ENABLE_TRT_BACKEND	Whether to integrate the TensorRT backend (GPU-only), default is `ON`.
ENABLE_OPENVINO_BACKEND	Whether to integrate the OpenVINO backend (CPU-only), default is `ON`.

Example:

# Build
cd PaddleX/libs/ultra-infer/scripts/linux
# export PYTHON_VERSION=...
# export WITH_GPU=...
# export ENABLE_ORT_BACKEND=...
# export ...
bash set_up_docker_and_build_py.sh

# Install
python -m pip install ../../python/dist/ultra_infer*.whl

3. Frequently Asked Questions¶

1. Why does the inference speed not appear to improve noticeably before and after enabling the high-performance inference plugin?

The high-performance inference plugin achieves inference acceleration by intelligently selecting and configuring the backend. However, due to the complex structure of some models or the presence of unsupported operators, not all models may be able to be accelerated. In these cases, PaddleX will provide corresponding prompts in the log. You can use the PaddleX benchmark feature to measure the inference duration of each module component, thereby facilitating a more accurate performance evaluation. Moreover, for pipelines, the performance bottleneck of inference may not lie in the model inference, but rather in the surrounding logic, which could also result in limited acceleration gains.

2. Do all pipelines and modules support high-performance inference?

All pipelines and modules that use static graph models support enabling the high-performance inference plugin; however, in certain scenarios, some models might not be able to achieve accelerated inference. For detailed reasons, please refer to Question 1.

3. Why does the installation of the high-performance inference plugin fail with a log message stating: “You are not using PaddlePaddle compiled with CUDA 11. Currently, CUDA versions other than 11.x are not supported by the high-performance inference plugin.”?

For the GPU version of the high-performance inference plugin, the official PaddleX currently only provides precompiled packages for CUDA 11.8 + cuDNN 8.9. The support for CUDA 12 is in progress.

4. Why does the program freeze during runtime or display some "WARNING" and "ERROR" messages after using the high-performance inference feature? What should be done in such cases?

When initializing the model, operations such as subgraph optimization may take longer and may generate some "WARNING" and "ERROR" messages. However, as long as the program does not exit automatically, it is recommended to wait patiently, as the program usually continues to run to completion.

5. When using GPU for inference, enabling the high-performance inference plugin increases memory usage and causes OOM. How can this be resolved?

Some acceleration methods trade off memory usage to support a broader range of inference scenarios. If memory becomes a bottleneck, consider the following optimization strategies:

Adjust pipeline configurations: Disable unnecessary features to avoid loading redundant models. Appropriately reduce the batch size based on business requirements to balance throughput and memory usage.
Switch inference backends: Different inference backends have varying memory management strategies. Try benchmarking various backends to compare memory usage and performance.
Optimize dynamic shape configurations: For modules using TensorRT or Paddle Inference TensorRT subgraph engine, narrow the dynamic shape range based on the actual distribution of input data.

PaddleX High-Performance Inference Guide¶

Table of Contents¶

1. Installation and Basic Usage¶

1.1 Installing the High-Performance Inference Plugin¶

1.1.1 Installing the High-Performance Inference Plugin in a Docker Container (Highly Recommended)¶

1.1.2 Installing the High-Performance Inference Plugin Locally¶

1.2 Enabling the High-Performance Inference Plugin¶

2. Advanced Usage¶

2.1 Working Modes of High-Performance Inference¶

2.1.1 Safe Auto-Configuration Mode¶

2.1.2 Unrestricted Manual Configuration Mode¶

2.2 High-Performance Inference Configuration¶

2.3 Modifying the High-Performance Inference Configuration¶

2.4 Enabling/Disabling the High‑Performance Inference Plugin in Configuration Files¶

2.5 Model Cache Description¶

2.6 Customizing the Model Inference Library¶

3. Frequently Asked Questions¶

Comments