Skip to content

Document Image Preprocessing Pipeline Tutorial

1. Introduction to the Do Pipeline

The document image preprocessing pipeline integrates two major functions: document orientation classification and geometric distortion correction. The document orientation classification can automatically identify the four orientations of a document (0°, 90°, 180°, 270°) to ensure that the document is processed in the correct direction for subsequent tasks. The geometric distortion correction model is used to correct geometric distortions that occur during the document's photographing or scanning process, restoring the document to its original shape and proportions. This is suitable for digital document management, preprocessing for doc_preprocessor recognition, and any scenario where improving document image quality is necessary. Through automated orientation correction and distortion correction, this module significantly enhances the accuracy and efficiency of document processing, providing users with a more reliable foundation for image analysis. The pipeline also offers flexible service deployment options, supporting invocation using various programming languages on multiple hardware platforms. Moreover, it provides the capability for further development, allowing you to train and fine-tune on your own dataset based on this pipeline, with the trained models being seamlessly integrable.

The general document image preprocessing pipeline includes optional document image orientation classification module and document image correction module with the following models included.

Document Image Orientation Classification Module (Optional):

ModelModel download link Top-1 Acc(%) GPU Inference Time (ms)
[Normal Mode / High-Performance Mode]
CPU inference time (ms) Model storage size(M) Introduction
PP-LCNet_x1_0_doc_ori Inference Model/Train Model 99.06 3.84845 9.23735 7 A document image classification model based on PP-LCNet_x1_0, containing four categories: 0 degrees, 90 degrees, 180 degrees, and 270 degrees.

Text Image Unwarping Module (Optional):

ModelModel download link CER Model storage size(M) Introduction
UVDocInference Model/Train Model 0.179 30.3 M High-Precision Text Image Correction Model

Test Environment Description:

  • Performance Test Environment
  • Test Dataset:
    • Document Image Orientation Classification Module: A self-built dataset using PaddleX, covering multiple scenarios such as ID cards and documents, containing 1000 images.
    • Text Image Rectification Module: DocUNet.
  • Hardware Configuration:

    • GPU: NVIDIA Tesla T4
    • CPU: Intel Xeon Gold 6271C @ 2.60GHz
    • Other Environments: Ubuntu 20.04 / cuDNN 8.6 / TensorRT 8.5.2.2
  • Inference Mode Description

Mode GPU Configuration CPU Configuration Acceleration Technology Combination
Normal Mode FP32 Precision / No TRT Acceleration FP32 Precision / 8 Threads PaddleInference
High-Performance Mode Optimal combination of pre-selected precision types and acceleration strategies FP32 Precision / 8 Threads Pre-selected optimal backend (Paddle/OpenVINO/TRT, etc.)

2. Quick Start

PaddleX supports experiencing the effects of the document image preprocessing pipeline locally via command line or Python.

Before using the document image preprocessing pipeline locally, please ensure you have completed the installation of the PaddleX wheel package according to the PaddleX Local Installation Guide.

2.1 Local Experience

2.1.1 Command Line Experience

You can quickly experience the effects of the document image preprocessing pipeline with a single command. Use the test file and replace --input with the local path to perform predictions.

paddlex --pipeline doc_preprocessor \
        --input doc_test_rotated.jpg \
        --use_doc_orientation_classify True \
        --use_doc_unwarping True \
        --save_path ./output \
        --device gpu:0
You can refer to the parameter descriptions in 2.1.2 Python Script Integration for related parameter details.

After running, the results will be printed to the terminal as follows:

{'res': {'input_path': 'doc_test_rotated.jpg', 'model_settings': {'use_doc_orientation_classify': True, 'use_doc_unwarping': True}, 'angle': 180}}

You can refer to the results explanation in 2.1.2 Python Script Integration for a description of the output parameters.

The visualized results are saved under save_path. The visualized results are as follows:

2.1.2 Python Script Integration

The above command line is for quickly experiencing and viewing the effect. Generally, in a project, it is often necessary to integrate through code. You can complete quick inference in a pipeline with just a few lines of code. The inference code is as follows:

from paddlex import create_pipeline

pipeline = create_pipeline(pipeline="doc_preprocessor")
output = pipeline.predict(
    input="doc_test_rotated.jpg",
    use_doc_orientation_classify=True,
    use_doc_unwarping=True,
)
for res in output:
    res.print()
    res.save_to_img(save_path="./output/")
    res.save_to_json(save_path="./output/")

In the above Python script, the following steps were executed:

(1) Instantiate the doc_preprocessor pipeline object using create_pipeline(). The specific parameter descriptions are as follows:

Parameter Description Type Default
pipeline The pipeline name or the path to the pipeline configuration file. If it is a pipeline name, it must be a pipeline supported by PaddleX. str None
device Inference device for the pipeline. Supports specifying the GPU card number, such as "gpu:0", other hardware card numbers, such as "npu:0", and CPU as "cpu". str gpu:0
use_hpip Whether to enable high-performance inference, available only when the pipeline supports high-performance inference. bool False

(2) Call the predict() method of the doc_preprocessor pipeline object for inference prediction. This method will return a generator. Below are the parameters of the predict() method and their descriptions:

Parameter Description Type Options Default
input Data to be predicted, supporting various input types, required Python Var|str|list
  • Python Var: Such as image data represented by numpy.ndarray
  • str: Such as the local path of an image file or PDF file: /root/data/img.jpg; As URL link, such as the network URL of an image file or PDF file: example; As a local directory, which should contain images to be predicted, such as a local path: /root/data/ (currently does not support directory prediction for PDFs, PDF files need to be specified to the specific file path)
  • List: List elements must be of the above types, such as [numpy.ndarray, numpy.ndarray], ["/root/data/img1.jpg", "/root/data/img2.jpg"], ["/root/data1", "/root/data2"]
None
device Inference device for the pipeline str|None
  • CPU: Like cpu, indicating inference using CPU;
  • GPU: Like gpu:0, indicating inference using the first GPU;
  • NPU: Like npu:0, indicating inference using the first NPU;
  • XPU: Like xpu:0, indicating inference using the first XPU;
  • MLU: Like mlu:0, indicating inference using the first MLU;
  • DCU: Like dcu:0, indicating inference using the first DCU;
  • None: If set to None, the default value initialized by the pipeline will be used. During initialization, it will preferentially use the local GPU device 0, if none, then the CPU device;
None
use_doc_orientation_classify Whether to use the document orientation classification module bool|None
  • bool: True or False;
  • None: If set to None, the default value initialized by the pipeline will be used, initialized to True;
None
use_doc_unwarping Whether to use the document unwarping correction module bool|None
  • bool: True or False;
  • None: If set to None, the default value initialized by the pipeline will be used, initialized to True;
None

(3) Process the prediction results, where the prediction result for each sample is of dict type. Additionally, these results support operations such as printing, saving as an image, and saving as a json file.

Method Description Parameter Type Description Default
print() Prints the results to the terminal format_json bool Whether to format the output using JSON indentation True
indent int Specifies the indentation level to beautify the output JSON data for better readability, effective only when format_json is True 4
ensure_ascii bool Controls whether to escape non-ASCII characters as Unicode. When set to True, all non-ASCII characters will be escaped; False retains the original characters, effective only when format_json is True False
save_to_json() Saves the results as a JSON format file save_path str The file path to save, naming consistent with the input file type when it is a directory None
indent int Specifies the indentation level to beautify the output JSON data for better readability, effective only when format_json is True 4
ensure_ascii bool Controls whether to escape non-ASCII characters as Unicode. When set to True, all non-ASCII characters will be escaped; False retains the original characters, effective only when format_json is True False
save_to_img() Saves the results as an image format file save_path str The file path to save, supporting both directory or file path None
  • Calling the print() method will output the results to the terminal. The content printed to the terminal is explained as follows:

    • input_path: (str) The input path of the image to be predicted.

    • model_settings: (Dict[str, bool]) Model parameters required for configuring the pipeline.

      • use_doc_orientation_classify: (bool) Controls whether to enable the document orientation classification module.
      • use_doc_unwarping: (bool) Controls whether to enable the document unwarping module.
    • angle: (int) The prediction result of the document orientation classification. When enabled, the values are [0, 90, 180, 270]; when not enabled, it is -1.

  • Calling the save_to_json() method will save the above content to the specified save_path. If a directory is specified, the path will be save_path/{your_img_basename}.json; if a file is specified, it will be saved directly to that file. Since JSON files do not support saving NumPy arrays, any numpy.array types will be converted to lists.

  • Calling the save_to_img() method will save the visualized results to the specified save_path. If a directory is specified, the path will be save_path/{your_img_basename}_doc_preprocessor_res_img.{your_img_extension}; if a file is specified, it will be saved directly to that file. (Since the pipeline typically includes multiple result images, it is not recommended to specify a specific file path directly, as multiple images may be overwritten, leaving only the last image.)

  • Additionally, it is also possible to obtain visualized images with results and prediction outcomes through attributes, as detailed below:

Attribute Description
json Retrieves the prediction results in json format
img Retrieves visualized images in dict format
  • The json attribute retrieves prediction results as a dictionary type of data, consistent with the content saved by calling the save_to_json() method.
  • The img attribute returns prediction results as a dictionary type of data. Here, the key is preprocessed_img, and the corresponding value is an Image.Image object, which is a visualized image used to display the results of the doc_preprocessor.

Additionally, you can obtain the doc_preprocessor pipeline configuration file and load it for prediction. You can execute the following command to save the results in my_path:

paddlex --get_pipeline_config doc_preprocessor --save_path ./my_path

Once you have the configuration file, you can customize the various configurations of the doc_preprocessor pipeline by simply changing the pipeline parameter value in the create_pipeline method to the path of the pipeline configuration file. An example is as follows:

例如,若您的配置文件保存在 ./my_path/doc_preprocessor.yaml ,则只需执行:

from paddlex import create_pipeline
pipeline = create_pipeline(pipeline="./my_path/doc_preprocessor.yaml")
output = pipeline.predict(
    input="doc_test_rotated.jpg"
    use_doc_orientation_classify=True,
    use_doc_unwarping=True,
)
for res in output:
    res.print()
    res.save_to_img("./output/")
    res.save_to_json("./output/")

Note: The parameters in the configuration file are for pipeline initialization. If you wish to modify the initialization parameters for the doc_preprocessor pipeline, you can directly edit the parameters in the configuration file and load the file for prediction. Additionally, CLI prediction also supports passing in a configuration file; simply specify the path to the configuration file using --pipeline.

3. Development Integration/Deployment

If the document image preprocessing pipeline meets your requirements for inference speed and accuracy, you can proceed directly with development integration/deployment.

If you need to apply the document image preprocessing pipeline directly to your Python project, you can refer to the sample code in 2.2 Python Script Method.

Additionally, PaddleX offers three other deployment methods, detailed as follows:

🚀 High-Performance Inference: In real production environments, many applications have stringent performance standards for deployment strategies, especially regarding response speed, to ensure efficient system operation and a smooth user experience. To address this, PaddleX provides a high-performance inference plugin designed to deeply optimize model inference and pre/post-processing, resulting in significant end-to-end process acceleration. For detailed high-performance inference procedures, please refer to the PaddleX High-Performance Inference Guide.

API Reference

For the main operations provided by the service:

  • The HTTP request method is POST.
  • Both the request body and response body are JSON data (JSON objects).
  • When the request is processed successfully, the response status code is 200, and the attributes of the response body are as follows:
Name Type Meaning
logId string The UUID of the request.
errorCode integer Error code. Fixed as 0.
errorMsg string Error message. Fixed as "Success".
result object The result of the operation.
  • When the request is not processed successfully, the attributes of the response body are as follows:
Name Type Meaning
logId string The UUID of the request.
errorCode integer Error code. Same as the response status code.
errorMsg string Error message.

The main operations provided by the service are as follows:

  • infer

Obtain the document image preprocessing results.

POST /doc_preprocessor

  • The attributes of the request body are as follows:
Name Type Meaning Required
file string The URL of an image or PDF file accessible by the server, or the Base64-encoded content of the file. For PDF files exceeding 10 pages, only the first 10 pages will be used. Yes
fileType integer | null The type of the file. 0 for PDF files, 1 for image files. If this attribute is missing, the file type will be inferred from the URL. No
useDocOrientationClassify boolean | null Refer to the use_doc_orientation_classify parameter description in the pipeline predict method. No
useDocUnwarping boolean | null Refer to the use_doc_unwarping parameter description in the pipeline predict method. No
  • When the request is processed successfully, the result in the response body has the following attributes:
Name Type Meaning
docPreprocessingResults object Document image preprocessing results. The array length is 1 (for image input) or the smaller of the number of document pages and 10 (for PDF input). For PDF input, each element in the array represents the processing result of each page in the PDF file.
dataInfo object Information about the input data.

Each element in docPreprocessingResults is an object with the following attributes:

Name Type Meaning
outputImage string The preprocessed image. The image is in PNG format and is Base64-encoded.
prunedResult object A simplified version of the res field in the JSON representation of the result generated by the pipeline object's predict method, excluding the input_path field.
docPreprocessingImage string | null The visualization result image. The image is in JPEG format and is Base64-encoded.
inputImage string | null The input image. The image is in JPEG format and is Base64-encoded.
Multi-language Service Call Example
Python
import base64
import requests

API_URL = "http://localhost:8080/document-preprocessing"
file_path = "./demo.jpg"

with open(file_path, "rb") as file:
    file_bytes = file.read()
    file_data = base64.b64encode(file_bytes).decode("ascii")

payload = {"file": file_data, "fileType": 1}

response = requests.post(API_URL, json=payload)

assert response.status_code == 200
result = response.json()["result"]
for i, res in enumerate(result["docPreprocessingResults"]):
    print(res["prunedResult"])
    output_img_path = f"out_{i}.png"
    with open(output_img_path, "wb") as f:
        f.write(base64.b64decode(res["outputImage"]))
    print(f"Output image saved at {output_img_path}")


☁️ Service Deployment: Service deployment is a common form of deployment in real production environments. By encapsulating inference functions as services, clients can access these services through network requests to obtain inference results. PaddleX supports multiple pipeline service deployment solutions. For detailed pipeline service deployment procedures, please refer to the PaddleX Service Deployment Guide.

4. Custom Development

If the default model weights provided by the document image preprocessing pipeline do not meet your accuracy or speed requirements in your specific scenario, you can try to further fine-tune the existing model using data from your specific domain or application scenario to enhance the recognition performance of the document image preprocessing pipeline in your context.

4.1 Model Fine-Tuning

Since the document image preprocessing pipeline consists of several modules, if the pipeline's performance does not meet expectations, it may be due to any one of these modules. You can analyze the images with poor recognition results to identify which module has issues, and then refer to the corresponding fine-tuning tutorial link in the table below to fine-tune the model.

situation Fine-tuning model Fine-tuning reference link
The overall image rotation correction is inaccurate. Image orientation classification module 链接
The image distortion correction is inaccurate. Image Unwarping Fine-tuning is not supported at the moment.

4.2 Model Application

After completing fine-tuning training with a private dataset, you can obtain a local model weights file.

If you need to use the fine-tuned model weights, simply modify the pipeline configuration file by entering the local path of the fine-tuned model weights into the model_dir field in the pipeline configuration file.

......
  DocOrientationClassify:
    module_name: doc_text_orientation
    model_name: PP-LCNet_x1_0_doc_ori
    model_dir: ./output/best_model/inference  # Replace it with the path of the fine-tuned document image orientation classification model weights.
......

Then, refer to the command line method or Python script method in 2. Quick Start to load the modified pipeline configuration file.

5. Multi-Hardware Support

PaddleX supports a variety of mainstream hardware devices such as NVIDIA GPU, Kunlunxin XPU, Ascend NPU, and Cambricon MLU. You can achieve seamless switching between different hardware by simply modifying the --device parameter.

For example, if you are using an Ascend NPU for inference in a document image preprocessing pipeline, the Python command you would use is:

paddlex --pipeline doc_preprocessor \
        --input doc_test_rotated.jpg \
        --use_doc_orientation_classify True \
        --use_doc_unwarping True \
        --save_path ./output \
        --device npu:0

If you want to use the document image preprocessing pipeline on more types of hardware, please refer to the PaddleX Multi-Hardware Usage Guide.

Comments