Skip to content

Document Image Preprocessing Pipeline Tutorial

1. Introduction to Document Image Preprocessing Pipeline

The Document Image Preprocessing Pipeline integrates two key functions: document orientation classification and geometric distortion correction. The document orientation classification module automatically identifies the four possible orientations of a document (0°, 90°, 180°, 270°), ensuring that the document is processed in the correct direction. The text image unwarping model is designed to correct geometric distortions that occur during document photography or scanning, restoring the document's original shape and proportions. This pipeline is suitable for digital document management, preprocessing tasks for OCR, and any scenario requiring improved document image quality. By automating orientation correction and geometric distortion correction, this module significantly enhances the accuracy and efficiency of document processing, providing a more reliable foundation for image analysis. The pipeline also offers flexible service-oriented deployment options, supporting calls from various programming languages on multiple hardware platforms. Additionally, the pipeline supports secondary development, allowing you to fine-tune the models on your own datasets and seamlessly integrate the trained models.

The General Document Image Preprocessing Pipeline includes the following two modules. Each module can be trained and inferred independently and contains multiple models. For detailed information, please click on the corresponding module to view the documentation.

In this pipeline, you can select the models to use based on the benchmark data provided below.

Document Image Orientation Classification Module (Optional):
ModelModel Download Links Top-1 Acc (%) GPU Inference Time (ms)
[Normal Mode / High-Performance Mode]
CPU Inference Time (ms)
[Normal Mode / High-Performance Mode]
Model Storage Size (MB) Description
PP-LCNet_x1_0_doc_oriInference Model/Training Model 99.06 2.31 / 0.43 3.37 / 1.27 7 A document image classification model based on PP-LCNet_x1_0, which includes four categories: 0°, 90°, 180°, and 270°.
Text Image Unwarping Module (Optional):
ModelModel Download Links CER Model Storage Size (MB) Description
UVDocInference Model/Training Model 0.179 30.3 MB A high-precision text image unwarping model.
Test Environment Description:
  • Performance Test Environment
    • Test Datasets:
      • Document Image Orientation Classification Model: A self-built dataset by PaddleX, covering various scenarios including ID cards and documents, containing 1000 images.
      • Text Image Unwarping Model: DocUNet.
    • Hardware Configuration:
      • GPU: NVIDIA Tesla T4
      • CPU: Intel Xeon Gold 6271C @ 2.60GHz
      • Other Environment: Ubuntu 20.04 / cuDNN 8.6 / TensorRT 8.5.2.2
  • Inference Mode Description
Mode GPU Configuration CPU Configuration Acceleration Technology Combination
Normal Mode FP32 Precision / No TRT Acceleration FP32 Precision / 8 Threads PaddleInference
High-Performance Mode Optimal combination of precision type and acceleration strategy selected in advance FP32 Precision / 8 Threads Optimal backend (Paddle/OpenVINO/TRT, etc.) selected in advance

2. Quick Start

Before using the General Document Image Preprocessing Pipeline locally, ensure that you have completed the wheel package installation according to the Installation Guide. After installation, you can experience it via the command line or integrate it into Python locally.

2.1 Command Line Experience

You can quickly experience the doc_preprocessor pipeline with a single command:

paddleocr doc_preprocessor -i https://paddle-model-ecology.bj.bcebos.com/paddlex/demo_image/doc_test_rotated.jpg

# Specify whether to use the document orientation classification model via --use_doc_orientation_classify
paddleocr doc_preprocessor -i ./doc_test_rotated.jpg --use_doc_orientation_classify True

# Specify whether to use the text image unwarping module via --use_doc_unwarping
paddleocr doc_preprocessor -i ./doc_test_rotated.jpg --use_doc_unwarping True

# Specify the use of GPU for model inference via --device
paddleocr doc_preprocessor -i ./doc_test_rotated.jpg --device gpu
The command line supports more parameter settings. Click to expand for detailed explanations of command line parameters.
Parameter Description Parameter Type Default Value
input The data to be predicted, supporting multiple input types. This parameter is required.
  • Python Var: For example, image data represented as numpy.ndarray.
  • str: For example, the local path of an image file or PDF file: /root/data/img.jpg; or a URL link, such as the network URL of an image file or PDF file: example; or a local directory, which should contain the images to be predicted, such as the local path: /root/data/ (currently does not support prediction of PDF files in directories; PDF files need to be specified to a specific file path).
  • List: The list elements should be of the above types, such as [numpy.ndarray, numpy.ndarray], ["/root/data/img1.jpg", "/root/data/img2.jpg"], ["/root/data1", "/root/data2"].
Python Var|str|list
save_path Specify the path to save the inference result file. If set to None, the inference result will not be saved locally. str None
doc_orientation_classify_model_name The name of the document orientation classification model. If set to None, the pipeline's default model will be used. str None
doc_orientation_classify_model_dir The directory path of the document orientation classification model. If set to None, the official model will be downloaded. str None
doc_unwarping_model_name The name of the text image unwarping model. If set to None, the pipeline's default model will be used. str None
doc_unwarping_model_dir The directory path of the text image unwarping model. If set to None, the official model will be downloaded. str None
use_doc_orientation_classify Whether to load the document orientation classification module. If set to None, the parameter value initialized by the pipeline will be used by default, initialized as True. bool None
use_doc_unwarping Whether to load the text image unwarping module. If set to None, the parameter value initialized by the pipeline will be used by default, initialized as True. bool None
device The device used for inference. Support for specifying specific card numbers.
  • CPU: For example, cpu indicates using the CPU for inference.
  • GPU: For example, gpu:0 indicates using the first GPU for inference.
  • NPU: For example, npu:0 indicates using the first NPU for inference.
  • XPU: For example, xpu:0 indicates using the first XPU for inference.
  • MLU: For example, mlu:0 indicates using the first MLU for inference.
  • DCU: For example, dcu:0 indicates using the first DCU for inference.
  • None: If set to None, the parameter value initialized by the pipeline will be used by default. During initialization, the local GPU 0 device will be prioritized; if not available, the CPU device will be used.
str None
enable_hpi Whether to enable high-performance inference. bool False
use_tensorrt Whether to use TensorRT for inference acceleration. bool False
min_subgraph_size The minimum subgraph size, used to optimize the computation of model subgraphs. int 3
precision The computational precision, such as fp32, fp16. str fp32
enable_mkldnn Whether to enable the MKL-DNN acceleration library. If set to None, it will be enabled by default. bool None
cpu_threads The number of threads used for inference on the CPU. int 8
paddlex_config Path to PaddleX pipeline configuration file. str None


The running results will be printed to the terminal. The running results of the doc_preprocessor pipeline with default configuration are as follows:

{'res': {'input_path': '/root/.paddlex/predict_input/doc_test_rotated.jpg', 'page_index': None, 'model_settings': {'use_doc_orientation_classify': True, 'use_doc_unwarping': True}, 'angle': 180}}

The visualization results are saved under the save_path. The visualization results are as follows:

2.2 Integration via Python Script

The command-line approach is for quick experience and viewing results. Generally, in projects, integration through code is often required. You can achieve rapid inference in production lines with just a few lines of code. The inference code is as follows:

from paddleocr import DocPreprocessor

pipeline = DocPreprocessor()
# docpp = DocPreprocessor(use_doc_orientation_classify=True) # Specify whether to use the document orientation classification model via use_doc_orientation_classify
# docpp = DocPreprocessor(use_doc_unwarping=True) # Specify whether to use the text image unwarping module via use_doc_unwarping
# docpp = DocPreprocessor(device="gpu") # Specify whether to use GPU for model inference via device
output = pipeline.predict("./doc_test_rotated.jpg")
for res in output:
    res.print()  ## Print the structured output of the prediction
    res.save_to_img("./output/")
    res.save_to_json("./output/")

In the above Python script, the following steps are executed:

(1) Instantiate the doc_preprocessor pipeline object via DocPreprocessor(). The specific parameter descriptions are as follows:

Parameter Description Parameter Type Default Value
doc_orientation_classify_model_name The name of the document orientation classification model. If set to None, the pipeline's default model will be used. str None
doc_orientation_classify_model_dir The directory path of the document orientation classification model. If set to None, the official model will be downloaded. str None
doc_unwarping_model_name The name of the text image unwarping model. If set to None, the pipeline's default model will be used. str None
doc_unwarping_model_dir The directory path of the text image unwarping model. If set to None, the official model will be downloaded. str None
use_doc_orientation_classify Whether to load the document orientation classification module. If set to None, the parameter value initialized by the pipeline will be used by default, initialized as True. bool None
use_doc_unwarping Whether to load the text image unwarping module. If set to None, the parameter value initialized by the pipeline will be used by default, initialized as True. bool None
device The device used for inference. Support for specifying specific card numbers.
  • CPU: For example, cpu indicates using the CPU for inference.
  • GPU: For example, gpu:0 indicates using the first GPU for inference.
  • NPU: For example, npu:0 indicates using the first NPU for inference.
  • XPU: For example, xpu:0 indicates using the first XPU for inference.
  • MLU: For example, mlu:0 indicates using the first MLU for inference.
  • DCU: For example, dcu:0 indicates using the first DCU for inference.
  • None: If set to None, the parameter value initialized by the pipeline will be used by default. During initialization, the local GPU 0 device will be prioritized; if not available, the CPU device will be used.
str None
enable_hpi Whether to enable high-performance inference. bool False
use_tensorrt Whether to use TensorRT for inference acceleration. bool False
min_subgraph_size The minimum subgraph size, used to optimize the computation of model subgraphs. int 3
precision The computational precision, such as fp32, fp16. str fp32
enable_mkldnn Whether to enable the MKL-DNN acceleration library. If set to None, it will be enabled by default. bool None
cpu_threads The number of threads used for inference on the CPU. int 8
paddlex_config Path to PaddleX pipeline configuration file. str None

(2) Call the predict() method of the doc_preprocessor pipeline object for inference prediction. This method will return a list of results.

In addition, the pipeline also provides the predict_iter() method. The two methods are completely consistent in terms of parameter acceptance and result return. The difference is that predict_iter() returns a generator, which can process and obtain prediction results step by step, suitable for scenarios with large datasets or where memory savings are desired. You can choose either of the two methods according to your actual needs.

The following are the parameters and their descriptions of the predict() method:

Parameter Description Parameter Type Default Value
input The data to be predicted, supporting multiple input types. This parameter is required.
  • Python Var: For example, image data represented as numpy.ndarray.
  • str: For example, the local path of an image file or PDF file: /root/data/img.jpg; or a URL link, such as the network URL of an image file or PDF file: example; or a local directory, which should contain the images to be predicted, such as the local path: /root/data/ (currently does not support prediction of PDF files in directories; PDF files need to be specified to a specific file path).
  • List: The list elements should be of the above types, such as [numpy.ndarray, numpy.ndarray], ["/root/data/img1.jpg", "/root/data/img2.jpg"], ["/root/data1", "/root/data2"].
Python Var|str|list
device Same as the parameter during instantiation. str None
use_doc_orientation_classify Whether to use the document orientation classification module during inference. bool None
use_doc_unwarping Whether to use the text image unwarping module during inference. bool None

(3) Process the prediction results. The prediction result for each sample is a corresponding Result object, which supports operations such as printing, saving as an image, and saving as a json file:

Method Description Parameter Parameter Type Description Default Value
print() Print the result to the terminal format_json bool Whether to format the output content using JSON indentation True
indent int Specify the indentation level to beautify the output JSON data for better readability. Only valid when format_json is True. 4
ensure_ascii bool Control whether to escape non-ASCII characters to Unicode. When set to True, all non-ASCII characters will be escaped; False retains the original characters. Only valid when format_json is True. False
save_to_json() Save the result as a JSON file save_path str The file path for saving. When it is a directory, the saved file name will be consistent with the input file type name. None
indent int Specify the indentation level to beautify the output JSON data for better readability. Only valid when format_json is True. 4
ensure_ascii bool Control whether to escape non-ASCII characters to Unicode. When set to True, all non-ASCII characters will be escaped; False retains the original characters. Only valid when format_json is True. False
save_to_img() Save the result as an image file save_path str The file path for saving. Supports directory or file paths. None

Here's the continuation of the translation:

  • Calling the print() method will output the results to the terminal. The content printed to the terminal is explained as follows:

    • input_path: (str) The input path of the image to be predicted

    • page_index: (Union[int, None]) If the input is a PDF file, it indicates the current page number of the PDF; otherwise, it is None

    • model_settings: (Dict[str, bool]) Model parameters configured for the production line

      • use_doc_orientation_classify: (bool) Controls whether to enable the document orientation classification module
      • use_doc_unwarping: (bool) Controls whether to enable the text image rectification module
    • angle: (int) The prediction result of the document orientation classification. When enabled, the value is one of [0, 90, 180, 270]; when disabled, it is -1

  • Calling the save_to_json() method will save the above content to the specified save_path. If a directory is specified, the saved path will be save_path/{your_img_basename}.json. If a file is specified, it will be saved directly to that file. Since JSON files do not support saving numpy arrays, numpy.array types will be converted to list form.

  • Calling the save_to_img() method will save the visualization results to the specified save_path. If a directory is specified, the saved path will be save_path/{your_img_basename}_doc_preprocessor_res_img.{your_img_extension}. If a file is specified, it will be saved directly to that file. (Production lines usually contain many result images, so it is not recommended to specify a specific file path directly, as multiple images will be overwritten, and only the last image will be retained)

  • In addition, it also supports obtaining visualization images and prediction results with results through attributes, as follows:

Attribute Attribute Description
json Obtain the prediction result in JSON format
img Obtain visualization images in dictionary format
  • The prediction result obtained by the json attribute is data of type dict, and the content is consistent with that saved by calling the save_to_json() method.
  • The prediction result returned by the img attribute is a dictionary-type data. The key is preprocessed_img, and the corresponding value is an Image.Image object: a visualization image for displaying the doc_preprocessor result.

3. Development Integration/Deployment

If the production line meets your requirements for inference speed and accuracy, you can proceed directly to development integration/deployment.

If you need to apply the production line directly to your Python project, you can refer to the example code in 2.2 Python Script Integration.

In addition, PaddleOCR also provides two other deployment methods, which are detailed as follows:

🚀 High-performance inference: In actual production environments, many applications have strict performance requirements (especially response speed) to ensure efficient system operation and smooth user experience. To this end, PaddleOCR provides high-performance inference functionality, aiming to deeply optimize model inference and pre/post-processing to achieve significant end-to-end process acceleration. For detailed high-performance inference procedures, please refer to the High-Performance Inference Guide.

☁️ Service-oriented deployment: Service-oriented deployment is a common form of deployment in actual production environments. By encapsulating inference functions as services, clients can access these services through network requests to obtain inference results. For detailed production line service-oriented deployment procedures, please refer to the Service-Oriented Deployment Guide.

Below are the API references for basic service-oriented deployment and examples of multi-language service calls:

API Reference

Main operations provided by the service:

  • The HTTP request method is POST.
  • The request body and response body are both JSON data (JSON objects).
  • When the request is processed successfully, the response status code is 200, and the properties of the response body are as follows:
Name Type Description
logId string The UUID of the request.
errorCode integer Error code. Fixed to 0.
errorMsg string Error description. Fixed to "Success".
result object Operation result.
  • When the request is not processed successfully, the properties of the response body are as follows:
Name Type Description
logId string The UUID of the request.
errorCode integer Error code. Same as the response status code.
errorMsg string Error description.

Main operations provided by the service:

  • infer

Obtain the preprocessing result of the image document image.

POST /document-preprocessing

  • Properties of the request body:
Name Type Description Required
file string The URL of an image file or PDF file accessible to the server, or the Base64 encoding result of the content of the above types of files. By default, for PDF files with more than 10 pages, only the first 10 pages will be processed.
To remove the page limit, please add the following configuration to the production line configuration file:
Serving:
  extra:
    max_num_input_imgs: null
Yes
fileType integer | null File type. 0 indicates a PDF file, and 1 indicates an image file. If this property is not present in the request body, the file type will be inferred based on the URL. No
useDocOrientationClassify boolean | null Please refer to the description of the use_doc_orientation_classify parameter in the predict method of the production line object. No
useDocUnwarping boolean | null Please refer to the description of the use_doc_unwarping parameter in the predict method of the production line object. No
  • When the request is processed successfully, the result in the response body has the following properties:
Name Type Description
docPreprocessingResults object Document image preprocessing results. The array length is 1 (for image input) or the actual number of processed document pages (for PDF input). For PDF input, each element in the array represents the result of each page actually processed in the PDF file.
dataInfo object Input data information.

Each element in docPreprocessingResults is an object with the following properties:

Name Type Description
outputImage string The preprocessed image. The image is in PNG format and uses Base64 encoding.
prunedResult object A simplified version of the res field in the JSON representation of the result generated by the predict method of the production line object, with the input_path and page_index fields removed.
docPreprocessingImage stringnull Visualization result image. The image is in JPEG format and uses Base64 encoding.
inputImage stringnull Input image. The image is in JPEG format and uses Base64 encoding.
Multi-language Service Call Examples
Python
import base64
import requests

API_URL = "http://localhost:8080/document-preprocessing"
file_path = "./demo.jpg"

with open(file_path, "rb") as file:
    file_bytes = file.read()
    file_data = base64.b64encode(file_bytes).decode("ascii")

payload = {"file": file_data, "fileType": 1}

response = requests.post(API_URL, json=payload)

assert response.status_code == 200
result = response.json()["result"]
for i, res in enumerate(result["docPreprocessingResults"]):
    print(res["prunedResult"])
    output_img_path = f"out_{i}.png"
    with open(output_img_path, "wb") as f:
        f.write(base64.b64decode(res["outputImage"]))
    print(f"Output image saved at {output_img_path}")


4. Secondary Development

If the default model weights provided by the document image preprocessing pipeline do not meet your accuracy or speed requirements in your specific scenario, you can attempt to further fine-tune the existing model using your own domain-specific or application-specific data to enhance the recognition performance of the document image preprocessing pipeline in your context.

4.1 Model Fine-Tuning

Since the document image preprocessing pipeline comprises multiple modules, any module could potentially contribute to suboptimal performance if the overall pipeline does not meet expectations. You can analyze images with poor recognition results to identify which module is causing the issue and then refer to the corresponding fine-tuning tutorial links in the table below to perform model fine-tuning.

Scenario Module to Fine-Tune Fine-Tuning Reference Link
Inaccurate rotation correction of the entire image Document Image Orientation Classification Module Link
Inaccurate distortion correction of the image Text Image Rectification Module Fine-tuning is currently not supported

Comments