Document Image Preprocessing Pipeline Tutorial¶
1. Introduction to Document Image Preprocessing Pipeline¶
The Document Image Preprocessing Pipeline integrates two key functions: document orientation classification and geometric distortion correction. The document orientation classification module automatically identifies the four possible orientations of a document (0°, 90°, 180°, 270°), ensuring that the document is processed in the correct direction. The text image unwarping model is designed to correct geometric distortions that occur during document photography or scanning, restoring the document's original shape and proportions. This pipeline is suitable for digital document management, preprocessing tasks for OCR, and any scenario requiring improved document image quality. By automating orientation correction and geometric distortion correction, this module significantly enhances the accuracy and efficiency of document processing, providing a more reliable foundation for image analysis. The pipeline also offers flexible service-oriented deployment options, supporting calls from various programming languages on multiple hardware platforms. Additionally, the pipeline supports secondary development, allowing you to fine-tune the models on your own datasets and seamlessly integrate the trained models.
The General Document Image Preprocessing Pipeline includes the following two modules. Each module can be trained and inferred independently and contains multiple models. For detailed information, please click on the corresponding module to view the documentation.
- Document Image Orientation Classification Module (Optional)
- Text Image Unwarping Module (Optional)
In this pipeline, you can select the models to use based on the benchmark data provided below.
Document Image Orientation Classification Module (Optional):
Model | Model Download Links | Top-1 Acc (%) | GPU Inference Time (ms) [Normal Mode / High-Performance Mode] |
CPU Inference Time (ms) [Normal Mode / High-Performance Mode] |
Model Storage Size (MB) | Description |
---|---|---|---|---|---|---|
PP-LCNet_x1_0_doc_ori | Inference Model/Training Model | 99.06 | 2.31 / 0.43 | 3.37 / 1.27 | 7 | A document image classification model based on PP-LCNet_x1_0, which includes four categories: 0°, 90°, 180°, and 270°. |
Text Image Unwarping Module (Optional):
Model | Model Download Links | CER | Model Storage Size (MB) | Description |
---|---|---|---|---|
UVDoc | Inference Model/Training Model | 0.179 | 30.3 MB | A high-precision text image unwarping model. |
Test Environment Description:
- Performance Test Environment
- Test Datasets:
- Document Image Orientation Classification Model: A self-built dataset by PaddleX, covering various scenarios including ID cards and documents, containing 1000 images.
- Text Image Unwarping Model: DocUNet.
- Hardware Configuration:
- GPU: NVIDIA Tesla T4
- CPU: Intel Xeon Gold 6271C @ 2.60GHz
- Other Environment: Ubuntu 20.04 / cuDNN 8.6 / TensorRT 8.5.2.2
- Test Datasets:
- Inference Mode Description
Mode | GPU Configuration | CPU Configuration | Acceleration Technology Combination |
---|---|---|---|
Normal Mode | FP32 Precision / No TRT Acceleration | FP32 Precision / 8 Threads | PaddleInference |
High-Performance Mode | Optimal combination of precision type and acceleration strategy selected in advance | FP32 Precision / 8 Threads | Optimal backend (Paddle/OpenVINO/TRT, etc.) selected in advance |
2. Quick Start¶
Before using the General Document Image Preprocessing Pipeline locally, ensure that you have completed the wheel package installation according to the Installation Guide. After installation, you can experience it via the command line or integrate it into Python locally.
2.1 Command Line Experience¶
You can quickly experience the doc_preprocessor
pipeline with a single command:
paddleocr doc_preprocessor -i https://paddle-model-ecology.bj.bcebos.com/paddlex/demo_image/doc_test_rotated.jpg
# Specify whether to use the document orientation classification model via --use_doc_orientation_classify
paddleocr doc_preprocessor -i ./doc_test_rotated.jpg --use_doc_orientation_classify True
# Specify whether to use the text image unwarping module via --use_doc_unwarping
paddleocr doc_preprocessor -i ./doc_test_rotated.jpg --use_doc_unwarping True
# Specify the use of GPU for model inference via --device
paddleocr doc_preprocessor -i ./doc_test_rotated.jpg --device gpu
The command line supports more parameter settings. Click to expand for detailed explanations of command line parameters.
Parameter | Description | Parameter Type | Default Value |
---|---|---|---|
input |
The data to be predicted, supporting multiple input types. This parameter is required.
|
Python Var|str|list |
|
save_path |
Specify the path to save the inference result file. If set to None , the inference result will not be saved locally. |
str |
None |
doc_orientation_classify_model_name |
The name of the document orientation classification model. If set to None , the pipeline's default model will be used. |
str |
None |
doc_orientation_classify_model_dir |
The directory path of the document orientation classification model. If set to None , the official model will be downloaded. |
str |
None |
doc_unwarping_model_name |
The name of the text image unwarping model. If set to None , the pipeline's default model will be used. |
str |
None |
doc_unwarping_model_dir |
The directory path of the text image unwarping model. If set to None , the official model will be downloaded. |
str |
None |
use_doc_orientation_classify |
Whether to load the document orientation classification module. If set to None , the parameter value initialized by the pipeline will be used by default, initialized as True . |
bool |
None |
use_doc_unwarping |
Whether to load the text image unwarping module. If set to None , the parameter value initialized by the pipeline will be used by default, initialized as True . |
bool |
None |
device |
The device used for inference. Support for specifying specific card numbers.
|
str |
None |
enable_hpi |
Whether to enable high-performance inference. | bool |
False |
use_tensorrt |
Whether to use TensorRT for inference acceleration. | bool |
False |
min_subgraph_size |
The minimum subgraph size, used to optimize the computation of model subgraphs. | int |
3 |
precision |
The computational precision, such as fp32, fp16. | str |
fp32 |
enable_mkldnn |
Whether to enable the MKL-DNN acceleration library. If set to None , it will be enabled by default. |
bool |
None |
cpu_threads |
The number of threads used for inference on the CPU. | int |
8 |
paddlex_config |
Path to PaddleX pipeline configuration file. | str |
None |
The running results will be printed to the terminal. The running results of the doc_preprocessor
pipeline with default configuration are as follows:
{'res': {'input_path': '/root/.paddlex/predict_input/doc_test_rotated.jpg', 'page_index': None, 'model_settings': {'use_doc_orientation_classify': True, 'use_doc_unwarping': True}, 'angle': 180}}
The visualization results are saved under the save_path
. The visualization results are as follows:
2.2 Integration via Python Script¶
The command-line approach is for quick experience and viewing results. Generally, in projects, integration through code is often required. You can achieve rapid inference in production lines with just a few lines of code. The inference code is as follows:
from paddleocr import DocPreprocessor
pipeline = DocPreprocessor()
# docpp = DocPreprocessor(use_doc_orientation_classify=True) # Specify whether to use the document orientation classification model via use_doc_orientation_classify
# docpp = DocPreprocessor(use_doc_unwarping=True) # Specify whether to use the text image unwarping module via use_doc_unwarping
# docpp = DocPreprocessor(device="gpu") # Specify whether to use GPU for model inference via device
output = pipeline.predict("./doc_test_rotated.jpg")
for res in output:
res.print() ## Print the structured output of the prediction
res.save_to_img("./output/")
res.save_to_json("./output/")
In the above Python script, the following steps are executed:
(1) Instantiate the doc_preprocessor
pipeline object via DocPreprocessor()
. The specific parameter descriptions are as follows:
Parameter | Description | Parameter Type | Default Value |
---|---|---|---|
doc_orientation_classify_model_name |
The name of the document orientation classification model. If set to None , the pipeline's default model will be used. |
str |
None |
doc_orientation_classify_model_dir |
The directory path of the document orientation classification model. If set to None , the official model will be downloaded. |
str |
None |
doc_unwarping_model_name |
The name of the text image unwarping model. If set to None , the pipeline's default model will be used. |
str |
None |
doc_unwarping_model_dir |
The directory path of the text image unwarping model. If set to None , the official model will be downloaded. |
str |
None |
use_doc_orientation_classify |
Whether to load the document orientation classification module. If set to None , the parameter value initialized by the pipeline will be used by default, initialized as True . |
bool |
None |
use_doc_unwarping |
Whether to load the text image unwarping module. If set to None , the parameter value initialized by the pipeline will be used by default, initialized as True . |
bool |
None |
device |
The device used for inference. Support for specifying specific card numbers.
|
str |
None |
enable_hpi |
Whether to enable high-performance inference. | bool |
False |
use_tensorrt |
Whether to use TensorRT for inference acceleration. | bool |
False |
min_subgraph_size |
The minimum subgraph size, used to optimize the computation of model subgraphs. | int |
3 |
precision |
The computational precision, such as fp32, fp16. | str |
fp32 |
enable_mkldnn |
Whether to enable the MKL-DNN acceleration library. If set to None , it will be enabled by default. |
bool |
None |
cpu_threads |
The number of threads used for inference on the CPU. | int |
8 |
paddlex_config |
Path to PaddleX pipeline configuration file. | str |
None |
(2) Call the predict()
method of the doc_preprocessor
pipeline object for inference prediction. This method will return a list of results.
In addition, the pipeline also provides the predict_iter()
method. The two methods are completely consistent in terms of parameter acceptance and result return. The difference is that predict_iter()
returns a generator
, which can process and obtain prediction results step by step, suitable for scenarios with large datasets or where memory savings are desired. You can choose either of the two methods according to your actual needs.
The following are the parameters and their descriptions of the predict()
method:
Parameter | Description | Parameter Type | Default Value |
---|---|---|---|
input |
The data to be predicted, supporting multiple input types. This parameter is required.
|
Python Var|str|list |
|
device |
Same as the parameter during instantiation. | str |
None |
use_doc_orientation_classify |
Whether to use the document orientation classification module during inference. | bool |
None |
use_doc_unwarping |
Whether to use the text image unwarping module during inference. | bool |
None |
(3) Process the prediction results. The prediction result for each sample is a corresponding Result object, which supports operations such as printing, saving as an image, and saving as a json
file:
Method | Description | Parameter | Parameter Type | Description | Default Value |
---|---|---|---|---|---|
print() |
Print the result to the terminal | format_json |
bool |
Whether to format the output content using JSON indentation |
True |
indent |
int |
Specify the indentation level to beautify the output JSON data for better readability. Only valid when format_json is True . |
4 | ||
ensure_ascii |
bool |
Control whether to escape non-ASCII characters to Unicode . When set to True , all non-ASCII characters will be escaped; False retains the original characters. Only valid when format_json is True . |
False |
||
save_to_json() |
Save the result as a JSON file | save_path |
str |
The file path for saving. When it is a directory, the saved file name will be consistent with the input file type name. | None |
indent |
int |
Specify the indentation level to beautify the output JSON data for better readability. Only valid when format_json is True . |
4 | ||
ensure_ascii |
bool |
Control whether to escape non-ASCII characters to Unicode . When set to True , all non-ASCII characters will be escaped; False retains the original characters. Only valid when format_json is True . |
False |
||
save_to_img() |
Save the result as an image file | save_path |
str |
The file path for saving. Supports directory or file paths. | None |
Here's the continuation of the translation:
-
Calling the
print()
method will output the results to the terminal. The content printed to the terminal is explained as follows:-
input_path
:(str)
The input path of the image to be predicted -
page_index
:(Union[int, None])
If the input is a PDF file, it indicates the current page number of the PDF; otherwise, it isNone
-
model_settings
:(Dict[str, bool])
Model parameters configured for the production lineuse_doc_orientation_classify
:(bool)
Controls whether to enable the document orientation classification moduleuse_doc_unwarping
:(bool)
Controls whether to enable the text image rectification module
-
angle
:(int)
The prediction result of the document orientation classification. When enabled, the value is one of [0, 90, 180, 270]; when disabled, it is -1
-
-
Calling the
save_to_json()
method will save the above content to the specifiedsave_path
. If a directory is specified, the saved path will besave_path/{your_img_basename}.json
. If a file is specified, it will be saved directly to that file. Since JSON files do not support saving numpy arrays,numpy.array
types will be converted to list form. -
Calling the
save_to_img()
method will save the visualization results to the specifiedsave_path
. If a directory is specified, the saved path will besave_path/{your_img_basename}_doc_preprocessor_res_img.{your_img_extension}
. If a file is specified, it will be saved directly to that file. (Production lines usually contain many result images, so it is not recommended to specify a specific file path directly, as multiple images will be overwritten, and only the last image will be retained) -
In addition, it also supports obtaining visualization images and prediction results with results through attributes, as follows:
Attribute | Attribute Description |
---|---|
json |
Obtain the prediction result in JSON format |
img |
Obtain visualization images in dictionary format |
- The prediction result obtained by the
json
attribute is data of type dict, and the content is consistent with that saved by calling thesave_to_json()
method. - The prediction result returned by the
img
attribute is a dictionary-type data. The key ispreprocessed_img
, and the corresponding value is anImage.Image
object: a visualization image for displaying the doc_preprocessor result.
3. Development Integration/Deployment¶
If the production line meets your requirements for inference speed and accuracy, you can proceed directly to development integration/deployment.
If you need to apply the production line directly to your Python project, you can refer to the example code in 2.2 Python Script Integration.
In addition, PaddleOCR also provides two other deployment methods, which are detailed as follows:
🚀 High-performance inference: In actual production environments, many applications have strict performance requirements (especially response speed) to ensure efficient system operation and smooth user experience. To this end, PaddleOCR provides high-performance inference functionality, aiming to deeply optimize model inference and pre/post-processing to achieve significant end-to-end process acceleration. For detailed high-performance inference procedures, please refer to the High-Performance Inference Guide.
☁️ Service-oriented deployment: Service-oriented deployment is a common form of deployment in actual production environments. By encapsulating inference functions as services, clients can access these services through network requests to obtain inference results. For detailed production line service-oriented deployment procedures, please refer to the Service-Oriented Deployment Guide.
Below are the API references for basic service-oriented deployment and examples of multi-language service calls:
API Reference
Main operations provided by the service:
- The HTTP request method is POST.
- The request body and response body are both JSON data (JSON objects).
- When the request is processed successfully, the response status code is
200
, and the properties of the response body are as follows:
Name | Type | Description |
---|---|---|
logId |
string |
The UUID of the request. |
errorCode |
integer |
Error code. Fixed to 0 . |
errorMsg |
string |
Error description. Fixed to "Success" . |
result |
object |
Operation result. |
- When the request is not processed successfully, the properties of the response body are as follows:
Name | Type | Description |
---|---|---|
logId |
string |
The UUID of the request. |
errorCode |
integer |
Error code. Same as the response status code. |
errorMsg |
string |
Error description. |
Main operations provided by the service:
infer
Obtain the preprocessing result of the image document image.
POST /document-preprocessing
- Properties of the request body:
Name | Type | Description | Required |
---|---|---|---|
file |
string |
The URL of an image file or PDF file accessible to the server, or the Base64 encoding result of the content of the above types of files. By default, for PDF files with more than 10 pages, only the first 10 pages will be processed. To remove the page limit, please add the following configuration to the production line configuration file:
|
Yes |
fileType |
integer | null |
File type. 0 indicates a PDF file, and 1 indicates an image file. If this property is not present in the request body, the file type will be inferred based on the URL. |
No |
useDocOrientationClassify |
boolean | null |
Please refer to the description of the use_doc_orientation_classify parameter in the predict method of the production line object. |
No |
useDocUnwarping |
boolean | null |
Please refer to the description of the use_doc_unwarping parameter in the predict method of the production line object. |
No |
- When the request is processed successfully, the
result
in the response body has the following properties:
Name | Type | Description |
---|---|---|
docPreprocessingResults |
object |
Document image preprocessing results. The array length is 1 (for image input) or the actual number of processed document pages (for PDF input). For PDF input, each element in the array represents the result of each page actually processed in the PDF file. |
dataInfo |
object |
Input data information. |
Each element in docPreprocessingResults
is an object
with the following properties:
Name | Type | Description |
---|---|---|
outputImage |
string |
The preprocessed image. The image is in PNG format and uses Base64 encoding. |
prunedResult |
object |
A simplified version of the res field in the JSON representation of the result generated by the predict method of the production line object, with the input_path and page_index fields removed. |
docPreprocessingImage |
string | null |
Visualization result image. The image is in JPEG format and uses Base64 encoding. |
inputImage |
string | null |
Input image. The image is in JPEG format and uses Base64 encoding. |
Multi-language Service Call Examples
Python
import base64
import requests
API_URL = "http://localhost:8080/document-preprocessing"
file_path = "./demo.jpg"
with open(file_path, "rb") as file:
file_bytes = file.read()
file_data = base64.b64encode(file_bytes).decode("ascii")
payload = {"file": file_data, "fileType": 1}
response = requests.post(API_URL, json=payload)
assert response.status_code == 200
result = response.json()["result"]
for i, res in enumerate(result["docPreprocessingResults"]):
print(res["prunedResult"])
output_img_path = f"out_{i}.png"
with open(output_img_path, "wb") as f:
f.write(base64.b64decode(res["outputImage"]))
print(f"Output image saved at {output_img_path}")
4. Secondary Development¶
If the default model weights provided by the document image preprocessing pipeline do not meet your accuracy or speed requirements in your specific scenario, you can attempt to further fine-tune the existing model using your own domain-specific or application-specific data to enhance the recognition performance of the document image preprocessing pipeline in your context.
4.1 Model Fine-Tuning¶
Since the document image preprocessing pipeline comprises multiple modules, any module could potentially contribute to suboptimal performance if the overall pipeline does not meet expectations. You can analyze images with poor recognition results to identify which module is causing the issue and then refer to the corresponding fine-tuning tutorial links in the table below to perform model fine-tuning.
Scenario | Module to Fine-Tune | Fine-Tuning Reference Link |
---|---|---|
Inaccurate rotation correction of the entire image | Document Image Orientation Classification Module | Link |
Inaccurate distortion correction of the image | Text Image Rectification Module | Fine-tuning is currently not supported |