Document Image Preprocessing Pipeline Tutorial¶
1. Introduction to the Do Pipeline¶
The document image preprocessing pipeline integrates two major functions: document orientation classification and geometric distortion correction. The document orientation classification can automatically identify the four orientations of a document (0°, 90°, 180°, 270°) to ensure that the document is processed in the correct direction for subsequent tasks. The geometric distortion correction model is used to correct geometric distortions that occur during the document's photographing or scanning process, restoring the document to its original shape and proportions. This is suitable for digital document management, preprocessing for doc_preprocessor recognition, and any scenario where improving document image quality is necessary. Through automated orientation correction and distortion correction, this module significantly enhances the accuracy and efficiency of document processing, providing users with a more reliable foundation for image analysis. The pipeline also offers flexible service deployment options, supporting invocation using various programming languages on multiple hardware platforms. Moreover, it provides the capability for further development, allowing you to train and fine-tune on your own dataset based on this pipeline, with the trained models being seamlessly integrable.
The general document image preprocessing pipeline includes optional document image orientation classification module and document image correction module with the following models included.
Document Image Orientation Classification Module (Optional):
Model | Model download link | Top-1 Acc(%) | GPU Inference Time (ms) [Normal Mode / High-Performance Mode] |
CPU inference time (ms) | Model storage size(M) | Introduction |
---|---|---|---|---|---|---|
PP-LCNet_x1_0_doc_ori | Inference Model/Train Model | 99.06 | 3.84845 | 9.23735 | 7 | A document image classification model based on PP-LCNet_x1_0, containing four categories: 0 degrees, 90 degrees, 180 degrees, and 270 degrees. |
Text Image Unwarping Module (Optional):
Model | Model download link | CER | Model storage size(M) | Introduction |
---|---|---|---|---|
UVDoc | Inference Model/Train Model | 0.179 | 30.3 M | High-Precision Text Image Correction Model |
Test Environment Description:
- Performance Test Environment
- Test Dataset:
- Document Image Orientation Classification Module: A self-built dataset using PaddleX, covering multiple scenarios such as ID cards and documents, containing 1000 images.
- Text Image Rectification Module: DocUNet.
-
Hardware Configuration:
- GPU: NVIDIA Tesla T4
- CPU: Intel Xeon Gold 6271C @ 2.60GHz
- Other Environments: Ubuntu 20.04 / cuDNN 8.6 / TensorRT 8.5.2.2
-
Inference Mode Description
Mode | GPU Configuration | CPU Configuration | Acceleration Technology Combination |
---|---|---|---|
Normal Mode | FP32 Precision / No TRT Acceleration | FP32 Precision / 8 Threads | PaddleInference |
High-Performance Mode | Optimal combination of pre-selected precision types and acceleration strategies | FP32 Precision / 8 Threads | Pre-selected optimal backend (Paddle/OpenVINO/TRT, etc.) |
2. Quick Start¶
PaddleX supports experiencing the effects of the document image preprocessing pipeline locally via command line or Python.
Before using the document image preprocessing pipeline locally, please ensure you have completed the installation of the PaddleX wheel package according to the PaddleX Local Installation Guide.
2.1 Local Experience¶
2.1.1 Command Line Experience¶
You can quickly experience the effects of the document image preprocessing pipeline with a single command. Use the test file and replace --input
with the local path to perform predictions.
paddlex --pipeline doc_preprocessor \
--input doc_test_rotated.jpg \
--use_doc_orientation_classify True \
--use_doc_unwarping True \
--save_path ./output \
--device gpu:0
After running, the results will be printed to the terminal as follows:
{'res': {'input_path': 'doc_test_rotated.jpg', 'model_settings': {'use_doc_orientation_classify': True, 'use_doc_unwarping': True}, 'angle': 180}}
You can refer to the results explanation in 2.1.2 Python Script Integration for a description of the output parameters.
The visualized results are saved under save_path
. The visualized results are as follows:
2.1.2 Python Script Integration¶
The above command line is for quickly experiencing and viewing the effect. Generally, in a project, it is often necessary to integrate through code. You can complete quick inference in a pipeline with just a few lines of code. The inference code is as follows:
from paddlex import create_pipeline
pipeline = create_pipeline(pipeline="doc_preprocessor")
output = pipeline.predict(
input="doc_test_rotated.jpg",
use_doc_orientation_classify=True,
use_doc_unwarping=True,
)
for res in output:
res.print()
res.save_to_img(save_path="./output/")
res.save_to_json(save_path="./output/")
In the above Python script, the following steps were executed:
(1) Instantiate the doc_preprocessor
pipeline object using create_pipeline()
. The specific parameter descriptions are as follows:
Parameter | Description | Type | Default |
---|---|---|---|
pipeline |
The pipeline name or the path to the pipeline configuration file. If it is a pipeline name, it must be a pipeline supported by PaddleX. | str |
None |
device |
Inference device for the pipeline. Supports specifying the GPU card number, such as "gpu:0", other hardware card numbers, such as "npu:0", and CPU as "cpu". | str |
gpu:0 |
use_hpip |
Whether to enable high-performance inference, available only when the pipeline supports high-performance inference. | bool |
False |
(2) Call the predict()
method of the doc_preprocessor pipeline object for inference prediction. This method will return a generator
. Below are the parameters of the predict()
method and their descriptions:
Parameter | Description | Type | Options | Default |
---|---|---|---|---|
input |
Data to be predicted, supporting various input types, required | Python Var|str|list |
|
None |
device |
Inference device for the pipeline | str|None |
|
None |
use_doc_orientation_classify |
Whether to use the document orientation classification module | bool|None |
|
None |
use_doc_unwarping |
Whether to use the document unwarping correction module | bool|None |
|
None |
(3) Process the prediction results, where the prediction result for each sample is of dict
type. Additionally, these results support operations such as printing, saving as an image, and saving as a json
file.
Method | Description | Parameter | Type | Description | Default |
---|---|---|---|---|---|
print() |
Prints the results to the terminal | format_json |
bool |
Whether to format the output using JSON indentation |
True |
indent |
int |
Specifies the indentation level to beautify the output JSON data for better readability, effective only when format_json is True |
4 | ||
ensure_ascii |
bool |
Controls whether to escape non-ASCII characters as Unicode . When set to True , all non-ASCII characters will be escaped; False retains the original characters, effective only when format_json is True |
False |
||
save_to_json() |
Saves the results as a JSON format file | save_path |
str |
The file path to save, naming consistent with the input file type when it is a directory | None |
indent |
int |
Specifies the indentation level to beautify the output JSON data for better readability, effective only when format_json is True |
4 | ||
ensure_ascii |
bool |
Controls whether to escape non-ASCII characters as Unicode . When set to True , all non-ASCII characters will be escaped; False retains the original characters, effective only when format_json is True |
False |
||
save_to_img() |
Saves the results as an image format file | save_path |
str |
The file path to save, supporting both directory or file path | None |
-
Calling the
print()
method will output the results to the terminal. The content printed to the terminal is explained as follows:-
input_path
:(str)
The input path of the image to be predicted. -
model_settings
:(Dict[str, bool])
Model parameters required for configuring the pipeline.use_doc_orientation_classify
:(bool)
Controls whether to enable the document orientation classification module.use_doc_unwarping
:(bool)
Controls whether to enable the document unwarping module.
-
angle
:(int)
The prediction result of the document orientation classification. When enabled, the values are [0, 90, 180, 270]; when not enabled, it is -1.
-
-
Calling the
save_to_json()
method will save the above content to the specifiedsave_path
. If a directory is specified, the path will besave_path/{your_img_basename}.json
; if a file is specified, it will be saved directly to that file. Since JSON files do not support saving NumPy arrays, anynumpy.array
types will be converted to lists. -
Calling the
save_to_img()
method will save the visualized results to the specifiedsave_path
. If a directory is specified, the path will besave_path/{your_img_basename}_doc_preprocessor_res_img.{your_img_extension}
; if a file is specified, it will be saved directly to that file. (Since the pipeline typically includes multiple result images, it is not recommended to specify a specific file path directly, as multiple images may be overwritten, leaving only the last image.) -
Additionally, it is also possible to obtain visualized images with results and prediction outcomes through attributes, as detailed below:
Attribute | Description |
---|---|
json |
Retrieves the prediction results in json format |
img |
Retrieves visualized images in dict format |
- The
json
attribute retrieves prediction results as a dictionary type of data, consistent with the content saved by calling thesave_to_json()
method. - The
img
attribute returns prediction results as a dictionary type of data. Here, the key ispreprocessed_img
, and the corresponding value is anImage.Image
object, which is a visualized image used to display the results of thedoc_preprocessor
.
Additionally, you can obtain the doc_preprocessor
pipeline configuration file and load it for prediction. You can execute the following command to save the results in my_path
:
Once you have the configuration file, you can customize the various configurations of the doc_preprocessor
pipeline by simply changing the pipeline
parameter value in the create_pipeline
method to the path of the pipeline configuration file. An example is as follows:
例如,若您的配置文件保存在 ./my_path/doc_preprocessor.yaml
,则只需执行:
from paddlex import create_pipeline
pipeline = create_pipeline(pipeline="./my_path/doc_preprocessor.yaml")
output = pipeline.predict(
input="doc_test_rotated.jpg"
use_doc_orientation_classify=True,
use_doc_unwarping=True,
)
for res in output:
res.print()
res.save_to_img("./output/")
res.save_to_json("./output/")
Note: The parameters in the configuration file are for pipeline initialization. If you wish to modify the initialization parameters for the doc_preprocessor
pipeline, you can directly edit the parameters in the configuration file and load the file for prediction. Additionally, CLI prediction also supports passing in a configuration file; simply specify the path to the configuration file using --pipeline
.
3. Development Integration/Deployment¶
If the document image preprocessing pipeline meets your requirements for inference speed and accuracy, you can proceed directly with development integration/deployment.
If you need to apply the document image preprocessing pipeline directly to your Python project, you can refer to the sample code in 2.2 Python Script Method.
Additionally, PaddleX offers three other deployment methods, detailed as follows:
🚀 High-Performance Inference: In real production environments, many applications have stringent performance standards for deployment strategies, especially regarding response speed, to ensure efficient system operation and a smooth user experience. To address this, PaddleX provides a high-performance inference plugin designed to deeply optimize model inference and pre/post-processing, resulting in significant end-to-end process acceleration. For detailed high-performance inference procedures, please refer to the PaddleX High-Performance Inference Guide.
API Reference
For the main operations provided by the service:
- The HTTP request method is POST.
- Both the request body and response body are JSON data (JSON objects).
- When the request is processed successfully, the response status code is
200
, and the attributes of the response body are as follows:
Name | Type | Meaning |
---|---|---|
logId |
string |
The UUID of the request. |
errorCode |
integer |
Error code. Fixed as 0 . |
errorMsg |
string |
Error message. Fixed as "Success" . |
result |
object |
The result of the operation. |
- When the request is not processed successfully, the attributes of the response body are as follows:
Name | Type | Meaning |
---|---|---|
logId |
string |
The UUID of the request. |
errorCode |
integer |
Error code. Same as the response status code. |
errorMsg |
string |
Error message. |
The main operations provided by the service are as follows:
infer
Obtain the document image preprocessing results.
POST /doc_preprocessor
- The attributes of the request body are as follows:
Name | Type | Meaning | Required |
---|---|---|---|
file |
string |
The URL of an image or PDF file accessible by the server, or the Base64-encoded content of the file. For PDF files exceeding 10 pages, only the first 10 pages will be used. | Yes |
fileType |
integer | null |
The type of the file. 0 for PDF files, 1 for image files. If this attribute is missing, the file type will be inferred from the URL. |
No |
useDocOrientationClassify |
boolean | null |
Refer to the use_doc_orientation_classify parameter description in the pipeline predict method. |
No |
useDocUnwarping |
boolean | null |
Refer to the use_doc_unwarping parameter description in the pipeline predict method. |
No |
- When the request is processed successfully, the
result
in the response body has the following attributes:
Name | Type | Meaning |
---|---|---|
docPreprocessingResults |
object |
Document image preprocessing results. The array length is 1 (for image input) or the smaller of the number of document pages and 10 (for PDF input). For PDF input, each element in the array represents the processing result of each page in the PDF file. |
dataInfo |
object |
Information about the input data. |
Each element in docPreprocessingResults
is an object
with the following attributes:
Name | Type | Meaning |
---|---|---|
outputImage |
string |
The preprocessed image. The image is in PNG format and is Base64-encoded. |
prunedResult |
object |
A simplified version of the res field in the JSON representation of the result generated by the pipeline object's predict method, excluding the input_path field. |
docPreprocessingImage |
string | null |
The visualization result image. The image is in JPEG format and is Base64-encoded. |
inputImage |
string | null |
The input image. The image is in JPEG format and is Base64-encoded. |
Multi-language Service Call Example
Python
import base64
import requests
API_URL = "http://localhost:8080/document-preprocessing"
file_path = "./demo.jpg"
with open(file_path, "rb") as file:
file_bytes = file.read()
file_data = base64.b64encode(file_bytes).decode("ascii")
payload = {"file": file_data, "fileType": 1}
response = requests.post(API_URL, json=payload)
assert response.status_code == 200
result = response.json()["result"]
for i, res in enumerate(result["docPreprocessingResults"]):
print(res["prunedResult"])
output_img_path = f"out_{i}.png"
with open(output_img_path, "wb") as f:
f.write(base64.b64decode(res["outputImage"]))
print(f"Output image saved at {output_img_path}")
☁️ Service Deployment: Service deployment is a common form of deployment in real production environments. By encapsulating inference functions as services, clients can access these services through network requests to obtain inference results. PaddleX supports multiple pipeline service deployment solutions. For detailed pipeline service deployment procedures, please refer to the PaddleX Service Deployment Guide.
4. Custom Development¶
If the default model weights provided by the document image preprocessing pipeline do not meet your accuracy or speed requirements in your specific scenario, you can try to further fine-tune the existing model using data from your specific domain or application scenario to enhance the recognition performance of the document image preprocessing pipeline in your context.
4.1 Model Fine-Tuning¶
Since the document image preprocessing pipeline consists of several modules, if the pipeline's performance does not meet expectations, it may be due to any one of these modules. You can analyze the images with poor recognition results to identify which module has issues, and then refer to the corresponding fine-tuning tutorial link in the table below to fine-tune the model.
situation | Fine-tuning model | Fine-tuning reference link |
---|---|---|
The overall image rotation correction is inaccurate. | Image orientation classification module | 链接 |
The image distortion correction is inaccurate. | Image Unwarping | Fine-tuning is not supported at the moment. |
4.2 Model Application¶
After completing fine-tuning training with a private dataset, you can obtain a local model weights file.
If you need to use the fine-tuned model weights, simply modify the pipeline configuration file by entering the local path of the fine-tuned model weights into the model_dir
field in the pipeline configuration file.
......
DocOrientationClassify:
module_name: doc_text_orientation
model_name: PP-LCNet_x1_0_doc_ori
model_dir: ./output/best_model/inference # Replace it with the path of the fine-tuned document image orientation classification model weights.
......
Then, refer to the command line method or Python script method in 2. Quick Start to load the modified pipeline configuration file.
5. Multi-Hardware Support¶
PaddleX supports a variety of mainstream hardware devices such as NVIDIA GPU, Kunlunxin XPU, Ascend NPU, and Cambricon MLU. You can achieve seamless switching between different hardware by simply modifying the --device
parameter.
For example, if you are using an Ascend NPU for inference in a document image preprocessing pipeline, the Python command you would use is:
paddlex --pipeline doc_preprocessor \
--input doc_test_rotated.jpg \
--use_doc_orientation_classify True \
--use_doc_unwarping True \
--save_path ./output \
--device npu:0
If you want to use the document image preprocessing pipeline on more types of hardware, please refer to the PaddleX Multi-Hardware Usage Guide.