PP-DocTranslation Pipeline Usage Tutorial¶
1. Introduction to PP-DocTranslation Pipeline¶
PP-DocTranslation is a document intelligent translation solution provided by PaddlePaddle. It integrates advanced general layout analysis technology and large language model (LLM) capabilities to offer you efficient document intelligent translation services. This solution can accurately identify and extract various elements within documents, including text blocks, headings, paragraphs, images, tables, and other complex layout structures, and on this basis, achieve high-quality multilingual translation. PP-DocTranslation supports mutual translation among multiple mainstream languages, particularly excelling in handling documents with complex layouts and strong contextual dependencies, striving to deliver precise, natural, fluent, and professional translation results. This pipeline also provides flexible serving options, supporting the use of multiple programming languages on various hardware. Moreover, it offers the capability for secondary development, allowing you to train and fine-tune models on your own datasets based on this pipeline, and the trained models can also be seamlessly integrated.
The PP-DocTranslation pipeline uses the PP-StructureV3 sub-pipeline, and thus has all the functions of the PP-StructureV3 pipeline. For more information on the functions and usage details of the PP-StructureV3 pipeline, you can click on the PP-StructureV3 Pipeline Documentation page.
In this pipeline, you can select the model to use based on the benchmark data below.
👉Model List Details
Document Image Orientation Classification Module:
Model | Download Link | Top-1 Acc (%) | GPU Inference Time (ms) [Standard / High Performance] |
CPU Inference Time (ms) [Standard / High Performance] |
Model Size (M) | Description |
---|---|---|---|---|---|---|
PP-LCNet_x1_0_doc_ori | Inference Model/Pretrained Model | 99.06 | 2.62 / 0.59 | 3.24 / 1.19 | 7 | A document image classification model based on PP-LCNet_x1_0 with four classes: 0°, 90°, 180°, and 270° |
Text Image Unwarping Module:
Model | Download Link | CER | Model Size (M) | Description |
---|---|---|---|---|
UVDoc | Inference Model/Pretrained Model | 0.179 | 30.3 | High-accuracy text image unwarping model |
Layout Detection Module Models:
Model | Download Link | mAP(0.5) (%) | GPU Inference Time (ms) [Standard / High Performance] |
CPU Inference Time (ms) [Standard / High Performance] |
Model Size (M) | Description |
---|---|---|---|---|---|---|
PP-DocLayout_plus-L | Inference Model/Pretrained Model | 83.2 | 53.03 / 17.23 | 634.62 / 378.32 | 126.01 | High-accuracy layout detection model based on RT-DETR-L, trained on a custom dataset covering scenarios like Chinese/English papers, multi-column magazines, newspapers, PPTs, contracts, books, exams, research reports, ancient books, Japanese documents, and vertical text documents |
PP-DocLayout-L | Inference Model/Pretrained Model | 90.4 | 33.59 / 33.59 | 503.01 / 251.08 | 123.76 | High-accuracy layout detection model based on RT-DETR-L, trained on a custom dataset covering papers, magazines, contracts, books, exams, and research reports |
PP-DocLayout-M | Inference Model/Pretrained Model | 75.2 | 13.03 / 4.72 | 43.39 / 24.44 | 22.578 | Balanced accuracy-efficiency layout detection model based on PicoDet-L, trained on a custom dataset covering papers, magazines, contracts, books, exams, and research reports |
PP-DocLayout-S | Inference Model/Pretrained Model | 70.9 | 11.54 / 3.86 | 18.53 / 6.29 | 4.834 | High-efficiency layout detection model based on PicoDet-S, trained on a custom dataset for papers, magazines, contracts, books, exams, and research reports |
Table Structure Recognition Module:
Model | Download Link | Accuracy (%) | GPU Inference Time (ms) [Standard / High Performance] |
CPU Inference Time (ms) [Standard / High Performance] |
Model Size (M) | Description |
---|---|---|---|---|---|---|
SLANeXt_wired | Inference Model/Pretrained Model | 69.65 | 85.92 / 85.92 | - / 501.66 | 351M | SLANeXt series is a next-generation table structure recognition model developed by Baidu PaddlePaddle Vision Team. Compared with SLANet and SLANet_plus, SLANeXt focuses on recognizing table structures, with dedicated weights for wired and wireless tables, significantly improving performance especially for wired tables. |
SLANeXt_wireless | Inference Model/Pretrained Model |
Table Classification Module Models:
Model | Download Link | Top-1 Acc (%) | GPU Inference Time (ms) [Standard / High Performance] |
CPU Inference Time (ms) [Standard / High Performance] |
Model Size (M) |
---|---|---|---|---|---|
PP-LCNet_x1_0_table_cls | Inference Model/Pretrained Model | 94.2 | 2.62 / 0.60 | 3.17 / 1.14 | 6.6M |
Table Cell Detection Module Models:
Model | Download Link | mAP (%) | GPU Inference Time (ms) [Standard / High Performance] |
CPU Inference Time (ms) [Standard / High Performance] |
Model Size (M) | Description |
---|---|---|---|---|---|---|
RT-DETR-L_wired_table_cell_det | Inference Model/Pretrained Model | 82.7 | 33.47 / 27.02 | 402.55 / 256.56 | 124M | RT-DETR is the first real-time end-to-end object detection model. Baidu PaddlePaddle Vision Team used RT-DETR-L as the base and pre-trained on a custom table cell detection dataset, achieving strong performance on both wired and wireless tables. |
RT-DETR-L_wireless_table_cell_det | Inference Model/Pretrained Model |
Text Detection Module:
Model | Download Link | Detection Hmean (%) | GPU Inference Time (ms) [Standard / High Performance] |
CPU Inference Time (ms) [Standard / High Performance] |
Model Size (M) | Description |
---|---|---|---|---|---|---|
PP-OCRv5_server_det | Inference Model/Pretrained Model | 83.8 | 89.55 / 70.19 | 383.15 / 383.15 | 84.3 | PP-OCRv5 server-side text detection model, higher accuracy, suitable for deployment on high-performance servers |
PP-OCRv5_mobile_det | Inference Model/Pretrained Model | 79.0 | 10.67 / 6.36 | 57.77 / 28.15 | 4.7 | PP-OCRv5 mobile-side text detection model, more efficient, suitable for edge device deployment |
PP-OCRv4_server_det | Inference Model/Pretrained Model | 69.2 | 127.82 / 98.87 | 585.95 / 489.77 | 109 | PP-OCRv4 server-side text detection model, higher accuracy, suitable for deployment on high-performance servers |
PP-OCRv4_mobile_det | Inference Model/Pretrained Model | 63.8 | 9.87 / 4.17 | 56.60 / 20.79 | 4.7 | PP-OCRv4 mobile-side text detection model, more efficient, suitable for edge device deployment |
PP-OCRv3_mobile_det | Inference Model/Pretrained Model | Accuracy similar to PP-OCRv4_mobile_det | 9.90 / 3.60 | 41.93 / 20.76 | 2.1 | PP-OCRv3 mobile-side text detection model, more efficient, suitable for edge device deployment |
PP-OCRv3_server_det | Inference Model/Pretrained Model | Accuracy similar to PP-OCRv4_server_det | 119.50 / 75.00 | 379.35 / 318.35 | 102.1 | PP-OCRv3 server-side text detection model, higher accuracy, suitable for deployment on high-performance servers |
Text Recognition Module Models:
* Chinese Recognition ModelsModel | Download Link | Recognition Avg Accuracy(%) | GPU Inference Time (ms) [Regular Mode / High-Performance Mode] |
CPU Inference Time (ms) [Regular Mode / High-Performance Mode] |
Model Size (M) | Description |
---|---|---|---|---|---|---|
PP-OCRv5_server_rec | Inference Model/Training Model | 86.38 | 8.46 / 2.36 | 31.21 / 31.21 | 81 | PP-OCRv5_rec is a next-generation text recognition model. It aims to efficiently and accurately support four major languages—Simplified Chinese, Traditional Chinese, English, and Japanese—as well as complex text scenarios such as handwriting, vertical text, pinyin, and rare characters. While maintaining recognition performance, it balances inference speed and model robustness, providing efficient and precise technical support for document understanding in various scenarios. |
PP-OCRv5_mobile_rec | Inference Model/Training Model | 81.29 | 5.43 / 1.46 | 21.20 / 5.32 | 16 | |
PP-OCRv4_server_rec_doc | Inference Model/Training Model | 86.58 | 8.69 / 2.78 | 37.93 / 37.93 | 74.7 | PP-OCRv4_server_rec_doc is trained on a mix of more Chinese document data and PP-OCR training data, based on PP-OCRv4_server_rec. It enhances recognition capabilities for Traditional Chinese, Japanese, and special characters, supporting 15,000+ characters. In addition to improving document-related text recognition, it also enhances general text recognition. |
PP-OCRv4_mobile_rec | Inference Model/Training Model | 78.74 | 5.26 / 1.12 | 17.48 / 3.61 | 10.6 | The lightweight recognition model of PP-OCRv4, with high inference efficiency, deployable on various hardware devices including edge devices. |
PP-OCRv4_server_rec | Inference Model/Training Model | 80.61 | 8.75 / 2.49 | 36.93 / 36.93 | 71.2 | The server-side model of PP-OCRv4, with high inference accuracy, deployable on various servers. |
PP-OCRv3_mobile_rec | Inference Model/Training Model | 72.96 | 3.89 / 1.16 | 8.72 / 3.56 | 9.2 | The lightweight recognition model of PP-OCRv3, with high inference efficiency, deployable on various hardware devices including edge devices. |
Model | Download Link | Recognition Avg Accuracy(%) | GPU Inference Time (ms) [Regular Mode / High-Performance Mode] |
CPU Inference Time (ms) [Regular Mode / High-Performance Mode] |
Model Size (M) | Description |
---|---|---|---|---|---|---|
ch_SVTRv2_rec | Inference Model/Training Model | 68.81 | 10.38 / 8.31 | 66.52 / 30.83 | 73.9 | SVTRv2 is a server-side text recognition model developed by the OpenOCR team from Fudan University's Vision and Learning Lab (FVL). It won first prize in the PaddleOCR Algorithm Model Challenge - Task 1: OCR End-to-End Recognition, achieving a 6% improvement in end-to-end recognition accuracy over PP-OCRv4 on the A榜 leaderboard. |
Model | Download Link | Recognition Avg Accuracy(%) | GPU Inference Time (ms) [Regular Mode / High-Performance Mode] |
CPU Inference Time (ms) [Regular Mode / High-Performance Mode] |
Model Size (M) | Description |
---|---|---|---|---|---|---|
ch_RepSVTR_rec | Inference Model/Training Model | 65.07 | 6.29 / 1.57 | 20.64 / 5.40 | 22.1 | RepSVTR is a mobile text recognition model based on SVTRv2. It won first prize in the PaddleOCR Algorithm Model Challenge - Task 1: OCR End-to-End Recognition, achieving a 2.5% improvement in end-to-end recognition accuracy over PP-OCRv4 on the B榜 leaderboard, with comparable inference speed. |
Model | Download Link | Recognition Avg Accuracy(%) | GPU Inference Time (ms) [Regular Mode / High-Performance Mode] |
CPU Inference Time (ms) [Regular Mode / High-Performance Mode] |
Model Size (M) | Description |
---|---|---|---|---|---|---|
en_PP-OCRv4_mobile_rec | Inference Model/Training Model | 70.39 | 4.81 / 1.23 | 17.20 / 4.18 | 6.8 | An ultra-lightweight English recognition model trained based on the PP-OCRv4 recognition model, supporting English and numeric recognition. |
en_PP-OCRv3_mobile_rec | Inference Model/Training Model | 70.69 | 3.56 / 0.78 | 8.44 / 5.78 | 7.8 M | An ultra-lightweight English recognition model trained based on the PP-OCRv3 recognition model, supporting English and numeric recognition. |
Model | Download Link | Recognition Avg Accuracy(%) | GPU Inference Time (ms) [Regular Mode / High-Performance Mode] |
CPU Inference Time (ms) [Regular Mode / High-Performance Mode] |
Model Size (M) | Description |
---|---|---|---|---|---|---|
korean_PP-OCRv3_mobile_rec | Inference Model/Training Model | 60.21 | 3.73 / 0.98 | 8.76 / 2.91 | 8.6 | An ultra-lightweight Korean recognition model trained based on the PP-OCRv3 recognition model, supporting Korean and numeric recognition. |
japan_PP-OCRv3_mobile_rec | Inference Model/Training Model | 45.69 | 3.86 / 1.01 | 8.62 / 2.92 | 8.8 M | An ultra-lightweight Japanese recognition model trained based on the PP-OCRv3 recognition model, supporting Japanese and numeric recognition. |
chinese_cht_PP-OCRv3_mobile_rec | Inference Model/Training Model | 82.06 | 3.90 / 1.16 | 9.24 / 3.18 | 9.7 M | An ultra-lightweight Traditional Chinese recognition model trained based on the PP-OCRv3 recognition model, supporting Traditional Chinese and numeric recognition. |
te_PP-OCRv3_mobile_rec | Inference Model/Training Model | 95.88 | 3.59 / 0.81 | 8.28 / 6.21 | 7.8 M | An ultra-lightweight Telugu recognition model trained based on the PP-OCRv3 recognition model, supporting Telugu and numeric recognition. |
ka_PP-OCRv3_mobile_rec | Inference Model/Training Model | 96.96 | 3.49 / 0.89 | 8.63 / 2.77 | 8.0 M | An ultra-lightweight Kannada recognition model trained based on the PP-OCRv3 recognition model, supporting Kannada and numeric recognition. |
ta_PP-OCRv3_mobile_rec | Inference Model/Training Model | 76.83 | 3.49 / 0.86 | 8.35 / 3.41 | 8.0 M | An ultra-lightweight Tamil recognition model trained based on the PP-OCRv3 recognition model, supporting Tamil and numeric recognition. |
latin_PP-OCRv3_mobile_rec | Inference Model/Training Model | 76.93 | 3.53 / 0.78 | 8.50 / 6.83 | 7.8 | An ultra-lightweight Latin recognition model trained based on the PP-OCRv3 recognition model, supporting Latin and numeric recognition. |
arabic_PP-OCRv3_mobile_rec | Inference Model/Training Model | 73.55 | 3.60 / 0.83 | 8.44 / 4.69 | 7.8 | An ultra-lightweight Arabic script recognition model trained based on the PP-OCRv3 recognition model, supporting Arabic script and numeric recognition. |
cyrillic_PP-OCRv3_mobile_rec | Inference Model/Training Model | 94.28 | 3.56 / 0.79 | 8.22 / 2.76 | 7.9 M | An ultra-lightweight Cyrillic script recognition model trained based on the PP-OCRv3 recognition model, supporting Cyrillic script and numeric recognition. |
devanagari_PP-OCRv3_mobile_rec | Inference Model/Training Model | 96.44 | 3.60 / 0.78 | 6.95 / 2.87 | 7.9 | An ultra-lightweight Devanagari script recognition model trained based on the PP-OCRv3 recognition model, supporting Devanagari script and numeric recognition. |
Text Line Orientation Classification Module (Optional):
Model | Download Link | Top-1 Acc(%) | GPU Inference Time (ms) [Regular Mode / High-Performance Mode] |
CPU Inference Time (ms) [Regular Mode / High-Performance Mode] |
Model Size (M) | Description |
---|---|---|---|---|---|---|
PP-LCNet_x0_25_textline_ori | Inference Model/Training Model | 95.54 | 2.16 / 0.41 | 2.37 / 0.73 | 0.32 | A text line classification model based on PP-LCNet_x0_25, with two classes: 0 degrees and 180 degrees. |
Formula Recognition Module:
Model | Download Link | Avg-BLEU(%) | GPU Inference Time (ms) [Regular Mode / High-Performance Mode] |
CPU Inference Time (ms) [Regular Mode / High-Performance Mode] |
Model Size (M) | Description | UniMERNet | Inference Model/Training Model | 86.13 | 2266.96/- | -/- | 1.4 G | UniMERNet is a formula recognition model developed by Shanghai AI Lab. It uses Donut Swin as the encoder and MBartDecoder as the decoder. Trained on a dataset of one million samples, including simple formulas, complex formulas, scanned formulas, and handwritten formulas, it significantly improves recognition accuracy for real-world scenarios. | PP-FormulaNet-S | Inference Model/Training Model | 87.12 | 1311.84 / 1311.84 | - / 8288.07 | 167.9 | PP-FormulaNet is an advanced formula recognition model developed by Baidu's PaddlePaddle Vision team, supporting 50,000 common LaTeX vocabulary items. The PP-FormulaNet-S version uses PP-HGNetV2-B4 as its backbone and employs techniques like parallel masking and model distillation to significantly improve inference speed while maintaining high recognition accuracy, suitable for simple printed formulas, cross-line simple printed formulas, etc. The PP-FormulaNet-L version is based on Vary_VIT_B as its backbone and is trained on a large-scale formula dataset, showing significant improvement in complex formula recognition compared to PP-FormulaNet-S, suitable for simple printed formulas, complex printed formulas, handwritten formulas, etc. | PP-FormulaNet-L | Inference Model/Training Model | 92.13 | 1976.52/- | -/- | 535.2 | LaTeX_OCR_rec | Inference Model/Training Model | 71.63 | 1088.89 / 1088.89 | - / - | 89.7 | LaTeX-OCR is a formula recognition algorithm based on an autoregressive large model. By using Hybrid ViT as the backbone and transformer as the decoder, it significantly improves the accuracy of formula recognition. |
---|
Seal Text Recognition Module:
Model | Download Link | Detection Hmean(%) | GPU Inference Time (ms) [Regular Mode / High-Performance Mode] |
CPU Inference Time (ms) [Regular Mode / High-Performance Mode] |
Model Size (M) | Description |
---|---|---|---|---|---|---|
PP-OCRv4_server_seal_det | Inference Model/Training Model | 98.21 | 124.64 / 91.57 | 545.68 / 439.86 | 109 | The server-side seal text detection model of PP-OCRv4, with higher accuracy, suitable for deployment on high-performance servers. |
PP-OCRv4_mobile_seal_det | Inference Model/Training Model | 96.47 | 9.70 / 3.56 | 50.38 / 19.64 | 4.6 | The mobile-side seal text detection model of PP-OCRv4, with higher efficiency, suitable for deployment on edge devices. |
- Performance Testing Environment
- Test Datasets:
- Document Image Orientation Classification Model: A dataset built by PaddleX, covering multiple scenarios such as IDs and documents, containing 1,000 images.
- Text Image Unwarping Model: DocUNet.
- Layout Detection Model: A layout analysis dataset built by PaddleOCR, containing 10,000 common document-type images such as Chinese and English papers, magazines, and reports.
- PP-DocLayout_plus-L: A layout detection dataset built by PaddleOCR, containing 1,300 document-type images such as Chinese and English papers, magazines, newspapers, reports, PPTs, exams, and textbooks.
- Table Structure Recognition Model: An internal English table recognition dataset built by PaddleX.
- Text Detection Model: A Chinese dataset built by PaddleOCR, covering street views, web images, documents, and handwriting, with 500 images for detection.
- Chinese Recognition Model: A Chinese dataset built by PaddleOCR, covering street views, web images, documents, and handwriting, with 11,000 images for text recognition.
- ch_SVTRv2_rec: PaddleOCR Algorithm Model Challenge - Task 1: OCR End-to-End Recognition A榜 evaluation set.
- ch_RepSVTR_rec: PaddleOCR Algorithm Model Challenge - Task 1: OCR End-to-End Recognition B榜 evaluation set.
- English Recognition Model: An English dataset built by PaddleX.
- Multilingual Recognition Model: A multilingual dataset built by PaddleX.
- Text Line Orientation Classification Model: A dataset built by PaddleX, covering multiple scenarios such as IDs and documents, containing 1,000 images.
- Seal Text Recognition Model: A dataset built by PaddleX, containing 500 circular seal images.
- Hardware Configuration:
- GPU: NVIDIA Tesla T4
- CPU: Intel Xeon Gold 6271C @ 2.60GHz
- Software Environment:
- Ubuntu 20.04 / CUDA 11.8 / cuDNN 8.9 / TensorRT 8.6.1.6
- paddlepaddle 3.0.0 / paddleocr 3.0.3
- Test Datasets:
- Inference Mode Description
Mode | GPU Configuration | CPU Configuration | Acceleration Technology Combination |
---|---|---|---|
Regular Mode | FP32 Precision / No TRT Acceleration | FP32 Precision / 8 Threads | PaddleInference |
High-Performance Mode | Optimal combination of precision types and acceleration strategies | FP32 Precision / 8 Threads | Optimal backend selection (Paddle/OpenVINO/TRT, etc.) |
2. Quick Start¶
Before using the PP-DocTranslation pipeline locally, please ensure that you have completed the installation of the wheel package according to the Installation Tutorial.
Please note: If you encounter issues such as the program becoming unresponsive, unexpected program termination, running out of memory resources, or extremely slow inference during execution, please try adjusting the configuration according to the documentation, such as disabling unnecessary features or using lighter-weight models.
Before use, you need to prepare the API key for a large language model, which supports the Baidu Cloud Qianfan Platform or local large model services that comply with the OpenAI interface standards.
2.1 Experience via Command Line¶
You can download the test file and quickly experience the pipeline effect with a single command:
paddleocr pp_doctranslation -i vehicle_certificate-1.png --target_language en --qianfan_api_key your_api_key
Command line supports more parameter settings. Click to expand for detailed description of command line parameters
Parameter | Description | Type | Default Value |
---|---|---|---|
input |
Data to be predicted, required. For example, local path of image file or PDF file: /root/data/img.jpg ; URL link, such as network URL of image or PDF file: example; local directory, the directory must contain images to be predicted, such as local path: /root/data/ (currently does not support PDF files in the directory, PDF files need to specify the exact file path).
|
str |
|
save_path |
Specifies the path to save the inference result files. If not set, inference results will not be saved locally. | str |
|
target_language |
Target language (ISO 639-1 language code). | str |
zh |
layout_detection_model_name |
Model name for layout detection. If not set, the pipeline default model will be used. | str |
|
layout_detection_model_dir |
Directory path of the layout detection model. If not set, the official model will be downloaded. | str |
|
layout_threshold |
Score threshold for layout model. Any float between 0-1 . If not set, the pipeline initialized value will be used, default initialized as 0.5 . |
float |
|
layout_nms |
Whether to use post-processing NMS in layout detection. If not set, the pipeline initialized value will be used, default initialized as True . |
bool |
|
layout_unclip_ratio |
Expansion coefficient for detection boxes in layout detection model. Any float greater than 0 . If not set, the pipeline initialized value will be used, default initialized as 1.0 . |
float |
|
layout_merge_bboxes_mode |
Mode for merging detection boxes output by the layout detection model.
large . |
str |
|
chart_recognition_model_name |
Model name for chart parsing. If not set, the pipeline default model will be used. | str |
|
chart_recognition_model_dir |
Directory path for chart parsing model. If not set, the official model will be downloaded. | str |
|
chart_recognition_batch_size |
Batch size for chart parsing model. If not set, batch size defaults to 1 . |
int |
|
region_detection_model_name |
Model name for region detection. If not set, the pipeline default model will be used. | str |
|
region_detection_model_dir |
Directory path for region detection model. If not set, the official model will be downloaded. | str |
|
doc_orientation_classify_model_name |
Model name for document orientation classification. If not set, the pipeline default model will be used. | str |
|
doc_orientation_classify_model_dir |
Directory path for document orientation classification model. If not set, the official model will be downloaded. | str |
|
doc_unwarping_model_name |
Model name for text image unwarping. If not set, the pipeline default model will be used. | str |
|
doc_unwarping_model_dir |
Directory path for text image unwarping model. If not set, the official model will be downloaded. | str |
|
text_detection_model_name |
Model name for text detection. If not set, the pipeline default model will be used. | str |
|
text_detection_model_dir |
Directory path for text detection model. If not set, the official model will be downloaded. | str |
|
text_det_limit_side_len |
Image side length limit for text detection. Any integer greater than 0 . If not set, the pipeline initialized value will be used, default initialized as 960 . |
int |
|
text_det_limit_type |
Type of image side length limit for text detection. Supports min and max . min means ensuring the shortest side of the image is not less than det_limit_side_len , max means ensuring the longest side of the image is not greater than limit_side_len . If not set, the pipeline initialized value will be used, default initialized as max . |
str |
|
text_det_thresh |
Detection pixel threshold. In the output probability map, pixels with score greater than this threshold are considered text pixels. Any float greater than 0 . If not set, the pipeline initialized value 0.3 will be used by default. |
float |
|
text_det_box_thresh |
Detection box threshold. If the average score of all pixels within the detected bounding box is greater than this threshold, the result is considered a text region. Any float greater than 0 . If not set, the pipeline initialized value 0.6 will be used by default. |
float |
|
text_det_unclip_ratio |
Text detection expansion coefficient, used to expand text regions. The larger the value, the larger the expansion area. Any float greater than 0 . If not set, the pipeline initialized value 2.0 will be used by default. |
float |
|
textline_orientation_model_name |
Model name for textline orientation. If not set, the pipeline default model will be used. | str |
|
textline_orientation_model_dir |
Directory path for textline orientation model. If not set, the official model will be downloaded. | str |
|
textline_orientation_batch_size |
Batch size for textline orientation model. If not set, batch size defaults to 1 . |
int |
|
text_recognition_model_name |
Model name for text recognition. If not set, the pipeline default model will be used. | str |
|
text_recognition_model_dir |
Directory path for text recognition model. If not set, the official model will be downloaded. | str |
|
text_recognition_batch_size |
Batch size for text recognition model. If not set, batch size defaults to 1 . |
int |
|
text_rec_score_thresh |
Text recognition threshold. Text results with scores greater than this threshold will be kept. Any float greater than 0 . If not set, the pipeline initialized value 0.0 will be used, meaning no threshold. |
float |
|
table_classification_model_name |
Model name for table classification. If not set, the pipeline default model will be used. | str |
|
table_classification_model_dir |
Directory path for table classification model. If not set, the official model will be downloaded. | str |
|
wired_table_structure_recognition_model_name |
Model name for wired table structure recognition. If not set, the pipeline default model will be used. | str |
|
wired_table_structure_recognition_model_dir |
Directory path for wired table structure recognition model. If not set, the official model will be downloaded. | str |
|
wireless_table_structure_recognition_model_name |
Model name for wireless table structure recognition. If not set, the pipeline default model will be used. | str |
|
wireless_table_structure_recognition_model_dir |
Directory path for wireless table structure recognition model. If not set, the official model will be downloaded. | str |
|
wired_table_cells_detection_model_name |
Model name for wired table cells detection. If not set, the pipeline default model will be used. | str |
|
wired_table_cells_detection_model_dir |
Directory path for wired table cells detection model. If not set, the official model will be downloaded. | str |
|
wireless_table_cells_detection_model_name |
Model name for wireless table cells detection. If not set, the pipeline default model will be used. | str |
|
wireless_table_cells_detection_model_dir |
Directory path for wireless table cells detection model. If not set, the official model will be downloaded. | str |
|
table_orientation_classify_model_name |
Model name for table orientation classification. If not set, the pipeline default model will be used. | str |
|
table_orientation_classify_model_dir |
Directory path for table orientation classification model. If not set, the official model will be downloaded. | str |
|
seal_text_detection_model_name |
Model name for seal text detection. If not set, the pipeline default model will be used. | str |
|
seal_text_detection_model_dir |
Directory path for seal text detection model. If not set, the official model will be downloaded. | str |
|
seal_det_limit_side_len |
Image side length limit for seal text detection. Any integer greater than 0 . If not set, the pipeline initialized value will be used, default initialized as 736 . |
int |
|
seal_det_limit_type |
Type of image side length limit for seal text detection. Supports min and max . min means ensuring the shortest side of the image is not less than det_limit_side_len , max means ensuring the longest side is not greater than limit_side_len . If not set, the pipeline initialized value will be used, default initialized as min . |
str |
|
seal_det_thresh |
Detection pixel threshold. In the output probability map, pixels with score greater than this threshold are considered text pixels. Any float greater than 0 . If not set, the pipeline initialized value 0.2 will be used by default. |
float |
|
seal_det_box_thresh |
Detection box threshold. If the average score of all pixels within the detected bounding box is greater than this threshold, the result is considered a text region. Any float greater than 0 . If not set, the pipeline initialized value 0.6 will be used by default. |
float |
|
seal_det_unclip_ratio |
Expansion coefficient for seal text detection. This method is used to expand the text region; the larger the value, the larger the expansion area. Any float greater than 0 . If not set, the pipeline initialized value 0.5 will be used by default. |
float |
|
seal_text_recognition_model_name |
Model name for seal text recognition. If not set, the pipeline default model will be used. | str |
|
seal_text_recognition_model_dir |
Directory path for seal text recognition model. If not set, the official model will be downloaded. | str |
|
seal_text_recognition_batch_size |
Batch size for seal text recognition model. If not set, batch size defaults to 1 . |
int |
|
seal_rec_score_thresh |
Text recognition threshold. Text results with scores greater than this threshold will be kept. Any float greater than 0 . If not set, the pipeline initialized value 0.0 will be used, meaning no threshold. |
float |
|
formula_recognition_model_name |
Model name for formula recognition. If not set, the pipeline default model will be used. | str |
|
formula_recognition_model_dir |
Directory path for formula recognition model. If not set, the official model will be downloaded. | str |
|
formula_recognition_batch_size |
Batch size of the formula recognition model. If not set, the batch size defaults to 1 . |
int |
|
use_doc_orientation_classify |
Whether to load and use the document orientation classification module. If not set, the pipeline initialized value will be used, default is False . |
bool |
|
use_doc_unwarping |
Whether to load and use the text image unwarping module. If not set, the pipeline initialized value will be used, default is False . |
bool |
|
use_textline_orientation |
Whether to load and use the text line orientation classification module. If not set, the pipeline initialized value will be used, default is True . |
bool |
|
use_seal_recognition |
Whether to load and use the seal text recognition sub-pipeline. If not set, the pipeline initialized value will be used, default is True . |
bool |
|
use_table_recognition |
Whether to load and use the table recognition sub-pipeline. If not set, the pipeline initialized value will be used, default is True . |
bool |
|
use_formula_recognition |
Whether to load and use the formula recognition sub-pipeline. If not set, the pipeline initialized value will be used, default is True . |
bool |
|
use_chart_recognition |
Whether to load and use the chart parsing module. If not set, the pipeline initialized value will be used, default is False . |
bool |
|
use_region_detection |
Whether to load and use the region detection module. If not set, the pipeline initialized value will be used, default is True . |
bool |
|
qianfan_api_key |
API key for the Qianfan platform. | str |
|
device |
Device used for inference. Supports specifying exact card number:
|
str |
|
enable_hpi |
Whether to enable high-performance inference. | bool |
False |
use_tensorrt |
Whether to enable the TensorRT subgraph engine of Paddle Inference. If the model does not support acceleration by TensorRT, enabling this flag will not enable acceleration. For PaddlePaddle with CUDA 11.8, compatible TensorRT version is 8.x (x≥6), recommended TensorRT version is 8.6.1.6. |
bool |
False |
precision |
Computation precision, e.g. fp32, fp16. | str |
fp32 |
enable_mkldnn |
Whether to enable MKL-DNN accelerated inference. If MKL-DNN is unavailable or the model does not support acceleration via MKL-DNN, enabling this flag will not enable acceleration. | bool |
True |
mkldnn_cache_capacity |
MKL-DNN cache capacity. | int |
10 |
cpu_threads |
Number of threads used for inference on CPU. | int |
8 |
paddlex_config |
Path to PaddleX pipeline configuration file. | str |
The execution results will be printed to the terminal.
2.2 Integration via Python Script¶
The command-line method is for quickly experiencing and viewing the results. Generally, in projects, integration via code is often required. You can download the test file and use the following sample code for inference:
from paddleocr import PPDocTranslation
# Create a translation pipeline
pipeline = PPDocTranslation()
# Document path
input_path = "document_sample.pdf"
# Output directory
output_path = "./output"
# Large model configuration
chat_bot_config = {
"module_name": "chat_bot",
"model_name": "ernie-3.5-8k",
"base_url": "https://qianfan.baidubce.com/v2",
"api_type": "openai",
"api_key": "api_key", # your api_key
}
if input_path.lower().endswith(".md"):
# Read markdown documents, supporting passing in directories and url links with the .md suffix
ori_md_info_list = pipeline.load_from_markdown(input_path)
else:
# Use PP-StructureV3 to perform layout parsing on PDF/image documents to obtain markdown information
visual_predict_res = pipeline.visual_predict(
input_path,
use_doc_orientation_classify=False,
use_doc_unwarping=False,
use_common_ocr=True,
use_seal_recognition=True,
use_table_recognition=True,
)
ori_md_info_list = []
for res in visual_predict_res:
layout_parsing_result = res["layout_parsing_result"]
ori_md_info_list.append(layout_parsing_result.markdown)
layout_parsing_result.save_to_img(output_path)
layout_parsing_result.save_to_markdown(output_path)
# Concatenate the markdown information of multi-page documents into a single markdown file, and save the merged original markdown text
if input_path.lower().endswith(".pdf"):
ori_md_info = pipeline.concatenate_markdown_pages(ori_md_info_list)
ori_md_info.save_to_markdown(output_path)
# Perform document translation (target language: English)
tgt_md_info_list = pipeline.translate(
ori_md_info_list=ori_md_info_list,
target_language="en",
chunk_size=5000,
chat_bot_config=chat_bot_config,
)
# Save the translation results
for tgt_md_info in tgt_md_info_list:
tgt_md_info.save_to_markdown(output_path)
After executing the above code, you will obtain the parsed results of the original document to be translated, the Markdown file of the original text to be translated, and the Markdown file of the translated document, all saved in the output
directory.
The process, API description, and output description of PP-DocTranslation prediction are as follows:
(1) Instantiate the PP-DocTranslation pipeline object by calling PPDocTranslation
.
Relevant parameter descriptions are as follows:
Parameter | Description | Type | Default Value |
---|---|---|---|
layout_detection_model_name |
The model name for layout detection. If set to None , the pipeline's default model will be used. |
str|None |
None |
layout_detection_model_dir |
The directory path of the layout detection model. If set to None , the official model will be downloaded. |
str|None |
None |
layout_threshold |
Score threshold for the layout model.
|
float|dict|None |
None |
layout_nms |
Whether to use post-processing NMS for layout detection. If set to None , the pipeline's initialized value will be used, defaulting to True . |
bool|None |
None |
layout_unclip_ratio |
Expansion coefficient for detection boxes in the layout detection model.
|
float|Tuple[float,float]|dict|None |
None |
layout_merge_bboxes_mode |
Overlap box filtering method for layout detection.
|
str|dict|None |
None |
chart_recognition_model_name |
The model name for chart parsing. If set to None , the pipeline's default model will be used. |
str|None |
None |
chart_recognition_model_dir |
The directory path of the chart parsing model. If set to None , the official model will be downloaded. |
str|None |
None |
chart_recognition_batch_size |
Batch size for the chart parsing model. If set to None , batch size defaults to 1 . |
int|None |
None |
region_detection_model_name |
The model name for region detection. If set to None , the pipeline's default model will be used. |
str|None |
None |
region_detection_model_dir |
The directory path of the region detection model. If set to None , the official model will be downloaded. |
str|None |
None |
doc_orientation_classify_model_name |
The model name for document orientation classification. If set to None , the pipeline's default model will be used. |
str|None |
None |
doc_orientation_classify_model_dir |
The directory path of the document orientation classification model. If set to None , the official model will be downloaded. |
str|None |
None |
doc_unwarping_model_name |
The model name for text image unwarping. If set to None , the pipeline's default model will be used. |
str|None |
None |
doc_unwarping_model_dir |
The directory path of the text image unwarping model. If set to None , the official model will be downloaded. |
str|None |
None |
text_detection_model_name |
The model name for text detection. If set to None , the pipeline's default model will be used. |
str|None |
None |
text_detection_model_dir |
The directory path of the text detection model. If set to None , the official model will be downloaded. |
str|None |
None |
text_det_limit_side_len |
Image side length limit for text detection.
|
int|None |
None |
text_det_limit_type |
Type of image side length limit for text detection.
|
str|None |
None |
text_det_thresh |
Pixel threshold for detection; pixels in the output probability map with scores above this threshold are considered text pixels.
|
float|None |
None |
text_det_box_thresh |
Detection box threshold; when the average score of all pixels inside a detected box exceeds this threshold, it is considered a text region.
|
float|None |
None |
text_det_unclip_ratio |
Expansion coefficient for text detection; this method expands the text region, and the larger the value, the larger the expansion area.
|
float|None |
None |
textline_orientation_model_name |
The model name for text line orientation classification. If set to None , the pipeline's default model will be used. |
str|None |
None |
textline_orientation_model_dir |
The directory path of the text line orientation model. If set to None , the official model will be downloaded. |
str|None |
None |
textline_orientation_batch_size |
Batch size for the text line orientation model. If set to None , batch size defaults to 1 . |
int|None |
None |
text_recognition_model_name |
The model name for text recognition. If set to None , the pipeline's default model will be used. |
str|None |
None |
text_recognition_model_dir |
The directory path of the text recognition model. If set to None , the official model will be downloaded. |
str|None |
None |
text_recognition_batch_size |
Batch size for the text recognition model. If set to None , batch size defaults to 1 . |
int|None |
None |
text_rec_score_thresh |
Text recognition threshold; text results with scores greater than this threshold will be retained.
|
float|None |
None |
table_classification_model_name |
The model name for table classification. If set to None , the pipeline's default model will be used. |
str|None |
None |
table_classification_model_dir |
The directory path of the table classification model. If set to None , the official model will be downloaded. |
str|None |
None |
wired_table_structure_recognition_model_name |
The model name for wired table structure recognition. If set to None , the pipeline's default model will be used. |
str|None |
None |
wired_table_structure_recognition_model_dir |
The directory path of the wired table structure recognition model. If set to None , the official model will be downloaded. |
str|None |
None |
wireless_table_structure_recognition_model_name |
The model name for wireless table structure recognition. If set to None , the pipeline's default model will be used. |
str|None |
None |
wireless_table_structure_recognition_model_dir |
The directory path of the wireless table structure recognition model. If set to None , the official model will be downloaded. |
str|None |
None |
wired_table_cells_detection_model_name |
The model name for wired table cell detection. If set to None , the pipeline's default model will be used. |
str|None |
None |
wired_table_cells_detection_model_dir |
The directory path of the wired table cell detection model. If set to None , the official model will be downloaded. |
str|None |
None |
wireless_table_cells_detection_model_name |
The model name for wireless table cell detection. If set to None , the pipeline's default model will be used. |
str|None |
None |
wireless_table_cells_detection_model_dir |
The directory path of the wireless table cell detection model. If set to None , the official model will be downloaded. |
str|None |
None |
table_orientation_classify_model_name |
The model name for table orientation classification. If set to None , the pipeline's default model will be used. |
str|None |
None |
table_orientation_classify_model_dir |
The directory path of the table orientation classification model. If set to None , the official model will be downloaded. |
str|None |
None |
seal_text_detection_model_name |
The model name for seal text detection. If set to None , the pipeline's default model will be used. |
str|None |
None |
seal_text_detection_model_dir |
The directory path of the seal text detection model. If set to None , the official model will be downloaded. |
str|None |
None |
seal_det_limit_side_len |
Image side length limit for seal text detection.
|
int|None |
None |
seal_det_limit_type |
Type of image side length limit for seal text detection.
|
str|None |
None |
seal_det_thresh |
Detection pixel threshold. In the output probability map, pixels with scores above this threshold are considered text pixels.
|
float|None |
None |
seal_det_box_thresh |
Detection box threshold. When the average score of all pixels within the detected bounding box is greater than this threshold, the result is considered a text region.
|
float|None |
None |
seal_det_unclip_ratio |
Expansion coefficient for seal text detection. This method expands the text region; the larger the value, the larger the expansion area.
|
float|None |
None |
seal_text_recognition_model_name |
Name of the seal text recognition model. If set to None , the pipeline default model will be used. |
str|None |
None |
seal_text_recognition_model_dir |
Directory path for the seal text recognition model. If set to None , the official model will be downloaded. |
str|None |
None |
seal_text_recognition_batch_size |
Batch size for the seal text recognition model. If set to None , the batch size defaults to 1 . |
int|None |
None |
seal_rec_score_thresh |
Seal text recognition threshold. Text results with scores above this threshold will be retained.
|
float|None |
None |
formula_recognition_model_name |
Name of the formula recognition model. If set to None , the pipeline default model will be used. |
str|None |
None |
formula_recognition_model_dir |
Directory path for the formula recognition model. If set to None , the official model will be downloaded. |
str|None |
None |
formula_recognition_batch_size |
Batch size for the formula recognition model. If set to None , the batch size defaults to 1 . |
int|None |
None |
use_doc_orientation_classify |
Whether to load and use the document orientation classification module. If set to None , the pipeline initialized parameter value will be used, defaulting to False . |
bool|None |
None |
use_doc_unwarping |
Whether to load and use the text image unwarping module. If set to None , the pipeline initialized parameter value will be used, defaulting to False . |
bool|None |
None |
use_textline_orientation |
Whether to load and use the text line orientation classification module. If set to None , the pipeline initialized parameter value will be used, defaulting to True . |
bool|None |
None |
use_seal_recognition |
Whether to load and use the seal text recognition sub-pipeline. If set to None , the pipeline initialized parameter value will be used, defaulting to True . |
bool|None |
None |
use_table_recognition |
Whether to load and use the table recognition sub-pipeline. If set to None , the pipeline initialized parameter value will be used, defaulting to True . |
bool|None |
None |
use_formula_recognition |
Whether to load and use the formula recognition sub-pipeline. If set to None , the pipeline initialized parameter value will be used, defaulting to True . |
bool|None |
None |
use_chart_recognition |
Whether to load and use the chart parsing module. If set to None , the pipeline initialized parameter value will be used, defaulting to False . |
bool|None |
None |
use_region_detection |
Whether to load and use the document region detection module. If set to None , the pipeline initialized parameter value will be used, defaulting to True . |
bool|None |
None |
chat_bot_config |
Large language model configuration information. The configuration content is the following dict:
|
dict|None |
None |
device |
Device used for inference. Supports specifying a specific card number:
|
str|None |
None |
enable_hpi |
Whether to enable high-performance inference. | bool |
False |
use_tensorrt |
Whether to enable Paddle Inference’s TensorRT subgraph engine. If the model does not support acceleration via TensorRT, enabling this flag will have no effect. For Paddle with CUDA 11.8, the compatible TensorRT version is 8.x (x≥6), recommended installation is TensorRT 8.6.1.6. |
bool |
False |
precision |
Computation precision, such as fp32, fp16. | str |
"fp32" |
enable_mkldnn |
Whether to enable MKL-DNN accelerated inference. If MKL-DNN is unavailable or the model does not support acceleration via MKL-DNN, enabling this flag will have no effect. | bool |
True |
mkldnn_cache_capacity |
MKL-DNN cache capacity. | int |
10 |
cpu_threads |
Number of threads used during inference on CPU. | int |
8 |
paddlex_config |
Path to the PaddleX pipeline configuration file. | str|None |
None |
(2) Call the visual_predict()
method of the PP-DocTranslation pipeline object to obtain visual prediction results. This method returns a list of results. Additionally, the pipeline provides a visual_predict_iter()
method. Both methods accept the same parameters and return the same results, but visual_predict_iter()
returns a generator
, which can process and retrieve prediction results step-by-step, suitable for large datasets or memory-saving scenarios. You can choose either method according to your actual needs. Below are the parameters of the visual_predict()
method and their descriptions:
Parameter | Description | Type | Default |
---|---|---|---|
input |
Data to be predicted, supports multiple input types, required.
|
Python Var|str|list |
|
use_doc_orientation_classify |
Whether to use the document orientation classification module during inference. Setting to None means using the instantiated parameter, otherwise this parameter has higher priority. |
bool|None |
None |
use_doc_unwarping |
Whether to use the text image unwarping module during inference. Setting to None means using the instantiated parameter, otherwise this parameter has higher priority. |
bool|None |
None |
use_textline_orientation |
Whether to use the text line orientation classification module during inference. Setting to None means using the instantiated parameter, otherwise this parameter has higher priority. |
bool|None |
None |
use_seal_recognition |
Whether to use the seal text recognition sub-pipeline during inference. Setting to None means using the instantiated parameter, otherwise this parameter has higher priority. |
bool|None |
None |
use_table_recognition |
Whether to use the table recognition sub-pipeline during inference. Setting to None means using the instantiated parameter, otherwise this parameter has higher priority. |
bool|None |
None |
use_formula_recognition |
Whether to use the formula recognition sub-pipeline during inference. Setting to None means using the instantiated parameter, otherwise this parameter has higher priority. |
bool|None |
None |
use_chart_recognition |
Whether to use the chart parsing module during inference. Setting to None means using the instantiated parameter, otherwise this parameter has higher priority. |
bool|None |
None |
use_region_detection |
Whether to use the document layout detection module during inference. Setting to None means using the instantiated parameter, otherwise this parameter has higher priority. |
bool|None |
None |
layout_threshold |
Parameter meaning is basically the same as the instantiated parameter. Setting to None means using the instantiated parameter, otherwise this parameter has higher priority. |
float|dict|None |
None |
layout_nms |
Parameter meaning is basically the same as the instantiated parameter. Setting to None means using the instantiated parameter, otherwise this parameter has higher priority. |
bool|None |
None |
layout_unclip_ratio |
Parameter meaning is basically the same as the instantiated parameter. Setting to None means using the instantiated parameter, otherwise this parameter has higher priority. |
float|Tuple[float,float]|dict|None |
None |
layout_merge_bboxes_mode |
Parameter meaning is basically the same as the instantiated parameter. Setting to None means using the instantiated parameter, otherwise this parameter has higher priority. |
str|dict|None |
None |
text_det_limit_side_len |
Parameter meaning is basically the same as the instantiated parameter. Setting to None means using the instantiated parameter, otherwise this parameter has higher priority. |
int|None |
None |
text_det_limit_type |
Parameter meaning is basically the same as the instantiated parameter. Setting to None means using the instantiated parameter, otherwise this parameter has higher priority. |
str|None |
None |
text_det_thresh |
Parameter meaning is basically the same as the instantiated parameter. Setting to None means using the instantiated parameter, otherwise this parameter has higher priority. |
float|None |
None |
text_det_box_thresh |
Parameter meaning is basically the same as the instantiated parameter. Setting to None means using the instantiated parameter, otherwise this parameter has higher priority. |
float|None |
None |
text_det_unclip_ratio |
Parameter meaning is basically the same as the instantiated parameter. Setting to None means using the instantiated parameter, otherwise this parameter has higher priority. |
float|None |
None |
text_rec_score_thresh |
Parameter meaning is basically the same as the instantiated parameter. Setting to None means using the instantiated parameter, otherwise this parameter has higher priority. |
float|None |
None |
seal_det_limit_side_len |
Parameter meaning is basically the same as the instantiated parameter. Setting to None means using the instantiated parameter, otherwise this parameter has higher priority. |
int|None |
None |
seal_det_limit_type |
Parameter meaning is basically the same as the instantiated parameter. Setting to None means using the instantiated parameter, otherwise this parameter has higher priority. |
str|None |
None |
seal_det_thresh |
Parameter meaning is basically the same as the instantiated parameter. Setting to None means using the instantiated parameter, otherwise this parameter has higher priority. |
float|None |
None |
seal_det_box_thresh |
Parameter meaning is basically the same as the instantiated parameter. Setting to None means using the instantiated parameter, otherwise this parameter has higher priority. |
float|None |
None |
seal_det_unclip_ratio |
Parameter meaning is basically the same as the instantiated parameter. Setting to None means using the instantiated parameter, otherwise this parameter has higher priority. |
float|None |
None |
seal_rec_score_thresh |
Parameter meaning is basically the same as the instantiated parameter. Setting to None means using the instantiated parameter, otherwise this parameter has higher priority. |
float|None |
None |
use_wired_table_cells_trans_to_html |
Whether to enable direct conversion of wired table cell detection results to HTML. When enabled, HTML is constructed directly based on the geometric relations of wired table cell detection results. | bool |
False |
use_wireless_table_cells_trans_to_html |
Whether to enable direct conversion of wireless table cell detection results to HTML. When enabled, HTML is constructed directly based on the geometric relations of wireless table cell detection results. | bool |
False |
use_table_orientation_classify |
Whether to enable table orientation classification. When enabled, tables with 90/180/270 degree rotations in images can be corrected in orientation and correctly recognized. | bool |
True |
use_ocr_results_with_table_cells |
Whether to enable OCR segmentation by table cells. When enabled, OCR detection results are segmented and re-recognized based on cell prediction results to avoid missing text. | bool |
True |
use_e2e_wired_table_rec_model |
Whether to enable end-to-end wired table recognition mode. When enabled, the cell detection model is not used, only the table structure recognition model is used. | bool |
False |
use_e2e_wireless_table_rec_model |
Whether to enable end-to-end wireless table recognition mode. When enabled, the cell detection model is not used, only the table structure recognition model is used. | bool |
True |
(3) Processing visual prediction results: Each sample's prediction result is a corresponding Result object, supporting operations such as printing, saving as images, and saving as json
files:
Method | Description | Parameter | Parameter Type | Parameter Description | Default |
---|---|---|---|---|---|
print() |
Print results to terminal | format_json |
bool |
Whether to format the output content using JSON indentation |
True |
indent |
int |
Specify indentation level to beautify output JSON data for better readability, effective only when format_json is True |
4 | ||
ensure_ascii |
bool |
Control whether non-ASCII characters are escaped as Unicode . When set to True , all non-ASCII characters will be escaped; if False , original characters are preserved. Effective only when format_json is True |
False |
||
save_to_json() |
Save results as a JSON file | save_path |
str |
File path for saving. If a directory is specified, the saved file name matches the input file type name | None |
indent |
int |
Specify indentation level to beautify output JSON data for better readability, effective only when format_json is True |
4 | ||
ensure_ascii |
bool |
Control whether non-ASCII characters are escaped as Unicode . When set to True , all non-ASCII characters will be escaped; if False , original characters are preserved. Effective only when format_json is True |
False |
||
save_to_img() |
Save visualized images from intermediate modules as PNG format images | save_path |
str |
File path for saving, supports directory or file path | None |
save_to_markdown() |
Save each page of image or PDF files as separate markdown files | save_path |
str |
File path for saving, supports directory or file path | None |
save_to_html() |
Save tables in the file as HTML format files | save_path |
str |
File path for saving, supports directory or file path | None |
save_to_xlsx() |
Save tables in the file as XLSX format files | save_path |
str |
File path for saving, supports directory or file path | None |
(4) Call the translate()
method to perform document translation. This method returns the original and translated markdown content as a markdown object, which can be saved locally by executing the save_to_markdown()
method for the desired parts. Below are the relevant parameters of the translate()
method:
Parameter | Description | Type | Default |
---|---|---|---|
ori_md_info_list |
List of original Markdown data containing content to be translated. Must be a list of dictionaries, each representing a document block | List[Dict] |
|
target_language |
Target language (ISO 639-1 language code, e.g. "en" /"ja" /"fr" ) |
str |
"zh" |
chunk_size |
Character count threshold for chunked translation processing | int |
5000 |
task_description |
Custom task description prompt | str|None |
None |
output_format |
Specified output format requirements, e.g. "preserve original Markdown structure" | str|None |
None |
rules_str |
Custom translation rule description | str|None |
None |
few_shot_demo_text_content |
Few-shot learning example text content | str|None |
None |
few_shot_demo_key_value_list |
Structured few-shot example data in key-value pairs, can include professional terminology glossary | str|None |
None |
glossary |
Professional terminology glossary for translation | dict|None |
None |
llm_request_interval |
Interval in seconds between requests to the large language model. This parameter helps prevent too frequent calls to the LLM. | float |
0.0 |
chat_bot_config |
Large language model configuration. Setting to None uses instantiation parameters; otherwise, this parameter takes priority. |
dict|None |
None |
3. Development Integration/Deployment¶
If the pipeline can meet your requirements for inference speed and accuracy, you can proceed directly with development integration/deployment.
If you need to directly apply the pipeline in your Python project, you can refer to the sample code in 2.2 Python Script Approach.
In addition, PaddleOCR also offers two other deployment methods, detailed as follows:
🚀 High-Performance Inference: In real-world production environments, many applications have stringent performance criteria (especially response speed) for deployment strategies to ensure efficient system operation and a smooth user experience. To this end, PaddleOCR provides high-performance inference capabilities, aiming to deeply optimize model inference and pre/post-processing, achieving significant acceleration in the end-to-end process. For detailed information on the high-performance inference process, please refer to High-Performance Inference.
☁️ Serving: Serving is a common deployment form in real-world production environments. By encapsulating inference functions as services, clients can access these services through network requests to obtain inference results. For detailed information on the pipeline serving process, please refer to Serving.
Below are the API references for basic serving and examples of multi-language service invocation:
API Reference
Main operations provided by the serving:
- HTTP request method is POST.
- Both request body and response body are JSON data (JSON objects).
- When the request is processed successfully, the response status code is
200
, and the response body has the following properties:
Name | Type | Meaning |
---|---|---|
logId |
string |
Request UUID. |
errorCode |
integer |
Error code. Fixed as 0 . |
errorMsg |
string |
Error message. Fixed as "Success" . |
result |
object |
Operation result. |
- When the request is not successful, the response body has the following properties:
Name | Type | Meaning |
---|---|---|
logId |
string |
Request UUID. |
errorCode |
integer |
Error code. Same as response status code. |
errorMsg |
string |
Error message. |
Main operations provided by the serving are as follows:
analyzeImages
Use computer vision models to analyze images, obtaining OCR, table recognition results, etc.
POST /doctrans-visual
- Request body properties are as follows:
Name | Type | Meaning | Required |
---|---|---|---|
file |
string |
URL of image or PDF file accessible by the server, or Base64 encoding of such file contents. By default, for PDF files over 10 pages, only the first 10 pages are processed. To remove the page limit, add the following configuration in the pipeline config file:
|
Yes |
fileType |
integer |null |
File type. 0 means PDF, 1 means image file. If not present in the request, the file type will be inferred from the URL. |
No |
useDocOrientationClassify |
boolean | null |
See the use_doc_orientation_classify parameter description in the pipeline object's visual_predict method. |
No |
useDocUnwarping |
boolean | null |
See the use_doc_unwarping parameter description in the pipeline object's visual_predict method. |
No |
useTextlineOrientation |
boolean | null |
See the use_textline_orientation parameter description in the pipeline object's visual_predict method. |
No |
useSealRecognition |
boolean | null |
See the use_seal_recognition parameter description in the pipeline object's visual_predict method. |
No |
useTableRecognition |
boolean | null |
See the use_table_recognition parameter description in the pipeline object's visual_predict method. |
No |
useFormulaRecognition |
boolean | null |
See the use_formula_recognition parameter description in the pipeline object's visual_predict method. |
No |
useChartRecognition |
boolean | null |
See the use_chart_recognition parameter description in the pipeline object's visual_predict method. |
No |
useRegionDetection |
boolean | null |
See the use_region_detection parameter description in the pipeline object's visual_predict method. |
No |
layoutThreshold |
number | object | null |
See the layout_threshold parameter description in the pipeline object's visual_predict method. |
No |
layoutNms |
boolean | null |
See the layout_nms parameter description in the pipeline object's visual_predict method. |
No |
layoutUnclipRatio |
number | array | object | null |
See the layout_unclip_ratio parameter description in the pipeline object's visual_predict method. |
No |
layoutMergeBboxesMode |
string | object | null |
See the layout_merge_bboxes_mode parameter description in the pipeline object's visual_predict method. |
No |
textDetLimitSideLen |
integer | null |
See the text_det_limit_side_len parameter description in the pipeline object's visual_predict method. |
No |
textDetLimitType |
string | null |
See the text_det_limit_type parameter description in the pipeline object's visual_predict method. |
No |
textDetThresh |
number | null |
See the text_det_thresh parameter description in the pipeline object's visual_predict method. |
No |
textDetBoxThresh |
number | null |
See the text_det_box_thresh parameter description in the pipeline object's visual_predict method. |
No |
textDetUnclipRatio |
number | null |
See the text_det_unclip_ratio parameter description in the pipeline object's visual_predict method. |
No |
textRecScoreThresh |
number | null |
See the text_rec_score_thresh parameter description in the pipeline object's visual_predict method. |
No |
sealDetLimitSideLen |
integer | null |
See the seal_det_limit_side_len parameter description in the pipeline object's visual_predict method. |
No |
sealDetLimitType |
string | null |
See the seal_det_limit_type parameter description in the pipeline object's visual_predict method. |
No |
sealDetThresh |
number | null |
See the seal_det_thresh parameter description in the pipeline object's visual_predict method. |
No |
sealDetBoxThresh |
number | null |
See the seal_det_box_thresh parameter description in the pipeline object's visual_predict method. |
No |
sealDetUnclipRatio |
number | null |
See the seal_det_unclip_ratio parameter description in the pipeline object's visual_predict method. |
No |
sealRecScoreThresh |
number | null |
See the seal_rec_score_thresh parameter description in the pipeline object's visual_predict method. |
No |
useWiredTableCellsTransToHtml |
boolean |
See the use_wired_table_cells_trans_to_html parameter description in the pipeline object's visual_predict method. |
No |
useWirelessTableCellsTransToHtml |
boolean |
See the use_wireless_table_cells_trans_to_html parameter description in the pipeline object's visual_predict method. |
No |
useTableOrientationClassify |
boolean |
See the use_table_orientation_classify parameter description in the pipeline object's visual_predict method. |
No |
useOcrResultsWithTableCells |
boolean |
See the use_ocr_results_with_table_cells parameter description in the pipeline object's visual_predict method. |
No |
useE2eWiredTableRecModel |
boolean |
See the use_e2e_wired_table_rec_model parameter description in the pipeline object's visual_predict method. |
No |
useE2eWirelessTableRecModel |
boolean |
See the use_e2e_wireless_table_rec_model parameter description in the pipeline object's visual_predict method. |
No |
visualize |
boolean | null |
Whether to return visualization result images and intermediate images during processing.
For example, add the following field in the pipeline config file:
By default, images will not be returned; the visualize parameter in the request body can override this default behavior. If neither the request body nor the config file sets it (or the request body passes null and the config file does not set it), images will be returned by default.
|
No |
- When the request is processed successfully, the response body's
result
has the following properties:
Name | Type | Meaning |
---|---|---|
layoutParsingResults |
array |
Layout parsing results. The array length is 1 (for image input) or equals the actual number of processed pages (for PDF input). For PDF input, each element corresponds to the result of each processed page in order. |
dataInfo |
object |
Input data information. |
Each element in layoutParsingResults
is an object
with the following properties:
Name | Type | Meaning |
---|---|---|
prunedResult |
object |
Simplified version of the res field in the JSON representation of the layout_parsing_result generated by the pipeline object's visual_predict method, with input_path and page_index fields removed. |
markdown |
object |
Markdown result. |
outputImages |
object | null |
See the img property description in the pipeline prediction results. Images are in JPEG format and Base64 encoded. |
inputImage |
string | null |
Input image. JPEG format, Base64 encoded. |
markdown
is an object
with the following properties:
Name | Type | Meaning |
---|---|---|
text |
string |
Markdown text. |
images |
object |
Key-value pairs of Markdown image relative paths and Base64 encoded images. |
isStart |
boolean |
Whether the first element on the current page is the start of a paragraph. |
isEnd |
boolean |
Whether the last element on the current page is the end of a paragraph. |
translate
Use a large model to translate documents.
POST /doctrans-translate
- Request body properties are as follows:
Name | Type | Meaning | Required |
---|---|---|---|
markdownList |
array |
List of Markdown to be translated. Can be obtained from the results of the analyzeImages operation. The images attribute will not be used. |
Yes |
targetLanguage |
string |
Please refer to the target_language parameter description in the translate method of the pipeline object. |
No |
chunkSize |
integer |
Please refer to the chunk_size parameter description in the translate method of the pipeline object. |
No |
taskDescription |
string | null |
Please refer to the task_description parameter description in the translate method of the pipeline object. |
No |
outputFormat |
string | null |
Please refer to the output_format parameter description in the translate method of the pipeline object. |
No |
rulesStr |
string | null |
Please refer to the rules_str parameter description in the translate method of the pipeline object. |
No |
fewShotDemoTextContent |
string | null |
Please refer to the few_shot_demo_text_content parameter description in the translate method of the pipeline object. |
No |
fewShotDemoKeyValueList |
string | null |
Please refer to the few_shot_demo_key_value_list parameter description in the translate method of the pipeline object. |
No |
glossary |
object | null |
Please refer to the glossary parameter description in the translate method of the pipeline object. |
No |
llmRequestInterval |
number | null |
Please refer to the llm_request_interval parameter description in the translate method of the pipeline object. |
No |
chatBotConfig |
object | null |
Please refer to the chat_bot_config parameter description in the translate method of the pipeline object. |
No |
- When the request is successfully processed, the
result
in the response body has the following attributes:
Name | Type | Meaning |
---|---|---|
translationResults |
array |
Translation results. |
Each element in translationResults
is an object
with the following attributes:
Name | Type | Meaning |
---|---|---|
language |
string |
Target language. |
markdown |
object |
Markdown result. Object definition is consistent with the markdown returned by the analyzeImages operation. |
Examples of multi-language service invocation
Python
import base64
import pathlib
import pprint
import sys
import requests
API_BASE_URL = "http://127.0.0.1:8080"
file_path = "./demo.jpg"
target_language = "en"
with open(file_path, "rb") as file:
file_bytes = file.read()
file_data = base64.b64encode(file_bytes).decode("ascii")
payload = {
"file": file_data,
"fileType": 1,
}
resp_visual = requests.post(url=f"{API_BASE_URL}/doctrans-visual", json=payload)
if resp_visual.status_code != 200:
print(
f"Request to doctrans-visual failed with status code {resp_visual.status_code}."
)
pprint.pp(resp_visual.json())
sys.exit(1)
result_visual = resp_visual.json()["result"]
markdown_list = []
for i, res in enumerate(result_visual["layoutParsingResults"]):
md_dir = pathlib.Path(f"markdown_{i}")
md_dir.mkdir(exist_ok=True)
(md_dir / "doc.md")
write_text(res["markdown"]["text"])
for img_path, img in res["markdown"]["images"].items():
img_path = md_dir / img_path
img_path.parent.mkdir(parents=True, exist_ok=True)
img_path.write_bytes(base64.b64decode(img))
print(f"The Markdown document to be translated is saved at {md_dir / 'doc.md'}")
del res["markdown"]["images"]
markdown_list.append(res["markdown"])
for img_name, img in res["outputImages"].items():
img_path = f"{img_name}_{i}.jpg"
with open(img_path, "wb") as f:
f.write(base64.b64decode(img))
print(f"Output image saved at {img_path}")
payload = {
"markdownList": markdown_list,
"targetLanguage": target_language,
}
resp_translate = requests.post(url=f"{API_BASE_URL}/doctrans-translate", json=payload)
if resp_translate.status_code != 200:
print(
f"Request to doctrans-translate failed with status code {resp_translate.status_code}."
)
pprint.pprint(resp_translate.json()) # Corrected 'pp' to 'pprint' for proper function call
sys.exit(1)
result_translate = resp_translate.json()["result"]
for i, res in enumerate(result_translate["translationResults"]):
md_dir = pathlib.Path(f"markdown_{i}")
(md_dir / "doc_translated.md").write_text(res["markdown"]["text"])
print(f"Translated markdown document saved at {md_dir / 'doc_translated.md'}")
4. Secondary Development¶
If the default model weights provided by the PP-DocTranslation pipeline do not meet your accuracy or speed requirements in your scenario, you can try to useyour own data from specific domains or application scenariosto furtherfine-tunethe existing model to improve the recognition effect in your scenario.
4.1 Model Fine-tuning¶
Since the PP-DocTranslation pipeline contains several modules, if the performance of the model pipeline does not meet expectations, the issue may originate from any one of these modules. You can analyze cases with poor extraction results, use visualized images to determine which module has the problem, and refer to the corresponding fine-tuning tutorial links in the following table to fine-tune the model.
Scenario | Fine-tuning module | Fine-tuning reference link |
---|---|---|
Inaccurate detection of layout areas, such as failure to detect seals and tables | Layout detection module | Link |
Inaccurate recognition of table structures | Table structure recognition module | Link |
Inaccurate recognition of formulas | Formula recognition module | Link |
Omission in detecting seal texts | Seal text detection module | Link |
Omission in detecting texts | Text detection module | Link |
Inaccurate text content | Text recognition module | Link |
Inaccurate correction of vertical or rotated text lines | Text line orientation classification module | Link |
Inaccurate correction of whole image rotation | Document image orientation classification module | Link |
Inaccurate correction of image distortion | Text image unwarping module | Fine-tuning is temporarily not supported |
4.2 Model Application¶
After completing fine-tuning training with your private dataset, you can obtain a local model weight file. Then, you can use the fine-tuned model weights by customizing the pipeline configuration file.
- Obtain the pipeline configuration file
You can call the export_paddlex_config_to_yaml
method of the PP-DocTranslation pipeline object in PaddleOCR to export the current pipeline configuration to a YAML file:
from paddleocr import PPDocTranslation
pipeline = PPDocTranslation()
pipeline.export_paddlex_config_to_yaml("PP-DocTranslation.yaml")
- Modify the configuration file
After obtaining the default pipeline configuration file, replace the local path of the fine-tuned model weights with the corresponding location in the pipeline configuration file. For example,
......
SubModules:
TextDetection:
module_name: text_detection
model_name: PP-OCRv5_server_det
model_dir: null # Replace with the path to the weights of the fine-tuned text detection model
limit_side_len: 960
limit_type: max
thresh: 0.3
box_thresh: 0.6
unclip_ratio: 1.5
TextRecognition:
module_name: text_recognition
model_name: PP-OCRv5_server_rec
model_dir: null # Replace with the path to the weights of the fine-tuned text recognition model
batch_size: 1
score_thresh: 0
......
The pipeline configuration file not only includes parameters supported by PaddleOCR CLI and Python API but also allows for more advanced configurations. Detailed information can be found in the corresponding pipeline usage tutorial in the Overview of PaddleX Model Pipeline Usage. Refer to the detailed instructions therein and adjust the configurations according to your needs.
- Load the pipeline configuration file in CLI
After modifying the configuration file, specify the path to the modified pipeline configuration file using the --paddlex_config
parameter in the command line. PaddleOCR will then read its contents as the pipeline configuration. Here is an example:
- Load the pipeline configuration file in the Python API
When initializing the pipeline object, you can pass the path of the PaddleX pipeline configuration file or a configuration dict through the paddlex_config
parameter, and PaddleOCR will read its content as the pipeline configuration. The example is as follows:
from paddleocr import PPDocTranslation
pipeline = PPDocTranslation(paddlex_config="PP-DocTranslation.yaml")