PP-DocTranslation Pipeline Usage Tutorial¶
1. Introduction to PP-DocTranslation Pipeline¶
PP-DocTranslation is a document intelligent translation solution provided by PaddlePaddle. It integrates advanced general layout analysis technology and large language model (LLM) capabilities to offer you efficient document intelligent translation services. This solution can accurately identify and extract various elements within documents, including text blocks, headings, paragraphs, images, tables, and other complex layout structures, and on this basis, achieve high-quality multilingual translation. PP-DocTranslation supports mutual translation among multiple mainstream languages, particularly excelling in handling documents with complex layouts and strong contextual dependencies, striving to deliver precise, natural, fluent, and professional translation results. This pipeline also provides flexible serving options, supporting the use of multiple programming languages on various hardware. Moreover, it offers the capability for secondary development, allowing you to train and fine-tune models on your own datasets based on this pipeline, and the trained models can also be seamlessly integrated.
The PP-DocTranslation pipeline uses the PP-StructureV3 sub-pipeline, and thus has all the functions of the PP-StructureV3 pipeline. For more information on the functions and usage details of the PP-StructureV3 pipeline, you can click on the PP-StructureV3 Pipeline Documentation page.
In this pipeline, you can select the model to use based on the benchmark data below.
👉Details of model list
Document image orientation classification module:
Model | Model download link | Top-1 Acc (%) | GPU inference time (ms) [Normal mode / High-performance mode] |
CPU inference time (ms) [Normal mode / High-performance mode] |
Model storage size (M) | Introduction |
---|---|---|---|---|---|---|
PP-LCNet_x1_0_doc_ori | Inference model/Training model | 99.06 | 2.62 / 0.59 | 3.24 / 1.19 | 7 | A document image classification model based on PP-LCNet_x1_0, with four categories: 0 degrees, 90 degrees, 180 degrees, and 270 degrees |
Text image unwarping module:
Model | Model download link | CER | Model storage size (M) | Introduction |
---|---|---|---|---|
UVDoc | Inference model/Training model | 0.179 | 30.3 M | A high-precision text image unwarping model |
Layout region detection module model:
Model | Model download link | mAP(0.5) (%) | GPU inference time (ms) [Normal mode / High-performance mode] |
CPU inference time (ms) [Normal mode / High-performance mode] |
Model storage size (M) | Introduction |
---|---|---|---|---|---|---|
PP-DocLayout_plus-L | Inference model/Training model | 83.2 | 53.03 / 17.23 | 634.62 / 378.32 | 126.01 M | A higher-precision layout region localization model trained on a self-built dataset based on RT-DETR-L, covering scenarios such as Chinese and English papers, multi-column magazines, newspapers, PPTs, contracts, books, examination papers, research reports, ancient books, Japanese documents, and documents with vertical text. | PP-DocLayout-L | Inference model/Training model | 90.4 | 33.59 / 33.59 | 503.01 / 251.08 | 123.76 M | A high-precision layout region localization model trained on a self-built dataset based on RT-DETR-L, covering scenarios such as Chinese and English papers, magazines, contracts, books, examination papers, and research reports. |
PP-DocLayout-M | Inference model/Training model | 75.2 | 13.03 / 4.72 | 43.39 / 24.44 | 22.578 | A layout region localization model with balanced precision and efficiency trained on a self-built dataset based on PicoDet-L, covering scenarios such as Chinese and English papers, magazines, contracts, books, examination papers, and research reports. |
PP-DocLayout-S | Inference model/Training model | 70.9 | 11.54 / 3.86 | 18.53 / 6.29 | 4.834 | A highly efficient layout region localization model trained on a self-built dataset based on PicoDet-S, covering scenarios such as Chinese and English papers, magazines, contracts, books, examination papers, and research reports. |
Table structure recognition module:
Model | Model download link | Accuracy (%) | GPU inference time (ms) [Normal Mode / High-Performance Mode] |
CPU inference time (ms) [Normal Mode / High-Performance Mode] |
Model storage size (M) | Introduction |
---|---|---|---|---|---|---|
SLANeXt_wired | Inference model/Training model | 69.65 | 85.92 / 85.92 | - / 501.66 | 351M | The SLANeXt series is a new generation of table structure recognition models independently developed by Baidu PaddlePaddle's vision team. Compared to SLANet and SLANet_plus, SLANeXt focuses on recognizing table structures and has trained dedicated weights for wired and wireless tables separately. This has significantly improved its ability to recognize various types of tables, especially wired tables. |
SLANeXt_wireless | Inference model/Training model |
Table classification module model:
Model | Model download link | Top1 Acc(%) | GPU inference time (ms) [Normal Mode / High-Performance Mode] |
CPU inference time (ms) [Normal Mode / High-Performance Mode] |
Model storage size (M) |
---|---|---|---|---|---|
PP-LCNet_x1_0_table_cls | Inference model/Training model | 94.2 | 2.62 / 0.60 | 3.17 / 1.14 | 6.6M |
Table cell detection module model:
Model | Model download link | mAP(%) | GPU inference time (ms) [Normal mode / High-performance mode] |
CPU inference time (ms) [Normal mode / High-performance mode] |
Model storage size (M) | Introduction |
---|---|---|---|---|---|---|
RT-DETR-L_wired_table_cell_det | Inference model/Training model | 82.7 | 33.47 / 27.02 | 402.55 / 256.56 | 124M | RT-DETR is the first real-time end-to-end object detection model. Based on RT-DETR-L as the base model, Baidu PaddlePaddle's vision team completed pre-training on a self-built table cell detection dataset, achieving table cell detection with good performance for both wired and wireless tables. |
RT-DETR-L_wireless_table_cell_det | Inference model/Training model |
Text detection module:
Model | Model download link | Detection Hmean (%) | GPU inference time (ms) [Normal mode / High-performance mode] |
CPU inference time (ms) [Normal mode / High-performance mode] |
Model storage size (M) | Introduction |
---|---|---|---|---|---|---|
PP-OCRv5_server_det | Inference model/Training model | 83.8 | 89.55 / 70.19 | 383.15 / 383.15 | 84.3 | The server-side text detection model of PP-OCRv5, with higher accuracy, suitable for deployment on servers with better performance |
PP-OCRv5_mobile_det | Inference model/Training model | 79.0 | 10.67 / 6.36 | 57.77 / 28.15 | 4.7 | PP-OCRv5's mobile-end text detection model, with higher efficiency, suitable for deployment on edge devices |
PP-OCRv4_server_det | Inference model/Training model | 69.2 | 127.82 / 98.87 | 585.95 / 489.77 | 109 | PP-OCRv4's server-end text detection model, with higher accuracy, suitable for deployment on servers with better performance |
PP-OCRv4_mobile_det | Inference model/Training model | 63.8 | 9.87 / 4.17 | 56.60 / 20.79 | 4.7 | PP-OCRv4's mobile-end text detection model, with higher efficiency, suitable for deployment on edge devices |
PP-OCRv3_mobile_det | Inference model/Training model | Accuracy is close to PP-OCRv4_mobile_det | 9.90 / 3.60 | 41.93 / 20.76 | 2.1 | PP-OCRv3's mobile-end text detection model, with higher efficiency, suitable for deployment on edge devices |
PP-OCRv3_server_det | Inference model/Training model | Accuracy is close to PP-OCRv4_server_det | 119.50 / 75.00 | 379.35 / 318.35 | 102.1 | Server-side text detection model of PP-OCRv3, with higher accuracy, suitable for deployment on servers with better performance |
Text recognition module model:
*Chinese recognition modelModel | Model download link | Recognition Avg Accuracy(%) | GPU inference time (ms) [Normal mode / High-performance mode] |
CPU inference time (ms) [Normal mode / High-performance mode] |
Model storage size (M) | Introduction |
---|---|---|---|---|---|---|
PP-OCRv5_server_rec | Inference model/Training model | 86.38 | 8.46 / 2.36 | 31.21 / 31.21 | 81 M | PP-OCRv5_rec is a new generation of text recognition model. This model is committed to efficiently and accurately supporting four major languages, namely Simplified Chinese, Traditional Chinese, English, and Japanese, as well as complex text scenarios such as handwriting, vertical text, pinyin, and rare characters with a single model. While maintaining recognition effectiveness, it also takes into account inference speed and model robustness, providing efficient and accurate technical support for document understanding in various scenarios. |
PP-OCRv5_mobile_rec | Inference model/Training model | 81.29 | 5.43 / 1.46 | 21.20 / 5.32 | 16 M | |
PP-OCRv4_server_rec_doc | Inference model/Training model | 86.58 | 8.69 / 2.78 | 37.93 / 37.93 | 74.7 M | PP-OCRv4_server_rec_doc is trained on a mixed dataset of more Chinese document data and PP-OCR training data based on PP-OCRv4_server_rec. It has enhanced the ability to recognize some traditional Chinese characters, Japanese characters, and special characters, and can support the recognition of over 15,000 characters. In addition to improving the document-related text recognition ability, it has also enhanced the general text recognition ability. |
PP-OCRv4_mobile_rec | Inference model/Training model | 78.74 | 5.26 / 1.12 | 17.48 / 3.61 | 10.6 M | A lightweight recognition model of PP-OCRv4 with high inference efficiency, which can be deployed on various hardware devices including edge devices. |
PP-OCRv4_server_rec | Inference model/Training model | 80.61 | 8.75 / 2.49 | 36.93 / 36.93 | 71.2 M | A server-side model of PP-OCRv4 with high inference accuracy, which can be deployed on various servers. |
PP-OCRv3_mobile_rec | Inference model/Training model | 72.96 | 3.89 / 1.16 | 8.72 / 3.56 | 9.2 M | A lightweight recognition model of PP-OCRv3 with high inference efficiency, which can be deployed on various hardware devices including edge devices. |
Model | Model download link | Recognition Avg Accuracy(%) | GPU inference time (ms) [Normal mode / High-performance mode] |
CPU inference time (ms) [Normal mode / High-performance mode] |
Model storage size (M) | Introduction |
---|---|---|---|---|---|---|
ch_SVTRv2_rec | Inference model/Training model | 68.81 | 10.38 / 8.31 | 66.52 / 30.83 | 73.9 M | SVTRv2 is a server-side text recognition model developed by the OpenOCR team of the Vision and Learning Lab (FVL) at Fudan University. It won the first prize in the PaddleOCR Algorithm Model Challenge - Task 1: OCR End-to-End Recognition Task, with a 6% improvement in end-to-end recognition accuracy on Leaderboard A compared to PP-OCRv4. |
Model | Model download link | Recognition Avg Accuracy(%) | GPU inference time (ms) [Normal mode / High-performance mode] |
CPU inference time (ms) [Normal mode / High-performance mode] |
Model storage size (M) | Introduction |
---|---|---|---|---|---|---|
ch_RepSVTR_rec | Inference model/Training model | 65.07 | 6.29 / 1.57 | 20.64 / 5.40 | 22.1 M | RepSVTR is a mobile-side text recognition model based on SVTRv2. It won the first prize in the PaddleOCR Algorithm Model Challenge - Task 1: OCR End-to-End Recognition Task, with a 2.5% improvement in end-to-end recognition accuracy on Leaderboard B compared to PP-OCRv4, while maintaining the same inference speed. |
Model | Model download link | Recognition Avg Accuracy(%) | GPU inference time (ms) [Normal mode / High-performance mode] |
CPU inference time (ms) [Normal mode / High-performance mode] |
Model storage size (M) | Introduction |
---|---|---|---|---|---|---|
en_PP-OCRv4_mobile_rec | Inference model/Training model | 70.39 | 4.81 / 1.23 | 17.20 / 4.18 | 6.8 M | An ultra-lightweight English recognition model trained based on the PP-OCRv4 recognition model, supporting English and number recognition |
en_PP-OCRv3_mobile_rec | Inference model/Training model | 70.69 | 3.56 / 0.78 | 8.44 / 5.78 | 7.8 M | An ultra-lightweight English recognition model trained based on the PP-OCRv3 recognition model, supporting English and number recognition |
Model | Model download link | Avg Accuracy of recognition (%) | GPU inference time (ms) [Normal mode / High-performance mode] |
CPU inference time (ms) [Normal mode / High-performance mode] |
Model storage size (M) | Introduction |
---|---|---|---|---|---|---|
korean_PP-OCRv3_mobile_rec | Inference model/Training model | 60.21 | 3.73 / 0.98 | 8.76 / 2.91 | 8.6 M | An ultra-lightweight Korean recognition model trained based on the PP-OCRv3 recognition model, supporting Korean and digit recognition |
japan_PP-OCRv3_mobile_rec | Inference model/Training model | 45.69 | 3.86 / 1.01 | 8.62 / 2.92 | 8.8 M | An ultra-lightweight Japanese recognition model trained based on the PP-OCRv3 recognition model, supporting Japanese and digit recognition |
chinese_cht_PP-OCRv3_mobile_rec | Inference model/Training model | 82.06 | 3.90 / 1.16 | 9.24 / 3.18 | 9.7 M | An ultra-lightweight traditional Chinese recognition model trained based on the PP-OCRv3 recognition model, supporting traditional Chinese and digit recognition |
te_PP-OCRv3_mobile_rec | Inference model/Training model | 95.88 | 3.59 / 0.81 | 8.28 / 6.21 | 7.8 M | An ultra-lightweight Telugu recognition model trained based on the PP-OCRv3 recognition model, supporting Telugu and digit recognition |
ka_PP-OCRv3_mobile_rec | Inference model/Training model | 96.96 | 3.49 / 0.89 | 8.63 / 2.77 | 8.0 M | An ultra-lightweight Kannada recognition model trained based on the PP-OCRv3 recognition model, supporting Kannada and digit recognition |
ta_PP-OCRv3_mobile_rec | Inference model/Training model | 76.83 | 3.49 / 0.86 | 8.35 / 3.41 | 8.0 M | An ultra-lightweight Tamil recognition model trained based on the PP-OCRv3 recognition model, supporting Tamil and digit recognition |
latin_PP-OCRv3_mobile_rec | Inference model/Training model | 76.93 | 3.53 / 0.78 | 8.50 / 6.83 | 7.8 M | An ultra-lightweight Latin recognition model trained based on the PP-OCRv3 recognition model, supporting Latin and digit recognition |
arabic_PP-OCRv3_mobile_rec | Inference model/Training model | 73.55 | 3.60 / 0.83 | 8.44 / 4.69 | 7.8 M | An ultra-lightweight Arabic alphabet recognition model trained based on the PP-OCRv3 recognition model, supporting Arabic alphabet and digit recognition |
cyrillic_PP-OCRv3_mobile_rec | Inference model/Training model | 94.28 | 3.56 / 0.79 | 8.22 / 2.76 | 7.9 M | An ultra-lightweight Slavic alphabet recognition model trained based on the PP-OCRv3 recognition model, supporting Slavic alphabet and digit recognition |
devanagari_PP-OCRv3_mobile_rec | Inference model/Training model | 96.44 | 3.60 / 0.78 | 6.95 / 2.87 | 7.9 M | An ultra-lightweight Sanskrit alphabet recognition model trained based on the PP-OCRv3 recognition model, supporting Sanskrit alphabet and digit recognition |
Text line direction classification module (optional):
Model | Model download link | Top-1 Acc (%) | GPU inference time (ms) [Normal mode / High-performance mode] |
CPU inference time (ms) [Normal mode / High-performance mode] |
Model storage size (M) | Introduction |
---|---|---|---|---|---|---|
PP-LCNet_x0_25_textline_ori | Inference model/Training model | 95.54 | 2.16 / 0.41 | 2.37 / 0.73 | 0.32 | A text line classification model based on PP-LCNet_x0_25, with two categories, namely 0 degrees and 180 degrees |
Formula recognition module:
Model | Model download link | Avg-BLEU(%) | GPU inference time (ms) [Normal mode / High-performance mode] |
CPU inference time (ms) [Normal mode / High-performance mode] |
Model storage size (M) | Introduction | UniMERNet | Inference model/Training model | 86.13 | 2266.96/- | -/- | 1.4 G | UniMERNet is a formula recognition model developed by Shanghai AI Lab. It uses Donut Swin as the encoder and MBartDecoder as the decoder. By training on a dataset of one million entries that includes simple formulas, complex formulas, scanned formulas, and handwritten formulas, the model significantly improves its recognition accuracy for formulas in real-world scenarios. | PP-FormulaNet-S | Inference model/Training model | 87.12 | 1311.84 / 1311.84 | - / 8288.07 | 167.9 M | PP-FormulaNet is an advanced formula recognition model developed by Baidu PaddlePaddle's vision team, supporting the recognition of 50,000 common LaTeX source code vocabulary. The PP-FormulaNet-S version employs PP-HGNetV2-B4 as its backbone network. Through techniques such as parallel masking and model distillation, it significantly enhances the model's inference speed while maintaining high recognition accuracy, suitable for scenarios like simple printed formulas and simple multi-line printed formulas. The PP-FormulaNet-L version, on the other hand, is based on Vary_VIT_B as its backbone network and has undergone in-depth training on a large-scale formula dataset. It shows significant improvement in recognizing complex formulas compared to PP-FormulaNet-S and is suitable for scenarios like simple printed formulas, complex printed formulas, and handwritten formulas. | PP-FormulaNet-L | Inference model/Training model | 92.13 | 1976.52/- | -/- | 535.2 M | LaTeX_OCR_rec | Inference model/Training model | 71.63 | 1088.89 / 1088.89 | - / - | 89.7 M | LaTeX-OCR is a formula recognition algorithm based on an autoregressive large model. By adopting Hybrid ViT as the backbone network and transformer as the decoder, it significantly improves the accuracy of formula recognition. |
---|
Seal text detection module:
Model | Model download link | Detection Hmean (%) | GPU inference time (ms) [Normal mode / High-performance mode] |
CPU inference time (ms) [Normal mode / High-performance mode] |
Model storage size (M) | Introduction |
---|---|---|---|---|---|---|
PP-OCRv4_server_seal_det | Inference model/Training model | 98.21 | 124.64 / 91.57 | 545.68 / 439.86 | 109 | PP-OCRv4's server-side seal text detection model with higher accuracy, suitable for deployment on better servers |
PP-OCRv4_mobile_seal_det | Inference model/Training model | 96.47 | 9.70 / 3.56 | 50.38 / 19.64 | 4.6 | PP-OCRv4's mobile-side seal text detection model with higher efficiency, suitable for deployment on the end side |
- Performance test environment
- Test dataset:
- Document image orientation classification model: A self-built dataset by PaddleX, covering multiple scenarios such as certificates and documents, containing 1000 images.
- Text image unwarping model:DocUNet.
- Layout area detection model: The self-built layout area analysis dataset of PaddleOCR, which includes 10,000 common document images such as Chinese and English papers, magazines, and research reports.
- PP-DocLayout_plus-L: The self-built layout area detection dataset of PaddleOCR, which includes 1,300 document images such as Chinese and English papers, magazines, newspapers, research reports, PPTs, examination papers, and textbooks.
- Table structure recognition model: The self-built English table recognition dataset within PaddleX.
- Text detection model: The self-built Chinese dataset of PaddleOCR, covering multiple scenarios such as street views, web images, documents, and handwriting, with 500 images for detection.
- Chinese recognition model: The self-built Chinese dataset of PaddleOCR, covering multiple scenarios such as street views, web images, documents, and handwriting, with 11,000 images for text recognition.
- ch_SVTRv2_rec:PaddleOCR Algorithm Model Challenge - Task 1: OCR End-to-End Recognition TaskEvaluation set for Leaderboard A.
- ch_RepSVTR_rec:PaddleOCR Algorithm Model Challenge - Task 1: OCR End-to-End Recognition TaskEvaluation set for Leaderboard B.
- English recognition model: The self-built English dataset of PaddleX.
- Multilingual recognition model: The self-built multilingual dataset of PaddleX.
- Text line direction classification model: The self-built dataset of PaddleX, covering multiple scenarios such as certificates and documents, with 1,000 images.
- Seal text detection model: The self-built dataset of PaddleX, which includes 500 images of round seals.
- Hardware configuration:
- GPU: NVIDIA Tesla T4
- CPU: Intel Xeon Gold 6271C @ 2.60GHz
- Other environments: Ubuntu 20.04 / CUDA 11.8 / cuDNN 8.9 / TensorRT 8.6.1.6
- Test dataset:
- Description of inference modes
Modes | GPU configuration | CPU configuration | Combination of acceleration technologies |
---|---|---|---|
Regular mode | FP32 precision / no TRT acceleration | FP32 precision / 8 threads | PaddleInference |
High-performance mode | Select the optimal combination of prior precision type and acceleration strategy | FP32 precision / 8 threads | Select the optimal prior backend (Paddle/OpenVINO/TRT, etc.) |
2. Quick Start¶
Before using the PP-DocTranslation pipeline locally, please ensure that you have completed the installation of the wheel package according to the Installation Tutorial.
Please note: If you encounter issues such as the program becoming unresponsive, unexpected program termination, running out of memory resources, or extremely slow inference during execution, please try adjusting the configuration according to the documentation, such as disabling unnecessary features or using lighter-weight models.
Before use, you need to prepare the API key for a large language model, which supports the Baidu Cloud Qianfan Platform or local large model services that comply with the OpenAI interface standards.
2.1 Experience via Command Line¶
You can download the test file and quickly experience the pipeline effect with a single command:
paddleocr pp_doctranslation -i vehicle_certificate-1.png --target_language en --qianfan_api_key your_api_key
The command line supports more parameter settings. Click to expand for detailed descriptions of command line parameters.
Parameter | Description | Parameter Type | Default Value |
---|---|---|---|
input |
Data to be predicted, required. For example, the local path of an image file or PDF file:/root/data/img.jpg ;Or a URL link, such as the network URL of an image file or PDF file:Example;Or a local directory, which should contain the images to be predicted, such as the local path:/root/data/ (Currently, prediction for PDF files within a directory is not supported. PDF files need to be specified to a specific file path). |
str |
|
save_path |
Specify the path where the inference result file will be saved. If not set, the inference result will not be saved locally. | str |
|
target_language |
Target language (ISO 639-1 language code). | str |
zh |
layout_detection_model_name |
The model name for layout area detection. If not set, the default model of the pipeline will be used. | str |
|
layout_detection_model_dir |
The directory path of the layout area detection model. If not set, the official model will be downloaded. | str |
|
layout_threshold |
The score threshold for the layout model.Any floating-point number between 0-1. If not set, the parameter value initialized by the pipeline will be used, which is initialized to 0.5 |
by default. |
|
float |
Whether to use post-processing NMS for layout detection. If not set, the parameter value initialized by the pipeline will be used, and the default initialization is True . |
bool |
|
layout_unclip_ratio |
The expansion coefficient of the detection box for the layout area detection model. Any floating-point number greater than 0. If not set, the parameter value initialized by the pipeline will be used, and the default initialization is 1.0 |
. |
|
float |
layout_merge_bboxes_mode
If not set, the parameter value initialized by the pipeline will be used, and the default initialization is large |
. |
|
str |
chart_recognition_model_name | The model name for chart parsing. If not set, the default model of the pipeline will be used. |
|
str |
chart_recognition_model_dir | The directory path for the chart parsing model. If not set, the official model will be downloaded. |
|
str |
chart_recognition_batch_sizeThe batch size for the chart parsing model. If not set, the batch size will be set to 。 |
int |
|
region_detection_model_name |
Name of the model for detecting submodules of document image layout. If not set, the default model in the pipeline will be used. | str |
|
region_detection_model_dir |
Directory path of the model for detecting submodules of document image layout. If not set, the official model will be downloaded. | str |
|
doc_orientation_classify_model_name |
Name of the model for document orientation classification. If not set, the default model in the pipeline will be used. | str |
|
doc_orientation_classify_model_dir |
Directory path of the model for document orientation classification. If not set, the official model will be downloaded. | str |
|
doc_unwarping_model_name |
Name of the model for text image unwarping. If not set, the default model in the pipeline will be used. | str |
|
doc_unwarping_model_dir |
Directory path of the model for text image unwarping. If not set, the official model will be downloaded. | str |
|
text_detection_model_name |
Name of the model for text detection. If not set, the default model in the pipeline will be used. | str |
|
text_detection_model_dir |
Directory path of the model for text detection. If not set, the official model will be downloaded. | str |
|
text_det_limit_side_len |
Limit on the side length of the image for text detection.
Any integer greater than 0 . If not set, the parameter value initialized in the pipeline will be used, and the default initialization value is 960 。 |
int |
|
text_det_limit_type |
Type of image side length limit for text detection. It supportsmin andmax ,min means ensuring that the shortest side of the image is not less thandet_limit_side_len ,max means ensuring that the longest side of the image is not greater thanlimit_side_len . If not set, the parameter value initialized by the pipeline will be used, and the default initialization ismax . |
str |
|
text_det_thresh |
Detection pixel threshold. Only pixels with scores greater than this threshold in the output probability map will be considered as text pixels.
Any floating-point number greater than0 . If not set, the parameter value initialized by the pipeline will be used by default,0.3 . |
float |
|
text_det_box_thresh |
Detection box threshold. When the average score of all pixels within the detection result border is greater than this threshold, the result will be considered as a text area. Any floating-point number greater than0 . If not set, the parameter value initialized by the pipeline will be used by default,0.6 . |
float |
|
text_det_unclip_ratio |
Text detection expansion coefficient. This method is used to expand the text area. The larger the value, the larger the expanded area.
Any floating-point number greater than0 . If not set, the parameter value initialized by the pipeline will be used by default,2.0 . |
float |
|
textline_orientation_model_name |
Name of the text line orientation model. If not set, the default model in the pipeline will be used. | str |
|
textline_orientation_model_dir |
Directory path of the text line orientation model. If not set, the official model will be downloaded. | str |
|
textline_orientation_batch_size |
Batch size of the text line orientation model. If not set, the batch size will be set to 1 by default. |
int |
|
text_recognition_model_name |
Name of the text recognition model. If not set, the default model in the pipeline will be used. | str |
|
text_recognition_model_dir |
Directory path of the text recognition model. If not set, the official model will be downloaded. | str |
|
text_recognition_batch_size |
Batch size of the text recognition model. If not set, the batch size will be set to 1 by default. |
int |
|
text_rec_score_thresh |
Text recognition threshold. Text results with scores greater than this threshold will be retained. Any floating-point number greater than 0. If not set, the parameter value initialized in the pipeline, 0.0 |
, will be used by default. That is, no threshold is set. |
|
float |
table_classification_model_name | Name of the table classification model. If not set, the default model in the pipeline will be used. |
|
table_classification_model_dir |
The directory path of the table classification model. If not set, the official model will be downloaded. | str |
|
wired_table_structure_recognition_model_name |
The name of the wired table structure recognition model. If not set, the default model in the pipeline will be used. | str |
|
wired_table_structure_recognition_model_dir |
The directory path of the wired table structure recognition model. If not set, the official model will be downloaded. | str |
|
wireless_table_structure_recognition_model_name |
The name of the wireless table structure recognition model. If not set, the default model in the pipeline will be used. | str |
|
wireless_table_structure_recognition_model_dir |
The directory path of the wireless table structure recognition model. If not set, the official model will be downloaded. | str |
|
wired_table_cells_detection_model_name |
The name of the wired table cells detection model. If not set, the default model in the pipeline will be used. | str |
|
wired_table_cells_detection_model_dir |
The directory path of the wired table cells detection model. If not set, the official model will be downloaded. | str |
|
wireless_table_cells_detection_model_name |
The name of the wireless table cells detection model. If not set, the default model in the pipeline will be used. | str |
|
wireless_table_cells_detection_model_dir |
Directory path of the wireless table cell detection model. If not set, the official model will be downloaded. | str |
|
table_orientation_classify_model_name |
Name of the table orientation classification model. If not set, the default model in the pipeline will be used. | str |
|
table_orientation_classify_model_dir |
Directory path of the table orientation classification model. If not set, the official model will be downloaded. | str |
|
seal_text_detection_model_name |
Name of the seal text detection model. If not set, the default model in the pipeline will be used. | str |
|
seal_text_detection_model_dir |
Directory path of the seal text detection model. If not set, the official model will be downloaded. | str |
|
seal_det_limit_side_len |
Limit on the side length of the image for seal text detection. Any integer greater than 0. If not set, the parameter value initialized in the pipeline will be used, which is initialized to 736 |
by default. |
|
int |
seal_det_limit_typeType of the side length limit for seal text detection image. Supports min and max, where min means ensuring that the shortest side of the image is not less than det_limit_side_len, and maxlimit_side_len . If not set, the parameter value initialized by the pipeline will be used, and the default initialization is min . |
str |
|
seal_det_thresh |
Detection pixel threshold. Only pixels with scores greater than this threshold in the output probability map will be considered as text pixels.
Any floating-point number greater than 0 . If not set, the parameter value initialized by the pipeline will be used by default, which is 0.2 . | float |
|
seal_det_box_thresh |
Detection box threshold. When the average score of all pixels within the bounding box of the detection result is greater than this threshold, the result will be considered as a text region.
Any floating-point number greater than 0 . If not set, the parameter value initialized by the pipeline will be used by default, which is 0.6 . |
float |
|
seal_det_unclip_ratio |
Expansion coefficient for seal text detection. This method is used to expand the text region. The larger the value, the larger the expanded area.
Any floating-point number greater than 0 . If not set, the parameter value initialized by the pipeline will be used by default, which is 0.5 . |
float |
|
seal_text_recognition_model_name |
Name of the seal text recognition model. If not set, the default model of the pipeline will be used. | str |
|
seal_text_recognition_model_dir |
Directory path of the seal text recognition model. If not set, the official model will be downloaded. | str |
|
seal_text_recognition_batch_size |
The batch size of the seal text recognition model. If not set, the batch size will be set to 1 by default. |
int |
|
seal_rec_score_thresh |
Text recognition threshold. Text results with scores greater than this threshold will be retained. Any floating-point number greater than 0. If not set, the parameter value initialized by the pipeline will be used by default, which is 0.0 |
. That is, no threshold is set. |
|
float |
formula_recognition_model_name | The name of the formula recognition model. If not set, the default model of the pipeline will be used. |
|
str |
formula_recognition_model_dir | The directory path of the formula recognition model. If not set, the official model will be downloaded. |
|
str |
formula_recognition_batch_sizeThe batch size of the formula recognition model. If not set, the batch size will be set to 1 |
by default. |
|
int |
use_doc_orientation_classify | Whether to use the document orientation classification module. |
bool |
False |
use_doc_unwarping | Whether to use the text image unwarping module. |
bool |
False |
use_textline_orientationWhether to load and use the text line orientation classification module. If not set, the parameter value initialized by the pipeline will be used, which is initialized to True |
by default. |
|
use_seal_recognition |
Whether to load and use the seal text recognition sub-pipeline. If not set, the parameter value initialized by the pipeline will be used, and the default initialization is True . |
bool |
|
use_table_recognition |
Whether to load and use the table recognition sub-pipeline. If not set, the parameter value initialized by the pipeline will be used, and the default initialization is True . |
bool |
|
use_formula_recognition |
Whether to load and use the formula recognition sub-pipeline. If not set, the parameter value initialized by the pipeline will be used, and the default initialization is True . |
bool |
|
use_chart_recognition |
Whether to use the chart parsing module. | bool |
False |
use_region_detection |
Whether to load and use the document region detection sub-pipeline. If not set, the parameter value initialized by the pipeline will be used, and the default initialization is True . |
bool |
|
device |
The device used for inference. It supports specifying a specific card number:
|
str |
|
enable_hpi |
Whether to enable high-performance inference. | bool |
False |
use_tensorrt |
Whether to enable the TensorRT subgraph engine of Paddle Inference. If the model does not support acceleration via TensorRT, acceleration will not be used even if this flag is set. For PaddlePaddle with CUDA 11.8, the compatible TensorRT version is 8.x (x>=6), and it is recommended to install TensorRT 8.6.1.6. For PaddlePaddle with CUDA 12.6, the compatible TensorRT version is 10.x (x>=5), and it is recommended to install TensorRT 10.5.0.18. |
bool |
False |
precision |
Computational precision, such as fp32, fp16. | str |
fp32 |
enable_mkldnn |
Whether to enable MKL-DNN accelerated inference. If MKL-DNN is not available or the model does not support acceleration via MKL-DNN, acceleration will not be used even if this flag is set. | bool |
True |
mkldnn_cache_capacity |
MKL-DNN cache capacity. | int |
10 |
cpu_threads |
Number of threads used for inference on CPU. | int |
8 |
paddlex_config |
Path to the PaddleX pipeline configuration file. | str |
The execution results will be printed to the terminal.
2.2 Integration via Python Script¶
The command-line method is for quickly experiencing and viewing the results. Generally, in projects, integration via code is often required. You can download the test file and use the following sample code for inference:
from paddlex import create_pipeline
# Create a translation pipeline
pipeline = create_pipeline(pipeline="PP-DocTranslation")
# Document path
input_path = "document_sample.pdf"
# Output directory
output_path = "./output"
# Large model configuration
chat_bot_config = {
"module_name": "chat_bot",
"model_name": "ernie-3.5-8k",
"base_url": "https://qianfan.baidubce.com/v2",
"api_type": "openai",
"api_key": "api_key", # your api_key
}
if input_path.lower().endswith(".md"):
# Read markdown documents, supporting passing in directories and url links with the .md suffix
ori_md_info_list = pipeline.load_from_markdown(input_path)
else:
# Use PP-StructureV3 to perform layout parsing on PDF/image documents to obtain markdown information
visual_predict_res = pipeline.visual_predict(
input_path,
use_doc_orientation_classify=False,
use_doc_unwarping=False,
use_common_ocr=True,
use_seal_recognition=True,
use_table_recognition=True,
)
ori_md_info_list = []
for res in visual_predict_res:
layout_parsing_result = res["layout_parsing_result"]
ori_md_info_list.append(layout_parsing_result.markdown)
layout_parsing_result.save_to_img(output_path)
layout_parsing_result.save_to_markdown(output_path)
# Concatenate the markdown information of multi-page documents into a single markdown file, and save the merged original markdown text
if input_path.lower().endswith(".pdf"):
ori_md_info = pipeline.concatenate_markdown_pages(ori_md_info_list)
ori_md_info.save_to_markdown(output_path)
# Perform document translation (target language: English)
tgt_md_info_list = pipeline.translate(
ori_md_info_list=ori_md_info_list,
target_language="en",
chunk_size=5000,
chat_bot_config=chat_bot_config,
)
# Save the translation results
for tgt_md_info in tgt_md_info_list:
tgt_md_info.save_to_markdown(output_path)
After executing the above code, you will obtain the parsed results of the original document to be translated, the Markdown file of the original text to be translated, and the Markdown file of the translated document, all saved in the output
directory.
The process, API description, and output description of PP-DocTranslation prediction are as follows:
(1) CallPPDocTranslation
Instantiate a PP-DocTranslation pipeline object.
The descriptions of relevant parameters are as follows:Parameter | Description | Parameter Type | Default Value |
---|---|---|---|
layout_detection_model_name |
The model name for layout area detection. If set to None , the default model of the pipeline will be used. |
str|None |
None |
layout_detection_model_dir |
The directory path of the layout area detection model. If set to None , the official model will be downloaded. |
str|None |
None |
layout_threshold |
The score threshold for the layout model.
|
by default. |
float|dict|None |
None |
layout_nmsWhether to use post-processing NMS for layout detection. If set to None, the parameter value initialized by the pipeline will be used, which is initialized to True |
by default. |
bool|None |
layout_unclip_ratio |
Expansion coefficient of the detection box for the layout area detection model.
|
float|Tuple[float,float]|dict|None |
None |
layout_merge_bboxes_mode |
Filtering method for overlapping boxes in layout area detection.
|
str|dict|None |
None |
chart_recognition_model_name |
The model name for chart parsing. If set to None , the default model of the pipeline will be used. |
str|None |
None |
chart_recognition_model_dir |
The directory path of the model for chart parsing. If set to None , the official model will be downloaded. |
str|None |
None |
chart_recognition_batch_size |
The batch size of the model for chart parsing. If set to None , the batch size will be set to 1 by default. |
int|None |
None |
region_detection_model_name |
The model name for detecting submodules of document image layout. If set to None , the default model of the pipeline will be used. |
str|None |
None |
region_detection_model_dir |
The directory path of the model for detecting submodules of document image layout. If set to None , the official model will be downloaded. |
str|None |
None |
doc_orientation_classify_model_name |
Name of the document orientation classification model. If set to None , the default model in the pipeline will be used. |
str|None |
None |
doc_orientation_classify_model_dir |
Directory path of the document orientation classification model. If set to None , the official model will be downloaded. |
str|None |
None |
doc_unwarping_model_name |
Name of the text image unwarping model. If set to None , the default model in the pipeline will be used. |
str|None |
None |
doc_unwarping_model_dir |
Directory path of the text image unwarping model. If set to None , the official model will be downloaded. |
str|None |
None |
text_detection_model_name |
Name of the text detection model. If set to None , the default model in the pipeline will be used. |
str|None |
None |
text_detection_model_dir |
Directory path of the text detection model. If set to None , the official model will be downloaded. |
str|None |
None |
text_det_limit_side_len |
Limit on the side length of the image for text detection.
|
int|None |
None |
text_det_limit_type |
The type of image side length limit for text detection.
|
str|None |
None |
text_det_thresh |
Detection pixel threshold. Only pixels with scores greater than this threshold in the output probability map will be considered as text pixels.
|
float|None |
None |
text_det_box_thresh |
Detection box threshold: When the average score of all pixels within the detected bounding box is greater than this threshold, the result is considered a text region.
|
float|None |
None |
text_det_unclip_ratio |
Text detection expansion coefficient. This method is used to expand the text region. The larger the value, the larger the expanded area.
|
float|None |
None |
textline_orientation_model_name |
Name of the text line orientation model. If set toNone , the default model of the pipeline will be used. |
str|None |
None |
textline_orientation_model_dir |
Directory path of the text line orientation model. If set toNone , the official model will be downloaded. |
str|None |
None |
textline_orientation_batch_size |
Batch size of the text line orientation model. If set toNone Set the default batch size to 1 . |
int|None |
None |
text_recognition_model_name |
The name of the text recognition model. If set to None , the default model in the pipeline will be used. |
str|None |
None |
text_recognition_model_dir |
The directory path of the text recognition model. If set to None , the official model will be downloaded. |
str|None |
None |
text_recognition_batch_size |
The batch size of the text recognition model. If set to None , the default batch size will be set to 1 . |
int|None |
None |
text_rec_score_thresh |
The threshold for text recognition. Text results with scores higher than this threshold will be retained.
|
float|None |
None |
table_classification_model_name |
The name of the table classification model. If set to None , the default model in the pipeline will be used. |
str|None |
None |
table_classification_model_dir |
The directory path of the table classification model. If set to None , the official model will be downloaded. |
str|None |
None |
wired_table_structure_recognition_model_name |
The name of the wired table structure recognition model. If set to None , the default model in the pipeline will be used. |
str|None |
None |
wired_table_structure_recognition_model_dir |
The directory path of the wired table structure recognition model. If set to None , the official model will be downloaded. |
str|None |
None |
wireless_table_structure_recognition_model_name |
The name of the wireless table structure recognition model. If set to None , the default model in the pipeline will be used. |
str|None |
None |
wireless_table_structure_recognition_model_dir |
The directory path of the wireless table structure recognition model. If set to None , the official model will be downloaded. |
str|None |
None |
wired_table_cells_detection_model_name |
The name of the wired table cell detection model. If set to None , the default model in the pipeline will be used. |
str|None |
None |
wired_table_cells_detection_model_dir |
The directory path of the wired table cell detection model. If set to None , the official model will be downloaded. |
str|None |
None |
wireless_table_cells_detection_model_name |
The name of the wireless table cell detection model. If set to None , the default model in the pipeline will be used. |
str|None |
None |
wireless_table_cells_detection_model_dir |
The directory path of the wireless table cell detection model. If set to None , the official model will be downloaded. |
str|None |
None |
table_orientation_classify_model_name |
The name of the table orientation classification model. If set to None , the default model in the pipeline will be used. |
str|None |
None |
table_orientation_classify_model_dir |
The directory path of the table orientation classification model. If set to None , the official model will be downloaded. |
str|None |
None |
seal_text_detection_model_name |
The name of the seal text detection model. If set to None , the default model in the pipeline will be used. |
str|None |
None |
seal_text_detection_model_dir |
The directory path of the seal text detection model. If set to None , the official model will be downloaded. |
str|None |
None |
seal_det_limit_side_len |
The image side length limit for seal text detection.
|
int|None |
None |
seal_det_limit_type |
The image side length limit type for seal text detection.
|
str|None |
None |
seal_det_thresh |
The detection pixel threshold. Only pixels with scores greater than this threshold in the output probability map will be considered as text pixels.
|
float|None |
None |
seal_det_box_thresh |
Detection box threshold. When the average score of all pixels within the detected bounding box is greater than this threshold, the result is considered a text region.
|
float|None |
None |
seal_det_unclip_ratio |
Expansion coefficient for seal text detection. This method is used to expand the text region. The larger the value, the larger the expanded area.
|
float|None |
None |
seal_text_recognition_model_name |
Name of the seal text recognition model. If set to None , the default model of the pipeline will be used. |
str|None |
None |
seal_text_recognition_model_dir |
Directory path of the seal text recognition model. If set to None , the official model will be downloaded. |
str|None |
None |
seal_text_recognition_batch_size |
Batch size of the seal text recognition model. If set to None , the batch size will be set to 1 by default. |
int|None |
None |
seal_rec_score_thresh |
Threshold for seal text recognition. Text results with scores higher than this threshold will be retained.
|
float|None |
None |
formula_recognition_model_name |
Name of the formula recognition model. If set to None , the default model of the pipeline will be used. |
str|None |
None |
formula_recognition_model_dir |
Directory path of the formula recognition model. If set to None , the official model will be downloaded. |
str|None |
None |
formula_recognition_batch_size |
The batch size of the formula recognition model. If set to None , the batch size will be set to 1 by default. |
int|None |
None |
use_doc_orientation_classify |
Whether to load and use the document orientation classification module. If set to None , the parameter value initialized by the pipeline will be used, and the default initialization is True . |
bool|None |
None |
use_doc_unwarping |
Whether to load and use the text image unwarping module. If set to None , the parameter value initialized by the pipeline will be used, and the default initialization is True . |
bool|None |
None |
use_textline_orientation |
Whether to load and use the text line orientation classification module. If set to None , the parameter value initialized by the pipeline will be used, and the default initialization is True . |
bool|None |
None |
use_seal_recognition |
Whether to load and use the sub-pipeline for seal text recognition. If set to None , the parameter value initialized by the pipeline will be used, and the default initialization is True . |
bool|None |
None |
use_table_recognition |
Whether to load and use the sub-pipeline for table recognition. If set to None The parameter value initialized by the pipeline will be used, and the default initialization is True . |
bool|None |
None |
use_formula_recognition |
Whether to load and use the sub-pipeline for formula recognition. If set to None , the parameter value initialized by the pipeline will be used, and the default initialization is True . |
bool|None |
None |
use_chart_recognition |
Whether to load and use the chart parsing module. If set to None , the parameter value initialized by the pipeline will be used, and the default initialization is True . |
bool|None |
None |
use_region_detection |
Whether to load and use the sub-pipeline for document region detection. If set to None , the parameter value initialized by the pipeline will be used, and the default initialization is True . |
bool|None |
None |
chat_bot_config |
Configuration information for the large language model. The configuration content is the following dict:
|
dict|None |
None |
device |
Device for inference. Support specifying a specific card number:
|
str|None |
None |
enable_hpi |
Whether to enable high-performance inference. | bool |
False |
use_tensorrt |
Whether to enable the TensorRT subgraph engine of Paddle Inference. If the model does not support acceleration via TensorRT, acceleration will not be used even if this flag is set. For PaddlePaddle with CUDA 11.8, the compatible TensorRT version is 8.x (x>=6), and it is recommended to install TensorRT 8.6.1.6. For PaddlePaddle with CUDA 12.6, the compatible TensorRT version is 10.x (x>=5), and it is recommended to install TensorRT 10.5.0.18. |
bool |
False |
precision |
Computational precision, such as fp32, fp16. | str |
"fp32" |
enable_mkldnn |
Whether to enable MKL-DNN for accelerated inference. If MKL-DNN is not available or the model does not support acceleration via MKL-DNN, acceleration will not be used even if this flag is set. | bool |
True |
mkldnn_cache_capacity |
MKL-DNN cache capacity. | int |
10 |
cpu_threads |
The number of threads used for inference on the CPU. | int |
8 |
paddlex_config |
Path to the PaddleX pipeline configuration file. | str|None |
None |
(2) Call the visual_predict()
method of the PP-DocTranslation pipeline object to obtain visual prediction results. This method returns a list of results. Additionally, the pipeline also provides the visual_predict_iter()
method. Both methods are identical in terms of parameter acceptance and result return. The difference is that visual_predict_iter()
returns a generator
that can process and obtain prediction results step by step, which is suitable for scenarios involving large datasets or where memory conservation is desired. Either of these two methods can be chosen based on actual needs. Below is visual_predict()
Parameters of the method and their descriptions:
Parameter | Description | Parameter Type | Default Value |
---|---|---|---|
input |
Data to be predicted, supporting multiple input types, required.
|
Python Var|str|list |
|
use_doc_orientation_classify |
Whether to use the document orientation classification module during inference. Setting it toNone means using the instantiation parameter; otherwise, this parameter takes precedence. |
bool|None |
False |
use_doc_unwarping |
Whether to use the text image unwarping module during inference. Set to None to use the instantiated parameter; otherwise, this parameter takes precedence. |
bool|None |
False |
use_textline_orientation |
Whether to use the text line orientation classification module during inference. Set to None to use the instantiated parameter; otherwise, this parameter takes precedence. |
bool|None |
None |
use_seal_recognition |
Whether to use the seal text recognition sub-pipeline during inference. Set to None to use the instantiated parameter; otherwise, this parameter takes precedence. |
bool|None |
None |
use_table_recognition |
Whether to use the table recognition sub-pipeline during inference. Set to None to use the instantiated parameter; otherwise, this parameter takes precedence. |
bool|None |
None |
use_formula_recognition |
Whether to use the formula recognition sub-pipeline during inference. Set to None to use the instantiated parameter; otherwise, this parameter takes precedence. |
bool|None |
None |
use_chart_recognition |
Whether to use the chart parsing module. Set to None to use the instantiated parameter; otherwise, this parameter takes precedence. |
bool|None |
False |
use_region_detection |
Whether to use the sub-pipeline for document region detection. Set to None to use the instantiation parameter; otherwise, this parameter takes precedence. |
bool|None |
None |
layout_threshold |
The parameter meaning is basically the same as the instantiation parameter. Set to None to use the instantiation parameter; otherwise, this parameter takes precedence. |
float|dict|None |
None |
layout_nms |
The parameter meaning is basically the same as the instantiation parameter. Set to None to use the instantiation parameter; otherwise, this parameter takes precedence. |
bool|None |
None |
layout_unclip_ratio |
The parameter meaning is basically the same as the instantiation parameter. Set to None to use the instantiation parameter; otherwise, this parameter takes precedence. |
float|Tuple[float,float]|dict|None |
None |
layout_merge_bboxes_mode |
The parameter meaning is basically the same as the instantiation parameter. Set to None to use the instantiation parameter; otherwise, this parameter takes precedence. |
str|dict|None |
None |
text_det_limit_side_len |
The parameter meaning is basically the same as the instantiation parameter. Set to None to use the instantiation parameter; otherwise, this parameter takes precedence. |
int|None |
None |
text_det_limit_type |
The parameter meaning is basically the same as the instantiation parameter. Set to None It indicates the use of instantiation parameters; otherwise, this parameter takes precedence. |
str|None |
None |
text_det_thresh |
The parameter meaning is basically the same as the instantiation parameter. Set toNone It indicates the use of instantiation parameters; otherwise, this parameter takes precedence. |
float|None |
None |
text_det_box_thresh |
The parameter meaning is basically the same as the instantiation parameter. Set toNone It indicates the use of instantiation parameters; otherwise, this parameter takes precedence. |
float|None |
None |
text_det_unclip_ratio |
The parameter meaning is basically the same as the instantiation parameter. Set toNone It indicates the use of instantiation parameters; otherwise, this parameter takes precedence. |
float|None |
None |
text_rec_score_thresh |
The parameter meaning is basically the same as the instantiation parameter. Set toNone It indicates the use of instantiation parameters; otherwise, this parameter takes precedence. |
float|None |
None |
seal_det_limit_side_len |
The parameter meaning is basically the same as the instantiation parameter. Set toNone It indicates the use of instantiation parameters; otherwise, this parameter takes precedence. |
int|None |
None |
seal_det_limit_type |
The parameter meaning is basically the same as the instantiation parameter. Set toNone It indicates the use of instantiation parameters; otherwise, this parameter takes precedence. |
str|None |
None |
seal_det_thresh |
The parameter meaning is basically the same as the instantiation parameter. Set to None to use the instantiation parameter; otherwise, this parameter takes precedence. |
float|None |
None |
seal_det_box_thresh |
The parameter meaning is basically the same as the instantiation parameter. Set to None to use the instantiation parameter; otherwise, this parameter takes precedence. |
float|None |
None |
seal_det_unclip_ratio |
The parameter meaning is basically the same as the instantiation parameter. Set to None to use the instantiation parameter; otherwise, this parameter takes precedence. |
float|None |
None |
seal_rec_score_thresh |
The parameter meaning is basically the same as the instantiation parameter. Set to None to use the instantiation parameter; otherwise, this parameter takes precedence. |
float|None |
None |
use_wired_table_cells_trans_to_html |
Whether to enable direct conversion of wired table cell detection results to HTML. If enabled, HTML is constructed directly based on the geometric relationships of wired table cell detection results. | bool |
False |
use_wireless_table_cells_trans_to_html |
Whether to enable direct conversion of wireless table cell detection results to HTML. If enabled, HTML is constructed directly based on the geometric relationships of wireless table cell detection results. | bool |
False |
use_table_orientation_classify |
Whether to enable table orientation classification. When enabled, if the table in the image is rotated by 90/180/270 degrees, the orientation can be corrected and table recognition can be completed correctly. | bool |
True |
use_ocr_results_with_table_cells |
Whether to enable cell-segmented OCR. When enabled, OCR detection results will be segmented and re-recognized based on cell prediction results to avoid missing text. | bool |
True |
use_e2e_wired_table_rec_model |
Whether to enable the end-to-end wired table recognition mode. If enabled, the cell detection model will not be used, and only the table structure recognition model will be used. | bool |
False |
use_e2e_wireless_table_rec_model |
Whether to enable the end-to-end wireless table recognition mode. If enabled, the cell detection model will not be used, and only the table structure recognition model will be used. | bool |
True |
(3) Processing visual prediction results: The prediction result for each sample is a corresponding Result object, and it supports operations such as printing, saving as an image, and saving as a json
file:
Method | Method Description | Parameter | Parameter Type | Parameter Description | Default Value |
---|---|---|---|---|---|
print() |
Print the result to the terminal | format_json |
bool |
Whether to use indentation formatting for the output content in JSON format |
True |
indent |
int |
Specify the indentation level to beautify the outputJSON data to make it more readable, valid only whenformat_json isTrue . |
4 | ||
ensure_ascii |
bool |
controls whether non-ASCII characters are escaped toUnicode . When set toTrue , all non-ASCII characters will be escaped;False will retain the original characters, valid only whenformat_json isTrue . |
False |
||
save_to_json() |
Saves the result as a file in json format | save_path |
str |
The path where the file is saved. When it is a directory, the saved file name is consistent with the input file type name. | None |
indent |
int |
Specifies the indentation level to beautify the outputJSON data to make it more readable, valid only whenformat_json isTrue . |
4 | ||
ensure_ascii |
bool |
controls whether non-ASCII characters are escaped toUnicode . When set toTrue , all non-ASCII characters will be escaped;False will retain the original characters, valid only whenformat_json Valid whenTrue is set |
False |
||
save_to_img() |
Saves the visualized images of each intermediate module in PNG format | save_path |
str |
The file path for saving, which supports directory or file path | None |
save_to_markdown() |
Saves each page of an image or PDF file as a separate file in markdown format | save_path |
str |
The file path for saving, which supports directory or file path | None |
save_to_html() |
Saves tables in a file as a file in html format | save_path |
str |
The file path for saving, which supports directory or file path | None |
save_to_xlsx() |
Saves tables in a file as a file in xlsx format | save_path |
str |
The file path for saving, which supports directory or file path | None |
(4) Calltranslate()
method to perform document translation. This method returns the original markdown text and the translated text as a markdown object. You can save the required parts locally by executing thesave_to_markdown()
method. Below are the parameter descriptions for thetranslate()
method:
Parameter | Description | Parameter Type | Default Value |
---|---|---|---|
ori_md_info_list |
A data list in the original Markdown format, containing the content to be translated. It must be a list composed of dictionaries, with each dictionary representing a document block. | List[Dict] |
No default value (required) |
target_language |
Target language (ISO 639-1 language code, such as "en" /"ja" /"fr" ). |
str |
"zh" |
chunk_size |
The character count threshold for chunking the text to be translated. | int |
5000 |
task_description |
Custom task description prompt. | str|None |
None |
output_format |
Specify the output format requirements, such as "maintain the original Markdown structure". | str|None |
None |
rules_str |
Custom translation rule description. | str|None |
None |
few_shot_demo_text_content |
Example text content for few-shot learning. | str|None |
None |
few_shot_demo_key_value_list |
Structured few-shot example data. Example data in key-value pair format, which can include a glossary of technical terms. | str|None |
None |
chat_bot_config |
Large language model configuration. Set to None to use instantiation parameters; otherwise, this parameter takes precedence. |
dict|None |
None |
llm_request_interval |
The time interval, in seconds, for sending requests to the large language model. This parameter can be used to prevent overly frequent calls to the large language model. | float |
0 |
3. Development Integration/Deployment¶
If the pipeline can meet your requirements for inference speed and accuracy, you can proceed directly with development integration/deployment.
If you need to directly apply the pipeline in your Python project, you can refer to the sample code in 2.2 Python Script Approach.
In addition, PaddleOCR also offers two other deployment methods, detailed as follows:
🚀 High-Performance Inference: In real-world production environments, many applications have stringent performance criteria (especially response speed) for deployment strategies to ensure efficient system operation and a smooth user experience. To this end, PaddleOCR provides high-performance inference capabilities, aiming to deeply optimize model inference and pre/post-processing, achieving significant acceleration in the end-to-end process. For detailed information on the high-performance inference process, please refer to High-Performance Inference.
☁️ Serving: Serving is a common deployment form in real-world production environments. By encapsulating inference functions as services, clients can access these services through network requests to obtain inference results. For detailed information on the pipeline serving process, please refer to Serving.
Below are the API references for basic serving and examples of multilingual service invocation:
API reference
Main operations provided by the service:
- The HTTP request method is POST.
- Both the request body and response body are JSON data (JSON objects).
- When the request is processed successfully, the response status code is
200
, and the properties of the response body are as follows:
Name | Type | Meaning |
---|---|---|
logId |
string |
The UUID of the request. |
errorCode |
integer |
Error code. Fixed as0 . |
errorMsg |
string |
Error description. Fixed as"Success" . |
result |
object |
Operation result. |
- When the request is not processed successfully, the properties of the response body are as follows:
Name | Type | Meaning |
---|---|---|
logId |
string |
The UUID of the request. |
errorCode |
integer |
Error code. Same as the response status code. |
errorMsg |
string |
Error description. |
The main operations provided by the service are as follows:
analyzeImages
Analyze images using computer vision models to obtain OCR, table recognition results, etc.
POST /doctrans-visual
- The properties of the request body are as follows:
Name | Type | Meaning | Required |
---|---|---|---|
file |
string |
The URL of an image file or PDF file accessible to the server, or the Base64-encoded result of the content of the aforementioned file types. By default, for PDF files with more than 10 pages, only the first 10 pages will be processed. To remove the page limit, add the following configuration to the pipeline configuration file:
|
Yes |
fileType |
integer |null |
File type.0 indicates a PDF file,1 indicates an image file. If this property is not present in the request body, the file type will be inferred from the URL. |
No |
useDocOrientationClassify |
boolean |null |
Refer to the description of the use_doc_orientation_classify parameter in the predict method of the pipeline object. |
No |
useDocUnwarping |
boolean |null |
Refer to the description of the use_doc_unwarping parameter in the predict method of the pipeline object. |
No |
useTextlineOrientation |
boolean |null |
Refer to the description of the use_textline_orientation parameter in the predict Parameter description. |
No |
useSealRecognition |
boolean |null |
Refer to the parameter description of use_seal_recognition in the predict method of the pipeline object. |
No |
useTableRecognition |
boolean |null |
Refer to the parameter description of use_table_recognition in the predict method of the pipeline object. |
No |
useFormulaRecognition |
boolean |null |
Refer to the parameter description of use_formula_recognition in the predict method of the pipeline object. |
No |
useChartRecognition |
boolean |null |
Refer to the parameter description of use_chart_recognition in the predict method of the pipeline object. |
No |
useRegionDetection |
boolean |null |
Refer to the parameter description in the predict method of the pipeline object.use_region_detection Parameter description. |
No |
layoutThreshold |
number |object |null |
Refer to the parameter description of layout_threshold in the predict method of the pipeline object. |
No |
layoutNms |
boolean |null |
Refer to the parameter description of layout_nms in the predict method of the pipeline object. |
No |
layoutUnclipRatio |
number |array |object |null |
Refer to the parameter description of layout_unclip_ratio in the predict method of the pipeline object. |
No |
layoutMergeBboxesMode |
string |object |null |
Refer to the parameter description of layout_merge_bboxes_mode in the predict method of the pipeline object. |
No |
textDetLimitSideLen |
integer |null |
Refer to the description of the predict method's text_det_limit_side_len parameter in the pipeline object. |
No |
textDetLimitType |
string |null |
Refer to the description of the predict method's text_det_limit_type parameter in the pipeline object. |
No |
textDetThresh |
number |null |
Refer to the description of the predict method's text_det_thresh parameter in the pipeline object. |
No |
textDetBoxThresh |
number |null |
Refer to the description of the predict method's text_det_box_thresh parameter in the pipeline object. |
No |
textDetUnclipRatio |
number |null |
Refer to the description of the predict method's text_det_unclip_ratio parameter in the pipeline object. |
No |
textRecScoreThresh |
number |null |
Refer to the description of the predict method's text_rec_score_thresh parameter in the pipeline object. |
No |
sealDetLimitSideLen |
integer |null |
Refer to the description of the predict method's seal_det_limit_side_len parameter in the pipeline object. |
No |
sealDetLimitType |
string |null |
Refer to the description of the predict method's seal_det_limit_type parameter in the pipeline object. |
No |
sealDetThresh |
number |null |
Refer to the description of the predict method's seal_det_thresh parameter in the pipeline object. |
No |
sealDetBoxThresh |
number |null |
Refer to the description of the predict method's seal_det_box_thresh parameter in the pipeline object. |
No |
sealDetUnclipRatio |
number |null |
Refer to the description of the predict method's seal_det_unclip_ratio parameter in the pipeline object. |
No |
sealRecScoreThresh |
number |null |
Refer to the description of the predict method's seal_rec_score_thresh parameter in the pipeline object. |
No |
useWiredTableCellsTransToHtml |
boolean |
Refer to the description of the predict method's use_wired_table_cells_trans_to_html parameter in the pipeline object. |
No |
useWirelessTableCellsTransToHtml |
boolean |
Refer to the description of the predict method's use_wireless_table_cells_trans_to_html parameter in the pipeline object. |
No |
useTableOrientationClassify |
boolean |
Refer to the description of the predict method's use_table_orientation_classify parameter in the pipeline object. |
No |
useOcrResultsWithTableCells |
boolean |
See the description of the use_ocr_results_with_table_cells parameter for the predict method in the pipeline object. |
No |
useE2eWiredTableRecModel |
boolean |
See the description of the use_e2e_wired_table_rec_model parameter for the predict method in the pipeline object. |
No |
useE2eWirelessTableRecModel |
boolean |
See the description of the use_e2e_wireless_table_rec_model parameter for the predict method in the pipeline object. |
No |
visualize |
boolean |null |
Whether to return visualization result charts and intermediate images during processing, etc.
For example, add the following field in the pipeline configuration file: Images will not be returned by default, and can be controlled by the visualize Parameters can override the default behavior. If neither the request body nor the configuration file is set (or null is passed in the request body and the configuration file is not set), the image is returned by default. |
No |
- When the request is processed successfully, the
result
in the response body has the following properties:
Name | Type | Meaning |
---|---|---|
layoutParsingResults |
array |
Layout parsing results. The array length is 1 (for image input) or the actual number of processed document pages (for PDF input). For PDF input, each element in the array represents the result of each actual processed page in the PDF file in sequence. |
dataInfo |
object |
Input data information. |
Each element in
layoutParsingResults is an
object
with the following properties: | Name | Type |
---|---|---|
Meaning |
prunedResult |
objectA simplified version of the res field in the JSON representation of the layout_parsing_result generated by the visual_predict method of the pipeline object, where the input_path |
and |
page_index |
fields are removed. |
markdown |
object Markdown results.outputImages |
objectimg property description. The image is in JPEG format and encoded with Base64. |
inputImage |
string |null |
Input image. The image is in JPEG format and encoded with Base64. |
markdown
is anobject
with the following properties:
Name | Type | Meaning |
---|---|---|
text |
string |
Markdown text. |
images |
object |
Key-value pairs of relative paths of Markdown images and Base64-encoded images. |
isStart |
boolean |
Whether the first element on the current page is the start of a paragraph. |
isEnd |
boolean |
Whether the last element on the current page is the end of a paragraph. |
translate
Translate documents using a large model.
POST /doctrans-translate
- The properties of the request body are as follows:
Name | Type | Meaning | Required |
---|---|---|---|
markdownList |
array |
List of Markdown documents to be translated. Can be obtained from the results of the analyzeImages operation.The images |
property will not be used. |
Yes |
targetLanguage |
stringPlease refer to the translatetarget_language Parameter description. |
No |
chunkSize |
integer |
See the parameter description of chunk_size for the translate method in the pipeline object. |
No |
taskDescription |
string |null |
See the parameter description of task_description for the translate method in the pipeline object. |
No |
outputFormat |
string |null |
See the parameter description of output_format for the translate method in the pipeline object. |
No |
rulesStr |
string |null |
See the parameter description of rules_str for the translate method in the pipeline object. |
No |
fewShotDemoTextContent |
string |null |
See the parameter description of few_shot_demo_text_content for the translate method in the pipeline object. |
No |
fewShotDemoKeyValueList |
string |null |
Refer to the description of the few_shot_demo_key_value_list parameter in the translate method of the pipeline object. |
No |
chatBotConfig |
object |null |
Refer to the description of the chat_bot_config parameter in the translate method of the pipeline object. |
No |
llmRequestInterval |
number |null |
Refer to the description of the llm_request_interval parameter in the translate method of the pipeline object. |
No |
- When the request is processed successfully, the
result
in the response body has the following properties:
Name | Type | Meaning |
---|---|---|
translationResults |
array |
Translation results. |
Each element in
translationResults is an
object
with the following properties: | Name | Type |
---|---|---|
Meaning |
language |
string |
Target language. |
markdown |
Markdown results. The object definition is consistent with the analyzeImages operation's returned markdown . |
Example of multilingual service invocation
Python
import base64
import pathlib
import pprint
import sys
import requests
API_BASE_URL = "http://127.0.0.1:8080"
file_path = "./demo.jpg"
target_language = "en"
with open(file_path, "rb") as file:
file_bytes = file.read()
file_data = base64.b64encode(file_bytes).decode("ascii")
payload = {
"file": file_data,
"fileType": 1,
}
resp_visual = requests.post(url=f"{API_BASE_URL}/doctrans-visual", json=payload)
if resp_visual.status_code != 200:
print(
f"Request to doctrans-visual failed with status code {resp_visual.status_code}."
)
pprint.pp(resp_visual.json())
sys.exit(1)
result_visual = resp_visual.json()["result"]
markdown_list = []
for i, res in enumerate(result_visual["layoutParsingResults"]):
md_dir = pathlib.Path(f"markdown_{i}")
md_dir.mkdir(exist_ok=True)
(md_dir / "doc.md")
write_text(res["markdown"]["text"])
for img_path, img in res["markdown"]["images"].items():
img_path = md_dir / img_path
img_path.parent.mkdir(parents=True, exist_ok=True)
img_path.write_bytes(base64.b64decode(img))
print(f"The Markdown document to be translated is saved at {md_dir / 'doc.md'}")
del res["markdown"]["images"]
markdown_list.append(res["markdown"])
for img_name, img in res["outputImages"].items():
img_path = f"{img_name}_{i}.jpg"
with open(img_path, "wb") as f:
f.write(base64.b64decode(img))
print(f"Output image saved at {img_path}")
payload = {
"markdownList": markdown_list,
"targetLanguage": target_language,
}
resp_translate = requests.post(url=f"{API_BASE_URL}/doctrans-translate", json=payload)
if resp_translate.status_code != 200:
print(
f"Request to doctrans-translate failed with status code {resp_translate.status_code}."
)
pprint.pprint(resp_translate.json()) # Corrected 'pp' to 'pprint' for proper function call
sys.exit(1)
result_translate = resp_translate.json()["result"]
for i, res in enumerate(result_translate["translationResults"]):
md_dir = pathlib.Path(f"markdown_{i}")
(md_dir / "doc_translated.md").write_text(res["markdown"]["text"])
print(f"Translated markdown document saved at {md_dir / 'doc_translated.md'}")
4. Secondary Development¶
If the default model weights provided by the PP-DocTranslation pipeline do not meet your accuracy or speed requirements in your scenario, you can try to useyour own data from specific domains or application scenariosto furtherfine-tunethe existing model to improve the recognition effect in your scenario.
4.1 Model Fine-tuning¶
Since the PP-DocTranslation pipeline contains several modules, if the performance of the model pipeline does not meet expectations, the issue may originate from any one of these modules. You can analyze cases with poor extraction results, use visualized images to determine which module has the problem, and refer to the corresponding fine-tuning tutorial links in the following table to fine-tune the model.
Scenario | Fine-tuning module | Fine-tuning reference link |
---|---|---|
Inaccurate detection of layout areas, such as failure to detect seals and tables | Layout area detection module | Link |
Inaccurate recognition of table structures | Table structure recognition module | Link |
Inaccurate recognition of formulas | Formula recognition module | Link |
Omission in detecting seal texts | Seal text detection module | Link |
Omission in detecting texts | Text detection module | Link |
Inaccurate text content | Text recognition module | Link |
Inaccurate correction of vertical or rotated text lines | Text line orientation classification module | Link |
Inaccurate correction of whole image rotation | Document image orientation classification module | Link |
Inaccurate correction of image distortion | Text image unwarping module | Fine-tuning is temporarily not supported |
4.2 Model Application¶
After completing fine-tuning training with your private dataset, you can obtain a local model weight file. Then, you can use the fine-tuned model weights by customizing the pipeline configuration file.
- Obtain the pipeline configuration file
You can call the export_paddlex_config_to_yaml
method of the PP-DocTranslation pipeline object in PaddleOCR to export the current pipeline configuration to a YAML file:
from paddleocr import PPDocTranslation
pipeline = PPDocTranslation()
pipeline.export_paddlex_config_to_yaml("PP-DocTranslation.yaml")
- Modify the configuration file
After obtaining the default pipeline configuration file, replace the local path of the fine-tuned model weights with the corresponding location in the pipeline configuration file. For example,
......
SubModules:
TextDetection:
module_name: text_detection
model_name: PP-OCRv5_server_det
model_dir: null # Replace with the path to the weights of the fine-tuned text detection model
limit_side_len: 960
limit_type: max
thresh: 0.3
box_thresh: 0.6
unclip_ratio: 1.5
TextRecognition:
module_name: text_recognition
model_name: PP-OCRv5_server_rec
model_dir: null # Replace with the path to the weights of the fine-tuned text recognition model
batch_size: 1
score_thresh: 0
......
The pipeline configuration file not only includes parameters supported by PaddleOCR CLI and Python API but also allows for more advanced configurations. Detailed information can be found in the corresponding pipeline usage tutorial in the Overview of PaddleX Model Pipeline Usage. Refer to the detailed instructions therein and adjust the configurations according to your needs.
- Load the pipeline configuration file in CLI
After modifying the configuration file, specify the path to the modified pipeline configuration file using the --paddlex_config
parameter in the command line. PaddleOCR will then read its contents as the pipeline configuration. Here is an example:
- Load the pipeline configuration file in the Python API
When initializing the pipeline object, you can pass the path of the PaddleX pipeline configuration file or a configuration dict through the paddlex_config
parameter, and PaddleOCR will read its content as the pipeline configuration. The example is as follows:
from paddleocr import PPDocTranslation
pipeline = PPDocTranslation(paddlex_config="PP-DocTranslation.yaml")