Skip to content

PP-DocTranslation Pipeline Usage Tutorial

1. Introduction to PP-DocTranslation Pipeline

PP-DocTranslation is a document intelligent translation solution provided by PaddlePaddle. It integrates advanced general layout analysis technology and large language model (LLM) capabilities to offer you efficient document intelligent translation services. This solution can accurately identify and extract various elements within documents, including text blocks, headings, paragraphs, images, tables, and other complex layout structures, and on this basis, achieve high-quality multilingual translation. PP-DocTranslation supports mutual translation among multiple mainstream languages, particularly excelling in handling documents with complex layouts and strong contextual dependencies, striving to deliver precise, natural, fluent, and professional translation results. This pipeline also provides flexible serving options, supporting the use of multiple programming languages on various hardware. Moreover, it offers the capability for secondary development, allowing you to train and fine-tune models on your own datasets based on this pipeline, and the trained models can also be seamlessly integrated.

The PP-DocTranslation pipeline uses the PP-StructureV3 sub-pipeline, and thus has all the functions of the PP-StructureV3 pipeline. For more information on the functions and usage details of the PP-StructureV3 pipeline, you can click on the PP-StructureV3 Pipeline Documentation page.

In this pipeline, you can select the model to use based on the benchmark data below.

👉Model List Details

Document Image Orientation Classification Module:

ModelDownload Link Top-1 Acc (%) GPU Inference Time (ms)
[Standard / High Performance]
CPU Inference Time (ms)
[Standard / High Performance]
Model Size (M) Description
PP-LCNet_x1_0_doc_ori Inference Model/Pretrained Model 99.06 2.62 / 0.59 3.24 / 1.19 7 A document image classification model based on PP-LCNet_x1_0 with four classes: 0°, 90°, 180°, and 270°

Text Image Unwarping Module:

ModelDownload Link CER Model Size (M) Description
UVDoc Inference Model/Pretrained Model 0.179 30.3 High-accuracy text image unwarping model

Layout Detection Module Models:

ModelDownload Link mAP(0.5) (%) GPU Inference Time (ms)
[Standard / High Performance]
CPU Inference Time (ms)
[Standard / High Performance]
Model Size (M) Description
PP-DocLayout_plus-L Inference Model/Pretrained Model 83.2 53.03 / 17.23 634.62 / 378.32 126.01 High-accuracy layout detection model based on RT-DETR-L, trained on a custom dataset covering scenarios like Chinese/English papers, multi-column magazines, newspapers, PPTs, contracts, books, exams, research reports, ancient books, Japanese documents, and vertical text documents
PP-DocLayout-L Inference Model/Pretrained Model 90.4 33.59 / 33.59 503.01 / 251.08 123.76 High-accuracy layout detection model based on RT-DETR-L, trained on a custom dataset covering papers, magazines, contracts, books, exams, and research reports
PP-DocLayout-M Inference Model/Pretrained Model 75.2 13.03 / 4.72 43.39 / 24.44 22.578 Balanced accuracy-efficiency layout detection model based on PicoDet-L, trained on a custom dataset covering papers, magazines, contracts, books, exams, and research reports
PP-DocLayout-S Inference Model/Pretrained Model 70.9 11.54 / 3.86 18.53 / 6.29 4.834 High-efficiency layout detection model based on PicoDet-S, trained on a custom dataset for papers, magazines, contracts, books, exams, and research reports

Table Structure Recognition Module:

ModelDownload Link Accuracy (%) GPU Inference Time (ms)
[Standard / High Performance]
CPU Inference Time (ms)
[Standard / High Performance]
Model Size (M) Description
SLANeXt_wired Inference Model/Pretrained Model 69.65 85.92 / 85.92 - / 501.66 351M SLANeXt series is a next-generation table structure recognition model developed by Baidu PaddlePaddle Vision Team. Compared with SLANet and SLANet_plus, SLANeXt focuses on recognizing table structures, with dedicated weights for wired and wireless tables, significantly improving performance especially for wired tables.
SLANeXt_wireless Inference Model/Pretrained Model

Table Classification Module Models:

ModelDownload Link Top-1 Acc (%) GPU Inference Time (ms)
[Standard / High Performance]
CPU Inference Time (ms)
[Standard / High Performance]
Model Size (M)
PP-LCNet_x1_0_table_cls Inference Model/Pretrained Model 94.2 2.62 / 0.60 3.17 / 1.14 6.6M

Table Cell Detection Module Models:

ModelDownload Link mAP (%) GPU Inference Time (ms)
[Standard / High Performance]
CPU Inference Time (ms)
[Standard / High Performance]
Model Size (M) Description
RT-DETR-L_wired_table_cell_det Inference Model/Pretrained Model 82.7 33.47 / 27.02 402.55 / 256.56 124M RT-DETR is the first real-time end-to-end object detection model. Baidu PaddlePaddle Vision Team used RT-DETR-L as the base and pre-trained on a custom table cell detection dataset, achieving strong performance on both wired and wireless tables.
RT-DETR-L_wireless_table_cell_det Inference Model/Pretrained Model

Text Detection Module:

ModelDownload Link Detection Hmean (%) GPU Inference Time (ms)
[Standard / High Performance]
CPU Inference Time (ms)
[Standard / High Performance]
Model Size (M) Description
PP-OCRv5_server_det Inference Model/Pretrained Model 83.8 89.55 / 70.19 383.15 / 383.15 84.3 PP-OCRv5 server-side text detection model, higher accuracy, suitable for deployment on high-performance servers
PP-OCRv5_mobile_det Inference Model/Pretrained Model 79.0 10.67 / 6.36 57.77 / 28.15 4.7 PP-OCRv5 mobile-side text detection model, more efficient, suitable for edge device deployment
PP-OCRv4_server_det Inference Model/Pretrained Model 69.2 127.82 / 98.87 585.95 / 489.77 109 PP-OCRv4 server-side text detection model, higher accuracy, suitable for deployment on high-performance servers
PP-OCRv4_mobile_det Inference Model/Pretrained Model 63.8 9.87 / 4.17 56.60 / 20.79 4.7 PP-OCRv4 mobile-side text detection model, more efficient, suitable for edge device deployment
PP-OCRv3_mobile_det Inference Model/Pretrained Model Accuracy similar to PP-OCRv4_mobile_det 9.90 / 3.60 41.93 / 20.76 2.1 PP-OCRv3 mobile-side text detection model, more efficient, suitable for edge device deployment
PP-OCRv3_server_det Inference Model/Pretrained Model Accuracy similar to PP-OCRv4_server_det 119.50 / 75.00 379.35 / 318.35 102.1 PP-OCRv3 server-side text detection model, higher accuracy, suitable for deployment on high-performance servers

Text Recognition Module Models:

* Chinese Recognition Models
ModelDownload Link Recognition Avg Accuracy(%) GPU Inference Time (ms)
[Regular Mode / High-Performance Mode]
CPU Inference Time (ms)
[Regular Mode / High-Performance Mode]
Model Size (M) Description
PP-OCRv5_server_rec Inference Model/Training Model 86.38 8.46 / 2.36 31.21 / 31.21 81 PP-OCRv5_rec is a next-generation text recognition model. It aims to efficiently and accurately support four major languages—Simplified Chinese, Traditional Chinese, English, and Japanese—as well as complex text scenarios such as handwriting, vertical text, pinyin, and rare characters. While maintaining recognition performance, it balances inference speed and model robustness, providing efficient and precise technical support for document understanding in various scenarios.
PP-OCRv5_mobile_rec Inference Model/Training Model 81.29 5.43 / 1.46 21.20 / 5.32 16
PP-OCRv4_server_rec_doc Inference Model/Training Model 86.58 8.69 / 2.78 37.93 / 37.93 74.7 PP-OCRv4_server_rec_doc is trained on a mix of more Chinese document data and PP-OCR training data, based on PP-OCRv4_server_rec. It enhances recognition capabilities for Traditional Chinese, Japanese, and special characters, supporting 15,000+ characters. In addition to improving document-related text recognition, it also enhances general text recognition.
PP-OCRv4_mobile_rec Inference Model/Training Model 78.74 5.26 / 1.12 17.48 / 3.61 10.6 The lightweight recognition model of PP-OCRv4, with high inference efficiency, deployable on various hardware devices including edge devices.
PP-OCRv4_server_rec Inference Model/Training Model 80.61 8.75 / 2.49 36.93 / 36.93 71.2 The server-side model of PP-OCRv4, with high inference accuracy, deployable on various servers.
PP-OCRv3_mobile_rec Inference Model/Training Model 72.96 3.89 / 1.16 8.72 / 3.56 9.2 The lightweight recognition model of PP-OCRv3, with high inference efficiency, deployable on various hardware devices including edge devices.
ModelDownload Link Recognition Avg Accuracy(%) GPU Inference Time (ms)
[Regular Mode / High-Performance Mode]
CPU Inference Time (ms)
[Regular Mode / High-Performance Mode]
Model Size (M) Description
ch_SVTRv2_rec Inference Model/Training Model 68.81 10.38 / 8.31 66.52 / 30.83 73.9 SVTRv2 is a server-side text recognition model developed by the OpenOCR team from Fudan University's Vision and Learning Lab (FVL). It won first prize in the PaddleOCR Algorithm Model Challenge - Task 1: OCR End-to-End Recognition, achieving a 6% improvement in end-to-end recognition accuracy over PP-OCRv4 on the A榜 leaderboard.
ModelDownload Link Recognition Avg Accuracy(%) GPU Inference Time (ms)
[Regular Mode / High-Performance Mode]
CPU Inference Time (ms)
[Regular Mode / High-Performance Mode]
Model Size (M) Description
ch_RepSVTR_rec Inference Model/Training Model 65.07 6.29 / 1.57 20.64 / 5.40 22.1 RepSVTR is a mobile text recognition model based on SVTRv2. It won first prize in the PaddleOCR Algorithm Model Challenge - Task 1: OCR End-to-End Recognition, achieving a 2.5% improvement in end-to-end recognition accuracy over PP-OCRv4 on the B榜 leaderboard, with comparable inference speed.
* English Recognition Models
ModelDownload Link Recognition Avg Accuracy(%) GPU Inference Time (ms)
[Regular Mode / High-Performance Mode]
CPU Inference Time (ms)
[Regular Mode / High-Performance Mode]
Model Size (M) Description
en_PP-OCRv4_mobile_rec Inference Model/Training Model 70.39 4.81 / 1.23 17.20 / 4.18 6.8 An ultra-lightweight English recognition model trained based on the PP-OCRv4 recognition model, supporting English and numeric recognition.
en_PP-OCRv3_mobile_rec Inference Model/Training Model 70.69 3.56 / 0.78 8.44 / 5.78 7.8 M An ultra-lightweight English recognition model trained based on the PP-OCRv3 recognition model, supporting English and numeric recognition.
* Multilingual Recognition Models
ModelDownload Link Recognition Avg Accuracy(%) GPU Inference Time (ms)
[Regular Mode / High-Performance Mode]
CPU Inference Time (ms)
[Regular Mode / High-Performance Mode]
Model Size (M) Description
korean_PP-OCRv3_mobile_rec Inference Model/Training Model 60.21 3.73 / 0.98 8.76 / 2.91 8.6 An ultra-lightweight Korean recognition model trained based on the PP-OCRv3 recognition model, supporting Korean and numeric recognition.
japan_PP-OCRv3_mobile_rec Inference Model/Training Model 45.69 3.86 / 1.01 8.62 / 2.92 8.8 M An ultra-lightweight Japanese recognition model trained based on the PP-OCRv3 recognition model, supporting Japanese and numeric recognition.
chinese_cht_PP-OCRv3_mobile_rec Inference Model/Training Model 82.06 3.90 / 1.16 9.24 / 3.18 9.7 M An ultra-lightweight Traditional Chinese recognition model trained based on the PP-OCRv3 recognition model, supporting Traditional Chinese and numeric recognition.
te_PP-OCRv3_mobile_rec Inference Model/Training Model 95.88 3.59 / 0.81 8.28 / 6.21 7.8 M An ultra-lightweight Telugu recognition model trained based on the PP-OCRv3 recognition model, supporting Telugu and numeric recognition.
ka_PP-OCRv3_mobile_rec Inference Model/Training Model 96.96 3.49 / 0.89 8.63 / 2.77 8.0 M An ultra-lightweight Kannada recognition model trained based on the PP-OCRv3 recognition model, supporting Kannada and numeric recognition.
ta_PP-OCRv3_mobile_rec Inference Model/Training Model 76.83 3.49 / 0.86 8.35 / 3.41 8.0 M An ultra-lightweight Tamil recognition model trained based on the PP-OCRv3 recognition model, supporting Tamil and numeric recognition.
latin_PP-OCRv3_mobile_rec Inference Model/Training Model 76.93 3.53 / 0.78 8.50 / 6.83 7.8 An ultra-lightweight Latin recognition model trained based on the PP-OCRv3 recognition model, supporting Latin and numeric recognition.
arabic_PP-OCRv3_mobile_rec Inference Model/Training Model 73.55 3.60 / 0.83 8.44 / 4.69 7.8 An ultra-lightweight Arabic script recognition model trained based on the PP-OCRv3 recognition model, supporting Arabic script and numeric recognition.
cyrillic_PP-OCRv3_mobile_rec Inference Model/Training Model 94.28 3.56 / 0.79 8.22 / 2.76 7.9 M An ultra-lightweight Cyrillic script recognition model trained based on the PP-OCRv3 recognition model, supporting Cyrillic script and numeric recognition.
devanagari_PP-OCRv3_mobile_rec Inference Model/Training Model 96.44 3.60 / 0.78 6.95 / 2.87 7.9 An ultra-lightweight Devanagari script recognition model trained based on the PP-OCRv3 recognition model, supporting Devanagari script and numeric recognition.

Text Line Orientation Classification Module (Optional):

Model Download Link Top-1 Acc(%) GPU Inference Time (ms)
[Regular Mode / High-Performance Mode]
CPU Inference Time (ms)
[Regular Mode / High-Performance Mode]
Model Size (M) Description
PP-LCNet_x0_25_textline_ori Inference Model/Training Model 95.54 2.16 / 0.41 2.37 / 0.73 0.32 A text line classification model based on PP-LCNet_x0_25, with two classes: 0 degrees and 180 degrees.

Formula Recognition Module:

ModelDownload Link Avg-BLEU(%) GPU Inference Time (ms)
[Regular Mode / High-Performance Mode]
CPU Inference Time (ms)
[Regular Mode / High-Performance Mode]
Model Size (M) Description
UniMERNet Inference Model/Training Model 86.13 2266.96/- -/- 1.4 G UniMERNet is a formula recognition model developed by Shanghai AI Lab. It uses Donut Swin as the encoder and MBartDecoder as the decoder. Trained on a dataset of one million samples, including simple formulas, complex formulas, scanned formulas, and handwritten formulas, it significantly improves recognition accuracy for real-world scenarios. PP-FormulaNet-S Inference Model/Training Model 87.12 1311.84 / 1311.84 - / 8288.07 167.9 PP-FormulaNet is an advanced formula recognition model developed by Baidu's PaddlePaddle Vision team, supporting 50,000 common LaTeX vocabulary items. The PP-FormulaNet-S version uses PP-HGNetV2-B4 as its backbone and employs techniques like parallel masking and model distillation to significantly improve inference speed while maintaining high recognition accuracy, suitable for simple printed formulas, cross-line simple printed formulas, etc. The PP-FormulaNet-L version is based on Vary_VIT_B as its backbone and is trained on a large-scale formula dataset, showing significant improvement in complex formula recognition compared to PP-FormulaNet-S, suitable for simple printed formulas, complex printed formulas, handwritten formulas, etc.
PP-FormulaNet-L Inference Model/Training Model 92.13 1976.52/- -/- 535.2 LaTeX_OCR_rec Inference Model/Training Model 71.63 1088.89 / 1088.89 - / - 89.7 LaTeX-OCR is a formula recognition algorithm based on an autoregressive large model. By using Hybrid ViT as the backbone and transformer as the decoder, it significantly improves the accuracy of formula recognition.

Seal Text Recognition Module:

ModelDownload Link Detection Hmean(%) GPU Inference Time (ms)
[Regular Mode / High-Performance Mode]
CPU Inference Time (ms)
[Regular Mode / High-Performance Mode]
Model Size (M) Description
PP-OCRv4_server_seal_det Inference Model/Training Model 98.21 124.64 / 91.57 545.68 / 439.86 109 The server-side seal text detection model of PP-OCRv4, with higher accuracy, suitable for deployment on high-performance servers.
PP-OCRv4_mobile_seal_det Inference Model/Training Model 96.47 9.70 / 3.56 50.38 / 19.64 4.6 The mobile-side seal text detection model of PP-OCRv4, with higher efficiency, suitable for deployment on edge devices.
Testing Environment Description:
  • Performance Testing Environment
    • Test Datasets:
      • Document Image Orientation Classification Model: A dataset built by PaddleX, covering multiple scenarios such as IDs and documents, containing 1,000 images.
      • Text Image Unwarping Model: DocUNet.
      • Layout Detection Model: A layout analysis dataset built by PaddleOCR, containing 10,000 common document-type images such as Chinese and English papers, magazines, and reports.
      • PP-DocLayout_plus-L: A layout detection dataset built by PaddleOCR, containing 1,300 document-type images such as Chinese and English papers, magazines, newspapers, reports, PPTs, exams, and textbooks.
      • Table Structure Recognition Model: An internal English table recognition dataset built by PaddleX.
      • Text Detection Model: A Chinese dataset built by PaddleOCR, covering street views, web images, documents, and handwriting, with 500 images for detection.
      • Chinese Recognition Model: A Chinese dataset built by PaddleOCR, covering street views, web images, documents, and handwriting, with 11,000 images for text recognition.
      • ch_SVTRv2_rec: PaddleOCR Algorithm Model Challenge - Task 1: OCR End-to-End Recognition A榜 evaluation set.
      • ch_RepSVTR_rec: PaddleOCR Algorithm Model Challenge - Task 1: OCR End-to-End Recognition B榜 evaluation set.
      • English Recognition Model: An English dataset built by PaddleX.
      • Multilingual Recognition Model: A multilingual dataset built by PaddleX.
      • Text Line Orientation Classification Model: A dataset built by PaddleX, covering multiple scenarios such as IDs and documents, containing 1,000 images.
      • Seal Text Recognition Model: A dataset built by PaddleX, containing 500 circular seal images.
    • Hardware Configuration:
      • GPU: NVIDIA Tesla T4
      • CPU: Intel Xeon Gold 6271C @ 2.60GHz
    • Software Environment:
      • Ubuntu 20.04 / CUDA 11.8 / cuDNN 8.9 / TensorRT 8.6.1.6
      • paddlepaddle 3.0.0 / paddleocr 3.0.3
  • Inference Mode Description
Mode GPU Configuration CPU Configuration Acceleration Technology Combination
Regular Mode FP32 Precision / No TRT Acceleration FP32 Precision / 8 Threads PaddleInference
High-Performance Mode Optimal combination of precision types and acceleration strategies FP32 Precision / 8 Threads Optimal backend selection (Paddle/OpenVINO/TRT, etc.)

2. Quick Start

Before using the PP-DocTranslation pipeline locally, please ensure that you have completed the installation of the wheel package according to the Installation Tutorial.

Please note: If you encounter issues such as the program becoming unresponsive, unexpected program termination, running out of memory resources, or extremely slow inference during execution, please try adjusting the configuration according to the documentation, such as disabling unnecessary features or using lighter-weight models.

Before use, you need to prepare the API key for a large language model, which supports the Baidu Cloud Qianfan Platform or local large model services that comply with the OpenAI interface standards.

2.1 Experience via Command Line

You can download the test file and quickly experience the pipeline effect with a single command:

paddleocr pp_doctranslation -i vehicle_certificate-1.png --target_language en --qianfan_api_key your_api_key
Command line supports more parameter settings. Click to expand for detailed description of command line parameters
Parameter Description Type Default Value
input Data to be predicted, required. For example, local path of image file or PDF file: /root/data/img.jpg; URL link, such as network URL of image or PDF file: example; local directory, the directory must contain images to be predicted, such as local path: /root/data/ (currently does not support PDF files in the directory, PDF files need to specify the exact file path). str
save_path Specifies the path to save the inference result files. If not set, inference results will not be saved locally. str
target_language Target language (ISO 639-1 language code). str zh
layout_detection_model_name Model name for layout detection. If not set, the pipeline default model will be used. str
layout_detection_model_dir Directory path of the layout detection model. If not set, the official model will be downloaded. str
layout_threshold Score threshold for layout model. Any float between 0-1. If not set, the pipeline initialized value will be used, default initialized as 0.5. float
layout_nms Whether to use post-processing NMS in layout detection. If not set, the pipeline initialized value will be used, default initialized as True. bool
layout_unclip_ratio Expansion coefficient for detection boxes in layout detection model. Any float greater than 0. If not set, the pipeline initialized value will be used, default initialized as 1.0. float
layout_merge_bboxes_mode Mode for merging detection boxes output by the layout detection model.
  • large: when set to large, among overlapping boxes, only the largest outer box is kept and the overlapping inner boxes are deleted;
  • small: when set to small, among overlapping boxes, only the smaller inner boxes are kept and the overlapping outer boxes are deleted;
  • union: no box filtering, both inner and outer boxes are kept;
If not set, the pipeline initialized value will be used, default initialized as large.
str
chart_recognition_model_name Model name for chart parsing. If not set, the pipeline default model will be used. str
chart_recognition_model_dir Directory path for chart parsing model. If not set, the official model will be downloaded. str
chart_recognition_batch_size Batch size for chart parsing model. If not set, batch size defaults to 1. int
region_detection_model_name Model name for region detection. If not set, the pipeline default model will be used. str
region_detection_model_dir Directory path for region detection model. If not set, the official model will be downloaded. str
doc_orientation_classify_model_name Model name for document orientation classification. If not set, the pipeline default model will be used. str
doc_orientation_classify_model_dir Directory path for document orientation classification model. If not set, the official model will be downloaded. str
doc_unwarping_model_name Model name for text image unwarping. If not set, the pipeline default model will be used. str
doc_unwarping_model_dir Directory path for text image unwarping model. If not set, the official model will be downloaded. str
text_detection_model_name Model name for text detection. If not set, the pipeline default model will be used. str
text_detection_model_dir Directory path for text detection model. If not set, the official model will be downloaded. str
text_det_limit_side_len Image side length limit for text detection. Any integer greater than 0. If not set, the pipeline initialized value will be used, default initialized as 960. int
text_det_limit_type Type of image side length limit for text detection. Supports min and max. min means ensuring the shortest side of the image is not less than det_limit_side_len, max means ensuring the longest side of the image is not greater than limit_side_len. If not set, the pipeline initialized value will be used, default initialized as max. str
text_det_thresh Detection pixel threshold. In the output probability map, pixels with score greater than this threshold are considered text pixels. Any float greater than 0. If not set, the pipeline initialized value 0.3 will be used by default. float
text_det_box_thresh Detection box threshold. If the average score of all pixels within the detected bounding box is greater than this threshold, the result is considered a text region. Any float greater than 0. If not set, the pipeline initialized value 0.6 will be used by default. float
text_det_unclip_ratio Text detection expansion coefficient, used to expand text regions. The larger the value, the larger the expansion area. Any float greater than 0. If not set, the pipeline initialized value 2.0 will be used by default. float
textline_orientation_model_name Model name for textline orientation. If not set, the pipeline default model will be used. str
textline_orientation_model_dir Directory path for textline orientation model. If not set, the official model will be downloaded. str
textline_orientation_batch_size Batch size for textline orientation model. If not set, batch size defaults to 1. int
text_recognition_model_name Model name for text recognition. If not set, the pipeline default model will be used. str
text_recognition_model_dir Directory path for text recognition model. If not set, the official model will be downloaded. str
text_recognition_batch_size Batch size for text recognition model. If not set, batch size defaults to 1. int
text_rec_score_thresh Text recognition threshold. Text results with scores greater than this threshold will be kept. Any float greater than 0. If not set, the pipeline initialized value 0.0 will be used, meaning no threshold. float
table_classification_model_name Model name for table classification. If not set, the pipeline default model will be used. str
table_classification_model_dir Directory path for table classification model. If not set, the official model will be downloaded. str
wired_table_structure_recognition_model_name Model name for wired table structure recognition. If not set, the pipeline default model will be used. str
wired_table_structure_recognition_model_dir Directory path for wired table structure recognition model. If not set, the official model will be downloaded. str
wireless_table_structure_recognition_model_name Model name for wireless table structure recognition. If not set, the pipeline default model will be used. str
wireless_table_structure_recognition_model_dir Directory path for wireless table structure recognition model. If not set, the official model will be downloaded. str
wired_table_cells_detection_model_name Model name for wired table cells detection. If not set, the pipeline default model will be used. str
wired_table_cells_detection_model_dir Directory path for wired table cells detection model. If not set, the official model will be downloaded. str
wireless_table_cells_detection_model_name Model name for wireless table cells detection. If not set, the pipeline default model will be used. str
wireless_table_cells_detection_model_dir Directory path for wireless table cells detection model. If not set, the official model will be downloaded. str
table_orientation_classify_model_name Model name for table orientation classification. If not set, the pipeline default model will be used. str
table_orientation_classify_model_dir Directory path for table orientation classification model. If not set, the official model will be downloaded. str
seal_text_detection_model_name Model name for seal text detection. If not set, the pipeline default model will be used. str
seal_text_detection_model_dir Directory path for seal text detection model. If not set, the official model will be downloaded. str
seal_det_limit_side_len Image side length limit for seal text detection. Any integer greater than 0. If not set, the pipeline initialized value will be used, default initialized as 736. int
seal_det_limit_type Type of image side length limit for seal text detection. Supports min and max. min means ensuring the shortest side of the image is not less than det_limit_side_len, max means ensuring the longest side is not greater than limit_side_len. If not set, the pipeline initialized value will be used, default initialized as min. str
seal_det_thresh Detection pixel threshold. In the output probability map, pixels with score greater than this threshold are considered text pixels. Any float greater than 0. If not set, the pipeline initialized value 0.2 will be used by default. float
seal_det_box_thresh Detection box threshold. If the average score of all pixels within the detected bounding box is greater than this threshold, the result is considered a text region. Any float greater than 0. If not set, the pipeline initialized value 0.6 will be used by default. float
seal_det_unclip_ratio Expansion coefficient for seal text detection. This method is used to expand the text region; the larger the value, the larger the expansion area. Any float greater than 0. If not set, the pipeline initialized value 0.5 will be used by default. float
seal_text_recognition_model_name Model name for seal text recognition. If not set, the pipeline default model will be used. str
seal_text_recognition_model_dir Directory path for seal text recognition model. If not set, the official model will be downloaded. str
seal_text_recognition_batch_size Batch size for seal text recognition model. If not set, batch size defaults to 1. int
seal_rec_score_thresh Text recognition threshold. Text results with scores greater than this threshold will be kept. Any float greater than 0. If not set, the pipeline initialized value 0.0 will be used, meaning no threshold. float
formula_recognition_model_name Model name for formula recognition. If not set, the pipeline default model will be used. str
formula_recognition_model_dir Directory path for formula recognition model. If not set, the official model will be downloaded. str
formula_recognition_batch_size Batch size of the formula recognition model. If not set, the batch size defaults to 1. int
use_doc_orientation_classify Whether to load and use the document orientation classification module. If not set, the pipeline initialized value will be used, default is False. bool
use_doc_unwarping Whether to load and use the text image unwarping module. If not set, the pipeline initialized value will be used, default is False. bool
use_textline_orientation Whether to load and use the text line orientation classification module. If not set, the pipeline initialized value will be used, default is True. bool
use_seal_recognition Whether to load and use the seal text recognition sub-pipeline. If not set, the pipeline initialized value will be used, default is True. bool
use_table_recognition Whether to load and use the table recognition sub-pipeline. If not set, the pipeline initialized value will be used, default is True. bool
use_formula_recognition Whether to load and use the formula recognition sub-pipeline. If not set, the pipeline initialized value will be used, default is True. bool
use_chart_recognition Whether to load and use the chart parsing module. If not set, the pipeline initialized value will be used, default is False. bool
use_region_detection Whether to load and use the region detection module. If not set, the pipeline initialized value will be used, default is True. bool
qianfan_api_key API key for the Qianfan platform. str
device Device used for inference. Supports specifying exact card number:
  • CPU: e.g. cpu means using CPU for inference;
  • GPU: e.g. gpu:0 means using GPU #1 for inference;
  • NPU: e.g. npu:0 means using NPU #1 for inference;
  • XPU: e.g. xpu:0 means using XPU #1 for inference;
  • MLU: e.g. mlu:0 means using MLU #1 for inference;
  • DCU: e.g. dcu:0 means using DCU #1 for inference;
If not set, the pipeline initialized value will be used. At initialization, the local GPU device #0 will be preferred, if none, CPU device will be used.
str
enable_hpi Whether to enable high-performance inference. bool False
use_tensorrt Whether to enable the TensorRT subgraph engine of Paddle Inference. If the model does not support acceleration by TensorRT, enabling this flag will not enable acceleration.
For PaddlePaddle with CUDA 11.8, compatible TensorRT version is 8.x (x≥6), recommended TensorRT version is 8.6.1.6.
bool False
precision Computation precision, e.g. fp32, fp16. str fp32
enable_mkldnn Whether to enable MKL-DNN accelerated inference. If MKL-DNN is unavailable or the model does not support acceleration via MKL-DNN, enabling this flag will not enable acceleration. bool True
mkldnn_cache_capacity MKL-DNN cache capacity. int 10
cpu_threads Number of threads used for inference on CPU. int 8
paddlex_config Path to PaddleX pipeline configuration file. str


The execution results will be printed to the terminal.

2.2 Integration via Python Script

The command-line method is for quickly experiencing and viewing the results. Generally, in projects, integration via code is often required. You can download the test file and use the following sample code for inference:

from paddleocr import PPDocTranslation

# Create a translation pipeline
pipeline = PPDocTranslation()

# Document path
input_path = "document_sample.pdf"

# Output directory
output_path = "./output"

# Large model configuration
chat_bot_config = {
    "module_name": "chat_bot",
    "model_name": "ernie-3.5-8k",
    "base_url": "https://qianfan.baidubce.com/v2",
    "api_type": "openai",
    "api_key": "api_key",  # your api_key
}

if input_path.lower().endswith(".md"):
    # Read markdown documents, supporting passing in directories and url links with the .md suffix
    ori_md_info_list = pipeline.load_from_markdown(input_path)
else:
    # Use PP-StructureV3 to perform layout parsing on PDF/image documents to obtain markdown information
    visual_predict_res = pipeline.visual_predict(
        input_path,
        use_doc_orientation_classify=False,
        use_doc_unwarping=False,
        use_common_ocr=True,
        use_seal_recognition=True,
use_table_recognition=True,
    )

    ori_md_info_list = []
    for res in visual_predict_res:
        layout_parsing_result = res["layout_parsing_result"]
        ori_md_info_list.append(layout_parsing_result.markdown)
        layout_parsing_result.save_to_img(output_path)
        layout_parsing_result.save_to_markdown(output_path)

    # Concatenate the markdown information of multi-page documents into a single markdown file, and save the merged original markdown text
    if input_path.lower().endswith(".pdf"):
        ori_md_info = pipeline.concatenate_markdown_pages(ori_md_info_list)
        ori_md_info.save_to_markdown(output_path)

# Perform document translation (target language: English)
tgt_md_info_list = pipeline.translate(
    ori_md_info_list=ori_md_info_list,
    target_language="en",
    chunk_size=5000,
    chat_bot_config=chat_bot_config,
)
# Save the translation results
for tgt_md_info in tgt_md_info_list:
    tgt_md_info.save_to_markdown(output_path)

After executing the above code, you will obtain the parsed results of the original document to be translated, the Markdown file of the original text to be translated, and the Markdown file of the translated document, all saved in the output directory.

The process, API description, and output description of PP-DocTranslation prediction are as follows:

(1) Instantiate the PP-DocTranslation pipeline object by calling PPDocTranslation. Relevant parameter descriptions are as follows:
Parameter Description Type Default Value
layout_detection_model_name The model name for layout detection. If set to None, the pipeline's default model will be used. str|None None
layout_detection_model_dir The directory path of the layout detection model. If set to None, the official model will be downloaded. str|None None
layout_threshold Score threshold for the layout model.
  • float: Any float between 0-1;
  • dict: {0:0.1}, where the key is the class ID and the value is the threshold for that class;
  • None: If set to None, the pipeline's initialized value will be used, defaulting to 0.5.
float|dict|None None
layout_nms Whether to use post-processing NMS for layout detection. If set to None, the pipeline's initialized value will be used, defaulting to True. bool|None None
layout_unclip_ratio Expansion coefficient for detection boxes in the layout detection model.
  • float: Any float greater than 0;
  • Tuple[float,float]: Expansion coefficients in horizontal and vertical directions respectively;
  • dict: Keys are int representing cls_id, values are tuple, e.g. {0: (1.1, 2.0)}, meaning for class 0 detection boxes, center remains unchanged, width expanded by 1.1 times, height expanded by 2.0 times;
  • None: If set to None, the pipeline's initialized value will be used, defaulting to 1.0.
float|Tuple[float,float]|dict|None None
layout_merge_bboxes_mode Overlap box filtering method for layout detection.
  • str: large, small, union, indicating whether to keep the larger box, smaller box, or both during overlap filtering;
  • dict: Keys are int cls_id, values are str, e.g. {0: "large", 2: "small"}, meaning use "large" mode for class 0 boxes and "small" mode for class 2 boxes;
  • None: If set to None, the pipeline's initialized value will be used, defaulting to large.
str|dict|None None
chart_recognition_model_name The model name for chart parsing. If set to None, the pipeline's default model will be used. str|None None
chart_recognition_model_dir The directory path of the chart parsing model. If set to None, the official model will be downloaded. str|None None
chart_recognition_batch_size Batch size for the chart parsing model. If set to None, batch size defaults to 1. int|None None
region_detection_model_name The model name for region detection. If set to None, the pipeline's default model will be used. str|None None
region_detection_model_dir The directory path of the region detection model. If set to None, the official model will be downloaded. str|None None
doc_orientation_classify_model_name The model name for document orientation classification. If set to None, the pipeline's default model will be used. str|None None
doc_orientation_classify_model_dir The directory path of the document orientation classification model. If set to None, the official model will be downloaded. str|None None
doc_unwarping_model_name The model name for text image unwarping. If set to None, the pipeline's default model will be used. str|None None
doc_unwarping_model_dir The directory path of the text image unwarping model. If set to None, the official model will be downloaded. str|None None
text_detection_model_name The model name for text detection. If set to None, the pipeline's default model will be used. str|None None
text_detection_model_dir The directory path of the text detection model. If set to None, the official model will be downloaded. str|None None
text_det_limit_side_len Image side length limit for text detection.
  • int: Any integer greater than 0;
  • None: If set to None, the pipeline's initialized value will be used, defaulting to 960.
int|None None
text_det_limit_type Type of image side length limit for text detection.
  • str: Supports min and max, where min means ensuring the shortest side of the image is not less than det_limit_side_len, and max means ensuring the longest side is not greater than limit_side_len;
  • None: If set to None, the pipeline's initialized value will be used, defaulting to max.
str|None None
text_det_thresh Pixel threshold for detection; pixels in the output probability map with scores above this threshold are considered text pixels.
  • float: Any float greater than 0;
  • None: If set to None, the pipeline's initialized value of 0.3 will be used.
float|None None
text_det_box_thresh Detection box threshold; when the average score of all pixels inside a detected box exceeds this threshold, it is considered a text region.
  • float: Any float greater than 0;
  • None: If set to None, the pipeline's initialized value of 0.6 will be used.
float|None None
text_det_unclip_ratio Expansion coefficient for text detection; this method expands the text region, and the larger the value, the larger the expansion area.
  • float: Any float greater than 0;
  • None: If set to None, the pipeline's initialized value of 2.0 will be used.
float|None None
textline_orientation_model_name The model name for text line orientation classification. If set to None, the pipeline's default model will be used. str|None None
textline_orientation_model_dir The directory path of the text line orientation model. If set to None, the official model will be downloaded. str|None None
textline_orientation_batch_size Batch size for the text line orientation model. If set to None, batch size defaults to 1. int|None None
text_recognition_model_name The model name for text recognition. If set to None, the pipeline's default model will be used. str|None None
text_recognition_model_dir The directory path of the text recognition model. If set to None, the official model will be downloaded. str|None None
text_recognition_batch_size Batch size for the text recognition model. If set to None, batch size defaults to 1. int|None None
text_rec_score_thresh Text recognition threshold; text results with scores greater than this threshold will be retained.
  • float: Any float greater than 0;
  • None: If set to None, the pipeline's initialized value of 0.0 (no threshold) will be used.
float|None None
table_classification_model_name The model name for table classification. If set to None, the pipeline's default model will be used. str|None None
table_classification_model_dir The directory path of the table classification model. If set to None, the official model will be downloaded. str|None None
wired_table_structure_recognition_model_name The model name for wired table structure recognition. If set to None, the pipeline's default model will be used. str|None None
wired_table_structure_recognition_model_dir The directory path of the wired table structure recognition model. If set to None, the official model will be downloaded. str|None None
wireless_table_structure_recognition_model_name The model name for wireless table structure recognition. If set to None, the pipeline's default model will be used. str|None None
wireless_table_structure_recognition_model_dir The directory path of the wireless table structure recognition model. If set to None, the official model will be downloaded. str|None None
wired_table_cells_detection_model_name The model name for wired table cell detection. If set to None, the pipeline's default model will be used. str|None None
wired_table_cells_detection_model_dir The directory path of the wired table cell detection model. If set to None, the official model will be downloaded. str|None None
wireless_table_cells_detection_model_name The model name for wireless table cell detection. If set to None, the pipeline's default model will be used. str|None None
wireless_table_cells_detection_model_dir The directory path of the wireless table cell detection model. If set to None, the official model will be downloaded. str|None None
table_orientation_classify_model_name The model name for table orientation classification. If set to None, the pipeline's default model will be used. str|None None
table_orientation_classify_model_dir The directory path of the table orientation classification model. If set to None, the official model will be downloaded. str|None None
seal_text_detection_model_name The model name for seal text detection. If set to None, the pipeline's default model will be used. str|None None
seal_text_detection_model_dir The directory path of the seal text detection model. If set to None, the official model will be downloaded. str|None None
seal_det_limit_side_len Image side length limit for seal text detection.
  • int: any integer greater than 0;
  • None: if set to None, the parameter value initialized by the pipeline will be used, with a default initialization of 736.
int|None None
seal_det_limit_type Type of image side length limit for seal text detection.
  • str: supports min and max, where min ensures the shortest image side is not less than det_limit_side_len, and max ensures the longest image side is not greater than limit_side_len;
  • None: if set to None, the parameter value initialized by the pipeline will be used, with a default initialization of min.
str|None None
seal_det_thresh Detection pixel threshold. In the output probability map, pixels with scores above this threshold are considered text pixels.
  • float: any floating number greater than 0;
  • None: if set to None, the pipeline default parameter value 0.2 will be used.
float|None None
seal_det_box_thresh Detection box threshold. When the average score of all pixels within the detected bounding box is greater than this threshold, the result is considered a text region.
  • float: any floating number greater than 0;
  • None: if set to None, the pipeline default parameter value 0.6 will be used.
float|None None
seal_det_unclip_ratio Expansion coefficient for seal text detection. This method expands the text region; the larger the value, the larger the expansion area.
  • float: any floating number greater than 0;
  • None: if set to None, the pipeline default parameter value 0.5 will be used.
float|None None
seal_text_recognition_model_name Name of the seal text recognition model. If set to None, the pipeline default model will be used. str|None None
seal_text_recognition_model_dir Directory path for the seal text recognition model. If set to None, the official model will be downloaded. str|None None
seal_text_recognition_batch_size Batch size for the seal text recognition model. If set to None, the batch size defaults to 1. int|None None
seal_rec_score_thresh Seal text recognition threshold. Text results with scores above this threshold will be retained.
  • float: any floating number greater than 0;
  • None: if set to None, the pipeline default parameter value 0.0 will be used, meaning no threshold is set.
float|None None
formula_recognition_model_name Name of the formula recognition model. If set to None, the pipeline default model will be used. str|None None
formula_recognition_model_dir Directory path for the formula recognition model. If set to None, the official model will be downloaded. str|None None
formula_recognition_batch_size Batch size for the formula recognition model. If set to None, the batch size defaults to 1. int|None None
use_doc_orientation_classify Whether to load and use the document orientation classification module. If set to None, the pipeline initialized parameter value will be used, defaulting to False. bool|None None
use_doc_unwarping Whether to load and use the text image unwarping module. If set to None, the pipeline initialized parameter value will be used, defaulting to False. bool|None None
use_textline_orientation Whether to load and use the text line orientation classification module. If set to None, the pipeline initialized parameter value will be used, defaulting to True. bool|None None
use_seal_recognition Whether to load and use the seal text recognition sub-pipeline. If set to None, the pipeline initialized parameter value will be used, defaulting to True. bool|None None
use_table_recognition Whether to load and use the table recognition sub-pipeline. If set to None, the pipeline initialized parameter value will be used, defaulting to True. bool|None None
use_formula_recognition Whether to load and use the formula recognition sub-pipeline. If set to None, the pipeline initialized parameter value will be used, defaulting to True. bool|None None
use_chart_recognition Whether to load and use the chart parsing module. If set to None, the pipeline initialized parameter value will be used, defaulting to False. bool|None None
use_region_detection Whether to load and use the document region detection module. If set to None, the pipeline initialized parameter value will be used, defaulting to True. bool|None None
chat_bot_config Large language model configuration information. The configuration content is the following dict:
{
"module_name": "chat_bot",
"model_name": "ernie-3.5-8k",
"base_url": "https://qianfan.baidubce.com/v2",
"api_type": "openai",
"api_key": "api_key"  # Please set this to the actual API key
}
dict|None None
device Device used for inference. Supports specifying a specific card number:
  • CPU: e.g. cpu means using CPU for inference;
  • GPU: e.g. gpu:0 means using the first GPU for inference;
  • NPU: e.g. npu:0 means using the first NPU for inference;
  • XPU: e.g. xpu:0 means using the first XPU for inference;
  • MLU: e.g. mlu:0 means using the first MLU for inference;
  • DCU: e.g. dcu:0 means using the first DCU for inference;
  • None: if set to None, initialization will prioritize using the local GPU device 0; if unavailable, CPU will be used.
str|None None
enable_hpi Whether to enable high-performance inference. bool False
use_tensorrt Whether to enable Paddle Inference’s TensorRT subgraph engine. If the model does not support acceleration via TensorRT, enabling this flag will have no effect.
For Paddle with CUDA 11.8, the compatible TensorRT version is 8.x (x≥6), recommended installation is TensorRT 8.6.1.6.
bool False
precision Computation precision, such as fp32, fp16. str "fp32"
enable_mkldnn Whether to enable MKL-DNN accelerated inference. If MKL-DNN is unavailable or the model does not support acceleration via MKL-DNN, enabling this flag will have no effect. bool True
mkldnn_cache_capacity MKL-DNN cache capacity. int 10
cpu_threads Number of threads used during inference on CPU. int 8
paddlex_config Path to the PaddleX pipeline configuration file. str|None None
(2) Call the visual_predict() method of the PP-DocTranslation pipeline object to obtain visual prediction results. This method returns a list of results. Additionally, the pipeline provides a visual_predict_iter() method. Both methods accept the same parameters and return the same results, but visual_predict_iter() returns a generator, which can process and retrieve prediction results step-by-step, suitable for large datasets or memory-saving scenarios. You can choose either method according to your actual needs. Below are the parameters of the visual_predict() method and their descriptions:
Parameter Description Type Default
input Data to be predicted, supports multiple input types, required.
  • Python Var: image data such as numpy.ndarray;
  • str: local path of image or PDF files, e.g. /root/data/img.jpg; URL link: network URL of image or PDF files, e.g. example; local directory: directory containing images to be predicted, e.g. /root/data/ (currently does not support PDFs in directories, PDF files need to specify exact file path);
  • list: list elements must be one of the above types, e.g. [numpy.ndarray, numpy.ndarray], ["/root/data/img1.jpg", "/root/data/img2.jpg"], ["/root/data1", "/root/data2"].
Python Var|str|list
use_doc_orientation_classify Whether to use the document orientation classification module during inference. Setting to None means using the instantiated parameter, otherwise this parameter has higher priority. bool|None None
use_doc_unwarping Whether to use the text image unwarping module during inference. Setting to None means using the instantiated parameter, otherwise this parameter has higher priority. bool|None None
use_textline_orientation Whether to use the text line orientation classification module during inference. Setting to None means using the instantiated parameter, otherwise this parameter has higher priority. bool|None None
use_seal_recognition Whether to use the seal text recognition sub-pipeline during inference. Setting to None means using the instantiated parameter, otherwise this parameter has higher priority. bool|None None
use_table_recognition Whether to use the table recognition sub-pipeline during inference. Setting to None means using the instantiated parameter, otherwise this parameter has higher priority. bool|None None
use_formula_recognition Whether to use the formula recognition sub-pipeline during inference. Setting to None means using the instantiated parameter, otherwise this parameter has higher priority. bool|None None
use_chart_recognition Whether to use the chart parsing module during inference. Setting to None means using the instantiated parameter, otherwise this parameter has higher priority. bool|None None
use_region_detection Whether to use the document layout detection module during inference. Setting to None means using the instantiated parameter, otherwise this parameter has higher priority. bool|None None
layout_threshold Parameter meaning is basically the same as the instantiated parameter. Setting to None means using the instantiated parameter, otherwise this parameter has higher priority. float|dict|None None
layout_nms Parameter meaning is basically the same as the instantiated parameter. Setting to None means using the instantiated parameter, otherwise this parameter has higher priority. bool|None None
layout_unclip_ratio Parameter meaning is basically the same as the instantiated parameter. Setting to None means using the instantiated parameter, otherwise this parameter has higher priority. float|Tuple[float,float]|dict|None None
layout_merge_bboxes_mode Parameter meaning is basically the same as the instantiated parameter. Setting to None means using the instantiated parameter, otherwise this parameter has higher priority. str|dict|None None
text_det_limit_side_len Parameter meaning is basically the same as the instantiated parameter. Setting to None means using the instantiated parameter, otherwise this parameter has higher priority. int|None None
text_det_limit_type Parameter meaning is basically the same as the instantiated parameter. Setting to None means using the instantiated parameter, otherwise this parameter has higher priority. str|None None
text_det_thresh Parameter meaning is basically the same as the instantiated parameter. Setting to None means using the instantiated parameter, otherwise this parameter has higher priority. float|None None
text_det_box_thresh Parameter meaning is basically the same as the instantiated parameter. Setting to None means using the instantiated parameter, otherwise this parameter has higher priority. float|None None
text_det_unclip_ratio Parameter meaning is basically the same as the instantiated parameter. Setting to None means using the instantiated parameter, otherwise this parameter has higher priority. float|None None
text_rec_score_thresh Parameter meaning is basically the same as the instantiated parameter. Setting to None means using the instantiated parameter, otherwise this parameter has higher priority. float|None None
seal_det_limit_side_len Parameter meaning is basically the same as the instantiated parameter. Setting to None means using the instantiated parameter, otherwise this parameter has higher priority. int|None None
seal_det_limit_type Parameter meaning is basically the same as the instantiated parameter. Setting to None means using the instantiated parameter, otherwise this parameter has higher priority. str|None None
seal_det_thresh Parameter meaning is basically the same as the instantiated parameter. Setting to None means using the instantiated parameter, otherwise this parameter has higher priority. float|None None
seal_det_box_thresh Parameter meaning is basically the same as the instantiated parameter. Setting to None means using the instantiated parameter, otherwise this parameter has higher priority. float|None None
seal_det_unclip_ratio Parameter meaning is basically the same as the instantiated parameter. Setting to None means using the instantiated parameter, otherwise this parameter has higher priority. float|None None
seal_rec_score_thresh Parameter meaning is basically the same as the instantiated parameter. Setting to None means using the instantiated parameter, otherwise this parameter has higher priority. float|None None
use_wired_table_cells_trans_to_html Whether to enable direct conversion of wired table cell detection results to HTML. When enabled, HTML is constructed directly based on the geometric relations of wired table cell detection results. bool False
use_wireless_table_cells_trans_to_html Whether to enable direct conversion of wireless table cell detection results to HTML. When enabled, HTML is constructed directly based on the geometric relations of wireless table cell detection results. bool False
use_table_orientation_classify Whether to enable table orientation classification. When enabled, tables with 90/180/270 degree rotations in images can be corrected in orientation and correctly recognized. bool True
use_ocr_results_with_table_cells Whether to enable OCR segmentation by table cells. When enabled, OCR detection results are segmented and re-recognized based on cell prediction results to avoid missing text. bool True
use_e2e_wired_table_rec_model Whether to enable end-to-end wired table recognition mode. When enabled, the cell detection model is not used, only the table structure recognition model is used. bool False
use_e2e_wireless_table_rec_model Whether to enable end-to-end wireless table recognition mode. When enabled, the cell detection model is not used, only the table structure recognition model is used. bool True
(3) Processing visual prediction results: Each sample's prediction result is a corresponding Result object, supporting operations such as printing, saving as images, and saving as json files:
Method Description Parameter Parameter Type Parameter Description Default
print() Print results to terminal format_json bool Whether to format the output content using JSON indentation True
indent int Specify indentation level to beautify output JSON data for better readability, effective only when format_json is True 4
ensure_ascii bool Control whether non-ASCII characters are escaped as Unicode. When set to True, all non-ASCII characters will be escaped; if False, original characters are preserved. Effective only when format_json is True False
save_to_json() Save results as a JSON file save_path str File path for saving. If a directory is specified, the saved file name matches the input file type name None
indent int Specify indentation level to beautify output JSON data for better readability, effective only when format_json is True 4
ensure_ascii bool Control whether non-ASCII characters are escaped as Unicode. When set to True, all non-ASCII characters will be escaped; if False, original characters are preserved. Effective only when format_json is True False
save_to_img() Save visualized images from intermediate modules as PNG format images save_path str File path for saving, supports directory or file path None
save_to_markdown() Save each page of image or PDF files as separate markdown files save_path str File path for saving, supports directory or file path None
save_to_html() Save tables in the file as HTML format files save_path str File path for saving, supports directory or file path None
save_to_xlsx() Save tables in the file as XLSX format files save_path str File path for saving, supports directory or file path None
- Calling the `print()` method will print the results to the terminal, with the following explanation of printed content: - `input_path`: `(str)` Input path of the image or PDF to be predicted - `page_index`: `(Union[int, None])` If the input is a PDF, this indicates the current page number; otherwise `None` - `model_settings`: `(Dict[str, bool])` Model parameters configured for the pipeline - `use_doc_preprocessor`: `(bool)` Controls whether to enable the document preprocessing sub-pipeline - `use_general_ocr`: `(bool)` Controls whether to enable the OCR sub-pipeline - `use_seal_recognition`: `(bool)` Controls whether to enable the seal text recognition sub-pipeline - `use_table_recognition`: `(bool)` Controls whether to enable the table recognition sub-pipeline - `use_formula_recognition`: `(bool)` Controls whether to enable the formula recognition sub-pipeline - `doc_preprocessor_res`: `(Dict[str, Union[List[float], str]])` Document preprocessing result dictionary, present only when `use_doc_preprocessor=True` - `input_path`: `(str)` Image path accepted by the document preprocessing sub-pipeline; if input is `numpy.ndarray`, saved as `None`, here it is `None` - `page_index`: `None`, here input is `numpy.ndarray`, so value is `None` - `model_settings`: `(Dict[str, bool])` Model configuration parameters of the document preprocessing sub-pipeline - `use_doc_orientation_classify`: `(bool)` Controls whether to enable the document image orientation classification sub-module - `use_doc_unwarping`: `(bool)` Controls whether to enable the text image unwarping sub-module - `angle`: `(int)` Prediction result of the document image orientation classification sub-module, returns actual angle value if enabled - `parsing_res_list`: `(List[Dict])` List of parsing results, each element is a dictionary; list order corresponds to reading order after parsing - `block_bbox`: `(np.ndarray)` Bounding box of layout detection - `block_label`: `(str)` Label of the layout region, e.g. `text`, `table`, etc. - `block_content`: `(str)` Content within the layout region - `seg_start_flag`: `(bool)` Indicates whether this layout region is the start of a paragraph - `seg_end_flag`: `(bool)` Indicates whether this layout region is the end of a paragraph - `sub_label`: `(str)` Sub-label of the layout region, e.g. sub-label of `text` could be `title_text` - `sub_index`: `(int)` Sub-index of the layout region, used for restoring Markdown - `index`: `(int)` Index of the layout region, used to display layout sorting results - `overall_ocr_res`: `(Dict[str, Union[List[str], List[float], numpy.ndarray]])` Global OCR result dictionary - `input_path`: `(Union[str, None])` Image path accepted by the image OCR sub-pipeline; if input is `numpy.ndarray`, saved as `None` - `page_index`: `None`, here input is `numpy.ndarray`, so value is `None` - `model_settings`: `(Dict)` Model configuration parameters of the OCR sub-pipeline - `dt_polys`: `(List[numpy.ndarray])` List of text detection polygons; each detection box is a numpy array with 4 vertex coordinates, shape (4, 2), dtype int16 - `dt_scores`: `(List[float])` Confidence scores of text detection boxes - `text_det_params`: `(Dict[str, Dict[str, int, float]])` Configuration parameters of the text detection module - `limit_side_len`: `(int)` Length limit for image preprocessing - `limit_type`: `(str)` Processing method for length limit - `thresh`: `(float)` Confidence threshold for text pixel classification - `box_thresh`: `(float)` Confidence threshold for text detection boxes - `unclip_ratio`: `(float)` Expansion factor for text detection boxes - `text_type`: `(str)` Type of text detection, currently fixed as "general" - `text_type`: `(str)` Type of text detection, currently fixed as "general" - `textline_orientation_angles`: `(List[int])` Prediction results of text line orientation classification; returns actual angle values when enabled (e.g. [0,0,1]) - `text_rec_score_thresh`: `(float)` Filtering threshold for text recognition results - `rec_texts`: `(List[str])` List of text recognition results, only including texts exceeding the `text_rec_score_thresh` - `rec_scores`: `(List[float])` Confidence scores of text recognition, filtered by `text_rec_score_thresh` - `rec_polys`: `(List[numpy.ndarray])` List of text detection boxes filtered by confidence, format same as `dt_polys` - `formula_res_list`: `(List[Dict[str, Union[numpy.ndarray, List[float], str]]])` List of formula recognition results, each element is a dictionary - `rec_formula`: `(str)` Formula recognition result - `rec_polys`: `(numpy.ndarray)` Formula detection boxes, shape (4, 2), dtype int16 - `formula_region_id`: `(int)` Region ID where the formula is located - `seal_res_list`: `(List[Dict[str, Union[numpy.ndarray, List[float], str]]])` List of seal recognition results, each element is a dictionary - `input_path`: `(str)` Input path of seal image - `page_index`: `None`, here input is `numpy.ndarray`, so value is `None` - `model_settings`: `(Dict)` Model configuration parameters of the seal text recognition sub-pipeline - `dt_polys`: `(List[numpy.ndarray])` List of seal detection boxes, format same as `dt_polys` - `text_det_params`: `(Dict[str, Dict[str, int, float]])` Configuration parameters of the seal detection module, meanings same as above - `text_type`: `(str)` Type of seal detection, currently fixed as "seal" - `text_rec_score_thresh`: `(float)` Filtering threshold for seal recognition results - `rec_texts`: `(List[str])` List of seal recognition results, only including texts exceeding the `text_rec_score_thresh` - `rec_scores`: `(List[float])` Confidence scores of seal recognition, filtered by `text_rec_score_thresh` - `rec_polys`: `(List[numpy.ndarray])` List of seal detection boxes filtered by confidence, format same as `dt_polys` - `rec_boxes`: `(numpy.ndarray)` Rectangular bounding box array of detection boxes, shape (n, 4), dtype int16; each row represents one rectangle - `table_res_list`: `(List[Dict[str, Union[numpy.ndarray, List[float], str]]])` List of table recognition results, each element is a dictionary - `cell_box_list`: `(List[numpy.ndarray])` List of table cell bounding boxes - `pred_html`: `(str)` Table in HTML format string - `table_ocr_pred`: `(dict)` OCR recognition results of the table - `rec_polys`: `(List[numpy.ndarray])` List of cell detection boxes - `rec_texts`: `(List[str])` Recognition results of cells - `rec_scores`: `(List[float])` Recognition confidence scores of cells - `rec_boxes`: `(numpy.ndarray)` Rectangular bounding box array of detection boxes, shape (n, 4), dtype int16; each row represents one rectangle - Calling the `save_to_json()` method will save the above content to the specified `save_path`. If a directory is specified, the saved path will be `save_path/{your_img_basename}_res.json`. If a file is specified, it will be saved directly to that file. Since JSON files do not support saving numpy arrays, all `numpy.array` types will be converted to list format. - Calling the `save_to_img()` method will save visualization results to the specified `save_path`. If a directory is specified, it will save layout detection visual images, global OCR visual images, layout reading order visual images, etc. If a file is specified, it will be saved directly to that file. (The pipeline usually contains many result images, so it is not recommended to specify a specific file path directly, or multiple images will be overwritten, leaving only the last image.) - Calling the `save_to_markdown()` method will save the converted Markdown files to the specified `save_path`. The saved file path will be `save_path/{your_img_basename}.md`. If the input is a PDF file, it is recommended to specify a directory directly, otherwise multiple markdown files will be overwritten. - Calling the `concatenate_markdown_pages()` method merges the multi-page Markdown contents output by the PP-DocTranslation pipeline `markdown_list` into a single complete document and returns the merged Markdown content.
(4) Call the translate() method to perform document translation. This method returns the original and translated markdown content as a markdown object, which can be saved locally by executing the save_to_markdown() method for the desired parts. Below are the relevant parameters of the translate() method:
Parameter Description Type Default
ori_md_info_list List of original Markdown data containing content to be translated. Must be a list of dictionaries, each representing a document block List[Dict]
target_language Target language (ISO 639-1 language code, e.g. "en"/"ja"/"fr") str "zh"
chunk_size Character count threshold for chunked translation processing int 5000
task_description Custom task description prompt str|None None
output_format Specified output format requirements, e.g. "preserve original Markdown structure" str|None None
rules_str Custom translation rule description str|None None
few_shot_demo_text_content Few-shot learning example text content str|None None
few_shot_demo_key_value_list Structured few-shot example data in key-value pairs, can include professional terminology glossary str|None None
glossary Professional terminology glossary for translation dict|None None
llm_request_interval Interval in seconds between requests to the large language model. This parameter helps prevent too frequent calls to the LLM. float 0.0
chat_bot_config Large language model configuration. Setting to None uses instantiation parameters; otherwise, this parameter takes priority. dict|None None

3. Development Integration/Deployment

If the pipeline can meet your requirements for inference speed and accuracy, you can proceed directly with development integration/deployment.

If you need to directly apply the pipeline in your Python project, you can refer to the sample code in 2.2 Python Script Approach.

In addition, PaddleOCR also offers two other deployment methods, detailed as follows:

🚀 High-Performance Inference: In real-world production environments, many applications have stringent performance criteria (especially response speed) for deployment strategies to ensure efficient system operation and a smooth user experience. To this end, PaddleOCR provides high-performance inference capabilities, aiming to deeply optimize model inference and pre/post-processing, achieving significant acceleration in the end-to-end process. For detailed information on the high-performance inference process, please refer to High-Performance Inference.

☁️ Serving: Serving is a common deployment form in real-world production environments. By encapsulating inference functions as services, clients can access these services through network requests to obtain inference results. For detailed information on the pipeline serving process, please refer to Serving.

Below are the API references for basic serving and examples of multi-language service invocation:

API Reference

Main operations provided by the serving:

  • HTTP request method is POST.
  • Both request body and response body are JSON data (JSON objects).
  • When the request is processed successfully, the response status code is 200, and the response body has the following properties:
Name Type Meaning
logId string Request UUID.
errorCode integer Error code. Fixed as 0.
errorMsg string Error message. Fixed as "Success".
result object Operation result.
  • When the request is not successful, the response body has the following properties:
Name Type Meaning
logId string Request UUID.
errorCode integer Error code. Same as response status code.
errorMsg string Error message.

Main operations provided by the serving are as follows:

  • analyzeImages

Use computer vision models to analyze images, obtaining OCR, table recognition results, etc.

POST /doctrans-visual

  • Request body properties are as follows:
Name Type Meaning Required
file string URL of image or PDF file accessible by the server, or Base64 encoding of such file contents. By default, for PDF files over 10 pages, only the first 10 pages are processed.
To remove the page limit, add the following configuration in the pipeline config file:
Serving:
  extra:
    max_num_input_imgs: null
Yes
fileType integernull File type. 0 means PDF, 1 means image file. If not present in the request, the file type will be inferred from the URL. No
useDocOrientationClassify boolean | null See the use_doc_orientation_classify parameter description in the pipeline object's visual_predict method. No
useDocUnwarping boolean | null See the use_doc_unwarping parameter description in the pipeline object's visual_predict method. No
useTextlineOrientation boolean | null See the use_textline_orientation parameter description in the pipeline object's visual_predict method. No
useSealRecognition boolean | null See the use_seal_recognition parameter description in the pipeline object's visual_predict method. No
useTableRecognition boolean | null See the use_table_recognition parameter description in the pipeline object's visual_predict method. No
useFormulaRecognition boolean | null See the use_formula_recognition parameter description in the pipeline object's visual_predict method. No
useChartRecognition boolean | null See the use_chart_recognition parameter description in the pipeline object's visual_predict method. No
useRegionDetection boolean | null See the use_region_detection parameter description in the pipeline object's visual_predict method. No
layoutThreshold number | object | null See the layout_threshold parameter description in the pipeline object's visual_predict method. No
layoutNms boolean | null See the layout_nms parameter description in the pipeline object's visual_predict method. No
layoutUnclipRatio number | array | object | null See the layout_unclip_ratio parameter description in the pipeline object's visual_predict method. No
layoutMergeBboxesMode string | object | null See the layout_merge_bboxes_mode parameter description in the pipeline object's visual_predict method. No
textDetLimitSideLen integer | null See the text_det_limit_side_len parameter description in the pipeline object's visual_predict method. No
textDetLimitType string | null See the text_det_limit_type parameter description in the pipeline object's visual_predict method. No
textDetThresh number | null See the text_det_thresh parameter description in the pipeline object's visual_predict method. No
textDetBoxThresh number | null See the text_det_box_thresh parameter description in the pipeline object's visual_predict method. No
textDetUnclipRatio number | null See the text_det_unclip_ratio parameter description in the pipeline object's visual_predict method. No
textRecScoreThresh number | null See the text_rec_score_thresh parameter description in the pipeline object's visual_predict method. No
sealDetLimitSideLen integer | null See the seal_det_limit_side_len parameter description in the pipeline object's visual_predict method. No
sealDetLimitType string | null See the seal_det_limit_type parameter description in the pipeline object's visual_predict method. No
sealDetThresh number | null See the seal_det_thresh parameter description in the pipeline object's visual_predict method. No
sealDetBoxThresh number | null See the seal_det_box_thresh parameter description in the pipeline object's visual_predict method. No
sealDetUnclipRatio number | null See the seal_det_unclip_ratio parameter description in the pipeline object's visual_predict method. No
sealRecScoreThresh number | null See the seal_rec_score_thresh parameter description in the pipeline object's visual_predict method. No
useWiredTableCellsTransToHtml boolean See the use_wired_table_cells_trans_to_html parameter description in the pipeline object's visual_predict method. No
useWirelessTableCellsTransToHtml boolean See the use_wireless_table_cells_trans_to_html parameter description in the pipeline object's visual_predict method. No
useTableOrientationClassify boolean See the use_table_orientation_classify parameter description in the pipeline object's visual_predict method. No
useOcrResultsWithTableCells boolean See the use_ocr_results_with_table_cells parameter description in the pipeline object's visual_predict method. No
useE2eWiredTableRecModel boolean See the use_e2e_wired_table_rec_model parameter description in the pipeline object's visual_predict method. No
useE2eWirelessTableRecModel boolean See the use_e2e_wireless_table_rec_model parameter description in the pipeline object's visual_predict method. No
visualize boolean | null Whether to return visualization result images and intermediate images during processing.
  • If true is passed: return images.
  • If false is passed: do not return images.
  • If this parameter is not provided in the request body or null is passed: follow the pipeline config file setting Serving.visualize.

For example, add the following field in the pipeline config file:
Serving:
  visualize: False
By default, images will not be returned; the visualize parameter in the request body can override this default behavior. If neither the request body nor the config file sets it (or the request body passes null and the config file does not set it), images will be returned by default.
No
  • When the request is processed successfully, the response body's result has the following properties:
Name Type Meaning
layoutParsingResults array Layout parsing results. The array length is 1 (for image input) or equals the actual number of processed pages (for PDF input). For PDF input, each element corresponds to the result of each processed page in order.
dataInfo object Input data information.

Each element in layoutParsingResults is an object with the following properties:

Name Type Meaning
prunedResult object Simplified version of the res field in the JSON representation of the layout_parsing_result generated by the pipeline object's visual_predict method, with input_path and page_index fields removed.
markdown object Markdown result.
outputImages object | null See the img property description in the pipeline prediction results. Images are in JPEG format and Base64 encoded.
inputImage string | null Input image. JPEG format, Base64 encoded.

markdown is an object with the following properties:

Name Type Meaning
text string Markdown text.
images object Key-value pairs of Markdown image relative paths and Base64 encoded images.
isStart boolean Whether the first element on the current page is the start of a paragraph.
isEnd boolean Whether the last element on the current page is the end of a paragraph.
  • translate

Use a large model to translate documents.

POST /doctrans-translate

  • Request body properties are as follows:
Name Type Meaning Required
markdownList array List of Markdown to be translated. Can be obtained from the results of the analyzeImages operation. The images attribute will not be used. Yes
targetLanguage string Please refer to the target_language parameter description in the translate method of the pipeline object. No
chunkSize integer Please refer to the chunk_size parameter description in the translate method of the pipeline object. No
taskDescription string | null Please refer to the task_description parameter description in the translate method of the pipeline object. No
outputFormat string | null Please refer to the output_format parameter description in the translate method of the pipeline object. No
rulesStr string | null Please refer to the rules_str parameter description in the translate method of the pipeline object. No
fewShotDemoTextContent string | null Please refer to the few_shot_demo_text_content parameter description in the translate method of the pipeline object. No
fewShotDemoKeyValueList string | null Please refer to the few_shot_demo_key_value_list parameter description in the translate method of the pipeline object. No
glossary object | null Please refer to the glossary parameter description in the translate method of the pipeline object. No
llmRequestInterval number | null Please refer to the llm_request_interval parameter description in the translate method of the pipeline object. No
chatBotConfig object | null Please refer to the chat_bot_config parameter description in the translate method of the pipeline object. No
  • When the request is successfully processed, the result in the response body has the following attributes:
Name Type Meaning
translationResults array Translation results.

Each element in translationResults is an object with the following attributes:

Name Type Meaning
language string Target language.
markdown object Markdown result. Object definition is consistent with the markdown returned by the analyzeImages operation.
  • Note:
  • Including sensitive parameters such as the API key for large model calls in the request body may pose security risks. If not necessary, set these parameters in the configuration file and do not pass them during the request.

    Examples of multi-language service invocation
    Python
    import base64
    import pathlib
    import pprint
    import sys
    
    import requests
    
    
    API_BASE_URL = "http://127.0.0.1:8080"
    
    file_path = "./demo.jpg"
    target_language = "en"
    
    with open(file_path, "rb") as file:
        file_bytes = file.read()
        file_data = base64.b64encode(file_bytes).decode("ascii")
    
    payload = {
        "file": file_data,
        "fileType": 1,
    }
    resp_visual = requests.post(url=f"{API_BASE_URL}/doctrans-visual", json=payload)
    if resp_visual.status_code != 200:
        print(
            f"Request to doctrans-visual failed with status code {resp_visual.status_code}."
        )
        pprint.pp(resp_visual.json())
        sys.exit(1)
    result_visual = resp_visual.json()["result"]
    
    markdown_list = []
    for i, res in enumerate(result_visual["layoutParsingResults"]):
        md_dir = pathlib.Path(f"markdown_{i}")
        md_dir.mkdir(exist_ok=True)
        (md_dir / "doc.md")
    write_text(res["markdown"]["text"])
        for img_path, img in res["markdown"]["images"].items():
            img_path = md_dir / img_path
            img_path.parent.mkdir(parents=True, exist_ok=True)
            img_path.write_bytes(base64.b64decode(img))
        print(f"The Markdown document to be translated is saved at {md_dir / 'doc.md'}")
        del res["markdown"]["images"]
        markdown_list.append(res["markdown"])
        for img_name, img in res["outputImages"].items():
            img_path = f"{img_name}_{i}.jpg"
            with open(img_path, "wb") as f:
                f.write(base64.b64decode(img))
            print(f"Output image saved at {img_path}")
    
    payload = {
        "markdownList": markdown_list,
    "targetLanguage": target_language,
    }
    resp_translate = requests.post(url=f"{API_BASE_URL}/doctrans-translate", json=payload)
    if resp_translate.status_code != 200:
        print(
            f"Request to doctrans-translate failed with status code {resp_translate.status_code}."
        )
        pprint.pprint(resp_translate.json())  # Corrected 'pp' to 'pprint' for proper function call
        sys.exit(1)
    result_translate = resp_translate.json()["result"]
    
    for i, res in enumerate(result_translate["translationResults"]):
        md_dir = pathlib.Path(f"markdown_{i}")
        (md_dir / "doc_translated.md").write_text(res["markdown"]["text"])
        print(f"Translated markdown document saved at {md_dir / 'doc_translated.md'}")


    4. Secondary Development

    If the default model weights provided by the PP-DocTranslation pipeline do not meet your accuracy or speed requirements in your scenario, you can try to useyour own data from specific domains or application scenariosto furtherfine-tunethe existing model to improve the recognition effect in your scenario.

    4.1 Model Fine-tuning

    Since the PP-DocTranslation pipeline contains several modules, if the performance of the model pipeline does not meet expectations, the issue may originate from any one of these modules. You can analyze cases with poor extraction results, use visualized images to determine which module has the problem, and refer to the corresponding fine-tuning tutorial links in the following table to fine-tune the model.

    Scenario Fine-tuning module Fine-tuning reference link
    Inaccurate detection of layout areas, such as failure to detect seals and tables Layout detection module Link
    Inaccurate recognition of table structures Table structure recognition module Link
    Inaccurate recognition of formulas Formula recognition module Link
    Omission in detecting seal texts Seal text detection module Link
    Omission in detecting texts Text detection module Link
    Inaccurate text content Text recognition module Link
    Inaccurate correction of vertical or rotated text lines Text line orientation classification module Link
    Inaccurate correction of whole image rotation Document image orientation classification module Link
    Inaccurate correction of image distortion Text image unwarping module Fine-tuning is temporarily not supported

    4.2 Model Application

    After completing fine-tuning training with your private dataset, you can obtain a local model weight file. Then, you can use the fine-tuned model weights by customizing the pipeline configuration file.

    1. Obtain the pipeline configuration file

    You can call the export_paddlex_config_to_yaml method of the PP-DocTranslation pipeline object in PaddleOCR to export the current pipeline configuration to a YAML file:

    from paddleocr import PPDocTranslation
    
    pipeline = PPDocTranslation()
    pipeline.export_paddlex_config_to_yaml("PP-DocTranslation.yaml")
    
    1. Modify the configuration file

    After obtaining the default pipeline configuration file, replace the local path of the fine-tuned model weights with the corresponding location in the pipeline configuration file. For example,

    ......
    SubModules:
        TextDetection:
        module_name: text_detection
        model_name: PP-OCRv5_server_det
        model_dir: null # Replace with the path to the weights of the fine-tuned text detection model
        limit_side_len: 960
        limit_type: max
        thresh: 0.3
        box_thresh: 0.6
        unclip_ratio: 1.5
    
        TextRecognition:
        module_name: text_recognition
        model_name: PP-OCRv5_server_rec
        model_dir: null # Replace with the path to the weights of the fine-tuned text recognition model
        batch_size: 1
                score_thresh: 0
    ......
    

    The pipeline configuration file not only includes parameters supported by PaddleOCR CLI and Python API but also allows for more advanced configurations. Detailed information can be found in the corresponding pipeline usage tutorial in the Overview of PaddleX Model Pipeline Usage. Refer to the detailed instructions therein and adjust the configurations according to your needs.

    1. Load the pipeline configuration file in CLI

    After modifying the configuration file, specify the path to the modified pipeline configuration file using the --paddlex_config parameter in the command line. PaddleOCR will then read its contents as the pipeline configuration. Here is an example:

    paddleocr pp_doctranslation --paddlex_config PP-DocTranslation.yaml ...
    
    1. Load the pipeline configuration file in the Python API

    When initializing the pipeline object, you can pass the path of the PaddleX pipeline configuration file or a configuration dict through the paddlex_config parameter, and PaddleOCR will read its content as the pipeline configuration. The example is as follows:

    from paddleocr import PPDocTranslation
    
    pipeline = PPDocTranslation(paddlex_config="PP-DocTranslation.yaml")
    

    Comments