Skip to content

PP-DocTranslation Pipeline Usage Tutorial

1. Introduction to PP-DocTranslation Pipeline

PP-DocTranslation is a document intelligent translation solution provided by PaddlePaddle. It integrates advanced general layout analysis technology and large language model (LLM) capabilities to offer you efficient document intelligent translation services. This solution can accurately identify and extract various elements within documents, including text blocks, headings, paragraphs, images, tables, and other complex layout structures, and on this basis, achieve high-quality multilingual translation. PP-DocTranslation supports mutual translation among multiple mainstream languages, particularly excelling in handling documents with complex layouts and strong contextual dependencies, striving to deliver precise, natural, fluent, and professional translation results. This pipeline also provides flexible serving options, supporting the use of multiple programming languages on various hardware. Moreover, it offers the capability for secondary development, allowing you to train and fine-tune models on your own datasets based on this pipeline, and the trained models can also be seamlessly integrated.

The PP-DocTranslation pipeline uses the PP-StructureV3 sub-pipeline, and thus has all the functions of the PP-StructureV3 pipeline. For more information on the functions and usage details of the PP-StructureV3 pipeline, you can click on the PP-StructureV3 Pipeline Documentation page.

In this pipeline, you can select the model to use based on the benchmark data below.

👉Details of model list

Document image orientation classification module:

ModelModel download link Top-1 Acc (%) GPU inference time (ms)
[Normal mode / High-performance mode]
CPU inference time (ms)
[Normal mode / High-performance mode]
Model storage size (M) Introduction
PP-LCNet_x1_0_doc_ori Inference model/Training model 99.06 2.62 / 0.59 3.24 / 1.19 7 A document image classification model based on PP-LCNet_x1_0, with four categories: 0 degrees, 90 degrees, 180 degrees, and 270 degrees

Text image unwarping module:

ModelModel download link CER Model storage size (M) Introduction
UVDoc Inference model/Training model 0.179 30.3 M A high-precision text image unwarping model

Layout region detection module model:

ModelModel download link mAP(0.5) (%) GPU inference time (ms)
[Normal mode / High-performance mode]
CPU inference time (ms)
[Normal mode / High-performance mode]
Model storage size (M) Introduction
PP-DocLayout_plus-L Inference model/Training model 83.2 53.03 / 17.23 634.62 / 378.32 126.01 M A higher-precision layout region localization model trained on a self-built dataset based on RT-DETR-L, covering scenarios such as Chinese and English papers, multi-column magazines, newspapers, PPTs, contracts, books, examination papers, research reports, ancient books, Japanese documents, and documents with vertical text.
PP-DocLayout-L Inference model/Training model 90.4 33.59 / 33.59 503.01 / 251.08 123.76 M A high-precision layout region localization model trained on a self-built dataset based on RT-DETR-L, covering scenarios such as Chinese and English papers, magazines, contracts, books, examination papers, and research reports.
PP-DocLayout-M Inference model/Training model 75.2 13.03 / 4.72 43.39 / 24.44 22.578 A layout region localization model with balanced precision and efficiency trained on a self-built dataset based on PicoDet-L, covering scenarios such as Chinese and English papers, magazines, contracts, books, examination papers, and research reports.
PP-DocLayout-S Inference model/Training model 70.9 11.54 / 3.86 18.53 / 6.29 4.834 A highly efficient layout region localization model trained on a self-built dataset based on PicoDet-S, covering scenarios such as Chinese and English papers, magazines, contracts, books, examination papers, and research reports.

Table structure recognition module:

ModelModel download link Accuracy (%) GPU inference time (ms)
[Normal Mode / High-Performance Mode]
CPU inference time (ms)
[Normal Mode / High-Performance Mode]
Model storage size (M) Introduction
SLANeXt_wired Inference model/Training model 69.65 85.92 / 85.92 - / 501.66 351M The SLANeXt series is a new generation of table structure recognition models independently developed by Baidu PaddlePaddle's vision team. Compared to SLANet and SLANet_plus, SLANeXt focuses on recognizing table structures and has trained dedicated weights for wired and wireless tables separately. This has significantly improved its ability to recognize various types of tables, especially wired tables.
SLANeXt_wireless Inference model/Training model

Table classification module model:

ModelModel download link Top1 Acc(%) GPU inference time (ms)
[Normal Mode / High-Performance Mode]
CPU inference time (ms)
[Normal Mode / High-Performance Mode]
Model storage size (M)
PP-LCNet_x1_0_table_cls Inference model/Training model 94.2 2.62 / 0.60 3.17 / 1.14 6.6M

Table cell detection module model:

ModelModel download link mAP(%) GPU inference time (ms)
[Normal mode / High-performance mode]
CPU inference time (ms)
[Normal mode / High-performance mode]
Model storage size (M) Introduction
RT-DETR-L_wired_table_cell_det Inference model/Training model 82.7 33.47 / 27.02 402.55 / 256.56 124M RT-DETR is the first real-time end-to-end object detection model. Based on RT-DETR-L as the base model, Baidu PaddlePaddle's vision team completed pre-training on a self-built table cell detection dataset, achieving table cell detection with good performance for both wired and wireless tables.
RT-DETR-L_wireless_table_cell_det Inference model/Training model

Text detection module:

ModelModel download link Detection Hmean (%) GPU inference time (ms)
[Normal mode / High-performance mode]
CPU inference time (ms)
[Normal mode / High-performance mode]
Model storage size (M) Introduction
PP-OCRv5_server_det Inference model/Training model 83.8 89.55 / 70.19 383.15 / 383.15 84.3 The server-side text detection model of PP-OCRv5, with higher accuracy, suitable for deployment on servers with better performance
PP-OCRv5_mobile_det Inference model/Training model 79.0 10.67 / 6.36 57.77 / 28.15 4.7 PP-OCRv5's mobile-end text detection model, with higher efficiency, suitable for deployment on edge devices
PP-OCRv4_server_det Inference model/Training model 69.2 127.82 / 98.87 585.95 / 489.77 109 PP-OCRv4's server-end text detection model, with higher accuracy, suitable for deployment on servers with better performance
PP-OCRv4_mobile_det Inference model/Training model 63.8 9.87 / 4.17 56.60 / 20.79 4.7 PP-OCRv4's mobile-end text detection model, with higher efficiency, suitable for deployment on edge devices
PP-OCRv3_mobile_det Inference model/Training model Accuracy is close to PP-OCRv4_mobile_det 9.90 / 3.60 41.93 / 20.76 2.1 PP-OCRv3's mobile-end text detection model, with higher efficiency, suitable for deployment on edge devices
PP-OCRv3_server_det Inference model/Training model Accuracy is close to PP-OCRv4_server_det 119.50 / 75.00 379.35 / 318.35 102.1 Server-side text detection model of PP-OCRv3, with higher accuracy, suitable for deployment on servers with better performance

Text recognition module model:

*Chinese recognition model
ModelModel download link Recognition Avg Accuracy(%) GPU inference time (ms)
[Normal mode / High-performance mode]
CPU inference time (ms)
[Normal mode / High-performance mode]
Model storage size (M) Introduction
PP-OCRv5_server_rec Inference model/Training model 86.38 8.46 / 2.36 31.21 / 31.21 81 M PP-OCRv5_rec is a new generation of text recognition model. This model is committed to efficiently and accurately supporting four major languages, namely Simplified Chinese, Traditional Chinese, English, and Japanese, as well as complex text scenarios such as handwriting, vertical text, pinyin, and rare characters with a single model. While maintaining recognition effectiveness, it also takes into account inference speed and model robustness, providing efficient and accurate technical support for document understanding in various scenarios.
PP-OCRv5_mobile_rec Inference model/Training model 81.29 5.43 / 1.46 21.20 / 5.32 16 M
PP-OCRv4_server_rec_doc Inference model/Training model 86.58 8.69 / 2.78 37.93 / 37.93 74.7 M PP-OCRv4_server_rec_doc is trained on a mixed dataset of more Chinese document data and PP-OCR training data based on PP-OCRv4_server_rec. It has enhanced the ability to recognize some traditional Chinese characters, Japanese characters, and special characters, and can support the recognition of over 15,000 characters. In addition to improving the document-related text recognition ability, it has also enhanced the general text recognition ability.
PP-OCRv4_mobile_rec Inference model/Training model 78.74 5.26 / 1.12 17.48 / 3.61 10.6 M A lightweight recognition model of PP-OCRv4 with high inference efficiency, which can be deployed on various hardware devices including edge devices.
PP-OCRv4_server_rec Inference model/Training model 80.61 8.75 / 2.49 36.93 / 36.93 71.2 M A server-side model of PP-OCRv4 with high inference accuracy, which can be deployed on various servers.
PP-OCRv3_mobile_rec Inference model/Training model 72.96 3.89 / 1.16 8.72 / 3.56 9.2 M A lightweight recognition model of PP-OCRv3 with high inference efficiency, which can be deployed on various hardware devices including edge devices.
ModelModel download link Recognition Avg Accuracy(%) GPU inference time (ms)
[Normal mode / High-performance mode]
CPU inference time (ms)
[Normal mode / High-performance mode]
Model storage size (M) Introduction
ch_SVTRv2_rec Inference model/Training model 68.81 10.38 / 8.31 66.52 / 30.83 73.9 M SVTRv2 is a server-side text recognition model developed by the OpenOCR team of the Vision and Learning Lab (FVL) at Fudan University. It won the first prize in the PaddleOCR Algorithm Model Challenge - Task 1: OCR End-to-End Recognition Task, with a 6% improvement in end-to-end recognition accuracy on Leaderboard A compared to PP-OCRv4.
ModelModel download link Recognition Avg Accuracy(%) GPU inference time (ms)
[Normal mode / High-performance mode]
CPU inference time (ms)
[Normal mode / High-performance mode]
Model storage size (M) Introduction
ch_RepSVTR_rec Inference model/Training model 65.07 6.29 / 1.57 20.64 / 5.40 22.1 M RepSVTR is a mobile-side text recognition model based on SVTRv2. It won the first prize in the PaddleOCR Algorithm Model Challenge - Task 1: OCR End-to-End Recognition Task, with a 2.5% improvement in end-to-end recognition accuracy on Leaderboard B compared to PP-OCRv4, while maintaining the same inference speed.
*English recognition model
ModelModel download link Recognition Avg Accuracy(%) GPU inference time (ms)
[Normal mode / High-performance mode]
CPU inference time (ms)
[Normal mode / High-performance mode]
Model storage size (M) Introduction
en_PP-OCRv4_mobile_rec Inference model/Training model 70.39 4.81 / 1.23 17.20 / 4.18 6.8 M An ultra-lightweight English recognition model trained based on the PP-OCRv4 recognition model, supporting English and number recognition
en_PP-OCRv3_mobile_rec Inference model/Training model 70.69 3.56 / 0.78 8.44 / 5.78 7.8 M An ultra-lightweight English recognition model trained based on the PP-OCRv3 recognition model, supporting English and number recognition
*Multilingual recognition model
ModelModel download link Avg Accuracy of recognition (%) GPU inference time (ms)
[Normal mode / High-performance mode]
CPU inference time (ms)
[Normal mode / High-performance mode]
Model storage size (M) Introduction
korean_PP-OCRv3_mobile_rec Inference model/Training model 60.21 3.73 / 0.98 8.76 / 2.91 8.6 M An ultra-lightweight Korean recognition model trained based on the PP-OCRv3 recognition model, supporting Korean and digit recognition
japan_PP-OCRv3_mobile_rec Inference model/Training model 45.69 3.86 / 1.01 8.62 / 2.92 8.8 M An ultra-lightweight Japanese recognition model trained based on the PP-OCRv3 recognition model, supporting Japanese and digit recognition
chinese_cht_PP-OCRv3_mobile_rec Inference model/Training model 82.06 3.90 / 1.16 9.24 / 3.18 9.7 M An ultra-lightweight traditional Chinese recognition model trained based on the PP-OCRv3 recognition model, supporting traditional Chinese and digit recognition
te_PP-OCRv3_mobile_rec Inference model/Training model 95.88 3.59 / 0.81 8.28 / 6.21 7.8 M An ultra-lightweight Telugu recognition model trained based on the PP-OCRv3 recognition model, supporting Telugu and digit recognition
ka_PP-OCRv3_mobile_rec Inference model/Training model 96.96 3.49 / 0.89 8.63 / 2.77 8.0 M An ultra-lightweight Kannada recognition model trained based on the PP-OCRv3 recognition model, supporting Kannada and digit recognition
ta_PP-OCRv3_mobile_rec Inference model/Training model 76.83 3.49 / 0.86 8.35 / 3.41 8.0 M An ultra-lightweight Tamil recognition model trained based on the PP-OCRv3 recognition model, supporting Tamil and digit recognition
latin_PP-OCRv3_mobile_rec Inference model/Training model 76.93 3.53 / 0.78 8.50 / 6.83 7.8 M An ultra-lightweight Latin recognition model trained based on the PP-OCRv3 recognition model, supporting Latin and digit recognition
arabic_PP-OCRv3_mobile_rec Inference model/Training model 73.55 3.60 / 0.83 8.44 / 4.69 7.8 M An ultra-lightweight Arabic alphabet recognition model trained based on the PP-OCRv3 recognition model, supporting Arabic alphabet and digit recognition
cyrillic_PP-OCRv3_mobile_rec Inference model/Training model 94.28 3.56 / 0.79 8.22 / 2.76 7.9 M An ultra-lightweight Slavic alphabet recognition model trained based on the PP-OCRv3 recognition model, supporting Slavic alphabet and digit recognition
devanagari_PP-OCRv3_mobile_rec Inference model/Training model 96.44 3.60 / 0.78 6.95 / 2.87 7.9 M An ultra-lightweight Sanskrit alphabet recognition model trained based on the PP-OCRv3 recognition model, supporting Sanskrit alphabet and digit recognition

Text line direction classification module (optional):

Model Model download link Top-1 Acc (%) GPU inference time (ms)
[Normal mode / High-performance mode]
CPU inference time (ms)
[Normal mode / High-performance mode]
Model storage size (M) Introduction
PP-LCNet_x0_25_textline_ori Inference model/Training model 95.54 2.16 / 0.41 2.37 / 0.73 0.32 A text line classification model based on PP-LCNet_x0_25, with two categories, namely 0 degrees and 180 degrees

Formula recognition module:

ModelModel download link Avg-BLEU(%) GPU inference time (ms)
[Normal mode / High-performance mode]
CPU inference time (ms)
[Normal mode / High-performance mode]
Model storage size (M) Introduction
UniMERNet Inference model/Training model 86.13 2266.96/- -/- 1.4 G UniMERNet is a formula recognition model developed by Shanghai AI Lab. It uses Donut Swin as the encoder and MBartDecoder as the decoder. By training on a dataset of one million entries that includes simple formulas, complex formulas, scanned formulas, and handwritten formulas, the model significantly improves its recognition accuracy for formulas in real-world scenarios. PP-FormulaNet-S Inference model/Training model 87.12 1311.84 / 1311.84 - / 8288.07 167.9 M PP-FormulaNet is an advanced formula recognition model developed by Baidu PaddlePaddle's vision team, supporting the recognition of 50,000 common LaTeX source code vocabulary. The PP-FormulaNet-S version employs PP-HGNetV2-B4 as its backbone network. Through techniques such as parallel masking and model distillation, it significantly enhances the model's inference speed while maintaining high recognition accuracy, suitable for scenarios like simple printed formulas and simple multi-line printed formulas. The PP-FormulaNet-L version, on the other hand, is based on Vary_VIT_B as its backbone network and has undergone in-depth training on a large-scale formula dataset. It shows significant improvement in recognizing complex formulas compared to PP-FormulaNet-S and is suitable for scenarios like simple printed formulas, complex printed formulas, and handwritten formulas. PP-FormulaNet-L Inference model/Training model 92.13 1976.52/- -/- 535.2 M LaTeX_OCR_rec Inference model/Training model 71.63 1088.89 / 1088.89 - / - 89.7 M LaTeX-OCR is a formula recognition algorithm based on an autoregressive large model. By adopting Hybrid ViT as the backbone network and transformer as the decoder, it significantly improves the accuracy of formula recognition.

Seal text detection module:

ModelModel download link Detection Hmean (%) GPU inference time (ms)
[Normal mode / High-performance mode]
CPU inference time (ms)
[Normal mode / High-performance mode]
Model storage size (M) Introduction
PP-OCRv4_server_seal_det Inference model/Training model 98.21 124.64 / 91.57 545.68 / 439.86 109 PP-OCRv4's server-side seal text detection model with higher accuracy, suitable for deployment on better servers
PP-OCRv4_mobile_seal_det Inference model/Training model 96.47 9.70 / 3.56 50.38 / 19.64 4.6 PP-OCRv4's mobile-side seal text detection model with higher efficiency, suitable for deployment on the end side
Test environment description:
  • Performance test environment
    • Test dataset:
      • Document image orientation classification model: A self-built dataset by PaddleX, covering multiple scenarios such as certificates and documents, containing 1000 images.
      • Text image unwarping model:DocUNet.
      • Layout area detection model: The self-built layout area analysis dataset of PaddleOCR, which includes 10,000 common document images such as Chinese and English papers, magazines, and research reports.
      • PP-DocLayout_plus-L: The self-built layout area detection dataset of PaddleOCR, which includes 1,300 document images such as Chinese and English papers, magazines, newspapers, research reports, PPTs, examination papers, and textbooks.
      • Table structure recognition model: The self-built English table recognition dataset within PaddleX.
      • Text detection model: The self-built Chinese dataset of PaddleOCR, covering multiple scenarios such as street views, web images, documents, and handwriting, with 500 images for detection.
      • Chinese recognition model: The self-built Chinese dataset of PaddleOCR, covering multiple scenarios such as street views, web images, documents, and handwriting, with 11,000 images for text recognition.
      • ch_SVTRv2_rec:PaddleOCR Algorithm Model Challenge - Task 1: OCR End-to-End Recognition TaskEvaluation set for Leaderboard A.
      • ch_RepSVTR_rec:PaddleOCR Algorithm Model Challenge - Task 1: OCR End-to-End Recognition TaskEvaluation set for Leaderboard B.
      • English recognition model: The self-built English dataset of PaddleX.
      • Multilingual recognition model: The self-built multilingual dataset of PaddleX.
      • Text line direction classification model: The self-built dataset of PaddleX, covering multiple scenarios such as certificates and documents, with 1,000 images.
      • Seal text detection model: The self-built dataset of PaddleX, which includes 500 images of round seals.
    • Hardware configuration:
      • GPU: NVIDIA Tesla T4
      • CPU: Intel Xeon Gold 6271C @ 2.60GHz
      • Other environments: Ubuntu 20.04 / CUDA 11.8 / cuDNN 8.9 / TensorRT 8.6.1.6
  • Description of inference modes
Modes GPU configuration CPU configuration Combination of acceleration technologies
Regular mode FP32 precision / no TRT acceleration FP32 precision / 8 threads PaddleInference
High-performance mode Select the optimal combination of prior precision type and acceleration strategy FP32 precision / 8 threads Select the optimal prior backend (Paddle/OpenVINO/TRT, etc.)

2. Quick Start

Before using the PP-DocTranslation pipeline locally, please ensure that you have completed the installation of the wheel package according to the Installation Tutorial.

Please note: If you encounter issues such as the program becoming unresponsive, unexpected program termination, running out of memory resources, or extremely slow inference during execution, please try adjusting the configuration according to the documentation, such as disabling unnecessary features or using lighter-weight models.

Before use, you need to prepare the API key for a large language model, which supports the Baidu Cloud Qianfan Platform or local large model services that comply with the OpenAI interface standards.

2.1 Experience via Command Line

You can download the test file and quickly experience the pipeline effect with a single command:

paddleocr pp_doctranslation -i vehicle_certificate-1.png --target_language en --qianfan_api_key your_api_key
The command line supports more parameter settings. Click to expand for detailed descriptions of command line parameters.
Parameter Description Parameter Type Default Value
input Data to be predicted, required. For example, the local path of an image file or PDF file:/root/data/img.jpg;Or a URL link, such as the network URL of an image file or PDF file:Example;Or a local directory, which should contain the images to be predicted, such as the local path:/root/data/(Currently, prediction for PDF files within a directory is not supported. PDF files need to be specified to a specific file path). str
save_path Specify the path where the inference result file will be saved. If not set, the inference result will not be saved locally. str
target_language Target language (ISO 639-1 language code). str zh
layout_detection_model_name The model name for layout area detection. If not set, the default model of the pipeline will be used. str
layout_detection_model_dir The directory path of the layout area detection model. If not set, the official model will be downloaded. str
layout_threshold The score threshold for the layout model.Any floating-point number between 0-1. If not set, the parameter value initialized by the pipeline will be used, which is initialized to 0.5 by default.
float Whether to use post-processing NMS for layout detection. If not set, the parameter value initialized by the pipeline will be used, and the default initialization is True. bool
layout_unclip_ratio The expansion coefficient of the detection box for the layout area detection model. Any floating-point number greater than 0. If not set, the parameter value initialized by the pipeline will be used, and the default initialization is 1.0 .
float layout_merge_bboxes_mode
  • The merging processing mode for the detection boxes output by the model in layout detection.large
  • , when set to large, it means that among the detection boxes output by the model, for the detection boxes that overlap and contain each other, only the largest outer box is retained, and the overlapping inner boxes are deleted;small
  • , when set to small, it means that among the detection boxes output by the model, for the detection boxes that overlap and contain each other, only the small inner box that is contained is retained, and the overlapping outer boxes are deleted;union
, no filtering processing is performed on the boxes, and both inner and outer boxes are retained;If not set, the parameter value initialized by the pipeline will be used, and the default initialization is large
.
str chart_recognition_model_name The model name for chart parsing. If not set, the default model of the pipeline will be used.
str chart_recognition_model_dir The directory path for the chart parsing model. If not set, the official model will be downloaded.
str chart_recognition_batch_sizeThe batch size for the chart parsing model. If not set, the batch size will be set to int
region_detection_model_name Name of the model for detecting submodules of document image layout. If not set, the default model in the pipeline will be used. str
region_detection_model_dir Directory path of the model for detecting submodules of document image layout. If not set, the official model will be downloaded. str
doc_orientation_classify_model_name Name of the model for document orientation classification. If not set, the default model in the pipeline will be used. str
doc_orientation_classify_model_dir Directory path of the model for document orientation classification. If not set, the official model will be downloaded. str
doc_unwarping_model_name Name of the model for text image unwarping. If not set, the default model in the pipeline will be used. str
doc_unwarping_model_dir Directory path of the model for text image unwarping. If not set, the official model will be downloaded. str
text_detection_model_name Name of the model for text detection. If not set, the default model in the pipeline will be used. str
text_detection_model_dir Directory path of the model for text detection. If not set, the official model will be downloaded. str
text_det_limit_side_len Limit on the side length of the image for text detection. Any integer greater than 0. If not set, the parameter value initialized in the pipeline will be used, and the default initialization value is 960 int
text_det_limit_type Type of image side length limit for text detection. It supportsminandmax,minmeans ensuring that the shortest side of the image is not less thandet_limit_side_len,maxmeans ensuring that the longest side of the image is not greater thanlimit_side_len. If not set, the parameter value initialized by the pipeline will be used, and the default initialization ismax. str
text_det_thresh Detection pixel threshold. Only pixels with scores greater than this threshold in the output probability map will be considered as text pixels. Any floating-point number greater than0. If not set, the parameter value initialized by the pipeline will be used by default,0.3. float
text_det_box_thresh Detection box threshold. When the average score of all pixels within the detection result border is greater than this threshold, the result will be considered as a text area. Any floating-point number greater than0. If not set, the parameter value initialized by the pipeline will be used by default,0.6. float
text_det_unclip_ratio Text detection expansion coefficient. This method is used to expand the text area. The larger the value, the larger the expanded area. Any floating-point number greater than0. If not set, the parameter value initialized by the pipeline will be used by default,2.0. float
textline_orientation_model_name Name of the text line orientation model. If not set, the default model in the pipeline will be used. str
textline_orientation_model_dir Directory path of the text line orientation model. If not set, the official model will be downloaded. str
textline_orientation_batch_size Batch size of the text line orientation model. If not set, the batch size will be set to 1 by default. int
text_recognition_model_name Name of the text recognition model. If not set, the default model in the pipeline will be used. str
text_recognition_model_dir Directory path of the text recognition model. If not set, the official model will be downloaded. str
text_recognition_batch_size Batch size of the text recognition model. If not set, the batch size will be set to 1 by default. int
text_rec_score_thresh Text recognition threshold. Text results with scores greater than this threshold will be retained. Any floating-point number greater than 0. If not set, the parameter value initialized in the pipeline, 0.0 , will be used by default. That is, no threshold is set.
float table_classification_model_name Name of the table classification model. If not set, the default model in the pipeline will be used.
table_classification_model_dir The directory path of the table classification model. If not set, the official model will be downloaded. str
wired_table_structure_recognition_model_name The name of the wired table structure recognition model. If not set, the default model in the pipeline will be used. str
wired_table_structure_recognition_model_dir The directory path of the wired table structure recognition model. If not set, the official model will be downloaded. str
wireless_table_structure_recognition_model_name The name of the wireless table structure recognition model. If not set, the default model in the pipeline will be used. str
wireless_table_structure_recognition_model_dir The directory path of the wireless table structure recognition model. If not set, the official model will be downloaded. str
wired_table_cells_detection_model_name The name of the wired table cells detection model. If not set, the default model in the pipeline will be used. str
wired_table_cells_detection_model_dir The directory path of the wired table cells detection model. If not set, the official model will be downloaded. str
wireless_table_cells_detection_model_name The name of the wireless table cells detection model. If not set, the default model in the pipeline will be used. str
wireless_table_cells_detection_model_dir Directory path of the wireless table cell detection model. If not set, the official model will be downloaded. str
table_orientation_classify_model_name Name of the table orientation classification model. If not set, the default model in the pipeline will be used. str
table_orientation_classify_model_dir Directory path of the table orientation classification model. If not set, the official model will be downloaded. str
seal_text_detection_model_name Name of the seal text detection model. If not set, the default model in the pipeline will be used. str
seal_text_detection_model_dir Directory path of the seal text detection model. If not set, the official model will be downloaded. str
seal_det_limit_side_len Limit on the side length of the image for seal text detection. Any integer greater than 0. If not set, the parameter value initialized in the pipeline will be used, which is initialized to 736 by default.
int seal_det_limit_typeType of the side length limit for seal text detection image. Supports min and max, where min means ensuring that the shortest side of the image is not less than det_limit_side_len, and maxlimit_side_len. If not set, the parameter value initialized by the pipeline will be used, and the default initialization is min. str
seal_det_thresh Detection pixel threshold. Only pixels with scores greater than this threshold in the output probability map will be considered as text pixels. Any floating-point number greater than 0. If not set, the parameter value initialized by the pipeline will be used by default, which is 0.2.float
seal_det_box_thresh Detection box threshold. When the average score of all pixels within the bounding box of the detection result is greater than this threshold, the result will be considered as a text region. Any floating-point number greater than 0. If not set, the parameter value initialized by the pipeline will be used by default, which is 0.6. float
seal_det_unclip_ratio Expansion coefficient for seal text detection. This method is used to expand the text region. The larger the value, the larger the expanded area. Any floating-point number greater than 0. If not set, the parameter value initialized by the pipeline will be used by default, which is 0.5. float
seal_text_recognition_model_name Name of the seal text recognition model. If not set, the default model of the pipeline will be used. str
seal_text_recognition_model_dir Directory path of the seal text recognition model. If not set, the official model will be downloaded. str
seal_text_recognition_batch_size The batch size of the seal text recognition model. If not set, the batch size will be set to 1 by default. int
seal_rec_score_thresh Text recognition threshold. Text results with scores greater than this threshold will be retained. Any floating-point number greater than 0. If not set, the parameter value initialized by the pipeline will be used by default, which is 0.0 . That is, no threshold is set.
float formula_recognition_model_name The name of the formula recognition model. If not set, the default model of the pipeline will be used.
str formula_recognition_model_dir The directory path of the formula recognition model. If not set, the official model will be downloaded.
str formula_recognition_batch_sizeThe batch size of the formula recognition model. If not set, the batch size will be set to 1 by default.
int use_doc_orientation_classify Whether to use the document orientation classification module. bool
False use_doc_unwarping Whether to use the text image unwarping module. bool
False use_textline_orientationWhether to load and use the text line orientation classification module. If not set, the parameter value initialized by the pipeline will be used, which is initialized to True by default.
use_seal_recognition Whether to load and use the seal text recognition sub-pipeline. If not set, the parameter value initialized by the pipeline will be used, and the default initialization is True. bool
use_table_recognition Whether to load and use the table recognition sub-pipeline. If not set, the parameter value initialized by the pipeline will be used, and the default initialization is True. bool
use_formula_recognition Whether to load and use the formula recognition sub-pipeline. If not set, the parameter value initialized by the pipeline will be used, and the default initialization is True. bool
use_chart_recognition Whether to use the chart parsing module. bool False
use_region_detection Whether to load and use the document region detection sub-pipeline. If not set, the parameter value initialized by the pipeline will be used, and the default initialization is True. bool
device The device used for inference. It supports specifying a specific card number:
  • CPU: For example, cpu means using CPU for inference;
  • GPU: For example, gpu:0 means using the first GPU for inference;
  • NPU: For example, npu:0 means using the first NPU for inference;
  • XPU: For example, xpu:0Indicates the use of the first XPU for inference;
  • MLU: e.g.,mlu:0Indicates the use of the first MLU for inference;
  • DCU: e.g.,dcu:0Indicates the use of the first DCU for inference;
If not set, the parameter value initialized by the pipeline will be used by default. During initialization, the local GPU device 0 will be used preferentially. If not available, the CPU device will be used.
str
enable_hpi Whether to enable high-performance inference. bool False
use_tensorrt Whether to enable the TensorRT subgraph engine of Paddle Inference. If the model does not support acceleration via TensorRT, acceleration will not be used even if this flag is set.
For PaddlePaddle with CUDA 11.8, the compatible TensorRT version is 8.x (x>=6), and it is recommended to install TensorRT 8.6.1.6.
For PaddlePaddle with CUDA 12.6, the compatible TensorRT version is 10.x (x>=5), and it is recommended to install TensorRT 10.5.0.18.
bool False
precision Computational precision, such as fp32, fp16. str fp32
enable_mkldnn Whether to enable MKL-DNN accelerated inference. If MKL-DNN is not available or the model does not support acceleration via MKL-DNN, acceleration will not be used even if this flag is set. bool True
mkldnn_cache_capacity MKL-DNN cache capacity. int 10
cpu_threads Number of threads used for inference on CPU. int 8
paddlex_config Path to the PaddleX pipeline configuration file. str


The execution results will be printed to the terminal.

2.2 Integration via Python Script

The command-line method is for quickly experiencing and viewing the results. Generally, in projects, integration via code is often required. You can download the test file and use the following sample code for inference:

from paddlex import create_pipeline
# Create a translation pipeline
pipeline = create_pipeline(pipeline="PP-DocTranslation")

# Document path
input_path = "document_sample.pdf"

# Output directory
output_path = "./output"

# Large model configuration
chat_bot_config = {
    "module_name": "chat_bot",
    "model_name": "ernie-3.5-8k",
    "base_url": "https://qianfan.baidubce.com/v2",
    "api_type": "openai",
    "api_key": "api_key",  # your api_key
}

if input_path.lower().endswith(".md"):
    # Read markdown documents, supporting passing in directories and url links with the .md suffix
    ori_md_info_list = pipeline.load_from_markdown(input_path)
else:
    # Use PP-StructureV3 to perform layout parsing on PDF/image documents to obtain markdown information
    visual_predict_res = pipeline.visual_predict(
        input_path,
        use_doc_orientation_classify=False,
        use_doc_unwarping=False,
        use_common_ocr=True,
        use_seal_recognition=True,
use_table_recognition=True,
    )

    ori_md_info_list = []
    for res in visual_predict_res:
        layout_parsing_result = res["layout_parsing_result"]
        ori_md_info_list.append(layout_parsing_result.markdown)
        layout_parsing_result.save_to_img(output_path)
        layout_parsing_result.save_to_markdown(output_path)

    # Concatenate the markdown information of multi-page documents into a single markdown file, and save the merged original markdown text
    if input_path.lower().endswith(".pdf"):
        ori_md_info = pipeline.concatenate_markdown_pages(ori_md_info_list)
        ori_md_info.save_to_markdown(output_path)

# Perform document translation (target language: English)
tgt_md_info_list = pipeline.translate(
    ori_md_info_list=ori_md_info_list,
    target_language="en",
    chunk_size=5000,
    chat_bot_config=chat_bot_config,
)
# Save the translation results
for tgt_md_info in tgt_md_info_list:
    tgt_md_info.save_to_markdown(output_path)

After executing the above code, you will obtain the parsed results of the original document to be translated, the Markdown file of the original text to be translated, and the Markdown file of the translated document, all saved in the output directory.

The process, API description, and output description of PP-DocTranslation prediction are as follows:

(1) CallPPDocTranslationInstantiate a PP-DocTranslation pipeline object.The descriptions of relevant parameters are as follows:
Parameter Description Parameter Type Default Value
layout_detection_model_name The model name for layout area detection. If set to None, the default model of the pipeline will be used. str|None None
layout_detection_model_dir The directory path of the layout area detection model. If set to None, the official model will be downloaded. str|None None
layout_threshold The score threshold for the layout model.
  • float:Any floating-point number between 0-1
  • ;dict:{0:0.1}
  • where the key is the class ID and the value is the threshold for that class;None: If set to None, the parameter value initialized by the pipeline will be used, which is initialized to 0.5
by default. float|dict|None
None layout_nmsWhether to use post-processing NMS for layout detection. If set to None, the parameter value initialized by the pipeline will be used, which is initialized to True by default. bool|None
layout_unclip_ratio Expansion coefficient of the detection box for the layout area detection model.
  • float: any floating-point number greater than 0;
  • Tuple[float,float]: expansion coefficients in the horizontal and vertical directions respectively;
  • dict, where the key of the dict is of int type, representing cls_id, and the value is of tuple type, such as {0: (1.1, 2.0)}, indicating that the center of the detection box for category 0 output by the model remains unchanged, with the width expanded by 1.1 times and the height expanded by 2.0 times;
  • None: if set to None, the parameter value initialized by the pipeline will be used, which is initialized to 1.0 by default.
float|Tuple[float,float]|dict|None None
layout_merge_bboxes_mode Filtering method for overlapping boxes in layout area detection.
  • str: large, small, union, indicating whether to retain the large box, small box, or both during overlapping box filtering, respectively;
  • dict: the key of the dict is of int type, representing cls_id, and the value is of str type, such as {0: "large", 2: "small"}, which means using the large mode for detection boxes of category 0 and the small mode for detection boxes of category 2;
  • None: If set to None, the parameter value initialized by the pipeline will be used, and the default initialization is large.
str|dict|None None
chart_recognition_model_name The model name for chart parsing. If set to None, the default model of the pipeline will be used. str|None None
chart_recognition_model_dir The directory path of the model for chart parsing. If set to None, the official model will be downloaded. str|None None
chart_recognition_batch_size The batch size of the model for chart parsing. If set to None, the batch size will be set to 1 by default. int|None None
region_detection_model_name The model name for detecting submodules of document image layout. If set to None, the default model of the pipeline will be used. str|None None
region_detection_model_dir The directory path of the model for detecting submodules of document image layout. If set to None, the official model will be downloaded. str|None None
doc_orientation_classify_model_name Name of the document orientation classification model. If set to None, the default model in the pipeline will be used. str|None None
doc_orientation_classify_model_dir Directory path of the document orientation classification model. If set to None, the official model will be downloaded. str|None None
doc_unwarping_model_name Name of the text image unwarping model. If set to None, the default model in the pipeline will be used. str|None None
doc_unwarping_model_dir Directory path of the text image unwarping model. If set to None, the official model will be downloaded. str|None None
text_detection_model_name Name of the text detection model. If set to None, the default model in the pipeline will be used. str|None None
text_detection_model_dir Directory path of the text detection model. If set to None, the official model will be downloaded. str|None None
text_det_limit_side_len Limit on the side length of the image for text detection.
  • int: greater than0, any integer;
  • None: if set toNone, the parameter value initialized by the pipeline will be used, and the default initialization value is960.
int|None None
text_det_limit_type The type of image side length limit for text detection.
  • str: supportsminandmax, whereminmeans ensuring that the shortest side of the image is not less thandet_limit_side_len, andmaxmeans ensuring that the longest side of the image is not greater thanlimit_side_len;
  • None: if set toNone, the parameter value initialized by the pipeline will be used, and the default initialization value ismax.
str|None None
text_det_thresh Detection pixel threshold. Only pixels with scores greater than this threshold in the output probability map will be considered as text pixels.
  • float: any floating-point number greater than0;
  • None: if set toNone, the parameter value initialized by the pipeline will be used by default,0.3.
float|None None
text_det_box_thresh Detection box threshold: When the average score of all pixels within the detected bounding box is greater than this threshold, the result is considered a text region.
  • float: any floating-point number greater than0;
  • None: If set toNone, the parameter value initialized by the pipeline, 0.6, will be used by default.
float|None None
text_det_unclip_ratio Text detection expansion coefficient. This method is used to expand the text region. The larger the value, the larger the expanded area.
  • float: any floating-point number greater than0;
  • None: If set toNone, the parameter value initialized by the pipeline, 2.0, will be used by default.
float|None None
textline_orientation_model_name Name of the text line orientation model. If set toNone, the default model of the pipeline will be used. str|None None
textline_orientation_model_dir Directory path of the text line orientation model. If set toNone, the official model will be downloaded. str|None None
textline_orientation_batch_size Batch size of the text line orientation model. If set toNoneSet the default batch size to 1. int|None None
text_recognition_model_name The name of the text recognition model. If set to None, the default model in the pipeline will be used. str|None None
text_recognition_model_dir The directory path of the text recognition model. If set to None, the official model will be downloaded. str|None None
text_recognition_batch_size The batch size of the text recognition model. If set to None, the default batch size will be set to 1. int|None None
text_rec_score_thresh The threshold for text recognition. Text results with scores higher than this threshold will be retained.
  • float: Any floating-point number greater than 0;
  • None: If set to None, the parameter value initialized by the pipeline, 0.0, will be used by default, meaning no threshold will be set.
float|None None
table_classification_model_name The name of the table classification model. If set to None, the default model in the pipeline will be used. str|None None
table_classification_model_dir The directory path of the table classification model. If set to None, the official model will be downloaded. str|None None
wired_table_structure_recognition_model_name The name of the wired table structure recognition model. If set to None, the default model in the pipeline will be used. str|None None
wired_table_structure_recognition_model_dir The directory path of the wired table structure recognition model. If set to None, the official model will be downloaded. str|None None
wireless_table_structure_recognition_model_name The name of the wireless table structure recognition model. If set to None, the default model in the pipeline will be used. str|None None
wireless_table_structure_recognition_model_dir The directory path of the wireless table structure recognition model. If set to None, the official model will be downloaded. str|None None
wired_table_cells_detection_model_name The name of the wired table cell detection model. If set to None, the default model in the pipeline will be used. str|None None
wired_table_cells_detection_model_dir The directory path of the wired table cell detection model. If set to None, the official model will be downloaded. str|None None
wireless_table_cells_detection_model_name The name of the wireless table cell detection model. If set to None, the default model in the pipeline will be used. str|None None
wireless_table_cells_detection_model_dir The directory path of the wireless table cell detection model. If set to None, the official model will be downloaded. str|None None
table_orientation_classify_model_name The name of the table orientation classification model. If set to None, the default model in the pipeline will be used. str|None None
table_orientation_classify_model_dir The directory path of the table orientation classification model. If set to None, the official model will be downloaded. str|None None
seal_text_detection_model_name The name of the seal text detection model. If set to None, the default model in the pipeline will be used. str|None None
seal_text_detection_model_dir The directory path of the seal text detection model. If set to None, the official model will be downloaded. str|None None
seal_det_limit_side_len The image side length limit for seal text detection.
  • int: any integer greater than 0;
  • None: If set to None, the parameter value initialized by the pipeline will be used, and the default initialization value is 736.
int|None None
seal_det_limit_type The image side length limit type for seal text detection.
  • str: supports min and max, where min indicates that the shortest side of the image is guaranteed to be no less than det_limit_side_len, and max indicates that the longest side of the image is guaranteed to be no greater than limit_side_len;
  • None: If set to None, the parameter value initialized by the pipeline will be used, and the default initialization value is min.
str|None None
seal_det_thresh The detection pixel threshold. Only pixels with scores greater than this threshold in the output probability map will be considered as text pixels.
  • float: any floating-point number greater than 0;
  • None: if set to None, the parameter value initialized by the pipeline will be used by default, which is 0.2.
float|None None
seal_det_box_thresh Detection box threshold. When the average score of all pixels within the detected bounding box is greater than this threshold, the result is considered a text region.
  • float: any floating-point number greater than 0;
  • None: if set to None, the parameter value initialized by the pipeline will be used by default, which is 0.6.
float|None None
seal_det_unclip_ratio Expansion coefficient for seal text detection. This method is used to expand the text region. The larger the value, the larger the expanded area.
  • float: any floating-point number greater than 0;
  • None: if set to None, the parameter value initialized by the pipeline will be used by default, which is 0.5.
float|None None
seal_text_recognition_model_name Name of the seal text recognition model. If set to None, the default model of the pipeline will be used. str|None None
seal_text_recognition_model_dir Directory path of the seal text recognition model. If set to None, the official model will be downloaded. str|None None
seal_text_recognition_batch_size Batch size of the seal text recognition model. If set to None, the batch size will be set to 1 by default. int|None None
seal_rec_score_thresh Threshold for seal text recognition. Text results with scores higher than this threshold will be retained.
  • float: any floating-point number greater than 0;
  • None: if set to None, the parameter value initialized by the pipeline, 0.0, will be used by default, meaning no threshold is set.
float|None None
formula_recognition_model_name Name of the formula recognition model. If set to None, the default model of the pipeline will be used. str|None None
formula_recognition_model_dir Directory path of the formula recognition model. If set to None, the official model will be downloaded. str|None None
formula_recognition_batch_size The batch size of the formula recognition model. If set to None, the batch size will be set to 1 by default. int|None None
use_doc_orientation_classify Whether to load and use the document orientation classification module. If set to None, the parameter value initialized by the pipeline will be used, and the default initialization is True. bool|None None
use_doc_unwarping Whether to load and use the text image unwarping module. If set to None, the parameter value initialized by the pipeline will be used, and the default initialization is True. bool|None None
use_textline_orientation Whether to load and use the text line orientation classification module. If set to None, the parameter value initialized by the pipeline will be used, and the default initialization is True. bool|None None
use_seal_recognition Whether to load and use the sub-pipeline for seal text recognition. If set to None, the parameter value initialized by the pipeline will be used, and the default initialization is True. bool|None None
use_table_recognition Whether to load and use the sub-pipeline for table recognition. If set to NoneThe parameter value initialized by the pipeline will be used, and the default initialization is True. bool|None None
use_formula_recognition Whether to load and use the sub-pipeline for formula recognition. If set to None, the parameter value initialized by the pipeline will be used, and the default initialization is True. bool|None None
use_chart_recognition Whether to load and use the chart parsing module. If set to None, the parameter value initialized by the pipeline will be used, and the default initialization is True. bool|None None
use_region_detection Whether to load and use the sub-pipeline for document region detection. If set to None, the parameter value initialized by the pipeline will be used, and the default initialization is True. bool|None None
chat_bot_config Configuration information for the large language model. The configuration content is the following dict:
{
"module_name": "chat_bot",
"model_name": "ernie-3.5-8k",
"base_url": "https://qianfan.baidubce.com/v2",
"api_type": "openai",
"api_key": "api_key"  # Please set this to the actual API key
}
dict|None None
device Device for inference. Support specifying a specific card number:
  • CPU: e.g.,cpumeans using CPU for inference;
  • GPU: e.g.,gpu:0means using the 1st GPU for inference;
  • NPU: e.g.,npu:0means using the 1st NPU for inference;
  • XPU: e.g.,xpu:0means using the 1st XPU for inference;
  • MLU: e.g.,mlu:0means using the 1st MLU for inference;
  • DCU: e.g.,dcu:0means using the 1st DCU for inference;
  • None: If set toNone, during initialization, the local GPU device 0 will be used preferentially. If not available, the CPU device will be used.
str|None None
enable_hpi Whether to enable high-performance inference. bool False
use_tensorrt Whether to enable the TensorRT subgraph engine of Paddle Inference. If the model does not support acceleration via TensorRT, acceleration will not be used even if this flag is set.
For PaddlePaddle with CUDA 11.8, the compatible TensorRT version is 8.x (x>=6), and it is recommended to install TensorRT 8.6.1.6.
For PaddlePaddle with CUDA 12.6, the compatible TensorRT version is 10.x (x>=5), and it is recommended to install TensorRT 10.5.0.18.
bool False
precision Computational precision, such as fp32, fp16. str "fp32"
enable_mkldnn Whether to enable MKL-DNN for accelerated inference. If MKL-DNN is not available or the model does not support acceleration via MKL-DNN, acceleration will not be used even if this flag is set. bool True
mkldnn_cache_capacity MKL-DNN cache capacity. int 10
cpu_threads The number of threads used for inference on the CPU. int 8
paddlex_config Path to the PaddleX pipeline configuration file. str|None None
(2) Call the visual_predict()method of the PP-DocTranslation pipeline object to obtain visual prediction results. This method returns a list of results. Additionally, the pipeline also provides the visual_predict_iter()method. Both methods are identical in terms of parameter acceptance and result return. The difference is that visual_predict_iter()returns a generatorthat can process and obtain prediction results step by step, which is suitable for scenarios involving large datasets or where memory conservation is desired. Either of these two methods can be chosen based on actual needs. Below is visual_predict()Parameters of the method and their descriptions:
Parameter Description Parameter Type Default Value
input Data to be predicted, supporting multiple input types, required.
  • Python Var: such asnumpy.ndarrayrepresenting image data;
  • str: such as the local path of an image file or PDF file:/root/data/img.jpg;such as URL links, such as the network URL of an image file or PDF file:Example;such as local directories, which should contain images to be predicted, such as the local path:/root/data/(Currently, prediction for PDF files within directories is not supported. PDF files need to be specified to their exact file paths);
  • list: List elements should be of the aforementioned data types, such as[numpy.ndarray, numpy.ndarray],["/root/data/img1.jpg", "/root/data/img2.jpg"],["/root/data1", "/root/data2"].
Python Var|str|list
use_doc_orientation_classify Whether to use the document orientation classification module during inference. Setting it toNonemeans using the instantiation parameter; otherwise, this parameter takes precedence. bool|None False
use_doc_unwarping Whether to use the text image unwarping module during inference. Set to None to use the instantiated parameter; otherwise, this parameter takes precedence. bool|None False
use_textline_orientation Whether to use the text line orientation classification module during inference. Set to None to use the instantiated parameter; otherwise, this parameter takes precedence. bool|None None
use_seal_recognition Whether to use the seal text recognition sub-pipeline during inference. Set to None to use the instantiated parameter; otherwise, this parameter takes precedence. bool|None None
use_table_recognition Whether to use the table recognition sub-pipeline during inference. Set to None to use the instantiated parameter; otherwise, this parameter takes precedence. bool|None None
use_formula_recognition Whether to use the formula recognition sub-pipeline during inference. Set to None to use the instantiated parameter; otherwise, this parameter takes precedence. bool|None None
use_chart_recognition Whether to use the chart parsing module. Set to None to use the instantiated parameter; otherwise, this parameter takes precedence. bool|None False
use_region_detection Whether to use the sub-pipeline for document region detection. Set to None to use the instantiation parameter; otherwise, this parameter takes precedence. bool|None None
layout_threshold The parameter meaning is basically the same as the instantiation parameter. Set to None to use the instantiation parameter; otherwise, this parameter takes precedence. float|dict|None None
layout_nms The parameter meaning is basically the same as the instantiation parameter. Set to None to use the instantiation parameter; otherwise, this parameter takes precedence. bool|None None
layout_unclip_ratio The parameter meaning is basically the same as the instantiation parameter. Set to None to use the instantiation parameter; otherwise, this parameter takes precedence. float|Tuple[float,float]|dict|None None
layout_merge_bboxes_mode The parameter meaning is basically the same as the instantiation parameter. Set to None to use the instantiation parameter; otherwise, this parameter takes precedence. str|dict|None None
text_det_limit_side_len The parameter meaning is basically the same as the instantiation parameter. Set to None to use the instantiation parameter; otherwise, this parameter takes precedence. int|None None
text_det_limit_type The parameter meaning is basically the same as the instantiation parameter. Set to NoneIt indicates the use of instantiation parameters; otherwise, this parameter takes precedence. str|None None
text_det_thresh The parameter meaning is basically the same as the instantiation parameter. Set toNoneIt indicates the use of instantiation parameters; otherwise, this parameter takes precedence. float|None None
text_det_box_thresh The parameter meaning is basically the same as the instantiation parameter. Set toNoneIt indicates the use of instantiation parameters; otherwise, this parameter takes precedence. float|None None
text_det_unclip_ratio The parameter meaning is basically the same as the instantiation parameter. Set toNoneIt indicates the use of instantiation parameters; otherwise, this parameter takes precedence. float|None None
text_rec_score_thresh The parameter meaning is basically the same as the instantiation parameter. Set toNoneIt indicates the use of instantiation parameters; otherwise, this parameter takes precedence. float|None None
seal_det_limit_side_len The parameter meaning is basically the same as the instantiation parameter. Set toNoneIt indicates the use of instantiation parameters; otherwise, this parameter takes precedence. int|None None
seal_det_limit_type The parameter meaning is basically the same as the instantiation parameter. Set toNoneIt indicates the use of instantiation parameters; otherwise, this parameter takes precedence. str|None None
seal_det_thresh The parameter meaning is basically the same as the instantiation parameter. Set to Noneto use the instantiation parameter; otherwise, this parameter takes precedence. float|None None
seal_det_box_thresh The parameter meaning is basically the same as the instantiation parameter. Set to Noneto use the instantiation parameter; otherwise, this parameter takes precedence. float|None None
seal_det_unclip_ratio The parameter meaning is basically the same as the instantiation parameter. Set to Noneto use the instantiation parameter; otherwise, this parameter takes precedence. float|None None
seal_rec_score_thresh The parameter meaning is basically the same as the instantiation parameter. Set to Noneto use the instantiation parameter; otherwise, this parameter takes precedence. float|None None
use_wired_table_cells_trans_to_html Whether to enable direct conversion of wired table cell detection results to HTML. If enabled, HTML is constructed directly based on the geometric relationships of wired table cell detection results. bool False
use_wireless_table_cells_trans_to_html Whether to enable direct conversion of wireless table cell detection results to HTML. If enabled, HTML is constructed directly based on the geometric relationships of wireless table cell detection results. bool False
use_table_orientation_classify Whether to enable table orientation classification. When enabled, if the table in the image is rotated by 90/180/270 degrees, the orientation can be corrected and table recognition can be completed correctly. bool True
use_ocr_results_with_table_cells Whether to enable cell-segmented OCR. When enabled, OCR detection results will be segmented and re-recognized based on cell prediction results to avoid missing text. bool True
use_e2e_wired_table_rec_model Whether to enable the end-to-end wired table recognition mode. If enabled, the cell detection model will not be used, and only the table structure recognition model will be used. bool False
use_e2e_wireless_table_rec_model Whether to enable the end-to-end wireless table recognition mode. If enabled, the cell detection model will not be used, and only the table structure recognition model will be used. bool True
(3) Processing visual prediction results: The prediction result for each sample is a corresponding Result object, and it supports operations such as printing, saving as an image, and saving as a json file:
Method Method Description Parameter Parameter Type Parameter Description Default Value
print() Print the result to the terminal format_json bool Whether to use indentation formatting for the output content in JSON format True
indent int Specify the indentation level to beautify the outputJSONdata to make it more readable, valid only whenformat_jsonisTrue. 4
ensure_ascii bool controls whether non-ASCIIcharacters are escaped toUnicode. When set toTrue, all non-ASCIIcharacters will be escaped;Falsewill retain the original characters, valid only whenformat_jsonisTrue. False
save_to_json() Saves the result as a file in json format save_path str The path where the file is saved. When it is a directory, the saved file name is consistent with the input file type name. None
indent int Specifies the indentation level to beautify the outputJSONdata to make it more readable, valid only whenformat_jsonisTrue. 4
ensure_ascii bool controls whether non-ASCIIcharacters are escaped toUnicode. When set toTrue, all non-ASCIIcharacters will be escaped;Falsewill retain the original characters, valid only whenformat_jsonValid whenTrueis set False
save_to_img() Saves the visualized images of each intermediate module in PNG format save_path str The file path for saving, which supports directory or file path None
save_to_markdown() Saves each page of an image or PDF file as a separate file in markdown format save_path str The file path for saving, which supports directory or file path None
save_to_html() Saves tables in a file as a file in html format save_path str The file path for saving, which supports directory or file path None
save_to_xlsx() Saves tables in a file as a file in xlsx format save_path str The file path for saving, which supports directory or file path None
- Calling the `print()` method will print the results to the terminal, and the content printed to the terminal is explained as follows: - `input_path`: `(str)` The input path of the image or PDF to be predicted - `page_index`: `(Union[int, None])` If the input is a PDF file, it indicates which page of the PDF it is; otherwise, it is `None` - `model_settings`: `(Dict[str, bool])` Configure the model parameters required for the pipeline - `use_doc_preprocessor`: `(bool)` Controls whether to enable the document preprocessing sub-pipeline - `use_general_ocr`: `(bool)` Controls whether to enable the OCR sub-pipeline - `use_seal_recognition`: `(bool)` Controls whether to enable the seal recognition sub-pipeline - `use_table_recognition`: `(bool)` Controls whether to enable the table recognition sub-pipeline - `use_formula_recognition`: `(bool)` Controls whether to enable the formula recognition sub-pipeline - `doc_preprocessor_res`: `(Dict[str, Union[List[float], str]])` A dictionary of document preprocessing results, which only exists when `use_doc_preprocessor=True` - `input_path`: `(str)` The image path accepted by the document preprocessing sub-pipeline. When the input is `numpy.ndarray`, it is saved as `None`, and it is `None` here - `page_index`: `None`, as the input here is `numpy.ndarray`, so the value is `None` - `model_settings`: `(Dict[str, bool])` The model configuration parameters for the document preprocessing sub-pipeline - `use_doc_orientation_classify`: `(bool)` Controls whether to enable the document image orientation classification submodule. - `use_doc_unwarping`: `(bool)` Controls whether to enable the text image unwarping submodule. - `angle`: `(int)` The prediction result of the document image orientation classification submodule. Returns the actual angle value when enabled. - `parsing_res_list`: `(List[Dict])` A list of parsing results, where each element is a dictionary. The list is in the reading order after parsing. - `block_bbox`: `(np.ndarray)` The bounding box of the layout area. - `block_label`: `(str)` The label of the layout area, such as `text`, `table`, etc. - `block_content`: `(str)` The content within the layout area. - `seg_start_flag`: `(bool)` Indicates whether this layout area is the start of a paragraph. - `seg_end_flag`: `(bool)` Indicates whether this layout area is the end of a paragraph. - `sub_label`: `(str)` The sub-label of the layout area. For example, the sub-label of `text` might be `title_text`. - `sub_index`: `(int)` The sub-index of the layout area, used for restoring Markdown. - `index`: `(int)` The index of the layout area, used for displaying the layout sorting results. - `overall_ocr_res`: `(Dict[str, Union[List[str], List[float], numpy.ndarray]])` A dictionary of global OCR results. - `input_path`: `(Union[str, None])` The image path accepted by the image OCR sub-pipeline. When the input is `numpy.ndarray`, it is saved as `None`. - `page_index`: `None`. The input here is `numpy.ndarray`, so the value is `None`. - `model_settings`: `(Dict)` Model configuration parameters for the OCR sub-pipeline. - `dt_polys`: `(List[numpy.ndarray])` List of polygon bounding boxes for text detection. Each bounding box is represented by a numpy array consisting of 4 vertex coordinates, with an array shape of (4, 2) and a data type of int16. - `dt_scores`: `(List[float])` List of confidence scores for text detection bounding boxes. - `text_det_params`: `(Dict[str, Dict[str, int, float]])` Configuration parameters for the text detection module. - `limit_side_len`: `(int)` The side length limit value during image preprocessing. - `limit_type`: `(str)` The processing method for the side length limit. - `thresh`: `(float)` The confidence threshold for text pixel classification. - `box_thresh`: `(float)` The confidence threshold for text detection bounding boxes. - `unclip_ratio`: `(float)` The dilation coefficient for text detection bounding boxes. - `text_type`: `(str)` The type of text detection, currently fixed as "general". - `text_type`: `(str)` The type of text detection, currently fixed as "general". - `textline_orientation_angles`: `(List[int])` The prediction results for text line orientation classification. Returns the actual angle value when enabled (e.g., [0,0,1]) - `text_rec_score_thresh`: `(float)` The filtering threshold for text recognition results - `rec_texts`: `(List[str])` A list of text recognition results, containing only texts with confidence scores exceeding `text_rec_score_thresh` - `rec_scores`: `(List[float])` A list of confidence scores for text recognition, filtered by `text_rec_score_thresh` - `rec_polys`: `(List[numpy.ndarray])` A list of text detection bounding boxes filtered by confidence scores, with the same format as `dt_polys` - `formula_res_list`: `(List[Dict[str, Union[numpy.ndarray, List[float], str]]])` A list of formula recognition results, with each element being a dictionary - `rec_formula`: `(str)` The recognized formula result - `rec_polys`: `(numpy.ndarray)` The bounding box of the recognized formula, with a shape of (4, 2) and a dtype of int16 - `formula_region_id`: `(int)` The region number where the formula is located - `seal_res_list`: `(List[Dict[str, Union[numpy.ndarray, List[float], str]]])` A list of seal recognition results, with each element being a dictionary - `input_path`: `(str)` The input path of the seal image - `page_index`: `None`, as the input here is `numpy.ndarray`, so the value is `None` - `model_settings`: `(Dict)` Model configuration parameters for the seal recognition sub-pipeline - `dt_polys`: `(List[numpy.ndarray])` A list of detected bounding boxes for seals, with the same format as `dt_polys` - `text_det_params`: `(Dict[str, Dict[str, int, float]])` Configuration parameters for the seal detection module, with the same parameter meanings as above - `text_type`: `(str)` The type of seal detection, currently fixed as "seal" - `text_rec_score_thresh`: `(float)` The filtering threshold for seal recognition results - `rec_texts`: `(List[str])` A list of seal recognition results, containing only texts with confidence scores exceeding `text_rec_score_thresh` - `rec_scores`: `(List[float])` A list of confidence scores for seal recognition, filtered by `text_rec_score_thresh` - `rec_polys`: `(List[numpy.ndarray])` A list of detected bounding boxes for seals after confidence filtering, with the same format as `dt_polys` - `rec_boxes`: `(numpy.ndarray)` An array of rectangular bounding boxes for detected boxes, with a shape of (n, 4) and dtype of int16. Each row represents a rectangle - `table_res_list`: `(List[Dict[str, Union[numpy.ndarray, List[float], str]]])` A list of table recognition results, with each element being a dictionary - `cell_box_list`: `(List[numpy.ndarray])` A list of bounding boxes for table cells - `pred_html`: `(str)` An HTML-formatted string for the table - `table_ocr_pred`: `(dict)` OCR recognition results for the table - `rec_polys`: `(List[numpy.ndarray])` A list of detection bounding boxes for cells - `rec_texts`: `(List[str])` Recognition results for cells - `rec_scores`: `(List[float])` Recognition confidence scores for cells - `rec_boxes`: `(numpy.ndarray)` An array of rectangular bounding boxes for detection boxes, with a shape of (n, 4) and a dtype of int16. Each row represents a rectangle - Calling the `save_to_json()` method will save the above content to the specified `save_path`. If a directory is specified, the saved path will be `save_path/{your_img_basename}_res.json`. If a file is specified, it will be saved directly to that file. Since JSON files do not support saving numpy arrays, the `numpy.array` types within will be converted to list form. - Calling the `save_to_img()` method will save the visualization results to the specified `save_path`. If a directory is specified, it will save the visualization images for layout region detection, global OCR, layout reading order, etc. If a file is specified, it will be saved directly to that file. (The pipeline usually contains many result images, and it is not recommended to directly specify a specific file path; otherwise, multiple images will be overwritten, and only the last image will be retained.) - Calling the `save_to_markdown()` method will save the converted Markdown file to the specified `save_path`, with the saved file path being `save_path/{your_img_basename}.md`. If the input is a PDF file, it is recommended to directly specify a directory; otherwise, multiple Markdown files will be overwritten. - Calling the `concatenate_markdown_pages()` method combines the multi-page Markdown content `markdown_list` output by the PP-DocTranslation pipeline into a single complete document and returns the combined Markdown content.
(4) Calltranslate()method to perform document translation. This method returns the original markdown text and the translated text as a markdown object. You can save the required parts locally by executing thesave_to_markdown()method. Below are the parameter descriptions for thetranslate()method:
Parameter Description Parameter Type Default Value
ori_md_info_list A data list in the original Markdown format, containing the content to be translated. It must be a list composed of dictionaries, with each dictionary representing a document block. List[Dict] No default value (required)
target_language Target language (ISO 639-1 language code, such as "en"/"ja"/"fr"). str "zh"
chunk_size The character count threshold for chunking the text to be translated. int 5000
task_description Custom task description prompt. str|None None
output_format Specify the output format requirements, such as "maintain the original Markdown structure". str|None None
rules_str Custom translation rule description. str|None None
few_shot_demo_text_content Example text content for few-shot learning. str|None None
few_shot_demo_key_value_list Structured few-shot example data. Example data in key-value pair format, which can include a glossary of technical terms. str|None None
chat_bot_config Large language model configuration. Set to None to use instantiation parameters; otherwise, this parameter takes precedence. dict|None None
llm_request_interval The time interval, in seconds, for sending requests to the large language model. This parameter can be used to prevent overly frequent calls to the large language model. float 0

3. Development Integration/Deployment

If the pipeline can meet your requirements for inference speed and accuracy, you can proceed directly with development integration/deployment.

If you need to directly apply the pipeline in your Python project, you can refer to the sample code in 2.2 Python Script Approach.

In addition, PaddleOCR also offers two other deployment methods, detailed as follows:

🚀 High-Performance Inference: In real-world production environments, many applications have stringent performance criteria (especially response speed) for deployment strategies to ensure efficient system operation and a smooth user experience. To this end, PaddleOCR provides high-performance inference capabilities, aiming to deeply optimize model inference and pre/post-processing, achieving significant acceleration in the end-to-end process. For detailed information on the high-performance inference process, please refer to High-Performance Inference.

☁️ Serving: Serving is a common deployment form in real-world production environments. By encapsulating inference functions as services, clients can access these services through network requests to obtain inference results. For detailed information on the pipeline serving process, please refer to Serving.

Below are the API references for basic serving and examples of multilingual service invocation:

API reference

Main operations provided by the service:

  • The HTTP request method is POST.
  • Both the request body and response body are JSON data (JSON objects).
  • When the request is processed successfully, the response status code is200, and the properties of the response body are as follows:
Name Type Meaning
logId string The UUID of the request.
errorCode integer Error code. Fixed as0.
errorMsg string Error description. Fixed as"Success".
result object Operation result.
  • When the request is not processed successfully, the properties of the response body are as follows:
Name Type Meaning
logId string The UUID of the request.
errorCode integer Error code. Same as the response status code.
errorMsg string Error description.

The main operations provided by the service are as follows:

  • analyzeImages

Analyze images using computer vision models to obtain OCR, table recognition results, etc.

POST /doctrans-visual

  • The properties of the request body are as follows:
Name Type Meaning Required
file string The URL of an image file or PDF file accessible to the server, or the Base64-encoded result of the content of the aforementioned file types. By default, for PDF files with more than 10 pages, only the first 10 pages will be processed.
To remove the page limit, add the following configuration to the pipeline configuration file:
Serving:
  extra:
    max_num_input_imgs: null
Yes
fileType integer|null File type.0indicates a PDF file,1indicates an image file. If this property is not present in the request body, the file type will be inferred from the URL. No
useDocOrientationClassify boolean|null Refer to the description of the use_doc_orientation_classifyparameter in the predictmethod of the pipeline object. No
useDocUnwarping boolean|null Refer to the description of the use_doc_unwarpingparameter in the predictmethod of the pipeline object. No
useTextlineOrientation boolean|null Refer to the description of the use_textline_orientationparameter in the predictParameter description. No
useSealRecognition boolean|null Refer to the parameter description of use_seal_recognition in the predict method of the pipeline object. No
useTableRecognition boolean|null Refer to the parameter description of use_table_recognition in the predict method of the pipeline object. No
useFormulaRecognition boolean|null Refer to the parameter description of use_formula_recognition in the predict method of the pipeline object. No
useChartRecognition boolean|null Refer to the parameter description of use_chart_recognition in the predict method of the pipeline object. No
useRegionDetection boolean|null Refer to the parameter description in the predict method of the pipeline object.use_region_detectionParameter description. No
layoutThreshold number|object|null Refer to the parameter description of layout_threshold in the predict method of the pipeline object. No
layoutNms boolean|null Refer to the parameter description of layout_nms in the predict method of the pipeline object. No
layoutUnclipRatio number|array|object|null Refer to the parameter description of layout_unclip_ratio in the predict method of the pipeline object. No
layoutMergeBboxesMode string|object|null Refer to the parameter description of layout_merge_bboxes_mode in the predict method of the pipeline object. No
textDetLimitSideLen integer|null Refer to the description of the predict method's text_det_limit_side_len parameter in the pipeline object. No
textDetLimitType string|null Refer to the description of the predict method's text_det_limit_type parameter in the pipeline object. No
textDetThresh number|null Refer to the description of the predict method's text_det_thresh parameter in the pipeline object. No
textDetBoxThresh number|null Refer to the description of the predict method's text_det_box_thresh parameter in the pipeline object. No
textDetUnclipRatio number|null Refer to the description of the predict method's text_det_unclip_ratio parameter in the pipeline object. No
textRecScoreThresh number|null Refer to the description of the predict method's text_rec_score_thresh parameter in the pipeline object. No
sealDetLimitSideLen integer|null Refer to the description of the predict method's seal_det_limit_side_len parameter in the pipeline object. No
sealDetLimitType string|null Refer to the description of the predict method's seal_det_limit_type parameter in the pipeline object. No
sealDetThresh number|null Refer to the description of the predict method's seal_det_thresh parameter in the pipeline object. No
sealDetBoxThresh number|null Refer to the description of the predict method's seal_det_box_thresh parameter in the pipeline object. No
sealDetUnclipRatio number|null Refer to the description of the predict method's seal_det_unclip_ratio parameter in the pipeline object. No
sealRecScoreThresh number|null Refer to the description of the predict method's seal_rec_score_thresh parameter in the pipeline object. No
useWiredTableCellsTransToHtml boolean Refer to the description of the predict method's use_wired_table_cells_trans_to_html parameter in the pipeline object. No
useWirelessTableCellsTransToHtml boolean Refer to the description of the predict method's use_wireless_table_cells_trans_to_html parameter in the pipeline object. No
useTableOrientationClassify boolean Refer to the description of the predict method's use_table_orientation_classify parameter in the pipeline object. No
useOcrResultsWithTableCells boolean See the description of the use_ocr_results_with_table_cellsparameter for the predictmethod in the pipeline object. No
useE2eWiredTableRecModel boolean See the description of the use_e2e_wired_table_rec_modelparameter for the predictmethod in the pipeline object. No
useE2eWirelessTableRecModel boolean See the description of the use_e2e_wireless_table_rec_modelparameter for the predictmethod in the pipeline object. No
visualize boolean|null Whether to return visualization result charts and intermediate images during processing, etc.
  • Pass in true: Return images.
  • Pass in false: Do not return images.
  • If this parameter is not provided in the request body or nullis passed in: Follow the setting in the pipeline configuration file Serving.visualize.

For example, add the following field in the pipeline configuration file:
Serving:
  visualize: False
Images will not be returned by default, and can be controlled by the visualizeParameters can override the default behavior. If neither the request body nor the configuration file is set (or null is passed in the request body and the configuration file is not set), the image is returned by default.
No
  • When the request is processed successfully, the result in the response body has the following properties:
Name Type Meaning
layoutParsingResults array Layout parsing results. The array length is 1 (for image input) or the actual number of processed document pages (for PDF input). For PDF input, each element in the array represents the result of each actual processed page in the PDF file in sequence.
dataInfo object Input data information.

Each element in layoutParsingResults is an object

with the following properties: Name Type
Meaning prunedResult objectA simplified version of the res field in the JSON representation of the layout_parsing_result generated by the visual_predict method of the pipeline object, where the input_path
and page_index fields are removed.
markdown objectMarkdown results.outputImages objectimgproperty description. The image is in JPEG format and encoded with Base64.
inputImage string|null Input image. The image is in JPEG format and encoded with Base64.

markdownis anobjectwith the following properties:

Name Type Meaning
text string Markdown text.
images object Key-value pairs of relative paths of Markdown images and Base64-encoded images.
isStart boolean Whether the first element on the current page is the start of a paragraph.
isEnd boolean Whether the last element on the current page is the end of a paragraph.
  • translate

Translate documents using a large model.

POST /doctrans-translate

  • The properties of the request body are as follows:
Name Type Meaning Required
markdownList array List of Markdown documents to be translated. Can be obtained from the results of the analyzeImagesoperation.The images property will not be used.
Yes targetLanguage stringPlease refer to the translatetarget_languageParameter description. No
chunkSize integer See the parameter description of chunk_sizefor the translatemethod in the pipeline object. No
taskDescription string|null See the parameter description of task_descriptionfor the translatemethod in the pipeline object. No
outputFormat string|null See the parameter description of output_formatfor the translatemethod in the pipeline object. No
rulesStr string|null See the parameter description of rules_strfor the translatemethod in the pipeline object. No
fewShotDemoTextContent string|null See the parameter description of few_shot_demo_text_contentfor the translatemethod in the pipeline object. No
fewShotDemoKeyValueList string|null Refer to the description of the few_shot_demo_key_value_list parameter in the translate method of the pipeline object. No
chatBotConfig object|null Refer to the description of the chat_bot_config parameter in the translate method of the pipeline object. No
llmRequestInterval number|null Refer to the description of the llm_request_interval parameter in the translate method of the pipeline object. No
  • When the request is processed successfully, the result in the response body has the following properties:
Name Type Meaning
translationResults array Translation results.

Each element in translationResults is an object

with the following properties: Name Type
Meaning language string
Target language. markdown Markdown results. The object definition is consistent with the analyzeImages operation's returned markdown.
  • Note:
  • Including sensitive parameters such as the API key for large model calls in the request body may pose security risks. If not necessary, set these parameters in the configuration file and do not pass them during the request.

    Example of multilingual service invocation
    Python
    import base64
    import pathlib
    import pprint
    import sys
    
    import requests
    
    
    API_BASE_URL = "http://127.0.0.1:8080"
    
    file_path = "./demo.jpg"
    target_language = "en"
    
    with open(file_path, "rb") as file:
        file_bytes = file.read()
        file_data = base64.b64encode(file_bytes).decode("ascii")
    
    payload = {
        "file": file_data,
        "fileType": 1,
    }
    resp_visual = requests.post(url=f"{API_BASE_URL}/doctrans-visual", json=payload)
    if resp_visual.status_code != 200:
        print(
            f"Request to doctrans-visual failed with status code {resp_visual.status_code}."
        )
        pprint.pp(resp_visual.json())
        sys.exit(1)
    result_visual = resp_visual.json()["result"]
    
    markdown_list = []
    for i, res in enumerate(result_visual["layoutParsingResults"]):
        md_dir = pathlib.Path(f"markdown_{i}")
        md_dir.mkdir(exist_ok=True)
        (md_dir / "doc.md")
    write_text(res["markdown"]["text"])
        for img_path, img in res["markdown"]["images"].items():
            img_path = md_dir / img_path
            img_path.parent.mkdir(parents=True, exist_ok=True)
            img_path.write_bytes(base64.b64decode(img))
        print(f"The Markdown document to be translated is saved at {md_dir / 'doc.md'}")
        del res["markdown"]["images"]
        markdown_list.append(res["markdown"])
        for img_name, img in res["outputImages"].items():
            img_path = f"{img_name}_{i}.jpg"
            with open(img_path, "wb") as f:
                f.write(base64.b64decode(img))
            print(f"Output image saved at {img_path}")
    
    payload = {
        "markdownList": markdown_list,
    "targetLanguage": target_language,
    }
    resp_translate = requests.post(url=f"{API_BASE_URL}/doctrans-translate", json=payload)
    if resp_translate.status_code != 200:
        print(
            f"Request to doctrans-translate failed with status code {resp_translate.status_code}."
        )
        pprint.pprint(resp_translate.json())  # Corrected 'pp' to 'pprint' for proper function call
        sys.exit(1)
    result_translate = resp_translate.json()["result"]
    
    for i, res in enumerate(result_translate["translationResults"]):
        md_dir = pathlib.Path(f"markdown_{i}")
        (md_dir / "doc_translated.md").write_text(res["markdown"]["text"])
        print(f"Translated markdown document saved at {md_dir / 'doc_translated.md'}")


    4. Secondary Development

    If the default model weights provided by the PP-DocTranslation pipeline do not meet your accuracy or speed requirements in your scenario, you can try to useyour own data from specific domains or application scenariosto furtherfine-tunethe existing model to improve the recognition effect in your scenario.

    4.1 Model Fine-tuning

    Since the PP-DocTranslation pipeline contains several modules, if the performance of the model pipeline does not meet expectations, the issue may originate from any one of these modules. You can analyze cases with poor extraction results, use visualized images to determine which module has the problem, and refer to the corresponding fine-tuning tutorial links in the following table to fine-tune the model.

    Scenario Fine-tuning module Fine-tuning reference link
    Inaccurate detection of layout areas, such as failure to detect seals and tables Layout area detection module Link
    Inaccurate recognition of table structures Table structure recognition module Link
    Inaccurate recognition of formulas Formula recognition module Link
    Omission in detecting seal texts Seal text detection module Link
    Omission in detecting texts Text detection module Link
    Inaccurate text content Text recognition module Link
    Inaccurate correction of vertical or rotated text lines Text line orientation classification module Link
    Inaccurate correction of whole image rotation Document image orientation classification module Link
    Inaccurate correction of image distortion Text image unwarping module Fine-tuning is temporarily not supported

    4.2 Model Application

    After completing fine-tuning training with your private dataset, you can obtain a local model weight file. Then, you can use the fine-tuned model weights by customizing the pipeline configuration file.

    1. Obtain the pipeline configuration file

    You can call the export_paddlex_config_to_yaml method of the PP-DocTranslation pipeline object in PaddleOCR to export the current pipeline configuration to a YAML file:

    from paddleocr import PPDocTranslation
    
    pipeline = PPDocTranslation()
    pipeline.export_paddlex_config_to_yaml("PP-DocTranslation.yaml")
    
    1. Modify the configuration file

    After obtaining the default pipeline configuration file, replace the local path of the fine-tuned model weights with the corresponding location in the pipeline configuration file. For example,

    ......
    SubModules:
        TextDetection:
        module_name: text_detection
        model_name: PP-OCRv5_server_det
        model_dir: null # Replace with the path to the weights of the fine-tuned text detection model
        limit_side_len: 960
        limit_type: max
        thresh: 0.3
        box_thresh: 0.6
        unclip_ratio: 1.5
    
        TextRecognition:
        module_name: text_recognition
        model_name: PP-OCRv5_server_rec
        model_dir: null # Replace with the path to the weights of the fine-tuned text recognition model
        batch_size: 1
                score_thresh: 0
    ......
    

    The pipeline configuration file not only includes parameters supported by PaddleOCR CLI and Python API but also allows for more advanced configurations. Detailed information can be found in the corresponding pipeline usage tutorial in the Overview of PaddleX Model Pipeline Usage. Refer to the detailed instructions therein and adjust the configurations according to your needs.

    1. Load the pipeline configuration file in CLI

    After modifying the configuration file, specify the path to the modified pipeline configuration file using the --paddlex_config parameter in the command line. PaddleOCR will then read its contents as the pipeline configuration. Here is an example:

    paddleocr pp_doctranslation --paddlex_config PP-DocTranslation.yaml ...
    
    1. Load the pipeline configuration file in the Python API

    When initializing the pipeline object, you can pass the path of the PaddleX pipeline configuration file or a configuration dict through the paddlex_config parameter, and PaddleOCR will read its content as the pipeline configuration. The example is as follows:

    from paddleocr import PPDocTranslation
    
    pipeline = PPDocTranslation(paddlex_config="PP-DocTranslation.yaml")
    

    Comments