Skip to content

PP-ChatOCRv3-doc Pipeline Tutorial

1. Introduction to PP-ChatOCRv3-doc Pipeline

PP-ChatOCRv3-doc is a unique intelligent analysis solution for documents and images developed by PaddlePaddle. It combines Large Language Models (LLM) and OCR technology to provide a one-stop solution for complex document information extraction challenges such as layout analysis, rare characters, multi-page PDFs, tables, and seal recognition. By integrating with ERNIE Bot, it fuses massive data and knowledge to achieve high accuracy and wide applicability.

The PP-ChatOCRv3-doc pipeline includes modules for Table Structure Recognition, Layout Region Detection, Text Detection, Text Recognition, Seal Text Detection, Text Image Rectification, and Document Image Orientation Classification.

If you prioritize model accuracy, choose a model with higher accuracy. If you prioritize inference speed, choose a model with faster inference speed. If you prioritize model storage size, choose a model with a smaller storage size. Some benchmarks for these models are as follows:

👉Model List Details

Table Structure Recognition Module Models:

ModelModel Download Link Accuracy (%) GPU Inference Time (ms)
[Normal Mode / High-Performance Mode]
CPU Inference Time (ms)
[Normal Mode / High-Performance Mode]
Model Size (M) Description
SLANetInference Model/Training Model 59.52 103.08 / 103.08 197.99 / 197.99 6.9 M SLANet is a table structure recognition model developed by Baidu PaddleX Team. The model significantly improves the accuracy and inference speed of table structure recognition by adopting a CPU-friendly lightweight backbone network PP-LCNet, a high-low-level feature fusion module CSP-PAN, and a feature decoding module SLA Head that aligns structural and positional information.
SLANet_plusInference Model/Training Model 63.69 140.29 / 140.29 195.39 / 195.39 6.9 M SLANet_plus is an enhanced version of SLANet, the table structure recognition model developed by Baidu PaddleX Team. Compared to SLANet, SLANet_plus significantly improves the recognition ability for wireless and complex tables and reduces the model's sensitivity to the accuracy of table positioning, enabling more accurate recognition even with offset table positioning.

Layout Detection Module Models:

ModelModel Download Link mAP(0.5) (%) GPU Inference Time (ms)
[Normal Mode / High-Performance Mode]
CPU Inference Time (ms)
[Normal Mode / High-Performance Mode]
Model Size (M) Description
PicoDet_layout_1xInference Model/Training Model 86.8 9.03 / 3.10 25.82 / 20.70 7.4 An efficient layout area localization model trained on the PubLayNet dataset based on PicoDet-1x can locate five types of areas, including text, titles, tables, images, and lists.
PicoDet_layout_1x_tableInference Model/Training Model 95.7 8.02 / 3.09 23.70 / 20.41 7.4 M An efficient layout area localization model trained on the PubLayNet dataset based on PicoDet-1x can locate one type of tables.
PicoDet-S_layout_3clsInference Model/Training Model 87.1 8.99 / 2.22 16.11 / 8.73 4.8 An high-efficient layout area localization model trained on a self-constructed dataset based on PicoDet-S for scenarios such as Chinese and English papers, magazines, and research reports includes three categories: tables, images, and seals.
PicoDet-S_layout_17clsInference Model/Training Model 70.3 9.11 / 2.12 15.42 / 9.12 4.8 A high-efficient layout area localization model trained on a self-constructed dataset based on PicoDet-S_layout_17cls for scenarios such as Chinese and English papers, magazines, and research reports includes 17 common layout categories, namely: paragraph titles, images, text, numbers, abstracts, content, chart titles, formulas, tables, table titles, references, document titles, footnotes, headers, algorithms, footers, and seals.
PicoDet-L_layout_3clsInference Model/Training Model 89.3 13.05 / 4.50 41.30 / 41.30 22.6 An efficient layout area localization model trained on a self-constructed dataset based on PicoDet-L for scenarios such as Chinese and English papers, magazines, and research reports includes three categories: tables, images, and seals.
PicoDet-L_layout_17clsInference Model/Training Model 79.9 13.50 / 4.69 43.32 / 43.32 22.6 A efficient layout area localization model trained on a self-constructed dataset based on PicoDet-L_layout_17cls for scenarios such as Chinese and English papers, magazines, and research reports includes 17 common layout categories, namely: paragraph titles, images, text, numbers, abstracts, content, chart titles, formulas, tables, table titles, references, document titles, footnotes, headers, algorithms, footers, and seals.
RT-DETR-H_layout_3clsInference Model/Training Model 95.9 114.93 / 27.71 947.56 / 947.56 470.1 A high-precision layout area localization model trained on a self-constructed dataset based on RT-DETR-H for scenarios such as Chinese and English papers, magazines, and research reports includes three categories: tables, images, and seals.
RT-DETR-H_layout_17clsInference Model/Training Model 92.6 115.29 / 104.09 995.27 / 995.27 470.2 A high-precision layout area localization model trained on a self-constructed dataset based on RT-DETR-H for scenarios such as Chinese and English papers, magazines, and research reports includes 17 common layout categories, namely: paragraph titles, images, text, numbers, abstracts, content, chart titles, formulas, tables, table titles, references, document titles, footnotes, headers, algorithms, footers, and seals.

Text Detection Module Models:

ModelModel Download Link Detection Hmean (%) GPU Inference Time (ms)
[Normal Mode / High-Performance Mode]
CPU Inference Time (ms)
[Normal Mode / High-Performance Mode]
Model Size (M) Description
PP-OCRv4_server_detInference Model/Training Model 82.69 83.34 / 80.91 442.58 / 442.58 109 PP-OCRv4's server-side text detection model, featuring higher accuracy, suitable for deployment on high-performance servers
PP-OCRv4_mobile_detInference Model/Training Model 77.79 8.79 / 3.13 51.00 / 28.58 4.7 PP-OCRv4's mobile text detection model, optimized for efficiency, suitable for deployment on edge devices

Text Recognition Module Models:

ModelModel Download Link Recognition Avg Accuracy (%) GPU Inference Time (ms)
[Normal Mode / High-Performance Mode]
CPU Inference Time (ms)
[Normal Mode / High-Performance Mode]
Model Size (M) Description
PP-OCRv4_mobile_recInference Model/Training Model 78.20 4.82 / 4.82 16.74 / 4.64 10.6 M PP-OCRv4 is the next version of Baidu PaddlePaddle's self-developed text recognition model PP-OCRv3. By introducing data augmentation schemes and GTC-NRTR guidance branches, it further improves text recognition accuracy without compromising inference speed. The model offers both server (server) and mobile (mobile) versions to meet industrial needs in different scenarios.
PP-OCRv4_server_recInference Model/Training Model 79.20 6.58 / 6.58 33.17 / 33.17 71.2 M
ModelModel Download Link Recognition Avg Accuracy (%) GPU Inference Time (ms)
[Normal Mode / High-Performance Mode]
CPU Inference Time (ms)
[Normal Mode / High-Performance Mode]
Model Size (M) Description
ch_SVTRv2_recInference Model/Training Model 68.81 8.08 / 8.08 50.17 / 42.50 73.9 M SVTRv2 is a server-side text recognition model developed by the OpenOCR team at the Vision and Learning Lab (FVL) of Fudan University. It won the first prize in the OCR End-to-End Recognition Task of the PaddleOCR Algorithm Model Challenge, with a 6% improvement in end-to-end recognition accuracy compared to PP-OCRv4 on the A-list.
ModelModel Download Link Recognition Avg Accuracy (%) GPU Inference Time (ms)
[Normal Mode / High-Performance Mode]
CPU Inference Time (ms)
[Normal Mode / High-Performance Mode]
Model Size (M) Description
ch_RepSVTR_recInference Model/Training Model 65.07 5.93 / 5.93 20.73 / 7.32 22.1 M The RepSVTR text recognition model is a mobile-oriented text recognition model based on SVTRv2. It won the first prize in the OCR End-to-End Recognition Task of the PaddleOCR Algorithm Model Challenge, with a 2.5% improvement in end-to-end recognition accuracy compared to PP-OCRv4 on the B-list, while maintaining similar inference speed.

Seal Text Detection Module Models:

ModelModel Download Link Detection Hmean (%) GPU Inference Time (ms)
[Normal Mode / High-Performance Mode]
CPU Inference Time (ms)
[Normal Mode / High-Performance Mode]
Model Size (M) Description
PP-OCRv4_server_seal_detInference Model/Training Model 98.21 74.75 / 67.72 382.55 / 382.55 109 PP-OCRv4's server-side seal text detection model, featuring higher accuracy, suitable for deployment on better-equipped servers
PP-OCRv4_mobile_seal_detInference Model/Training Model 96.47 7.82 / 3.09 48.28 / 23.97 4.6 PP-OCRv4's mobile seal text detection model, offering higher efficiency, suitable for deployment on edge devices

Text Image Rectification Module Models:

ModelModel Download Link MS-SSIM (%) Model Size (M) Description
UVDocInference Model/Training Model 54.40 30.3 M High-precision text image rectification model

The accuracy metrics of the models are measured from the DocUNet benchmark.

Document Image Orientation Classification Module Models:

ModelModel Download Link Top-1 Acc (%) GPU Inference Time (ms)
[Normal Mode / High-Performance Mode]
CPU Inference Time (ms)
[Normal Mode / High-Performance Mode]
Model Size (M) Description
PP-LCNet_x1_0_doc_oriInference Model/Training Model 99.06 2.31 / 0.43 3.37 / 1.27 7 A document image classification model based on PP-LCNet_x1_0, with four categories: 0°, 90°, 180°, 270°
Test Environment Description:
  • Performance Test Environment
    • Test Dataset:
      • Table Structure Recognition Model: PaddleX internally built English table recognition dataset.
      • Layout Detection Model: PaddleOCR's self-built layout analysis dataset, containing 10,000 images of common document types such as Chinese and English papers, magazines, and research reports.
      • Text Detection Model: PaddleOCR's self-built Chinese dataset, covering multiple scenarios including street scenes, web images, documents, and handwriting, with 500 images for detection.
      • Text Recognition Model: PaddleOCR's self-built Chinese dataset, covering multiple scenarios including street scenes, web images, documents, and handwriting, with 11,000 images for text recognition.
      • ch_SVTRv2_rec: PaddleOCR Algorithm Model Challenge - Task 1: OCR End-to-End Recognition Task A-rank evaluation set.
      • ch_RepSVTR_rec: PaddleOCR Algorithm Model Challenge - Task 1: OCR End-to-End Recognition Task B-rank evaluation set.
      • English Recognition Model: PaddleX self-built English dataset.
      • Multilingual Recognition Model: PaddleX self-built multilingual dataset.
      • Text Line Direction Classification Model: PaddleX self-built dataset, covering multiple scenarios such as certificates and documents, containing 1,000 images.
      • Text Image Rectification Model: DocUNet
    • Hardware Configuration:
      • GPU: NVIDIA Tesla T4
      • CPU: Intel Xeon Gold 6271C @ 2.60GHz
      • Other Environments: Ubuntu 20.04 / cuDNN 8.6 / TensorRT 8.5.2.2
  • Inference Mode Description
Mode GPU Configuration CPU Configuration Acceleration Technology Combination
Normal Mode FP32 Precision / No TRT Acceleration FP32 Precision / 8 Threads PaddleInference
High-Performance Mode Optimal combination of pre-selected precision types and acceleration strategies FP32 Precision / 8 Threads Pre-selected optimal backend (Paddle/OpenVINO/TRT, etc.)

2. Quick Start

The pre-trained model pipelines provided by PaddleX allow for quick experience of their effects. You can experience the effect of the Document Scene Information Extraction v3 pipeline online, or use Python to experience it locally.

2.1 Online Experience

You can experience online the effect of the Document Scene Information Extraction v3 pipeline, using the demo images provided by the official. For example:

If you are satisfied with the pipeline's performance, you can directly integrate and deploy it. If not, you can also use private data to fine-tune the models in the pipeline online.

2.2 Local Experience

Before using the Document Scene Information Extraction v3 pipeline locally, ensure that you have completed the installation of the PaddleX wheel package according to the PaddleX Local Installation Guide. If you wish to selectively install dependencies, please refer to the relevant instructions in the installation guide. The dependency group corresponding to this pipeline is ie.

Before performing model inference, you need to prepare the API key for the large language model. PP-ChatOCRv3 supports calling the large model inference service provided by the Baidu Cloud Qianfan Platform. You can refer to Authentication and Authorization to obtain the API key from the Qianfan Platform.

After updating the configuration file, you can use a few lines of Python code to complete the quick inference. You can use the test file for testing:

from paddlex import create_pipeline

chat_bot_config={
    "module_name": "chat_bot",
    "model_name": "ernie-3.5-8k",
    "base_url": "https://qianfan.baidubce.com/v2",
    "api_type": "openai",
    "api_key": "api_key" # your api_key
}

retriever_config={
    "module_name": "retriever",
    "model_name": "embedding-v1",
    "base_url": "https://qianfan.baidubce.com/v2",
    "api_type": "qianfan",
    "api_key": "api_key" # your api_key
}

pipeline = create_pipeline(pipeline="PP-ChatOCRv3-doc", initial_predictor=False)

visual_predict_res = pipeline.visual_predict(
    input="vehicle_certificate-1.png",
    use_doc_orientation_classify=False,
    use_doc_unwarping=False,
    use_common_ocr=True,
    use_seal_recognition=True,
    use_table_recognition=True,
)

visual_info_list = []
for res in visual_predict_res:
    visual_info_list.append(res["visual_info"])
    layout_parsing_result = res["layout_parsing_result"]

vector_info = pipeline.build_vector(
    visual_info_list,
    flag_save_bytes_vector=True,
    retriever_config=retriever_config,
)
chat_result = pipeline.chat(
    key_list=["驾驶室准乘人数"],
    visual_info=visual_info_list,
    vector_info=vector_info,
    chat_bot_config=chat_bot_config,
    retriever_config=retriever_config,
)
print(chat_result)

After running, the output will be as follows:

{'chat_res': {'驾驶室准乘人数': '2'}}

The prediction process, API descriptions, and output descriptions of PP-ChatOCRv3-doc are as follows:

(1) Call the create_pipeline method to instantiate the PP-ChatOCRv3 pipeline object. The relevant parameter descriptions are as follows:
Parameter Parameter Description Parameter Type Default Value
pipeline The name of the pipeline or the path to the pipeline configuration file. If it is the name of the pipeline, it must be a pipeline supported by PaddleX. str None
config Specific configuration information for the pipeline (if set simultaneously with pipeline, it has higher priority than pipeline, and the pipeline name must be consistent). dict[str, Any] None
device The device for pipeline inference. Supports specifying specific GPU card numbers, such as "gpu:0", specific card numbers for other hardware, such as "npu:0", and CPU such as "cpu". str gpu
use_hpip Whether to enable the high-performance inference plugin. If set to None, the setting from the configuration file or config will be used. bool None None
hpi_config High-performance inference configuration dict | None None None
initial_predictor Whether to initialize the inference module (if False, it will be initialized when the relevant inference module is used for the first time). bool True
(2) Call the visual_predict() method of the PP-ChatOCRv3-doc pipeline object to obtain visual prediction results. This method will return a generator. The following are the parameters and their descriptions for the `visual_predict()` method:
Parameter Parameter Description Parameter Type Options Default Value
input The data to be predicted, supporting multiple input types, required. Python Var|str|list
  • Python Var: Such as numpy.ndarray representing image data.
  • str: Such as the local path of an image file or PDF file: /root/data/img.jpg; URL link, such as the network URL of an image file or PDF file: Example; Local directory, which should contain images to be predicted, such as the local path: /root/data/ (currently does not support prediction of PDF files in directories, PDF files need to be specified to the specific file path).
  • List: List elements need to be of the above types of data, such as [numpy.ndarray, numpy.ndarray], ["/root/data/img1.jpg", "/root/data/img2.jpg"], ["/root/data1", "/root/data2"].
None
device The device for pipeline inference. str|None
  • CPU: Such as cpu to use CPU for inference;
  • GPU: Such as gpu:0 to use the first GPU for inference;
  • NPU: Such as npu:0 to use the first NPU for inference;
  • XPU: Such as xpu:0 to use the first XPU for inference;
  • MLU: Such as mlu:0 to use the first MLU for inference;
  • DCU: Such as dcu:0 to use the first DCU for inference;
  • None: If set to None, it will default to the value initialized by the pipeline. During initialization, it will prioritize using the local GPU 0 device, and if not available, it will use the CPU device;
None
use_doc_orientation_classify Whether to use the document orientation classification module. bool|None
  • bool: True or False;
  • None: If set to None, it will default to the value initialized by the pipeline, initialized to True;
None
use_doc_unwarping Whether to use the document distortion correction module. bool|None
  • bool: True or False;
  • None: If set to None, it will default to the value initialized by the pipeline, initialized to True;
None
use_textline_orientation Whether to use the text line orientation classification module. bool|None
  • bool: True or False;
  • None: If set to None, it will default to the value initialized by the pipeline, initialized to True;
None
use_general_ocr Whether to use the OCR sub-pipeline. bool|None
  • bool: True or False;
  • None: If set to None, it will default to the value initialized by the pipeline, initialized to True;
None
use_seal_recognition Whether to use the seal recognition sub-pipeline. bool|None
  • bool: True or False;
  • None: If set to None, it will default to the value initialized by the pipeline, initialized to True;
None
use_table_recognition Whether to use the table recognition sub-pipeline. bool|None
  • bool: True or False;
  • None: If set to None, it will default to the value initialized by the pipeline, initialized to True;
None
layout_threshold The score threshold for the layout model. float|dict|None
  • float: Any floating-point number between 0-1;
  • dict: {0:0.1} where the key is the category ID and the value is the threshold for that category;
  • None: If set to None, it will default to the value initialized by the pipeline, initialized to 0.5;
None
layout_nms Whether to use NMS. bool|None
  • bool: True or False;
  • None: If set to None, it will default to the value initialized by the pipeline, initialized to True;
None
layout_unclip_ratio The expansion coefficient for layout detection. float|Tuple[float,float]|dict|None
  • float: Any floating-point number greater than 0;
  • Tuple[float,float]: The expansion coefficients in the horizontal and vertical directions, respectively;
  • dict, keys as int representing cls_id, values as float scaling factors, e.g., {0: (1.1, 2.0)} means cls_id 0 expanding the width by 1.1 times and the height by 2.0 times while keeping the center unchanged
  • None: If set to None, it will default to the value initialized by the pipeline, initialized to 1.0;
None
layout_merge_bboxes_mode The overlapping box filtering method. str|dict|None
  • str: large, small, union. Respectively representing retaining the large box, small box, or both when filtering overlapping boxes.
  • dict, keys as int representing cls_id and values as merging modes, e.g., {0: "large", 2: "small"}
  • None: If set to None, it will default to the value initialized by the pipeline, initialized to large;
None
text_det_limit_side_len The side length limit for text detection images. int|None
  • int: Any integer greater than 0;
  • None: If set to None, it will default to the value initialized by the pipeline, initialized to 960;
None
text_det_limit_type The side length limit type for text detection images. str|None
  • str: Supports min and max, where min ensures that the shortest side of the image is not less than det_limit_side_len, and max ensures that the longest side of the image is not greater than limit_side_len.
  • None: If set to None, it will default to the value initialized by the pipeline, initialized to max;
None
text_det_thresh The detection pixel threshold, where pixels with scores greater than this threshold in the output probability map are considered text pixels. float|None
  • float: Any floating-point number greater than 0.
  • None: If set to None, it will default to the value initialized by the pipeline, 0.3.
None
text_det_box_thresh The detection box threshold, where a detection result is considered a text region if the average score of all pixels within the border of the result is greater than this threshold. float|None
  • float: Any floating-point number greater than 0.
  • None: If set to None, it will default to the value initialized by the pipeline, 0.6.
None
text_det_unclip_ratio The text detection expansion coefficient, which expands the text region using this method. The larger the value, the larger the expansion area. float|None
  • float: Any floating-point number greater than 0.
  • None: If set to None, it will default to the value initialized by the pipeline, 2.0.
None
text_rec_score_thresh The text recognition threshold, where text results with scores greater than this threshold are retained. float|None
  • float: Any floating-point number greater than 0.
  • None: If set to None, it will default to the value initialized by the pipeline, 0.0. I.e., no threshold is set.
None
seal_det_limit_side_len The side length limit for seal detection images. int|None
  • int: Any integer greater than 0;
  • None: If set to None, it will default to the value initialized by the pipeline, initialized to 960;
None
seal_det_limit_type The side length limit type for seal detection images. str|None
  • str: Supports min and max, where min ensures that the shortest side of the image is not less than det_limit_side_len, and max ensures that the longest side of the image is not greater than limit_side_len.
  • None: If set to None, it will default to the value initialized by the pipeline, initialized to max;
None
seal_det_thresh The detection pixel threshold, where pixels with scores greater than this threshold in the output probability map are considered seal pixels. float|None
  • float: Any floating-point number greater than 0.
  • None: If set to None, it will default to the value initialized by the pipeline, 0.3.
None
seal_det_box_thresh The detection box threshold, where a detection result is considered a seal region if the average score of all pixels within the border of the result is greater than this threshold. float|None
  • float: Any floating-point number greater than 0.
  • None: If set to None, it will default to the value initialized by the pipeline, 0.6.
None
seal_det_unclip_ratio The seal detection expansion coefficient, which expands the seal region using this method. The larger the value, the larger the expansion area. float|None
  • float: Any floating-point number greater than 0.
  • None: If set to None, it will default to the value initialized by the pipeline, 2.0.
None
seal_rec_score_thresh The seal recognition threshold, where text results with scores greater than this threshold are retained. float|None
  • float: Any floating-point number greater than 0.
  • None: If set to None, it will default to the value initialized by the pipeline, 0.0. I.e., no threshold is set.
None
(3) Process the visual prediction results. The prediction result for each sample is of type `dict`, containing two fields: `visual_info` and `layout_parsing_result`. Obtain visual information (including `normal_text_dict`, `table_text_list`, `table_html_list`, etc.) through `visual_info`, and place the information for each sample into the `visual_info_list` list, which will be sent to the large language model later. Of course, you can also obtain the layout parsing results through `layout_parsing_result`, which contains tables, text, images, etc., contained in the file or image, and supports printing, saving as an image, and saving as a `json` file:
......
for res in visual_predict_res:
    visual_info_list.append(res["visual_info"])
    layout_parsing_result = res["layout_parsing_result"]
    layout_parsing_result.print()
    layout_parsing_result.save_to_img("./output")
    layout_parsing_result.save_to_json("./output")
    layout_parsing_result.save_to_xlsx("./output")
    layout_parsing_result.save_to_html("./output")
......
Method Method Description Parameters Parameter Type Parameter Description Default Value
print() Prints the result to the terminal format_json bool Whether to format the output content with JSON indentation True
indent int Specifies the indentation level to beautify the output JSON data for better readability, only valid when format_json is True 4
ensure_ascii bool Controls whether to escape non-ASCII characters to Unicode. When set to True, all non-ASCII characters will be escaped; False retains the original characters, only valid when format_json is True False
save_to_json() Saves the result as a JSON file save_path str The file path for saving, when it is a directory, the saved file name will be consistent with the input file type N/A
indent int Specifies the indentation level to beautify the output JSON data for better readability, only valid when format_json is True 4
ensure_ascii bool Controls whether to escape non-ASCII characters to Unicode. When set to True, all non-ASCII characters will be escaped; False retains the original characters, only valid when format_json is True False
save_to_img() Saves the visual images of each module in PNG format save_path str The file path for saving, supports directory or file path N/A
save_to_html() Saves the tables in the file as an HTML file save_path str The file path for saving, supports directory or file path N/A
save_to_xlsx() Saves the tables in the file as an XLSX file save_path str The file path for saving, supports directory or file path N/A
- Calling the `print()` method will print the result to the terminal. The content printed to the terminal is explained as follows: - `input_path`: `(str)` The input path of the image to be predicted - `page_index`: `(Union[int, None])` If the input is a PDF file, it indicates the current page number of the PDF, otherwise it is `None` - `model_settings`: `(Dict[str, bool])` Model parameters required for configuring the pipeline - `use_doc_preprocessor`: `(bool)` Controls whether to enable the document preprocessing pipeline - `use_general_ocr`: `(bool)` Controls whether to enable the OCR pipeline - `use_seal_recognition`: `(bool)` Controls whether to enable the seal recognition pipeline - `use_table_recognition`: `(bool)` Controls whether to enable the table recognition pipeline - `use_formula_recognition`: `(bool)` Controls whether to enable the formula recognition pipeline - `parsing_res_list`: `(List[Dict])` A list of parsing results, each element is a dictionary, and the list order is the reading order after parsing. - `block_bbox`: `(np.ndarray)` The bounding box of the layout area. - `block_label`: `(str)` The label of the layout area, such as `text`, `table`, etc. - `block_content`: `(str)` The content within the layout area. - `overall_ocr_res`: `(Dict[str, Union[List[str], List[float], numpy.ndarray]])` A dictionary of global OCR results - `input_path`: `(Union[str, None])` The image path received by the image OCR pipeline, saved as `None` when the input is `numpy.ndarray` - `model_settings`: `(Dict)` Model configuration parameters for the OCR pipeline - `dt_polys`: `(List[numpy.ndarray])` A list of polygon boxes for text detection. Each detection box is represented by a numpy array of 4 vertex coordinates, with a shape of (4, 2) and a data type of int16 - `dt_scores`: `(List[float])` A list of confidence scores for text detection boxes - `text_det_params`: `(Dict[str, Dict[str, int, float]])` Configuration parameters for the text detection module - `limit_side_len`: `(int)` The side length limit during image preprocessing - `limit_type`: `(str)` The processing method for the side length limit - `thresh`: `(float)` The confidence threshold for text pixel classification - `box_thresh`: `(float)` The confidence threshold for text detection boxes - `unclip_ratio`: `(float)` The expansion coefficient for text detection boxes - `text_type`: `(str)` The type of text detection, currently fixed as "general" - `text_type`: `(str)` The type of text detection, currently fixed as "general" - `textline_orientation_angles`: `(List[int])` The prediction results of text line orientation classification. Actual angle values are returned when enabled (e.g., [0,0,1]) - `text_rec_score_thresh`: `(float)` The filtering threshold for text recognition results - `rec_texts`: `(List[str])` A list of text recognition results, only including texts with confidence exceeding `text_rec_score_thresh` - `rec_scores`: `(List[float])` A list of confidence scores for text recognition, already filtered by `text_rec_score_thresh` - `rec_polys`: `(List[numpy.ndarray])` A list of text detection boxes filtered by confidence, with the same format as `dt_polys` - `formula_res_list`: `(List[Dict[str, Union[numpy.ndarray, List[float], str]]])` A list of formula recognition results, each element is a dictionary - `rec_formula`: `(str)` The formula recognition result - `rec_polys`: `(numpy.ndarray)` The formula detection box, with a shape of (4, 2) and a dtype of int16 - `formula_region_id`: `(int)` The region```markdown - Calling the `save_to_json()` method will save the aforementioned content to the specified `save_path`. If a directory is specified, the save path will be `save_path/{your_img_basename}_res.json`. If a file is specified, it will be saved directly to that file. Since JSON files do not support saving numpy arrays, `numpy.array` types will be converted to list form. - Invoking the `save_to_img()` method will save the visualization results to the specified `save_path`. If a directory is specified, the layout detection visualization image, global OCR visualization image, reading order visualization image, and other contents will be saved. If a file is specified, it will be saved directly to that file. (Pipelines often involve multiple result images, so it is not recommended to specify a specific file path directly, as multiple images will be overwritten, leaving only the last one.) In addition, attributes are also supported to obtain visualized images with results and prediction results, as detailed below:
Attribute Attribute Description
json Obtain prediction results in json format
img Obtain visualized images in dict format
- The prediction result obtained by the `json` attribute is data of type `dict`, and its content is consistent with the content saved by calling the `save_to_json()` method. - The prediction result returned by the `img` attribute is data of type `dict`. The keys are `layout_det_res`, `overall_ocr_res`, `text_paragraphs_ocr_res`, `formula_res_region1`, `table_cell_img`, and `seal_res_region1`, and the corresponding values are `Image.Image` objects: used to display the visualized images of layout detection, OCR, OCR text paragraphs, formulas, tables, and seal results, respectively. If optional modules are not used, only `layout_det_res` is included in the dictionary.
(4) Call the build_vector() method of the PP-ChatOCRv3-doc Pipeline object to construct vectors for text content. Below are the parameters and their descriptions for the `build_vector()` method:
Parameter Parameter Description Parameter Type Options Default Value
visual_info Visual information, which can be a dictionary containing visual information or a list composed of such dictionaries list|dict None None
min_characters Minimum number of characters int A positive integer greater than 0, determined based on the token length supported by the large language model 3500
block_size Block size for vector library creation of long texts int A positive integer greater than 0, determined based on the token length supported by the large language model 300
flag_save_bytes_vector Whether to save text as a binary file bool True|False False
retriever_config Configuration parameters for the vector retrieval large model, refer to the "LLM_Retriever" field in the configuration file dict None None
This method returns a dictionary containing visual text information, with the following content: - `flag_save_bytes_vector`: `(bool)` Whether the result is saved as a binary file - `flag_too_short_text`: `(bool)` Whether the text length is less than the minimum number of characters - `vector`: `(str|list)` The binary content or text content of the text, depending on the values of `flag_save_bytes_vector` and `min_characters`. If `flag_save_bytes_vector=True` and the text length is greater than or equal to the minimum number of characters, binary content is returned; otherwise, the original text is returned.
(5) Call the chat() method of the PP-ChatOCRv3-doc Pipeline object to extract key information. Below are the parameters and their descriptions for the `chat()` method:
Parameter Parameter Description Parameter Type Options Default Value
key_list A single key or a list of keys used to extract information Union[str, List[str]] None None
visual_info Visual information results List[dict] None None
use_vector_retrieval Whether to use vector retrieval bool True|False True
vector_info Vector information used for retrieval dict None None
min_characters Required minimum number of characters int A positive integer greater than 0 3500
text_task_description Description of the text task str None None
text_output_format Output format of text results str None None
text_rules_str Rules for generating text results str None None
text_few_shot_demo_text_content Text content for few-shot demonstration str None None
text_few_shot_demo_key_value_list Key-value list for few-shot demonstration str None None
table_task_description Description of the table task str None None
table_output_format 表结果的输出格式 str None None
table_rules_str 生成表结果的规则 str None None
table_few_shot_demo_text_content 表少样本演示的文本内容 str None None
table_few_shot_demo_key_value_list 表少样本演示的键值列表 str None None
chat_bot_config 大语言模型配置信息,内容参考产线配置文件“LLM_Chat”字段 dict None None
retriever_config 向量检索大模型配置参数,内容参考配置文件中的“LLM_Retriever”字段 dict None None
该方法会将结果打印到终端,打印到终端的内容解释如下: - `chat_res`: `(dict)` 提取信息的结果,是一个字典,包含了待抽取的键和对应的值。

3. Development Integration/Deployment

If the pipeline meets your requirements for inference speed and accuracy in production, you can proceed directly with development integration/deployment.

If you need to apply the pipeline directly in your Python project, you can refer to the sample code in 2.2 Local Experience.

Additionally, PaddleX provides three other deployment methods, detailed as follows:

🚀 High-Performance Inference: In actual production environments, many applications have stringent standards for the performance metrics of deployment strategies (especially response speed) to ensure efficient system operation and smooth user experience. To this end, PaddleX provides a high-performance inference plugin designed to deeply optimize model inference and pre/post-processing, achieving significant speedups in the end-to-end process. For detailed instructions on high-performance inference, please refer to the PaddleX High-Performance Inference Guide.

☁️ Serving: Serving is a common deployment form in actual production environments. By encapsulating the inference functionality as a service, clients can access these services through network requests to obtain inference results. PaddleX supports multiple serving solutions for pipelines. For detailed instructions on serving, please refer to the PaddleX Serving Guide.

Below are the API references for basic serving and multi-language service invocation examples:

API Reference

For the main operations provided by the service:

  • The HTTP request method is POST.
  • Both the request body and response body are JSON data (JSON objects).
  • When the request is successfully processed, the response status code is 200, and the response body has the following attributes:
Name Type Meaning
logId string UUID of the request.
errorCode integer Error code. Fixed as 0.
errorMsg string Error description. Fixed as "Success".
result object Operation result.
  • When the request is not successfully processed, the response body has the following attributes:
Name Type Meaning
logId string UUID of the request.
errorCode integer Error code. Same as the response status code.
errorMsg string Error description.

The main operations provided by the service are as follows:

  • analyzeImages

Uses computer vision models to analyze images, obtain OCR, table recognition results, etc., and extract key information from the images.

POST /chatocr-visual

  • Attributes of the request body:
Name Type Meaning Required
file string URL of an image file or PDF file accessible to the server, or Base64 encoded result of the content of the above file types. By default, for PDF files exceeding 10 pages, only the content of the first 10 pages will be processed.
To remove the page limit, please add the following configuration to the pipeline configuration file:
Serving:
  extra:
    max_num_input_imgs: null
Yes
fileType integer | null File type. 0 represents a PDF file, 1 represents an image file. If this attribute is not present in the request body, the file type will be inferred based on the URL. No
useDocOrientationClassify boolean | null Please refer to the description of the use_doc_orientation_classify parameter of the pipeline object's visual_predict method. No
useDocUnwarping boolean | null Please refer to the description of the use_doc_unwarping parameter of the pipeline object's visual_predict method. No
useSealRecognition boolean | null Please refer to the description of the use_seal_recognition parameter of the pipeline object's visual_predict method. No
useTableRecognition boolean | null Please refer to the description of the use_table_recognition parameter of the pipeline object's visual_predict method. No
layoutThreshold number | null Please refer to the description of the layout_threshold parameter of the pipeline object's visual_predict method. No
layoutNms boolean | null Please refer to the description of the layout_nms parameter of the pipeline object's visual_predict method. No
layoutUnclipRatio number | array | object | null Please refer to the description of the layout_unclip_ratio parameter of the pipeline object's visual_predict method. No
layoutMergeBboxesMode string | object | null Please refer to the description of the layout_merge_bboxes_mode parameter of the pipeline object's visual_predict method. No
textDetLimitSideLen integer | null Please refer to the description of the text_det_limit_side_len parameter of the pipeline object's visual_predict method. No
textDetLimitType string | null Please refer to the description of the text_det_limit_type parameter of the pipeline object's visual_predict method. No
textDetThresh number | null Please refer to the description of the text_det_thresh parameter of the pipeline object's visual_predict method. No
textDetBoxThresh number | null Please refer to the description of the text_det_box_thresh parameter of the pipeline object's visual_predict method. No
textDetUnclipRatio number | null Please refer to the description of the text_det_unclip_ratio parameter of the pipeline object's visual_predict method. No
textRecScoreThresh number | null Please refer to the description of the text_rec_score_thresh parameter of the pipeline object's visual_predict method. No
sealDetLimitSideLen integer | null Please refer to the description of the seal_det_limit_side_len parameter of the pipeline object's visual_predict method. No
sealDetLimitType string | null Please refer to the description of the seal_det_limit_type parameter of the pipeline object's visual_predict method. No
sealDetThresh number | null Please refer to the description of the seal_det_thresh parameter of the pipeline object's visual_predict method. No
sealDetBoxThresh number | null Please refer to the description of the seal_det_box_thresh parameter of the pipeline object's visual_predict method. No
sealDetUnclipRatio number | null Please refer to the description of the seal_det_unclip_ratio parameter of the pipeline object's visual_predict method. No
sealRecScoreThresh number | null Please refer to the description of the seal_rec_score_thresh parameter of the pipeline object's visual_predict method. No
  • When the request is successfully processed, the result of the response body has the following attributes:
Name Type Meaning
layoutParsingResults array Analysis results obtained using computer vision models. The array length is 1 (for image input) or the actual number of document pages processed (for PDF input). For PDF input, each element in the array represents the result of each page actually processed in the PDF file.
visualInfo array Key information in the image, which can be used as input for other operations.
dataInfo object Input data information.

Each element in layoutParsingResults is an object with the following attributes:

Name Type Meaning
prunedResult object A simplified version of the res field in the JSON representation of the results generated by the pipeline's visual_predict method, with the input_path and the page_index fields removed.
outputImages object | null Refer to the description of img attribute of the pipeline's visual prediction result.
inputImage string | null Input image. The image is in JPEG format and encoded using Base64.
  • buildVectorStore

Builds a vector database.

POST /chatocr-vector

  • Attributes of the request body:
Name Type Meaning Required
visualInfo array Key information in the image. Provided by the analyzeImages operation. Yes
minCharacters integer | null Minimum data length to enable the vector database. No
blockSize int | null Please refer to the description of the block_size parameter of the pipeline object's build_vector method. No
retrieverConfig object | null Please refer to the description of the retriever_config parameter of the pipeline object's build_vector method. No
  • When the request is successfully processed, the result of the response body has the following attributes:
Name Type Meaning
vectorInfo object Serialized result of the vector database, which can be used as input for other operations.
  • chat

Interacts with large language models to extract key information using them.

POST /chatocr-chat

  • Attributes of the request body:
Name Type Meaning Required
keyList array List of keys. Yes
visualInfo object Key information in the image. Provided by the analyzeImages operation. Yes
useVectorRetrieval boolean | null Please refer to the description of the use_vector_retrieval parameter of the pipeline object's chat method. No
vectorInfo object | null Serialized result of the vector database. Provided by the buildVectorStore operation. Please note that the deserialization process involves performing an unpickle operation. To prevent malicious attacks, be sure to use data from trusted sources. No
minCharacters integer Minimum data length to enable the vector database. No
textTaskDescription string | null Please refer to the description of the text_task_description parameter of the pipeline object's chat method. No
textOutputFormat string | null Please refer to the description of the text_output_format parameter of the pipeline object's chat method. No
textRulesStr string | null Please refer to the description of the text_rules_str parameter of the pipeline object's chat method. No
textFewShotDemoTextContent string | null Please refer to the description of the text_few_shot_demo_text_content parameter of the pipeline object's chat method. No
textFewShotDemoKeyValueList string | null Please refer to the description of the text_few_shot_demo_key_value_list parameter of the pipeline object's chat method. No
tableTaskDescription string | null Please refer to the description of the table_task_description parameter of the pipeline object's chat method. No
tableOutputFormat string | null Please refer to the description of the table_output_format parameter of the pipeline object's chat method. No
tableRulesStr string | null Please refer to the description of the table_rules_str parameter of the pipeline object's chat method. No
tableFewShotDemoTextContent string | null Please refer to the description of the table_few_shot_demo_text_content parameter of the pipeline object's chat method. No
tableFewShotDemoKeyValueList string | null Please refer to the description of the table_few_shot_demo_key_value_list parameter of the pipeline object's chat method. No
chatBotConfig object | null Please refer to the description of the chat_bot_config parameter of the pipeline object's chat method. No
retrieverConfig object | null Please refer to the description of the retriever_config parameter of the pipeline object's chat method. No
  • When the request is successfully processed, the result of the response body has the following attributes:
Name Type Meaning
chatResult object Key information extraction result.
  • Note:
  • Including sensitive parameters such as API key for large model calls in the request body can be a security risk. If not necessary, set these parameters in the configuration file and do not pass them on request.

    Multi-language Service Invocation Examples
    Python
    import base64
    import pprint
    import sys
    
    import requests
    
    
    API_BASE_URL = "http://0.0.0.0:8080"
    
    file_path = "./demo.jpg"
    keys = ["Name"]
    
    with open(file_path, "rb") as file:
        file_bytes = file.read()
        file_data = base64.b64encode(file_bytes).decode("ascii")
    
    payload = {
        "file": file_data,
        "fileType": 1,
    }
    resp_visual = requests.post(url=f"{API_BASE_URL}/chatocr-visual", json=payload)
    if resp_visual.status_code != 200:
        print(
            f"Request to chatocr-visual failed with status code {resp_visual.status_code}.",
            file=sys.stderr,
        )
        pprint.pp(resp_visual.json())
        sys.exit(1)
    result_visual = resp_visual.json()["result"]
    
    for i, res in enumerate(result_visual["layoutParsingResults"]):
        print(res["prunedResult"])
        for img_name, img in res["outputImages"].items():
            img_path = f"{img_name}_{i}.jpg"
            with open(img_path, "wb") as f:
                f.write(base64.b64decode(img))
            print(f"Output image saved at {img_path}")
    
    payload = {
        "visualInfo": result_visual["visualInfo"],
    }
    resp_vector = requests.post(url=f"{API_BASE_URL}/chatocr-vector", json=payload)
    if resp_vector.status_code != 200:
        print(
            f"Request to chatocr-vector failed with status code {resp_vector.status_code}.",
            file=sys.stderr,
        )
        pprint.pp(resp_vector.json())
        sys.exit(1)
    result_vector = resp_vector.json()["result"]
    
    payload = {
        "keyList": keys,
        "visualInfo": result_visual["visualInfo"],
        "useVectorRetrieval": True,
        "vectorInfo": result_vector["vectorInfo"],
    }
    
    resp_chat = requests.post(url=f"{API_BASE_URL}/chatocr-chat", json=payload)
    if resp_chat.status_code != 200:
        print(
            f"Request to chatocr-chat failed with status code {resp_chat.status_code}.",
            file=sys.stderr,
        )
        pprint.pp(resp_chat.json())
        sys.exit(1)
    result_chat = resp_chat.json()["result"]
    print("Final result:")
    print(result_chat["chatResult"])
    


    📱 Edge Deployment: Edge deployment is a method where computing and data processing functions are placed on the user's device itself, allowing the device to process data directly without relying on remote servers. PaddleX supports deploying models on edge devices such as Android. For detailed instructions on edge deployment, please refer to the PaddleX Edge Deployment Guide. You can choose the appropriate deployment method for your pipeline based on your needs, and proceed with subsequent AI application integration.

    4. Custom Development

    If the default model weights provided by the PP-ChatOCRv3-doc Pipeline do not meet your requirements in terms of accuracy or speed for your specific scenario, you can attempt to further fine-tune the existing models using your own domain-specific or application-specific data to enhance the recognition performance of the general table recognition pipeline in your scenario.

    4.1 Model Fine-tuning

    The document scenario information extraction V3 pipeline consists of several modules. If the performance of the model pipeline does not meet expectations, the issue may originate from any of these modules. You can analyze cases with poor extraction results by visualizing images to determine which module has the problem. Then, refer to the corresponding fine-tuning tutorial links in the table below to fine-tune the model:

    Scenario Module to Fine-tune Fine-tuning Reference Link
    Inaccurate layout detection, such as undetected stamps or tables Layout Detection Module Link
    Inaccurate table structure recognition Table Structure Recognition Link
    Seal text missed Seal Text Detection Module Link
    Text missed Text Detection Module Link
    Text content is inaccurate Text Recognition Module Link
    Vertical or rotated text line correction is inaccurate Text Line Orientation Classification Module Link
    Whole image rotation correction is inaccurate Document Image Orientation Classification Module Link
    Image distortion correction is inaccurate Text Image Correction Module Not supported for fine-tuning

    4.2 Model Deployment

    After fine-tuning your models using your private dataset, you will obtain local model weights files.

    To use the fine-tuned model weights, simply modify the pipeline configuration file by replacing the local paths of the default model weights with those of your fine-tuned models:

    ......
    Pipeline:
      layout_model: RT-DETR-H_layout_3cls  # Replace with the local path of your fine-tuned model
      table_model: SLANet_plus  # Replace with the local path of your fine-tuned model
      text_det_model: PP-OCRv4_server_det  # Replace with the local path of your fine-tuned model
      text_rec_model: PP-OCRv4_server_rec  # Replace with the local path of your fine-tuned model
      seal_text_det_model: PP-OCRv4_server_seal_det  # Replace with the local path of your fine-tuned model
      doc_image_ori_cls_model: null   # Replace with the local path of your fine-tuned model if applicable
      doc_image_unwarp_model: null   # Replace with the local path of your fine-tuned model if applicable
    ......
    

    Subsequently, load the modified pipeline configuration file using the command-line interface or Python script as described in the local experience section.

    5. Multi-hardware Support

    PaddleX supports various mainstream hardware devices such as NVIDIA GPUs, Kunlun XPU, Ascend NPU, and Cambricon MLU. Seamless switching between different hardware can be achieved by simply setting the --device parameter.

    For example, to perform inference using the PP-ChatOCRv3-doc Pipeline on an NVIDIA GPU. At this point, if you wish to switch the hardware to Ascend NPU, simply modify the --device in the script to npu:

    from paddlex import create_pipeline
    pipeline = create_pipeline(
        pipeline="PP-ChatOCRv3-doc",
        device="npu:0" # gpu:0 -->npu:0
        )
    

    If you want to use the PP-ChatOCRv3-doc Pipeline on more types of hardware, please refer to the PaddleX Multi-Device Usage Guide.

    Comments