Skip to content

PP-ChatOCRv4-doc Pipeline Tutorial

1. Introduction to PP-ChatOCRv4-doc Pipeline

PP-ChatOCRv4-doc is a unique document and image intelligent analysis solution from PaddlePaddle, combining LLM, MLLM, and OCR technologies to address complex document information extraction challenges such as layout analysis, rare characters, multi-page PDFs, tables, and seal recognition. Integrated with ERNIE Bot, it fuses massive data and knowledge, achieving high accuracy and wide applicability. This pipeline also provides flexible service deployment options, supporting deployment on various hardware. Furthermore, it offers custom development capabilities, allowing you to train and fine-tune models on your own datasets, with seamless integration of trained models.

The Document Scene Information Extraction v4 pipeline includes modules for Layout Region Detection, Table Structure Recognition, Table Classification, Table Cell Localization, Text Detection, Text Recognition, Seal Text Detection, Text Image Rectification, and Document Image Orientation Classification.

If you prioritize model accuracy, choose a model with higher accuracy. If you prioritize inference speed, select a model with faster inference. If you prioritize model storage size, choose a model with a smaller storage size. Benchmarks for some models are as follows:

👉Model List Details

Table Structure Recognition Module Models:

ModelModel Download Link Accuracy (%) GPU Inference Time (ms)
[Normal Mode / High-Performance Mode]
CPU Inference Time (ms)
[Normal Mode / High-Performance Mode]
Model Size (M) Description
SLANetInference Model/Training Model 59.52 103.08 / 103.08 197.99 / 197.99 6.9 M SLANet is a table structure recognition model developed by Baidu PaddleX Team. The model significantly improves the accuracy and inference speed of table structure recognition by adopting a CPU-friendly lightweight backbone network PP-LCNet, a high-low-level feature fusion module CSP-PAN, and a feature decoding module SLA Head that aligns structural and positional information.
SLANet_plusInference Model/Training Model 63.69 140.29 / 140.29 195.39 / 195.39 6.9 M SLANet_plus is an enhanced version of SLANet, the table structure recognition model developed by Baidu PaddleX Team. Compared to SLANet, SLANet_plus significantly improves the recognition ability for wireless and complex tables and reduces the model's sensitivity to the accuracy of table positioning, enabling more accurate recognition even with offset table positioning.

Layout Detection Module Models:

ModelModel Download Link mAP(0.5) (%) GPU Inference Time (ms)
[Normal Mode / High-Performance Mode]
CPU Inference Time (ms)
[Normal Mode / High-Performance Mode]
Model Storage Size (M) Introduction
PP-DocLayout-LInference Model/Training Model 90.4 34.6244 / 10.3945 510.57 / - 123.76 M A high-precision layout area localization model trained on a self-built dataset containing Chinese and English papers, magazines, contracts, books, exams, and research reports using RT-DETR-L.
PP-DocLayout-MInference Model/Training Model 75.2 13.3259 / 4.8685 44.0680 / 44.0680 22.578 A layout area localization model with balanced precision and efficiency, trained on a self-built dataset containing Chinese and English papers, magazines, contracts, books, exams, and research reports using PicoDet-L.
PP-DocLayout-SInference Model/Training Model 70.9 8.3008 / 2.3794 10.0623 / 9.9296 4.834 A high-efficiency layout area localization model trained on a self-built dataset containing Chinese and English papers, magazines, contracts, books, exams, and research reports using PicoDet-S.
Note: The evaluation dataset for the above precision metrics is a self-built layout area detection dataset by PaddleOCR, containing 500 common document-type images of Chinese and English papers, magazines, contracts, books, exams, and research reports. GPU inference time is based on an NVIDIA Tesla T4 machine with FP32 precision. CPU inference speed is based on an Intel(R) Xeon(R) Gold 5117 CPU @ 2.00GHz with 8 threads and FP32 precision. > ❗ The above list includes the 3 core models that are key supported by the text recognition module. The module actually supports a total of 11 full models, including several predefined models with different categories. The complete model list is as follows: * Table Layout Detection Model
ModelModel Download Link mAP(0.5) (%) GPU Inference Time (ms)
[Normal Mode / High-Performance Mode]
CPU Inference Time (ms)
[Normal Mode / High-Performance Mode]
Model Storage Size (M) Introduction
PicoDet_layout_1x_tableInference Model/Training Model 97.5 8.02 / 3.09 23.70 / 20.41 7.4 M A high-efficiency layout area localization model trained on a self-built dataset using PicoDet-1x, capable of detecting table regions.
Note: The evaluation dataset for the above precision metrics is a self-built layout table area detection dataset by PaddleOCR, containing 7835 Chinese and English document images with tables. GPU inference time is based on an NVIDIA Tesla T4 machine with FP32 precision. CPU inference speed is based on an Intel(R) Xeon(R) Gold 5117 CPU @ 2.00GHz with 8 threads and FP32 precision. * 3-Class Layout Detection Model, including Table, Image, and Stamp
ModelModel Download Link mAP(0.5) (%) GPU Inference Time (ms)
[Normal Mode / High-Performance Mode]
CPU Inference Time (ms)
[Normal Mode / High-Performance Mode]
Model Storage Size (M) Introduction
PicoDet-S_layout_3clsInference Model/Training Model 88.2 8.99 / 2.22 16.11 / 8.73 4.8 A high-efficiency layout area localization model trained on a self-built dataset of Chinese and English papers, magazines, and research reports using PicoDet-S.
PicoDet-L_layout_3clsInference Model/Training Model 89.0 13.05 / 4.50 41.30 / 41.30 22.6 A balanced efficiency and precision layout area localization model trained on a self-built dataset of Chinese and English papers, magazines, and research reports using PicoDet-L.
RT-DETR-H_layout_3clsInference Model/Training Model 95.8 114.93 / 27.71 947.56 / 947.56 470.1 A high-precision layout area localization model trained on a self-built dataset of Chinese and English papers, magazines, and research reports using RT-DETR-H.
Note: The evaluation dataset for the above precision metrics is a self-built layout area detection dataset by PaddleOCR, containing 1154 common document images of Chinese and English papers, magazines, and research reports. GPU inference time is based on an NVIDIA Tesla T4 machine with FP32 precision. CPU inference speed is based on an Intel(R) Xeon(R) Gold 5117 CPU @ 2.00GHz with 8 threads and FP32 precision. * 5-Class English Document Area Detection Model, including Text, Title, Table, Image, and List
ModelModel Download Link mAP(0.5) (%) GPU Inference Time (ms)
[Normal Mode / High-Performance Mode]
CPU Inference Time (ms)
[Normal Mode / High-Performance Mode]
Model Storage Size (M) Introduction
PicoDet_layout_1xInference Model/Training Model 97.8 9.03 / 3.10 25.82 / 20.70 7.4 A high-efficiency English document layout area localization model trained on the PubLayNet dataset using PicoDet-1x.
Note: The evaluation dataset for the above precision metrics is the [PubLayNet](https://developer.ibm.com/exchanges/data/all/publaynet/) dataset, containing 11245 English document images. GPU inference time is based on an NVIDIA Tesla T4 machine with FP32 precision. CPU inference speed is based on an Intel(R) Xeon(R) Gold 5117 CPU @ 2.00GHz with 8 threads and FP32 precision. * 17-Class Area Detection Model, including 17 common layout categories: Paragraph Title, Image, Text, Number, Abstract, Content, Figure Caption, Formula, Table, Table Caption, References, Document Title, Footnote, Header, Algorithm, Footer, and Stamp
ModelModel Download Link mAP(0.5) (%) GPU Inference Time (ms)
[Normal Mode / High-Performance Mode]
CPU Inference Time (ms)
[Normal Mode / High-Performance Mode]
Model Storage Size (M) Introduction
PicoDet-S_layout_17clsInference Model/Training Model 87.4 9.11 / 2.12 15.42 / 9.12 4.8 A high-efficiency layout area localization model trained on a self-built dataset of Chinese and English papers, magazines, and research reports using PicoDet-S.
PicoDet-L_layout_17clsInference Model/Training Model 89.0 13.50 / 4.69 43.32 / 43.32 22.6 A balanced efficiency and precision layout area localization model trained on a self-built dataset of Chinese and English papers, magazines, and research reports using PicoDet-L.
RT-DETR-H_layout_17clsInference Model/Training Model 98.3 115.29 / 104.09 995.27 / 995.27 470.2 A high-precision layout area localization model trained on a self-built dataset of Chinese and English papers, magazines, and research reports using RT-DETR-H.

Text Detection Module Models:

ModelModel Download Link Detection Hmean (%) GPU Inference Time (ms)
[Normal Mode / High-Performance Mode]
CPU Inference Time (ms)
[Normal Mode / High-Performance Mode]
Model Size (M) Description
PP-OCRv5_server_detInference Model/Training Model 83.8 89.55 / 70.19 371.65 / 371.65 84.3 PP-OCRv5 server-side text detection model with higher accuracy, suitable for deployment on high-performance servers
PP-OCRv5_mobile_detInference Model/Training Model 79.0 8.79 / 3.13 51.00 / 28.58 4.7 PP-OCRv5 mobile-side text detection model with higher efficiency, suitable for deployment on edge devices
PP-OCRv4_server_detInference Model/Training Model 69.2 83.34 / 80.91 442.58 / 442.58 109 PP-OCRv4 server-side text detection model with higher accuracy, suitable for deployment on high-performance servers
PP-OCRv4_mobile_detInference Model/Training Model 63.8 8.79 / 3.13 51.00 / 28.58 4.7 PP-OCRv4 mobile-side text detection model with higher efficiency, suitable for deployment on edge devices

Text Recognition Module Models:

ModelModel Download Links Recognition Avg Accuracy(%) GPU Inference Time (ms)
[Normal Mode / High-Performance Mode]
CPU Inference Time (ms)
[Normal Mode / High-Performance Mode]
Model Storage Size (M) Introduction
PP-OCRv5_server_recInference Model/Pretrained Model 86.38 8.45/2.36 122.69/122.69 81 M PP-OCRv5_rec is a next-generation text recognition model. It aims to efficiently and accurately support the recognition of four major languages—Simplified Chinese, Traditional Chinese, English, and Japanese—as well as complex text scenarios such as handwriting, vertical text, pinyin, and rare characters using a single model. While maintaining recognition performance, it balances inference speed and model robustness, providing efficient and accurate technical support for document understanding in various scenarios.
PP-OCRv5_mobile_recInference Model/Pretrained Model 81.29 1.46/5.43 5.32/91.79 16 M
PP-OCRv4_server_rec_docInference Model/Pretrained Model 86.58 6.65 / 2.38 32.92 / 32.92 91 M PP-OCRv4_server_rec_doc is trained on a mixed dataset of more Chinese document data and PP-OCR training data, building upon PP-OCRv4_server_rec. It enhances the recognition capabilities for some Traditional Chinese characters, Japanese characters, and special symbols, supporting over 15,000 characters. In addition to improving document-related text recognition, it also enhances general text recognition capabilities.
PP-OCRv4_mobile_recInference Model/Pretrained Model 83.28 4.82 / 1.20 16.74 / 4.64 11 M A lightweight recognition model of PP-OCRv4 with high inference efficiency, suitable for deployment on various hardware devices, including edge devices.
PP-OCRv4_server_rec Inference Model/Pretrained Model 85.19 6.58 / 2.43 33.17 / 33.17 87 M The server-side model of PP-OCRv4, offering high inference accuracy and deployable on various servers.
en_PP-OCRv4_mobile_recInference Model/Pretrained Model 70.39 4.81 / 0.75 16.10 / 5.31 7.3 M An ultra-lightweight English recognition model trained based on the PP-OCRv4 recognition model, supporting English and numeric character recognition.
ModelModel Download Link Recognition Avg Accuracy (%) GPU Inference Time (ms)
[Normal Mode / High-Performance Mode]
CPU Inference Time (ms)
[Normal Mode / High-Performance Mode]
Model Size (M) Description
ch_SVTRv2_recInference Model/Training Model 68.81 8.08 / 8.08 50.17 / 42.50 73.9 M SVTRv2 is a server-side text recognition model developed by the OpenOCR team at the Vision and Learning Lab (FVL) of Fudan University. It won the first prize in the OCR End-to-End Recognition Task of the PaddleOCR Algorithm Model Challenge, with a 6% improvement in end-to-end recognition accuracy compared to PP-OCRv4 on the A-list.
ModelModel Download Link Recognition Avg Accuracy (%) GPU Inference Time (ms)
[Normal Mode / High-Performance Mode]
CPU Inference Time (ms)
[Normal Mode / High-Performance Mode]
Model Size (M) Description
ch_RepSVTR_recInference Model/Training Model 65.07 5.93 / 5.93 20.73 / 7.32 22.1 M The RepSVTR text recognition model is a mobile-oriented text recognition model based on SVTRv2. It won the first prize in the OCR End-to-End Recognition Task of the PaddleOCR Algorithm Model Challenge, with a 2.5% improvement in end-to-end recognition accuracy compared to PP-OCRv4 on the B-list, while maintaining similar inference speed.

Formula Recognition Module Models:

Model NameModel Download Link BLEU Score Normed Edit Distance ExpRate (%) GPU Inference Time (ms)
[Normal Mode / High-Performance Mode]
CPU Inference Time (ms)
[Normal Mode / High-Performance Mode]
Model Size
LaTeX_OCR_recInference Model/Training Model 0.8821 0.0823 40.01 2047.13 / 2047.13 10582.73 / 10582.73 89.7 M

Seal Text Detection Module Models:

ModelModel Download Link Detection Hmean (%) GPU Inference Time (ms)
[Normal Mode / High-Performance Mode]
CPU Inference Time (ms)
[Normal Mode / High-Performance Mode]
Model Size (M) Description
PP-OCRv4_server_seal_detInference Model/Training Model 98.21 74.75 / 67.72 382.55 / 382.55 109 PP-OCRv4's server-side seal text detection model, featuring higher accuracy, suitable for deployment on better-equipped servers
PP-OCRv4_mobile_seal_detInference Model/Training Model 96.47 7.82 / 3.09 48.28 / 23.97 4.6 PP-OCRv4's mobile seal text detection model, offering higher efficiency, suitable for deployment on edge devices
Test Environment Description:
  • Performance Test Environment
    • Test Dataset:
      • Text Image Rectification Model: DocUNet
      • Layout Region Detection Model: A self-built layout analysis dataset using PaddleOCR, containing 10,000 images of common document types such as Chinese and English papers, magazines, and research reports.
      • Table Structure Recognition Model: A self-built English table recognition dataset using PaddleX.
      • Text Detection Model: A self-built Chinese dataset using PaddleOCR, covering multiple scenarios such as street scenes, web images, documents, and handwriting, with 500 images for detection.
      • Chinese Recognition Model: A self-built Chinese dataset using PaddleOCR, covering multiple scenarios such as street scenes, web images, documents, and handwriting, with 11,000 images for text recognition.
      • ch_SVTRv2_rec: Evaluation set A for "OCR End-to-End Recognition Task" in the PaddleOCR Algorithm Model Challenge
      • ch_RepSVTR_rec: Evaluation set B for "OCR End-to-End Recognition Task" in the PaddleOCR Algorithm Model Challenge
      • English Recognition Model: A self-built English dataset using PaddleX.
      • Multilingual Recognition Model: A self-built multilingual dataset using PaddleX.
      • Text Line Orientation Classification Model: A self-built dataset using PaddleOCR, covering various scenarios such as ID cards and documents, containing 1000 images.
      • Seal Text Detection Model: A self-built dataset using PaddleOCR, containing 500 images of circular seal textures.
    • Hardware Configuration:
      • GPU: NVIDIA Tesla T4
      • CPU: Intel Xeon Gold 6271C @ 2.60GHz
      • Other Environments: Ubuntu 20.04 / cuDNN 8.6 / TensorRT 8.5.2.2
  • Inference Mode Description
Mode GPU Configuration CPU Configuration Acceleration Technology Combination
Normal Mode FP32 Precision / No TRT Acceleration FP32 Precision / 8 Threads PaddleInference
High-Performance Mode Optimal combination of pre-selected precision types and acceleration strategies FP32 Precision / 8 Threads Pre-selected optimal backend (Paddle/OpenVINO/TRT, etc.)

2. Quick Start

The pre-trained pipelines provided by PaddleOCR allow for quick experience of their effects. You can locally use Python to experience the effects of the PP-ChatOCRv4-doc pipeline.

Before using the PP-ChatOCRv4-doc pipeline locally, ensure you have completed the installation of the PaddleOCR wheel package according to the PaddleOCR Local Installation Tutorial. If you wish to selectively install dependencies, please refer to the relevant instructions in the installation guide. The dependency group corresponding to this pipeline is ie.

Before performing model inference, you first need to prepare the API key for the large language model. PP-ChatOCRv4 supports large model services on the Baidu Cloud Qianfan Platform or the locally deployed standard OpenAI interface. If using the Baidu Cloud Qianfan Platform, refer to Authentication and Authorization to obtain the API key. If using a locally deployed large model service, refer to the PaddleNLP Large Model Deployment Documentation for deployment of the dialogue interface and vectorization interface for large models, and fill in the corresponding base_url and api_key. If you need to use a multimodal large model for data fusion, refer to the OpenAI service deployment in the PaddleMIX Model Documentation for multimodal large model deployment, and fill in the corresponding base_url and api_key.

Note: If local deployment of a multimodal large model is restricted due to the local environment, you can comment out the lines containing the mllm variable in the code and only use the large language model for information extraction.

2.1 Command Line Experience

After updating the configuration file, you can complete quick inference using just a few lines of Python code. You can use the test file for testing:

paddleocr pp_chatocrv4_doc -i vehicle_certificate-1.png -k 驾驶室准乘人数 --qianfan_api_key your_api_key

# 通过 --invoke_mllm 和 --pp_docbee_base_url 使用多模态大模型
paddleocr pp_chatocrv4_doc -i vehicle_certificate-1.png -k 驾驶室准乘人数 --qianfan_api_key your_api_key --invoke_mllm True --pp_docbee_base_url http://127.0.0.1:8080/
The command line supports more parameter configurations. Click to expand for a detailed explanation of the command line parameters.
Parameter Parameter Description Parameter Type Options Default Value
input The data to be predicted, supporting multiple input types, required. Python Var|str|list
  • Python Var: Such as numpy.ndarray representing image data.
  • str: Such as the local path of an image file or PDF file: /root/data/img.jpg; URL link, such as the network URL of an image file or PDF file: Example; Local directory, which should contain images to be predicted, such as the local path: /root/data/ (currently does not support prediction of PDF files in directories, PDF files need to be specified to the specific file path).
  • List: List elements need to be of the above types, such as [numpy.ndarray, numpy.ndarray], ["/root/data/img1.jpg", "/root/data/img2.jpg"], ["/root/data1", "/root/data2"].
None
device The device for pipeline inference. str|None
  • CPU: Such as cpu to use CPU for inference;
  • GPU: Such as gpu:0 to use the first GPU for inference;
  • NPU: Such as npu:0 to use the first NPU for inference;
  • XPU: Such as xpu:0 to use the first XPU for inference;
  • MLU: Such as mlu:0 to use the first MLU for inference;
  • DCU: Such as dcu:0 to use the first DCU for inference;
  • None: If set to None, it will default to the value initialized by the pipeline. During initialization, it will prioritize using the local GPU 0 device, and if not available, it will use the CPU device;
None
use_doc_orientation_classify Whether to use the document orientation classification module. bool|None
  • bool: True or False;
  • None: If set to None, it will default to the value initialized by the pipeline, initialized to True;
None
use_doc_unwarping Whether to use the document distortion correction module. bool|None
  • bool: True or False;
  • None: If set to None, it will default to the value initialized by the pipeline, initialized to True;
None
use_textline_orientation Whether to use the text line orientation classification module. bool|None
  • bool: True or False;
  • None: If set to None, it will default to the value initialized by the pipeline, initialized to True;
None
use_general_ocr Whether to use the OCR sub-pipeline. bool|None
  • bool: True or False;
  • None: If set to None, it will default to the value initialized by the pipeline, initialized to True;
None
use_seal_recognition Whether to use the seal recognition sub-pipeline. bool|None
  • bool: True or False;
  • None: If set to None, it will default to the value initialized by the pipeline, initialized to True;
None
use_table_recognition Whether to use the table recognition sub-pipeline. bool|None
  • bool: True or False;
  • None: If set to None, it will default to the value initialized by the pipeline, initialized to True;
None
layout_threshold The score threshold for the layout model. float|dict|None
  • float: Any floating-point number between 0-1;
  • dict: {0:0.1} where the key is the category ID and the value is the threshold for that category;
  • None: If set to None, it will default to the value initialized by the pipeline, initialized to 0.5;
None
layout_nms Whether to use NMS. bool|None
  • bool: True or False;
  • None: If set to None, it will default to the value initialized by the pipeline, initialized to True;
None
layout_unclip_ratio The expansion coefficient for layout detection. float|Tuple[float,float]|dict|None
  • float: Any floating-point number greater than 0;
  • Tuple[float,float]: The expansion coefficients in the horizontal and vertical directions, respectively;
  • dict, keys as int representing cls_id, values as float scaling factors for each category.
  • None: If set to None, it will default to the value initialized by the pipeline, initialized to 1.0;
None
layout_merge_bboxes_mode The method for filtering overlapping bounding boxes. str|dict|None
  • str: large, small, union. Respectively representing retaining the larger box, smaller box, or both when overlapping boxes are filtered.
  • dict, keys as int representing cls_id and values as merging modes for each category.
  • None: If set to None, it will default to the value initialized by the pipeline, initialized to large;
None
text_det_limit_side_len The side length limit for text detection images. int|None
  • int: Any integer greater than 0;
  • None: If set to None, it will default to the value initialized by the pipeline, initialized to 960;
None
text_det_limit_type The type of side length limit for text detection images. str|None
  • str: Supports min and max, where min ensures that the shortest side of the image is not less than det_limit_side_len, and max ensures that the longest side of the image is not greater than limit_side_len.
  • None: If set to None, it will default to the value initialized by the pipeline, initialized to max;
None
text_det_thresh The pixel threshold for detection. In the output probability map, pixel points with scores greater than this threshold will be considered as text pixels. float|None
  • float: Any floating-point number greater than 0.
  • None: If set to None, it will default to the value initialized by the pipeline, initialized to 0.3.
None
text_det_box_thresh The bounding box threshold for detection. When the average score of all pixel points within the detection result bounding box is greater than this threshold, the result will be considered as a text region. float|None
  • float: Any floating-point number greater than 0.
  • None: If set to None, it will default to the value initialized by the pipeline, initialized to 0.6.
None
text_det_unclip_ratio The expansion coefficient for text detection. This method is used to expand the text region, and the larger the value, the larger the expansion area. float|None
  • float: Any floating-point number greater than 0.
  • None: If set to None, it will default to the value initialized by the pipeline, initialized to 2.0.
None
text_rec_score_thresh The text recognition threshold. Text results with scores greater than this threshold will be retained. float|None
  • float: Any floating-point number greater than 0.
  • None: If set to None, it will default to the value initialized by the pipeline, initialized to 0.0. I.e., no threshold is set.
None
seal_det_limit_side_len The side length limit for seal detection images. int|None
  • int: Any integer greater than 0;
  • None: If set to None, it will default to the value initialized by the pipeline, initialized to 960;
None
seal_det_limit_type The type of side length limit for seal detection images. str|None
  • str: Supports min and max, where min ensures that the shortest side of the image is not less than det_limit_side_len, and max ensures that the longest side of the image is not greater than limit_side_len.
  • None: If set to None, it will default to the value initialized by the pipeline, initialized to max;
None
seal_det_thresh The pixel threshold for detection. In the output probability map, pixel points with scores greater than this threshold will be considered as seal pixels. float|None
  • float: Any floating-point number greater than 0.
  • None: If set to None, it will default to the value initialized by the pipeline, initialized to 0.3.
None
seal_det_box_thresh The bounding box threshold for detection. When the average score of all pixel points within the detection result bounding box is greater than this threshold, the result will be considered as a seal region. float|None
  • float: Any floating-point number greater than 0.
  • None: If set to None, it will default to the value initialized by the pipeline, initialized to 0.6.
None
seal_det_unclip_ratio The expansion coefficient for seal detection. This method is used to expand the seal region, and the larger the value, the larger the expansion area. float|None
  • float: Any floating-point number greater than 0.
  • None: If set to None, it will default to the value initialized by the pipeline, initialized to 2.0.
None
seal_rec_score_thresh The seal recognition threshold. Text results with scores greater than this threshold will be retained. float|None
  • float: Any floating-point number greater than 0.
  • None: If set to None, it will default to the value initialized by the pipeline, initialized to 0.0. I.e., no threshold is set.
None

This method will print the results to the terminal. The content printed to the terminal is explained as follows:

驾驶室准乘人数 2

2.2 Python Script Experience

The command-line method is for a quick experience and to view results. Generally, in projects, integration via code is often required. You can download the Test File and use the following example code for inference:

from paddleocr import PPChatOCRv4Doc

chat_bot_config = {
    "module_name": "chat_bot",
    "model_name": "ernie-3.5-8k",
    "base_url": "https://qianfan.baidubce.com/v2",
    "api_type": "openai",
    "api_key": "api_key",  # your api_key
}

retriever_config = {
    "module_name": "retriever",
    "model_name": "embedding-v1",
    "base_url": "https://qianfan.baidubce.com/v2",
    "api_type": "qianfan",
    "api_key": "api_key",  # your api_key
}

mllm_chat_bot_config = {
    "module_name": "chat_bot",
    "model_name": "PP-DocBee2",
    "base_url": "http://127.0.0.1:8080/",  # your local mllm service url
    "api_type": "openai",
    "api_key": "api_key",  # your api_key
}

pipeline = PPChatOCRv4Doc()

visual_predict_res = pipeline.visual_predict(
    input="vehicle_certificate-1.png",
    use_doc_orientation_classify=False,
    use_doc_unwarping=False,
    use_common_ocr=True,
    use_seal_recognition=True,
    use_table_recognition=True,
)

visual_info_list = []
for res in visual_predict_res:
    visual_info_list.append(res["visual_info"])
    layout_parsing_result = res["layout_parsing_result"]

vector_info = pipeline.build_vector(
    visual_info_list, flag_save_bytes_vector=True, retriever_config=retriever_config
)
mllm_predict_res = pipeline.mllm_pred(
    input="vehicle_certificate-1.png",
    key_list=["Cab Seating Capacity"], # Translated: 驾驶室准乘人数
    mllm_chat_bot_config=mllm_chat_bot_config,
)
mllm_predict_info = mllm_predict_res["mllm_res"]
chat_result = pipeline.chat(
    key_list=["Cab Seating Capacity"], # Translated: 驾驶室准乘人数
    visual_info=visual_info_list,
    vector_info=vector_info,
    mllm_predict_info=mllm_predict_info,
    chat_bot_config=chat_bot_config,
    retriever_config=retriever_config,
)
print(chat_result)

After running, the output is as follows:

{'chat_res': {'驾驶室准乘人数': '2'}}

The prediction process, API description, and output description for PP-ChatOCRv4 are as follows:

(1) Call the PPChatOCRv4Doc method to instantiate the PP-ChatOCRv4 pipeline object. The relevant parameter descriptions are as follows:
Parameter Parameter Description Parameter Type Default Value
layout_detection_model_name The name of the model used for layout region detection. If set toNone, the pipeline's default model will be used. str None
layout_detection_model_dir The directory path of the layout region detection model. If set toNone, the official model will be downloaded. str None
doc_orientation_classify_model_name The name of the document orientation classification model. If set toNone, the pipeline's default model will be used. str None
doc_orientation_classify_model_dir The directory path of the document orientation classification model. If set toNone, the official model will be downloaded. str None
doc_unwarping_model_name The name of the document unwarping model. If set toNone, the pipeline's default model will be used. str None
doc_unwarping_model_dir The directory path of the document unwarping model. If set toNone, the official model will be downloaded. str None
text_detection_model_name The name of the text detection model. If set toNone, the pipeline's default model will be used. str None
text_detection_model_dir The directory path of the text detection model. If set toNone, the official model will be downloaded. str None
text_recognition_model_name The name of the text recognition model. If set toNone, the pipeline's default model will be used. str None
text_recognition_model_dir The directory path of the text recognition model. If set toNone, the official model will be downloaded. str None
text_recognition_batch_size The batch size for the text recognition model. If set toNone, the batch size will default to 1. int None
table_structure_recognition_model_name The name of the table structure recognition model. If set toNone, the pipeline's default model will be used. str None
table_structure_recognition_model_dir The directory path of the table structure recognition model. If set toNone, the official model will be downloaded. str None
seal_text_detection_model_name The name of the seal text detection model. If set toNone, the pipeline's default model will be used. str None
seal_text_detection_model_dir The directory path of the seal text detection model. If set toNone, the official model will be downloaded. str None
seal_text_recognition_model_name The name of the seal text recognition model. If set toNone, the pipeline's default model will be used. str None
seal_text_recognition_model_dir The directory path of the seal text recognition model. If set toNone, the official model will be downloaded. str None
seal_text_recognition_batch_size The batch size for the seal text recognition model. If set toNone, the batch size will default to 1. int None
use_doc_orientation_classify Whether to load the document orientation classification function. If set toNone, the value initialized by the pipeline for this parameter will be used by default (initialized to True). bool None
use_doc_unwarping Whether to load the document unwarping function. If set toNone, the value initialized by the pipeline for this parameter will be used by default (initialized to True). bool None
use_seal_recognition Whether to load the seal recognition sub-pipeline. If set toNone, the value initialized by the pipeline for this parameter will be used by default (initialized to True). bool None
use_table_recognition Whether to load the table recognition sub-pipeline. If set toNone, the value initialized by the pipeline for this parameter will be used by default (initialized to True). bool None
layout_threshold Layout model score threshold.
  • float:Any float between 0-1;
  • dict{0:0.1} where key is the class ID, and value is the threshold for that class;
  • None:If set to None, the value initialized by the pipeline for this parameter will be used by default (initialized to 0.5);
float|dict None
layout_nms Whether the layout region detection model uses NMS post-processing. bool None
layout_unclip_ratio Expansion factor for the detection boxes of the layout region detection model.
  • float:Any float greater than 0;
  • Tuple[float,float]:Expansion factors in the horizontal and vertical directions respectively;
  • dict, where the key is of int type, representing cls_id, and the value is of tuple type, e.g.,{0: (1.1, 2.0)}, meaning the center of the detection box for class 0 remains unchanged, width is expanded by 1.1 times, and height by 2.0 times.
  • None:If set to None, the value initialized by the pipeline for this parameter will be used by default (initialized to 1.0);
float|Tuple[float,float]|dict None
layout_merge_bboxes_mode Method for filtering overlapping boxes in layout region detection.
  • strlargesmall, union, representing whether to keep the large box, small box, or both when filtering overlapping boxes.
  • dict, where the key is of int type, representing cls_id, and the value is of str type, e.g.,{0: "large", 2: "small"}, meaning use "large" mode for class 0 detection boxes and "small" mode for class 2 detection boxes.
  • None:If set to None, the value initialized by the pipeline for this parameter will be used by default (initialized to large);
str|dict None
text_det_limit_side_len Maximum side length limit for text detection.
  • int:Any integer greater than 0;
  • None:If set to None, the value initialized by the pipeline for this parameter will be used by default (initialized to 960);
int None
text_det_limit_type Type of side length limit for text detection.
  • str:Supports min and max. min ensures the shortest side of the image is not less than det_limit_side_len. max ensures the longest side of the image is not greater than limit_side_len.
  • None:If set to None, the value initialized by the pipeline for this parameter will be used by default (initialized to max).
str None
text_det_thresh Detection pixel threshold. In the output probability map, pixels with scores greater than this threshold are considered text pixels.
  • float:Any float greater than 0.
  • None:If set to None, the value initialized by the pipeline for this parameter (0.3) will be used by default.
float None
text_det_box_thresh Detection box threshold. If the average score of all pixels within a detection result's bounding box is greater than this threshold, the result is considered a text region.
  • float:Any float greater than 0.
  • None:If set to None, the value initialized by the pipeline for this parameter (0.6) will be used by default.
float None
text_det_unclip_ratio Text detection expansion factor. This method is used to expand text regions; the larger the value, the larger the expanded area.
  • float:Any float greater than 0.
  • None:If set to None, the value initialized by the pipeline for this parameter (2.0) will be used by default.
float None
text_rec_score_thresh Text recognition threshold. Text results with scores greater than this threshold will be kept.
  • float:Any float greater than 0.
  • None:If set to None, the value initialized by the pipeline for this parameter (0.0, i.e., no threshold) will be used by default.
float None
seal_det_limit_side_len Image side length limit for seal text detection.
  • int:Any integer greater than 0;
  • None:If set to None, the value initialized by the pipeline for this parameter will be used by default (initialized to 736);
int None
seal_det_limit_type Type of image side length limit for seal text detection.
  • str:Supports min and max. min ensures the shortest side of the image is not less than det_limit_side_len. max ensures the longest side of the image is not greater than limit_side_len.
  • None:If set to None, the value initialized by the pipeline for this parameter will be used by default (initialized to min);
str None
seal_det_thresh Detection pixel threshold. In the output probability map, pixels with scores greater than this threshold are considered text pixels.
  • float:Any float greater than 0.
  • None:If set to None, the value initialized by the pipeline for this parameter (0.2) will be used by default.
float None
seal_det_box_thresh Detection box threshold. If the average score of all pixels within a detection result's bounding box is greater than this threshold, the result is considered a text region.
  • float:Any float greater than 0.
  • None:If set to None, the value initialized by the pipeline for this parameter (0.6) will be used by default.
float None
seal_det_unclip_ratio Seal text detection expansion factor. This method is used to expand text regions; the larger the value, the larger the expanded area.
  • float:Any float greater than 0.
  • None:If set to None, the value initialized by the pipeline for this parameter (0.5) will be used by default.
float None
seal_rec_score_thresh Seal text recognition threshold. Text results with scores greater than this threshold will be kept.
  • float:Any float greater than 0.
  • None:If set to None, the value initialized by the pipeline for this parameter (0.0, i.e., no threshold) will be used by default.
float None
retriever_config Configuration parameters for the vector retrieval large model. The configuration content is the following dictionary:
{
"module_name": "retriever",
"model_name": "embedding-v1",
"base_url": "https://qianfan.baidubce.com/v2",
"api_type": "qianfan",
"api_key": "api_key"  # Please set this to your actual API key
}
dict None
mllm_chat_bot_config Configuration parameters for the multimodal large model. The configuration content is the following dictionary:
{
"module_name": "chat_bot",
"model_name": "PP-DocBee",
"base_url": "http://127.0.0.1:8080/", # Please set this to the actual URL of your multimodal large model service
"api_type": "openai",
"api_key": "api_key"  # Please set this to your actual API key
}
dict None
chat_bot_config Configuration information for the large language model. The configuration content is the following dictionary:
{
"module_name": "chat_bot",
"model_name": "ernie-3.5-8k",
"base_url": "https://qianfan.baidubce.com/v2",
"api_type": "openai",
"api_key": "api_key"  # Please set this to your actual API key
}
dict None
input Data to be predicted, supports multiple input types, required.
  • Python Var:e.g., image data represented by numpy.ndarray
  • str:e.g., local path of an image file or PDF file: /root/data/img.jpgURL link, e.g., network URL of an image file or PDF file: ExampleLocal directory, which must contain images to be predicted, e.g., local path: /root/data/ (Currently, prediction from directories containing PDF files is not supported; PDF files need to be specified by their full path)
  • List:List elements must be of the above types, e.g.,[numpy.ndarray, numpy.ndarray]["/root/data/img1.jpg", "/root/data/img2.jpg"]["/root/data1", "/root/data2"]
Python Var|str|list None
save_path Specifies the path to save the inference result file. If set toNone, inference results will not be saved locally. str None
device Device used for inference. Supports specifying a specific card number.
  • CPU:e.g., cpu indicates using CPU for inference;
  • GPU:e.g., gpu:0 indicates using the 1st GPU for inference;
  • NPU:e.g., npu:0 indicates using the 1st NPU for inference;
  • XPU:e.g., xpu:0 indicates using the 1st XPU for inference;
  • MLU:e.g., mlu:0 indicates using the 1st MLU for inference;
  • DCU:e.g., dcu:0 indicates using the 1st DCU for inference;
  • None:If set to None, the value initialized by the pipeline for this parameter will be used by default. During initialization, it will prioritize using the local GPU 0 device; if not available, it will use the CPU device;
str None
enable_hpi Whether to enable high-performance inference. bool False
use_tensorrt Whether to use TensorRT for inference acceleration. bool False
min_subgraph_size Minimum subgraph size, used to optimize model subgraph computation. int 3
precision Computation precision, e.g., fp32, fp16. str fp32
enable_mkldnn Whether to enable MKL-DNN acceleration library. If set toNone, it will be enabled by default. bool None
cpu_threads Number of threads used when performing inference on CPU. int 8
paddlex_config PaddleX pipeline configuration file path. str None
(2) Call the visual_predict() method of the PP-ChatOCRv4 pipeline object to obtain visual prediction results. This method returns a list of results. Additionally, the pipeline also provides the visual_predict_iter() method. Both are identical in terms of parameter acceptance and result return, with the difference being that visual_predict_iter() returns a generator, allowing for step-by-step processing and retrieval of prediction results, suitable for handling large datasets or scenarios where memory saving is desired. You can choose either of these two methods based on your actual needs. The following are the parameters and their descriptions for the visual_predict() method:
Parameter Parameter Description Parameter Type Default Value
input Data to be predicted, supports multiple input types, required.
  • Python Var:e.g., image data represented by numpy.ndarray
  • str:e.g., local path of an image file or PDF file: /root/data/img.jpgURL link, e.g., network URL of an image file or PDF file: ExampleLocal directory, which must contain images to be predicted, e.g., local path: /root/data/ (Currently, prediction from directories containing PDF files is not supported; PDF files need to be specified by their full path)
  • List:List elements must be of the above types, e.g.,[numpy.ndarray, numpy.ndarray]["/root/data/img1.jpg", "/root/data/img2.jpg"]["/root/data1", "/root/data2"]
Python Var|str|list
device Same as the parameter during instantiation. str None
use_doc_orientation_classify Whether to use the document orientation classification module during inference. bool None
use_doc_unwarping Whether to use the text image correction module during inference. bool None
use_textline_orientation Whether to use the text line orientation classification module during inference. bool None
use_seal_recognition Whether to use the seal recognition sub-pipeline during inference. bool None
use_table_recognition Whether to use the table recognition sub-pipeline during inference. bool None
layout_threshold Same as the parameter during instantiation. float|dict None
layout_nms Same as the parameter during instantiation. bool None
layout_unclip_ratio Same as the parameter during instantiation. float|Tuple[float,float]|dict None
layout_merge_bboxes_mode Same as the parameter during instantiation. str|dict None
text_det_limit_side_len Same as the parameter during instantiation. int None
text_det_limit_type Same as the parameter during instantiation. str None
text_det_thresh Same as the parameter during instantiation. float None
text_det_box_thresh Same as the parameter during instantiation. float None
text_det_unclip_ratio Same as the parameter during instantiation. float None
text_rec_score_thresh Same as the parameter during instantiation. float None
seal_det_limit_side_len Same as the parameter during instantiation. int None
seal_det_limit_type Same as the parameter during instantiation. str None
seal_det_thresh Same as the parameter during instantiation. float None
seal_det_box_thresh Same as the parameter during instantiation. float None
seal_det_unclip_ratio Same as the parameter during instantiation. float None
seal_rec_score_thresh Same as the parameter during instantiation. float None
(3) Process the visual prediction results. The prediction result for each sample is of `dict` type, containing two fields: `visual_info` and `layout_parsing_result`. Visual information (including `normal_text_dict`, `table_text_list`, `table_html_list`, etc.) is obtained through `visual_info`, and the information for each sample is placed in the `visual_info_list` list. The content of this list will later be fed into the large language model. Of course, you can also obtain the layout parsing results through `layout_parsing_result`. This result contains content such as tables, text, and images found in the file or image, and supports operations like printing, saving as an image, and saving as a `json` file:
......
for res in visual_predict_res:
    visual_info_list.append(res["visual_info"])
    layout_parsing_result = res["layout_parsing_result"]
    layout_parsing_result.print()
    layout_parsing_result.save_to_img("./output")
    layout_parsing_result.save_to_json("./output")
    layout_parsing_result.save_to_xlsx("./output")
    layout_parsing_result.save_to_html("./output")
......
Method Method Description Parameter Parameter Type Parameter Description Default Value
print() Prints the result to the terminal format_json bool Whether to format the output content using JSON indentation True
indent int Specifies the indentation level to beautify the output JSON data for better readability, effective only when format_json is True 4
ensure_ascii bool Controls whether to escape non-ASCII characters to Unicode. Set to True to escape all non-ASCII characters; False to preserve original characters, effective only when format_json is True False
save_to_json() Saves the result as a JSON format file save_path str Save file path. When it's a directory, the saved file name will be consistent with the input file name. None
indent int Specifies the indentation level to beautify the output JSON data for better readability, effective only when format_json is True 4
ensure_ascii bool Controls whether to escape non-ASCII characters to Unicode. Set to True to escape all non-ASCII characters; False to preserve original characters, effective only when format_json is True False
save_to_img() Saves the visualization images of various intermediate modules as PNG format images save_path str Save file path, supports directory or file path None
save_to_html() Saves the tables in the file as HTML format files save_path str Save file path, supports directory or file path None
save_to_xlsx() Saves the tables in the file as XLSX format files save_path str Save file path, supports directory or file path None
- Calling the `print()` method will print the results to the terminal. The content printed to the terminal is explained as follows: - `input_path`: `(str)` Input path of the image to be predicted. - `page_index`: `(Union[int, None])` If the input is a PDF file, it indicates the current page number of the PDF; otherwise, it is `None`. - `model_settings`: `(Dict[str, bool])` Model parameters required to configure the pipeline. - `use_doc_preprocessor`: `(bool)` Controls whether to enable the document preprocessor sub-pipeline. - `use_seal_recognition`: `(bool)` Controls whether to enable the seal recognition sub-pipeline. - `use_table_recognition`: `(bool)` Controls whether to enable the table recognition sub-pipeline. - `use_formula_recognition`: `(bool)` Controls whether to enable the formula recognition sub-pipeline. - `parsing_res_list`: `(List[Dict])` List of parsing results, where each element is a dictionary. The list order is the reading order after parsing. - `block_bbox`: `(np.ndarray)` Bounding box of the layout region. - `block_label`: `(str)` Label of the layout region, e.g., `text`, `table`, etc. - `block_content`: `(str)` Content within the layout region. - `overall_ocr_res`: `(Dict[str, Union[List[str], List[float], numpy.ndarray]])` Dictionary of global OCR results. - `input_path`: `(Union[str, None])` Image path accepted by the image OCR sub-pipeline. When the input is `numpy.ndarray`, it is saved as `None`. - `model_settings`: `(Dict)` Model configuration parameters for the OCR sub-pipeline. - `dt_polys`: `(List[numpy.ndarray])` List of polygon boxes for text detection. Each detection box is represented by a numpy array of 4 vertex coordinates, with array shape (4, 2) and data type int16. - `dt_scores`: `(List[float])` List of confidence scores for text detection boxes. - `text_det_params`: `(Dict[str, Dict[str, int, float]])` Configuration parameters for the text detection module. - `limit_side_len`: `(int)` Side length limit value for image preprocessing. - `limit_type`: `(str)` Processing method for side length limit. - `thresh`: `(float)` Confidence threshold for text pixel classification. - `box_thresh`: `(float)` Confidence threshold for text detection boxes. - `unclip_ratio`: `(float)` Expansion factor for text detection boxes. - `text_type`: `(str)` Type of text detection, currently fixed to "general". - `text_type`: `(str)` Type of text detection, currently fixed to "general". - `textline_orientation_angles`: `(List[int])` Prediction results of text line orientation classification. When enabled, returns actual angle values (e.g., [0,0,1]). - `text_rec_score_thresh`: `(float)` Filtering threshold for text recognition results. - `rec_texts`: `(List[str])` List of text recognition results, containing only text with confidence exceeding `text_rec_score_thresh`. - `rec_scores`: `(List[float])` List of text recognition confidence scores, filtered by `text_rec_score_thresh`. - `rec_polys`: `(List[numpy.ndarray])` List of text detection boxes filtered by confidence, format same as `dt_polys`. - `formula_res_list`: `(List[Dict[str, Union[numpy.ndarray, List[float], str]]])` List of formula recognition results, each element is a dictionary. - `rec_formula`: `(str)` Formula recognition result. - `rec_polys`: `(numpy.ndarray)` Formula detection box, shape (4, 2), dtype int16. - `formula_region_id`: `(int)` Region number where the formula is located. - `seal_res_list`: `(List[Dict[str, Union[numpy.ndarray, List[float], str]]])` List of seal recognition results, each element is a dictionary. - `input_path`: `(str)` Input path of the seal image. - `model_settings`: `(Dict)` Model configuration parameters for the seal recognition sub-pipeline. - `dt_polys`: `(List[numpy.ndarray])` List of seal detection boxes, format same as `dt_polys`. - `text_det_params`: `(Dict[str, Dict[str, int, float]])` Configuration parameters for the seal detection module, specific parameter meanings are the same as above. - `text_type`: `(str)` Type of seal detection, currently fixed to "seal". - `text_rec_score_thresh`: `(float)` Filtering threshold for seal recognition results. - `rec_texts`: `(List[str])` List of seal recognition results, containing only text with confidence exceeding `text_rec_score_thresh`. - `rec_scores`: `(List[float])` List of seal recognition confidence scores, filtered by `text_rec_score_thresh`. - `rec_polys`: `(List[numpy.ndarray])` List of seal detection boxes filtered by confidence, format same as `dt_polys`. - `rec_boxes`: `(numpy.ndarray)` Array of rectangular bounding boxes for detections, shape (n, 4), dtype int16. Each row represents a rectangle. - `table_res_list`: `(List[Dict[str, Union[numpy.ndarray, List[float], str]]])` List of table recognition results, each element is a dictionary. - `cell_box_list`: `(List[numpy.ndarray])` List of bounding boxes for table cells. - `pred_html`: `(str)` HTML format string of the table. - `table_ocr_pred`: `(dict)` OCR recognition result for the table. - `rec_polys`: `(List[numpy.ndarray])` List of detection boxes for cells. - `rec_texts`: `(List[str])` Recognition results for cells. - `rec_scores`: `(List[float])` Recognition confidence scores for cells. - `rec_boxes`: `(numpy.ndarray)` Array of rectangular bounding boxes for detections, shape (n, 4), dtype int16. Each row represents a rectangle. - Calling the `save_to_json()` method will save the above content to the specified `save_path`. If a directory is specified, the save path will be `save_path/{your_img_basename}.json`. If a file is specified, it will be saved directly to that file. Since JSON files do not support saving numpy arrays, `numpy.array` types will be converted to list form. - Calling the `save_to_img()` method will save the visualization results to the specified `save_path`. If a directory is specified, the save path will be `save_path/{your_img_basename}_ocr_res_img.{your_img_extension}`. If a file is specified, it will be saved directly to that file. (The pipeline usually contains many result images, so it is not recommended to specify a specific file path directly, otherwise multiple images will be overwritten, and only the last image will be retained). Additionally, it supports obtaining visualization images with results and prediction results through properties, as follows:
Property Property Description
json Gets the prediction results in json format.
img Gets the visualization images in dict format.
- The prediction result obtained by the `json` property is dict-type data, and its content is consistent with the content saved by calling the `save_to_json()` method. - The prediction result returned by the `img` property is a dictionary-type data. The keys are `layout_det_res`, `overall_ocr_res`, `text_paragraphs_ocr_res`, `formula_res_region1`, `table_cell_img`, and `seal_res_region1`, and the corresponding values are `Image.Image` objects: used to display visualization images of layout region detection, OCR, OCR text paragraphs, formulas, tables, and seal results, respectively. If optional modules are not used, the dictionary will only contain `layout_det_res`.
(4) Call the build_vector() method of the PP-ChatOCRv4 pipeline object to build vectors for the text content. The following are the parameters and their descriptions for the `build_vector()` method:
Parameter Parameter Description Parameter Type Default Value
visual_info Visual information, can be a dictionary containing visual information, or a list of such dictionaries. list|dict None
min_characters Minimum number of characters. A positive integer greater than 0, can be determined based on the token length supported by the large language model. int 3500
block_size Block size when building a vector library for long text. A positive integer greater than 0, can be determined based on the token length supported by the large language model. int 300
flag_save_bytes_vector Whether to save text as a binary file. bool False
retriever_config Configuration parameters for the vector retrieval large model, same as the parameter during instantiation. dict None
This method returns a dictionary containing visual text information. The content of the dictionary is as follows: - `flag_save_bytes_vector`:`(bool)` Whether to save the result as a binary file. - `flag_too_short_text`:`(bool)` Whether the text length is less than the minimum number of characters. - `vector`: `(str|list)` Binary content of the text or the text content itself, depending on the values of `flag_save_bytes_vector` and `min_characters`. If `flag_save_bytes_vector=True` and the text length is greater than or equal to the minimum number of characters, it returns binary content; otherwise, it returns the original text.
(5) Call the mllm_pred() method of the PP-ChatOCRv4 pipeline object to get the extraction results from the multimodal large model. The following are the parameters and their descriptions for the `mllm_pred()` method:
Parameter Parameter Description Parameter Type Default Value
input Data to be predicted, supports multiple input types, required.
  • Python Var:e.g., image data represented by numpy.ndarray
  • str:e.g., local path of an image file or single-page PDF file: /root/data/img.jpgURL link, e.g., network URL of an image file or single-page PDF file: Example
Python Var|str
key_list A single key or a list of keys used for extracting information. Union[str, List[str]] None
mllm_chat_bot_config Configuration parameters for the multimodal large model, same as the parameter during instantiation. dict None
(6) Call the chat() method of the PP-ChatOCRv4 pipeline object to extract key information. The following are the parameters and their descriptions for the `chat()` method:
Parameter Parameter Description Parameter Type Default Value
key_list A single key or a list of keys used for extracting information. Union[str, List[str]] None
visual_info Visual information result. List[dict] None
use_vector_retrieval Whether to use vector retrieval. bool True
vector_info Vector information used for retrieval. dict None
min_characters Required minimum number of characters. A positive integer greater than 0. int 3500
text_task_description Description of the text task. str None
text_output_format Output format for text results. str None
text_rules_str Rules for generating text results. str None
text_few_shot_demo_text_content Text content for few-shot demonstration. str None
text_few_shot_demo_key_value_list Key-value list for few-shot demonstration. str None
table_task_description Description of the table task. str None
table_output_format Output format for table results. str None
table_rules_str Rules for generating table results. str None
table_few_shot_demo_text_content Text content for table few-shot demonstration. str None
table_few_shot_demo_key_value_list Key-value list for table few-shot demonstration. str None
mllm_predict_info Multimodal large model result. dict None
mllm_integration_strategy Data fusion strategy for multimodal large model and large language model, supports using one of them separately or fusing the results of both. Options: "integration", "llm_only", and "mllm_only". str "integration"
chat_bot_config Configuration information for the large language model, same as the parameter during instantiation. dict None
retriever_config Configuration parameters for the vector retrieval large model, same as the parameter during instantiation. dict None
This method will print the result to the terminal. The content printed to the terminal is explained as follows: - `chat_res`: `(dict)` The result of information extraction, which is a dictionary containing the keys to be extracted and their corresponding values.

3. Development Integration/Deployment

If the pipeline meets your requirements for inference speed and accuracy in production, you can proceed directly with development integration/deployment.

If you need to apply the pipeline directly in your Python project, you can refer to the sample code in 2.2 Python Script Experience.

Additionally, PaddleX provides two other deployment methods, detailed as follows:

🚀 High-Performance Inference: In actual production environments, many applications have stringent standards for the performance metrics of deployment strategies (especially response speed) to ensure efficient system operation and smooth user experience. To this end, PaddleX provides a high-performance inference plugin aimed at deeply optimizing model inference and pre/post-processing to significantly speed up the end-to-end process. For detailed instructions on high-performance inference, please refer to the High-Performance Inference Guide.

☁️ Serving: Serving is a common deployment form in actual production environments. By encapsulating the inference functionality as a service, clients can access these services through network requests to obtain inference results. PaddleX supports multiple serving solutions for pipelines. For detailed instructions on serving, please refer to the Service Deployment Guide.

Below are the API references for basic serving and multi-language service invocation examples:

API Reference

For the main operations provided by the service:

  • The HTTP request method is POST.
  • Both the request body and response body are JSON data (JSON objects).
  • When the request is successfully processed, the response status code is 200, and the response body has the following properties:
Name Type Meaning
logId string UUID of the request.
errorCode integer Error code. Fixed at 0.
errorMsg string Error description. Fixed at "Success".
result object Operation result.
  • When the request is not successfully processed, the response body has the following properties:
Name Type Meaning
logId string UUID of the request.
errorCode integer Error code. Same as the response status code.
errorMsg string Error description.

The main operations provided by the service are as follows:

  • analyzeImages

Uses computer vision models to analyze images, obtain OCR, table recognition results, etc., and extract key information from the images.

POST /chatocr-visual

  • Properties of the request body:
Name Type Meaning Required
file string URL of an image file or PDF file accessible to the server, or Base64 encoded result of the content of the above file types. By default, for PDF files exceeding 10 pages, only the content of the first 10 pages will be processed.
To remove the page limit, please add the following configuration to the pipeline configuration file:
Serving:
  extra:
    max_num_input_imgs: null
Yes
fileType integer | null File type. 0 represents a PDF file, 1 represents an image file. If this property is not present in the request body, the file type will be inferred based on the URL. No
useDocOrientationClassify boolean | null Please refer to the description of the use_doc_orientation_classify parameter of the pipeline object's visual_predict method. No
useDocUnwarping boolean | null Please refer to the description of the use_doc_unwarping parameter of the pipeline object's visual_predict method. No
useSealRecognition boolean | null Please refer to the description of the use_seal_recognition parameter of the pipeline object's visual_predict method. No
useTableRecognition boolean | null Please refer to the description of the use_table_recognition parameter of the pipeline object's visual_predict method. No
layoutThreshold number | null Please refer to the description of the layout_threshold parameter of the pipeline object's visual_predict method. No
layoutNms boolean | null Please refer to the description of the layout_nms parameter of the pipeline object's visual_predict method. No
layoutUnclipRatio number | array | object | null Please refer to the description of the layout_unclip_ratio parameter of the pipeline object's visual_predict method. No
layoutMergeBboxesMode string | object | null Please refer to the description of the layout_merge_bboxes_mode parameter of the pipeline object's visual_predict method. No
textDetLimitSideLen integer | null Please refer to the description of the text_det_limit_side_len parameter of the pipeline object's visual_predict method. No
textDetLimitType string | null Please refer to the description of the text_det_limit_type parameter of the pipeline object's visual_predict method. No
textDetThresh number | null Please refer to the description of the text_det_thresh parameter of the pipeline object's visual_predict method. No
textDetBoxThresh number | null Please refer to the description of the text_det_box_thresh parameter of the pipeline object's visual_predict method. No
textDetUnclipRatio number | null Please refer to the description of the text_det_unclip_ratio parameter of the pipeline object's visual_predict method. No
textRecScoreThresh number | null Please refer to the description of the text_rec_score_thresh parameter of the pipeline object's visual_predict method. No
sealDetLimitSideLen integer | null Please refer to the description of the seal_det_limit_side_len parameter of the pipeline object's visual_predict method. No
sealDetLimitType string | null Please refer to the description of the seal_det_limit_type parameter of the pipeline object's visual_predict method. No
sealDetThresh number | null Please refer to the description of the seal_det_thresh parameter of the pipeline object's visual_predict method. No
sealDetBoxThresh number | null Please refer to the description of the seal_det_box_thresh parameter of the pipeline object's visual_predict method. No
sealDetUnclipRatio number | null Please refer to the description of the seal_det_unclip_ratio parameter of the pipeline object's visual_predict method. No
sealRecScoreThresh number | null Please refer to the description of the seal_rec_score_thresh parameter of the pipeline object's visual_predict method. No
  • When the request is successfully processed, the result of the response body has the following properties:
Name Type Meaning
layoutParsingResults array Analysis results obtained using computer vision models. The array length is 1 (for image input) or the actual number of document pages processed (for PDF input). For PDF input, each element in the array represents the result of each page actually processed in the PDF file.
visualInfo array Key information in the image, which can be used as input for other operations.
dataInfo object Input data information.

Each element in layoutParsingResults is an object with the following properties:

Name Type Meaning
prunedResult object A simplified version of the res field in the JSON representation of the results generated by the pipeline's visual_predict method, with the input_path and the page_index fields removed.
outputImages object | null Refer to the description of img attribute of the pipeline's visual prediction result.
inputImage string | null Input image. The image is in JPEG format and encoded using Base64.
  • buildVectorStore

Builds a vector database.

POST /chatocr-vector

  • Properties of the request body:
Name Type Meaning Required
visualInfo array Key information in the image. Provided by the analyzeImages operation. Yes
minCharacters integer | null Minimum data length to enable the vector database. No
blockSize int | null Please refer to the description of the block_size parameter of the pipeline object's build_vector method. No
retrieverConfig object | null Please refer to the description of the retriever_config parameter of the pipeline object's build_vector method. No
  • When the request is successfully processed, the result of the response body has the following properties:
Name Type Meaning
vectorInfo object Serialized result of the vector database, which can be used as input for other operations.
  • invokeMLLM
  • Invoke the MLLM.

    POST /chatocr-mllm

    • Properties of the request body:
    Name Type Meaning Required
    image string URL of an image file accessible by the server or the Base64-encoded content of the image file. Yes
    keyList array List of keys. Yes
    mllmChatBotConfig object | null Please refer to the description of the mllm_chat_bot_config parameter of the pipeline object's mllm_pred method. No
    • When the request is successfully processed, the result of the response body has the following property:
    Name Type Meaning
    mllmPredictInfo object MLLM invocation result.
    • chat

    Interacts with large language models to extract key information using them.

    POST /chatocr-chat

    • Properties of the request body:
    Name Type Meaning Required
    keyList array List of keys. Yes
    visualInfo object Key information in the image. Provided by the analyzeImages operation. Yes
    useVectorRetrieval boolean | null Please refer to the description of the use_vector_retrieval parameter of the pipeline object's chat method. No
    vectorInfo object | null Serialized result of the vector database. Provided by the buildVectorStore operation. Please note that the deserialization process involves performing an unpickle operation. To prevent malicious attacks, be sure to use data from trusted sources. No
    minCharacters integer Minimum data length to enable the vector database. No
    textTaskDescription string | null Please refer to the description of the text_task_description parameter of the pipeline object's chat method. No
    textOutputFormat string | null Please refer to the description of the text_output_format parameter of the pipeline object's chat method. No
    textRulesStr string | null Please refer to the description of the text_rules_str parameter of the pipeline object's chat method. No
    textFewShotDemoTextContent string | null Please refer to the description of the text_few_shot_demo_text_content parameter of the pipeline object's chat method. No
    textFewShotDemoKeyValueList string | null Please refer to the description of the text_few_shot_demo_key_value_list parameter of the pipeline object's chat method. No
    tableTaskDescription string | null Please refer to the description of the table_task_description parameter of the pipeline object's chat method. No
    tableOutputFormat string | null Please refer to the description of the table_output_format parameter of the pipeline object's chat method. No
    tableRulesStr string | null Please refer to the description of the table_rules_str parameter of the pipeline object's chat method. No
    tableFewShotDemoTextContent string | null Please refer to the description of the table_few_shot_demo_text_content parameter of the pipeline object's chat method. No
    tableFewShotDemoKeyValueList string | null Please refer to the description of the table_few_shot_demo_key_value_list parameter of the pipeline object's chat method. No
    mllmPredictInfo object | null MLLM invocation result. Provided by the invokeMllm operation. No
    mllmIntegrationStrategy string | null Please refer to the description of the mllm_integration_strategy parameter of the pipeline object's chat method. No
    chatBotConfig object | null Please refer to the description of the chat_bot_config parameter of the pipeline object's chat method. No
    retrieverConfig object | null Please refer to the description of the retriever_config parameter of the pipeline object's chat method. No
    • When the request is successfully processed, the result of the response body has the following properties:
    Name Type Meaning
    chatResult object Key information extraction result.
  • Note:
  • Including sensitive parameters such as API key for large model calls in the request body can be a security risk. If not necessary, set these parameters in the configuration file and do not pass them on request.

    Multi-language Service Invocation Examples
    Python
    
    # This script only shows the use case for images. For calling with other file types, please read the API reference and make adjustments.
    
    import base64
    import pprint
    import sys
    import requests
    
    
    API_BASE_URL = "http://0.0.0.0:8080"
    
    image_path = "./demo.jpg"
    keys = ["name"]
    
    with open(image_path, "rb") as file:
        image_bytes = file.read()
        image_data = base64.b64encode(image_bytes).decode("ascii")
    
    payload = {
        "file": image_data,
        "fileType": 1,
    }
    
    resp_visual = requests.post(url=f"{API_BASE_URL}/chatocr-visual", json=payload)
    if resp_visual.status_code != 200:
        print(
            f"Request to chatocr-visual failed with status code {resp_visual.status_code}."
        )
        pprint.pp(resp_visual.json())
        sys.exit(1)
    result_visual = resp_visual.json()["result"]
    
    for i, res in enumerate(result_visual["layoutParsingResults"]):
        print(res["prunedResult"])
        for img_name, img in res["outputImages"].items():
            img_path = f"{img_name}_{i}.jpg"
            with open(img_path, "wb") as f:
                f.write(base64.b64decode(img))
            print(f"Output image saved at {img_path}")
    
    payload = {
        "visualInfo": result_visual["visualInfo"],
    }
    resp_vector = requests.post(url=f"{API_BASE_URL}/chatocr-vector", json=payload)
    if resp_vector.status_code != 200:
        print(
            f"Request to chatocr-vector failed with status code {resp_vector.status_code}."
        )
        pprint.pp(resp_vector.json())
        sys.exit(1)
    result_vector = resp_vector.json()["result"]
    
    payload = {
        "image": image_data,
        "keyList": keys,
    }
    resp_mllm = requests.post(url=f"{API_BASE_URL}/chatocr-mllm", json=payload)
    if resp_mllm.status_code != 200:
        print(
            f"Request to chatocr-mllm failed with status code {resp_mllm.status_code}."
        )
        pprint.pp(resp_mllm.json())
        sys.exit(1)
    result_mllm = resp_mllm.json()["result"]
    
    payload = {
        "keyList": keys,
        "visualInfo": result_visual["visualInfo"],
        "useVectorRetrieval": True,
        "vectorInfo": result_vector["vectorInfo"],
        "mllmPredictInfo": result_mllm["mllmPredictInfo"],
    }
    resp_chat = requests.post(url=f"{API_BASE_URL}/chatocr-chat", json=payload)
    if resp_chat.status_code != 200:
        print(
            f"Request to chatocr-chat failed with status code {resp_chat.status_code}."
        )
        pprint.pp(resp_chat.json())
        sys.exit(1)
    result_chat = resp_chat.json()["result"]
    print("Final result:")
    print(result_chat["chatResult"])
    


    4. Custom Development

    If the default model weights provided by the PP-ChatOCRv4 pipeline do not meet your requirements in terms of accuracy or speed, you can try to fine-tune the existing model using your own domain-specific or application-specific data to improve the recognition performance of the PP-ChatOCRv4 pipeline in your scenario.

    4.1 Model Fine-Tuning

    Since the PP-ChatOCRv4 pipeline includes several modules, the unsatisfactory performance of the pipeline may originate from any one of these modules. You can analyze the cases with poor extraction results, identify which module is problematic through visualizing the images, and refer to the corresponding fine-tuning tutorial links in the table below to fine-tune the model.

    Scenario Fine-tuning Module Fine-tuning Reference Link
    Inaccurate layout region detection, such as missed detection of seals, tables, etc. Layout Region Detection Module Link
    Inaccurate table structure recognition Table Structure Recognition Module Link
    Missed detection of seal text Seal Text Detection Module Link
    Missed detection of text Text Detection Module Link
    Inaccurate text content Text Recognition Module Link
    Inaccurate correction of vertical or rotated text lines Text Line Orientation Classification Module Link
    Inaccurate correction of whole-image rotation Document Image Orientation Classification Module Link
    Inaccurate correction of image distortion Text Image Correction Module Fine-tuning not supported

    4.2 Model Application

    After you complete fine-tuning with your private dataset, you will obtain a local model weight file.

    If you need to use the fine-tuned model weights, simply modify the production configuration file by replacing the local directory of the fine-tuned model weights to the corresponding position in the production configuration file:

    1. Exporting Pipeline Configuration Files

    You can call the export_paddlex_config_to_yaml method of the pipeline object to export the current pipeline configuration to a YAML file. Here is an example:

    from paddleocr import PPChatOCRv4
    
    pipeline = PPChatOCRv4()
    pipeline.export_paddlex_config_to_yaml("PP-ChatOCRv4.yaml")
    
    1. Editing Pipeline Configuration Files

    Replace the local directory of the fine-tuned model weights to the corresponding position in the pipeline configuration file. For example:

    ......
    SubModules:
        TextDetection:
        module_name: text_detection
        model_name: PP-OCRv5_server_det
        model_dir: null # Replace with the fine-tuned text detection model weights directory
        limit_side_len: 960
        limit_type: max
        thresh: 0.3
        box_thresh: 0.6
        unclip_ratio: 1.5
    
        TextRecognition:
        module_name: text_recognition
        model_name: PP-OCRv5_server_rec
        model_dir: null # Replace with the fine-tuned text recognition model weights directory
            batch_size: 1
        batch_size: 1
                score_thresh: 0
    ......
    

    The exported PaddleX pipeline configuration file not only includes parameters supported by PaddleOCR's CLI and Python API but also allows for more advanced settings. Please refer to the corresponding pipeline usage tutorials in PaddleX Pipeline Usage Overview for detailed instructions on adjusting various configurations according to your needs.

    1. Loading Pipeline Configuration Files in CLI

    By specifying the path to the PaddleX pipeline configuration file using the --paddlex_config parameter, PaddleOCR will read its contents as the configuration for inference. Here is an example:

    paddleocr pp_chatocrv4_doc --paddlex_config PP-ChatOCRv4.yaml ...
    
    1. Loading Pipeline Configuration Files in Python API

    When initializing the pipeline object, you can pass the path to the PaddleX pipeline configuration file or a configuration dictionary through the paddlex_config parameter, and PaddleOCR will use it as the configuration for inference. Here is an example:

    from paddleocr import PPChatOCRv4
    
    pipeline = PPChatOCRv4(paddlex_config="PP-ChatOCRv4.yaml")