PP-ChatOCRv4-doc Pipeline Tutorial¶
1. Introduction to PP-ChatOCRv4-doc Pipeline¶
PP-ChatOCRv4-doc is a unique document and image intelligent analysis solution from PaddlePaddle, combining LLM, MLLM, and OCR technologies to address complex document information extraction challenges such as layout analysis, rare characters, multi-page PDFs, tables, and seal recognition. Integrated with ERNIE Bot, it fuses massive data and knowledge, achieving high accuracy and wide applicability. This pipeline also provides flexible service deployment options, supporting deployment on various hardware. Furthermore, it offers custom development capabilities, allowing you to train and fine-tune models on your own datasets, with seamless integration of trained models.
The Document Scene Information Extraction v4 pipeline includes modules for Layout Region Detection, Table Structure Recognition, Table Classification, Table Cell Localization, Text Detection, Text Recognition, Seal Text Detection, Text Image Rectification, and Document Image Orientation Classification.
If you prioritize model accuracy, choose a model with higher accuracy. If you prioritize inference speed, select a model with faster inference. If you prioritize model storage size, choose a model with a smaller storage size. Benchmarks for some models are as follows:
👉Model List Details
Table Structure Recognition Module Models:
Model | Model Download Link | Accuracy (%) | GPU Inference Time (ms) [Normal Mode / High-Performance Mode] |
CPU Inference Time (ms) [Normal Mode / High-Performance Mode] |
Model Size (M) | Description |
---|---|---|---|---|---|---|
SLANet | Inference Model/Training Model | 59.52 | 103.08 / 103.08 | 197.99 / 197.99 | 6.9 M | SLANet is a table structure recognition model developed by Baidu PaddleX Team. The model significantly improves the accuracy and inference speed of table structure recognition by adopting a CPU-friendly lightweight backbone network PP-LCNet, a high-low-level feature fusion module CSP-PAN, and a feature decoding module SLA Head that aligns structural and positional information. |
SLANet_plus | Inference Model/Training Model | 63.69 | 140.29 / 140.29 | 195.39 / 195.39 | 6.9 M | SLANet_plus is an enhanced version of SLANet, the table structure recognition model developed by Baidu PaddleX Team. Compared to SLANet, SLANet_plus significantly improves the recognition ability for wireless and complex tables and reduces the model's sensitivity to the accuracy of table positioning, enabling more accurate recognition even with offset table positioning. |
Layout Detection Module Models:
Model | Model Download Link | mAP(0.5) (%) | GPU Inference Time (ms) [Normal Mode / High-Performance Mode] |
CPU Inference Time (ms) [Normal Mode / High-Performance Mode] |
Model Storage Size (M) | Introduction |
---|---|---|---|---|---|---|
PP-DocLayout-L | Inference Model/Training Model | 90.4 | 34.6244 / 10.3945 | 510.57 / - | 123.76 M | A high-precision layout area localization model trained on a self-built dataset containing Chinese and English papers, magazines, contracts, books, exams, and research reports using RT-DETR-L. |
PP-DocLayout-M | Inference Model/Training Model | 75.2 | 13.3259 / 4.8685 | 44.0680 / 44.0680 | 22.578 | A layout area localization model with balanced precision and efficiency, trained on a self-built dataset containing Chinese and English papers, magazines, contracts, books, exams, and research reports using PicoDet-L. |
PP-DocLayout-S | Inference Model/Training Model | 70.9 | 8.3008 / 2.3794 | 10.0623 / 9.9296 | 4.834 | A high-efficiency layout area localization model trained on a self-built dataset containing Chinese and English papers, magazines, contracts, books, exams, and research reports using PicoDet-S. |
Model | Model Download Link | mAP(0.5) (%) | GPU Inference Time (ms) [Normal Mode / High-Performance Mode] |
CPU Inference Time (ms) [Normal Mode / High-Performance Mode] |
Model Storage Size (M) | Introduction |
---|---|---|---|---|---|---|
PicoDet_layout_1x_table | Inference Model/Training Model | 97.5 | 8.02 / 3.09 | 23.70 / 20.41 | 7.4 M | A high-efficiency layout area localization model trained on a self-built dataset using PicoDet-1x, capable of detecting table regions. |
Model | Model Download Link | mAP(0.5) (%) | GPU Inference Time (ms) [Normal Mode / High-Performance Mode] |
CPU Inference Time (ms) [Normal Mode / High-Performance Mode] |
Model Storage Size (M) | Introduction |
---|---|---|---|---|---|---|
PicoDet-S_layout_3cls | Inference Model/Training Model | 88.2 | 8.99 / 2.22 | 16.11 / 8.73 | 4.8 | A high-efficiency layout area localization model trained on a self-built dataset of Chinese and English papers, magazines, and research reports using PicoDet-S. |
PicoDet-L_layout_3cls | Inference Model/Training Model | 89.0 | 13.05 / 4.50 | 41.30 / 41.30 | 22.6 | A balanced efficiency and precision layout area localization model trained on a self-built dataset of Chinese and English papers, magazines, and research reports using PicoDet-L. |
RT-DETR-H_layout_3cls | Inference Model/Training Model | 95.8 | 114.93 / 27.71 | 947.56 / 947.56 | 470.1 | A high-precision layout area localization model trained on a self-built dataset of Chinese and English papers, magazines, and research reports using RT-DETR-H. |
Model | Model Download Link | mAP(0.5) (%) | GPU Inference Time (ms) [Normal Mode / High-Performance Mode] |
CPU Inference Time (ms) [Normal Mode / High-Performance Mode] |
Model Storage Size (M) | Introduction |
---|---|---|---|---|---|---|
PicoDet_layout_1x | Inference Model/Training Model | 97.8 | 9.03 / 3.10 | 25.82 / 20.70 | 7.4 | A high-efficiency English document layout area localization model trained on the PubLayNet dataset using PicoDet-1x. |
Model | Model Download Link | mAP(0.5) (%) | GPU Inference Time (ms) [Normal Mode / High-Performance Mode] |
CPU Inference Time (ms) [Normal Mode / High-Performance Mode] |
Model Storage Size (M) | Introduction |
---|---|---|---|---|---|---|
PicoDet-S_layout_17cls | Inference Model/Training Model | 87.4 | 9.11 / 2.12 | 15.42 / 9.12 | 4.8 | A high-efficiency layout area localization model trained on a self-built dataset of Chinese and English papers, magazines, and research reports using PicoDet-S. |
PicoDet-L_layout_17cls | Inference Model/Training Model | 89.0 | 13.50 / 4.69 | 43.32 / 43.32 | 22.6 | A balanced efficiency and precision layout area localization model trained on a self-built dataset of Chinese and English papers, magazines, and research reports using PicoDet-L. |
RT-DETR-H_layout_17cls | Inference Model/Training Model | 98.3 | 115.29 / 104.09 | 995.27 / 995.27 | 470.2 | A high-precision layout area localization model trained on a self-built dataset of Chinese and English papers, magazines, and research reports using RT-DETR-H. |
Text Detection Module Models:
Model | Model Download Link | Detection Hmean (%) | GPU Inference Time (ms) [Normal Mode / High-Performance Mode] |
CPU Inference Time (ms) [Normal Mode / High-Performance Mode] |
Model Size (M) | Description |
---|---|---|---|---|---|---|
PP-OCRv5_server_det | Inference Model/Training Model | 83.8 | 89.55 / 70.19 | 371.65 / 371.65 | 84.3 | PP-OCRv5 server-side text detection model with higher accuracy, suitable for deployment on high-performance servers |
PP-OCRv5_mobile_det | Inference Model/Training Model | 79.0 | 8.79 / 3.13 | 51.00 / 28.58 | 4.7 | PP-OCRv5 mobile-side text detection model with higher efficiency, suitable for deployment on edge devices |
PP-OCRv4_server_det | Inference Model/Training Model | 69.2 | 83.34 / 80.91 | 442.58 / 442.58 | 109 | PP-OCRv4 server-side text detection model with higher accuracy, suitable for deployment on high-performance servers |
PP-OCRv4_mobile_det | Inference Model/Training Model | 63.8 | 8.79 / 3.13 | 51.00 / 28.58 | 4.7 | PP-OCRv4 mobile-side text detection model with higher efficiency, suitable for deployment on edge devices |
Text Recognition Module Models:
Model | Model Download Links | Recognition Avg Accuracy(%) | GPU Inference Time (ms) [Normal Mode / High-Performance Mode] |
CPU Inference Time (ms) [Normal Mode / High-Performance Mode] |
Model Storage Size (M) | Introduction |
---|---|---|---|---|---|---|
PP-OCRv5_server_rec | Inference Model/Pretrained Model | 86.38 | 8.45/2.36 | 122.69/122.69 | 81 M | PP-OCRv5_rec is a next-generation text recognition model. It aims to efficiently and accurately support the recognition of four major languages—Simplified Chinese, Traditional Chinese, English, and Japanese—as well as complex text scenarios such as handwriting, vertical text, pinyin, and rare characters using a single model. While maintaining recognition performance, it balances inference speed and model robustness, providing efficient and accurate technical support for document understanding in various scenarios. |
PP-OCRv5_mobile_rec | Inference Model/Pretrained Model | 81.29 | 1.46/5.43 | 5.32/91.79 | 16 M | |
PP-OCRv4_server_rec_doc | Inference Model/Pretrained Model | 86.58 | 6.65 / 2.38 | 32.92 / 32.92 | 91 M | PP-OCRv4_server_rec_doc is trained on a mixed dataset of more Chinese document data and PP-OCR training data, building upon PP-OCRv4_server_rec. It enhances the recognition capabilities for some Traditional Chinese characters, Japanese characters, and special symbols, supporting over 15,000 characters. In addition to improving document-related text recognition, it also enhances general text recognition capabilities. |
PP-OCRv4_mobile_rec | Inference Model/Pretrained Model | 83.28 | 4.82 / 1.20 | 16.74 / 4.64 | 11 M | A lightweight recognition model of PP-OCRv4 with high inference efficiency, suitable for deployment on various hardware devices, including edge devices. |
PP-OCRv4_server_rec | Inference Model/Pretrained Model | 85.19 | 6.58 / 2.43 | 33.17 / 33.17 | 87 M | The server-side model of PP-OCRv4, offering high inference accuracy and deployable on various servers. |
en_PP-OCRv4_mobile_rec | Inference Model/Pretrained Model | 70.39 | 4.81 / 0.75 | 16.10 / 5.31 | 7.3 M | An ultra-lightweight English recognition model trained based on the PP-OCRv4 recognition model, supporting English and numeric character recognition. |
Model | Model Download Link | Recognition Avg Accuracy (%) | GPU Inference Time (ms) [Normal Mode / High-Performance Mode] |
CPU Inference Time (ms) [Normal Mode / High-Performance Mode] |
Model Size (M) | Description |
---|---|---|---|---|---|---|
ch_SVTRv2_rec | Inference Model/Training Model | 68.81 | 8.08 / 8.08 | 50.17 / 42.50 | 73.9 M | SVTRv2 is a server-side text recognition model developed by the OpenOCR team at the Vision and Learning Lab (FVL) of Fudan University. It won the first prize in the OCR End-to-End Recognition Task of the PaddleOCR Algorithm Model Challenge, with a 6% improvement in end-to-end recognition accuracy compared to PP-OCRv4 on the A-list. |
Model | Model Download Link | Recognition Avg Accuracy (%) | GPU Inference Time (ms) [Normal Mode / High-Performance Mode] |
CPU Inference Time (ms) [Normal Mode / High-Performance Mode] |
Model Size (M) | Description |
---|---|---|---|---|---|---|
ch_RepSVTR_rec | Inference Model/Training Model | 65.07 | 5.93 / 5.93 | 20.73 / 7.32 | 22.1 M | The RepSVTR text recognition model is a mobile-oriented text recognition model based on SVTRv2. It won the first prize in the OCR End-to-End Recognition Task of the PaddleOCR Algorithm Model Challenge, with a 2.5% improvement in end-to-end recognition accuracy compared to PP-OCRv4 on the B-list, while maintaining similar inference speed. |
Formula Recognition Module Models:
Model Name | Model Download Link | BLEU Score | Normed Edit Distance | ExpRate (%) | GPU Inference Time (ms) [Normal Mode / High-Performance Mode] |
CPU Inference Time (ms) [Normal Mode / High-Performance Mode] |
Model Size |
---|---|---|---|---|---|---|---|
LaTeX_OCR_rec | Inference Model/Training Model | 0.8821 | 0.0823 | 40.01 | 2047.13 / 2047.13 | 10582.73 / 10582.73 | 89.7 M |
Seal Text Detection Module Models:
Model | Model Download Link | Detection Hmean (%) | GPU Inference Time (ms) [Normal Mode / High-Performance Mode] |
CPU Inference Time (ms) [Normal Mode / High-Performance Mode] |
Model Size (M) | Description |
---|---|---|---|---|---|---|
PP-OCRv4_server_seal_det | Inference Model/Training Model | 98.21 | 74.75 / 67.72 | 382.55 / 382.55 | 109 | PP-OCRv4's server-side seal text detection model, featuring higher accuracy, suitable for deployment on better-equipped servers |
PP-OCRv4_mobile_seal_det | Inference Model/Training Model | 96.47 | 7.82 / 3.09 | 48.28 / 23.97 | 4.6 | PP-OCRv4's mobile seal text detection model, offering higher efficiency, suitable for deployment on edge devices |
- Performance Test Environment
- Test Dataset:
- Text Image Rectification Model: DocUNet
- Layout Region Detection Model: A self-built layout analysis dataset using PaddleOCR, containing 10,000 images of common document types such as Chinese and English papers, magazines, and research reports.
- Table Structure Recognition Model: A self-built English table recognition dataset using PaddleX.
- Text Detection Model: A self-built Chinese dataset using PaddleOCR, covering multiple scenarios such as street scenes, web images, documents, and handwriting, with 500 images for detection.
- Chinese Recognition Model: A self-built Chinese dataset using PaddleOCR, covering multiple scenarios such as street scenes, web images, documents, and handwriting, with 11,000 images for text recognition.
- ch_SVTRv2_rec: Evaluation set A for "OCR End-to-End Recognition Task" in the PaddleOCR Algorithm Model Challenge
- ch_RepSVTR_rec: Evaluation set B for "OCR End-to-End Recognition Task" in the PaddleOCR Algorithm Model Challenge
- English Recognition Model: A self-built English dataset using PaddleX.
- Multilingual Recognition Model: A self-built multilingual dataset using PaddleX.
- Text Line Orientation Classification Model: A self-built dataset using PaddleOCR, covering various scenarios such as ID cards and documents, containing 1000 images.
- Seal Text Detection Model: A self-built dataset using PaddleOCR, containing 500 images of circular seal textures.
- Hardware Configuration:
- GPU: NVIDIA Tesla T4
- CPU: Intel Xeon Gold 6271C @ 2.60GHz
- Other Environments: Ubuntu 20.04 / cuDNN 8.6 / TensorRT 8.5.2.2
- Test Dataset:
- Inference Mode Description
Mode | GPU Configuration | CPU Configuration | Acceleration Technology Combination |
---|---|---|---|
Normal Mode | FP32 Precision / No TRT Acceleration | FP32 Precision / 8 Threads | PaddleInference |
High-Performance Mode | Optimal combination of pre-selected precision types and acceleration strategies | FP32 Precision / 8 Threads | Pre-selected optimal backend (Paddle/OpenVINO/TRT, etc.) |
2. Quick Start¶
The pre-trained pipelines provided by PaddleOCR allow for quick experience of their effects. You can locally use Python to experience the effects of the PP-ChatOCRv4-doc pipeline.
Before using the PP-ChatOCRv4-doc pipeline locally, ensure you have completed the installation of the PaddleOCR wheel package according to the PaddleOCR Local Installation Tutorial. If you wish to selectively install dependencies, please refer to the relevant instructions in the installation guide. The dependency group corresponding to this pipeline is ie
.
Before performing model inference, you first need to prepare the API key for the large language model. PP-ChatOCRv4 supports large model services on the Baidu Cloud Qianfan Platform or the locally deployed standard OpenAI interface. If using the Baidu Cloud Qianfan Platform, refer to Authentication and Authorization to obtain the API key. If using a locally deployed large model service, refer to the PaddleNLP Large Model Deployment Documentation for deployment of the dialogue interface and vectorization interface for large models, and fill in the corresponding base_url
and api_key
. If you need to use a multimodal large model for data fusion, refer to the OpenAI service deployment in the PaddleMIX Model Documentation for multimodal large model deployment, and fill in the corresponding base_url
and api_key
.
Note: If local deployment of a multimodal large model is restricted due to the local environment, you can comment out the lines containing the mllm
variable in the code and only use the large language model for information extraction.
2.1 Command Line Experience¶
After updating the configuration file, you can complete quick inference using just a few lines of Python code. You can use the test file for testing:
paddleocr pp_chatocrv4_doc -i vehicle_certificate-1.png -k 驾驶室准乘人数 --qianfan_api_key your_api_key
# 通过 --invoke_mllm 和 --pp_docbee_base_url 使用多模态大模型
paddleocr pp_chatocrv4_doc -i vehicle_certificate-1.png -k 驾驶室准乘人数 --qianfan_api_key your_api_key --invoke_mllm True --pp_docbee_base_url http://127.0.0.1:8080/
The command line supports more parameter configurations. Click to expand for a detailed explanation of the command line parameters.
Parameter | Parameter Description | Parameter Type | Options | Default Value |
---|---|---|---|---|
input |
The data to be predicted, supporting multiple input types, required. | Python Var|str|list |
|
None |
device |
The device for pipeline inference. | str|None |
|
None |
use_doc_orientation_classify |
Whether to use the document orientation classification module. | bool|None |
|
None |
use_doc_unwarping |
Whether to use the document distortion correction module. | bool|None |
|
None |
use_textline_orientation |
Whether to use the text line orientation classification module. | bool|None |
|
None |
use_general_ocr |
Whether to use the OCR sub-pipeline. | bool|None |
|
None |
use_seal_recognition |
Whether to use the seal recognition sub-pipeline. | bool|None |
|
None |
use_table_recognition |
Whether to use the table recognition sub-pipeline. | bool|None |
|
None |
layout_threshold |
The score threshold for the layout model. | float|dict|None |
|
None |
layout_nms |
Whether to use NMS. | bool|None |
|
None |
layout_unclip_ratio |
The expansion coefficient for layout detection. | float|Tuple[float,float]|dict|None |
|
None |
layout_merge_bboxes_mode |
The method for filtering overlapping bounding boxes. | str|dict|None |
|
None |
text_det_limit_side_len |
The side length limit for text detection images. | int|None |
|
None |
text_det_limit_type |
The type of side length limit for text detection images. | str|None |
|
None |
text_det_thresh |
The pixel threshold for detection. In the output probability map, pixel points with scores greater than this threshold will be considered as text pixels. | float|None |
|
None |
text_det_box_thresh |
The bounding box threshold for detection. When the average score of all pixel points within the detection result bounding box is greater than this threshold, the result will be considered as a text region. | float|None |
|
None |
text_det_unclip_ratio |
The expansion coefficient for text detection. This method is used to expand the text region, and the larger the value, the larger the expansion area. | float|None |
|
None |
text_rec_score_thresh |
The text recognition threshold. Text results with scores greater than this threshold will be retained. | float|None |
|
None |
seal_det_limit_side_len |
The side length limit for seal detection images. | int|None |
|
None |
seal_det_limit_type |
The type of side length limit for seal detection images. | str|None |
|
None |
seal_det_thresh |
The pixel threshold for detection. In the output probability map, pixel points with scores greater than this threshold will be considered as seal pixels. | float|None |
|
None |
seal_det_box_thresh |
The bounding box threshold for detection. When the average score of all pixel points within the detection result bounding box is greater than this threshold, the result will be considered as a seal region. | float|None |
|
None |
seal_det_unclip_ratio |
The expansion coefficient for seal detection. This method is used to expand the seal region, and the larger the value, the larger the expansion area. | float|None |
|
None |
seal_rec_score_thresh |
The seal recognition threshold. Text results with scores greater than this threshold will be retained. | float|None |
|
None |
This method will print the results to the terminal. The content printed to the terminal is explained as follows:
2.2 Python Script Experience¶
The command-line method is for a quick experience and to view results. Generally, in projects, integration via code is often required. You can download the Test File and use the following example code for inference:
from paddleocr import PPChatOCRv4Doc
chat_bot_config = {
"module_name": "chat_bot",
"model_name": "ernie-3.5-8k",
"base_url": "https://qianfan.baidubce.com/v2",
"api_type": "openai",
"api_key": "api_key", # your api_key
}
retriever_config = {
"module_name": "retriever",
"model_name": "embedding-v1",
"base_url": "https://qianfan.baidubce.com/v2",
"api_type": "qianfan",
"api_key": "api_key", # your api_key
}
mllm_chat_bot_config = {
"module_name": "chat_bot",
"model_name": "PP-DocBee2",
"base_url": "http://127.0.0.1:8080/", # your local mllm service url
"api_type": "openai",
"api_key": "api_key", # your api_key
}
pipeline = PPChatOCRv4Doc()
visual_predict_res = pipeline.visual_predict(
input="vehicle_certificate-1.png",
use_doc_orientation_classify=False,
use_doc_unwarping=False,
use_common_ocr=True,
use_seal_recognition=True,
use_table_recognition=True,
)
visual_info_list = []
for res in visual_predict_res:
visual_info_list.append(res["visual_info"])
layout_parsing_result = res["layout_parsing_result"]
vector_info = pipeline.build_vector(
visual_info_list, flag_save_bytes_vector=True, retriever_config=retriever_config
)
mllm_predict_res = pipeline.mllm_pred(
input="vehicle_certificate-1.png",
key_list=["Cab Seating Capacity"], # Translated: 驾驶室准乘人数
mllm_chat_bot_config=mllm_chat_bot_config,
)
mllm_predict_info = mllm_predict_res["mllm_res"]
chat_result = pipeline.chat(
key_list=["Cab Seating Capacity"], # Translated: 驾驶室准乘人数
visual_info=visual_info_list,
vector_info=vector_info,
mllm_predict_info=mllm_predict_info,
chat_bot_config=chat_bot_config,
retriever_config=retriever_config,
)
print(chat_result)
After running, the output is as follows:
The prediction process, API description, and output description for PP-ChatOCRv4 are as follows:
(1) Call the PPChatOCRv4Doc
method to instantiate the PP-ChatOCRv4 pipeline object.
The relevant parameter descriptions are as follows:
Parameter | Parameter Description | Parameter Type | Default Value |
---|---|---|---|
layout_detection_model_name |
The name of the model used for layout region detection. If set toNone , the pipeline's default model will be used. |
str |
None |
layout_detection_model_dir |
The directory path of the layout region detection model. If set toNone , the official model will be downloaded. |
str |
None |
doc_orientation_classify_model_name |
The name of the document orientation classification model. If set toNone , the pipeline's default model will be used. |
str |
None |
doc_orientation_classify_model_dir |
The directory path of the document orientation classification model. If set toNone , the official model will be downloaded. |
str |
None |
doc_unwarping_model_name |
The name of the document unwarping model. If set toNone , the pipeline's default model will be used. |
str |
None |
doc_unwarping_model_dir |
The directory path of the document unwarping model. If set toNone , the official model will be downloaded. |
str |
None |
text_detection_model_name |
The name of the text detection model. If set toNone , the pipeline's default model will be used. |
str |
None |
text_detection_model_dir |
The directory path of the text detection model. If set toNone , the official model will be downloaded. |
str |
None |
text_recognition_model_name |
The name of the text recognition model. If set toNone , the pipeline's default model will be used. |
str |
None |
text_recognition_model_dir |
The directory path of the text recognition model. If set toNone , the official model will be downloaded. |
str |
None |
text_recognition_batch_size |
The batch size for the text recognition model. If set toNone , the batch size will default to 1 . |
int |
None |
table_structure_recognition_model_name |
The name of the table structure recognition model. If set toNone , the pipeline's default model will be used. |
str |
None |
table_structure_recognition_model_dir |
The directory path of the table structure recognition model. If set toNone , the official model will be downloaded. |
str |
None |
seal_text_detection_model_name |
The name of the seal text detection model. If set toNone , the pipeline's default model will be used. |
str |
None |
seal_text_detection_model_dir |
The directory path of the seal text detection model. If set toNone , the official model will be downloaded. |
str |
None |
seal_text_recognition_model_name |
The name of the seal text recognition model. If set toNone , the pipeline's default model will be used. |
str |
None |
seal_text_recognition_model_dir |
The directory path of the seal text recognition model. If set toNone , the official model will be downloaded. |
str |
None |
seal_text_recognition_batch_size |
The batch size for the seal text recognition model. If set toNone , the batch size will default to 1 . |
int |
None |
use_doc_orientation_classify |
Whether to load the document orientation classification function. If set toNone , the value initialized by the pipeline for this parameter will be used by default (initialized to True ). |
bool |
None |
use_doc_unwarping |
Whether to load the document unwarping function. If set toNone , the value initialized by the pipeline for this parameter will be used by default (initialized to True ). |
bool |
None |
use_seal_recognition |
Whether to load the seal recognition sub-pipeline. If set toNone , the value initialized by the pipeline for this parameter will be used by default (initialized to True ). |
bool |
None |
use_table_recognition |
Whether to load the table recognition sub-pipeline. If set toNone , the value initialized by the pipeline for this parameter will be used by default (initialized to True ). |
bool |
None |
layout_threshold |
Layout model score threshold.
|
float|dict |
None |
layout_nms |
Whether the layout region detection model uses NMS post-processing. | bool |
None |
layout_unclip_ratio |
Expansion factor for the detection boxes of the layout region detection model.
|
float|Tuple[float,float]|dict |
None |
layout_merge_bboxes_mode |
Method for filtering overlapping boxes in layout region detection.
|
str|dict |
None |
text_det_limit_side_len |
Maximum side length limit for text detection.
|
int |
None |
text_det_limit_type |
Type of side length limit for text detection.
|
str |
None |
text_det_thresh |
Detection pixel threshold. In the output probability map, pixels with scores greater than this threshold are considered text pixels.
|
float |
None |
text_det_box_thresh |
Detection box threshold. If the average score of all pixels within a detection result's bounding box is greater than this threshold, the result is considered a text region.
|
float |
None |
text_det_unclip_ratio |
Text detection expansion factor. This method is used to expand text regions; the larger the value, the larger the expanded area.
|
float |
None |
text_rec_score_thresh |
Text recognition threshold. Text results with scores greater than this threshold will be kept.
|
float |
None |
seal_det_limit_side_len |
Image side length limit for seal text detection.
|
int |
None |
seal_det_limit_type |
Type of image side length limit for seal text detection.
|
str |
None |
seal_det_thresh |
Detection pixel threshold. In the output probability map, pixels with scores greater than this threshold are considered text pixels.
|
float |
None |
seal_det_box_thresh |
Detection box threshold. If the average score of all pixels within a detection result's bounding box is greater than this threshold, the result is considered a text region.
|
float |
None |
seal_det_unclip_ratio |
Seal text detection expansion factor. This method is used to expand text regions; the larger the value, the larger the expanded area.
|
float |
None |
seal_rec_score_thresh |
Seal text recognition threshold. Text results with scores greater than this threshold will be kept.
|
float |
None |
retriever_config |
Configuration parameters for the vector retrieval large model. The configuration content is the following dictionary:
|
dict |
None |
mllm_chat_bot_config |
Configuration parameters for the multimodal large model. The configuration content is the following dictionary:
|
dict |
None |
chat_bot_config |
Configuration information for the large language model. The configuration content is the following dictionary:
|
dict |
None |
input |
Data to be predicted, supports multiple input types, required.
|
Python Var|str|list |
None |
save_path |
Specifies the path to save the inference result file. If set toNone , inference results will not be saved locally. |
str |
None |
device |
Device used for inference. Supports specifying a specific card number.
|
str |
None |
enable_hpi |
Whether to enable high-performance inference. | bool |
False |
use_tensorrt |
Whether to use TensorRT for inference acceleration. | bool |
False |
min_subgraph_size |
Minimum subgraph size, used to optimize model subgraph computation. | int |
3 |
precision |
Computation precision, e.g., fp32, fp16. | str |
fp32 |
enable_mkldnn |
Whether to enable MKL-DNN acceleration library. If set toNone , it will be enabled by default.
|
bool |
None |
cpu_threads |
Number of threads used when performing inference on CPU. | int |
8 |
paddlex_config |
PaddleX pipeline configuration file path. | str |
None |
(2) Call the visual_predict()
method of the PP-ChatOCRv4 pipeline object to obtain visual prediction results. This method returns a list of results. Additionally, the pipeline also provides the visual_predict_iter()
method. Both are identical in terms of parameter acceptance and result return, with the difference being that visual_predict_iter()
returns a generator
, allowing for step-by-step processing and retrieval of prediction results, suitable for handling large datasets or scenarios where memory saving is desired. You can choose either of these two methods based on your actual needs. The following are the parameters and their descriptions for the visual_predict()
method:
Parameter | Parameter Description | Parameter Type | Default Value |
---|---|---|---|
input |
Data to be predicted, supports multiple input types, required.
|
Python Var|str|list |
|
device |
Same as the parameter during instantiation. | str |
None |
use_doc_orientation_classify |
Whether to use the document orientation classification module during inference. | bool |
None |
use_doc_unwarping |
Whether to use the text image correction module during inference. | bool |
None |
use_textline_orientation |
Whether to use the text line orientation classification module during inference. | bool |
None |
use_seal_recognition |
Whether to use the seal recognition sub-pipeline during inference. | bool |
None |
use_table_recognition |
Whether to use the table recognition sub-pipeline during inference. | bool |
None |
layout_threshold |
Same as the parameter during instantiation. | float|dict |
None |
layout_nms |
Same as the parameter during instantiation. | bool |
None |
layout_unclip_ratio |
Same as the parameter during instantiation. | float|Tuple[float,float]|dict |
None |
layout_merge_bboxes_mode |
Same as the parameter during instantiation. | str|dict |
None |
text_det_limit_side_len |
Same as the parameter during instantiation. | int |
None |
text_det_limit_type |
Same as the parameter during instantiation. | str |
None |
text_det_thresh |
Same as the parameter during instantiation. | float |
None |
text_det_box_thresh |
Same as the parameter during instantiation. | float |
None |
text_det_unclip_ratio |
Same as the parameter during instantiation. | float |
None |
text_rec_score_thresh |
Same as the parameter during instantiation. | float |
None |
seal_det_limit_side_len |
Same as the parameter during instantiation. | int |
None |
seal_det_limit_type |
Same as the parameter during instantiation. | str |
None |
seal_det_thresh |
Same as the parameter during instantiation. | float |
None |
seal_det_box_thresh |
Same as the parameter during instantiation. | float |
None |
seal_det_unclip_ratio |
Same as the parameter during instantiation. | float |
None |
seal_rec_score_thresh |
Same as the parameter during instantiation. | float |
None |
(3) Process the visual prediction results.
The prediction result for each sample is of `dict` type, containing two fields: `visual_info` and `layout_parsing_result`. Visual information (including `normal_text_dict`, `table_text_list`, `table_html_list`, etc.) is obtained through `visual_info`, and the information for each sample is placed in the `visual_info_list` list. The content of this list will later be fed into the large language model. Of course, you can also obtain the layout parsing results through `layout_parsing_result`. This result contains content such as tables, text, and images found in the file or image, and supports operations like printing, saving as an image, and saving as a `json` file:......
for res in visual_predict_res:
visual_info_list.append(res["visual_info"])
layout_parsing_result = res["layout_parsing_result"]
layout_parsing_result.print()
layout_parsing_result.save_to_img("./output")
layout_parsing_result.save_to_json("./output")
layout_parsing_result.save_to_xlsx("./output")
layout_parsing_result.save_to_html("./output")
......
Method | Method Description | Parameter | Parameter Type | Parameter Description | Default Value |
---|---|---|---|---|---|
print() |
Prints the result to the terminal | format_json |
bool |
Whether to format the output content using JSON indentation |
True |
indent |
int |
Specifies the indentation level to beautify the output JSON data for better readability, effective only when format_json is True |
4 | ||
ensure_ascii |
bool |
Controls whether to escape non-ASCII characters to Unicode . Set to True to escape all non-ASCII characters; False to preserve original characters, effective only when format_json is True |
False |
||
save_to_json() |
Saves the result as a JSON format file | save_path |
str |
Save file path. When it's a directory, the saved file name will be consistent with the input file name. | None |
indent |
int |
Specifies the indentation level to beautify the output JSON data for better readability, effective only when format_json is True |
4 | ||
ensure_ascii |
bool |
Controls whether to escape non-ASCII characters to Unicode . Set to True to escape all non-ASCII characters; False to preserve original characters, effective only when format_json is True |
False |
||
save_to_img() |
Saves the visualization images of various intermediate modules as PNG format images | save_path |
str |
Save file path, supports directory or file path | None |
save_to_html() |
Saves the tables in the file as HTML format files | save_path |
str |
Save file path, supports directory or file path | None |
save_to_xlsx() |
Saves the tables in the file as XLSX format files | save_path |
str |
Save file path, supports directory or file path | None |
Property | Property Description |
---|---|
json |
Gets the prediction results in json format. |
img |
Gets the visualization images in dict format. |
(4) Call the build_vector()
method of the PP-ChatOCRv4 pipeline object to build vectors for the text content.
The following are the parameters and their descriptions for the `build_vector()` method:
Parameter | Parameter Description | Parameter Type | Default Value |
---|---|---|---|
visual_info |
Visual information, can be a dictionary containing visual information, or a list of such dictionaries. | list|dict |
None |
min_characters |
Minimum number of characters. A positive integer greater than 0, can be determined based on the token length supported by the large language model. | int |
3500 |
block_size |
Block size when building a vector library for long text. A positive integer greater than 0, can be determined based on the token length supported by the large language model. | int |
300 |
flag_save_bytes_vector |
Whether to save text as a binary file. | bool |
False |
retriever_config |
Configuration parameters for the vector retrieval large model, same as the parameter during instantiation. | dict |
None |
(5) Call the mllm_pred()
method of the PP-ChatOCRv4 pipeline object to get the extraction results from the multimodal large model.
The following are the parameters and their descriptions for the `mllm_pred()` method:
Parameter | Parameter Description | Parameter Type | Default Value |
---|---|---|---|
input |
Data to be predicted, supports multiple input types, required.
|
Python Var|str |
|
key_list |
A single key or a list of keys used for extracting information. | Union[str, List[str]] |
None |
mllm_chat_bot_config |
Configuration parameters for the multimodal large model, same as the parameter during instantiation. | dict |
None |
(6) Call the chat()
method of the PP-ChatOCRv4 pipeline object to extract key information.
The following are the parameters and their descriptions for the `chat()` method:
Parameter | Parameter Description | Parameter Type | Default Value |
---|---|---|---|
key_list |
A single key or a list of keys used for extracting information. | Union[str, List[str]] |
None |
visual_info |
Visual information result. | List[dict] |
None |
use_vector_retrieval |
Whether to use vector retrieval. | bool |
True |
vector_info |
Vector information used for retrieval. | dict |
None |
min_characters |
Required minimum number of characters. A positive integer greater than 0. | int |
3500 |
text_task_description |
Description of the text task. | str |
None |
text_output_format |
Output format for text results. | str |
None |
text_rules_str |
Rules for generating text results. | str |
None |
text_few_shot_demo_text_content |
Text content for few-shot demonstration. | str |
None |
text_few_shot_demo_key_value_list |
Key-value list for few-shot demonstration. | str |
None |
table_task_description |
Description of the table task. | str |
None |
table_output_format |
Output format for table results. | str |
None |
table_rules_str |
Rules for generating table results. | str |
None |
table_few_shot_demo_text_content |
Text content for table few-shot demonstration. | str |
None |
table_few_shot_demo_key_value_list |
Key-value list for table few-shot demonstration. | str |
None |
mllm_predict_info |
Multimodal large model result. | dict |
None
|
mllm_integration_strategy |
Data fusion strategy for multimodal large model and large language model, supports using one of them separately or fusing the results of both. Options: "integration", "llm_only", and "mllm_only". | str |
"integration" |
chat_bot_config |
Configuration information for the large language model, same as the parameter during instantiation. | dict |
None |
retriever_config |
Configuration parameters for the vector retrieval large model, same as the parameter during instantiation. | dict |
None |
3. Development Integration/Deployment¶
If the pipeline meets your requirements for inference speed and accuracy in production, you can proceed directly with development integration/deployment.
If you need to apply the pipeline directly in your Python project, you can refer to the sample code in 2.2 Python Script Experience.
Additionally, PaddleX provides two other deployment methods, detailed as follows:
🚀 High-Performance Inference: In actual production environments, many applications have stringent standards for the performance metrics of deployment strategies (especially response speed) to ensure efficient system operation and smooth user experience. To this end, PaddleX provides a high-performance inference plugin aimed at deeply optimizing model inference and pre/post-processing to significantly speed up the end-to-end process. For detailed instructions on high-performance inference, please refer to the High-Performance Inference Guide.
☁️ Serving: Serving is a common deployment form in actual production environments. By encapsulating the inference functionality as a service, clients can access these services through network requests to obtain inference results. PaddleX supports multiple serving solutions for pipelines. For detailed instructions on serving, please refer to the Service Deployment Guide.
Below are the API references for basic serving and multi-language service invocation examples:
API Reference
For the main operations provided by the service:
- The HTTP request method is POST.
- Both the request body and response body are JSON data (JSON objects).
- When the request is successfully processed, the response status code is
200
, and the response body has the following properties:
Name | Type | Meaning |
---|---|---|
logId |
string |
UUID of the request. |
errorCode |
integer |
Error code. Fixed at 0 . |
errorMsg |
string |
Error description. Fixed at "Success" . |
result |
object |
Operation result. |
- When the request is not successfully processed, the response body has the following properties:
Name | Type | Meaning |
---|---|---|
logId |
string |
UUID of the request. |
errorCode |
integer |
Error code. Same as the response status code. |
errorMsg |
string |
Error description. |
The main operations provided by the service are as follows:
analyzeImages
Uses computer vision models to analyze images, obtain OCR, table recognition results, etc., and extract key information from the images.
POST /chatocr-visual
- Properties of the request body:
Name | Type | Meaning | Required |
---|---|---|---|
file |
string |
URL of an image file or PDF file accessible to the server, or Base64 encoded result of the content of the above file types. By default, for PDF files exceeding 10 pages, only the content of the first 10 pages will be processed. To remove the page limit, please add the following configuration to the pipeline configuration file:
|
Yes |
fileType |
integer | null |
File type. 0 represents a PDF file, 1 represents an image file. If this property is not present in the request body, the file type will be inferred based on the URL. |
No |
useDocOrientationClassify |
boolean | null |
Please refer to the description of the use_doc_orientation_classify parameter of the pipeline object's visual_predict method. |
No |
useDocUnwarping |
boolean | null |
Please refer to the description of the use_doc_unwarping parameter of the pipeline object's visual_predict method. |
No |
useSealRecognition |
boolean | null |
Please refer to the description of the use_seal_recognition parameter of the pipeline object's visual_predict method. |
No |
useTableRecognition |
boolean | null |
Please refer to the description of the use_table_recognition parameter of the pipeline object's visual_predict method. |
No |
layoutThreshold |
number | null |
Please refer to the description of the layout_threshold parameter of the pipeline object's visual_predict method. |
No |
layoutNms |
boolean | null |
Please refer to the description of the layout_nms parameter of the pipeline object's visual_predict method. |
No |
layoutUnclipRatio |
number | array | object | null |
Please refer to the description of the layout_unclip_ratio parameter of the pipeline object's visual_predict method. |
No |
layoutMergeBboxesMode |
string | object | null |
Please refer to the description of the layout_merge_bboxes_mode parameter of the pipeline object's visual_predict method. |
No |
textDetLimitSideLen |
integer | null |
Please refer to the description of the text_det_limit_side_len parameter of the pipeline object's visual_predict method. |
No |
textDetLimitType |
string | null |
Please refer to the description of the text_det_limit_type parameter of the pipeline object's visual_predict method. |
No |
textDetThresh |
number | null |
Please refer to the description of the text_det_thresh parameter of the pipeline object's visual_predict method. |
No |
textDetBoxThresh |
number | null |
Please refer to the description of the text_det_box_thresh parameter of the pipeline object's visual_predict method. |
No |
textDetUnclipRatio |
number | null |
Please refer to the description of the text_det_unclip_ratio parameter of the pipeline object's visual_predict method. |
No |
textRecScoreThresh |
number | null |
Please refer to the description of the text_rec_score_thresh parameter of the pipeline object's visual_predict method. |
No |
sealDetLimitSideLen |
integer | null |
Please refer to the description of the seal_det_limit_side_len parameter of the pipeline object's visual_predict method. |
No |
sealDetLimitType |
string | null |
Please refer to the description of the seal_det_limit_type parameter of the pipeline object's visual_predict method. |
No |
sealDetThresh |
number | null |
Please refer to the description of the seal_det_thresh parameter of the pipeline object's visual_predict method. |
No |
sealDetBoxThresh |
number | null |
Please refer to the description of the seal_det_box_thresh parameter of the pipeline object's visual_predict method. |
No |
sealDetUnclipRatio |
number | null |
Please refer to the description of the seal_det_unclip_ratio parameter of the pipeline object's visual_predict method. |
No |
sealRecScoreThresh |
number | null |
Please refer to the description of the seal_rec_score_thresh parameter of the pipeline object's visual_predict method. |
No |
- When the request is successfully processed, the
result
of the response body has the following properties:
Name | Type | Meaning |
---|---|---|
layoutParsingResults |
array |
Analysis results obtained using computer vision models. The array length is 1 (for image input) or the actual number of document pages processed (for PDF input). For PDF input, each element in the array represents the result of each page actually processed in the PDF file. |
visualInfo |
array |
Key information in the image, which can be used as input for other operations. |
dataInfo |
object |
Input data information. |
Each element in layoutParsingResults
is an object
with the following properties:
Name | Type | Meaning |
---|---|---|
prunedResult |
object |
A simplified version of the res field in the JSON representation of the results generated by the pipeline's visual_predict method, with the input_path and the page_index fields removed. |
outputImages |
object | null |
Refer to the description of img attribute of the pipeline's visual prediction result. |
inputImage |
string | null |
Input image. The image is in JPEG format and encoded using Base64. |
buildVectorStore
Builds a vector database.
POST /chatocr-vector
- Properties of the request body:
Name | Type | Meaning | Required |
---|---|---|---|
visualInfo |
array |
Key information in the image. Provided by the analyzeImages operation. |
Yes |
minCharacters |
integer | null |
Minimum data length to enable the vector database. | No |
blockSize |
int | null |
Please refer to the description of the block_size parameter of the pipeline object's build_vector method. |
No |
retrieverConfig |
object | null |
Please refer to the description of the retriever_config parameter of the pipeline object's build_vector method. |
No |
- When the request is successfully processed, the
result
of the response body has the following properties:
Name | Type | Meaning |
---|---|---|
vectorInfo |
object |
Serialized result of the vector database, which can be used as input for other operations. |
invokeMLLM
Invoke the MLLM.
POST /chatocr-mllm
- Properties of the request body:
Name | Type | Meaning | Required |
---|---|---|---|
image |
string |
URL of an image file accessible by the server or the Base64-encoded content of the image file. | Yes |
keyList |
array |
List of keys. | Yes |
mllmChatBotConfig |
object | null |
Please refer to the description of the mllm_chat_bot_config parameter of the pipeline object's mllm_pred method. |
No |
- When the request is successfully processed, the
result
of the response body has the following property:
Name | Type | Meaning |
---|---|---|
mllmPredictInfo |
object |
MLLM invocation result. |
chat
Interacts with large language models to extract key information using them.
POST /chatocr-chat
- Properties of the request body:
Name | Type | Meaning | Required |
---|---|---|---|
keyList |
array |
List of keys. | Yes |
visualInfo |
object |
Key information in the image. Provided by the analyzeImages operation. |
Yes |
useVectorRetrieval |
boolean | null |
Please refer to the description of the use_vector_retrieval parameter of the pipeline object's chat method. |
No |
vectorInfo |
object | null |
Serialized result of the vector database. Provided by the buildVectorStore operation. Please note that the deserialization process involves performing an unpickle operation. To prevent malicious attacks, be sure to use data from trusted sources. |
No |
minCharacters |
integer |
Minimum data length to enable the vector database. | No |
textTaskDescription |
string | null |
Please refer to the description of the text_task_description parameter of the pipeline object's chat method. |
No |
textOutputFormat |
string | null |
Please refer to the description of the text_output_format parameter of the pipeline object's chat method. |
No |
textRulesStr |
string | null |
Please refer to the description of the text_rules_str parameter of the pipeline object's chat method. |
No |
textFewShotDemoTextContent |
string | null |
Please refer to the description of the text_few_shot_demo_text_content parameter of the pipeline object's chat method. |
No |
textFewShotDemoKeyValueList |
string | null |
Please refer to the description of the text_few_shot_demo_key_value_list parameter of the pipeline object's chat method. |
No |
tableTaskDescription |
string | null |
Please refer to the description of the table_task_description parameter of the pipeline object's chat method. |
No |
tableOutputFormat |
string | null |
Please refer to the description of the table_output_format parameter of the pipeline object's chat method. |
No |
tableRulesStr |
string | null |
Please refer to the description of the table_rules_str parameter of the pipeline object's chat method. |
No |
tableFewShotDemoTextContent |
string | null |
Please refer to the description of the table_few_shot_demo_text_content parameter of the pipeline object's chat method. |
No |
tableFewShotDemoKeyValueList |
string | null |
Please refer to the description of the table_few_shot_demo_key_value_list parameter of the pipeline object's chat method. |
No |
mllmPredictInfo |
object | null |
MLLM invocation result. Provided by the invokeMllm operation. |
No |
mllmIntegrationStrategy |
string | null |
Please refer to the description of the mllm_integration_strategy parameter of the pipeline object's chat method. |
No |
chatBotConfig |
object | null |
Please refer to the description of the chat_bot_config parameter of the pipeline object's chat method. |
No |
retrieverConfig |
object | null |
Please refer to the description of the retriever_config parameter of the pipeline object's chat method. |
No |
- When the request is successfully processed, the
result
of the response body has the following properties:
Name | Type | Meaning |
---|---|---|
chatResult |
object |
Key information extraction result. |
Multi-language Service Invocation Examples
Python
# This script only shows the use case for images. For calling with other file types, please read the API reference and make adjustments.
import base64
import pprint
import sys
import requests
API_BASE_URL = "http://0.0.0.0:8080"
image_path = "./demo.jpg"
keys = ["name"]
with open(image_path, "rb") as file:
image_bytes = file.read()
image_data = base64.b64encode(image_bytes).decode("ascii")
payload = {
"file": image_data,
"fileType": 1,
}
resp_visual = requests.post(url=f"{API_BASE_URL}/chatocr-visual", json=payload)
if resp_visual.status_code != 200:
print(
f"Request to chatocr-visual failed with status code {resp_visual.status_code}."
)
pprint.pp(resp_visual.json())
sys.exit(1)
result_visual = resp_visual.json()["result"]
for i, res in enumerate(result_visual["layoutParsingResults"]):
print(res["prunedResult"])
for img_name, img in res["outputImages"].items():
img_path = f"{img_name}_{i}.jpg"
with open(img_path, "wb") as f:
f.write(base64.b64decode(img))
print(f"Output image saved at {img_path}")
payload = {
"visualInfo": result_visual["visualInfo"],
}
resp_vector = requests.post(url=f"{API_BASE_URL}/chatocr-vector", json=payload)
if resp_vector.status_code != 200:
print(
f"Request to chatocr-vector failed with status code {resp_vector.status_code}."
)
pprint.pp(resp_vector.json())
sys.exit(1)
result_vector = resp_vector.json()["result"]
payload = {
"image": image_data,
"keyList": keys,
}
resp_mllm = requests.post(url=f"{API_BASE_URL}/chatocr-mllm", json=payload)
if resp_mllm.status_code != 200:
print(
f"Request to chatocr-mllm failed with status code {resp_mllm.status_code}."
)
pprint.pp(resp_mllm.json())
sys.exit(1)
result_mllm = resp_mllm.json()["result"]
payload = {
"keyList": keys,
"visualInfo": result_visual["visualInfo"],
"useVectorRetrieval": True,
"vectorInfo": result_vector["vectorInfo"],
"mllmPredictInfo": result_mllm["mllmPredictInfo"],
}
resp_chat = requests.post(url=f"{API_BASE_URL}/chatocr-chat", json=payload)
if resp_chat.status_code != 200:
print(
f"Request to chatocr-chat failed with status code {resp_chat.status_code}."
)
pprint.pp(resp_chat.json())
sys.exit(1)
result_chat = resp_chat.json()["result"]
print("Final result:")
print(result_chat["chatResult"])
4. Custom Development¶
If the default model weights provided by the PP-ChatOCRv4 pipeline do not meet your requirements in terms of accuracy or speed, you can try to fine-tune the existing model using your own domain-specific or application-specific data to improve the recognition performance of the PP-ChatOCRv4 pipeline in your scenario.
4.1 Model Fine-Tuning¶
Since the PP-ChatOCRv4 pipeline includes several modules, the unsatisfactory performance of the pipeline may originate from any one of these modules. You can analyze the cases with poor extraction results, identify which module is problematic through visualizing the images, and refer to the corresponding fine-tuning tutorial links in the table below to fine-tune the model.
Scenario | Fine-tuning Module | Fine-tuning Reference Link |
---|---|---|
Inaccurate layout region detection, such as missed detection of seals, tables, etc. | Layout Region Detection Module | Link |
Inaccurate table structure recognition | Table Structure Recognition Module | Link |
Missed detection of seal text | Seal Text Detection Module | Link |
Missed detection of text | Text Detection Module | Link |
Inaccurate text content | Text Recognition Module | Link |
Inaccurate correction of vertical or rotated text lines | Text Line Orientation Classification Module | Link |
Inaccurate correction of whole-image rotation | Document Image Orientation Classification Module | Link |
Inaccurate correction of image distortion | Text Image Correction Module | Fine-tuning not supported |
4.2 Model Application¶
After you complete fine-tuning with your private dataset, you will obtain a local model weight file.
If you need to use the fine-tuned model weights, simply modify the production configuration file by replacing the local directory of the fine-tuned model weights to the corresponding position in the production configuration file:
- Exporting Pipeline Configuration Files
You can call the export_paddlex_config_to_yaml
method of the pipeline object to export the current pipeline configuration to a YAML file. Here is an example:
from paddleocr import PPChatOCRv4
pipeline = PPChatOCRv4()
pipeline.export_paddlex_config_to_yaml("PP-ChatOCRv4.yaml")
- Editing Pipeline Configuration Files
Replace the local directory of the fine-tuned model weights to the corresponding position in the pipeline configuration file. For example:
......
SubModules:
TextDetection:
module_name: text_detection
model_name: PP-OCRv5_server_det
model_dir: null # Replace with the fine-tuned text detection model weights directory
limit_side_len: 960
limit_type: max
thresh: 0.3
box_thresh: 0.6
unclip_ratio: 1.5
TextRecognition:
module_name: text_recognition
model_name: PP-OCRv5_server_rec
model_dir: null # Replace with the fine-tuned text recognition model weights directory
batch_size: 1
batch_size: 1
score_thresh: 0
......
The exported PaddleX pipeline configuration file not only includes parameters supported by PaddleOCR's CLI and Python API but also allows for more advanced settings. Please refer to the corresponding pipeline usage tutorials in PaddleX Pipeline Usage Overview for detailed instructions on adjusting various configurations according to your needs.
- Loading Pipeline Configuration Files in CLI
By specifying the path to the PaddleX pipeline configuration file using the --paddlex_config
parameter, PaddleOCR will read its contents as the configuration for inference. Here is an example:
- Loading Pipeline Configuration Files in Python API
When initializing the pipeline object, you can pass the path to the PaddleX pipeline configuration file or a configuration dictionary through the paddlex_config
parameter, and PaddleOCR will use it as the configuration for inference. Here is an example: