PP-ChatOCRv3-doc Pipeline Tutorial¶
1. Introduction to PP-ChatOCRv3-doc Pipeline¶
PP-ChatOCRv3-doc is a unique intelligent analysis solution for documents and images developed by PaddlePaddle. It combines Large Language Models (LLM) and OCR technology to provide a one-stop solution for complex document information extraction challenges such as layout analysis, rare characters, multi-page PDFs, tables, and seal recognition. By integrating with ERNIE Bot, it fuses massive data and knowledge to achieve high accuracy and wide applicability.
The PP-ChatOCRv3-doc pipeline includes modules for Table Structure Recognition, Layout Region Detection, Text Detection, Text Recognition, Seal Text Detection, Text Image Rectification, and Document Image Orientation Classification.
If you prioritize model accuracy, choose a model with higher accuracy. If you prioritize inference speed, choose a model with faster inference speed. If you prioritize model storage size, choose a model with a smaller storage size. Some benchmarks for these models are as follows:
👉Model List Details
Table Structure Recognition Module Models:
Model | Model Download Link | Accuracy (%) | GPU Inference Time (ms) [Normal Mode / High-Performance Mode] |
CPU Inference Time (ms) [Normal Mode / High-Performance Mode] |
Model Size (M) | Description |
---|---|---|---|---|---|---|
SLANet | Inference Model/Training Model | 59.52 | 103.08 / 103.08 | 197.99 / 197.99 | 6.9 M | SLANet is a table structure recognition model developed by Baidu PaddleX Team. The model significantly improves the accuracy and inference speed of table structure recognition by adopting a CPU-friendly lightweight backbone network PP-LCNet, a high-low-level feature fusion module CSP-PAN, and a feature decoding module SLA Head that aligns structural and positional information. |
SLANet_plus | Inference Model/Training Model | 63.69 | 140.29 / 140.29 | 195.39 / 195.39 | 6.9 M | SLANet_plus is an enhanced version of SLANet, the table structure recognition model developed by Baidu PaddleX Team. Compared to SLANet, SLANet_plus significantly improves the recognition ability for wireless and complex tables and reduces the model's sensitivity to the accuracy of table positioning, enabling more accurate recognition even with offset table positioning. |
Layout Detection Module Models:
Model | Model Download Link | mAP(0.5) (%) | GPU Inference Time (ms) [Normal Mode / High-Performance Mode] |
CPU Inference Time (ms) [Normal Mode / High-Performance Mode] |
Model Size (M) | Description |
---|---|---|---|---|---|---|
PicoDet_layout_1x | Inference Model/Training Model | 86.8 | 9.03 / 3.10 | 25.82 / 20.70 | 7.4 | An efficient layout area localization model trained on the PubLayNet dataset based on PicoDet-1x can locate five types of areas, including text, titles, tables, images, and lists. |
PicoDet_layout_1x_table | Inference Model/Training Model | 95.7 | 8.02 / 3.09 | 23.70 / 20.41 | 7.4 M | An efficient layout area localization model trained on the PubLayNet dataset based on PicoDet-1x can locate one type of tables. |
PicoDet-S_layout_3cls | Inference Model/Training Model | 87.1 | 8.99 / 2.22 | 16.11 / 8.73 | 4.8 | An high-efficient layout area localization model trained on a self-constructed dataset based on PicoDet-S for scenarios such as Chinese and English papers, magazines, and research reports includes three categories: tables, images, and seals. |
PicoDet-S_layout_17cls | Inference Model/Training Model | 70.3 | 9.11 / 2.12 | 15.42 / 9.12 | 4.8 | A high-efficient layout area localization model trained on a self-constructed dataset based on PicoDet-S_layout_17cls for scenarios such as Chinese and English papers, magazines, and research reports includes 17 common layout categories, namely: paragraph titles, images, text, numbers, abstracts, content, chart titles, formulas, tables, table titles, references, document titles, footnotes, headers, algorithms, footers, and seals. |
PicoDet-L_layout_3cls | Inference Model/Training Model | 89.3 | 13.05 / 4.50 | 41.30 / 41.30 | 22.6 | An efficient layout area localization model trained on a self-constructed dataset based on PicoDet-L for scenarios such as Chinese and English papers, magazines, and research reports includes three categories: tables, images, and seals. |
PicoDet-L_layout_17cls | Inference Model/Training Model | 79.9 | 13.50 / 4.69 | 43.32 / 43.32 | 22.6 | A efficient layout area localization model trained on a self-constructed dataset based on PicoDet-L_layout_17cls for scenarios such as Chinese and English papers, magazines, and research reports includes 17 common layout categories, namely: paragraph titles, images, text, numbers, abstracts, content, chart titles, formulas, tables, table titles, references, document titles, footnotes, headers, algorithms, footers, and seals. |
RT-DETR-H_layout_3cls | Inference Model/Training Model | 95.9 | 114.93 / 27.71 | 947.56 / 947.56 | 470.1 | A high-precision layout area localization model trained on a self-constructed dataset based on RT-DETR-H for scenarios such as Chinese and English papers, magazines, and research reports includes three categories: tables, images, and seals. |
RT-DETR-H_layout_17cls | Inference Model/Training Model | 92.6 | 115.29 / 104.09 | 995.27 / 995.27 | 470.2 | A high-precision layout area localization model trained on a self-constructed dataset based on RT-DETR-H for scenarios such as Chinese and English papers, magazines, and research reports includes 17 common layout categories, namely: paragraph titles, images, text, numbers, abstracts, content, chart titles, formulas, tables, table titles, references, document titles, footnotes, headers, algorithms, footers, and seals. |
Text Detection Module Models:
Model | Model Download Link | Detection Hmean (%) | GPU Inference Time (ms) [Normal Mode / High-Performance Mode] |
CPU Inference Time (ms) [Normal Mode / High-Performance Mode] |
Model Size (M) | Description |
---|---|---|---|---|---|---|
PP-OCRv4_server_det | Inference Model/Training Model | 82.69 | 83.34 / 80.91 | 442.58 / 442.58 | 109 | PP-OCRv4's server-side text detection model, featuring higher accuracy, suitable for deployment on high-performance servers |
PP-OCRv4_mobile_det | Inference Model/Training Model | 77.79 | 8.79 / 3.13 | 51.00 / 28.58 | 4.7 | PP-OCRv4's mobile text detection model, optimized for efficiency, suitable for deployment on edge devices |
Text Recognition Module Models:
Model | Model Download Link | Recognition Avg Accuracy (%) | GPU Inference Time (ms) [Normal Mode / High-Performance Mode] |
CPU Inference Time (ms) [Normal Mode / High-Performance Mode] |
Model Size (M) | Description |
---|---|---|---|---|---|---|
PP-OCRv4_mobile_rec | Inference Model/Training Model | 78.20 | 4.82 / 4.82 | 16.74 / 4.64 | 10.6 M | PP-OCRv4 is the next version of Baidu PaddlePaddle's self-developed text recognition model PP-OCRv3. By introducing data augmentation schemes and GTC-NRTR guidance branches, it further improves text recognition accuracy without compromising inference speed. The model offers both server (server) and mobile (mobile) versions to meet industrial needs in different scenarios. |
PP-OCRv4_server_rec | Inference Model/Training Model | 79.20 | 6.58 / 6.58 | 33.17 / 33.17 | 71.2 M |
Model | Model Download Link | Recognition Avg Accuracy (%) | GPU Inference Time (ms) [Normal Mode / High-Performance Mode] |
CPU Inference Time (ms) [Normal Mode / High-Performance Mode] |
Model Size (M) | Description |
---|---|---|---|---|---|---|
ch_SVTRv2_rec | Inference Model/Training Model | 68.81 | 8.08 / 8.08 | 50.17 / 42.50 | 73.9 M | SVTRv2 is a server-side text recognition model developed by the OpenOCR team at the Vision and Learning Lab (FVL) of Fudan University. It won the first prize in the OCR End-to-End Recognition Task of the PaddleOCR Algorithm Model Challenge, with a 6% improvement in end-to-end recognition accuracy compared to PP-OCRv4 on the A-list. |
Model | Model Download Link | Recognition Avg Accuracy (%) | GPU Inference Time (ms) [Normal Mode / High-Performance Mode] |
CPU Inference Time (ms) [Normal Mode / High-Performance Mode] |
Model Size (M) | Description |
---|---|---|---|---|---|---|
ch_RepSVTR_rec | Inference Model/Training Model | 65.07 | 5.93 / 5.93 | 20.73 / 7.32 | 22.1 M | The RepSVTR text recognition model is a mobile-oriented text recognition model based on SVTRv2. It won the first prize in the OCR End-to-End Recognition Task of the PaddleOCR Algorithm Model Challenge, with a 2.5% improvement in end-to-end recognition accuracy compared to PP-OCRv4 on the B-list, while maintaining similar inference speed. |
Seal Text Detection Module Models:
Model | Model Download Link | Detection Hmean (%) | GPU Inference Time (ms) [Normal Mode / High-Performance Mode] |
CPU Inference Time (ms) [Normal Mode / High-Performance Mode] |
Model Size (M) | Description |
---|---|---|---|---|---|---|
PP-OCRv4_server_seal_det | Inference Model/Training Model | 98.21 | 74.75 / 67.72 | 382.55 / 382.55 | 109 | PP-OCRv4's server-side seal text detection model, featuring higher accuracy, suitable for deployment on better-equipped servers |
PP-OCRv4_mobile_seal_det | Inference Model/Training Model | 96.47 | 7.82 / 3.09 | 48.28 / 23.97 | 4.6 | PP-OCRv4's mobile seal text detection model, offering higher efficiency, suitable for deployment on edge devices |
Text Image Rectification Module Models:
Model | Model Download Link | MS-SSIM (%) | Model Size (M) | Description |
---|---|---|---|---|
UVDoc | Inference Model/Training Model | 54.40 | 30.3 M | High-precision text image rectification model |
The accuracy metrics of the models are measured from the DocUNet benchmark.
Document Image Orientation Classification Module Models:
Model | Model Download Link | Top-1 Acc (%) | GPU Inference Time (ms) [Normal Mode / High-Performance Mode] |
CPU Inference Time (ms) [Normal Mode / High-Performance Mode] |
Model Size (M) | Description |
---|---|---|---|---|---|---|
PP-LCNet_x1_0_doc_ori | Inference Model/Training Model | 99.06 | 2.31 / 0.43 | 3.37 / 1.27 | 7 | A document image classification model based on PP-LCNet_x1_0, with four categories: 0°, 90°, 180°, 270° |
- Performance Test Environment
- Test Dataset:
- Table Structure Recognition Model: PaddleX internally built English table recognition dataset.
- Layout Detection Model: PaddleOCR's self-built layout analysis dataset, containing 10,000 images of common document types such as Chinese and English papers, magazines, and research reports.
- Text Detection Model: PaddleOCR's self-built Chinese dataset, covering multiple scenarios including street scenes, web images, documents, and handwriting, with 500 images for detection.
- Text Recognition Model: PaddleOCR's self-built Chinese dataset, covering multiple scenarios including street scenes, web images, documents, and handwriting, with 11,000 images for text recognition.
- ch_SVTRv2_rec: PaddleOCR Algorithm Model Challenge - Task 1: OCR End-to-End Recognition Task A-rank evaluation set.
- ch_RepSVTR_rec: PaddleOCR Algorithm Model Challenge - Task 1: OCR End-to-End Recognition Task B-rank evaluation set.
- English Recognition Model: PaddleX self-built English dataset.
- Multilingual Recognition Model: PaddleX self-built multilingual dataset.
- Text Line Direction Classification Model: PaddleX self-built dataset, covering multiple scenarios such as certificates and documents, containing 1,000 images.
- Text Image Rectification Model: DocUNet
- Hardware Configuration:
- GPU: NVIDIA Tesla T4
- CPU: Intel Xeon Gold 6271C @ 2.60GHz
- Other Environments: Ubuntu 20.04 / cuDNN 8.6 / TensorRT 8.5.2.2
- Test Dataset:
- Inference Mode Description
Mode | GPU Configuration | CPU Configuration | Acceleration Technology Combination |
---|---|---|---|
Normal Mode | FP32 Precision / No TRT Acceleration | FP32 Precision / 8 Threads | PaddleInference |
High-Performance Mode | Optimal combination of pre-selected precision types and acceleration strategies | FP32 Precision / 8 Threads | Pre-selected optimal backend (Paddle/OpenVINO/TRT, etc.) |
2. Quick Start¶
The pre-trained model pipelines provided by PaddleX allow for quick experience of their effects. You can experience the effect of the Document Scene Information Extraction v3 pipeline online, or use Python to experience it locally.
2.1 Online Experience¶
You can experience online the effect of the Document Scene Information Extraction v3 pipeline, using the demo images provided by the official. For example:
If you are satisfied with the pipeline's performance, you can directly integrate and deploy it. If not, you can also use private data to fine-tune the models in the pipeline online.
2.2 Local Experience¶
Before using the Document Scene Information Extraction v3 pipeline locally, ensure that you have completed the installation of the PaddleX wheel package according to the PaddleX Local Installation Guide. If you wish to selectively install dependencies, please refer to the relevant instructions in the installation guide. The dependency group corresponding to this pipeline is ie
.
Before performing model inference, you need to prepare the API key for the large language model. PP-ChatOCRv3 supports calling the large model inference service provided by the Baidu Cloud Qianfan Platform. You can refer to Authentication and Authorization to obtain the API key from the Qianfan Platform.
After updating the configuration file, you can use a few lines of Python code to complete the quick inference. You can use the test file for testing:
from paddlex import create_pipeline
chat_bot_config={
"module_name": "chat_bot",
"model_name": "ernie-3.5-8k",
"base_url": "https://qianfan.baidubce.com/v2",
"api_type": "openai",
"api_key": "api_key" # your api_key
}
retriever_config={
"module_name": "retriever",
"model_name": "embedding-v1",
"base_url": "https://qianfan.baidubce.com/v2",
"api_type": "qianfan",
"api_key": "api_key" # your api_key
}
pipeline = create_pipeline(pipeline="PP-ChatOCRv3-doc", initial_predictor=False)
visual_predict_res = pipeline.visual_predict(
input="vehicle_certificate-1.png",
use_doc_orientation_classify=False,
use_doc_unwarping=False,
use_common_ocr=True,
use_seal_recognition=True,
use_table_recognition=True,
)
visual_info_list = []
for res in visual_predict_res:
visual_info_list.append(res["visual_info"])
layout_parsing_result = res["layout_parsing_result"]
vector_info = pipeline.build_vector(
visual_info_list,
flag_save_bytes_vector=True,
retriever_config=retriever_config,
)
chat_result = pipeline.chat(
key_list=["驾驶室准乘人数"],
visual_info=visual_info_list,
vector_info=vector_info,
chat_bot_config=chat_bot_config,
retriever_config=retriever_config,
)
print(chat_result)
After running, the output will be as follows:
The prediction process, API descriptions, and output descriptions of PP-ChatOCRv3-doc are as follows:
(1) Call the create_pipeline
method to instantiate the PP-ChatOCRv3 pipeline object.
The relevant parameter descriptions are as follows:
Parameter | Parameter Description | Parameter Type | Default Value | |
---|---|---|---|---|
pipeline |
The name of the pipeline or the path to the pipeline configuration file. If it is the name of the pipeline, it must be a pipeline supported by PaddleX. | str |
None |
|
config |
Specific configuration information for the pipeline (if set simultaneously with pipeline , it has higher priority than pipeline , and the pipeline name must be consistent). |
dict[str, Any] |
None |
|
device |
The device for pipeline inference. Supports specifying specific GPU card numbers, such as "gpu:0", specific card numbers for other hardware, such as "npu:0", and CPU such as "cpu". | str |
gpu |
|
use_hpip |
Whether to enable the high-performance inference plugin. If set to None , the setting from the configuration file or config will be used. |
bool |
None | None |
hpi_config |
High-performance inference configuration | dict | None |
None | None |
initial_predictor |
Whether to initialize the inference module (if False , it will be initialized when the relevant inference module is used for the first time). |
bool |
True |
(2) Call the visual_predict()
method of the PP-ChatOCRv3-doc pipeline object to obtain visual prediction results. This method will return a generator.
The following are the parameters and their descriptions for the `visual_predict()` method:
Parameter | Parameter Description | Parameter Type | Options | Default Value |
---|---|---|---|---|
input |
The data to be predicted, supporting multiple input types, required. | Python Var|str|list |
|
None |
device |
The device for pipeline inference. | str|None |
|
None |
use_doc_orientation_classify |
Whether to use the document orientation classification module. | bool|None |
|
None |
use_doc_unwarping |
Whether to use the document distortion correction module. | bool|None |
|
None |
use_textline_orientation |
Whether to use the text line orientation classification module. | bool|None |
|
None |
use_general_ocr |
Whether to use the OCR sub-pipeline. | bool|None |
|
None |
use_seal_recognition |
Whether to use the seal recognition sub-pipeline. | bool|None |
|
None |
use_table_recognition |
Whether to use the table recognition sub-pipeline. | bool|None |
|
None |
layout_threshold |
The score threshold for the layout model. | float|dict|None |
|
None |
layout_nms |
Whether to use NMS. | bool|None |
|
None |
layout_unclip_ratio |
The expansion coefficient for layout detection. | float|Tuple[float,float]|dict|None |
|
None |
layout_merge_bboxes_mode |
The overlapping box filtering method. | str|dict|None |
|
None |
text_det_limit_side_len |
The side length limit for text detection images. | int|None |
|
None |
text_det_limit_type |
The side length limit type for text detection images. | str|None |
|
None |
text_det_thresh |
The detection pixel threshold, where pixels with scores greater than this threshold in the output probability map are considered text pixels. | float|None |
|
None |
text_det_box_thresh |
The detection box threshold, where a detection result is considered a text region if the average score of all pixels within the border of the result is greater than this threshold. | float|None |
|
None |
text_det_unclip_ratio |
The text detection expansion coefficient, which expands the text region using this method. The larger the value, the larger the expansion area. | float|None |
|
None |
text_rec_score_thresh |
The text recognition threshold, where text results with scores greater than this threshold are retained. | float|None |
|
None |
seal_det_limit_side_len |
The side length limit for seal detection images. | int|None |
|
None |
seal_det_limit_type |
The side length limit type for seal detection images. | str|None |
|
None |
seal_det_thresh |
The detection pixel threshold, where pixels with scores greater than this threshold in the output probability map are considered seal pixels. | float|None |
|
None |
seal_det_box_thresh |
The detection box threshold, where a detection result is considered a seal region if the average score of all pixels within the border of the result is greater than this threshold. | float|None |
|
None |
seal_det_unclip_ratio |
The seal detection expansion coefficient, which expands the seal region using this method. The larger the value, the larger the expansion area. | float|None |
|
None |
seal_rec_score_thresh |
The seal recognition threshold, where text results with scores greater than this threshold are retained. | float|None |
|
None |
(3) Process the visual prediction results.
The prediction result for each sample is of type `dict`, containing two fields: `visual_info` and `layout_parsing_result`. Obtain visual information (including `normal_text_dict`, `table_text_list`, `table_html_list`, etc.) through `visual_info`, and place the information for each sample into the `visual_info_list` list, which will be sent to the large language model later. Of course, you can also obtain the layout parsing results through `layout_parsing_result`, which contains tables, text, images, etc., contained in the file or image, and supports printing, saving as an image, and saving as a `json` file:......
for res in visual_predict_res:
visual_info_list.append(res["visual_info"])
layout_parsing_result = res["layout_parsing_result"]
layout_parsing_result.print()
layout_parsing_result.save_to_img("./output")
layout_parsing_result.save_to_json("./output")
layout_parsing_result.save_to_xlsx("./output")
layout_parsing_result.save_to_html("./output")
......
Method | Method Description | Parameters | Parameter Type | Parameter Description | Default Value |
---|---|---|---|---|---|
print() |
Prints the result to the terminal | format_json |
bool |
Whether to format the output content with JSON indentation | True |
indent |
int |
Specifies the indentation level to beautify the output JSON data for better readability, only valid when format_json is True |
4 | ||
ensure_ascii |
bool |
Controls whether to escape non-ASCII characters to Unicode. When set to True , all non-ASCII characters will be escaped; False retains the original characters, only valid when format_json is True |
False |
||
save_to_json() |
Saves the result as a JSON file | save_path |
str |
The file path for saving, when it is a directory, the saved file name will be consistent with the input file type | N/A |
indent |
int |
Specifies the indentation level to beautify the output JSON data for better readability, only valid when format_json is True |
4 | ||
ensure_ascii |
bool |
Controls whether to escape non-ASCII characters to Unicode. When set to True , all non-ASCII characters will be escaped; False retains the original characters, only valid when format_json is True |
False |
||
save_to_img() |
Saves the visual images of each module in PNG format | save_path |
str |
The file path for saving, supports directory or file path | N/A |
save_to_html() |
Saves the tables in the file as an HTML file | save_path |
str |
The file path for saving, supports directory or file path | N/A |
save_to_xlsx() |
Saves the tables in the file as an XLSX file | save_path |
str |
The file path for saving, supports directory or file path | N/A |
Attribute | Attribute Description |
---|---|
json |
Obtain prediction results in json format |
img |
Obtain visualized images in dict format |
(4) Call the build_vector()
method of the PP-ChatOCRv3-doc Pipeline object to construct vectors for text content.
Below are the parameters and their descriptions for the `build_vector()` method:
Parameter | Parameter Description | Parameter Type | Options | Default Value |
---|---|---|---|---|
visual_info |
Visual information, which can be a dictionary containing visual information or a list composed of such dictionaries | list|dict |
None
|
None |
min_characters |
Minimum number of characters | int |
A positive integer greater than 0, determined based on the token length supported by the large language model | 3500 |
block_size |
Block size for vector library creation of long texts | int |
A positive integer greater than 0, determined based on the token length supported by the large language model | 300 |
flag_save_bytes_vector |
Whether to save text as a binary file | bool |
True|False
|
False |
retriever_config |
Configuration parameters for the vector retrieval large model, refer to the "LLM_Retriever" field in the configuration file | dict |
None
|
None |
(5) Call the chat()
method of the PP-ChatOCRv3-doc Pipeline object to extract key information.
Below are the parameters and their descriptions for the `chat()` method:
Parameter | Parameter Description | Parameter Type | Options | Default Value |
---|---|---|---|---|
key_list |
A single key or a list of keys used to extract information | Union[str, List[str]] |
None |
None |
visual_info |
Visual information results | List[dict] |
None |
None |
use_vector_retrieval |
Whether to use vector retrieval | bool |
True|False |
True |
vector_info |
Vector information used for retrieval | dict |
None |
None |
min_characters |
Required minimum number of characters | int |
A positive integer greater than 0 | 3500 |
text_task_description |
Description of the text task | str |
None |
None |
text_output_format |
Output format of text results | str |
None |
None |
text_rules_str |
Rules for generating text results | str |
None |
None |
text_few_shot_demo_text_content |
Text content for few-shot demonstration | str |
None |
None |
text_few_shot_demo_key_value_list |
Key-value list for few-shot demonstration | str |
None |
None |
table_task_description |
Description of the table task | str |
None |
None |
table_output_format |
表结果的输出格式 | str |
None |
None |
table_rules_str |
生成表结果的规则 | str |
None |
None |
table_few_shot_demo_text_content |
表少样本演示的文本内容 | str |
None |
None |
table_few_shot_demo_key_value_list |
表少样本演示的键值列表 | str |
None |
None |
chat_bot_config |
大语言模型配置信息,内容参考产线配置文件“LLM_Chat”字段 | dict |
None
|
None |
retriever_config |
向量检索大模型配置参数,内容参考配置文件中的“LLM_Retriever”字段 | dict |
None
|
None |
3. Development Integration/Deployment¶
If the pipeline meets your requirements for inference speed and accuracy in production, you can proceed directly with development integration/deployment.
If you need to apply the pipeline directly in your Python project, you can refer to the sample code in 2.2 Local Experience.
Additionally, PaddleX provides three other deployment methods, detailed as follows:
🚀 High-Performance Inference: In actual production environments, many applications have stringent standards for the performance metrics of deployment strategies (especially response speed) to ensure efficient system operation and smooth user experience. To this end, PaddleX provides a high-performance inference plugin designed to deeply optimize model inference and pre/post-processing, achieving significant speedups in the end-to-end process. For detailed instructions on high-performance inference, please refer to the PaddleX High-Performance Inference Guide.
☁️ Serving: Serving is a common deployment form in actual production environments. By encapsulating the inference functionality as a service, clients can access these services through network requests to obtain inference results. PaddleX supports multiple serving solutions for pipelines. For detailed instructions on serving, please refer to the PaddleX Serving Guide.
Below are the API references for basic serving and multi-language service invocation examples:
API Reference
For the main operations provided by the service:
- The HTTP request method is POST.
- Both the request body and response body are JSON data (JSON objects).
- When the request is successfully processed, the response status code is
200
, and the response body has the following attributes:
Name | Type | Meaning |
---|---|---|
logId |
string |
UUID of the request. |
errorCode |
integer |
Error code. Fixed as 0 . |
errorMsg |
string |
Error description. Fixed as "Success" . |
result |
object |
Operation result. |
- When the request is not successfully processed, the response body has the following attributes:
Name | Type | Meaning |
---|---|---|
logId |
string |
UUID of the request. |
errorCode |
integer |
Error code. Same as the response status code. |
errorMsg |
string |
Error description. |
The main operations provided by the service are as follows:
analyzeImages
Uses computer vision models to analyze images, obtain OCR, table recognition results, etc., and extract key information from the images.
POST /chatocr-visual
- Attributes of the request body:
Name | Type | Meaning | Required |
---|---|---|---|
file |
string |
URL of an image file or PDF file accessible to the server, or Base64 encoded result of the content of the above file types. By default, for PDF files exceeding 10 pages, only the content of the first 10 pages will be processed. To remove the page limit, please add the following configuration to the pipeline configuration file:
|
Yes |
fileType |
integer | null |
File type. 0 represents a PDF file, 1 represents an image file. If this attribute is not present in the request body, the file type will be inferred based on the URL. |
No |
useDocOrientationClassify |
boolean | null |
Please refer to the description of the use_doc_orientation_classify parameter of the pipeline object's visual_predict method. |
No |
useDocUnwarping |
boolean | null |
Please refer to the description of the use_doc_unwarping parameter of the pipeline object's visual_predict method. |
No |
useSealRecognition |
boolean | null |
Please refer to the description of the use_seal_recognition parameter of the pipeline object's visual_predict method. |
No |
useTableRecognition |
boolean | null |
Please refer to the description of the use_table_recognition parameter of the pipeline object's visual_predict method. |
No |
layoutThreshold |
number | null |
Please refer to the description of the layout_threshold parameter of the pipeline object's visual_predict method. |
No |
layoutNms |
boolean | null |
Please refer to the description of the layout_nms parameter of the pipeline object's visual_predict method. |
No |
layoutUnclipRatio |
number | array | object | null |
Please refer to the description of the layout_unclip_ratio parameter of the pipeline object's visual_predict method. |
No |
layoutMergeBboxesMode |
string | object | null |
Please refer to the description of the layout_merge_bboxes_mode parameter of the pipeline object's visual_predict method. |
No |
textDetLimitSideLen |
integer | null |
Please refer to the description of the text_det_limit_side_len parameter of the pipeline object's visual_predict method. |
No |
textDetLimitType |
string | null |
Please refer to the description of the text_det_limit_type parameter of the pipeline object's visual_predict method. |
No |
textDetThresh |
number | null |
Please refer to the description of the text_det_thresh parameter of the pipeline object's visual_predict method. |
No |
textDetBoxThresh |
number | null |
Please refer to the description of the text_det_box_thresh parameter of the pipeline object's visual_predict method. |
No |
textDetUnclipRatio |
number | null |
Please refer to the description of the text_det_unclip_ratio parameter of the pipeline object's visual_predict method. |
No |
textRecScoreThresh |
number | null |
Please refer to the description of the text_rec_score_thresh parameter of the pipeline object's visual_predict method. |
No |
sealDetLimitSideLen |
integer | null |
Please refer to the description of the seal_det_limit_side_len parameter of the pipeline object's visual_predict method. |
No |
sealDetLimitType |
string | null |
Please refer to the description of the seal_det_limit_type parameter of the pipeline object's visual_predict method. |
No |
sealDetThresh |
number | null |
Please refer to the description of the seal_det_thresh parameter of the pipeline object's visual_predict method. |
No |
sealDetBoxThresh |
number | null |
Please refer to the description of the seal_det_box_thresh parameter of the pipeline object's visual_predict method. |
No |
sealDetUnclipRatio |
number | null |
Please refer to the description of the seal_det_unclip_ratio parameter of the pipeline object's visual_predict method. |
No |
sealRecScoreThresh |
number | null |
Please refer to the description of the seal_rec_score_thresh parameter of the pipeline object's visual_predict method. |
No |
- When the request is successfully processed, the
result
of the response body has the following attributes:
Name | Type | Meaning |
---|---|---|
layoutParsingResults |
array |
Analysis results obtained using computer vision models. The array length is 1 (for image input) or the actual number of document pages processed (for PDF input). For PDF input, each element in the array represents the result of each page actually processed in the PDF file. |
visualInfo |
array |
Key information in the image, which can be used as input for other operations. |
dataInfo |
object |
Input data information. |
Each element in layoutParsingResults
is an object
with the following attributes:
Name | Type | Meaning |
---|---|---|
prunedResult |
object |
A simplified version of the res field in the JSON representation of the results generated by the pipeline's visual_predict method, with the input_path and the page_index fields removed. |
outputImages |
object | null |
Refer to the description of img attribute of the pipeline's visual prediction result. |
inputImage |
string | null |
Input image. The image is in JPEG format and encoded using Base64. |
buildVectorStore
Builds a vector database.
POST /chatocr-vector
- Attributes of the request body:
Name | Type | Meaning | Required |
---|---|---|---|
visualInfo |
array |
Key information in the image. Provided by the analyzeImages operation. |
Yes |
minCharacters |
integer | null |
Minimum data length to enable the vector database. | No |
blockSize |
int | null |
Please refer to the description of the block_size parameter of the pipeline object's build_vector method. |
No |
retrieverConfig |
object | null |
Please refer to the description of the retriever_config parameter of the pipeline object's build_vector method. |
No |
- When the request is successfully processed, the
result
of the response body has the following attributes:
Name | Type | Meaning |
---|---|---|
vectorInfo |
object |
Serialized result of the vector database, which can be used as input for other operations. |
chat
Interacts with large language models to extract key information using them.
POST /chatocr-chat
- Attributes of the request body:
Name | Type | Meaning | Required |
---|---|---|---|
keyList |
array |
List of keys. | Yes |
visualInfo |
object |
Key information in the image. Provided by the analyzeImages operation. |
Yes |
useVectorRetrieval |
boolean | null |
Please refer to the description of the use_vector_retrieval parameter of the pipeline object's chat method. |
No |
vectorInfo |
object | null |
Serialized result of the vector database. Provided by the buildVectorStore operation. Please note that the deserialization process involves performing an unpickle operation. To prevent malicious attacks, be sure to use data from trusted sources. |
No |
minCharacters |
integer |
Minimum data length to enable the vector database. | No |
textTaskDescription |
string | null |
Please refer to the description of the text_task_description parameter of the pipeline object's chat method. |
No |
textOutputFormat |
string | null |
Please refer to the description of the text_output_format parameter of the pipeline object's chat method. |
No |
textRulesStr |
string | null |
Please refer to the description of the text_rules_str parameter of the pipeline object's chat method. |
No |
textFewShotDemoTextContent |
string | null |
Please refer to the description of the text_few_shot_demo_text_content parameter of the pipeline object's chat method. |
No |
textFewShotDemoKeyValueList |
string | null |
Please refer to the description of the text_few_shot_demo_key_value_list parameter of the pipeline object's chat method. |
No |
tableTaskDescription |
string | null |
Please refer to the description of the table_task_description parameter of the pipeline object's chat method. |
No |
tableOutputFormat |
string | null |
Please refer to the description of the table_output_format parameter of the pipeline object's chat method. |
No |
tableRulesStr |
string | null |
Please refer to the description of the table_rules_str parameter of the pipeline object's chat method. |
No |
tableFewShotDemoTextContent |
string | null |
Please refer to the description of the table_few_shot_demo_text_content parameter of the pipeline object's chat method. |
No |
tableFewShotDemoKeyValueList |
string | null |
Please refer to the description of the table_few_shot_demo_key_value_list parameter of the pipeline object's chat method. |
No |
chatBotConfig |
object | null |
Please refer to the description of the chat_bot_config parameter of the pipeline object's chat method. |
No |
retrieverConfig |
object | null |
Please refer to the description of the retriever_config parameter of the pipeline object's chat method. |
No |
- When the request is successfully processed, the
result
of the response body has the following attributes:
Name | Type | Meaning |
---|---|---|
chatResult |
object |
Key information extraction result. |
Multi-language Service Invocation Examples
Python
import base64
import pprint
import sys
import requests
API_BASE_URL = "http://0.0.0.0:8080"
file_path = "./demo.jpg"
keys = ["Name"]
with open(file_path, "rb") as file:
file_bytes = file.read()
file_data = base64.b64encode(file_bytes).decode("ascii")
payload = {
"file": file_data,
"fileType": 1,
}
resp_visual = requests.post(url=f"{API_BASE_URL}/chatocr-visual", json=payload)
if resp_visual.status_code != 200:
print(
f"Request to chatocr-visual failed with status code {resp_visual.status_code}.",
file=sys.stderr,
)
pprint.pp(resp_visual.json())
sys.exit(1)
result_visual = resp_visual.json()["result"]
for i, res in enumerate(result_visual["layoutParsingResults"]):
print(res["prunedResult"])
for img_name, img in res["outputImages"].items():
img_path = f"{img_name}_{i}.jpg"
with open(img_path, "wb") as f:
f.write(base64.b64decode(img))
print(f"Output image saved at {img_path}")
payload = {
"visualInfo": result_visual["visualInfo"],
}
resp_vector = requests.post(url=f"{API_BASE_URL}/chatocr-vector", json=payload)
if resp_vector.status_code != 200:
print(
f"Request to chatocr-vector failed with status code {resp_vector.status_code}.",
file=sys.stderr,
)
pprint.pp(resp_vector.json())
sys.exit(1)
result_vector = resp_vector.json()["result"]
payload = {
"keyList": keys,
"visualInfo": result_visual["visualInfo"],
"useVectorRetrieval": True,
"vectorInfo": result_vector["vectorInfo"],
}
resp_chat = requests.post(url=f"{API_BASE_URL}/chatocr-chat", json=payload)
if resp_chat.status_code != 200:
print(
f"Request to chatocr-chat failed with status code {resp_chat.status_code}.",
file=sys.stderr,
)
pprint.pp(resp_chat.json())
sys.exit(1)
result_chat = resp_chat.json()["result"]
print("Final result:")
print(result_chat["chatResult"])
📱 Edge Deployment: Edge deployment is a method where computing and data processing functions are placed on the user's device itself, allowing the device to process data directly without relying on remote servers. PaddleX supports deploying models on edge devices such as Android. For detailed instructions on edge deployment, please refer to the PaddleX Edge Deployment Guide. You can choose the appropriate deployment method for your pipeline based on your needs, and proceed with subsequent AI application integration.
4. Custom Development¶
If the default model weights provided by the PP-ChatOCRv3-doc Pipeline do not meet your requirements in terms of accuracy or speed for your specific scenario, you can attempt to further fine-tune the existing models using your own domain-specific or application-specific data to enhance the recognition performance of the general table recognition pipeline in your scenario.
4.1 Model Fine-tuning¶
The document scenario information extraction V3 pipeline consists of several modules. If the performance of the model pipeline does not meet expectations, the issue may originate from any of these modules. You can analyze cases with poor extraction results by visualizing images to determine which module has the problem. Then, refer to the corresponding fine-tuning tutorial links in the table below to fine-tune the model:
Scenario | Module to Fine-tune | Fine-tuning Reference Link |
---|---|---|
Inaccurate layout detection, such as undetected stamps or tables | Layout Detection Module | Link |
Inaccurate table structure recognition | Table Structure Recognition | Link |
Seal text missed | Seal Text Detection Module | Link |
Text missed | Text Detection Module | Link |
Text content is inaccurate | Text Recognition Module | Link |
Vertical or rotated text line correction is inaccurate | Text Line Orientation Classification Module | Link |
Whole image rotation correction is inaccurate | Document Image Orientation Classification Module | Link |
Image distortion correction is inaccurate | Text Image Correction Module | Not supported for fine-tuning |
4.2 Model Deployment¶
After fine-tuning your models using your private dataset, you will obtain local model weights files.
To use the fine-tuned model weights, simply modify the pipeline configuration file by replacing the local paths of the default model weights with those of your fine-tuned models:
......
Pipeline:
layout_model: RT-DETR-H_layout_3cls # Replace with the local path of your fine-tuned model
table_model: SLANet_plus # Replace with the local path of your fine-tuned model
text_det_model: PP-OCRv4_server_det # Replace with the local path of your fine-tuned model
text_rec_model: PP-OCRv4_server_rec # Replace with the local path of your fine-tuned model
seal_text_det_model: PP-OCRv4_server_seal_det # Replace with the local path of your fine-tuned model
doc_image_ori_cls_model: null # Replace with the local path of your fine-tuned model if applicable
doc_image_unwarp_model: null # Replace with the local path of your fine-tuned model if applicable
......
Subsequently, load the modified pipeline configuration file using the command-line interface or Python script as described in the local experience section.
5. Multi-hardware Support¶
PaddleX supports various mainstream hardware devices such as NVIDIA GPUs, Kunlun XPU, Ascend NPU, and Cambricon MLU. Seamless switching between different hardware can be achieved by simply setting the --device
parameter.
For example, to perform inference using the PP-ChatOCRv3-doc Pipeline on an NVIDIA GPU.
At this point, if you wish to switch the hardware to Ascend NPU, simply modify the --device
in the script to npu
:
from paddlex import create_pipeline
pipeline = create_pipeline(
pipeline="PP-ChatOCRv3-doc",
device="npu:0" # gpu:0 -->npu:0
)
If you want to use the PP-ChatOCRv3-doc Pipeline on more types of hardware, please refer to the PaddleX Multi-Device Usage Guide.