Instruction for Using Document Visual-Language Model Module¶
1. Overview¶
The document visual-language model is a cutting-edge multimodal processing technology aimed at overcoming the limitations of traditional document processing methods. Traditional methods are often restricted to handling documents of specific formats or predefined categories, whereas document visual-language models can integrate visual and linguistic information to comprehend and process diverse document content. By combining computer vision with natural language processing, the model can identify images, texts, and their interrelationships within documents, even understanding semantic information within complex layout structures. This makes document processing more intelligent and flexible, with enhanced generalization capabilities, showcasing broad application prospects in fields such as automated office and information extraction.
2. Supported Model List¶
Model | Download Link | Storage Size (GB) | Description |
---|---|---|---|
PP-DocBee-2B | Inference Model | 4.2 | PP-DocBee is a multimodal large model developed by the PaddlePaddle team, focused on document understanding with excellent performance on Chinese document understanding tasks. The model is fine-tuned and optimized using nearly 5 million multimodal datasets for document understanding, including general VQA, OCR, table, text-rich, math and complex reasoning, synthetic, and pure text data, with different training data ratios. On several authoritative English document understanding evaluation leaderboards in academia, PP-DocBee has generally achieved SOTA at the same parameter level. In internal business Chinese scenarios, PP-DocBee also exceeds current popular open-source and closed-source models. |
PP-DocBee-7B | Inference Model | 15.8 |
3. Quick Integration¶
❗ Before quick integration, please install the PaddleX wheel package. For details, refer to PaddleX Local Installation Guide.
After completing the installation of the wheel package, a few lines of code can execute the inference of the document visual-language model module, allowing model switching within this module at will. You may also integrate inference from models in the open document visual-language model module into your project. Before running the following code, please download the example image to your local.
from paddlex import create_model
model = create_model('PP-DocBee-2B')
results = model.predict(
input={"image": "medal_table.png", "query": "Identify the content of this table"},
batch_size=1
)
for res in results:
res.print()
res.save_to_json(f"./output/res.json")
The results obtained will be:
{'res': {'image': 'medal_table.png', 'query': 'Identify the content of this table', 'result': '| Rank | Country/Region | Gold | Silver | Bronze | Total Medals |\n| --- | --- | --- | --- | --- | --- |\n| 1 | China (CHN) | 48 | 22 | 30 | 100 |\n| 2 | USA | 36 | 39 | 37 | 112 |\n| 3 | Russia (RUS) | 24 | 13 | 23 | 60 |\n| 4 | UK (GBR) | 19 | 13 | 19 | 51 |\n| 5 | Germany (GER) | 16 | 11 | 14 | 41 |\n| 6 | Australia (AUS) | 14 | 15 | 17 | 46 |\n| 7 | Korea (KOR) | 13 | 11 | 8 | 32 |\n| 8 | Japan (JPN) | 9 | 8 | 8 | 25 |\n| 9 | Italy (ITA) | 8 | 9 | 10 | 27 |\n| 10 | France (FRA) | 7 | 16 | 20 | 43 |\n| 11 | Netherlands (NED) | 7 | 5 | 4 | 16 |\n| 12 | Ukraine (UKR) | 7 | 4 | 11 | 22 |\n| 13 | Kenya (KEN) | 6 | 4 | 6 | 16 |\n| 14 | Spain (ESP) | 5 | 11 | 3 | 19 |\n| 15 | Jamaica (JAM) | 5 | 4 | 2 | 11 |\n'}}
image
: Represents the path of the input image to be predictedquery
: Represents the input text information to be predictedresult
: The result information predicted by the model
The visualized prediction results are as follows:
| Rank | Country/Region | Gold | Silver | Bronze | Total Medals |
| --- | --- | --- | --- | --- | --- |
| 1 | China (CHN) | 48 | 22 | 30 | 100 |
| 2 | USA | 36 | 39 | 37 | 112 |
| 3 | Russia (RUS) | 24 | 13 | 23 | 60 |
| 4 | UK (GBR) | 19 | 13 | 19 | 51 |
| 5 | Germany (GER) | 16 | 11 | 14 | 41 |
| 6 | Australia (AUS) | 14 | 15 | 17 | 46 |
| 7 | Korea (KOR) | 13 | 11 | 8 | 32 |
| 8 | Japan (JPN) | 9 | 8 | 8 | 25 |
| 9 | Italy (ITA) | 8 | 9 | 10 | 27 |
| 10 | France (FRA) | 7 | 16 | 20 | 43 |
| 11 | Netherlands (NED) | 7 | 5 | 4 | 16 |
| 12 | Ukraine (UKR) | 7 | 4 | 11 | 22 |
| 13 | Kenya (KEN) | 6 | 4 | 6 | 16 |
| 14 | Spain (ESP) | 5 | 11 | 3 | 19 |
| 15 | Jamaica (JAM) | 5 | 4 | 2 | 11 |
The explanation of related methods and parameters are as follows:
create_model
instantiates the document visual-language model (usingPP-DocBee-2B
as an example), detailed as follows:
Parameter | Description | Type | Options | Default |
---|---|---|---|---|
model_name |
Model name | str |
None | None |
model_dir |
Model storage path | str |
None | None |
device |
Model inference device | str |
Supports specifying specific GPU card numbers, such as "gpu:0", specific hardware card numbers like "npu:0", or CPU as "cpu". | gpu:0 |
use_hpip |
Whether to enable the high-performance inference plugin. Not supported for now. | bool |
None | False |
hpi_config |
High-performance inference configuration. Not supported for now. | dict | None |
None | None |
-
model_name
must be specified. After specifying it, the default model parameters built into PaddleX are used, and ifmodel_dir
is specified, the user-defined model is used. -
The
predict()
method of the document visual-language model is called for inference prediction. Thepredict()
method parameters includeinput
andbatch_size
, detailed as follows:
Parameter | Description | Type | Options | Default |
---|---|---|---|---|
input |
Data to be predicted | dict |
Dict , needs to be determined according to the specific model. For the PP-DocBee series, the input is {'image': image_path, 'query': query_text}
|
None |
batch_size |
Batch size | int |
Integer (currently only supports 1) | 1 |
- The prediction results are processed, with the prediction result for each sample being the corresponding Result object, supporting operations such as printing and saving as a
json
file:
Method | Description | Parameter | Type | Description | Default |
---|---|---|---|---|---|
print() |
Print results to terminal | format_json |
bool |
Whether to use JSON indentation to format output content |
True |
indent |
int |
Specify indentation levels to enhance the readability of output JSON data. This is effective only when format_json is True |
4 | ||
ensure_ascii |
bool |
Control whether non-ASCII characters are escaped to Unicode . When set to True , all non-ASCII characters are escaped; False retains the original characters. This is effective only when format_json is True |
False |
||
save_to_json() |
Save results as json file | save_path |
str |
Path for saving the file. When specified as a directory, saved file names match the input file types | None |
indent |
int |
Specify indentation levels to enhance the readability of output JSON data. This is effective only when format_json is True |
4 | ||
ensure_ascii |
bool |
Control whether non-ASCII characters are escaped to Unicode . When set to True , all non-ASCII characters are escaped; False retains the original characters. This is effective only when format_json is True |
False |
- In addition, the prediction results can also be accessed through attributes, as follows:
Attribute | Description |
---|---|
json |
Get the predicted results in json format |
For more usage instructions on the API for single model inference in PaddleX, you can refer to Instructions for Using PaddleX Single Model Python API.
4. Secondary Development¶
The current module temporarily does not support fine-tuning training and only supports inference integration. Fine-tuning training for this module is planned to be supported in the future.