Instruction for Using Document Visual-Language Model Module¶

1. Overview¶

The document visual-language model is a cutting-edge multimodal processing technology aimed at overcoming the limitations of traditional document processing methods. Traditional methods are often restricted to handling documents of specific formats or predefined categories, whereas document visual-language models can integrate visual and linguistic information to comprehend and process diverse document content. By combining computer vision with natural language processing, the model can identify images, texts, and their interrelationships within documents, even understanding semantic information within complex layout structures. This makes document processing more intelligent and flexible, with enhanced generalization capabilities, showcasing broad application prospects in fields such as automated office and information extraction.

2. Supported Model List¶

Model	Download Link	Storage Size (GB)	Description
PP-DocBee-2B	Inference Model	4.2	PP-DocBee is a multimodal large model developed by the PaddlePaddle team, focused on document understanding with excellent performance on Chinese document understanding tasks. The model is fine-tuned and optimized using nearly 5 million multimodal datasets for document understanding, including general VQA, OCR, table, text-rich, math and complex reasoning, synthetic, and pure text data, with different training data ratios. On several authoritative English document understanding evaluation leaderboards in academia, PP-DocBee has generally achieved SOTA at the same parameter level. In internal business Chinese scenarios, PP-DocBee also exceeds current popular open-source and closed-source models.
PP-DocBee-7B	Inference Model	15.8

3. Quick Integration¶

❗ Before quick integration, please install the PaddleX wheel package. For details, refer to PaddleX Local Installation Guide.

After completing the installation of the wheel package, a few lines of code can execute the inference of the document visual-language model module, allowing model switching within this module at will. You may also integrate inference from models in the open document visual-language model module into your project. Before running the following code, please download the example image to your local.

from paddlex import create_model
model = create_model('PP-DocBee-2B')
results = model.predict(
    input={"image": "medal_table.png", "query": "Identify the content of this table"},
    batch_size=1
)
for res in results:
    res.print()
    res.save_to_json(f"./output/res.json")

The results obtained will be:

{'res': {'image': 'medal_table.png', 'query': 'Identify the content of this table', 'result': '| Rank | Country/Region | Gold | Silver | Bronze | Total Medals |\n| --- | --- | --- | --- | --- | --- |\n| 1 | China (CHN) | 48 | 22 | 30 | 100 |\n| 2 | USA | 36 | 39 | 37 | 112 |\n| 3 | Russia (RUS) | 24 | 13 | 23 | 60 |\n| 4 | UK (GBR) | 19 | 13 | 19 | 51 |\n| 5 | Germany (GER) | 16 | 11 | 14 | 41 |\n| 6 | Australia (AUS) | 14 | 15 | 17 | 46 |\n| 7 | Korea (KOR) | 13 | 11 | 8 | 32 |\n| 8 | Japan (JPN) | 9 | 8 | 8 | 25 |\n| 9 | Italy (ITA) | 8 | 9 | 10 | 27 |\n| 10 | France (FRA) | 7 | 16 | 20 | 43 |\n| 11 | Netherlands (NED) | 7 | 5 | 4 | 16 |\n| 12 | Ukraine (UKR) | 7 | 4 | 11 | 22 |\n| 13 | Kenya (KEN) | 6 | 4 | 6 | 16 |\n| 14 | Spain (ESP) | 5 | 11 | 3 | 19 |\n| 15 | Jamaica (JAM) | 5 | 4 | 2 | 11 |\n'}}

The parameters in the results have the following meaning:

image: Represents the path of the input image to be predicted
query: Represents the input text information to be predicted
result: The result information predicted by the model

The visualized prediction results are as follows:

| Rank | Country/Region | Gold | Silver | Bronze | Total Medals |
| --- | --- | --- | --- | --- | --- |
| 1 | China (CHN) | 48 | 22 | 30 | 100 |
| 2 | USA | 36 | 39 | 37 | 112 |
| 3 | Russia (RUS) | 24 | 13 | 23 | 60 |
| 4 | UK (GBR) | 19 | 13 | 19 | 51 |
| 5 | Germany (GER) | 16 | 11 | 14 | 41 |
| 6 | Australia (AUS) | 14 | 15 | 17 | 46 |
| 7 | Korea (KOR) | 13 | 11 | 8 | 32 |
| 8 | Japan (JPN) | 9 | 8 | 8 | 25 |
| 9 | Italy (ITA) | 8 | 9 | 10 | 27 |
| 10 | France (FRA) | 7 | 16 | 20 | 43 |
| 11 | Netherlands (NED) | 7 | 5 | 4 | 16 |
| 12 | Ukraine (UKR) | 7 | 4 | 11 | 22 |
| 13 | Kenya (KEN) | 6 | 4 | 6 | 16 |
| 14 | Spain (ESP) | 5 | 11 | 3 | 19 |
| 15 | Jamaica (JAM) | 5 | 4 | 2 | 11 |

The explanation of related methods and parameters are as follows:

create_model instantiates the document visual-language model (using PP-DocBee-2B as an example), detailed as follows:

Parameter	Description	Type	Options	Default
`model_name`	Model name	`str`	None	`None`
`model_dir`	Model storage path	`str`	None	None
`device`	Model inference device	`str`	Supports specifying specific GPU card numbers, such as "gpu:0", specific hardware card numbers like "npu:0", or CPU as "cpu".	`gpu:0`
`use_hpip`	Whether to enable the high-performance inference plugin. Not supported for now.	`bool`	None	`False`
`hpi_config`	High-performance inference configuration. Not supported for now.	`dict` \| `None`	None	`None`

model_name must be specified. After specifying it, the default model parameters built into PaddleX are used, and if model_dir is specified, the user-defined model is used.
The predict() method of the document visual-language model is called for inference prediction. The predict() method parameters include input and batch_size, detailed as follows:

Parameter	Description	Type	Options	Default
`input`	Data to be predicted	`dict`	`Dict`, needs to be determined according to the specific model. For the PP-DocBee series, the input is {'image': image_path, 'query': query_text}	None
`batch_size`	Batch size	`int`	Integer (currently only supports 1)	1

The prediction results are processed, with the prediction result for each sample being the corresponding Result object, supporting operations such as printing and saving as a json file:

Method	Description	Parameter	Type	Description	Default
`print()`	Print results to terminal	`format_json`	`bool`	Whether to use `JSON` indentation to format output content	`True`
		`indent`	`int`	Specify indentation levels to enhance the readability of output `JSON` data. This is effective only when `format_json` is `True`	4
		`ensure_ascii`	`bool`	Control whether non-`ASCII` characters are escaped to `Unicode`. When set to `True`, all non-`ASCII` characters are escaped; `False` retains the original characters. This is effective only when `format_json` is `True`	`False`
`save_to_json()`	Save results as json file	`save_path`	`str`	Path for saving the file. When specified as a directory, saved file names match the input file types	None
		`indent`	`int`	Specify indentation levels to enhance the readability of output `JSON` data. This is effective only when `format_json` is `True`	4
		`ensure_ascii`	`bool`	Control whether non-`ASCII` characters are escaped to `Unicode`. When set to `True`, all non-`ASCII` characters are escaped; `False` retains the original characters. This is effective only when `format_json` is `True`	`False`

In addition, the prediction results can also be accessed through attributes, as follows:

Attribute	Description
`json`	Get the predicted results in `json` format

For more usage instructions on the API for single model inference in PaddleX, you can refer to Instructions for Using PaddleX Single Model Python API.

4. Secondary Development¶

The current module temporarily does not support fine-tuning training and only supports inference integration. Fine-tuning training for this module is planned to be supported in the future.