Skip to content

Instruction for Using Document Visual-Language Model Module

1. Overview

The document visual-language model is a cutting-edge multimodal processing technology aimed at overcoming the limitations of traditional document processing methods. Traditional methods are often restricted to handling documents of specific formats or predefined categories, whereas document visual-language models can integrate visual and linguistic information to comprehend and process diverse document content. By combining computer vision with natural language processing, the model can identify images, texts, and their interrelationships within documents, even understanding semantic information within complex layout structures. This makes document processing more intelligent and flexible, with enhanced generalization capabilities, showcasing broad application prospects in fields such as automated office and information extraction.

2. Supported Model List

ModelDownload Link Storage Size (GB) Description
PP-DocBee-2BInference Model 4.2 PP-DocBee is a multimodal large model developed by the PaddlePaddle team, focused on document understanding with excellent performance on Chinese document understanding tasks. The model is fine-tuned and optimized using nearly 5 million multimodal datasets for document understanding, including general VQA, OCR, table, text-rich, math and complex reasoning, synthetic, and pure text data, with different training data ratios. On several authoritative English document understanding evaluation leaderboards in academia, PP-DocBee has generally achieved SOTA at the same parameter level. In internal business Chinese scenarios, PP-DocBee also exceeds current popular open-source and closed-source models.
PP-DocBee-7BInference Model 15.8

3. Quick Integration

❗ Before quick integration, please install the PaddleX wheel package. For details, refer to PaddleX Local Installation Guide.

After completing the installation of the wheel package, a few lines of code can execute the inference of the document visual-language model module, allowing model switching within this module at will. You may also integrate inference from models in the open document visual-language model module into your project. Before running the following code, please download the example image to your local.

from paddlex import create_model
model = create_model('PP-DocBee-2B')
results = model.predict(
    input={"image": "medal_table.png", "query": "Identify the content of this table"},
    batch_size=1
)
for res in results:
    res.print()
    res.save_to_json(f"./output/res.json")

The results obtained will be:

{'res': {'image': 'medal_table.png', 'query': 'Identify the content of this table', 'result': '| Rank | Country/Region | Gold | Silver | Bronze | Total Medals |\n| --- | --- | --- | --- | --- | --- |\n| 1 | China (CHN) | 48 | 22 | 30 | 100 |\n| 2 | USA | 36 | 39 | 37 | 112 |\n| 3 | Russia (RUS) | 24 | 13 | 23 | 60 |\n| 4 | UK (GBR) | 19 | 13 | 19 | 51 |\n| 5 | Germany (GER) | 16 | 11 | 14 | 41 |\n| 6 | Australia (AUS) | 14 | 15 | 17 | 46 |\n| 7 | Korea (KOR) | 13 | 11 | 8 | 32 |\n| 8 | Japan (JPN) | 9 | 8 | 8 | 25 |\n| 9 | Italy (ITA) | 8 | 9 | 10 | 27 |\n| 10 | France (FRA) | 7 | 16 | 20 | 43 |\n| 11 | Netherlands (NED) | 7 | 5 | 4 | 16 |\n| 12 | Ukraine (UKR) | 7 | 4 | 11 | 22 |\n| 13 | Kenya (KEN) | 6 | 4 | 6 | 16 |\n| 14 | Spain (ESP) | 5 | 11 | 3 | 19 |\n| 15 | Jamaica (JAM) | 5 | 4 | 2 | 11 |\n'}}
The parameters in the results have the following meaning:

  • image: Represents the path of the input image to be predicted
  • query: Represents the input text information to be predicted
  • result: The result information predicted by the model

The visualized prediction results are as follows:

| Rank | Country/Region | Gold | Silver | Bronze | Total Medals |
| --- | --- | --- | --- | --- | --- |
| 1 | China (CHN) | 48 | 22 | 30 | 100 |
| 2 | USA | 36 | 39 | 37 | 112 |
| 3 | Russia (RUS) | 24 | 13 | 23 | 60 |
| 4 | UK (GBR) | 19 | 13 | 19 | 51 |
| 5 | Germany (GER) | 16 | 11 | 14 | 41 |
| 6 | Australia (AUS) | 14 | 15 | 17 | 46 |
| 7 | Korea (KOR) | 13 | 11 | 8 | 32 |
| 8 | Japan (JPN) | 9 | 8 | 8 | 25 |
| 9 | Italy (ITA) | 8 | 9 | 10 | 27 |
| 10 | France (FRA) | 7 | 16 | 20 | 43 |
| 11 | Netherlands (NED) | 7 | 5 | 4 | 16 |
| 12 | Ukraine (UKR) | 7 | 4 | 11 | 22 |
| 13 | Kenya (KEN) | 6 | 4 | 6 | 16 |
| 14 | Spain (ESP) | 5 | 11 | 3 | 19 |
| 15 | Jamaica (JAM) | 5 | 4 | 2 | 11 |

The explanation of related methods and parameters are as follows:

  • create_model instantiates the document visual-language model (using PP-DocBee-2B as an example), detailed as follows:
Parameter Description Type Options Default
model_name Model name str None None
model_dir Model storage path str None None
device Model inference device str Supports specifying specific GPU card numbers, such as "gpu:0", specific hardware card numbers like "npu:0", or CPU as "cpu". gpu:0
use_hpip Whether to enable the high-performance inference plugin. Not supported for now. bool None False
hpi_config High-performance inference configuration. Not supported for now. dict | None None None
  • model_name must be specified. After specifying it, the default model parameters built into PaddleX are used, and if model_dir is specified, the user-defined model is used.

  • The predict() method of the document visual-language model is called for inference prediction. The predict() method parameters include input and batch_size, detailed as follows:

Parameter Description Type Options Default
input Data to be predicted dict Dict, needs to be determined according to the specific model. For the PP-DocBee series, the input is {'image': image_path, 'query': query_text} None
batch_size Batch size int Integer (currently only supports 1) 1
  • The prediction results are processed, with the prediction result for each sample being the corresponding Result object, supporting operations such as printing and saving as a json file:
Method Description Parameter Type Description Default
print() Print results to terminal format_json bool Whether to use JSON indentation to format output content True
indent int Specify indentation levels to enhance the readability of output JSON data. This is effective only when format_json is True 4
ensure_ascii bool Control whether non-ASCII characters are escaped to Unicode. When set to True, all non-ASCII characters are escaped; False retains the original characters. This is effective only when format_json is True False
save_to_json() Save results as json file save_path str Path for saving the file. When specified as a directory, saved file names match the input file types None
indent int Specify indentation levels to enhance the readability of output JSON data. This is effective only when format_json is True 4
ensure_ascii bool Control whether non-ASCII characters are escaped to Unicode. When set to True, all non-ASCII characters are escaped; False retains the original characters. This is effective only when format_json is True False
  • In addition, the prediction results can also be accessed through attributes, as follows:
Attribute Description
json Get the predicted results in json format

For more usage instructions on the API for single model inference in PaddleX, you can refer to Instructions for Using PaddleX Single Model Python API.

4. Secondary Development

The current module temporarily does not support fine-tuning training and only supports inference integration. Fine-tuning training for this module is planned to be supported in the future.