Skip to content

Instruction for Using Document Visual-Language Model Module

1. Overview

The document visual-language model is a cutting-edge multimodal processing technology aimed at overcoming the limitations of traditional document processing methods. Traditional methods are often restricted to handling documents of specific formats or predefined categories, whereas document visual-language models can integrate visual and linguistic information to comprehend and process diverse document content. By combining computer vision with natural language processing, the model can identify images, texts, and their interrelationships within documents, even understanding semantic information within complex layout structures. This makes document processing more intelligent and flexible, with enhanced generalization capabilities, showcasing broad application prospects in fields such as automated office and information extraction.

2. Supported Model List

ModelDownload Link Storage Size (GB) Model Score Description
PP-DocBee-2BInference Model 4.2 765 PP-DocBee is a multimodal large model developed by the PaddlePaddle team, focused on document understanding with excellent performance on Chinese document understanding tasks. The model is fine-tuned and optimized using nearly 5 million multimodal datasets for document understanding, including general VQA, OCR, table, text-rich, math and complex reasoning, synthetic, and pure text data, with different training data ratios. On several authoritative English document understanding evaluation leaderboards in academia, PP-DocBee has generally achieved SOTA at the same parameter level. In internal business Chinese scenarios, PP-DocBee also exceeds current popular open-source and closed-source models.
PP-DocBee-7BInference Model 15.8 -
PP-DocBee2-3BInference Model 7.6 852 PP-DocBee2 is a multimodal large model independently developed by the PaddlePaddle team, specifically tailored for document understanding. Building upon PP-DocBee, the team has further optimized the foundational model and introduced a new data optimization scheme to enhance data quality. With just a relatively small dataset of 470,000 samples generated using the team's proprietary data synthesis strategy, PP-DocBee2 demonstrates superior performance in Chinese document understanding tasks. In terms of internal business metrics for Chinese-language scenarios, PP-DocBee2 has achieved an approximately 11.4% improvement over PP-DocBee, outperforming both current popular open-source and closed-source models of a similar scale.

Note: The total scores of the above models are based on the test results from the internal evaluation set. All images in the internal evaluation set have a resolution (height, width) of (1680, 1204), with a total of 1,196 data entries. These entries cover various scenarios such as financial reports, laws and regulations, science and engineering papers, instruction manuals, liberal arts papers, contracts, research reports, etc. There are currently no plans to make this dataset publicly available.

3. Quick Integration

❗ Before quick integration, please install the PaddleX wheel package. For details, refer to PaddleX Local Installation Guide.

After completing the installation of the wheel package, a few lines of code can execute the inference of the document visual-language model module, allowing model switching within this module at will. You may also integrate inference from models in the open document visual-language model module into your project. Before running the following code, please download the example image to your local.

from paddlex import create_model
model = create_model('PP-DocBee2-3B')
results = model.predict(
    input={"image": "medal_table.png", "query": "识别这份表格的内容, 以markdown格式输出"},
    batch_size=1
)
for res in results:
    res.print()
    res.save_to_json(f"./output/res.json")

The results obtained will be:

{'res': {'image': 'medal_table.png', 'query': '识别这份表格的内容, 以markdown格式输出', 'result': '| 名次 | 国家/地区 | 金牌 | 银牌 | 铜牌 | 奖牌总数 |\n| --- | --- | --- | --- | --- | --- |\n| 1 | 中国(CHN) | 48 | 22 | 30 | 100 |\n| 2 | 美国(USA) | 36 | 39 | 37 | 112 |\n| 3 | 俄罗斯(RUS) | 24 | 13 | 23 | 60 |\n| 4 | 英国(GBR) | 19 | 13 | 19 | 51 |\n| 5 | 德国(GER) | 16 | 11 | 14 | 41 |\n| 6 | 澳大利亚(AUS) | 14 | 15 | 17 | 46 |\n| 7 | 韩国(KOR) | 13 | 11 | 8 | 32 |\n| 8 | 日本(JPN) | 9 | 8 | 8 | 25 |\n| 9 | 意大利(ITA) | 8 | 9 | 10 | 27 |\n| 10 | 法国(FRA) | 7 | 16 | 20 | 43 |\n| 11 | 荷兰(NED) | 7 | 5 | 4 | 16 |\n| 12 | 乌克兰(UKR) | 7 | 4 | 11 | 22 |\n| 13 | 肯尼亚(KEN) | 6 | 4 | 6 | 16 |\n| 14 | 西班牙(ESP) | 5 | 11 | 3 | 19 |\n| 15 | 牙买加(JAM) | 5 | 4 | 2 | 11 |\n'}}
The parameters in the results have the following meaning:

  • image: Represents the path of the input image to be predicted
  • query: Represents the input text information to be predicted
  • result: The result information predicted by the model

The visualized prediction results are as follows:

| 名次 | 国家/地区 | 金牌 | 银牌 | 铜牌 | 奖牌总数 |
| --- | --- | --- | --- | --- | --- |
| 1 | 中国(CHN) | 48 | 22 | 30 | 100 |
| 2 | 美国(USA) | 36 | 39 | 37 | 112 |
| 3 | 俄罗斯(RUS) | 24 | 13 | 23 | 60 |
| 4 | 英国(GBR) | 19 | 13 | 19 | 51 |
| 5 | 德国(GER) | 16 | 11 | 14 | 41 |
| 6 | 澳大利亚(AUS) | 14 | 15 | 17 | 46 |
| 7 | 韩国(KOR) | 13 | 11 | 8 | 32 |
| 8 | 日本(JPN) | 9 | 8 | 8 | 25 |
| 9 | 意大利(ITA) | 8 | 9 | 10 | 27 |
| 10 | 法国(FRA) | 7 | 16 | 20 | 43 |
| 11 | 荷兰(NED) | 7 | 5 | 4 | 16 |
| 12 | 乌克兰(UKR) | 7 | 4 | 11 | 22 |
| 13 | 肯尼亚(KEN) | 6 | 4 | 6 | 16 |
| 14 | 西班牙(ESP) | 5 | 11 | 3 | 19 |
| 15 | 牙买加(JAM) | 5 | 4 | 2 | 11 |

The explanation of related methods and parameters are as follows:

  • create_model instantiates the document visual-language model (using PP-DocBee-2B as an example), detailed as follows:
Parameter Description Type Options Default
model_name Model name str None None
model_dir Model storage path str None None
device Model inference device str Supports specifying specific GPU card numbers, such as "gpu:0", specific hardware card numbers like "npu:0", or CPU as "cpu". gpu:0
use_hpip Whether to enable the high-performance inference plugin. Not supported for now. bool None False
hpi_config High-performance inference configuration. Not supported for now. dict | None None None
  • model_name must be specified. After specifying it, the default model parameters built into PaddleX are used, and if model_dir is specified, the user-defined model is used.

  • The predict() method of the document visual-language model is called for inference prediction. The predict() method parameters include input and batch_size, detailed as follows:

Parameter Description Type Options Default
input Data to be predicted dict Dict, Since multimodal models have different requirements for input, it needs to be determined based on the specific model. Specifically:
  • The input format for the PP-DocBee series is{'image': image_path, 'query': query_text}
  • None
    batch_size Batch size int Integer 1
    • The prediction results are processed, with the prediction result for each sample being the corresponding Result object, supporting operations such as printing and saving as a json file:
    Method Description Parameter Type Description Default
    print() Print results to terminal format_json bool Whether to use JSON indentation to format output content True
    indent int Specify indentation levels to enhance the readability of output JSON data. This is effective only when format_json is True 4
    ensure_ascii bool Control whether non-ASCII characters are escaped to Unicode. When set to True, all non-ASCII characters are escaped; False retains the original characters. This is effective only when format_json is True False
    save_to_json() Save results as json file save_path str Path for saving the file. When specified as a directory, saved file names match the input file types None
    indent int Specify indentation levels to enhance the readability of output JSON data. This is effective only when format_json is True 4
    ensure_ascii bool Control whether non-ASCII characters are escaped to Unicode. When set to True, all non-ASCII characters are escaped; False retains the original characters. This is effective only when format_json is True False
    • In addition, the prediction results can also be accessed through attributes, as follows:
    Attribute Description
    json Get the predicted results in json format

    For more usage instructions on the API for single model inference in PaddleX, you can refer to Instructions for Using PaddleX Single Model Python API.

    4. Secondary Development

    The current module temporarily does not support fine-tuning training and only supports inference integration. Fine-tuning training for this module is planned to be supported in the future.