Skip to content

Document Visual Language Model Module Tutorial

I. Overview

Document visual language models are a cutting-edge multimodal processing technology aimed at addressing the limitations of traditional document processing methods. Traditional methods are often limited to processing document information in specific formats or predefined categories, whereas document visual language models can integrate visual and linguistic information to understand and handle diverse document content. By combining computer vision and natural language processing technologies, these models can recognize images, text, and their relationships within documents, and even understand semantic information within complex layout structures. This makes document processing more intelligent and flexible, with stronger generalization capabilities, showing broad application prospects in automated office work, information extraction, and other fields.

II. Supported Model List

ModelModel Download Link Model Storage Size (GB) Total Score Description
PP-DocBee-2BInference Model 4.2 765 PP-DocBee is a self-developed multimodal large model by the PaddlePaddle team, focusing on document understanding, and it performs excellently in Chinese document understanding tasks. The model is fine-tuned and optimized using nearly 5 million multimodal datasets for document understanding, including general VQA, OCR, charts, text-rich documents, mathematics and complex reasoning, synthetic data, and pure text data, with different training data ratios set. On several authoritative English document understanding evaluation lists in academia, PP-DocBee has basically achieved SOTA for models of the same parameter scale. In terms of internal business Chinese scenario indicators, PP-DocBee also outperforms the current popular open-source and closed-source models.
PP-DocBee-7BInference Model 15.8 -
PP-DocBee2-3BInference Model 7.6 852 PP-DocBee2 is a self-developed multimodal large model by the PaddlePaddle team, further optimizing the base model on the foundation of PP-DocBee and introducing a new data optimization scheme to improve data quality. Using a small amount of 470,000 data generated by a self-developed data synthesis strategy, PP-DocBee2 performs better in Chinese document understanding tasks. In terms of internal business Chinese scenario indicators, PP-DocBee2 improves by about 11.4% compared to PP-DocBee, and also outperforms the current popular open-source and closed-source models of the same scale.

Note: The total scores of the above models are test results from an internal evaluation set, where all images have a resolution (height, width) of (1680, 1204), with a total of 1196 data entries, covering scenarios such as financial reports, laws and regulations, scientific and technical papers, manuals, humanities papers, contracts, research reports, etc. There are no plans for public release at the moment.

III. Quick Start

❗ Before starting quickly, please install the PaddleOCR wheel package. For details, please refer to the Installation Guide.

You can quickly experience it with one line of command:

paddleocr doc_vlm -i "{'image': 'https://paddle-model-ecology.bj.bcebos.com/paddlex/imgs/demo_image/medal_table.png', 'query': '识别这份表格的内容, 以markdown格式输出'}"

You can also integrate the model inference from the open document visual language model module into your project. Before running the following code, please download the sample image locally.

from paddleocr import DocVLM
model = DocVLM(model_name="PP-DocBee2-3B")
results = model.predict(
    input={"image": "medal_table.png", "query": "识别这份表格的内容, 以markdown格式输出"},
    batch_size=1
)
for res in results:
    res.print()
    res.save_to_json(f"./output/res.json")

After running, the result is:

{'res': {'image': 'medal_table.png', 'query': '识别这份表格的内容, 以markdown格式输出', 'result': '| 名次 | 国家/地区 | 金牌 | 银牌 | 铜牌 | 奖牌总数 |\n| --- | --- | --- | --- | --- | --- |\n| 1 | 中国(CHN) | 48 | 22 | 30 | 100 |\n| 2 | 美国(USA) | 36 | 39 | 37 | 112 |\n| 3 | 俄罗斯(RUS) | 24 | 13 | 23 | 60 |\n| 4 | 英国(GBR) | 19 | 13 | 19 | 51 |\n| 5 | 德国(GER) | 16 | 11 | 14 | 41 |\n| 6 | 澳大利亚(AUS) | 14 | 15 | 17 | 46 |\n| 7 | 韩国(KOR) | 13 | 11 | 8 | 32 |\n| 8 | 日本(JPN) | 9 | 8 | 8 | 25 |\n| 9 | 意大利(ITA) | 8 | 9 | 10 | 27 |\n| 10 | 法国(FRA) | 7 | 16 | 20 | 43 |\n| 11 | 荷兰(NED) | 7 | 5 | 4 | 16 |\n| 12 | 乌克兰(UKR) | 7 | 4 | 11 | 22 |\n| 13 | 肯尼亚(KEN) | 6 | 4 | 6 | 16 |\n| 14 | 西班牙(ESP) | 5 | 11 | 3 | 19 |\n| 15 | 牙买加(JAM) | 5 | 4 | 2 | 11 |\n'}}

The meaning of the result parameters is as follows: - image: Indicates the path of the input image to be predicted - query: Represents the input text information to be predicted - result: Information of the model's prediction result

The visualization of the prediction result is as follows:

| 名次 | 国家/地区 | 金牌 | 银牌 | 铜牌 | 奖牌总数 |
| --- | --- | --- | --- | --- | --- |
| 1 | 中国(CHN) | 48 | 22 | 30 | 100 |
| 2 | 美国(USA) | 36 | 39 | 37 | 112 |
| 3 | 俄罗斯(RUS) | 24 | 13 | 23 | 60 |
| 4 | 英国(GBR) | 19 | 13 | 19 | 51 |
| 5 | 德国(GER) | 16 | 11 | 14 | 41 |
| 6 | 澳大利亚(AUS) | 14 | 15 | 17 | 46 |
| 7 | 韩国(KOR) | 13 | 11 | 8 | 32 |
| 8 | 日本(JPN) | 9 | 8 | 8 | 25 |
| 9 | 意大利(ITA) | 8 | 9 | 10 | 27 |
| 10 | 法国(FRA) | 7 | 16 | 20 | 43 |
| 11 | 荷兰(NED) | 7 | 5 | 4 | 16 |
| 12 | 乌克兰(UKR) | 7 | 4 | 11 | 22 |
| 13 | 肯尼亚(KEN) | 6 | 4 | 6 | 16 |
| 14 | 西班牙(ESP) | 5 | 11 | 3 | 19 |
| 15 | 牙买加(JAM) | 5 | 4 | 2 | 11 |

Explanations of related methods, parameters, etc., are as follows:

  • DocVLM instantiates the document visual language model (taking PP-DocBee-2B as an example), with specific explanations as follows:
Parameter Description Type Options Default
model_name Model Name str None None
model_dir Model Storage Path str None None
device Model Inference Device str Supports specifying specific GPU card number, such as "gpu:0", other hardware specific card numbers, such as "npu:0", CPU such as "cpu". gpu:0
use_hpip Whether to enable high-performance inference plugin. Currently not supported. bool None False
hpi_config High-performance inference configuration. Currently not supported. dict | None None None
  • Among them, model_name must be specified. After specifying model_name, the default PaddleX built-in model parameters will be used. On this basis, when specifying model_dir, user-defined models will be used.

  • Call the predict() method of the document visual language model for inference prediction. This method will return a result list. Additionally, this module also provides the predict_iter() method. Both are completely consistent in terms of parameter acceptance and result return, the difference being that predict_iter() returns a generator, capable of gradually processing and obtaining prediction results, suitable for handling large datasets or scenarios where memory saving is desired. You can choose to use either of these methods based on actual needs. The predict() method parameters include input, batch_size, with specific explanations as follows:

Parameter Description Type Options Default
input Data to be predicted dict Dict, as multimodal models have different input requirements, it needs to be determined based on the specific model. Specifically:
  • PP-DocBee series input format is {'image': image_path, 'query': query_text}
  • None
    batch_size Batch Size int Integer 1
    • Process the prediction results. The prediction result for each sample is the corresponding Result object, and it supports operations such as printing and saving as json file:
    Method Description Parameter Type Description Default
    print() Print results to terminal format_json bool Whether to format the output content using JSON indentation True
    indent int Specify the indentation level to beautify the output JSON data, making it more readable, effective only when format_json is True 4
    ensure_ascii bool Control whether non-ASCII characters are escaped to Unicode. When set to True, all non-ASCII characters will be escaped; False retains the original characters, effective only when format_json is True False
    save_to_json() Save the result as a json format file save_path str Path of the file to be saved. When it is a directory, the naming of the saved file is consistent with the input file type. None
    indent int Specify the indentation level to beautify the output JSON data, making it more readable, effective only when format_json is True 4
    ensure_ascii bool Control whether non-ASCII characters are escaped to Unicode. When set to True, all non-ASCII characters will be escaped; False retains the original characters, effective only when format_json is True False
    • Additionally, it also supports obtaining prediction results through attributes, as follows:
    Attribute Description
    json Get the prediction result in json format

    IV. Secondary Development

    The current module does not support fine-tuning training temporarily, only inference integration is supported. The fine-tuning training of this module is planned to be supported in the future.

    V. FAQ

    Comments