Document Understanding Pipeline User Guide¶

1. Introduction to Document Understanding Pipeline¶

The Document Understanding Pipeline is an advanced document processing technology based on Vision-Language Models (VLM), designed to overcome the limitations of traditional document processing. Traditional methods rely on fixed templates or predefined rules to parse documents, but this pipeline uses the multimodal capabilities of VLM to accurately answer user questions by integrating visual and linguistic information with just the input of document images and user queries. This technology does not require pre-training for specific document formats, allowing it to flexibly handle diverse document content and significantly enhance the generalization and practicality of document processing. It has broad application prospects in scenarios such as intelligent Q&A and information extraction. Currently, this pipeline does not support secondary development of VLM models, but future support is planned.

The Document Understanding Pipeline includes document-based vision-language model modules. You can choose the model to use based on the benchmark test data below.

If you prioritize model accuracy, choose a model with higher accuracy; if you care more about inference speed, choose a faster model; if you are concerned about storage size, choose a model with a smaller storage footprint.

Document-based Vision-Language Model Modules (Optional):

Model	Model Download Link	Model Storage Size (GB)	Description
PP-DocBee-2B	Inference Model	4.2	PP-DocBee is a self-developed multimodal large model by the PaddlePaddle team, focusing on document understanding with excellent performance on Chinese document understanding tasks. The model is fine-tuned with nearly 5 million multimodal datasets for document understanding, including general VQA, OCR, chart, text-rich documents, mathematics and complex reasoning, synthetic data, and pure text data, with different training data ratios. On several authoritative English document understanding evaluation benchmarks in academia, PP-DocBee has achieved SOTA for models of the same parameter scale. In internal business Chinese scenarios, PP-DocBee also outperforms current popular open and closed-source models.
PP-DocBee-7B	Inference Model	15.8

2. Quick Start¶

2.1 Local Experience¶

❗ Before using the Document Understanding Pipeline locally, ensure you have installed the PaddleX wheel package according to the PaddleX Local Installation Guide. If you wish to selectively install dependencies, refer to the relevant instructions in the installation guide. The dependency group for this pipeline is multimodal.

2.1.1 Integration via Python Script¶

The Document Understanding Pipeline can be quickly inferred with just a few lines of code as shown below:

from paddlex import create_pipeline
pipeline = create_pipeline(pipeline="doc_understanding")
output = pipeline.predict(
    {
        "image": "https://paddle-model-ecology.bj.bcebos.com/paddlex/imgs/demo_image/medal_table.png",
        "query": "Identify the contents of this table"
    }
)
for res in output:
    res.print()
    res.save_to_json("./output/")

In the above Python script, the following steps are performed:

Instantiate the Document Understanding Pipeline object through create_pipeline(). The parameter details are as follows:

Parameter	Description	Type	Default
`pipeline`	Pipeline name or configuration file path. If it's a pipeline name, it must be a pipeline supported by PaddleX.	`str`	`None`
`config`	Specific configuration information for the pipeline (if set simultaneously with `pipeline`, it has a higher priority and requires the pipeline name to be consistent with `pipeline`).	`dict[str, Any]`	`None`
`device`	Inference device for the pipeline. Supports specifying a specific GPU card number, such as "gpu:0", or other hardware card numbers, like "npu:0", or CPU as "cpu".	`str`	`None`
`use_hpip`	Whether to enable the high-performance inference plugin. If set to `None`, the setting from the configuration file will be used. Not supported for now.	`bool`	None	`None`
`hpi_config`	High-performance inference configuration. Not supported for now.	`dict` \| `None`	None	`None`

Call the predict() method of the Document Understanding Pipeline object for inference prediction. This method returns a generator. Below are the parameters of the predict() method and their descriptions:

Parameter	Description	Type	Options	Default
`input`	Data to be predicted, currently only supports dictionary-type input	`Python Dict`	Python Dict: For PP-DocBee, the input format is: `{"image":/path/to/image, "query": user question}`, representing the input image and the corresponding user question.	`None`
`device`	Inference device for the pipeline	`str\|None`	CPU: e.g., `cpu` for CPU inference; GPU: e.g., `gpu:0` for inference on the first GPU; NPU: e.g., `npu:0` for inference on the first NPU; XPU: e.g., `xpu:0` for inference on the first XPU; MLU: e.g., `mlu:0` for inference on the first MLU; DCU: e.g., `dcu:0` for inference on the first DCU; None: If set to `None`, the default value of this parameter initialized by the pipeline will be used. During initialization, it will preferentially use the local GPU 0 device if available, otherwise the CPU device will be used;	`None`

Process the prediction results. The prediction result for each sample is a corresponding Result object, and supports operations such as printing and saving as a json file:

Method	Description	Parameter	Type	Description	Default
`print()`	Print results to the terminal	`format_json`	`bool`	Whether to format the output content with `JSON` indentation	`True`
		`indent`	`int`	Specify indentation level to beautify the `JSON` output, making it more readable, effective only when `format_json` is `True`	4
		`ensure_ascii`	`bool`	Control whether to escape non-`ASCII` characters to `Unicode`. When set to `True`, all non-`ASCII` characters will be escaped; `False` retains the original characters, effective only when `format_json` is `True`	`False`
`save_to_json()`	Save results as a json format file	`save_path`	`str`	Path to save the file. If it is a directory, the saved file name is consistent with the input file type name	None
		`indent`	`int`	Specify indentation level to beautify the `JSON` output, making it more readable, effective only when `format_json` is `True`	4
		`ensure_ascii`	`bool`	Control whether to escape non-`ASCII` characters to `Unicode`. When set to `True`, all non-`ASCII` characters will be escaped; `False` retains the original characters, effective only when `format_json` is `True`	`False`

Calling the print() method will print the results to the terminal. The printed content includes:
image: (str) The input path of the image
query: (str) The question related to the input image
result: (str) The output result from the model
Calling the save_to_json() method will save the above content to the specified save_path. If specified as a directory, the saved path will be save_path/{your_img_basename}_res.json. If specified as a file, it will be directly saved to that file.
Additionally, you can also access the visualized image with results and prediction results through attributes, as follows:

Attribute	Description
`json`	Get the prediction result in `json` format
`img`	Get the visualized image in `dict` format

The prediction result obtained from the json attribute is data of type dict, and the related content is consistent with the content saved by calling the save_to_json() method.

Additionally, you can obtain the configuration file for the Document Understanding Pipeline and load the configuration file for prediction. You can execute the following command to save the result in my_path:

paddlex --get_pipeline_config doc_understanding --save_path ./my_path

If you have obtained the configuration file, you can customize various configurations of the Document Understanding Pipeline, just modify the pipeline parameter value in the create_pipeline method to the path of the pipeline configuration file. For example:

from paddlex import create_pipeline
pipeline = create_pipeline(pipeline="./my_path/doc_understanding.yaml")
output = pipeline.predict(
    {
        "image": "https://paddle-model-ecology.bj.bcebos.com/paddlex/imgs/demo_image/medal_table.png",
        "query": "Identify the contents of this table"
    }
)
for res in output:
    res.print()
    res.save_to_json("./output/")

Note: The parameters in the configuration file are the initialization parameters of the pipeline. If you want to change the initialization parameters of the Document Understanding Pipeline, you can directly modify the parameters in the configuration file and load the configuration file for prediction. Meanwhile, CLI prediction also supports passing in the configuration file, --pipeline specifies the path of the configuration file.

3. Development Integration/Deployment¶

If the pipeline meets your requirements for inference speed and accuracy, you can directly proceed to development integration/deployment.

If you need to directly apply the pipeline in your Python project, you can refer to the example code in 2.1.2 Python Script Method.

In addition, PaddleX also provides three other deployment methods, as detailed below:

🚀 High-Performance Inference (This pipeline does not support it currently): In real-world production environments, many applications have strict standards for performance indicators of deployment strategies (especially response speed) to ensure efficient system operation and smooth user experience. To this end, PaddleX provides a high-performance inference plugin aimed at deeply optimizing model inference and pre-post processing to achieve significant speed improvements in end-to-end processes. For detailed high-performance inference processes, please refer to the PaddleX High-Performance Inference Guide.

☁️ Service Deployment: Service deployment is a common form of deployment in real-world production environments. By encapsulating inference functions into services, clients can access these services via network requests to obtain inference results. PaddleX supports various pipeline service deployment solutions. For detailed pipeline service deployment processes, please refer to the PaddleX Service Deployment Guide.

Below is a basic service deployment API reference and multilingual service call example:

API Reference

Main operations provided by the service:

The HTTP request method is POST.
Both the request and response bodies are in JSON format (JSON object).
When the request is processed successfully, the response status code is 200, and the response body contains the following attributes:

Name	Type	Meaning
`logId`	`string`	The UUID of the request.
`errorCode`	`integer`	Error code. Fixed at `0`.
`errorMsg`	`string`	Error message. Fixed at `"Success"`.
`result`	`object`	Operation result.

When the request is not processed successfully, the response body contains the following attributes:

Name	Type	Meaning
`logId`	`string`	The UUID of the request.
`errorCode`	`integer`	Error code. Same as the response status code.
`errorMsg`	`string`	Error message.

The main operations provided by the service are as follows:

infer

Generate a response based on the input message.

POST /document-understanding

Note: The above interface is an alias for /chat/completion and is compatible with OpenAI's interface.

The request body contains the following attributes:

Name	Type	Meaning	Required	Default
`model`	`string`	Name of the model to use	Required	-
`messages`	`array`	List of dialogue messages	Required	-
`max_tokens`	`integer`	Maximum number of tokens to generate	Optional	1024
`temperature`	`float`	Sampling temperature	Optional	0.1
`top_p`	`float`	Top probability for nucleus sampling	Optional	0.95
`stream`	`boolean`	Whether to output in a streaming manner	Optional	false
`max_image_tokens`	`int`	Maximum number of tokens of input image	Optional	None

Each element in messages is an object with the following attributes:

Name	Type	Meaning	Required
`role`	`string`	Role of the message (user/assistant/system)	Required
`content`	`string` or `array`	Message content (text or mixed content)	Required

When content is an array, each element is an object with the following attributes:

Name	Type	Meaning	Required	Default
`type`	`string`	Content type (text/image_url)	Required	-
`text`	`string`	Text content (when type is text)	Conditionally Required	-
`image_url`	`string` or `object`	Image URL or object (when type is image_url)	Conditionally Required	-

When image_url is an object, it contains the following attributes:

Name	Type	Meaning	Required	Default
`url`	`string`	Image URL	Required	-
`detail`	`string`	Image detail processing method (low/high/auto)	Optional	auto

When the request is processed successfully, the result in the response body contains the following attributes:

Name	Type	Meaning
`id`	`string`	Request ID
`object`	`string`	Object type (chat.completion)
`created`	`integer`	Creation timestamp
`choices`	`array`	Generated result options
`usage`	`object`	Token usage details

Each element in choices is a Choice object with the following attributes:

Name	Type	Meaning	Optional Values
`finish_reason`	`string`	Reason the model stopped generating tokens	`stop` (naturally stopped) `length` (reached max token count) `tool_calls` (called a tool) `content_filter` (content filtered) `function_call` (called a function, deprecated)
`index`	`integer`	Index of the option in the list	-
`logprobs`	`object` \| `null`	Log probability information of the option	-
`message`	`ChatCompletionMessage`	Chat message generated by the model	-

The message object contains the following attributes:

Name	Type	Meaning	Note
`content`	`string` \| `null`	Message content	May be null
`refusal`	`string` \| `null`	Refusal message generated by the model	Provided when content is refused
`role`	`string`	Role of the message author	Fixed at `"assistant"`
`audio`	`object` \| `null`	Audio output data	Provided when audio output is requested Learn more
`function_call`	`object` \| `null`	Name and parameters of the function to be called	Deprecated, recommended to use `tool_calls`
`tool_calls`	`array` \| `null`	Tool calls generated by the model	e.g., function calls, etc.

The usage object contains the following attributes:

Name	Type	Meaning
`prompt_tokens`	`integer`	Number of prompt tokens
`completion_tokens`	`integer`	Number of generated tokens
`total_tokens`	`integer`	Total number of tokens

An example of result is shown below:

{
    "id": "ed960013-eb19-43fa-b826-3c1b59657e35",
    "choices": [
        {
            "finish_reason": "stop",
            "index": 0,
            "message": {
                "content": "| 名次 | 国家/地区 | 金牌 | 银牌 | 铜牌 | 奖牌总数 |\n| --- | --- | --- | --- | --- | --- |\n| 1 | 中国（CHN） | 48 | 22 | 30 | 100 |\n| 2 | 美国（USA） | 36 | 39 | 37 | 112 |\n| 3 | 俄罗斯（RUS） | 24 | 13 | 23 | 60 |\n| 4 | 英国（GBR） | 19 | 13 | 19 | 51 |\n| 5 | 德国（GER） | 16 | 11 | 14 | 41 |\n| 6 | 澳大利亚（AUS） | 14 | 15 | 17 | 46 |\n| 7 | 韩国（KOR） | 13 | 11 | 8 | 32 |\n| 8 | 日本（JPN） | 9 | 8 | 8 | 25 |\n| 9 | 意大利（ITA） | 8 | 9 | 10 | 27 |\n| 10 | 法国（FRA） | 7 | 16 | 20 | 43 |\n| 11 | 荷兰（NED） | 7 | 5 | 4 | 16 |\n| 12 | 乌克兰（UKR） | 7 | 4 | 11 | 22 |\n| 13 | 肯尼亚（KEN） | 6 | 4 | 6 | 16 |\n| 14 | 西班牙（ESP） | 5 | 11 | 3 | 19 |\n| 15 | 牙买加（JAM） | 5 | 4 | 2 | 11 |\n",
                "role": "assistant"
            }
        }
    ],
    "created": 1745218041,
    "model": "pp-docbee",
    "object": "chat.completion"
}

Examples of Multilingual Service Calls

Python

import base64
from openai import OpenAI

API_BASE_URL = "http://0.0.0.0:8080"

# Initialize OpenAI client
client = OpenAI(
    api_key='xxxxxxxxx',
    base_url=f'{API_BASE_URL}'
)

# Function to convert image to base64
def encode_image(image_path):
  with open(image_path, "rb") as image_file:
    return base64.b64encode(image_file.read()).decode('utf-8')

# Input image path
image_path = "medal_table.png"

# Convert the original image to base64
base64_image = encode_image(image_path)

# Submit information to the PP-DocBee model
response = client.chat.completions.create(
    model="pp-docbee", # Select model
    messages=[
        {
            "role": "system",
            "content": "You are a helpful assistant."
        },
        {
            "role": "user",
            "content":[
                {
                    "type": "text",
                    "text": "Recognize the content of this table and output the content in HTML format."
                },
                {
                    "type": "image_url",
                    "image_url": {"url": f"data:image/jpeg;base64,{base64_image}"}
                },
            ]
        },
    ],
)
content = response.choices[0].message.content
print('Reply:', content)

📱 Edge Deployment: Edge deployment is a way of placing computing and data processing functions on the user device itself, allowing the device to process data directly without relying on remote servers. PaddleX supports deploying models on edge devices such as Android. For detailed edge deployment processes, please refer to the PaddleX Edge Deployment Guide. You can choose the appropriate deployment method for your needs and proceed with subsequent AI application integration.

4. Secondary Development¶

Currently, this pipeline does not support fine-tuning training and only supports inference integration. Future support for fine-tuning training is planned.

5. Multi-Hardware Support¶

Currently, this pipeline only supports GPU and CPU inference. Future support for more hardware adaptations is planned.