Tutorial for PaddleOCR-VL Series Pipelines¶

PaddleOCR-VL is a SOTA and resource-efficient model tailored for document parsing. Taking the first version as an example, its core component is PaddleOCR-VL-0.9B, a compact yet powerful vision-language model (VLM) that integrates a NaViT-style dynamic resolution visual encoder with the ERNIE-4.5-0.3B language model to enable accurate element recognition. The PaddleOCR-VL series efficiently supports 109 languages and excels in recognizing complex elements such as text, tables, formulas, and charts, while maintaining minimal resource consumption. Through comprehensive evaluations on widely used public benchmarks and in-house benchmarks, PaddleOCR-VL achieves SOTA performance in both page-level document parsing and element-level recognition. It significantly outperforms existing pipeline-based and multimodal document parsing solutions, is competitive with advanced general-purpose multimodal large models, and delivers fast inference speeds. These strengths make it highly suitable for practical deployment in real-world scenarios.

On January 29, 2026, we released PaddleOCR-VL-1.5. PaddleOCR-VL-1.5 not only significantly improved the accuracy on the OmniDocBench v1.5 evaluation set to 94.5%, but also innovatively supports irregular-shaped bounding box localization. As a result, PaddleOCR-VL-1.5 demonstrates outstanding performance in real-world scenarios such as Skew, Warping, Screen Photography, Illumination, and Scanning. In addition, the model has added new capabilities for seal (stamp) recognition and text detection and recognition, with key metrics continuing to lead the industry.

On May 28, 2026, we released PaddleOCR-VL-1.6. With an accuracy of 96.3%, PaddleOCR-VL-1.6 once again set a new benchmark on OmniDocBench v1.6, while also achieving new state-of-the-art (SOTA) results on OmniDocBench v1.5 and Real5-OmniDocBench. It delivers industry-leading performance in text, formula, and table recognition across both open-source and proprietary solutions. In addition, the model shows substantial improvements in ancient document and rare character recognition, as well as significantly enhanced capabilities in multiple scenarios such as seal recognition, spotting, and chart understanding. The model architecture remains fully consistent with PaddleOCR-VL-1.5, enabling seamless migration at zero cost.

This document applies to the PaddleOCR-VL series pipelines in PaddleX. PaddleX registers the PaddleOCR-VL series as independent top-level pipelines. They are used in basically the same way, but their default configurations and models are different.

Pipeline name	Layout analysis model	VLM model
`PaddleOCR-VL`	`PP-DocLayoutV2`	`PaddleOCR-VL-0.9B`
`PaddleOCR-VL-1.5`	`PP-DocLayoutV3`	`PaddleOCR-VL-1.5-0.9B`
`PaddleOCR-VL-1.6`	`PP-DocLayoutV3`	`PaddleOCR-VL-1.6-0.9B`

A PaddleOCR-VL pipeline consists of layout analysis, region cropping, reading-order handling, VLM recognition, and result assembly. PaddleOCR-VL-0.9B, PaddleOCR-VL-1.5-0.9B, and PaddleOCR-VL-1.6-0.9B are VLM submodels inside the pipelines; they are not complete PaddleX pipelines. If you only start or call a VLM inference service, you are only running the VLM recognition stage, which does not provide the full pipeline capability.

1. Environment Preparation¶

To use the PaddleOCR-VL series pipelines, install PaddleX and the inference engine you want to use, for example:

python -m pip install paddlepaddle-gpu==3.3.0 -i https://www.paddlepaddle.org.cn/packages/stable/cu126/
python -m pip install "paddlex[ocr]"

2. Quick Start¶

The PaddleOCR-VL series pipelines support two usage methods: CLI command line and Python API. The CLI method is simpler and suitable for quickly verifying functionality, while the Python API method is more flexible and suitable for integration into existing projects. The following examples use PaddleOCR-VL-1.6 as the primary pipeline.

2.1 Command Line Usage¶

paddlex --pipeline PaddleOCR-VL-1.6 --input https://paddle-model-ecology.bj.bcebos.com/paddlex/imgs/demo_image/paddleocr_vl_demo.png

# Use --use_doc_orientation_classify to enable document orientation classification
paddlex --pipeline PaddleOCR-VL-1.6 --input https://paddle-model-ecology.bj.bcebos.com/paddlex/imgs/demo_image/paddleocr_vl_demo.png --use_doc_orientation_classify True

# Use --use_doc_unwarping to enable document unwarping module
paddlex --pipeline PaddleOCR-VL-1.6 --input https://paddle-model-ecology.bj.bcebos.com/paddlex/imgs/demo_image/paddleocr_vl_demo.png --use_doc_unwarping True

# Use --use_layout_detection to enable layout detection
paddlex --pipeline PaddleOCR-VL-1.6 --input https://paddle-model-ecology.bj.bcebos.com/paddlex/imgs/demo_image/paddleocr_vl_demo.png --use_layout_detection False

Command line supports more parameters. Click to expand for detailed parameter descriptions

The following table lists the PaddleOCR-VL series pipeline prediction parameters currently supported by the PaddleX CLI. Common parameters also include `--input`, `--save_path`, `--engine`, `--device`, `--use_hpip`, and `--hpi_config`. For complex `engine_config`, write it into a pipeline YAML file and pass the file through `--pipeline`.

Parameter	Description	Type
`input`	Data to be predicted, required. It can be an image/PDF file path, a URL, or a local directory containing images. Directory input currently does not support mixed PDF files; PDF files must be specified by file path.	`str`
`save_path`	Path for saving inference results. If not set, inference results will not be saved locally.	`str`
`use_doc_orientation_classify`	Whether to use the document orientation classification module.	`bool`
`use_doc_unwarping`	Whether to use the document unwarping module.	`bool`
`use_layout_detection`	Whether to use the layout detection and ordering module.	`bool`
`use_chart_recognition`	Whether to use chart parsing.	`bool`
`layout_threshold`	Score threshold for the layout model. It can be a float or a dictionary keyed by class ID.	`float\|dict`
`layout_nms`	Whether to use NMS post-processing for layout analysis.	`bool`
`layout_unclip_ratio`	Expansion ratio for layout detection boxes. It can be a float, a tuple, or a dictionary keyed by class ID.	`float\|tuple\|dict`
`layout_merge_bboxes_mode`	Merge mode for layout detection boxes. Supported values include `large`, `small`, and `union`; it can also be configured by class ID.	`str\|dict`
`use_queues`	Whether to enable internal queues. When enabled, PDF rendering, layout analysis, and VLM inference can run asynchronously in separate threads.	`bool`
`prompt_label`	Prompt type for the VLM. It only takes effect when `use_layout_detection=False`.	`str`
`format_block_content`	Whether to format `block_content` as Markdown. When set to `True`, image-type blocks may include image path information in `block_content`.	`bool`
`repetition_penalty`	Repetition penalty used in VLM sampling.	`float`
`temperature`	Temperature used in VLM sampling.	`float`
`top_p`	Top-p parameter used in VLM sampling.	`float`
`min_pixels`	Minimum number of pixels allowed during VLM image preprocessing.	`int`
`max_pixels`	Maximum number of pixels allowed during VLM image preprocessing.	`int`

The inference result will be printed in the terminal. The default output of the PaddleOCR-VL pipeline is as follows:

👉Click to expand

 
{'res': {'input_path': 'paddleocr_vl_demo.png', 'page_index': None, 'model_settings': {'use_doc_preprocessor': False, 'use_layout_detection': True, 'use_chart_recognition': False, 'format_block_content': False}, 'layout_det_res': {'input_path': None, 'page_index': None, 'boxes': [{'cls_id': 6, 'label': 'doc_title', 'score': 0.9636914134025574, 'coordinate': [np.float32(131.31366), np.float32(36.450516), np.float32(1384.522), np.float32(127.984665)]}, {'cls_id': 22, 'label': 'text', 'score': 0.9281806349754333, 'coordinate': [np.float32(585.39465), np.float32(158.438), np.float32(930.2184), np.float32(182.57469)]}, {'cls_id': 22, 'label': 'text', 'score': 0.9840355515480042, 'coordinate': [np.float32(9.023666), np.float32(200.86115), np.float32(361.41583), np.float32(343.8828)]}, {'cls_id': 14, 'label': 'image', 'score': 0.9871416091918945, 'coordinate': [np.float32(775.50574), np.float32(200.66502), np.float32(1503.3807), np.float32(684.9304)]}, {'cls_id': 22, 'label': 'text', 'score': 0.9801855087280273, 'coordinate': [np.float32(9.532196), np.float32(344.90594), np.float32(361.4413), np.float32(440.8244)]}, {'cls_id': 17, 'label': 'paragraph_title', 'score': 0.9708921313285828, 'coordinate': [np.float32(28.040405), np.float32(455.87976), np.float32(341.7215), np.float32(520.7117)]}, {'cls_id': 24, 'label': 'vision_footnote', 'score': 0.9002962708473206, 'coordinate': [np.float32(809.0692), np.float32(703.70044), np.float32(1488.3016), np.float32(750.5238)]}, {'cls_id': 22, 'label': 'text', 'score': 0.9825374484062195, 'coordinate': [np.float32(8.896561), np.float32(536.54895), np.float32(361.05237), np.float32(655.8058)]}, {'cls_id': 22, 'label': 'text', 'score': 0.9822263717651367, 'coordinate': [np.float32(8.971573), np.float32(657.4949), np.float32(362.01715), np.float32(774.625)]}, {'cls_id': 22, 'label': 'text', 'score': 0.9767460823059082, 'coordinate': [np.float32(9.407074), np.float32(776.5216), np.float32(361.31067), np.float32(846.82874)]}, {'cls_id': 22, 'label': 'text', 'score': 0.9868153929710388, 'coordinate': [np.float32(8.669495), np.float32(848.2543), np.float32(361.64703), np.float32(1062.8568)]}, {'cls_id': 22, 'label': 'text', 'score': 0.9826608300209045, 'coordinate': [np.float32(8.8025055), np.float32(1063.8615), np.float32(361.46588), np.float32(1182.8524)]}, {'cls_id': 22, 'label': 'text', 'score': 0.982555627822876, 'coordinate': [np.float32(8.820602), np.float32(1184.4663), np.float32(361.66394), np.float32(1302.4507)]}, {'cls_id': 22, 'label': 'text', 'score': 0.9584776759147644, 'coordinate': [np.float32(9.170288), np.float32(1304.2161), np.float32(361.48898), np.float32(1351.7483)]}, {'cls_id': 22, 'label': 'text', 'score': 0.9782056212425232, 'coordinate': [np.float32(389.1618), np.float32(200.38202), np.float32(742.7591), np.float32(295.65146)]}, {'cls_id': 22, 'label': 'text', 'score': 0.9844875931739807, 'coordinate': [np.float32(388.73303), np.float32(297.18463), np.float32(744.00024), np.float32(441.3034)]}, {'cls_id': 17, 'label': 'paragraph_title', 'score': 0.9680547714233398, 'coordinate': [np.float32(409.39468), np.float32(455.89386), np.float32(721.7174), np.float32(520.9387)]}, {'cls_id': 22, 'label': 'text', 'score': 0.9741666913032532, 'coordinate': [np.float32(389.71606), np.float32(536.8138), np.float32(742.7112), np.float32(608.00165)]}, {'cls_id': 22, 'label': 'text', 'score': 0.9840384721755981, 'coordinate': [np.float32(389.30988), np.float32(609.39636), np.float32(743.09247), np.float32(750.3231)]}, {'cls_id': 22, 'label': 'text', 'score': 0.9845995306968689, 'coordinate': [np.float32(389.13272), np.float32(751.7772), np.float32(743.058), np.float32(894.8815)]}, {'cls_id': 22, 'label': 'text', 'score': 0.984852135181427, 'coordinate': [np.float32(388.83267), np.float32(896.0371), np.float32(743.58215), np.float32(1038.7345)]}, {'cls_id': 22, 'label': 'text', 'score': 0.9804865717887878, 'coordinate': [np.float32(389.08478), np.float32(1039.9119), np.float32(742.7585), np.float32(1134.4897)]}, {'cls_id': 22, 'label': 'text', 'score': 0.986461341381073, 'coordinate': [np.float32(388.52643), np.float32(1135.8137), np.float32(743.451), np.float32(1352.0085)]}, {'cls_id': 22, 'label': 'text', 'score': 0.9869391918182373, 'coordinate': [np.float32(769.8341), np.float32(775.66235), np.float32(1124.9813), np.float32(1063.207)]}, {'cls_id': 22, 'label': 'text', 'score': 0.9822869896888733, 'coordinate': [np.float32(770.30383), np.float32(1063.938), np.float32(1124.8295), np.float32(1184.2192)]}, {'cls_id': 17, 'label': 'paragraph_title', 'score': 0.9689218997955322, 'coordinate': [np.float32(791.3042), np.float32(1199.3169), np.float32(1104.4521), np.float32(1264.6985)]}, {'cls_id': 22, 'label': 'text', 'score': 0.9713128209114075, 'coordinate': [np.float32(770.4253), np.float32(1279.6072), np.float32(1124.6917), np.float32(1351.8672)]}, {'cls_id': 22, 'label': 'text', 'score': 0.9236552119255066, 'coordinate': [np.float32(1153.9058), np.float32(775.5814), np.float32(1334.0654), np.float32(798.1581)]}, {'cls_id': 22, 'label': 'text', 'score': 0.9857938885688782, 'coordinate': [np.float32(1151.5197), np.float32(799.28015), np.float32(1506.3619), np.float32(991.1156)]}, {'cls_id': 22, 'label': 'text', 'score': 0.9820687174797058, 'coordinate': [np.float32(1151.5686), np.float32(991.91095), np.float32(1506.6023), np.float32(1110.8875)]}, {'cls_id': 22, 'label': 'text', 'score': 0.9866049885749817, 'coordinate': [np.float32(1151.6919), np.float32(1112.1301), np.float32(1507.1611), np.float32(1351.9504)]}]}}}

For explanation of the result parameters, refer to 2.2 Python Script Integration.

Note: The default model for the pipeline is relatively large, which may result in slower inference speed. It is recommended to use 3. Using VLM Inference Services for faster inference.

2.2 Python Script Integration¶

The command line method is for quick testing and visualization. In actual projects, you usually need to integrate the model via code. You can perform pipeline inference with just a few lines of code as shown below:

from paddlex import create_pipeline

pipeline = create_pipeline(pipeline="PaddleOCR-VL-1.6")

output = pipeline.predict(input="https://paddle-model-ecology.bj.bcebos.com/paddlex/imgs/demo_image/paddleocr_vl_demo.png")

for res in output:
    res.print() # Print the structured prediction output
    res.save_to_json(save_path="output") # Save the current image's structured result in JSON format
    res.save_to_markdown(save_path="output") # Save the current image's result in Markdown format
    res.save_to_word(save_path="output") # Save the current image's result in Word format

For PDF files, each page will be processed individually, and a separate Markdown file will be generated for each page. If you wish to perform cross-page table merging, reconstruct multi-level labels, or merge multi-page results, you can achieve this using the following method:

from paddlex import create_pipeline

pipeline = create_pipeline(pipeline="PaddleOCR-VL-1.6")

output = pipeline.predict(input="./your_pdf_file.pdf")

pages_res = list(output)

output = pipeline.restructure_pages(pages_res)

# output = pipeline.restructure_pages(pages_res, merge_tables=True) # Merge tables across pages
# output = pipeline.restructure_pages(pages_res, merge_tables=True, relevel_titles=True) # Merge tables across pages and reconstruct multi-level titles
# output = pipeline.restructure_pages(pages_res, merge_tables=True, relevel_titles=True, concatenate_pages=True) # Merge tables across pages, reconstruct multi-level titles, and merge multiple pages
for res in output:
    res.print() # Print the structured prediction output
    res.save_to_json(save_path="output") # Save the current image's structured result in JSON format
    res.save_to_markdown(save_path="output") # Save the current image's result in Markdown format

The above Python script performs the following steps:

(1) Instantiate the PaddleX pipeline object. Specific parameter descriptions are as follows:

In PaddleX, use `create_pipeline()` to create a PaddleOCR-VL series pipeline object. Fine-grained settings such as model names, model directories, and VLM server backends should usually be configured in a pipeline YAML file and then passed through `pipeline` or `config`.

Parameter	Description	Type	Default
`pipeline`	PaddleX pipeline name or path to a pipeline config file. Supported names include `PaddleOCR-VL`, `PaddleOCR-VL-1.5`, and `PaddleOCR-VL-1.6`. A custom YAML file path can also be used.	`str\|None`	`None`
`config`	Pipeline config dictionary. If both `pipeline` and `config` are provided, the `pipeline_name` in `config` takes precedence.	`dict\|None`	`None`
`device`	Device used for inference, such as `cpu`, `gpu:0`, `xpu:0`, `npu:0`, `dcu:0`, or `mlu:0`. Actual availability depends on the local environment and inference engine.	`str\|None`	`None`
`engine`	Inference engine used by the pipeline or model. Different engines support different fields. See Inference Engine And Configuration.	`str\|None`	`None`
`engine_config`	Inference engine configuration. Different engines support different fields. See Inference Engine And Configuration.	`dict\|None`	`None`
`use_hpip`	Whether to enable the high-performance inference plugin. If set to `None`, the setting from the configuration file or `config` will be used.	`bool\|None`	`None`
`hpi_config`	High-performance inference configuration.	`dict\|None`	`None`

(2) Call the PaddleOCR-VL pipeline's predict() method for inference prediction. This method will return a list of results. Additionally, the pipeline also provides the predict_iter() method. The two are completely consistent in terms of parameter acceptance and result return. The difference lies in that predict_iter() returns a generator, which can process and obtain prediction results step by step. It is suitable for scenarios involving large datasets or where memory conservation is desired. You can choose either of these two methods based on actual needs. Below are the parameters of the predict() method and their descriptions:

Parameter	Parameter Description	Parameter Type	Default Value
`input`	Data to be predicted, supporting multiple input types. Required. Python Var: such as `numpy.ndarray` representing image data str: such as the local path of an image file or PDF file: `/root/data/img.jpg`;such as a URL link, such as the network URL of an image file or PDF file:Example;such as a local directory, which should contain the images to be predicted, such as the local path: `/root/data/`(Currently, prediction for directories containing PDF files is not supported. PDF files need to be specified with a specific file path) list: List elements should be of the aforementioned data types, such as `[numpy.ndarray, numpy.ndarray]`, `["/root/data/img1.jpg", "/root/data/img2.jpg"]`, `["/root/data1", "/root/data2"].`	`Python Var\|str\|list`
`use_doc_orientation_classify`	Whether to use the document orientation classification module during inference. Setting it to `None` means using the instantiation parameter; otherwise, this parameter takes precedence.	`bool\|None`	`None`
`use_doc_unwarping`	Whether to use the text image rectification module during inference. Setting it to `None` means using the instantiation parameter; otherwise, this parameter takes precedence.	`bool\|None`	`None`
`use_layout_detection`	Whether to use the layout region detection and sorting module during inference. Setting it to `None` means using the instantiation parameter; otherwise, this parameter takes precedence.	`bool\|None`	`None`
`use_chart_recognition`	Whether to use the chart parsing module during inference. Setting it to `None` means using the instantiation parameter; otherwise, this parameter takes precedence.	`bool\|None`	`None`
`use_seal_recognition`	Meaning:Whether to use the seal recognition function. Setting it to `None` means using the instantiation parameter; otherwise, this parameter takes precedence.	`bool\|None`	`None`
`use_ocr_for_image_block`	Meaning:Whether to perform OCR on text within image blocks. Setting it to `None` means using the instantiation parameter; otherwise, this parameter takes precedence.	`bool\|None`	`None`
`layout_threshold`	The parameter meaning is basically the same as the instantiation parameter. Setting it to `None` means using the instantiation parameter; otherwise, this parameter takes precedence.	`float\|dict\|None`	`None`
`layout_nms`	The parameter meaning is basically the same as the instantiation parameter. Setting it to `None` means using the instantiation parameter; otherwise, this parameter takes precedence.	`bool\|None`	`None`
`layout_unclip_ratio`	The parameter meaning is basically the same as the instantiation parameter. Setting it to `None` means using the instantiation parameter; otherwise, this parameter takes precedence.	`float\|Tuple[float,float]\|dict\|None`	`None`
`layout_merge_bboxes_mode`	The parameter meaning is basically the same as the instantiation parameter. Setting it to `None` means using the instantiation parameter; otherwise, this parameter takes precedence.	`str\|dict\|None`	`None`
`merge_layout_blocks`	Control whether to merge the layout detection boxes for cross-column or staggered top and bottom columns. If not set, the initialized default value will be used, which defaults to initialization as`True`.	`bool`
`markdown_ignore_labels`	Layout labels that need to be ignored in Markdown. If not set, the initialized default value will be used.	`str`
`layout_shape_mode`	Meaning:Specifies the geometric representation mode for layout analysis results. It defines how the boundaries of detected regions (e.g., text blocks, images, tables) are calculated and displayed. Description: Value descriptions: rect (rectangle): Outputs an axis-aligned bounding box (including x1, y1, x2, y2). Suitable for standard horizontally aligned layouts. quad (quadrilateral): Outputs an arbitrary quadrilateral composed of four vertices. Suitable for regions with skew or perspective distortion. poly (polygon): Outputs a closed contour composed of multiple coordinate points. Suitable for irregularly shaped or curved layout elements, offering the highest precision. auto (automatic): The system automatically selects the most appropriate shape representation based on the complexity and confidence of the detected targets.	`str`	"auto"
`use_queues`	Used to control whether to enable internal queues. When set to `True`, data loading (such as rendering PDF pages as images), layout analysis model processing, and VLM inference will be executed asynchronously in separate threads, with data passed through queues, thereby improving efficiency. This approach is particularly efficient for PDF documents with many pages or directories containing a large number of images or PDF files.	`bool\|None`	`None`
`prompt_label`	The prompt type setting for the VL model, which takes effect only when `use_layout_detection=False`. The fillable parameters are `ocr`、`formula`、`table` and `chart`.	`str\|None`	`None`
`format_block_content`	The parameter meaning is basically the same as the instantiation parameter. Setting it to `None` means using the instantiation parameter; otherwise, this parameter takes precedence. When set to `True`, the `block_content` of image-type blocks will contain image path information (e.g., `<img src="..." />`). When set to `False` (default), the `block_content` of image-type blocks will only contain OCR-recognized text content without image paths. To include image paths in JSON output, set this parameter to `True`.	`bool\|None`	`None`
`repetition_penalty`	The repetition penalty parameter used for VL model sampling.	`float\|None`	`None`
`temperature`	Temperature parameter used for VL model sampling.	`float\|None`	`None`
`top_p`	Top-p parameter used for VL model sampling.	`float\|None`	`None`
`min_pixels`	The minimum number of pixels allowed when the VL model preprocesses images.	`int\|None`	`None`
`max_pixels`	The maximum number of pixels allowed when the VL model preprocesses images.	`int\|None`	`None`
`max_new_tokens`	The maximum number of tokens generated by the VL model.	`int\|None`	`None`
`merge_layout_blocks`	Control whether to merge the layout detection boxes for cross-column or staggered top and bottom columns.	`bool\|None`
`markdown_ignore_labels`	Layout labels that need to be ignored in Markdown.	`list\|None`
`vlm_extra_args`	Meaning:Additional configuration parameters for the VLM. The currently supported custom parameters are as follows: `ocr_min_pixels`: Minimum resolution for OCR `ocr_max_pixels`: Maximum resolution for OCR `table_min_pixels`: Minimum resolution for tables `table_max_pixels`: Maximum resolution for tables `chart_min_pixels`: Minimum resolution for charts `chart_max_pixels`: Maximum resolution for charts `formula_min_pixels`: Minimum resolution for formulas `formula_max_pixels`: Maximum resolution for formulas `seal_min_pixels`: Minimum resolution for seals `seal_max_pixels`: Maximum resolution for seals	`dict\|None`	`None`

(3) Call the PaddleOCR-VL pipeline's restructure_pages() method to reconstruct pages from the multi-page results list of inference predictions. This method will return a reconstructed multi-page result or a merged single-page result. Below are the parameters of the restructure_pages() method and their descriptions:

Parameter	Description	Type	Default Value
`res_list`	Meaning: The list of results predicted from a multi-page PDF inference.	`list\|None`	`None`
`merge_tables`	Meaning: Controls whether to merge tables across pages.	`bool`	`True`
`relevel_titles`	Meaning: Controls whether to reconstruct multi-level titles.	`bool`	`True`
`concatenate_pages`	Meaning: Controls whether to concatenate multi-page results into one page.	`bool`	`False`

(4) Process the prediction results: The prediction result for each sample is a corresponding Result object, supporting operations such as printing, saving as an image, and saving as a json file:

Method	Method Description	Parameter	Parameter Type	Parameter Description	Default Value
`print()`	Print results to the terminal	`format_json`	`bool`	Whether to format the output content using `JSON` indentation.	`True`
		`indent`	`int`	Specify the indentation level to beautify the output `JSON` data, making it more readable. Only valid when `format_json` is `True`.	`4`
		`ensure_ascii`	`bool`	Control whether non- `ASCII` characters are escaped as `Unicode`. When set to `True`, all non- `ASCII` characters will be escaped; `False` retains the original characters. Only valid when `format_json` is `True`.	`False`
`save_to_json()`	Save the results as a json format file	`save_path`	`str`	The file path for saving. When it is a directory, the saved file name will be consistent with the input file type naming.	`None`
		`indent`	`int`	Specify the indentation level to beautify the output `JSON`data, making it more readable. Only valid when `format_json`is `True`.	`4`
		`ensure_ascii`	`bool`	Control whether non- `ASCII` characters are escaped as `Unicode`. When set to `True`, all non- `ASCII` characters will be escaped; `False` retains the original characters. Only valid when `format_json` is `True`.	`False`
`save_to_img()`	Save the visualized images of each intermediate module in png format	`save_path`	`str`	The file path for saving, supporting directory or file paths.	`None`
`save_to_markdown()`	Save each page in an image or PDF file as a markdown format file separately	`save_path`	`str`	The file path for saving. When it is a directory, the saved file name will be consistent with the input file type naming	`None`
		`pretty`	`bool`	Whether to beautify the `markdown` output results, centering charts, etc., to make the `markdown` rendering more aesthetically pleasing.	`True`
		`show_formula_number`	`bool`	Control whether to retain formula numbers in `markdown`. When set to `True`, all formula numbers are retained; `False` retains only the formulas	`False`
`save_to_html()`	Save the tables in the file as html format files	`save_path`	`str`	The file path for saving, supporting directory or file paths.	`None`
`save_to_xlsx()`	Save the tables in the file as xlsx format files	`save_path`	`str`	The file path for saving, supporting directory or file paths.	`None`
`save_to_word()`	Save the layout parsing result as a Word (.docx) format file	`save_path`	`str`	The file path for saving, supporting directory or file paths.	`None`

- Calling the `print()` method will print the results to the terminal. The content printed to the terminal is explained as follows: - `input_path`: `(str)` The input path of the image or PDF to be predicted. - `page_index`: `(Union[int, None])` If the input is a PDF file, it indicates the current page number of the PDF; otherwise, it is `None`. - `page_count`: `(Union[int, None])` If the input is a PDF file, it indicates the total number of pages in the PDF; otherwise, it is `None`. - `width`: `(int)` The width of the original input image. - `height`: `(int)` The height of the original input image. - `model_settings`: `(Dict[str, bool])` Model parameters required for configuring PaddleOCR-VL. - `use_doc_preprocessor`: `(bool)` Controls whether to enable the document preprocessing sub-pipeline. - `use_layout_detection`: `(bool)` Controls whether to enable the layout analysis module. - `use_chart_recognition`: `(bool)` Controls whether to enable the chart recognition function. - `format_block_content`: `(bool)` Controls whether to save the formatted markdown content in `JSON`. When set to `True`, the `block_content` of image-type blocks will contain image path information (e.g., `

`). When set to `False` (default), the `block_content` of image-type blocks will only contain OCR-recognized text content without image paths. To include image paths in JSON output, set this parameter to `True`. - `merge_layout_blocks`: `(bool)` Controls whether to merge the layout frames of multi-column layouts or top-and-bottom alternating column layouts. - `markdown_ignore_labels`: `(List[str])` Labels of layout regions that need to be ignored in Markdown, defaulting to `['number','footnote','header','header_image','footer','footer_image','aside_text']` - `doc_preprocessor_res`: `(Dict[str, Union[List[float], str]])` A dictionary of document preprocessing results, which exists only when `use_doc_preprocessor=True`. - `input_path`: `(str)` The image path accepted by the document preprocessing sub-pipeline. When the input is a `numpy.ndarray`, it is saved as `None`; here, it is `None`. - `page_index`: `None`. Since the input here is a `numpy.ndarray`, the value is `None`. - `model_settings`: `(Dict[str, bool])` Model configuration parameters for the document preprocessing sub-pipeline. - `use_doc_orientation_classify`: `(bool)` Controls whether to enable the document image orientation classification sub-module. - `use_doc_unwarping`: `(bool)` Controls whether to enable the text image distortion correction sub-module. - `angle`: `(int)` The prediction result of the document image orientation classification sub-module. When enabled, it returns the actual angle value. - `parsing_res_list`: `(List[Dict])` A list of parsing results, where each element is a dictionary. The list order is the reading order after parsing. - `block_bbox`: `(np.ndarray)` The bounding box of the layout area. - `block_label`: `(str)` The label of the layout area, such as `text`, `table`, etc. - `block_content`: `(str)` The content within the layout area. - `block_id`: `(int)` The index of the layout area, used to display the layout sorting results. - `block_order` `(int)` The order of the layout area, used to display the layout reading order. For non-sorted parts, the default value is `None`. - Calling the `save_to_json()` method will save the above content to the specified `save_path`. If a directory is specified, the saved path will be `save_path/{your_img_basename}_res.json`. If a file is specified, it will be saved directly to that file. Since json files do not support saving numpy arrays, the `numpy.array` types within will be converted to list form. - `input_path`: `(str)` The input path of the image or PDF to be predicted. - `page_index`: `(Union[int, None])` If the input is a PDF file, it indicates the current page number of the PDF; otherwise, it is `None`. - `model_settings`: `(Dict[str, bool])` Model parameters required for configuring PaddleOCR-VL. - `use_doc_preprocessor`: `(bool)` Controls whether to enable the document preprocessing sub-pipeline. - `use_layout_detection`: `(bool)` Controls whether to enable the layout analysis module. - `use_chart_recognition`: `(bool)` Controls whether to enable the chart recognition function. - `format_block_content`: `(bool)` Controls whether to save the formatted markdown content in `JSON`. When set to `True`, the `block_content` of image-type blocks will contain image path information (e.g., `

`). When set to `False` (default), the `block_content` of image-type blocks will only contain OCR-recognized text content without image paths. To include image paths in JSON output, set this parameter to `True`. - `doc_preprocessor_res`: `(Dict[str, Union[List[float], str]])` A dictionary of document preprocessing results, which exists only when `use_doc_preprocessor=True`. - `input_path`: `(str)` The image path accepted by the document preprocessing sub-pipeline. When the input is a `numpy.ndarray`, it is saved as `None`; here, it is `None`. - `page_index`: `None`. Since the input here is a `numpy.ndarray`, the value is `None`. - `model_settings`: `(Dict[str, bool])` Model configuration parameters for the document preprocessing sub-pipeline. - `use_doc_orientation_classify`: `(bool)` Controls whether to enable the document image orientation classification sub-module. - `use_doc_unwarping`: `(bool)` Controls whether to enable the text image distortion correction sub-module. - `angle`: `(int)` The prediction result of the document image orientation classification sub-module. When enabled, it returns the actual angle value. - `parsing_res_list`: `(List[Dict])` A list of parsing results, where each element is a dictionary. The list order represents the reading order after parsing. - `block_bbox`: `(np.ndarray)` The bounding box of the layout region. - `block_label`: `(str)` The label of the layout region, such as `text`, `table`, etc. - `block_content`: `(str)` The content within the layout region. - `block_id`: `(int)` The index of the layout region, used to display the layout sorting results. - `block_order` `(int)` The order of the layout region, used to display the layout reading order. For non-sorted parts, the default value is `None`. - Calling the `save_to_img()` method will save the visualization results to the specified `save_path`. If a directory is specified, visualized images for layout region detection, global OCR, layout reading order, etc., will be saved. If a file is specified, it will be saved directly to that file. (Pipelines typically contain many result images, so it is not recommended to directly specify a specific file path, as multiple images will be overwritten, retaining only the last one.) - Calling the `save_to_markdown()` method will save the converted Markdown file to the specified `save_path`. The saved file path will be `save_path/{your_img_basename}.md`. If the input is a PDF file, it is recommended to directly specify a directory; otherwise, multiple markdown files will be overwritten. Additionally, it also supports obtaining visualized images and prediction results with results through attributes, as follows:

Attribute	Attribute Description
`json`	Obtain the prediction `json`result in the format
`img`	Obtain visualized images in `dict` format
`markdown`	Obtain Markdown results in `dict` format

- The prediction result obtained through the `json` attribute is data of dict type, with relevant content consistent with that saved by calling the `save_to_json()` method. - The prediction result returned by the `img` attribute is data of dict type. The keys are `layout_det_res`, `overall_ocr_res`, `text_paragraphs_ocr_res`, `formula_res_region1`, `table_cell_img`, and `seal_res_region1`, with corresponding values being `Image.Image` objects: used to display visualized images of layout region detection, OCR, OCR text paragraphs, formulas, tables, and seal results, respectively. If optional modules are not used, the dict only contains `layout_det_res`. - The prediction result returned by the `markdown` attribute is data of dict type. The keys are `markdown_texts`, `markdown_images`, and `page_continuation_flags`, with corresponding values being markdown text, images displayed in Markdown (`Image.Image` objects), and a bool tuple used to identify whether the first element on the current page is the start of a paragraph and whether the last element is the end of a paragraph, respectively.

3. Using VLM Inference Services¶

The inference performance under the default configuration is not fully optimized and may not meet actual production requirements. PaddleX supports connecting the VLM recognition stage in the complete pipeline to a dedicated VLM inference service. This improves VLM module inference performance and helps isolate server-side dependencies and compute resources in production environments. Server backends can include vLLM, SGLang, and FastDeploy. The workflow mainly consists of two steps:

Start the VLM inference service;
Configure the PaddleX pipeline to call the VLM inference service as a client.

The VLM inference service only handles the VLM recognition stage of the complete pipeline. Layout analysis, cropping, reading-order handling, and result assembly are still performed by the PaddleX pipeline, so local inference with the layout parsing model is still required. When starting the service, use the VLM submodel name that corresponds to the selected pipeline. See the pipeline table at the beginning of this document for the mapping.

3.1 Starting the VLM Inference Service¶

3.1.1 Using Docker Images¶

PaddleX provides a vLLM Docker image to quickly start a VLM inference service. For NVIDIA GPUs other than the Blackwell architecture, use ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/paddlex-genai-vllm-server:latest. For Blackwell GPUs, use ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/paddlex-genai-vllm-server:latest-sm120.

Using an NVIDIA GPU other than the Blackwell architecture and PaddleOCR-VL-1.6-0.9B as an example:

docker run \
    --rm \
    --gpus all \
    --network host \
    ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/paddlex-genai-vllm-server:latest \
    paddlex_genai_server --model_name PaddleOCR-VL-1.6-0.9B --host 0.0.0.0 --port 8118 --backend vllm

For Blackwell GPUs, replace the image above with the dedicated image:

docker run \
    --rm \
    --gpus all \
    --network host \
    ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/paddlex-genai-vllm-server:latest-sm120 \
    paddlex_genai_server --model_name PaddleOCR-VL-1.6-0.9B --host 0.0.0.0 --port 8118 --backend vllm

3.1.2 Starting Through PaddleX CLI¶

VLM server dependencies may differ from the local pipeline client environment, so it is recommended to create a separate virtual environment for the VLM inference service:

# Create a virtual environment
python -m venv .venv
# Activate the environment
source .venv/bin/activate
# Install PaddleX
python -m pip install "paddlex[ocr]"
# Install the vLLM server plugin
paddlex --install genai-vllm-server
# Or install the SGLang server plugin
# paddlex --install genai-sglang-server
# Or install the FastDeploy server plugin
# paddlex --install genai-fastdeploy-server

After installation, start the service with paddlex_genai_server:

paddlex_genai_server --model_name PaddleOCR-VL-1.6-0.9B --backend vllm --port 8118

# For the SGLang backend
# paddlex_genai_server --model_name PaddleOCR-VL-1.6-0.9B --backend sglang --port 8118

# For the FastDeploy backend
# paddlex_genai_server --model_name PaddleOCR-VL-1.6-0.9B --backend fastdeploy --port 8118

The command supports the following parameters:

Parameter	Description
`--model_name`	Model name. It should match the PaddleX pipeline version being used.
`--model_dir`	Model directory.
`--host`	Server hostname.
`--port`	Server port number.
`--backend`	Backend name. Supported values are `vllm`, `sglang`, and `fastdeploy`.
`--backend_config`	YAML file containing backend configuration.

3.2 Client Usage¶

After starting the VLM inference service, the client can call it through PaddleX. Install the client plugin first:

paddlex --install genai-client

Next, obtain the pipeline configuration file:

paddlex --get_pipeline_config PaddleOCR-VL-1.6

The default save path is PaddleOCR-VL-1.6.yaml. Modify SubModules.VLRecognition.genai_config.backend and SubModules.VLRecognition.genai_config.server_url in the config file to match the service, for example:

SubModules:
  VLRecognition:
    genai_config:
      backend: vllm-server
      server_url: http://127.0.0.1:8118/v1
      max_concurrency: 200

You can also use the unified engine + engine_config style to configure this submodule explicitly:

SubModules:
  VLRecognition:
    engine: genai_client
    engine_config:
      backend: vllm-server
      server_url: http://127.0.0.1:8118/v1
      max_concurrency: 200

Then use the modified config file to run the pipeline. For CLI:

paddlex --pipeline PaddleOCR-VL-1.6.yaml --input https://paddle-model-ecology.bj.bcebos.com/paddlex/imgs/demo_image/paddleocr_vl_demo.png

Or through the Python API:

from paddlex import create_pipeline

pipeline = create_pipeline("PaddleOCR-VL-1.6.yaml")

for res in pipeline.predict("https://paddle-model-ecology.bj.bcebos.com/paddlex/imgs/demo_image/paddleocr_vl_demo.png"):
    res.print()

4. Serving¶

If you need to directly apply PaddleOCR-VL in your Python project, you can refer to the example code in 2.2 Python Script Integration.

Additionally, PaddleX also provides a service deployment method, detailed as follows:

4.1 Install Dependencies¶

Run the following command to install the PaddleX serving plugin via PaddleX CLI:

paddlex --install serving

4.2 Run the Server¶

Run the server via PaddleX CLI:

paddlex --serve --pipeline PaddleOCR-VL-1.6

You should see information similar to the following:

INFO:     Started server process [63108]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:8080 (Press CTRL+C to quit)

If you need to adjust the configuration, such as model paths, batch size, deployment device, or VLM server backend, specify a custom configuration file through --pipeline. PaddleOCR-VL-1.5 and PaddleOCR-VL-1.6 share the same serving application implementation route internally, but their pipeline configs remain distinct.

The command-line options related to serving are as follows:

Name	Description
`--pipeline`	PaddleX pipeline registration name or pipeline configuration file path.
`--device`	Deployment device for the pipeline. By default, a GPU will be used if available; otherwise, a CPU will be used.
`--host`	Hostname or IP address to which the server is bound. Defaults to `0.0.0.0`.
`--port`	Port number on which the server listens. Defaults to `8080`.
`--use_hpip`	If specified, uses high-performance inference. Refer to the High-Performance Inference documentation for more information.
`--hpi_config`	High-performance inference configuration. Refer to the High-Performance Inference documentation for more information.

4.3 Client-Side Invocation¶

Below are the API references for basic service-based deployment and examples of multilingual service invocation:

API Reference

Main operations provided by the service:

The HTTP request method is POST.
Both the request body and response body are JSON data (JSON objects).
When the request is processed successfully, the response status code is200, and the properties of the response body are as follows:

Name	Type	Meaning
`logId`	`string`	The UUID of the request.
`errorCode`	`integer`	Error code. Fixed as `0`.
`errorMsg`	`string`	Error description. Fixed as `"Success"`.
`result`	`object`	Operation result.

When the request is not processed successfully, the properties of the response body are as follows:

Name	Type	Meaning
`logId`	`string`	The UUID of the request.
`errorCode`	`integer`	Error code. Same as the response status code.
`errorMsg`	`string`	Error description.

The main operations provided by the service are as follows:

infer

Perform layout parsing.

POST /layout-parsing

The properties of the request body are as follows:

Name	Type	Meaning	Required
`file`	`string`	The URL of image files (including TIFF; multi-page TIFF is processed page by page) or PDF file accessible to the server, or the Base64-encoded result of the content of the aforementioned file types. By default, there is no page limit. To set a page limit on the server, set `Serving.extra.max_num_input_imgs` to a positive integer in the pipeline configuration file, for example: `Serving: extra: max_num_input_imgs: 10`	Yes
`fileType`	`integer`\|`null`	File type. `0` represents a PDF file, `1` represents an image file (including TIFF). If this property is not present in the request body, the file type will be inferred from the URL.	No
`useDocOrientationClassify`	`boolean` \| `null`	Please refer to the description of the `use_doc_orientation_classify` parameter in the pipeline `predict` method.	No
`useDocUnwarping`	`boolean`\|`null`	Please refer to the description of the `use_doc_unwarping` parameter in the pipeline `predict` method.	No
`useLayoutDetection`	`boolean`\|`null`	Please refer to the description of the `use_layout_detection` parameter in the pipeline `predict` method.	No
`useChartRecognition`	`boolean`\|`null`	Please refer to the description of the `use_chart_recognition` parameter in the pipeline `predict` method.	No
`useSealRecognition`	`boolean`\|`null`	Please refer to the description of the `use_seal_recognition` parameter in the pipeline `predict` method.	No
`useOcrForImageBlock`	`boolean`\|`null`	Please refer to the description of the `use_ocr_for_image_block` parameter in the pipeline `predict` method.	No
`layoutThreshold`	`number`\|`object`\|`null`	Please refer to the description of the `layout_threshold` parameter in the pipeline `predict` method.	No
`layoutNms`	`boolean`\|`null`	Please refer to the description of the `layout_nms` parameter in the pipeline `predict` method.	No
`layoutUnclipRatio`	`number`\|`array`\|`object`\|`null`	Please refer to the description of the `layout_unclip_ratio` parameter in the pipeline `predict` method.	No
`layoutMergeBboxesMode`	`string`\|`object`\|`null`	Please refer to the description of the `layout_merge_bboxes_mode` parameter in the pipeline `predict` method.	No
`layoutShapeMode`	`string`	Please refer to the description of the `layout_shape_mode` parameter in the pipeline `predict` method.	No
`promptLabel`	`string`\|`null`	Please refer to the description of the `prompt_label` parameter in the pipeline `predict` method.	No
`formatBlockContent`	`boolean`\|`null`	Please refer to the description of the `format_block_content` parameter in the pipeline `predict` method.	No
`repetitionPenalty`	`number`\|`null`	Please refer to the description of the `repetition_penalty` parameter in the pipeline `predict` method.	No
`temperature`	`number`\|`null`	Please refer to the description of the `temperature` parameter in the pipeline `predict` method.	No
`topP`	`number`\|`null`	Please refer to the description of the `top_p` parameter in the pipeline `predict` method.	No
`minPixels`	`number`\|`null`	Please refer to the description of the `min_pixels` parameter in the pipeline `predict` method.	No
`maxPixels`	`number`\|`null`	Please refer to the description of the `max_pixels` parameter in the pipeline `predict` method.	No
`maxNewTokens`	`number`\|`null`	Please refer to the description of the `max_new_tokens` parameter in the pipeline `predict` method.	No
`mergeLayoutBlocks`	`boolean`\|`null`	Please refer to the description of the `merge_layout_blocks` parameter in the pipeline `predict` method.	No
`markdownIgnoreLabels`	`array`\|`null`	Please refer to the description of the `markdown_ignore_labels` parameter in the pipeline `predict` method.	No
`vlmExtraArgs`	`object`\|`null`	Please refer to the description of the `vlm_extra_args` parameter in the pipeline `predict` method.	No
`prettifyMarkdown`	`boolean`	Whether to output beautified Markdown text. The default is `true`.	No
`showFormulaNumber`	`boolean`	Whether to include formula numbers in the output Markdown text. The default is `false`.	No
`returnMarkdownImages`	`boolean`	Whether to return the images referenced in the Markdown. Default `true`; when set to `false`, `markdown.images` is `null` or omitted and the server skips image encoding / URL upload.	No
`restructurePages`	`boolean`	Whether to restructure results across multiple pages. The default is `false`.	No
`mergeTables`	`boolean`	Please refer to the description of the `merge_tables` parameter in the pipeline `restructure_pages` method. Valid only when `restructurePages` is `true`.	No
`relevelTitles`	`boolean`	Please refer to the description of the `relevel_titles` parameter in the pipeline `restructure_pages` method. Valid only when `restructurePages` is `true`.	No
`outputFormats`	`array`\|`null`	Optional list of extra document formats to return. By default, no extra formats are returned. Currently only `"docx"` is supported.	No
`visualize`	`boolean`\|`null`	Whether to return visualization result images and intermediate images during the processing. Pass `true`: Return images. Pass `false`: Do not return images. If this parameter is not provided in the request body or `null` is passed: Follow the setting in the configuration file `Serving.visualize`. For example, add the following field in the configuration file: `Serving: visualize: False` Images will not be returned by default, and the default behavior can be overridden by the `visualize` parameter in the request body. If this parameter is not set in either the request body or the configuration file (or `null` is passed in the request body and the configuration file is not set), images will be returned by default.	No

When the request is processed successfully, the result in the response body has the following attributes:

Name	Type	Meaning
`layoutParsingResults`	`array`	Layout parsing results. The array length is 1 (for image input) or the actual number of document pages processed (for PDF input). For PDF input, each element in the array represents the result of each actual page processed in the PDF file.
`dataInfo`	`object`	Input data information.

Image and other binary file fields in the element schema below (e.g. outputImages, inputImage, markdown.images, exports) are returned inline as Base64 strings by default; when the server is configured to return URLs, those values become pre-signed URLs while the field types remain unchanged. See the "Returning Binary Content as URLs" section of the Serving Deployment Guide for configuration.

Each element inlayoutParsingResults is an object with the following attributes:

Meaning	Name	Type
`prunedResult`	`object`	A simplified version of the `res` field in the JSON representation of the results generated by the `predict` method of the object, with the `input_path` and `page_index` fields removed.
`markdown`	`object`	Markdown results.
`outputImages`	`object`\|`null`	Refer to the `img` property description of the prediction results. The image is in JPEG format, encoded as Base64 by default; returned as a pre-signed URL when URL-return mode is enabled.
`inputImage`	`string`\|`null`	Input image. The image is in JPEG format, encoded as Base64 by default; returned as a pre-signed URL when URL-return mode is enabled.
`exports`	`object`\|`null`	Optional additional exports. Present only when `outputFormats` is set. Example: `{"docx": {"content": "..."}}`, where `content` is the Base64-encoded file content by default, or a pre-signed URL when URL-return mode is enabled.

markdownis an objectwith the following properties:

Name	Type	Meaning
`text`	`string`	Markdown text.
`images`	`object` \| `null`	Key-value pairs of relative Markdown image paths and their image data. Values are Base64-encoded by default; returned as pre-signed URLs when URL-return mode is enabled. The field is `null` or omitted when `returnMarkdownImages` is `false` in the request.

restructurePages

Restructure results across multiple pages.

POST /restructure-pages

The request body has the following properties:

Name	Type	Description	Required
`pages`	`array`	An array of pages.	Yes
`mergeTables`	`boolean`	Please refer to the description of the `merge_tables` parameter in the pipeline `restructure_pages` method.	No
`relevelTitles`	`boolean`	Please refer to the description of the `relevel_titles` parameter in the pipeline `restructure_pages` method.	No
`concatenatePages`	`boolean`	Please refer to the description of the `concatenate_pages` parameter in the pipeline `restructure_pages` method.	No
`prettifyMarkdown`	`boolean`	Whether to output beautified Markdown text. The default is `true`.	No
`showFormulaNumber`	`boolean`	Whether to include formula numbers in the output Markdown text. The default is `false`.	No
`returnMarkdownImages`	`boolean`	Whether to return the images referenced in the Markdown (from `pages[].markdownImages` in the request). Default `true`; when set to `false`, `markdown.images` is `null` or omitted and the server does not back-fill it.	No
`outputFormats`	`array`\|`null`	Optional extra export formats; same meaning as `outputFormats` on `infer`. Only `"docx"` is supported.	No

Each element in pages is an object with the following properties:

Name	Type	Description
`prunedResult`	`object`	The `prunedResult` object returned by the `infer` operation.
`markdownImages`	`object`\|`null`	The `images` property of the `markdown` object returned by the `infer` operation.

When the request is processed successfully, the result field in the response body has the following properties:

Name	Type	Description
`layoutParsingResults`	`array`	The restructured layout parsing results. For the fields that every element contains, please refer to the description of the result returned by the `infer` operation (excluding visualization result images and intermediate images).

Multilingual Service Invocation Example

Python


import base64
import requests
import pathlib

BASE_URL = "http://localhost:8080"

image_path = "./demo.jpg"

# Encode the local image in Base64
with open(image_path, "rb") as file:
    image_bytes = file.read()
    image_data = base64.b64encode(image_bytes).decode("ascii")

payload = {
    "file": image_data, # Base64-encoded file content or file URL
    "fileType": 1, # File type, 1 indicates an image file
}

response = requests.post(BASE_URL + "/layout-parsing", json=payload)
assert response.status_code == 200, (response.status_code, response.text)

result = response.json()["result"]
pages = []
for i, res in enumerate(result["layoutParsingResults"]):
    pages.append({"prunedResult": res["prunedResult"], "markdownImages": res["markdown"].get("images")})
    for img_name, img in res["outputImages"].items():
        img_path = f"{img_name}_{i}.jpg"
        pathlib.Path(img_path).parent.mkdir(exist_ok=True)
        with open(img_path, "wb") as f:
            f.write(base64.b64decode(img))
        print(f"Output image saved at {img_path}")

payload = {
    "pages": pages,
    "concatenatePages": True,
}

response = requests.post(BASE_URL + "/restructure-pages", json=payload)
assert response.status_code == 200, (response.status_code, response.text)

result = response.json()["result"]
res = result["layoutParsingResults"][0]
print(res["prunedResult"])
md_dir = pathlib.Path("markdown")
md_dir.mkdir(exist_ok=True)
(md_dir / "doc.md").write_text(res["markdown"]["text"])
for img_path, img in res["markdown"]["images"].items():
    img_path = md_dir / img_path
    img_path.parent.mkdir(parents=True, exist_ok=True)
    img_path.write_bytes(base64.b64decode(img))
print(f"Markdown document saved at {md_dir / 'doc.md'}")

C++

#include <iostream>
#include <filesystem>
#include <fstream>
#include <vector>
#include <string>
#include "cpp-httplib/httplib.h" // https://github.com/Huiyicc/cpp-httplib
#include "nlohmann/json.hpp" // https://github.com/nlohmann/json
#include "base64.hpp" // https://github.com/tobiaslocker/base64

namespace fs = std::filesystem;

int main() {
    httplib::Client client("localhost", 8080);

    const std::string filePath = "./demo.jpg";

    std::ifstream file(filePath, std::ios::binary | std::ios::ate);
    if (!file) {
        std::cerr << "Error opening file: " << filePath << std::endl;
        return 1;
    }

    std::streamsize size = file.tellg();
    file.seekg(0, std::ios::beg);
    std::vector buffer(size);
    if (!file.read(buffer.data(), size)) {
        std::cerr << "Error reading file." << std::endl;
        return 1;
    }

    std::string bufferStr(buffer.data(), static_cast(size));
    std::string encodedFile = base64::to_base64(bufferStr);

    nlohmann::json jsonObj;
    jsonObj["file"] = encodedFile;
    jsonObj["fileType"] = 1;

    auto response = client.Post("/layout-parsing", jsonObj.dump(), "application/json");

    if (response && response->status == 200) {
        nlohmann::json jsonResponse = nlohmann::json::parse(response->body);
        auto result = jsonResponse["result"];

        if (!result.is_object() || !result.contains("layoutParsingResults")) {
            std::cerr << "Unexpected response format." << std::endl;
            return 1;
        }

        const auto& results = result["layoutParsingResults"];
        for (size_t i = 0; i < results.size(); ++i) {
            const auto& res = results[i];

            if (res.contains("prunedResult")) {
                std::cout << "Layout result [" << i << "]: " << res["prunedResult"].dump() << std::endl;
            }

            if (res.contains("outputImages") && res["outputImages"].is_object()) {
                for (auto& [imgName, imgBase64] : res["outputImages"].items()) {
                    std::string outputPath = imgName + "_" + std::to_string(i) + ".jpg";
                    fs::path pathObj(outputPath);
                    fs::path parentDir = pathObj.parent_path();
                    if (!parentDir.empty() && !fs::exists(parentDir)) {
                        fs::create_directories(parentDir);
                    }

                    std::string decodedImage = base64::from_base64(imgBase64.get());

                    std::ofstream outFile(outputPath, std::ios::binary);
                    if (outFile.is_open()) {
                        outFile.write(decodedImage.c_str(), decodedImage.size());
                        outFile.close();
                        std::cout << "Saved image: " << outputPath << std::endl;
                    } else {
                        std::cerr << "Failed to save image: " << outputPath << std::endl;
                    }
                }
            }
        }
    } else {
        std::cerr << "Request failed." << std::endl;
        if (response) {
            std::cerr << "HTTP status: " << response->status << std::endl;
            std::cerr << "Response body: " << response->body << std::endl;
        }
        return 1;
    }

    return 0;
}

Java

import okhttp3.*;
import com.fasterxml.jackson.databind.ObjectMapper;
import com.fasterxml.jackson.databind.JsonNode;
import com.fasterxml.jackson.databind.node.ObjectNode;

import java.io.File;
import java.io.FileOutputStream;
import java.io.IOException;
import java.util.Base64;
import java.nio.file.Paths;
import java.nio.file.Files;

public class Main {
    public static void main(String[] args) throws IOException {
        String API_URL = "http://localhost:8080/layout-parsing";
        String imagePath = "./demo.jpg";

        File file = new File(imagePath);
        byte[] fileContent = java.nio.file.Files.readAllBytes(file.toPath());
        String base64Image = Base64.getEncoder().encodeToString(fileContent);

        ObjectMapper objectMapper = new ObjectMapper();
        ObjectNode payload = objectMapper.createObjectNode();
        payload.put("file", base64Image);
        payload.put("fileType", 1);

        OkHttpClient client = new OkHttpClient();
        MediaType JSON = MediaType.get("application/json; charset=utf-8");

        RequestBody body = RequestBody.create(JSON, payload.toString());

        Request request = new Request.Builder()
                .url(API_URL)
                .post(body)
                .build();

        try (Response response = client.newCall(request).execute()) {
            if (response.isSuccessful()) {
                String responseBody = response.body().string();
                JsonNode root = objectMapper.readTree(responseBody);
                JsonNode result = root.get("result");

                JsonNode layoutParsingResults = result.get("layoutParsingResults");
                for (int i = 0; i < layoutParsingResults.size(); i++) {
                    JsonNode item = layoutParsingResults.get(i);
                    int finalI = i;
                    JsonNode prunedResult = item.get("prunedResult");
                    System.out.println("Pruned Result [" + i + "]: " + prunedResult.toString());

                    JsonNode outputImages = item.get("outputImages");
                    outputImages.fieldNames().forEachRemaining(imgName -> {
                        try {
                            String imgBase64 = outputImages.get(imgName).asText();
                            byte[] imgBytes = Base64.getDecoder().decode(imgBase64);
                            String imgPath = imgName + "_" + finalI + ".jpg";

                            File outputFile = new File(imgPath);
                            File parentDir = outputFile.getParentFile();
                            if (parentDir != null && !parentDir.exists()) {
                                parentDir.mkdirs();
                                System.out.println("Created directory: " + parentDir.getAbsolutePath());
                            }

                            try (FileOutputStream fos = new FileOutputStream(outputFile)) {
                                fos.write(imgBytes);
                                System.out.println("Saved image: " + imgPath);
                            }
                        } catch (IOException e) {
                            System.err.println("Failed to save image: " + e.getMessage());
                        }
                    });
                }
            } else {
                System.err.println("Request failed with HTTP code: " + response.code());
            }
        }
    }
}

Go

package main

import (
    "bytes"
    "encoding/base64"
    "encoding/json"
    "fmt"
    "io/ioutil"
    "net/http"
    "os"
    "path/filepath"
)

func main() {
    API_URL := "http://localhost:8080/layout-parsing"
    filePath := "./demo.jpg"

    fileBytes, err := ioutil.ReadFile(filePath)
    if err != nil {
        fmt.Printf("Error reading file: %v\n", err)
        return
    }
    fileData := base64.StdEncoding.EncodeToString(fileBytes)

    payload := map[string]interface{}{
        "file":     fileData,
        "fileType": 1,
    }
    payloadBytes, err := json.Marshal(payload)
    if err != nil {
        fmt.Printf("Error marshaling payload: %v\n", err)
        return
    }

    client := &http.Client{}
    req, err := http.NewRequest("POST", API_URL, bytes.NewBuffer(payloadBytes))
    if err != nil {
        fmt.Printf("Error creating request: %v\n", err)
        return
    }
    req.Header.Set("Content-Type", "application/json")

    res, err := client.Do(req)
    if err != nil {
        fmt.Printf("Error sending request: %v\n", err)
        return
    }
    defer res.Body.Close()

    if res.StatusCode != http.StatusOK {
        fmt.Printf("Unexpected status code: %d\n", res.StatusCode)
        return
    }

    body, err := ioutil.ReadAll(res.Body)
    if err != nil {
        fmt.Printf("Error reading response: %v\n", err)
        return
    }

    type Markdown struct {
        Text   string            `json:"text"`
        Images map[string]string `json:"images"`
    }

    type LayoutResult struct {
        PrunedResult map[string]interface{} `json:"prunedResult"`
        Markdown     Markdown               `json:"markdown"`
        OutputImages map[string]string      `json:"outputImages"`
        InputImage   *string                `json:"inputImage"`
    }

    type Response struct {
        Result struct {
            LayoutParsingResults []LayoutResult `json:"layoutParsingResults"`
            DataInfo             interface{}    `json:"dataInfo"`
        } `json:"result"`
    }

    var respData Response
    if err := json.Unmarshal(body, &respData); err != nil {
        fmt.Printf("Error parsing response: %v\n", err)
        return
    }

    for i, res := range respData.Result.LayoutParsingResults {
        fmt.Printf("Result %d - prunedResult: %+v\n", i, res.PrunedResult)

        mdDir := fmt.Sprintf("markdown_%d", i)
        os.MkdirAll(mdDir, 0755)
        mdFile := filepath.Join(mdDir, "doc.md")
        if err := os.WriteFile(mdFile, []byte(res.Markdown.Text), 0644); err != nil {
            fmt.Printf("Error writing markdown file: %v\n", err)
        } else {
            fmt.Printf("Markdown document saved at %s\n", mdFile)
        }

        for path, imgBase64 := range res.Markdown.Images {
            fullPath := filepath.Join(mdDir, path)
            if err := os.MkdirAll(filepath.Dir(fullPath), 0755); err != nil {
                fmt.Printf("Error creating directory for markdown image: %v\n", err)
                continue
            }
            imgBytes, err := base64.StdEncoding.DecodeString(imgBase64)
            if err != nil {
                fmt.Printf("Error decoding markdown image: %v\n", err)
                continue
            }
            if err := os.WriteFile(fullPath, imgBytes, 0644); err != nil {
                fmt.Printf("Error saving markdown image: %v\n", err)
            }
        }

        for name, imgBase64 := range res.OutputImages {
            imgBytes, err := base64.StdEncoding.DecodeString(imgBase64)
            if err != nil {
                fmt.Printf("Error decoding output image %s: %v\n", name, err)
                continue
            }
            filename := fmt.Sprintf("%s_%d.jpg", name, i)

            if err := os.MkdirAll(filepath.Dir(filename), 0755); err != nil {
                fmt.Printf("Error creating directory for output image: %v\n", err)
                continue
            }

            if err := os.WriteFile(filename, imgBytes, 0644); err != nil {
                fmt.Printf("Error saving output image %s: %v\n", filename, err)
            } else {
                fmt.Printf("Output image saved at %s\n", filename)
            }
        }
    }
}

C#

using System;
using System.IO;
using System.Net.Http;
using System.Text;
using System.Threading.Tasks;
using Newtonsoft.Json.Linq;

class Program
{
    static readonly string API_URL = "http://localhost:8080/layout-parsing";
    static readonly string inputFilePath = "./demo.jpg";

    static async Task Main(string[] args)
    {
        var httpClient = new HttpClient();

        byte[] fileBytes = File.ReadAllBytes(inputFilePath);
        string fileData = Convert.ToBase64String(fileBytes);

        var payload = new JObject
        {
            { "file", fileData },
            { "fileType", 1 }
        };
        var content = new StringContent(payload.ToString(), Encoding.UTF8, "application/json");

        HttpResponseMessage response = await httpClient.PostAsync(API_URL, content);
        response.EnsureSuccessStatusCode();

        string responseBody = await response.Content.ReadAsStringAsync();
        JObject jsonResponse = JObject.Parse(responseBody);

        JArray layoutParsingResults = (JArray)jsonResponse["result"]["layoutParsingResults"];
        for (int i = 0; i < layoutParsingResults.Count; i++)
        {
            var res = layoutParsingResults[i];
            Console.WriteLine($"[{i}] prunedResult:\n{res["prunedResult"]}");

            JObject outputImages = res["outputImages"] as JObject;
            if (outputImages != null)
            {
                foreach (var img in outputImages)
                {
                    string imgName = img.Key;
                    string base64Img = img.Value?.ToString();
                    if (!string.IsNullOrEmpty(base64Img))
                    {
                        string imgPath = $"{imgName}_{i}.jpg";
                        byte[] imageBytes = Convert.FromBase64String(base64Img);

                        string directory = Path.GetDirectoryName(imgPath);
                        if (!string.IsNullOrEmpty(directory) && !Directory.Exists(directory))
                        {
                            Directory.CreateDirectory(directory);
                            Console.WriteLine($"Created directory: {directory}");
                        }

                        File.WriteAllBytes(imgPath, imageBytes);
                        Console.WriteLine($"Output image saved at {imgPath}");
                    }
                }
            }
        }
    }
}

Node.js

const axios = require('axios');
const fs = require('fs');
const path = require('path');

const API_URL = 'http://localhost:8080/layout-parsing';
const imagePath = './demo.jpg';
const fileType = 1;

function encodeImageToBase64(filePath) {
  const bitmap = fs.readFileSync(filePath);
  return Buffer.from(bitmap).toString('base64');
}

const payload = {
  file: encodeImageToBase64(imagePath),
  fileType: fileType
};

axios.post(API_URL, payload)
  .then(response => {
    const results = response.data.result.layoutParsingResults;
    results.forEach((res, index) => {
      console.log(`\n[${index}] prunedResult:`);
      console.log(res.prunedResult);

      const outputImages = res.outputImages;
      if (outputImages) {
        Object.entries(outputImages).forEach(([imgName, base64Img]) => {
          const imgPath = `${imgName}_${index}.jpg`;

          const directory = path.dirname(imgPath);
          if (!fs.existsSync(directory)) {
            fs.mkdirSync(directory, { recursive: true });
            console.log(`Created directory: ${directory}`);
          }

          fs.writeFileSync(imgPath, Buffer.from(base64Img, 'base64'));
          console.log(`Output image saved at ${imgPath}`);
        });
      } else {
        console.log(`[${index}] No outputImages.`);
      }
    });
  })
  .catch(error => {
    console.error('Error during API request:', error.message || error);
  });

PHP

<?php

$API_URL = "http://localhost:8080/layout-parsing";
$image_path = "./demo.jpg";

$image_data = base64_encode(file_get_contents($image_path));
$payload = array("file" => $image_data, "fileType" => 1);

$ch = curl_init($API_URL);
curl_setopt($ch, CURLOPT_POST, true);
curl_setopt($ch, CURLOPT_POSTFIELDS, json_encode($payload));
curl_setopt($ch, CURLOPT_HTTPHEADER, array('Content-Type: application/json'));
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
$response = curl_exec($ch);
curl_close($ch);

$result = json_decode($response, true)["result"]["layoutParsingResults"];

foreach ($result as $i => $item) {
    echo "[$i] prunedResult:\n";
    print_r($item["prunedResult"]);

    if (!empty($item["outputImages"])) {
        foreach ($item["outputImages"] as $img_name => $img_base64) {
            $output_image_path = "{$img_name}_{$i}.jpg";

            $directory = dirname($output_image_path);
            if (!is_dir($directory)) {
                mkdir($directory, 0777, true);
                echo "Created directory: $directory\n";
            }

            file_put_contents($output_image_path, base64_decode($img_base64));
            echo "Output image saved at $output_image_path\n";
        }
    } else {
        echo "No outputImages found for item $i\n";
    }
}
?>