Skip to content

Tutorial for PaddleOCR-VL Series Pipelines

PaddleOCR-VL is a SOTA and resource-efficient model tailored for document parsing. Taking the first version as an example, its core component is PaddleOCR-VL-0.9B, a compact yet powerful vision-language model (VLM) that integrates a NaViT-style dynamic resolution visual encoder with the ERNIE-4.5-0.3B language model to enable accurate element recognition. The PaddleOCR-VL series efficiently supports 109 languages and excels in recognizing complex elements such as text, tables, formulas, and charts, while maintaining minimal resource consumption. Through comprehensive evaluations on widely used public benchmarks and in-house benchmarks, PaddleOCR-VL achieves SOTA performance in both page-level document parsing and element-level recognition. It significantly outperforms existing pipeline-based and multimodal document parsing solutions, is competitive with advanced general-purpose multimodal large models, and delivers fast inference speeds. These strengths make it highly suitable for practical deployment in real-world scenarios.

On January 29, 2026, we released PaddleOCR-VL-1.5. PaddleOCR-VL-1.5 not only significantly improved the accuracy on the OmniDocBench v1.5 evaluation set to 94.5%, but also innovatively supports irregular-shaped bounding box localization. As a result, PaddleOCR-VL-1.5 demonstrates outstanding performance in real-world scenarios such as Skew, Warping, Screen Photography, Illumination, and Scanning. In addition, the model has added new capabilities for seal (stamp) recognition and text detection and recognition, with key metrics continuing to lead the industry.

On May 28, 2026, we released PaddleOCR-VL-1.6. With an accuracy of 96.3%, PaddleOCR-VL-1.6 once again set a new benchmark on OmniDocBench v1.6, while also achieving new state-of-the-art (SOTA) results on OmniDocBench v1.5 and Real5-OmniDocBench. It delivers industry-leading performance in text, formula, and table recognition across both open-source and proprietary solutions. In addition, the model shows substantial improvements in ancient document and rare character recognition, as well as significantly enhanced capabilities in multiple scenarios such as seal recognition, spotting, and chart understanding. The model architecture remains fully consistent with PaddleOCR-VL-1.5, enabling seamless migration at zero cost.

This document applies to the PaddleOCR-VL series pipelines in PaddleX. PaddleX registers the PaddleOCR-VL series as independent top-level pipelines. They are used in basically the same way, but their default configurations and models are different.

Pipeline name Layout analysis model VLM model
PaddleOCR-VL PP-DocLayoutV2 PaddleOCR-VL-0.9B
PaddleOCR-VL-1.5 PP-DocLayoutV3 PaddleOCR-VL-1.5-0.9B
PaddleOCR-VL-1.6 PP-DocLayoutV3 PaddleOCR-VL-1.6-0.9B

A PaddleOCR-VL pipeline consists of layout analysis, region cropping, reading-order handling, VLM recognition, and result assembly. PaddleOCR-VL-0.9B, PaddleOCR-VL-1.5-0.9B, and PaddleOCR-VL-1.6-0.9B are VLM submodels inside the pipelines; they are not complete PaddleX pipelines. If you only start or call a VLM inference service, you are only running the VLM recognition stage, which does not provide the full pipeline capability.

1. Environment Preparation

To use the PaddleOCR-VL series pipelines, install PaddleX and the inference engine you want to use, for example:

python -m pip install paddlepaddle-gpu==3.3.0 -i https://www.paddlepaddle.org.cn/packages/stable/cu126/
python -m pip install "paddlex[ocr]"

2. Quick Start

The PaddleOCR-VL series pipelines support two usage methods: CLI command line and Python API. The CLI method is simpler and suitable for quickly verifying functionality, while the Python API method is more flexible and suitable for integration into existing projects. The following examples use PaddleOCR-VL-1.6 as the primary pipeline.

2.1 Command Line Usage

paddlex --pipeline PaddleOCR-VL-1.6 --input https://paddle-model-ecology.bj.bcebos.com/paddlex/imgs/demo_image/paddleocr_vl_demo.png

# Use --use_doc_orientation_classify to enable document orientation classification
paddlex --pipeline PaddleOCR-VL-1.6 --input https://paddle-model-ecology.bj.bcebos.com/paddlex/imgs/demo_image/paddleocr_vl_demo.png --use_doc_orientation_classify True

# Use --use_doc_unwarping to enable document unwarping module
paddlex --pipeline PaddleOCR-VL-1.6 --input https://paddle-model-ecology.bj.bcebos.com/paddlex/imgs/demo_image/paddleocr_vl_demo.png --use_doc_unwarping True

# Use --use_layout_detection to enable layout detection
paddlex --pipeline PaddleOCR-VL-1.6 --input https://paddle-model-ecology.bj.bcebos.com/paddlex/imgs/demo_image/paddleocr_vl_demo.png --use_layout_detection False
Command line supports more parameters. Click to expand for detailed parameter descriptions The following table lists the PaddleOCR-VL series pipeline prediction parameters currently supported by the PaddleX CLI. Common parameters also include `--input`, `--save_path`, `--engine`, `--device`, `--use_hpip`, and `--hpi_config`. For complex `engine_config`, write it into a pipeline YAML file and pass the file through `--pipeline`.
Parameter Description Type
input Data to be predicted, required. It can be an image/PDF file path, a URL, or a local directory containing images. Directory input currently does not support mixed PDF files; PDF files must be specified by file path. str
save_path Path for saving inference results. If not set, inference results will not be saved locally. str
use_doc_orientation_classify Whether to use the document orientation classification module. bool
use_doc_unwarping Whether to use the document unwarping module. bool
use_layout_detection Whether to use the layout detection and ordering module. bool
use_chart_recognition Whether to use chart parsing. bool
layout_threshold Score threshold for the layout model. It can be a float or a dictionary keyed by class ID. float|dict
layout_nms Whether to use NMS post-processing for layout analysis. bool
layout_unclip_ratio Expansion ratio for layout detection boxes. It can be a float, a tuple, or a dictionary keyed by class ID. float|tuple|dict
layout_merge_bboxes_mode Merge mode for layout detection boxes. Supported values include large, small, and union; it can also be configured by class ID. str|dict
use_queues Whether to enable internal queues. When enabled, PDF rendering, layout analysis, and VLM inference can run asynchronously in separate threads. bool
prompt_label Prompt type for the VLM. It only takes effect when use_layout_detection=False. str
format_block_content Whether to format block_content as Markdown. When set to True, image-type blocks may include image path information in block_content. bool
repetition_penalty Repetition penalty used in VLM sampling. float
temperature Temperature used in VLM sampling. float
top_p Top-p parameter used in VLM sampling. float
min_pixels Minimum number of pixels allowed during VLM image preprocessing. int
max_pixels Maximum number of pixels allowed during VLM image preprocessing. int


The inference result will be printed in the terminal. The default output of the PaddleOCR-VL pipeline is as follows:

👉Click to expand
 
{'res': {'input_path': 'paddleocr_vl_demo.png', 'page_index': None, 'model_settings': {'use_doc_preprocessor': False, 'use_layout_detection': True, 'use_chart_recognition': False, 'format_block_content': False}, 'layout_det_res': {'input_path': None, 'page_index': None, 'boxes': [{'cls_id': 6, 'label': 'doc_title', 'score': 0.9636914134025574, 'coordinate': [np.float32(131.31366), np.float32(36.450516), np.float32(1384.522), np.float32(127.984665)]}, {'cls_id': 22, 'label': 'text', 'score': 0.9281806349754333, 'coordinate': [np.float32(585.39465), np.float32(158.438), np.float32(930.2184), np.float32(182.57469)]}, {'cls_id': 22, 'label': 'text', 'score': 0.9840355515480042, 'coordinate': [np.float32(9.023666), np.float32(200.86115), np.float32(361.41583), np.float32(343.8828)]}, {'cls_id': 14, 'label': 'image', 'score': 0.9871416091918945, 'coordinate': [np.float32(775.50574), np.float32(200.66502), np.float32(1503.3807), np.float32(684.9304)]}, {'cls_id': 22, 'label': 'text', 'score': 0.9801855087280273, 'coordinate': [np.float32(9.532196), np.float32(344.90594), np.float32(361.4413), np.float32(440.8244)]}, {'cls_id': 17, 'label': 'paragraph_title', 'score': 0.9708921313285828, 'coordinate': [np.float32(28.040405), np.float32(455.87976), np.float32(341.7215), np.float32(520.7117)]}, {'cls_id': 24, 'label': 'vision_footnote', 'score': 0.9002962708473206, 'coordinate': [np.float32(809.0692), np.float32(703.70044), np.float32(1488.3016), np.float32(750.5238)]}, {'cls_id': 22, 'label': 'text', 'score': 0.9825374484062195, 'coordinate': [np.float32(8.896561), np.float32(536.54895), np.float32(361.05237), np.float32(655.8058)]}, {'cls_id': 22, 'label': 'text', 'score': 0.9822263717651367, 'coordinate': [np.float32(8.971573), np.float32(657.4949), np.float32(362.01715), np.float32(774.625)]}, {'cls_id': 22, 'label': 'text', 'score': 0.9767460823059082, 'coordinate': [np.float32(9.407074), np.float32(776.5216), np.float32(361.31067), np.float32(846.82874)]}, {'cls_id': 22, 'label': 'text', 'score': 0.9868153929710388, 'coordinate': [np.float32(8.669495), np.float32(848.2543), np.float32(361.64703), np.float32(1062.8568)]}, {'cls_id': 22, 'label': 'text', 'score': 0.9826608300209045, 'coordinate': [np.float32(8.8025055), np.float32(1063.8615), np.float32(361.46588), np.float32(1182.8524)]}, {'cls_id': 22, 'label': 'text', 'score': 0.982555627822876, 'coordinate': [np.float32(8.820602), np.float32(1184.4663), np.float32(361.66394), np.float32(1302.4507)]}, {'cls_id': 22, 'label': 'text', 'score': 0.9584776759147644, 'coordinate': [np.float32(9.170288), np.float32(1304.2161), np.float32(361.48898), np.float32(1351.7483)]}, {'cls_id': 22, 'label': 'text', 'score': 0.9782056212425232, 'coordinate': [np.float32(389.1618), np.float32(200.38202), np.float32(742.7591), np.float32(295.65146)]}, {'cls_id': 22, 'label': 'text', 'score': 0.9844875931739807, 'coordinate': [np.float32(388.73303), np.float32(297.18463), np.float32(744.00024), np.float32(441.3034)]}, {'cls_id': 17, 'label': 'paragraph_title', 'score': 0.9680547714233398, 'coordinate': [np.float32(409.39468), np.float32(455.89386), np.float32(721.7174), np.float32(520.9387)]}, {'cls_id': 22, 'label': 'text', 'score': 0.9741666913032532, 'coordinate': [np.float32(389.71606), np.float32(536.8138), np.float32(742.7112), np.float32(608.00165)]}, {'cls_id': 22, 'label': 'text', 'score': 0.9840384721755981, 'coordinate': [np.float32(389.30988), np.float32(609.39636), np.float32(743.09247), np.float32(750.3231)]}, {'cls_id': 22, 'label': 'text', 'score': 0.9845995306968689, 'coordinate': [np.float32(389.13272), np.float32(751.7772), np.float32(743.058), np.float32(894.8815)]}, {'cls_id': 22, 'label': 'text', 'score': 0.984852135181427, 'coordinate': [np.float32(388.83267), np.float32(896.0371), np.float32(743.58215), np.float32(1038.7345)]}, {'cls_id': 22, 'label': 'text', 'score': 0.9804865717887878, 'coordinate': [np.float32(389.08478), np.float32(1039.9119), np.float32(742.7585), np.float32(1134.4897)]}, {'cls_id': 22, 'label': 'text', 'score': 0.986461341381073, 'coordinate': [np.float32(388.52643), np.float32(1135.8137), np.float32(743.451), np.float32(1352.0085)]}, {'cls_id': 22, 'label': 'text', 'score': 0.9869391918182373, 'coordinate': [np.float32(769.8341), np.float32(775.66235), np.float32(1124.9813), np.float32(1063.207)]}, {'cls_id': 22, 'label': 'text', 'score': 0.9822869896888733, 'coordinate': [np.float32(770.30383), np.float32(1063.938), np.float32(1124.8295), np.float32(1184.2192)]}, {'cls_id': 17, 'label': 'paragraph_title', 'score': 0.9689218997955322, 'coordinate': [np.float32(791.3042), np.float32(1199.3169), np.float32(1104.4521), np.float32(1264.6985)]}, {'cls_id': 22, 'label': 'text', 'score': 0.9713128209114075, 'coordinate': [np.float32(770.4253), np.float32(1279.6072), np.float32(1124.6917), np.float32(1351.8672)]}, {'cls_id': 22, 'label': 'text', 'score': 0.9236552119255066, 'coordinate': [np.float32(1153.9058), np.float32(775.5814), np.float32(1334.0654), np.float32(798.1581)]}, {'cls_id': 22, 'label': 'text', 'score': 0.9857938885688782, 'coordinate': [np.float32(1151.5197), np.float32(799.28015), np.float32(1506.3619), np.float32(991.1156)]}, {'cls_id': 22, 'label': 'text', 'score': 0.9820687174797058, 'coordinate': [np.float32(1151.5686), np.float32(991.91095), np.float32(1506.6023), np.float32(1110.8875)]}, {'cls_id': 22, 'label': 'text', 'score': 0.9866049885749817, 'coordinate': [np.float32(1151.6919), np.float32(1112.1301), np.float32(1507.1611), np.float32(1351.9504)]}]}}}

For explanation of the result parameters, refer to 2.2 Python Script Integration.

Note: The default model for the pipeline is relatively large, which may result in slower inference speed. It is recommended to use 3. Using VLM Inference Services for faster inference.

2.2 Python Script Integration

The command line method is for quick testing and visualization. In actual projects, you usually need to integrate the model via code. You can perform pipeline inference with just a few lines of code as shown below:

from paddlex import create_pipeline

pipeline = create_pipeline(pipeline="PaddleOCR-VL-1.6")

output = pipeline.predict(input="https://paddle-model-ecology.bj.bcebos.com/paddlex/imgs/demo_image/paddleocr_vl_demo.png")

for res in output:
    res.print() # Print the structured prediction output
    res.save_to_json(save_path="output") # Save the current image's structured result in JSON format
    res.save_to_markdown(save_path="output") # Save the current image's result in Markdown format
    res.save_to_word(save_path="output") # Save the current image's result in Word format

For PDF files, each page will be processed individually, and a separate Markdown file will be generated for each page. If you wish to perform cross-page table merging, reconstruct multi-level labels, or merge multi-page results, you can achieve this using the following method:

from paddlex import create_pipeline

pipeline = create_pipeline(pipeline="PaddleOCR-VL-1.6")

output = pipeline.predict(input="./your_pdf_file.pdf")

pages_res = list(output)

output = pipeline.restructure_pages(pages_res)

# output = pipeline.restructure_pages(pages_res, merge_tables=True) # Merge tables across pages
# output = pipeline.restructure_pages(pages_res, merge_tables=True, relevel_titles=True) # Merge tables across pages and reconstruct multi-level titles
# output = pipeline.restructure_pages(pages_res, merge_tables=True, relevel_titles=True, concatenate_pages=True) # Merge tables across pages, reconstruct multi-level titles, and merge multiple pages
for res in output:
    res.print() # Print the structured prediction output
    res.save_to_json(save_path="output") # Save the current image's structured result in JSON format
    res.save_to_markdown(save_path="output") # Save the current image's result in Markdown format

The above Python script performs the following steps:

(1) Instantiate the PaddleX pipeline object. Specific parameter descriptions are as follows: In PaddleX, use `create_pipeline()` to create a PaddleOCR-VL series pipeline object. Fine-grained settings such as model names, model directories, and VLM server backends should usually be configured in a pipeline YAML file and then passed through `pipeline` or `config`.
Parameter Description Type Default
pipeline PaddleX pipeline name or path to a pipeline config file. Supported names include PaddleOCR-VL, PaddleOCR-VL-1.5, and PaddleOCR-VL-1.6. A custom YAML file path can also be used. str|None None
config Pipeline config dictionary. If both pipeline and config are provided, the pipeline_name in config takes precedence. dict|None None
device Device used for inference, such as cpu, gpu:0, xpu:0, npu:0, dcu:0, or mlu:0. Actual availability depends on the local environment and inference engine. str|None None
engine Inference engine used by the pipeline or model. Different engines support different fields. See Inference Engine And Configuration. str|None None
engine_config Inference engine configuration. Different engines support different fields. See Inference Engine And Configuration. dict|None None
use_hpip Whether to enable the high-performance inference plugin. If set to None, the setting from the configuration file or config will be used. bool|None None
hpi_config High-performance inference configuration. dict|None None
(2) Call the PaddleOCR-VL pipeline's predict() method for inference prediction. This method will return a list of results. Additionally, the pipeline also provides the predict_iter() method. The two are completely consistent in terms of parameter acceptance and result return. The difference lies in that predict_iter() returns a generator, which can process and obtain prediction results step by step. It is suitable for scenarios involving large datasets or where memory conservation is desired. You can choose either of these two methods based on actual needs. Below are the parameters of the predict() method and their descriptions:
Parameter Parameter Description Parameter Type Default Value
input Data to be predicted, supporting multiple input types. Required.
  • Python Var: such as numpy.ndarray representing image data
  • str: such as the local path of an image file or PDF file: /root/data/img.jpg;such as a URL link, such as the network URL of an image file or PDF file:Example;such as a local directory, which should contain the images to be predicted, such as the local path: /root/data/(Currently, prediction for directories containing PDF files is not supported. PDF files need to be specified with a specific file path)
  • list: List elements should be of the aforementioned data types, such as [numpy.ndarray, numpy.ndarray], ["/root/data/img1.jpg", "/root/data/img2.jpg"], ["/root/data1", "/root/data2"].
Python Var|str|list
use_doc_orientation_classify Whether to use the document orientation classification module during inference. Setting it to None means using the instantiation parameter; otherwise, this parameter takes precedence. bool|None None
use_doc_unwarping Whether to use the text image rectification module during inference. Setting it to None means using the instantiation parameter; otherwise, this parameter takes precedence. bool|None None
use_layout_detection Whether to use the layout region detection and sorting module during inference. Setting it to None means using the instantiation parameter; otherwise, this parameter takes precedence. bool|None None
use_chart_recognition Whether to use the chart parsing module during inference. Setting it to None means using the instantiation parameter; otherwise, this parameter takes precedence. bool|None None
use_seal_recognition Meaning:Whether to use the seal recognition function. Setting it to None means using the instantiation parameter; otherwise, this parameter takes precedence. bool|None None
use_ocr_for_image_block Meaning:Whether to perform OCR on text within image blocks. Setting it to None means using the instantiation parameter; otherwise, this parameter takes precedence. bool|None None
layout_threshold The parameter meaning is basically the same as the instantiation parameter. Setting it to None means using the instantiation parameter; otherwise, this parameter takes precedence. float|dict|None None
layout_nms The parameter meaning is basically the same as the instantiation parameter. Setting it to None means using the instantiation parameter; otherwise, this parameter takes precedence. bool|None None
layout_unclip_ratio The parameter meaning is basically the same as the instantiation parameter. Setting it to None means using the instantiation parameter; otherwise, this parameter takes precedence. float|Tuple[float,float]|dict|None None
layout_merge_bboxes_mode The parameter meaning is basically the same as the instantiation parameter. Setting it to None means using the instantiation parameter; otherwise, this parameter takes precedence. str|dict|None None
merge_layout_blocks Control whether to merge the layout detection boxes for cross-column or staggered top and bottom columns. If not set, the initialized default value will be used, which defaults to initialization asTrue. bool
markdown_ignore_labels Layout labels that need to be ignored in Markdown. If not set, the initialized default value will be used. str
layout_shape_mode Meaning:Specifies the geometric representation mode for layout analysis results. It defines how the boundaries of detected regions (e.g., text blocks, images, tables) are calculated and displayed.
Description: Value descriptions:
  • rect (rectangle): Outputs an axis-aligned bounding box (including x1, y1, x2, y2). Suitable for standard horizontally aligned layouts.
  • quad (quadrilateral): Outputs an arbitrary quadrilateral composed of four vertices. Suitable for regions with skew or perspective distortion.
  • poly (polygon): Outputs a closed contour composed of multiple coordinate points. Suitable for irregularly shaped or curved layout elements, offering the highest precision.
  • auto (automatic): The system automatically selects the most appropriate shape representation based on the complexity and confidence of the detected targets.
str "auto"
use_queues Used to control whether to enable internal queues. When set to True, data loading (such as rendering PDF pages as images), layout analysis model processing, and VLM inference will be executed asynchronously in separate threads, with data passed through queues, thereby improving efficiency. This approach is particularly efficient for PDF documents with many pages or directories containing a large number of images or PDF files. bool|None None
prompt_label The prompt type setting for the VL model, which takes effect only when use_layout_detection=False. The fillable parameters are ocrformulatable and chart. str|None None
format_block_content The parameter meaning is basically the same as the instantiation parameter. Setting it to None means using the instantiation parameter; otherwise, this parameter takes precedence. When set to True, the block_content of image-type blocks will contain image path information (e.g., <img src="..." />). When set to False (default), the block_content of image-type blocks will only contain OCR-recognized text content without image paths. To include image paths in JSON output, set this parameter to True. bool|None None
repetition_penalty The repetition penalty parameter used for VL model sampling. float|None None
temperature Temperature parameter used for VL model sampling. float|None None
top_p Top-p parameter used for VL model sampling. float|None None
min_pixels The minimum number of pixels allowed when the VL model preprocesses images. int|None None
max_pixels The maximum number of pixels allowed when the VL model preprocesses images. int|None None
max_new_tokens The maximum number of tokens generated by the VL model. int|None None
merge_layout_blocks Control whether to merge the layout detection boxes for cross-column or staggered top and bottom columns. bool|None
markdown_ignore_labels Layout labels that need to be ignored in Markdown. list|None
vlm_extra_args Meaning:Additional configuration parameters for the VLM. The currently supported custom parameters are as follows:
  • ocr_min_pixels: Minimum resolution for OCR
  • ocr_max_pixels: Maximum resolution for OCR
  • table_min_pixels: Minimum resolution for tables
  • table_max_pixels: Maximum resolution for tables
  • chart_min_pixels: Minimum resolution for charts
  • chart_max_pixels: Maximum resolution for charts
  • formula_min_pixels: Minimum resolution for formulas
  • formula_max_pixels: Maximum resolution for formulas
  • seal_min_pixels: Minimum resolution for seals
  • seal_max_pixels: Maximum resolution for seals
dict|None None
(3) Call the PaddleOCR-VL pipeline's restructure_pages() method to reconstruct pages from the multi-page results list of inference predictions. This method will return a reconstructed multi-page result or a merged single-page result. Below are the parameters of the restructure_pages() method and their descriptions:
Parameter Description Type Default Value
res_list Meaning: The list of results predicted from a multi-page PDF inference. list|None None
merge_tables Meaning: Controls whether to merge tables across pages. bool True
relevel_titles Meaning: Controls whether to reconstruct multi-level titles. bool True
concatenate_pages Meaning: Controls whether to concatenate multi-page results into one page. bool False
(4) Process the prediction results: The prediction result for each sample is a corresponding Result object, supporting operations such as printing, saving as an image, and saving as a json file:
Method Method Description Parameter Parameter Type Parameter Description Default Value
print() Print results to the terminal format_json bool Whether to format the output content using JSON indentation. True
indent int Specify the indentation level to beautify the output JSON data, making it more readable. Only valid when format_json is True. 4
ensure_ascii bool Control whether non- ASCII characters are escaped as Unicode. When set to True, all non- ASCII characters will be escaped; False retains the original characters. Only valid when format_json is True. False
save_to_json() Save the results as a json format file save_path str The file path for saving. When it is a directory, the saved file name will be consistent with the input file type naming. None
indent int Specify the indentation level to beautify the output JSONdata, making it more readable. Only valid when format_jsonis True. 4
ensure_ascii bool Control whether non- ASCII characters are escaped as Unicode. When set to True, all non- ASCII characters will be escaped; False retains the original characters. Only valid when format_json is True. False
save_to_img() Save the visualized images of each intermediate module in png format save_path str The file path for saving, supporting directory or file paths. None
save_to_markdown() Save each page in an image or PDF file as a markdown format file separately save_path str The file path for saving. When it is a directory, the saved file name will be consistent with the input file type naming None
pretty bool Whether to beautify the markdown output results, centering charts, etc., to make the markdown rendering more aesthetically pleasing. True
show_formula_number bool Control whether to retain formula numbers in markdown. When set to True, all formula numbers are retained; False retains only the formulas False
save_to_html() Save the tables in the file as html format files save_path str The file path for saving, supporting directory or file paths. None
save_to_xlsx() Save the tables in the file as xlsx format files save_path str The file path for saving, supporting directory or file paths. None
save_to_word() Save the layout parsing result as a Word (.docx) format file save_path str The file path for saving, supporting directory or file paths. None
- Calling the `print()` method will print the results to the terminal. The content printed to the terminal is explained as follows: - `input_path`: `(str)` The input path of the image or PDF to be predicted. - `page_index`: `(Union[int, None])` If the input is a PDF file, it indicates the current page number of the PDF; otherwise, it is `None`. - `page_count`: `(Union[int, None])` If the input is a PDF file, it indicates the total number of pages in the PDF; otherwise, it is `None`. - `width`: `(int)` The width of the original input image. - `height`: `(int)` The height of the original input image. - `model_settings`: `(Dict[str, bool])` Model parameters required for configuring PaddleOCR-VL. - `use_doc_preprocessor`: `(bool)` Controls whether to enable the document preprocessing sub-pipeline. - `use_layout_detection`: `(bool)` Controls whether to enable the layout analysis module. - `use_chart_recognition`: `(bool)` Controls whether to enable the chart recognition function. - `format_block_content`: `(bool)` Controls whether to save the formatted markdown content in `JSON`. When set to `True`, the `block_content` of image-type blocks will contain image path information (e.g., ``). When set to `False` (default), the `block_content` of image-type blocks will only contain OCR-recognized text content without image paths. To include image paths in JSON output, set this parameter to `True`. - `merge_layout_blocks`: `(bool)` Controls whether to merge the layout frames of multi-column layouts or top-and-bottom alternating column layouts. - `markdown_ignore_labels`: `(List[str])` Labels of layout regions that need to be ignored in Markdown, defaulting to `['number','footnote','header','header_image','footer','footer_image','aside_text']` - `doc_preprocessor_res`: `(Dict[str, Union[List[float], str]])` A dictionary of document preprocessing results, which exists only when `use_doc_preprocessor=True`. - `input_path`: `(str)` The image path accepted by the document preprocessing sub-pipeline. When the input is a `numpy.ndarray`, it is saved as `None`; here, it is `None`. - `page_index`: `None`. Since the input here is a `numpy.ndarray`, the value is `None`. - `model_settings`: `(Dict[str, bool])` Model configuration parameters for the document preprocessing sub-pipeline. - `use_doc_orientation_classify`: `(bool)` Controls whether to enable the document image orientation classification sub-module. - `use_doc_unwarping`: `(bool)` Controls whether to enable the text image distortion correction sub-module. - `angle`: `(int)` The prediction result of the document image orientation classification sub-module. When enabled, it returns the actual angle value. - `parsing_res_list`: `(List[Dict])` A list of parsing results, where each element is a dictionary. The list order is the reading order after parsing. - `block_bbox`: `(np.ndarray)` The bounding box of the layout area. - `block_label`: `(str)` The label of the layout area, such as `text`, `table`, etc. - `block_content`: `(str)` The content within the layout area. - `block_id`: `(int)` The index of the layout area, used to display the layout sorting results. - `block_order` `(int)` The order of the layout area, used to display the layout reading order. For non-sorted parts, the default value is `None`. - Calling the `save_to_json()` method will save the above content to the specified `save_path`. If a directory is specified, the saved path will be `save_path/{your_img_basename}_res.json`. If a file is specified, it will be saved directly to that file. Since json files do not support saving numpy arrays, the `numpy.array` types within will be converted to list form. - `input_path`: `(str)` The input path of the image or PDF to be predicted. - `page_index`: `(Union[int, None])` If the input is a PDF file, it indicates the current page number of the PDF; otherwise, it is `None`. - `model_settings`: `(Dict[str, bool])` Model parameters required for configuring PaddleOCR-VL. - `use_doc_preprocessor`: `(bool)` Controls whether to enable the document preprocessing sub-pipeline. - `use_layout_detection`: `(bool)` Controls whether to enable the layout analysis module. - `use_chart_recognition`: `(bool)` Controls whether to enable the chart recognition function. - `format_block_content`: `(bool)` Controls whether to save the formatted markdown content in `JSON`. When set to `True`, the `block_content` of image-type blocks will contain image path information (e.g., ``). When set to `False` (default), the `block_content` of image-type blocks will only contain OCR-recognized text content without image paths. To include image paths in JSON output, set this parameter to `True`. - `doc_preprocessor_res`: `(Dict[str, Union[List[float], str]])` A dictionary of document preprocessing results, which exists only when `use_doc_preprocessor=True`. - `input_path`: `(str)` The image path accepted by the document preprocessing sub-pipeline. When the input is a `numpy.ndarray`, it is saved as `None`; here, it is `None`. - `page_index`: `None`. Since the input here is a `numpy.ndarray`, the value is `None`. - `model_settings`: `(Dict[str, bool])` Model configuration parameters for the document preprocessing sub-pipeline. - `use_doc_orientation_classify`: `(bool)` Controls whether to enable the document image orientation classification sub-module. - `use_doc_unwarping`: `(bool)` Controls whether to enable the text image distortion correction sub-module. - `angle`: `(int)` The prediction result of the document image orientation classification sub-module. When enabled, it returns the actual angle value. - `parsing_res_list`: `(List[Dict])` A list of parsing results, where each element is a dictionary. The list order represents the reading order after parsing. - `block_bbox`: `(np.ndarray)` The bounding box of the layout region. - `block_label`: `(str)` The label of the layout region, such as `text`, `table`, etc. - `block_content`: `(str)` The content within the layout region. - `block_id`: `(int)` The index of the layout region, used to display the layout sorting results. - `block_order` `(int)` The order of the layout region, used to display the layout reading order. For non-sorted parts, the default value is `None`. - Calling the `save_to_img()` method will save the visualization results to the specified `save_path`. If a directory is specified, visualized images for layout region detection, global OCR, layout reading order, etc., will be saved. If a file is specified, it will be saved directly to that file. (Pipelines typically contain many result images, so it is not recommended to directly specify a specific file path, as multiple images will be overwritten, retaining only the last one.) - Calling the `save_to_markdown()` method will save the converted Markdown file to the specified `save_path`. The saved file path will be `save_path/{your_img_basename}.md`. If the input is a PDF file, it is recommended to directly specify a directory; otherwise, multiple markdown files will be overwritten. Additionally, it also supports obtaining visualized images and prediction results with results through attributes, as follows:
Attribute Attribute Description
json Obtain the prediction jsonresult in the format
img Obtain visualized images in dict format
markdown Obtain Markdown results in dict format
- The prediction result obtained through the `json` attribute is data of dict type, with relevant content consistent with that saved by calling the `save_to_json()` method. - The prediction result returned by the `img` attribute is data of dict type. The keys are `layout_det_res`, `overall_ocr_res`, `text_paragraphs_ocr_res`, `formula_res_region1`, `table_cell_img`, and `seal_res_region1`, with corresponding values being `Image.Image` objects: used to display visualized images of layout region detection, OCR, OCR text paragraphs, formulas, tables, and seal results, respectively. If optional modules are not used, the dict only contains `layout_det_res`. - The prediction result returned by the `markdown` attribute is data of dict type. The keys are `markdown_texts`, `markdown_images`, and `page_continuation_flags`, with corresponding values being markdown text, images displayed in Markdown (`Image.Image` objects), and a bool tuple used to identify whether the first element on the current page is the start of a paragraph and whether the last element is the end of a paragraph, respectively.

3. Using VLM Inference Services

The inference performance under the default configuration is not fully optimized and may not meet actual production requirements. PaddleX supports connecting the VLM recognition stage in the complete pipeline to a dedicated VLM inference service. This improves VLM module inference performance and helps isolate server-side dependencies and compute resources in production environments. Server backends can include vLLM, SGLang, and FastDeploy. The workflow mainly consists of two steps:

  1. Start the VLM inference service;
  2. Configure the PaddleX pipeline to call the VLM inference service as a client.

The VLM inference service only handles the VLM recognition stage of the complete pipeline. Layout analysis, cropping, reading-order handling, and result assembly are still performed by the PaddleX pipeline, so local inference with the layout parsing model is still required. When starting the service, use the VLM submodel name that corresponds to the selected pipeline. See the pipeline table at the beginning of this document for the mapping.

3.1 Starting the VLM Inference Service

3.1.1 Using Docker Images

PaddleX provides a vLLM Docker image to quickly start a VLM inference service. For NVIDIA GPUs other than the Blackwell architecture, use ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/paddlex-genai-vllm-server:latest. For Blackwell GPUs, use ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/paddlex-genai-vllm-server:latest-sm120.

Using an NVIDIA GPU other than the Blackwell architecture and PaddleOCR-VL-1.6-0.9B as an example:

docker run \
    --rm \
    --gpus all \
    --network host \
    ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/paddlex-genai-vllm-server:latest \
    paddlex_genai_server --model_name PaddleOCR-VL-1.6-0.9B --host 0.0.0.0 --port 8118 --backend vllm

For Blackwell GPUs, replace the image above with the dedicated image:

docker run \
    --rm \
    --gpus all \
    --network host \
    ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/paddlex-genai-vllm-server:latest-sm120 \
    paddlex_genai_server --model_name PaddleOCR-VL-1.6-0.9B --host 0.0.0.0 --port 8118 --backend vllm

3.1.2 Starting Through PaddleX CLI

VLM server dependencies may differ from the local pipeline client environment, so it is recommended to create a separate virtual environment for the VLM inference service:

# Create a virtual environment
python -m venv .venv
# Activate the environment
source .venv/bin/activate
# Install PaddleX
python -m pip install "paddlex[ocr]"
# Install the vLLM server plugin
paddlex --install genai-vllm-server
# Or install the SGLang server plugin
# paddlex --install genai-sglang-server
# Or install the FastDeploy server plugin
# paddlex --install genai-fastdeploy-server

After installation, start the service with paddlex_genai_server:

paddlex_genai_server --model_name PaddleOCR-VL-1.6-0.9B --backend vllm --port 8118

# For the SGLang backend
# paddlex_genai_server --model_name PaddleOCR-VL-1.6-0.9B --backend sglang --port 8118

# For the FastDeploy backend
# paddlex_genai_server --model_name PaddleOCR-VL-1.6-0.9B --backend fastdeploy --port 8118

The command supports the following parameters:

Parameter Description
--model_name Model name. It should match the PaddleX pipeline version being used.
--model_dir Model directory.
--host Server hostname.
--port Server port number.
--backend Backend name. Supported values are vllm, sglang, and fastdeploy.
--backend_config YAML file containing backend configuration.

3.2 Client Usage

After starting the VLM inference service, the client can call it through PaddleX. Install the client plugin first:

paddlex --install genai-client

Next, obtain the pipeline configuration file:

paddlex --get_pipeline_config PaddleOCR-VL-1.6

The default save path is PaddleOCR-VL-1.6.yaml. Modify SubModules.VLRecognition.genai_config.backend and SubModules.VLRecognition.genai_config.server_url in the config file to match the service, for example:

SubModules:
  VLRecognition:
    genai_config:
      backend: vllm-server
      server_url: http://127.0.0.1:8118/v1
      max_concurrency: 200

You can also use the unified engine + engine_config style to configure this submodule explicitly:

SubModules:
  VLRecognition:
    engine: genai_client
    engine_config:
      backend: vllm-server
      server_url: http://127.0.0.1:8118/v1
      max_concurrency: 200

Then use the modified config file to run the pipeline. For CLI:

paddlex --pipeline PaddleOCR-VL-1.6.yaml --input https://paddle-model-ecology.bj.bcebos.com/paddlex/imgs/demo_image/paddleocr_vl_demo.png

Or through the Python API:

from paddlex import create_pipeline

pipeline = create_pipeline("PaddleOCR-VL-1.6.yaml")

for res in pipeline.predict("https://paddle-model-ecology.bj.bcebos.com/paddlex/imgs/demo_image/paddleocr_vl_demo.png"):
    res.print()

4. Serving

If you need to directly apply PaddleOCR-VL in your Python project, you can refer to the example code in 2.2 Python Script Integration.

Additionally, PaddleX also provides a service deployment method, detailed as follows:

4.1 Install Dependencies

Run the following command to install the PaddleX serving plugin via PaddleX CLI:

paddlex --install serving

4.2 Run the Server

Run the server via PaddleX CLI:

paddlex --serve --pipeline PaddleOCR-VL-1.6

You should see information similar to the following:

INFO:     Started server process [63108]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:8080 (Press CTRL+C to quit)

If you need to adjust the configuration, such as model paths, batch size, deployment device, or VLM server backend, specify a custom configuration file through --pipeline. PaddleOCR-VL-1.5 and PaddleOCR-VL-1.6 share the same serving application implementation route internally, but their pipeline configs remain distinct.

The command-line options related to serving are as follows:

Name Description
--pipeline PaddleX pipeline registration name or pipeline configuration file path.
--device Deployment device for the pipeline. By default, a GPU will be used if available; otherwise, a CPU will be used.
--host Hostname or IP address to which the server is bound. Defaults to 0.0.0.0.
--port Port number on which the server listens. Defaults to 8080.
--use_hpip If specified, uses high-performance inference. Refer to the High-Performance Inference documentation for more information.
--hpi_config High-performance inference configuration. Refer to the High-Performance Inference documentation for more information.

4.3 Client-Side Invocation

Below are the API references for basic service-based deployment and examples of multilingual service invocation:

API Reference

Main operations provided by the service:

  • The HTTP request method is POST.
  • Both the request body and response body are JSON data (JSON objects).
  • When the request is processed successfully, the response status code is200, and the properties of the response body are as follows:
Name Type Meaning
logId string The UUID of the request.
errorCode integer Error code. Fixed as 0.
errorMsg string Error description. Fixed as "Success".
result object Operation result.
  • When the request is not processed successfully, the properties of the response body are as follows:
Name Type Meaning
logId string The UUID of the request.
errorCode integer Error code. Same as the response status code.
errorMsg string Error description.

The main operations provided by the service are as follows:

  • infer

Perform layout parsing.

POST /layout-parsing

  • The properties of the request body are as follows:
Name Type Meaning Required
file string The URL of image files (including TIFF; multi-page TIFF is processed page by page) or PDF file accessible to the server, or the Base64-encoded result of the content of the aforementioned file types. By default, there is no page limit. To set a page limit on the server, set Serving.extra.max_num_input_imgs to a positive integer in the pipeline configuration file, for example:
Serving:
  extra:
    max_num_input_imgs: 10
Yes
fileType integer|null File type. 0 represents a PDF file, 1 represents an image file (including TIFF). If this property is not present in the request body, the file type will be inferred from the URL. No
useDocOrientationClassify boolean | null Please refer to the description of the use_doc_orientation_classify parameter in the pipeline predict method. No
useDocUnwarping boolean|null Please refer to the description of the use_doc_unwarping parameter in the pipeline predict method. No
useLayoutDetection boolean|null Please refer to the description of the use_layout_detection parameter in the pipeline predict method. No
useChartRecognition boolean|null Please refer to the description of the use_chart_recognition parameter in the pipeline predict method. No
useSealRecognition boolean|null Please refer to the description of the use_seal_recognition parameter in the pipeline predict method. No
useOcrForImageBlock boolean|null Please refer to the description of the use_ocr_for_image_block parameter in the pipeline predict method. No
layoutThreshold number|object|null Please refer to the description of the layout_threshold parameter in the pipeline predict method. No
layoutNms boolean|null Please refer to the description of the layout_nms parameter in the pipeline predict method. No
layoutUnclipRatio number|array|object|null Please refer to the description of the layout_unclip_ratio parameter in the pipeline predict method. No
layoutMergeBboxesMode string|object|null Please refer to the description of the layout_merge_bboxes_mode parameter in the pipeline predict method. No
layoutShapeMode string Please refer to the description of the layout_shape_mode parameter in the pipeline predict method. No
promptLabel string|null Please refer to the description of the prompt_label parameter in the pipeline predict method. No
formatBlockContent boolean|null Please refer to the description of the format_block_content parameter in the pipeline predict method. No
repetitionPenalty number|null Please refer to the description of the repetition_penalty parameter in the pipeline predict method. No
temperature number|null Please refer to the description of the temperature parameter in the pipeline predict method. No
topP number|null Please refer to the description of the top_p parameter in the pipeline predict method. No
minPixels number|null Please refer to the description of the min_pixels parameter in the pipeline predict method. No
maxPixels number|null Please refer to the description of the max_pixels parameter in the pipeline predict method. No
maxNewTokens number|null Please refer to the description of the max_new_tokens parameter in the pipeline predict method. No
mergeLayoutBlocks boolean|null Please refer to the description of the merge_layout_blocks parameter in the pipeline predict method. No
markdownIgnoreLabels array|null Please refer to the description of the markdown_ignore_labels parameter in the pipeline predict method. No
vlmExtraArgs object|null Please refer to the description of the vlm_extra_args parameter in the pipeline predict method. No
prettifyMarkdown boolean Whether to output beautified Markdown text. The default is true. No
showFormulaNumber boolean Whether to include formula numbers in the output Markdown text. The default is false. No
returnMarkdownImages boolean Whether to return the images referenced in the Markdown. Default true; when set to false, markdown.images is null or omitted and the server skips image encoding / URL upload. No
restructurePages boolean Whether to restructure results across multiple pages. The default is false. No
mergeTables boolean Please refer to the description of the merge_tables parameter in the pipeline restructure_pages method. Valid only when restructurePages is true. No
relevelTitles boolean Please refer to the description of the relevel_titles parameter in the pipeline restructure_pages method. Valid only when restructurePages is true. No
outputFormats array|null Optional list of extra document formats to return. By default, no extra formats are returned. Currently only "docx" is supported. No
visualize boolean|null Whether to return visualization result images and intermediate images during the processing.
  • Pass true: Return images.
  • Pass false: Do not return images.
  • If this parameter is not provided in the request body or null is passed: Follow the setting in the configuration file Serving.visualize.

For example, add the following field in the configuration file:
Serving:
  visualize: False
Images will not be returned by default, and the default behavior can be overridden by the visualize parameter in the request body. If this parameter is not set in either the request body or the configuration file (or null is passed in the request body and the configuration file is not set), images will be returned by default.
No
  • When the request is processed successfully, the result in the response body has the following attributes:
Name Type Meaning
layoutParsingResults array Layout parsing results. The array length is 1 (for image input) or the actual number of document pages processed (for PDF input). For PDF input, each element in the array represents the result of each actual page processed in the PDF file.
dataInfo object Input data information.

Image and other binary file fields in the element schema below (e.g. outputImages, inputImage, markdown.images, exports) are returned inline as Base64 strings by default; when the server is configured to return URLs, those values become pre-signed URLs while the field types remain unchanged. See the "Returning Binary Content as URLs" section of the Serving Deployment Guide for configuration.

Each element inlayoutParsingResults is an object with the following attributes:

Meaning Name Type
prunedResult object A simplified version of the res field in the JSON representation of the results generated by the predict method of the object, with the input_path and page_index fields removed.
markdown object Markdown results.
outputImages object|null Refer to the img property description of the prediction results. The image is in JPEG format, encoded as Base64 by default; returned as a pre-signed URL when URL-return mode is enabled.
inputImage string|null Input image. The image is in JPEG format, encoded as Base64 by default; returned as a pre-signed URL when URL-return mode is enabled.
exports object|null Optional additional exports. Present only when outputFormats is set. Example: {"docx": {"content": "..."}}, where content is the Base64-encoded file content by default, or a pre-signed URL when URL-return mode is enabled.

markdownis an objectwith the following properties:

Name Type Meaning
text string Markdown text.
images object | null Key-value pairs of relative Markdown image paths and their image data. Values are Base64-encoded by default; returned as pre-signed URLs when URL-return mode is enabled. The field is null or omitted when returnMarkdownImages is false in the request.
  • restructurePages

Restructure results across multiple pages.

POST /restructure-pages

  • The request body has the following properties:
Name Type Description Required
pages array An array of pages. Yes
mergeTables boolean Please refer to the description of the merge_tables parameter in the pipeline restructure_pages method. No
relevelTitles boolean Please refer to the description of the relevel_titles parameter in the pipeline restructure_pages method. No
concatenatePages boolean Please refer to the description of the concatenate_pages parameter in the pipeline restructure_pages method. No
prettifyMarkdown boolean Whether to output beautified Markdown text. The default is true. No
showFormulaNumber boolean Whether to include formula numbers in the output Markdown text. The default is false. No
returnMarkdownImages boolean Whether to return the images referenced in the Markdown (from pages[].markdownImages in the request). Default true; when set to false, markdown.images is null or omitted and the server does not back-fill it. No
outputFormats array|null Optional extra export formats; same meaning as outputFormats on infer. Only "docx" is supported. No

Each element in pages is an object with the following properties:

Name Type Description
prunedResult object The prunedResult object returned by the infer operation.
markdownImages object|null The images property of the markdown object returned by the infer operation.
  • When the request is processed successfully, the result field in the response body has the following properties:
Name Type Description
layoutParsingResults array The restructured layout parsing results. For the fields that every element contains, please refer to the description of the result returned by the infer operation (excluding visualization result images and intermediate images).
Multilingual Service Invocation Example
Python

import base64
import requests
import pathlib

BASE_URL = "http://localhost:8080"

image_path = "./demo.jpg"

# Encode the local image in Base64
with open(image_path, "rb") as file:
    image_bytes = file.read()
    image_data = base64.b64encode(image_bytes).decode("ascii")

payload = {
    "file": image_data, # Base64-encoded file content or file URL
    "fileType": 1, # File type, 1 indicates an image file
}

response = requests.post(BASE_URL + "/layout-parsing", json=payload)
assert response.status_code == 200, (response.status_code, response.text)

result = response.json()["result"]
pages = []
for i, res in enumerate(result["layoutParsingResults"]):
    pages.append({"prunedResult": res["prunedResult"], "markdownImages": res["markdown"].get("images")})
    for img_name, img in res["outputImages"].items():
        img_path = f"{img_name}_{i}.jpg"
        pathlib.Path(img_path).parent.mkdir(exist_ok=True)
        with open(img_path, "wb") as f:
            f.write(base64.b64decode(img))
        print(f"Output image saved at {img_path}")

payload = {
    "pages": pages,
    "concatenatePages": True,
}

response = requests.post(BASE_URL + "/restructure-pages", json=payload)
assert response.status_code == 200, (response.status_code, response.text)

result = response.json()["result"]
res = result["layoutParsingResults"][0]
print(res["prunedResult"])
md_dir = pathlib.Path("markdown")
md_dir.mkdir(exist_ok=True)
(md_dir / "doc.md").write_text(res["markdown"]["text"])
for img_path, img in res["markdown"]["images"].items():
    img_path = md_dir / img_path
    img_path.parent.mkdir(parents=True, exist_ok=True)
    img_path.write_bytes(base64.b64decode(img))
print(f"Markdown document saved at {md_dir / 'doc.md'}")
C++
#include <iostream>
#include <filesystem>
#include <fstream>
#include <vector>
#include <string>
#include "cpp-httplib/httplib.h" // https://github.com/Huiyicc/cpp-httplib
#include "nlohmann/json.hpp" // https://github.com/nlohmann/json
#include "base64.hpp" // https://github.com/tobiaslocker/base64

namespace fs = std::filesystem;

int main() {
    httplib::Client client("localhost", 8080);

    const std::string filePath = "./demo.jpg";

    std::ifstream file(filePath, std::ios::binary | std::ios::ate);
    if (!file) {
        std::cerr << "Error opening file: " << filePath << std::endl;
        return 1;
    }

    std::streamsize size = file.tellg();
    file.seekg(0, std::ios::beg);
    std::vector buffer(size);
    if (!file.read(buffer.data(), size)) {
        std::cerr << "Error reading file." << std::endl;
        return 1;
    }

    std::string bufferStr(buffer.data(), static_cast(size));
    std::string encodedFile = base64::to_base64(bufferStr);

    nlohmann::json jsonObj;
    jsonObj["file"] = encodedFile;
    jsonObj["fileType"] = 1;

    auto response = client.Post("/layout-parsing", jsonObj.dump(), "application/json");

    if (response && response->status == 200) {
        nlohmann::json jsonResponse = nlohmann::json::parse(response->body);
        auto result = jsonResponse["result"];

        if (!result.is_object() || !result.contains("layoutParsingResults")) {
            std::cerr << "Unexpected response format." << std::endl;
            return 1;
        }

        const auto& results = result["layoutParsingResults"];
        for (size_t i = 0; i < results.size(); ++i) {
            const auto& res = results[i];

            if (res.contains("prunedResult")) {
                std::cout << "Layout result [" << i << "]: " << res["prunedResult"].dump() << std::endl;
            }

            if (res.contains("outputImages") && res["outputImages"].is_object()) {
                for (auto& [imgName, imgBase64] : res["outputImages"].items()) {
                    std::string outputPath = imgName + "_" + std::to_string(i) + ".jpg";
                    fs::path pathObj(outputPath);
                    fs::path parentDir = pathObj.parent_path();
                    if (!parentDir.empty() && !fs::exists(parentDir)) {
                        fs::create_directories(parentDir);
                    }

                    std::string decodedImage = base64::from_base64(imgBase64.get());

                    std::ofstream outFile(outputPath, std::ios::binary);
                    if (outFile.is_open()) {
                        outFile.write(decodedImage.c_str(), decodedImage.size());
                        outFile.close();
                        std::cout << "Saved image: " << outputPath << std::endl;
                    } else {
                        std::cerr << "Failed to save image: " << outputPath << std::endl;
                    }
                }
            }
        }
    } else {
        std::cerr << "Request failed." << std::endl;
        if (response) {
            std::cerr << "HTTP status: " << response->status << std::endl;
            std::cerr << "Response body: " << response->body << std::endl;
        }
        return 1;
    }

    return 0;
}
Java
import okhttp3.*;
import com.fasterxml.jackson.databind.ObjectMapper;
import com.fasterxml.jackson.databind.JsonNode;
import com.fasterxml.jackson.databind.node.ObjectNode;

import java.io.File;
import java.io.FileOutputStream;
import java.io.IOException;
import java.util.Base64;
import java.nio.file.Paths;
import java.nio.file.Files;

public class Main {
    public static void main(String[] args) throws IOException {
        String API_URL = "http://localhost:8080/layout-parsing";
        String imagePath = "./demo.jpg";

        File file = new File(imagePath);
        byte[] fileContent = java.nio.file.Files.readAllBytes(file.toPath());
        String base64Image = Base64.getEncoder().encodeToString(fileContent);

        ObjectMapper objectMapper = new ObjectMapper();
        ObjectNode payload = objectMapper.createObjectNode();
        payload.put("file", base64Image);
        payload.put("fileType", 1);

        OkHttpClient client = new OkHttpClient();
        MediaType JSON = MediaType.get("application/json; charset=utf-8");

        RequestBody body = RequestBody.create(JSON, payload.toString());

        Request request = new Request.Builder()
                .url(API_URL)
                .post(body)
                .build();

        try (Response response = client.newCall(request).execute()) {
            if (response.isSuccessful()) {
                String responseBody = response.body().string();
                JsonNode root = objectMapper.readTree(responseBody);
                JsonNode result = root.get("result");

                JsonNode layoutParsingResults = result.get("layoutParsingResults");
                for (int i = 0; i < layoutParsingResults.size(); i++) {
                    JsonNode item = layoutParsingResults.get(i);
                    int finalI = i;
                    JsonNode prunedResult = item.get("prunedResult");
                    System.out.println("Pruned Result [" + i + "]: " + prunedResult.toString());

                    JsonNode outputImages = item.get("outputImages");
                    outputImages.fieldNames().forEachRemaining(imgName -> {
                        try {
                            String imgBase64 = outputImages.get(imgName).asText();
                            byte[] imgBytes = Base64.getDecoder().decode(imgBase64);
                            String imgPath = imgName + "_" + finalI + ".jpg";

                            File outputFile = new File(imgPath);
                            File parentDir = outputFile.getParentFile();
                            if (parentDir != null && !parentDir.exists()) {
                                parentDir.mkdirs();
                                System.out.println("Created directory: " + parentDir.getAbsolutePath());
                            }

                            try (FileOutputStream fos = new FileOutputStream(outputFile)) {
                                fos.write(imgBytes);
                                System.out.println("Saved image: " + imgPath);
                            }
                        } catch (IOException e) {
                            System.err.println("Failed to save image: " + e.getMessage());
                        }
                    });
                }
            } else {
                System.err.println("Request failed with HTTP code: " + response.code());
            }
        }
    }
}
Go
package main

import (
    "bytes"
    "encoding/base64"
    "encoding/json"
    "fmt"
    "io/ioutil"
    "net/http"
    "os"
    "path/filepath"
)

func main() {
    API_URL := "http://localhost:8080/layout-parsing"
    filePath := "./demo.jpg"

    fileBytes, err := ioutil.ReadFile(filePath)
    if err != nil {
        fmt.Printf("Error reading file: %v\n", err)
        return
    }
    fileData := base64.StdEncoding.EncodeToString(fileBytes)

    payload := map[string]interface{}{
        "file":     fileData,
        "fileType": 1,
    }
    payloadBytes, err := json.Marshal(payload)
    if err != nil {
        fmt.Printf("Error marshaling payload: %v\n", err)
        return
    }

    client := &http.Client{}
    req, err := http.NewRequest("POST", API_URL, bytes.NewBuffer(payloadBytes))
    if err != nil {
        fmt.Printf("Error creating request: %v\n", err)
        return
    }
    req.Header.Set("Content-Type", "application/json")

    res, err := client.Do(req)
    if err != nil {
        fmt.Printf("Error sending request: %v\n", err)
        return
    }
    defer res.Body.Close()

    if res.StatusCode != http.StatusOK {
        fmt.Printf("Unexpected status code: %d\n", res.StatusCode)
        return
    }

    body, err := ioutil.ReadAll(res.Body)
    if err != nil {
        fmt.Printf("Error reading response: %v\n", err)
        return
    }

    type Markdown struct {
        Text   string            `json:"text"`
        Images map[string]string `json:"images"`
    }

    type LayoutResult struct {
        PrunedResult map[string]interface{} `json:"prunedResult"`
        Markdown     Markdown               `json:"markdown"`
        OutputImages map[string]string      `json:"outputImages"`
        InputImage   *string                `json:"inputImage"`
    }

    type Response struct {
        Result struct {
            LayoutParsingResults []LayoutResult `json:"layoutParsingResults"`
            DataInfo             interface{}    `json:"dataInfo"`
        } `json:"result"`
    }

    var respData Response
    if err := json.Unmarshal(body, &respData); err != nil {
        fmt.Printf("Error parsing response: %v\n", err)
        return
    }

    for i, res := range respData.Result.LayoutParsingResults {
        fmt.Printf("Result %d - prunedResult: %+v\n", i, res.PrunedResult)

        mdDir := fmt.Sprintf("markdown_%d", i)
        os.MkdirAll(mdDir, 0755)
        mdFile := filepath.Join(mdDir, "doc.md")
        if err := os.WriteFile(mdFile, []byte(res.Markdown.Text), 0644); err != nil {
            fmt.Printf("Error writing markdown file: %v\n", err)
        } else {
            fmt.Printf("Markdown document saved at %s\n", mdFile)
        }

        for path, imgBase64 := range res.Markdown.Images {
            fullPath := filepath.Join(mdDir, path)
            if err := os.MkdirAll(filepath.Dir(fullPath), 0755); err != nil {
                fmt.Printf("Error creating directory for markdown image: %v\n", err)
                continue
            }
            imgBytes, err := base64.StdEncoding.DecodeString(imgBase64)
            if err != nil {
                fmt.Printf("Error decoding markdown image: %v\n", err)
                continue
            }
            if err := os.WriteFile(fullPath, imgBytes, 0644); err != nil {
                fmt.Printf("Error saving markdown image: %v\n", err)
            }
        }

        for name, imgBase64 := range res.OutputImages {
            imgBytes, err := base64.StdEncoding.DecodeString(imgBase64)
            if err != nil {
                fmt.Printf("Error decoding output image %s: %v\n", name, err)
                continue
            }
            filename := fmt.Sprintf("%s_%d.jpg", name, i)

            if err := os.MkdirAll(filepath.Dir(filename), 0755); err != nil {
                fmt.Printf("Error creating directory for output image: %v\n", err)
                continue
            }

            if err := os.WriteFile(filename, imgBytes, 0644); err != nil {
                fmt.Printf("Error saving output image %s: %v\n", filename, err)
            } else {
                fmt.Printf("Output image saved at %s\n", filename)
            }
        }
    }
}
C#
using System;
using System.IO;
using System.Net.Http;
using System.Text;
using System.Threading.Tasks;
using Newtonsoft.Json.Linq;

class Program
{
    static readonly string API_URL = "http://localhost:8080/layout-parsing";
    static readonly string inputFilePath = "./demo.jpg";

    static async Task Main(string[] args)
    {
        var httpClient = new HttpClient();

        byte[] fileBytes = File.ReadAllBytes(inputFilePath);
        string fileData = Convert.ToBase64String(fileBytes);

        var payload = new JObject
        {
            { "file", fileData },
            { "fileType", 1 }
        };
        var content = new StringContent(payload.ToString(), Encoding.UTF8, "application/json");

        HttpResponseMessage response = await httpClient.PostAsync(API_URL, content);
        response.EnsureSuccessStatusCode();

        string responseBody = await response.Content.ReadAsStringAsync();
        JObject jsonResponse = JObject.Parse(responseBody);

        JArray layoutParsingResults = (JArray)jsonResponse["result"]["layoutParsingResults"];
        for (int i = 0; i < layoutParsingResults.Count; i++)
        {
            var res = layoutParsingResults[i];
            Console.WriteLine($"[{i}] prunedResult:\n{res["prunedResult"]}");

            JObject outputImages = res["outputImages"] as JObject;
            if (outputImages != null)
            {
                foreach (var img in outputImages)
                {
                    string imgName = img.Key;
                    string base64Img = img.Value?.ToString();
                    if (!string.IsNullOrEmpty(base64Img))
                    {
                        string imgPath = $"{imgName}_{i}.jpg";
                        byte[] imageBytes = Convert.FromBase64String(base64Img);

                        string directory = Path.GetDirectoryName(imgPath);
                        if (!string.IsNullOrEmpty(directory) && !Directory.Exists(directory))
                        {
                            Directory.CreateDirectory(directory);
                            Console.WriteLine($"Created directory: {directory}");
                        }

                        File.WriteAllBytes(imgPath, imageBytes);
                        Console.WriteLine($"Output image saved at {imgPath}");
                    }
                }
            }
        }
    }
}
Node.js
const axios = require('axios');
const fs = require('fs');
const path = require('path');

const API_URL = 'http://localhost:8080/layout-parsing';
const imagePath = './demo.jpg';
const fileType = 1;

function encodeImageToBase64(filePath) {
  const bitmap = fs.readFileSync(filePath);
  return Buffer.from(bitmap).toString('base64');
}

const payload = {
  file: encodeImageToBase64(imagePath),
  fileType: fileType
};

axios.post(API_URL, payload)
  .then(response => {
    const results = response.data.result.layoutParsingResults;
    results.forEach((res, index) => {
      console.log(`\n[${index}] prunedResult:`);
      console.log(res.prunedResult);

      const outputImages = res.outputImages;
      if (outputImages) {
        Object.entries(outputImages).forEach(([imgName, base64Img]) => {
          const imgPath = `${imgName}_${index}.jpg`;

          const directory = path.dirname(imgPath);
          if (!fs.existsSync(directory)) {
            fs.mkdirSync(directory, { recursive: true });
            console.log(`Created directory: ${directory}`);
          }

          fs.writeFileSync(imgPath, Buffer.from(base64Img, 'base64'));
          console.log(`Output image saved at ${imgPath}`);
        });
      } else {
        console.log(`[${index}] No outputImages.`);
      }
    });
  })
  .catch(error => {
    console.error('Error during API request:', error.message || error);
  });
PHP
<?php

$API_URL = "http://localhost:8080/layout-parsing";
$image_path = "./demo.jpg";

$image_data = base64_encode(file_get_contents($image_path));
$payload = array("file" => $image_data, "fileType" => 1);

$ch = curl_init($API_URL);
curl_setopt($ch, CURLOPT_POST, true);
curl_setopt($ch, CURLOPT_POSTFIELDS, json_encode($payload));
curl_setopt($ch, CURLOPT_HTTPHEADER, array('Content-Type: application/json'));
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
$response = curl_exec($ch);
curl_close($ch);

$result = json_decode($response, true)["result"]["layoutParsingResults"];

foreach ($result as $i => $item) {
    echo "[$i] prunedResult:\n";
    print_r($item["prunedResult"]);

    if (!empty($item["outputImages"])) {
        foreach ($item["outputImages"] as $img_name => $img_base64) {
            $output_image_path = "{$img_name}_{$i}.jpg";

            $directory = dirname($output_image_path);
            if (!is_dir($directory)) {
                mkdir($directory, 0777, true);
                echo "Created directory: $directory\n";
            }

            file_put_contents($output_image_path, base64_decode($img_base64));
            echo "Output image saved at $output_image_path\n";
        }
    } else {
        echo "No outputImages found for item $i\n";
    }
}
?>


Comments