Tutorial for PaddleOCR-VL Series Pipelines¶
PaddleOCR-VL is a SOTA and resource-efficient model tailored for document parsing. Taking the first version as an example, its core component is PaddleOCR-VL-0.9B, a compact yet powerful vision-language model (VLM) that integrates a NaViT-style dynamic resolution visual encoder with the ERNIE-4.5-0.3B language model to enable accurate element recognition. The PaddleOCR-VL series efficiently supports 109 languages and excels in recognizing complex elements such as text, tables, formulas, and charts, while maintaining minimal resource consumption. Through comprehensive evaluations on widely used public benchmarks and in-house benchmarks, PaddleOCR-VL achieves SOTA performance in both page-level document parsing and element-level recognition. It significantly outperforms existing pipeline-based and multimodal document parsing solutions, is competitive with advanced general-purpose multimodal large models, and delivers fast inference speeds. These strengths make it highly suitable for practical deployment in real-world scenarios.
On January 29, 2026, we released PaddleOCR-VL-1.5. PaddleOCR-VL-1.5 not only significantly improved the accuracy on the OmniDocBench v1.5 evaluation set to 94.5%, but also innovatively supports irregular-shaped bounding box localization. As a result, PaddleOCR-VL-1.5 demonstrates outstanding performance in real-world scenarios such as Skew, Warping, Screen Photography, Illumination, and Scanning. In addition, the model has added new capabilities for seal (stamp) recognition and text detection and recognition, with key metrics continuing to lead the industry.
On May 28, 2026, we released PaddleOCR-VL-1.6. With an accuracy of 96.3%, PaddleOCR-VL-1.6 once again set a new benchmark on OmniDocBench v1.6, while also achieving new state-of-the-art (SOTA) results on OmniDocBench v1.5 and Real5-OmniDocBench. It delivers industry-leading performance in text, formula, and table recognition across both open-source and proprietary solutions. In addition, the model shows substantial improvements in ancient document and rare character recognition, as well as significantly enhanced capabilities in multiple scenarios such as seal recognition, spotting, and chart understanding. The model architecture remains fully consistent with PaddleOCR-VL-1.5, enabling seamless migration at zero cost.
This document applies to the PaddleOCR-VL series pipelines in PaddleX. PaddleX registers the PaddleOCR-VL series as independent top-level pipelines. They are used in basically the same way, but their default configurations and models are different.
| Pipeline name | Layout analysis model | VLM model |
|---|---|---|
PaddleOCR-VL |
PP-DocLayoutV2 |
PaddleOCR-VL-0.9B |
PaddleOCR-VL-1.5 |
PP-DocLayoutV3 |
PaddleOCR-VL-1.5-0.9B |
PaddleOCR-VL-1.6 |
PP-DocLayoutV3 |
PaddleOCR-VL-1.6-0.9B |
A PaddleOCR-VL pipeline consists of layout analysis, region cropping, reading-order handling, VLM recognition, and result assembly. PaddleOCR-VL-0.9B, PaddleOCR-VL-1.5-0.9B, and PaddleOCR-VL-1.6-0.9B are VLM submodels inside the pipelines; they are not complete PaddleX pipelines. If you only start or call a VLM inference service, you are only running the VLM recognition stage, which does not provide the full pipeline capability.
1. Environment Preparation¶
To use the PaddleOCR-VL series pipelines, install PaddleX and the inference engine you want to use, for example:
python -m pip install paddlepaddle-gpu==3.3.0 -i https://www.paddlepaddle.org.cn/packages/stable/cu126/
python -m pip install "paddlex[ocr]"
2. Quick Start¶
The PaddleOCR-VL series pipelines support two usage methods: CLI command line and Python API. The CLI method is simpler and suitable for quickly verifying functionality, while the Python API method is more flexible and suitable for integration into existing projects. The following examples use PaddleOCR-VL-1.6 as the primary pipeline.
2.1 Command Line Usage¶
paddlex --pipeline PaddleOCR-VL-1.6 --input https://paddle-model-ecology.bj.bcebos.com/paddlex/imgs/demo_image/paddleocr_vl_demo.png
# Use --use_doc_orientation_classify to enable document orientation classification
paddlex --pipeline PaddleOCR-VL-1.6 --input https://paddle-model-ecology.bj.bcebos.com/paddlex/imgs/demo_image/paddleocr_vl_demo.png --use_doc_orientation_classify True
# Use --use_doc_unwarping to enable document unwarping module
paddlex --pipeline PaddleOCR-VL-1.6 --input https://paddle-model-ecology.bj.bcebos.com/paddlex/imgs/demo_image/paddleocr_vl_demo.png --use_doc_unwarping True
# Use --use_layout_detection to enable layout detection
paddlex --pipeline PaddleOCR-VL-1.6 --input https://paddle-model-ecology.bj.bcebos.com/paddlex/imgs/demo_image/paddleocr_vl_demo.png --use_layout_detection False
Command line supports more parameters. Click to expand for detailed parameter descriptions
The following table lists the PaddleOCR-VL series pipeline prediction parameters currently supported by the PaddleX CLI. Common parameters also include `--input`, `--save_path`, `--engine`, `--device`, `--use_hpip`, and `--hpi_config`. For complex `engine_config`, write it into a pipeline YAML file and pass the file through `--pipeline`.| Parameter | Description | Type |
|---|---|---|
input |
Data to be predicted, required. It can be an image/PDF file path, a URL, or a local directory containing images. Directory input currently does not support mixed PDF files; PDF files must be specified by file path. | str |
save_path |
Path for saving inference results. If not set, inference results will not be saved locally. | str |
use_doc_orientation_classify |
Whether to use the document orientation classification module. | bool |
use_doc_unwarping |
Whether to use the document unwarping module. | bool |
use_layout_detection |
Whether to use the layout detection and ordering module. | bool |
use_chart_recognition |
Whether to use chart parsing. | bool |
layout_threshold |
Score threshold for the layout model. It can be a float or a dictionary keyed by class ID. | float|dict |
layout_nms |
Whether to use NMS post-processing for layout analysis. | bool |
layout_unclip_ratio |
Expansion ratio for layout detection boxes. It can be a float, a tuple, or a dictionary keyed by class ID. | float|tuple|dict |
layout_merge_bboxes_mode |
Merge mode for layout detection boxes. Supported values include large, small, and union; it can also be configured by class ID. |
str|dict |
use_queues |
Whether to enable internal queues. When enabled, PDF rendering, layout analysis, and VLM inference can run asynchronously in separate threads. | bool |
prompt_label |
Prompt type for the VLM. It only takes effect when use_layout_detection=False. |
str |
format_block_content |
Whether to format block_content as Markdown. When set to True, image-type blocks may include image path information in block_content. |
bool |
repetition_penalty |
Repetition penalty used in VLM sampling. | float |
temperature |
Temperature used in VLM sampling. | float |
top_p |
Top-p parameter used in VLM sampling. | float |
min_pixels |
Minimum number of pixels allowed during VLM image preprocessing. | int |
max_pixels |
Maximum number of pixels allowed during VLM image preprocessing. | int |
The inference result will be printed in the terminal. The default output of the PaddleOCR-VL pipeline is as follows:
👉Click to expand
{'res': {'input_path': 'paddleocr_vl_demo.png', 'page_index': None, 'model_settings': {'use_doc_preprocessor': False, 'use_layout_detection': True, 'use_chart_recognition': False, 'format_block_content': False}, 'layout_det_res': {'input_path': None, 'page_index': None, 'boxes': [{'cls_id': 6, 'label': 'doc_title', 'score': 0.9636914134025574, 'coordinate': [np.float32(131.31366), np.float32(36.450516), np.float32(1384.522), np.float32(127.984665)]}, {'cls_id': 22, 'label': 'text', 'score': 0.9281806349754333, 'coordinate': [np.float32(585.39465), np.float32(158.438), np.float32(930.2184), np.float32(182.57469)]}, {'cls_id': 22, 'label': 'text', 'score': 0.9840355515480042, 'coordinate': [np.float32(9.023666), np.float32(200.86115), np.float32(361.41583), np.float32(343.8828)]}, {'cls_id': 14, 'label': 'image', 'score': 0.9871416091918945, 'coordinate': [np.float32(775.50574), np.float32(200.66502), np.float32(1503.3807), np.float32(684.9304)]}, {'cls_id': 22, 'label': 'text', 'score': 0.9801855087280273, 'coordinate': [np.float32(9.532196), np.float32(344.90594), np.float32(361.4413), np.float32(440.8244)]}, {'cls_id': 17, 'label': 'paragraph_title', 'score': 0.9708921313285828, 'coordinate': [np.float32(28.040405), np.float32(455.87976), np.float32(341.7215), np.float32(520.7117)]}, {'cls_id': 24, 'label': 'vision_footnote', 'score': 0.9002962708473206, 'coordinate': [np.float32(809.0692), np.float32(703.70044), np.float32(1488.3016), np.float32(750.5238)]}, {'cls_id': 22, 'label': 'text', 'score': 0.9825374484062195, 'coordinate': [np.float32(8.896561), np.float32(536.54895), np.float32(361.05237), np.float32(655.8058)]}, {'cls_id': 22, 'label': 'text', 'score': 0.9822263717651367, 'coordinate': [np.float32(8.971573), np.float32(657.4949), np.float32(362.01715), np.float32(774.625)]}, {'cls_id': 22, 'label': 'text', 'score': 0.9767460823059082, 'coordinate': [np.float32(9.407074), np.float32(776.5216), np.float32(361.31067), np.float32(846.82874)]}, {'cls_id': 22, 'label': 'text', 'score': 0.9868153929710388, 'coordinate': [np.float32(8.669495), np.float32(848.2543), np.float32(361.64703), np.float32(1062.8568)]}, {'cls_id': 22, 'label': 'text', 'score': 0.9826608300209045, 'coordinate': [np.float32(8.8025055), np.float32(1063.8615), np.float32(361.46588), np.float32(1182.8524)]}, {'cls_id': 22, 'label': 'text', 'score': 0.982555627822876, 'coordinate': [np.float32(8.820602), np.float32(1184.4663), np.float32(361.66394), np.float32(1302.4507)]}, {'cls_id': 22, 'label': 'text', 'score': 0.9584776759147644, 'coordinate': [np.float32(9.170288), np.float32(1304.2161), np.float32(361.48898), np.float32(1351.7483)]}, {'cls_id': 22, 'label': 'text', 'score': 0.9782056212425232, 'coordinate': [np.float32(389.1618), np.float32(200.38202), np.float32(742.7591), np.float32(295.65146)]}, {'cls_id': 22, 'label': 'text', 'score': 0.9844875931739807, 'coordinate': [np.float32(388.73303), np.float32(297.18463), np.float32(744.00024), np.float32(441.3034)]}, {'cls_id': 17, 'label': 'paragraph_title', 'score': 0.9680547714233398, 'coordinate': [np.float32(409.39468), np.float32(455.89386), np.float32(721.7174), np.float32(520.9387)]}, {'cls_id': 22, 'label': 'text', 'score': 0.9741666913032532, 'coordinate': [np.float32(389.71606), np.float32(536.8138), np.float32(742.7112), np.float32(608.00165)]}, {'cls_id': 22, 'label': 'text', 'score': 0.9840384721755981, 'coordinate': [np.float32(389.30988), np.float32(609.39636), np.float32(743.09247), np.float32(750.3231)]}, {'cls_id': 22, 'label': 'text', 'score': 0.9845995306968689, 'coordinate': [np.float32(389.13272), np.float32(751.7772), np.float32(743.058), np.float32(894.8815)]}, {'cls_id': 22, 'label': 'text', 'score': 0.984852135181427, 'coordinate': [np.float32(388.83267), np.float32(896.0371), np.float32(743.58215), np.float32(1038.7345)]}, {'cls_id': 22, 'label': 'text', 'score': 0.9804865717887878, 'coordinate': [np.float32(389.08478), np.float32(1039.9119), np.float32(742.7585), np.float32(1134.4897)]}, {'cls_id': 22, 'label': 'text', 'score': 0.986461341381073, 'coordinate': [np.float32(388.52643), np.float32(1135.8137), np.float32(743.451), np.float32(1352.0085)]}, {'cls_id': 22, 'label': 'text', 'score': 0.9869391918182373, 'coordinate': [np.float32(769.8341), np.float32(775.66235), np.float32(1124.9813), np.float32(1063.207)]}, {'cls_id': 22, 'label': 'text', 'score': 0.9822869896888733, 'coordinate': [np.float32(770.30383), np.float32(1063.938), np.float32(1124.8295), np.float32(1184.2192)]}, {'cls_id': 17, 'label': 'paragraph_title', 'score': 0.9689218997955322, 'coordinate': [np.float32(791.3042), np.float32(1199.3169), np.float32(1104.4521), np.float32(1264.6985)]}, {'cls_id': 22, 'label': 'text', 'score': 0.9713128209114075, 'coordinate': [np.float32(770.4253), np.float32(1279.6072), np.float32(1124.6917), np.float32(1351.8672)]}, {'cls_id': 22, 'label': 'text', 'score': 0.9236552119255066, 'coordinate': [np.float32(1153.9058), np.float32(775.5814), np.float32(1334.0654), np.float32(798.1581)]}, {'cls_id': 22, 'label': 'text', 'score': 0.9857938885688782, 'coordinate': [np.float32(1151.5197), np.float32(799.28015), np.float32(1506.3619), np.float32(991.1156)]}, {'cls_id': 22, 'label': 'text', 'score': 0.9820687174797058, 'coordinate': [np.float32(1151.5686), np.float32(991.91095), np.float32(1506.6023), np.float32(1110.8875)]}, {'cls_id': 22, 'label': 'text', 'score': 0.9866049885749817, 'coordinate': [np.float32(1151.6919), np.float32(1112.1301), np.float32(1507.1611), np.float32(1351.9504)]}]}}}
For explanation of the result parameters, refer to 2.2 Python Script Integration.
Note: The default model for the pipeline is relatively large, which may result in slower inference speed. It is recommended to use 3. Using VLM Inference Services for faster inference.
2.2 Python Script Integration¶
The command line method is for quick testing and visualization. In actual projects, you usually need to integrate the model via code. You can perform pipeline inference with just a few lines of code as shown below:
from paddlex import create_pipeline
pipeline = create_pipeline(pipeline="PaddleOCR-VL-1.6")
output = pipeline.predict(input="https://paddle-model-ecology.bj.bcebos.com/paddlex/imgs/demo_image/paddleocr_vl_demo.png")
for res in output:
res.print() # Print the structured prediction output
res.save_to_json(save_path="output") # Save the current image's structured result in JSON format
res.save_to_markdown(save_path="output") # Save the current image's result in Markdown format
res.save_to_word(save_path="output") # Save the current image's result in Word format
For PDF files, each page will be processed individually, and a separate Markdown file will be generated for each page. If you wish to perform cross-page table merging, reconstruct multi-level labels, or merge multi-page results, you can achieve this using the following method:
from paddlex import create_pipeline
pipeline = create_pipeline(pipeline="PaddleOCR-VL-1.6")
output = pipeline.predict(input="./your_pdf_file.pdf")
pages_res = list(output)
output = pipeline.restructure_pages(pages_res)
# output = pipeline.restructure_pages(pages_res, merge_tables=True) # Merge tables across pages
# output = pipeline.restructure_pages(pages_res, merge_tables=True, relevel_titles=True) # Merge tables across pages and reconstruct multi-level titles
# output = pipeline.restructure_pages(pages_res, merge_tables=True, relevel_titles=True, concatenate_pages=True) # Merge tables across pages, reconstruct multi-level titles, and merge multiple pages
for res in output:
res.print() # Print the structured prediction output
res.save_to_json(save_path="output") # Save the current image's structured result in JSON format
res.save_to_markdown(save_path="output") # Save the current image's result in Markdown format
The above Python script performs the following steps:
(1) Instantiate the PaddleX pipeline object. Specific parameter descriptions are as follows:
In PaddleX, use `create_pipeline()` to create a PaddleOCR-VL series pipeline object. Fine-grained settings such as model names, model directories, and VLM server backends should usually be configured in a pipeline YAML file and then passed through `pipeline` or `config`.| Parameter | Description | Type | Default |
|---|---|---|---|
pipeline |
PaddleX pipeline name or path to a pipeline config file. Supported names include PaddleOCR-VL, PaddleOCR-VL-1.5, and PaddleOCR-VL-1.6. A custom YAML file path can also be used. |
str|None |
None |
config |
Pipeline config dictionary. If both pipeline and config are provided, the pipeline_name in config takes precedence. |
dict|None |
None |
device |
Device used for inference, such as cpu, gpu:0, xpu:0, npu:0, dcu:0, or mlu:0. Actual availability depends on the local environment and inference engine. |
str|None |
None |
engine |
Inference engine used by the pipeline or model. Different engines support different fields. See Inference Engine And Configuration. | str|None |
None |
engine_config |
Inference engine configuration. Different engines support different fields. See Inference Engine And Configuration. | dict|None |
None |
use_hpip |
Whether to enable the high-performance inference plugin. If set to None, the setting from the configuration file or config will be used. |
bool|None |
None |
hpi_config |
High-performance inference configuration. | dict|None |
None |
(2) Call the PaddleOCR-VL pipeline's predict() method for inference prediction. This method will return a list of results. Additionally, the pipeline also provides the predict_iter() method. The two are completely consistent in terms of parameter acceptance and result return. The difference lies in that predict_iter() returns a generator, which can process and obtain prediction results step by step. It is suitable for scenarios involving large datasets or where memory conservation is desired. You can choose either of these two methods based on actual needs. Below are the parameters of the predict() method and their descriptions:
| Parameter | Parameter Description | Parameter Type | Default Value |
|---|---|---|---|
input |
Data to be predicted, supporting multiple input types. Required.
|
Python Var|str|list |
|
use_doc_orientation_classify |
Whether to use the document orientation classification module during inference. Setting it to None means using the instantiation parameter; otherwise, this parameter takes precedence. |
bool|None |
None |
use_doc_unwarping |
Whether to use the text image rectification module during inference. Setting it to None means using the instantiation parameter; otherwise, this parameter takes precedence. |
bool|None |
None |
use_layout_detection |
Whether to use the layout region detection and sorting module during inference. Setting it to None means using the instantiation parameter; otherwise, this parameter takes precedence. |
bool|None |
None |
use_chart_recognition |
Whether to use the chart parsing module during inference. Setting it to None means using the instantiation parameter; otherwise, this parameter takes precedence. |
bool|None |
None |
use_seal_recognition |
Meaning:Whether to use the seal recognition function. Setting it to None means using the instantiation parameter; otherwise, this parameter takes precedence. |
bool|None |
None |
use_ocr_for_image_block |
Meaning:Whether to perform OCR on text within image blocks. Setting it to None means using the instantiation parameter; otherwise, this parameter takes precedence. |
bool|None |
None |
layout_threshold |
The parameter meaning is basically the same as the instantiation parameter. Setting it to None means using the instantiation parameter; otherwise, this parameter takes precedence. |
float|dict|None |
None |
layout_nms |
The parameter meaning is basically the same as the instantiation parameter. Setting it to None means using the instantiation parameter; otherwise, this parameter takes precedence. |
bool|None |
None |
layout_unclip_ratio |
The parameter meaning is basically the same as the instantiation parameter. Setting it to None means using the instantiation parameter; otherwise, this parameter takes precedence. |
float|Tuple[float,float]|dict|None |
None |
layout_merge_bboxes_mode |
The parameter meaning is basically the same as the instantiation parameter. Setting it to None means using the instantiation parameter; otherwise, this parameter takes precedence. |
str|dict|None |
None |
merge_layout_blocks |
Control whether to merge the layout detection boxes for cross-column or staggered top and bottom columns. If not set, the initialized default value will be used, which defaults to initialization asTrue. |
bool |
|
markdown_ignore_labels |
Layout labels that need to be ignored in Markdown. If not set, the initialized default value will be used. | str |
|
layout_shape_mode |
Meaning:Specifies the geometric representation mode for layout analysis results. It defines how the boundaries of detected regions (e.g., text blocks, images, tables) are calculated and displayed. Description: Value descriptions:
|
str |
"auto" |
use_queues |
Used to control whether to enable internal queues. When set to True, data loading (such as rendering PDF pages as images), layout analysis model processing, and VLM inference will be executed asynchronously in separate threads, with data passed through queues, thereby improving efficiency. This approach is particularly efficient for PDF documents with many pages or directories containing a large number of images or PDF files. |
bool|None |
None |
prompt_label |
The prompt type setting for the VL model, which takes effect only when use_layout_detection=False. The fillable parameters are ocr、formula、table and chart. |
str|None |
None |
format_block_content |
The parameter meaning is basically the same as the instantiation parameter. Setting it to None means using the instantiation parameter; otherwise, this parameter takes precedence. When set to True, the block_content of image-type blocks will contain image path information (e.g., <img src="..." />). When set to False (default), the block_content of image-type blocks will only contain OCR-recognized text content without image paths. To include image paths in JSON output, set this parameter to True. |
bool|None |
None |
repetition_penalty |
The repetition penalty parameter used for VL model sampling. | float|None |
None |
temperature |
Temperature parameter used for VL model sampling. | float|None |
None |
top_p |
Top-p parameter used for VL model sampling. | float|None |
None |
min_pixels |
The minimum number of pixels allowed when the VL model preprocesses images. | int|None |
None |
max_pixels |
The maximum number of pixels allowed when the VL model preprocesses images. | int|None |
None |
max_new_tokens |
The maximum number of tokens generated by the VL model. | int|None |
None |
merge_layout_blocks |
Control whether to merge the layout detection boxes for cross-column or staggered top and bottom columns. | bool|None |
|
markdown_ignore_labels |
Layout labels that need to be ignored in Markdown. | list|None |
|
vlm_extra_args |
Meaning:Additional configuration parameters for the VLM. The currently supported custom parameters are as follows:
|
dict|None |
None |
(3) Call the PaddleOCR-VL pipeline's restructure_pages() method to reconstruct pages from the multi-page results list of inference predictions. This method will return a reconstructed multi-page result or a merged single-page result. Below are the parameters of the restructure_pages() method and their descriptions:
| Parameter | Description | Type | Default Value |
|---|---|---|---|
res_list |
Meaning: The list of results predicted from a multi-page PDF inference. | list|None |
None |
merge_tables |
Meaning: Controls whether to merge tables across pages. | bool |
True |
relevel_titles |
Meaning: Controls whether to reconstruct multi-level titles. | bool |
True |
concatenate_pages |
Meaning: Controls whether to concatenate multi-page results into one page. | bool |
False |
(4) Process the prediction results: The prediction result for each sample is a corresponding Result object, supporting operations such as printing, saving as an image, and saving as a json file:
| Method | Method Description | Parameter | Parameter Type | Parameter Description | Default Value |
|---|---|---|---|---|---|
print() |
Print results to the terminal | format_json |
bool |
Whether to format the output content using JSON indentation. |
True |
indent |
int |
Specify the indentation level to beautify the output JSON data, making it more readable. Only valid when format_json is True. |
4 |
||
ensure_ascii |
bool |
Control whether non- ASCII characters are escaped as Unicode. When set to True, all non- ASCII characters will be escaped; False retains the original characters. Only valid when format_json is True. |
False |
||
save_to_json() |
Save the results as a json format file | save_path |
str |
The file path for saving. When it is a directory, the saved file name will be consistent with the input file type naming. | None |
indent |
int |
Specify the indentation level to beautify the output JSONdata, making it more readable. Only valid when format_jsonis True. |
4 |
||
ensure_ascii |
bool |
Control whether non- ASCII characters are escaped as Unicode. When set to True, all non- ASCII characters will be escaped; False retains the original characters. Only valid when format_json is True. |
False |
||
save_to_img() |
Save the visualized images of each intermediate module in png format | save_path |
str |
The file path for saving, supporting directory or file paths. | None |
save_to_markdown() |
Save each page in an image or PDF file as a markdown format file separately | save_path |
str |
The file path for saving. When it is a directory, the saved file name will be consistent with the input file type naming | None |
pretty |
bool |
Whether to beautify the markdown output results, centering charts, etc., to make the markdown rendering more aesthetically pleasing. |
True |
||
show_formula_number |
bool |
Control whether to retain formula numbers in markdown. When set to True, all formula numbers are retained; False retains only the formulas |
False |
||
save_to_html() |
Save the tables in the file as html format files | save_path |
str |
The file path for saving, supporting directory or file paths. | None |
save_to_xlsx() |
Save the tables in the file as xlsx format files | save_path |
str |
The file path for saving, supporting directory or file paths. | None |
save_to_word() |
Save the layout parsing result as a Word (.docx) format file | save_path |
str |
The file path for saving, supporting directory or file paths. | None |
| Attribute | Attribute Description |
|---|---|
json |
Obtain the prediction jsonresult in the format |
img |
Obtain visualized images in dict format |
markdown |
Obtain Markdown results in dict format |
3. Using VLM Inference Services¶
The inference performance under the default configuration is not fully optimized and may not meet actual production requirements. PaddleX supports connecting the VLM recognition stage in the complete pipeline to a dedicated VLM inference service. This improves VLM module inference performance and helps isolate server-side dependencies and compute resources in production environments. Server backends can include vLLM, SGLang, and FastDeploy. The workflow mainly consists of two steps:
- Start the VLM inference service;
- Configure the PaddleX pipeline to call the VLM inference service as a client.
The VLM inference service only handles the VLM recognition stage of the complete pipeline. Layout analysis, cropping, reading-order handling, and result assembly are still performed by the PaddleX pipeline, so local inference with the layout parsing model is still required. When starting the service, use the VLM submodel name that corresponds to the selected pipeline. See the pipeline table at the beginning of this document for the mapping.
3.1 Starting the VLM Inference Service¶
3.1.1 Using Docker Images¶
PaddleX provides a vLLM Docker image to quickly start a VLM inference service. For NVIDIA GPUs other than the Blackwell architecture, use ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/paddlex-genai-vllm-server:latest. For Blackwell GPUs, use ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/paddlex-genai-vllm-server:latest-sm120.
Using an NVIDIA GPU other than the Blackwell architecture and PaddleOCR-VL-1.6-0.9B as an example:
docker run \
--rm \
--gpus all \
--network host \
ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/paddlex-genai-vllm-server:latest \
paddlex_genai_server --model_name PaddleOCR-VL-1.6-0.9B --host 0.0.0.0 --port 8118 --backend vllm
For Blackwell GPUs, replace the image above with the dedicated image:
docker run \
--rm \
--gpus all \
--network host \
ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/paddlex-genai-vllm-server:latest-sm120 \
paddlex_genai_server --model_name PaddleOCR-VL-1.6-0.9B --host 0.0.0.0 --port 8118 --backend vllm
3.1.2 Starting Through PaddleX CLI¶
VLM server dependencies may differ from the local pipeline client environment, so it is recommended to create a separate virtual environment for the VLM inference service:
# Create a virtual environment
python -m venv .venv
# Activate the environment
source .venv/bin/activate
# Install PaddleX
python -m pip install "paddlex[ocr]"
# Install the vLLM server plugin
paddlex --install genai-vllm-server
# Or install the SGLang server plugin
# paddlex --install genai-sglang-server
# Or install the FastDeploy server plugin
# paddlex --install genai-fastdeploy-server
After installation, start the service with paddlex_genai_server:
paddlex_genai_server --model_name PaddleOCR-VL-1.6-0.9B --backend vllm --port 8118
# For the SGLang backend
# paddlex_genai_server --model_name PaddleOCR-VL-1.6-0.9B --backend sglang --port 8118
# For the FastDeploy backend
# paddlex_genai_server --model_name PaddleOCR-VL-1.6-0.9B --backend fastdeploy --port 8118
The command supports the following parameters:
| Parameter | Description |
|---|---|
--model_name |
Model name. It should match the PaddleX pipeline version being used. |
--model_dir |
Model directory. |
--host |
Server hostname. |
--port |
Server port number. |
--backend |
Backend name. Supported values are vllm, sglang, and fastdeploy. |
--backend_config |
YAML file containing backend configuration. |
3.2 Client Usage¶
After starting the VLM inference service, the client can call it through PaddleX. Install the client plugin first:
Next, obtain the pipeline configuration file:
The default save path is PaddleOCR-VL-1.6.yaml. Modify SubModules.VLRecognition.genai_config.backend and SubModules.VLRecognition.genai_config.server_url in the config file to match the service, for example:
SubModules:
VLRecognition:
genai_config:
backend: vllm-server
server_url: http://127.0.0.1:8118/v1
max_concurrency: 200
You can also use the unified engine + engine_config style to configure this submodule explicitly:
SubModules:
VLRecognition:
engine: genai_client
engine_config:
backend: vllm-server
server_url: http://127.0.0.1:8118/v1
max_concurrency: 200
Then use the modified config file to run the pipeline. For CLI:
paddlex --pipeline PaddleOCR-VL-1.6.yaml --input https://paddle-model-ecology.bj.bcebos.com/paddlex/imgs/demo_image/paddleocr_vl_demo.png
Or through the Python API:
from paddlex import create_pipeline
pipeline = create_pipeline("PaddleOCR-VL-1.6.yaml")
for res in pipeline.predict("https://paddle-model-ecology.bj.bcebos.com/paddlex/imgs/demo_image/paddleocr_vl_demo.png"):
res.print()
4. Serving¶
If you need to directly apply PaddleOCR-VL in your Python project, you can refer to the example code in 2.2 Python Script Integration.
Additionally, PaddleX also provides a service deployment method, detailed as follows:
4.1 Install Dependencies¶
Run the following command to install the PaddleX serving plugin via PaddleX CLI:
4.2 Run the Server¶
Run the server via PaddleX CLI:
You should see information similar to the following:
INFO: Started server process [63108]
INFO: Waiting for application startup.
INFO: Application startup complete.
INFO: Uvicorn running on http://0.0.0.0:8080 (Press CTRL+C to quit)
If you need to adjust the configuration, such as model paths, batch size, deployment device, or VLM server backend, specify a custom configuration file through --pipeline. PaddleOCR-VL-1.5 and PaddleOCR-VL-1.6 share the same serving application implementation route internally, but their pipeline configs remain distinct.
The command-line options related to serving are as follows:
| Name | Description |
|---|---|
--pipeline |
PaddleX pipeline registration name or pipeline configuration file path. |
--device |
Deployment device for the pipeline. By default, a GPU will be used if available; otherwise, a CPU will be used. |
--host |
Hostname or IP address to which the server is bound. Defaults to 0.0.0.0. |
--port |
Port number on which the server listens. Defaults to 8080. |
--use_hpip |
If specified, uses high-performance inference. Refer to the High-Performance Inference documentation for more information. |
--hpi_config |
High-performance inference configuration. Refer to the High-Performance Inference documentation for more information. |
4.3 Client-Side Invocation¶
Below are the API references for basic service-based deployment and examples of multilingual service invocation:
API Reference
Main operations provided by the service:
- The HTTP request method is POST.
- Both the request body and response body are JSON data (JSON objects).
- When the request is processed successfully, the response status code is
200, and the properties of the response body are as follows:
| Name | Type | Meaning |
|---|---|---|
logId |
string |
The UUID of the request. |
errorCode |
integer |
Error code. Fixed as 0. |
errorMsg |
string |
Error description. Fixed as "Success". |
result |
object |
Operation result. |
- When the request is not processed successfully, the properties of the response body are as follows:
| Name | Type | Meaning |
|---|---|---|
logId |
string |
The UUID of the request. |
errorCode |
integer |
Error code. Same as the response status code. |
errorMsg |
string |
Error description. |
The main operations provided by the service are as follows:
infer
Perform layout parsing.
POST /layout-parsing
- The properties of the request body are as follows:
| Name | Type | Meaning | Required |
|---|---|---|---|
file |
string |
The URL of image files (including TIFF; multi-page TIFF is processed page by page) or PDF file accessible to the server, or the Base64-encoded result of the content of the aforementioned file types.
By default, there is no page limit. To set a page limit on the server, set Serving.extra.max_num_input_imgs to a positive integer in the pipeline configuration file, for example:
|
Yes |
fileType |
integer|null |
File type. 0 represents a PDF file, 1 represents an image file (including TIFF). If this property is not present in the request body, the file type will be inferred from the URL. |
No |
useDocOrientationClassify |
boolean | null |
Please refer to the description of the use_doc_orientation_classify parameter in the pipeline predict method. |
No |
useDocUnwarping |
boolean|null |
Please refer to the description of the use_doc_unwarping parameter in the pipeline predict method. |
No |
useLayoutDetection |
boolean|null |
Please refer to the description of the use_layout_detection parameter in the pipeline predict method. |
No |
useChartRecognition |
boolean|null |
Please refer to the description of the use_chart_recognition parameter in the pipeline predict method. |
No |
useSealRecognition |
boolean|null |
Please refer to the description of the use_seal_recognition parameter in the pipeline predict method. |
No |
useOcrForImageBlock |
boolean|null |
Please refer to the description of the use_ocr_for_image_block parameter in the pipeline predict method. |
No |
layoutThreshold |
number|object|null |
Please refer to the description of the layout_threshold parameter in the pipeline predict method. |
No |
layoutNms |
boolean|null |
Please refer to the description of the layout_nms parameter in the pipeline predict method. |
No |
layoutUnclipRatio |
number|array|object|null |
Please refer to the description of the layout_unclip_ratio parameter in the pipeline predict method. |
No |
layoutMergeBboxesMode |
string|object|null |
Please refer to the description of the layout_merge_bboxes_mode parameter in the pipeline predict method. |
No |
layoutShapeMode |
string |
Please refer to the description of the layout_shape_mode parameter in the pipeline predict method. |
No |
promptLabel |
string|null |
Please refer to the description of the prompt_label parameter in the pipeline predict method. |
No |
formatBlockContent |
boolean|null |
Please refer to the description of the format_block_content parameter in the pipeline predict method. |
No |
repetitionPenalty |
number|null |
Please refer to the description of the repetition_penalty parameter in the pipeline predict method. |
No |
temperature |
number|null |
Please refer to the description of the temperature parameter in the pipeline predict method. |
No |
topP |
number|null |
Please refer to the description of the top_p parameter in the pipeline predict method. |
No |
minPixels |
number|null |
Please refer to the description of the min_pixels parameter in the pipeline predict method. |
No |
maxPixels |
number|null |
Please refer to the description of the max_pixels parameter in the pipeline predict method. |
No |
maxNewTokens |
number|null |
Please refer to the description of the max_new_tokens parameter in the pipeline predict method. |
No |
mergeLayoutBlocks |
boolean|null |
Please refer to the description of the merge_layout_blocks parameter in the pipeline predict method. |
No |
markdownIgnoreLabels |
array|null |
Please refer to the description of the markdown_ignore_labels parameter in the pipeline predict method. |
No |
vlmExtraArgs |
object|null |
Please refer to the description of the vlm_extra_args parameter in the pipeline predict method. |
No |
prettifyMarkdown |
boolean |
Whether to output beautified Markdown text. The default is true. |
No |
showFormulaNumber |
boolean |
Whether to include formula numbers in the output Markdown text. The default is false. |
No |
returnMarkdownImages |
boolean |
Whether to return the images referenced in the Markdown. Default true; when set to false, markdown.images is null or omitted and the server skips image encoding / URL upload. |
No |
restructurePages |
boolean |
Whether to restructure results across multiple pages. The default is false. |
No |
mergeTables |
boolean |
Please refer to the description of the merge_tables parameter in the pipeline restructure_pages method. Valid only when restructurePages is true. |
No |
relevelTitles |
boolean |
Please refer to the description of the relevel_titles parameter in the pipeline restructure_pages method. Valid only when restructurePages is true. |
No |
outputFormats |
array|null |
Optional list of extra document formats to return. By default, no extra formats are returned. Currently only "docx" is supported. |
No |
visualize |
boolean|null |
Whether to return visualization result images and intermediate images during the processing.
For example, add the following field in the configuration file: Images will not be returned by default, and the default behavior can be overridden by the visualize parameter in the request body. If this parameter is not set in either the request body or the configuration file (or null is passed in the request body and the configuration file is not set), images will be returned by default. |
No |
- When the request is processed successfully, the
resultin the response body has the following attributes:
| Name | Type | Meaning |
|---|---|---|
layoutParsingResults |
array |
Layout parsing results. The array length is 1 (for image input) or the actual number of document pages processed (for PDF input). For PDF input, each element in the array represents the result of each actual page processed in the PDF file. |
dataInfo |
object |
Input data information. |
Image and other binary file fields in the element schema below (e.g. outputImages, inputImage, markdown.images, exports) are returned inline as Base64 strings by default; when the server is configured to return URLs, those values become pre-signed URLs while the field types remain unchanged. See the "Returning Binary Content as URLs" section of the Serving Deployment Guide for configuration.
Each element inlayoutParsingResults is an object with the following attributes:
| Meaning | Name | Type |
|---|---|---|
prunedResult |
object |
A simplified version of the res field in the JSON representation of the results generated by the predict method of the object, with the input_path and page_index fields removed. |
markdown |
object |
Markdown results. |
outputImages |
object|null |
Refer to the img property description of the prediction results. The image is in JPEG format, encoded as Base64 by default; returned as a pre-signed URL when URL-return mode is enabled. |
inputImage |
string|null |
Input image. The image is in JPEG format, encoded as Base64 by default; returned as a pre-signed URL when URL-return mode is enabled. |
exports |
object|null |
Optional additional exports. Present only when outputFormats is set. Example: {"docx": {"content": "..."}}, where content is the Base64-encoded file content by default, or a pre-signed URL when URL-return mode is enabled. |
markdownis an objectwith the following properties:
| Name | Type | Meaning |
|---|---|---|
text |
string |
Markdown text. |
images |
object | null |
Key-value pairs of relative Markdown image paths and their image data. Values are Base64-encoded by default; returned as pre-signed URLs when URL-return mode is enabled. The field is null or omitted when returnMarkdownImages is false in the request. |
restructurePages
Restructure results across multiple pages.
POST /restructure-pages
- The request body has the following properties:
| Name | Type | Description | Required |
|---|---|---|---|
pages |
array |
An array of pages. | Yes |
mergeTables |
boolean |
Please refer to the description of the merge_tables parameter in the pipeline restructure_pages method. |
No |
relevelTitles |
boolean |
Please refer to the description of the relevel_titles parameter in the pipeline restructure_pages method. |
No |
concatenatePages |
boolean |
Please refer to the description of the concatenate_pages parameter in the pipeline restructure_pages method. |
No |
prettifyMarkdown |
boolean |
Whether to output beautified Markdown text. The default is true. |
No |
showFormulaNumber |
boolean |
Whether to include formula numbers in the output Markdown text. The default is false. |
No |
returnMarkdownImages |
boolean |
Whether to return the images referenced in the Markdown (from pages[].markdownImages in the request). Default true; when set to false, markdown.images is null or omitted and the server does not back-fill it. |
No |
outputFormats |
array|null |
Optional extra export formats; same meaning as outputFormats on infer. Only "docx" is supported. |
No |
Each element in pages is an object with the following properties:
| Name | Type | Description |
|---|---|---|
prunedResult |
object |
The prunedResult object returned by the infer operation. |
markdownImages |
object|null |
The images property of the markdown object returned by the infer operation. |
- When the request is processed successfully, the
resultfield in the response body has the following properties:
| Name | Type | Description |
|---|---|---|
layoutParsingResults |
array |
The restructured layout parsing results. For the fields that every element contains, please refer to the description of the result returned by the infer operation (excluding visualization result images and intermediate images). |
Multilingual Service Invocation Example
Python
import base64
import requests
import pathlib
BASE_URL = "http://localhost:8080"
image_path = "./demo.jpg"
# Encode the local image in Base64
with open(image_path, "rb") as file:
image_bytes = file.read()
image_data = base64.b64encode(image_bytes).decode("ascii")
payload = {
"file": image_data, # Base64-encoded file content or file URL
"fileType": 1, # File type, 1 indicates an image file
}
response = requests.post(BASE_URL + "/layout-parsing", json=payload)
assert response.status_code == 200, (response.status_code, response.text)
result = response.json()["result"]
pages = []
for i, res in enumerate(result["layoutParsingResults"]):
pages.append({"prunedResult": res["prunedResult"], "markdownImages": res["markdown"].get("images")})
for img_name, img in res["outputImages"].items():
img_path = f"{img_name}_{i}.jpg"
pathlib.Path(img_path).parent.mkdir(exist_ok=True)
with open(img_path, "wb") as f:
f.write(base64.b64decode(img))
print(f"Output image saved at {img_path}")
payload = {
"pages": pages,
"concatenatePages": True,
}
response = requests.post(BASE_URL + "/restructure-pages", json=payload)
assert response.status_code == 200, (response.status_code, response.text)
result = response.json()["result"]
res = result["layoutParsingResults"][0]
print(res["prunedResult"])
md_dir = pathlib.Path("markdown")
md_dir.mkdir(exist_ok=True)
(md_dir / "doc.md").write_text(res["markdown"]["text"])
for img_path, img in res["markdown"]["images"].items():
img_path = md_dir / img_path
img_path.parent.mkdir(parents=True, exist_ok=True)
img_path.write_bytes(base64.b64decode(img))
print(f"Markdown document saved at {md_dir / 'doc.md'}")
C++
#include <iostream>
#include <filesystem>
#include <fstream>
#include <vector>
#include <string>
#include "cpp-httplib/httplib.h" // https://github.com/Huiyicc/cpp-httplib
#include "nlohmann/json.hpp" // https://github.com/nlohmann/json
#include "base64.hpp" // https://github.com/tobiaslocker/base64
namespace fs = std::filesystem;
int main() {
httplib::Client client("localhost", 8080);
const std::string filePath = "./demo.jpg";
std::ifstream file(filePath, std::ios::binary | std::ios::ate);
if (!file) {
std::cerr << "Error opening file: " << filePath << std::endl;
return 1;
}
std::streamsize size = file.tellg();
file.seekg(0, std::ios::beg);
std::vector buffer(size);
if (!file.read(buffer.data(), size)) {
std::cerr << "Error reading file." << std::endl;
return 1;
}
std::string bufferStr(buffer.data(), static_cast(size));
std::string encodedFile = base64::to_base64(bufferStr);
nlohmann::json jsonObj;
jsonObj["file"] = encodedFile;
jsonObj["fileType"] = 1;
auto response = client.Post("/layout-parsing", jsonObj.dump(), "application/json");
if (response && response->status == 200) {
nlohmann::json jsonResponse = nlohmann::json::parse(response->body);
auto result = jsonResponse["result"];
if (!result.is_object() || !result.contains("layoutParsingResults")) {
std::cerr << "Unexpected response format." << std::endl;
return 1;
}
const auto& results = result["layoutParsingResults"];
for (size_t i = 0; i < results.size(); ++i) {
const auto& res = results[i];
if (res.contains("prunedResult")) {
std::cout << "Layout result [" << i << "]: " << res["prunedResult"].dump() << std::endl;
}
if (res.contains("outputImages") && res["outputImages"].is_object()) {
for (auto& [imgName, imgBase64] : res["outputImages"].items()) {
std::string outputPath = imgName + "_" + std::to_string(i) + ".jpg";
fs::path pathObj(outputPath);
fs::path parentDir = pathObj.parent_path();
if (!parentDir.empty() && !fs::exists(parentDir)) {
fs::create_directories(parentDir);
}
std::string decodedImage = base64::from_base64(imgBase64.get());
std::ofstream outFile(outputPath, std::ios::binary);
if (outFile.is_open()) {
outFile.write(decodedImage.c_str(), decodedImage.size());
outFile.close();
std::cout << "Saved image: " << outputPath << std::endl;
} else {
std::cerr << "Failed to save image: " << outputPath << std::endl;
}
}
}
}
} else {
std::cerr << "Request failed." << std::endl;
if (response) {
std::cerr << "HTTP status: " << response->status << std::endl;
std::cerr << "Response body: " << response->body << std::endl;
}
return 1;
}
return 0;
}
Java
import okhttp3.*;
import com.fasterxml.jackson.databind.ObjectMapper;
import com.fasterxml.jackson.databind.JsonNode;
import com.fasterxml.jackson.databind.node.ObjectNode;
import java.io.File;
import java.io.FileOutputStream;
import java.io.IOException;
import java.util.Base64;
import java.nio.file.Paths;
import java.nio.file.Files;
public class Main {
public static void main(String[] args) throws IOException {
String API_URL = "http://localhost:8080/layout-parsing";
String imagePath = "./demo.jpg";
File file = new File(imagePath);
byte[] fileContent = java.nio.file.Files.readAllBytes(file.toPath());
String base64Image = Base64.getEncoder().encodeToString(fileContent);
ObjectMapper objectMapper = new ObjectMapper();
ObjectNode payload = objectMapper.createObjectNode();
payload.put("file", base64Image);
payload.put("fileType", 1);
OkHttpClient client = new OkHttpClient();
MediaType JSON = MediaType.get("application/json; charset=utf-8");
RequestBody body = RequestBody.create(JSON, payload.toString());
Request request = new Request.Builder()
.url(API_URL)
.post(body)
.build();
try (Response response = client.newCall(request).execute()) {
if (response.isSuccessful()) {
String responseBody = response.body().string();
JsonNode root = objectMapper.readTree(responseBody);
JsonNode result = root.get("result");
JsonNode layoutParsingResults = result.get("layoutParsingResults");
for (int i = 0; i < layoutParsingResults.size(); i++) {
JsonNode item = layoutParsingResults.get(i);
int finalI = i;
JsonNode prunedResult = item.get("prunedResult");
System.out.println("Pruned Result [" + i + "]: " + prunedResult.toString());
JsonNode outputImages = item.get("outputImages");
outputImages.fieldNames().forEachRemaining(imgName -> {
try {
String imgBase64 = outputImages.get(imgName).asText();
byte[] imgBytes = Base64.getDecoder().decode(imgBase64);
String imgPath = imgName + "_" + finalI + ".jpg";
File outputFile = new File(imgPath);
File parentDir = outputFile.getParentFile();
if (parentDir != null && !parentDir.exists()) {
parentDir.mkdirs();
System.out.println("Created directory: " + parentDir.getAbsolutePath());
}
try (FileOutputStream fos = new FileOutputStream(outputFile)) {
fos.write(imgBytes);
System.out.println("Saved image: " + imgPath);
}
} catch (IOException e) {
System.err.println("Failed to save image: " + e.getMessage());
}
});
}
} else {
System.err.println("Request failed with HTTP code: " + response.code());
}
}
}
}
Go
package main
import (
"bytes"
"encoding/base64"
"encoding/json"
"fmt"
"io/ioutil"
"net/http"
"os"
"path/filepath"
)
func main() {
API_URL := "http://localhost:8080/layout-parsing"
filePath := "./demo.jpg"
fileBytes, err := ioutil.ReadFile(filePath)
if err != nil {
fmt.Printf("Error reading file: %v\n", err)
return
}
fileData := base64.StdEncoding.EncodeToString(fileBytes)
payload := map[string]interface{}{
"file": fileData,
"fileType": 1,
}
payloadBytes, err := json.Marshal(payload)
if err != nil {
fmt.Printf("Error marshaling payload: %v\n", err)
return
}
client := &http.Client{}
req, err := http.NewRequest("POST", API_URL, bytes.NewBuffer(payloadBytes))
if err != nil {
fmt.Printf("Error creating request: %v\n", err)
return
}
req.Header.Set("Content-Type", "application/json")
res, err := client.Do(req)
if err != nil {
fmt.Printf("Error sending request: %v\n", err)
return
}
defer res.Body.Close()
if res.StatusCode != http.StatusOK {
fmt.Printf("Unexpected status code: %d\n", res.StatusCode)
return
}
body, err := ioutil.ReadAll(res.Body)
if err != nil {
fmt.Printf("Error reading response: %v\n", err)
return
}
type Markdown struct {
Text string `json:"text"`
Images map[string]string `json:"images"`
}
type LayoutResult struct {
PrunedResult map[string]interface{} `json:"prunedResult"`
Markdown Markdown `json:"markdown"`
OutputImages map[string]string `json:"outputImages"`
InputImage *string `json:"inputImage"`
}
type Response struct {
Result struct {
LayoutParsingResults []LayoutResult `json:"layoutParsingResults"`
DataInfo interface{} `json:"dataInfo"`
} `json:"result"`
}
var respData Response
if err := json.Unmarshal(body, &respData); err != nil {
fmt.Printf("Error parsing response: %v\n", err)
return
}
for i, res := range respData.Result.LayoutParsingResults {
fmt.Printf("Result %d - prunedResult: %+v\n", i, res.PrunedResult)
mdDir := fmt.Sprintf("markdown_%d", i)
os.MkdirAll(mdDir, 0755)
mdFile := filepath.Join(mdDir, "doc.md")
if err := os.WriteFile(mdFile, []byte(res.Markdown.Text), 0644); err != nil {
fmt.Printf("Error writing markdown file: %v\n", err)
} else {
fmt.Printf("Markdown document saved at %s\n", mdFile)
}
for path, imgBase64 := range res.Markdown.Images {
fullPath := filepath.Join(mdDir, path)
if err := os.MkdirAll(filepath.Dir(fullPath), 0755); err != nil {
fmt.Printf("Error creating directory for markdown image: %v\n", err)
continue
}
imgBytes, err := base64.StdEncoding.DecodeString(imgBase64)
if err != nil {
fmt.Printf("Error decoding markdown image: %v\n", err)
continue
}
if err := os.WriteFile(fullPath, imgBytes, 0644); err != nil {
fmt.Printf("Error saving markdown image: %v\n", err)
}
}
for name, imgBase64 := range res.OutputImages {
imgBytes, err := base64.StdEncoding.DecodeString(imgBase64)
if err != nil {
fmt.Printf("Error decoding output image %s: %v\n", name, err)
continue
}
filename := fmt.Sprintf("%s_%d.jpg", name, i)
if err := os.MkdirAll(filepath.Dir(filename), 0755); err != nil {
fmt.Printf("Error creating directory for output image: %v\n", err)
continue
}
if err := os.WriteFile(filename, imgBytes, 0644); err != nil {
fmt.Printf("Error saving output image %s: %v\n", filename, err)
} else {
fmt.Printf("Output image saved at %s\n", filename)
}
}
}
}
C#
using System;
using System.IO;
using System.Net.Http;
using System.Text;
using System.Threading.Tasks;
using Newtonsoft.Json.Linq;
class Program
{
static readonly string API_URL = "http://localhost:8080/layout-parsing";
static readonly string inputFilePath = "./demo.jpg";
static async Task Main(string[] args)
{
var httpClient = new HttpClient();
byte[] fileBytes = File.ReadAllBytes(inputFilePath);
string fileData = Convert.ToBase64String(fileBytes);
var payload = new JObject
{
{ "file", fileData },
{ "fileType", 1 }
};
var content = new StringContent(payload.ToString(), Encoding.UTF8, "application/json");
HttpResponseMessage response = await httpClient.PostAsync(API_URL, content);
response.EnsureSuccessStatusCode();
string responseBody = await response.Content.ReadAsStringAsync();
JObject jsonResponse = JObject.Parse(responseBody);
JArray layoutParsingResults = (JArray)jsonResponse["result"]["layoutParsingResults"];
for (int i = 0; i < layoutParsingResults.Count; i++)
{
var res = layoutParsingResults[i];
Console.WriteLine($"[{i}] prunedResult:\n{res["prunedResult"]}");
JObject outputImages = res["outputImages"] as JObject;
if (outputImages != null)
{
foreach (var img in outputImages)
{
string imgName = img.Key;
string base64Img = img.Value?.ToString();
if (!string.IsNullOrEmpty(base64Img))
{
string imgPath = $"{imgName}_{i}.jpg";
byte[] imageBytes = Convert.FromBase64String(base64Img);
string directory = Path.GetDirectoryName(imgPath);
if (!string.IsNullOrEmpty(directory) && !Directory.Exists(directory))
{
Directory.CreateDirectory(directory);
Console.WriteLine($"Created directory: {directory}");
}
File.WriteAllBytes(imgPath, imageBytes);
Console.WriteLine($"Output image saved at {imgPath}");
}
}
}
}
}
}
Node.js
const axios = require('axios');
const fs = require('fs');
const path = require('path');
const API_URL = 'http://localhost:8080/layout-parsing';
const imagePath = './demo.jpg';
const fileType = 1;
function encodeImageToBase64(filePath) {
const bitmap = fs.readFileSync(filePath);
return Buffer.from(bitmap).toString('base64');
}
const payload = {
file: encodeImageToBase64(imagePath),
fileType: fileType
};
axios.post(API_URL, payload)
.then(response => {
const results = response.data.result.layoutParsingResults;
results.forEach((res, index) => {
console.log(`\n[${index}] prunedResult:`);
console.log(res.prunedResult);
const outputImages = res.outputImages;
if (outputImages) {
Object.entries(outputImages).forEach(([imgName, base64Img]) => {
const imgPath = `${imgName}_${index}.jpg`;
const directory = path.dirname(imgPath);
if (!fs.existsSync(directory)) {
fs.mkdirSync(directory, { recursive: true });
console.log(`Created directory: ${directory}`);
}
fs.writeFileSync(imgPath, Buffer.from(base64Img, 'base64'));
console.log(`Output image saved at ${imgPath}`);
});
} else {
console.log(`[${index}] No outputImages.`);
}
});
})
.catch(error => {
console.error('Error during API request:', error.message || error);
});
PHP
<?php
$API_URL = "http://localhost:8080/layout-parsing";
$image_path = "./demo.jpg";
$image_data = base64_encode(file_get_contents($image_path));
$payload = array("file" => $image_data, "fileType" => 1);
$ch = curl_init($API_URL);
curl_setopt($ch, CURLOPT_POST, true);
curl_setopt($ch, CURLOPT_POSTFIELDS, json_encode($payload));
curl_setopt($ch, CURLOPT_HTTPHEADER, array('Content-Type: application/json'));
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
$response = curl_exec($ch);
curl_close($ch);
$result = json_decode($response, true)["result"]["layoutParsingResults"];
foreach ($result as $i => $item) {
echo "[$i] prunedResult:\n";
print_r($item["prunedResult"]);
if (!empty($item["outputImages"])) {
foreach ($item["outputImages"] as $img_name => $img_base64) {
$output_image_path = "{$img_name}_{$i}.jpg";
$directory = dirname($output_image_path);
if (!is_dir($directory)) {
mkdir($directory, 0777, true);
echo "Created directory: $directory\n";
}
file_put_contents($output_image_path, base64_decode($img_base64));
echo "Output image saved at $output_image_path\n";
}
} else {
echo "No outputImages found for item $i\n";
}
}
?>