Tutorial on Using the Open-Vocabulary Segmentation Module¶

I. Overview¶

Open-vocabulary segmentation is an image segmentation task that aims to segment objects in an image based on additional information such as text descriptions, bounding boxes, keypoints, etc., rather than just the image itself. It allows the model to handle a wide range of object categories without a predefined list. This technology combines visual and multimodal techniques, significantly enhancing the flexibility and accuracy of image processing. Open-vocabulary segmentation has important applications in the field of computer vision, especially in object segmentation tasks in complex scenes.

II. Supported Model List¶

Model	Model Download Link	GPU Inference Time (ms) [Normal Mode / High-Performance Mode]	CPU Inference Time (ms)	Model Size (M)	Description
SAM-H_box	Inference Model	144.9	33920.7	2433.7	SAM (Segment Anything Model) is an advanced image segmentation model that can segment any object in an image based on simple user-provided prompts (such as points, boxes, or text). Trained on the SA-1B dataset with ten million images and 1.1 billion mask annotations, it performs well in most scenarios.
SAM-H_point	Inference Model	144.9	33920.7	2433.7

Test Environment Description:

Performance Test Environment
Hardware Configuration:
- GPU: NVIDIA Tesla T4
- CPU: Intel Xeon Gold 6271C @ 2.60GHz
- Other Environments: Ubuntu 20.04 / cuDNN 8.6 / TensorRT 8.5.2.2
Inference Mode Description

Mode	GPU Configuration	CPU Configuration	Acceleration Technology Combination
Normal Mode	FP32 Precision / No TRT Acceleration	FP32 Precision / 8 Threads	PaddleInference
High-Performance Mode	Optimal combination of pre-selected precision types and acceleration strategies	FP32 Precision / 8 Threads	Pre-selected optimal backend (Paddle/OpenVINO/TRT, etc.)

III. Quick Integration¶

❗ Before quick integration, please install the PaddleX wheel package. For details, refer to the PaddleX Local Installation Guide.

After installing the whl package, you can complete the inference of the open-vocabulary segmentation module with just a few lines of code. You can switch between models under this module at will, and you can also integrate the model inference of the open-vocabulary segmentation module into your project. Before running the following code, please download the example image to your local machine.

from paddlex import create_model
model = create_model('SAM-H_box')
results = model.predict(
    "open_vocabulary_segmentation.jpg",
    prompts = {
        "box_prompt": [
            [112.9239273071289,118.38755798339844,513.7587890625,382.0570068359375],
            [4.597158432006836,263.5540771484375,92.20092010498047,336.5640869140625],
            [592.3548583984375,260.8838806152344,607.1813354492188,294.2261962890625]
        ],
    }
)
for res in results:
    res.print()
    res.save_to_img(f"./output/")
    res.save_to_json(f"./output/res.json")

After running, the result obtained is:

{'res': "{'input_path': '000000004505.jpg', 'prompts': {'box_prompt': [[112.9239273071289, 118.38755798339844, 513.7587890625, 382.0570068359375], [4.597158432006836, 263.5540771484375, 92.20092010498047, 336.5640869140625], [592.3548583984375, 260.8838806152344, 607.1813354492188, 294.2261962890625]]}, 'masks': '...', 'mask_infos': [{'label': 'box_prompt', 'prompt': [112.9239273071289, 118.38755798339844, 513.7587890625, 382.0570068359375]}, {'label': 'box_prompt', 'prompt': [4.597158432006836, 263.5540771484375, 92.20092010498047, 336.5640869140625]}, {'label': 'box_prompt', 'prompt': [592.3548583984375, 260.8838806152344, 607.1813354492188, 294.2261962890625]}]}"}

The meanings of the parameters in the running results are as follows: - input_path: The path of the input image to be predicted. - prompts: The original prompt information used for prediction. - masks: The actual predicted masks. Since the data is too large to be conveniently printed directly, it is replaced with ... here. You can save the prediction results as an image using res.save_to_img() or as a JSON file using res.save_to_json(). - mask_infos: The prompt information corresponding to each predicted mask. - label: The prompt type corresponding to the predicted mask. - prompt: The original prompt input corresponding to the predicted mask.

The visualization image is as follows:

Note: Due to network issues, the parsing of the above URL may not have been successful. If you need the content of this webpage, please check the validity of the URL and try again.

Related methods and parameter explanations are as follows:

create_model instantiates an open-vocabulary segmentation model (using SAM-H_box as an example). The specific explanations are as follows:

Parameter	Parameter Description	Parameter Type	Options	Default Value
`model_name`	The name of the model	`str`	None	`None`
`model_dir`	The storage path of the model	`str`	None	None

The model_name must be specified. After specifying model_name, the model parameters built into PaddleX will be used by default. If model_dir is specified, the user-defined model will be used.
The predict() method of the open-vocabulary segmentation model is called for inference prediction. The parameters of the predict() method are input, batch_size, and prompts, with specific explanations as follows:

Parameter	Parameter Description	Parameter Type	Options	Default Value
`input`	Data to be predicted, supporting multiple input types	`Python Var`/`str`/`list`	Python variable, such as image data represented by `numpy.ndarray` File path, such as the local path of an image file: `/root/data/img.jpg` URL link, such as the network URL of an image file: Example Local directory, the directory should contain data files to be predicted, such as the local path: `/root/data/` List, the elements of the list must be the above types of data, such as `[numpy.ndarray, numpy.ndarray]`, `[\"/root/data/img1.jpg\", \"/root/data/img2.jpg\"]`, `[\"/root/data1\", \"/root/data2\"]`	None
`batch_size`	Batch size	`int`	Any integer	1
`prompts`	Prompts used by the model	`dict`	dict, such as `{"box_prompt": [[float, float, float, float], ...]}`, representing multiple bboxes used as prompts during inference	None

The prediction results are processed, and the prediction result of each sample is of type dict, supporting operations such as printing, saving as an image, and saving as a json file:

Method	Method Description	Parameter	Parameter Type	Parameter Description	Default Value
`print()`	Print the results to the terminal	`format_json`	`bool`	Whether to format the output content using `JSON` indentation	`True`
		`indent`	`int`	Specify the indentation level to beautify the output `JSON` data and make it more readable. This is only effective when `format_json` is `True`	4
		`ensure_ascii`	`bool`	Control whether non-`ASCII` characters are escaped to `Unicode`. When set to `True`, all non-`ASCII` characters will be escaped; `False` retains the original characters. This is only effective when `format_json` is `True`	`False`
`save_to_json()`	Save the results as a file in JSON format	`save_path`	`str`	The file path for saving. When it is a directory, the saved file name will be consistent with the input file name	None
		`indent`	`int`	Specify the indentation level to beautify the output `JSON` data and make it more readable. This is only effective when `format_json` is `True`	4
		`ensure_ascii`	`bool`	Control whether non-`ASCII` characters are escaped to `Unicode`. When set to `True`, all non-`ASCII` characters will be escaped; `False` retains the original characters. This is only effective when `format_json` is `True`	`False`
`save_to_img()`	Save the results as a file in image format	`save_path`	`str`	The file path for saving. When it is a directory, the saved file name will be consistent with the input file name	None

In addition, it also supports obtaining the visualization image with results and the prediction results through attributes, as follows:

Attribute	Attribute Description
`json`	Get the prediction results in `json` format
`img`	Get the visualization image in `dict` format

For more information on the usage of PaddleX single-model inference APIs, please refer to PaddleX Single-Model Python Script Usage Guide.

IV. Secondary Development¶

The current module temporarily does not support fine-tuning training and only supports inference integration. Fine-tuning training for this module is planned to be supported in the future.

Tutorial on Using the Open-Vocabulary Segmentation Module¶

I. Overview¶

II. Supported Model List¶

III. Quick Integration¶

IV. Secondary Development¶

Comments