Open-Vocabulary Object Detection Module Tutorial¶

I. Overview¶

Open-vocabulary object detection is an advanced object detection technology aimed at overcoming the limitations of traditional object detection. Traditional methods can only recognize objects within predefined categories, while open-vocabulary object detection allows models to identify objects not seen during training. By integrating natural language processing techniques and using text descriptions to define new categories, the model can recognize and locate these new objects. This makes object detection more flexible and generalizable, with significant application potential.

II. List of Supported Models¶

Model	Model Download Link	mAP(0.5:0.95)	mAP(0.5)	GPU Inference Time (ms) [Normal Mode / High-Performance Mode]	CPU Inference Time (ms) [Normal Mode / High-Performance Mode]	Model Storage Size (MB)	Introduction
GroundingDINO-T	Inference Model	49.4	64.4	- / -	- / -	658.3	This is an open-vocabulary object detection model trained on the O365, GoldG, and Cap4M datasets. It uses Bert for text encoding and DINO for the visual model, with additional cross-modal fusion modules, achieving good performance in open-vocabulary object detection.
YOLO-Worldv2-L	Inference Model	44.4	59.8	- / -	292.14 / 292.14	421.4	This is an open-vocabulary object detection model trained on the O365 and GoldG datasets. It uses CLIP for text encoding and YOLOv8 for the visual model, with additional light-weight cross-modal fusion modules, achieving a good balance between speed and performance.

Test Environment Description:

Performance Test Environment
- Test Dataset： Based on the open vocabulary object detection model trained on the three datasets: O365, GoldG, and Cap4M.
- Hardware Configuration:
  - GPU: NVIDIA Tesla T4
  - CPU: Intel Xeon Gold 6271C @ 2.60GHz
- Software Environment:
  - Ubuntu 20.04 / CUDA 11.8 / cuDNN 8.9 / TensorRT 8.6.1.6
  - paddlepaddle 3.0.0 / paddlex 3.0.3
```
  </li>
  <li><b>Inference Mode Description</b></li>
```

Mode	GPU Configuration	CPU Configuration	Acceleration Technology Combination
Normal Mode	FP32 Precision / No TRT Acceleration	FP32 Precision / 8 Threads	PaddleInference
High-Performance Mode	Optimal combination of pre-selected precision types and acceleration strategies	FP32 Precision / 8 Threads	Pre-selected optimal backend (Paddle/OpenVINO/TRT, etc.)

III. Quick Integration¶

❗ Before quick integration, please install the PaddleX wheel package first. For details, please refer to the PaddleX Local Installation Guide.

After installing the wheel package, you can perform inference for the open-vocabulary object detection module with just a few lines of code. You can switch models under this module at will, and you can also integrate the model inference of the open-vocabulary object detection module into your project. Before running the following code, please download the example image to your local machine.

Note: Due to network issues, the above URLs may not be accessible. If you need to access these links, please check the validity of the URLs and try again. If the problem persists, it may be related to the links themselves or the network connection.

from paddlex import create_model
model = create_model('GroundingDINO-T')
results = model.predict('open_vocabulary_detection.jpg', prompt='bus . walking man . rearview mirror .')
for res in results:
    res.print()
    res.save_to_img(f"./output/")
    res.save_to_json(f"./output/res.json")

After running, the result obtained is:

{'res': "{'input_path': 'open_vocabulary_detection.jpg', 'boxes': [{'coordinate': [112.10542297363281, 117.93667602539062, 514.35693359375, 382.10150146484375], 'label': 'bus', 'score': 0.9348853230476379}, {'coordinate': [264.1828918457031, 162.6674346923828, 286.8844909667969, 201.86187744140625], 'label': 'rearview mirror', 'score': 0.6022508144378662}, {'coordinate': [606.1133422851562, 254.4973907470703, 622.56982421875, 293.7867126464844], 'label': 'walking man', 'score': 0.4384709894657135}, {'coordinate': [591.8192138671875, 260.2451171875, 607.3953247070312, 294.2210388183594], 'label': 'man', 'score': 0.3573091924190521}]}"}

The meanings of the parameters in the running results are as follows: - input_path: The path of the input image to be predicted. - boxes: Information about each predicted object. - label: The category name. - score: The prediction score. - coordinate: The coordinates of the prediction box, in the format [xmin, ymin, xmax, ymax].

The visualization image is as follows:

Note: Due to network issues, the parsing of the above URL may not have been successful. If you need the content of this webpage, please check the validity of the URL and try again.

Related methods, parameters, and explanations are as follows:

create_model instantiates an open-vocabulary object detection model (using GroundingDINO-T as an example). The specific explanations are as follows:

Parameter	Parameter Description	Parameter Type	Options	Default Value
`model_name`	The name of the model	`str`	None	`None`
`model_dir`	The storage path of the model	`str`	None	None
`device`	The device used for model inference	`str`	It supports specifying specific GPU card numbers, such as "gpu:0", other hardware card numbers, such as "npu:0", or CPU, such as "cpu".	`gpu:0`
`thresholds`	The filtering thresholds used by the model	`dict/None`	None	None
`use_hpip`	Whether to enable the high-performance inference plugin	`bool`	None	`False`
`hpi_config`	High-performance inference configuration	`dict` \| `None`	None	`None`

The model_name must be specified. After specifying model_name, the model parameters built into PaddleX will be used by default. If model_dir is specified, the user-defined model will be used.
thresholds is the filtering threshold used by the model. The default is None, which means using the settings from the lower priority. The priority of parameter settings from high to low is: predict parameter input > create_model initialization input > yaml configuration file setting.
The GroundingDINO series of models require two thresholds during inference: box_threshold (default 0.3) and text_threshold (default 0.25). The parameter input format is {"box_threshold": 0.3, "text_threshold": 0.25}.
The predict() method of the open-vocabulary object detection model is called for inference prediction. The parameters of the predict() method are input, batch_size, thresholds, and prompt, with specific explanations as follows:

Parameter	Parameter Description	Parameter Type	Options	Default Value
`input`	Data to be predicted, supporting multiple input types	`Python Var`/`str`/`list`	Python variable, such as image data represented by `numpy.ndarray` File path, such as the local path of an image file: `/root/data/img.jpg` URL link, such as the network URL of an image file: Example Local directory, the directory should contain data files to be predicted, such as the local path: `/root/data/` List, the elements of the list must be the above types of data, such as `[numpy.ndarray, numpy.ndarray]`, `[\"/root/data/img1.jpg\", \"/root/data/img2.jpg\"]`, `[\"/root/data1\", \"/root/data2\"]`	None
`batch_size`	Batch size	`int`	Any integer	1
`thresholds`	The filtering thresholds used by the model	`dict`/`None`	None, indicating the use of the settings from the lower priority. The priority of parameter settings from high to low is: `predict parameter input > create_model initialization input > yaml configuration file setting` dict, such as `{"box_threshold": 0.3, "text_threshold": 0.25}`, indicating that the box_threshold is set to 0.3 and the text_threshold is set to 0.25 during inference	None
`prompt`	The prompt used by the model for prediction	`str`	Any string	1

The prediction results are processed, and the prediction result of each sample is of type dict, supporting operations such as printing, saving as an image, and saving as a json file:

Method	Method Description	Parameter	Parameter Type	Parameter Description	Default Value
`print()`	Print the results to the terminal	`format_json`	`bool`	Whether to format the output content using `JSON` indentation	`True`
		`indent`	`int`	Specify the indentation level to beautify the output `JSON` data and make it more readable. This is only effective when `format_json` is `True`	4
		`ensure_ascii`	`bool`	Control whether non-`ASCII` characters are escaped to `Unicode`. When set to `True`, all non-`ASCII` characters will be escaped; `False` retains the original characters. This is only effective when `format_json` is `True`	`False`
`save_to_json()`	Save the results as a file in JSON format	`save_path`	`str`	The file path for saving. When it is a directory, the saved file name will be consistent with the input file name	None
		`indent`	`int`	Specify the indentation level to beautify the output `JSON` data and make it more readable. This is only effective when `format_json` is `True`	4
		`ensure_ascii`	`bool`	Control whether non-`ASCII` characters are escaped to `Unicode`. When set to `True`, all non-`ASCII` characters will be escaped; `False` retains the original characters. This is only effective when `format_json` is `True`	`False`
`save_to_img()`	Save the results as a file in image format	`save_path`	`str`	The file path for saving. When it is a directory, the saved file name will be consistent with the input file name	None

In addition, it also supports obtaining the visualization image with results and the prediction results through attributes, as follows:

Attribute	Attribute Description
`json`	Get the prediction results in `json` format
`img`	Get the visualization image in `dict` format

For more information on the usage of PaddleX single-model inference APIs, please refer to PaddleX Single-Model Python Script Usage Guide.

IV. Secondary Development¶

The current module temporarily does not support fine-tuning training and only supports inference integration. Fine-tuning training for this module is planned to be supported in the future.

Open-Vocabulary Object Detection Module Tutorial¶

I. Overview¶

II. List of Supported Models¶

III. Quick Integration¶

IV. Secondary Development¶

Comments