Skip to content

PaddleX Documentation

PaddleX Pipeline List (CPU/GPU)

PaddleX Pipelines (CPU/GPU)¶

1. Basic Pipelines¶

Pipeline Name	Pipeline Modules	Baidu AI Studio Community Experience URL	Pipeline Introduction	Applicable Scenarios
Image Classification	Image Classification	Online Experience	Image classification is a technique that assigns images to predefined categories. It is widely used in object recognition, scene understanding, and automatic annotation. Image classification can identify various objects such as animals, plants, traffic signs, etc., and categorize them based on their features. By leveraging deep learning models, image classification can automatically extract image features and perform accurate classification. The General Image Classification Pipeline is designed to solve image classification tasks for given images.	Automatic classification and recognition of product images Real-time monitoring of defective products on pipelines Personnel recognition in security surveillance
Object Detection	Object Detection	Online Experience	Object detection aims to identify the categories and locations of multiple objects in images or videos by generating bounding boxes to mark these objects. Unlike simple image classification, object detection not only recognizes what objects are in the image, such as people, cars, and animals, but also accurately determines the specific location of each object, usually represented by a rectangular box. This technology is widely used in autonomous driving, surveillance systems, and smart photo albums, relying on deep learning models (e.g., YOLO, Faster R-CNN) that efficiently extract features and perform real-time detection, significantly enhancing the computer's ability to understand image content.	Tracking moving objects in video surveillance Vehicle detection in autonomous driving Defect detection in industrial manufacturing Shelf product detection in retail
Semantic Segmentation	Semantic Segmentation	Online Experience	Semantic segmentation is a computer vision technique that assigns each pixel in an image to a specific category, enabling detailed understanding of image content. Semantic segmentation not only identifies the types of objects in an image but also classifies each pixel, allowing entire regions of the same category to be marked. For example, in a street scene image, semantic segmentation can distinguish pedestrians, cars, sky, and roads at the pixel level, forming a detailed label map. This technology is widely used in autonomous driving, medical image analysis, and human-computer interaction, often relying on deep learning models (e.g., FCN, U-Net) that use Convolutional Neural Networks (CNNs) to extract features and achieve high-precision pixel-level classification, providing a foundation for further intelligent analysis.	Analysis of satellite images in Geographic Information Systems Segmentation of obstacles and passable areas in robot vision Separation of foreground and background in film production
Instance Segmentation	Instance Segmentation	Online Experience	Instance segmentation is a computer vision task that identifies object categories in images and distinguishes the pixels of different instances within the same category, enabling precise segmentation of each object. Instance segmentation can separately mark each car, person, or animal in an image, ensuring they are processed independently at the pixel level. For example, in a street scene image with multiple cars and pedestrians, instance segmentation can clearly separate the contours of each car and person, forming multiple independent region labels. This technology is widely used in autonomous driving, video surveillance, and robot vision, often relying on deep learning models (e.g., Mask R-CNN) that use CNNs for efficient pixel classification and instance differentiation, providing powerful support for understanding complex scenes.	Crowd counting in malls Counting crops or fruits in agricultural intelligence Selecting and segmenting specific objects in image editing
PP-ChatOCRv3	Table Structure Recognition	Online Experience	Document Image Scene Information Extraction v3 (PP-ChatOCRv3-doc) is a PaddlePaddle-specific intelligent document and image analysis solution that integrates LLM and OCR technologies to solve common complex document information extraction challenges such as layout analysis, rare characters, multi-page PDFs, tables, and seal recognition. By integrating the Wenxin large model, it combines vast data and knowledge, providing high accuracy and wide applicability. The open-source version supports local experience and deployment, and fine-tuning training for each module.	Construction of knowledge graphs Detection of information related to specific events in online news and social media Extraction and analysis of key information in academic literature (especially in scenarios requiring recognition of seals, distorted images, and more complex tables)
	Layout Detection
	Text Detection
	Text Recognition
	Seal Text Detection
	Text Image Unrapping
	Document Image Orientation Classification
PP-ChatOCRv4	Table Structure Recognition	Coming Soon	Document Scene Information Extraction v4 (PP-ChatOCRv4) is a PaddlePaddle-featured intelligent analysis solution for documents and images, combining LLM, MLLM, and OCR technologies. Based on PP-ChatOCRv3, it optimizes common complex document information extraction challenges such as layout analysis, rare characters, multi-page PDFs, tables, and seal recognition. It integrates massive data and knowledge with the Ernie model, achieving high accuracy and wide applicability. This pipeline also provides flexible service deployment methods, supporting deployment on various hardware. Furthermore, it offers secondary development capabilities, allowing you to train and optimize on your own dataset, and the trained model can be seamlessly integrated.	Knowledge Graph Construction Detection of Information Related to Specific Events in Online News and Social Media Extraction and Analysis of Key Information in Academic Literature (especially scenarios requiring recognition of seals, distorted images, and more complex tables)
	Layout Detection
	Text Detection
	Text Recognition
	Seal Text Detection
	Text Image Unrapping
	Document Image Orientation Classification
	Document-based Vision-Language Model
General OCR	Text Detection	Online Experience	OCR (Optical Character Recognition) is a technology that converts text in images into editable text. It is widely used in document digitization, information extraction, and data processing. OCR can recognize printed text, handwritten text, and even certain types of fonts and symbols. General OCR is used to solve text recognition tasks, extracting text information from images and outputting it in text form. PP-OCRv4 is an end-to-end OCR system that can achieve millisecond-level accurate text prediction on CPUs, reaching open-source SOTA in general scenarios. Based on this project, many developers from academia, industry, and research have quickly implemented multiple OCR applications, covering various fields such as general, manufacturing, finance, and transportation.	License plate recognition in intelligent security Recognition of house numbers and other information Digitization of paper documents Recognition of ancient characters in cultural heritage
	Text Recognition
	Document Image Orientation Classification
	Text Image Unwarping
	Text Line Orientation Classification
General Table Recognition	Table Structure Recognition	Online Experience	Table recognition is a technology that automatically identifies and extracts table content and structure from documents or images. It is widely used in data entry, information retrieval, and document analysis. By using computer vision and machine learning algorithms, table recognition can convert complex table information into an editable format, facilitating further processing and analysis by users.	Processing of bank statements Recognition and extraction of indicators in medical reports Extraction of table information in contracts
	Text Detection
	Text Recognition
	Layout Detection
	Doc Img Orientation Classification
	Text Image Unrapping
Time Series Forecasting	Time Series Forecasting Module	Online Experience	Time series forecasting is a technique that utilizes historical data to predict future trends by analyzing patterns in time series data. It is widely applied in financial markets, weather forecasting, and sales prediction. Time series forecasting typically employs statistical methods or deep learning models (such as LSTM, ARIMA, etc.), which can handle time dependencies in data to provide accurate predictions, assisting decision-makers in better planning and response. This technology plays a crucial role in many industries, including energy management, supply chain optimization, and market analysis	Stock prediction climate forecasting disease spread prediction energy demand forecasting traffic flow prediction product lifecycle prediction electric load forecasting
Time Series Anomaly Detection	Time Series Anomaly Detection Module	Online Experience	Time series anomaly detection is a technique that identifies abnormal patterns or behaviors in time series data. It is widely used in network security, device monitoring, and financial fraud detection. By analyzing normal trends and patterns in historical data, it discovers events that significantly differ from expected behaviors, such as sudden increases in network traffic or unusual transaction activities. Time series anomaly detection often employs statistical methods or machine learning algorithms (like Isolation Forest, LSTM, etc.), which can automatically identify anomalies in data, providing real-time alerts to enterprises and organizations to help promptly address potential risks and issues. This technology plays a vital role in ensuring system stability and security	Financial fraud detection network intrusion detection equipment failure detection industrial production anomaly detection stock market anomaly detection power system anomaly detection
Time Series Classification	Time Series Classification Module	Online Experience	Time series classification is a technique that categorizes time series data into predefined classes. It is widely applied in behavior recognition, speech recognition, and financial trend analysis. By analyzing features that vary over time, it identifies different patterns or events, such as classifying a speech signal as "greeting" or "request" or dividing stock price movements into "rising" or "falling." Time series classification typically utilizes machine learning and deep learning models, effectively capturing time dependencies and variation patterns to provide accurate classification labels for data. This technology plays a key role in intelligent monitoring, voice assistants, and market forecasting applications	Electrocardiogram Classification Stock Market Behavior Classification Electroencephalogram Classification Emotion Classification Traffic Condition Classification Network Traffic Classification Equipment Operating Condition Classification
Multi-label Image Classification	Multi-label Image Classification	Online Experience	Image multi-label classification is a technology that assigns an image to multiple related categories simultaneously. It is widely used in image tagging, content recommendation, and social media analysis. It can identify multiple objects or features present in an image, such as both "dog" and "outdoor" labels in a single picture. By using deep learning models, image multi-label classification can automatically extract image features and perform accurate classification to provide more comprehensive information for users. This technology is significant in applications like intelligent search engines and automatic content generation.	Medical image diagnosis Complex scene recognition Multi-target monitoring Product attribute recognition Ecological environment monitoring Security monitoring Disaster warning
Small Object Detection	Small Object Detection	Online Experience	Small object detection is a technology specifically for identifying small objects in images. It is widely used in surveillance, autonomous driving, and satellite image analysis. It can accurately find and classify small-sized objects like pedestrians, traffic signs, or small animals in complex scenes. By using deep learning algorithms and optimized convolutional neural networks, small object detection can effectively enhance the recognition ability of small objects, ensuring that important information is not missed in practical applications. This technology plays an important role in improving safety and automation levels.	Pedestrian detection in autonomous vehicles Identification of small buildings in satellite images Detection of small traffic signs in intelligent transportation systems Identification of small intruding objects in security surveillance Detection of small defects in industrial inspection Monitoring of small animals in drone images
Image Anomaly Detection	Image Anomaly Detection	None	Image anomaly detection is a technology that identifies images that deviate from or do not conform to normal patterns by analyzing their content. It is widely used in industrial quality inspection, medical image analysis, and security surveillance. By using machine learning and deep learning algorithms, image anomaly detection can automatically identify potential defects, anomalies, or abnormal behavior in images, helping us detect problems and take appropriate measures promptly. Image anomaly detection systems are designed to automatically detect and label abnormal situations in images to improve work efficiency and accuracy.	Industrial quality control Medical image analysis Anomaly detection in surveillance videos Identification of violations in traffic monitoring Obstacle detection in autonomous driving Agricultural pest and disease monitoring Pollutant identification in environmental monitoring
General Layout Parsing	Layout Detection	None	Layout parsing is a technology that extracts structured information from document images, primarily used to convert complex document layouts into machine-readable data formats. This technology is widely applied in document management, information extraction, and data digitization. By combining Optical Character Recognition (OCR), image processing, and machine learning algorithms, layout parsing can identify and extract text blocks, headings, paragraphs, images, tables, and other layout elements from documents. The process typically includes three main steps: layout analysis, element analysis, and data formatting, ultimately generating structured document data to enhance the efficiency and accuracy of data processing.	Analysis of financial and legal documents Digitization of historical documents and archives Automated form filling Page structure parsing
	Layout Detection Module
	Text Detection Module
	Text Recognition Module
	Doc Img Orientation Classification
	Text Image Unrapping
	Table Structure Recognition
	Text Line Orientation Classification
	Formula Recognition
	Seal Text Detection
General Layout Parsing v3	Layout Detection Module	Coming Soon	Based on the General Layout Parsing v1 pipeline, the General Layout Parsing v3 pipeline enhances the capabilities of layout detection, table recognition, and formula recognition. It adds the ability to restore multi-column reading order and convert results into Markdown files. It performs exceptionally well in various document data and can handle more complex document data. This pipeline also provides flexible service deployment methods, supporting multiple programming languages on various hardware. Furthermore, it offers secondary development capabilities, allowing you to train and optimize on your own dataset, and the trained model can be seamlessly integrated.	Intelligent Document Analysis Document Digitization Page Structure Parsing Complex Table Recognition Large Model Data Construction RAG
	Text Detection Module
	Text Recognition Module
	Doc Img Orientation Classification
	Text Image Unrapping Module
	Wired Table Structure Recognition Module
	Wireless Table Structure Recognition Module
	Table Classification Module
	Wired Table Cell Detection Module
	Wireless Table Cell Detection Module
	Text Line Orientation Classification Module
	Formula Recognition Module
	Seal Text Detection Module
Formula Recognition	Formula Recognition	Online Experience	Formula recognition is a technology that automatically identifies and extracts LaTeX formula content and structure from documents or images. It is widely used in document editing and data analysis in fields such as mathematics, physics, and computer science. By using computer vision and machine learning algorithms, formula recognition can convert complex mathematical formula information into editable LaTeX format, facilitating further processing and analysis by users.	Document digitization and retrieval Formula search engine Formula editor Automated typesetting
	Layout Detection Module
	Doc Img Orientation Classification
	Text Image Unrapping
Seal Text Recognition	Seal Text Detection	Online Experience	Seal text recognition is a technology that automatically extracts and identifies seal content from documents or images. Seal text recognition is a part of document processing and is useful in many scenarios, such as contract comparison, inventory audit, and invoice reimbursement review.	Contract and agreement verification Check processing Loan approval Legal document management
	Text Recognition
	Layout Detection
	Doc Img Orientation Classification
	Text Image Unrapping
General Image Recognition	Mainbody Detection	None	The general image recognition pipeline is designed to address open-domain target localization and recognition issues. It can effectively identify and differentiate various target objects in different environments and conditions, making it widely applicable in autonomous driving, intelligent security, medical image analysis, and industrial automation, among other fields.	Automated Identity Verification Unmanned Retail Autonomous Driving
General Image Recognition	Image Features	None
Pedestrian Attribute Recognition	Pedestrian Detection	None	Pedestrian attribute recognition is a key function in computer vision systems used to locate and tag specific features of pedestrians in images or videos, such as gender, age, clothing color, and style.	Smart City Security Monitoring
Pedestrian Attribute Recognition	Pedestrian Attribute Recognition	None		Smart City Security Monitoring
Vehicle Attribute Recognition	Vehicle Detection	None	Vehicle attribute recognition is an important component of computer vision systems. Its main task is to locate and tag specific attributes of vehicles in images or videos, such as vehicle type, color, and license plate number. This task not only requires accurate detection of vehicles but also the recognition of detailed attribute information for each vehicle.	Intelligent Parking Traffic Management Autonomous Driving
Vehicle Attribute Recognition	Vehicle Attribute Recognition	None		Intelligent Parking Traffic Management Autonomous Driving
Face Recognition	Face Detection	None	The facial recognition task is an important component of the computer vision field, aiming to realize automatic personal identity recognition through the analysis and comparison of facial features.	Security Authentication Monitoring Systems Social Media
Face Recognition	Face Features	None		Security Authentication Monitoring Systems Social Media
3D Multimodal Fusion Detection	3D Multimodal Fusion Detection	Not Available	3D multimodal fusion detection is a technology that combines multiple data modalities (such as LiDAR, cameras, and millimeter-wave radar) to detect targets in three-dimensional space. It leverages the strengths of different modalities to achieve more accurate target localization, classification, and tracking. Through deep learning algorithms, this technology can process complex 3D scenes, identify vehicles, pedestrians, obstacles, and other targets, and provide key support for fields such as autonomous driving, intelligent transportation, and robot navigation.	Obstacle detection and avoidance in autonomous vehicles Traffic flow monitoring in intelligent transportation systems Object recognition and grasping in industrial robots
Human Keypoint Detection	Pedestrian Detection	Not Available	Human keypoint detection is an important task in computer vision, aiming to locate specific parts of the human body (such as the head, shoulders, elbows, knees, etc.) through image or video data. By analyzing the geometric structure and appearance features of the human body, this technology can capture human posture and movements in real-time and is widely used in human-computer interaction, motion analysis, and virtual reality.	Movement guidance in smart fitness applications Character movement capture in virtual reality Abnormal behavior analysis in security surveillance
Human Keypoint Detection	Keypoint Detection	Not Available
Open-Vocabulary Detection	Open-Vocabulary Detection	Not Available	Open-vocabulary detection is an emerging computer vision technology aimed at enabling models to recognize and understand new categories or vocabulary not seen during training. Unlike traditional object detection, open-vocabulary detection does not rely on large amounts of labeled data but instead combines pre-trained language models and visual features to quickly recognize and understand unknown categories. This technology has broad application prospects in dynamic environment object detection, image classification, and intelligent robots.	Recognition of unknown obstacles in autonomous driving Abnormal behavior detection in intelligent security Target exploration by intelligent robots in complex environments
Open-Vocabulary Segmentation	Open-Vocabulary Segmentation	Not Available	Open-vocabulary segmentation is a cutting-edge computer vision technology aimed at performing pixel-level semantic segmentation of unknown categories in images. Unlike traditional segmentation methods limited to labeled categories, open-vocabulary segmentation combines pre-trained language models and visual features to dynamically recognize and segment new categories not seen during training. This technology excels in open-world scenarios and brings new possibilities to fields such as autonomous driving, intelligent robots, and dynamic environment perception.	Segmentation and path planning of unknown objects in autonomous driving Scene understanding by intelligent robots in unknown environments Real-time semantic segmentation and analysis in dynamic scenes
Rotated Object Detection	Rotated Object Detection	Not Available	Rotated object detection is an important technology in the field of computer vision, focusing on detecting and locating objects with arbitrary orientations in images. Unlike traditional object detection methods (which usually assume objects are horizontal or vertical), rotated object detection can handle objects at any rotation angle, thus more accurately identifying and locating targets. By introducing oriented bounding boxes (OBB) and improved deep learning algorithms, this technology performs well in complex scenes such as aerial images, satellite images, and traffic sign detection in autonomous driving.	Target recognition and localization in aerial images Rotated traffic sign detection in autonomous driving Infrastructure detection in satellite images
Document Image Preprocessing	Doc Img Orientation Classification	Not Available	Document image preprocessing is a key step in document analysis and recognition, aiming to optimize document images through a series of technical means to improve the accuracy and efficiency of subsequent processing. Document image preprocessing includes operations such as orientation classification, text rectification, noise removal, and binarization, which can effectively improve image quality, correct document orientation, and remove interference factors. This technology is widely used in document scanning, OCR text recognition, and electronic document generation.	Automatic orientation correction in document scanners Text image optimization in OCR systems Image restoration in historical document digitization
Document Image Preprocessing	Text Image Unrapping	Not Available
Multilingual Speech Recognition	Multilingual Speech Recognition	Not Available	Multilingual speech recognition is an advanced speech processing technology that aims to automatically identify and transcribe speech signals in multiple languages to achieve efficient information extraction and communication. Compared to single-language speech recognition, multilingual speech recognition needs to handle differences in pronunciation, grammar, and vocabulary across languages, thus requiring more powerful models and richer language resources. Through deep learning and large-scale multilingual data training, this technology can recognize speech content in multiple languages in real-time and is widely used in intelligent translation, voice assistants, and multilingual customer service.	Multilingual interaction in intelligent voice assistants Real-time speech translation in international conferences Multilingual voice customer service systems
General Video Classification	Video Classification	Not Available	Video classification is an important task in the field of computer vision, aiming to automatically analyze and identify the semantic categories of video content. Through deep learning models, video classification technology can extract spatiotemporal features from video frame sequences to accurately classify the themes, scenes, or activities in the video. This technology is widely used in video content recommendation, video surveillance analysis, intelligent media management, and video retrieval.	Content recommendation and classification in video platforms Abnormal behavior recognition in security surveillance Automatic classification and management of intelligent media libraries
General Video Detection	Video Detection	Not Available	Video detection is a key technology in the field of computer vision, focusing on real-time or offline analysis of video content to identify and locate target objects and events in the video. By combining deep learning and object detection algorithms, video detection technology can handle complex dynamic scenes, detecting objects, people, behaviors, and abnormal events in the video. This technology has broad application prospects in intelligent security, traffic monitoring, sports analysis, and video content review.	Intrusion detection and alarm in intelligent security systems Vehicle detection and violation recognition in traffic monitoring Athlete behavior analysis in sports events
Document Understanding	Document-related Visual Language Model	Not Available	The document understanding product line is an advanced document processing technology based on Visual-Language Models (VLM), aiming to overcome the limitations of traditional document processing. Traditional methods rely on fixed templates or predefined rules to parse documents. In contrast, this product line leverages the multimodal capabilities of VLM to accurately answer user queries by integrating visual and linguistic information, with only the document image and user question as input. This technology does not require pre-training for specific document formats, allowing it to flexibly handle diverse document content, significantly enhancing the generalization and practicality of document processing. It has broad application prospects in scenarios such as intelligent Q&A and information extraction.	Intelligent Q&A Information Extraction Contract Review and Risk Management

2. Featured Pipelines¶

Pipeline Name	Pipeline Modules	Baidu AIStudio Community Experience Link	Pipeline Introduction	Applicable Scenarios
Semi-supervised Learning for Large Models - Image Classification	Semi-supervised Learning for Large Models - Image Classification	Online Experience	Image classification is a technique that assigns images to predefined categories. It is widely used in object recognition, scene understanding, and automatic annotation. Image classification can identify various objects such as animals, plants, traffic signs, etc., and categorize them based on their features. By leveraging deep learning models, image classification can automatically extract image features and perform accurate classification. The general image classification pipeline is designed to solve image classification tasks for given images.	Commodity image classification Artwork style classification Crop disease and pest identification Animal species recognition Classification of land, water bodies, and buildings in satellite remote sensing images
Semi-supervised Learning for Large Models - Object Detection	Semi-supervised Learning for Large Models - Object Detection	Online Experience	The semi-supervised learning for large models - object detection pipeline is a unique offering from PaddlePaddle. It utilizes a joint training approach with large and small models, leveraging a small amount of labeled data and a large amount of unlabeled data to enhance model accuracy, significantly reducing the costs of manual model iteration and data annotation. The figure below demonstrates the performance of this pipeline on the COCO dataset with 10% labeled data. After training with this pipeline, on COCO 10% labeled data + 90% unlabeled data, the large model (RT-DETR-H) achieves an 8.4% higher accuracy (47.7% -> 56.1%), setting a new state-of-the-art (SOTA) for this dataset. The small model (PicoDet-S) also achieves over 10% higher accuracy (18.3% -> 28.8%) compared to direct training.	Pedestrian, vehicle, and traffic sign detection in autonomous driving Enemy facility and equipment detection in military reconnaissance Seabed organism detection in deep-sea exploration
Semi-supervised Learning for Large Models - OCR	Text Detection	Online Experience	The semi-supervised learning for large models - OCR pipeline is a unique OCR training pipeline from PaddlePaddle. It consists of a text detection model and a text recognition model working in series. The input image is first processed by the text detection model to obtain and rectify all text line bounding boxes, which are then fed into the text recognition model to generate OCR text results. In the text recognition part, a joint training approach with large and small models is adopted, utilizing a small amount of labeled data and a large amount of unlabeled data to enhance model accuracy, significantly reducing the costs of manual model iteration and data annotation. The figure below shows the effects of this pipeline in two OCR application scenarios, demonstrating significant improvements for both large and small models in different contexts.	Digitizing paper documents Reading and verifying personal information on IDs, passports, and driver's licenses Recognizing product information in retail
Semi-supervised Learning for Large Models - OCR	Large Model Semi-supervised Learning - Text Recognition	Online Experience
General Scene Information Extraction v2	Text Detection	Online Experience	The General Scene Information Extraction Pipeline (PP-ChatOCRv2-common) is a unique intelligent analysis solution for complex documents from PaddlePaddle. It combines Large Language Models (LLMs) and OCR technology, leveraging the Wenxin Large Model to integrate massive data and knowledge, achieving high accuracy and wide applicability. The system flow of PP-ChatOCRv2-common is as follows: Input the prediction image, send it to the general OCR system, predict text through text detection and text recognition models, perform vector retrieval between the predicted text and user queries to obtain relevant text information, and finally pass these text information to the prompt generator to recombine them into prompts for the Wenxin Large Model to generate prediction results.	Key information extraction from various scenarios such as ID cards, bank cards, household registration books, train tickets, and paper invoices
General Scene Information Extraction v2	Text Recognition	Online Experience

Comments