Key Information Extraction (KIE)¶
1. Introduction¶
Key information extraction (KIE) refers to extracting key information from text or images. As downstream task of OCR, the key information extraction task of document image has many practical application scenarios, such as form recognition, ticket information extraction, ID card information extraction, etc.
PP-Structure conducts research based on the LayoutXLM multi-modal, and proposes the VI-LayoutXLM, which gets rid of visual features when finetuning the downstream tasks. An textline sorting method is also utilized to fit in reading order. What's more, UDML knowledge distillation is used for higher accuracy. Finally, the accuracy and inference speed of VI-LayoutXLM surpass those of LayoutXLM.
The main features of the key information extraction module in PP-Structure are as follows.
- Integrate multi-modal methods such as LayoutXLM, VI-LayoutXLM, and PP-OCR inference engine.
- Supports Semantic Entity Recognition (SER) and Relation Extraction (RE) tasks based on multimodal methods. Based on the SER task, the text recognition and classification in the image can be completed; based on the RE task, the relationship extraction of the text content in the image can be completed, such as judging the problem pair (pair).
- Supports custom training for SER tasks and RE tasks.
- Supports end-to-end system prediction and evaluation of OCR+SER.
- Supports end-to-end system prediction of OCR+SER+RE.
- Support SER model export and inference using PaddleInference.
2. Performance¶
We evaluate the methods on the Chinese dataset of XFUND, and the performance is as follows
Model | Backbone | Task | Config file | Hmean | Inference time (ms) | Download link |
---|---|---|---|---|---|---|
VI-LayoutXLM | VI-LayoutXLM-base | SER | ser_vi_layoutxlm_xfund_zh_udml.yml | 93.19% | 15.49 | trained model |
LayoutXLM | LayoutXLM-base | SER | ser_layoutxlm_xfund_zh.yml | 90.38% | 19.49 | trained model |
VI-LayoutXLM | VI-LayoutXLM-base | RE | re_vi_layoutxlm_xfund_zh_udml.yml | 83.92% | 15.49 | trained model |
LayoutXLM | LayoutXLM-base | RE | re_layoutxlm_xfund_zh.yml | 74.83% | 19.49 | trained model |
- Note:Inference environment:V100 GPU + cuda10.2 + cudnn8.1.1 + TensorRT 7.2.3.4,tested using fp16.
For more KIE models in PaddleOCR, please refer to KIE model zoo.
3. Visualization¶
There are two main solutions to the key information extraction task based on VI-LayoutXLM series model.
(1) Text detection + text recognition + semantic entity recognition (SER)
(2) Text detection + text recognition + semantic entity recognition (SER) + relationship extraction (RE)
The following images are demo results of the SER and RE models. For more detailed introduction to the above solutions, please refer to KIE Guide.
3.1 SER¶
Demo results for SER task are as follows.
Note: test pictures are from xfund dataset, invoice dataset and a composite ID card dataset.
Boxes of different colors in the image represent different categories.
The invoice and application form images have three categories: request
, answer
and header
. The question
and answer
can be used to extract the relationship.
For the ID card image, the model can directly identify the key information such as name
, gender
, nationality
, so that the subsequent relationship extraction process is not required, and the key information extraction task can be completed using only one model.
3.2 RE¶
Demo results for RE task are as follows.
Red boxes are questions, blue boxes are answers. The green lines means the two connected objects are a pair.
4. Usage¶
4.1 Prepare for the environment¶
Use the following command to install KIE dependencies.
NOTE: For KIE tasks, it is necessary to downgrade the Paddle framework version (Paddle<2.6) and the PaddleNLP version (PaddleNLP<2.6).
The visualized results of SER are saved in the ./output
folder by default. Examples of results are as follows.
4.2 Quick start¶
Here we use XFUND dataset to quickly experience the SER model and RE model.
4.2.1 Prepare for the dataset¶
4.2.2 Predict images using the trained model¶
Use the following command to download the models.
If you want to use OCR engine to obtain end-to-end prediction results, you can use the following command to predict.
The visual result images and the predicted text file will be saved in the Global.save_res_path
directory.
If you want to use a custom ocr model, you can set it through the following fields
Global.kie_det_model_dir
: the detection inference model pathGlobal.kie_rec_model_dir
: the recognition inference model path
If you want to load the text detection and recognition results collected before, you can use the following command to predict.
4.2.3 Inference using PaddleInference¶
Firstly, download the inference SER inference model.
- SER
Use the following command for inference.
The visual results and text file will be saved in directory output
.
- RE
Use the following command for inference.
The visual results and text file will be saved in directory output
.
If you want to use a custom ocr model, you can set it through the following fields
--det_model_dir
: the detection inference model path--rec_model_dir
: the recognition inference model path
4.3 More¶
For training, evaluation and inference tutorial for KIE models, please refer to KIE doc.
For training, evaluation and inference tutorial for text detection models, please refer to text detection doc.
For training, evaluation and inference tutorial for text recognition models, please refer to text recognition doc.
To complete the key information extraction task in your own scenario from data preparation to model selection, please refer to: Guide to End-to-end KIE。
5. Reference¶
- LayoutXLM: Multimodal Pre-training for Multilingual Visually-rich Document Understanding, https://arxiv.org/pdf/2104.08836.pdf
- microsoft/unilm/layoutxlm, https://github.com/microsoft/unilm/tree/master/layoutxlm
- XFUND dataset, https://github.com/doc-analysis/XFUND
6. License¶
The content of this project itself is licensed under the Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)