SVTR¶

1. Introduction¶

Paper:

SVTR: Scene Text Recognition with a Single Visual Model Yongkun Du and Zhineng Chen and Caiyan Jia Xiaoting Yin and Tianlun Zheng and Chenxia Li and Yuning Du and Yu-Gang Jiang IJCAI, 2022

The accuracy (%) and model files of SVTR on the public dataset of scene text recognition are as follows:

Chinese dataset from Chinese Benckmark , and the Chinese training evaluation strategy of SVTR follows the paper.

Model	IC13 857	SVT	IIIT5k 3000	IC15 1811	SVTP	CUTE80	Avg_6	IC15 2077	IC13 1015	IC03 867	IC03 860	Avg_10	Chinese scene_test	Download link
SVTR Tiny	96.85	91.34	94.53	83.99	85.43	89.24	90.87	80.55	95.37	95.27	95.70	90.13	67.90	English / Chinese
SVTR Small	95.92	93.04	95.03	84.70	87.91	92.01	91.63	82.72	94.88	96.08	96.28	91.02	69.00	English / Chinese
SVTR Base	97.08	91.50	96.03	85.20	89.92	91.67	92.33	83.73	95.66	95.62	95.81	91.61	71.40	English / -
SVTR Large	97.20	91.65	96.30	86.58	88.37	95.14	92.82	84.54	96.35	96.54	96.74	92.24	72.10	English / Chinese

2. Environment¶

Please refer to "Environment Preparation" to configure the PaddleOCR environment, and refer to "Project Clone"to clone the project code.

Dataset Preparation¶

English dataset download Chinese dataset download

3. Model Training / Evaluation / Prediction¶

Please refer to Text Recognition Tutorial. PaddleOCR modularizes the code, and training different recognition models only requires changing the configuration file.

Training¶

Specifically, after the data preparation is completed, the training can be started. The training command is as follows:

# Single GPU training (long training period, not recommended)
python3 tools/train.py -c configs/rec/rec_svtrnet.yml

# Multi GPU training, specify the gpu number through the --gpus parameter
python3 -m paddle.distributed.launch --gpus '0,1,2,3'  tools/train.py -c configs/rec/rec_svtrnet.yml

Evaluation¶

You can download the model files and configuration files provided by SVTR: download link, take SVTR-T as an example, using the following command to evaluate:

# Download the tar archive containing the model files and configuration files of SVTR-T and extract it
wget https://paddleocr.bj.bcebos.com/PP-OCRv3/chinese/rec_svtr_tiny_none_ctc_en_train.tar && tar xf rec_svtr_tiny_none_ctc_en_train.tar
# GPU evaluation
python3 -m paddle.distributed.launch --gpus '0' tools/eval.py -c ./rec_svtr_tiny_none_ctc_en_train/rec_svtr_tiny_6local_6global_stn_en.yml -o Global.pretrained_model=./rec_svtr_tiny_none_ctc_en_train/best_accuracy

Prediction¶

python3 tools/infer_rec.py -c ./rec_svtr_tiny_none_ctc_en_train/rec_svtr_tiny_6local_6global_stn_en.yml -o Global.infer_img='./doc/imgs_words_en/word_10.png' Global.pretrained_model=./rec_svtr_tiny_none_ctc_en_train/best_accuracy

4. Inference and Deployment¶

4.1 Python Inference¶

First, the model saved during the SVTR text recognition training process is converted into an inference model. ( Model download link ), you can use the following command to convert:

python3 tools/export_model.py -c configs/rec/rec_svtrnet.yml -o Global.pretrained_model=./rec_svtr_tiny_none_ctc_en_train/best_accuracy  Global.save_inference_dir=./inference/rec_svtr_tiny_stn_en

Note: If you are training the model on your own dataset and have modified the dictionary file, please pay attention to modify the character_dict_path in the configuration file to the modified dictionary file.

After the conversion is successful, there are three files in the directory:

/inference/rec_svtr_tiny_stn_en/
    ├── inference.pdiparams
    ├── inference.pdiparams.info
    └── inference.pdmodel

For SVTR text recognition model inference, the following commands can be executed:

python3 tools/infer/predict_rec.py --image_dir='./doc/imgs_words_en/word_10.png' --rec_model_dir='./inference/rec_svtr_tiny_stn_en/' --rec_algorithm='SVTR' --rec_image_shape='3,64,256' --rec_char_dict_path='./ppocr/utils/ic15_dict.txt'

After executing the command, the prediction result (recognized text and score) of the image above is printed to the screen, an example is as follows:

Predicts of ./doc/imgs_words_en/word_10.png:('pain', 0.9999998807907104)

4.2 C++ Inference¶

Not supported

4.3 Serving¶

Not supported

4.4 More¶

Not supported

5. FAQ¶

1. Speed situation on CPU and GPU
Since most of the operators used by SVTR are matrix multiplication, in the GPU environment, the speed has an advantage, but in the environment where mkldnn is enabled on the CPU, SVTR has no advantage over the optimized convolutional network.
1. SVTR model convert to ONNX failed
Ensure paddle2onnx and onnxruntime versions are up to date, refer to SVTR model to onnx step-by-step example for the convert onnx command. 1271214273).
1. SVTR model convert to ONNX is successful but the inference result is incorrect
The possible reason is that the model parameter out_char_num is not set correctly, it should be set to W//4, W//8 or W//12, please refer to Section 3.3.3 of SVTR, a high-precision Chinese scene text recognition model projectdetail/5073182?contributionType=1).
1. Optimization of long text recognition
Refer to Section 3.3 of SVTR, a high-precision Chinese scene text recognition model.
1. Notes on the reproduction of the paper results
Dataset using provided by ABINet.
By default, 4 cards of GPUs are used for training, the default Batchsize of a single card is 512, and the total Batchsize is 2048, corresponding to a learning rate of 0.0005. When modifying the Batchsize or changing the number of GPU cards, the learning rate should be modified in equal proportion.
1. Exploration Directions for further optimization
Learning rate adjustment: adjusting to twice the default to keep Batchsize unchanged; or reducing Batchsize to 1/2 the default to keep the learning rate unchanged.
Data augmentation strategies: optionally RecConAug and RecAug.
If STN is not used, Local of mixer can be replaced by Conv and local_mixer can all be modified to [5, 5].
Grid search for optimal embed_dim, depth, num_heads configurations.
Use the Post-Normalization strategy, which is to modify the model configuration prenorm to True.

Citation¶

@article{Du2022SVTR,
  title     = {SVTR: Scene Text Recognition with a Single Visual Model},
  author    = {Du, Yongkun and Chen, Zhineng and Jia, Caiyan and Yin, Xiaoting and Zheng, Tianlun and Li, Chenxia and Du, Yuning and Jiang, Yu-Gang},
  booktitle = {IJCAI},
  year      = {2022},
  url       = {https://arxiv.org/abs/2205.00159}
}

SVTR¶

1. Introduction¶

2. Environment¶

Dataset Preparation¶

3. Model Training / Evaluation / Prediction¶

Training¶

Evaluation¶

Prediction¶

4. Inference and Deployment¶

4.1 Python Inference¶

4.2 C++ Inference¶

4.3 Serving¶

4.4 More¶

5. FAQ¶

Citation¶

Comments