Skip to content

tokenizer

Description

The Tokenizer subcommand provides encoding and decoding functionality between text and token sequences. It also allows viewing or exporting model vocabulary information. Both text and multimodal models are supported.

Usage

fastdeploy tokenizer --model MODEL (--encode TEXT | --decode TOKENS | --vocab-size | --info)

Parameters

Parameter Description Default
--model, -m Model path or name None
--encode, -e Encode text into a list of tokens None
--decode, -d Decode a list of tokens back into text None
--vocab-size, -vs Display the vocabulary size None
--info, -i Display detailed tokenizer information (special tokens, IDs, max length, etc.) None
--vocab-export FILE, -ve FILE Export the vocabulary to a file None

Examples

# 1. Encode text into tokens
# Convert input text into a token sequence recognizable by the model
fastdeploy tokenizer --model baidu/ERNIE-4.5-0.3B-Paddle --encode "Hello, world!"

# 2. Decode tokens into text
# Convert a token sequence back into readable text
fastdeploy tokenizer --model baidu/ERNIE-4.5-0.3B-Paddle --decode "[1, 2, 3]"

# 3. View vocabulary size
# Output the total number of tokens in the model’s vocabulary
fastdeploy tokenizer --model baidu/ERNIE-4.5-0.3B-Paddle --vocab-size

# 4. View tokenizer details
# Includes special symbols, ID mappings, max token length, etc.
fastdeploy tokenizer --model baidu/ERNIE-4.5-0.3B-Paddle --info

# 5. Export vocabulary to a file
# Save the tokenizer’s vocabulary to a local file
fastdeploy tokenizer --model baidu/ERNIE-4.5-0.3B-Paddle --vocab-export ./vocab.txt

# 6. Support for multimodal models
# Decode tokens for a multimodal model
fastdeploy tokenizer --model baidu/EB-VL-Lite-d --decode "[5300, 96382]"

# 7. Combine multiple functions
# Encode, decode, view vocabulary, and export vocabulary in a single command
fastdeploy tokenizer \
    -m baidu/ERNIE-4.5-0.3B-PT \
    -e "你好哇" \
    -d "[5300, 96382]" \
    -i \
    -vs \
    -ve vocab.json