🤗 Tokenizers.js: A pure JS/TS implementation of today's most used tokenizers
npm install @huggingface/tokenizers
A lightweight tokenizer for the Web
Run today's most used tokenizers directly in your browser or Node.js application. No heavy dependencies, no server required. Just fast, client-side tokenization compatible with thousands of models on the Hugging Face Hub. These tokenizers are also used in 🤗 Transformers.js
- Lightweight (~ 8.3kB gzip)
- Zero dependencies
- Works in browsers and Node.js
``bash`
npm install @huggingface/tokenizers
Alternatively, you can use it via a CDN as follows:
`html`
`javascript
import { Tokenizer } from "@huggingface/tokenizers";
// Load files from the Hugging Face Hub
const modelId = "HuggingFaceTB/SmolLM3-3B";
const tokenizerJson = await fetch(https://huggingface.co/${modelId}/resolve/main/tokenizer.json).then((res) => res.json());https://huggingface.co/${modelId}/resolve/main/tokenizer_config.json
const tokenizerConfig = await fetch().then((res) => res.json());
// Create tokenizer
const tokenizer = new Tokenizer(tokenizerJson, tokenizerConfig);
// Tokenize text
const tokens = tokenizer.tokenize("Hello World"); // ['Hello', 'Ä World']
const encoded = tokenizer.encode("Hello World"); // { ids: [9906, 4435], tokens: ['Hello', 'Ä World'], attention_mask: [1, 1] }
const decoded = tokenizer.decode(encoded.ids); // 'Hello World'
`
This library expects two files from Hugging Face models:
- tokenizer.json - Contains the tokenizer configurationtokenizer_config.json` - Contains additional metadata
-
Tokenizers.js supports Hugging Face tokenizer components:
- NFD
- NFKC
- NFC
- NFKD
- Lowercase
- Strip
- StripAccents
- Replace
- BERT Normalizer
- Precompiled
- Sequence
- BERT
- ByteLevel
- Whitespace
- WhitespaceSplit
- Metaspace
- CharDelimiterSplit
- Split
- Punctuation
- Digits
- BPE (Byte-Pair Encoding)
- WordPiece
- Unigram
- Legacy
- ByteLevel
- TemplateProcessing
- RobertaProcessing
- BertProcessing
- Sequence
- ByteLevel
- WordPiece
- Metaspace
- BPE
- CTC
- Replace
- Fuse
- Strip
- ByteFallback
- Sequence