A powerful and lightweight multilingual tokenizer library that provides natural language processing capabilities for multiple languages including English, Chinese, Japanese, and Korean.
npm install gs-tokenizerbash
yarn add gs-tokenizer
`
$3
`bash
npm install gs-tokenizer
`
Usage
$3
The quick module provides convenient static methods for easy integration:
`javascript
import { tokenize, tokenizeText, addCustomDictionary } from 'gs-tokenizer';
// Direct tokenization without creating an instance
const text = 'Hello world! 我爱北京天安门。';
const tokens = tokenize(text);
const words = tokenizeText(text);
console.log(words);
// Add custom dictionary
addCustomDictionary(['人工智能', '技术'], 'tech', 10, 'zh');
`
$3
#### Load Custom Dictionary with Quick Module
`javascript
import { tokenize, addCustomDictionary } from 'gs-tokenizer';
// Load multiple custom dictionaries for different languages
addCustomDictionary(['人工智能', '机器学习'], 'tech', 10, 'zh');
addCustomDictionary(['Web3', 'Blockchain'], 'crypto', 10, 'en');
addCustomDictionary(['アーティフィシャル・インテリジェンス'], 'tech-ja', 10, 'ja');
// Tokenize with custom dictionaries applied
const text = '人工智能和Web3是未来的重要技术。アーティフィシャル・インテリジェンスも重要です。';
const tokens = tokenize(text);
console.log(tokens.filter(token => token.src === 'tech'));
`
#### Without Built-in Lexicon
`javascript
import { MultilingualTokenizer } from 'gs-tokenizer';
// Create tokenizer without using built-in lexicon
const tokenizer = new MultilingualTokenizer({
customDictionaries: {
'zh': [{ priority: 10, data: new Set(['自定义词']), name: 'custom', lang: 'zh' }]
}
});
// Tokenize using only custom dictionary
const text = '这是一个自定义词的示例。';
const tokens = tokenizer.tokenize(text, 'zh');
console.log(tokens);
`
$3
`javascript
const tokenizer = new OldMultilingualTokenizer();
// Add custom words with name, priority, and language
tokenizer.addCustomDictionary(['人工智能', '技术'], 'tech', 10, 'zh');
tokenizer.addCustomDictionary(['Python', 'JavaScript'], 'programming', 5, 'en');
const text = '我爱人工智能技术和Python编程';
const tokens = tokenizer.tokenize(text);
const words = tokenizer.tokenizeText(text);
console.log(words); // Should include '人工智能', 'Python'
// Remove custom word
tokenizer.removeCustomWord('Python', 'en', 'programming');
`
$3
`javascript
import { MultilingualTokenizer } from 'gs-tokenizer';
const tokenizer = new MultilingualTokenizer();
// Tokenize text
const text = '我爱北京天安门';
const tokens = tokenizer.tokenize(text);
// Get all possible tokens (core module only)
const allTokens = tokenizer.tokenizeAll(text);
`
$3
`javascript
import { OldMultilingualTokenizer } from 'gs-tokenizer/old';
const tokenizer = new OldMultilingualTokenizer();
// Tokenize text (old is more stable but slower)
const text = '我爱北京天安门';
const tokens = tokenizer.tokenize(text);
`
API Reference
$3
Main tokenizer class that handles multilingual text processing.
#### Constructor
`typescript
import { MultilingualTokenizer, TokenizerOptions } from 'gs-tokenizer';
const tokenizer = new MultilingualTokenizer(options)
`
Options:
- customDictionaries: Record - Custom dictionaries for each language
- defaultLanguage: string - Default language code (default: 'en')
#### Methods
| Method | Description |
|--------|-------------|
| tokenize(text: string): Token[] | Tokenizes the input text and returns detailed token information |
| tokenizeAll(text: string): Token[] | Returns all possible tokens at each position (core module only) |
| tokenizeText(text: string): string[] | Tokenizes the input text and returns only word tokens |
| tokenizeTextAll(text: string): string[] | Returns all possible word tokens at each position (core module only) |
| addCustomDictionary(words: string[], name: string, priority?: number, language?: string): void | Adds custom words to the tokenizer |
| removeCustomWord(word: string, language?: string, lexiconName?: string): void | Removes a custom word from the tokenizer |
| addStage(stage: ITokenizerStage): void | Adds a custom tokenization stage (core module only) |
$3
Factory function to create a new MultilingualTokenizer instance with optional configuration.
$3
The quick module provides convenient static methods:
`typescript
import { Token } from 'gs-tokenizer';
// Quick Use API type definition
type QuickUseAPI = {
// Tokenize text
tokenize: (text: string, language?: string) => Token[];
// Tokenize to text only
tokenizeText: (text: string, language?: string) => string[];
// Add custom dictionary
addCustomDictionary: (words: string[], name: string, priority?: number, language?: string) => void;
// Remove custom word
removeCustomWord: (word: string, language?: string, lexiconName?: string) => void;
// Set default languages for lexicon loading
setDefaultLanguages: (languages: string[]) => void;
// Set default types for lexicon loading
setDefaultTypes: (types: string[]) => void;
};
// Import quick use API
import { tokenize, tokenizeText, addCustomDictionary, removeCustomWord, setDefaultLanguages, setDefaultTypes } from 'gs-tokenizer';
`
$3
#### Token Interface
`typescript
interface Token {
txt: string; // Token text content
type: 'word' | 'punctuation' | 'space' | 'other' | 'emoji' | 'date' | 'host' | 'ip' | 'number' | 'hashtag' | 'mention';
lang?: string; // Language code
src?: string; // Source (e.g., custom dictionary name)
}
`
$3
`typescript
interface ITokenizerStage {
order: number;
priority: number;
tokenize(text: string, start: number): IStageBestResult;
all(text: string): IToken[];
}
`
#### TokenizerOptions Interface
`typescript
import { LexiconEntry } from 'gs-tokenizer';
interface TokenizerOptions {
customDictionaries?: Record;
granularity?: 'word' | 'grapheme' | 'sentence';
defaultLanguage?: string;
}
`
Browser Compatibility
- Chrome/Edge: 87+
- Firefox: 86+
- Safari: 14.1+
Note: Uses Intl.Segmenter for CJK languages, which requires modern browser support.
Development
$3
`bash
npm run build
`
$3
`bash
npm run test # Run all tests
npm run test:base # Run base tests
npm run test:english # Run English-specific tests
npm run test:cjk # Run CJK-specific tests
npm run test:mixed # Run mixed language tests
``