gs-tokenizer

A powerful and lightweight multilingual tokenizer library that provides natural language processing capabilities for multiple languages including English, Chinese, Japanese, and Korean.

Documentation

- English README
- 中文 README
- 日本語 README
- 한국어 README

Features

- Language Support: English, Chinese, Japanese, Korean
- Intelligent Tokenization:
- English: Word boundary-based tokenization
- CJK (Chinese, Japanese, Korean): Natural word segmentation using browser's Intl.Segmenter
- Date: Special handling for date patterns
- Punctuation: Consecutive punctuation marks are merged into a single token
- Custom Dictionary: Support for adding custom words with priority and name
- Auto Language Detection: Automatically detects the language of input text
- Multiple Output Formats: Get detailed token information or just word lists
- Lightweight: Minimal dependencies, designed for browser environments
- Quick Use API: Convenient static methods for easy integration
- tokenizeAll: New feature in core module that returns all possible tokens at each position

Module Comparison

| Module | Stability | Speed | Tokenization Accuracy | New Features |
|--------|-----------|-------|-----------------------|--------------|
| old | ✅ More stable | ⚡️ Slower | ✅ More accurate | ❌ No new features |
| core | ⚠️ Less stable | ⚡️ Faster | ⚠️ May be less accurate | ✅ tokenizeAll, Stage-based architecture |

Installation

bash

yarn add gs-tokenizer

$3

bash

npm install gs-tokenizer





Usage



$3



The quick module provides convenient static methods for easy integration:

javascript

import { tokenize, tokenizeText, addCustomDictionary } from 'gs-tokenizer';



// Direct tokenization without creating an instance

const text = 'Hello world! 我爱北京天安门。';

const tokens = tokenize(text);

const words = tokenizeText(text);

console.log(words);



// Add custom dictionary

addCustomDictionary(['人工智能', '技术'], 'tech', 10, 'zh');





$3



#### Load Custom Dictionary with Quick Module

javascript

import { tokenize, addCustomDictionary } from 'gs-tokenizer';



// Load multiple custom dictionaries for different languages

addCustomDictionary(['人工智能', '机器学习'], 'tech', 10, 'zh');

addCustomDictionary(['Web3', 'Blockchain'], 'crypto', 10, 'en');

addCustomDictionary(['アーティフィシャル・インテリジェンス'], 'tech-ja', 10, 'ja');



// Tokenize with custom dictionaries applied

const text = '人工智能和Web3是未来的重要技术。アーティフィシャル・インテリジェンスも重要です。';

const tokens = tokenize(text);

console.log(tokens.filter(token => token.src === 'tech'));





#### Without Built-in Lexicon

javascript

import { MultilingualTokenizer } from 'gs-tokenizer';



// Create tokenizer without using built-in lexicon

const tokenizer = new MultilingualTokenizer({

  customDictionaries: {

    'zh': [{ priority: 10, data: new Set(['自定义词']), name: 'custom', lang: 'zh' }]

  }

});



// Tokenize using only custom dictionary

const text = '这是一个自定义词的示例。';

const tokens = tokenizer.tokenize(text, 'zh');

console.log(tokens);

$3

javascript

const tokenizer = new OldMultilingualTokenizer();



// Add custom words with name, priority, and language

tokenizer.addCustomDictionary(['人工智能', '技术'], 'tech', 10, 'zh');

tokenizer.addCustomDictionary(['Python', 'JavaScript'], 'programming', 5, 'en');



const text = '我爱人工智能技术和Python编程';

const tokens = tokenizer.tokenize(text);

const words = tokenizer.tokenizeText(text);

console.log(words); // Should include '人工智能', 'Python'



// Remove custom word

tokenizer.removeCustomWord('Python', 'en', 'programming');

$3

javascript

import { MultilingualTokenizer } from 'gs-tokenizer';



const tokenizer = new MultilingualTokenizer();



// Tokenize text

const text = '我爱北京天安门';

const tokens = tokenizer.tokenize(text);



// Get all possible tokens (core module only)

const allTokens = tokenizer.tokenizeAll(text);

$3

javascript

import { OldMultilingualTokenizer } from 'gs-tokenizer/old';



const tokenizer = new OldMultilingualTokenizer();



// Tokenize text (old is more stable but slower)

const text = '我爱北京天安门';

const tokens = tokenizer.tokenize(text);





API Reference



$3



Main tokenizer class that handles multilingual text processing.



#### Constructor

typescript

import { MultilingualTokenizer, TokenizerOptions } from 'gs-tokenizer';



const tokenizer = new MultilingualTokenizer(options)





Options:

-

customDictionaries

: Record - Custom dictionaries for each language

-

defaultLanguage

: string - Default language code (default: 'en')



#### Methods



| Method | Description |

|--------|-------------|

|

tokenize(text: string): Token[]

 | Tokenizes the input text and returns detailed token information |

|

tokenizeAll(text: string): Token[]

 | Returns all possible tokens at each position (core module only) |

|

tokenizeText(text: string): string[]

 | Tokenizes the input text and returns only word tokens |

|

tokenizeTextAll(text: string): string[]

 | Returns all possible word tokens at each position (core module only) |

|

addCustomDictionary(words: string[], name: string, priority?: number, language?: string): void

 | Adds custom words to the tokenizer |

|

removeCustomWord(word: string, language?: string, lexiconName?: string): void

 | Removes a custom word from the tokenizer |

|

addStage(stage: ITokenizerStage): void

 | Adds a custom tokenization stage (core module only) |



$3



Factory function to create a new MultilingualTokenizer instance with optional configuration.



$3



The quick module provides convenient static methods:

typescript

import { Token } from 'gs-tokenizer';



// Quick Use API type definition

type QuickUseAPI = {

  // Tokenize text

  tokenize: (text: string, language?: string) => Token[];

  // Tokenize to text only

  tokenizeText: (text: string, language?: string) => string[];

  // Add custom dictionary

  addCustomDictionary: (words: string[], name: string, priority?: number, language?: string) => void;

  // Remove custom word

  removeCustomWord: (word: string, language?: string, lexiconName?: string) => void;

  // Set default languages for lexicon loading

  setDefaultLanguages: (languages: string[]) => void;

  // Set default types for lexicon loading

  setDefaultTypes: (types: string[]) => void;

};



// Import quick use API

import { tokenize, tokenizeText, addCustomDictionary, removeCustomWord, setDefaultLanguages, setDefaultTypes } from 'gs-tokenizer';





$3



####

Token

 Interface

typescript

interface Token {

  txt: string;              // Token text content

  type: 'word' | 'punctuation' | 'space' | 'other' | 'emoji' | 'date' | 'host' | 'ip' | 'number' | 'hashtag' | 'mention';

  lang?: string;            // Language code

  src?: string;             // Source (e.g., custom dictionary name)

}

$3

typescript

interface ITokenizerStage {

  order: number;

  priority: number;

  tokenize(text: string, start: number): IStageBestResult;

  all(text: string): IToken[];

}





####

TokenizerOptions

 Interface

typescript

import { LexiconEntry } from 'gs-tokenizer';



interface TokenizerOptions {

  customDictionaries?: Record;

  granularity?: 'word' | 'grapheme' | 'sentence';

  defaultLanguage?: string;

}





Browser Compatibility



- Chrome/Edge: 87+

- Firefox: 86+

- Safari: 14.1+



Note: Uses

Intl.Segmenter

 for CJK languages, which requires modern browser support.



Development



$3

bash

npm run build

$3

bash

npm run test          # Run all tests

npm run test:base     # Run base tests

npm run test:english  # Run English-specific tests

npm run test:cjk      # Run CJK-specific tests

npm run test:mixed    # Run mixed language tests

License

MIT

GitHub Repository