Token Estimator

A very fast Unicode-aware token segmentation and counting library for JavaScript/TypeScript, optimized to align tokenization results with Gemini and GPT models.

Features

- Unicode-aware: Handles international text, emoji sequences, and various scripts
- Robust tokenization: Uses Intl.Segmenter when available for best results, with regex fallback
- Multiple token categories: word, number, whitespace, punctuation, emoji, and other
- Token counting: Count tokens with configurable options
- Text truncation: Truncate text by token count without splitting grapheme clusters
- TypeScript support: Full TypeScript definitions included

Installation

``bash npm install token-estimator`

`Usage`

`typescript import { segmentIntoTokens, countTokens, truncateByTokenCount } from 'token-estimator'

// Segment text into tokens const tokens = segmentIntoTokens("Hello world! 😊") // Returns tokens with text, positions, and categories

// Count tokens const tokenCount = countTokens("Hello world! 😊") // Returns: 4

// Truncate by token count const { truncatedText, truncatedTokenCount } = truncateByTokenCount( "Hello world! 😊 How are you?", 3 ) // Returns: { truncatedText: "Hello world! 😊", truncatedTokenCount: 3 }`

`API`

`$3`

Segments text into tokens with detailed information.

Parameters: -sourceText: The text to segment -options.keepWhitespace: Include whitespace tokens (default: true) -options.requestedLocale: Locale hint for segmentation (default: 'en') -options.maxTokensLimit: Safety limit (default: 100000)

Returns: Array of Tokenobjects with properties: -text: The token text -startIndex: Start position in original string -endIndex: End position in original string -category: Token category ('word', 'number', 'whitespace', 'punctuation', 'emoji', 'other')

`$3`

Counts tokens in the source text.

Parameters: Same as segmentIntoTokens

Returns: Number of tokens

`$3`

Truncates text to specified token count.

Parameters: -sourceText: Text to truncate -tokenLimit: Maximum number of tokens -options: Same as segmentIntoTokens

Returns: Object with truncatedText and truncatedTokenCount`

License

MIT

Token Estimator

A very fast Unicode-aware token segmentation and counting library for JavaScript/TypeScript, optimized to align tokenization results with Gemini and GPT models.

Features

Installation

``bash npm install token-estimator`

`Usage`

`typescript import { segmentIntoTokens, countTokens, truncateByTokenCount } from 'token-estimator'

// Segment text into tokens const tokens = segmentIntoTokens("Hello world! 😊") // Returns tokens with text, positions, and categories

// Count tokens const tokenCount = countTokens("Hello world! 😊") // Returns: 4

`API`

`$3`

Segments text into tokens with detailed information.

`$3`

Counts tokens in the source text.

Parameters: Same as segmentIntoTokens

Returns: Number of tokens

`$3`

Truncates text to specified token count.

Parameters: -sourceText: Text to truncate -tokenLimit: Maximum number of tokens -options: Same as segmentIntoTokens

Returns: Object with truncatedText and truncatedTokenCount`

License

MIT