A Unicode-aware token segmentation and counting library
npm install token-estimatorA very fast Unicode-aware token segmentation and counting library for JavaScript/TypeScript, optimized to align tokenization results with Gemini and GPT models.
- Unicode-aware: Handles international text, emoji sequences, and various scripts
- Robust tokenization: Uses Intl.Segmenter when available for best results, with regex fallback
- Multiple token categories: word, number, whitespace, punctuation, emoji, and other
- Token counting: Count tokens with configurable options
- Text truncation: Truncate text by token count without splitting grapheme clusters
- TypeScript support: Full TypeScript definitions included
``bash`
npm install token-estimator
`typescript
import { segmentIntoTokens, countTokens, truncateByTokenCount } from 'token-estimator'
// Segment text into tokens
const tokens = segmentIntoTokens("Hello world! 😊")
// Returns tokens with text, positions, and categories
// Count tokens
const tokenCount = countTokens("Hello world! 😊")
// Returns: 4
// Truncate by token count
const { truncatedText, truncatedTokenCount } = truncateByTokenCount(
"Hello world! 😊 How are you?",
3
)
// Returns: { truncatedText: "Hello world! 😊", truncatedTokenCount: 3 }
`
Segments text into tokens with detailed information.
Parameters:
- sourceText: The text to segmentoptions.keepWhitespace
- : Include whitespace tokens (default: true)options.requestedLocale
- : Locale hint for segmentation (default: 'en')options.maxTokensLimit
- : Safety limit (default: 100000)
Returns: Array of Token objects with properties:text
- : The token textstartIndex
- : Start position in original stringendIndex
- : End position in original stringcategory
- : Token category ('word', 'number', 'whitespace', 'punctuation', 'emoji', 'other')
Counts tokens in the source text.
Parameters: Same as segmentIntoTokens
Returns: Number of tokens
Truncates text to specified token count.
Parameters:
- sourceText: Text to truncatetokenLimit
- : Maximum number of tokensoptions
- : Same as segmentIntoTokens
Returns: Object with truncatedText and truncatedTokenCount`
MIT