Node.js bindings for BlingFire tokenization library
Node.js bindings for BlingFire, a tokenization library from Microsoft.
BlingFire provides text tokenization and sentence splitting with high performance and no external dependencies.
- Native C++ implementation with minimal overhead
- Production-tested tokenization and sentence splitting
- Full TypeScript support with type definitions
- No runtime dependencies beyond Node.js
- Includes precompiled binaries for Linux (arm64, amd64) and macOS (arm64, amd64)
``bash`
npm install blingfire
This package includes precompiled binaries for:
- Linux: arm64, amd64
- macOS (darwin): arm64, amd64
For these platforms, no build tools are required. For other platforms, you will need to build from source.
If precompiled binaries are not available for your platform, you will need:
- Node.js 16.x or higher
- Python (for node-gyp)
- C++ compiler:
- macOS: Xcode Command Line Tools (xcode-select --install)apt-get install build-essential
- Linux: GCC or Clang ()
- Windows: Visual Studio Build Tools
During source builds, the package will:
1. Copy BlingFire source files to the vendor directory
2. Compile the native C++ addon
3. Build TypeScript files
`typescript
import { textToSentences, textToWords } from 'blingfire';
// Sentence splitting
const text = 'Hello world. This is a test. How are you?';
const sentences = textToSentences(text);
console.log(sentences);
// Output: "Hello world.\nThis is a test.\nHow are you?"
// Get sentences as array
const sentenceArray = sentences.split('\n');
console.log(sentenceArray);
// Output: ["Hello world.", "This is a test.", "How are you?"]
// Word tokenization
const words = textToWords(text);
console.log(words);
// Output: "Hello world . This is a test . How are you ?"
// Get words as array
const wordArray = words.split(' ');
console.log(wordArray);
// Output: ["Hello", "world", ".", "This", "is", "a", "test", ".", "How", "are", "you", "?"]
`
`javascript
const { textToSentences, textToWords } = require('blingfire');
const text = 'Dr. Smith went to the U.S. He was happy.';
const sentences = textToSentences(text);
console.log(sentences.split('\n'));
`
#### Processing Documents
`typescript
import { textToSentences, textToWords } from 'blingfire';
// Process a document
const document =
Natural language processing (NLP) is a subfield of AI.
It focuses on the interaction between computers and humans.
Dr. Jones believes it's revolutionary!;
// Split into sentences
const sentences = textToSentences(document).split('\n');
// Tokenize each sentence
sentences.forEach((sentence, i) => {
const words = textToWords(sentence).split(' ');
console.log(Sentence ${i + 1} has ${words.length} tokens);`
});
#### Unicode and Multilingual Text
`typescript
import { textToSentences, textToWords } from 'blingfire';
// Works with Unicode
const multiLang = 'Hello world. 你好世界。Привет мир.';
const sentences = textToSentences(multiLang);
console.log(sentences.split('\n'));
// Handles emojis and special characters
const withEmoji = 'I love coding! 🚀 It makes me happy. 😊';
const words = textToWords(withEmoji);
console.log(words);
`
Splits text into sentences using BlingFire's sentence boundary detection.
Parameters:
- text (string): The input text to split into sentences
Returns:
- (string): Sentences separated by newline characters (\n)
Throws:
- TypeError: If input is not a stringError
- : If BlingFire processing fails
Example:
`typescript`
const result = textToSentences('First. Second. Third.');
// Returns: "First.\nSecond.\nThird."
Tokenizes text into words using BlingFire's word boundary detection.
Parameters:
- text (string): The input text to tokenize
Returns:
- (string): Words separated by space characters
Throws:
- TypeError: If input is not a stringError
- : If BlingFire processing fails
Example:
`typescript`
const result = textToWords('Hello, world!');
// Returns: "Hello , world !"
This package wraps the BlingFire C++ library using Node.js N-API. During installation:
1. Vendor Script: Copies BlingFire source files from a git submodule to a vendor/ directory
2. Native Compilation: Compiles both BlingFire and the N-API binding using node-gyp
3. TypeScript Build: Compiles TypeScript wrapper to JavaScript
The package includes all BlingFire source code, so no external dependencies are required at runtime.
`bashClone with submodules
git clone --recursive
cd blingfire
$3
`bash
Run tests once
npm testWatch mode
npm run test:watch
`$3
`bash
Clean build artifacts
npm run cleanRebuild everything
npm install
npm run build:typescript
npm run build:native
`License
The contents of this repository are licensed under the MIT License.
See the
LICENSE` file for more information.- BlingFire by Microsoft
- This package wraps the C++ library with Node.js N-API bindings
Contributions are welcome! Please feel free to submit a Pull Request.