Implements the Unicode default word boundary specification (UAX #29 §4.1)
npm install unicode-default-word-boundaryUnicode Default Word Boundary
=============================


Implements the [Unicode UAX #29 §4.1 default word boundary
specification][defaultwb], for finding word breaks in **multilingual
text**.
Use this to split words in text! Using UAX #29 is a lot smarter than the\b word boundary in JavaScript's regular expressions! Note that
character classes like \b, \w, \d [only work on ASCII
characters][mdnregexp].
Usage
-----
Import the module and use the split() function:
``js
const split = require('unicode-default-word-boundary').split;
console.log(split(The quick (“brown”) fox can’t jump 32.3 feet, right?));`
Output:
[ 'The', 'quick', '(', '“', 'brown', '”', ')', 'fox', 'can’t', 'jump', '32.3', 'feet', ',', 'right', '?' ]
But that's not all! Try it with non-English text, like Russian:
`javascriptВ чащах юга жил бы цитрус? Да, но фальшивый экземпляр!
split()`
[ 'В', 'чащах', 'юга', 'жил', 'бы', 'цитрус', '?', 'Да', ',', 'но', 'фальшивый', 'экземпляр', '!' ]
...Hebrew:
`javascriptאיך בלש תפס גמד רוצח עז קטנה?
split();`
[ 'איך', 'בלש', 'תפס', 'גמד', 'רוצח', 'עז', 'קטנה', '?' ]
...[nêhiyawêwin][]:
`javascriptᑕᐻ ᒥᔪ ᑭᓯᑲᐤ ᐊᓄᐦᐨ᙮
split();`
[ 'ᑕᐻ', 'ᒥᔪ ᑭᓯᑲᐤ', 'ᐊᓄᐦᐨ', '᙮' ]
...and many more!
More advanced use cases will want to use the findSpans() or thefindBoundaries() function.
What doesn't work
-----------------
Languages that do not have obvious word breaks, such as Chinese,
Japanese, Thai, Lao, and Khmer. You'll need to use statistical or
dictionary-based approaches to split words in these languages.
API Documentation
-----------------
The following functions make up the primary API:
split() splits the text at word boundaries, returning an array of all
"words" from the text that contain characters other than whitespace.
See above for examples.
findSpans() is a generator that yields successive _basic spans_ from
the text. A basic span is a chunk of text that is guaranteed to
start at a word boundary and end at the next word boundary. In other
words, basic spans are _indivisible_ in that there are no word
boundaries contained within a basic span.
A basic span has the following properties:
`typescript`
interface BasicSpan {
/* Where the span starts, relative to the input text. /
start: number;
/ At what index does the next* span begin. /
end: number;
/* How many characters are in this span. /
length: number;
/* The text contained within this span. /
text: string;
}
Note that unlike, split(), findSpans() does yield spans that
contain whitespace.
#### Example
Array.from(findSpans("Hello, world🌎!"))
Will yield spans with the following properties:
`javascript`
[ { start: 0, end: 5, length: 5, text: 'Hello' },
{ start: 5, end: 6, length: 1, text: ',' },
{ start: 6, end: 7, length: 1, text: ' ' },
{ start: 7, end: 12, length: 5, text: 'world' },
{ start: 12, end: 14, length: 2, text: '🌎' },
{ start: 14, end: 15, length: 1, text: '!' } ]
N.B.: findSpans() may _not_ yield plain JavaScript objects, asfindSpans()
shown above. The objects that yield will adhere to theBasicSpan interface, however what findSpans() actually yields may
differ from simple objects.
findBoundaries() is like findSpans() except it yields the _index_ offindSpans()
each successive word boundary. Anecdotally, using this function directly
may be faster than generating spans objects with .
Contributing and Maintaining
----------------------------
When maintaining this package, you might notice something strange.
index.ts depends on ./src/gen/WordBreakProperty.ts, but this filelibexec/
does not exist! It is a generated file, created by reading Unicode
property data files, [downloaded from Unicode's website][unicodefiles].
These data files have been compressed and committed to this repository
in :
libexec/
libexec/
├── WordBreakProperty-15.1.0.txt.gz
├── compile-word-break.js
└── emoji-data-15.1.0.txt.gz
**Note that compile-word-break.js actually creates./src/gen/WordBreakProperty.ts!**
When you have _just_ cloned the repository, this file will be generated
when you run npm install:
npm install
If you want to regenerate it afterwards, you can run the build script:
npm run build
To run the benchmarks, you can run the following:
npm run benchmarks
If you want to compare the current implementation with a new
implementation, what I do is create a new working tree called opt/:
git worktree add -b «NEW-BRANCH-NAME» opt
Then, I make changes in the working tree inside opt/`, **compile
and run the tests**, then, in the main working tree, I run the
benchmarks:
cd opt/
npm install
vim # do whatever you need to do here
npm test # this also compiles the TypeScript
cd ..
npm run benchmarks
License
-------
TypeScript implementation © 2019 National Research Council Canada,
© 2024 Eddie Antonio Santos. MIT Licensed.
The algorithm comes from [UAX #29: Unicode Text Segmentation, an
integral part of the Unicode Standard, version 15.1][uax29].
[defaultwb]: https://unicode.org/reports/tr29/#Default_Word_Boundaries
[mdnregexp]: https://developer.mozilla.org/en-US/docs/Web/JavaScript/Guide/Regular_Expressions/Character_Classes#Types
[nêhiyawêwin]: https://en.wikipedia.org/wiki/Plains_Cree
[uax29]: https://unicode.org/reports/tr29/
[unicodefiles]: https://unicode.org/reports/tr41/tr41-24.html