CLDR text segmentation for JavaScript
npm install cldr-segmentation
cldr-segmentation
===
Text segmentation library for JavaScript.
This library provides CLDR-based text segmentation capabilities in JavaScript. Text segmentation is the process of identifying word, sentence, and other boundaries in a text. The segmentation rules are published by the Unicode consortium as part of the Common Locale Data Repository, or CLDR, and made freely available to the public.
Good question. Most of the time, that'll probably work fine. However, it's not always obvious where words or sentences should start or end. Consider this sentence:
``text`
I like Mrs. Murphy. She's nice.
Splitting only on periods will give you ["I like Mrs. ", "Murphy. ", "She's nice."], which probably isn't what you wanted - the period after Mrs doesn't indicate the end of the sentence.
In addition, other languages use different segmentation rules than English. For example, identifying sentence boundaries in Japanese is a little more difficult because sentences tend to end with \u3002 - the ideographic full stop - as opposed to a period. The CLDR contains support for hundreds of languages, meaning you don't have to consider every language when dealing with international text.
Cldr-segmentation is published as both a UMD module and an ES6 module, meaning it should work in node via require or import and the browser via a