English word and sentence tokenizer, for natural language processing.
npm install lexed
npm i --save lexed
`
Usage
This lexer can be used for both:
* Splitting string into an array of multiple sentences.
* Splitting a string into arrays of sentences and further into arrays of tokens
$3
`javascript
const Lexed = require("lexed").Lexed;
// or ES6 imports
import Lexed from "lexed";
const result = new Lexed('Sentence one. Sentence two! sentence 3? sentence "four." Sentence Five. Microsoft Co. released windows 10').sentenceLevel();
console.log(result);
// would give the following array:
[
'Sentence one.',
'Sentence two!',
'sentence 3?',
'sentence "four."',
'Sentence Five.',
'Microsoft Co. released windows 10'
];
`
$3
`javascript
const Lexed = require("lexed").Lexed;
// or ES6 imports
import Lexed from "lexed";
const result = new Lexed('Microsoft Co. released windows 10').lexer().tokens;
console.log(result);
// would give the following object:
[
[
'Microsoft',
'Co.'
'released',
'windows',
'10'
],
];
`
Extensibility
Currently there's not much to extend in the lexer. Except the abbreviations list.
The abbreviations list is used to detect dots . that are not really a full stop for a sentence.
For example the following sentence: Mr. Andrews went to the office, if Mr isn't registered as an abbreviation, then it the string would be considered two sentences:
- Mr.
- Andrews went to the office
Which is obviously inaccurate. However, since Mr. _is_ actually registered as an abbreviation, then we'll get one sentence: Mr. Andrews went to the office.
Now if you want to extend the abbreviations list you should import the abbreviations from _Lexed_ library and add/remove values as you wish.
`javascript
const abbreviations = require("lexed").abbreviations;
// or ES6 imports
import {abbreviations} from "lexed";
// push new abbreviation
abbreviations.push("Mmm"); // french for madam
`
Contributing
$3
- Mocha (testing framework) installed globally
- TypeScript (language compiler) installed globally
- ts-node (typescript) runtime installed globally
$3
* Clone the repository: git clone https://github.com/alexcorvi/lexed.git
* Install dependencies: cd lexed && npm install
* ...
* Test penn-treebank compliance: npm run penn
* Test the library: npm run test
* Build the library: npm run build`