Tokenizing strings of text. Extracting arrays of words and optionally number, emojis, tags, usernames and email addresses from strings. For Node.js and the browser. When you need more than just [a-z] regular expressions.
npm install words-n-numbersInspired by extractwords
[![NPM version][npm-version-image]][npm-url]
[![NPM downloads][npm-downloads-image]][npm-url]

[![Build Status][build-image]][build-url]
[![JavaScript Style Guide][standardjs-image]][standardjs-url]
[![MIT License][license-image]][license-url]
From v8.0.0 - emojis-regular expression now extracts single emojis, so no more "words" formed by several emojis. This because each emoji in a sense are words. You can still make a custom regular expression to grab several emojis in a row as one item with const customEmojis = '\\p{Emoji_Presentation}' and then use it as your custom regex.
Meaning that instead of:
``javaScript`
extract('A ticket to 大éĒ costs ÂĨ2000 đđ đĸ', { regex: emojis})
// ['đđ', 'đĸ']
...you will get:
`javaScript`
extract('A ticket to 大éĒ costs ÂĨ2000 đđ đĸ', { regex: emojis})
// ['đ', 'đ', 'đĸ']
`javascript`
const { extract, words, numbers, emojis, tags, usernames, email } = require('words-n-numbers')
// extract, words, numbers, emojis, tags, usernames, email available
`javascript`
import { extract, words, numbers, emojis, tags, usernames, email } from 'words-n-numbers'
// extract, words, numbers, emojis, tags, usernames, email available
`html
`

The default regex should catch every unicode character from for every language. Default regex flags are giu. emojisCustom-regex won't work with the u-flag (unicode).
javaScript
const stringOfWords = 'A 1000000 dollars baby!'
extract(stringOfWords)
// returns ['A', 'dollars', 'baby']
`$3
`javaScript
const stringOfWords = 'A 1000000 dollars baby!'
extract(stringOfWords, { toLowercase: true })
// returns ['a', 'dollars', 'baby']
`$3
`javaScript
const stringOfWords = 'A 1000000 dollars baby!'
extract(stringOfWords, { regex: [words, numbers], toLowercase: true })
// returns ['a', '1000000', 'dollars', 'baby']
`$3
`javaScript
const stringOfWords = 'A ticket to 大éĒ costs ÂĨ2000 đđ đĸ'
extract(stringOfWords, { regex: [words, emojis], toLowercase: true })
// returns [ 'A', 'ticket', 'to', '大éĒ', 'costs', 'đ', 'đ', 'đĸ' ]
`$3
`javaScript
const stringOfWords = 'A ticket to 大éĒ costs ÂĨ2000 đđ đĸ'
extract(stringOfWords, { regex: [numbers, emojis], toLowercase: true })
// returns [ '2000', 'đ', 'đ', 'đĸ' ]
`$3
`javaScript
cons stringOfWords = 'A ticket to 大éĒ costs ÂĨ2000 đđ đĸ'
extract(stringOfWords, { regex: [words, numbers, emojis], toLowercase: true })
// returns [ 'a', 'ticket', 'to', '大éĒ', 'costs', '2000', 'đ', 'đ', 'đĸ' ]
`$3
`javaScript
const stringOfWords = 'A #49ticket to #大éĒ or two#tickets costs ÂĨ2000 đđđ đĸ'
extract(stringOfWords, { regex: tags, toLowercase: true })
// returns [ '#49ticket', '#大éĒ' ]
`$3
`javaScript
const stringOfWords = 'A #ticket to #大éĒ costs bob@bob.com, @alice and @įžæ ÂĨ2000 đđđ đĸ'
extract(stringOfWords, { regex: usernames, toLowercase: true })
// returns [ '@alice123', '@įžæ' ]
`$3
`javaScript
const stringOfWords = 'A #ticket to #大éĒ costs bob@bob.com, alice.allison@alice123.com, some-name.nameson.nameson@domain.org and @įžæ ÂĨ2000 đđđ đĸ'
extract(stringOfWords, { regex: email, toLowercase: true })
// returns [ 'bob@bob.com', 'alice.allison@alice123.com', 'some-name.nameson.nameson@domain.org' ]
`$3
`javaScript
const stringOfWords = 'A #ticket to #大éĒ costs bob@bob.com, alice.allison@alice123.com, some-name.nameson.nameson@domain.org and @įžæ ÂĨ2000 đđđ đĸđŠđŊâđ¤âđ¨đģ đŠđŊâđ¤âđ¨đģ'
extract(stringOfWords, { regex: emojisCustom, flags: 'g' })
// returns [ 'đ', 'đ', 'đ', 'đĸ', 'đŠđŊâđ¤âđ¨đģ', 'đŠđŊâđ¤âđ¨đģ' ]
`$3
Some characters needs to be escaped, like \and '. And you escape it with a backslash - \.
`javaScript
const stringOfWords = 'This happens at 5 o\'clock !!!'
extract(stringOfWords, { regex: '[a-z\'0-9]+' })
// returns ['This', 'happens', 'at', '5', 'o\'clock']
`API
$3
Returns an array of words and optionally numbers.
`javascript
extract(stringOfText, \)
`$3
`javascript
{
regex: 'custom or predefined regex', // defaults to words
toLowercase: [true / false] // defaults to false
flags: 'gmixsuUAJD' // regex flags, defaults to giu - /[regexPattern]/[regexFlags]
}
`$3
You can add an array of different regexes or just a string. If you add an array, they will be joined with a
|-separator, making it an OR-regex. Put the email, usernames and tags before words to get the extraction right.`javaScript
// email addresses before usernames before words can give another outcome than
extract(oldString, { regex: [email, usernames, words] })// than words before usernames before email addresses
extract(oldString, { regex: [words, usernames, email] })
`$3
`javaScript
words // only words, any language <-- default
numbers // only numbers
emojis // only emojis
emojisCustom // only emojis. Works with the g-flag, not giu. Based on custom emoji extractor from https://github.com/mathiasbynens/rgi-emoji-regex-pattern
tags // #tags (any language
usernames // @usernames (any language)
email // email addresses. Most valid addresses,
// but not to be used as a validator
`$3
All but one regex uses the
giu-flag. The one that doesn't is the emojisCustom that will need only a g-flag. emojisCustom is added because the standard emojis regex based on \\p{Emoji_Presentation} isn't able to grab all emojis. When browsers support p\{RGI_emoji} under a giu-flag the library will be changed.#### PR's welcome
PR's and issues are more than welcome =)
[license-image]: http://img.shields.io/badge/license-MIT-blue.svg?style=flat
[license-url]: LICENSE
[npm-url]: https://npmjs.org/package/words-n-numbers
[npm-version-image]: http://img.shields.io/npm/v/words-n-numbers.svg?style=flat
[npm-downloads-image]: http://img.shields.io/npm/dm/words-n-numbers.svg?style=flat
[build-url]: https://github.com/eklem/words-n-numbers/actions/workflows/tests.yml
[build-image]: https://github.com/eklem/words-n-numbers/actions/workflows/tests.yml/badge.svg
[standardjs-url]: https://standardjs.com
[standardjs-image]: https://img.shields.io/badge/code_style-standard-brightgreen.svg?style=flat-square