Generate a word list from various sources, including SCOWL
npm install @neopass/wordlist
Generate a word list from various sources, including system dictionaries and SCOWL.
Includes a default list of ~86,000 english words.
Additional dictionary/wordlist paths can be configured via the options. System dictionaries exist at locations such as /usr/share/dict/words, /usr/share/dict/british-english, etc.
- Installation
- Usage
- Options
- Specify Alternate Word Lists
- Combine Lists
- The Default List
- Generate a List From SCOWL Sources
- SCOWL Aliases
- Create a Custom Word List File
- Exclusions
- Using the Custom List
- SCOWL License
``bash`
npm install @neopass/wordlist
There are three functions available for creating word lists: wordList, wordListSync, and listBuilder. The default list is included by default, so no configuration of options is required.
wordList builds and returns the list asynchronously:
`javascript
const { wordList } = require('@neopass/wordlist')
wordList().then(list => console.log(list.length)) // 86748
`
wordListSync builds and returns the list synchronously:
`javascript
const { wordListSync } = require('@neopass/wordlist')
const list = wordListSync()
console.log(list.length) // 86748
`
listBuilder calls back each word asynchronously:
`javascript
const { listBuilder } = require('@neopass/wordlist')
const builder = listBuilder()
const list = []
builder(word => list.push(word))
.then(() => console.log(list.length)) // 86748
`
`typescript`
export interface IListOptions {
/**
* Word list paths to search for in order. Only the first
* one found is used. This option is ignored if 'combine'
* is a non-empty array.
*
* default: [
* '$default',
* ]
*/
paths?: string[]
/**
* Word list paths to combine. All found files are used.
*/
combine?: string[]
/**
* Mutate the list by filtering on lower-case words, converting to
* lower case, or applying a custom mutator function.
*/
mutator?: 'only-lower'|'to-lower'|Mutator
}
paths: Allows alternate, fallback lists to be used.
combine: Allows multiple lists to be combined into one.
mutator: mutates the list depending on the value provided.only-lower
- : Filter out words that are not strictly comprised of characters [a-z].to-lower
- : Convert words to lower case.Mutator
- : (word: string) => string|string[]|void: a custom function that receives a word and returns one or more words, or undefined. Used for custom transformation/exclusion of words in the list.
Return values:
- string: the returned string is added to the list.string[]
- : all returned strings are added to the list.
- For any other return value the word is not added.
`javascript
const { wordList } = require('@neopass/wordlist')
/**
* Create a custom mutator for splitting hyphenated words
* and converting them to lower case.
*/
function customMutator(word: string) {
// Will return ['west', 'ender'] for an input of 'West-ender'.
return word.split('-').map(word => word.toLowerCase())
}
const options = {
paths: ['/some/list/path/words.txt'],
mutator: customMutator,
}
const list = await wordList(options)
assert(list.includes('west'))
assert(list.includes('ender'))
`
The paths specified in options are searched in order and the first list found is used. This allows for the use of system word lists with different names and/or locations on various platforms. A common location for the system word list is /usr/share/dict/words.
`javascript
const { wordList } = require('@neopass/wordlist')
// Prefer british-english list.
const options = {
paths: [
'/usr/share/dict/british-english', // if found, use this one
'/usr/share/dict/american-english', // else if found, use this one
'/usr/share/dict/words', // else if found, use this one
'$default', // else use this one
]
}
wordList(options)
.then(list => console.log(list.length)) // 101825
`
Lists can be combined into one with the combine option:
`javascript
const { wordList } = require('@neopass/wordlist')
// Combine multiple dictionaries.
const options = {
combine: [
// System dictionary.
'/usr/share/dict/words', // use this one
'$default', // and use this one
]
}
wordList(options)
.then(list => console.log(list.length)) // 335427
`
Important: Using combine with wordList/wordListSync will result in duplicates if the lists overlap. It is recommended to use combine with listBuilder to control how words are added. For example, a Set can be used to eliminate duplicates from combined lists:
`javascript
const { listBuilder } = require('@neopass/wordlist')
// Combine multiple lists.
const options = {
combine: [
// System dictionary.
'/usr/share/dict/words',
// Default list.
'$default',
]
}
// Create a list builder.
const builder = listBuilder(options)
// Create a set to avoid duplicate words.
const set = new Set()
// Run the builder.
builder(word => set.add(word))
.then(() => console.log(set.size)) // 299569
`
The default list is a ~86,000-word, PG-13, lower-case list taken from english SCOWL sources, with some other additions including slang.
Suggestions for additions to the default list are welcome by submitting an issue. Whole lists are definitely preferred to single-word suggestions, e.g., "notable extraterrestrials in history", "insects of upper polish honduras", or "names of horses in modern literature". _Suggestions for inappropriate word removal are also welcome (curse words, coarse words/slang, racial slurs, etc.)_.
By default the list alias, $default, is included in the options. This allows wordlist to create a largish list without any additional configuration.
`javascript`
export const defaultOptions: IListOptions = {
paths: [
'$default'
]
}
`javascript$default
/**
* We don't need to specify a config because the alias`
* is part of the default configuration.
*/
const list = wordListSync()
The $default alias (along with other aliases) resolves to a path at run time.
SCOWL word lists are included as aliases, and can be used to generate custom lists:
`javascript
const { listBuilder } = require('@neopass/wordlist')
// Combine multiple lists from scowl.
const options = {
combine: [
'$english-words.10',
'$english-words.20',
'$english-words.35',
'$special-hacker.50',
]
}
// Create a list builder.
const builder = listBuilder(options)
// We'll add the words to a set.
const set = new Set()
// Run the builder.
builder(word => set.add(word))
.then(() => console.log(set.size)) // 49130
`
Warning: Some SCOWL sources contain words not approprate for all audiences, including swear words, racial slurs, and words of a sexual nature. You'll most likely want to scrutinize these sources depending on your use case and intended audience.
SCOWL is primarily intened as a source for spell checkers. From the SCOWL website:
> SCOWL (Spell Checker Oriented Word Lists) and Friends is a database of information on English words useful for creating high-quality word lists suitable for use in spell checkers of most dialects of English. The database primary contains information on how common a word is, differences in spelling between the dialects if English, spelling variant information, and (basic) part-of-speech and inflection information.
Note: SCOWL sources contain some words with apostrophes 's and also unicode characters. Care should be taken to deal with these depending on your needs. For example, we can transform words to remove any trailing 's characters and then only accept words that contain the letters a-z:
`javascript
const { listBuilder } = require('@neopass/wordlist')
/**
* Remove trailing 's from words.'s
*/
function transform(word) {
if (word.endsWith()) {
return word.slice(0, -2)
}
return word
}
/**
* Determine if a word should be added.
*/
function accept(word) {
// Only accept words with characters a-z (case insensitive).
return (/^[a-z]+$/i).test(word)
}
// Combine multiple lists from scowl.
const options = {
combine: [
'$english-words.10',
'$english-words.20',
'$english-words.35',
'$special-hacker.50',
]
}
// Create a list builder.
const builder = listBuilder(options)
// Create a set to avoid duplicate words.
const set = new Set()
// Run the builder.
const _builder = builder((word) => {
word = transform(word)
if (accept(word)) {
set.add(word)
}
})
_builder.then(() => console.log(set.size)) // 38714
`
A path alias is defined for every SCOWL source list. SCOWL aliases consist of the $ character followed by the source file name. The below is a _representative sample_ of the available source aliases.
``
$american-abbreviations.70
$american-abbreviations.95
$american-proper-names.80
$american-proper-names.95
$american-upper.50
$american-upper.80
$american-upper.95
$american-words.35
$american-words.80
$australian-abbreviations.35
$australian-abbreviations.80
$australian-contractions.35
$australian-proper-names.35
$australian-proper-names.80
$australian-proper-names.95
$australian-upper.60
$australian-upper.95
$australian-words.35
$australian-words.80
$australian_variant_1-abbreviations.95
$australian_variant_1-contractions.60
$australian_variant_1-proper-names.80
$australian_variant_1-proper-names.95
$australian_variant_1-upper.80
$australian_variant_1-upper.95
$australian_variant_1-words.80
$australian_variant_1-words.95
$australian_variant_2-abbreviations.80
$australian_variant_2-abbreviations.95
$australian_variant_2-contractions.50
$australian_variant_2-contractions.70
$australian_variant_2-proper-names.95
$australian_variant_2-upper.80
$australian_variant_2-words.55
$australian_variant_2-words.95
$british-abbreviations.35
$british-abbreviations.80
$british-proper-names.80
$british-proper-names.95
$british-upper.50
$british-upper.95
$british-words.10
$british-words.20
$british-words.35
$british-words.95
$british_variant_1-abbreviations.55
$british_variant_1-contractions.35
$british_variant_1-contractions.60
$british_variant_1-upper.95
$british_variant_1-words.10
$british_variant_1-words.95
$british_variant_2-abbreviations.70
$british_variant_2-contractions.50
$british_variant_2-upper.35
$british_variant_2-upper.95
$british_variant_2-words.80
$british_variant_2-words.95
$british_z-abbreviations.80
$british_z-abbreviations.95
$british_z-proper-names.80
$british_z-proper-names.95
$british_z-upper.50
$british_z-upper.95
$british_z-words.10
$british_z-words.95
$canadian-abbreviations.55
$canadian-proper-names.80
$canadian-proper-names.95
$canadian-upper.50
$canadian-upper.95
$canadian-words.10
$canadian-words.95
$canadian_variant_1-abbreviations.55
$canadian_variant_1-contractions.35
$canadian_variant_1-proper-names.95
$canadian_variant_1-upper.35
$canadian_variant_1-upper.80
$canadian_variant_1-words.35
$canadian_variant_1-words.95
$canadian_variant_2-abbreviations.70
$canadian_variant_2-contractions.50
$canadian_variant_2-upper.35
$canadian_variant_2-upper.80
$canadian_variant_2-words.35
$canadian_variant_2-words.80
$english-abbreviations.20
$english-abbreviations.80
$english-contractions.35
$english-contractions.80
$english-contractions.95
$english-proper-names.35
$english-proper-names.80
$english-upper.35
$english-upper.80
$english-words.80
$english-words.95
$special-hacker.50
$special-roman-numerals.35
$variant_1-abbreviations.55
$variant_1-abbreviations.95
$variant_1-contractions.35
$variant_1-proper-names.80
$variant_1-proper-names.95
$variant_1-upper.35
$variant_1-upper.80
$variant_1-words.20
$variant_1-words.80
$variant_2-abbreviations.70
$variant_2-abbreviations.95
$variant_2-contractions.50
$variant_2-contractions.70
$variant_2-upper.35
$variant_2-upper.95
$variant_2-words.35
$variant_2-words.95
$variant_3-abbreviations.40
$variant_3-abbreviations.95
$variant_3-words.35
$variant_3-words.95
See the SCOWL Readme for a description of SCOWL sources.
A custom word list file from miscellaneous sources can be assembled with the wordlist-gen binary, or the word-gen utility in the wordlist repo.
From the @neopass/wordlist package:
`bash`
npx wordlist-gen --sources
From the wordlist repo:
`bash`
git clone git@github.com:neopass/wordlist.git
cd wordlist
`bash`
node bin/word-gen --sources
First, set up a directory of book and/or word list files, for example:
``
root
+-- data
+-- books
| -- modern steam engine design.txt
| -- how to skin a rabbit.txt
+-- lists
| -- names.txt
| -- animals.txt
| -- slang.txt
+-- scowl
| -- english-words.10
| -- english-words.20
| -- english-words.35
| -- special-hacker.50
+-- exclusions
| -- patterns.txt
The structure doesn't really matter. The format should be utf-8 text, and can consist of one or more words per line. exclusions is optional.
`bash`
npx wordlist-gen --sources data/books data/lists data/scowl --out my-words.txt
sources can specify multiple files and/or directories.
Note: only words consisting of letters a-z are added, and they're all lower-cased.
Words can be _scrubbed_ by specifying exclusions:
`bash`
node bin/word-gen <...> --exclude data/exclusions
Much like the sources, exclusions can consist of multiple files and/or directories in the following format:
`bashExclude whole words (case insensitive):
spoon
fork
Tongs
$3
Use
path.resolve or path.join to create an absolute path to your custom word list file:`javascript
const path = require('path')
const { wordList } = require('@neopass/wordlist')const options = {
paths: [
// Use a path relative to the location of this module.
path.resolve(__dirname, '../my-words.txt')
]
}
wordList(options)
.then(list => console.log(list.length)) // 124030
`SCOWL License
`
Copyright 2000-2016 by Kevin AtkinsonPermission to use, copy, modify, distribute and sell these word
lists, the associated scripts, the output created from the scripts,
and its documentation for any purpose is hereby granted without fee,
provided that the above copyright notice appears in all copies and
that both that copyright notice and this permission notice appear in
supporting documentation. Kevin Atkinson makes no representations
about the suitability of this array for any purpose. It is provided
"as is" without express or implied warranty.
``