Spelling correction & Fuzzy search based on symmetric delete spelling correction algorithm
npm install symspell-exSpelling correction & Fuzzy search based on Symmetric Delete Spelling Correction algorithm (SymSpell)
> Work in progress, need more optimizations





> Changes v1.1.1
> - Tokenization support
> - Term frequency should be provided for training and terms should be unique
> - Correct function return different object (Correction object)
> - Hash table implemented in redis store instead of normal list structure
> - Enhanced testing code and coverage
> - Fixed bugs in lookup
For single term training you can use add function:
``typescript
import {SymSpellEx, MemoryStore} from 'symspell-ex';
const LANGUAGE = 'en';
// Create SymSpellEx instnce and inject store new store instance
symSpellEx = new SymSpellEx(new MemoryStore());
await symSpellEx.initialize();
// Train data
await symSpellEx.add("argument", LANGUAGE);
await symSpellEx.add("computer", LANGUAGE);
`
For multiple terms (Array) you can use train function:`typescript`
const terms = ['argument', 'computer'];
await symSpellEx.train(terms, 1, LANGUAGE);
function can be used to get multiple suggestions if available up to the maxSuggestions valueArguments:
-
input _String_ (Wrong/Invalid word we need to correct)
- language _String_ (Language to be used in search)
- maxDistance _Number_, _optional_, default = 2 (Maximum distance for suggestions)
- maxSuggestions _Number_, _optional_, default = 5 (Maximum suggestions number to return)Return:
Array Array of suggestionsExample
`typescript
await symSpellEx.search('argoments', 'en');
`
Example Suggestion Object:
`json
{
"term": "argoments",
"suggestion": "arguments",
"distance": 2,
"frequency": 155
}
`$3
correct function can be used to get the best suggestion for input word or sentence in terms of edit distance and frequencyArguments:
-
input _String_ (Wrong/Invalid word we need to correct)
- language _String_ (Language to be used in search)
- maxDistance _Number_, _optional_, default = 2 (Maximum distance for suggestions)Return:
Correction object which contains original input and corrected output string, with array of suggestions Example
`typescript
await symSpellEx.correct('Special relatvity was orignally proposed by Albert Einstein', 'en');
`Returns this
Correction object:
> This output is totally depending on the quality of the training data that was push into the store
`json
{
"suggestions": [],
"input": "Special relatvity was orignally proposed by Albert Einstein",
"output": "Special relativity was originally proposed by Albert Einstein"
}
`Computational Complexity
The algorithm has constant time O(1) time, independent of the dictionary size, but depend on the average term length and maximum edit distance,
_Hash Table_ is used to store all search entries which has an average search time complexity of O(1).Why the algorithm is fast?
$3
in training phase all possible spelling error variants as generated (deletes only) and stored in hash table> This makes the algorithm very fast, but it also required a large memory footprint, and the training phase takes a considerable amount of time to build the dictionary first time.
> _(Using RedisStore makes it easy to train and build once, then search and correct from any external source)_
$3
It allows a tremendous reduction of the number of spelling error candidates to be pre-calculated (generated and added to hash table),
which then allows O(1) search while getting spelling suggestions.Library Design
!Lib Diagram$3
This interface can be implemented to provide a different tokenizer for the libraryInterface type
`typescript
export interface Tokenizer {
tokenize(input: string): Array;
}
`$3
This interface can be implemented to provide more algorithms to use to calculate edit distance between two words
> Edit Distance
> is a way of quantifying how dissimilar two strings (e.g., words) are to one another by counting the minimum number of operations required to transform one string into the otherInterface type
`typescript
interface EditDistance {
name: String;
calculateDistance(source: string, target: string): number;
}
`$3
This interface can be implemented to provide additional method to store data other than built-in stores (Memory, Redis)Data store should handle storage for these 2 data types:
- Terms: List data structure to store terms and retrieve it by index
- Entries: Hash Table data structure to store dictionary entries and retrieve data by term (_Key_)
> Data store should also handle storage for multiple languages and switch between them
Interface type
`typescript
export interface DataStore {
name: string;
initialize(): Promise;
isInitialized(): boolean;
setLanguage(language: string): Promise;
pushTerm(value: string): Promise;
getTermAt(index: number): Promise;
getTermsAt(indexes: Array): Promise>;
getEntry(key: string): Promise>;
getEntries(keys: Array): Promise>>;
setEntry(key: string, value: Array): Promise;
hasEntry(key: string): Promise;
maxEntryLength(): Promise;
clear(): Promise;
}
``#### Built-in data stores
- Memory: Stores data in memory, using array structure for terms and high speed hash table (megahash)
to manage dictionary entries
> May be limited by node process memory limits, which can be overridden
- Redis: Stores data into Redis database using list structure to store terms and hash to store dictionary data
> Very efficient way to train and store data, it will allow accessing by multiple processes and/or machines,
> also dumping and migrating data will be easy
TODO
- [x] Tokenization
- [x] Sentence correction
- [x] Bulk data training
- [ ] Word Segmentation
- [ ] Domain specific correction
#### References
* Symmetric Delete Spelling Correction
* SymSpell
#### License
MIT