An all-purpose tokenizer that transforms text into a JavaScript dictionary object of unique words and/or subwords, for text classification and/or text generation.
npm install dictionary_output_tokenizerThe purpose of this repository is to demonstrate/compare different ways to tokenize text, for text classification (ie: word count matrix) and/or text generation. This JavaScript tokenizers uses regex to find unique words and assign each unique word to a number; regex is a powerful library that can be used to preprocess text strings.
[Demonstration of how to use the library: Word tokenization for word count text classification] https://codesolutions2.github.io/dictionary_output_tokenizer/index7.html
The example webapp shows the tokenizer output of the dictionary_output_tokenizer, and the encode output of a popular tokenizer for gpt (gpt-tokenizer) because gpt-tokenizer is rapid.