Methodius (an NGram utility)

A utility for analyzing frequency of text chunks on the web.

Supply a bit o' text to the Methodius class, and let it determine your bigrams, trigrams, ngrams, letter-frequencies, word frequencies, bigram relationships, and create ngram trees.

![Hippocratic License HL3-LAW-MEDIA-MIL-SOC-SV](https://firstdonoharm.dev/version/3/0/law-media-mil-soc-sv.html)

!npm

Example

``JavaScript const { Methodius } = require('methodius'); // or import { Methodius } from 'methodius';

const udhr1 =
All human beings are born free and equal in dignity and rights. They are endowed with reason and conscience and should act towards one another in a spirit of brotherhood.
; const nGrams = new Methodius(udhr1);

const topLetters = nGrams.getTopLetters(10); const topWords = nGrams.getTopWords(10);

`API`

Methodius


Global Class

new Methodius(text)

Parameters | name | type | Description | | --- |--- | --- | | text | string | raw text to be analyzed |

`$3`


####

Punctuations


characters to ignore when analyzing text
period, comma, semicolon, colon, bang, question mark, interrobang, Spanish bang+, parens, bracket, brace, single quote, some spaces

\\.,;:!?‽¡¿⸘()\\[\\]{}<>’'…\"\n\t\r

#### wordSeparatorscharacters to ignore AND CONSUME when trying to find words em-dash, period, comma, semicolon, colon, bang, question mark, interrobang, Spanish bang+, parens, bracket, brace, single quote, space

—\\.,;:!?‽¡¿⸘()\\[\\]{}<>…"\\s

`$3`


####

hasPunctuation(string)


 determines if string contains punctuation 
 
Parameters
| name      | type  | Description   |
| ---       |---    | ---           |
| string    | string       |               |

Returnsboolean

#### hasSymbols(string)determines if string contains symbols Parameters | name | type | Description | | --- |--- | --- | | string | string | |

Returnsboolean

#### hasSpace(string)determines if a string has a space

Parameters | name | type | Description | | --- |--- | --- | | string | string | |

Returnsboolean

#### sanitizeText(string)lowercases text and removes diacritics and other characters that would throw off n-gram analysis

Parameters | name | type | Description | | --- |--- | --- | | string |string | |

Returnsstring

#### getWords(text)extracts an array of words from a string

Parameters | name | type | Description | | --- |--- | --- | | text | string | |

ReturnsArray

#### getNGrams(text, gramSize)gets ngrams from text

Parameters | name | type | Description | | --- |--- | --- | | text | string | | | gramSize | Number | Default = 2 |

ReturnsArray

#### getMeanWordSize(wordArray)Gets average size of a word

Parameters | name | type | Description | | --- |--- | --- | | wordArray | string[] | |

Returnsnumber

#### getMedianWordSize(wordArray)Gets the median (middle) size of a word

Parameters | name | type | Description | | --- |--- | --- | | wordArray | string[] | |

Returnsnumber

####getWordNGrams(text)Gets 2-word pairs from text.

Note: This doesn't use sentence punctuation as a boundary. Should it?

Parameters | name | type | Description | | --- |--- | --- | | text | string | | | gramSize | number | default=2 |

ReturnsArray

#### getFrequencyMap(frequencyMap)converts an array of strings into a map of those strings and number of occurences

Parameters | name | type | Description | | --- |--- | --- | | ngramArray |Array | |

ReturnsMap

#### getPercentMap(frequencyMap)converts a frequency map into a map of percentages

Parameters | name | type | Description | | --- |--- | --- | | frequencyMap |Map | |

ReturnsMap

#### getTopGrams(frequencyMap)filters a frequency map into only a small subset of the most frequent ones Parameters | name | type | Description | | --- |--- | --- | | frequencyMap |Map| | | limit | number | default=20 |

ReturnsMap

#### getIntersection(iterable1, iterable2)returns an array of items that occur in both iterables Parameters | name | type | Description | | --- |--- | --- | | iterable1 |Map|Array| | | iterable2 |Map|Array | |

ReturnsArrayAn array of items that occur in both iterables. It will compare the keys, if sent a map

#### getUnion(iterable1, iterable2)Returns an array that is the union of two iterables

Parameters | name | type | Description | | --- |--- | --- | | iterable1 |Map|Array| | | iterable2 |Map|Array | |

ReturnsArrayA union of the items that occur in both iterables.

#### getDisjunctiveUnion(iterable1, iterable2)returns an array of arrays of the unique items in either iterable Parameters | name | type | Description | | --- |--- | --- | | iterable1 |Map|Array| | | iterable2 |Map|Array | |

ReturnsArrayAn array of arrays of the unique items. The first item is the first parameter, 2nd item second param

#### getComparison(iterable1, iterable2)returns a map containing various comparisons between two iterables Parameters | name | type | Description | | --- |--- | --- | | iterable1 |Map|Array| | | iterable2 |Map|Array | |

ReturnsMap>A map containing various comparisons between two iterables. Those comparisons will be some kind of array (See intersection or disjunctiveUnion)

#### getWordPlacementForNGram(ngram, wordsArray)determines the placement of a single ngram in an array of words Parameters | name | type | Description | | --- |--- | --- | | ngram |string| | | wordsArray |Array | |

ReturnsMapa map with the keys 'start', 'middle', and 'end' whose values correspond to how often the provided ngram occurs in this position

#### getWordPlacementForNGrams(ngrams, wordsArray)determines the placement of ngrams in an array of words Parameters | name | type | Description | | --- |--- | --- | | ngram |Array| | | wordsArray |Array | |

ReturnsMap>a map with the key of the ngram, and the value that is a map containing start, middle, end

#### getNgramCollections(ngrams, wordsArray)gets ngrams from an array of words Parameters | name | type | Description | | --- |--- | --- | | wordArray |Array| an array of words | | ngramSize |number | default = 2. The size of the ngrams to return |

ReturnsArray>An array containing arrays of ngrams, each array corresponds to a word.

#### getNgramSiblings(searchText, ngramCollections, siblingSize)using a collection returned from getNgramCollections, searches for a string and returns what comes before and after it Parameters | name | type | Description | | --- |--- | --- | | searchText |string| the string to search for | | ngramCollections |Array|Array>| an array of ngrams, or an nGramCollection | | siblingSize |number | default = 1. How many siblings to find in front or behind |

ReturnsMap<'before'|'after',Map>a Map with the keys 'before' and 'after' which contain maps of what comes before and after

Example`JavaScript const words = ['revolution', 'nation']; const ngramCollections = Methodius.getNgramCollections(words, 2); const onSiblings = Methodius.getNgramSiblings('io', ngramCollections); /* new Map([ ['before', new Map( ['ti', 2] )], ['after', new Map( ['on', 2] )] ]) */`

#### getRelatedNgrams(words, ngrams, ngramSize)Gets the ngrams that will occur before or after other ngrams. Useful for finding patterns of ngrams.

Parameters | name | type | Description | | --- |--- | --- | | words |Array| an array of words to evaluate | | ngrams |Map| a frequency map of ngrams | | ngramSize |number | default = 2. the size of the ngram |

Returns

Map A frequency map of how often ngrams occured before or after other ngrams

Example

This requires several steps. You'll need an array of words and a frequency map of ngrams.

`JavaScript const ngrams = getNGrams('the revolution of the nation was on television. It was about pollution and the terrible situation ', 2); const frequencyMap = getFrequencyMap(ngrams); const topNgrams = getTopGrams(frequencyMap, 5); const words = ['the', 'revolution', 'of', 'the', 'nation', 'was', 'on', 'television', 'it', 'was', 'about', 'pollution', 'and', 'the', 'terrible', 'situation' ]; const relatedNgrams = getRelatedNgrams(words, topNgrams, 2, 5);`

#### getNgramTreeCollection(words)

Gets a nested map of maps that breaks down unique words into their smallest ngrams

Parameters | name | type | Description | | --- |--- | --- | | words |Array | an array of words to evaluate |

Returns

Map| Map> A nested map of maps that breaks down unique words into their smallest ngrams.

`$3`


####

sanitizedText


lowercased text with diacritics removed

string####lettersan array of letters in the text

Array####wordsan array of words in the text

Array####bigramsan array of letter bigrams in the text

Array####trigramsan array of letter trigrams in the text

Array####uniqueLettersan array of unique letters in the text

Array####uniqueBigramsan array of unique bigrams in the text

Array####uniqueTrigramsan array of unique trigrams in the text

Map>####letterPositionsa map of placements of letters within words

Map>####bigramPositionsa map of placements of bigrams within words

Map>####uniqueTrigramsa map of placements of trigrams within words

Array####uniqueWordsan array of unique words in the text

Array####letterFrequenciesa map of letter frequencies in the sanitized text

Map####bigramFrequenciesa map of bigram frequencies in the sanitized text

Map####trigramFrequenciesa map of trigram frequencies in the sanitized text

Map####wordFrequenciesa map of word frequencies in the sanitized text

Map####letterPercentagesa map of letter percentages in the sanitized text

Map####bigramPercentagesa map of bigram percentages in the sanitized text

Map####trigramPercentagesa map of trigram percentages in the sanitized text

Map####wordPercentagesa map of word percentages in the sanitized text

Map

#### meanWordSizeThe average size of a wordnumber

#### medianWordSizeThe middle size of a word

number

#### ngramTreeCollectionA nested map of maps that breaks down unique words into their smallest ngrams.

`$3`

#### getLetterNGrams(size)gets an array of customizeable ngrams in the text

Parameters | name | type | Description | | --- |--- | --- | | size |number | default = 2 size of the n-gram to return |

ReturnsArray

#### getTopLetters(limit)a map of the most used letters in the text

Parameters | name | type | Description | | --- |--- | --- | | limit |number | default = 20 number of top letters to return |

ReturnsMap

#### getTopBigrams(limit)a map of the most used bigrams in the text

Parameters | name | type | Description | | --- |--- | --- | | limit |number | default = 20 number of top bigrams to return |

ReturnsMap

#### getTopTrigrams(limit)a map of the most used trigrams in the text

Parameters | name | type | Description | | --- |--- | --- | | limit |number | default = 20 number of top trigrams to return |

ReturnsMap

#### getTopWords(limit)a map of the most used words in the text

Parameters | name | type | Description | | --- |--- | --- | | limit |number | default = 20 number of top words to return |

ReturnsMap

####compareTo(methodius)Compare this methodius instance to another

Parameters | name | type | Description | | --- |--- | --- | | methodius |Methodius | another Methodius instance |

ReturnsMapA map of property names and their comparisons (intersection, disjunctiveUnions, etc) for a set of properties

####getRelatedTopNgrams(ngramSize, limit)Gets the ngrams that will occur before or after other ngrams based on what the most frequent ngrams are. Useful for finding patterns of ngrams.

Parameters | name | type | Description | | --- |--- | --- | | ngramSize |number| default = 2. the size of the ngram | | limit |number | default = 20. the number of top ngrams to use |

Returns

Map` A frequency map of how often the most common ngrams occured before or after other common ngrams

Methodius (an NGram utility)

A utility for analyzing frequency of text chunks on the web.

Supply a bit o' text to the Methodius class, and let it determine your bigrams, trigrams, ngrams, letter-frequencies, word frequencies, bigram relationships, and create ngram trees.

![Hippocratic License HL3-LAW-MEDIA-MIL-SOC-SV](https://firstdonoharm.dev/version/3/0/law-media-mil-soc-sv.html)

!npm

Example

``JavaScript const { Methodius } = require('methodius'); // or import { Methodius } from 'methodius';

const topLetters = nGrams.getTopLetters(10); const topWords = nGrams.getTopWords(10);

`API`

Methodius


Global Class

new Methodius(text)

Parameters | name | type | Description | | --- |--- | --- | | text | string | raw text to be analyzed |

`$3`


####

Punctuations


characters to ignore when analyzing text
period, comma, semicolon, colon, bang, question mark, interrobang, Spanish bang+, parens, bracket, brace, single quote, some spaces

\\.,;:!?‽¡¿⸘()\\[\\]{}<>’'…\"\n\t\r

—\\.,;:!?‽¡¿⸘()\\[\\]{}<>…"\\s

`$3`


####

hasPunctuation(string)


 determines if string contains punctuation 
 
Parameters
| name      | type  | Description   |
| ---       |---    | ---           |
| string    | string       |               |