String Algorithms

![Build Status](https://travis-ci.org/burgaard/string-algorithms) ![Coverage Status](https://coveralls.io/r/burgaard/string-algorithms)

Features

This package implements a collection of linear-time and linear-space string
algorithms that are often used when implementing more advanced searching,
sorting and indexing algorithms. The algorithms all accept and properly handle
16-bit Unicode strings.

The algorithms implemented are:

- longestCommonPrefix calculates the
longest common prefixes given a
suffix array. This implementation is based on
Kasei et al's algorithm.
- longestCommonSubstring calculates the longest common substring of
two or more strings in O(n + k) or O(n + (k log(k))) depending on
the chosen index map implementation. The former version requires an additional
O(n) space, whereas the latter version only requires an additional O(k)
space.
- radixSort sorts an array with sub-arrays that are all the same length.
- search finds all instances of the given term in string. This implementation
is based on
Knuth, Morris and Pratt's algorithm.
- suffixArray calculates the
suffix array of a given string.
This implementation is based on the
Difference Cover modulo 3 (DC3)/skew algorithm by Kärkkäinen et al.

Note: While the algorithms provided here are linear-time implementations,
they are still outperformed by readily available C/C++ implementations.

Also note that although these implementations are O(n), linear time does not
automatically beat O(n log(n)) all the time. More efficient implementations
that are O(n log(n)) may in fact be faster in practice in many situations.
To see that, consider that log₂(n) grows very slowly. For example
log₂(100,000) is approximately 16.6. The linear-time longest common
substring implementation makes many linear passes through the input string,
quite possibly more than 16 in total. So if there exists an O(n log(n))
implementation that can do everything it needs to do in just one pass through
the input, it would already come out ahead of the linear time implementation
for n less than or equal to 100,000.

Due to limitations of Node.js, the maximum string size is currently limited too
by the maximum heap size which is currently just shy of 2GB--and the actual
longest string that can be handled by the multiple longest common substring
algorithm will be sevaral factors shorter than the maximum heap size.

Examples

$3

Find the longest common substring:

``javascript import { longestCommonSubstring } from 'string-algorithms';

const strings = [ '12apple', '3apple4', 'apple56' ];

console.log(longestCommonSubstring(strings));`

produces the output apple.

Run the example.

`$3`

Find the suffix array of mississippi:

`javascript import { suffixArray } from 'string-algorithms';

console.log(suffixArray('mississippi'));`

produces the output

`javascript [ 11, // $ 10, // i 7, // ippi 4, // issippi 1, // ississippi 0, // mississippi 9, // pi 8, // ppi 6, // sippi 3, // sissippi 5, // ssippi 2 // ssissippi ]`

Run the example.

`$3`

Given an array with arrays of integers:

`javascript import { radixSort } from 'string-algorithms';

const integers = [ [-9, 4, 0], [ 4, -2, 3], [ 4, 2, -1], [ 1, 0, 6], [-4, -2, -5], [ 4, 6, 8], ];

const result = radixSort(integers);

/* [ [-9, 4, 0], [-4, -2, -5], [ 1, 0, 6], [ 4, -2, 3], [ 4, 2, -1], [ 4, 6, 8], ]; */`

Run the example.

Given an array of strings that are all the same length, and a function that converts each string to an array of char codes:

`javascript const strings = [ 'image', 'mania', 'genom', 'mango' ];

const result = radixSort(strings, s => s.split('').map(c => c.charCodeAt(0)));

/* [ 'genom', 'image', 'mango', 'mania' ] */`

Run the example.

`Install`

`npm install --save string-algorithms`

`API`

`$3`

Finds all instances of term in the given text.

text is the string to be searched.

term is the substring to search for.

Returns an array with the start index of all occurrences of term in text.

Note:

`$3`

Radix sorts an array of entries. If getEntry is not given, then entriesis assumed to contain an array of arrays where each sub-array is of the same length. IfgetEntryis given, then the entries may be of any type, butgetEntrymust return an array of the same length corresponding to each given entry.

entries is an array with entries to be radix sorted.

getEntryis an optional function for retrieving each entry as an array. For example, entries may contain arrays that are all 4 elements long, but only the last three elements should be considered for sorting. In that case,getEntry could be entry => entry.slice(1).

Returns a new array with the sorted entries.

Note: Although this is a linear-time sort algorithm, it requires input to be of a uniform length (arrays with k entries, strings with at most k characters, digits with at most k digits and so on). The constant overhead is also pretty big, so for something as simple as sorting integers, a fast O(n * log(n)) implementation will probably beat radix sorting even for pretty big n.

`$3`

Calculates the suffix array for the given string and an optional terminator code which must be negative.

s is the string or array of character codes to compute the suffix array for.

terminatoris an optional negative terminator code. The terminator code must not be present anywhere ins.

Returns an array with the sorted suffixes of s.

`$3`

Calculates the longest common prefix from a suffix array in linear time.

sequence is a sequence of character codes.

suffixArray is the suffix array corresponding to sequence.

Returns an array indicating the height of the shared prefix between adjecent suffix array entries.

`$3`

Finds the longest common substring(s) in the set of given strings. If there are multiple substrings that all share the longest length, then all such substrings are returned. O(n + k) or O(n + k log(k)) depending on the selected string indexing strategy.strings is an array of strings.

indexMapis the optional string indexing map strategy. If given a string, it must be one of 'log' or 'linear'. Otherwise it must be an object that derives fromStringIndexMap. The default value is 'log'.

Returns an array with the longest common substrings(s).

`$3`

Maps the position of strings s₁ ... s_K when concatenated into one string. Concrete implementations provide different compromises between O(1) and O(log(k)) lookup times versus O(n) and O(k) space requirements. Extend this class to implement custom mappings from string positions to substring indices with different runtime/space tradeoffs than the two pre-defined implementations.

#### function add(length)

Adds a substring with the given length.

length is the length of the substring.

Returns the current total length of all substrings.

#### lookup(position)

Looks up the substring corresponding to the given position in the concatenated string.

position is the position in the concatenated string

Returns the index of the substring that contains the given position.

#### toString()`

Returns a string representation of the string index map.

Contributing

Contributions welcome; Please submit all pull requests against the master branch. If your
pull request contains JavaScript patches or features, you should include relevant unit
tests. Please check the Contributing Guidelines for more details.
Thanks!

Author

Kim Burgaard <kim@burgaard.us>.

Made in Sunny California with love and sustainably harvested coffee.

License

MIT

String Algorithms

![Build Status](https://travis-ci.org/burgaard/string-algorithms) ![Coverage Status](https://coveralls.io/r/burgaard/string-algorithms)

Features

The algorithms implemented are:

Note: While the algorithms provided here are linear-time implementations,
they are still outperformed by readily available C/C++ implementations.

Examples

$3

Find the longest common substring:

``javascript import { longestCommonSubstring } from 'string-algorithms';

const strings = [ '12apple', '3apple4', 'apple56' ];

console.log(longestCommonSubstring(strings));`

produces the output apple.

Run the example.

`$3`

Find the suffix array of mississippi:

`javascript import { suffixArray } from 'string-algorithms';

console.log(suffixArray('mississippi'));`

produces the output

`javascript [ 11, // $ 10, // i 7, // ippi 4, // issippi 1, // ississippi 0, // mississippi 9, // pi 8, // ppi 6, // sippi 3, // sissippi 5, // ssippi 2 // ssissippi ]`

Run the example.

`$3`

Given an array with arrays of integers:

`javascript import { radixSort } from 'string-algorithms';

const integers = [ [-9, 4, 0], [ 4, -2, 3], [ 4, 2, -1], [ 1, 0, 6], [-4, -2, -5], [ 4, 6, 8], ];

const result = radixSort(integers);

/* [ [-9, 4, 0], [-4, -2, -5], [ 1, 0, 6], [ 4, -2, 3], [ 4, 2, -1], [ 4, 6, 8], ]; */`

Run the example.

Given an array of strings that are all the same length, and a function that converts each string to an array of char codes:

`javascript const strings = [ 'image', 'mania', 'genom', 'mango' ];

const result = radixSort(strings, s => s.split('').map(c => c.charCodeAt(0)));

/* [ 'genom', 'image', 'mango', 'mania' ] */`

Run the example.

`Install`

`npm install --save string-algorithms`

`API`

`$3`

Finds all instances of term in the given text.

text is the string to be searched.

term is the substring to search for.

Returns an array with the start index of all occurrences of term in text.

Note:

`$3`

entries is an array with entries to be radix sorted.

Returns a new array with the sorted entries.

`$3`

Calculates the suffix array for the given string and an optional terminator code which must be negative.

s is the string or array of character codes to compute the suffix array for.

terminatoris an optional negative terminator code. The terminator code must not be present anywhere ins.

Returns an array with the sorted suffixes of s.

`$3`

Calculates the longest common prefix from a suffix array in linear time.

sequence is a sequence of character codes.

suffixArray is the suffix array corresponding to sequence.

Returns an array indicating the height of the shared prefix between adjecent suffix array entries.

`$3`

Returns an array with the longest common substrings(s).

`$3`

#### function add(length)

Adds a substring with the given length.

length is the length of the substring.

Returns the current total length of all substrings.

#### lookup(position)

Looks up the substring corresponding to the given position in the concatenated string.

position is the position in the concatenated string

Returns the index of the substring that contains the given position.

#### toString()`

Returns a string representation of the string index map.

Contributing

Author

Kim Burgaard <kim@burgaard.us>.

Made in Sunny California with love and sustainably harvested coffee.

License

MIT