docproc

A document processing pipeline mostly used with
search-index

``javascript docProc = require('docproc') readableStream.pipe(docProc.pipeline(ops))`

DocProc is a pumpify chain of transform streams that turns Plain Old JSON Objects into a format that can be indexed bysearch-index.

Each processed document must have the following fields:

* id- document id *vector- vector, used for ranking *stored- the document that will be cached *raw- the unadulterated document *normalised- the "cleaned up" document. *tokenised - the tokenised document.

So

`javascript { id: 'one', text: 'the first doc' }`

becomes

`javascript { id: 'one', normalised: { id: 'one', text: 'the first doc' }, raw: { id: 'one', text: 'the first doc' }, stored: { id: 'one', text: 'the first doc' }, tokenised: { id: [ 'one' ], text: [ 'the', 'first', 'doc' ] }, vector: { id: { one: 1, '*': 1 }, text: { doc: 1, first: 1, the: 1, '*': 1 }, '': { one: 1, '': 1, doc: 1, first: 1, the: 1 } } },`

...after being passeds through docProc.

You can also compose document processing pipelines by reusing the stages provided, or by creating new ones using the node.js transform stream specification:

`javascript docProc.customPipeline([ new docProc.IngestDoc(), new docProc.CreateStoredDocument(), new docProc.NormaliseFields(), new docProc.Tokeniser({separator: ' '}), new docProc.RemoveStopWords({stopwords: []}), new docProc.CalculateTermFrequency(), new docProc.CreateCompositeVector(), new docProc.CreateSortVectors(), new docProc.FieldedSearch({fieldedSearch: false}) ])`

`API`

`$3`

A function that returns a writable stream that contains a sensible default document processing pipeline

`$3`

A function that takes in an Array of pipeline stages where every stage is a transform stream and returns a writable stream.

`$3`

A transform stream that calculates term frequency.

`$3`

A transform stream that calculates the composite vector- used for searching accross all fields.

`$3`

A transform stream that creates sort vectors.

`$3`

A transform stream that defines the parts of each document that are to be cached in the index itself.

`$3`

A transform stream that determines which fields can be searched on individually. In order to make indexes smaller, you can index fields that can be searched on.

`$3`

A transform stream that takes an unprocessed document and converts it into a structure that can be processed bysearch-index.

`$3`

A transform stream that converts text to lower case.

`$3`

A transform stream that converts non-string fields into Strings.

`$3`

A transform stream that removes stopwords

`$3`

A transform stream that will do nothing other than print out the state of the document toconsole.log`. Use this when developing and
debugging.

$3

A transform stream that splits fields down into their individual
linguistic tokens

Options

See: https://github.com/fergiemcdowall/search-index/blob/master/doc/API.md#options-and-settings

docproc

A document processing pipeline mostly used with
search-index

``javascript docProc = require('docproc') readableStream.pipe(docProc.pipeline(ops))`

DocProc is a pumpify chain of transform streams that turns Plain Old JSON Objects into a format that can be indexed bysearch-index.

Each processed document must have the following fields:

So

`javascript { id: 'one', text: 'the first doc' }`

becomes

...after being passeds through docProc.

You can also compose document processing pipelines by reusing the stages provided, or by creating new ones using the node.js transform stream specification:

`API`

`$3`

A function that returns a writable stream that contains a sensible default document processing pipeline

`$3`

A function that takes in an Array of pipeline stages where every stage is a transform stream and returns a writable stream.

`$3`

A transform stream that calculates term frequency.

`$3`

A transform stream that calculates the composite vector- used for searching accross all fields.

`$3`

A transform stream that creates sort vectors.

`$3`

A transform stream that defines the parts of each document that are to be cached in the index itself.

`$3`

A transform stream that determines which fields can be searched on individually. In order to make indexes smaller, you can index fields that can be searched on.

`$3`

A transform stream that takes an unprocessed document and converts it into a structure that can be processed bysearch-index.

`$3`

A transform stream that converts text to lower case.

`$3`

A transform stream that converts non-string fields into Strings.

`$3`

A transform stream that removes stopwords

`$3`

A transform stream that will do nothing other than print out the state of the document toconsole.log`. Use this when developing and
debugging.

$3

A transform stream that splits fields down into their individual
linguistic tokens

Options

See: https://github.com/fergiemcdowall/search-index/blob/master/doc/API.md#options-and-settings