npm install docproc``javascript`
docProc = require('docproc')
readableStream.pipe(docProc.pipeline(ops))
DocProc is a pumpify chain of
transform streams that turns Plain Old JSON Objects into a format that
can be indexed by search-index.
Each processed document must have the following fields:
* id - document idvector
* - vector, used for rankingstored
* - the document that will be cachedraw
* - the unadulterated documentnormalised
* - the "cleaned up" document.tokenised
* - the tokenised document.
So
`javascript`
{
id: 'one',
text: 'the first doc'
}
becomes
`javascript`
{ id: 'one',
normalised: { id: 'one', text: 'the first doc' },
raw: { id: 'one', text: 'the first doc' },
stored: { id: 'one', text: 'the first doc' },
tokenised: { id: [ 'one' ], text: [ 'the', 'first', 'doc' ] },
vector:
{ id: { one: 1, '*': 1 },
text: { doc: 1, first: 1, the: 1, '*': 1 },
'': { one: 1, '': 1, doc: 1, first: 1, the: 1 } } },
...after being passeds through docProc.
You can also compose document processing pipelines by reusing the
stages provided, or by creating new ones using the node.js transform
stream
specification:
`javascript`
docProc.customPipeline([
new docProc.IngestDoc(),
new docProc.CreateStoredDocument(),
new docProc.NormaliseFields(),
new docProc.Tokeniser({separator: ' '}),
new docProc.RemoveStopWords({stopwords: []}),
new docProc.CalculateTermFrequency(),
new docProc.CreateCompositeVector(),
new docProc.CreateSortVectors(),
new docProc.FieldedSearch({fieldedSearch: false})
])
A function that returns a writable stream that contains a sensible
default document processing pipeline
A function that takes in an Array of pipeline stages where every stage
is a transform stream and returns a writable stream.
A transform stream that calculates term frequency.
A transform stream that calculates the composite vector- used for
searching accross all fields.
A transform stream that creates sort vectors.
A transform stream that defines the parts of each document that are to
be cached in the index itself.
A transform stream that determines which fields can be searched on
individually. In order to make indexes smaller, you can index fields
that can be searched on.
A transform stream that takes an unprocessed document and converts it
into a structure that can be processed by search-index.
A transform stream that converts text to lower case.
A transform stream that converts non-string fields into Strings.
A transform stream that removes stopwords
A transform stream that will do nothing other than print out the state
of the document to console.log`. Use this when developing and
debugging.
A transform stream that splits fields down into their individual
linguistic tokens
See: https://github.com/fergiemcdowall/search-index/blob/master/doc/API.md#options-and-settings