## CLI usage
npm install @soustack/ingest``bash`
soustack-ingest ingest
The CLI reads the input file, runs it through the ingest pipeline, and writes JSON outputs under (see src/cli.ts and src/pipeline/emit.ts).
- Node.js 18+ (or compatible)
- Optional: pandoc for improved RTF/RTFD conversion (the adapter will fall back to a built-in parser when it is unavailable).
Adapters are selected by file extension (src/cli.ts, src/adapters).
- .rtfd.zip: handled by readRtfdZip (src/adapters/rtfdZip.ts). The adapter extracts the archive, locates the primary .rtf payload (preferring TXT.rtf or the largest .rtf file), and converts it to text. It tries a Node-based parser first, then falls back to pandoc and textutil when available..txt
- : handled by readTxt (src/adapters/txt.ts). Reads the file as UTF-8 text and passes it to the pipeline..docx
- : handled by readDocx (src/adapters/docx.ts). Extracts plain text from Microsoft Word documents using mammoth..pdf
- : handled by readPdf (src/adapters/pdf.ts). Extracts plain text from PDF files using pdf-parse.
Unsupported extensions throw an error.
The ingest pipeline runs stages in order (src/cli.ts, src/pipeline).
1. normalize (src/pipeline/normalize.ts)string
- Input: raw adapter text ().NormalizedText
- Output: with fullText and line metadata (Line[]).\n
- Contract: normalize newlines to and assign 1-based line numbers.
2. segment (src/pipeline/segment.ts)Line[]
- Input: .SegmentedText
- Output: with Chunk[].
- Contract: scores potential recipe boundaries and returns one chunk per inferred recipe with a best-effort title guess and confidence score.
3. extract (src/pipeline/extract.ts)Chunk
- Input: a plus the full Line[].IntermediateRecipe
- Output: containing title, ingredients, instructions, and source-line evidence.ingredients
- Contract: splits lines into and instructions sections by headers; lines before any header fall into instructions.
4. toSoustack (src/pipeline/toSoustack.ts)IntermediateRecipe
- Input: .SoustackRecipe
- Output: (Soustack JSON shape) with $schema (canonical URL), profile: "lite", stacks as an object map, normalized ingredients/instructions string arrays, and ingest metadata.metadata.ingest
- Contract: embeds source path and line range into .
5. validate (src/pipeline/validate.ts)SoustackRecipe
- Input: .ValidationResult
- Output: (ok, errors).
- Contract: see validator notes below.
6. emit (src/pipeline/emit.ts)SoustackRecipe
- Input: list of validated values and an output directory.
- Output:
- with name/slug/path entries.
- files for each recipe.recipe.name
- Contract: recipe filenames are slugified from and truncated to 80 characters.
Validation is intentionally lightweight today. The pipeline starts with a stub validator built from a fallback schema (src/pipeline/validate.ts). It attempts to load soustack at runtime:
- If soustack exports validator, that object is used.validateRecipe
- If it exports , it is wrapped into a validator.
- If neither exists or the import fails, the stub validator stays active.
To wire soustack validation:
1. Ensure soustack is installed (already in package.json).validator
2. Export either a object with a validate(recipe) function, or a validateRecipe(recipe) function, from the soustack package entry point.initValidator()
3. Call once at startup (the CLI does this before any validate() calls) so the active validator is set deterministically.
`bash`
npm run build
npm test
npm run ingest --
`bash``
npm run ingest -- "/mnt/data/bowman cookbook.rtfd.zip" --out ./output