TypeScript (Node.js)

Banner2

Extract text, tables, images, and metadata from 62+ file formats including PDF, Office documents, and images. Native NAPI-RS bindings for Node.js with superior performance, async/await support, and TypeScript type definitions.

Installation

$3

Install via one of the supported package managers:

npm:
``bash npm install @kreuzberg/node`

pnpm:`bash pnpm add @kreuzberg/node`

yarn:`bash yarn add @kreuzberg/node`

`$3`

- Node.js 22+ required (NAPI-RS native bindings) - Optional: ONNX Runtime version 1.22.x for embeddings support - Optional: Tesseract OCR for OCR functionality

- Optional: LibreOffice for legacy Office formats (DOC, XLS, PPT, RTF, ODT, ODS, ODP)

Format Support Notes: - Modern Office formats (DOCX, XLSX, PPTX) work without LibreOffice - Legacy formats (DOC, XLS, PPT) require LibreOffice installation - WASM binding supports DOCX, XLSX, PPTX, and ODT (no LibreOffice required)

`$3`

Pre-built binaries available for: - macOS (arm64, x64) - Linux (x64) - Windows (x64)

`Quick Start`

`$3`

Extract text, metadata, and structure from any supported document format:

`typescript import { extractFileSync } from '@kreuzberg/node';

const config = { useCache: true, enableQualityProcessing: true, };

const result = extractFileSync('document.pdf', null, config);

console.log(result.content); console.log(MIME Type: ${result.mimeType});`

`$3`

#### Extract with Custom Configuration

Most use cases benefit from configuration to control extraction behavior:

With OCR (for scanned documents):

`typescript import { extractFile } from '@kreuzberg/node';

const config = { ocr: { backend: 'tesseract', language: 'eng+fra', tesseractConfig: { psm: 3, }, }, };

const result = await extractFile('document.pdf', null, config); console.log(result.content);`

#### Table Extraction

`typescript import { extractFileSync } from '@kreuzberg/node';

const result = extractFileSync('document.pdf');

for (const table of result.tables) { console.log(Table with ${table.cells.length} rows); console.log(Page: ${table.pageNumber}); console.log(table.markdown); }`

#### Processing Multiple Files

`typescript import { batchExtractFilesSync } from '@kreuzberg/node';

const files = ['doc1.pdf', 'doc2.docx', 'doc3.pptx']; const results = batchExtractFilesSync(files);

results.forEach((result, i) => { console.log(File ${i + 1}: ${result.content.length} characters); });`

#### Async Processing

For non-blocking document processing:

`typescript import { extractFile } from '@kreuzberg/node';

const result = await extractFile('document.pdf'); console.log(result.content);`

#### Configuration Discovery

`typescript import { ExtractionConfig, extractFile } from '@kreuzberg/node';

const config = ExtractionConfig.discover(); if (config) { console.log('Found configuration file'); const result = await extractFile('document.pdf', null, config); console.log(result.content); } else { console.log('No configuration file found, using defaults'); const result = await extractFile('document.pdf'); console.log(result.content); }`

#### Worker Thread Pool

`typescript import { createWorkerPool, extractFileInWorker, batchExtractFilesInWorker, closeWorkerPool } from '@kreuzberg/node';

// Create a pool with 4 worker threads const pool = createWorkerPool(4);

try { // Extract single file in worker const result = await extractFileInWorker(pool, 'document.pdf', null, { useCache: true }); console.log(result.content);

// Extract multiple files concurrently const files = ['doc1.pdf', 'doc2.docx', 'doc3.xlsx']; const results = await batchExtractFilesInWorker(pool, files, { useCache: true });

results.forEach((result, i) => { console.log(File ${i + 1}: ${result.content.length} characters); }); } finally { // Always close the pool when done await closeWorkerPool(pool); }`

Performance Benefits: - Parallel Processing: Multiple documents extracted simultaneously - CPU Utilization: Maximizes multi-core CPU usage for large batches - Queue Management: Automatically distributes work across available workers - Resource Control: Prevents thread exhaustion with configurable pool size

Best Practices: - Use worker pools for batches of 10+ documents - Set pool size to number of CPU cores (default behavior) - Always close pools withcloseWorkerPool()to prevent resource leaks - Reuse pools across multiple batch operations for efficiency

`$3`

- Installation Guide - Platform-specific setup - API Documentation - Complete API reference - Examples & Guides - Full code examples and usage guides - Configuration Guide - Advanced configuration options

`NAPI-RS Implementation Details`

`$3`

This binding uses NAPI-RS to provide native Node.js bindings with:

- Zero-copy data transfer between JavaScript and Rust layers - Native thread pool for concurrent document processing - Direct memory management for efficient large document handling - Binary-compatible pre-built native modules across platforms

`$3`

- Single documents are processed synchronously or asynchronously in a dedicated thread - Batch operations distribute work across available CPU cores - Thread count is configurable but defaults to system CPU count - Long-running extractions block the event loop unless using async APIs

`$3`

- Large documents (> 100 MB) are streamed to avoid loading entirely into memory - Temporary files are created in system temp directory for extraction - Memory is automatically released after extraction completion - ONNX models are cached in memory for repeated embeddings operations

`Features`

`$3`

62+ file formats across 8 major categories with intelligent format detection and comprehensive metadata extraction.

#### Office Documents

| Category | Formats | Capabilities | |----------|---------|--------------| | Word Processing |.docx, .odt| Full text, tables, images, metadata, styles | | Spreadsheets |.xlsx, .xlsm, .xlsb, .xls, .xla, .xlam, .xltm, .ods| Sheet data, formulas, cell metadata, charts | | Presentations |.pptx, .ppt, .ppsx| Slides, speaker notes, images, metadata | | PDF |.pdf| Text, tables, images, metadata, OCR support | | eBooks |.epub, .fb2 | Chapters, metadata, embedded resources |

#### Images (OCR-Enabled)

| Category | Formats | Features | |----------|---------|----------| | Raster |.png, .jpg, .jpeg, .gif, .webp, .bmp, .tiff, .tif| OCR, table detection, EXIF metadata, dimensions, color space | | Advanced |.jp2, .jpx, .jpm, .mj2, .jbig2, .jb2, .pnm, .pbm, .pgm, .ppm| OCR via hayro-jpeg2000 (pure Rust decoder), JBIG2 support, table detection, format-specific metadata | | Vector |.svg | DOM parsing, embedded text, graphics metadata |

#### Web & Data

| Category | Formats | Features | |----------|---------|----------| | Markup |.html, .htm, .xhtml, .xml, .svg| DOM parsing, metadata (Open Graph, Twitter Card), link extraction | | Structured Data |.json, .yaml, .yml, .toml, .csv, .tsv| Schema detection, nested structures, validation | | Text & Markdown |.txt, .md, .markdown, .djot, .rst, .org, .rtf | CommonMark, GFM, Djot, reStructuredText, Org Mode |

#### Email & Archives

| Category | Formats | Features | |----------|---------|----------| | Email |.eml, .msg| Headers, body (HTML/plain), attachments, threading | | Archives |.zip, .tar, .tgz, .gz, .7z | File listing, nested archives, metadata |

#### Academic & Scientific

| Category | Formats | Features | |----------|---------|----------| | Citations |.bib, .biblatex, .ris, .nbib, .enw, .csl| Structured parsing: RIS (structured), PubMed/MEDLINE, EndNote XML (structured), BibTeX, CSL JSON | | Scientific |.tex, .latex, .typst, .jats, .ipynb, .docbook| LaTeX, Jupyter notebooks, PubMed JATS | | Documentation |.opml, .pod, .mdoc, .troff | Technical documentation formats |

Complete Format Reference

`$3`

- Text Extraction - Extract all text content with position and formatting information - Metadata Extraction - Retrieve document properties, creation date, author, etc. - Table Extraction - Parse tables with structure and cell content preservation - Image Extraction - Extract embedded images and render page previews - OCR Support - Integrate multiple OCR backends for scanned documents

- Async/Await - Non-blocking document processing with concurrent operations

- Plugin System - Extensible post-processing for custom text transformation

- Embeddings - Generate vector embeddings using ONNX Runtime models

- Batch Processing - Efficiently process multiple documents in parallel - Memory Efficient - Stream large files without loading entirely into memory - Language Detection - Detect and support multiple languages in documents - Configuration - Fine-grained control over extraction behavior

`$3`

| Format | Speed | Memory | Notes | |--------|-------|--------|-------| | PDF (text) | 10-100 MB/s | ~50MB per doc | Fastest extraction | | Office docs | 20-200 MB/s | ~100MB per doc | DOCX, XLSX, PPTX | | Images (OCR) | 1-5 MB/s | Variable | Depends on OCR backend | | Archives | 5-50 MB/s | ~200MB per doc | ZIP, TAR, etc. | | Web formats | 50-200 MB/s | Streaming | HTML, XML, JSON |

`OCR Support`

Kreuzberg supports multiple OCR backends for extracting text from scanned documents and images:

- Tesseract

- Guten

`$3`

`typescript import { extractFile } from '@kreuzberg/node';

const config = { ocr: { backend: 'tesseract', language: 'eng+fra', tesseractConfig: { psm: 3, }, }, };

const result = await extractFile('document.pdf', null, config); console.log(result.content);`

`Async Support`

This binding provides full async/await support for non-blocking document processing:

`typescript import { extractFile } from '@kreuzberg/node';

const result = await extractFile('document.pdf'); console.log(result.content);`

`Plugin System`

Kreuzberg supports extensible post-processing plugins for custom text transformation and filtering.

For detailed plugin documentation, visit Plugin System Guide.

`Embeddings Support`

Generate vector embeddings for extracted text using the built-in ONNX Runtime support. Requires ONNX Runtime installation.

Embeddings Guide

`Batch Processing`

Process multiple documents efficiently:

`typescript import { batchExtractFilesSync } from '@kreuzberg/node';

const files = ['doc1.pdf', 'doc2.docx', 'doc3.pptx']; const results = batchExtractFilesSync(files);

results.forEach((result, i) => { console.log(File ${i + 1}: ${result.content.length} characters); });``

Configuration

For advanced configuration options including language detection, table extraction, OCR settings, and more:

Configuration Guide

Documentation

- Official Documentation
- API Reference
- Examples & Guides

Contributing

Contributions are welcome! See Contributing Guide.

License

MIT License - see LICENSE file for details.

Support

- Discord Community: Join our Discord
- GitHub Issues: Report bugs
- Discussions: Ask questions

TypeScript (Node.js)

Banner2

Installation

$3

Install via one of the supported package managers:

npm:
``bash npm install @kreuzberg/node`

pnpm:`bash pnpm add @kreuzberg/node`

yarn:`bash yarn add @kreuzberg/node`

`$3`

- Node.js 22+ required (NAPI-RS native bindings) - Optional: ONNX Runtime version 1.22.x for embeddings support - Optional: Tesseract OCR for OCR functionality

- Optional: LibreOffice for legacy Office formats (DOC, XLS, PPT, RTF, ODT, ODS, ODP)

`$3`

Pre-built binaries available for: - macOS (arm64, x64) - Linux (x64) - Windows (x64)

`Quick Start`

`$3`

Extract text, metadata, and structure from any supported document format:

`typescript import { extractFileSync } from '@kreuzberg/node';

const config = { useCache: true, enableQualityProcessing: true, };

const result = extractFileSync('document.pdf', null, config);

console.log(result.content); console.log(MIME Type: ${result.mimeType});`

`$3`

#### Extract with Custom Configuration

Most use cases benefit from configuration to control extraction behavior:

With OCR (for scanned documents):

`typescript import { extractFile } from '@kreuzberg/node';

const config = { ocr: { backend: 'tesseract', language: 'eng+fra', tesseractConfig: { psm: 3, }, }, };

const result = await extractFile('document.pdf', null, config); console.log(result.content);`

#### Table Extraction

`typescript import { extractFileSync } from '@kreuzberg/node';

const result = extractFileSync('document.pdf');

for (const table of result.tables) { console.log(Table with ${table.cells.length} rows); console.log(Page: ${table.pageNumber}); console.log(table.markdown); }`

#### Processing Multiple Files

`typescript import { batchExtractFilesSync } from '@kreuzberg/node';

const files = ['doc1.pdf', 'doc2.docx', 'doc3.pptx']; const results = batchExtractFilesSync(files);

results.forEach((result, i) => { console.log(File ${i + 1}: ${result.content.length} characters); });`

#### Async Processing

For non-blocking document processing:

`typescript import { extractFile } from '@kreuzberg/node';

const result = await extractFile('document.pdf'); console.log(result.content);`

#### Configuration Discovery

`typescript import { ExtractionConfig, extractFile } from '@kreuzberg/node';

#### Worker Thread Pool

`typescript import { createWorkerPool, extractFileInWorker, batchExtractFilesInWorker, closeWorkerPool } from '@kreuzberg/node';

// Create a pool with 4 worker threads const pool = createWorkerPool(4);

try { // Extract single file in worker const result = await extractFileInWorker(pool, 'document.pdf', null, { useCache: true }); console.log(result.content);

// Extract multiple files concurrently const files = ['doc1.pdf', 'doc2.docx', 'doc3.xlsx']; const results = await batchExtractFilesInWorker(pool, files, { useCache: true });

results.forEach((result, i) => { console.log(File ${i + 1}: ${result.content.length} characters); }); } finally { // Always close the pool when done await closeWorkerPool(pool); }`

`$3`

`NAPI-RS Implementation Details`

`$3`

This binding uses NAPI-RS to provide native Node.js bindings with:

`$3`

`Features`

`$3`

62+ file formats across 8 major categories with intelligent format detection and comprehensive metadata extraction.

#### Office Documents

#### Images (OCR-Enabled)

#### Web & Data

#### Email & Archives

#### Academic & Scientific

Complete Format Reference

`$3`

- Async/Await - Non-blocking document processing with concurrent operations

- Plugin System - Extensible post-processing for custom text transformation

- Embeddings - Generate vector embeddings using ONNX Runtime models

`$3`

`OCR Support`

Kreuzberg supports multiple OCR backends for extracting text from scanned documents and images:

- Tesseract

- Guten

`$3`

`typescript import { extractFile } from '@kreuzberg/node';

const config = { ocr: { backend: 'tesseract', language: 'eng+fra', tesseractConfig: { psm: 3, }, }, };

const result = await extractFile('document.pdf', null, config); console.log(result.content);`

`Async Support`

This binding provides full async/await support for non-blocking document processing:

`typescript import { extractFile } from '@kreuzberg/node';

const result = await extractFile('document.pdf'); console.log(result.content);`

`Plugin System`

Kreuzberg supports extensible post-processing plugins for custom text transformation and filtering.

For detailed plugin documentation, visit Plugin System Guide.

`Embeddings Support`

Generate vector embeddings for extracted text using the built-in ONNX Runtime support. Requires ONNX Runtime installation.

Embeddings Guide

`Batch Processing`

Process multiple documents efficiently:

`typescript import { batchExtractFilesSync } from '@kreuzberg/node';

const files = ['doc1.pdf', 'doc2.docx', 'doc3.pptx']; const results = batchExtractFilesSync(files);

results.forEach((result, i) => { console.log(File ${i + 1}: ${result.content.length} characters); });``

Configuration

For advanced configuration options including language detection, table extraction, OCR settings, and more:

Configuration Guide

Documentation

- Official Documentation
- API Reference
- Examples & Guides

Contributing

Contributions are welcome! See Contributing Guide.

License

MIT License - see LICENSE file for details.

Support

- Discord Community: Join our Discord
- GitHub Issues: Report bugs
- Discussions: Ask questions

@kreuzberg/node

TypeScript (Node.js)

Installation

$3

$3

$3

Quick Start

$3

$3

$3

NAPI-RS Implementation Details

$3

$3

$3

Features

$3

$3

$3

OCR Support

$3

Async Support

Plugin System

Embeddings Support

Batch Processing

Configuration

Documentation

Contributing

License

Support

@kreuzberg/node

TypeScript (Node.js)

Installation

$3

$3

$3

Quick Start

$3

$3

$3

NAPI-RS Implementation Details

$3

$3

$3

Features

$3

$3

$3

OCR Support

$3

Async Support

Plugin System

Embeddings Support

Batch Processing

Configuration

Documentation

Contributing

License

Support

Dist Tags

`$3`

`$3`

`Quick Start`

`$3`

`$3`

`$3`

`NAPI-RS Implementation Details`

`$3`

`$3`

`$3`

`Features`

`$3`

`$3`

`$3`

`OCR Support`

`$3`

`Async Support`

`Plugin System`

`Embeddings Support`

`Batch Processing`

`$3`

`$3`

`Quick Start`

`$3`

`$3`

`$3`

`NAPI-RS Implementation Details`

`$3`

`$3`

`$3`

`Features`

`$3`

`$3`

`$3`

`OCR Support`

`$3`

`Async Support`

`Plugin System`

`Embeddings Support`

`Batch Processing`