WebAssembly

Banner2

Extract text, tables, images, and metadata from 62+ file formats including PDF, Office documents, and images. WebAssembly bindings for browsers, Deno, and Cloudflare Workers with portable deployment and multi-threading support.

Installation

$3

Install via one of the supported package managers:

npm:
``bash npm install @kreuzberg/wasm`

pnpm:`bash pnpm add @kreuzberg/wasm`

yarn:`bash yarn add @kreuzberg/wasm`

`$3`

- Modern browser with WebAssembly support, or Deno 1.0+, or Cloudflare Workers - Optional: Tesseract WASM for OCR functionality

`Quick Start`

`$3`

Extract text, metadata, and structure from any supported document format:

`ts import { extractBytes, initWasm } from "@kreuzberg/wasm";

async function main() { await initWasm();

const buffer = await fetch("document.pdf").then((r) => r.arrayBuffer()); const bytes = new Uint8Array(buffer);

const result = await extractBytes(bytes, "application/pdf");

console.log("Extracted content:"); console.log(result.content); console.log("MIME type:", result.mimeType); console.log("Metadata:", result.metadata); }

main().catch(console.error);`

`$3`

#### Extract with Custom Configuration

Most use cases benefit from configuration to control extraction behavior:

With OCR (for scanned documents):

`ts import { enableOcr, extractBytes, initWasm } from "@kreuzberg/wasm";

async function extractWithOcr() { await initWasm();

try { await enableOcr(); console.log("OCR enabled successfully"); } catch (error) { console.error("Failed to enable OCR:", error); return; }

const bytes = new Uint8Array(await fetch("scanned-page.png").then((r) => r.arrayBuffer()));

const result = await extractBytes(bytes, "image/png", { ocr: { backend: "tesseract-wasm", language: "eng", }, });

console.log("Extracted text:"); console.log(result.content); }

extractWithOcr().catch(console.error);`

#### Table Extraction

See Table Extraction Guide for detailed examples.

#### Processing Multiple Files

`ts import { extractBytes, initWasm } from "@kreuzberg/wasm";

interface DocumentJob { name: string; bytes: Uint8Array; mimeType: string; }

async function _processBatch(documents: DocumentJob[], concurrency: number = 3) { await initWasm();

const results: Record = {}; const queue = [...documents];

const workers = Array(concurrency) .fill(null) .map(async () => { while (queue.length > 0) { const doc = queue.shift(); if (!doc) break;

try { const result = await extractBytes(doc.bytes, doc.mimeType); results[doc.name] = result.content; } catch (error) { console.error(Failed to process ${doc.name}:, error); } } });

await Promise.all(workers); return results; }`

#### Async Processing

For non-blocking document processing:

`ts import { extractBytes, getWasmCapabilities, initWasm } from "@kreuzberg/wasm";

async function extractDocuments(files: Uint8Array[], mimeTypes: string[]) { const caps = getWasmCapabilities(); if (!caps.hasWasm) { throw new Error("WebAssembly not supported"); }

await initWasm();

const results = await Promise.all(files.map((bytes, index) => extractBytes(bytes, mimeTypes[index])));

return results.map((r) => ({ content: r.content, pageCount: r.metadata?.pageCount, })); }

const fileBytes = [new Uint8Array([1, 2, 3])]; const mimes = ["application/pdf"];

extractDocuments(fileBytes, mimes) .then((results) => console.log(results)) .catch(console.error);`

`$3`

- Installation Guide - Platform-specific setup - API Documentation - Complete API reference - Examples & Guides - Full code examples and usage guides - Configuration Guide - Advanced configuration options

`Features`

`$3`

62+ file formats across 8 major categories with intelligent format detection and comprehensive metadata extraction.

#### Office Documents

| Category | Formats | Capabilities | |----------|---------|--------------| | Word Processing |.docx, .odt| Full text, tables, images, metadata, styles | | Spreadsheets |.xlsx, .xlsm, .xlsb, .xls, .xla, .xlam, .xltm, .ods| Sheet data, formulas, cell metadata, charts | | Presentations |.pptx, .ppt, .ppsx| Slides, speaker notes, images, metadata | | PDF |.pdf| Text, tables, images, metadata, OCR support | | eBooks |.epub, .fb2 | Chapters, metadata, embedded resources |

#### Images (OCR-Enabled)

| Category | Formats | Features | |----------|---------|----------| | Raster |.png, .jpg, .jpeg, .gif, .webp, .bmp, .tiff, .tif| OCR, table detection, EXIF metadata, dimensions, color space | | Advanced |.jp2, .jpx, .jpm, .mj2, .jbig2, .jb2, .pnm, .pbm, .pgm, .ppm| OCR via hayro-jpeg2000 (pure Rust decoder), JBIG2 support, table detection, format-specific metadata | | Vector |.svg | DOM parsing, embedded text, graphics metadata |

#### Web & Data

| Category | Formats | Features | |----------|---------|----------| | Markup |.html, .htm, .xhtml, .xml, .svg| DOM parsing, metadata (Open Graph, Twitter Card), link extraction | | Structured Data |.json, .yaml, .yml, .toml, .csv, .tsv| Schema detection, nested structures, validation | | Text & Markdown |.txt, .md, .markdown, .djot, .rst, .org, .rtf | CommonMark, GFM, Djot, reStructuredText, Org Mode |

#### Email & Archives

| Category | Formats | Features | |----------|---------|----------| | Email |.eml, .msg| Headers, body (HTML/plain), attachments, threading | | Archives |.zip, .tar, .tgz, .gz, .7z | File listing, nested archives, metadata |

#### Academic & Scientific

| Category | Formats | Features | |----------|---------|----------| | Citations |.bib, .biblatex, .ris, .nbib, .enw, .csl| Structured parsing: RIS (structured), PubMed/MEDLINE, EndNote XML (structured), BibTeX, CSL JSON | | Scientific |.tex, .latex, .typst, .jats, .ipynb, .docbook| LaTeX, Jupyter notebooks, PubMed JATS | | Documentation |.opml, .pod, .mdoc, .troff | Technical documentation formats |

Complete Format Reference

`$3`

- Text Extraction - Extract all text content with position and formatting information - Metadata Extraction - Retrieve document properties, creation date, author, etc. - Table Extraction - Parse tables with structure and cell content preservation - Image Extraction - Extract embedded images and render page previews - OCR Support - Integrate multiple OCR backends for scanned documents

- Async/Await - Non-blocking document processing with concurrent operations

- Plugin System - Extensible post-processing for custom text transformation

- Batch Processing - Efficiently process multiple documents in parallel - Memory Efficient - Stream large files without loading entirely into memory - Language Detection - Detect and support multiple languages in documents - Configuration - Fine-grained control over extraction behavior

`$3`

| Format | Speed | Memory | Notes | |--------|-------|--------|-------| | PDF (text) | 10-100 MB/s | ~50MB per doc | Fastest extraction | | Office docs | 20-200 MB/s | ~100MB per doc | DOCX, XLSX, PPTX | | Images (OCR) | 1-5 MB/s | Variable | Depends on OCR backend | | Archives | 5-50 MB/s | ~200MB per doc | ZIP, TAR, etc. | | Web formats | 50-200 MB/s | Streaming | HTML, XML, JSON |

`OCR Support`

Kreuzberg supports multiple OCR backends for extracting text from scanned documents and images:

- Tesseract-Wasm

`$3`

`ts import { enableOcr, extractBytes, initWasm } from "@kreuzberg/wasm";

async function extractWithOcr() { await initWasm();

try { await enableOcr(); console.log("OCR enabled successfully"); } catch (error) { console.error("Failed to enable OCR:", error); return; }

const bytes = new Uint8Array(await fetch("scanned-page.png").then((r) => r.arrayBuffer()));

const result = await extractBytes(bytes, "image/png", { ocr: { backend: "tesseract-wasm", language: "eng", }, });

console.log("Extracted text:"); console.log(result.content); }

extractWithOcr().catch(console.error);`

`Async Support`

This binding provides full async/await support for non-blocking document processing:

`ts import { extractBytes, getWasmCapabilities, initWasm } from "@kreuzberg/wasm";

async function extractDocuments(files: Uint8Array[], mimeTypes: string[]) { const caps = getWasmCapabilities(); if (!caps.hasWasm) { throw new Error("WebAssembly not supported"); }

await initWasm();

const results = await Promise.all(files.map((bytes, index) => extractBytes(bytes, mimeTypes[index])));

return results.map((r) => ({ content: r.content, pageCount: r.metadata?.pageCount, })); }

const fileBytes = [new Uint8Array([1, 2, 3])]; const mimes = ["application/pdf"];

extractDocuments(fileBytes, mimes) .then((results) => console.log(results)) .catch(console.error);`

`Plugin System`

Kreuzberg supports extensible post-processing plugins for custom text transformation and filtering.

For detailed plugin documentation, visit Plugin System Guide.

`Batch Processing`

Process multiple documents efficiently:

`ts import { extractBytes, initWasm } from "@kreuzberg/wasm";

interface DocumentJob { name: string; bytes: Uint8Array; mimeType: string; }

async function _processBatch(documents: DocumentJob[], concurrency: number = 3) { await initWasm();

const results: Record = {}; const queue = [...documents];

const workers = Array(concurrency) .fill(null) .map(async () => { while (queue.length > 0) { const doc = queue.shift(); if (!doc) break;

try { const result = await extractBytes(doc.bytes, doc.mimeType); results[doc.name] = result.content; } catch (error) { console.error(Failed to process ${doc.name}:, error); } } });

await Promise.all(workers); return results; }``

Configuration

For advanced configuration options including language detection, table extraction, OCR settings, and more:

Configuration Guide

Documentation

- Official Documentation
- API Reference
- Examples & Guides

Contributing

Contributions are welcome! See Contributing Guide.

License

MIT License - see LICENSE file for details.

Support

- Discord Community: Join our Discord
- GitHub Issues: Report bugs
- Discussions: Ask questions

WebAssembly

Banner2

Installation

$3

Install via one of the supported package managers:

npm:
``bash npm install @kreuzberg/wasm`

pnpm:`bash pnpm add @kreuzberg/wasm`

yarn:`bash yarn add @kreuzberg/wasm`

`$3`

- Modern browser with WebAssembly support, or Deno 1.0+, or Cloudflare Workers - Optional: Tesseract WASM for OCR functionality

`Quick Start`

`$3`

Extract text, metadata, and structure from any supported document format:

`ts import { extractBytes, initWasm } from "@kreuzberg/wasm";

async function main() { await initWasm();

const buffer = await fetch("document.pdf").then((r) => r.arrayBuffer()); const bytes = new Uint8Array(buffer);

const result = await extractBytes(bytes, "application/pdf");

console.log("Extracted content:"); console.log(result.content); console.log("MIME type:", result.mimeType); console.log("Metadata:", result.metadata); }

main().catch(console.error);`

`$3`

#### Extract with Custom Configuration

Most use cases benefit from configuration to control extraction behavior:

With OCR (for scanned documents):

`ts import { enableOcr, extractBytes, initWasm } from "@kreuzberg/wasm";

async function extractWithOcr() { await initWasm();

try { await enableOcr(); console.log("OCR enabled successfully"); } catch (error) { console.error("Failed to enable OCR:", error); return; }

const bytes = new Uint8Array(await fetch("scanned-page.png").then((r) => r.arrayBuffer()));

const result = await extractBytes(bytes, "image/png", { ocr: { backend: "tesseract-wasm", language: "eng", }, });

console.log("Extracted text:"); console.log(result.content); }

extractWithOcr().catch(console.error);`

#### Table Extraction

See Table Extraction Guide for detailed examples.

#### Processing Multiple Files

`ts import { extractBytes, initWasm } from "@kreuzberg/wasm";

interface DocumentJob { name: string; bytes: Uint8Array; mimeType: string; }

async function _processBatch(documents: DocumentJob[], concurrency: number = 3) { await initWasm();

const results: Record = {}; const queue = [...documents];

const workers = Array(concurrency) .fill(null) .map(async () => { while (queue.length > 0) { const doc = queue.shift(); if (!doc) break;

try { const result = await extractBytes(doc.bytes, doc.mimeType); results[doc.name] = result.content; } catch (error) { console.error(Failed to process ${doc.name}:, error); } } });

await Promise.all(workers); return results; }`

#### Async Processing

For non-blocking document processing:

`ts import { extractBytes, getWasmCapabilities, initWasm } from "@kreuzberg/wasm";

async function extractDocuments(files: Uint8Array[], mimeTypes: string[]) { const caps = getWasmCapabilities(); if (!caps.hasWasm) { throw new Error("WebAssembly not supported"); }

await initWasm();

const results = await Promise.all(files.map((bytes, index) => extractBytes(bytes, mimeTypes[index])));

return results.map((r) => ({ content: r.content, pageCount: r.metadata?.pageCount, })); }

const fileBytes = [new Uint8Array([1, 2, 3])]; const mimes = ["application/pdf"];

extractDocuments(fileBytes, mimes) .then((results) => console.log(results)) .catch(console.error);`

`$3`

`Features`

`$3`

62+ file formats across 8 major categories with intelligent format detection and comprehensive metadata extraction.

#### Office Documents

#### Images (OCR-Enabled)

#### Web & Data

#### Email & Archives

#### Academic & Scientific

Complete Format Reference

`$3`

- Async/Await - Non-blocking document processing with concurrent operations

- Plugin System - Extensible post-processing for custom text transformation

`$3`

`OCR Support`

Kreuzberg supports multiple OCR backends for extracting text from scanned documents and images:

- Tesseract-Wasm

`$3`

`ts import { enableOcr, extractBytes, initWasm } from "@kreuzberg/wasm";

async function extractWithOcr() { await initWasm();

try { await enableOcr(); console.log("OCR enabled successfully"); } catch (error) { console.error("Failed to enable OCR:", error); return; }

const bytes = new Uint8Array(await fetch("scanned-page.png").then((r) => r.arrayBuffer()));

const result = await extractBytes(bytes, "image/png", { ocr: { backend: "tesseract-wasm", language: "eng", }, });

console.log("Extracted text:"); console.log(result.content); }

extractWithOcr().catch(console.error);`

`Async Support`

This binding provides full async/await support for non-blocking document processing:

`ts import { extractBytes, getWasmCapabilities, initWasm } from "@kreuzberg/wasm";

async function extractDocuments(files: Uint8Array[], mimeTypes: string[]) { const caps = getWasmCapabilities(); if (!caps.hasWasm) { throw new Error("WebAssembly not supported"); }

await initWasm();

const results = await Promise.all(files.map((bytes, index) => extractBytes(bytes, mimeTypes[index])));

return results.map((r) => ({ content: r.content, pageCount: r.metadata?.pageCount, })); }

const fileBytes = [new Uint8Array([1, 2, 3])]; const mimes = ["application/pdf"];

extractDocuments(fileBytes, mimes) .then((results) => console.log(results)) .catch(console.error);`

`Plugin System`

Kreuzberg supports extensible post-processing plugins for custom text transformation and filtering.

For detailed plugin documentation, visit Plugin System Guide.

`Batch Processing`

Process multiple documents efficiently:

`ts import { extractBytes, initWasm } from "@kreuzberg/wasm";

interface DocumentJob { name: string; bytes: Uint8Array; mimeType: string; }

async function _processBatch(documents: DocumentJob[], concurrency: number = 3) { await initWasm();

const results: Record = {}; const queue = [...documents];

const workers = Array(concurrency) .fill(null) .map(async () => { while (queue.length > 0) { const doc = queue.shift(); if (!doc) break;

try { const result = await extractBytes(doc.bytes, doc.mimeType); results[doc.name] = result.content; } catch (error) { console.error(Failed to process ${doc.name}:, error); } } });

await Promise.all(workers); return results; }``

Configuration

For advanced configuration options including language detection, table extraction, OCR settings, and more:

Configuration Guide

Documentation

- Official Documentation
- API Reference
- Examples & Guides

Contributing

Contributions are welcome! See Contributing Guide.

License

MIT License - see LICENSE file for details.

Support

- Discord Community: Join our Discord
- GitHub Issues: Report bugs
- Discussions: Ask questions

@kreuzberg/wasm

WebAssembly

Installation

$3

$3

Quick Start

$3

$3

$3

Features

$3

$3

$3

OCR Support

$3

Async Support

Plugin System

Batch Processing

Configuration

Documentation

Contributing

License

Support

@kreuzberg/wasm

WebAssembly

Installation

$3

$3

Quick Start

$3

$3

$3

Features

$3

$3

$3

OCR Support

$3

Async Support

Plugin System

Batch Processing

Configuration

Documentation

Contributing

License

Support

Dist Tags

`$3`

`Quick Start`

`$3`

`$3`

`$3`

`Features`

`$3`

`$3`

`$3`

`OCR Support`

`$3`

`Async Support`

`Plugin System`

`Batch Processing`

`$3`

`Quick Start`

`$3`

`$3`

`$3`

`Features`

`$3`

`$3`

`$3`

`OCR Support`

`$3`

`Async Support`

`Plugin System`

`Batch Processing`