Advanced client-side PDF to HTML converter with WASM parsing, OCR support, and intelligent text layout reconstruction. Perfect for document management systems and web applications.
npm install pdf2html-client


Advanced client-side PDF to HTML converter with WASM parsing, OCR support, and intelligent text layout reconstruction. Perfect for document management systems and web applications.
This library was built primarily to support high fidelity PDF/DOCX imports in Venmail Drive. Most PDF-to-HTML pipelines pick one tradeoff: either pixel-perfect output that is hard to edit, or “flow” output that drifts and overlaps. The goal is to provide a one-stop simple workflow for document imports that works with offline-first applications.
pdf2html-client is built around a multi-mode text layout engine:
- High fidelity when you need it (absolute/smart positioned text)
- Editability when you want it (flow/outline-flow)
- Semantic structure with layout awareness (semantic regions + flexbox)
- Overlap-aware fallbacks for sensitive areas where reflow would break readability
All of this runs in the browser (via pdfium or unpdf).
- WASM PDF parsing
- Primary: PDFium (WebAssembly)
- Fallback/alternative: unpdf
- Select via parserStrategy: 'auto' | 'pdfium' | 'unpdf'
- Multiple text layout modes (see below)
- Optional OCR for scanned PDFs
- Uses onnxruntime-web + OpenCV.js
- Automatically detects scanned PDFs and only runs OCR when it makes sense
- Font detection + mapping
- Detects fonts from extracted text
- Maps fonts using an internal font catalog
- Output formats
- html, css, plus metadata (processing time, page count, OCR used, font mappings, image stats)
``bash`
pnpm add pdf2html-client
- Bundled by default: PDFium (primary) and unpdf (fallback) so core parsing works out-of-the-box.pdfjs-dist
- External (you must provide):
- (used as an additional parser path)onnxruntime-web
- and @techstark/opencv-js (used only when OCR is enabled)
For bundlers, mark these as externals/peer-like. For UMD/CDN usage, ensure these scripts are available globally before loading the library.
If you enable OCR, you should download the lightweight OCR models ahead of time:
`bash`
pnpm run download-models
This downloads models into models/.
I want to edit documents and extract content
`ts`
import { PDF2HTML } from 'pdf2html-client';
const result = await PDF2HTML.convertForEditing(pdfFile);
I want pixel-perfect document display
`ts`
import { PDF2HTML } from 'pdf2html-client';
const result = await PDF2HTML.convertForFidelity(pdfFile);
I want responsive web documents
`ts`
import { PDF2HTML } from 'pdf2html-client';
const result = await PDF2HTML.convertForWeb(pdfFile);
Let the library choose the best option
`tsUsed ${presetUsed} preset: ${reason}
import { PDF2HTML } from 'pdf2html-client';
const { result, presetUsed, reason } = await PDF2HTML.convertAuto(pdfFile);
console.log();`
`ts`
console.log(result.html); // HTML markup
console.log(result.css); // CSS styles
console.log(result.text); // Extracted text (if enabled)
console.log(result.metadata); // Processing info
That's it! The library handles everything else automatically.
Install peer dependencies and mark them as externals in your bundler config:
`bash`
pnpm add pdfjs-dist onnxruntime-web @techstark/opencv-js
`ts`
// Vite example
export default {
build: {
rollupOptions: {
external: ['pdfjs-dist', 'onnxruntime-web', '@techstark/opencv-js']
}
}
}
`ts
import { PDF2HTML } from 'pdf2html-client';
const converter = new PDF2HTML({
parserStrategy: 'auto', // or 'pdfjs' to explicitly use pdfjs-dist
enableOCR: true // requires onnxruntime-web and @techstark/opencv-js
});
`
Load the external scripts before the library:
`html`
Notes:
- PDF.js is only required when using parserStrategy: 'pdfjs' or when auto chooses it. If missing, the library falls back to bundled PDFium/unpdf.enableOCR: true
- OCR dependencies are only required when . They are lazy-loaded on first OCR use.models/
- Ensure OCR models are available at or provide custom URLs via ocrConfig.
`ts
import { PDF2HTML } from 'pdf2html-client';
// For document editing (most common use case)
const result = await PDF2HTML.convertForEditing(pdfFile);
// For high-fidelity display
const result = await PDF2HTML.convertForFidelity(pdfFile);
// For web-optimized responsive output
const result = await PDF2HTML.convertForWeb(pdfFile);
`
`ts
import { PDF2HTML } from 'pdf2html-client';
// Create converter for specific use case
const converter = PDF2HTML.forEditing();
const result = await converter.convert(pdfFile);
converter.dispose();
`
`ts
import { PDF2HTML } from 'pdf2html-client';
// Build configuration step by step
const converter = new PDF2HTML()
.enableOCR(true)
.enableFontMapping(true)
.setTextLayout('semantic')
.setPreserveLayout(true)
.setResponsive(true)
.setDarkMode(false)
.setImageFormat('base64')
.includeExtractedText(true)
.setMaxConcurrentPages(2);
const result = await converter.convert(pdfFile);
converter.dispose();
`
`ts
import { PDF2HTML } from 'pdf2html-client';
// Start with a preset and customize
const converter = new PDF2HTML()
.applyPreset('editing')
.setDarkMode(true) // Override preset setting
.setImageFormat('url'); // Override preset setting
const result = await converter.convert(pdfFile);
converter.dispose();
`
`ts
import { PDF2HTML } from 'pdf2html-client';
const converter = new PDF2HTML({
enableOCR: false,
enableFontMapping: false,
parserStrategy: 'auto',
htmlOptions: {
format: 'html+inline-css',
preserveLayout: true,
responsive: false,
darkMode: false,
imageFormat: 'base64',
textLayout: 'semantic', // Default mode - flow semantic with layout awareness
textLayoutPasses: 2,
textPipeline: 'v2',
includeExtractedText: true
}
});
const out = await converter.convert(pdfFile, (p) => {
console.log(${p.stage}: ${p.progress}%);
});
console.log(out.html);
console.log(out.css);
console.log(out.metadata);
converter.dispose();
`
Perfect for document management systems where users need to edit and extract content from PDFs:
`ts
import { PDF2HTML } from 'pdf2html-client';
const converter = new PDF2HTML({
enableOCR: true, // Handle scanned documents
enableFontMapping: true, // Better font fidelity
htmlOptions: {
textLayout: 'flow', // Maximum editability
preserveLayout: false, // Semantic HTML structure
format: 'html+inline-css',
responsive: true,
includeExtractedText: true, // Easy copy-paste
imageFormat: 'base64'
}
});
const result = await converter.convert(pdfFile);
// result.html contains clean, editable semantic HTML
// result.text contains extracted text for search/indexing
`
Ideal for document viewers and archival systems where visual accuracy is paramount:
`ts
import { PDF2HTML } from 'pdf2html-client';
const converter = new PDF2HTML({
enableOCR: false, // Skip OCR for text PDFs
enableFontMapping: true,
htmlOptions: {
textLayout: 'absolute', // Pixel-perfect positioning
preserveLayout: true,
format: 'html+inline-css',
responsive: false,
darkMode: false,
imageFormat: 'base64',
textLayoutPasses: 1, // Faster processing
textPipeline: 'legacy' // Proven stability
}
});
const result = await converter.convert(pdfFile);
// result.html maintains exact PDF visual layout
`
Best for web applications that need responsive, accessible documents:
`ts
import { PDF2HTML } from 'pdf2html-client';
const converter = new PDF2HTML({
enableOCR: true,
enableFontMapping: false, // Faster loading
htmlOptions: {
textLayout: 'semantic', // Best of both worlds
preserveLayout: true,
format: 'html+css', // Separate CSS for caching
responsive: true,
darkMode: true, // Support dark theme
imageFormat: 'url', // Better performance
useFlexboxLayout: true, // Modern layout
semanticLayout: {
blockGapFactor: 1.2,
headingThreshold: 0.8
}
}
});
const result = await converter.convert(pdfFile);
// Responsive HTML that adapts to screen sizes
// Semantic structure for accessibility
`
For convenience, you can use pre-configured presets:
`ts
import { PDF2HTML, ConfigPresets } from 'pdf2html-client';
// Document editing preset
const editingConverter = new PDF2HTML(ConfigPresets.editing);
// High-fidelity display preset
const fidelityConverter = new PDF2HTML(ConfigPresets.fidelity);
// Web-optimized preset
const webConverter = new PDF2HTML(ConfigPresets.web);
// You can also customize presets
const customConverter = new PDF2HTML({
...ConfigPresets.editing,
htmlOptions: {
...ConfigPresets.editing.htmlOptions,
darkMode: true // Override preset setting
}
});
`
convert() returns an HTMLOutput:
- html: Generated markup
- css: Generated styles
- metadata: Page count, processing time, OCR usage, font mapping count, scan detection, and image stats
- fonts: Font families referenced by output
- text (optional): Extracted text (when htmlOptions.includeExtractedText is enabled)
Set htmlOptions.textLayout:
Default: semantic - Flow semantic mode with layout awareness, providing the best balance of editability and visual fidelity.
Best for maximum positional fidelity. Produces positioned text elements for precise placement.
Positioned output with additional grouping/merging heuristics to reduce fragmentation while maintaining fidelity.
Two behaviors depending on htmlOptions.preserveLayout:
- preserveLayout: true
- Produces "outline-flow" HTML that aims to be editable while still matching layout constraints.
- preserveLayout: false
- Produces semantic HTML (paragraphs/headings/lists) for maximum reflow/editability.
Produces semantic regions/lines designed for editing while still anchored to the original PDF layout.
When preserveLayout: true, semantic mode renders positioned regions and then uses:
- Flexbox line layout (when safe)
- Automatic fallback to absolute positioning when overlap risk or sensitive geometry is detected
This is the mode targeted at preventing "vertical overlaps" without losing fidelity.
For special cases, you can render text through an SVG text layer when preserveLayout is enabled.
The top-level constructor takes PDF2HTMLConfig.
- enableOCR: booleanocrConfig?: { confidenceThreshold: number; language?: string; preprocess?: boolean; autoRotate?: boolean }
- ocrProcessorOptions?: { batchSize?: number; maxConcurrent?: number; timeout?: number }
-
OCR only runs when the document is detected as scanned.
- enableFontMapping: booleanfontMappingOptions?: { strategy: 'exact' | 'similar' | 'fallback'; similarityThreshold: number; cacheEnabled: boolean }
-
- parserStrategy?: 'auto' | 'pdfium' | 'unpdf'parserOptions?: { extractText: boolean; extractImages: boolean; extractGraphics: boolean; extractForms: boolean; extractAnnotations: boolean }
-
htmlOptions?: HTMLGenerationOptions (high-level knobs):
- format: 'html' | 'html+css' | 'html+inline-css'preserveLayout: boolean
- responsive: boolean
- darkMode: boolean
- imageFormat: 'base64' | 'url'
- textLayout?: 'absolute' | 'smart' | 'flow' | 'semantic'
- textLayoutPasses?: 1 | 2
- textRenderMode?: 'html' | 'svg'
- textPipeline?: 'legacy' | 'v2'
- includeExtractedText?: boolean
- textClassifierProfile?: string
- semanticLayout?: { blockGapFactor?: number; headingThreshold?: number; maxHeadingLength?: number }
- useFlexboxLayout?: boolean
-
- maxConcurrentPages?: number (default: 4)cacheEnabled?: boolean
- wasmMemoryLimit?: number
-
`bash`
pnpm run demo
`bash`
pnpm test
pnpm run test:browser
Browser tests are designed to catch layout regressions, especially text overlaps in semantic layouts.
```
src/
core/ PDF parsing + layout analysis
html/ HTML/CSS generation + layout engines
fonts/ Font detection + mapping
ocr/ OCR engine + processing
types/ Public types
demo/ React demo app
tests/ Unit + browser tests
- Finish PDFJS Fallback
- Add more font mappings
- Better tables (structure + export)
- Richer forms/annotations rendering
- Expanded vector graphics support
- More layout profiles and tuning presets
MIT