pdf2html

![NPM version](https://www.npmjs.com/package/pdf2html)
![npm module downloads](https://www.npmjs.org/package/pdf2html)
![Build Status](https://travis-ci.org/shebinleo/pdf2html)
![License](https://www.npmjs.org/package/pdf2html)
![Node.js Version](https://nodejs.org)

> Convert PDF files to HTML, extract text, generate thumbnails, extract images, and extract metadata using Apache Tika and PDFBox

🚀 Features

- PDF to HTML conversion - Maintains formatting and structure
- Text extraction - Extract plain text content from PDFs
- Page-by-page processing - Process PDFs page by page
- Metadata extraction - Extract author, title, creation date, and more
- Thumbnail generation - Generate preview images from PDF pages
- Image extraction - Extract all embedded images from PDFs
- Buffer support - Process PDFs from memory buffers or file paths
- TypeScript support - Full type definitions included
- Async/Promise based - Modern async API
- Configurable - Extensive options for customization

📋 Prerequisites

- Node.js >= 12.0.0
- Java Runtime Environment (JRE) >= 8
- Required for Apache Tika and PDFBox
- Download Java

📦 Installation

$3

``bash npm install pdf2html`

`$3`

`bash yarn add pdf2html`

`$3`

`bash pnpm add pdf2html`

The installation process will automatically download the required Apache Tika and PDFBox JAR files. You'll see a progress indicator during the download.

`🔧 Basic Usage`

`$3`

`javascript const pdf2html = require('pdf2html'); const fs = require('fs');

// From file path const html = await pdf2html.html('path/to/document.pdf'); console.log(html);

// From buffer const pdfBuffer = fs.readFileSync('path/to/document.pdf'); const html = await pdf2html.html(pdfBuffer); console.log(html);

// With options const html = await pdf2html.html(pdfBuffer, { maxBuffer: 1024 1024 10, // 10MB buffer });`

`$3`

`javascript // From file path const text = await pdf2html.text('path/to/document.pdf');

// From buffer const pdfBuffer = fs.readFileSync('path/to/document.pdf'); const text = await pdf2html.text(pdfBuffer); console.log(text);`

`$3`

`javascript // From file path const htmlPages = await pdf2html.pages('path/to/document.pdf');

// From buffer const pdfBuffer = fs.readFileSync('path/to/document.pdf'); const htmlPages = await pdf2html.pages(pdfBuffer); htmlPages.forEach((page, index) => { console.log(Page ${index + 1}:, page); });

// Get text for each page const textPages = await pdf2html.pages(pdfBuffer, { text: true, });`

`$3`

`javascript // From file path or buffer const metadata = await pdf2html.meta(pdfBuffer); console.log(metadata); // Output: { // title: 'Document Title', // author: 'John Doe', // subject: 'Document Subject', // keywords: 'pdf, conversion', // creator: 'Microsoft Word', // producer: 'Adobe PDF Library', // creationDate: '2023-01-01T00:00:00Z', // modificationDate: '2023-01-02T00:00:00Z', // pages: 10 // }`

`$3`

`javascript // From file path const thumbnailPath = await pdf2html.thumbnail('path/to/document.pdf');

// From buffer const pdfBuffer = fs.readFileSync('path/to/document.pdf'); const thumbnailPath = await pdf2html.thumbnail(pdfBuffer); console.log('Thumbnail saved to:', thumbnailPath);

// Custom thumbnail options const thumbnailPath = await pdf2html.thumbnail(pdfBuffer, { page: 1, // Page number (default: 1) imageType: 'png', // 'png' or 'jpg' (default: 'png') width: 300, // Width in pixels (default: 160) height: 400, // Height in pixels (default: 226) });`

`$3`

`javascript // From file path const imagePaths = await pdf2html.extractImages('path/to/document.pdf'); console.log('Extracted images:', imagePaths); // Output: ['/absolute/path/to/files/image/document1.jpg', '/absolute/path/to/files/image/document2.png', ...]

// From buffer const pdfBuffer = fs.readFileSync('path/to/document.pdf'); const imagePaths = await pdf2html.extractImages(pdfBuffer);

// With custom output directory const imagePaths = await pdf2html.extractImages(pdfBuffer, { outputDirectory: './extracted-images', // Custom output directory });

// With custom buffer size for large PDFs const imagePaths = await pdf2html.extractImages('large-document.pdf', { outputDirectory: './output', maxBuffer: 1024 1024 10, // 10MB buffer });`

`💻 TypeScript Support`

This package includes TypeScript type definitions out of the box. No need to install @types/pdf2html.

`$3`

`typescript import * as pdf2html from 'pdf2html'; // or import { html, text, pages, meta, thumbnail, extractImages, PDFMetadata, PDFProcessingError } from 'pdf2html';

async function convertPDF() { try { // All methods accept string paths or Buffers const htmlContent: string = await pdf2html.html('document.pdf'); const textContent: string = await pdf2html.text(Buffer.from(pdfData));

// Full type safety for options const thumbnailPath = await pdf2html.thumbnail('document.pdf', { page: 1, // number imageType: 'png', // 'png' | 'jpg' width: 300, // number height: 400, // number });

// TypeScript knows the shape of metadata const metadata: PDFMetadata = await pdf2html.meta('document.pdf'); console.log(metadata['pdf:producer']); // string | undefined console.log(metadata.resourceName); // string | undefined } catch (error) { if (error instanceof pdf2html.PDFProcessingError) { console.error('PDF processing failed:', error.message); console.error('Exit code:', error.exitCode); } } }`

`$3`

`typescript // Input types - all methods accept either file paths or Buffers type PDFInput = string | Buffer;

// Options interfaces interface ProcessingOptions { maxBuffer?: number; // Maximum buffer size in bytes }

interface PageOptions extends ProcessingOptions { text?: boolean; // Extract text instead of HTML }

interface ThumbnailOptions extends ProcessingOptions { page?: number; // Page number (default: 1) imageType?: 'png' | 'jpg'; // Image format (default: 'png') width?: number; // Width in pixels (default: 160) height?: number; // Height in pixels (default: 226) }

// Metadata structure with common fields interface PDFMetadata { 'pdf:PDFVersion'?: string; 'pdf:producer'?: string; 'xmp:CreatorTool'?: string; 'dc:title'?: string; 'dc:creator'?: string; resourceName?: string; [key: string]: any; // Allows additional properties }

// Error class class PDFProcessingError extends Error { command?: string; // The command that failed exitCode?: number; // The process exit code }`

`$3`

Full IntelliSense support in VS Code and other TypeScript-aware editors:

- Auto-completion for all methods and options - Inline documentation on hover - Type checking at compile time - Catch errors before runtime

`$3`

`typescript import { PDFProcessor, utils } from 'pdf2html';

// Using the PDFProcessor class directly const html = await PDFProcessor.toHTML('document.pdf');

// Using utility classes const { FileManager, HTMLParser } = utils; await FileManager.ensureDirectories();

// Type guards function isPDFProcessingError(error: unknown): error is pdf2html.PDFProcessingError { return error instanceof pdf2html.PDFProcessingError; }

// Generic helper with proper typing async function processPDFSafely(operation: () => Promise, fallback: T): Promise { try { return await operation(); } catch (error) { if (isPDFProcessingError(error)) { console.error(PDF operation failed: ${error.message}); } return fallback; } }

// Usage const pages = await processPDFSafely( () => pdf2html.pages('document.pdf', { text: true }), [] // fallback to empty array );`

`⚙️ Advanced Configuration`

`$3`

By default, the maximum buffer size is 2MB. For large PDFs, you may need to increase this:

`javascript const options = { maxBuffer: 1024 1024 50, // 50MB buffer };

// Apply to any method await pdf2html.html('large-file.pdf', options); await pdf2html.text('large-file.pdf', options); await pdf2html.pages('large-file.pdf', options); await pdf2html.meta('large-file.pdf', options); await pdf2html.thumbnail('large-file.pdf', options);`

`$3`

Always wrap your calls in try-catch blocks for proper error handling:

`javascript try { const html = await pdf2html.html('document.pdf'); // Process HTML } catch (error) { if (error.code === 'ENOENT') { console.error('PDF file not found'); } else if (error.message.includes('Java')) { console.error('Java is not installed or not in PATH'); } else { console.error('PDF processing failed:', error.message); } }`

`🏗️ API Reference`

`$3`

Converts PDF to HTML format.

- input string | Buffer- Path to the PDF file or PDF buffer - optionsobject(optional) -maxBuffer number- Maximum buffer size in bytes (default: 2MB) - Returns:Promise - HTML content

`$3`

Extracts text from PDF.

- input string | Buffer- Path to the PDF file or PDF buffer - optionsobject(optional) -maxBuffer number- Maximum buffer size in bytes - Returns:Promise - Extracted text

`$3`

Processes PDF page by page.

- input string | Buffer- Path to the PDF file or PDF buffer - optionsobject(optional) -text boolean- Extract text instead of HTML (default: false) -maxBuffer number- Maximum buffer size in bytes - Returns:Promise - Array of HTML or text strings

`$3`

Extracts PDF metadata.

- input string | Buffer- Path to the PDF file or PDF buffer - optionsobject(optional) -maxBuffer number- Maximum buffer size in bytes - Returns:Promise

pdf2html

pdf2html

🚀 Features

📋 Prerequisites

📦 Installation

$3

$3

$3

🔧 Basic Usage

$3

$3

$3

$3

$3

$3

💻 TypeScript Support

$3

$3

$3

$3

⚙️ Advanced Configuration

$3

$3

🏗️ API Reference

$3

$3

$3

$3

$3

🔧 Manual Dependency Installation

🐛 Troubleshooting

$3

$3

🤝 Contributing

📝 License

🙏 Acknowledgments

📊 Dependencies

pdf2html

pdf2html

🚀 Features

📋 Prerequisites

📦 Installation

$3

$3

$3

🔧 Basic Usage

$3

$3

$3

$3

$3

$3

💻 TypeScript Support

$3

$3

$3

$3

⚙️ Advanced Configuration

$3

$3

🏗️ API Reference

$3

$3

$3

$3

$3

🔧 Manual Dependency Installation

🐛 Troubleshooting

$3

$3

🤝 Contributing

📝 License

🙏 Acknowledgments

📊 Dependencies

`$3`

`$3`

`🔧 Basic Usage`

`$3`

`$3`

`$3`

`$3`

`$3`

`$3`

`💻 TypeScript Support`

`$3`

`$3`

`$3`

`$3`

`⚙️ Advanced Configuration`

`$3`

`$3`

`🏗️ API Reference`

`$3`

`$3`

`$3`

`$3`

`🔧 Manual Dependency Installation`

`🐛 Troubleshooting`

`$3`

`$3`

`🤝 Contributing`

`$3`

`$3`

`🔧 Basic Usage`

`$3`

`$3`

`$3`

`$3`

`$3`

`$3`

`💻 TypeScript Support`

`$3`

`$3`

`$3`

`$3`

`⚙️ Advanced Configuration`

`$3`

`$3`

`🏗️ API Reference`

`$3`

`$3`

`$3`

`$3`

`🔧 Manual Dependency Installation`

`🐛 Troubleshooting`

`$3`

`$3`

`🤝 Contributing`