pdf2html

![NPM version](https://www.npmjs.com/package/pdf2html)
![npm module downloads](https://www.npmjs.org/package/pdf2html)
![Build Status](https://travis-ci.org/shebinleo/pdf2html)
![License](https://www.npmjs.org/package/pdf2html)
![Node.js Version](https://nodejs.org)

> Convert PDF files to HTML, extract text, generate thumbnails, extract images, and extract metadata using Apache Tika and PDFBox

🚀 Features

- PDF to HTML conversion - Maintains formatting and structure
- Text extraction - Extract plain text content from PDFs
- Page-by-page processing - Process PDFs page by page
- Metadata extraction - Extract author, title, creation date, and more
- Thumbnail generation - Generate preview images from PDF pages
- Image extraction - Extract all embedded images from PDFs
- Buffer support - Process PDFs from memory buffers or file paths
- TypeScript support - Full type definitions included
- Async/Promise based - Modern async API
- Configurable - Extensive options for customization

📋 Prerequisites

- Node.js >= 12.0.0
- Java Runtime Environment (JRE) >= 8
- Required for Apache Tika and PDFBox
- Download Java

📦 Installation

$3

bash

npm install pdf2html

$3

bash

yarn add pdf2html

$3

bash

pnpm add pdf2html





The installation process will automatically download the required Apache Tika and PDFBox JAR files. You'll see a progress indicator during the download.



🔧 Basic Usage



$3

javascript

const pdf2html = require('pdf2html');

const fs = require('fs');



// From file path

const html = await pdf2html.html('path/to/document.pdf');

console.log(html);



// From buffer

const pdfBuffer = fs.readFileSync('path/to/document.pdf');

const html = await pdf2html.html(pdfBuffer);

console.log(html);



// With options

const html = await pdf2html.html(pdfBuffer, {

    maxBuffer: 1024  1024  10, // 10MB buffer

});

$3

javascript

// From file path

const text = await pdf2html.text('path/to/document.pdf');



// From buffer

const pdfBuffer = fs.readFileSync('path/to/document.pdf');

const text = await pdf2html.text(pdfBuffer);

console.log(text);

$3

javascript

// From file path

const htmlPages = await pdf2html.pages('path/to/document.pdf');



// From buffer

const pdfBuffer = fs.readFileSync('path/to/document.pdf');

const htmlPages = await pdf2html.pages(pdfBuffer);

htmlPages.forEach((page, index) => {

    console.log(

Page ${index + 1}:

, page);

});



// Get text for each page

const textPages = await pdf2html.pages(pdfBuffer, {

    text: true,

});

$3

javascript

// From file path or buffer

const metadata = await pdf2html.meta(pdfBuffer);

console.log(metadata);

// Output: {

//   title: 'Document Title',

//   author: 'John Doe',

//   subject: 'Document Subject',

//   keywords: 'pdf, conversion',

//   creator: 'Microsoft Word',

//   producer: 'Adobe PDF Library',

//   creationDate: '2023-01-01T00:00:00Z',

//   modificationDate: '2023-01-02T00:00:00Z',

//   pages: 10

// }

$3

javascript

// From file path

const thumbnailPath = await pdf2html.thumbnail('path/to/document.pdf');



// From buffer

const pdfBuffer = fs.readFileSync('path/to/document.pdf');

const thumbnailPath = await pdf2html.thumbnail(pdfBuffer);

console.log('Thumbnail saved to:', thumbnailPath);



// Custom thumbnail options

const thumbnailPath = await pdf2html.thumbnail(pdfBuffer, {

    page: 1, // Page number (default: 1)

    imageType: 'png', // 'png' or 'jpg' (default: 'png')

    width: 300, // Width in pixels (default: 160)

    height: 400, // Height in pixels (default: 226)

});

$3

javascript

// From file path

const imagePaths = await pdf2html.extractImages('path/to/document.pdf');

console.log('Extracted images:', imagePaths);

// Output: ['/absolute/path/to/files/image/document1.jpg', '/absolute/path/to/files/image/document2.png', ...]



// From buffer

const pdfBuffer = fs.readFileSync('path/to/document.pdf');

const imagePaths = await pdf2html.extractImages(pdfBuffer);



// With custom output directory

const imagePaths = await pdf2html.extractImages(pdfBuffer, {

    outputDirectory: './extracted-images', // Custom output directory

});



// With custom buffer size for large PDFs

const imagePaths = await pdf2html.extractImages('large-document.pdf', {

    outputDirectory: './output',

    maxBuffer: 1024  1024  10, // 10MB buffer

});





💻 TypeScript Support



This package includes TypeScript type definitions out of the box. No need to install

@types/pdf2html

.



$3

typescript

import * as pdf2html from 'pdf2html';

// or

import { html, text, pages, meta, thumbnail, extractImages, PDFMetadata, PDFProcessingError } from 'pdf2html';



async function convertPDF() {

    try {

        // All methods accept string paths or Buffers

        const htmlContent: string = await pdf2html.html('document.pdf');

        const textContent: string = await pdf2html.text(Buffer.from(pdfData));



        // Full type safety for options

        const thumbnailPath = await pdf2html.thumbnail('document.pdf', {

            page: 1, // number

            imageType: 'png', // 'png' | 'jpg'

            width: 300, // number

            height: 400, // number

        });



        // TypeScript knows the shape of metadata

        const metadata: PDFMetadata = await pdf2html.meta('document.pdf');

        console.log(metadata['pdf:producer']); // string | undefined

        console.log(metadata.resourceName); // string | undefined

    } catch (error) {

        if (error instanceof pdf2html.PDFProcessingError) {

            console.error('PDF processing failed:', error.message);

            console.error('Exit code:', error.exitCode);

        }

    }

}

$3

typescript

// Input types - all methods accept either file paths or Buffers

type PDFInput = string | Buffer;



// Options interfaces

interface ProcessingOptions {

    maxBuffer?: number; // Maximum buffer size in bytes

}



interface PageOptions extends ProcessingOptions {

    text?: boolean; // Extract text instead of HTML

}



interface ThumbnailOptions extends ProcessingOptions {

    page?: number; // Page number (default: 1)

    imageType?: 'png' | 'jpg'; // Image format (default: 'png')

    width?: number; // Width in pixels (default: 160)

    height?: number; // Height in pixels (default: 226)

}



// Metadata structure with common fields

interface PDFMetadata {

    'pdf:PDFVersion'?: string;

    'pdf:producer'?: string;

    'xmp:CreatorTool'?: string;

    'dc:title'?: string;

    'dc:creator'?: string;

    resourceName?: string;

    [key: string]: any; // Allows additional properties

}



// Error class

class PDFProcessingError extends Error {

    command?: string; // The command that failed

    exitCode?: number; // The process exit code

}





$3



Full IntelliSense support in VS Code and other TypeScript-aware editors:



- Auto-completion for all methods and options

- Inline documentation on hover

- Type checking at compile time

- Catch errors before runtime



$3

typescript

import { PDFProcessor, utils } from 'pdf2html';



// Using the PDFProcessor class directly

const html = await PDFProcessor.toHTML('document.pdf');



// Using utility classes

const { FileManager, HTMLParser } = utils;

await FileManager.ensureDirectories();



// Type guards

function isPDFProcessingError(error: unknown): error is pdf2html.PDFProcessingError {

    return error instanceof pdf2html.PDFProcessingError;

}



// Generic helper with proper typing

async function processPDFSafely(operation: () => Promise, fallback: T): Promise {

    try {

        return await operation();

    } catch (error) {

        if (isPDFProcessingError(error)) {

            console.error(

PDF operation failed: ${error.message}

);

        }

        return fallback;

    }

}



// Usage

const pages = await processPDFSafely(

    () => pdf2html.pages('document.pdf', { text: true }),

    [] // fallback to empty array

);





⚙️ Advanced Configuration



$3



By default, the maximum buffer size is 2MB. For large PDFs, you may need to increase this:

javascript

const options = {

    maxBuffer: 1024  1024  50, // 50MB buffer

};



// Apply to any method

await pdf2html.html('large-file.pdf', options);

await pdf2html.text('large-file.pdf', options);

await pdf2html.pages('large-file.pdf', options);

await pdf2html.meta('large-file.pdf', options);

await pdf2html.thumbnail('large-file.pdf', options);





$3



Always wrap your calls in try-catch blocks for proper error handling:

javascript

try {

    const html = await pdf2html.html('document.pdf');

    // Process HTML

} catch (error) {

    if (error.code === 'ENOENT') {

        console.error('PDF file not found');

    } else if (error.message.includes('Java')) {

        console.error('Java is not installed or not in PATH');

    } else {

        console.error('PDF processing failed:', error.message);

    }

}





🏗️ API Reference



$3



Converts PDF to HTML format.



- input

string | Buffer

 - Path to the PDF file or PDF buffer

- options

object

 (optional)

    -

maxBuffer number

 - Maximum buffer size in bytes (default: 2MB)

- Returns:

Promise

 - HTML content



$3



Extracts text from PDF.



- input

string | Buffer

 - Path to the PDF file or PDF buffer

- options

object

 (optional)

    -

maxBuffer number

 - Maximum buffer size in bytes

- Returns:

Promise

 - Extracted text



$3



Processes PDF page by page.



- input

string | Buffer

 - Path to the PDF file or PDF buffer

- options

object

 (optional)

    -

text boolean

 - Extract text instead of HTML (default: false)

    -

maxBuffer number

 - Maximum buffer size in bytes

- Returns:

Promise

 - Array of HTML or text strings



$3



Extracts PDF metadata.



- input

string | Buffer

 - Path to the PDF file or PDF buffer

- options

object

 (optional)

    -

maxBuffer number

 - Maximum buffer size in bytes

- Returns:

Promise

@rlyle1179/pdf2html

pdf2html

🚀 Features

📋 Prerequisites

📦 Installation

$3

$3

$3

🔧 Basic Usage

$3

$3

$3

$3

$3

$3

💻 TypeScript Support

$3

$3

$3

$3

⚙️ Advanced Configuration

$3

$3

🏗️ API Reference

$3

$3

$3

$3

$3

🔧 Manual Dependency Installation

🐛 Troubleshooting

$3

$3

🤝 Contributing

📝 License

🙏 Acknowledgments

📊 Dependencies

@rlyle1179/pdf2html

pdf2html

🚀 Features

📋 Prerequisites

📦 Installation

$3

$3

$3

🔧 Basic Usage

$3

$3

$3

$3

$3

$3

💻 TypeScript Support

$3

$3

$3

$3

⚙️ Advanced Configuration

$3

$3

🏗️ API Reference

$3

$3

$3

$3

$3

🔧 Manual Dependency Installation

🐛 Troubleshooting

$3

$3

🤝 Contributing

📝 License

🙏 Acknowledgments

📊 Dependencies