pdf-plus

A comprehensive PDF content extraction library with support for text, images, and structured data.

Features

- 📝 Text Extraction - High-quality text extraction with positioning
- 🖼️ Image Detection - Detect and reference images in PDF content
- 💾 Image File Extraction - Extract actual image files from PDFs
- 🎨 Image Optimization - Optional Sharp/Imagemin optimization with quality control
- 🔄 JP2 Conversion - Automatic JPEG 2000 to JPG conversion for compatibility
- 🚀 Parallel Processing - 1.5-3x faster with configurable concurrency (Phase 1)
- ⚡ Async I/O - Non-blocking file operations for better performance (Phase 2)
- 🧵 Worker Threads - True multi-threading for CPU-intensive operations (Phase 3)
- 🌊 Streaming API - Process large PDFs with 10-100x lower memory usage (Phase 4)
- 📄 Page to Image - Convert PDF pages to images (PNG, JPG, WebP) (Phase 5 - NEW!)
- 🎯 Format Preservation - Preserves original image formats (JPG, PNG) and full quality
- 🔧 TypeScript Support - Full TypeScript definitions included
- 🛡️ Robust Validation - Comprehensive input validation and error handling

Installation

``bash

`Using pnpm (recommended)`


pnpm add pdf-plus
Using npm

npm install pdf-plus
Using yarn

yarn add pdf-plus


Quick Start

`typescript import { extractPdfContent } from "pdf-plus";

// Extract both text and images const result = await extractPdfContent("document.pdf", { extractText: true, extractImages: true, verbose: true, });

console.log(Extracted ${result.images.length} images from ${result.document.pages} pages); console.log(Text content: ${result.cleanText.substring(0, 100)}...);`

`$3`

For large PDFs, use the streaming API for lower memory usage and real-time progress:

`typescript import { extractPdfStream } from "pdf-plus";

const stream = extractPdfStream("large-document.pdf", { extractImageFiles: true, imageOutputDir: "./images", streamMode: true, });

for await (const event of stream) { if (event.type === "page") { console.log(Page ${event.pageNumber}/${event.totalPages} complete); } else if (event.type === "progress") { console.log(Progress: ${event.percentComplete.toFixed(1)}%); } else if (event.type === "complete") { console.log(Done! ${event.totalImages} images extracted); } }`

Benefits:

- 📉 10-100x lower memory usage for large PDFs - ⚡ 100x faster time to first result - 📊 Real-time progress tracking - 🛑 Cancellation support

See PHASE4-STREAMING.md for complete streaming API documentation.

`$3`

Render PDF pages to high-quality images with a simple function call:

`typescript import { generatePageImages } from "pdf-plus";

// Simple - render all pages to JPG images const imagePaths = await generatePageImages( "document.pdf", // PDF file path "./page-images" // Output directory where images will be saved );

console.log(Generated ${imagePaths.length} page images); // Returns: ['/path/to/page-images/jpg/page-001.jpg', '/path/to/page-images/jpg/page-002.jpg', ...]`

With Options:

`typescript const imagePaths = await generatePageImages("document.pdf", "./page-images", { pageImageFormat: "jpg", // 'jpg', 'png', or 'webp' pageImageDpi: 150, // DPI quality (72, 150, 300, 600) pageRenderEngine: "poppler", // 'poppler' (recommended) or 'pdfjs' specificPages: [1, 2, 3], // Optional: only render specific pages parallelProcessing: true, // Parallel rendering (default: true) maxConcurrentPages: 10, // Max parallel pages (default: 10) verbose: true, // Show progress });`

Features:

- 🎨 Multiple formats - JPG, PNG, WebP - 📐 Quality control - Adjustable DPI (72, 150, 300, 600) - 📄 Page selection - Render specific pages or all pages - 🚀 Parallel rendering - Fast multi-page processing - 📁 Returns file paths - Array of absolute paths to generated images - 🔧 Two engines - Poppler (best quality) or PDF.js

Output Structure:

`page-images/ └── jpg/ ├── page-001.jpg ├── page-002.jpg └── page-003.jpg`

See PAGE-TO-IMAGE-FEATURE.md for complete page-to-image documentation.

`Usage Examples`

`$3`

`typescript import { extractText } from "pdf-plus";

const text = await extractText("document.pdf"); console.log(Extracted ${text.length} characters);`

`$3`

`typescript import { extractImageFiles } from "pdf-plus";

// Extract and save embedded images from PDF const imagePaths = await extractImageFiles( "document.pdf", "./extracted-images" // Output directory for embedded images );

console.log(Extracted ${imagePaths.length} embedded images);`

`$3`

`typescript import { generatePageImages } from "pdf-plus";

// Render PDF pages to image files const imagePaths = await generatePageImages( "document.pdf", "./page-images" // Output directory for page images );

console.log(Generated ${imagePaths.length} page images); // Each page becomes an image: page-001.jpg, page-002.jpg, etc.`

`$3`

`typescript import { extractPdfContent } from "pdf-plus";

const result = await extractPdfContent("document.pdf", { extractImageFiles: true, imageOutputDir: "./images",

// Enable optimization optimizeImages: true, imageOptimizer: "auto", // or 'sharp', 'imagemin' imageQuality: 80, imageProgressive: true,

// Convert JP2 (JPEG 2000) to JPG for better compatibility (default: true) convertJp2ToJpg: true, imageQuality: 100, // Default: 100 for JP2 conversion (max quality)

verbose: true, });

// Check optimization results result.images.forEach((img) => { console.log(${img.filename}: Optimized and saved); });`

`$3`

`typescript import { extractPdfContent } from "pdf-plus";

// BASIC: Parallel processing (enabled by default) const result = await extractPdfContent("document.pdf", { extractImageFiles: true, imageOutputDir: "./images", parallelProcessing: true, // 1.5-3x faster });

// ADVANCED: With worker threads for CPU-intensive operations const result = await extractPdfContent("large-document.pdf", { extractImageFiles: true, imageOutputDir: "./images",

// Enable parallel processing (default: true) parallelProcessing: true,

// Enable worker threads for true multi-threading (default: false) useWorkerThreads: true, // 2.5-3.2x additional speedup! autoScaleWorkers: true, // Auto-adjust based on system resources maxWorkerThreads: 8, // Max worker threads (default: CPU cores - 1)

// Fine-tune concurrency for your workload maxConcurrentPages: 20, // Process up to 20 pages simultaneously maxConcurrentImages: 50, // Extract up to 50 images per page in parallel maxConcurrentConversions: 5, // Convert up to 5 JP2 files simultaneously maxConcurrentOptimizations: 5, // Optimize up to 5 images simultaneously

verbose: true, });

// Performance gains (tested on Art Basel PDF, 54 images): // - Baseline (sequential): 140ms // - Parallel processing: 47ms (2.96x faster) // - Parallel + Workers: 44ms (3.23x faster) 🚀`

Performance Recommendations:

| PDF Size | Images | Recommended Settings | | -------- | ------ | ------------------------------------------------------------------------------------------------------------------------- | | Small | <20 |parallelProcessing: true(default settings) | | Medium | 20-50 |parallelProcessing: true, maxConcurrentPages: 10, maxConcurrentImages: 20| | Large | 50+ |parallelProcessing: true, useWorkerThreads: true, maxConcurrentPages: 20, maxConcurrentImages: 50| | Huge | 200+ |parallelProcessing: true, useWorkerThreads: true, maxWorkerThreads: 8, maxConcurrentPages: 30, maxConcurrentImages: 100 |

Worker Threads Benefits:

- ✅ True multi-threading (runs on separate CPU cores) - ✅ 2.5-3.2x faster for CPU-intensive operations (JP2 conversion, optimization) - ✅ Auto-scaling based on memory and CPU usage - ✅ Opt-in (default: false) - no breaking changes

See PERFORMANCE.md and PHASE3-WORKERS.md for detailed benchmarks and optimization guide.

`$3`

`typescript import { extractPdfContent } from "pdf-plus";

const result = await extractPdfContent("document.pdf", { imageRefFormat: "📷 Image {index} on page {page}", extractImageFiles: true, useImagePaths: true, });

// Text will contain: "📷 Image 1 on page 1" instead of "[IMAGE:img_1]"`

`$3`

`typescript import { PDFExtractor } from "pdf-plus";

const extractor = new PDFExtractor();

const result = await extractor.extract("large-document.pdf", { extractText: true, extractImages: true, extractImageFiles: true, imageOutputDir: "./extracted-images", memoryLimit: "1GB", batchSize: 10, progressCallback: (progress) => { console.log(Processing page ${progress.currentPage}/${progress.totalPages}); }, });`

`$3`

#### Extract and Save Images from Academic Papers

`typescript import { extractPdfContent } from "pdf-plus"; import path from "path";

async function extractAcademicPaper(pdfPath: string) { const result = await extractPdfContent(pdfPath, { extractText: true, extractImages: true, extractImageFiles: true, imageOutputDir: "./paper-images", imageRefFormat: "Figure {index}: {name}", verbose: true, });

// Save text content const fs = await import("fs"); fs.writeFileSync("./paper-text.txt", result.cleanText);

// Log extraction summary console.log(📄 Extracted from ${result.document.filename}:); console.log( 📝 Text: ${result.document.textLength} characters); console.log( 🖼️ Images: ${result.images.length} found); console.log( 📊 Pages: ${result.document.pages});

return result; }`

#### Batch Process Multiple PDFs

`typescript import { PDFExtractor } from "pdf-plus"; import { glob } from "glob";

async function batchProcessPDFs(pattern: string) { const extractor = new PDFExtractor("./cache"); // Enable caching const pdfFiles = await glob(pattern);

const results = [];

for (const pdfFile of pdfFiles) { console.log(Processing: ${pdfFile});

try { const result = await extractor.extract(pdfFile, { extractText: true, extractImages: true, imageOutputDir:./output/${path.basename(pdfFile, ".pdf")}, batchSize: 5, // Process 5 pages at a time verbose: false, });

results.push({ file: pdfFile, success: true, pages: result.document.pages, images: result.images.length, textLength: result.document.textLength, }); } catch (error) { console.error(Failed to process ${pdfFile}:, error); results.push({ file: pdfFile, success: false, error: error.message, }); } }

return results; }`

`API Reference`

`$3`

#### extractPdfContent(pdfPath, options)

Extract complete content from a PDF file.

Parameters:

- pdfPath(string) - Path to the PDF file -options (ExtractionOptions) - Extraction configuration

Returns: Promise

#### extractText(pdfPath, options)

Extract only text content (optimized for speed).

Returns: Promise

#### extractImages(pdfPath, options)

Extract only image references.

Returns: Promise

#### extractImageFiles(pdfPath, outputDir, options)

Extract and save embedded image files from PDF.

Parameters:

- pdfPath- Path to the PDF file -outputDir- Output directory path where embedded images will be saved -options - Optional extraction options

Returns: Promise - Array of saved file paths

#### generatePageImages(pdfPath, outputDir, options)

Render PDF pages to image files (page-to-image conversion).

Parameters:

- pdfPath- Path to the PDF file -outputDir- Output directory path where page images will be saved -options - Optional rendering options (pageImageFormat, pageImageDpi, pageRenderEngine, etc.)

Returns: Promise - Array of absolute paths to generated page images

Example:

`typescript import { generatePageImages } from "pdf-plus";

const imagePaths = await generatePageImages("document.pdf", "./page-images", { pageImageFormat: "jpg", pageImageDpi: 150, pageRenderEngine: "poppler", });

console.log(Generated ${imagePaths.length} page images); // Returns: ['/absolute/path/to/page-images/jpg/page-001.jpg', ...]`

`$3`

`typescript interface ExtractionOptions { // Basic extraction options extractText?: boolean; // Extract text content (default: true) extractImages?: boolean; // Extract image references (default: true) extractImageFiles?: boolean; // Save actual image files (default: false) useImagePaths?: boolean; // Use file paths in references (default: false) imageOutputDir?: string; // Directory for image files (default: './extracted-images') imageRefFormat?: string; // Custom reference format (default: '[IMAGE:{id}]') baseName?: string; // Base name for output files verbose?: boolean; // Show detailed progress (default: false) memoryLimit?: string; // Memory limit (e.g., '512MB', '1GB') batchSize?: number; // Pages per batch (1-100) progressCallback?: (progress: ProgressInfo) => void;

// Image optimization options optimizeImages?: boolean; // Enable image optimization (default: false) imageOptimizer?: "auto" | "sharp" | "imagemin"; // Optimizer to use (default: 'auto') imageQuality?: number; // Image quality 1-100 (default: 80, JP2 conversion: 100) imageProgressive?: boolean; // Progressive JPEG (default: true) convertJp2ToJpg?: boolean; // Convert JP2 to JPG (default: true)

// Performance options (NEW!) parallelProcessing?: boolean; // Enable parallel processing (default: true) maxConcurrentPages?: number; // Max pages in parallel (default: 10) maxConcurrentImages?: number; // Max images per page in parallel (default: 20) maxConcurrentConversions?: number; // Max JP2 conversions in parallel (default: 5) maxConcurrentOptimizations?: number; // Max optimizations in parallel (default: 5)

// Worker thread options (NEW! 🚀) useWorkerThreads?: boolean; // Enable worker threads (default: false) autoScaleWorkers?: boolean; // Auto-scale workers (default: true) maxWorkerThreads?: number; // Max worker threads (default: CPU cores - 1) minWorkerThreads?: number; // Min worker threads (default: 1) memoryThreshold?: number; // Memory threshold 0-1 (default: 0.8) cpuThreshold?: number; // CPU threshold 0-1 (default: 0.9) workerTaskTimeout?: number; // Task timeout ms (default: 30000) workerIdleTimeout?: number; // Idle timeout ms (default: 60000) workerMemoryLimit?: number; // Memory per worker MB (default: 512) enableWorkerForConversion?: boolean; // Workers for JP2 (default: true) enableWorkerForOptimization?: boolean; // Workers for optimization (default: true) enableWorkerForDecoding?: boolean; // Workers for decoding (default: true) }`

Performance Options Explained:

Parallel Processing:

- parallelProcessing: Enable/disable parallel processing. Enabled by default for 1.5-3x speedup. -maxConcurrentPages: How many pages to process simultaneously. Higher values = faster for multi-page PDFs, but more memory usage. -maxConcurrentImages: How many images per page to extract in parallel. Increase for pages with many images. -maxConcurrentConversions: How many JP2→JPG conversions to run simultaneously. Keep moderate (5-10) to avoid memory issues. -maxConcurrentOptimizations: How many image optimizations to run simultaneously. Keep moderate (5-10) as optimization is CPU-intensive.

Worker Threads (NEW! 🚀):

- useWorkerThreads: Enable true multi-threading using Node.js worker threads. Provides 2.5-3.2x additional speedup for CPU-intensive operations. Default: false(opt-in). -autoScaleWorkers: Automatically adjust worker count based on system memory and CPU usage. Default: true. -maxWorkerThreads: Maximum number of worker threads. Default: CPU cores - 1. -minWorkerThreads: Minimum number of worker threads to keep alive. Default: 1. -memoryThreshold: Memory usage threshold (0-1) before scaling down workers. Default: 0.8 (80%). -cpuThreshold: CPU usage threshold (0-1) before scaling down workers. Default: 0.9 (90%). -workerTaskTimeout: Maximum time (ms) for a worker task before timeout. Default: 30000 (30 seconds). -workerIdleTimeout: Time (ms) before idle workers are terminated. Default: 60000 (60 seconds). -workerMemoryLimit: Memory limit (MB) per worker thread. Default: 512MB. -enableWorkerForConversion: Use workers for JP2 conversion. Default: true. -enableWorkerForOptimization: Use workers for image optimization. Default: true. -enableWorkerForDecoding: Use workers for image decoding. Default: true.

`$3`

Use these placeholders in imageRefFormat:

- {id} - Unique image ID (e.g., img_1) -{name}- Original image name from PDF -{page}- Page number -{index}- Global image index -{path} - File path (when extractImageFiles is true)

Examples:

- [IMAGE:{id}] → [IMAGE:img_1]-📷 Image {index} → 📷 Image 1-{name} on page {page} → artwork_1 on page 5- →

`Image Optimization & Conversion`

Extract and optimize images in one step using Sharp or Imagemin:

`typescript import { extractPdfContent } from "pdf-plus";

const result = await extractPdfContent("document.pdf", { extractImageFiles: true, imageOutputDir: "./images",

// Enable optimization optimizeImages: true, imageOptimizer: "auto", // Automatically selects best available imageQuality: 80, imageProgressive: true,

// Convert JP2 (JPEG 2000) to JPG for better compatibility (default: true) convertJp2ToJpg: true,

verbose: true, });

// Output: // 🖼️ Extracting images from: document.pdf // 📊 Processing 50 pages with PDF-lib engine // 💾 Extracted real image: img_p1_1.jpg (245KB) // 🔄 Converting 16 JP2 images to JPG... // 🔄 Converted JP2 → JPG: img_p2_2.jpg (24026 → 18500 bytes) // 🎨 Optimizing 54 images... // ✅ img_p1_1.jpg: 251904 → 184320 bytes (-26.8%) [sharp] // ✅ img_p2_2.jpg: 18500 → 15200 bytes (-17.8%) [sharp]`

`$3`

JP2 (JPEG 2000) files are not widely supported by browsers and image tools. The library automatically converts them to standard JPG format:

`typescript const result = await extractPdfContent("document.pdf", { extractImageFiles: true, convertJp2ToJpg: true, // Default: true imageQuality: 100, // Default: 100 (maximum quality preservation) });

// All JP2 images are now JPG files with better compatibility`

Quality Preservation:

- Default quality: 100 - Preserves maximum quality from JP2 - Use lower values (80-90) if you want additional compression - Original JP2 files are deleted after successful conversion

Benefits:

- ✅ Better browser compatibility - ✅ Can be optimized by Sharp/Imagemin - ✅ Maximum quality preserved (quality=100) - ✅ Works everywhere

`$3`

| Optimizer | Speed | Quality | Formats | Platform | | ---------- | -------- | --------- | ------------------ | ----------------------------------------- | |sharp| Fast | Excellent | JPG, PNG, WebP | Native (requires compilation) | |imagemin| Medium | Excellent | JPG, PNG, GIF, SVG | Cross-platform | |auto | Variable | Excellent | All supported | Tries sharp first, falls back to imagemin |

`$3`

`typescript // Maximum compression (slower, smaller files) const result = await extractPdfContent("document.pdf", { optimizeImages: true, imageQuality: 70, });

// Balanced (recommended) const result = await extractPdfContent("document.pdf", { optimizeImages: true, imageQuality: 80, // Default });

// Fast optimization with Sharp const result = await extractPdfContent("document.pdf", { optimizeImages: true, imageOptimizer: "sharp", imageQuality: 85, });`

`Performance Modes`

`$3`

`typescript const text = await extractText("document.pdf"); // ~40% faster than combined mode`

`$3`

`typescript const images = await extractImages("document.pdf"); // ~20% faster than combined mode`

`$3`

`typescript const result = await extractPdfContent("document.pdf"); // Full extraction with text and image references`

`Error Handling`

`typescript import { extractPdfContent } from "pdf-plus";

try { const result = await extractPdfContent("document.pdf"); } catch (error) { if (error.code === "VALIDATION_ERROR") { console.error("Configuration error:", error.validationErrors); } else if (error.code === "EXTRACTION_ERROR") { console.error("Extraction failed:", error.message); } else { console.error("Unexpected error:", error); } }`

`Development`

`bash

`Install dependencies`


pnpm install
Build the library

pnpm run build
Lint and format

pnpm run lint:fix
pnpm run format
Type checking

pnpm run check


Requirements
- Node.js >= 18.0.0
- TypeScript >= 5.0 (for development)
License
MIT
Contributing
Contributions are welcome! Please read our contributing guidelines and submit pull requests to our repository.
Troubleshooting
$3
#### "Cannot find module" errors
Make sure you're using the correct import syntax for your environment:

`typescript // ESM (recommended) import { extractPdfContent } from "pdf-plus";

// CommonJS const { extractPdfContent } = require("pdf-plus");`

#### Memory issues with large PDFs

For large documents, use streaming options:

`typescript const result = await extractPdfContent("large-document.pdf", { memoryLimit: "512MB", batchSize: 5, useCache: true, });`

#### Image extraction not working

Try different engines:

`typescript const result = await extractPdfContent("document.pdf", { imageEngine: "poppler", // or 'pdf-lib', 'auto' extractImageFiles: true, });`

#### Text extraction issues

Some PDFs may have encoding issues. Try:

`typescript const result = await extractPdfContent("document.pdf", { extractText: true, textEngine: "pdfjs", // Alternative engine verbose: true, // See detailed logs });`

`$3`

1. Use specific extraction modes for better performance:

`typescript // Text only (fastest) const text = await extractText("document.pdf");

// Images only const images = await extractImages("document.pdf");`

2. Enable caching for repeated operations:

`typescript const extractor = new PDFExtractor("./cache");`

3. Process pages in batches for large documents:`typescript const result = await extractPdfContent("large.pdf", { batchSize: 10, memoryLimit: "1GB", });`

`$3`

- Check the Issues page - Review examples for common use cases - Enable verbose logging for debugging:{ verbose: true }`

Roadmap

$3

- OCR Support: Text extraction from image-based PDFs
- Advanced Text Analysis: Font detection, text classification
- Streaming API: Process large documents efficiently
- Cloud Integration: Direct integration with cloud storage
- CLI Tool: Command-line interface for batch processing
- Web Worker Support: Browser-based extraction
- Plugin System: Extensible architecture for custom extractors

$3

- [ ] OCR integration with Tesseract.js
- [ ] Advanced image processing options
- [ ] Streaming extraction API
- [ ] Performance optimizations
- [ ] Browser compatibility layer
- [ ] CLI tool development

See CHANGELOG.md for detailed version history.

pdf-plus

A comprehensive PDF content extraction library with support for text, images, and structured data.

Features

Installation

``bash

`Using pnpm (recommended)`


pnpm add pdf-plus
Using npm

npm install pdf-plus
Using yarn

yarn add pdf-plus


Quick Start

`typescript import { extractPdfContent } from "pdf-plus";

// Extract both text and images const result = await extractPdfContent("document.pdf", { extractText: true, extractImages: true, verbose: true, });

console.log(Extracted ${result.images.length} images from ${result.document.pages} pages); console.log(Text content: ${result.cleanText.substring(0, 100)}...);`

`$3`

For large PDFs, use the streaming API for lower memory usage and real-time progress:

`typescript import { extractPdfStream } from "pdf-plus";

const stream = extractPdfStream("large-document.pdf", { extractImageFiles: true, imageOutputDir: "./images", streamMode: true, });

Benefits:

- 📉 10-100x lower memory usage for large PDFs - ⚡ 100x faster time to first result - 📊 Real-time progress tracking - 🛑 Cancellation support

See PHASE4-STREAMING.md for complete streaming API documentation.

`$3`

Render PDF pages to high-quality images with a simple function call:

`typescript import { generatePageImages } from "pdf-plus";

// Simple - render all pages to JPG images const imagePaths = await generatePageImages( "document.pdf", // PDF file path "./page-images" // Output directory where images will be saved );

console.log(Generated ${imagePaths.length} page images); // Returns: ['/path/to/page-images/jpg/page-001.jpg', '/path/to/page-images/jpg/page-002.jpg', ...]`

With Options:

Features:

Output Structure:

`page-images/ └── jpg/ ├── page-001.jpg ├── page-002.jpg └── page-003.jpg`

See PAGE-TO-IMAGE-FEATURE.md for complete page-to-image documentation.

`Usage Examples`

`$3`

`typescript import { extractText } from "pdf-plus";

const text = await extractText("document.pdf"); console.log(Extracted ${text.length} characters);`

`$3`

`typescript import { extractImageFiles } from "pdf-plus";

// Extract and save embedded images from PDF const imagePaths = await extractImageFiles( "document.pdf", "./extracted-images" // Output directory for embedded images );

console.log(Extracted ${imagePaths.length} embedded images);`

`$3`

`typescript import { generatePageImages } from "pdf-plus";

// Render PDF pages to image files const imagePaths = await generatePageImages( "document.pdf", "./page-images" // Output directory for page images );

console.log(Generated ${imagePaths.length} page images); // Each page becomes an image: page-001.jpg, page-002.jpg, etc.`

`$3`

`typescript import { extractPdfContent } from "pdf-plus";

const result = await extractPdfContent("document.pdf", { extractImageFiles: true, imageOutputDir: "./images",

// Enable optimization optimizeImages: true, imageOptimizer: "auto", // or 'sharp', 'imagemin' imageQuality: 80, imageProgressive: true,

// Convert JP2 (JPEG 2000) to JPG for better compatibility (default: true) convertJp2ToJpg: true, imageQuality: 100, // Default: 100 for JP2 conversion (max quality)

verbose: true, });

// Check optimization results result.images.forEach((img) => { console.log(${img.filename}: Optimized and saved); });`

`$3`

`typescript import { extractPdfContent } from "pdf-plus";

// ADVANCED: With worker threads for CPU-intensive operations const result = await extractPdfContent("large-document.pdf", { extractImageFiles: true, imageOutputDir: "./images",

// Enable parallel processing (default: true) parallelProcessing: true,

verbose: true, });

// Performance gains (tested on Art Basel PDF, 54 images): // - Baseline (sequential): 140ms // - Parallel processing: 47ms (2.96x faster) // - Parallel + Workers: 44ms (3.23x faster) 🚀`

Performance Recommendations:

Worker Threads Benefits:

See PERFORMANCE.md and PHASE3-WORKERS.md for detailed benchmarks and optimization guide.

`$3`

`typescript import { extractPdfContent } from "pdf-plus";

const result = await extractPdfContent("document.pdf", { imageRefFormat: "📷 Image {index} on page {page}", extractImageFiles: true, useImagePaths: true, });

// Text will contain: "📷 Image 1 on page 1" instead of "[IMAGE:img_1]"`

`$3`

`typescript import { PDFExtractor } from "pdf-plus";

const extractor = new PDFExtractor();

`$3`

#### Extract and Save Images from Academic Papers

`typescript import { extractPdfContent } from "pdf-plus"; import path from "path";

// Save text content const fs = await import("fs"); fs.writeFileSync("./paper-text.txt", result.cleanText);

return result; }`

#### Batch Process Multiple PDFs

`typescript import { PDFExtractor } from "pdf-plus"; import { glob } from "glob";

async function batchProcessPDFs(pattern: string) { const extractor = new PDFExtractor("./cache"); // Enable caching const pdfFiles = await glob(pattern);

const results = [];

for (const pdfFile of pdfFiles) { console.log(Processing: ${pdfFile});

return results; }`

`API Reference`

`$3`

#### extractPdfContent(pdfPath, options)

Extract complete content from a PDF file.

Parameters:

- pdfPath(string) - Path to the PDF file -options (ExtractionOptions) - Extraction configuration

Returns: Promise

#### extractText(pdfPath, options)

Extract only text content (optimized for speed).

Returns: Promise

#### extractImages(pdfPath, options)

Extract only image references.

Returns: Promise

#### extractImageFiles(pdfPath, outputDir, options)

Extract and save embedded image files from PDF.

Parameters:

- pdfPath- Path to the PDF file -outputDir- Output directory path where embedded images will be saved -options - Optional extraction options

Returns: Promise - Array of saved file paths

#### generatePageImages(pdfPath, outputDir, options)

Render PDF pages to image files (page-to-image conversion).

Parameters:

- pdfPath- Path to the PDF file -outputDir- Output directory path where page images will be saved -options - Optional rendering options (pageImageFormat, pageImageDpi, pageRenderEngine, etc.)

Returns: Promise - Array of absolute paths to generated page images

Example:

`typescript import { generatePageImages } from "pdf-plus";

const imagePaths = await generatePageImages("document.pdf", "./page-images", { pageImageFormat: "jpg", pageImageDpi: 150, pageRenderEngine: "poppler", });

console.log(Generated ${imagePaths.length} page images); // Returns: ['/absolute/path/to/page-images/jpg/page-001.jpg', ...]`

`$3`

Performance Options Explained:

Parallel Processing:

Worker Threads (NEW! 🚀):

`$3`

Use these placeholders in imageRefFormat:

- {id} - Unique image ID (e.g., img_1) -{name}- Original image name from PDF -{page}- Page number -{index}- Global image index -{path} - File path (when extractImageFiles is true)

Examples:

- [IMAGE:{id}] → [IMAGE:img_1]-📷 Image {index} → 📷 Image 1-{name} on page {page} → artwork_1 on page 5- →

`Image Optimization & Conversion`

Extract and optimize images in one step using Sharp or Imagemin:

`typescript import { extractPdfContent } from "pdf-plus";

const result = await extractPdfContent("document.pdf", { extractImageFiles: true, imageOutputDir: "./images",

// Enable optimization optimizeImages: true, imageOptimizer: "auto", // Automatically selects best available imageQuality: 80, imageProgressive: true,

// Convert JP2 (JPEG 2000) to JPG for better compatibility (default: true) convertJp2ToJpg: true,

verbose: true, });

`$3`

JP2 (JPEG 2000) files are not widely supported by browsers and image tools. The library automatically converts them to standard JPG format:

// All JP2 images are now JPG files with better compatibility`

Quality Preservation:

- Default quality: 100 - Preserves maximum quality from JP2 - Use lower values (80-90) if you want additional compression - Original JP2 files are deleted after successful conversion

Benefits:

- ✅ Better browser compatibility - ✅ Can be optimized by Sharp/Imagemin - ✅ Maximum quality preserved (quality=100) - ✅ Works everywhere

`$3`

`typescript // Maximum compression (slower, smaller files) const result = await extractPdfContent("document.pdf", { optimizeImages: true, imageQuality: 70, });

// Balanced (recommended) const result = await extractPdfContent("document.pdf", { optimizeImages: true, imageQuality: 80, // Default });

// Fast optimization with Sharp const result = await extractPdfContent("document.pdf", { optimizeImages: true, imageOptimizer: "sharp", imageQuality: 85, });`

`Performance Modes`

`$3`

`typescript const text = await extractText("document.pdf"); // ~40% faster than combined mode`

`$3`

`typescript const images = await extractImages("document.pdf"); // ~20% faster than combined mode`

`$3`

`typescript const result = await extractPdfContent("document.pdf"); // Full extraction with text and image references`

`Error Handling`

`typescript import { extractPdfContent } from "pdf-plus";

`Development`

`bash

`Install dependencies`


pnpm install
Build the library

pnpm run build
Lint and format

pnpm run lint:fix
pnpm run format
Type checking

pnpm run check


Requirements
- Node.js >= 18.0.0
- TypeScript >= 5.0 (for development)
License
MIT
Contributing
Contributions are welcome! Please read our contributing guidelines and submit pull requests to our repository.
Troubleshooting
$3
#### "Cannot find module" errors
Make sure you're using the correct import syntax for your environment:

`typescript // ESM (recommended) import { extractPdfContent } from "pdf-plus";

// CommonJS const { extractPdfContent } = require("pdf-plus");`

#### Memory issues with large PDFs

For large documents, use streaming options:

`typescript const result = await extractPdfContent("large-document.pdf", { memoryLimit: "512MB", batchSize: 5, useCache: true, });`

#### Image extraction not working

Try different engines:

`typescript const result = await extractPdfContent("document.pdf", { imageEngine: "poppler", // or 'pdf-lib', 'auto' extractImageFiles: true, });`

#### Text extraction issues

Some PDFs may have encoding issues. Try:

`typescript const result = await extractPdfContent("document.pdf", { extractText: true, textEngine: "pdfjs", // Alternative engine verbose: true, // See detailed logs });`

`$3`

1. Use specific extraction modes for better performance:

`typescript // Text only (fastest) const text = await extractText("document.pdf");

// Images only const images = await extractImages("document.pdf");`

2. Enable caching for repeated operations:

`typescript const extractor = new PDFExtractor("./cache");`

3. Process pages in batches for large documents:`typescript const result = await extractPdfContent("large.pdf", { batchSize: 10, memoryLimit: "1GB", });`

`$3`

- Check the Issues page - Review examples for common use cases - Enable verbose logging for debugging:{ verbose: true }`

Roadmap

$3

See CHANGELOG.md for detailed version history.