pdf-parse-new

---

Pure JavaScript cross-platform module to extract text from PDFs with intelligent performance optimization.

![npm version](https://www.npmjs.com/package/pdf-parse-new)
![License](LICENSE)
![Downloads](https://www.npmjs.com/package/pdf-parse-new)

Version 2.0.0 - Release with SmartPDFParser, multi-core processing, and AI-powered method selection based on 15,000+ real-world benchmarks.

---

- Features
- Installation
- Quick Start
- Smart Parser
- API Reference
- Performance Optimization
- Benchmarking
- Troubleshooting
- Contributing
- License

---

Features

$3

✨ SmartPDFParser with AI-Powered Selection
- Automatically selects optimal parsing method based on PDF characteristics
- CPU-aware thresholds that adapt to available hardware (4 to 48+ cores)
- Fast-path optimization: 50x faster overhead for small PDFs (25ms → 0.5ms)
- LRU caching: 25x faster on repeated similar PDFs
- 90%+ optimization rate in production

⚡ Multi-Core Performance
- Child Processes: True multi-processing, 2-4x faster for huge PDFs
- Worker Threads: Alternative multi-threading with lower memory overhead
- Oversaturation: Use 1.5x-2x cores for maximum CPU utilization (I/O-bound optimization)
- Automatic memory safety limits

📊 Battle-Tested Intelligence
- Decision tree trained on 9,417 real-world PDF benchmarks
- Tested on documents from 1 to 10,000+ pages
- CPU normalization: adapts thresholds from 4-core laptops to 48-core servers
- Production-ready with comprehensive error handling

🚀 Multiple Parsing Strategies
- Batch Processing: Parallel page processing (optimal for 0-1000 pages)
- Child Processes: Multi-processing (default for 1000+ pages, most consistent)
- Worker Threads: Multi-threading (alternative, can be faster on some PDFs)
- Streaming: Memory-efficient chunking for constrained environments
- Aggressive: Combines streaming with large batches
- Sequential: Traditional fallback

🔧 Developer Experience
- Drop-in replacement for pdf-parse (backward compatible)
- 7 practical examples in test/examples/
- Full TypeScript definitions with autocomplete
- Comprehensive benchmarking tools included
- Zero configuration required (paths resolved automatically)

---

Installation

bash

npm install pdf-parse-new





---



What's New in 2.0.0



$3



SmartPDFParser - Intelligent automatic method selection

- CPU-aware decision tree (adapts to 4-48+ cores)

- Fast-path optimization (0.5ms overhead vs 25ms)

- LRU caching for repeated PDFs

- 90%+ optimization rate



Multi-Core Processing

- Child processes (default, most consistent)

- Worker threads (alternative, can be faster)

- Oversaturation factor (1.5x cores = better CPU utilization)

- Automatic memory safety



Performance Improvements

- 2-4x faster for huge PDFs (1000+ pages)

- 50x faster overhead for tiny PDFs (< 0.5 MB)

- 25x faster on cache hits

- CPU normalization for any hardware



Better DX

- 7 practical examples with npm scripts

- Full TypeScript definitions

- Comprehensive benchmarking tools

- Clean repository structure



$3



Version 2.0.0 is backward compatible. Your existing code will continue to work:

javascript

// v1.x code still works

const pdf = require('pdf-parse-new');

pdf(buffer).then(data => console.log(data.text));





To take advantage of new features:

javascript

// Use SmartPDFParser for automatic optimization

const SmartParser = require('pdf-parse-new/lib/SmartPDFParser');

const parser = new SmartParser();

const result = await parser.parse(buffer);

console.log(

Used ${result._meta.method} in ${result._meta.duration}ms

);





---



Quick Start



$3

javascript

const fs = require('fs');

const pdf = require('pdf-parse-new');



const dataBuffer = fs.readFileSync('path/to/file.pdf');



pdf(dataBuffer).then(function(data) {

    console.log(data.numpages);  // Number of pages

    console.log(data.text);       // Full text content

    console.log(data.info);       // PDF metadata

});





$3



See test/examples/ for practical examples:

bash

Try the examples

npm run example:basic      # Basic parsing

npm run example:smart      # SmartPDFParser (recommended)

npm run example:compare    # Compare all methods



Or run directly

node test/examples/01-basic-parse.js

node test/examples/06-smart-parser.js





7 complete examples covering all parsing methods with real-world patterns!



$3

javascript

const SmartParser = require('pdf-parse-new/lib/SmartPDFParser');



const parser = new SmartParser();

const dataBuffer = fs.readFileSync('large-document.pdf');



parser.parse(dataBuffer).then(function(result) {

    console.log(

Parsed ${result.numpages} pages in ${result._meta.duration}ms

);

    console.log(

Method used: ${result._meta.method}

);

    console.log(result.text);

});

$3

javascript

pdf(dataBuffer)

    .then(data => {

        // Process data

    })

    .catch(error => {

        console.error('Error parsing PDF:', error);

    });





---



Smart Parser



The

SmartPDFParser

 automatically selects the optimal parsing method based on PDF characteristics.



$3



Based on 9,417 real-world benchmarks (trained 2025-11-23):



| Pages     | Method    | Avg Time | Best For                    |

|-----------|-----------|----------|-----------------------------|

| 1-10      | batch-5   | ~10ms    | Tiny documents              |

| 11-50     | batch-10  | ~107ms   | Small documents             |

| 51-200    | batch-20  | ~332ms   | Medium documents            |

| 201-500   | batch-50  | ~1102ms  | Large documents             |

| 501-1000  | batch-50  | ~1988ms  | X-Large documents           |

| 1000+ | processes* | ~2355-4468ms | Huge documents (2-4x faster!) |



\*Both workers and processes are excellent for huge PDFs. Processes is the default due to better consistency, but workers can be faster in some cases. Use

forceMethod: 'workers' to try workers.





$3



#### Automatic (Recommended)

javascript

const SmartParser = require('pdf-parse-new/lib/SmartPDFParser');

const parser = new SmartParser();



// Automatically selects best method

const result = await parser.parse(pdfBuffer);





#### Force Specific Method

javascript

const parser = new SmartParser({

    forceMethod: 'workers'  // 'batch', 'workers', 'processes', 'stream', 'sequential'

});



// Example: Compare workers vs processes for your specific PDFs

const testWorkers = new SmartParser({ forceMethod: 'workers' });

const testProcesses = new SmartParser({ forceMethod: 'processes' });



const result1 = await testWorkers.parse(hugePdfBuffer);

console.log(

Workers: ${result1._meta.duration}ms

);



const result2 = await testProcesses.parse(hugePdfBuffer);

console.log(

Processes: ${result2._meta.duration}ms

);





#### Memory Limit

javascript

const parser = new SmartParser({

    maxMemoryUsage: 2e9  // 2GB max

});





#### Oversaturation for Maximum Performance



PDF parsing is I/O-bound. During I/O waits, CPU cores sit idle. Oversaturation keeps them busy:

javascript

const parser = new SmartParser({

    oversaturationFactor: 1.5  // Use 1.5x more workers than cores

});



// Example on 24-core system:

// - Default (1.5x): 36 workers (instead of 23!)

// - Aggressive (2x): 48 workers

// - Conservative (1x): 24 workers





Why this works:

- PDF parsing involves lots of I/O (reading data, decompressing)

- During I/O, CPU cores are idle

- More workers = cores stay busy = better throughput



Automatic memory limiting:

- Parser automatically limits workers if memory is constrained

- Each worker needs ~2x PDF size in memory

- Safe default balances speed and memory



$3

javascript

const stats = parser.getStats();

console.log(stats);

// {

//   totalParses: 10,

//   methodUsage: { batch: 8, workers: 2 },

//   averageTimes: { batch: 150.5, workers: 2300.1 },

//   failedParses: 0

// }





$3



SmartPDFParser automatically adapts to your CPU:

javascript

// On 4-core laptop

parser.parse(500_page_pdf);

// → Uses workers (threshold: ~167 pages)



// On 48-core server

parser.parse(500_page_pdf);

// → Uses batch (threshold: ~2000 pages, workers overhead not worth it yet)





This ensures optimal performance regardless of hardware! The decision tree was trained on multiple machines with different core counts.



$3



SmartPDFParser uses intelligent fast-paths to minimize overhead:

javascript

const parser = new SmartParser();



// Tiny PDF (< 0.5 MB)

await parser.parse(tiny_pdf);

// ⚡ Fast-path: ~0.5ms overhead (50x faster than tree navigation!)



// Small PDF (< 1 MB)

await parser.parse(small_pdf);

// ⚡ Fast-path: ~0.5ms overhead



// Medium PDF (already seen similar)

await parser.parse(medium_pdf);

// 💾 Cache hit: ~1ms overhead



// Common scenario (500 pages, 5MB)

await parser.parse(common_pdf);

// 📋 Common scenario: ~2ms overhead



// Rare case (unusual size/page ratio)

await parser.parse(unusual_pdf);

// 🌳 Full tree: ~25ms overhead (only for edge cases)





Overhead Comparison:



| PDF Type | Before | After | Speedup |

|----------|--------|-------|---------|

| Tiny (< 0.5 MB) | 25ms | 0.5ms | 50x faster ⚡ |

| Small (< 1 MB) | 25ms | 0.5ms | 50x faster ⚡ |

| Cached | 25ms | 1ms | 25x faster 💾 |

| Common | 25ms | 2ms | 12x faster 📋 |

| Rare | 25ms | 25ms | Same 🌳 |



90%+ of PDFs hit a fast-path! This means minimal overhead even for tiny documents.



---



API Reference



$3



Parse a PDF file and extract text content.



Parameters:

-

dataBuffer

 (Buffer): PDF file buffer

-

options

 (Object, optional):

  -

pagerender

 (Function): Custom page rendering function

  -

max

 (Number): Maximum number of pages to parse

  -

version

 (String): PDF.js version to use



Returns: Promise

pdf-parse-new

pdf-parse-new

Table of Contents

Features

$3

Installation

What's New in 2.0.0

$3

$3

Quick Start

$3

$3

Try the examples

Or run directly

$3

$3

Smart Parser

$3

$3

$3

$3

$3

API Reference

$3

$3

TypeScript and NestJS Support

$3

$3

$3

$3

$3

Performance Optimization

$3

$3

$3

Benchmarking

$3

$3

Edit test-pdfs.json with your PDF URLs/paths

$3

Troubleshooting

$3

$3

$3

NPM Module Compatibility

$3

$3

Advanced Usage

$3

$3

$3

Why pdf-parse-new?

$3

$3

$3

Contributing

$3

$3

License

Credits

Changelog

$3

pdf-parse-new

pdf-parse-new

Table of Contents

Features

$3

Installation

What's New in 2.0.0

$3

$3

Quick Start

$3

$3

Try the examples

Or run directly

$3

$3

Smart Parser

$3