Pure javascript cross-platform module to extract text from PDFs with AI-powered optimization and multi-core processing.
npm install pdf-parse-newtest/examples/
bash
npm install pdf-parse-new
`
---
What's New in 2.0.0
$3
SmartPDFParser - Intelligent automatic method selection
- CPU-aware decision tree (adapts to 4-48+ cores)
- Fast-path optimization (0.5ms overhead vs 25ms)
- LRU caching for repeated PDFs
- 90%+ optimization rate
Multi-Core Processing
- Child processes (default, most consistent)
- Worker threads (alternative, can be faster)
- Oversaturation factor (1.5x cores = better CPU utilization)
- Automatic memory safety
Performance Improvements
- 2-4x faster for huge PDFs (1000+ pages)
- 50x faster overhead for tiny PDFs (< 0.5 MB)
- 25x faster on cache hits
- CPU normalization for any hardware
Better DX
- 7 practical examples with npm scripts
- Full TypeScript definitions
- Comprehensive benchmarking tools
- Clean repository structure
$3
Version 2.0.0 is backward compatible. Your existing code will continue to work:
`javascript
// v1.x code still works
const pdf = require('pdf-parse-new');
pdf(buffer).then(data => console.log(data.text));
`
To take advantage of new features:
`javascript
// Use SmartPDFParser for automatic optimization
const SmartParser = require('pdf-parse-new/lib/SmartPDFParser');
const parser = new SmartParser();
const result = await parser.parse(buffer);
console.log(Used ${result._meta.method} in ${result._meta.duration}ms);
`
---
Quick Start
$3
`javascript
const fs = require('fs');
const pdf = require('pdf-parse-new');
const dataBuffer = fs.readFileSync('path/to/file.pdf');
pdf(dataBuffer).then(function(data) {
console.log(data.numpages); // Number of pages
console.log(data.text); // Full text content
console.log(data.info); // PDF metadata
});
`
$3
See test/examples/ for practical examples:
`bash
Try the examples
npm run example:basic # Basic parsing
npm run example:smart # SmartPDFParser (recommended)
npm run example:compare # Compare all methods
Or run directly
node test/examples/01-basic-parse.js
node test/examples/06-smart-parser.js
`
7 complete examples covering all parsing methods with real-world patterns!
$3
`javascript
const SmartParser = require('pdf-parse-new/lib/SmartPDFParser');
const parser = new SmartParser();
const dataBuffer = fs.readFileSync('large-document.pdf');
parser.parse(dataBuffer).then(function(result) {
console.log(Parsed ${result.numpages} pages in ${result._meta.duration}ms);
console.log(Method used: ${result._meta.method});
console.log(result.text);
});
`
$3
`javascript
pdf(dataBuffer)
.then(data => {
// Process data
})
.catch(error => {
console.error('Error parsing PDF:', error);
});
`
---
Smart Parser
The SmartPDFParser automatically selects the optimal parsing method based on PDF characteristics.
$3
Based on 9,417 real-world benchmarks (trained 2025-11-23):
| Pages | Method | Avg Time | Best For |
|-----------|-----------|----------|-----------------------------|
| 1-10 | batch-5 | ~10ms | Tiny documents |
| 11-50 | batch-10 | ~107ms | Small documents |
| 51-200 | batch-20 | ~332ms | Medium documents |
| 201-500 | batch-50 | ~1102ms | Large documents |
| 501-1000 | batch-50 | ~1988ms | X-Large documents |
| 1000+ | processes* | ~2355-4468ms | Huge documents (2-4x faster!) |
\*Both workers and processes are excellent for huge PDFs. Processes is the default due to better consistency, but workers can be faster in some cases. Use forceMethod: 'workers' to try workers.
$3
#### Automatic (Recommended)
`javascript
const SmartParser = require('pdf-parse-new/lib/SmartPDFParser');
const parser = new SmartParser();
// Automatically selects best method
const result = await parser.parse(pdfBuffer);
`
#### Force Specific Method
`javascript
const parser = new SmartParser({
forceMethod: 'workers' // 'batch', 'workers', 'processes', 'stream', 'sequential'
});
// Example: Compare workers vs processes for your specific PDFs
const testWorkers = new SmartParser({ forceMethod: 'workers' });
const testProcesses = new SmartParser({ forceMethod: 'processes' });
const result1 = await testWorkers.parse(hugePdfBuffer);
console.log(Workers: ${result1._meta.duration}ms);
const result2 = await testProcesses.parse(hugePdfBuffer);
console.log(Processes: ${result2._meta.duration}ms);
`
#### Memory Limit
`javascript
const parser = new SmartParser({
maxMemoryUsage: 2e9 // 2GB max
});
`
#### Oversaturation for Maximum Performance
PDF parsing is I/O-bound. During I/O waits, CPU cores sit idle. Oversaturation keeps them busy:
`javascript
const parser = new SmartParser({
oversaturationFactor: 1.5 // Use 1.5x more workers than cores
});
// Example on 24-core system:
// - Default (1.5x): 36 workers (instead of 23!)
// - Aggressive (2x): 48 workers
// - Conservative (1x): 24 workers
`
Why this works:
- PDF parsing involves lots of I/O (reading data, decompressing)
- During I/O, CPU cores are idle
- More workers = cores stay busy = better throughput
Automatic memory limiting:
- Parser automatically limits workers if memory is constrained
- Each worker needs ~2x PDF size in memory
- Safe default balances speed and memory
$3
`javascript
const stats = parser.getStats();
console.log(stats);
// {
// totalParses: 10,
// methodUsage: { batch: 8, workers: 2 },
// averageTimes: { batch: 150.5, workers: 2300.1 },
// failedParses: 0
// }
`
$3
SmartPDFParser automatically adapts to your CPU:
`javascript
// On 4-core laptop
parser.parse(500_page_pdf);
// โ Uses workers (threshold: ~167 pages)
// On 48-core server
parser.parse(500_page_pdf);
// โ Uses batch (threshold: ~2000 pages, workers overhead not worth it yet)
`
This ensures optimal performance regardless of hardware! The decision tree was trained on multiple machines with different core counts.
$3
SmartPDFParser uses intelligent fast-paths to minimize overhead:
`javascript
const parser = new SmartParser();
// Tiny PDF (< 0.5 MB)
await parser.parse(tiny_pdf);
// โก Fast-path: ~0.5ms overhead (50x faster than tree navigation!)
// Small PDF (< 1 MB)
await parser.parse(small_pdf);
// โก Fast-path: ~0.5ms overhead
// Medium PDF (already seen similar)
await parser.parse(medium_pdf);
// ๐พ Cache hit: ~1ms overhead
// Common scenario (500 pages, 5MB)
await parser.parse(common_pdf);
// ๐ Common scenario: ~2ms overhead
// Rare case (unusual size/page ratio)
await parser.parse(unusual_pdf);
// ๐ณ Full tree: ~25ms overhead (only for edge cases)
`
Overhead Comparison:
| PDF Type | Before | After | Speedup |
|----------|--------|-------|---------|
| Tiny (< 0.5 MB) | 25ms | 0.5ms | 50x faster โก |
| Small (< 1 MB) | 25ms | 0.5ms | 50x faster โก |
| Cached | 25ms | 1ms | 25x faster ๐พ |
| Common | 25ms | 2ms | 12x faster ๐ |
| Rare | 25ms | 25ms | Same ๐ณ |
90%+ of PDFs hit a fast-path! This means minimal overhead even for tiny documents.
---
API Reference
$3
Parse a PDF file and extract text content.
Parameters:
- dataBuffer (Buffer): PDF file buffer
- options (Object, optional):
- pagerender (Function): Custom page rendering function
- max (Number): Maximum number of pages to parse
- version (String): PDF.js version to use
Returns: Promisenumpages (Number): Total number of pages
- numrender (Number): Number of rendered pages
- info (Object): PDF metadata
- metadata (Object): PDF metadata object
- text (String): Extracted text content
- version (String): PDF.js version used
$3
#### constructor(options)
Options:
- forceMethod (String): Force specific parsing method
- maxMemoryUsage (Number): Maximum memory usage in bytes
- availableCPUs (Number): Override CPU count detection
#### parse(dataBuffer, userOptions)
Parse PDF with automatic method selection.
Returns: Promise_meta field)
- _meta.method (String): Parsing method used
- _meta.duration (Number): Parse time in milliseconds
- _meta.analysis (Object): PDF analysis data
#### getStats()
Get parsing statistics for current session.
---
TypeScript and NestJS Support
This library includes full TypeScript definitions and works seamlessly with NestJS.
$3
`typescript
// โ
CORRECT: Use namespace import
import * as PdfParse from 'pdf-parse-new';
// Create parser instance
const parser = new PdfParse.SmartPDFParser({
oversaturationFactor: 2.0,
enableFastPath: true
});
// Parse PDF
const result = await parser.parse(pdfBuffer);
console.log(Parsed ${result.numpages} pages using ${result._meta.method});
`
`typescript
// โ WRONG: This will NOT work
import PdfParse from 'pdf-parse-new'; // Error: SmartPDFParser is not a constructor
import { SmartPDFParser } from 'pdf-parse-new'; // Error: No named export
`
$3
`typescript
import { Injectable } from '@nestjs/common';
import * as PdfParse from 'pdf-parse-new';
import * as fs from 'fs';
@Injectable()
export class PdfService {
private parser: PdfParse.SmartPDFParser;
constructor() {
// Initialize parser with custom options
this.parser = new PdfParse.SmartPDFParser({
oversaturationFactor: 2.0,
enableFastPath: true,
enableCache: true,
maxWorkerLimit: 50
});
}
async parsePdf(filePath: string): Promise {
const dataBuffer = fs.readFileSync(filePath);
const result = await this.parser.parse(dataBuffer);
console.log( Pages: ${result.numpages});
console.log(Method: ${result._meta?.method});
console.log(Duration: ${result._meta?.duration?.toFixed(2)}ms);
return result.text;
}
getParserStats() {
return this.parser.getStats();
}
}
`
$3
`typescript
import { Controller, Post, UploadedFile, UseInterceptors } from '@nestjs/common';
import { FileInterceptor } from '@nestjs/platform-express';
import * as PdfParse from 'pdf-parse-new';
@Controller('pdf')
export class PdfController {
private parser = new PdfParse.SmartPDFParser({ oversaturationFactor: 2.0 });
@Post('upload')
@UseInterceptors(FileInterceptor('file'))
async uploadPdf(@UploadedFile() file: Express.Multer.File) {
const result = await this.parser.parse(file.buffer);
return {
pages: result.numpages,
text: result.text,
metadata: result.info,
parsingInfo: {
method: result._meta?.method,
duration: result._meta?.duration,
fastPath: result._meta?.fastPath || false
}
};
}
}
`
$3
`typescript
// Method 1: Namespace import (recommended)
import * as PdfParse from 'pdf-parse-new';
const parser = new PdfParse.SmartPDFParser();
// Method 2: CommonJS require
const PdfParse = require('pdf-parse-new');
const parser = new PdfParse.SmartPDFParser();
// Method 3: Direct module import
import SmartPDFParser = require('pdf-parse-new/lib/SmartPDFParser');
const parser = new SmartPDFParser();
`
$3
All types are fully documented and available:
`typescript
import * as PdfParse from 'pdf-parse-new';
// Use types from the namespace
type Result = PdfParse.Result;
type Options = PdfParse.Options;
type SmartParserOptions = PdfParse.SmartParserOptions;
`
๐ For more detailed examples and troubleshooting, see NESTJS_USAGE.md
---
Performance Optimization
$3
For a 1500-page PDF:
| Method | Time (estimate) | Speed vs Batch | Notes |
|------------|-----------------|----------------|-------|
| Workers | ~2.4-7s | 2-7x faster โจ | Faster startup, can vary by PDF |
| Processes | ~4.2-4.5s | 3-4x faster | More consistent, better isolation |
| Batch | ~17.6s | baseline | Good up to 1000 pages |
| Sequential | ~17.8s | 0.99x | Fallback only |
Note: Performance varies by PDF complexity, size, and system. Both workers and processes provide significant speedup - test both on your specific PDFs to find the best option.
$3
1. Use SmartParser for large documents (100+ pages)
2. Batch processing is optimal for most use cases (0-1000 pages)
3. Both Processes and Workers excel at huge PDFs (1000+ pages)
- Processes (default): More consistent, better memory isolation, 2-4x faster than batch
- Workers: Can be faster on some PDFs, use forceMethod: 'workers' to test
4. Avoid sequential unless you have a specific reason
5. Monitor memory for PDFs over 500 pages
$3
Batch (default for most cases)
- PDFs up to 1000 pages
- Balanced speed and memory usage
- Best all-around performance
Workers (best for huge PDFs)
- PDFs over 1000 pages
- Multi-core systems
- When speed is critical
- Note: Memory usage = PDF size ร concurrent workers
- For very large PDFs, limit maxWorkers to 2-4 to avoid memory issues
Processes (alternative to workers)
- Similar to workers but uses child processes
- Better isolation but slightly slower
Stream (memory constrained)
- Very limited memory environments
- When you need to process PDFs larger than available RAM
Sequential (fallback)
- Single-core systems
- When parallel processing causes issues
- Debugging purposes
---
Benchmarking
The library includes comprehensive benchmarking tools for optimization.
$3
`
benchmark/
โโโ collect-benchmarks.js # Collect performance data
โโโ train-smart-parser.js # Train decision tree
โโโ test-pdfs.example.json # Example PDF list
โโโ test-pdfs.json # Your PDFs (gitignored)
`
$3
1. Setup test PDFs:
`bash
cp benchmark/test-pdfs.example.json benchmark/test-pdfs.json
Edit test-pdfs.json with your PDF URLs/paths
`
2. Collect benchmark data:
`bash
node benchmark/collect-benchmarks.js
`
Features:
- Tests all parsing methods on each PDF
- Supports local files and remote URLs
- Saves incrementally (no data loss on interruption)
- Generates detailed performance reports
3. Train decision tree (library developers only):
`bash
node benchmark/train-smart-parser.js
`
Analyzes collected benchmarks and generates optimized parsing rules.
$3
`json
{
"note": "Add your PDF URLs or file paths here",
"urls": [
"./test/data/sample.pdf",
"https://example.com/document.pdf",
"/absolute/path/to/file.pdf"
]
}
`
---
Troubleshooting
$3
Out of Memory
`javascript
// Limit memory usage
const parser = new SmartParser({ maxMemoryUsage: 2e9 });
// Or use streaming
const parser = new SmartParser({ forceMethod: 'stream' });
`
Slow Parsing
`javascript
// For large PDFs, force workers
const parser = new SmartParser({ forceMethod: 'workers' });
`
Corrupted/Invalid PDFs
`javascript
// More aggressive parsing
const pdf = require('pdf-parse-new/lib/pdf-parse-aggressive');
pdf(dataBuffer).then(data => console.log(data.text));
`
$3
`javascript
// Enable verbose logging
process.env.DEBUG = 'pdf-parse:*';
`
$3
- ๐ Open an issue
- ๐ฌ Check existing issues for solutions
- ๐ Include benchmark data when reporting performance issues
---
NPM Module Compatibility
This library is designed to work correctly when installed as an npm module.
$3
All internal paths use proper resolution:
- โ
Worker threads: path.join(__dirname, 'pdf-worker.js')
- โ
Child processes: path.join(__dirname, 'pdf-child.js')
- โ
PDF.js: require('./pdf.js/v4.5.136/build/pdf.js')
This ensures the library works correctly:
- When installed via npm install
- In node_modules/ directory
- Regardless of working directory
- With or without symlinks
$3
The library automatically resolves all internal paths - you don't need to configure anything!
---
Advanced Usage
$3
`javascript
function customPageRenderer(pageData) {
const renderOptions = {
normalizeWhitespace: true,
disableCombineTextItems: false
};
return pageData.getTextContent(renderOptions).then(textContent => {
let text = '';
for (let item of textContent.items) {
text += item.str + ' ';
}
return text;
});
}
const options = { pagerender: customPageRenderer };
pdf(dataBuffer, options).then(data => console.log(data.text));
`
$3
`javascript
// Parse only first 10 pages
pdf(dataBuffer, { max: 10 }).then(data => {
console.log(Parsed ${data.numrender} of ${data.numpages} pages);
});
`
$3
`javascript
const PDFProcess = require('pdf-parse-new/lib/pdf-parse-processes');
PDFProcess(dataBuffer, {
maxProcesses: 4, // Use 4 parallel processes
batchSize: 10 // Process 10 pages per batch
}).then(data => console.log(data.text));
`
---
Why pdf-parse-new?
$3
| Feature | pdf-parse | pdf-parse-new 2.0 |
|---------|-----------|-------------------|
| Speed (huge PDFs) | Baseline | 2-4x faster โก |
| Smart optimization | โ | โ
AI-powered |
| Multi-core support | โ | โ
Workers + Processes |
| CPU adaptation | โ | โ
4-48+ cores |
| Fast-path | โ | โ
50x faster overhead |
| Caching | โ | โ
LRU cache |
| TypeScript | Partial | โ
Complete |
| Examples | Basic | โ
7 production-ready |
| Benchmarking | โ | โ
Tools included |
| Maintenance | Slow | โ
Active |
$3
- โ
Pure JavaScript (no native dependencies, no compilation)
- โ
Cross-platform (Windows, Mac, Linux - same code)
- โ
Zero configuration (paths auto-resolved, npm-safe)
- โ
No memory leaks (proper cleanup, GC-friendly)
- โ
Production-ready (comprehensive error handling)
- โ
Well-tested (9,417 benchmark samples)
- โ
Modern (async/await, Promises, ES6+)
$3
9,924-page PDF (13.77 MB) on 24-core system:
`
Sequential: ~15,000ms
Batch-50: ~11,723ms
Processes: ~4,468ms โ
(2.6x faster than batch)
Workers: ~6,963ms โ
(1.7x faster than batch)
SmartParser: Automatically chooses Processes โก
`
100 KB PDF on any system:
`
Overhead:
- Without fast-path: 25ms
- With fast-path: 0.5ms โ
(50x faster)
`
---
Contributing
Contributions are welcome! Please read our contributing guidelines.
$3
`bash
git clone https://github.com/your-repo/pdf-parse-new.git
cd pdf-parse-new
npm install
npm test
`
$3
`bash
npm test # Run all tests
npm run test:smart # Test smart parser
npm run benchmark # Run benchmarks
`
---
License
MIT License - see LICENSE file for details.
---
Credits
- Based on pdf-parse by autokent
- Powered by PDF.js v4.5.136 by Mozilla
- Performance optimization and v2.0 development by Simone Gosetto
---
Changelog
$3
Major Features:
- โจ SmartPDFParser with AI-powered method selection
- โก Multi-core processing (workers + processes)
- ๐ Oversaturation for maximum CPU utilization
- โก Fast-path optimization (50x faster overhead)
- ๐พ LRU caching (25x faster on cache hits)
- ๐ฏ CPU-aware thresholds (4-48+ cores)
- ๐ Decision tree trained on 9,417 benchmarks
- ๐ง 7 production-ready examples
- ๐ Complete TypeScript definitions
- ๐งช Comprehensive benchmarking tools
Performance:
- 2-4x faster for huge PDFs (1000+ pages)
- 50x faster overhead for tiny PDFs
- 25x faster on repeated similar PDFs
- 90%+ optimization rate in production
Breaking Changes:
- None - fully backward compatible with 1.x
See CHANGELOG for complete version history.
---
Made with โค๏ธ for the JavaScript community
npm: pdf-parse-new`