Document parsing library for ParseFlow - Extract text and data from PDF, Word (docx), and Excel (xlsx) files
npm install parseflow-coreCore PDF parsing library for ParseFlow - Extract text, metadata, images, and TOC from PDF files.


---
- 📄 Text Extraction - Extract text from PDF with multiple strategies (raw, formatted, clean)
- 📊 Metadata Extraction - Get title, author, page count, creation date, etc.
- 🔍 Keyword Search - Search for specific content in PDFs with context
- 🖼️ Image Extraction - Extract images from PDFs (requires poppler-utils)
- 📑 Table of Contents - Extract PDF bookmarks and outline structure (requires pdftk/pdfinfo)
---
``bash`
npm install parseflow-core
Or using pnpm:
`bash`
pnpm add parseflow-core
Or using yarn:
`bash`
yarn add parseflow-core
---
`typescript
import { PDFParser } from 'parseflow-core';
const parser = new PDFParser();
// Extract all text
const result = await parser.extractText('path/to/document.pdf');
console.log(result.text);
// Extract specific page
const page2 = await parser.extractText('path/to/document.pdf', { page: 2 });
// Extract page range
const pages = await parser.extractText('path/to/document.pdf', { range: '1-5' });
`
`typescript`
const metadata = await parser.getMetadata('path/to/document.pdf');
console.log(metadata);
// {
// title: 'Document Title',
// author: 'Author Name',
// pageCount: 10,
// creationDate: '2025-01-01',
// ...
// }
`typescript
const results = await parser.searchPDF('path/to/document.pdf', 'keyword', {
caseSensitive: false,
maxResults: 10
});
results.forEach(result => {
console.log(Found on page ${result.page}: ${result.context});`
});
`typescript
import { ImageExtractorExternal } from 'parseflow-core';
const extractor = new ImageExtractorExternal();
const images = await extractor.extract('path/to/document.pdf', './output', {
format: 'png'
});
`
`typescript
import { TOCExtractorExternal } from 'parseflow-core';
const tocExtractor = new TOCExtractorExternal();
const toc = await tocExtractor.extract('path/to/document.pdf');
console.log(toc);
`
---
Main parser class for PDF operations.
#### Methods
- extractText(path, options?) - Extract text from PDFgetMetadata(path)
- - Get PDF metadatasearchPDF(path, query, options?)
- - Search for keywords
Extract images from PDF using external tools.
#### Methods
- isAvailable() - Check if pdfimages is availableextract(pdfPath, outputDir, options?)
- - Extract images
Extract table of contents from PDF.
#### Methods
- isAvailable() - Check if pdftk/pdfinfo is availableextract(pdfPath, options?)
- - Extract TOC
---
Some features require external tools:
Windows:
- Download Poppler
- Add to system PATH
Linux:
`bash`
sudo apt-get install poppler-utils
macOS:
`bash`
brew install poppler
Windows:
- Download Poppler (includes pdfinfo)
Linux:
`bash`
sudo apt-get install poppler-utils pdftk
macOS:
`bash``
brew install poppler pdftk-java
---
- Node.js: >= 18.0.0
- TypeScript: >= 5.0.0 (for development)
---
For complete documentation, visit:
- GitHub Repository
- API Documentation
- Examples
---
Contributions are welcome! Please see CONTRIBUTING.md for details.
---
MIT © Libres-coder
---
- pdf-parse - PDF text extraction
- pdf-lib - PDF manipulation
- Poppler - PDF rendering library
---
---
Made with ❤️ by ParseFlow Team