parseflow-core

Core PDF parsing library for ParseFlow - Extract text, metadata, images, and TOC from PDF files.

![npm version](https://www.npmjs.com/package/parseflow-core)
![License: MIT](https://opensource.org/licenses/MIT)

---

✨ Features

- 📄 Text Extraction - Extract text from PDF with multiple strategies (raw, formatted, clean)
- 📊 Metadata Extraction - Get title, author, page count, creation date, etc.
- 🔍 Keyword Search - Search for specific content in PDFs with context
- 🖼️ Image Extraction - Extract images from PDFs (requires poppler-utils)
- 📑 Table of Contents - Extract PDF bookmarks and outline structure (requires pdftk/pdfinfo)

---

📦 Installation

``bash npm install parseflow-core`

Or using pnpm:`bash pnpm add parseflow-core`

Or using yarn:`bash yarn add parseflow-core`

---

`🚀 Quick Start`

`$3`

`typescript import { PDFParser } from 'parseflow-core';

const parser = new PDFParser();

// Extract all text const result = await parser.extractText('path/to/document.pdf'); console.log(result.text);

// Extract specific page const page2 = await parser.extractText('path/to/document.pdf', { page: 2 });

// Extract page range const pages = await parser.extractText('path/to/document.pdf', { range: '1-5' });`

`$3`

`typescript const metadata = await parser.getMetadata('path/to/document.pdf'); console.log(metadata); // { // title: 'Document Title', // author: 'Author Name', // pageCount: 10, // creationDate: '2025-01-01', // ... // }`

`$3`

`typescript const results = await parser.searchPDF('path/to/document.pdf', 'keyword', { caseSensitive: false, maxResults: 10 });

results.forEach(result => { console.log(Found on page ${result.page}: ${result.context}); });`

`$3`

`typescript import { ImageExtractorExternal } from 'parseflow-core';

const extractor = new ImageExtractorExternal(); const images = await extractor.extract('path/to/document.pdf', './output', { format: 'png' });`

`$3`

`typescript import { TOCExtractorExternal } from 'parseflow-core';

const tocExtractor = new TOCExtractorExternal(); const toc = await tocExtractor.extract('path/to/document.pdf'); console.log(toc);`

---

`📚 API Reference`

`$3`

Main parser class for PDF operations.

#### Methods

- extractText(path, options?)- Extract text from PDF -getMetadata(path)- Get PDF metadata -searchPDF(path, query, options?) - Search for keywords

`$3`

Extract images from PDF using external tools.

#### Methods

- isAvailable()- Check if pdfimages is available -extract(pdfPath, outputDir, options?) - Extract images

`$3`

Extract table of contents from PDF.

#### Methods

- isAvailable()- Check if pdftk/pdfinfo is available -extract(pdfPath, options?) - Extract TOC

---

`🔧 External Tools`

Some features require external tools:

`$3`

Windows: - Download Poppler - Add to system PATH

Linux:`bash sudo apt-get install poppler-utils`

macOS:`bash brew install poppler`

`$3`

Windows: - Download Poppler (includes pdfinfo)

Linux:`bash sudo apt-get install poppler-utils pdftk`

macOS:`bash brew install poppler pdftk-java``

---

💻 Requirements

- Node.js: >= 18.0.0
- TypeScript: >= 5.0.0 (for development)

---

📖 Documentation

For complete documentation, visit:
- GitHub Repository
- API Documentation
- Examples

---

🤝 Contributing

Contributions are welcome! Please see CONTRIBUTING.md for details.

---

📄 License

MIT © Libres-coder

---

🙏 Acknowledgments

- pdf-parse - PDF text extraction
- pdf-lib - PDF manipulation
- Poppler - PDF rendering library

---

🔗 Links

- GitHub
- Issues
- npm

---

Made with ❤️ by ParseFlow Team

parseflow-core

Core PDF parsing library for ParseFlow - Extract text, metadata, images, and TOC from PDF files.

![npm version](https://www.npmjs.com/package/parseflow-core)
![License: MIT](https://opensource.org/licenses/MIT)

---

✨ Features

---

📦 Installation

``bash npm install parseflow-core`

Or using pnpm:`bash pnpm add parseflow-core`

Or using yarn:`bash yarn add parseflow-core`

---

`🚀 Quick Start`

`$3`

`typescript import { PDFParser } from 'parseflow-core';

const parser = new PDFParser();

// Extract all text const result = await parser.extractText('path/to/document.pdf'); console.log(result.text);

// Extract specific page const page2 = await parser.extractText('path/to/document.pdf', { page: 2 });

// Extract page range const pages = await parser.extractText('path/to/document.pdf', { range: '1-5' });`

`$3`

`typescript const results = await parser.searchPDF('path/to/document.pdf', 'keyword', { caseSensitive: false, maxResults: 10 });

results.forEach(result => { console.log(Found on page ${result.page}: ${result.context}); });`

`$3`

`typescript import { ImageExtractorExternal } from 'parseflow-core';

const extractor = new ImageExtractorExternal(); const images = await extractor.extract('path/to/document.pdf', './output', { format: 'png' });`

`$3`

`typescript import { TOCExtractorExternal } from 'parseflow-core';

const tocExtractor = new TOCExtractorExternal(); const toc = await tocExtractor.extract('path/to/document.pdf'); console.log(toc);`

---

`📚 API Reference`

`$3`

Main parser class for PDF operations.

#### Methods

- extractText(path, options?)- Extract text from PDF -getMetadata(path)- Get PDF metadata -searchPDF(path, query, options?) - Search for keywords

`$3`

Extract images from PDF using external tools.

#### Methods

- isAvailable()- Check if pdfimages is available -extract(pdfPath, outputDir, options?) - Extract images

`$3`

Extract table of contents from PDF.

#### Methods

- isAvailable()- Check if pdftk/pdfinfo is available -extract(pdfPath, options?) - Extract TOC

---

`🔧 External Tools`

Some features require external tools:

`$3`

Windows: - Download Poppler - Add to system PATH

Linux:`bash sudo apt-get install poppler-utils`

macOS:`bash brew install poppler`

`$3`

Windows: - Download Poppler (includes pdfinfo)

Linux:`bash sudo apt-get install poppler-utils pdftk`

macOS:`bash brew install poppler pdftk-java``

---

💻 Requirements

- Node.js: >= 18.0.0
- TypeScript: >= 5.0.0 (for development)

---

📖 Documentation

For complete documentation, visit:
- GitHub Repository
- API Documentation
- Examples

---

🤝 Contributing

Contributions are welcome! Please see CONTRIBUTING.md for details.

---

📄 License

---

🙏 Acknowledgments

- pdf-parse - PDF text extraction
- pdf-lib - PDF manipulation
- Poppler - PDF rendering library

---

🔗 Links

- GitHub
- Issues
- npm

---

Made with ❤️ by ParseFlow Team