Mursa PDF Parser

![npm version](https://www.npmjs.com/package/mursa-pdf-parser)
![License: MIT](https://opensource.org/licenses/MIT)
![Node.js Version](https://nodejs.org)

A comprehensive, zero-dependency* PDF parsing library for Node.js with support for text extraction, metadata extraction, image extraction, and full PDF object model access.

> *Only depends on pako for zlib decompression

Features

- Text Extraction - Extract text content with positioning information
- Metadata Extraction - Document Info Dictionary and XMP metadata
- Image Extraction - Extract embedded images (JPEG, JPEG2000, raw bitmap)
- Full PDF Object Access - Low-level access to all PDF objects
- Stream Decompression - FlateDecode, LZW, ASCII85, ASCIIHex, RunLength
- ToUnicode CMap Support - Proper character encoding for complex fonts
- No Native Dependencies - Pure JavaScript, works everywhere Node.js runs

Installation

``bash npm install mursa-pdf-parser`

`Quick Start`

`javascript import { MursaPDF } from 'mursa-pdf-parser';

// Load a PDF const pdf = await MursaPDF.load('document.pdf');

// Extract text const text = pdf.getText(); console.log(text);

// Get metadata const metadata = pdf.getMetadata(); console.log(metadata.info.Title);

// Get page count console.log(Pages: ${pdf.getPageCount()});`

`API Reference`

`$3`

`javascript import { MursaPDF, parsePDF } from 'mursa-pdf-parser';

// From file path const pdf = await MursaPDF.load('path/to/file.pdf');

// From Buffer const pdf = MursaPDF.fromBuffer(buffer);

// From base64 string const pdf = MursaPDF.fromBase64(base64String);

// Using convenience function const pdf = await parsePDF('document.pdf');`

`$3`

`javascript // Get all text as a single string const text = pdf.getText();

// Get text with page information const result = pdf.getTextWithPages(); // Returns: // { // text: "full text...", // pages: [ // { pageNumber: 1, text: "...", items: [...] } // ] // }

// Get text from a specific page (1-indexed) const page1Text = pdf.getTextFromPage(1);`

`$3`

`javascript // Get all metadata const metadata = pdf.getMetadata(); // Returns: // { // info: { Title, Author, Subject, Keywords, Creator, Producer, CreationDate, ModDate }, // xmp: { raw, parsed }, // structure: { version, pageCount, pageLayout, pageMode, hasOutlines, hasAcroForm } // }

// Get document info only const info = pdf.getInfo();

// Get XMP metadata const xmp = pdf.getXMP();

// Get bookmarks/outlines const outlines = pdf.getOutlines();`

`$3`

`javascript // Get all images with raw data const images = pdf.getImages();

// Get images as files with proper extensions const files = pdf.getImageFiles(); // Returns: // [ // { // filename: "image_1.jpg", // mimeType: "image/jpeg", // data: Uint8Array, // width: 800, // height: 600 // } // ]

// Get image summary const summary = pdf.getImageSummary(); // Returns: // { // totalImages: 5, // totalSize: 102400, // byPage: { 1: 2, 2: 3 }, // byFormat: { jpeg: 4, raw: 1 }, // byColorSpace: { DeviceRGB: 5 }, // images: [...] // }`

`$3`

`javascript const version = pdf.getVersion(); // "1.7" const pageCount = pdf.getPageCount(); // 10 const pages = pdf.getPages(); // Array of page dictionaries const catalog = pdf.getCatalog(); // Root catalog object`

`$3`

`javascript // Get raw PDF object by object number const obj = pdf.getObject(5, 0);

// Resolve an indirect reference const resolved = pdf.resolveReference(ref);

// Get cross-reference table const xref = pdf.getXRef();

// Get trailer dictionary const trailer = pdf.getTrailer();`

`$3`

`javascript // Access the text extractor directly const textExtractor = pdf.textExtractor;

// Access the metadata extractor directly const metadataExtractor = pdf.metadataExtractor;

// Access the image extractor directly const imageExtractor = pdf.imageExtractor;`

`Complete Example`

`javascript import { MursaPDF } from 'mursa-pdf-parser'; import { writeFile } from 'fs/promises';

async function extractPDF(filePath) { // Load the PDF const pdf = await MursaPDF.load(filePath);

// Get basic info console.log(PDF Version: ${pdf.getVersion()}); console.log(Pages: ${pdf.getPageCount()});

// Get metadata const metadata = pdf.getMetadata(); console.log(Title: ${metadata.info.Title || 'Untitled'}); console.log(Author: ${metadata.info.Author || 'Unknown'});

// Extract all text const text = pdf.getText(); await writeFile('output.txt', text); console.log(Extracted ${text.length} characters);

// Extract images const images = pdf.getImageFiles(); for (const img of images) { await writeFile(img.filename, img.data); console.log(Saved ${img.filename} (${img.width}x${img.height})); }

console.log(Extracted ${images.length} images); }

extractPDF('document.pdf');`

`Architecture`

`mursa-pdf-parser/ ├── src/ │ ├── core/ │ │ ├── lexer.js # Tokenizer - converts bytes to tokens │ │ ├── parser.js # Parser - builds PDF object model │ │ └── objects.js # PDF object types (Name, String, Array, etc.) │ ├── filters/ │ │ └── index.js # Stream decompression (FlateDecode, LZW, etc.) │ ├── extraction/ │ │ ├── text.js # Text extraction with font handling │ │ ├── metadata.js # Document info and XMP extraction │ │ └── images.js # Image extraction and conversion │ └── index.js # Main API (MursaPDF class) ├── examples/ │ └── basic-usage.js # Usage examples └── test/ └── test.js # Test suite`

`How PDF Parsing Works`

`$3`

A PDF file consists of four main parts:

1. Header - PDF version identifier (%PDF-1.7) 2. Body - Objects containing document content (text, images, fonts) 3. Cross-Reference Table - Maps object numbers to byte offsets 4. Trailer - Points to the document catalog and metadata

`$3`

1. Read Header - Extract PDF version 2. Find Trailer - Locate from end of file 3. Parse XRef - Build object location map 4. Parse Objects - On-demand parsing of referenced objects 5. Extract Content - Process page content streams

`$3`

1. Get page content streams 2. Decompress stream data (FlateDecode, etc.) 3. Parse content stream operators (Tj, TJ, Tf, Td, etc.) 4. Map character codes to Unicode using font encodings and ToUnicode CMaps

`Supported Features`

`$3`


- FlateDecode (zlib/deflate)
- ASCIIHexDecode
- ASCII85Decode
- LZWDecode
- RunLengthDecode
$3

- JPEG / DCTDecode
- JPEG2000 / JPXDecode
- Raw bitmap data
- Indexed color images
$3

- DeviceGray
- DeviceRGB
- DeviceCMYK
- ICCBased
- Indexed
- Separation
$3

- PDF 1.0 - 2.0
- Traditional XRef tables
- XRef streams (PDF 1.5+)
- Object streams (PDF 1.5+)
Limitations
- Encrypted PDFs - Password-protected PDFs are not currently supported
- Complex Fonts - Some CID fonts with unusual encodings may not decode correctly
- Scanned PDFs - Documents containing only scanned images require OCR (not included)
- Form Data - Interactive form field values are not extracted
Contributing
Contributions are welcome! Please feel free to submit a Pull Request.

1. Fork the repository 2. Create your feature branch (git checkout -b feature/amazing-feature) 3. Commit your changes (git commit -m 'Add some amazing feature') 4. Push to the branch (git push origin feature/amazing-feature`)
5. Open a Pull Request

License

This project is licensed under the MIT License - see the LICENSE file for details.

Author

Murali - GitHub

Acknowledgments

- Uses pako for zlib decompression

Mursa PDF Parser

![npm version](https://www.npmjs.com/package/mursa-pdf-parser)
![License: MIT](https://opensource.org/licenses/MIT)
![Node.js Version](https://nodejs.org)

A comprehensive, zero-dependency* PDF parsing library for Node.js with support for text extraction, metadata extraction, image extraction, and full PDF object model access.

> *Only depends on pako for zlib decompression

Features

Installation

``bash npm install mursa-pdf-parser`

`Quick Start`

`javascript import { MursaPDF } from 'mursa-pdf-parser';

// Load a PDF const pdf = await MursaPDF.load('document.pdf');

// Extract text const text = pdf.getText(); console.log(text);

// Get metadata const metadata = pdf.getMetadata(); console.log(metadata.info.Title);

// Get page count console.log(Pages: ${pdf.getPageCount()});`

`API Reference`

`$3`

`javascript import { MursaPDF, parsePDF } from 'mursa-pdf-parser';

// From file path const pdf = await MursaPDF.load('path/to/file.pdf');

// From Buffer const pdf = MursaPDF.fromBuffer(buffer);

// From base64 string const pdf = MursaPDF.fromBase64(base64String);

// Using convenience function const pdf = await parsePDF('document.pdf');`

`$3`

`javascript // Get all text as a single string const text = pdf.getText();

// Get text with page information const result = pdf.getTextWithPages(); // Returns: // { // text: "full text...", // pages: [ // { pageNumber: 1, text: "...", items: [...] } // ] // }

// Get text from a specific page (1-indexed) const page1Text = pdf.getTextFromPage(1);`

`$3`

// Get document info only const info = pdf.getInfo();

// Get XMP metadata const xmp = pdf.getXMP();

// Get bookmarks/outlines const outlines = pdf.getOutlines();`

`$3`

`javascript // Get all images with raw data const images = pdf.getImages();

`$3`

`javascript // Get raw PDF object by object number const obj = pdf.getObject(5, 0);

// Resolve an indirect reference const resolved = pdf.resolveReference(ref);

// Get cross-reference table const xref = pdf.getXRef();

// Get trailer dictionary const trailer = pdf.getTrailer();`

`$3`

`javascript // Access the text extractor directly const textExtractor = pdf.textExtractor;

// Access the metadata extractor directly const metadataExtractor = pdf.metadataExtractor;

// Access the image extractor directly const imageExtractor = pdf.imageExtractor;`

`Complete Example`

`javascript import { MursaPDF } from 'mursa-pdf-parser'; import { writeFile } from 'fs/promises';

async function extractPDF(filePath) { // Load the PDF const pdf = await MursaPDF.load(filePath);

// Get basic info console.log(PDF Version: ${pdf.getVersion()}); console.log(Pages: ${pdf.getPageCount()});

// Get metadata const metadata = pdf.getMetadata(); console.log(Title: ${metadata.info.Title || 'Untitled'}); console.log(Author: ${metadata.info.Author || 'Unknown'});

// Extract all text const text = pdf.getText(); await writeFile('output.txt', text); console.log(Extracted ${text.length} characters);

// Extract images const images = pdf.getImageFiles(); for (const img of images) { await writeFile(img.filename, img.data); console.log(Saved ${img.filename} (${img.width}x${img.height})); }

console.log(Extracted ${images.length} images); }

extractPDF('document.pdf');`

`Architecture`

`How PDF Parsing Works`

`$3`

A PDF file consists of four main parts:

`$3`

`Supported Features`

`$3`


- FlateDecode (zlib/deflate)
- ASCIIHexDecode
- ASCII85Decode
- LZWDecode
- RunLengthDecode
$3

- JPEG / DCTDecode
- JPEG2000 / JPXDecode
- Raw bitmap data
- Indexed color images
$3

- DeviceGray
- DeviceRGB
- DeviceCMYK
- ICCBased
- Indexed
- Separation
$3

- PDF 1.0 - 2.0
- Traditional XRef tables
- XRef streams (PDF 1.5+)
- Object streams (PDF 1.5+)
Limitations
- Encrypted PDFs - Password-protected PDFs are not currently supported
- Complex Fonts - Some CID fonts with unusual encodings may not decode correctly
- Scanned PDFs - Documents containing only scanned images require OCR (not included)
- Form Data - Interactive form field values are not extracted
Contributing
Contributions are welcome! Please feel free to submit a Pull Request.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Author

Murali - GitHub

Acknowledgments

- Uses pako for zlib decompression