A modern JavaScript library for parsing and processing Microsoft Word DOCX documents with support for both buffer and stream operations. Features incremental parsing, checkbox detection, footnote support, and document validation.
npm install @thasmorato/docx-parserA modern JavaScript/TypeScript library for parsing DOCX documents using generator architecture and Clean Architecture.
- Streaming: Processes DOCX documents incrementally using async generators
- Memory Efficient: Doesn't load the entire document into memory
- TypeScript: Complete typing with well-defined interfaces
- Clean Architecture: Clear separation between layers (Domain, Application, Infrastructure, Interfaces)
- Flexible: Extracts text, images, tables, metadata, and formatting
- Configurable: Advanced options to control parsing
- Checkbox Detection: Automatically detects and parses checkbox states in lists
- Footnote Support: Identifies and extracts footnotes with proper references
- Header Levels: Detects document structure with header hierarchy (H1-H6)
- Document Validation: Built-in validation for DOCX file integrity
- Page Elements: Supports headers, footers, and page breaks
- List Processing: Handles numbered, bulleted, and checkbox lists
``bash`
npm install @thasmorato/docx-parseror
yarn add @thasmorato/docx-parseror
pnpm add @thasmorato/docx-parser
The library exports only essential items for a clean API:
`typescript
// Main parsing functions
import {
parseDocx, // Buffer ā AsyncGenerator
parseDocxStream, // ReadStream ā AsyncGenerator
parseDocxHttpStream, // HTTP Readable Stream ā AsyncGenerator
parseDocxReadable, // Node.js Readable Stream ā AsyncGenerator
parseDocxWebStream, // ReadableStream ā AsyncGenerator
parseDocxFile, // File path ā AsyncGenerator
parseDocxToArray, // Any source ā Promise
extractText, // Extract only text
extractImages, // Extract only images
getMetadata // Extract only metadata
} from 'docx-parser';
// Essential types
import type {
DocumentElement, // Union of all element types
ParseOptions, // Configuration options
ValidationResult, // Validation response
MetadataElement, // Document metadata
ParagraphElement, // Text paragraphs
ImageElement, // Images with metadata
TableElement, // Tables with rows/cells
HeaderElement, // Document headers
FooterElement, // Document footers
ListElement, // Lists (bullet/numbered)
PageBreakElement, // Page breaks
SectionElement // Document sections
} from 'docx-parser';
// Error handling
import { DocxParseError } from 'docx-parser';
// Advanced utilities
import { StreamAdapter } from 'docx-parser';
import { ValidateDocumentUseCaseImpl } from 'docx-parser';
`
`typescript
import { parseDocx } from 'docx-parser';
import { readFileSync } from 'fs';
const buffer = readFileSync('document.docx');
// Process document incrementally
for await (const element of parseDocx(buffer)) {
console.log(Type: ${element.type});
if (element.type === 'paragraph') {
console.log(Text: ${element.content});Image: ${element.metadata?.filename}
} else if (element.type === 'image') {
console.log();Table with ${element.content.length} rows
} else if (element.type === 'table') {
console.log();`
}
}
`typescript
import { parseDocxFile } from 'docx-parser';
// Read file directly from filesystem
for await (const element of parseDocxFile('./document.docx')) {
console.log(element.type, element.content);
}
`
`typescript
import { parseDocxStream, parseDocxHttpStream, parseDocxWebStream } from 'docx-parser';
import { createReadStream } from 'fs';
import axios from 'axios';
// Using Node.js ReadStream (recommended for local files)
const fileStream = createReadStream('./document.docx');
for await (const element of parseDocxStream(fileStream)) {
console.log(element.type);
}
// Using HTTP streams (axios, fetch, etc.) - NEW!
const response = await axios({
url: 'https://example.com/document.docx',
responseType: 'stream'
});
for await (const element of parseDocxHttpStream(response.data)) {
console.log(element.type);
}
// Using Web ReadableStream (for fetch, responses, etc.)
const fetchResponse = await fetch('https://example.com/document.docx');
const webStream = fetchResponse.body!;
for await (const element of parseDocxWebStream(webStream)) {
console.log(element.type);
}
`
`typescript
import { StreamAdapter } from 'docx-parser';
import { createReadStream } from 'fs';
// Convert ReadStream to Web ReadableStream
const nodeStream = createReadStream('./document.docx');
const webStream = StreamAdapter.toWebStream(nodeStream);
// Convert ReadableStream to Buffer
const buffer = await StreamAdapter.toBuffer(webStream);
// Create ReadableStream from Buffer
const streamFromBuffer = StreamAdapter.fromBuffer(buffer);
`
#### parseDocx(buffer, options?)
Processes a DOCX Buffer incrementally.
`typescript
import { parseDocx } from 'docx-parser';
import { readFileSync } from 'fs';
const buffer = readFileSync('doc.docx');
for await (const element of parseDocx(buffer, {
includeImages: true,
includeTables: true,
normalizeWhitespace: true
})) {
// Process each element
}
`
#### parseDocxStream(stream, options?)
Processes a DOCX ReadStream (Node.js) incrementally.
`typescript
import { parseDocxStream } from 'docx-parser';
import { createReadStream } from 'fs';
const stream = createReadStream('doc.docx');
for await (const element of parseDocxStream(stream)) {
// Process each element
}
`
#### parseDocxReadable(stream, options?)
Processes a DOCX from any Node.js Readable stream incrementally.
`typescript
import { parseDocxReadable } from 'docx-parser';
import { Readable } from 'stream';
// Custom Readable stream
const customStream = new Readable({
read() {
// Custom stream logic
}
});
for await (const element of parseDocxReadable(customStream)) {
// Process each element
}
// Transform streams, pipeline streams, etc.
import { Transform } from 'stream';
const transformedStream = someStream.pipe(new Transform({
transform(chunk, encoding, callback) {
// Transform logic
callback(null, chunk);
}
}));
for await (const element of parseDocxReadable(transformedStream)) {
// Process transformed stream
}
`
#### parseDocxHttpStream(stream, options?)
Processes a DOCX from HTTP Readable streams (axios, fetch, etc.) incrementally.
`typescript
import { parseDocxHttpStream } from 'docx-parser';
import axios from 'axios';
const response = await axios({
url: 'https://example.com/doc.docx',
responseType: 'stream'
});
for await (const element of parseDocxHttpStream(response.data)) {
// Process each element
}
`
#### parseDocxWebStream(stream, options?)
Processes a DOCX ReadableStream (Web API) incrementally.
`typescript
import { parseDocxWebStream } from 'docx-parser';
const response = await fetch('document.docx');
const stream = response.body!;
for await (const element of parseDocxWebStream(stream)) {
// Process each element
}
`
#### parseDocxToArray(source, options?)
Returns all elements as an array (non-streaming).
`typescript
import { parseDocxToArray } from 'docx-parser';
import { createReadStream } from 'fs';
// From buffer
const elements = await parseDocxToArray(buffer);
// From ReadStream
const stream = createReadStream('doc.docx');
const elements2 = await parseDocxToArray(stream);
console.log(Document has ${elements.length} elements);`
#### extractText(source, options?)
Extracts only text from the document.
`typescript
import { extractText } from 'docx-parser';
import { createReadStream } from 'fs';
// From buffer
const text = await extractText(buffer, {
preserveFormatting: true
});
// From ReadStream
const stream = createReadStream('doc.docx');
const text2 = await extractText(stream);
console.log(text);
`
#### extractImages(source)
Extracts only images from the document.
`typescript
import { extractImages } from 'docx-parser';
import { createReadStream } from 'fs';
// From buffer
for await (const image of extractImages(buffer)) {
console.log(Image: ${image.metadata?.filename});
}
// From ReadStream
const stream = createReadStream('doc.docx');
for await (const image of extractImages(stream)) {
// image.content contains the image Buffer
}
`
#### getMetadata(source)
Extracts only metadata from the document.
`typescript
import { getMetadata } from 'docx-parser';
import { createReadStream } from 'fs';
// From buffer
const metadata = await getMetadata(buffer);
// From ReadStream
const stream = createReadStream('doc.docx');
const metadata2 = await getMetadata(stream);
console.log(Title: ${metadata.title});Author: ${metadata.author}
console.log();Created: ${metadata.created}
console.log();`
#### ValidateDocumentUseCaseImpl
Validates DOCX document structure and integrity.
`typescript
import { ValidateDocumentUseCaseImpl } from 'docx-parser';
const validator = new ValidateDocumentUseCaseImpl();
const result = await validator.validate(buffer);
if (!result.isValid) {
console.log('Invalid document:');
result.errors.forEach(error => {
console.log(- ${error.message} (${error.code}));`
});
} else {
console.log('Valid document!');
}
The library now supports HTTP streams from popular libraries like axios, fetch, and others. This resolves the common error "stream.getReader is not a function" when working with HTTP responses.
`typescript
import axios from 'axios';
import { parseDocxHttpStream, parseDocxToArray } from 'docx-parser';
// Method 1: Specific HTTP stream function (recommended)
async function parseFromUrl(url: string) {
const response = await axios({
method: 'get',
url: url,
responseType: 'stream' // Important: use 'stream', not 'buffer'
});
for await (const element of parseDocxHttpStream(response.data)) {
console.log(${element.type}: ${element.content});
}
}
// Method 2: Automatic detection
async function parseToArray(url: string) {
const response = await axios({
method: 'get',
url: url,
responseType: 'stream'
});
// parseDocxToArray automatically detects Node.js streams
const elements = await parseDocxToArray(response.data, {
includeImages: true,
includeTables: true
});
return elements;
}
`
`typescriptHTTP error! status: ${response.status}
// Fetch API (Node.js 18+)
const response = await fetch('https://example.com/document.docx');
if (!response.ok) throw new Error();
// response.body is a Web ReadableStream - works automatically
const elements = await parseDocxToArray(response.body);
`
`typescript
import axios from 'axios';
import { parseDocxHttpStream } from 'docx-parser';
async function safeParseFromUrl(url: string) {
try {
const response = await axios({
method: 'get',
url: url,
responseType: 'stream',
timeout: 30000, // 30 seconds timeout
headers: {
'Accept': 'application/vnd.openxmlformats-officedocument.wordprocessingml.document'
}
});
const elements = [];
for await (const element of parseDocxHttpStream(response.data)) {
elements.push(element);
}
return elements;
} catch (error) {
if (axios.isAxiosError(error)) {
console.error('HTTP Error:', error.response?.status, error.message);
} else if (error.name === 'DocxParseError') {
console.error('DOCX Parse Error:', error.message);
} else {
console.error('Unknown Error:', error);
}
throw error;
}
}
`
š For complete HTTP streams documentation, see: project/HTTP_STREAMS_GUIDE.md
`typescript
interface ParseOptions {
// Content control
includeMetadata?: boolean; // Include metadata (default: true)
includeImages?: boolean; // Include images (default: true)
includeTables?: boolean; // Include tables (default: true)
includeHeaders?: boolean; // Include headers (default: false)
includeFooters?: boolean; // Include footers (default: false)
// Image processing
imageFormat?: 'buffer' | 'base64' | 'stream'; // Image format
maxImageSize?: number; // Maximum size in bytes (default: 10MB)
// Text formatting
preserveFormatting?: boolean; // Preserve formatting (default: true)
normalizeWhitespace?: boolean; // Normalize whitespace (default: true)
// Performance
chunkSize?: number; // Chunk size (default: 64KB)
concurrent?: boolean; // Parallel processing (default: false)
}
`
typescript
{
type: 'metadata',
id: string,
position: { page: number, section: number, order: number },
content: {
title?: string,
author?: string,
subject?: string,
created?: Date,
modified?: Date,
// ... other metadata
}
}
`$3
`typescript
{
type: 'paragraph',
id: string,
position: { page: number, section: number, order: number },
content: string,
formatting?: {
fontFamily?: string,
fontSize?: number,
bold?: boolean,
italic?: boolean,
underline?: boolean,
color?: string,
highlight?: string
},
// Special properties
checkbox?: { // For checkbox list items
checked: boolean
},
isFootnote?: boolean, // If this is a footnote
footnoteId?: string // Footnote identifier
}
`$3
`typescript
{
type: 'image',
id: string,
position: { page: number, section: number, order: number },
content: Buffer,
metadata?: {
filename?: string,
format: 'png' | 'jpg' | 'gif' | 'svg' | 'wmf' | 'emf',
width: number,
height: number
},
positioning?: {
inline?: boolean,
x?: number,
y?: number
}
}
`$3
`typescript
{
type: 'table',
id: string,
position: { page: number, section: number, order: number },
content: Array<{
cells: Array<{ content: string }>,
isHeader: boolean
}>
}
`$3
`typescript
{
type: 'header',
id: string,
position: { page: number, section: number, order: number },
content: string,
level: 1 | 2 | 3 | 4 | 5 | 6,
hasPageNumber?: boolean // If header contains page numbers
}
`$3
`typescript
{
type: 'footer',
id: string,
position: { page: number, section: number, order: number },
content: string
}
`$3
`typescript
{
type: 'list',
id: string,
position: { page: number, section: number, order: number },
content: string[],
listType: 'bullet' | 'number',
level: number
}
`$3
`typescript
// Page breaks
{
type: 'pageBreak',
id: string,
position: { page: number, section: number, order: number },
content: null
}// Document sections
{
type: 'section',
id: string,
position: { page: number, section: number, order: number },
content: string,
sectionType: 'header' | 'footer' | 'body'
}
`šļø Advanced Examples
$3
`typescript
import { parseDocx } from 'docx-parser';// Only process text and tables
for await (const element of parseDocx(buffer, {
includeImages: false,
includeMetadata: false,
includeTables: true
})) {
if (element.type === 'paragraph') {
console.log('Paragraph:', element.content);
} else if (element.type === 'table') {
console.log('Table found');
element.content.forEach((row, i) => {
console.log(
Row ${i}:, row.cells.map(c => c.content).join(' | '));
});
}
}
`$3
`typescript
import { parseDocx } from 'docx-parser';for await (const element of parseDocx(buffer)) {
if (element.type === 'paragraph' && element.checkbox) {
const status = element.checkbox.checked ? 'ā
' : 'ā';
console.log(
${status} ${element.content});
}
}
`$3
`typescript
import { parseDocx } from 'docx-parser';const footnotes = [];
for await (const element of parseDocx(buffer)) {
if (element.type === 'paragraph' && element.isFootnote) {
footnotes.push({
id: element.footnoteId,
content: element.content
});
}
}
console.log('Found footnotes:', footnotes);
`$3
`typescript
import { parseDocx } from 'docx-parser';for await (const element of parseDocx(buffer)) {
if (element.type === 'header') {
const indent = ' '.repeat(element.level - 1);
console.log(
${indent}H${element.level}: ${element.content}); if (element.hasPageNumber) {
console.log(
${indent} (contains page number));
}
}
}
`$3
`typescript
import { parseDocxReadable } from 'docx-parser';
import { Readable, Transform, pipeline } from 'stream';
import { promisify } from 'util';// Custom stream processing
const customStream = new Readable({
read() {
// Custom logic to read DOCX data
}
});
for await (const element of parseDocxReadable(customStream)) {
console.log(
${element.type}: ${element.content});
}// Pipeline with transforms
const pipelineAsync = promisify(pipeline);
const transformStream = new Transform({
transform(chunk, encoding, callback) {
// Apply transformations if needed
callback(null, chunk);
}
});
// Process transformed stream
const transformedStream = someDocxStream.pipe(transformStream);
for await (const element of parseDocxReadable(transformedStream)) {
console.log('Transformed element:', element);
}
`$3
`typescript
import { writeFileSync } from 'fs';
import { extractImages } from 'docx-parser';let imageCount = 0;
for await (const image of extractImages(buffer)) {
const filename = image.metadata?.filename ||
image_${++imageCount}.${image.metadata?.format};
writeFileSync(./output/${filename}, image.content);
console.log(Image saved: ${filename});
}
`$3
`typescript
import { ValidateDocumentUseCaseImpl } from 'docx-parser';const validator = new ValidateDocumentUseCaseImpl();
const result = await validator.validate(buffer);
if (!result.isValid) {
console.log('Invalid document:');
result.errors.forEach(error => {
console.log(
- ${error.message} (${error.code}));
});
} else {
console.log('Valid document!');
}
`š§ Architecture
The library follows Clean Architecture with 4 layers:
- Domain: Types, interfaces, and business rules
- Application: Use cases and application logic
- Infrastructure: Concrete implementations (JSZip, XML parsing)
- Interfaces: Public API and controllers
`
src/
āāā domain/ # Entities and business rules
āāā application/ # Use cases and application logic
āāā infrastructure/ # Adapters and implementations
āāā interfaces/ # Public API
`šØ Error Handling
`typescript
import { DocxParseError } from 'docx-parser';try {
for await (const element of parseDocx(invalidBuffer)) {
console.log(element);
}
} catch (error) {
if (error instanceof DocxParseError) {
console.log('DOCX-specific error:', error.message);
} else {
console.log('General error:', error);
}
}
`š Performance Tips
1. Use streaming: Prefer
parseDocx() over parseDocxToArray() for large documents
2. Filter content: Disable includeImages if you don't need images
3. Chunk size: Adjust chunkSize based on document sizes
4. Limit images: Use maxImageSize to avoid very large imagesš¤ Contributing
1. Fork the project
2. Create a feature branch
3. Commit your changes
4. Push to the branch
5. Open a Pull Request
$3
`bash
Clone the repository
git clone https://github.com/ThaSMorato/docx-parser.git
cd docx-parserInstall dependencies
pnpm installRun tests
pnpm testRun linting
pnpm lintBuild the project
pnpm build
`$3
This project uses GitHub Actions for automated NPM publishing with semantic versioning:
- MAJOR:
MAJOR: breaking change or BREAKING CHANGE in commit
- MINOR: MINOR: new feature or feat: prefix
- PATCH: fix:, docs:, chore: or any other commit type.github/VERSIONING.md` for detailed information about the versioning system.MIT License - see LICENSE for details.
Open an issue on GitHub with:
- Library version
- Sample DOCX file (if possible)
- Code that reproduces the problem
- Complete error message
---
Made with ā¤ļø and TypeScript