A developer-friendly transformation engine for programmatic document manipulation
npm install @paulmeller/docflow> A developer-friendly transformation engine built on SuperDoc
Transform documents programmatically with a clean, chainable API. Built on top of SuperDoc Editor, DocFlow makes it easy to load, query, transform, and export documents in multiple formats.
- π Format Conversion: DOCX β Markdown β HTML β JSON β Plain Text
- π Table Support: DOCX tables export cleanly to Markdown with formatting preserved
- π Powerful Queries: CSS-like selectors and predicate functions
- β‘ Batch Processing: Process multiple documents concurrently
- π― Type-Safe: Full TypeScript support
- π Chainable API: Fluent, readable transformations
- π« Error Handling: Multiple error handling modes
- π¦ Headless: Runs in Node.js without a browser
``bash`
npm install @paulmeller/docflow
`javascript
import DocFlow from '@paulmeller/docflow';
// Simple transformation
await new DocFlow()
.load('document.docx')
.transform('heading[level=2]', node => ({
...node,
text: node.text.toUpperCase()
}))
.save('output.docx');
`
vs loadContent()DO NOT confuse these two methods:
| Method | Purpose | First Parameter | When to Use |
|--------|---------|----------------|-------------|
| load() | Load from file path or buffer | File path string or Buffer | Reading files from disk |loadContent()
| | Load from content string | Content string (markdown/HTML/JSON) | When you have content in memory |
`javascript
// β WRONG - This treats JSON string as a file path!
const jsonString = JSON.stringify(docJson);
await doc.load(jsonString, { format: 'json' });
// Error: ENAMETOOLONG or "file not found"
// β WRONG - This treats markdown as a file path!
const markdown = '# Hello\n\nWorld';
await doc.load(markdown, { format: 'markdown' });
`
`javascript
// β
CORRECT - Use load() for file paths
await doc.load('document.docx');
await doc.load('data.json', { format: 'json' });
// β
CORRECT - Use loadContent() for content strings
const jsonString = JSON.stringify(docJson);
await doc.loadContent(jsonString);
const markdown = '# Hello\n\nWorld';
await doc.loadContent(markdown);
`
Load from file, buffer, or content:
`javascript
const doc = new DocFlow();
// From file path
await doc.load('document.docx');
// From buffer (binary data)
const buffer = fs.readFileSync('document.docx');
await doc.load(buffer);
// From JSON buffer with explicit format
const jsonBuffer = Buffer.from(jsonString, 'utf-8');
await doc.load(jsonBuffer, { format: 'json' });
// From content string (auto-detects markdown, HTML, JSON, or text)
await doc.loadContent('
Content
');$3
Find nodes using selectors:
`javascript
// CSS-like selectors
const headings = doc.query('heading[level=2]');
const paragraphs = doc.query('paragraph');// Predicate functions
const longParagraphs = doc.query(node =>
node.type === 'paragraph' &&
node.content?.length > 100
);
// Work with results
console.log(headings.count); // Number of matches
console.log(headings.text()); // Concatenated text
console.log(headings.first()); // First match
console.log(headings.last()); // Last match
`$3
Modify document structure:
`javascript
import { Transforms } from '@paulmeller/docflow';// Transform matching nodes
await doc.transform('heading[level=2]', node => ({
...node,
text: Transforms.toTitleCase(node.text)
}));
// Transform with built-in helpers
await doc.transform('text', Transforms.replace(/foo/g, 'bar'));
await doc.transform('text', Transforms.addPrefix('> '));
// Transform entire document
await doc.transformDocument(json => {
// Modify entire document structure
return modifiedJSON;
});
`$3
Export to various formats:
`javascript
// Export to buffer/string
const docx = await doc.export('docx');
const html = await doc.export('html');
const markdown = await doc.export('markdown');
const json = await doc.export('json');
const text = await doc.export('text');// Save directly to file
await doc.save('output.docx');
await doc.save('output.md', 'markdown');
await doc.save('output.html', 'html');
await doc.save('output.json', 'json');
await doc.save('output.txt', 'text');
`Examples
$3
`javascript
import DocFlow from '@paulmeller/docflow';// DOCX to Markdown
await new DocFlow()
.load('document.docx')
.save('document.md', 'markdown');
// Markdown to DOCX
await new DocFlow()
.load('document.md')
.save('document.docx');
// DOCX with tables to Markdown (tables preserved with formatting)
await new DocFlow()
.load('report-with-tables.docx')
.save('report.md', 'markdown');
// Extract plain text for analysis
const plainText = await new DocFlow()
.load('document.docx')
.export('text');
console.log(plainText); // All text without formatting
`$3
`javascript
import DocFlow from '@paulmeller/docflow';// Export DOCX to JSON (for storage or transmission)
const doc1 = new DocFlow();
await doc1.load('report.docx');
const jsonData = await doc1.export('json');
await fs.writeFile('report.json', JSON.stringify(jsonData, null, 2));
// Import JSON and convert to Markdown
const doc2 = new DocFlow();
await doc2.load('report.json'); // Auto-detects .json extension
const markdown = await doc2.export('markdown');
// Load JSON from buffer (e.g., from API or virtual filesystem)
const jsonString = JSON.stringify(jsonData);
const buffer = Buffer.from(jsonString, 'utf-8');
const doc3 = new DocFlow();
await doc3.load(buffer, { format: 'json' }); // Explicit format for buffers
const html = await doc3.export('html');
// Load JSON from content string (when you have it in memory)
const doc4 = new DocFlow();
await doc4.loadContent(jsonString); // Auto-detects JSON
const docx = await doc4.export('docx');
`$3
`javascript
import { DocFlow, Transforms } from '@paulmeller/docflow';const doc = new DocFlow();
await doc
.load('document.docx')
.transform('heading[level=1]', node => ({
...node,
text: Transforms.toTitleCase(node.text)
}))
.transform('heading[level=2]', node => ({
...node,
text: Transforms.toSentenceCase(node.text)
}))
.save('standardized.docx');
`$3
`javascript
const doc = new DocFlow();
await doc
.load('template.docx')
.transform('text', Transforms.replace(/\{name\}/g, 'John Doe'))
.transform('text', Transforms.replace(/\{date\}/g, new Date().toLocaleDateString()))
.transform('text', Transforms.replace(/\{company\}/g, 'Acme Corp'))
.save('filled-document.docx');
`$3
`javascript
const doc = new DocFlow();
await doc.load('document.docx');// Find all headings
const headings = doc.query('heading');
console.log(
Document has ${headings.count} headings);// Find specific content
const importantSections = doc.query(node =>
node.type === 'heading' &&
node.text?.toLowerCase().includes('important')
);
console.log('Important sections:');
importantSections.map(h => console.log(
- ${h.text}));
`$3
`javascript
import { BatchProcessor } from '@paulmeller/docflow';const batch = new BatchProcessor({
concurrency: 5
});
const results = await batch.process(
['doc1.docx', 'doc2.docx', 'doc3.docx'],
async (doc, file) => {
await doc
.transform('heading[level=1]', node => ({
...node,
attrs: { ...node.attrs, level: 2 }
}))
.save(file.replace('.docx', '-updated.docx'));
return { processed: true };
}
);
console.log(
β Processed ${results.successful} files);
console.log(β Failed ${results.failed} files);
`$3
`javascript
const doc = new DocFlow();await doc
.load('input.docx')
// Step 1: Standardize heading levels
.transform('heading[level=1]', node => ({
...node,
text: node.text.toUpperCase()
}))
// Step 2: Remove empty paragraphs
.transformDocument(json => {
json.content = json.content.filter(node =>
node.type !== 'paragraph' || node.content?.length > 0
);
return json;
})
// Step 3: Add metadata
.transformDocument(json => ({
...json,
attrs: {
...json.attrs,
processed: true,
timestamp: Date.now()
}
}))
.save('processed.docx');
`$3
`javascript
const doc = new DocFlow();
await doc.load('document.docx');const json = doc.toJSON();
// Analyze structure
const stats = {
headings: doc.query('heading').count,
paragraphs: doc.query('paragraph').count,
h1: doc.query('heading[level=1]').count,
h2: doc.query('heading[level=2]').count,
h3: doc.query('heading[level=3]').count
};
console.log('Document Structure:');
console.log(JSON.stringify(stats, null, 2));
// Table of contents
const toc = doc.query('heading')
.map(h =>
${' '.repeat(h.attrs.level - 1)}- ${h.text})
.join('\n');console.log('\nTable of Contents:');
console.log(toc);
`$3
`javascript
// Mode 1: Throw on error (default)
try {
await new DocFlow()
.load('missing.docx')
.save('output.docx');
} catch (error) {
console.error('Failed:', error.message);
}// Mode 2: Collect errors
const doc = new DocFlow({ errorMode: 'collect' });
await doc
.load('input.docx')
.transform('invalid-selector', node => node)
.save('output.docx');
const errors = doc.getErrors();
if (errors.length > 0) {
console.error('Errors occurred:', errors);
}
// Mode 3: Silent (ignore errors)
const doc2 = new DocFlow({ errorMode: 'silent' });
await doc2.load('maybe-exists.docx').save('output.docx');
`$3
`javascript
// Define custom transformation
function addTimestamp(node) {
if (node.type === 'paragraph') {
return {
...node,
attrs: {
...node.attrs,
timestamp: new Date().toISOString()
}
};
}
return node;
}// Apply it
await doc
.load('document.docx')
.transform('paragraph', addTimestamp)
.save('timestamped.docx');
`$3
`javascript
const doc = new DocFlow();
await doc.load('document.docx');// Only process if conditions are met
const validation = doc.validate();
if (validation.valid) {
const headings = doc.query('heading');
if (headings.count > 0) {
await doc
.transform('heading', node => ({
...node,
text:
π ${node.text}
}))
.save('processed.docx');
}
}
`Format Auto-Detection
DocFlow automatically detects formats to make your code simpler and more intuitive.
$3
- Auto-detects format from file extension:
.docx, .md, .html, .json, .txt
- Use format option to override: await doc.load('file.txt', { format: 'markdown' })
- For Buffers, format defaults to 'docx' unless you specify otherwise`javascript
// Auto-detected from extension
await doc.load('document.docx'); // β format: 'docx'
await doc.load('data.json'); // β format: 'json'
await doc.load('readme.md'); // β format: 'markdown'// Explicit format for buffers
const buffer = Buffer.from(jsonString, 'utf-8');
await doc.load(buffer, { format: 'json' });
`$3
- Auto-detects JSON: Looks for
{"type": "doc"...} pattern
- Auto-detects HTML: Looks for , , , etc.
- Auto-detects Markdown: Everything else treated as markdown
- No format parameter needed - detection is automatic
` Textjavascript`
// All auto-detected
await doc.loadContent('# Heading\n\nText'); // β markdown
await doc.loadContent('Title
await doc.loadContent('{"type": "doc", ...}'); // β JSON
#### Constructor
`typescript`
new DocFlow(options?: {
headless?: boolean; // Default: true
validateSchema?: boolean; // Default: true
errorMode?: 'throw' | 'collect' | 'silent'; // Default: 'throw'
})
#### Methods
##### load(source, options?)
Load a document from file or buffer.
`typescript`
load(source: string | Buffer, options?: {
format?: 'docx' | 'html' | 'markdown' | 'json' | 'text'
}): Promise
##### loadContent(content)
Load content directly. Automatically detects if content is markdown, HTML, or plain text. Auto-initializes a blank document if none exists.
`typescript`
loadContent(content: string): Promise
##### toJSON()
Get document as ProseMirror JSON.
`typescript`
toJSON(): ProseMirrorJSON
##### query(selector)
Query document structure.
`typescript`
query(selector: string | Function): QueryResult
Selectors:
- "heading" - All heading nodes"heading[level=2]"
- - H2 headings"paragraph"
- - All paragraphsnode => condition
- - Custom predicate
##### transform(selector, transformer)
Transform matching nodes.
`typescript`
transform(
selector: string | Function,
transformer: (node) => node | Promise
): Promise
##### transformDocument(transformer)
Transform entire document.
`typescript`
transformDocument(
transformer: (json) => json | Promise
): Promise
##### export(format?, options?)
Export to format.
`typescript`
export(
format?: 'docx' | 'html' | 'markdown' | 'json' | 'text',
options?: ExportOptions
): Promise
##### save(filepath, format?)
Save to file.
`typescript`
save(filepath: string, format?: string): Promise
##### validate()
Validate document.
`typescript`
validate(): {
valid: boolean;
errors: string[];
warnings: string[];
document: ProseMirrorJSON;
}
##### getHistory()
Get operation history.
`typescript`
getHistory(): Operation[]
##### getErrors()
Get collected errors (when errorMode: 'collect').
`typescript`
getErrors(): ErrorRecord[]
#### Properties
- count: Number of matching nodesnodes
- : Array of found nodesselector
- : Selector used
#### Methods
##### first()
Get first match.
`typescript`
first(): Node | null
##### last()
Get last match.
`typescript`
last(): Node | null
##### map(fn)
Map over results.
`typescript`
map
##### filter(fn)
Filter results.
`typescript`
filter(fn: (node, index) => boolean): QueryResult
##### text()
Get concatenated text content.
`typescript`
text(): string
##### transform(transformer)
Transform all found nodes.
`typescript`
transform(transformer: Function): Promise
Process multiple documents concurrently.
`typescript
const batch = new BatchProcessor({
concurrency: 5,
errorMode: 'collect'
});
const results = await batch.process(
['file1.docx', 'file2.docx'],
async (doc, file) => {
// Process each document
await doc.transform(...).save(...);
return { success: true };
}
);
`
Built-in transformation helpers.
`typescript
import { Transforms } from '@paulmeller/docflow';
Transforms.toTitleCase(text) // "hello world" β "Hello World"
Transforms.toSentenceCase(text) // "HELLO WORLD" β "Hello world"
Transforms.replace(pattern, repl) // Replace matching text
Transforms.addPrefix(prefix) // Add prefix to text
Transforms.remove(condition) // Remove matching nodes
`
Documents are represented as ProseMirror JSON:
`json`
{
"type": "doc",
"content": [
{
"type": "heading",
"attrs": { "level": 1 },
"content": [
{ "type": "text", "text": "Title" }
]
},
{
"type": "paragraph",
"content": [
{ "type": "text", "text": "Content" }
]
}
]
}
- doc - Root documentheading
- - Heading (attrs: level)paragraph
- - Paragraphtext
- - Text contentblockquote
- - Block quotecodeBlock
- - Code blockbulletList
- - Bullet listorderedList
- - Numbered listlistItem
- - List item
`javascript
// β Good - Chainable, readable
await doc
.load('input.docx')
.transform('heading', ...)
.transform('paragraph', ...)
.save('output.docx');
// β Avoid - Verbose
await doc.load('input.docx');
await doc.transform('heading', ...);
await doc.transform('paragraph', ...);
await doc.save('output.docx');
`
`javascript
// β Good - Specific
doc.query('heading[level=2]')
// β Avoid - Too broad
doc.query('heading').filter(h => h.attrs.level === 2)
`
`javascript
// β Good - Handle errors
try {
await doc.load('file.docx');
} catch (error) {
console.error('Failed to load:', error);
}
// Or use collect mode
const doc = new DocFlow({ errorMode: 'collect' });
await doc.load('file.docx');
if (doc.getErrors().length > 0) {
// Handle errors
}
`
`javascript`
// β Good - Validate after transformation
await doc.transform(...);
const validation = doc.validate();
if (!validation.valid) {
console.error('Validation failed:', validation.errors);
}
`javascript
// β Good - Concurrent processing
const batch = new BatchProcessor({ concurrency: 5 });
await batch.process(files, pipeline);
// β Avoid - Sequential processing
for (const file of files) {
await new DocFlow().load(file).transform(...).save(...);
}
`
Full TypeScript support included:
`typescript
import DocFlow, { QueryResult, Transforms } from '@paulmeller/docflow';
const doc = new DocFlow({
errorMode: 'collect'
});
await doc.load('document.docx');
const headings: QueryResult = doc.query('heading');
const count: number = headings.count;
`
Due to limitations in the underlying SuperDoc library's DOCX conversion engine:
- Lists: DOCX export/import loses list items beyond the first item (confirmed SuperDoc limitation even with proper command API)
- Tables: β
Work perfectly for DOCX round-trips
- Recommendation: For list transformations, use MarkdownβHTML conversions instead of DOCX round-trips
- β
Markdown β HTML β Markdown: Full fidelity for lists, tables, links, formatting
- β
HTML β Markdown: Complete preservation of all content
- β
DOCX β Markdown/HTML: One-way conversion works well for extracting content
- β
JSON export: Perfect for analyzing document structure
`javascript
// β
GOOD: Use HTML/Markdown for list transformations
await doc.createBlank();
await doc.loadContent('- Item 1\n- Item 2\n- Item 3');
const html = await doc.export('html'); // Preserves all items
const md = await doc.export('markdown'); // Preserves all items
// β οΈ LIMITED: DOCX round-trips may lose list structure
await doc.load('document.docx');
const docx = await doc.export('docx');
await doc.load(docx); // May lose list items 2+
`
For applications requiring full DOCX round-trip fidelity with complex lists and tables, consider using Microsoft Word's native APIs or alternative DOCX libraries.
Upstream Issue Tracker: These limitations originate from the SuperDoc library. You can track progress at SuperDoc GitHub Issues.
1. Use batch processing for multiple files
2. Reuse DocFlow instances when possible
3. Use specific selectors to minimize traversal
4. Validate only when necessary (disable with validateSchema: false)
5. Handle large documents with streaming (future feature)
Error message:
``
Error: ENAMETOOLONG: name too long, open '{"type": "doc", "content": [...]}'
Cause: You're passing content to load() instead of a file path.
Fix: Use loadContent() for content strings:
`javascript
// β WRONG - Treats JSON string as file path
const json = await doc.export('json');
await doc.load(JSON.stringify(json), { format: 'json' });
// β
CORRECT - Use loadContent() for content
const json = await doc.export('json');
await doc.loadContent(JSON.stringify(json));
// OR use load() with a buffer
const buffer = Buffer.from(JSON.stringify(json), 'utf-8');
await doc.load(buffer, { format: 'json' });
`
`javascript
// Check file exists
if (fs.existsSync('document.docx')) {
await doc.load('document.docx');
}
// Check format
const format = path.extname('document.docx');
await doc.load('document.docx', { format: 'docx' });
`
`javascriptFound ${matches.count} matches
// Verify selector matches
const matches = doc.query('heading[level=2]');
console.log();
// Check transformation logic
await doc.transform('heading', node => {
console.log('Transforming:', node);
return { ...node, text: node.text.toUpperCase() };
});
`
`javascript
// Explicitly specify format
await doc.save('output.docx', 'docx');
// Check supported formats
const formats = ['docx', 'html', 'markdown', 'json', 'text'];
``
MIT
Contributions welcome! Please read CONTRIBUTING.md for guidelines.
Built on SuperDoc by Harbour Enterprises.
- π Documentation
- π¬ Discussions
- π Issues