Simplified document converter for common formats (PDF, DOCX, XLSX, PPTX) to Markdown
npm install @hivehub/transmutation-lite> Simplified TypeScript document converter for common formats to Markdown




Version: 0.6.1
Status: ✅ Production Ready - 177/177 Tests Passing (100%)
Transmutation Lite is a lightweight, TypeScript-based document conversion library that converts common document formats (PDF, DOCX, XLSX, PPTX, HTML, TXT) to Markdown. Designed specifically for integration with the HiveLLM Classify project, it provides fast, reliable conversions without the complexity of OCR, audio, video, or archive processing.
> ⚠️ Important: This is a simplified version optimized for use cases that do not require high precision, such as document classification. For production applications requiring high-quality conversion, advanced features, and superior performance, we strongly recommend using the full Transmutation Rust library, which offers:
> - 98x faster performance than Docling
> - 80%+ conversion quality with precision mode
> - Advanced OCR, audio, and video processing
> - Enterprise-grade reliability and accuracy
>
> Use Transmutation Lite for: Document classification, quick previews, prototyping
> Use Transmutation (Rust) for: RAG systems, production pipelines, high-quality document processing
pdf-parse-new for reliable text extractionmammoth for Word documentsxlsx for Excel spreadsheets (tables)turndown for clean conversionnpm installation (recommended):
``bash`
npm install @hivellm/transmutation-lite
Local usage (within monorepo):
`bash`From classify project
npm installUses: "file:../transmutation-lite" dependency
`typescript
import { Converter, Logger, LogLevel } from '@hivellm/transmutation-lite';
// Simple conversion
const converter = new Converter();
const result = await converter.convertFile('./document.pdf');
console.log(result.markdown);
// With caching and logging
const converter = new Converter({
enableCache: true,
cacheSize: 100,
logger: new Logger({ level: LogLevel.INFO }),
collectMetrics: true,
});
const result = await converter.convertFile('./document.docx', {
preserveFormatting: true,
maxPages: 10,
});
console.log('Format:', result.metadata.format);
console.log('Pages:', result.metadata.pageCount);
console.log('Time:', result.conversionTimeMs, 'ms');
// Get metrics
const metrics = converter.getMetricsSummary();
console.log('Success Rate:', metrics.successRate);
console.log('Cache Hit Rate:', metrics.cacheHitRate);
`
`bashConvert single file
npx transmutation-lite convert document.pdf -o output.md
API Reference
$3
Convenience function to convert a single file.
Parameters:
-
filePath (string): Path to the file to convert
- options (ConversionOptions): Optional conversion optionsReturns:
Promise`typescript
const result = await convert('./document.pdf', {
preserveFormatting: true,
maxPages: 10,
});
`$3
Main converter class that manages all format converters.
#### Methods
#####
convertFile(filePath, options?)Convert a file to Markdown.
`typescript
const converter = new Converter();
const result = await converter.convertFile('./document.docx');
`#####
convertBuffer(buffer, format, options?)Convert a buffer directly.
`typescript
const buffer = await readFile('./document.pdf');
const result = await converter.convertBuffer(buffer, DocumentFormat.PDF);
`#####
detectFormat(filePath)Detect document format from file extension.
`typescript
const format = converter.detectFormat('./document.pdf');
// Returns: DocumentFormat.PDF
`#####
isSupported(filePath)Check if a file format is supported.
`typescript
const supported = converter.isSupported('./document.pdf');
// Returns: true
`#####
getSupportedFormats()Get list of supported formats.
`typescript
const formats = converter.getSupportedFormats();
// Returns: [DocumentFormat.PDF, DocumentFormat.DOCX, ...]
`$3
####
ConversionOptions`typescript
interface ConversionOptions {
preserveFormatting?: boolean; // Default: true
extractImages?: boolean; // Default: false (not implemented)
maxPages?: number; // Default: 0 (all pages)
formatOptions?: Record;
}
`####
ConversionResult`typescript
interface ConversionResult {
markdown: string; // Converted markdown content
metadata: DocumentMetadata; // Document metadata
conversionTimeMs: number; // Conversion time in milliseconds
warnings?: string[]; // Any warnings during conversion
}
`####
DocumentMetadata`typescript
interface DocumentMetadata {
format: DocumentFormat; // Original format
fileSize: number; // File size in bytes
pageCount?: number; // Number of pages/sheets/slides
title?: string; // Document title
author?: string; // Document author
createdAt?: Date; // Creation date
extra?: Record; // Format-specific metadata
}
`####
DocumentFormat Enum`typescript
enum DocumentFormat {
PDF = 'pdf',
DOCX = 'docx',
XLSX = 'xlsx',
PPTX = 'pptx',
TXT = 'txt',
HTML = 'html',
UNKNOWN = 'unknown',
}
`Supported Formats
| Format | Extension | Library | Quality | Notes |
|--------|-----------|---------|---------|-------|
| PDF |
.pdf | pdf-parse | Basic | Text extraction only, no images, limited formatting |
| DOCX | .docx | mammoth | Good | Formatting support, quality depends on mammoth |
| XLSX | .xlsx, .xls | xlsx | Good | Converts to Markdown tables |
| PPTX | .pptx, .ppt | jszip (basic) | ⚠️ Limited | Basic text extraction only, experimental |
| HTML | .html, .htm | turndown | Good | Clean Markdown conversion |
| TXT | .txt, .md | native | Good | Direct text handling |Integration with Classify
Transmutation Lite is designed to integrate seamlessly with the HiveLLM Classify project:
`typescript
import { convert } from '@hivellm/transmutation-lite';
import { ClassifyClient } from '@hivellm/classify';// Convert document to Markdown
const conversionResult = await convert('./contract.pdf');
// Classify the markdown content
const classifier = new ClassifyClient({
provider: 'deepseek',
apiKey: process.env.DEEPSEEK_API_KEY,
});
const classificationResult = await classifier.classifyText(
conversionResult.markdown
);
console.log('Domain:', classificationResult.classification.domain);
console.log('Type:', classificationResult.classification.doc_type);
`CLI Commands
$3
Convert a single file to Markdown.
Options:
-
-o, --output - Output file path (default: )
- -m, --max-pages - Maximum pages/sheets to process
- --no-preserve-formatting - Disable formatting preservationExamples:
`bash
Basic conversion
transmutation-lite convert document.pdfSpecify output
transmutation-lite convert document.pdf -o output/document.mdLimit pages
transmutation-lite convert large.pdf --max-pages 10Disable formatting
transmutation-lite convert document.docx --no-preserve-formatting
`$3
Convert all supported files in a directory.
Options:
-
-o, --output - Output directory (default: )
- -r, --recursive - Process subdirectories recursively
- -m, --max-pages - Maximum pages/sheets to process
- --parallel - Number of parallel conversions (default: 4)
- --no-preserve-formatting - Disable formatting preservationExamples:
`bash
Basic batch conversion
transmutation-lite batch ./documentsRecursive with custom output
transmutation-lite batch ./docs -o ./markdown --recursiveHigh parallelism
transmutation-lite batch ./files --parallel 8Limit pages for all files
transmutation-lite batch ./pdfs --max-pages 5
`$3
List all supported file formats.
`bash
transmutation-lite formats
`Examples
$3
`typescript
import { Converter } from '@hivellm/transmutation-lite';const converter = new Converter();
const result = await converter.convertFile('./research-paper.pdf');
console.log('Title:', result.metadata.title);
console.log('Author:', result.metadata.author);
console.log('Pages:', result.metadata.pageCount);
console.log('Created:', result.metadata.createdAt);
console.log('Producer:', result.metadata.extra?.producer);
`$3
`typescript
import { Converter } from '@hivellm/transmutation-lite';
import { readdir } from 'fs/promises';
import { join } from 'path';const converter = new Converter();
const files = await readdir('./documents');
let completed = 0;
for (const file of files) {
const filePath = join('./documents', file);
if (converter.isSupported(filePath)) {
try {
const result = await converter.convertFile(filePath);
await writeFile(
./output/${file}.md,
result.markdown
);
completed++;
console.log(✅ [${completed}/${files.length}] ${file});
} catch (error) {
console.error(❌ Failed: ${file}, error);
}
}
}
`$3
`typescript
import { convert } from '@hivellm/transmutation-lite';const result = await convert('./data.xlsx');
// Output will be Markdown tables:
// # Sheet1
//
// | Column A | Column B | Column C |
// | --- | --- | --- |
// | Value 1 | Value 2 | Value 3 |
// | Value 4 | Value 5 | Value 6 |
console.log(result.markdown);
`$3
`typescript
import { convert, ConversionError } from '@hivellm/transmutation-lite';try {
const result = await convert('./document.pdf');
console.log(result.markdown);
} catch (error) {
if (error instanceof ConversionError) {
console.error('Conversion failed:', error.message);
console.error('Format:', error.format);
console.error('Cause:', error.cause);
} else {
console.error('Unexpected error:', error);
}
}
`Performance
> Note: Performance estimates below are theoretical. Real-world performance may vary based on document complexity and system specifications.
Estimated conversion times:
| Format | File Size | Pages | Estimated Time | Notes |
|--------|-----------|-------|----------------|-------|
| PDF | 2 MB | 15 | ~200-500ms | Text extraction only, depends on pdf-parse |
| DOCX | 500 KB | 20 | ~150-300ms | Depends on document complexity |
| XLSX | 1 MB | 10 sheets | ~100-200ms | Table conversion |
| PPTX | 3 MB | 30 slides | ~300-600ms | Basic text extraction via jszip |
| HTML | 200 KB | - | ~50-100ms | Turndown conversion |
Important: For production use requiring guaranteed performance and quality, use the full Transmutation Rust library.
Limitations
- No OCR - PDF images are not extracted or processed
- No Audio/Video - Audio and video files are not supported
- No Archives - ZIP, TAR, and other archives are not supported
- PPTX Quality - PowerPoint conversion is basic (text-only)
- Images - Image extraction is not implemented in this lite version
- Lower Precision - Text extraction quality is lower than the Rust version
$3
| Feature | Transmutation Lite (TypeScript) | Transmutation (Rust) |
|---------|--------------------------------|---------------------|
| Best For | Document classification, quick previews | RAG systems, production pipelines |
| Precision | ⚠️ Basic (untested, depends on libraries) | ✅ Excellent (80%+ tested) |
| Performance | Moderate (Node.js) | ✅ 98x faster than Docling (tested) |
| Memory | Moderate (Node.js overhead) | ✅ Very Low (~20MB tested) |
| OCR Support | ❌ No | ✅ Yes (Tesseract) |
| Audio/Video | ❌ No | ✅ Yes (Whisper) |
| Archives | ❌ No | ✅ Yes (ZIP, TAR, etc.) |
| Setup | ✅ npm install (local) | Requires Rust/binary |
| Integration | ✅ Easy (Node.js) | Moderate (CLI/FFI) |
| Status | ✅ Production Ready | ✅ Production Ready |
Recommendation: For production use cases requiring high-quality document conversion, we strongly recommend using the full Transmutation Rust library. Transmutation Lite is ideal for:
- Quick document classification (like HiveLLM Classify)
- Prototyping and development
- Simple text extraction where precision is not critical
- Node.js-only environments without Rust toolchain
Development
$3
`bash
npm run build
`$3
`bash
npm test
`$3
`bash
npm run lint
npm run lint:fix
`$3
`bash
npm run type-check
`Project Structure
`
transmutation-lite/
├── src/
│ ├── converters/
│ │ ├── base.ts # Base converter interface
│ │ ├── pdf.ts # PDF converter
│ │ ├── docx.ts # DOCX converter
│ │ ├── xlsx.ts # XLSX converter
│ │ ├── pptx.ts # PPTX converter
│ │ ├── html.ts # HTML converter
│ │ ├── txt.ts # TXT converter
│ │ └── index.ts # Converter exports
│ ├── types.ts # Type definitions
│ ├── index.ts # Main library export
│ └── cli.ts # CLI implementation
├── tests/ # Test files
├── dist/ # Compiled output
├── package.json
├── tsconfig.json
└── README.md
``- Classify - Intelligent document classification using LLMs (uses Transmutation Lite)
- Transmutation ⭐ - Recommended: Full-featured Rust document converter with 98x performance and 80%+ quality
- Vectorizer - Vector database and search engine (uses full Transmutation)
- Nexus - Graph database with vector search
MIT
This project follows the HiveLLM ecosystem standards:
1. TypeScript 5.x with strict mode
2. Tests with vitest (currently basic coverage)
3. Clear documentation
4. Semantic versioning
5. Conventional commits
Current Status: Production ready - 177/177 tests passing, fully documented, ready for npm publication.
See CHANGELOG.md for version history and release notes.
---
Contact: HiveLLM Development Team