Convert DOCX and PDF files to clean semantic HTML
npm install file-to-html-converter, , ,
, , , , ![]()
bash
npm install file2html
`
Usage
$3
`javascript
import { docxToHtml } from 'file2html';
const html = await docxToHtml('document.docx');
console.log(html);
`
$3
`javascript
import { pdfToHtml } from 'file2html';
const html = await pdfToHtml('document.pdf');
console.log(html);
`
$3
`javascript
import { docxToHtml, pdfToHtml } from 'file2html';
async function convertFiles() {
try {
// Convert DOCX file
const docxHtml = await docxToHtml('sample.docx');
console.log('DOCX HTML:', docxHtml);
// Convert PDF file
const pdfHtml = await pdfToHtml('sample.pdf');
console.log('PDF HTML:', pdfHtml);
} catch (error) {
console.error('Conversion failed:', error.message);
}
}
convertFiles();
`
API Reference
$3
Converts a DOCX file to semantic HTML.
Parameters:
- filePath (string): Path to the DOCX file
Returns:
- Promise: Clean semantic HTML
Features:
- Converts paragraphs to This is a paragraph with bold text and italic text. tags
- Maps bold text to tags
- Maps italic text to tags
- Converts tables to , , structure
- Detects headings based on paragraph styles and converts to , , etc.
- Extracts images and converts to tags with base64 data URLs
filePath
$3
Converts a PDF file to semantic HTML.
Parameters:
- (string): Path to the PDF file
Promise
Returns:
- : Clean semantic HTML
`
Features:
- Groups text into paragraphs
- Detects headings based on text patterns and formatting
- Converts to semantic HTML structure
Output Format
The library generates clean, semantic HTML without inline styles:
html
`Document Title
Section Heading
Cell 1
Cell 2
`
Error Handling
Both functions throw errors for:
- Non-existent files
- Invalid file formats
- Corrupted files
- Permission issues
javascript
`
try {
const html = await docxToHtml('invalid-file.docx');
} catch (error) {
console.error('Conversion failed:', error.message);
}
`
Development
$3
bash
`
npm run build
`
$3
bash
`
npm test
`
$3
bash
`
npm run test:watch
adm-zip
Dependencies
- : For extracting DOCX files
fast-xml-parser
- : For parsing WordprocessingML XML
pdf-parse`: For extracting text from PDF files
-
License
MIT
Contributing
Contributions are welcome! Please feel free to submit a Pull Request.