file2html

Convert DOCX and PDF files to clean semantic HTML.

Features

- DOCX Support: Convert Microsoft Word documents to semantic HTML
- PDF Support: Convert PDF files to semantic HTML
- Clean Output: Generates semantic HTML with proper tags like

,
,
,
,
, , ,
- No Inline Styles: Output is clean HTML without inline styles or absolute positioning
- TypeScript Support: Full TypeScript definitions included

Installation

```bash npm install file2html```Usage $3```javascript import { docxToHtml } from 'file2html'; const html = await docxToHtml('document.docx'); console.log(html);```$3```javascript import { pdfToHtml } from 'file2html'; const html = await pdfToHtml('document.pdf'); console.log(html);```$3```javascript import { docxToHtml, pdfToHtml } from 'file2html'; async function convertFiles() { try { // Convert DOCX file const docxHtml = await docxToHtml('sample.docx'); console.log('DOCX HTML:', docxHtml); // Convert PDF file const pdfHtml = await pdfToHtml('sample.pdf'); console.log('PDF HTML:', pdfHtml); } catch (error) { console.error('Conversion failed:', error.message); } } convertFiles();```API Reference $3 Converts a DOCX file to semantic HTML. Parameters: -`filePath`(string): Path to the DOCX file Returns: -`Promise`: Clean semantic HTML Features: - Converts paragraphs to`
`tags - Maps bold text to``tags - Maps italic text to``tags - Converts tables to`
`,` `,`
`structure - Detects headings based on paragraph styles and converts to`
`,`
`, etc. - Extracts images and converts totags with base64 data URLs $3 Converts a PDF file to semantic HTML. Parameters: -`filePath`(string): Path to the PDF file Returns: -`Promise`: Clean semantic HTML Features: - Groups text into paragraphs - Detects headings based on text patterns and formatting - Converts to semantic HTML structure Output Format The library generates clean, semantic HTML without inline styles:```html Document Title This is a paragraph with bold text and italic text. Section Heading List item 1 List item 2 Cell 1 Cell 2```Error Handling Both functions throw errors for: - Non-existent files - Invalid file formats - Corrupted files - Permission issues```javascript try { const html = await docxToHtml('invalid-file.docx'); } catch (error) { console.error('Conversion failed:', error.message); }```Development $3```bash npm run build```$3```bash npm test```$3```bash npm run test:watch```Dependencies -`adm-zip`: For extracting DOCX files -`fast-xml-parser`: For parsing WordprocessingML XML -`pdf-parse`: For extracting text from PDF files

License

MIT

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.
file-to-html-converter - npm explorer
file-to-html-converter
v1.0.0TypeScript
Convert DOCX and PDF files to clean semantic HTML
docx pdf html conversion semantic parser
0/weekUpdated 4 months agoMITUnpacked: 33.3 KB
Published by danielnsayensu
`npm install file-to-html-converter`
npm
file2html

Convert DOCX and PDF files to clean semantic HTML.

Features

- DOCX Support: Convert Microsoft Word documents to semantic HTML
- PDF Support: Convert PDF files to semantic HTML
- Clean Output: Generates semantic HTML with proper tags like
,
,
,
,
, , ,
- No Inline Styles: Output is clean HTML without inline styles or absolute positioning
- TypeScript Support: Full TypeScript definitions included

Installation

```bash npm install file2html```Usage $3```javascript import { docxToHtml } from 'file2html'; const html = await docxToHtml('document.docx'); console.log(html);```$3```javascript import { pdfToHtml } from 'file2html'; const html = await pdfToHtml('document.pdf'); console.log(html);```$3```javascript import { docxToHtml, pdfToHtml } from 'file2html'; async function convertFiles() { try { // Convert DOCX file const docxHtml = await docxToHtml('sample.docx'); console.log('DOCX HTML:', docxHtml); // Convert PDF file const pdfHtml = await pdfToHtml('sample.pdf'); console.log('PDF HTML:', pdfHtml); } catch (error) { console.error('Conversion failed:', error.message); } } convertFiles();```API Reference $3 Converts a DOCX file to semantic HTML. Parameters: -`filePath`(string): Path to the DOCX file Returns: -`Promise`: Clean semantic HTML Features: - Converts paragraphs to`
`tags - Maps bold text to``tags - Maps italic text to``tags - Converts tables to`
`,` `,`
`structure - Detects headings based on paragraph styles and converts to`
`,`
`, etc. - Extracts images and converts totags with base64 data URLs $3 Converts a PDF file to semantic HTML. Parameters: -`filePath`(string): Path to the PDF file Returns: -`Promise`: Clean semantic HTML Features: - Groups text into paragraphs - Detects headings based on text patterns and formatting - Converts to semantic HTML structure Output Format The library generates clean, semantic HTML without inline styles:```html Document Title This is a paragraph with bold text and italic text. Section Heading List item 1 List item 2 Cell 1 Cell 2```Error Handling Both functions throw errors for: - Non-existent files - Invalid file formats - Corrupted files - Permission issues```javascript try { const html = await docxToHtml('invalid-file.docx'); } catch (error) { console.error('Conversion failed:', error.message); }```Development $3```bash npm run build```$3```bash npm test```$3```bash npm run test:watch```Dependencies -`adm-zip`: For extracting DOCX files -`fast-xml-parser`: For parsing WordprocessingML XML -`pdf-parse`: For extracting text from PDF files

License

MIT

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.