TypeScript DOM cleaning and structuring library
npm install iris-extractA TypeScript library for cleaning and structuring DOM content, inspired by Unstructured. Built with Cheerio for fast, server-side HTML processing.
- ๐งน DOM Cleaning: Remove scripts, styles, navigation, and other unwanted elements
- ๐๏ธ Semantic Structure: Classify elements as titles, paragraphs, lists, tables, etc.
- ๐ Table Extraction: Extract tables with headers and structured data
- ๐ผ๏ธ Image Handling: Extract images with metadata and alt text
- โก Fast Processing: Built on Cheerio for efficient server-side HTML parsing
- ๐ฏ Configurable: Flexible options for different use cases
- ๐ TypeScript: Full type safety and excellent IDE support
``bash`
npm install unstructured-ts
`typescript
import { partitionHtml } from 'unstructured-ts';
const html =
This is a paragraph with some content.
| Name | Age |
|---|---|
| John | 30 |
;const result = partitionHtml(html);
console.log(result.elements);
// [
// { type: 'Title', text: 'Main Title', ... },
// { type: 'NarrativeText', text: 'This is a paragraph with some content.', ... },
// { type: 'ListItem', text: 'First item', ... },
// { type: 'ListItem', text: 'Second item', ... },
// { type: 'Table', text: 'Name | Age\\n--- | ---\\nJohn | 30', rows: [['John', '30']], headers: ['Name', 'Age'], ... }
// ]
`Advanced Usage
$3
`typescript
import { DOMPartitioner } from 'unstructured-ts';const partitioner = new DOMPartitioner({
skipNavigation: true, // Remove navigation elements
skipHeaders: false, // Keep header elements
skipFooters: true, // Remove footer elements
skipForms: true, // Remove form elements
minTextLength: 15, // Minimum text length to include
extractTables: true, // Extract table structure
extractImages: true, // Extract image elements
includeImageAlt: true, // Include alt text in image elements
includeOriginalHtml: false // Include original HTML in metadata
});
const result = partitioner.partition(html);
`$3
`typescript
import { ElementType } from 'unstructured-ts';const result = partitionHtml(html);
// Filter by element type
const titles = result.elements.filter(el => el.type === ElementType.TITLE);
const paragraphs = result.elements.filter(el => el.type === ElementType.NARRATIVE_TEXT);
const tables = result.elements.filter(el => el.type === ElementType.TABLE);
// Access table data
tables.forEach(table => {
if (table.type === ElementType.TABLE) {
console.log('Headers:', table.headers);
console.log('Rows:', table.rows);
}
});
// Access metadata
result.elements.forEach(element => {
console.log(
${element.type}: ${element.text});
console.log('Metadata:', element.metadata);
});
`Element Types
The library classifies DOM elements into semantic types:
- Title: Headings (h1-h6) and title-like content
- NarrativeText: Paragraphs and article content
- ListItem: List items and bullet points
- Text: Generic text content
- Table: Structured tabular data
- Image: Images with metadata
- Header/Footer: Page headers and footers
- Navigation: Navigation menus and links
- Form: Form elements and inputs
API Reference
$3
Convenience function to partition HTML content.
$3
Main class for partitioning DOM content.
#### Constructor
`typescript
new DOMPartitioner(options?: PartitionOptions)
`#### Methods
-
partition(html: string): PartitionResult - Partition HTML content$3
Configuration options for partitioning:
`typescript
interface PartitionOptions {
skipNavigation?: boolean; // Default: true
skipHeaders?: boolean; // Default: false
skipFooters?: boolean; // Default: false
skipForms?: boolean; // Default: true
minTextLength?: number; // Default: 10
preserveWhitespace?: boolean; // Default: false
extractTables?: boolean; // Default: true
extractImages?: boolean; // Default: true
includeImageAlt?: boolean; // Default: true
includeOriginalHtml?: boolean;// Default: false
}
`$3
Base element interface:
`typescript
interface Element {
id: string;
type: ElementType;
text: string;
metadata: ElementMetadata;
}
`$3
Extended element for tables:
`typescript
interface TableElement extends Element {
type: ElementType.TABLE;
rows: string[][];
headers?: string[];
}
`$3
Extended element for images:
`typescript
interface ImageElement extends Element {
type: ElementType.IMAGE;
src?: string;
alt?: string;
width?: number;
height?: number;
}
``This library is inspired by the Python Unstructured library but is designed specifically for TypeScript/JavaScript environments:
| Feature | unstructured-ts | Unstructured Python |
|---------|----------------|-------------------|
| DOM Processing | โ
Cheerio-based | โ
BeautifulSoup-based |
| Element Classification | โ
Simplified | โ
Comprehensive |
| Table Extraction | โ
Basic structure | โ
Advanced analysis |
| Multiple File Formats | โ HTML only | โ
PDF, DOCX, etc. |
| OCR Support | โ | โ
|
| Language | TypeScript | Python |
| Performance | โก Fast | ๐ Slower |
| Dependencies | Minimal | Heavy |
Contributions are welcome! Please feel free to submit a Pull Request.
MIT