This library converts HTML DOM to a semantic Markdown format optimized for use with Large Language Models (LLMs). It preserves the semantic structure of web content, extracts essential metadata, and reduces token usage compared to raw HTML, making it easier for LLMs to understand and process information.
- Semantic Structure Preservation: Retains the meaning of HTML elements like , , , and more. - Metadata Extraction: Captures important metadata such as title, description, keywords, Open Graph tags, Twitter Card tags, and JSON-LD data. - Token Efficiency: Optimizes for token usage through URL refification and concise representation of content. - Main Content Detection: Automatically identifies and extracts the primary content section of a webpage. - Table Column Tracking: Adds unique identifiers to table columns, improving LLM's ability to correlate data across rows.
Installation
``bash pnpm add @alloc/dom-to-semantic-markdown `
Usage
`javascript import { convertHtmlToMarkdown } from "@alloc/dom-to-semantic-markdown";
- html: string: The HTML string to be converted. - options?: ConversionOptions: Optional configuration object to customize the conversion process. See ConversionOptions for available settings.
Returns: string - The Markdown string representation of the HTML content.
$3
Converts an HTML Element to semantic Markdown.
- element: Element: The HTML DOM Element to be converted. This allows you to convert specific parts of a document, not just the entire HTML string. - options?: ConversionOptions: Optional configuration object to customize the conversion process. See ConversionOptions for available settings.
Returns: string - The Markdown string representation of the provided HTML Element and its descendants.
$3
Extracts metadata from an HTML Element.
- element: Element: The HTML DOM Element to extract metadata from. - mode?: 'basic' | 'extended': Optional mode to control the level of metadata extraction. - 'basic': Includes standard meta tags like title, description, and keywords. - 'extended': Includes basic meta tags, Open Graph tags, Twitter Card tags, and JSON-LD data.
Returns: SemanticMarkdownAST.MetaDataNode['content'] - An object containing the extracted metadata.
$3
Converts an HTML Element into a Semantic Markdown Abstract Syntax Tree (AST). This function recursively parses the HTML structure and generates a structured Markdown representation. It uses extractMetaData to extract metadata from the
element.
-
element: Element: The HTML DOM element to be converted. - options?: ExtractOptions: Optional configuration to customize the extraction process. See ExtractOptions for details. - indentLevel?: number: The current indentation level, used for nested elements like lists. Defaults to 0.
Returns:
SemanticMarkdownAST.Node[] - An array of AST nodes representing the semantic Markdown structure of the input HTML element. This AST can then be rendered into a Markdown string using a separate rendering function.
This function is not intended for direct use in most cases. Use
convertHtmlToMarkdown or convertElementToMarkdown for simpler HTML to Markdown conversion. However, understanding htmlToMarkdownAST is crucial for customizing or extending the library's functionality.
$3
Converts a Semantic Markdown Abstract Syntax Tree (AST) back into a Markdown string. This function takes the AST generated by
htmlToMarkdownAST and renders it into a human-readable Markdown format.
-
nodes: Node[]: An array of SemanticMarkdownAST nodes representing the Markdown content. This is typically the output of the htmlToMarkdownAST function. - options?: RenderOptions: Optional configuration object to customize the rendering process. See RenderOptions for available settings. - indentLevel?: number: The initial indentation level for the Markdown output. Used for nested structures like lists and blockquotes. Defaults to 0.
Returns:
string - The Markdown string representation of the AST.
This function is essential for completing the HTML to Markdown conversion process. It takes the structured AST and transforms it into a flat, string-based Markdown output.
Types
$3
-
debug?: boolean: Enable debug logging. - websiteDomain?: string: The domain of the website being converted. - extractMainContent?: boolean: Whether to extract only the main content of the page. - includeMetaData?: 'basic' | 'extended' | false: Controls whether to include metadata extracted from the HTML head. - 'basic': Includes standard meta tags like title, description, and keywords. - 'extended': Includes basic meta tags, Open Graph tags, Twitter Card tags, and JSON-LD data. - false: Disables metadata extraction. - excludeTagNames?: string[]: Avoid extracting content from these tags. - excludeInvisibleElements?: boolean: Whether to exclude elements that are not visible. - enableTableColumnTracking?: boolean: Adds unique identifiers to table columns. - overrideElementProcessing?: (element: Element, options: ConversionOptions, indentLevel: number) => SemanticMarkdownAST[] | undefined: Custom processing for HTML elements. - processUnhandledElement?: (element: Element, options: ConversionOptions, indentLevel: number) => SemanticMarkdownAST[] | undefined: Handler for unknown HTML elements.
$3
-
emitFrontMatter?: boolean: Include the metadata as “front matter” in the output. - overrideNodeRenderer?: (node: SemanticMarkdownAST, options: ConversionOptions, indentLevel: number) => string | undefined: Custom renderer for AST nodes. - renderCustomNode?: (node: CustomNode, options: ConversionOptions, indentLevel: number) => string | undefined: Renderer for custom AST nodes.
$3
-
refifyUrls?: boolean: Whether to convert URLs to reference-style links. - overrideDOMParser?: DOMParser: Custom DOMParser for Node.js environments. - _Everything in ExtractOptions and RenderOptions_
$3
SemanticMarkdownAST is a type-only namespace that defines the structure of the Markdown Abstract Syntax Tree (AST) used by this library. It encompasses various node types that represent different semantic elements in Markdown, allowing for a structured and programmatically accessible representation of Markdown content.
The namespace includes the following type definitions for different Markdown elements:
-
BlockquoteNode: Represents blockquotes. - BoldNode: Represents bold text. - CodeNode: Represents code blocks and inline code. - CustomNode: Represents custom, user-defined nodes. - HeadingNode: Represents headings with levels from 1 to 6. - ImageNode: Represents images. - ItalicNode: Represents italic text. - LinkNode: Represents hyperlinks. - ListItemNode: Represents items in a list. - ListNode: Represents ordered and unordered lists. - MetaDataNode: Represents metadata extracted from HTML , including standard meta tags, Open Graph, Twitter Card, and JSON-LD. - SemanticHtmlNode: Represents semantic HTML elements like , , etc. - StrikethroughNode: Represents strikethrough text. - TableCellNode: Represents cells within a table. - TableNode: Represents tables. - TableRowNode: Represents rows within a table. - TextNode: Represents plain text content. - VideoNode: Represents video embeds.
Each of these node types defines a specific structure with properties relevant to the represented Markdown element, such as
content, level (for headings), href (for links), etc. These types are used throughout the library to represent and manipulate Markdown content programmatically.
Using the Output with LLMs
The semantic Markdown produced by this library is optimized for use with Large Language Models (LLMs). To use it effectively:
1. Extract the Markdown content using the library. 2. Start with a brief instruction or context for the LLM. 3. Wrap the extracted Markdown in triple backticks (
). 4. Follow the Markdown with your question or prompt.
Example:
``` The following is a semantic Markdown representation of a webpage. Please analyze its content:
`markdown {paste your extracted markdown here} `
{your question, e.g., "What are the main points discussed in this article?"} ```
This format helps the LLM understand its task and the context of the content, enabling more accurate and relevant responses to your questions.