get browser context for AI Agent
A powerful web content extraction library that extracts clean, readable content from web pages and converts it to Markdown format. Built for browser automation and content processing workflows.
- Smart Content Extraction: Uses advanced algorithms (Defuddle + Mozilla Readability) to extract main content from web pages
- HTML to Markdown Conversion: Clean conversion with support for GitHub Flavored Markdown (GFM)
- Browser Integration: Works seamlessly with Puppeteer and other browser automation tools
- Fallback Strategy: Automatically falls back to Readability when primary extraction fails
- Customizable: Configurable tag removal and conversion options
``bash`
pnpm install @agent-infra/browser-context
Extract clean content from a web page using Puppeteer:
`typescript
import { extractContent } from '@agent-infra/browser-context';
import puppeteer from 'puppeteer';
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto('https://example.com/article');
// Extract content as Markdown
const result = await extractContent(page);
console.log(result.title); // Article title
console.log(result.content); // Clean Markdown content
await browser.close();
`
Extract content from HTML strings:
`typescript
import {
extractWithDefuddle,
extractWithReadability,
} from '@agent-infra/browser-context';
// Using Defuddle (primary method)
const result1 = await extractWithDefuddle(htmlString, url, {
markdown: true,
});
// Using Readability (fallback method)
const result2 = await extractWithReadability(page, {
markdown: true,
});
`
Convert HTML content to Markdown:
`typescript
import { toMarkdown } from '@agent-infra/browser-context';
const html = '
Content with bold text
';console.log(markdown);
// # Title
//
// Content with bold text
`
`typescript
import {
toMarkdown,
DEFAULT_TAGS_TO_REMOVE,
} from '@agent-infra/browser-context';
const options = {
gfmExtension: true,
codeBlockStyle: 'fenced' as const,
headingStyle: 'atx' as const,
emDelimiter: '*',
strongDelimiter: '**',
removeTags: [...DEFAULT_TAGS_TO_REMOVE, 'footer', 'nav'], // Remove additional tags
};
const markdown = toMarkdown(htmlContent, options);
`
Main extraction function that automatically tries Defuddle first, then falls back to Readability.
Parameters:
- page: Puppeteer page instance
Returns:
- Promise<{title: string, content: string}>: Extracted title and Markdown content
Extract content using the Defuddle library.
Parameters:
- html: HTML content stringurl
- : Page URLoptions
- : Defuddle configuration options
Extract content using Mozilla's Readability algorithm.
Parameters:
- page: Puppeteer page instanceoptions.markdown
- : Whether to convert to Markdown (default: false)
Convert HTML to Markdown format.
Parameters:
- html: HTML content stringoptions
- : Conversion options
ToMarkdownOptions:
- gfmExtension: Enable GitHub Flavored Markdown (default: true)removeTags
- : Array of HTML tags to removecodeBlockStyle
- : 'indented' | 'fenced'headingStyle
- : 'setext' | 'atx'emDelimiter
- : Emphasis delimiterstrongDelimiter
- : Strong emphasis delimiter
The library uses a smart two-tier extraction strategy:
1. Primary: Defuddle library for accurate content extraction
2. Fallback: Mozilla Readability algorithm when Defuddle fails
This ensures maximum compatibility and extraction success across different website structures.
By default, the following HTML elements are removed during Markdown conversion:
- script, style, link, headiframe
- , video, audio, canvasobject
- , embed, noscriptaside
- , dialog
You can customize this list using the removeTags` option.
This library is designed to work with:
- Puppeteer
- Playwright
- Any browser automation tool that provides a Page-like interface
Apache License 2.0.
Special thanks to the open source projects that inspired this toolkit:
- readability - A standalone version of the readability lib