Browser Context

A powerful web content extraction library that extracts clean, readable content from web pages and converts it to Markdown format. Built for browser automation and content processing workflows.

Features

- Smart Content Extraction: Uses advanced algorithms (Defuddle + Mozilla Readability) to extract main content from web pages
- HTML to Markdown Conversion: Clean conversion with support for GitHub Flavored Markdown (GFM)
- Browser Integration: Works seamlessly with Puppeteer and other browser automation tools
- Fallback Strategy: Automatically falls back to Readability when primary extraction fails
- Customizable: Configurable tag removal and conversion options

Installation

``bash pnpm install @agent-infra/browser-context`

`Usage`

`$3`

Extract clean content from a web page using Puppeteer:

`typescript import { extractContent } from '@agent-infra/browser-context'; import puppeteer from 'puppeteer';

const browser = await puppeteer.launch(); const page = await browser.newPage(); await page.goto('https://example.com/article');

// Extract content as Markdown const result = await extractContent(page); console.log(result.title); // Article title console.log(result.content); // Clean Markdown content

await browser.close();`

`$3`

Extract content from HTML strings:

`typescript import { extractWithDefuddle, extractWithReadability, } from '@agent-infra/browser-context';

// Using Defuddle (primary method) const result1 = await extractWithDefuddle(htmlString, url, { markdown: true, });

// Using Readability (fallback method) const result2 = await extractWithReadability(page, { markdown: true, });`

`$3`

Convert HTML content to Markdown:

`typescript import { toMarkdown } from '@agent-infra/browser-context';

const html = '

`Title`

Content with bold text

';
const markdown = toMarkdown(html, {
  gfmExtension: true, // Enable GitHub Flavored Markdown
  codeBlockStyle: 'fenced', // Use fenced code blocks
  headingStyle: 'atx', // Use # style headings
});

console.log(markdown); // # Title // // Content with bold text`

`$3`

`typescript import { toMarkdown, DEFAULT_TAGS_TO_REMOVE, } from '@agent-infra/browser-context';

const options = { gfmExtension: true, codeBlockStyle: 'fenced' as const, headingStyle: 'atx' as const, emDelimiter: '*', strongDelimiter: '**', removeTags: [...DEFAULT_TAGS_TO_REMOVE, 'footer', 'nav'], // Remove additional tags };

const markdown = toMarkdown(htmlContent, options);`

`API Reference`

`$3`

Main extraction function that automatically tries Defuddle first, then falls back to Readability.

Parameters:

- page: Puppeteer page instance

Returns:

- Promise<{title: string, content: string}>: Extracted title and Markdown content

`$3`

Extract content using the Defuddle library.

Parameters:

- html: HTML content string -url: Page URL -options: Defuddle configuration options

`$3`

Extract content using Mozilla's Readability algorithm.

Parameters:

- page: Puppeteer page instance -options.markdown: Whether to convert to Markdown (default: false)

`$3`

Convert HTML to Markdown format.

Parameters:

- html: HTML content string -options: Conversion options

ToMarkdownOptions:

- gfmExtension: Enable GitHub Flavored Markdown (default: true) -removeTags: Array of HTML tags to remove -codeBlockStyle: 'indented' | 'fenced' -headingStyle: 'setext' | 'atx' -emDelimiter: Emphasis delimiter -strongDelimiter: Strong emphasis delimiter

`Content Extraction Strategy`

The library uses a smart two-tier extraction strategy:

1. Primary: Defuddle library for accurate content extraction 2. Fallback: Mozilla Readability algorithm when Defuddle fails

This ensures maximum compatibility and extraction success across different website structures.

`Removed HTML Elements`

By default, the following HTML elements are removed during Markdown conversion:

- script, style, link, head-iframe, video, audio, canvas-object, embed, noscript-aside, dialog

You can customize this list using the removeTags` option.

Browser Compatibility

This library is designed to work with:

- Puppeteer
- Playwright
- Any browser automation tool that provides a Page-like interface

License

Apache License 2.0.

Credits

Special thanks to the open source projects that inspired this toolkit:

- readability - A standalone version of the readability lib

Browser Context

A powerful web content extraction library that extracts clean, readable content from web pages and converts it to Markdown format. Built for browser automation and content processing workflows.

Features

Installation

``bash pnpm install @agent-infra/browser-context`

`Usage`

`$3`

Extract clean content from a web page using Puppeteer:

`typescript import { extractContent } from '@agent-infra/browser-context'; import puppeteer from 'puppeteer';

const browser = await puppeteer.launch(); const page = await browser.newPage(); await page.goto('https://example.com/article');

// Extract content as Markdown const result = await extractContent(page); console.log(result.title); // Article title console.log(result.content); // Clean Markdown content

await browser.close();`

`$3`

Extract content from HTML strings:

`typescript import { extractWithDefuddle, extractWithReadability, } from '@agent-infra/browser-context';

// Using Defuddle (primary method) const result1 = await extractWithDefuddle(htmlString, url, { markdown: true, });

// Using Readability (fallback method) const result2 = await extractWithReadability(page, { markdown: true, });`

`$3`

Convert HTML content to Markdown:

`typescript import { toMarkdown } from '@agent-infra/browser-context';

const html = '

`Title`

Content with bold text

';
const markdown = toMarkdown(html, {
  gfmExtension: true, // Enable GitHub Flavored Markdown
  codeBlockStyle: 'fenced', // Use fenced code blocks
  headingStyle: 'atx', // Use # style headings
});

console.log(markdown); // # Title // // Content with bold text`

`$3`

`typescript import { toMarkdown, DEFAULT_TAGS_TO_REMOVE, } from '@agent-infra/browser-context';

const markdown = toMarkdown(htmlContent, options);`

`API Reference`

`$3`

Main extraction function that automatically tries Defuddle first, then falls back to Readability.

Parameters:

- page: Puppeteer page instance

Returns:

- Promise<{title: string, content: string}>: Extracted title and Markdown content

`$3`

Extract content using the Defuddle library.

Parameters:

- html: HTML content string -url: Page URL -options: Defuddle configuration options

`$3`

Extract content using Mozilla's Readability algorithm.

Parameters:

- page: Puppeteer page instance -options.markdown: Whether to convert to Markdown (default: false)

`$3`

Convert HTML to Markdown format.

Parameters:

- html: HTML content string -options: Conversion options

ToMarkdownOptions:

`Content Extraction Strategy`

The library uses a smart two-tier extraction strategy:

1. Primary: Defuddle library for accurate content extraction 2. Fallback: Mozilla Readability algorithm when Defuddle fails

This ensures maximum compatibility and extraction success across different website structures.

`Removed HTML Elements`

By default, the following HTML elements are removed during Markdown conversion:

- script, style, link, head-iframe, video, audio, canvas-object, embed, noscript-aside, dialog

You can customize this list using the removeTags` option.

Browser Compatibility

This library is designed to work with:

- Puppeteer
- Playwright
- Any browser automation tool that provides a Page-like interface

License

Apache License 2.0.

Credits

Special thanks to the open source projects that inspired this toolkit:

- readability - A standalone version of the readability lib

@agent-infra/browser-context

Dist Tags

Browser Context

Features

Installation

Usage

$3

$3

$3

Title

$3

API Reference

$3

$3

$3

$3

Content Extraction Strategy

Removed HTML Elements

Browser Compatibility

License

Credits

@agent-infra/browser-context

Dist Tags

Browser Context

Features

Installation

Usage

$3

$3

$3

Title

$3

API Reference

$3

$3

$3

$3

Content Extraction Strategy

Removed HTML Elements

Browser Compatibility

License

Credits

`Usage`

`$3`

`$3`

`$3`

`Title`

`$3`

`API Reference`

`$3`

`$3`

`$3`

`$3`

`Content Extraction Strategy`

`Removed HTML Elements`

`Usage`

`$3`

`$3`

`$3`

`Title`

`$3`

`API Reference`

`$3`

`$3`

`$3`

`$3`

`Content Extraction Strategy`

`Removed HTML Elements`