Web content extraction library for Playwright - extracts content from web pages and converts to Markdown
npm install @leibniz/extractorWeb 内容提取库,将网页内容智能转换为 Markdown。
``bash`
npm install @leibniz/extractor
`typescript
import { ClipperService } from '@leibniz/extractor';
const clipper = new ClipperService({ cwd: './output' });
// 从 HTML 提取并保存
const result = await clipper.clip(html, 'https://example.com/article');
console.log(result.markdownPath); // ./output/example.com/文章标题.md
console.log(result.imagesSaved); // 5
`
主要服务类,提供完整的网页内容提取和保存功能。
`typescript
import { ClipperService } from '@leibniz/extractor';
const clipper = new ClipperService({ cwd: '/path/to/output' });
`
#### clip(html, url) - 提取并保存
`typescript
const result = await clipper.clip(html, url);
// result: {
// success: boolean,
// markdownPath: string, // Markdown 文件路径
// assetsDir: string, // 图片资源目录
// imagesSaved: number // 保存的图片数量
// }
`
#### clipToMarkdown(html, url) - 仅提取不保存
`typescript
const result = clipper.clipToMarkdown(html, url);
// result: {
// success: boolean,
// markdown: string, // Markdown 内容
// title: string, // 页面标题
// images: ExtractedImage[] // 图片列表(未下载)
// }
`
#### clipWithPlaywright(url, options?) - 使用 Playwright 抓取
`typescript`
const result = await clipper.clipWithPlaywright(url, {
waitFor: 'networkidle', // 等待策略
waitForSelector: '[data-ready]', // 等待特定元素
cookies: [...], // 登录态 cookies
headers: { ... }, // 自定义请求头
headless: true // 无头模式
});
`typescript
import { clipFromHtml, extractContent } from '@leibniz/extractor';
// 从 HTML 字符串提取
const result = clipFromHtml(html, url);
// 从 Document 对象提取(适用于 Playwright)
const result = extractContent(document, url);
`
`typescript
import { hasSlateEditor, SlateWikiExtractor } from '@leibniz/extractor';
// 检测页面是否包含 Slate 编辑器
if (hasSlateEditor(document)) {
const result = SlateWikiExtractor.extract(document);
}
`
`typescript
import { chromium } from 'playwright';
import { ClipperService } from '@leibniz/extractor';
const clipper = new ClipperService({ cwd: './content' });
const browser = await chromium.launch();
const page = await browser.newPage();
await page.goto('https://example.com/article');
const html = await page.content();
const result = await clipper.clip(html, page.url());
console.log(✓ ${result.markdownPath});
await browser.close();
`
| 提取器 | 检测方式 | 适用场景 |
|--------|----------|----------|
| SlateWiki | CSS [data-slate-editor] | Slate 编辑器页面 |.deep-research-panel
| GeminiDeepResearch | CSS | Gemini 深度研究报告 |
| Default | 自动 | 通用网页 (Readability) |
```
${cwd}/
├── example.com/
│ ├── 文章标题.md
│ └── assets/
│ └── 2026_01_23_10_50_23/
│ └── image_0.png
MIT