- Returns raw search result metadata only.
- getArticle(article: Article): Promise
- Fetches a single article by its URL and enriches it with summary and image.
\- listAllArticles(maxItems?: number): Promise
- Returns an array of all article stubs (url/id/title). Use maxItems to stop early.
\- getAllArticles(maxItems?: number): Promise
- Returns enriched articles for all pages (may be slow on large wikis). Use maxItems to limit.
\- streamAllArticles(maxItems?: number): AsyncGenerator
- Async generator that yields enriched articles progressively, useful for streaming ingestion.
\- getArticleCount(): Promise
- Returns the exact article count using MediaWiki site statistics.
$3
`ts
interface Article {
url: string;
id: string;
title: string;
}
interface EnrichedArticle extends Article {
img?: string;
article?: string;
}
`
---
π§ͺ Testing & Examples
Example scripts are available in the tests/ directory:
- search.ts β Fetches and prints enriched articles for a query.
- searchResults.ts β Prints raw search result metadata.
- getArticles.ts β Fetches and prints a single enriched article from search results.
\- allItems.ts β Enumerates or streams all pages; handy for building RAG datasets.
\- articleCount.ts β Prints the exact number of articles via site statistics.
\- streamToFiles.ts β Streams enriched articles and writes each to a text file.
$3
`ts
import { WikiaPull } from 'wikia-pull';
const wiki = new WikiaPull("jojo");
// 1) Just list article stubs
const stubs = await wiki.listAllArticles(1000); // limit to 1000 for demo
console.log(stubs.length, stubs[0]);
// 2) Stream enriched content (preferred for large ingestions)
let count = 0;
for await (const article of wiki.streamAllArticles(100)) { // limit to 100 for demo
// send to your vector store, files, etc.
console.log(article.title, article.url);
count++;
}
console.log("streamed", count);
`
$3
`ts
import { WikiaPull } from 'wikia-pull';
import * as fs from 'fs';
import * as path from 'path';
const wiki = new WikiaPull('jojo');
const outDir = './output';
if (!fs.existsSync(outDir)) fs.mkdirSync(outDir, { recursive: true });
let n = 0;
for await (const article of wiki.streamAllArticles(25)) {
const filename = article.title.replace(/[<>:"/\\|?*]/g, '_').replace(/\s+/g, '_').slice(0, 100);
const filepath = path.join(outDir, ${filename}.txt);
const content = Title: ${article.title}\nURL: ${article.url}\nID: ${article.id}\nImage: ${article.img || 'None'}\n\n${article.article || ''};
fs.writeFileSync(filepath, content, 'utf8');
console.log(Saved #${++n}:, filepath);
}
`
---
β οΈ Error Handling
- Throws an error if no articles are found for a query.
- Throws an error if a network request fails or an article URL is missing.
- Errors include the failing URL when possible, e.g. HTTP error 404 while fetching https://- For streaming, consider try/catch per item to skip transient failures and continue ingestion.
---
π Credits
This project is a TypeScript implementation (with additional features) of
HermitPurple by GeopJr.
Inspired by @yimura/scraper.
Original license applies. See below.
---
π License
ISC License
Copyright (c) 2020 GeopJr
Rewritten in TypeScript as wikia-pull by grml