Lightweight async scraper for Google News
npm install google-news-scraperA lightweight package that scrapes article data from Google News. Simply pass a keyword or phrase, and the results are returned as an array of JSON objects.

* Installation
* Usage
* Output
* Config
* TypeScript
* CommonJS
* Performance
* Upkeep
* Bugs
* Contribute
* Python version
bash
Install via NPM
npm install google-news-scraper
``bash
Install via Yarn
yarn add google-news-scraper
`Usage πΉοΈ
Simply import the package and pass a config object.
`javascript
import googleNewsScraper from 'google-news-scraper';const articles = await googleNewsScraper({ searchTerm: "The Oscars" });
`
Full documentation on the config object can be found below.Output π²
The output is an array of JSON objects, with each article following the structure below:`json
[
{
"title": "Article title",
"link": "http://url-to-website.com/path/to/article",
"image":"http://url-to-website.com/path/to/image.jpg",
"source": "Name of publication",
"datetime": 2024-05-13T08:02:22.000Z,
"time": "Time/date published (human-readable)",
"articleType": "String, one of ['regular' | 'topicFeatured' | 'topicSmall']"
}
]
`Config βοΈ
The config object passed to the function above has the following properties:#### searchTerm
This is the search query you'd like to find articles for, simply pass the search string like so:
searchTerm: "The Oscars". The search term is no longer a required field, as hahagu added support for topic pages in #44. If
searchTerm and baseUrl are both supplied, the scraper will just return results from the Google News homepage.$3
The baseUrl property enables you to specify an alternate base URL for your search. This is useful when you want to scrape, for example, a specific Google news topic. PLEASE NOTE: Using both a
baseUrl that points to a topic AND a searchTerm is not advised, as the searchTerm will typically be ignored in favour of the topic in the baseUrl.In the scraped URL, your
baseUrl will be immediately followed by query parameters (eg: ?hl=en-US&gl=US&ceid=US), so it doesn't matter whether your baseUrl has a trailing slash or not.Defaults to
https://news.google.com/search#### prettyURLs
The URLs that Google News supplies for each article are "ugly" links (eg:
"https://news.google.com/articles/CAIiEPgfWP_e7PfrSwLwvWeb5msqFwgEKg8IACoHCAowjuuKAzCWrzwwt4QY?hl=en-GB&gl=GB&ceid=GB%3Aen"), buy default the scraper will retrieve the actual "pretty" URL (eg: "https://www.nytimes.com/2020/01/22/movies/expanded-best-picture-oscar.html"). This is done using some base64 decoding, so the overhead is negligible. To prevent this default behaviour and retrieve the "ugly" links instead, pass prettyURLs: false to the config object.Credit to anthonyfranc for the base64 decode fix π
Defaults to
true.#### timeframe
The results can be filtered to articles published within a given timeframe prior to the request.
The format of the timeframe is a string comprised of a number, followed by a letter prepresenting the time operator. For example
1y would signify 1 year. Full list of operators below:
* h = hours (eg: 12h)
* d = days (eg: 7d)
* m = months (eg: 6m)
* y = years (eg: 1y)Defaults to
7d.#### getArticleContent
By default, the scraper does not return the article content, as this would require Puppeteer to navigate to each individual article in the results (increasing execution time significantly). If you would like to enable this behaviour, and receive the content of each article, simply pass
getArticleContent: true, in the config. This will add two fields to each article in the output: content and favicon.`json
[
{
"title": "Article title",
"link": "https://url-to-website.com/path/to/article",
"image":"https://url-to-website.com/path/to/image.jpg",
"source": "Name of publication",
"time": "Time/date published (human-readable)",
"content": "The full text content of the article...",
"favicon": "https://url-to-website.com/path/to/favicon.png",
}
]
`PLEASE NOTE: Due to the large amount of variable factors to take into account, this feature fails on many websites. All errors are handled gracefully and wil return an empty string as the content. Please ensure you handle such outcomes in your application.
Defaults to
false#### logLevel
You can customise the log level to any of the following:
-
none: No logs will be output at all.
- error: Only errors will be outputted to the log.
- warn: Errors and warnings will be output to the log.
- info: Info, errors and warnings will be output to the log.
- verbose: All of the above and potentially more. Currently there are no specifically verbose logs, but in future I may move some of the info logs to verbose and/or add some debugging info there.Defaults to
error.#### queryVars
An object of additional query params to add to the Google News URL string, formatted as key value pairs. This can be useful if you want to search for articles in a specific language, for example:
`javascript
const articles = await googleNewsScraper({
searchTerm: "Γltimas noticias en Madrid",
queryVars: {
gl:"ES",
ceid:"ES:es"
},
});
`Defaults to
null#### puppeteerArgs
An array of Chromium flags to pass to the browser instance. By default, this will be an empty array. A full list of available flags can be found here. NB: if you are launching this in a Heroku app, you will need to pass the
--no-sandbox and --disable-setuid-sandbox flags, as explained in this SO answer.Defaults to
[]#### puppeteerHeadlessMode
Whether or not Puppeteer should run in headless mode. Running in headless mode increases performance by approximately 30% (credit to ole-ve for finding this). If you're not sure about this setting, leave it as it is.
Defaults to
true#### limit
The total number of articles that you would like to be returned. Please note that with higher numbers, the actual returned number may be lower. Typically the max is
99, but it varies depending on many variables in Puppeteer (such as rate limiting, network conditions etc.). Defaults to
99 TypeScript π
Google News Scraper includes full TypeScript definitions. Your IDE should pick the types up automatically, but if not you can find them in the
dist/tsc/ folder.Common JS π΄π»
Google News Scraper is built to work as an ESM module out of the box, but also works as a Common JS module too, just use require instead of import:
`javascript
const googleNewsScraper = require('google-news-scraper');const articles = await googleNewsScraper({ searchTerm: "The Oscars" });
``Please report bugs via the issue tracker.