šØ High-fidelity, browser-based, single-page web archiving library and CLI.
npm install @harvard-lil/scoop   
High-fidelity, browser-based, single-page web archiving library and CLI.
Use it in the terminal...
``bash`
scoop "https://lil.law.harvard.edu"
... or in your Node.js project
`javascript
import { Scoop } from '@harvard-lil/scoop'
const capture = await Scoop.capture('https://lil.law.harvard.edu')
const wacz = await capture.toWACZ()
`
---
---
Scoop is a high fidelity, browser-based, web archiving capture engine for witnessing the web from the Harvard Library Innovation Lab.
Fine-tune this custom web capture software to create robust single-page captures of the internet with accurate and complete provenance information.
With extensive options for asset formats and inclusions, Scoop will create .warc, warc.gz or .wacz files to be stored by users and replayed using the web archive replay software of their choosing.
Scoop also comes with built-in support for the WACZ Signing and Verification specification,
allowing users to cryptographically sign their captures.
More info:
- "Witnessing the web is hard: Why and how we built the Scoop web archiving capture engine šØ"
April 13 2023 - _lil.law.harvard.edu_
- "New Release: High Fidelity Capture Engine for Witnessing the Web šØ"
March 28 2023 - _blogs.harvard.edu/perma_
---
, .warc.gz and .wacz output formats
- Support for the WACZ Signing and Verification specification
- Optional preservation of _"raw"_ exchanges in WACZ files for later analysis or reprocessing _("wacz with raw exchanges"_)$3
- š¾ Sample WACZ file captured with Scoop.
Playback software such as replayweb.page can be used to explore this sample capture.
- š· Entry points
- š· Web Capture
- š· Provenance Summary
- š· PDF Snapshot
- š· Embedded videos as attachments [[1]](/.github/assets/screenshot-video-as-attachment-1.png?raw=true) [[2]](/.github/assets/screenshot-video-as-attachment-2.png?raw=true)---
Getting started
$3
Scoop requires Node.js 18+. Other _recommended_ system-level dependencies:
curl, python3 (for
--capture-video-as-attachment option).While the amount of resources Scoop needs is entirely dependent on what is being captured, a minimum of 4GB of RAM seems to be indicated for complex captures.
$3
This program has been written for UNIX-like systems and is expected to work on Linux, Mac OS, and Windows Subsystem for Linux.$3
Scoop is available on npmjs.org and can be installed as follows:
`bash
As a CLI
npm install -g @harvard-lil/scoopAs a library
npm install @harvard-lil/scoop --saveIn both cases, you may need to install Playwright's dependencies:
sudo npx playwright install-deps chromium
`
Trouble installing the CLI?
- Make sure you are running Node.js 20-23 (
node -v)
- Permissions issues are a common when installing npm packages globally for the first time.
See npm's documentation for solutions.
- On certain systems, using install-deps without the chromium argument might be necessary:
`bash
sudo npx playwright install-deps
`
- npx may be used as an alternative to a global installation:
`bash
In a new folder
npm init
npm install @harvard-lil/scoop
npx scoop "https://example.com"
`
---
Using Scoop on the command line
Here are a few examples of how the
scoop command can be used to make a customized capture of a web page.`bash
This will capture a given url using the default settings.
scoop "https://lil.law.harvard.edu" Unless specified otherwise, scoop will save the output of the capture as "./archive.wacz".
We can change this with the
--output / -o option
scoop "https://lil.law.harvard.edu" -o my-collection/lil.waczBut what if I want to change the output format itself?
scoop "https://lil.law.harvard.edu" -f warc -o my-collection/lil.warcBy default, Scoop runs in headless mode.
I can turn the "headless" flag off to see what happens in Chromium during capture.
scoop "https://lil.law.harvard.edu" --headless falseAlthough it comes with "good defaults", scoop is highly configurable ...
timeout-related options are good
scoop "https://lil.law.harvard.edu" --capture-video-as-attachment false --screenshot false --capture-window-x 320 --capture-window-y 480 --capture-timeout 30000 --max-capture-size 100000 --signing-url "https://example.com/sign"... use --help to list the available options, and see what the defaults are.
scoop --helpTimeout-related options are good dials to turn first when trying to customize "how much" of a page to capture.
scoop "https://lil.law.harvard.edu" --capture-timeout 90000 --load-timeout 60000 --network-idle-timeout 30000
`
See: Output of scoop --help š
`
Usage: scoop [options] šØ High-fidelity, browser-based, single-page web archiving library and CLI.
More info: https://github.com/harvard-lil/scoop
Options:
-v, --version Display Scoop and Scoop CLI version.
-o, --output Output path. (default: "./archive.wacz")
-f, --format Output format. (choices: "warc", "warc-gzipped", "wacz", "wacz-with-raw", default: "wacz")
--json-summary-output If set, allows for saving a capture summary as JSON. Must be a path to .json file.
--export-attachments-output If set, allows for exporting attachments (screenshot, certs, ...). Must be a path to an existing directory.
--signing-url Authsign-compatible endpoint for signing WACZ file.
--signing-token Authentication token to --signing-url, if needed.
--screenshot Add screenshot step to capture? (choices: "true", "false", default: "true")
--pdf-snapshot Add PDF snapshot step to capture? (choices: "true", "false", default: "false")
--dom-snapshot Add DOM snapshot step to capture? (choices: "true", "false", default: "false")
--capture-video-as-attachment Add capture video(s) as attachment(s) step to capture? (choices: "true", "false", default: "true")
--capture-certificates-as-attachment Add capture certificate(s) as attachment(s) step to capture? (choices: "true", "false", default: "true")
--provenance-summary Add provenance summary to capture? (choices: "true", "false", default: "true")
--attachments-bypass-limits If active, attachments will not count towards time and size constraints imposed on capture (--capture-timeout, --max--capture-size). (choices: "true", "false", default: "true")
--capture-timeout Maximum time allocated to capture process before hard cut-off, in ms. (default: 60000)
--load-timeout Max time Scoop will wait for the page to load, in ms. (default: 20000)
--network-idle-timeout Max time Scoop will wait for the in-browser networking tasks to complete, in ms. (default: 20000)
--behaviors-timeout Max time Scoop will wait for the browser behaviors to complete, in ms. (default: 20000)
--capture-video-as-attachment-timeout Max time Scoop will wait for the video capture process to complete, in ms. (default: 30000)
--capture-certificates-as-attachment-timeout Max time Scoop will wait for the certificates capture process to complete, in ms. (default: 10000)
--capture-window-x Width of the browser window Scoop will open to capture, in pixels. (default: 1600)
--capture-window-y Height of the browser window Scoop will open to capture, in pixels. (default: 900)
--max-capture-size Size limit for the capture's exchanges list, in bytes. (default: 209715200)
--max-video-capture-size Size limit for the video attachment, in bytes. Scoop will not capture video attachments larger than this. (default: 209715200)
--auto-scroll Should Scoop try to scroll through the page? (choices: "true", "false", default: "true")
--auto-play-media Should Scoop try to autoplay