bioRxiv MECA Downloader CLI

A comprehensive command-line interface (CLI) tool to download, process, and manage bioRxiv MECA (Manuscript Exchange Common Approach) files from AWS S3 for text and data mining purposes.

Features

- Multi-Server Support: Works with both bioRxiv and medRxiv servers
- AWS S3 Integration: Connect to S3 buckets with requester-pays support
- Content Exploration: List, search, and browse available content by month or batch
- Individual Downloads: Download MECA files by DOI with API integration
- Batch Processing: Process large amounts of data with configurable concurrency
- Content Summaries: Get detailed information about preprints
- Month/Batch Analysis: Detailed metadata for specific time periods
- XML Processing: Robust handling of bioRxiv XML files with entity replacement

Installation

$3

- Node.js 18.0.0 or higher
- AWS account with access to bioRxiv/medRxiv S3 buckets (--requester-pays)
- API key for bioRxiv API access

$3

``bash npm install -g biorxiv`

`Quick Start`

`$3`

Get a summary of a bioRxiv/medRxiv preprint from a URL or DOI.

Arguments:

- : bioRxiv URL or DOI to summarize

Options:

- -m, --more: Show additional details and full abstract --s, --server : Specify server (biorxiv or medrxiv)

Examples:

`bash biorxiv summary "10.1101/2024.05.08.593085" biorxiv summary -m "10.1101/2024.05.08.593085" biorxiv summary -s medrxiv "10.1101/2020.03.19.20039131" --more`

`$3`

Download MECA files from the bioRxiv/medRxiv S3 buckets by DOI.

Arguments:

- : DOI of the paper (e.g., "10.1101/2024.05.08.593085")

Options:

- -o, --output

: Output directory for downloaded files (default: "./downloads")
-

-a, --api-url

: API base URL
-

--requester-pays

: Enable requester-pays for S3 bucket access
Examples:

`bash biorxiv --requester-pays download "10.1101/2024.05.08.593085" biorxiv --requester-pays download "10.1101/2024.05.08.593085" --output "./papers" biorxiv --requester-pays download "10.1101/2024.05.08.593085" --api-url "https://custom-api.com"`

`$3`

List available content in the bioRxiv or medRxiv S3 bucket.

Options:

- -m, --month : Filter by specific month (e.g., "2024-01" or "January_2024") --b, --batch : Filter by specific batch (e.g., "Batch_01") --l, --limit : Limit the number of results (default: 50) --s, --server : Server to use: "biorxiv" or "medrxiv"

Examples:

`bash

`Local development`


biorxiv list
biorxiv list --month "2024-01"
biorxiv list --batch 1 --limit 100 --server medrxiv


Batch Processing
List detailed metadata for all files in a specific month or batch.
Options:

- -m, --month : Month to list (e.g., "January_2024" or "2024-01") --b, --batch : Batch to list (e.g., "1", "batch-1", "Batch_01") --s, --server : Server to use: "biorxiv" or "medrxiv"

Examples:

`bash biorxiv batch-info --month "2024-01" biorxiv batch-info --batch "1" biorxiv batch-info --server medrxiv --month "2024-01"`

`Global Options`

`$3`

Enable requester pays functionality. The S3 buckets require requester pays for external access.

`Batch Processing`

Batch process MECA files for a given month or batch.

Options:

Time Selection:

- -m, --month : Month(s) to process. Supports: YYYY-MM, comma-separated list, or wildcard patterns --b, --batch : Batch to process. Supports: single batch, range, or comma-separated list

Processing Control:

- -l, --limit : Maximum number of files to process --c, --concurrency : Number of files to process concurrently (default: 1) ---force: Force reprocessing of existing files ---dry-run: List files without processing them

Output Control:

- -o, --output

: Output directory for extracted files (default: "./batch-extracted")
-

--keep

: Keep MECA files after processing
-

--full-extract

: Extract entire MECA file instead of selective extraction
-

--max-file-size

: Skip files larger than this size (e.g. 1GB)
API Configuration:

- -a, --api-url : API base URL (default: "https://biorxiv.csf.now") --k, --api-key : API key for authentication (or use OPENRXIV_BATCH_PROCESSING_API_KEY env var) --s, --server : Server type: biorxiv or medrxiv

Examples:

`bash

`Process specific month`


biorxiv batch-process --month "2025-08" --requester-pays
Process multiple months

biorxiv batch-process --month "2024-01,2024-02,2024-03" --requester-pays
Dry run to see what would be processed

biorxiv batch-process --month "2025-08" --dry-run
Process all of 2025

biorxiv batch-process --month "2025-*" --requester-pays
Process with concurrency

biorxiv batch-process --month "2025-08" --concurrency 5 --requester-pays


Configuration
The tool reads AWS credentials from the home directory under the default profile, if available.
$3
You can also set credentials via environment variables:

`bash export OPENRXIV_BATCH_PROCESSING_API_KEY="your-api-key" export AWS_ACCESS_KEY_ID="your-access-key" export AWS_SECRET_ACCESS_KEY="your-secret-key"`

`Development`

`$3`

`bash git clone https://github.com/continuous-foundation/biorxiv cd biorxiv npm install`

`$3`

`bash npm run build`

`$3`

`bash npm test npm run test:watch`

`$3`

`bash npm run lint npm run lint:format``

License

MIT License - see LICENSE file for details.

Compliance

This tool is designed to comply with bioRxiv's and medRxiv's fair use policies:

- No content redistribution
- Link back to bioRxiv/medRxiv for indexing services
- Respect author copyright and licensing
- Intended for legitimate text and data mining purposes

bioRxiv MECA Downloader CLI

A comprehensive command-line interface (CLI) tool to download, process, and manage bioRxiv MECA (Manuscript Exchange Common Approach) files from AWS S3 for text and data mining purposes.

Features

Installation

$3

- Node.js 18.0.0 or higher
- AWS account with access to bioRxiv/medRxiv S3 buckets (--requester-pays)
- API key for bioRxiv API access

$3

``bash npm install -g biorxiv`

`Quick Start`

`$3`

Get a summary of a bioRxiv/medRxiv preprint from a URL or DOI.

Arguments:

- : bioRxiv URL or DOI to summarize

Options:

- -m, --more: Show additional details and full abstract --s, --server : Specify server (biorxiv or medrxiv)

Examples:

`bash biorxiv summary "10.1101/2024.05.08.593085" biorxiv summary -m "10.1101/2024.05.08.593085" biorxiv summary -s medrxiv "10.1101/2020.03.19.20039131" --more`

`$3`

Download MECA files from the bioRxiv/medRxiv S3 buckets by DOI.

Arguments:

- : DOI of the paper (e.g., "10.1101/2024.05.08.593085")

Options:

- -o, --output

: Output directory for downloaded files (default: "./downloads")
-

-a, --api-url

: API base URL
-

--requester-pays

: Enable requester-pays for S3 bucket access
Examples:

`$3`

List available content in the bioRxiv or medRxiv S3 bucket.

Options:

Examples:

`bash

`Local development`


biorxiv list
biorxiv list --month "2024-01"
biorxiv list --batch 1 --limit 100 --server medrxiv


Batch Processing
List detailed metadata for all files in a specific month or batch.
Options:

- -m, --month : Month to list (e.g., "January_2024" or "2024-01") --b, --batch : Batch to list (e.g., "1", "batch-1", "Batch_01") --s, --server : Server to use: "biorxiv" or "medrxiv"

Examples:

`bash biorxiv batch-info --month "2024-01" biorxiv batch-info --batch "1" biorxiv batch-info --server medrxiv --month "2024-01"`

`Global Options`

`$3`

Enable requester pays functionality. The S3 buckets require requester pays for external access.

`Batch Processing`

Batch process MECA files for a given month or batch.

Options:

Time Selection:

- -m, --month : Month(s) to process. Supports: YYYY-MM, comma-separated list, or wildcard patterns --b, --batch : Batch to process. Supports: single batch, range, or comma-separated list

Processing Control:

Output Control:

- -o, --output

: Output directory for extracted files (default: "./batch-extracted")
-

--keep

: Keep MECA files after processing
-

--full-extract

: Extract entire MECA file instead of selective extraction
-

--max-file-size

: Skip files larger than this size (e.g. 1GB)
API Configuration:

Examples:

`bash

`Process specific month`


biorxiv batch-process --month "2025-08" --requester-pays
Process multiple months

biorxiv batch-process --month "2024-01,2024-02,2024-03" --requester-pays
Dry run to see what would be processed

biorxiv batch-process --month "2025-08" --dry-run
Process all of 2025

biorxiv batch-process --month "2025-*" --requester-pays
Process with concurrency

biorxiv batch-process --month "2025-08" --concurrency 5 --requester-pays


Configuration
The tool reads AWS credentials from the home directory under the default profile, if available.
$3
You can also set credentials via environment variables:

`bash export OPENRXIV_BATCH_PROCESSING_API_KEY="your-api-key" export AWS_ACCESS_KEY_ID="your-access-key" export AWS_SECRET_ACCESS_KEY="your-secret-key"`

`Development`

`$3`

`bash git clone https://github.com/continuous-foundation/biorxiv cd biorxiv npm install`

`$3`

`bash npm run build`

`$3`

`bash npm test npm run test:watch`

`$3`

`bash npm run lint npm run lint:format``

License

MIT License - see LICENSE file for details.

Compliance

This tool is designed to comply with bioRxiv's and medRxiv's fair use policies:

- No content redistribution
- Link back to bioRxiv/medRxiv for indexing services
- Respect author copyright and licensing
- Intended for legitimate text and data mining purposes