DataHorde

> A data gathering and manipulation kit with support for a variety of database platforms like MsSQL, MySQL, PSQL, MariaDB and SQLite.

![pipeline status](https://gitlab.com/rdfedor/datahorde/commits/master) ![coverage report](https://rdfedor.gitlab.io/datahorde/?path=coverage/index.html)

A toolkit which allows a user to easily, and programmatically, crawl a site for data.
It contains a tool for extracting to pipe delimited csv for easy import and manipulation. Source is available in GitLab and documentation on
components is available at https://rdfedor.gitlab.io/datahorde.

1. Installation
2. Available Commands
1. Crawler
2. Devour
3. Digest
4. Egress
5. Mimic
3. Development Setup
4. Release History
5. Contributing

1. Installation

OS X & Linux:

To install the CLI globally and accessible from terminal,
``sh npm install -g datahorde`

To install as part of a project,`sh npm install --save datahorde`

`2. Available Commands`

`Usage: dh [options] [command]

Options: -V, --version output the version number -h, --help output usage information

Commands: crawl|c [options] [otherTargetUrls...] Searches a collection of urls, collects data and writes to csv devour|d [options] Allows for the easy ingestion of csvs to various datasources digest|di [options] Processes digest action [execute,migrate,rollback] against a particular database mimic|m [options] Templating engine which uses ejs to build customized scripts egress|v [options] Exports data from a datasource using a query to csv`

`$3`

Allows a user to crawl a website collecting data as defined by the definition.

`Usage: dh crawl|c [options] [otherTargetUrls...]

Searches a collection of urls, collects data and writes to csv

Options: -l, --load load targetUrls from local csv -k, --loadUrlKey the header that contains the url (default: "url") -e, --loadDelimiter delimeter separating columns in the csv (default: "|") -a, --loadHeaders [,column2...] header name of columns in the csv (default: true) -m, --loadRenameHeader map column data to headers (default: false) -p, --profile use a specific crawler profile (Available: generic) (default: "generic") -r, --rowSelector css selector to the rows to select the data from (default: "body") -c, --columnSelector [,otherSelectors...] css selector to the values inside the rows (default: "a") -n, --nextPageLinkSelector css selector to url for next page -o, --outputHeaders [,otherColumns...] displays headers at the top of the export -h, --headers

=[,otherHeaders...]     set header values in the requests made
  -x, --proxy                                   path to the socks proxy
  -t, --threads                               run multiple threads at the same time (default: 1)
  -b, --browser                                        emulate a browser request
  -h, --help                                           output usage information

As an example, lets write to csv and display a list of the 42nd weekend's top release titles, release title link,movie studio that published it and the link to the imdb movie studio page,`sh dh crawl -r 'table.mojo-body-table tr' -c 'td.mojo-field-type-release a,td.mojo-field-type-release a:attr(href),td.mojo-field-type-release_studios a,td.mojo-field-type-release_studios a:attr(href)' -o 'release_title,release_title_link,movie_studio,movie_studio_link' mojo_movies_weekend_24.csv https://www.boxofficemojo.com/weekend/2019W42/`

Then taking the output of the above, we can collect details from each page like the total demoestic and internation gross, genre and feature length,`sh dh crawl -l -i release_title_link -h 'title,domestic_gross,international_gross,genre,feature_length' -r main -c 'h1,.mojo-performance-summary-table div:nth-child(2) span:nth-child(3) span:text,.mojo-performance-summary-table div:nth-child(3) span:nth-child(3) span:text,.mojo-summary-values > div:nth-child(4) > span:nth-child(2):text,.mojo-summary-values > div:nth-child(2) > span:nth-child(2):text' mojo_movie_details.csv mojo_movies_weekend_24.csv`

`$3`

Ingests a file based csv into a database table.

`Usage: dh devour|d [options]

Allows for the easy ingestion of csvs to various datasources

Options: -V, --version output the version number -s, --schema =,[...] describe the csv file and the import table model. (Types: int, str, bigint, double, float, date, uuid) -d, --delimiter Delimiter separating columns in csv. (default: "|") -u, --databaseUri DatabaseUri supported by Sequelize library. Mysql, MariaDB, MSSQL, SQLite, PSQL supported. (default: "sqlite::memory:") -t, --tableName Name of the table to create at the databaseUri. (default: "importTable") -h, --noHeaders Csv file has no headers -r, --renameHeaders Rename headers to match schema on import -h, --help output usage information`

Builing on the output from the crawler, we can then ingest the csv using devour into a file based sqlite database,`sh dh devour --databaseUri='sqlite:./mojo_movies.sqlite' --schema 'release_title=str(75),release_title_link=str,movie_studio=str(75),movie_studio_link=str' mojo_movies_weekend_24.csv`

One could also rename the columns as we import them if they don't line up with what we're aiming for,`sh dh devour --databaseUri='sqlite:mojo_movies.sqlite' --schema 'title=str(75),titleUrl=str,studio=str(75),studioUrl=str' --renameHeaders ./mojo_movies_weekend_24.csv`

`$3`

Digest executes a series of sql against a data source. It supports database migrations and allows for the ingestion and processing of data from csvs to a data source.

`Usage: dh digest|di [options]

Processes digest action [execute,crawl,seed,egress,migrate,rollback] against a particular database

Options: -V, --version output the version number -d, --data =,[...] digest data used when generating migrations and sql -u, --databaseUri DatabaseUri supported by Sequelize library. Mysql, MariaDB, MSSQL, SQLite, PSQL supported. (default: "sqlite::memory:") -h, --help output usage information`

Action Definitions * Execute (ex) - Runs crawl, migrate, seeds, process the path files then egresses the data back to csv. * Crawl (cr) - Collects data from a given set of urls and exports them to csv. * Seed (se) - Imports data from a csv to a particular table location. Supports same parameters as defined in the cli. * Egress (eg) - Exports data to csv from a file based query. It can be templatized to use the values passed by the data parameter. * Migrate (mi) - Applies schema migrations to a particular data source. * Rollback (ro) - Reverses schema migrations applied to a data source.

A basic example is available to show how to build the digest where the parameters are based on the options from their respective CLI implemtations. Migrations are supported by Sequelize.

`dh digest exec ./examples/digest-basic-example/`

A more complex example with crawlers is available. Change year or week by passing via the cli like this and they will override the default values defined in the configuration.

`dh digest exec -d 'year=2019,week=43' ./examples/digest-movie-studio-report/`

`$3`

Executes a given series of commands against a database and exports data from a tabl to a file based csv.

`Usage: dh egress|v [options]

Exports data from a datasource using a query to csv

Options: -V, --version output the version number -o, --output write data to csv -d, --delimeter delimeter separating column data (default: "|") -u, --databaseUri DatabaseUri supported by Sequelize library. Mysql, MariaDB, MSSQL, SQLite, PSQL supported. (default: "sqlite::memory:") -h, --headers [, otherHeaders...] comma separated list of headers in the csv (default: true) -h, --help output usage information`

Then export the same data using egress,`sh dh egress --databaseUri='sqlite:./mojo_movies.sqlite' --output ./mojo_movies_output.csv 'select * from importTable'`

`$3`

Allows for files to be templatized so they can be customized based on the parameters passed.

`Usage: dh mimic|m [options]

Templating engine which uses ejs to build customized scripts

Options: -V, --version output the version number -d, --data =,[...] generates template using the given values -n, --noOverwrite does not overwrite the file if the outputFile already exists -h, --help output usage information`

_For more documentation, please refer to the jsdoc._

`3. Development Setup`

To develop locally, simple checkout the repository and run,

`sh npm install npm test`

`4. Release History`

* 0.0.9 - 0.0.14 * Updates to documentation * 0.0.8 * Extract common database components * Add extensions to common database components * Add digest tool to automate commands * Add better error handling * 0.0.7 * Update version for npm release * 0.0.2 * Refactor CLI into a single command * Refactor library to support csv export * Add csv export utility "vomit" * 0.0.1 * Work in progress

`5. Contributing`

1. Fork it (< https://gitlab.com/rdfedor/datahorde/-/forks/new >) 2. Create your feature branch (git checkout -b feature/fooBar) 3. Commit your changes (git commit -am 'Add some fooBar') 4. Push to the branch (git push origin feature/fooBar`)
5. Create a new Pull Request

DataHorde

> A data gathering and manipulation kit with support for a variety of database platforms like MsSQL, MySQL, PSQL, MariaDB and SQLite.

![pipeline status](https://gitlab.com/rdfedor/datahorde/commits/master) ![coverage report](https://rdfedor.gitlab.io/datahorde/?path=coverage/index.html)

1. Installation
2. Available Commands
1. Crawler
2. Devour
3. Digest
4. Egress
5. Mimic
3. Development Setup
4. Release History
5. Contributing

1. Installation

OS X & Linux:

To install the CLI globally and accessible from terminal,
``sh npm install -g datahorde`

To install as part of a project,`sh npm install --save datahorde`

`2. Available Commands`

`Usage: dh [options] [command]

Options: -V, --version output the version number -h, --help output usage information

`$3`

Allows a user to crawl a website collecting data as defined by the definition.

`Usage: dh crawl|c [options] [otherTargetUrls...]

Searches a collection of urls, collects data and writes to csv

=[,otherHeaders...]     set header values in the requests made
  -x, --proxy                                   path to the socks proxy
  -t, --threads                               run multiple threads at the same time (default: 1)
  -b, --browser                                        emulate a browser request
  -h, --help                                           output usage information

`$3`

Ingests a file based csv into a database table.

`Usage: dh devour|d [options]

Allows for the easy ingestion of csvs to various datasources

`$3`

Digest executes a series of sql against a data source. It supports database migrations and allows for the ingestion and processing of data from csvs to a data source.

`Usage: dh digest|di [options]

Processes digest action [execute,crawl,seed,egress,migrate,rollback] against a particular database

A basic example is available to show how to build the digest where the parameters are based on the options from their respective CLI implemtations. Migrations are supported by Sequelize.

`dh digest exec ./examples/digest-basic-example/`

A more complex example with crawlers is available. Change year or week by passing via the cli like this and they will override the default values defined in the configuration.

`dh digest exec -d 'year=2019,week=43' ./examples/digest-movie-studio-report/`

`$3`

Executes a given series of commands against a database and exports data from a tabl to a file based csv.

`Usage: dh egress|v [options]

Exports data from a datasource using a query to csv

Then export the same data using egress,`sh dh egress --databaseUri='sqlite:./mojo_movies.sqlite' --output ./mojo_movies_output.csv 'select * from importTable'`

`$3`

Allows for files to be templatized so they can be customized based on the parameters passed.

`Usage: dh mimic|m [options]

Templating engine which uses ejs to build customized scripts

_For more documentation, please refer to the jsdoc._

`3. Development Setup`

To develop locally, simple checkout the repository and run,

`sh npm install npm test`

datahorde

DataHorde

Table of Contents

1. Installation

`2. Available Commands`

`$3`

`$3`

`$3`

`$3`

`$3`

`3. Development Setup`

`4. Release History`

`5. Contributing`

datahorde

DataHorde

Table of Contents

1. Installation

`2. Available Commands`

`$3`

`$3`

`$3`

`$3`

`$3`

`3. Development Setup`

`4. Release History`

`5. Contributing`