> A data gathering and manipulation kit with support for a variety of database platforms like MsSQL, MySQL, PSQL, MariaDB and SQLite.
 
A toolkit which allows a user to easily, and programmatically, crawl a site for data.
It contains a tool for extracting to pipe delimited csv for easy import and manipulation. Source is available in GitLab and documentation on
components is available at https://rdfedor.gitlab.io/datahorde.
1. Installation
2. Available Commands
1. Crawler
2. Devour
3. Digest
4. Egress
5. Mimic
3. Development Setup
4. Release History
5. Contributing
OS X & Linux:
To install the CLI globally and accessible from terminal,
``sh`
npm install -g datahorde
To install as part of a project,
`sh`
npm install --save datahorde
`
Usage: dh [options] [command]
Options:
-V, --version output the version number
-h, --help output usage information
Commands:
crawl|c [options]
devour|d [options]
digest|di [options]
mimic|m [options]
egress|v [options]
`
Allows a user to crawl a website collecting data as defined by the definition.
`
Usage: dh crawl|c [options]
Searches a collection of urls, collects data and writes to csv
Options:
-l, --load load targetUrls from local csv
-k, --loadUrlKey
-e, --loadDelimiter
-a, --loadHeaders
-m, --loadRenameHeader map column data to headers (default: false)
-p, --profile
-r, --rowSelector
-c, --columnSelector
-n, --nextPageLinkSelector
-o, --outputHeaders
-h, --headers
-x, --proxy
-t, --threads
-b, --browser emulate a browser request
-h, --help output usage information
`
As an example, lets write to csv and display a list of the 42nd weekend's top release titles, release title link,movie studio that published it and the link to the imdb movie studio page,
`sh`
dh crawl -r 'table.mojo-body-table tr' -c 'td.mojo-field-type-release a,td.mojo-field-type-release a:attr(href),td.mojo-field-type-release_studios a,td.mojo-field-type-release_studios a:attr(href)' -o 'release_title,release_title_link,movie_studio,movie_studio_link' mojo_movies_weekend_24.csv https://www.boxofficemojo.com/weekend/2019W42/
Then taking the output of the above, we can collect details from each page like the total demoestic and internation gross, genre and feature length,
`sh`
dh crawl -l -i release_title_link -h 'title,domestic_gross,international_gross,genre,feature_length' -r main -c 'h1,.mojo-performance-summary-table div:nth-child(2) span:nth-child(3) span:text,.mojo-performance-summary-table div:nth-child(3) span:nth-child(3) span:text,.mojo-summary-values > div:nth-child(4) > span:nth-child(2):text,.mojo-summary-values > div:nth-child(2) > span:nth-child(2):text' mojo_movie_details.csv mojo_movies_weekend_24.csv
Ingests a file based csv into a database table.
`
Usage: dh devour|d [options]
Allows for the easy ingestion of csvs to various datasources
Options:
-V, --version output the version number
-s, --schema
-d, --delimiter
-u, --databaseUri
-t, --tableName
-h, --noHeaders Csv file has no headers
-r, --renameHeaders Rename headers to match schema on import
-h, --help output usage information
`
Builing on the output from the crawler, we can then ingest the csv using devour into a file based sqlite database,
`sh`
dh devour --databaseUri='sqlite:./mojo_movies.sqlite' --schema 'release_title=str(75),release_title_link=str,movie_studio=str(75),movie_studio_link=str' mojo_movies_weekend_24.csv
One could also rename the columns as we import them if they don't line up with what we're aiming for,
`sh`
dh devour --databaseUri='sqlite:mojo_movies.sqlite' --schema 'title=str(75),titleUrl=str,studio=str(75),studioUrl=str' --renameHeaders ./mojo_movies_weekend_24.csv
Digest executes a series of sql against a data source. It supports database migrations and allows
for the ingestion and processing of data from csvs to a data source.
`
Usage: dh digest|di [options]
Processes digest action [execute,crawl,seed,egress,migrate,rollback] against a particular database
Options:
-V, --version output the version number
-d, --data
-u, --databaseUri
-h, --help output usage information
`
Action Definitions
* Execute (ex) - Runs crawl, migrate, seeds, process the path files then egresses the data back to csv.
* Crawl (cr) - Collects data from a given set of urls and exports them to csv.
* Seed (se) - Imports data from a csv to a particular table location. Supports same parameters as defined in the cli.
* Egress (eg) - Exports data to csv from a file based query. It can be templatized to use the values passed by the data parameter.
* Migrate (mi) - Applies schema migrations to a particular data source.
* Rollback (ro) - Reverses schema migrations applied to a data source.
A basic example is available to show how to build the digest where the parameters are based on the options from their respective CLI implemtations. Migrations are supported by Sequelize.
``
dh digest exec ./examples/digest-basic-example/
A more complex example with crawlers is available. Change year or week by passing via the cli like this and they will override the default values defined in the configuration.
``
dh digest exec -d 'year=2019,week=43' ./examples/digest-movie-studio-report/
Executes a given series of commands against a database and exports data from a tabl to a file based csv.
`
Usage: dh egress|v [options]
Exports data from a datasource using a query to csv
Options:
-V, --version output the version number
-o, --output
-d, --delimeter
-u, --databaseUri
-h, --headers
-h, --help output usage information
`
Then export the same data using egress,
`sh`
dh egress --databaseUri='sqlite:./mojo_movies.sqlite' --output ./mojo_movies_output.csv 'select * from importTable'
Allows for files to be templatized so they can be customized based on the parameters passed.
`
Usage: dh mimic|m [options]
Templating engine which uses ejs to build customized scripts
Options:
-V, --version output the version number
-d, --data
-n, --noOverwrite does not overwrite the file if the outputFile already exists
-h, --help output usage information
`
_For more documentation, please refer to the jsdoc._
To develop locally, simple checkout the repository and run,
`sh`
npm install
npm test
* 0.0.9 - 0.0.14
* Updates to documentation
* 0.0.8
* Extract common database components
* Add extensions to common database components
* Add digest tool to automate commands
* Add better error handling
* 0.0.7
* Update version for npm release
* 0.0.2
* Refactor CLI into a single command
* Refactor library to support csv export
* Add csv export utility "vomit"
* 0.0.1
* Work in progress
1. Fork it (< https://gitlab.com/rdfedor/datahorde/-/forks/new >)
2. Create your feature branch (git checkout -b feature/fooBar)git commit -am 'Add some fooBar'
3. Commit your changes ()git push origin feature/fooBar`)
4. Push to the branch (
5. Create a new Pull Request