Floodesh

Floodesh is middleware based web spider written with Nodejs. "Floodesh" is a combination of two words, flood and mesh.

- Requirement
* Gearman Server
* MongoDB
- Quick start
* Install scaffold
* Initialize
- Context
* Request
+ ctx.querystring
+ ctx.idempotent
+ ctx.search
+ ctx.method
+ ctx.query
+ ctx.path
+ ctx.url
+ ctx.origin
+ ctx.protocol
+ ctx.host
+ ctx.hostname
+ ctx.secure
* Response
+ ctx.status
+ ctx.message
+ ctx.body
+ ctx.length
+ ctx.type
+ ctx.lastModifieds
+ ctx.etag
+ ctx.header
+ ctx.contentType
+ ctx.get(key)
+ ctx.is(types)
* Other
+ ctx.tasks
+ ctx.dataSet
- Configuration
* index
* bottleneck
* downloader
* gearman
* database
* logger
* seenreq
* service
- Error handling
- Diagram
* Client
+ State diagram
+ Flow chart
* Worker
- Middlewares

Requirement

Gearman Server

Make sure g++, make, libboost-all-dev, gperf, libevent-dev and uuid-dev have been installed.

``sh $ wget https://launchpad.net/gearmand/1.2/1.1.12/+download/gearmand-1.1.12.tar.gz | tar xvf $ cd gearmand-1.1.12 $ ./configure $ make $ make install`

`Quick start`


Install scaffold

sh	
$ npm install -g floodesh-cli


Initialize

Generate new app from templates by only one command.

`sh $ mkdir demo $ cd demo $ floodesh-cli init # all necessary files will be generated in your directory.`

Please make sure you have /data/tests and /var/log/bda/tests created and have Write access before use, you can customize path by modifying logBaseDir in config/[env]/index.js

`Context`


A context instance is a kind of Finite-State Machine implemented by

Generators which is ECMAScript 6 feature. By context, we can access almost all fields in response and request, like:

`javascript worker.use( (ctx,next) => { ctx.content = ctx.body.toString(); // totally do not care about the body return next(); })`

`Request`

`$3`


  *  
Get querystring.
$3

  *  
  
Check if the request is idempotent.
$3

  *  
  
Get the search string. It includes the leading "?" compare to querystring.
$3

  *  
  
Get request method.
$3

  *  
  
Get parsed query-string.
$3

  *  
  
Get the request pathname
$3

  *  
  
Return request url, the same as __ctx.href__.
$3

  *  
  
Get the origin of URL, for instance, "https://www.google.com".
$3

  *  
  
Return the protocol string "http:" or "https:".
$3

  * , hostname:port
  
Parse the "Host" header field host and support X-Forwarded-Host when a proxy is enabled.
$3

  * 
  
Parse the "Host" header field hostname and support X-Forwarded-Host when a proxy is enabled.
$3

  * 
  
Check if protocol is https.
Response
$3

  *  
  
Get status code from response.
$3

  *  
  
Get status message from response.
$3

  *  
  
Get the response body in Buffer.
$3

  *  
  
Get length of response body.
$3

  *  
  
Get the response mime type, for instance, "text/html"
$3

  *  
  
Get the Last-Modified date in Date form, if it exists.
$3

  *  
  
Get the ETag of a response.
$3

  *  
  
Return the response header.
$3

  *  
$3

  *

key

 
  *  Return: 
Get value by key in response headers
$3

  *

type

s |Array\>
  *  Return: |false|null

Check if the incoming response contains the "Content-Type" header field, and it contains any of the give mime types.If there is no response body, null is returned.If there is no content type, false is returned.Otherwise, it returns the first type that matches.

`Other`

`$3`

Array of generated tasks. A task is an object consists of Options and next, next is a function name in your spider you want to call in next task , Supported format:

`[{ opt:, next: }]`

`$3`


 * 
A map to store result, that will be parsed and saved by floodesh.
Configuration

index

*

retry : Retry times at worker side, default 3

logBaseDir

 : Directory where project's log directory exists, default '/var/log/bda/'
*

parsers

 : Array of parsers, which are file names in parser directory without '.js'
bottleneck

*

defaultCfg

rate

 : Number of milliseconds to delay between each requests
  *

concurrent

 : Size of the worker pool
  *

priorityRange : Range of acceptable priorities starting from 0, default 3

defaultPriority

 : priority of the request
  *

homogenous

 :true
downloader

*

headers

 : HTTP headers
gearman

*

jobs : Max number of jobs per worker, default 1

srvQueueSize : Max number of jobs queued to gearman server, default 1000

mongodb

 : Mongodb Connection String URI,
*

worker

 :
  *

servers : Array of server list, server should be an object like {'host':'gearman-server'}

client

 :
  *

servers

 : Same as above,
  *

loadBalancing

 : 'RoundRobin'
*

retry

 : Retry times at client side

database

*

mongodb

 : Mongodb Connection String URI
logger
seenreq

*

repo

 : [redis|mongodb] default use memory as repo.
*

removeKeys

 :Array of keys in query string to skip when test if an url is seen
service

*

server

 : Remote service origin

Error handling

Just throw an

Error in a synced middleware, otherwise return a rejected Promise. err.stack will be logged and err.code will be sent to client to persist.

`javascript // sync module.exports = (ctx, next) => { // balabala throw new Error('crash here'); }

// async module.exports = (ctx, next) => { return new Promise( (resolve, reject) => { // balabala reject(new Error('got error')); }); }`

`Diagram`


Client

$3

!floodesh client state
$3

!floodesh client flow
Worker

$3

!floodesh worker flow
Middlewares

 * mof-cheerio: A simple wrapper of

Cheerio

.
 * mof-charsetparser: Parse

Charset

  in response headers.
 * mof-iconv: Encoding converter middleware using

iconv or iconv-lite

.
 * mof-request: A wrapper of

Request.js

, with some default options.
 * mof-bottleneck: A wrapper of

bottleneckp

 which is asynchronous rate limiter with priority.
 * mof-proxy: With power to acquire proxy from a proxy service.
 * mof-whacko: A wrapper of

whacko

, which is a fork of cheerio that uses parse5 as an underlying platform.
 * mof-statsd: A wrapper of

statsd-client

, which enables you send metrics to a statsd daemon.
 * mof-uarotate: Rotate

User-Agent

 header automatically from a local file.
 * mof-seenreq: Only make sense in flowesh, a simple wrapper of

seenreq

.
 * mof-validbody: Check if a response body meets a pattern, for instance, a html body should start with

< and json body {`.
* mof-statuscode: Status code detector.
* mof-genestamp: Prints gene and url of a task, along with # of new tasks and # of records.

Floodesh

Floodesh is middleware based web spider written with Nodejs. "Floodesh" is a combination of two words, flood and mesh.

Requirement

Gearman Server

Make sure g++, make, libboost-all-dev, gperf, libevent-dev and uuid-dev have been installed.

``sh $ wget https://launchpad.net/gearmand/1.2/1.1.12/+download/gearmand-1.1.12.tar.gz | tar xvf $ cd gearmand-1.1.12 $ ./configure $ make $ make install`

`MongoDB`

`Quick start`


Install scaffold

sh	
$ npm install -g floodesh-cli


Initialize

Generate new app from templates by only one command.

`sh $ mkdir demo $ cd demo $ floodesh-cli init # all necessary files will be generated in your directory.`

Please make sure you have /data/tests and /var/log/bda/tests created and have Write access before use, you can customize path by modifying logBaseDir in config/[env]/index.js

`Context`


A context instance is a kind of Finite-State Machine implemented by

Generators which is ECMAScript 6 feature. By context, we can access almost all fields in response and request, like:

`javascript worker.use( (ctx,next) => { ctx.content = ctx.body.toString(); // totally do not care about the body return next(); })`

`Request`

`$3`


  *  
Get querystring.
$3

  *  
  
Check if the request is idempotent.
$3

  *  
  
Get the search string. It includes the leading "?" compare to querystring.
$3

  *  
  
Get request method.
$3

  *  
  
Get parsed query-string.
$3

  *  
  
Get the request pathname
$3

  *  
  
Return request url, the same as __ctx.href__.
$3

  *  
  
Get the origin of URL, for instance, "https://www.google.com".
$3

  *  
  
Return the protocol string "http:" or "https:".
$3

  * , hostname:port
  
Parse the "Host" header field host and support X-Forwarded-Host when a proxy is enabled.
$3

  * 
  
Parse the "Host" header field hostname and support X-Forwarded-Host when a proxy is enabled.
$3

  * 
  
Check if protocol is https.
Response
$3

  *  
  
Get status code from response.
$3

  *  
  
Get status message from response.
$3

  *  
  
Get the response body in Buffer.
$3

  *  
  
Get length of response body.
$3

  *  
  
Get the response mime type, for instance, "text/html"
$3

  *  
  
Get the Last-Modified date in Date form, if it exists.
$3

  *  
  
Get the ETag of a response.
$3

  *  
  
Return the response header.
$3

  *  
$3

  *

key

 
  *  Return: 
Get value by key in response headers
$3

  *

type

s |Array\>
  *  Return: |false|null

`Other`

`$3`

Array of generated tasks. A task is an object consists of Options and next, next is a function name in your spider you want to call in next task , Supported format:

`[{ opt:, next: }]`

`$3`


 * 
A map to store result, that will be parsed and saved by floodesh.
Configuration

index

*

retry : Retry times at worker side, default 3

logBaseDir

 : Directory where project's log directory exists, default '/var/log/bda/'
*

parsers

 : Array of parsers, which are file names in parser directory without '.js'
bottleneck

*

defaultCfg

rate

 : Number of milliseconds to delay between each requests
  *

concurrent

 : Size of the worker pool
  *

priorityRange : Range of acceptable priorities starting from 0, default 3

defaultPriority

 : priority of the request
  *

homogenous

 :true
downloader

*

headers

 : HTTP headers
gearman

*

jobs : Max number of jobs per worker, default 1

srvQueueSize : Max number of jobs queued to gearman server, default 1000

mongodb

 : Mongodb Connection String URI,
*

worker

 :
  *

servers : Array of server list, server should be an object like {'host':'gearman-server'}

client

 :
  *

servers

 : Same as above,
  *

loadBalancing

 : 'RoundRobin'
*

retry

 : Retry times at client side

database

*

mongodb

 : Mongodb Connection String URI
logger
seenreq

*

repo

 : [redis|mongodb] default use memory as repo.
*

removeKeys

 :Array of keys in query string to skip when test if an url is seen
service

*

server

 : Remote service origin

Error handling

Just throw an

Error in a synced middleware, otherwise return a rejected Promise. err.stack will be logged and err.code will be sent to client to persist.

`javascript // sync module.exports = (ctx, next) => { // balabala throw new Error('crash here'); }

// async module.exports = (ctx, next) => { return new Promise( (resolve, reject) => { // balabala reject(new Error('got error')); }); }`

`Diagram`


Client

$3

!floodesh client state
$3

!floodesh client flow
Worker

$3

!floodesh worker flow
Middlewares

 * mof-cheerio: A simple wrapper of

Cheerio

.
 * mof-charsetparser: Parse

Charset

  in response headers.
 * mof-iconv: Encoding converter middleware using

iconv or iconv-lite

.
 * mof-request: A wrapper of

Request.js

, with some default options.
 * mof-bottleneck: A wrapper of

bottleneckp

 which is asynchronous rate limiter with priority.
 * mof-proxy: With power to acquire proxy from a proxy service.
 * mof-whacko: A wrapper of

whacko

, which is a fork of cheerio that uses parse5 as an underlying platform.
 * mof-statsd: A wrapper of

statsd-client

, which enables you send metrics to a statsd daemon.
 * mof-uarotate: Rotate

User-Agent

 header automatically from a local file.
 * mof-seenreq: Only make sense in flowesh, a simple wrapper of

seenreq

.
 * mof-validbody: Check if a response body meets a pattern, for instance, a html body should start with

< and json body {`.
* mof-statuscode: Status code detector.
* mof-genestamp: Prints gene and url of a task, along with # of new tasks and # of records.

floodesh

Floodesh

Table of Contents

Requirement

Gearman Server

MongoDB

Quick start

Install scaffold

Initialize

Context

Request

$3

$3

$3

$3

$3

$3

$3

$3

$3

$3

$3

$3

Response

$3

$3

$3

$3

$3

$3

$3

$3

$3

$3

$3

Other

$3

$3

Configuration

index

bottleneck

downloader

gearman

database

logger

seenreq

service

Error handling

Diagram

Client

$3

$3

Worker

$3

Middlewares

floodesh

Floodesh

Table of Contents

Requirement

Gearman Server

MongoDB

Quick start

Install scaffold

Initialize

Context

Request

$3

$3

$3

$3

$3

$3

$3

$3

$3

$3

$3

$3

Response

$3

`MongoDB`

`Quick start`

`Context`

`Request`

`$3`

`Other`

`$3`

`$3`

`Diagram`

`MongoDB`

`Quick start`

`Context`

`Request`

`$3`

`Other`

`$3`

`$3`

`Diagram`