Web Site Scraper

Table of Contents generated with DocToc

- Web Site Scraper
- Vogue Scraper
- Scraped Data Format
- Vogue Scheduler
- Construction
- Concurrent Writes
- Relevant Data Types
- Async Primitives
- Vogue DB
- To Do
- Is Done

Web Site Scraper

* Vogue
* v: Vogue is the so-called 'hub'; the main object to be instantiated
* v.scrapers: Vogue_scrapers
* v.scrapers.d: { [dsk]: Vogue_scraper, }
* v.vdb: Vogue_db
* v.server: Vogue_server
* v.scheduler: Vogue_scheduler

* Vogue_ops slim on-page script (renders trendlines &c)
* Vogue_telegraph for websocket-based client/server communication
* based on Datom XEmitter

Vogue Scraper

$3

* Mandatory and Standard Items

* Freeform Items
* field details must be a JSONifiable object (preferably a flat one)
* can contain further items as seen fit
* will be rendered as a list of named entries in a format determined by CSS
* Vogue_scraper has prototype method html_from_details() to turn
contents of details field into HTML; can be overridden for individual purposes

Vogue Scheduler

$3

* schedular.add_interval: ( cfg ) ->
* Mandatory properties of cfg:
* task: function | asyncfunction: the task to run. It will be called without arguments.
* repeat: duration: how often to repeat a task.
* Optional properties of cfg:
* jitter: duration | percentage (default null meaning 0 seconds): how much to randomly vary the
repetition rate. Example: when repetition is 1 hour, jitter is 5 minutes and the task has
been first started at 00:00, then it will be repeated somewhere between 00:55 and 01:05. A
jitter of 25% would equal 15 minutes in this case.

* timeout: duration (default: null meaning no timeout): how long to wait for a task to finish. When
a task has been stopped because of timeout, the next session will be started following the same rules
as if it finished normally.

* pause: duration (default: null meaning 0 seconds): specifies the minim time between the
finishing of one session and the earliest allowed start of the next one, taking jitter into account.
Example: When a task has been scheduled with { repeat: '10 minutes', jitter: '3 minutes', pause: '5 minutes', }, is first run at 00:00 and took 9 minutes until 00:09 to finish, it would—without
pause or jitter—be scheduled to run next at 00:09 (i.e. immediately). However, we have to wait
at least until 00:09 + 5 minutes to account for the pause, and because a non-zero jitter could
cause the start to be that much earlier, we have to postpone until 00:09 + 5 minutes + 3 minutes = 00:17.

Note that for simplicity, we have only used minutes and hours in the above, which in reality could
have the undesirable effect that a task scheduled to have a pause of 1 minute finishes at 00:59:59
only to be run again at 01:00:00—just a second later. To avoid this, durations are internally
reckoned in milliseconds, so 1 minute really means 60,000 milliseconds and 1 hour equals
3,600,000 milliseconds.

* In case a session runs longer than its repeat duration,

``0|0———0|1———0|2———0|3———0|4———0|5———0|6———0|7———0|8———0|9———1|0———1|1———1|2———1|3———1|4———1|5———1|6———1|7———1|8———1|9

session pause jitter |—————————————————————————————————————————————————————|=============================|~~~~~~~~~~~~~~~~~| |:::::::::::::::::::::::::::::::::::::::::::::::::::::::::::|~~~~~~~~~~~~~~~~~| repeat jitter interval for start of next session |+++++++++++++++++|+++++++++++++++|

session pause jitter |—————————————————|=============================|~~~~~~~~~~~~~~~~~| jitter |:::::::::::::::::::::::::::::::::::::::::::::::::::::::::::|~~~~~~~~~~~~~~~~~| repeat interval for start of next session |+++++++++++++++++|+++++++++++++++|

session pause jitter |—————|=============================|~~~~~~~~~~~~~~~~~| jitter |:::::::::::::::::::::::::::::::::::::::::::::::::::::::::::|~~~~~~~~~~~~~~~~~| repeat interval for start of next session |+++++++++++++++++|+++++++++++++++|

`$3`

* any number of sessions may be active at any given point in time * for a number of reasons, it would be nice if no partial results from any session would appear in DB * the solution for this would usually be to wrap each session into a DB transaction * however, SQLite does not allow concurrent writes and sessions can take a long time (e.g. because the network is slow); this risks transaction timeouts which in turn call for logic to handle this failure mode * on the other hand the expected number of rows to be inserted by any session is low (not more than a few dozens up to hundreds of rows), so at least the performance gain from a transaction-per-session model should be negligible * another alternative is to queue rows in memory and pass them to a single writer method that does an atomic insert * yet another possibility is to use implicit (or explicit, but short-duration) transactions and allow partial results * in case each session also registers its start and finish timestamps we can always detect incomplete sessions and treat them accordingly (hide them or add notification) * the 'random inserts with implicit short transactions' (RIWIST) model then seems to be the simplest way to deal with concurrent sessions * we won't get consecutive row numbers with RIWIST, but that should not pose a problem * since each insert is done immediately instead of being deferred, we get a chance to grab the (possibly updated) row withinsert ... returning *;statements, so that is a plus over a deferred model should we ever need that data.

`$3`

* (absolute) duration: a string spelling out a duration using a float literal and a unit, separated by a space. Allowed units areweek, day, hour, minute, second, all in singular and plural. Examples:'1.5 seconds', '1e2 minutes', '1 week'. *percentage (for relative durations): a string spelling out a percentage with a float literal immediately followed by a percent sign,%. Examples: '4.2%', 0%. The amount must be between 0 and 100.

`$3`

These 'five letter' methods are available both on the Vogue_schedulerclass and its instances; for many people, the most strightforward way to understand what these methods do will be to read the very simple definitions and recognize they are very thin shims over the somewhat less convenient JavaScript methods:

every: ( dts, f ) -> setInterval f, dts 1000after: ( dts, f ) -> setTimeout f, dts 1000*cease: ( toutid ) -> clearTimeout toutidsleep: ( dts ) -> new Promise ( done ) => setTimeout done, dts 1000*defer: ( f = -> ) -> await sleep 0; return await f()

In each case, dts denotes an interval (delta time) measured in seconds (not milliseconds) and fdenotes a function.every() and after() return so-called timeout IDs (toutids), i.e. values that are recognized bycease() (clearTimeout(), clearInterval()) to stop a one-off or repetetive timed function call.sleep() returns a promise that should be awaited as in await sleep 3, which will allow another task on the event loop to return and resume execution no sooner than after 3000 milliseconds have elapsed. Finally, there isdefer(), which should also be awaited. It is a special use-case of sleep()where the timeout is set to zero, so the remaining effect is that other tasks on the event loop get a chance to run. It accepts an optional function argument whose (synchronous or asynchronous) result will be returned.

`Vogue DB`

`To Do`

* [–] documentation * [–] repurposeposts table to only contain unique posts with primary key ( pid, version )whereversion is a small integer that allows to store updated detailsfor a given post. * [–] versionedpostswould e.g. allow pricing fluctions to be linked to constant articles; IOW with versions, a constant 'article'/'product' can now have a varyingpricedetail * [–] oldposts becomes posts_and_sessions (?) which details occurrences of postsin completedsessions('sightings'). * [–] introduce tagging in its simplest form (only value-less, affirmative tags such as+interesting,+seen, +hide, pin-to-top). Tags are linked to unversioned posts via pid. [–] consider to allow to tie tags to versioned* posts with the additional rule that a 'version' may be a wildcard like*that applies to all versions of a post. Tying a tag to the version that I see in this listing then means 'OK not interested in this case but do show again if any details should have changed', tying a tag to symbolic version*communicates 'I've seen this and am not interested in being alerted to any changes'. * [–] Some tags may be given systematic behavior: *+seen—de-emphasize/grey-out post where it appears *+watch—the opposite of +seen, hilite post where it appears *+pin-to-top—always show post near top of listings even when its position in the data source's listing is further down * other tags may be added at user request and be associated with a certain style / color / theme (maybe associate arbitrary CSS with each tag) * [–] allow to specify a list of keys intodetailscontaining facts that are to be writ large, such as the price of a product. Maybe also allow to track specific facets with a sparkline? [–] allow to specify from which key to deriverank from? In that case it would be best to keep all* the data indetails including № in the original listing and have a mapping like { rank: 'nr', }, {
rank: 'price', }to determine how to derive which aspect from which field. * [–] problem with this: if the № becomes part ofdetailsthen change in position of item in listing would in itself be treated as an update to the details of an item, which is not what we want * i.e. change in rank is categorically different from change in other details * [–] consider to either wrapcheeriohandling (or make it optional), or use a library with a more transparent API * [–]Vogue_kaiju(shim over Playwright) for in-browser scraping of dynamic HTML pages * [–] see https://github.com/kudla/promise-status-async, https://stackoverflow.com/a/53328182/7568091 on how to check for promise status; https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Promise/race on how to implement session timeout; https://github.com/dondevi/promise-abortable for cancelling promise * [–] what causes the error whenforeign key ( pid ) references scr__posts in scr_tagged_postsis uncommented? * [–] make it so that sparklines are flush right i.e. the data point for the current session is 'now', and 'now' has maximum of the displayed x values * [–] add chart with all combined sparklines from 'latest' view * [–] addstatus field (first, last, return, cont) to each data point of trends to allow for coloring in charts * [–] may want to implement option where only those sessions are show where at least one change occurred * [–] simplify construction of customized instances; currently one has to do:

`path = 'path/to/dbay-vogue.db' db = new DBay { path, } vdb = new Vogue_db { db, } vogue = new Vogue { vdb, }`

i.e. one must know and get hold of the secondary classes involved and introduce them at appropriate junctures.

`Is Done`

* [+] name * [+] POC * [+] renameround -> session* [+] renameseq -> rank* [+] renameprogress -> trends? * [+] use D3, plot to produce sparklines (MVP) * [+]Vogue_scraper -> Vogue_scraper_ABC`
* [+] in HN scraping, one could say that the 'ultimate' identity is really the content URL, which can be
associated with any number of discussion URLs as time passes
* [+] only display the most recent session entry for each item ID (and up to so-and-so many items per
page), i.e. if an item was present in sessions 12 and 11, only show the entry for session 12 (with the
sparkline that does show a dot for session 11); if an item dropped out in session 9, also show it (as long
as the table does not exceed the line limit)

Web Site Scraper

Table of Contents generated with DocToc

- Web Site Scraper
- Vogue Scraper
- Scraped Data Format
- Vogue Scheduler
- Construction
- Concurrent Writes
- Relevant Data Types
- Async Primitives
- Vogue DB
- To Do
- Is Done

Web Site Scraper

* Vogue_ops slim on-page script (renders trendlines &c)
* Vogue_telegraph for websocket-based client/server communication
* based on Datom XEmitter

Vogue Scraper

$3

* Mandatory and Standard Items

Vogue Scheduler

$3

* In case a session runs longer than its repeat duration,

`$3`

`Vogue DB`

`To Do`

`path = 'path/to/dbay-vogue.db' db = new DBay { path, } vdb = new Vogue_db { db, } vogue = new Vogue { vdb, }`

i.e. one must know and get hold of the secondary classes involved and introduce them at appropriate junctures.