website scraper
npm install dbay-vogue
Table of Contents generated with DocToc
- Web Site Scraper
- Vogue Scraper
- Scraped Data Format
- Vogue Scheduler
- Construction
- Concurrent Writes
- Relevant Data Types
- Async Primitives
- Vogue DB
- To Do
- Is Done
* Vogue
* v: Vogue is the so-called 'hub'; the main object to be instantiated
* v.scrapers: Vogue_scrapers
* v.scrapers.d: { [dsk]: Vogue_scraper, }
* v.vdb: Vogue_db
* v.server: Vogue_server
* v.scheduler: Vogue_scheduler
* Vogue_ops slim on-page script (renders trendlines &c)
* Vogue_telegraph for websocket-based client/server communication
* based on Datom XEmitter
* Mandatory and Standard Items
* Freeform Items
* field details must be a JSONifiable object (preferably a flat one)
* can contain further items as seen fit
* will be rendered as a list of named entries in a format determined by CSS
* Vogue_scraper has prototype method html_from_details() to turn
contents of details field into HTML; can be overridden for individual purposes
* schedular.add_interval: ( cfg ) ->
* Mandatory properties of cfg:
* task: function | asyncfunction: the task to run. It will be called without arguments.
* repeat: duration: how often to repeat a task.
* Optional properties of cfg:
* jitter: duration | percentage (default null meaning 0 seconds): how much to randomly vary the
repetition rate. Example: when repetition is 1 hour, jitter is 5 minutes and the task has
been first started at 00:00, then it will be repeated somewhere between 00:55 and 01:05. A
jitter of 25% would equal 15 minutes in this case.
* timeout: duration (default: null meaning no timeout): how long to wait for a task to finish. When
a task has been stopped because of timeout, the next session will be started following the same rules
as if it finished normally.
* pause: duration (default: null meaning 0 seconds): specifies the minim time between the
finishing of one session and the earliest allowed start of the next one, taking jitter into account.
Example: When a task has been scheduled with { repeat: '10 minutes', jitter: '3 minutes', pause: '5, is first run at
minutes', }00:00 and took 9 minutes until 00:09 to finish, it would—without
pause or jitter—be scheduled to run next at 00:09 (i.e. immediately). However, we have to wait
at least until 00:09 + 5 minutes to account for the pause, and because a non-zero jitter could
cause the start to be that much earlier, we have to postpone until 00:09 + 5 minutes + 3 minutes =.
00:17
Note that for simplicity, we have only used minutes and hours in the above, which in reality could
have the undesirable effect that a task scheduled to have a pause of 1 minute finishes at 00:59:59
only to be run again at 01:00:00—just a second later. To avoid this, durations are internally
reckoned in milliseconds, so 1 minute really means 60,000 milliseconds and 1 hour equals
3,600,000 milliseconds.
* In case a session runs longer than its repeat duration,
``
0|0———0|1———0|2———0|3———0|4———0|5———0|6———0|7———0|8———0|9———1|0———1|1———1|2———1|3———1|4———1|5———1|6———1|7———1|8———1|9
session pause jitter
|—————————————————————————————————————————————————————|=============================|~~~~~~~~~~~~~~~~~|
|:::::::::::::::::::::::::::::::::::::::::::::::::::::::::::|~~~~~~~~~~~~~~~~~|
repeat jitter interval for start of next session
|+++++++++++++++++|+++++++++++++++|
session pause jitter
|—————————————————|=============================|~~~~~~~~~~~~~~~~~| jitter
|:::::::::::::::::::::::::::::::::::::::::::::::::::::::::::|~~~~~~~~~~~~~~~~~|
repeat interval for start of next session
|+++++++++++++++++|+++++++++++++++|
session pause jitter
|—————|=============================|~~~~~~~~~~~~~~~~~| jitter
|:::::::::::::::::::::::::::::::::::::::::::::::::::::::::::|~~~~~~~~~~~~~~~~~|
repeat interval for start of next session
|+++++++++++++++++|+++++++++++++++|
`
* any number of sessions may be active at any given point in time
* for a number of reasons, it would be nice if no partial results from any session would appear in DB
* the solution for this would usually be to wrap each session into a DB transaction
* however, SQLite does not allow concurrent writes and sessions can take a long time (e.g. because the
network is slow); this risks transaction timeouts which in turn call for logic to handle this failure mode
* on the other hand the expected number of rows to be inserted by any session is low (not more than a few
dozens up to hundreds of rows), so at least the performance gain from a transaction-per-session model
should be negligible
* another alternative is to queue rows in memory and pass them to a single writer method that does an
atomic insert
* yet another possibility is to use implicit (or explicit, but short-duration) transactions and allow
partial results
* in case each session also registers its start and finish timestamps we can always detect incomplete
sessions and treat them accordingly (hide them or add notification)
* the 'random inserts with implicit short transactions' (RIWIST) model then seems to be the simplest way to
deal with concurrent sessions
* we won't get consecutive row numbers with RIWIST, but that should not pose a problem
* since each insert is done immediately instead of being deferred, we get a chance to grab the (possibly
updated) row with insert ... returning *; statements, so that is a plus over a deferred model should we
ever need that data.
* (absolute) duration: a string spelling out a duration using a float literal and a unit, separated by aweek
space. Allowed units are , day, hour, minute, second, all in singular and plural. Examples:'1.5 seconds'
, '1e2 minutes', '1 week'.percentage (for relative durations)
* : a string spelling out a percentage with a float literal immediately%
followed by a percent sign, . Examples: '4.2%', 0%. The amount must be between 0 and 100.
These 'five letter' methods are available both on the Vogue_scheduler class and its instances; for many
people, the most strightforward way to understand what these methods do will be to read the very simple
definitions and recognize they are very thin shims over the somewhat less convenient JavaScript methods:
every: ( dts, f ) -> setInterval f, dts 1000after: ( dts, f ) -> setTimeout f, dts 1000
cease: ( toutid ) -> clearTimeout toutid
* sleep: ( dts ) -> new Promise ( done ) => setTimeout done, dts 1000
defer: ( f = -> ) -> await sleep 0; return await f()
*
In each case, dts denotes an interval (delta time) measured in seconds (not milliseconds) and fevery()
denotes a function. and after() return so-called timeout IDs (toutids), i.e. values that arecease()
recognized by (clearTimeout(), clearInterval()) to stop a one-off or repetetive timed functionsleep()
call. returns a promise that should be awaited as in await sleep 3, which will allow anotherdefer()
task on the event loop to return and resume execution no sooner than after 3000 milliseconds have elapsed.
Finally, there is , which should also be awaited. It is a special use-case of sleep() where the
timeout is set to zero, so the remaining effect is that other tasks on the event loop get a chance to run.
It accepts an optional function argument whose (synchronous or asynchronous) result will be returned.
* [–] documentation
* [–] repurpose posts table to only contain unique posts with primary key ( pid, version ) whereversion
is a small integer that allows to store updated details for a given post.posts
* [–] versioned would e.g. allow pricing fluctions to be linked to constant articles; IOW withprice
versions, a constant 'article'/'product' can now have a varying detailposts
* [–] old becomes posts_and_sessions (?) which details occurrences of posts in completedsessions
('sightings').+interesting
* [–] introduce tagging in its simplest form (only value-less, affirmative tags such as ,+seen
, +hide, pin-to-top). Tags are linked to unversioned posts via pid.*
[–] consider to allow to tie tags to versioned* posts with the additional rule that a 'version' may
be a wildcard like that applies to all versions of a post. Tying a tag to the version that I see in*
this listing then means 'OK not interested in this case but do show again if any details should have
changed', tying a tag to symbolic version communicates 'I've seen this and am not interested in+seen
being alerted to any changes'.
* [–] Some tags may be given systematic behavior:
* —de-emphasize/grey-out post where it appears+watch
* —the opposite of +seen, hilite post where it appears+pin-to-top
* —always show post near top of listings even when its position in the data source'sdetails
listing is further down
* other tags may be added at user request and be associated with a certain style / color / theme (maybe
associate arbitrary CSS with each tag)
* [–] allow to specify a list of keys into containing facts that are to be writ large, such asrank
the price of a product. Maybe also allow to track specific facets with a sparkline?
[–] allow to specify from which key to derive from? In that case it would be best to keep all*details
the data in including № in the original listing and have a mapping like { rank: 'nr', }, {
rank: 'price', } to determine how to derive which aspect from which field.details
* [–] problem with this: if the № becomes part of then change in position of item in listingcheerio
would in itself be treated as an update to the details of an item, which is not what we want
* i.e. change in rank is categorically different from change in other details
* [–] consider to either wrap handling (or make it optional), or use a library with a moreVogue_kaiju
transparent API
* [–] (shim over Playwright) for in-browser scraping of dynamicforeign key ( pid ) references scr__posts
HTML pages
* [–] see https://github.com/kudla/promise-status-async, https://stackoverflow.com/a/53328182/7568091 on
how to check for promise status;
https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Promise/race on how to
implement session timeout; https://github.com/dondevi/promise-abortable for cancelling promise
* [–] what causes the error when in scr_tagged_posts isstatus
uncommented?
* [–] make it so that sparklines are flush right i.e. the data point for the current session is 'now',
and 'now' has maximum of the displayed x values
* [–] add chart with all combined sparklines from 'latest' view
* [–] add field (first, last, return, cont) to each data point of trends to allow for
coloring in charts
* [–] may want to implement option where only those sessions are show where at least one change occurred
* [–] simplify construction of customized instances; currently one has to do:
``
path = 'path/to/dbay-vogue.db'
db = new DBay { path, }
vdb = new Vogue_db { db, }
vogue = new Vogue { vdb, }
i.e. one must know and get hold of the secondary classes involved and introduce them at appropriate
junctures.
* [+] name
* [+] POC
* [+] rename round -> sessionseq
* [+] rename -> rankprogress
* [+] rename -> trends?Vogue_scraper
* [+] use D3, plot to produce
sparklines (MVP)
* [+] -> Vogue_scraper_ABC`
* [+] in HN scraping, one could say that the 'ultimate' identity is really the content URL, which can be
associated with any number of discussion URLs as time passes
* [+] only display the most recent session entry for each item ID (and up to so-and-so many items per
page), i.e. if an item was present in sessions 12 and 11, only show the entry for session 12 (with the
sparkline that does show a dot for session 11); if an item dropped out in session 9, also show it (as long
as the table does not exceed the line limit)