standardized immutable objects in the spirit of datomic, especially suited for use in data pipelines
npm install datom
Table of Contents generated with DocToc
- Dataclasses
- Export Bound Methods
- Creation of Bespoke Library Instances
- Configuration Parameters
- Methods
- Freezing & Thawing
- Stamping
- Type Testing
- Value Creation
- Selecting
- System Properties
- WIP
- PipeDreams Datoms (Data Events)
- select = ( d, selector ) ->
- Benchmarks
- To Do
standardized immutable objects in the spirit of datomic, especially suited for use in data pipelines
* Dataclasses allow to marry ES6 classes and Intertype type declarations
* derive your class from ( require 'datom' ).Dataclass
* declare class property declaration as an Intertype
declaration
* simple example:
``coffee`
class Quantity extends Dataclass
@declaration:
fields:
q: 'float'
u: 'nonempty.text'
template:
q: 0
u: 'unit'
* now, when you do q = new Quantity(), you get a (shallowly) frozen object with properties { q: 0, u:
'unit', } which is the default value for that class (representing the most generic measurement, zeronew Quantity { u: 'km',
dimensionless units)
* this is probably not very useful, so pass in values to override defaults, as in
} to define a length or new Quantity { q: 12.5, u: 's', } to define a time spanquantity = DATOM.lets quantity, ( quantity ) -> quantity.q = 120
* can modify using Dataclass
* by default, instances of derivatives of are deep-frozen, meaning not the instance itself norfreeze: 'deep'
its properties can be mutated
* default can be made explicit by adding to type declarationfreeze: true
* can also dow shallow freezing by setting or freeze: 'shallow'; in that case, propertiesfreeze: false
like lists and objects can still be mutated, but properties can not be reassigned, added or deleted
* setting will result in a fully mutable objectDATOM.thaw x
* can always be used to obtain a fully mutable copy where that is called forDataclass
* one feature that may be useful for some use cases is that instances can have computed
properties; define those in the constructor:
`coffee
#.......................................................................................................
class Something extends Dataclass
@declaration:
freeze: false
fields:
mode: 'nonempty.text'
name: 'nonempty.text'
template:
mode: null
name: null
constructor: ( P... ) ->
super P...
GUY.props.def @, 'id',
enumerable: true
get: => "#{@mode}:#{@name}"
set: ( value ) =>
@__types.validate.nonempty.text value
parts = value.split ':'
@mode = parts[ 0 ]
@name = parts[ 1 .. ].join ':'
return null
return undefined
#.......................................................................................................
s = new Something { mode: 'mymode', name: 'p', }
debug '^464561^', s
T?.eq s, { mode: 'mymode', name: 'p', id: 'mymode:p', }
debug '^464561^', s.id
s.id = 'foo:bar'
T?.eq s, { mode: 'foo', name: 'bar', id: 'foo:bar', }
`
NOTE: Documentation is outdated. WIP.
If you plan on using methods like new_datom() or select() a lot, consider using .export():
`coffee`
DATOM = require 'datom'
{ new_datom
select } = DATOM.export()
Now new_datom() and select() are methods bound to DATOM. (Observe that because of the JavaScriptmethod = DATOM.method
'tear-off' effect, when you do , then method() will likely fail as its reference tothis has been lost.)
In order to configure a copy of the library, pass in a settings object:
`coffee`
_DATOM = require 'datom'
settings = { merge_values: false, }
DATOM = new _DATOM.Datom settings
{ new_datom
select } = DATOM.export()
Or, more idiomatically:
`coffee`
DATOM = new ( require 'datom' ).Datom { merge_values: false, }
{ new_datom
select } = DATOM.export()
The second form also helps to avoid accidental usage of the result of require 'datom', which is of
course the same library with a different configuration.
* merge_values (boolean, default: true)—Whether to merge attributes of the second argument tonew_datom()
into the resulting value. When set to false, new_datom '^somekey', somevalue will always{ $key: '^somekey', $value: somevalue, }
result in a datom ; when left to the default, and if somevalue$
is an object, then its attributes will become attributes of the datom, which may result in name clashes in
case any attribute name should start with a (dollar sign).
* freeze (boolean, default: true)—Whether to freeze datoms. When set to false, no freezing will
be performed, which may entail slightly improved performance.
* dirty (boolean, default: true)—Whether to automatically set { $dirty: true, } when the copylets()
of a datom has been treated with and a modifyer function.
* @freeze = ( d ) ->
* @thaw = ( d ) ->
* @lets = ( original, modifier ) ->
* @set = ( d, k, P... ) ->
* @unset = ( d, k ) ->
* @stamp = ( d, P... ) ->
* @unstamp = ( d ) ->
* @is_system = ( d ) ->
* @is_stamped = ( d ) ->
* @is_fresh = ( d ) ->
* @is_dirty = ( d ) ->
* @new_datom = ( $key, $value, other... ) ->
* @new_single_datom = ( $key, $value, other... ) ->
* @new_open_datom = ( $key, $value, other... ) ->
* @new_close_datom = ( $key, $value, other... ) ->
* @new_system_datom = ( $key, $value, other... ) ->
* @new_text_datom = ( $value, other... ) ->
* @new_end_datom = ->
* @new_warning = ( ref, message, d, other... ) ->
* @select = ( d, selector ) ->
* d.$key—key (i.e., type) of a datom.
* d.$value—'the' proper value of a datom. This is always used in case new_datom() was called with anew_datom '^mykey', 123
non-object in the value slot (as in ), or when the library was configured with {
merge_values: false, }.—In case there is no d.$value, the datom's proper value is the object that would$
result from deleting all properties whose names start with a (dollar sign).d.$dirty
* —whether the object has been (thawed, then) changed (and then frozen again) since its$dirty
property was last cleared or set to false.d.$stamped
* —whether the object has been marked as 'stamped' (i.e., processed).
-------------------------------------------------------------------------------
The below copied from PipeDreams docs, to be updated
Data streams—of which pull-streams,
PipeStreams, and NodeJS
Streams are examples—do their work by
sending pieces of data (that originate from a data source) through a number of
transforms (to finally end up in a data sink).note
> (note) I will ignore here alternative ways of dealing with streams, especially
> the EventEmitter way of dealing with streamed.
> data
> When I say 'streams', I also implicitly mean 'pipelines'; when I say
> 'pipelines', I also implicitly mean 'pipelines to stream data' and 'streams'
> in general.
When NodeJS streams started out, the thinking about those streams was pretty
much confined to saying that 'a stream is a series of
bytes'. Already back then,
an alternative view took hold (I'm slightly paraphrasing here):
> The core interpretation was that stream could be buffers or strings - but the
> userland interpretation was that a stream could be anything that is
> serializeable [...] it was a sequence of buffers, bytes, strings or objects.
> Why not use the same api?
I will no repeat here what I've written about perceived shortcomings of NodeJS
streams;
instead, let me iterate a few observations:
* In streaming, data is just data. There's no need for having a separate
'Object Mode' or
somesuch.
* There's a single exception to the above rule, and that is when the data item
being sent down the line is null. This has historically—by both NodeJS
streams and pull-streams—been interpreted as a termination signal, and I'm not
going to change that (although at some point I might as well).
* When starting out with streams and building fairly simple-minded pipelines,
sending down either raw pieces of business data or else null to indicate
termination is enough to satisfy most needs. However, when one transitions to
more complex environments, raw data is not sufficient any more: When
processing text from one format to another, how could a downstream transform
tell whether a given piece of text is raw data or the output of an upstream
transform?
Another case where raw data becomes insufficient are circular
pipelines—pipelines that re-compute (some or all) output values in a recursive
manner. An example which outputs the integer sequences of the Collatz
Conjecture is in the tests
folder.
There, whenever we see an even number n, we send down that even number nn/2
alongside with half its value, ; whenever we see an odd number n, we3*n+1
send it on, followed by its value tripled plus one, . No matter whether
you put the transform for even numbers in front of that for odd numbers or the
other way round, there will be numbers that come out at the bottom that need
to be re-input into the top of the pipeline, and since there's no telling in
advance how long a Collatz sequence will be for a given integer, it is, in the
general case, insufficient to build a pipeline made from a (necessarily
finite) repetitive sequence of copies of those individual transforms. Thus,
classical streams cannot easily model this kind of processing.
The idea of datoms—short for data atoms, a term borrowed from Rich
Hickey's Datomic—is
to simply to wrap each piece of raw data in a higher-level structure. This is of
course an old idea, but not one that is very prevalent in NodeJS streams, the
fundamental assumption (of classical stream processing) being that all stream
transforms get to process each piece of data, and that all pieces of data are of
equal status (with the exception of null).
The PipeDreams sample implementation of Collatz Sequences uses datoms to (1)
wrap the numerical pieces of data, which allows to mark data as processed
(a.k.a. 'stamped'), to (2) mark data as 'to be recycled', and to (3) inject
system-level synchronization signals into the data stream to make sure that
recycled data gets processed before new data is allowed into the stream.
In PipeDreams datoms, **each piece of data is explicitly labelled for its
type; each datom may have a different status: there are system-level
datoms that serve to orchestrate the flow of data within the pipeline**; there
are user-level datoms which originate from the application; there are
**datoms to indicate the opening and closing of regions (phases) in the data
stream; there are stream transforms that listen to and act on specific
system-level events**.
Datoms are JS objects that must minimally have a key property, a string thatvalue
specifies the datom's category, namespace and name; in addition, they may have a property with the payload (where desired), and any number of other$
attributes. The property is used to carry metadata (e.g. from which line ind := { $key, ?$value, ?$stamped,...,
a source file a given datom was generated from). Thus, we may give the outline
of a datom as (in a rather informal notation)
?$, }.
The key of a datom must be a string that consists of at least two parts, thesigil and the name. The sigil, a single punctuation character, indicates
the 'category' of each datom; there are two levels and three elementary
categories, giving six types of datoms:
* Application level:
* ^ for data datoms (a.k.a. 'singletons'),<
* for start-of-region datoms,>
* for end-of-region datoms.
* System level:
* ~ for data datoms,[
* for start-of-region datoms,]
* for end-of-region datoms.
Normally, one will probably want to send around business data inside (the
value property of) application-level data datoms (hence their name, also
shortened to D-datoms); however, one can also set other properties of datom
objects, or send data around using properties of start- or end-of-region datoms.
Region events are intended to be used e.g. when parsing text with markup; say
you want to turn a snippet of HTML like this:
``Helo world!
into another textual representation, you may want to turn that into a sequence
of datoms similar to these, in the order of sending and regions symbolized by
boxes:note
``
--------------------------------------------------------+
{ key: '
{ key: ' { key: '^text', value: "Helo ", } # d3 | |
----------------------------------------------------+ | |
{ key: ' { key: '^text' value: "world!", } # d5 | | |
{ key: '>em', } # d6 | | |
----------------------------------------------------+ | |
{ key: '>div', } # d7 | |
------------------------------------------------------+ |
{ key: '>document', } # d8 |
--------------------------------------------------------+
> note by 'in the order of sending' I mean you'd have to send datom d1d2
> first, then and so on. Trivial until you imagine you write a pipeline andpipeline.push $do_this() # s1, might be processing d3 right now
> then picture how the events will travel down that pipeline:
>
> pipeline.push $do_that() # s2, might be processing d2 right now
> pipeline.push $do_something_else() # s3, might be processing d1 right now
> s3
>
> Although there's really no telling whether step will really process datomd1
> at the 'same point in time' that step s2 processes datom d2 and so on
> (in the strict sense, this is hardly possible in a single-threaded language
> anyway), the visualization still holds a grain of truth: stream transforms
> that come 'later' (further down) in the pipeline will see events near the top
> of your to-do list first, and vice versa. This can be mildly confusing.
The select method can be used to determine whether a given event d matches aselect d, selector
set of conditions; typically, one will want to use to decide
whether a given event is suitable for processing by the stream transform at
hand, or whether it should be passed on unchanged.
The current implementation of select() is much dumber and faster than its predecessors; where previously,key
it was possible to match datoms with multiple selectors that contained multiple sigils and so forth, the new
version does little more than check wheter the single selector allowed equals the given datom's select d, '^somekey#stamped'
value—that's about it, except that one can still to match both unstamped and
stamped datoms.
Here is a speed comparison ([code on GitHub) between Datom versions 7 and 8, using two methods of dealing with object freezing
and two Datom configurations, f1 standing for the standard configuration (i.e. either DATOM = require
'datom' or DATOM = ( require 'datom' ).new { freeze: true, }) and f0 for the non-freezing configurationDATOM = ( require 'datom' ).new { freeze: true, }
(obtained by ). datom_v7_thaw_freeze_f0 is missing herethaw
because of a bug in the method used in v7. Each run involved thawing 100 datoms with 5 key/value{ '$key': '^vapeurs', '𤭨': 447, '媑': true, escamote: false, auditionnerais: true,
pairs each (ex.:
exacerbant: true, }), changing 3 values and freezing the object again. Tests marked ...thaw_freeze... used = thaw d; ...; d = freeze d
explicit calls to to do this, the ones marked ...lets... use a single calld = lets d, ( d ) -> ... to accomplish the same.
We see an overall improvement in the performance of v8 as compared to v7 which can be ascribed to the update
of the letsfreezethat dependency which represents a
complete overhaul of that library:
``
datom_v8_thaw_freeze_f0 144,938 Hz 100.0 % │████████████▌│
datom_v8_lets_f0 128,930 Hz 89.0 % │███████████▏ │
datom_v8_thaw_freeze_f1 126,920 Hz 87.6 % │███████████ │
datom_v7_lets_f0 92,669 Hz 63.9 % │████████ │
datom_v8_lets_f1 81,917 Hz 56.5 % │███████▏ │
datom_v7_lets_f1 40,063 Hz 27.6 % │███▌ │
datom_v7_thaw_freeze_f1 39,334 Hz 27.1 % │███▍ │
For best performance, it is recommended to
* prefer d = thaw d; ...; d = freeze d over lets() although the latter is more elegant and preventsfreeze()
one from forgetting to a thaw()ed value, and toDATOM
* configure the library to forego actual freezing when moving from development to production, where
appropriate, for a speed gain of around 10%.
* [ ] implement piecemeal structural validation such that on repeated calls to a validator instance's
validate() method an error will be thrown as soon as unbalanced regions (delimeted by { $key: ' and { $key: '>token', ..., }) are encountered.
* [ ] VNRs:
* [X] implement Vectorial NumbeRs (VNRs)
* [ ] document Vectorial NumbeRs (VNRs)
* [ ] remove either cmp_total() or cmp_partial() for simplification$
* [ ] assert and document that VNRs may be sorted element-wise lexicographically (e.g in Postgres, but
also in JS) by appending a single zero element (or, for that matter, by padding as many zeroes as needed
to make all VNRs the same length)
* [ ] consider to disallow giving VNRs a final zero element
* [ ] consider to store VNRs with an apended zero element
* [ ] implement & document standard attributes, -prefixed and otherwise (?), such as^text
* [ ] —key for 'text datoms'text
* [ ] —the underlying source text where code, data is parsed$
* [ ] —'produced by', contains short label to point to source position, may be left-chained (most$vnr
recent first) to obtain breadcrumbs path of responsible source locations
* [ ] —for VNRs, the primary ordering criterium$ref
* [ ] —do we still use this? See DataMill$pos
* [ ] ? $range? for [ start, stop, ] pairs, indices into a source; use inclusive or exclusive$loc
upper bound?
* [ ] ? for [ line_nr, col_nr, ] pairs; NB might also want to use stop position of ranges{ dirty: false, }
* [X] make the default setting (i.e. not marking changed datoms)$dirty
* [ ] consider to remove altogether; datoms-as-immutable-values can not be updated anyway, andd2 = lets d1, ( d ) -> ...
whether an operation like has or has not caused any differences between d1d2
and (short of a Turing-complete analysis of the function passed in to lets()) is only answerable
by comparing all members of both datoms.
* [ ] Dependency emittery@0.7.0 changed
behavior: "Ensure .emit() doesn't return a value" which breaks contracts. The fix currently consists inObject.freeze()
not upgrading from 0.6.0 until a workaround has been implemented.
* [ ] Allow to instantiate with configuration for freezing (, letsfreezethat.freeze())strcuturedClone()
and cloning (, Object.assign(), GUY.props.nonull_assign())select()
* [ ] re-implement (syntax or method for) selecting stamped datoms
* [ ] implement wildcards for ; cache selectors to avoid re-interpretation of recurrent patternsDataclass
* [ ] dataclasses should optionally be mutable
* [X] make deep-freezing the default for ?Dataclass
* [ ] devise way to declare type upon class declaration (possible at all?); next best solution:register()
add class method or declare()`
* [ ] when declaring, validating dataclass instances, consider to use private name (symbol) to avoid any
chance for name clashes