A promise based streaming tokenizer
npm install strtok3






strtok3 module provides several methods for creating a tokenizer from various input sources.
strtok3 can read from:
sh
npm install strtok3
`
$3
Starting with version 7, the module has migrated from CommonJS to pure ECMAScript Module (ESM).
The distributed JavaScript codebase is compliant with the ECMAScript 2020 (11th Edition) standard.
Requires a modern browser, Node.js (V8) ≥ 18 engine or Bun (JavaScriptCore) ≥ 1.2.
For TypeScript CommonJs backward compatibility, you can use load-esm.
> [!NOTE]
> This module requires a Node.js ≥ 16 engine.
> It can also be used in a browser environment when bundled with a module bundler.
Support the Project
If you find this project useful and would like to support its development, consider sponsoring or contributing:
- Become a sponsor to Borewit
- Buy me a coffee:
API Documentation
$3
Use one of the methods to instantiate an abstract tokenizer:
- fromBlob
- fromBuffer
- fromFile*
- fromStream*
- fromWebStream
> [!NOTE]
> fromFile and fromStream only available when importing this module with Node.js
All methods return a Tokenizer, either directly or via a promise.
#### fromBlob() function
Create a tokenizer from a Blob.
`ts
function fromBlob(blob: Blob, options?: ITokenizerOptions): BlobTokenizer
`
| Parameter | Optional | Type | Description |
|-----------|-----------|---------------------------------------------------|----------------------------------------------------------------------------------------|
| blob | no | Blob | Blob or File to read from |
| options | yes | ITokenizerOptions | Tokenizer options |
Returns a tokenizer.
`js
import { fromBlob } from 'strtok3';
import { openAsBlob } from 'node:fs';
import * as Token from 'token-types';
async function parse() {
const blob = await openAsBlob('somefile.bin');
const tokenizer = fromBlob(blob);
const myUint8Number = await tokenizer.readToken(Token.UINT8);
console.log(My number: ${myUint8Number});
}
parse();
`
#### fromBuffer() function
Create a tokenizer from memory (Uint8Array or Node.js Buffer).
`ts
function fromBuffer(uint8Array: Uint8Array, options?: ITokenizerOptions): BufferTokenizer
`
| Parameter | Optional | Type | Description |
|------------|----------|--------------------------------------------------|-----------------------------------|
| uint8Array | no | Uint8Array | Buffer or Uint8Array to read from |
| options | yes | ITokenizerOptions | Tokenizer options |
Returns a tokenizer.
`js
import { fromBuffer } from 'strtok3';
import * as Token from 'token-types';
const tokenizer = fromBuffer(buffer);
async function parse() {
const myUint8Number = await tokenizer.readToken(Token.UINT8);
console.log(My number: ${myUint8Number});
}
parse();
`
#### fromFile function
Creates a tokenizer from a local file.
`ts
function fromFile(sourceFilePath: string): Promise
`
| Parameter | Type | Description |
|----------------|----------|----------------------------|
| sourceFilePath | string | Path to file to read from |
> [!NOTE]
> - Only available for Node.js engines
> - fromFile automatically embeds file-information
A Promise resolving to a tokenizer which can be used to parse a file.
`js
import { fromFile } from 'strtok3';
import * as Token from 'token-types';
async function parse() {
const tokenizer = await fromFile('somefile.bin');
try {
const myNumber = await tokenizer.readToken(Token.UINT8);
console.log(My number: ${myNumber});
} finally {
tokenizer.close(); // Close the file
}
}
parse();
`
#### fromWebStream() function
Create a tokenizer from a WHATWG ReadableStream.
`ts
function fromWebStream(webStream: AnyWebByteStream, options?: ITokenizerOptions): ReadStreamTokenizer
`
| Parameter | Optional | Type | Description |
|----------------|----------|--------------------------------------------------------------------------|------------------------------------|
| webStream | no | ReadableStream | WHATWG ReadableStream to read from |
| options | yes | ITokenizerOptions | Tokenizer options |
Returns a tokenizer.
`js
import { fromWebStream } from 'strtok3';
import * as Token from 'token-types';
async function parse() {
const tokenizer = fromWebStream(readableStream);
try {
const myUint8Number = await tokenizer.readToken(Token.UINT8);
console.log(My number: ${myUint8Number});
} finally {
await tokenizer.close();
}
}
parse();
`
$3
The tokenizer is an abstraction of a stream, file or Uint8Array, allowing _reading_ or _peeking_ from the stream.
It can also be translated in chunked reads, as done in @tokenizer/http;
#### Key Features:
- Supports seeking within the stream using tokenizer.ignore().
- Offers peek methods to preview data without advancing the read pointer.
- Maintains the read position via tokenizer.position.
#### Tokenizer functions
_Read_ methods advance the stream pointer, while _peek_ methods do not.
There are two kind of functions:
1. read methods: used to read a token of Buffer from the tokenizer. The position of the tokenizer-stream will advance with the size of the token.
2. peek methods: same as the read, but it will not advance the pointer. It allows to read (peek) ahead.
#### readBuffer function
Read data from the _tokenizer_ into provided "buffer" (Uint8Array).
readBuffer(buffer, options?)
`ts
readBuffer(buffer: Uint8Array, options?: IReadChunkOptions): Promise;
`
| Parameter | Type | Description |
|------------|----------------------------------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| buffer | Buffer | Uint8Array | Target buffer to write the data read to |
| options | IReadChunkOptions | An integer specifying the number of bytes to read |
Return promise with number of bytes read.
The number of bytes read may be less than requested if the mayBeLess flag is set.
#### peekBuffer function
Peek (read ahead), from tokenizer, into the buffer without advancing the stream pointer.
`ts
peekBuffer(uint8Array: Uint8Array, options?: IReadChunkOptions): Promise;
`
| Parameter | Type | Description |
|------------|-----------------------------------------|-----------------------------------------------------|
| buffer | Buffer | Uint8Array | Target buffer to write the data read (peeked) to. |
| options | IReadChunkOptions | An integer specifying the number of bytes to read. | |
Return value Promise Promise with number of bytes read. The number of bytes read may be less if the mayBeLess flag was set.
#### readToken function
Read a token from the tokenizer-stream.
`ts
readToken(token: IGetToken, position: number = this.position): Promise
`
| Parameter | Type | Description |
|------------|-------------------------|---------------------------------------------------------------------------------------------------------------------- |
| token | IGetToken | Token to read from the tokenizer-stream. |
| position? | number | Offset where to begin reading within the file. If position is null, data will be read from the current file position. |
Return value Promise. Promise with number of bytes read. The number of bytes read maybe if less, mayBeLess flag was set.
#### peek function
Peek a token from the tokenizer.
`ts
peekToken(token: IGetToken, position: number = this.position): Promise
`
| Parameter | Type | Description |
|------------|----------------------------|-------------------------------------------------------------------------------------------------------------------------|
| token | IGetToken | Token to read from the tokenizer-stream. |
| position? | number | Offset where to begin reading within the file. If position is null, data will be read from the current file position. |
Return a promise with the token value peeked from the tokenizer.
#### readNumber function
Read a numeric token from the tokenizer.
`ts
readNumber(token: IToken): Promise
`
| Parameter | Type | Description |
|------------|---------------------------------|----------------------------------------------------|
| token | IGetToken | Numeric token to read from the tokenizer-stream. |
A promise resolving to a numeric value read and decoded from the tokenizer-stream.
#### ignore function
Advance the offset pointer with the token number of bytes provided.
`ts
ignore(length: number): Promise
`
| Parameter | Type | Description |
|------------|--------|------------------------------------------------------------------|
| length | number | Number of bytes to ignore. Will advance the tokenizer.position |
A promise resolving to the number of bytes ignored from the tokenizer-stream.
#### close function
Clean up resources, such as closing a file pointer if applicable.
#### Tokenizer attributes
- fileInfo
Optional attribute describing the file information, see IFileInfo
- position
Pointer to the current position in the tokenizer stream.
If a position is provided to a _read_ or _peek_ method, is should be, at least, equal or greater than this value.
$3
Each attribute is optional:
| Attribute | Type | Description |
|-----------|---------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| length | number | Requested number of bytes to read. |
| position | number | Position where to peek from the file. If position is null, data will be read from the current file position. Position may not be less then tokenizer.position |
| mayBeLess | boolean | If and only if set, will not throw an EOF error if less than the requested mayBeLess could be read. |
Example usage:
`js
tokenizer.peekBuffer(buffer, {mayBeLess: true});
`
$3
Provides optional metadata about the file being tokenized.
| Attribute | Type | Description |
|-----------|---------|---------------------------------------------------------------------------------------------------|
| size | number | File size in bytes |
| mimeType | string | MIME-type of file. |
| path | string | File path |
| url | string | File URL |
$3
The token is basically a description of what to read from the tokenizer-stream.
A basic set of token types can be found here: token-types.
A token is something which implements the following interface:
`ts
export interface IGetToken {
/**
* Length in bytes of encoded value
*/
len: number;
/**
* Decode value from buffer at offset
* @param buf Buffer to read the decoded value from
* @param off Decode offset
*/
get(buf: Uint8Array, off: number): T;
}
`
The tokenizer reads token.len bytes from the tokenizer-stream into a Buffer.
The token.get will be called with the Buffer. token.get is responsible for conversion from the buffer to the desired output type.
$3
To convert a Web-API readable stream into a Node.js readable stream), you can use readable-web-to-node-stream to convert one in another.
`js
import { fromWebStream } from 'strtok3';
import { ReadableWebToNodeStream } from 'readable-web-to-node-stream';
(async () => {
const response = await fetch(url);
const readableWebStream = response.body; // Web-API readable stream
const webStream = new ReadableWebToNodeStream(readableWebStream); // convert to Node.js readable stream
const tokenizer = fromWebStream(webStream); // And we now have tokenizer in a web environment
})();
`
Dependencies
Dependencies:
- @tokenizer/token: Provides token definitions and utilities used by strtok3` for interpreting binary data.