**MiniParse** is a small TypeScript parser combinator library with an efficient regex based lexer.
npm install mini-parseMiniParse is a small TypeScript parser combinator library with an efficient
regex based lexer.
- MiniParse is a combinator library.
You write a grammar by combining simple TypeScript
functions like or(), repeat(), and seq().
It's just TypeScript so it's easy to mix with your existing code,
IDE, test frameworks, etc.
- MiniParse is a Parsing Expression Grammar (PEG) parser.
Parsing is simple to understand, top down, using recursive descent with backtracking.
- Parsers are modular - every grammar fragment is also a parser and can be tested
and reused independently.
- Small code size makes MiniParse suitable for runtime embedding (< 4kb compressed).
- Extensive debug tracing for use while developing grammars.
(Tracing is automatically removed from production builds).
- Error reporting is included, with full line context displayed.
- Swap lexers under parser control. Handy for parsing quotes,
or for mixed language parsing (e.g. html in jsx).
- Stack parsers to parse things that can appear almost anywhere in your grammar.
Handy for things like nested comments, semantic comments (jsdoc), or annotations.
- Named accumulators make it easy to collect parsing results
from deeply nested sub parsers.
- Mapped logging utilities are available for error reporting on preprocessed src.
Typically you can write a parser by mixing combinator functions from the library.
Here's a parser for a simple sum expression:
``ts`
export const simpleSum = seq(num, or("+", "-"), num);
Here's a parser for nested block comments:
`ts`
export const blockComment: Parser
"/*",
repeat(or(() => blockComment, anyNot("*/"))),
req("*/"),
);
The example above uses the combinators:
seq(), repeat(), or(), anyNot(), and req().
More combinators are available and documented in ParserCombinator.ts
To any combinator that accepts a parser as an argument, you can pass:
- another parser
- a function that returns a parser - uses the returned parser, but calls
the function lazily
- a string - a parser that accepts any token exactly matching the string
Parsing relies on a lexer to divide the source string into parts, called tokens.
(You could produce a token per character, but that would be comparatively slow)
Here's an example token matcher to identify three kinds of tokens for parsing
simple arithmetic expressions.
`ts`
export const simpleTokens = new RegexMatchers({
number: /\d+/,
symbol: matchOneOf("( ) ^ + - * /"),
ws: /\s+/,
});
To create a lexer that walks over a source text producing tokens, use:
`ts`
const text = "3 + 4";
const stream = matchingLexer(text, simpleTokens);
By default, the lexer will skip over whitespace tokens that you designate with
the name 'ws'. You can configure whether and which tokens to skip statically by
passing the set of names skip when you construct the matchingLexer, ortokensIgnore()
dynamically while parsing by using the combinator.
To run a parser, just give it a lexer attached to your source text.
`ts`
const result = simpleSum.parse({ stream });
The result will contain the combined results of the parsers, in this case
["3", "+", "4"].
Typically, it's convenient to use .map() to select the relevant parts from a
successful parse and do a bit of format conversion.
This parser will return a number rather than a string:
`ts`
const int = num.map(r => parseInt(r.value, 10));
Here's an example that even does some computation, and returns a numeric sum or
difference of the whole expression. It parses the same text as simpleSum
above, but converts to numbers and then adds or subtracts.
`ts`
// return the numeric sum, rather than a sequence of strings
export const sumResults = seq(int, or("+", "-"), int).map(r => {
const [a, op, b] = r.value;
return op === "+" ? a + b : a - b;
});
Note that .map() is only called on successful parses of the mapped expression,.map()
if the expression fails, the parser will backtrack and try any alternatives in
the grammar and will not be called on the failed part of the parse.
For larger parsers, you'll typically convert the parsed text into an
intermediate form, sometimes called an
Abstract Syntax Tree.
.parse() allows you to pass an application specific data structure that's.map()
visible in for convenience in building the abstract syntax tree with
each successfully parsed element.
`ts
type ASTElem = BinOpElem;
interface BinOpElem {
kind: "binOp";
left: number | BinOpElem;
right: number | BinOpElem;
op: "+" | "-";
}
export const sumElem = seq(int, or("+", "-"), int).map(r => {
const [a, op, b] = r.value;
const binOpElem: BinOpElem = {
kind: "binOp",
left: a,
right: b,
op: op as "+" | "-",
};
r.app.state.push(binOpElem);
});
`
For complicated nested or repeating parsers,
it's convenient to assign names to particular results of interest.
You can use a Symbol or a string to name a result using the .named()
method on all parsers.
Multiple results with the same name are accumulated into an array.
`ts
const op = or("+", "-");
export const taggedSum = seq(
int,
repeat(seq(op, int)), // accumulate an array of [op, int] pairs
).map(r => {
const [left, opRights] = r.value;
if (!opRights) return left;
return opRights.reduce((acc, opRight) => {
const [op, right] = opRight;
return op === "+" ? acc + right : acc - right;
}, left);
});
`
For debugging your grammar, it's useful to debug your grammar in pieces. One of
the nice features of parser combinators is that every part of the grammar is
independently testable.
To print out the progress of parsing:
1. Call enableTracing() to turn on the tracing facility (normally off and.trace(opts?)
removed from prod builds)
1. Create your parsers.
1. Call on any Parser. See TraceOptions for options.traceName()
controlling trace levels.
1. Add application relevant trace names to any parser using orsetTraceName()
..traceName()
- Use on any parser to set the trace name for debugging.setTraceName()
- Alternately, you can use protected by a tracing global
and the javascript bundler will remove the code in production builds to
save a few bytes.
`ts
if (tracing) {
const names: Record
fnCall,
fnParam,
fnParamList,
fnDecl,
};
Object.entries(names).forEach(([name, parser]) => {
setTraceName(parser, name);
});
`
Calculator - classic PEG style parser in
MiniParse syntax.
Calculator with Results calculator
example parser with inline calculation of results.
WGSL-D parsing selected parts of the WebGPU
WGSL shader language
along with #import and #export extensions.
Sometimes it's nice to let the grammar choose a different tokenizer
to parse different sections of the source text.
For example, to parse a programming language with quotes,
you'll probably want a different tokenizer for the text inside of quotes:
`ts`
const quote = seq('"', tokens(quoteTokens, repeat(nonQuote)), '"');
In unusual cases, it can be handy to choose the next parser based
on information outside the grammar itself.
For example, to handle an #ifdef style preprocessor, the parsing#ifdef
proceeds differently depending on whether the is true or false,.toParser
and information outside the source text may be required to decide.
If you want to check a runtime dictionary to decide which parser to
use for the next tokens, than is of use.
If the language you're parsing has some elements that can appear almost anywhere,
it'd be awkward to mention those elements at every possible position in the grammar.
Examples include nested block comments, comments containing semantic info, etc.
To handle pervasive elements, MiniParse lets you implement your own layered tokenizers.
There are two application specific objects that are passed to every parser:
state and context.app.state, as mentioned above, is handy for accumulating application
results of successful parses.
app.context is useful to store ephemeral application state discoveredapp.state
during parsing.
Like , app.context is just for applications -.map()
MiniParse doesn't use it
and applications can read and write it using the method on any parser.app.state
But unlike MiniParse will reset app.contextapp.context
when a sub-parser fails and backtracks. is passed to child parsers, but doesn't accumulate to parent parsers.
An example of using app.context is for parsing nested #ifdef #endif clauses.app.context` is a good place to store
the stack of active/inactive states discovered while
parsing.
Left recursive rules are typically disallowed in top down parsers, including MiniParse.
In the parser combinator setting, it's obvious why - a function calling itself
in its first statement is going to recurse forever.
Best to write the grammar so that recursion is in the middle or at the end.
See the block comment example or the calculator example.
PEG parsers like MiniParse can be sped up using a memoization algorithm called
packrat parsing.
Tratt
describes a technique to allow some left recursive rules, based on
Warth's proposal
for left recursion with packrat parsing.
Rossum
also has pursued this approach for Python.
But per Tratt, note that the resulting parse order is not as predictable, and there
are issues with rules that are simultaneously left and right recursive.
Allowing a regex as a parser argument would be convenient to avoid
the need for a separate lexer in some cases.
Is MiniParse right for your project? Consider the alternatives:
- Full Custom Parser - _maximum speed and ultimate malleability, lots of work._
For maximum speed and control, write a dedicated parser directly in Typescript.
This is the most effort, but if you're writing a production compiler and need
to squeeze every millisecond, it's worth it.
Otherwise use a parser generator tool suite or a parser combinator library.
- Parser Generator - _high speed, some work to adopt._
Parser generators statically analyze and precompile a grammar description language.
These mature tools can be a bit big, but there's lots of documentation,
rich ecosystems of example code and support tools.
Worthy examples include:
Nearley,
Lezer,
Antlr,
Chevrotain,
Tree-sitter,
or Ohm.
Each parser generator has its own textual format to describe the grammar. The library
compiles into an execution format before parsing.
Each of the above libraries uses a different base algorithm (Earley, GLR, LL, Packrat)
with different tradeoffs, but all have evolved robust features to classic parsing
problems of error recovery, left recursion, producing parse results, tracing, etc.
Parser generators are typically more complicated to adopt
than parser combinator libraries,
less flexible, require a separate build step,
and they're more code than typical parser combinators.
But for demanding parsing jobs, the complexity of a parser generator tool is
easily worth the investment.
- Parser Combinators - _lower speed, most flexibility, lightweight adoption._
Parser combinators define a grammar by mixing simple TypeScript functions
provided by the library or written by the user (aka combinator functions).
Execution of the grammar normally means simply running these functions.
The simplicity makes parser combinators flexible and easy to adopt - you're using
TypesScript for everything.
Parser combinators typically interpret rather than compiling the grammar in advance,
so they're slower to run. But they're plenty fast enough for most purposes.
In the Parser Combinator category, MiniParse has a few interesting features
and is notably lightweight.
As you're parser shopping, also consider other worthy and more mature libraries
in the TypeScript parser combinator category like:
ts-parsec
and ParJS.