A single file C++ header-only PEG (Parsing Expression Grammars) library; yhirose (2015).
npm install peglib.cxxcpp-peglib
==========

C++17 header-only PEG (Parsing Expression Grammars) library, by Yuji Hirose. You can start using it right away just by including peglib.h in your project.
Since this library only supports C++17 compilers, please make sure that the compiler option -std=c++17 is enabled.
(/std:c++17 /Zc:__cplusplus for MSVC)
You can also try the online version, PEG Playground at https://yhirose.github.io/cpp-peglib.
The PEG syntax is well described on page 2 in the document by Bryan Ford. cpp-peglib also supports the following additional syntax for now:
* '...'i (Case-insensitive literal operator)
* [...]i (Case-insensitive character class operator)
* [^...] (Negated character class operator)
* [^...]i (Case-insensitive negated character class operator)
* {2,5} (Regex-like repetition operator)
* < ... > (Token boundary operator)
* ~ (Ignore operator)
* \x20 (Hex number char)
* \u10FFFF (Unicode char)
* %whitespace (Automatic whitespace skipping)
* %word (Word expression)
* $name( ... ) (Capture scope operator)
* $name< ... > (Named capture operator)
* $name (Backreference operator)
* | (Dictionary operator)
* ↑ (Cut operator)
* MACRO_NAME( ... ) (Parameterized rule or Macro)
{ precedence L - + L / } (Parsing infix expression)
* %recovery( ... ) (Error recovery operator)
* exp⇑label or exp^label (Syntax sugar for (exp / %recover(label)))
* label { error_message "..." } (Error message instruction)
* { no_ast_opt } (No AST node optimization instruction)
'End of Input' check will be done as default. To disable the check, please call disable_eoi_check.
This library supports the linear-time parsing known as the Packrat parsing.
IMPORTANT NOTE for some Linux distributions such as Ubuntu and CentOS: Need -pthread option when linking. See #23, #46 and #62.
I am sure that you will enjoy this excellent "Practical parsing with PEG and cpp-peglib" article by bert hubert!
Installation
------------
Run:
``bash
`
$ npm i peglib.cxx
peglib.h
And then include as follows:
`
cxx
`
// main.cxx
#include
int main() { / ... / }
node_modules/peglib.cxx
Finally, compile while adding the path to your compiler's include paths.
`
bash
`
$ clang++ -std=c++17 -I./node_modules/peglib.cxx main.cxx # or, use g++
$ g++ -std=c++17 -I./node_modules/peglib.cxx main.cxx
`
You may also use a simpler approach with the cpoach tool, which automatically adds the necessary include paths of all the installed dependencies for your project.
bash
`
$ cpoach clang++ -std=c++17 main.cxx # or, use g++
$ cpoach g++ -std=c++17 main.cxx
`
How to use
----------
This is a simple calculator sample. It shows how to define grammar, associate semantic actions to the grammar, and handle semantic values.
cpp
`
// (1) Include the header file
#include
#include
#include
using namespace peg;
using namespace std;
int main(void) {
// (2) Make a parser
parser parser(R"(
# Grammar for Calculator...
Additive <- Multiplicative '+' Additive / Multiplicative
Multiplicative <- Primary '*' Multiplicative / Primary
Primary <- '(' Additive ')' / Number
Number <- < [0-9]+ >
%whitespace <- [ \t]*
)");
assert(static_cast
// (3) Setup actions
parser["Additive"] = [](const SemanticValues &vs) {
switch (vs.choice()) {
case 0: // "Multiplicative '+' Additive"
return any_cast
default: // "Multiplicative"
return any_cast
}
};
parser["Multiplicative"] = [](const SemanticValues &vs) {
switch (vs.choice()) {
case 0: // "Primary '*' Multiplicative"
return any_cast
default: // "Primary"
return any_cast
}
};
parser["Number"] = [](const SemanticValues &vs) {
return vs.token_to_number
};
// (4) Parse
parser.enable_packrat_parsing(); // Enable packrat parsing.
int val;
parser.parse(" (1 + 2) * 3 ", val);
assert(val == 9);
}
`
To show syntax errors in grammar text:
cpp
`
auto grammar = R"(
# Grammar for Calculator...
Additive <- Multiplicative '+' Additive / Multiplicative
Multiplicative <- Primary '*' Multiplicative / Primary
Primary <- '(' Additive ')' / Number
Number <- < [0-9]+ >
%whitespace <- [ \t]*
)";
parser parser;
parser.set_logger([](size_t line, size_t col, const string& msg, const string &rule) {
cerr << line << ":" << col << ": " << msg << "\n";
});
auto ok = parser.load_grammar(grammar);
assert(ok);
`
There are four semantic actions available:
cpp
`
[](const SemanticValues& vs, any& dt)
[](const SemanticValues& vs)
[](SemanticValues& vs, any& dt)
[](SemanticValues& vs)
SemanticValues
value contains the following information:
any& dt
* Semantic values
* Matched string information
* Token information if the rule is literal or uses a token boundary operator
* Choice number when the rule is 'prioritized choice'
is a 'read-write' context data which can be used for whatever purposes. The initial context data is set in peg::parser::parse method.
peg::any
A semantic action can return a value of arbitrary data type, which will be wrapped by . If a user returns nothing in a semantic action, the first semantic value in the const SemanticValues& vs argument will be returned. (Yacc parser has the same behavior.)
SemanticValues
Here shows the structure:
`
cpp
`
struct SemanticValues : protected std::vector
{
// Input text
const char* path;
const char* ss;
// Matched string
std::string_view sv() const { return sv_; }
// Line number and column at which the matched string is
std::pair
// Tokens
std::vector
std::string_view token(size_t id = 0) const;
// Token conversion
std::string token_to_string(size_t id = 0) const;
template
// Choice number (0 based index)
size_t choice() const;
// Transform the semantic value vector to another vector
template
}
<
The following example uses ... > operator, which is token boundary operator.
`
cpp
`
peg::parser parser(R"(
ROOT <- _ TOKEN (',' _ TOKEN)*
TOKEN <- < [a-z0-9]+ > _
_ <- [ \t\r\n]*
)");
parser["TOKEN"] = [](const SemanticValues& vs) {
// 'token' doesn't include trailing whitespaces
auto token = vs.token();
};
auto ret = parser.parse(" token1, token2 ");
~
We can ignore unnecessary semantic values from the list by using operator.
`
cpp
`
peg::parser parser(R"(
ROOT <- _ ITEM (',' _ ITEM _)*
ITEM <- ([a-z0-9])+
~_ <- [ \t]*
)");
parser["ROOT"] = & {
assert(vs.size() == 2); // should be 2 instead of 5.
};
auto ret = parser.parse(" item1, item2 ");
`
The following grammar is the same as the above.
cpp
`
peg::parser parser(R"(
ROOT <- ~_ ITEM (',' ~_ ITEM ~_)*
ITEM <- ([a-z0-9])+
_ <- [ \t]*
)");
`
Semantic predicate support is available with a predicate action.
cpp
`
peg::parser parser("NUMBER <- [0-9]+");
parser["NUMBER"] = [](const SemanticValues &vs) {
return vs.token_to_number
};
parser["NUMBER"].predicate = [](const SemanticValues &vs,
const std::any & /dt/, std::string &msg) {
if (vs.token_to_number
msg = "value error!!";
return false;
}
return true;
};
long val;
auto ret = parser.parse("100", val);
assert(ret == true);
assert(val == 100);
ret = parser.parse("200", val);
assert(ret == false);
`
enter and leave actions are also available.
cpp
`
parser["RULE"].enter = [](const Context &c, const char* s, size_t n, any& dt) {
std::cout << "enter" << std::endl;
};
parser["RULE"] = [](const SemanticValues& vs, any& dt) {
std::cout << "action!" << std::endl;
};
parser["RULE"].leave = [](const Context &c, const char* s, size_t n, size_t matchlen, any& value, any& dt) {
std::cout << "leave" << std::endl;
};
`
You can receive error information via a logger:
cpp
`
parser.set_logger([](size_t line, size_t col, const string& msg) {
...
});
parser.set_logger([](size_t line, size_t col, const string& msg, const string &rule) {
...
});
%whitespace
Ignoring Whitespaces
--------------------
As you can see in the first example, we can ignore whitespaces between tokens automatically with rule.
%whitespace
rule can be applied to the following three conditions:
`
* trailing spaces on tokens
* leading spaces on text
* trailing spaces on literal strings in rules
These are valid tokens:
`
KEYWORD <- 'keyword'
KEYWORDI <- 'case_insensitive_keyword'
WORD <- < [a-zA-Z0-9] [a-zA-Z0-9-_]* > # token boundary operator is used.
IDNET <- < IDENT_START_CHAR IDENT_CHAR* > # token boundary operator is used.
one, "two three", four
The following grammar accepts .
`
`
ROOT <- ITEM (',' ITEM)*
ITEM <- WORD / PHRASE
WORD <- < [a-z]+ >
PHRASE <- < '"' (!'"' .)* '"' >
%whitespace <- [ \t\r\n]*
`
Word expression
---------------
cpp
`
peg::parser parser(R"(
ROOT <- 'hello' 'world'
%whitespace <- [ \t\r\n]*
%word <- [a-z]+
)");
parser.parse("hello world"); // OK
parser.parse("helloworld"); // NG
`
Capture/Backreference
---------------------
cpp
`
peg::parser parser(R"(
ROOT <- CONTENT
CONTENT <- (ELEMENT / TEXT)*
ELEMENT <- $(STAG CONTENT ETAG)
STAG <- '<' $tag< TAG_NAME > '>'
ETAG <- '' $tag '>'
TAG_NAME <- 'b' / 'u'
TEXT <- TEXT_DATA
TEXT_DATA <- ![<] .
)");
parser.parse("This is a test text."); // OK
parser.parse("This is a test text."); // NG
parser.parse("This is a test text."); // NG
|
Dictionary
----------
operator allows us to make a word dictionary for fast lookup by using Trie structure internally. We don't have to worry about the order of words.
`
peg
`
START <- 'This month is ' MONTH '.'
MONTH <- 'Jan' | 'January' | 'Feb' | 'February' | '...'
choice()
We are able to find which item is matched with .
`
cpp
`
parser["MONTH"] = [](const SemanticValues &vs) {
auto id = vs.choice();
};
`
It supports the case-insensitive mode.
peg
`
START <- 'This month is ' MONTH '.'
MONTH <- 'Jan'i | 'January'i | 'Feb'i | 'February'i | '...'i
↑
Cut operator
------------
operator could mitigate the backtrack performance problem, but has a risk to change the meaning of grammar.
`
peg
`
S <- '(' ↑ P ')' / '"' ↑ P '"' / P
P <- 'a' / 'b' / 'c'
(z
When we parse with the above grammar, we don't have to backtrack in S after ( is matched, because a cut operator is inserted there.
`
Parameterized Rule or Macro
---------------------------
peg
`Syntax
Start ← _ Expr
Expr ← Sum
Sum ← List(Product, SumOpe)
Product ← List(Value, ProOpe)
Value ← Number / T('(') Expr T(')')
Token
SumOpe ← T('+' / '-')
ProOpe ← T('*' / '/')
Number ← T([0-9]+)
~_ ← [ \t\r\n]*
Macro
List(I, D) ← I (D I)*
T(x) ← < x > _
`
Parsing infix expression by Precedence climbing
-----------------------------------------------
Regarding the precedence climbing algorithm, please see this article.
cpp
`
parser parser(R"(
EXPRESSION <- INFIX_EXPRESSION(ATOM, OPERATOR)
ATOM <- NUMBER / '(' EXPRESSION ')'
OPERATOR <- < [-+/*] >
NUMBER <- < '-'? [0-9]+ >
%whitespace <- [ \t]*
# Declare order of precedence
INFIX_EXPRESSION(A, O) <- A (O A)* {
precedence
L + -
L * /
}
)");
parser["INFIX_EXPRESSION"] = [](const SemanticValues& vs) -> long {
auto result = any_cast
if (vs.size() > 1) {
auto ope = any_cast
auto num = any_cast
switch (ope) {
case '+': result += num; break;
case '-': result -= num; break;
case '': result = num; break;
case '/': result /= num; break;
}
}
return result;
};
parser["OPERATOR"] = [](const SemanticValues& vs) { return *vs.sv(); };
parser["NUMBER"] = [](const SemanticValues& vs) { return vs.token_to_number
long val;
parser.parse(" -1 + (1 + 2) * 3 - -1", val);
assert(val == 9);
`
precedence instruction can be applied only to the following 'list' style rule.
`
Rule <- Atom (Operator Atom)* {
precedence
L - +
L / *
R ^
}
enable_ast
precedence instruction contains precedence info entries. Each entry starts with associativity which is 'L' (left) or 'R' (right), then operator literal tokens follow. The first entry has the highest order level.
AST generation
--------------
cpp-peglib is able to generate an AST (Abstract Syntax Tree) when parsing. method on peg::parser class enables the feature.
std::string_vew
NOTE: An AST node holds a corresponding token as for performance and less memory usage. It is users' responsibility to keep the original source text along with the generated AST tree.
`
`
peg::parser parser(R"(
...
definition1 <- ... { no_ast_opt }
definition2 <- ... { no_ast_opt }
...
)");
parser.enable_ast();
shared_ptr
if (parser.parse("...", ast)) {
cout << peg::ast_to_s(ast);
ast = parser.optimize_ast(ast);
cout << peg::ast_to_s(ast);
}
optimize_ast
removes redundant nodes to make an AST simpler. If you want to disable this behavior from particular rules, no_ast_opt instruction can be used.
peg::AstOptimizer
It internally calls to do the job. You can make your own AST optimizers to fit your needs.
`
See actual usages in the AST calculator example and PL/0 language example.
Make a parser with parser combinators
-------------------------------------
Instead of making a parser by parsing PEG syntax text, we can also construct a parser by hand with parser combinators. Here is an example:
cpp
`
using namespace peg;
using namespace std;
vector
Definition ROOT, TAG_NAME, _;
ROOT <= seq(_, zom(seq(chr('['), TAG_NAME, chr(']'), _)));
TAG_NAME <= oom(seq(npd(chr(']')), dot())), & {
tags.push_back(vs.token_to_string());
};
_ <= zom(cls(" \t"));
auto ret = ROOT.parse(" [tag1] [tag:2] [tag-3] ");
`
The following are available operators:
| Operator | Description | Operator | Description |
|:---------|:--------------------------------|:---------|:--------------------|
| seq | Sequence | cho | Prioritized Choice |
| zom | Zero or More | oom | One or More |
| opt | Optional | apd | And predicate |
| npd | Not predicate | lit | Literal string |
| liti | Case-insensitive Literal string | cls | Character class |
| ncls | Negated Character class | chr | Character |
| dot | Any character | tok | Token boundary |
| ign | Ignore semantic value | csc | Capture scope |
| cap | Capture | bkr | Back reference |
| dic | Dictionary | pre | Infix expression |
| rec | Infix expression | usr | User defined parser |
| rep | Repetition | | |
Adjust definitions
------------------
It's possible to add/override definitions.
cpp
`
auto syntax = R"(
ROOT <- _ 'Hello' _ NAME '!' _
)";
Rules additional_rules = {
{
"NAME", usr([](const char* s, size_t n, SemanticValues& vs, any& dt) -> size_t {
static vector
for (const auto& name: names) {
if (name.size() <= n && !name.compare(0, name.size(), s, name.size())) {
return name.size(); // processed length
}
}
return -1; // parse error
})
},
{
"~_", zom(cls(" \t\r\n"))
}
};
auto g = parser(syntax, additional_rules);
assert(g.parse(" Hello BNF! "));
.
Unicode support
---------------
cpp-peglib accepts UTF8 text. matches a Unicode codepoint. Also, it supports \u????.
%t
Error report and recovery
-------------------------
cpp-peglib supports the furthest failure error position report as described in the Bryan Ford original document.
For better error report and recovery, cpp-peglib supports 'recovery' operator with label which can be associated with a recovery expression and a custom error message. This idea comes from the fantastic "Syntax Error Recovery in Parsing Expression Grammars" paper by Sergio Medeiros and Fabio Mascarenhas.
The custom message supports which is a placeholder for the unexpected token, and %c for the unexpected Unicode char.
`
Here is an example of Java-like grammar:
peg
stmtbjava.peg
Prog ← 'public' 'class' NAME '{' 'public' 'static' 'void' 'main' '(' 'String' '[' ']' NAME ')' BlockStmt '}'
BlockStmt ← '{' (!'}' Stmt^stmtb)* '}' # Annotated with
condw
Stmt ← IfStmt / WhileStmt / PrintStmt / DecStmt / AssignStmt / BlockStmt
IfStmt ← 'if' '(' Exp ')' Stmt ('else' Stmt)?
WhileStmt ← 'while' '(' Exp^condw ')' Stmt # Annotated with
semi
DecStmt ← 'int' NAME ('=' Exp)? ';'
AssignStmt ← NAME '=' Exp ';'^semia # Annotated with
`
PrintStmt ← 'System.out.println' '(' Exp ')' ';'
Exp ← RelExp ('==' RelExp)*
RelExp ← AddExp ('<' AddExp)*
AddExp ← MulExp (('+' / '-') MulExp)*
MulExp ← AtomExp (('' / '/') AtomExp)
AtomExp ← '(' Exp ')' / NUMBER / NAME
NUMBER ← < [0-9]+ >
NAME ← < [a-zA-Z_][a-zA-Z_0-9]* >
%whitespace ← [ \t\n]*
%word ← NAME
Recovery operator labels
semia ← '' { error_message "missing semicolon in assignment." }
stmtb ← (!(Stmt / 'else' / '}') .)* { error_message "invalid statement" }
condw ← &'==' ('==' RelExp) / &'<' ('<' AddExp) / (!')' .)*
';'^semi
For instance, is a syntactic sugar for (';' / %recovery(semi)). %recover operator tries to recover the error at ';' by skipping input text with the recovery expression semi. Also semi is associated with a custom message "missing semicolon in assignment."
`
Here is the result:
java
`
> cat sample.java
public class Example {
public static void main(String[] args) {
int n = 5;
int f = 1;
while( < n) {
f = f * n;
n = n - 1
};
System.out.println(f);
}
}
> peglint java.peg sample.java
sample.java:5:12: syntax error, unexpected '<', expecting '(',
sample.java:8:5: missing semicolon in assignment.
sample.java:8:6: invalid statement
`
As you can see, it can now show more than one error, and provide more meaningful error messages than the default messages.
$3
We can associate custom error messages to definitions.
peg
`custom_message.peg
START <- CODE (',' CODE)*
CODE <- < '0x' [a-fA-F0-9]+ > { error_message 'code format error...' }
%whitespace <- [ \t]*
`
`
> cat custom_message.txt
0x1234,0x@@@@,0xABCD
> peglint custom_message.peg custom_message.txt
custom_message.txt:1:8: code format error...
`
NOTE: If there is more than one element with an error message instruction in a prioritized choice, this feature may not work as you expect.
Change the Start Definition Rule
--------------------------------
We can change the start definition rule as below.
cpp
`
auto grammar = R"(
Start <- A
A <- B (',' B)*
B <- '[one]' / '[two]'
%whitespace <- [ \t\n]*
)";
peg::parser parser(grammar, "A"); // Start Rule is "A"
or
peg::parser parser;
parser.load_grammar(grammar, "A"); // Start Rule is "A"
parser.parse(" [one] , [two] "); // OK
`
peglint - PEG syntax lint utility
---------------------------------
$3
no_ast_opt
> cd lint
> mkdir build
> cd build
> cmake ..
> make
> ./peglint
usage: grammar_file_path [source_file_path]
options:
--source: source text
--packrat: enable packrat memoise
--ast: show AST tree
--opt, --opt-all: optimize all AST nodes except nodes selected with instruction
no_ast_opt
--opt-only: optimize only AST nodes selected with instruction
`
--trace: show concise trace messages
--profile: show profile report
--verbose: verbose output for trace and profile
`
$3
`
> cat a.peg
Additive <- Multiplicative '+' Additive / Multiplicative
Multiplicative <- Primary '*' Multiplicative / Primary
Primary <- '(' Additive ')' / Number
%whitespace <- [ \t\r\n]*
> peglint a.peg
[commandline]:3:35: 'Number' is not defined.
`
$3
`
> cat a.peg
Additive <- Multiplicative '+' Additive / Multiplicative
Multiplicative <- Primary '*' Multiplicative / Primary
Primary <- '(' Additive ')' / Number
Number <- < [0-9]+ >
%whitespace <- [ \t\r\n]*
> peglint --source "1 + a * 3" a.peg
[commandline]:1:3: syntax error
`
$3
`
> cat a.txt
1 + 2 * 3
> peglint --ast a.peg a.txt
+ Additive
+ Multiplicative
+ Primary
- Number (1)
+ Additive
+ Multiplicative
+ Primary
- Number (2)
+ Multiplicative
+ Primary
- Number (3)
`
$3
`
> peglint --ast --opt --source "1 + 2 * 3" a.peg
+ Additive
- Multiplicative[Number] (1)
+ Additive[Multiplicative]
- Primary[Number] (2)
- Multiplicative[Number] (3)
`
$3
``
> cat a.peg
Additive <- Multiplicative '+' Additive / Multiplicative
Multiplicative <- Primary '*' Multiplicative / Primary
Primary <- '(' Additive ')' / Number { no_ast_opt }
Number <- < [0-9]+ >
%whitespace <- [ \t\r\n]*
> peglint --ast --opt --source "1 + 2 * 3" a.peg
+ Additive/0
+ Multiplicative/1[Primary]
- Number (1)
+ Additive/1[Multiplicative]
+ Primary/1
- Number (2)
+ Multiplicative/1[Primary]
- Number (3)
> peglint --ast --opt-only --source "1 + 2 * 3" a.peg
+ Additive/0
+ Multiplicative/1
- Primary/1[Number] (1)
+ Additive/1
+ Multiplicative/0
- Primary/1[Number] (2)
+ Multiplicative/1
- Primary/1[Number] (3)
Sample codes
------------
* Calculator
* Calculator (with parser operators)
* Calculator (AST version)
* Calculator (parsing expressions by precedence climbing)
* Calculator (AST version and parsing expressions by precedence climbing)
* A tiny PL/0 JIT compiler in less than 900 LOC with LLVM and PEG parser
* A Programming Language just for writing Fizz Buzz program. :)
License
-------
MIT license (© 2022 Yuji Hirose)



