cpp-peglib
==========

![](https://github.com/yhirose/cpp-peglib/actions)

C++17 header-only PEG (Parsing Expression Grammars) library, by Yuji Hirose. You can start using it right away just by including peglib.h in your project.

Since this library only supports C++17 compilers, please make sure that the compiler option -std=c++17 is enabled.
(/std:c++17 /Zc:__cplusplus for MSVC)

You can also try the online version, PEG Playground at https://yhirose.github.io/cpp-peglib.

The PEG syntax is well described on page 2 in the document by Bryan Ford. cpp-peglib also supports the following additional syntax for now:

* '...'i (Case-insensitive literal operator)
* [...]i (Case-insensitive character class operator)
* [^...] (Negated character class operator)
* [^...]i (Case-insensitive negated character class operator)
* {2,5} (Regex-like repetition operator)
* < ... > (Token boundary operator)
* ~ (Ignore operator)
* \x20 (Hex number char)
* \u10FFFF (Unicode char)
* %whitespace (Automatic whitespace skipping)
* %word (Word expression)
* $name( ... ) (Capture scope operator)
* $name< ... > (Named capture operator)
* $name (Backreference operator)
* | (Dictionary operator)
* ↑ (Cut operator)
* MACRO_NAME( ... ) (Parameterized rule or Macro)
{ precedence L - + L / } (Parsing infix expression)
* %recovery( ... ) (Error recovery operator)
* exp⇑label or exp^label (Syntax sugar for (exp / %recover(label)))
* label { error_message "..." } (Error message instruction)
* { no_ast_opt } (No AST node optimization instruction)

'End of Input' check will be done as default. To disable the check, please call disable_eoi_check.

This library supports the linear-time parsing known as the Packrat parsing.

IMPORTANT NOTE for some Linux distributions such as Ubuntu and CentOS: Need -pthread option when linking. See #23, #46 and #62.

I am sure that you will enjoy this excellent "Practical parsing with PEG and cpp-peglib" article by bert hubert!

Installation
------------

Run:

``bash $ npm i peglib.cxx`And then includepeglib.has follows:`cxx // main.cxx #include int main() { / ... / }`Finally, compile while adding the pathnode_modules/peglib.cxxto your compiler's include paths.`bash $ clang++ -std=c++17 -I./node_modules/peglib.cxx main.cxx # or, use g++ $ g++ -std=c++17 -I./node_modules/peglib.cxx main.cxx`You may also use a simpler approach with the cpoach tool, which automatically adds the necessary include paths of all the installed dependencies for your project.`bash $ cpoach clang++ -std=c++17 main.cxx # or, use g++ $ cpoach g++ -std=c++17 main.cxx`How to use ---------- This is a simple calculator sample. It shows how to define grammar, associate semantic actions to the grammar, and handle semantic values.`cpp // (1) Include the header file #include #include #include using namespace peg; using namespace std; int main(void) { // (2) Make a parser parser parser(R"( # Grammar for Calculator... Additive <- Multiplicative '+' Additive / Multiplicative Multiplicative <- Primary '*' Multiplicative / Primary Primary <- '(' Additive ')' / Number Number <- < [0-9]+ > %whitespace <- [ \t]* )"); assert(static_cast(parser) == true); // (3) Setup actions parser["Additive"] = [](const SemanticValues &vs) { switch (vs.choice()) { case 0: // "Multiplicative '+' Additive" return any_cast(vs[0]) + any_cast(vs[1]); default: // "Multiplicative" return any_cast(vs[0]); } }; parser["Multiplicative"] = [](const SemanticValues &vs) { switch (vs.choice()) { case 0: // "Primary '*' Multiplicative" return any_cast(vs[0]) * any_cast(vs[1]); default: // "Primary" return any_cast(vs[0]); } }; parser["Number"] = [](const SemanticValues &vs) { return vs.token_to_number(); }; // (4) Parse parser.enable_packrat_parsing(); // Enable packrat parsing. int val; parser.parse(" (1 + 2) * 3 ", val); assert(val == 9); }`To show syntax errors in grammar text:`cpp auto grammar = R"( # Grammar for Calculator... Additive <- Multiplicative '+' Additive / Multiplicative Multiplicative <- Primary '*' Multiplicative / Primary Primary <- '(' Additive ')' / Number Number <- < [0-9]+ > %whitespace <- [ \t]* )"; parser parser; parser.set_logger([](size_t line, size_t col, const string& msg, const string &rule) { cerr << line << ":" << col << ": " << msg << "\n"; }); auto ok = parser.load_grammar(grammar); assert(ok);`There are four semantic actions available:`cpp [](const SemanticValues& vs, any& dt) [](const SemanticValues& vs) [](SemanticValues& vs, any& dt) [](SemanticValues& vs)`SemanticValuesvalue contains the following information: * Semantic values * Matched string information * Token information if the rule is literal or uses a token boundary operator * Choice number when the rule is 'prioritized choice'any& dt is a 'read-write' context data which can be used for whatever purposes. The initial context data is set in peg::parser::parsemethod. A semantic action can return a value of arbitrary data type, which will be wrapped bypeg::any. If a user returns nothing in a semantic action, the first semantic value in the const SemanticValues& vsargument will be returned. (Yacc parser has the same behavior.) Here shows theSemanticValuesstructure:`cpp struct SemanticValues : protected std::vector { // Input text const char* path; const char* ss; // Matched string std::string_view sv() const { return sv_; } // Line number and column at which the matched string is std::pair line_info() const; // Tokens std::vector tokens; std::string_view token(size_t id = 0) const; // Token conversion std::string token_to_string(size_t id = 0) const; template T token_to_number() const; // Choice number (0 based index) size_t choice() const; // Transform the semantic value vector to another vector template vector transform(size_t beg = 0, size_t end = -1) const; }`The following example uses< ... >operator, which is token boundary operator.`cpp peg::parser parser(R"( ROOT <- _ TOKEN (',' _ TOKEN)* TOKEN <- < [a-z0-9]+ > _ _ <- [ \t\r\n]* )"); parser["TOKEN"] = [](const SemanticValues& vs) { // 'token' doesn't include trailing whitespaces auto token = vs.token(); }; auto ret = parser.parse(" token1, token2 ");`We can ignore unnecessary semantic values from the list by using~operator.`cpp peg::parser parser(R"( ROOT <- _ ITEM (',' _ ITEM _)* ITEM <- ([a-z0-9])+ ~_ <- [ \t]* )"); parser["ROOT"] = & { assert(vs.size() == 2); // should be 2 instead of 5. }; auto ret = parser.parse(" item1, item2 ");`The following grammar is the same as the above.`cpp peg::parser parser(R"( ROOT <- ~_ ITEM (',' ~_ ITEM ~_)* ITEM <- ([a-z0-9])+ _ <- [ \t]* )");`Semantic predicate support is available with a predicate action.`cpp peg::parser parser("NUMBER <- [0-9]+"); parser["NUMBER"] = [](const SemanticValues &vs) { return vs.token_to_number(); }; parser["NUMBER"].predicate = [](const SemanticValues &vs, const std::any & /dt/, std::string &msg) { if (vs.token_to_number() != 100) { msg = "value error!!"; return false; } return true; }; long val; auto ret = parser.parse("100", val); assert(ret == true); assert(val == 100); ret = parser.parse("200", val); assert(ret == false);`enter and leave actions are also available.`cpp parser["RULE"].enter = [](const Context &c, const char* s, size_t n, any& dt) { std::cout << "enter" << std::endl; }; parser["RULE"] = [](const SemanticValues& vs, any& dt) { std::cout << "action!" << std::endl; }; parser["RULE"].leave = [](const Context &c, const char* s, size_t n, size_t matchlen, any& value, any& dt) { std::cout << "leave" << std::endl; };`You can receive error information via a logger:`cpp parser.set_logger([](size_t line, size_t col, const string& msg) { ... }); parser.set_logger([](size_t line, size_t col, const string& msg, const string &rule) { ... });`Ignoring Whitespaces -------------------- As you can see in the first example, we can ignore whitespaces between tokens automatically with%whitespacerule.%whitespacerule can be applied to the following three conditions: * trailing spaces on tokens * leading spaces on text * trailing spaces on literal strings in rules These are valid tokens:`KEYWORD <- 'keyword' KEYWORDI <- 'case_insensitive_keyword' WORD <- < [a-zA-Z0-9] [a-zA-Z0-9-_]* > # token boundary operator is used. IDNET <- < IDENT_START_CHAR IDENT_CHAR* > # token boundary operator is used.`The following grammar accepts one, "two three", four .`ROOT <- ITEM (',' ITEM)* ITEM <- WORD / PHRASE WORD <- < [a-z]+ > PHRASE <- < '"' (!'"' .)* '"' > %whitespace <- [ \t\r\n]*`Word expression ---------------`cpp peg::parser parser(R"( ROOT <- 'hello' 'world' %whitespace <- [ \t\r\n]* %word <- [a-z]+ )"); parser.parse("hello world"); // OK parser.parse("helloworld"); // NG`Capture/Backreference ---------------------`cpp peg::parser parser(R"( ROOT <- CONTENT CONTENT <- (ELEMENT / TEXT)* ELEMENT <- $(STAG CONTENT ETAG) STAG <- '<' $tag< TAG_NAME > '>' ETAG <- '' TAG_NAME <- 'b' / 'u' TEXT <- TEXT_DATA TEXT_DATA <- ![<] . )"); parser.parse("This is a test text."); // OK parser.parse("This is a test text."); // NG parser.parse("This is a test text."); // NG`Dictionary ----------|operator allows us to make a word dictionary for fast lookup by using Trie structure internally. We don't have to worry about the order of words.`peg START <- 'This month is ' MONTH '.' MONTH <- 'Jan' | 'January' | 'Feb' | 'February' | '...'`We are able to find which item is matched withchoice().`cpp parser["MONTH"] = [](const SemanticValues &vs) { auto id = vs.choice(); };`It supports the case-insensitive mode.`peg START <- 'This month is ' MONTH '.' MONTH <- 'Jan'i | 'January'i | 'Feb'i | 'February'i | '...'i`Cut operator ------------↑operator could mitigate the backtrack performance problem, but has a risk to change the meaning of grammar.`peg S <- '(' ↑ P ')' / '"' ↑ P '"' / P P <- 'a' / 'b' / 'c'`When we parse(z with the above grammar, we don't have to backtrack in S after (is matched, because a cut operator is inserted there. Parameterized Rule or Macro ---------------------------`peg

`Syntax`



Start      ← _ Expr

Expr       ← Sum

Sum        ← List(Product, SumOpe)

Product    ← List(Value, ProOpe)

Value      ← Number / T('(') Expr T(')')



Token

SumOpe     ← T('+' / '-')

ProOpe     ← T('*' / '/')

Number     ← T([0-9]+)

~_         ← [ \t\r\n]*



Macro

List(I, D) ← I (D I)*

T(x)       ← < x > _





Parsing infix expression by Precedence climbing

-----------------------------------------------



Regarding the precedence climbing algorithm, please see this article.

cpp

parser parser(R"(

  EXPRESSION             <-  INFIX_EXPRESSION(ATOM, OPERATOR)

  ATOM                   <-  NUMBER / '(' EXPRESSION ')'

  OPERATOR               <-  < [-+/*] >

  NUMBER                 <-  < '-'? [0-9]+ >

  %whitespace            <-  [ \t]*



  # Declare order of precedence

  INFIX_EXPRESSION(A, O) <-  A (O A)* {

    precedence

      L + -

      L * /

  }

)");



parser["INFIX_EXPRESSION"] = [](const SemanticValues& vs) -> long {

  auto result = any_cast(vs[0]);

  if (vs.size() > 1) {

    auto ope = any_cast(vs[1]);

    auto num = any_cast(vs[2]);

    switch (ope) {

      case '+': result += num; break;

      case '-': result -= num; break;

      case '': result = num; break;

      case '/': result /= num; break;

    }

  }

  return result;

};

parser["OPERATOR"] = [](const SemanticValues& vs) { return *vs.sv(); };

parser["NUMBER"] = [](const SemanticValues& vs) { return vs.token_to_number(); };



long val;

parser.parse(" -1 + (1 + 2) * 3 - -1", val);

assert(val == 9);





precedence instruction can be applied only to the following 'list' style rule.



Rule <- Atom (Operator Atom)* {

  precedence

    L - +

    L / *

    R ^

}





precedence instruction contains precedence info entries. Each entry starts with associativity which is 'L' (left) or 'R' (right), then operator literal tokens follow. The first entry has the highest order level.



AST generation

--------------



cpp-peglib is able to generate an AST (Abstract Syntax Tree) when parsing.

enable_ast method on peg::parser

 class enables the feature.



NOTE: An AST node holds a corresponding token as

std::string_vew

 for performance and less memory usage. It is users' responsibility to keep the original source text along with the generated AST tree.



peg::parser parser(R"(

  ...

  definition1 <- ... { no_ast_opt }

  definition2 <- ... { no_ast_opt }

  ...

)");



parser.enable_ast();



shared_ptr ast;

if (parser.parse("...", ast)) {

  cout << peg::ast_to_s(ast);



  ast = parser.optimize_ast(ast);

  cout << peg::ast_to_s(ast);

}

optimize_ast removes redundant nodes to make an AST simpler. If you want to disable this behavior from particular rules, no_ast_opt

 instruction can be used.



It internally calls

peg::AstOptimizer

 to do the job. You can make your own AST optimizers to fit your needs.



See actual usages in the AST calculator example and PL/0 language example.



Make a parser with parser combinators

-------------------------------------



Instead of making a parser by parsing PEG syntax text, we can also construct a parser by hand with parser combinators. Here is an example:

cpp

using namespace peg;

using namespace std;



vector tags;



Definition ROOT, TAG_NAME, _;

ROOT     <= seq(_, zom(seq(chr('['), TAG_NAME, chr(']'), _)));

TAG_NAME <= oom(seq(npd(chr(']')), dot())), & {

              tags.push_back(vs.token_to_string());

            };

_        <= zom(cls(" \t"));



auto ret = ROOT.parse(" [tag1] [tag:2] [tag-3] ");





The following are available operators:



| Operator | Description                     | Operator | Description         |

|:---------|:--------------------------------|:---------|:--------------------|

| seq      | Sequence                        | cho      | Prioritized Choice  |

| zom      | Zero or More                    | oom      | One or More         |

| opt      | Optional                        | apd      | And predicate       |

| npd      | Not predicate                   | lit      | Literal string      |

| liti     | Case-insensitive Literal string | cls      | Character class     |

| ncls     | Negated Character class         | chr      | Character           |

| dot      | Any character                   | tok      | Token boundary      |

| ign      | Ignore semantic value           | csc      | Capture scope       |

| cap      | Capture                         | bkr      | Back reference      |

| dic      | Dictionary                      | pre      | Infix expression    |

| rec      | Infix expression                | usr      | User defined parser |

| rep      | Repetition                      |          |                     |



Adjust definitions

------------------



It's possible to add/override definitions.

cpp

auto syntax = R"(

  ROOT <- _ 'Hello' _ NAME '!' _

)";



Rules additional_rules = {

  {

    "NAME", usr([](const char* s, size_t n, SemanticValues& vs, any& dt) -> size_t {

      static vector names = { "PEG", "BNF" };

      for (const auto& name: names) {

        if (name.size() <= n && !name.compare(0, name.size(), s, name.size())) {

          return name.size(); // processed length

        }

      }

      return -1; // parse error

    })

  },

  {

    "~_", zom(cls(" \t\r\n"))

  }

};



auto g = parser(syntax, additional_rules);



assert(g.parse(" Hello BNF! "));





Unicode support

---------------



cpp-peglib accepts UTF8 text.

. matches a Unicode codepoint. Also, it supports \u????

.



Error report and recovery

-------------------------



cpp-peglib supports the furthest failure error position report as described in the Bryan Ford original document.



For better error report and recovery, cpp-peglib supports 'recovery' operator with label which can be associated with a recovery expression and a custom error message. This idea comes from the fantastic "Syntax Error Recovery in Parsing Expression Grammars" paper by Sergio Medeiros and Fabio Mascarenhas.



The custom message supports

%t which is a placeholder for the unexpected token, and %c

 for the unexpected Unicode char.



Here is an example of Java-like grammar:

peg

java.peg

Prog        ← 'public' 'class' NAME '{' 'public' 'static' 'void' 'main' '(' 'String' '[' ']' NAME ')' BlockStmt '}'

BlockStmt   ← '{' (!'}' Stmt^stmtb)* '}' # Annotated with

stmtb



Stmt        ← IfStmt / WhileStmt / PrintStmt / DecStmt / AssignStmt / BlockStmt

IfStmt      ← 'if' '(' Exp ')' Stmt ('else' Stmt)?

WhileStmt   ← 'while' '(' Exp^condw ')' Stmt # Annotated with

condw



DecStmt     ← 'int' NAME ('=' Exp)? ';'

AssignStmt  ← NAME '=' Exp ';'^semia # Annotated with

semi



PrintStmt   ← 'System.out.println' '(' Exp ')' ';'

Exp         ← RelExp ('==' RelExp)*

RelExp      ← AddExp ('<' AddExp)*

AddExp      ← MulExp (('+' / '-') MulExp)*

MulExp      ← AtomExp (('' / '/') AtomExp)

AtomExp     ← '(' Exp ')' / NUMBER / NAME



NUMBER      ← < [0-9]+ >

NAME        ← < [a-zA-Z_][a-zA-Z_0-9]* >



%whitespace ← [ \t\n]*

%word       ← NAME



Recovery operator labels

semia       ← '' { error_message "missing semicolon in assignment." }

stmtb       ← (!(Stmt / 'else' / '}') .)* { error_message "invalid statement" }

condw       ← &'==' ('==' RelExp) / &'<' ('<' AddExp) / (!')' .)*





For instance,

';'^semi is a syntactic sugar for (';' / %recovery(semi)). %recover operator tries to recover the error at ';' by skipping input text with the recovery expression semi. Also semi

 is associated with a custom message "missing semicolon in assignment."



Here is the result:

java

> cat sample.java

public class Example {

  public static void main(String[] args) {

    int n = 5;

    int f = 1;

    while( < n) {

      f = f * n;

      n = n - 1

    };

    System.out.println(f);

  }

}



> peglint java.peg sample.java

sample.java:5:12: syntax error, unexpected '<', expecting '(', , .

sample.java:8:5: missing semicolon in assignment.

sample.java:8:6: invalid statement





As you can see, it can now show more than one error, and provide more meaningful error messages than the default messages.



$3



We can associate custom error messages to definitions.

peg

custom_message.peg

START       <- CODE (',' CODE)*

CODE        <- < '0x' [a-fA-F0-9]+ > { error_message 'code format error...' }

%whitespace <- [ \t]*



> cat custom_message.txt

0x1234,0x@@@@,0xABCD



> peglint custom_message.peg custom_message.txt

custom_message.txt:1:8: code format error...





NOTE: If there is more than one element with an error message instruction in a prioritized choice, this feature may not work as you expect.



Change the Start Definition Rule

--------------------------------



We can change the start definition rule as below.

cpp

auto grammar = R"(

  Start       <- A

  A           <- B (',' B)*

  B           <- '[one]' / '[two]'

  %whitespace <- [ \t\n]*

)";



peg::parser parser(grammar, "A"); // Start Rule is "A"



  or



peg::parser parser;

parser.load_grammar(grammar, "A"); // Start Rule is "A"



parser.parse(" [one] , [two] "); // OK





peglint - PEG syntax lint utility

---------------------------------



$3



> cd lint

> mkdir build

> cd build

> cmake ..

> make

> ./peglint

usage: grammar_file_path [source_file_path]



  options:

    --source: source text

    --packrat: enable packrat memoise

    --ast: show AST tree

    --opt, --opt-all: optimize all AST nodes except nodes selected with

no_ast_opt

 instruction

    --opt-only: optimize only AST nodes selected with

no_ast_opt

 instruction

    --trace: show concise trace messages

    --profile: show profile report

    --verbose: verbose output for trace and profile

$3



> cat a.peg

Additive    <- Multiplicative '+' Additive / Multiplicative

Multiplicative   <- Primary '*' Multiplicative / Primary

Primary     <- '(' Additive ')' / Number

%whitespace <- [ \t\r\n]*



> peglint a.peg

[commandline]:3:35: 'Number' is not defined.

$3



> cat a.peg

Additive    <- Multiplicative '+' Additive / Multiplicative

Multiplicative   <- Primary '*' Multiplicative / Primary

Primary     <- '(' Additive ')' / Number

Number      <- < [0-9]+ >

%whitespace <- [ \t\r\n]*



> peglint --source "1 + a * 3" a.peg

[commandline]:1:3: syntax error

$3



> cat a.txt

1 + 2 * 3



> peglint --ast a.peg a.txt

+ Additive

  + Multiplicative

    + Primary

      - Number (1)

  + Additive

    + Multiplicative

      + Primary

        - Number (2)

      + Multiplicative

        + Primary

          - Number (3)

$3



> peglint --ast --opt --source "1 + 2 * 3" a.peg

+ Additive

  - Multiplicative[Number] (1)

  + Additive[Multiplicative]

    - Primary[Number] (2)

    - Multiplicative[Number] (3)

$3



> cat a.peg

Additive    <- Multiplicative '+' Additive / Multiplicative

Multiplicative   <- Primary '*' Multiplicative / Primary

Primary     <- '(' Additive ')' / Number          { no_ast_opt }

Number      <- < [0-9]+ >

%whitespace <- [ \t\r\n]*



> peglint --ast --opt --source "1 + 2 * 3" a.peg

+ Additive/0

  + Multiplicative/1[Primary]

    - Number (1)

  + Additive/1[Multiplicative]

    + Primary/1

      - Number (2)

    + Multiplicative/1[Primary]

      - Number (3)



> peglint --ast --opt-only --source "1 + 2 * 3" a.peg

+ Additive/0

  + Multiplicative/1

    - Primary/1[Number] (1)

  + Additive/1

    + Multiplicative/0

      - Primary/1[Number] (2)

      + Multiplicative/1

        - Primary/1[Number] (3)

``

Sample codes
------------

* Calculator
* Calculator (with parser operators)
* Calculator (AST version)
* Calculator (parsing expressions by precedence climbing)
* Calculator (AST version and parsing expressions by precedence climbing)
* A tiny PL/0 JIT compiler in less than 900 LOC with LLVM and PEG parser
* A Programming Language just for writing Fizz Buzz program. :)

License
-------

MIT license (© 2022 Yuji Hirose)

![](https://wolfram77.github.io)

![SRC](https://github.com/yhirose/cpp-peglib)
![ORG](https://nodef.github.io)
![](https://ga-beacon.deno.dev/G-RC63DPBH3P:SH3Eq-NoQ9mwgYeHWxu7cw/github.com/nodef/cpp-peglib.cxx)

Syntax



Start      ← _ Expr

Expr       ← Sum

Sum        ← List(Product, SumOpe)

Product    ← List(Value, ProOpe)

Value      ← Number / T('(') Expr T(')')



Token

SumOpe     ← T('+' / '-')

ProOpe     ← T('*' / '/')

Number     ← T([0-9]+)

~_         ← [ \t\r\n]*



Macro

List(I, D) ← I (D I)*

T(x)       ← < x > _





Parsing infix expression by Precedence climbing

-----------------------------------------------



Regarding the precedence climbing algorithm, please see this article.

cpp

parser parser(R"(

  EXPRESSION             <-  INFIX_EXPRESSION(ATOM, OPERATOR)

  ATOM                   <-  NUMBER / '(' EXPRESSION ')'

  OPERATOR               <-  < [-+/*] >

  NUMBER                 <-  < '-'? [0-9]+ >

  %whitespace            <-  [ \t]*



  # Declare order of precedence

  INFIX_EXPRESSION(A, O) <-  A (O A)* {

    precedence

      L + -

      L * /

  }

)");



parser["INFIX_EXPRESSION"] = [](const SemanticValues& vs) -> long {

  auto result = any_cast(vs[0]);

  if (vs.size() > 1) {

    auto ope = any_cast(vs[1]);

    auto num = any_cast(vs[2]);

    switch (ope) {

      case '+': result += num; break;

      case '-': result -= num; break;

      case '': result = num; break;

      case '/': result /= num; break;

    }

  }

  return result;

};

parser["OPERATOR"] = [](const SemanticValues& vs) { return *vs.sv(); };

parser["NUMBER"] = [](const SemanticValues& vs) { return vs.token_to_number(); };



long val;

parser.parse(" -1 + (1 + 2) * 3 - -1", val);

assert(val == 9);





precedence instruction can be applied only to the following 'list' style rule.



Rule <- Atom (Operator Atom)* {

  precedence

    L - +

    L / *

    R ^

}





precedence instruction contains precedence info entries. Each entry starts with associativity which is 'L' (left) or 'R' (right), then operator literal tokens follow. The first entry has the highest order level.



AST generation

--------------



cpp-peglib is able to generate an AST (Abstract Syntax Tree) when parsing.

enable_ast method on peg::parser

 class enables the feature.



NOTE: An AST node holds a corresponding token as

std::string_vew

 for performance and less memory usage. It is users' responsibility to keep the original source text along with the generated AST tree.



peg::parser parser(R"(

  ...

  definition1 <- ... { no_ast_opt }

  definition2 <- ... { no_ast_opt }

  ...

)");



parser.enable_ast();



shared_ptr ast;

if (parser.parse("...", ast)) {

  cout << peg::ast_to_s(ast);



  ast = parser.optimize_ast(ast);

  cout << peg::ast_to_s(ast);

}

optimize_ast removes redundant nodes to make an AST simpler. If you want to disable this behavior from particular rules, no_ast_opt

 instruction can be used.



It internally calls

peg::AstOptimizer

 to do the job. You can make your own AST optimizers to fit your needs.



See actual usages in the AST calculator example and PL/0 language example.



Make a parser with parser combinators

-------------------------------------



Instead of making a parser by parsing PEG syntax text, we can also construct a parser by hand with parser combinators. Here is an example:

cpp

using namespace peg;

using namespace std;



vector tags;



Definition ROOT, TAG_NAME, _;

ROOT     <= seq(_, zom(seq(chr('['), TAG_NAME, chr(']'), _)));

TAG_NAME <= oom(seq(npd(chr(']')), dot())), & {

              tags.push_back(vs.token_to_string());

            };

_        <= zom(cls(" \t"));



auto ret = ROOT.parse(" [tag1] [tag:2] [tag-3] ");





The following are available operators:



| Operator | Description                     | Operator | Description         |

|:---------|:--------------------------------|:---------|:--------------------|

| seq      | Sequence                        | cho      | Prioritized Choice  |

| zom      | Zero or More                    | oom      | One or More         |

| opt      | Optional                        | apd      | And predicate       |

| npd      | Not predicate                   | lit      | Literal string      |

| liti     | Case-insensitive Literal string | cls      | Character class     |

| ncls     | Negated Character class         | chr      | Character           |

| dot      | Any character                   | tok      | Token boundary      |

| ign      | Ignore semantic value           | csc      | Capture scope       |

| cap      | Capture                         | bkr      | Back reference      |

| dic      | Dictionary                      | pre      | Infix expression    |

| rec      | Infix expression                | usr      | User defined parser |

| rep      | Repetition                      |          |                     |



Adjust definitions

------------------



It's possible to add/override definitions.

cpp

auto syntax = R"(

  ROOT <- _ 'Hello' _ NAME '!' _

)";



Rules additional_rules = {

  {

    "NAME", usr([](const char* s, size_t n, SemanticValues& vs, any& dt) -> size_t {

      static vector names = { "PEG", "BNF" };

      for (const auto& name: names) {

        if (name.size() <= n && !name.compare(0, name.size(), s, name.size())) {

          return name.size(); // processed length

        }

      }

      return -1; // parse error

    })

  },

  {

    "~_", zom(cls(" \t\r\n"))

  }

};



auto g = parser(syntax, additional_rules);



assert(g.parse(" Hello BNF! "));





Unicode support

---------------



cpp-peglib accepts UTF8 text.

. matches a Unicode codepoint. Also, it supports \u????

.



Error report and recovery

-------------------------



cpp-peglib supports the furthest failure error position report as described in the Bryan Ford original document.



For better error report and recovery, cpp-peglib supports 'recovery' operator with label which can be associated with a recovery expression and a custom error message. This idea comes from the fantastic "Syntax Error Recovery in Parsing Expression Grammars" paper by Sergio Medeiros and Fabio Mascarenhas.



The custom message supports

%t which is a placeholder for the unexpected token, and %c

 for the unexpected Unicode char.



Here is an example of Java-like grammar:

peg

java.peg

Prog        ← 'public' 'class' NAME '{' 'public' 'static' 'void' 'main' '(' 'String' '[' ']' NAME ')' BlockStmt '}'

BlockStmt   ← '{' (!'}' Stmt^stmtb)* '}' # Annotated with

stmtb



Stmt        ← IfStmt / WhileStmt / PrintStmt / DecStmt / AssignStmt / BlockStmt

IfStmt      ← 'if' '(' Exp ')' Stmt ('else' Stmt)?

WhileStmt   ← 'while' '(' Exp^condw ')' Stmt # Annotated with

condw



DecStmt     ← 'int' NAME ('=' Exp)? ';'

AssignStmt  ← NAME '=' Exp ';'^semia # Annotated with

semi



PrintStmt   ← 'System.out.println' '(' Exp ')' ';'

Exp         ← RelExp ('==' RelExp)*

RelExp      ← AddExp ('<' AddExp)*

AddExp      ← MulExp (('+' / '-') MulExp)*

MulExp      ← AtomExp (('' / '/') AtomExp)

AtomExp     ← '(' Exp ')' / NUMBER / NAME



NUMBER      ← < [0-9]+ >

NAME        ← < [a-zA-Z_][a-zA-Z_0-9]* >



%whitespace ← [ \t\n]*

%word       ← NAME



Recovery operator labels

semia       ← '' { error_message "missing semicolon in assignment." }

stmtb       ← (!(Stmt / 'else' / '}') .)* { error_message "invalid statement" }

condw       ← &'==' ('==' RelExp) / &'<' ('<' AddExp) / (!')' .)*





For instance,

';'^semi is a syntactic sugar for (';' / %recovery(semi)). %recover operator tries to recover the error at ';' by skipping input text with the recovery expression semi. Also semi

 is associated with a custom message "missing semicolon in assignment."



Here is the result:

java

> cat sample.java

public class Example {

  public static void main(String[] args) {

    int n = 5;

    int f = 1;

    while( < n) {

      f = f * n;

      n = n - 1

    };

    System.out.println(f);

  }

}



> peglint java.peg sample.java

sample.java:5:12: syntax error, unexpected '<', expecting '(', , .

sample.java:8:5: missing semicolon in assignment.

sample.java:8:6: invalid statement





As you can see, it can now show more than one error, and provide more meaningful error messages than the default messages.



$3



We can associate custom error messages to definitions.

peg

custom_message.peg

START       <- CODE (',' CODE)*

CODE        <- < '0x' [a-fA-F0-9]+ > { error_message 'code format error...' }

%whitespace <- [ \t]*



> cat custom_message.txt

0x1234,0x@@@@,0xABCD



> peglint custom_message.peg custom_message.txt

custom_message.txt:1:8: code format error...





NOTE: If there is more than one element with an error message instruction in a prioritized choice, this feature may not work as you expect.



Change the Start Definition Rule

--------------------------------



We can change the start definition rule as below.

cpp

auto grammar = R"(

  Start       <- A

  A           <- B (',' B)*

  B           <- '[one]' / '[two]'

  %whitespace <- [ \t\n]*

)";



peg::parser parser(grammar, "A"); // Start Rule is "A"



  or



peg::parser parser;

parser.load_grammar(grammar, "A"); // Start Rule is "A"



parser.parse(" [one] , [two] "); // OK





peglint - PEG syntax lint utility

---------------------------------



$3



> cd lint

> mkdir build

> cd build

> cmake ..

> make

> ./peglint

usage: grammar_file_path [source_file_path]



  options:

    --source: source text

    --packrat: enable packrat memoise

    --ast: show AST tree

    --opt, --opt-all: optimize all AST nodes except nodes selected with

no_ast_opt

 instruction

    --opt-only: optimize only AST nodes selected with

no_ast_opt

 instruction

    --trace: show concise trace messages

    --profile: show profile report

    --verbose: verbose output for trace and profile

$3



> cat a.peg

Additive    <- Multiplicative '+' Additive / Multiplicative

Multiplicative   <- Primary '*' Multiplicative / Primary

Primary     <- '(' Additive ')' / Number

%whitespace <- [ \t\r\n]*



> peglint a.peg

[commandline]:3:35: 'Number' is not defined.

$3



> cat a.peg

Additive    <- Multiplicative '+' Additive / Multiplicative

Multiplicative   <- Primary '*' Multiplicative / Primary

Primary     <- '(' Additive ')' / Number

Number      <- < [0-9]+ >

%whitespace <- [ \t\r\n]*



> peglint --source "1 + a * 3" a.peg

[commandline]:1:3: syntax error

$3



> cat a.txt

1 + 2 * 3



> peglint --ast a.peg a.txt

+ Additive

  + Multiplicative

    + Primary

      - Number (1)

  + Additive

    + Multiplicative

      + Primary

        - Number (2)

      + Multiplicative

        + Primary

          - Number (3)

$3



> peglint --ast --opt --source "1 + 2 * 3" a.peg

+ Additive

  - Multiplicative[Number] (1)

  + Additive[Multiplicative]

    - Primary[Number] (2)

    - Multiplicative[Number] (3)

$3



> cat a.peg

Additive    <- Multiplicative '+' Additive / Multiplicative

Multiplicative   <- Primary '*' Multiplicative / Primary

Primary     <- '(' Additive ')' / Number          { no_ast_opt }

Number      <- < [0-9]+ >

%whitespace <- [ \t\r\n]*



> peglint --ast --opt --source "1 + 2 * 3" a.peg

+ Additive/0

  + Multiplicative/1[Primary]

    - Number (1)

  + Additive/1[Multiplicative]

    + Primary/1

      - Number (2)

    + Multiplicative/1[Primary]

      - Number (3)



> peglint --ast --opt-only --source "1 + 2 * 3" a.peg

+ Additive/0

  + Multiplicative/1

    - Primary/1[Number] (1)

  + Additive/1

    + Multiplicative/0

      - Primary/1[Number] (2)

      + Multiplicative/1

        - Primary/1[Number] (3)