Tutorial 1 - Write a Grammar Compile and Use

This tutorial explains through a simple example how to use the parser generator in Rust. It is also available for C# and Java. It assumes you are moderately familiar with regular expressions, and context-free grammars. The program here is to download the tooling, get a prepared grammar, compile it with himecc and use the generated parser. This tutorial will use the example of a small language for simple mathematical expressions. Using this language, one would be able to use the standards +, -, * and / operators, as well as ( ).

Pre-requisites for this guide are:
  • A local installation of a compatible .Net platform such as:
    • The .Net Framework 2.0 or higher on Windows (installed by default on Windows Vista and up).
    • Mono 4.6 or higher on Linux and MacOS.
    • .Net Core 2.0 or higher on a compatible OS.
  • A location installation of a recent version of the Rust toolchain with Cargo.

Get the tooling

The toolchain for Hime can be found on the download page.
Download and extract the toolchain package.
You may want some editing support for Hime grammars. In this case, head to the editors line-up page. For an IDE experience, we recommend the Visual Studio Code extension.

Write a grammar

Hime provides its own language for expressing context-free grammars. It is largely similar to the standard BNF form with some enhancements for the sake of expressivity. In the example below, the grammar is shown to have three sections:

grammar MathExp
{
    options { }
    terminals { }
    rules { }
}
  • The options section lets you specify options for the lexer and parser generator.
  • The terminals section is used to describe the regular expressions matching the terminals in the grammar. This section is optional.
  • The rules section is used for stating the context-free rules of the grammar.

In our example, we want to use numbers so we have to define regular expressions matching these:

terminals
{
    INTEGER -> [1-9] [0-9]* | '0' ;
    REAL    -> INTEGER? '.' INTEGER  (('e' | 'E') ('+' | '-')? INTEGER)?
            |  INTEGER ('e' | 'E') ('+' | '-')? INTEGER ;
    NUMBER  -> INTEGER | REAL ;
}

Here, we define three terminals:

  • INTEGER, which is defined using a very simple regular expression
  • REAL, which matches decimal numbers ( 23.15) and floats using a power expression ( 12e-15)
  • NUMBER, which is the union of the two former

Because NUMBER will match any INTEGER and REAL, we have to decide which of these is to be recognized by the lexer. The simple rule here is that the latest defined terminals have higher priority. Hence, the generated lexer will only be able to recognize NUMBER and will never yield INTEGER or REAL. Also, we want to define a regular expression matching blank spaces that will be used as separators:

WHITE_SPACE -> U+0020 | U+0009 | U+000B | U+000C ;
SEPARATOR -> WHITE_SPACE+;

Here, the WHITE_SPACE expression will match a single space, or tab, or vertical tab. The SEPARATOR expression will match any string of at least one WHITE_SPACE and has a higher priority than WHITE_SPACE.

Now, we can use these terminals to write the context-free grammar rules:

rules
{
    exp_atom   -> NUMBER
               | '('exp ')' ;

    exp_factor -> exp_atom
               |  exp_factor '*' exp_atom
               |  exp_factor '/' exp_atom ;

    exp_term   -> exp_factor
               |  exp_term '+' exp_factor
               |  exp_term '-' exp_factor ;

    exp        -> exp_term ;
}

Here, we have 4 grammar rules, using 4 grammar variables and the NUMBER terminal. Note that it is possible to add inline terminals in the grammar rules (e.g.: '+'). These terminals will have higher priority over those defined in the terminals section, although their definition is limited to simple text (no regular expressions here).

Context-free grammar rules are to be written as:
variable -> definition ;

You can also use the the *, +, and ? operators as in regular expressions. The parser generator will automatically generates the additional corresponding rules. Use ( ) for grouping terms in rules and | for alternatives.

Finally, we setup the grammar options as follow:

options
{
    Axiom = "exp";
    Separator = "SEPARATOR";
}
  • The Axiom option specifies which grammar variable is to be used as the root symbol for the grammar.
  • The Separator option specifies which terminal is to be used as token separator in the lexer. Text matching the expression of the separator will automatically be discarded by the lexer.

Compile the grammar

Now that we have a grammar, let's compile it to generate the parser. The toolchain package contains at its root useful front scripts that can be to invoke the himecc compiler. On Windows, you should look for the himecc.bat script. On Linux and MacOS you should look for the himecc script. If this does not suit you, you may invoke the himecc.exe assembly for your installed framework.

Compile the MathExp.gram:

  • On Windows: himecc.bat MathExp.gram -t:rust or explicitly with net461/himecc.exe MathExp.gram -t:rust
  • On Linux and MacOS: ./himecc MathExp.gram -t:rust
  • On any OS, explicitly with .Net Core: dotnet netcore20/himecc.dll MathExp.gram -t:rust
  • On any OS, explicitly with Mono: mono net461/himecc.exe MathExp.gram -t:rust

The tool will generate 3 files:

  • MathExp.rs, the source file for the lexer and parser
  • MathExpLexer.bin, the binary representation of the lexer’s automaton
  • MathExpParser.bin, the binary representation of the parser’s automaton

Note here that the default target for himecc is the .Net platform; so that we have to specify Rust as the target with the -t:rust option. For a complete guide to the options of himecc, head to the reference page.

Setup the test project

Setup a test project; either as a .Net Core app, or as a .Net Framework application. Use the following project layout:

test/
+-> Cargo.toml
+-> src/
    +-> main.rs
    +-> MathExp.rs
    +-> MathExpLexer.bin
    +-> MathExpParser.bin

Set the minimal Cargo.toml:

[package]
name = "test_hime"
version = "1.0.0"

[dependencies]
hime_redist = "3.4.0"

Set the minimal main.rs:

mod MathExp; // default namespace for the parser is the grammar's name

extern crate hime_redist;

use hime_redist::ast::AstNode;

fn main() {
    let result = MathExp::parse_string("2 + 3");
    let ast = result.get_ast();
    let root = ast.get_root();
    print(root, Vec::<bool>::new());
}

fn print<'a>(node: AstNode<'a>, crossings: Vec<bool>) {
    let mut i = 0;
    if !crossings.is_empty() {
        while i < crossings.len() - 1 {
            print!("{:}", if crossings[i] { "|   " } else { "    " });
            i += 1;
        }
        print!("+-> ");
    }
    println!("{:}", node);
    i = 0;
    let children = node.children();
    while i < children.len() {
        let mut child_crossings = crossings.clone();
        child_crossings.push(i < children.len() - 1);
        print(children.at(i), child_crossings);
        i += 1;
    }
}

The parse tree, or AST (Abstract Syntax Tree) produced by the parser is available as a handle for the tree's root in the code above. Nodes in the tree have two important accessors:

  • get_symbol contains the grammar's symbol attached to the node, a variable or a token, i.e. a piece of text matched by the lexer corresponding to one of the grammar's terminals.
  • children is the list of the current node's children.

For more info, look into the complete API documentation.

Build and execute the test project

Build the test project:

cargo build

Execute the test project:

  • On Windows: target/debug/test_hime.exe
  • On Linux and MacOS: ./target/debug/test_hime

The output of the program should a text printout of the produced syntax tree be as follow:

exp
+-> exp_term
    +-> exp_term
    |   +-> exp_factor
    |       +-> exp_atom
    |           +-> NUMBER = 2
    +-> + = +
    +-> exp_factor
        +-> exp_atom
            +-> NUMBER = 3

The lexers and parser created in this manner are bound to parse the piece of text given to the lexer. In order to parse another piece of text, simply create a new lexer and parser in the same way. The two objects are lightweight so you should not have to worry about creating multiple ones. In this example, the lexer is given the full string that has to be parsed. Another constructor allows you to give a stream of text in the form of a TextReader.

Go to the next tutorial to see how to improved the produced parse trees by using tree actions.