Tutorial 1 - Write a Grammar Compile and Use

This tutorial explains through a simple example how to use the parser generator in Java. It is also available for C# and Rust. It assumes you are moderately familiar with regular expressions, and context-free grammars. The program here is to download the tooling, get a prepared grammar, compile it with himecc and use the generated parser. This tutorial will use the example of a small language for simple mathematical expressions. Using this language, one would be able to use the standards +, -, * and / operators, as well as ( ).

Pre-requisites for this tutorial are:
  • A local installation of a compatible .Net platform such as:
    • The .Net Framework 2.0 or higher on Windows (installed by default on Windows Vista and up).
    • Mono 4.6 or higher on Linux and MacOS.
    • .Net Core 2.0 or higher on a compatible OS.
  • A location installation of the Java JDK 7 or higher, for example OpenJDK and Maven.

Get the tooling

The toolchain for Hime can be found on the download page.
Download and extract the toolchain package.
You may want some editing support for Hime grammars. In this case, head to the editors line-up page. For an IDE experience, we recommend the Visual Studio Code extension.

Write a grammar

Hime provides its own language for expressing context-free grammars. It is largely similar to the standard BNF form with some enhancements for the sake of expressivity. In the example below, the grammar is shown to have three sections:

grammar MathExp
{
    options { }
    terminals { }
    rules { }
}
  • The options section lets you specify options for the lexer and parser generator.
  • The terminals section is used to describe the regular expressions matching the terminals in the grammar. This section is optional.
  • The rules section is used for stating the context-free rules of the grammar.

In our example, we want to use numbers so we have to define regular expressions matching these:

terminals
{
    INTEGER -> [1-9] [0-9]* | '0' ;
    REAL    -> INTEGER? '.' INTEGER  (('e' | 'E') ('+' | '-')? INTEGER)?
            |  INTEGER ('e' | 'E') ('+' | '-')? INTEGER ;
    NUMBER  -> INTEGER | REAL ;
}

Here, we define three terminals:

  • INTEGER, which is defined using a very simple regular expression
  • REAL, which matches decimal numbers ( 23.15) and floats using a power expression ( 12e-15)
  • NUMBER, which is the union of the two former

Because NUMBER will match any INTEGER and REAL, we have to decide which of these is to be recognized by the lexer. The simple rule here is that the latest defined terminals have higher priority. Hence, the generated lexer will only be able to recognize NUMBER and will never yield INTEGER or REAL. Also, we want to define a regular expression matching blank spaces that will be used as separators:

WHITE_SPACE -> U+0020 | U+0009 | U+000B | U+000C ;
SEPARATOR -> WHITE_SPACE+;

Here, the WHITE_SPACE expression will match a single space, or tab, or vertical tab. The SEPARATOR expression will match any string of at least one WHITE_SPACE and has a higher priority than WHITE_SPACE.

Now, we can use these terminals to write the context-free grammar rules:

rules
{
    exp_atom   -> NUMBER
               | '('exp ')' ;

    exp_factor -> exp_atom
               |  exp_factor '*' exp_atom
               |  exp_factor '/' exp_atom ;

    exp_term   -> exp_factor
               |  exp_term '+' exp_factor
               |  exp_term '-' exp_factor ;

    exp        -> exp_term ;
}

Here, we have 4 grammar rules, using 4 grammar variables and the NUMBER terminal. Note that it is possible to add inline terminals in the grammar rules (e.g.: '+'). These terminals will have higher priority over those defined in the terminals section, although their definition is limited to simple text (no regular expressions here).

Context-free grammar rules are to be written as:
variable -> definition ;

You can also use the the *, +, and ? operators as in regular expressions. The parser generator will automatically generates the additional corresponding rules. Use ( ) for grouping terms in rules and | for alternatives.

Finally, we setup the grammar options as follow:

options
{
    Axiom = "exp";
    Separator = "SEPARATOR";
}
  • The Axiom option specifies which grammar variable is to be used as the root symbol for the grammar.
  • The Separator option specifies which terminal is to be used as token separator in the lexer. Text matching the expression of the separator will automatically be discarded by the lexer.

Compile the grammar

Now that we have a grammar, let's compile it to generate the parser. The toolchain package contains at its root useful front scripts that can be to invoke the himecc compiler. On Windows, you should look for the himecc.bat script. On Linux and MacOS you should look for the himecc script. If this does not suit you, you may invoke the himecc.exe assembly for your installed framework.

Compile the MathExp.gram:

  • On Windows: himecc.bat MathExp.gram -t:java or explicitly with net461/himecc.exe MathExp.gram -t:java
  • On Linux and MacOS: ./himecc MathExp.gram -t:java
  • On any OS, explicitly with .Net Core: dotnet netcore20/himecc.dll MathExp.gram -t:java
  • On any OS, explicitly with Mono: mono net461/himecc.exe MathExp.gram -t:java

The tool will generate 4 files:

  • MathExpLexer.java, the source file for the lexer
  • MathExpLexer.bin, the binary representation of the lexer’s automaton
  • MathExpParser.java, the source file for the parser
  • MathExpParser.bin, the binary representation of the parser’s automaton

Note here that the default target for himecc is the .Net platform; so that we have to specify Java as the target with the -t:java option. For a complete guide to the options of himecc, head to the reference page.

Setup the test project

Setup a test project as a standard Maven project. Use the following project layout:

test/
+-> pom.xml
+-> src/
    +-> main/
        +-> java/
        |   +-> mathexp/
        |       +-> Program.java
        |       +-> MathExpLexer.java
        |       +-> MathExpParser.java
        +-> resources/
            +-> mathexp
                +-> MathExpLexer.bin
                +-> MathExpParser.bin

Set the minimal pom.xml:

<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd">
    <modelVersion>4.0.0</modelVersion>
    <groupId>fr.cenotelie.hime</groupId>
    <artifactId>test-hime</artifactId>
    <packaging>jar</packaging>
    <version>1.0.0</version>
    <dependencies>
        <!-- reference to the Java runtime for generated parsers -->
        <dependency>
            <groupId>fr.cenotelie.hime</groupId>
            <artifactId>hime-redist</artifactId>
            <version>3.4.0</version>
            <scope>compile</scope>
        </dependency>
    </dependencies>
    <build>
        <plugins>
            <plugin>
                <groupId>org.apache.maven.plugins</groupId>
                <artifactId>maven-resources-plugin</artifactId>
                <version>3.0.2</version>
                <configuration>
                    <encoding>UTF-8</encoding>
                </configuration>
            </plugin>
            <plugin>
                <groupId>org.apache.maven.plugins</groupId>
                <artifactId>maven-compiler-plugin</artifactId>
                <version>3.6.1</version>
                <configuration>
                    <source>1.7</source>
                    <target>1.7</target>
                    <encoding>UTF-8</encoding>
                </configuration>
            </plugin>
            <plugin>
                <groupId>org.apache.maven.plugins</groupId>
                <artifactId>maven-assembly-plugin</artifactId>
                <version>3.0.0</version>
                <configuration>
                    <archive>
                        <manifest>
                            <!-- register the main class for the program -->
                            <mainClass>mathexp.Program</mainClass>
                        </manifest>
                    </archive>
                    <descriptorRefs>
                        <!-- produce a jar with all dependencies (Hime runtime is embedded) -->
                        <descriptorRef>jar-with-dependencies</descriptorRef>
                    </descriptorRefs>
                </configuration>
                <executions>
                    <execution>
                        <id>make-assembly</id>
                        <phase>package</phase>
                        <goals>
                            <goal>single</goal>
                        </goals>
                    </execution>
                </executions>
            </plugin>
        </plugins>
    </build>
</project>

Set the minimal Program.java:

package mathexp; // default namespace for the parser is the grammar's name

import fr.cenotelie.hime.redist.ASTNode;
import fr.cenotelie.hime.redist.ParseResult;

public class Program {
    public static void main(String[] args) {
        MathExpLexer lexer = new MathExpLexer("2 + 3");
        MathExpParser parser = new MathExpParser(lexer);
        ParseResult result = parser.parse();
        ASTNode root = result.getRoot();
        print(root, new boolean[]{});
    }

    private static void print(ASTNode node, boolean[] crossings) {
        for (int i = 0; i < crossings.length - 1; i++)
            System.out.print(crossings[i] ? "|   " : "    ");
        if (crossings.length > 0)
            System.out.print("+-> ");
        System.out.println(node.toString());
        for (int i = 0; i != node.getChildren().size(); i++) {
            boolean[] childCrossings = Arrays.copyOf(crossings, crossings.length + 1);
            childCrossings[childCrossings.length - 1] = (i < node.getChildren().size() - 1);
            print(node.getChildren().get(i), childCrossings);
        }
    }
}

The parse tree, or AST (Abstract Syntax Tree) produced by the parser is available as a handle for the tree's root in the code above. Nodes in the tree have two important methods:

  • getSymbol() returns the grammar's symbol attached to the node, a variable or a token, i.e. a piece of text matched by the lexer corresponding to one of the grammar's terminals.
  • getChildren() returns the list of the current node's children.

For more info, look into the complete API documentation.

Build and execute the test project

Build the test project:

mvn clean verify

Execute the test project:

java -jar target/test-hime-1.0.0-jar-with-dependencies.jar

The output of the program should a text printout of the produced syntax tree be as follow:

exp
+-> exp_term
    +-> exp_term
    |   +-> exp_factor
    |       +-> exp_atom
    |           +-> NUMBER = 2
    +-> + = +
    +-> exp_factor
        +-> exp_atom
            +-> NUMBER = 3

The lexers and parser created in this manner are bound to parse the piece of text given to the lexer. In order to parse another piece of text, simply create a new lexer and parser in the same way. The two objects are lightweight so you should not have to worry about creating multiple ones. In this example, the lexer is given the full string that has to be parsed. Another constructor allows you to give a stream of text in the form of an InputStreamReader.

Go to the next tutorial to see how to improved the produced parse trees by using tree actions.