Get started
Grammars and edition
API Documentation
Release Notes
This tutorial explains through a simple example how to use the parser generator in Java. It is also available for C# and Java. It assumes you are moderately familiar with regular expressions, and context-free grammars. The program here is to download the tooling, get a prepared grammar, compile it with himecc and use the generated parser. This tutorial will use the example of a small language for simple mathematical expressions. Using this language, one would be able to use the standards +, -, * and / operators, as well as ( ).
The toolchain for Hime can be found on the download page.
Download and extract the toolchain package.
You may want some editing support for Hime grammars. In this case, head to the editors line-up page. For an IDE experience, we recommend theVisual Studio Code extension.
Hime provides its own language for expressing context-free grammars. It is largely similar to the standard BNF form with some enhancements for the sake of expressivity. In the example below, the grammar is shown to have three sections:
grammar MathExp
{
options { }
terminals { }
rules { }
}
options
section lets you specify options for the lexer and parser generator.terminals
section is used to describe the regular expressions matching the terminals in the grammar. This section is optional.rules
section is used for stating the context-free rules of the grammar.In our example, we want to use numbers so we have to define regular expressions matching these:
terminals
{
INTEGER -> [1-9] [0-9]* | '0' ;
REAL -> INTEGER? '.' INTEGER (('e' | 'E') ('+' | '-')? INTEGER)?
| INTEGER ('e' | 'E') ('+' | '-')? INTEGER ;
NUMBER -> INTEGER | REAL ;
}
Here, we define three terminals:
INTEGER
, which is defined using a very simple regular expressionREAL
, which matches decimal numbers (23.15
) and floats using a power expression (12e-15
)NUMBER
, which is the union of the two formerBecause NUMBER
will match any INTEGER
and REAL
, we have to decide which of these is to be recognized by the lexer. The simple rule here is that the latest defined terminals have higher priority. Hence, the generated lexer will only be able to recognize NUMBER
and will never yield INTEGER
or REAL
. Also, we want to define a regular expression matching blank spaces that will be used as separators:
WHITE_SPACE -> U+0020 | U+0009 | U+000B | U+000C ;
SEPARATOR -> WHITE_SPACE+;
Here, the WHITE_SPACE
expression will match a single space, or tab, or vertical tab. The SEPARATOR
expression will match any string of at least one WHITE_SPACE
and has a higher priority than WHITE_SPACE
.
Now, we can use these terminals to write the context-free grammar rules:
rules
{
exp_atom -> NUMBER
| '('exp ')' ;
exp_factor -> exp_atom
| exp_factor '*' exp_atom
| exp_factor '/' exp_atom ;
exp_term -> exp_factor
| exp_term '+' exp_factor
| exp_term '-' exp_factor ;
exp -> exp_term ;
}
Here, we have 4 grammar rules, using 4 grammar variables and the NUMBER
terminal. Note that it is possible to add inline terminals in the grammar rules (e.g.: '+'
). These terminals will have higher priority over those defined in the terminals
section, although their definition is limited to simple text (no regular expressions here).
variable -> definition ;
You can also use the the *
, +
, and ?
operators as in regular expressions. The parser generator will automatically generates the additional corresponding rules. Use ( )
for grouping terms in rules and |
for alternatives.
Finally, we setup the grammar options as follow:
options
{
Axiom = "exp";
Separator = "SEPARATOR";
}
Axiom
option specifies which grammar variable is to be used as the root symbol for the grammar.Separator
option specifies which terminal is to be used as token separator in the lexer. Text matching the expression of the separator will automatically be discarded by the lexer.Now that we have a grammar, let's compile it to generate the parser. The toolchain package contains at its root useful front scripts that can be to invoke the himecc
compiler. On Windows, you should look for the himecc.bat
script. On Linux and MacOS you should look for the himecc
script. If this does not suit you, you may invoke the himecc.exe
assembly for your installed framework.
Compile the MathExp.gram
:
himecc.bat MathExp.gram -t:rust
or explicitly withnet461/himecc.exe MathExp.gram -t:rust
./himecc MathExp.gram -t:rust
dotnet netcore20/himecc.dll MathExp.gram -t:rust
mono net461/himecc.exe MathExp.gram -t:rust
The tool will generate 3 files:
MathExp.rs
, the source file for the lexer and parserMathExpLexer.bin
, the binary representation of the lexer’s automatonMathExpParser.bin
, the binary representation of the parser’s automatonNote here that the default target for himecc
is the .Net platform; so that we have to specify Rust as the target with the -t:rust
option. For a complete guide to the options of himecc
, head to the reference page.
Setup a test project as a standard Cargo project. Use the following project layout:
test/ +-> Cargo.toml +-> src/ +-> main.rs +-> MathExp.rs +-> MathExpLexer.bin +-> MathExpParser.bin
Set the minimal Cargo.toml
:
[package]
name = "test_hime"
version = "1.0.0"
[dependencies]
hime_redist = "3.5.1"
Set the minimal main.rs
:
mod math_exp; // default namespace for the parser is the grammar's name
extern crate hime_redist;
use hime_redist::ast::AstNode;
fn main() {
let result = math_exp::parse_string("2 + 3");
let ast = result.get_ast();
let root = ast.get_root();
print(root, Vec::<bool>::new());
}
fn print<'a>(node: AstNode<'a>, crossings: Vec<bool>) {
let mut i = 0;
if !crossings.is_empty() {
while i < crossings.len() - 1 {
print!("{:}", if crossings[i] { "| " } else { " " });
i += 1;
}
print!("+-> ");
}
println!("{:}", node);
i = 0;
let children = node.children();
while i < children.len() {
let mut child_crossings = crossings.clone();
child_crossings.push(i < children.len() - 1);
print(children.at(i), child_crossings);
i += 1;
}
}
The parse tree, or AST (Abstract Syntax Tree) produced by the parser is available as a handle for the tree's root in the code above. Nodes in the tree have two important accessors:
get_symbol
contains the grammar's symbol attached to the node, a variable or a token, i.e. a piece of text matched by the lexer corresponding to one of the grammar's terminals.children
is the list of the current node's children.For more info, look into the complete API documentation.
Build the test project:
cargo build
Execute the test project:
target/debug/test_hime.exe
./target/debug/test_hime
The output of the program should a text printout of the produced syntax tree be as follow:
exp +-> exp_term +-> exp_term | +-> exp_factor | +-> exp_atom | +-> NUMBER = 2 +-> + = + +-> exp_factor +-> exp_atom +-> NUMBER = 3
The lexers and parser created in this manner are bound to parse the piece of text given to the lexer. In order to parse another piece of text, simply create a new lexer and parser in the same way. The two objects are lightweight so you should not have to worry about creating multiple ones. In this example, the lexer is given the full string that has to be parsed. Another constructor allows you to give a stream of text in the form of a reference to an object implementing Read
.
Generated parsers also provides an API to easily visit the produced AST. First, one has to declare a new struct that implements the provided Visitor
trait.
use hime_redist::symbols::SemanticElementTrait;
struct MyVisitor {}
impl math_exp::Visitor for MyVisitor {
fn on_terminal_number(&self, node: &AstNode) {
println!("Found value: {}", node.get_value().unwrap());
}
}
In this example, we only override the on_terminal_number
method. But equivalent methods are also available to react to other terminals and variables. Finally, to visit the full AST, we call:
// Creates the lexer and parser
let result = math_exp::parse_string("2 + 3");
let visitor = MyVisitor {};
math_exp::visit(&result, &visitor);
Note that the math_exp::visit_ast_node
function is also available to start visiting AST at a specific node instead of the root.
Go to the next tutorial to see how to improved the produced parse trees by using tree actions.