Get started
Grammars and edition
API Documentation
Release Notes
This page summarizes the basics of writing lexical rules. These rules are used to define grammar terminals and are written in the terminals
section of a grammar. They take the form of regular expressions. You can find some reference on regular expressions on Wikipedia or the IEEE.
All lexical rules are written as:
IDENTIFIER -> regexp here ;
The interpretation of this rule is that it declares the IDENTIFIER
grammar terminal and specifies its definition with the associated regular expression. Note that all rules end with a semicolon (;
). An example of valid rule is:
IDENTIFIER -> [a-z]+;
The order of appearance of the rules is important for two reasons:
NUMBER
in the body of NUMBER
's rule. In a rule's body you can only refer to previously defined terminals.ANIMAL -> 'cat' | 'dog' ;
is more specific that the IDENTIFIER
rule, any word matched by ANIMAL
is also matched by IDENTIFIER
. The lexer generator then have to decide which is to be preferred in the overlapping cases. This decision is taken by using a notion of priority in lexical rules. The later a rule is defined, the more priority is has. In our example, if the ANIMAL
rule is defined after IDENTIFIER
, then the cat
value will be matched by ANIMAL
. On the contrary, if IDENTIFIER
is defined after ANIMAL
, then ANIMAL
can never be matched because all its possible matches are overridden by IDENTIFIER
.In regular expressions, plain text must be enclosed in single quotes. The 'abc'
expression will only match the abc
string. The single quote character itself can also be used by escaping it. The 'a\'bc'
expression will match the a'bc
string. Blank spaces can also be used and will have to be matched. The 'a bc'
expression will only match the a bc
string. The possible escape sequences are summarized hereafter:
'\\'
, Backslash'\''
, single quote'\0'
, Unicode character 0'\a'
, Alert (character 7)'\b'
, Backspace (character 8)'\f'
, Form feed (character 12)'\n'
, New line (character 10)'\r'
, Carriage return (character 13)'\t'
, Horizontal tab (character 9)'\v'
, Vertical tab (character 11)Example:
DOG -> 'dog' ;
Individual Unicode characters can also be represented using their Unicode codepoint written in hexadecimal, prefixed by U+
. For example U+AA
, U+1234
, U+12526
are valid Unicode codepoints. The first two are in the Unicode plane 0, whereas the last one is in plane 1. You can refer to the Unicode character charts to find the codepoints of the characters. Example:
MYTERMINAL -> U+1234 ;
Ranges of characters can be specified using the U+1234 .. U+5678
construct. This expression will match any character which Unicode codepoint is between U+1234
and U+5678
(included). Ranges outside of Unicode plane 0 are also possible: U+13456 .. U+15678
. Example:
HEBREW -> U+0590 .. U+05FF ;
The dot (.
) can be used to match any character (exactly 1). It must not be enclosed within single quotes. The .
expression will match any character, whereas the '.'
expression will match only a dot (Unicode codepoint U+002E
). Note that the dot will also match line ending characters like New Line ( U+0A
) and Carriage Return ( U+0D
). Example:
SOMETHING -> 'do' . ;
Character classes are to be expressed using the usual []
construction. The [abc]
character class will match either the 'a'
, 'b'
, or 'c'
characters. [a-z]
will match any lowercase Latin letter. Negative classes can be expressed using [^]
. The [^a-z]
expression will match any character that is not a lowercase Latin letter. Examples:
UPPER_CASE_LETTER -> [A-Z] ;
LETTER -> [a-zA-Z] ;
NOT_A_LETTER -> [^a-zA-Z] ;
Unicode blocks can be used with the construct ub{NAME}
. For example, the ub{Cyrillic}
expression will only match characters in the U+0400 .. U+04FF
range (Cyrillic characters). List of the supported Unicode Blocks. Example:
CYRILLIC -> ub{Cyrillic} ;
AEGEAN_DIGIT -> ub{AegeanNumbers} ;
Unicode catgories can be used with the construct uc{NAME}
. For example, the uc{Lu}
expression will only match uppercase letters. List of the supported Unicode Categories. Example:
UPPERCASE -> uc{Lu} ;
SEPARATOR -> uc{Z} ;
Nested regular expressions are expressed by referring to the terminal's name. The 'a' NAME 'b'
expression will match any string matched by the regular expression corresponding to NAME
with a a
before and a b
after it. Example:
ART -> 'art' ;
PARTY -> 'p' ART 'y' ;
The following operators enable the specification of the cardinality of their preceding element:
*
+
?
{n}
{n, m}
Examples:
WORD -> [a-z]+ ; // a word is a sequence of one or more letters
IDENTIFIER -> '@'? WORD ; // an identifier is a word that may or may not be preceded by an @
The union operator|
marks an alternative. Example:
ANIMAL -> 'dog' | 'cat' ;
Use brackets ()
to group elements in an expression:
IDENTIFIER -> [a-z] ([a-z] | [0-9])* ; // string of letters and digits beginning with a letter
It is possible to exclude some possibilities from a regular expressions using the difference operator ( -
). Given two expressions exp1
and exp2
, the difference exp1 - exp2
will match any string that is matched by exp1
but not by exp2
. Strings only matched by exp2
are never matched. Examples:
COMMENT -> '/*' ( .* - (.* '*/' .*) ) '*/' ;
COMMENT
is a string that begins with /*
and ends with */
. The content between the markers can ben empty, but it cannot contains the string */
. The content is expressed as any string of zero or more characters ( .*
), from which are excluded all strings that contain */
, expressed as .* '*/' .*
.