Open-source image
open-source
Hime logo

Table of content

Basics of Lexical Rules

This page summarizes the basics of writing lexical rules. These rules are used to define grammar terminals and are written in the terminals section of a grammar. They take the form of regular expressions. You can find some reference on regular expressions on Wikipedia or the IEEE.

Rule expression

All lexical rules are written as:

IDENTIFIER -> regexp here ;

The interpretation of this rule is that it declares the IDENTIFIER grammar terminal and specifies its definition with the associated regular expression. Note that all rules end with a semicolon (;). An example of valid rule is:

IDENTIFIER -> [a-z]+;

Rule ordering

The order of appearance of the rules is important for two reasons:

  • First, a regular expression cannot reference itself or part of itself. That is to say, in the previous example you cannot reference NUMBER in the body of NUMBER's rule. In a rule's body you can only refer to previously defined terminals.
  • Second, different rules may overlap. For instance the rule ANIMAL -> 'cat' | 'dog' ; is more specific that the IDENTIFIER rule, any word matched by ANIMAL is also matched by IDENTIFIER. The lexer generator then have to decide which is to be preferred in the overlapping cases. This decision is taken by using a notion of priority in lexical rules. The later a rule is defined, the more priority is has. In our example, if the ANIMAL rule is defined after IDENTIFIER, then the cat value will be matched by ANIMAL. On the contrary, if IDENTIFIER is defined after ANIMAL, then ANIMAL can never be matched because all its possible matches are overridden by IDENTIFIER.

Simple text element

In regular expressions, plain text must be enclosed in single quotes. The 'abc' expression will only match the abc string. The single quote character itself can also be used by escaping it. The 'a\'bc' expression will match the a'bc string. Blank spaces can also be used and will have to be matched. The 'a bc' expression will only match the a bc string. The possible escape sequences are summarized hereafter:

  • '\\', Backslash
  • '\'', single quote
  • '\0', Unicode character 0
  • '\a', Alert (character 7)
  • '\b', Backspace (character 8)
  • '\f', Form feed (character 12)
  • '\n', New line (character 10)
  • '\r', Carriage return (character 13)
  • '\t', Horizontal tab (character 9)
  • '\v', Vertical tab (character 11)

Example:

DOG -> 'dog' ;

Unicode characters

Individual Unicode characters can also be represented using their Unicode codepoint written in hexadecimal, prefixed by U+. For example U+AA, U+1234, U+12526 are valid Unicode codepoints. The first two are in the Unicode plane 0, whereas the last one is in plane 1. You can refer to the Unicode character charts to find the codepoints of the characters. Example:

MYTERMINAL -> U+1234 ;

Unicode ranges

Ranges of characters can be specified using the U+1234 .. U+5678 construct. This expression will match any character which Unicode codepoint is between U+1234 and U+5678 (included). Ranges outside of Unicode plane 0 are also possible: U+13456 .. U+15678. Example:

HEBREW -> U+0590 .. U+05FF ;

Wildcard

The dot (.) can be used to match any character (exactly 1). It must not be enclosed within single quotes. The . expression will match any character, whereas the '.' expression will match only a dot (Unicode codepoint U+002E). Note that the dot will also match line ending characters like New Line ( U+0A) and Carriage Return ( U+0D). Example:

SOMETHING -> 'do' . ;

Character classes

Character classes are to be expressed using the usual [] construction. The [abc] character class will match either the 'a', 'b', or 'c' characters. [a-z] will match any lowercase Latin letter. Negative classes can be expressed using [^]. The [^a-z] expression will match any character that is not a lowercase Latin letter. Examples:

UPPER_CASE_LETTER -> [A-Z] ;
LETTER -> [a-zA-Z] ;
NOT_A_LETTER -> [^a-zA-Z] ;

Unicode blocks

Unicode blocks can be used with the construct ub{NAME}. For example, the ub{Cyrillic} expression will only match characters in the U+0400 .. U+04FF range (Cyrillic characters). List of the supported Unicode Blocks. Example:

CYRILLIC -> ub{Cyrillic} ;
AEGEAN_DIGIT -> ub{AegeanNumbers} ;

Unicode categories

Unicode catgories can be used with the construct uc{NAME}. For example, the uc{Lu} expression will only match uppercase letters. List of the supported Unicode Categories. Example:

UPPERCASE -> uc{Lu} ;
SEPARATOR -> uc{Z} ;

Reference to a previous terminal

Nested regular expressions are expressed by referring to the terminal's name. The 'a' NAME 'b' expression will match any string matched by the regular expression corresponding to NAME with a a before and a b after it. Example:

ART -> 'art' ;
PARTY -> 'p' ART 'y' ;

Repetition operators

The following operators enable the specification of the cardinality of their preceding element:

  • Zero or more: *
  • One or more: +
  • Optional (0 or 1 time): ?
  • Exactly n times: {n}
  • Between n and m times: {n, m}

Examples:

WORD -> [a-z]+ ; // a word is a sequence of one or more letters
IDENTIFIER -> '@'? WORD ; // an identifier is a word that may or may not be preceded by an @

Union

The union operator| marks an alternative. Example:

ANIMAL -> 'dog' | 'cat' ;

Grouping

Use brackets () to group elements in an expression:

IDENTIFIER -> [a-z] ([a-z] | [0-9])* ; // string of letters and digits beginning with a letter

Difference

It is possible to exclude some possibilities from a regular expressions using the difference operator ( -). Given two expressions exp1 and exp2, the difference exp1 - exp2 will match any string that is matched by exp1 but not by exp2. Strings only matched by exp2 are never matched. Examples:

COMMENT -> '/*' ( .* - (.* '*/' .*) ) '*/' ;

COMMENT is a string that begins with /* and ends with */. The content between the markers can ben empty, but it cannot contains the string */. The content is expressed as any string of zero or more characters ( .*), from which are excluded all strings that contain */, expressed as .* '*/' .*.