Hime Grammar Language - Context-Sensitive Lexing

Context-sensitive lexing is the ability for a lexer to yield different tokens depending on the context of the parser. The most common example is the use of context-sensitive keywords, such as the get and set keywords in C#:

public MyProperty { get { return someField; } }
public void DoStuff() { int get = 1; }

In this excerpt, the get name on the first line is a keyword that starts the expression of the get accessor for the property. However, on the second line, the get name is a normal symbol with no special meaning. A grammar for this kind of language will usually involve two lexical rules, one for the keyword and one for the normal symbol:

SYMBOL -> [a-zA-Z_] [a-zA-Z_0-9]* ;
GET -> 'get' ;

The problem is now that the get input will always be interpreted as the keyword because in the above definitions the GET rule comes after and therefore as a higher priority. Shall the order be reversed, the keyword would never be matched because GET is strictly a subset of SYMBOL. The problem is then to discriminate between the cases where get is a keyword or a normal symbol.

To resolve this, Hime supports the definition of context-sensitive terminals, i.e. terminals that can only be matched if the parser is in a recognized context. The first step to use context-sensitive lexing is to mark specific terminals as members of a context:

SYMBOL -> [a-zA-Z_] [a-zA-Z_0-9]* ;
context accessors { GET -> 'get' ; }

In the rule above, the context keyword is used to open a context, named accessors. Within this context, any number of terminals can be defined. The semantics of this construction is that the GET rule can only be matched and the GET token produced if and only if the accessors context is in effect for the parser. The next step is then tell the parser when to recognize an accessors context. To do so, we modify the syntactic rules:

property -> get_accessor set_accessor? | set_accessor get_accessor? ;
get_accessor -> #accessors { GET } block ;

In the second rule above, the accessors context is referenced preceded by a hash. It opens a rule body between curly brackets. The inside of the contextual body is then able to refer to the contextual terminals, here the GET terminal. The semantics of this rule is that when the parser is facing the beginning of the get_accessor rule, it opens the accessors context and tells the lexer to enable it. The lexer will then be able to match the GET terminal, instead of the SYMBOL terminal. Once the GET terminal has been produced and matched by the parser, the parser closes the context and continues on to the block rule as per the definition of the get_accessor rule. In this way, the get name will only be matched by the lexer as a GET keyword when the accessors context is open.

Some points to keep in mind:

The terminals defined in the lexical rules outside of any context are said to be in the default context. They can always be matched by the lexer.

The priority of the terminals (contextual or not) is not affected by their appearance in a context. The priority is always specified by the appearance order of the lexical rules, irrespective of the use of contexts. For example:

SYMBOL -> [a-zA-Z_] [a-zA-Z_0-9]* ;
context accessors { GET -> 'get' ; }
GWORD -> 'g' [a-zA-Z_0-9] ;

The use of a context above does not make the GET terminal have more priority than GWORD. The priority is always the order of appearance (the later the more priority). In this example, the GET terminal will never be produced because whether the accessors context is in effect or not, the GWORD definition will always be matched instead because of its priority (and the fact that it is a superset of GET).

Any number of context can be defined. A context can be split up in the rules definition to accommodate the priority of the terminals:

context c1 { A -> 'a'; }
X -> 'x' ;
context c1 { SPECIAL_X -> 'x'; }

At runtime, the parser keeps tracks of all the active contexts. So multiple contexts can be active at the same time and recursively activated.

Finally, keep in mind that there is an additional cost in performance for context-sensitive lexing.