Get started
Grammars and edition
API Documentation
Release Notes
Context-sensitive lexing is the ability for a lexer to yield different tokens depending on the context of the parser. The most common example is the use of context-sensitive keywords, such as the get
and set
keywords in C#:
public MyProperty { get { return someField; } }
public void DoStuff() { int get = 1; }
In this excerpt, the get
name on the first line is a keyword that starts the expression of the get accessor for the property. However, on the second line, the get
name is a normal symbol with no special meaning. A grammar for this kind of language will usually involve two lexical rules, one for the keyword and one for the normal symbol:
SYMBOL -> [a-zA-Z_] [a-zA-Z_0-9]* ;
GET -> 'get' ;
The problem is now that the get
input will always be interpreted as the keyword because in the above definitions the GET
rule comes after and therefore as a higher priority. Shall the order be reversed, the keyword would never be matched because GET
is strictly a subset of SYMBOL
. The problem is then to discriminate between the cases where get
is a keyword or a normal symbol.
To resolve this, Hime supports the definition of context-sensitive terminals, i.e. terminals that can only be matched if the parser is in a recognized context. The first step to use context-sensitive lexing is to mark specific terminals as members of a context:
SYMBOL -> [a-zA-Z_] [a-zA-Z_0-9]* ;
context accessors { GET -> 'get' ; }
In the rule above, the context
keyword is used to open a context, named accessors
. Within this context, any number of terminals can be defined. The semantics of this construction is that the GET
rule can only be matched and the GET
token produced if and only if the accessors context is in effect for the parser. The next step is then tell the parser when to recognize an accessors
context. To do so, we modify the syntactic rules:
property -> get_accessor set_accessor? | set_accessor get_accessor? ;
get_accessor -> #accessors { GET } block ;
In the second rule above, the accessors
context is referenced preceded by a hash. It opens a rule body between curly brackets. The inside of the contextual body is then able to refer to the contextual terminals, here the GET
terminal. The semantics of this rule is that when the parser is facing the beginning of the get_accessor
rule, it opens the accessors
context and tells the lexer to enable it. The lexer will then be able to match the GET
terminal, instead of the SYMBOL
terminal. Once the GET
terminal has been produced and matched by the parser, the parser closes the context and continues on to the block rule as per the definition of the get_accessor
rule. In this way, the get
name will only be matched by the lexer as a GET
keyword when the accessors
context is open.
Some points to keep in mind:
The terminals defined in the lexical rules outside of any context are said to be in the default context. They can always be matched by the lexer.
The priority of the terminals (contextual or not) is not affected by their appearance in a context. The priority is always specified by the appearance order of the lexical rules, irrespective of the use of contexts. For example:
SYMBOL -> [a-zA-Z_] [a-zA-Z_0-9]* ;
context accessors { GET -> 'get' ; }
GWORD -> 'g' [a-zA-Z_0-9] ;
The use of a context above does not make the GET
terminal have more priority than GWORD
. The priority is always the order of appearance (the later the more priority). In this example, the GET
terminal will never be produced because whether the accessors
context is in effect or not, the GWORD
definition will always be matched instead because of its priority (and the fact that it is a superset of GET).
Any number of context can be defined. A context can be split up in the rules definition to accommodate the priority of the terminals:
context c1 { A -> 'a'; }
X -> 'x' ;
context c1 { SPECIAL_X -> 'x'; }
At runtime, the parser keeps tracks of all the active contexts. So multiple contexts can be active at the same time and recursively activated.
Finally, keep in mind that there is an additional cost in performance for context-sensitive lexing.