Hime - Lexical contexts

Table of content

Home
- Downloads
Get started
- Kickstart in: C#, Java, Rust
- C# tutorials: Basics, Tree actions, Semantic actions
- Java tutorials: Basics, Tree actions, Semantic actions
- Rust tutorials: Basics, Tree actions, Semantic actions
Grammars and edition
- Editors for Hime grammars
- Library of Hime grammars
Reference
- Command Line: himecc
- Grammar Language
  - Grammar Inheritance
  - Grammar Options
  - Lexical Rules
  - Syntactic Rules
- Bibliography
API Documentation
- API Documentation v3.5.0 .Net, Java, Rust
- API Documentation v3.4.0 .Net, Java, Rust
- API Documentation v3.3.2 .Net, Java, Rust
- API Documentation v3.3.1 .Net, Java, Rust
- API Documentation v3.3.0 .Net, Java, Rust
- API Documentation v3.2.0 .Net, Java
- API Documentation v3.1.0 .Net, Java
- API Documentation v3.0.0 .Net, Java
- API Documentation v2.0.6 .Net, Java
- API Documentation v2.0.5 .Net, Java
- API Documentation v2.0.1 .Net, Java
- API Documentation v1.3.2 .Net, Java
- API Documentation v1.2.0 .Net, Java
- API Documentation v1.1.0 .Net, Java
- API Documentation v1.0.0 .Net, Java
Release Notes
- v3.5.1, 2020, August 6th.
- v3.5.0, 2020, May 11th.
- v3.4.1, 2019, January 10th.
- v3.4.0, 2018, August 9th.
- v3.3.2, 2018, May 18th.
- v3.3.1, 2018, February 18th.
- v3.3.0, 2018, January 24th.
- v3.2.2, 2017, Octobre 19th.
- v3.2.1, 2017, Octobre 15th.
- v3.2.0, 2017, Octobre 4th.
- v3.1.0, 2017, September 26th.
- v3.0.1, 2017, August 3rd.
- v3.0.0, 2017, May 4th.
- v2.0.6, 2017, February 7th.
- v2.0.5, 2016, September 10th.
- v2.0.4, 2016, March 29th.
- v2.0.3, 2016, March 1st.
- v2.0.2, 2016, January 20th.
- v2.0.1, 2015, October 25th.
- v1.3.2, 2015, January 22nd.
- v1.3.1, 2014, October 23rd.
- v1.3.0, 2014, September 16th.
- v1.2.0, 2014, August 14th.
- v1.1.0, 2014, May 29th.
- v1.0.0, 2014, May 12th.

Lexical contexts

Context-sensitive lexing is the ability for a lexer to yield different tokens depending on the context of the parser. The most common example is the use of context-sensitive keywords, such as the get and set keywords in C#:

public MyProperty { get { return someField; } }
public void DoStuff() { int get = 1; }

In this excerpt, the get name on the first line is a keyword that starts the expression of the get accessor for the property. However, on the second line, the get name is a normal symbol with no special meaning. A grammar for this kind of language will usually involve two lexical rules, one for the keyword and one for the normal symbol:

SYMBOL -> [a-zA-Z_] [a-zA-Z_0-9]* ;
GET -> 'get' ;

The problem is now that the get input will always be interpreted as the keyword because in the above definitions the GET rule comes after and therefore as a higher priority. Shall the order be reversed, the keyword would never be matched because GET is strictly a subset of SYMBOL. The problem is then to discriminate between the cases where get is a keyword or a normal symbol.

To resolve this, Hime supports the definition of context-sensitive terminals, i.e. terminals that can only be matched if the parser is in a recognized context. The first step to use context-sensitive lexing is to mark specific terminals as members of a context:

SYMBOL -> [a-zA-Z_] [a-zA-Z_0-9]* ;
context accessors { GET -> 'get' ; }

In the rule above, the context keyword is used to open a context, named accessors. Within this context, any number of terminals can be defined. The semantics of this construction is that the GET rule can only be matched and the GET token produced if and only if the accessors context is in effect for the parser. The next step is then tell the parser when to recognize an accessors context. To do so, we modify the syntactic rules:

property -> get_accessor set_accessor? | set_accessor get_accessor? ;
get_accessor -> #accessors { GET } block ;

In the second rule above, the accessors context is referenced preceded by a hash. It opens a rule body between curly brackets. The inside of the contextual body is then able to refer to the contextual terminals, here the GET terminal. The semantics of this rule is that when the parser is facing the beginning of the get_accessor rule, it opens the accessors context and tells the lexer to enable it. The lexer will then be able to match the GET terminal, instead of the SYMBOL terminal. Once the GET terminal has been produced and matched by the parser, the parser closes the context and continues on to the block rule as per the definition of the get_accessor rule. In this way, the get name will only be matched by the lexer as a GET keyword when the accessors context is open.

Some points to keep in mind:

The terminals defined in the lexical rules outside of any context are said to be in the default context. They can always be matched by the lexer.

The priority of the terminals (contextual or not) is not affected by their appearance in a context. The priority is always specified by the appearance order of the lexical rules, irrespective of the use of contexts. For example:

SYMBOL -> [a-zA-Z_] [a-zA-Z_0-9]* ;
context accessors { GET -> 'get' ; }
GWORD -> 'g' [a-zA-Z_0-9] ;

The use of a context above does not make the GET terminal have more priority than GWORD. The priority is always the order of appearance (the later the more priority). In this example, the GET terminal will never be produced because whether the accessors context is in effect or not, the GWORD definition will always be matched instead because of its priority (and the fact that it is a superset of GET).

Any number of context can be defined. A context can be split up in the rules definition to accommodate the priority of the terminals:

context c1 { A -> 'a'; }
X -> 'x' ;
context c1 { SPECIAL_X -> 'x'; }

At runtime, the parser keeps tracks of all the active contexts. So multiple contexts can be active at the same time and recursively activated.

Finally, keep in mind that there is an additional cost in performance for context-sensitive lexing.