[Texas PLT logo]

COMP 202: Principles of Object-Oriented Programming II

  Design Patterns for Parsing (Part 2)  

Extended Visitors and Recursive Descent Parsing

You may download the supporting code for today here. It also includes a ToModelAlgo visitor that transforms the parse tree into an AST with addition, multiplication, integer and variable leaves.

Here is a simpler example using the extended visitor framework: ASTs with addition, multiplication, integer and variable leaves.

Simple Infix Arithmetic Expressions

Consider a program to read in from a text file a simple arithmetic expression in in-fix format, such as "abc + 45 + x + y", and build the corresponding parse tree.  We shall restrict ourselves to arithmetic expressions containing integers, strings (representing variables) and addition.  The syntax of such arithmetic expression can be expressed as a grammar with the following 4 rules.   

The format below is called "Baukus-Naur Form" or "BNF" for short.  It is an example of a "context-free grammar" that is "read Left-to-right with a Left-most derivation on a 1 token look-ahead" or, more simply, "LL(1)".  That means that you a) read the line from left to right, b) process the information from left to right and c) need only look at one token at any given point to know where you are in the grammar.   

E     ::=  F E1
E1    ::=  E1a | Empty
E1a   ::=  + E
F     ::=  num | id

The above grammar can be separated into two distinct sets of symbols:

Terminal symbols are atomic entities that are never defined in terms of other symbols, i.e. they only appear on the right hand side of the rules.   Non-terminal symbols, however are defined in terms of other symbols and hence always appear once on the left-hand side of the rules.  E is a special non-terminal symbol called the "Start Symbol" and represents the form of the grammar when one first encounters something in that grammar.  Likewise, Empty is a special symbol representing the lack of symbols at the end of a sequence of symbols.

It is important to note that there are really only two types of rules above: 

Here is how the above grammar generates the string "abc + 45".

E -> F E1              // E is the start symbol and has only one rule: E ::= F E1

  -> id E1             // Apply the rule F ::= id because "abc" is an id.

  -> id E1a            // Apply the rule E1 ::= E1a because there are more tokens

  -> id + E            // Apply the rule E1a ::= + E

  -> id + F E1         // Apply the rule E ::= F E1

  -> id + num E1        // Apply the rule F ::= num because "45" is a number

  -> id + num Empty    // Apply the rule E1 ::= Empty because there are no more tokens

  -> abc + 45          // the lexeme of id is "abc" and the lexeme of num is "45"

The above derivation of the input string "abc + 45" from the grammar rules can be represented by a tree structure called the parse tree as follows.

                                      E
                                      |
                     ------------------------------------
                    |                                    |
                    F                                   E1
                    |                                    |
                    id                                  E1a
                   "abc"                                 |
                                                  ---------------  
                                                  |              |
                                                              E
                                                            "+"             |
                                                             ----------
                                                             |         |
                                                             F        E1
                                                             |         |
                                                            num      Empty
                                                            "45"



First, let us consider the modeling of the tokens.   We want "intelligent" tokens that understand what they are, so they can be used in algorithms that depend on what type of tokens are being used.  In particular, when we are parsing a stream of incoming tokens, we simply want to run some algorithm that determines whether or not the token was the appropriate one for that moment as well as then proceed to process the subsequent tokens correctly, based on that one token (remember that we are only considering LL(1) grammars, so one token is all it takes).

Tokens are thus hosts for extended visitors, calling the case on the visitor corresponding to the name of the token.  A factory is used to hide the details of constructing tokens from the rest of the system.


Now consider the grammar symbols.   There are really only three distinct grammar symbols: empty, terminal and sequence.  A symbol for the combination rule is unnecessary because once a token stream has been parsed in terms of a grammar, the choices of a combination no longer exist and that point in the grammar can be represented by a single symbol, not a combination.

Once again, these grammar symbols are hosts for extended visitors so that algorithms can be run on them. 



Parsing thus involves running an algorithm on the stream of tokens from the tokenizer (actually, only the first one because the algorithm itself will pull the rest of the tokens from the tokenizer).  This algorithm will produce a grammar symbol (the "start symbol") which is at the root of a tree of grammar symbols (the parse tree).

In recursive descent parsing (RDP), we use a "tokenizer" to scan and "tokenize" the input from left to right, and build the parse tree  from the top down, based on the value of the tokens.  Building a parse tree from the top down means building it by first looking at the grammar rule(s) for the start symbol E and try to create the object  that represents E.  The task of building E is called "parsing E" and is to be carried out by an appropriate token visitor.

We can see from the above diagram that in order to parse E, we need to parse the sequence of parse F and E1.   But that means that we need to know how to parse F and E1:

Writing the concrete visitors to parse each of the non-terminal symbols will require, in general, performing a global analysis looking deeper than necessary  into the grammar structure.  One must analyze the details of the grammar to the level of knowing all possible tokens at that stage.  This analysis may require one to look beyond the immediate relationships a class may have with any other classes.  For instance, to find the possible tokens that could be the start of an E term, one must look down to the F term.  However, the code to process an E term should only be concerned with the fact that an E term is composed of a F term and an E1 term, and should not be coupled to the internal nature of the F, E1 or any other terms.  Likewise, when parsing a branch, such as F, one must create the union of all the visitors that parse the branch’s subclasses.

To remedy this problem, one must re-think the instantiation process of the visitors.  In particular, in order to decouple the construction of the visitor for one term from the details of other terms, one must abstract and encapsulate the construction process.  This is done by the abstract factory design pattern.  Using factories to instantiate the parsing visitors

          enables each term to be decoupled from any other term by hiding the instantiation details.

          enables the construction of the union of visitors by chaining, which is used to implement branches.

          enables the lazy construction of already installed visitors which is needed to create circular relationships.

Each non-terminal symbol (and its corresponding class) is associated with a factory that constructs its parsing visitor. 

All factories adhere to a basic factory interface which provides the methods to instantiate the parsing visitors.  For convenience, all the factories are derived from an abstract factory, ATokVisitorFact, which provides access to the tokenizer.

The major components of the parsing system are:

In practice, the following convenience classes are often useful:

The code to create a visitor to parse the above grammar is thus:

Note how the process of  creating a visitor for a particular grammar is purely a run-time operation and does not require any custom classes at all.

Parse Tree-to-Model Translation

The parse tree created by the parser can be a bit unwieldy to work with. It is reasonable to transform it to a simpler representation. The file source code contains a ToModelAlgo class that performs this transformation. It is a visitor on grammar symbols and returns an AAST.

Consider the different cases for a parse tree:

  1. If you have just a number, it will be the parse tree
    |_ parser.SequenceSymbol
      |_ parser.TerminalSymbol: 7.0 (Num)
      |_ parser.MTSymbol
  2. Similarly, if you just have a variable name, it will be the parse tree
    |_ parser.SequenceSymbol
      |_ parser.TerminalSymbol: a (Id)
      |_ parser.MTSymbol
  3. If you have something more complex, the parse tree will look like
    |_ parser.SequenceSymbol
     |_ parser.TerminalSymbol: 3.0
     |_ parser.SequenceSymbol
       |_ parser.TerminalSymbol: +
       |_ parser.SequenceSymbol
         |_ parser.TerminalSymbol: 7.0
         |_ parser.MTSymbol

So in all cases will it start with a SequenceSymbol. Therefore, the ToModelAlgo visitor rejects anything that is not a SequenceSymbol. If it encounters a Sequence, then the first symbol must be a terminal symbol, namely a Num or an Id (check all three cases above). The ToModelAlgo visitor therefore executes the NumOrIdHelper on the first symbol and passes its return value (an AAST) to the visitor that deals with the second symbol, SequenceHelper.

NumOrIdHelper is simple. It rejects anything that is not a Num or an Id. If it encounters a Num or an Id, it returns the AAST for it.

SequenceHelper takes the result from NumOrIdHelper as inps[0]. Consider the legal cases for SequenceHelper: In cases 1) and 2) above, SequenceHelper may encounter the empty symbol. In that case, the AAST is just whatever was passed from NumOrIdHelper in inps[0]. Otherwise, SequenceHelper must encounter a sequence ("+ ...").

If SequenceHelper encounters a sequence, it should have a "+" or "*" in the first symbol and an arithmetic expression in the second symbol. SequenceSymbol therefore uses the main ToModelAlgo visitor on the second symbol, and then passes that return value to OperatorHelper as inps[1], along with whatever it got from NumOrIdHelper in inps[0].

OperatorHelper is run on the first symbol of SequenceHelper's host. OperatorHelper takes the AAST from the left side of the operator as inps[0] and the AAST from the right side as inps[1]. There are two legal cases for OperatorHelper: "+" and "*". It creates the correct AASTs for these, e.g. new Add(inps[0], inps[1]); and rejects anything else.

This code transforms the parse tree into a simple AAST that contains only the information that is pertinent to the model.

  Design Patterns for Parsing (Part 2)  

URL: http://www.cs.rice.edu/teaching/202/08-fall/lectures/parsing2/index.shtml
Copyright © 2008-2010 Mathias Ricken and Stephen Wong