COMP 202: Principles of Object-Oriented Programming II← Design Patterns for Parsing (Part 2) → |
Home — Fall 2008 | | | Information | | | Resources | | | Exam Resources | | | OwlSpace | | | COMP 201 |
You may download the supporting code for today
here. It also includes a ToModelAlgo
visitor that transforms the parse tree into an AST with addition, multiplication, integer and variable leaves.
Here is a simpler example using the extended visitor framework: ASTs with addition, multiplication, integer and variable leaves.
Consider a program to read in from a text file a simple arithmetic expression in in-fix format, such as "abc + 45 + x + y", and build the corresponding parse tree. We shall restrict ourselves to arithmetic expressions containing integers, strings (representing variables) and addition. The syntax of such arithmetic expression can be expressed as a grammar with the following 4 rules.
The format below is called "Baukus-Naur Form" or "BNF" for short. It is an example of a "context-free grammar" that is "read Left-to-right with a Left-most derivation on a 1 token look-ahead" or, more simply, "LL(1)". That means that you a) read the line from left to right, b) process the information from left to right and c) need only look at one token at any given point to know where you are in the grammar.
E ::= F E1
E1 ::= E1a | Empty
E1a ::= + E
F ::= num | id
The above grammar can be separated into two distinct sets of symbols:
Terminal symbols are atomic entities that are never defined in terms of other symbols, i.e. they only appear on the right hand side of the rules. Non-terminal symbols, however are defined in terms of other symbols and hence always appear once on the left-hand side of the rules. E is a special non-terminal symbol called the "Start Symbol" and represents the form of the grammar when one first encounters something in that grammar. Likewise, Empty is a special symbol representing the lack of symbols at the end of a sequence of symbols.
It is important to note that there are really only two types of rules above:
Here is how the above grammar generates the string "abc + 45".
E -> F E1 // E is the start symbol and has only one rule: E ::= F E1
-> id E1 // Apply the rule F ::= id because "abc" is an id.
-> id E1a // Apply the rule E1 ::= E1a because there are more tokens
-> id + E // Apply the rule E1a ::= + E
-> id + F E1 // Apply the rule E ::= F E1
-> id + num E1 // Apply the rule F ::= num because "45" is a number
-> id + num Empty // Apply the rule E1 ::= Empty because there are no more tokens
-> abc + 45 // the lexeme of id is "abc" and the lexeme of num is "45"
The above derivation of the input string "abc + 45" from the grammar rules can be represented by a tree structure called the parse tree as follows.
E
|
------------------------------------
|
|
F
E1
|
|
id
E1a
"abc"
|
---------------
|
|
+
E
"+"
|
----------
| |
F E1
| |
num Empty
"45"
First, let us consider the modeling of the tokens.
We want "intelligent" tokens that understand what they are, so they can be used
in algorithms that depend on what type of tokens are being used. In
particular, when we are parsing a stream of incoming tokens, we simply want to
run some algorithm that determines whether or not the token was the appropriate
one for that moment as well as then proceed to process the subsequent tokens
correctly, based on that one token (remember that we are only considering LL(1)
grammars, so one token is all it takes).
Tokens are thus hosts for extended visitors, calling the case on the visitor corresponding to the name of the token. A factory is used to hide the details of constructing tokens from the rest of the system.
Now consider the grammar symbols. There are really only three distinct grammar symbols: empty, terminal and sequence. A symbol for the combination rule is unnecessary because once a token stream has been parsed in terms of a grammar, the choices of a combination no longer exist and that point in the grammar can be represented by a single symbol, not a combination.
Once again, these grammar symbols are hosts for extended visitors so that algorithms can be run on them.
Parsing thus involves running an algorithm on the stream of tokens from the tokenizer (actually, only the first one because the algorithm itself will pull the rest of the tokens from the tokenizer). This algorithm will produce a grammar symbol (the "start symbol") which is at the root of a tree of grammar symbols (the parse tree).
In recursive descent parsing (RDP), we use a "tokenizer" to scan and "tokenize" the input from left to right, and build the parse tree from the top down, based on the value of the tokens. Building a parse tree from the top down means building it by first looking at the grammar rule(s) for the start symbol E and try to create the object that represents E. The task of building E is called "parsing E" and is to be carried out by an appropriate token visitor.
We can see from the above diagram that in order to parse E, we need to parse the sequence of parse F and E1. But that means that we need to know how to parse F and E1:
Writing the concrete visitors to parse each of the non-terminal symbols will require, in general, performing a global analysis looking deeper than necessary into the grammar structure. One must analyze the details of the grammar to the level of knowing all possible tokens at that stage. This analysis may require one to look beyond the immediate relationships a class may have with any other classes. For instance, to find the possible tokens that could be the start of an E term, one must look down to the F term. However, the code to process an E term should only be concerned with the fact that an E term is composed of a F term and an E1 term, and should not be coupled to the internal nature of the F, E1 or any other terms. Likewise, when parsing a branch, such as F, one must create the union of all the visitors that parse the branch’s subclasses.
To remedy this problem, one
must re-think the instantiation process of the visitors.
In particular, in order to decouple the construction of the visitor for
one term from the details of other terms, one must abstract and encapsulate the
construction process. This is done
by the abstract factory design pattern.
Using factories to instantiate the parsing visitors
·
enables each term to be decoupled from any other term by hiding
the instantiation details.
·
enables the construction of the union of visitors by chaining,
which is used to implement branches.
·
enables the lazy construction of already installed visitors which
is needed to create circular relationships.
Each non-terminal symbol (and its corresponding class) is associated with a factory that constructs its parsing visitor.
All factories
adhere to a basic factory interface which provides the methods to instantiate
the parsing visitors. For
convenience, all the factories are derived from an abstract factory,
ATokVisitorFact,
which provides access to the tokenizer.
The major components of the parsing system are:
In practice, the following convenience classes are often useful:
The code to create a visitor to parse the above grammar is thus:
Note how the process of creating a visitor for a particular grammar is purely a run-time operation and does not require any custom classes at all.
The parse tree created by the parser can be a bit unwieldy to work with. It is reasonable to transform it to a simpler representation. The file source code contains a ToModelAlgo class
that performs this transformation. It is a visitor on grammar symbols and returns an AAST
.
Consider the different cases for a parse tree:
|_ parser.SequenceSymbol |_ parser.TerminalSymbol: 7.0 (Num) |_ parser.MTSymbol
|_ parser.SequenceSymbol |_ parser.TerminalSymbol: a (Id) |_ parser.MTSymbol
|_ parser.SequenceSymbol |_ parser.TerminalSymbol: 3.0 |_ parser.SequenceSymbol |_ parser.TerminalSymbol: + |_ parser.SequenceSymbol |_ parser.TerminalSymbol: 7.0 |_ parser.MTSymbol
So in all cases will it start with a SequenceSymbol
. Therefore, the
ToModelAlgo
visitor rejects anything that is not a SequenceSymbol
. If
it encounters a Sequence, then the first symbol must be a terminal
symbol, namely a Num or an Id (check all three cases above). The
ToModelAlgo
visitor therefore executes the NumOrIdHelper
on the first
symbol and passes its return value (an AAST
) to the visitor that deals
with the second symbol, SequenceHelper
.
NumOrIdHelper
is simple. It rejects anything that is not a Num or an
Id. If it encounters a Num or an Id, it returns the AAST
for it.
SequenceHelper
takes the result from NumOrIdHelper
as inps[0].
Consider the legal cases for SequenceHelper
: In cases 1) and 2) above,
SequenceHelper
may encounter the empty symbol. In that case, the AAST
is just whatever was passed from NumOrIdHelper
in inps[0]. Otherwise,
SequenceHelper
must encounter a sequence ("+ ...").
If SequenceHelper
encounters a sequence, it should have a "+" or "*"
in the first symbol and an arithmetic expression in the second symbol.
SequenceSymbol
therefore uses the main ToModelAlgo
visitor on the
second symbol, and then passes that return value to OperatorHelper
as
inps[1], along with whatever it got from NumOrIdHelper
in inps[0]
.
OperatorHelper
is run on the first symbol of SequenceHelper
's host.
OperatorHelper
takes the AAST
from the left side of the operator as
inps[0]
and the AAST
from the right side as inps[1]
. There are two
legal cases for OperatorHelper
: "+" and "*". It creates the correct
AAST
s for these, e.g. new Add(inps[0], inps[1]);
and rejects anything
else.
This code transforms the parse tree into a simple AAST
that contains
only the information that is pertinent to the model.
URL: http://www.cs.rice.edu/teaching/202/08-fall/lectures/parsing2/index.shtml
Copyright © 2008-2010 Mathias Ricken and Stephen Wong