COMP 405
|
Extended Visitors and Recursive Descent Parsing |
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
You may branch the supporting code from the course SVN site:
https://svn.rice.edu/r/comp405/course/ExtVisitorParsing/ This
code also includes a ToModelAlgo
visitor that transforms the parse tree into an AST with addition, multiplication, integer and variable leaves.
Here is a simpler example using the extended visitor framework:
main()
method,
only test cases and is meant just to show the relevant code.Reference Materials:
This is just a sample, please search the web for more...
Consider a program to read in from a text file a simple arithmetic expression in in-fix format, such as "abc + 45 + x + y", and build the corresponding parse tree. We shall restrict ourselves to arithmetic expressions containing integers, strings (representing variables) and addition. The syntax of such arithmetic expression can be expressed as a grammar with the following 4 rules.
The format below is called "Baukus-Naur Form" or "BNF" for short. It is an example of a "context-free grammar" that is "read Left-to-right with a Left-most derivation on a 1 token look-ahead" or, more simply, "LL(1)". That means that you a) read the line from left to right, b) process the information from left to right and c) need only look at one token at any given point to know where you are in the grammar.
E ::= F E1
E1 ::= E1a | Empty
E1a ::= + E
F ::= num | id
The above grammar can be separated into two distinct sets of symbols:
Terminal symbols are atomic entities that are never defined in terms of other symbols, i.e. they only appear on the right hand side of the rules. Non-terminal symbols, however are defined in terms of other symbols and hence always appear once on the left-hand side of the rules. E is a special non-terminal symbol called the "Start Symbol" and represents the form of the grammar when one first encounters something in that grammar. Likewise, Empty is a special symbol representing the lack of symbols at the end of a sequence of symbols.
It is important to note that there are really only two types of rules above:
Here is how the above grammar generates the string "abc + 45".
E -> F E1 // E is the start symbol and has only one rule: E ::= F E1
-> id E1 // Apply the rule F ::= id because "abc" is an id.
-> id E1a // Apply the rule E1 ::= E1a because there are more tokens
-> id + E // Apply the rule E1a ::= + E
-> id + F E1 // Apply the rule E ::= F E1
-> id + num E1 // Apply the rule F ::= num because "45" is a number
-> id + num Empty // Apply the rule E1 ::= Empty because there are no more tokens
-> abc + 45 // the lexeme of id is "abc" and the lexeme of num is "45"
The above derivation of the input string "abc + 45" from the grammar rules can be represented by a tree structure called the parse tree as follows.
E
|
------------------------------------
|
|
F
E1
|
|
id
E1a
"abc"
|
---------------
|
|
+
E
"+"
|
----------
| |
F E1
| |
num Empty
"45"
THE STRUCTURE OF THE CODE SHOULD MATCH THE STRUCTURE OF THE DATA
This is one of the core design philosophies that runs through object-oriented design. The point is that the data drives how it is processed by its very nature, i.e. its structure. The structure of the data is not simply a convenient container for easy processing, but rather an embodiment, a model of the data that captures the essence of what the data is and how it behaves. In this inverted-control world of OOP/D where data drives processing rather than processing manipulates data, the processing of the data is thus effectively a behavior of the data and thus fundamentally must mirror the data's structure. What can and cannot be done with the data is then controlled by the data structure itself, leading to much more natural and robust processing algorithms.
In all the following design discussions, pay close attention to how this design philosophy is being applied at all levels. If you have done parsing algorithms in another course, compare the algorithms and how matching the code to the structure of the data changes how we look at and solve the problem.
First, let us consider the modeling of the tokens. We want "intelligent" tokens that understand what they are, so they can be used in algorithms that depend on what type of tokens are being used. In particular, when we are parsing a stream of incoming tokens, we simply want to run some algorithm that determines whether or not the token was the appropriate one for that moment as well as then proceed to process the subsequent tokens correctly, based on that one token (remember that we are only considering LL(1) grammars, so one token is all it takes).
Tokens are thus hosts for extended visitors, calling the case on the visitor corresponding to the name of the token. A factory is used to hide the details of constructing tokens from the rest of the system.
Now consider the grammar symbols. There are really only three distinct grammar symbols: empty, terminal and sequence. A symbol for the combination rule is unnecessary because once a token stream has been parsed in terms of a grammar, the choices of a combination no longer exist and that point in the grammar can be represented by a single symbol, not a combination.
Once again, these grammar symbols are hosts for extended visitors so that algorithms can be run on them.
Parsing thus involves running an algorithm on the stream of tokens from the tokenizer (actually, only the first one because the algorithm itself will pull the rest of the tokens from the tokenizer). This algorithm will produce a grammar symbol (the "start symbol") which is at the root of a tree of grammar symbols (the parse tree).
In recursive descent parsing (RDP), we use a "tokenizer" to scan and "tokenize" the input from left to right, and build the parse tree from the top down, based on the value of the tokens. Building a parse tree from the top down means building it by first looking at the grammar rule(s) for the start symbol E and try to create the object that represents E. The task of building E is called "parsing E" and is to be carried out by an appropriate token visitor.
We can see from the above diagram that in order to parse E, we need to parse the sequence of parse F and E1. But that means that we need to know how to parse F and E1:
Writing the concrete visitors to parse each of the non-terminal symbols will require, in general, performing a global analysis looking deeper than necessary into the grammar structure. One must analyze the details of the grammar to the level of knowing all possible tokens at that stage. This analysis may require one to look beyond the immediate relationships a class may have with any other classes. For instance, to find the possible tokens that could be the start of an E term, one must look down to the F term. However, the code to process an E term should only be concerned with the fact that an E term is composed of a F term and an E1 term, and should not be coupled to the internal nature of the F, E1 or any other terms. Likewise, when parsing a branch, such as F, one must create the union of all the visitors that parse the branch’s subclasses.
To remedy this problem, one must re-think the instantiation process of the visitors. In particular, in order to decouple the construction of the visitor for one term from the details of other terms, one must abstract and encapsulate the construction process. This is done by the abstract factory design pattern. Using factories to instantiate the parsing visitors
Each non-terminal symbol (and its corresponding class) is associated with a factory that constructs its parsing visitor.
All factories
adhere to a basic factory interface which provides the methods to instantiate
the parsing visitors. For
convenience, all the factories are derived from an abstract factory,
ATokVisitorFact,
which provides access to the tokenizer.
The major components of the parsing system are:
Tokens
.
More specifically, the visitors are ATokisitorVisitor<IGrammarSymbol,
Object>
's, which use a lambda-based
Extended Visitor implementation. ITokVisitorFact
that contains the tokenizer and provides services to
manage it.ATokVisitotFact
whose instances parse specific Tokens
, as
specified by a given an String
, the name of the corresponding
Token
. There is a one-to-one
correspondence between the instances of this class and the distinct types of
tokens generated by the tokenizer.ATokVisitotFact
that produces
ATokVisitors
whose cases can
process the symbols created by either of two given
ATokVisitors
. The cases of the visitors produced by this
factory are thus the union of the cases of the given visitors. The
visitors created by this factory utilize the
ITokVisitorCmds
copied from the given visitors. The
default case is that specified by the latter of the two given
factories' visitors. The restriction to LL(1) grammars means that no
two cases from the contributing visitors will ever be for the same
Token
.ATokVisitotFact
that produces
ATokVisitors
whose cases can
process a binary sequence of the symbols created by two given
ATokVisitors
. Thus,every case in the
first of the given visitors is augmented (decorated) to first execute
normally and then to execute the second of the
given visitors on the next token. The default case is also so
decorated.ATokVisitotFact
that produces
ATokVisitors
whose cases parse the
empty token/symbol defined at the end of a recursive structure of symbols in
the grammar. Thus, only the default case is defined.
All the visitors created by this factory do is to push the token back into
the tokenizer. Note that due to the way in which the
CombinationFact
works, the factory must always
be the last when combining two or
more visitors. ITokVisitorFact
is simply a proxy to a given ITokVisitorFact
.
This enables a specification of a given factory to be used but without
specifying the usage of a particular instance of the factory.
In particular, this factory is very useful when a recursive loop is
encountered.In practice, the following convenience classes are often useful:
ITokVisitorFact
simply utilizes multiple SequenceFact
's to
create a resultant factory whose visitors can parse an arbitrarily long
sequence of symbols.ITokVisitorFact
simply utilizes multiple CombinationFact
's to
create a resultant factory whose visitors can parse a token chosen from an
arbitrarily large set of symbols. The code to create a visitor to parse the above grammar is thus:
ProxyFact eFact_Proxy = new ProxyFact(); // use this wherever E appears in the grammar, except the very beginning. ITokVisitorFact eFact = new SequenceFact("E", tok, new CombinationFact("F", tok, new TerminalSymbolFact("Num", tok), new TerminalSymbolFact("Id", tok)), new CombinationFact("E1", tok, new MultiSequenceFact("E1a", tok, new TerminalSymbolFact("+", tok), eFact_Proxy), new MTSymbolFact(tok))); eFact_Proxy.setFact(eFact); // close the loop ITokVisitor<IGrammarSymbol, Object> parseEVisitor = eFact.makeVisitor(); // Make the actual visitor to perform the parsing
Note how the process of creating a visitor for a particular grammar is purely a run-time operation and does not require any custom classes at all.
One of the biggest hurdles in constructing a parsing system is how to effectively deal with recursive definitions, particularly if the grammar is left-recursive. While there are techniques to refactor the grammar to remove left-recursion, this is not always practical. Theoretically, a left-recursive grammar is not parsable by an LL(1) parser, so all left-recursive grammars must be refactored to into a "weakly equivalent" right recursion grammar. But our problem with recursion arises even if the grammar is not fully left-recursive where the recursive non-terminal symbol ends up as the left-most symbol in a sequence somewhere.
Here is an example of a partially left-recursive grammar:
Expr ::= Term "]"
Term ::= Num | SubExpr
SubExpr ::= "[" Expr
If we were to hard-code the parser generator, we would get the following:
ProxyFact exprFact_Proxy = new ProxyFact(); // use this wherever Expr appears in the grammar, except the very beginning. ITokVisitorFact exprFact = new SequenceFact("Expr", tok, new CombinationFact("Term", tok, new TerminalSymbolFact("Num", tok), new SequenceSymbolFact("SubExpr", tok), new TerminalSymbolFact("[", tok), exprFact_Proxy), new TerminalSymbolFact("]", tok)); exprFact_Proxy.setFact(eFact); // close the loop ITokVisitor<IGrammarSymbol, Object> parseEVisitor = exprFact.makeVisitor(); // Make the actual visitor to perform the parsing
Below is a sequence diagram that shows the instantiation of the factories in
the above code snippet and the call to exprFact.makeVisitor()
that
instantiates the parsing visitor. The top of the boxes show,
relatively, when Java instantiates each of the factories as it works from the
left-most to right-most and from innermost to outermost calls.
The 3 purple boxes are the ATokVisitors
corresponding to the 3
terminal symbols in the grammar. These are the only visitors for
which processing commands, i.e. the commands to process each terminal token, are
being instantiated and thus are the only source of commands being used to
construct the overall processing visitor.
The frist thing to notice is that the exprFact
sequence factory
cannot possibly instantiate the two parsing visitors it needs right when the
factory itself is instantiated. If it did, then the recursion
would cause required that exprFact attempt to reference itself before it is
completed its own instantiation.
But notice that even if we delay instantiation until the factory method,
makeVisitor()
, is called, we still have a problem. Note the red lines in the
middle. What is returned, given that at that point, the Expr
sequence factory hasn't finished instantiating its visitor?
In SequenceFact
, the left-hand token visitor (the "head" token's
visitor) is lazily instantiated th first time that its makeVisitor()
is called. Subsequent calls simply return the now
eager visitor instance. The left-hand token visitor (the "tail"
token's visitor) is lazily instantiated at the resultant visitor's run-time,
where once again, any subsequent usage of the tail visitor is simply an eager
reference.
The parse tree created by the parser can be a bit unwieldy to work with. It is reasonable to transform it to a simpler representation. The file source code contains a ToModelAlgo class
that performs this transformation. It is a visitor on grammar symbols and returns an AAST
(abstract AST object). Note that this utility only works on simple
arithmetic grammars and, as written, is not universally applicable.
A "toString
" visitor, AAST_ToString
, is also included
that will give a tree-formatted string representation of the AST (AAST
object).
Consider the different cases for a parse tree:
|_ parser.SequenceSymbol |_ parser.TerminalSymbol: 7.0 (Num) |_ parser.MTSymbol
|_ parser.SequenceSymbol |_ parser.TerminalSymbol: a (Id) |_ parser.MTSymbol
|_ parser.SequenceSymbol |_ parser.TerminalSymbol: 3.0 |_ parser.SequenceSymbol |_ parser.TerminalSymbol: + |_ parser.SequenceSymbol |_ parser.TerminalSymbol: 7.0 |_ parser.MTSymbol
So in all cases will it start with a SequenceSymbol
. Therefore, the
ToModelAlgo
visitor rejects anything that is not a SequenceSymbol
. If
it encounters a Sequence, then the first symbol must be a terminal
symbol, namely a Num or an Id (check all three cases above). The
ToModelAlgo
visitor therefore executes the NumOrIdHelper
on the first
symbol and passes its return value (an AAST
) to the visitor that deals
with the second symbol, SequenceHelper
.
NumOrIdHelper
is simple. It rejects anything that is not a Num or an
Id. If it encounters a Num or an Id, it returns the AAST
for it.
SequenceHelper
takes the result from NumOrIdHelper
as inps[0].
Consider the legal cases for SequenceHelper
: In cases 1) and 2) above,
SequenceHelper
may encounter the empty symbol. In that case, the AAST
is just whatever was passed from NumOrIdHelper
in inps[0]. Otherwise,
SequenceHelper
must encounter a sequence ("+ ...").
If SequenceHelper
encounters a sequence, it should have a "+" or "*"
in the first symbol and an arithmetic expression in the second symbol.
SequenceSymbol
therefore uses the main ToModelAlgo
visitor on the
second symbol, and then passes that return value to OperatorHelper
as
inps[1], along with whatever it got from NumOrIdHelper
in inps[0]
.
OperatorHelper
is run on the first symbol of SequenceHelper
's host.
OperatorHelper
takes the AAST
from the left side of the operator as
inps[0]
and the AAST
from the right side as inps[1]
. There are two
legal cases for OperatorHelper
: "+" and "*". It creates the correct
AAST
s for these, e.g. new Add(inps[0], inps[1]);
and rejects anything
else.
This code transforms the parse tree into a simple AAST
that contains
only the information that is pertinent to the model.
package parser.visitor; import parser.*; import sample.*; /** * Turn the parse tree into an AST that is more easy to use. 5.0+3.0+7.0 parser.SequenceSymbol |_ parser.TerminalSymbol: 5.0 |_ parser.SequenceSymbol |_ parser.TerminalSymbol: + |_ parser.SequenceSymbol |_ parser.TerminalSymbol: 3.0 |_ parser.SequenceSymbol |_ parser.TerminalSymbol: + |_ parser.SequenceSymbol |_ parser.TerminalSymbol: 7.0 |_ parser.MTSymbol */ public class ToModelAlgo extends AGramSymVisitor{ // R = AAST, P = Void public static final ToModelAlgo Singleton = new ToModelAlgo(); private ToModelAlgo() { super(new IGramSymVisitorCmd () { public AAST apply(String idx, IGrammarSymbol host, Void... inps) { throw new IllegalArgumentException("Parse tree does not represent a correct arithmetic expression."); } }); setCmd("Sequence",new IGramSymVisitorCmd () { @SuppressWarnings("unchecked") public AAST apply(String idx, IGrammarSymbol host, Void... inps) { SequenceSymbol sh = (SequenceSymbol)host; return sh.getSymbol2().execute(SequenceHelper.Singleton,sh.getSymbol1().execute(NumOrIdHelper.Singleton)); } }); } // |_ parser.SequenceSymbol // |_ parser.TerminalSymbol: + // |_ parser.SequenceSymbol // |_ parser.TerminalSymbol: 3.0 // |_ ... //or // |_ parser.MTSymbol // inps[0] is the AAST from the left side of the + or * private static class SequenceHelper extends AGramSymVisitor { // R = AAST, P = AAST public static final SequenceHelper Singleton = new SequenceHelper(); private SequenceHelper() { super(new IGramSymVisitorCmd () { public AAST apply(String idx, IGrammarSymbol host, AAST... inps) { throw new IllegalArgumentException("Parse tree does not represent a correct arithmetic expression."); } }); // legal cases: // - SequenceSymbol // - MTSymbol setCmd("Sequence",new IGramSymVisitorCmd () { public AAST apply(String idx, IGrammarSymbol host, AAST... inps) { SequenceSymbol sh = (SequenceSymbol)host; return sh.getSymbol1().execute(OperatorHelper.Singleton, inps[0], sh.getSymbol2().execute(ToModelAlgo.Singleton)); } }); setCmd("MTSymbol",new IGramSymVisitorCmd () { public AAST apply(String idx, IGrammarSymbol host, AAST... inps) { return inps[0]; } }); } } // |_ parser.SequenceSymbol // |_ parser.TerminalSymbol: + // |_ parser.SequenceSymbol // |_ parser.TerminalSymbol: 3.0 // |_ ... //or // |_ parser.MTSymbol private static class NumOrIdHelper extends AGramSymVisitor { // R = AAST, P = Void public static final NumOrIdHelper Singleton = new NumOrIdHelper(); private NumOrIdHelper() { super(new IGramSymVisitorCmd () { public AAST apply(String idx, IGrammarSymbol host, Void... inps) { throw new IllegalArgumentException("Parse tree does not represent a correct arithmetic expression."); } }); // legal cases: // - Num // - Id setCmd("Num",new IGramSymVisitorCmd () { public AAST apply(String idx, IGrammarSymbol host, Void... inps) { return new IntLeaf(new Double(host.toString()).intValue()); } }); setCmd("Id",new IGramSymVisitorCmd () { public AAST apply(String idx, IGrammarSymbol host, Void... inps) { return new VarLeaf(host.toString()); } }); } } // parser.TerminalSymbol: + // or // parser.TerminalSymbol: * // inps[0] is the AAST from the left side of the + or * // inps[1] is the AAST from the left side of the + or * private static class OperatorHelper extends AGramSymVisitor { // R = AAST, P = AAST public static final OperatorHelper Singleton = new OperatorHelper(); private OperatorHelper() { super(new IGramSymVisitorCmd () { public AAST apply(String idx, IGrammarSymbol host, AAST... inps) { throw new IllegalArgumentException("Parse tree does not represent a correct arithmetic expression."); } }); // legal cases: // - + // - * setCmd("+",new IGramSymVisitorCmd () { public AAST apply(String idx, IGrammarSymbol host, AAST... inps) { return new Add(inps[0], inps[1]); } }); setCmd("*",new IGramSymVisitorCmd () { public AAST apply(String idx, IGrammarSymbol host, AAST... inps) { return new Multiply(inps[0], inps[1]); } }); } } }
The job of the tokenizer in our system is to process the raw text (stream of characters) of the language into a stream of manageable chunks of texts called "tokens". Tokens include variable names, keywords, special characters, charactors representing operations, line breaks, numeric values, quoted strings, etc. In our system, token are "intelligent" objects, meaning that they have the ability to execute a visitor algorithm that can process the token based on its value. There are a number of tokenizers running through the supplied code being used in various places. There a couple of important take-away from this:
In our system, the ITokenizer interface is the top-level tokenizer representation. Its defines the invariant behaviors of getting the next token from the input file stream and an uncommonly implemented feature, putBack(), which allows the system to put a token back into the token stream as if it had never been pulled out of the stream. This capability will enable our algorithms to achieve the "backtracking" effect used by many parsing algorithms but without explicitly doing so or by disturbing the natural recursive process of the algorithm.
Tokenizer1, XMLTokenizer and BNFTokenizer are all concrete tokenizers that are hard-coded to work for the sample language, XML and BNF respectively. The actual tokenizer used by the supplied code is regular expression-based tokenizer, ALexor, that is instantiated by an ILexorFactory instance. A static convenience method, LexorFactory.fromKeywords(), has been supplied to make a custom tokenizer from the grammar's terminal symbols. The terminal symbols can be collected as the BNF of the desired grammar is parsed. This is why ParserFactory.makeBNFParser() takes a Set<String>, an accumulator for the terminal grammar symbols, which are the lexemes of the associated tokens. The accumlated terminal symbols will then used to create the tokenizer for the language described by the parsed BNF.
In the supplied demo code, the field, model.RDPModel.tokenizer2symbols
is a HashMap
that is loaded with the terminal symbols for the
pre-defined grammars, "original", XML and BNF. The BNF for these
grammars are bnf1.txt
, bnfxml1.txt
and
bnfbnf1.txt
respectively.
model.RDPModel.tokenizer2symbols
is used to create the tokenizers for the
respective grammars. In the end, you will demonstrate
that the only tokenizer and grammar that one ever has to hard code is for BNF
itself!
© 2013 by Stephen Wong