COMP 405
Spring 2014

Extended Visitors and Recursive Descent Parsing

Home  Info  Owlspace  Java Resources  Eclipse Resources  SharePoint  Piazza

You may branch the supporting code from the course SVN site: https://svn.rice.edu/r/comp405/course/ExtVisitorParsing/   This code also includes a ToModelAlgo visitor that transforms the parse tree into an AST with addition, multiplication, integer and variable leaves.

Here is a simpler example using the extended visitor framework:

Reference Materials:

This is just a sample, please search the web for more...

Simple Infix Arithmetic Expressions

Consider a program to read in from a text file a simple arithmetic expression in in-fix format, such as "abc + 45 + x + y", and build the corresponding parse tree.  We shall restrict ourselves to arithmetic expressions containing integers, strings (representing variables) and addition.  The syntax of such arithmetic expression can be expressed as a grammar with the following 4 rules.   

The format below is called "Baukus-Naur Form" or "BNF" for short.  It is an example of a "context-free grammar" that is "read Left-to-right with a Left-most derivation on a 1 token look-ahead" or, more simply, "LL(1)".  That means that you a) read the line from left to right, b) process the information from left to right and c) need only look at one token at any given point to know where you are in the grammar.   

E     ::=  F E1
E1    ::=  E1a | Empty
E1a   ::=  + E
F     ::=  num | id

The above grammar can be separated into two distinct sets of symbols:

Terminal symbols are atomic entities that are never defined in terms of other symbols, i.e. they only appear on the right hand side of the rules.   Non-terminal symbols, however are defined in terms of other symbols and hence always appear once on the left-hand side of the rules.  E is a special non-terminal symbol called the "Start Symbol" and represents the form of the grammar when one first encounters something in that grammar.  Likewise, Empty is a special symbol representing the lack of symbols at the end of a sequence of symbols.

It is important to note that there are really only two types of rules above: 

Here is how the above grammar generates the string "abc + 45".

E -> F E1              // E is the start symbol and has only one rule: E ::= F E1

  -> id E1             // Apply the rule F ::= id because "abc" is an id.

  -> id E1a            // Apply the rule E1 ::= E1a because there are more tokens

  -> id + E            // Apply the rule E1a ::= + E

  -> id + F E1         // Apply the rule E ::= F E1

  -> id + num E1        // Apply the rule F ::= num because "45" is a number

  -> id + num Empty    // Apply the rule E1 ::= Empty because there are no more tokens

  -> abc + 45          // the lexeme of id is "abc" and the lexeme of num is "45"

The above derivation of the input string "abc + 45" from the grammar rules can be represented by a tree structure called the parse tree as follows.

                                      E
                                      |
                     ------------------------------------
                    |                                    |
                    F                                   E1
                    |                                    |
                    id                                  E1a
                   "abc"                                 |
                                                  ---------------  
                                                  |              |
                                                              E
                                                            "+"             |
                                                             ----------
                                                             |         |
                                                             F        E1
                                                             |         |
                                                            num      Empty
                                                            "45"


Design Philosophy

THE STRUCTURE OF THE CODE SHOULD MATCH THE STRUCTURE OF THE DATA

This is one of the core design philosophies that runs through object-oriented design.   The point is that the data drives how it is processed by its very nature, i.e. its structure.   The structure of the data is not simply a convenient container for easy processing, but rather an embodiment, a model of the data that captures the essence of what the data is and how it behaves.   In this inverted-control world of OOP/D where data drives processing rather than processing manipulates data, the processing of the data is thus effectively a behavior of the data and thus fundamentally must mirror the data's structure.   What can and cannot be done with the data is then controlled by the data structure itself, leading to much more natural and robust processing algorithms.

In all the following design discussions, pay close attention to how this design philosophy is being applied at all levels.    If you have done parsing algorithms in another course, compare the algorithms and how matching the code to the structure of the data changes how we look at and solve the problem. 


First, let us consider the modeling of the tokens.   We want "intelligent" tokens that understand what they are, so they can be used in algorithms that depend on what type of tokens are being used.  In particular, when we are parsing a stream of incoming tokens, we simply want to run some algorithm that determines whether or not the token was the appropriate one for that moment as well as then proceed to process the subsequent tokens correctly, based on that one token (remember that we are only considering LL(1) grammars, so one token is all it takes).

Tokens are thus hosts for extended visitors, calling the case on the visitor corresponding to the name of the token.  A factory is used to hide the details of constructing tokens from the rest of the system.

Tokens

Now consider the grammar symbols.   There are really only three distinct grammar symbols: empty, terminal and sequence.  A symbol for the combination rule is unnecessary because once a token stream has been parsed in terms of a grammar, the choices of a combination no longer exist and that point in the grammar can be represented by a single symbol, not a combination.

Once again, these grammar symbols are hosts for extended visitors so that algorithms can be run on them. 

Parsing thus involves running an algorithm on the stream of tokens from the tokenizer (actually, only the first one because the algorithm itself will pull the rest of the tokens from the tokenizer).  This algorithm will produce a grammar symbol (the "start symbol") which is at the root of a tree of grammar symbols (the parse tree).

In recursive descent parsing (RDP), we use a "tokenizer" to scan and "tokenize" the input from left to right, and build the parse tree  from the top down, based on the value of the tokens.  Building a parse tree from the top down means building it by first looking at the grammar rule(s) for the start symbol E and try to create the object  that represents E.  The task of building E is called "parsing E" and is to be carried out by an appropriate token visitor.

We can see from the above diagram that in order to parse E, we need to parse the sequence of parse F and E1.   But that means that we need to know how to parse F and E1:

Writing the concrete visitors to parse each of the non-terminal symbols will require, in general, performing a global analysis looking deeper than necessary  into the grammar structure.  One must analyze the details of the grammar to the level of knowing all possible tokens at that stage.  This analysis may require one to look beyond the immediate relationships a class may have with any other classes.  For instance, to find the possible tokens that could be the start of an E term, one must look down to the F term.  However, the code to process an E term should only be concerned with the fact that an E term is composed of a F term and an E1 term, and should not be coupled to the internal nature of the F, E1 or any other terms.  Likewise, when parsing a branch, such as F, one must create the union of all the visitors that parse the branch’s subclasses.

To remedy this problem, one must re-think the instantiation process of the visitors.  In particular, in order to decouple the construction of the visitor for one term from the details of other terms, one must abstract and encapsulate the construction process.  This is done by the abstract factory design pattern.  Using factories to instantiate the parsing visitors

Each non-terminal symbol (and its corresponding class) is associated with a factory that constructs its parsing visitor. 

All factories adhere to a basic factory interface which provides the methods to instantiate the parsing visitors.  For convenience, all the factories are derived from an abstract factory, ATokVisitorFact, which provides access to the tokenizer.

The major components of the parsing system are:

In practice, the following convenience classes are often useful:

The code to create a visitor to parse the above grammar is thus:

     ProxyFact eFact_Proxy = new ProxyFact(); // use this wherever E appears in the grammar, except the very beginning.
    
    ITokVisitorFact eFact = new SequenceFact("E", tok,
                                             new CombinationFact("F", tok, 
                                                                 new TerminalSymbolFact("Num", tok), 
                                                                 new TerminalSymbolFact("Id", tok)),
                                             new CombinationFact("E1", tok, 
                                                                 new MultiSequenceFact("E1a", tok, 
                                                                                       new TerminalSymbolFact("+", tok), 
                                                                                       eFact_Proxy), 
                                                                 new MTSymbolFact(tok)));
    
    eFact_Proxy.setFact(eFact);  // close the loop
    
    ITokVisitor<IGrammarSymbol, Object> parseEVisitor = eFact.makeVisitor();  // Make the actual visitor to perform the parsing
    
    

Note how the process of  creating a visitor for a particular grammar is purely a run-time operation and does not require any custom classes at all.

The Problem of Recursion

One of the biggest hurdles in constructing a parsing system is how to effectively deal with recursive definitions, particularly if the grammar is left-recursive.  While there are techniques to refactor the grammar to remove left-recursion, this is not always practical.  Theoretically, a left-recursive grammar is not parsable by an LL(1) parser, so all left-recursive grammars must be refactored to into a "weakly equivalent" right recursion grammar.    But our problem with recursion arises even if the grammar is not fully left-recursive where the recursive non-terminal symbol ends up as the left-most symbol in a sequence somewhere.

Here is an example of a partially left-recursive grammar:

Expr ::= Term "]"
Term ::= Num | SubExpr
SubExpr ::= "[" Expr

If we were to hard-code the parser generator, we would get the following:

     ProxyFact exprFact_Proxy = new ProxyFact(); // use this wherever Expr appears in the grammar, except the very beginning.

    ITokVisitorFact exprFact = new SequenceFact("Expr", tok,
                                             new CombinationFact("Term", tok, 
                                                                 new TerminalSymbolFact("Num", tok), 
                                                                 new SequenceSymbolFact("SubExpr", tok),
                                                                                        new TerminalSymbolFact("[", tok),
                                                                                        exprFact_Proxy),
                                             new TerminalSymbolFact("]", tok));
    
    exprFact_Proxy.setFact(eFact);  // close the loop
    
    ITokVisitor<IGrammarSymbol, Object> parseEVisitor = exprFact.makeVisitor();  // Make the actual visitor to perform the parsing

Below is a sequence diagram that shows the instantiation of the factories in the above code snippet and the call to exprFact.makeVisitor() that instantiates the parsing visitor.    The top of the boxes show, relatively, when Java instantiates each of the factories as it works from the left-most to right-most and from innermost to outermost calls.    The 3 purple boxes are the ATokVisitors corresponding to the 3 terminal symbols in the grammar.   These are the only visitors for which processing commands, i.e. the commands to process each terminal token, are being instantiated and thus are the only source of commands being used to construct the overall processing visitor.

The frist thing to notice is that the exprFact sequence factory cannot possibly instantiate the two parsing visitors it needs right when the factory itself is instantiated.    If it did, then the recursion would cause required that exprFact attempt to reference itself before it is completed its own instantiation. 

But notice that even if we delay instantiation until the factory method, makeVisitor(), is called, we still have a problem.   Note the red lines in the middle.   What is returned, given that at that point, the Expr sequence factory hasn't finished instantiating its visitor?

sequence diagram

In SequenceFact, the left-hand token visitor (the "head" token's visitor) is lazily instantiated th first time that its makeVisitor() is called.   Subsequent calls simply return the now eager visitor instance.   The left-hand token visitor (the "tail" token's visitor) is lazily instantiated at the resultant visitor's run-time, where once again, any subsequent usage of the tail visitor is simply an eager reference.

 

Parse Tree-to-Model Translation

The parse tree created by the parser can be a bit unwieldy to work with. It is reasonable to transform it to a simpler representation. The file source code contains a ToModelAlgo class that performs this transformation. It is a visitor on grammar symbols and returns an AAST (abstract AST object).   Note that this utility only works on simple arithmetic grammars and, as written, is not universally applicable.   A "toString" visitor, AAST_ToString, is also included that will give a tree-formatted string representation of the AST (AAST object).

Consider the different cases for a parse tree:

  1. If you have just a number, it will be the parse tree
    |_ parser.SequenceSymbol
      |_ parser.TerminalSymbol: 7.0 (Num)
      |_ parser.MTSymbol
  2. Similarly, if you just have a variable name, it will be the parse tree
    |_ parser.SequenceSymbol
      |_ parser.TerminalSymbol: a (Id)
      |_ parser.MTSymbol
  3. If you have something more complex, the parse tree will look like
    |_ parser.SequenceSymbol
     |_ parser.TerminalSymbol: 3.0
     |_ parser.SequenceSymbol
       |_ parser.TerminalSymbol: +
       |_ parser.SequenceSymbol
         |_ parser.TerminalSymbol: 7.0
         |_ parser.MTSymbol

So in all cases will it start with a SequenceSymbol. Therefore, the ToModelAlgo visitor rejects anything that is not a SequenceSymbol. If it encounters a Sequence, then the first symbol must be a terminal symbol, namely a Num or an Id (check all three cases above). The ToModelAlgo visitor therefore executes the NumOrIdHelper on the first symbol and passes its return value (an AAST) to the visitor that deals with the second symbol, SequenceHelper.

NumOrIdHelper is simple. It rejects anything that is not a Num or an Id. If it encounters a Num or an Id, it returns the AAST for it.

SequenceHelper takes the result from NumOrIdHelper as inps[0]. Consider the legal cases for SequenceHelper: In cases 1) and 2) above, SequenceHelper may encounter the empty symbol. In that case, the AAST is just whatever was passed from NumOrIdHelper in inps[0]. Otherwise, SequenceHelper must encounter a sequence ("+ ...").

If SequenceHelper encounters a sequence, it should have a "+" or "*" in the first symbol and an arithmetic expression in the second symbol. SequenceSymbol therefore uses the main ToModelAlgo visitor on the second symbol, and then passes that return value to OperatorHelper as inps[1], along with whatever it got from NumOrIdHelper in inps[0].

OperatorHelper is run on the first symbol of SequenceHelper's host. OperatorHelper takes the AAST from the left side of the operator as inps[0] and the AAST from the right side as inps[1]. There are two legal cases for OperatorHelper: "+" and "*". It creates the correct AASTs for these, e.g. new Add(inps[0], inps[1]); and rejects anything else.

This code transforms the parse tree into a simple AAST that contains only the information that is pertinent to the model.

package parser.visitor;

import parser.*;
import sample.*;

/**
 * Turn the parse tree into an AST that is more easy to use.
 5.0+3.0+7.0
 parser.SequenceSymbol
 |_ parser.TerminalSymbol: 5.0
 |_ parser.SequenceSymbol
   |_ parser.TerminalSymbol: +
   |_ parser.SequenceSymbol
     |_ parser.TerminalSymbol: 3.0
     |_ parser.SequenceSymbol
       |_ parser.TerminalSymbol: +
       |_ parser.SequenceSymbol
         |_ parser.TerminalSymbol: 7.0
         |_ parser.MTSymbol
 */
public class ToModelAlgo extends AGramSymVisitor {      // R = AAST, P = Void
    public static final ToModelAlgo Singleton = new ToModelAlgo();
    private ToModelAlgo() {
        super(new IGramSymVisitorCmd() {
            public AAST apply(String idx, IGrammarSymbol host, Void... inps) {
                throw new IllegalArgumentException("Parse tree does not represent a correct arithmetic expression.");
            }
        });
        setCmd("Sequence",new IGramSymVisitorCmd() {
            @SuppressWarnings("unchecked")
            public AAST apply(String idx, IGrammarSymbol host, Void... inps) {
                SequenceSymbol sh = (SequenceSymbol)host;
                return sh.getSymbol2().execute(SequenceHelper.Singleton,sh.getSymbol1().execute(NumOrIdHelper.Singleton));
            }
        });
    }
    
    // |_ parser.SequenceSymbol
    //   |_ parser.TerminalSymbol: +
    //   |_ parser.SequenceSymbol
    //     |_ parser.TerminalSymbol: 3.0
    //     |_ ...
    //or
    // |_ parser.MTSymbol
    // inps[0] is the AAST from the left side of the + or *
    private static class SequenceHelper extends AGramSymVisitor {         // R = AAST, P = AAST

        public static final SequenceHelper Singleton = new SequenceHelper();
        private SequenceHelper() {
            super(new IGramSymVisitorCmd() {
                public AAST apply(String idx, IGrammarSymbol host, AAST... inps) {
                    throw new IllegalArgumentException("Parse tree does not represent a correct arithmetic expression.");
                }
            });
            // legal cases:
            // - SequenceSymbol
            // - MTSymbol
            setCmd("Sequence",new IGramSymVisitorCmd() {
                public AAST apply(String idx, IGrammarSymbol host, AAST... inps) {
                    SequenceSymbol sh = (SequenceSymbol)host;
                    return sh.getSymbol1().execute(OperatorHelper.Singleton,
                                                   inps[0],
                                                   sh.getSymbol2().execute(ToModelAlgo.Singleton));
                }
            });
            setCmd("MTSymbol",new IGramSymVisitorCmd() {
                public AAST apply(String idx, IGrammarSymbol host, AAST... inps) {
                    return inps[0];
                }
            });
        }        
    }    
    
    // |_ parser.SequenceSymbol
    //   |_ parser.TerminalSymbol: +
    //   |_ parser.SequenceSymbol
    //     |_ parser.TerminalSymbol: 3.0
    //     |_ ...
    //or
    // |_ parser.MTSymbol
    private static class NumOrIdHelper extends AGramSymVisitor {       // R = AAST, P = Void
        public static final NumOrIdHelper Singleton = new NumOrIdHelper();
        private NumOrIdHelper() {
            super(new IGramSymVisitorCmd() {
                public AAST apply(String idx, IGrammarSymbol host, Void... inps) {
                    throw new IllegalArgumentException("Parse tree does not represent a correct arithmetic expression.");
                }
            });
            // legal cases:
            // - Num
            // - Id
            setCmd("Num",new IGramSymVisitorCmd() {
                public AAST apply(String idx, IGrammarSymbol host, Void... inps) {
                    return new IntLeaf(new Double(host.toString()).intValue());
                }
            });
            setCmd("Id",new IGramSymVisitorCmd() {
                public AAST apply(String idx, IGrammarSymbol host, Void... inps) {
                    return new VarLeaf(host.toString());
                }
            });
        }        
    }    
    
    // parser.TerminalSymbol: +
    // or
    // parser.TerminalSymbol: *
    // inps[0] is the AAST from the left side of the + or *
    // inps[1] is the AAST from the left side of the + or *
    private static class OperatorHelper extends AGramSymVisitor {       // R = AAST, P = AAST
        public static final OperatorHelper Singleton = new OperatorHelper();
        private OperatorHelper() {
            super(new IGramSymVisitorCmd() {
                public AAST apply(String idx, IGrammarSymbol host, AAST... inps) {
                    throw new IllegalArgumentException("Parse tree does not represent a correct arithmetic expression.");
                }
            });
            // legal cases:
            // - +
            // - *
            setCmd("+",new IGramSymVisitorCmd() {
                public AAST apply(String idx, IGrammarSymbol host, AAST... inps) {
                    return new Add(inps[0], inps[1]);
                }
            });
            setCmd("*",new IGramSymVisitorCmd() {
                public AAST apply(String idx, IGrammarSymbol host, AAST... inps) {
                    return new Multiply(inps[0], inps[1]);
                }
            });
        }        
    }    
}

 

A Pantheon of Tokenizers

The job of the tokenizer in our system is to process the raw text (stream of characters) of the language into a stream of manageable chunks of texts called "tokens".   Tokens include variable names, keywords, special characters, charactors representing operations, line breaks, numeric values, quoted strings, etc.   In our system, token are "intelligent" objects, meaning that they have the ability to execute a visitor algorithm that can process the token based on its value. There are a number of tokenizers running through the supplied code being used in various places.  There a couple of important take-away from this:

Tokenizers

In our system, the ITokenizer interface is the top-level tokenizer representation.    Its defines the invariant behaviors of getting the next token from the input file stream and an uncommonly implemented feature, putBack(), which allows the system to put a token back into the token stream as if it had never been pulled out of the stream.   This capability will enable our algorithms to achieve the "backtracking" effect used by many parsing algorithms but without explicitly doing so or by disturbing the natural recursive process of the algorithm.

Tokenizer1, XMLTokenizer and BNFTokenizer are all concrete tokenizers that are hard-coded to work for the sample language, XML and BNF respectively.   The actual tokenizer used by the supplied code is regular expression-based tokenizer, ALexor,  that is instantiated by an ILexorFactory instance.   A static convenience method, LexorFactory.fromKeywords(), has been supplied to make a custom tokenizer from the grammar's terminal symbols.    The terminal symbols can be collected as the BNF of the desired grammar is parsed.   This is why ParserFactory.makeBNFParser() takes a Set<String>, an accumulator for the terminal grammar symbols, which are the lexemes of the associated tokens.   The accumlated terminal symbols will then used to create the tokenizer for the language described by the parsed BNF.

In the supplied demo code, the field, model.RDPModel.tokenizer2symbols is a HashMap that is loaded with the terminal symbols for the pre-defined grammars, "original", XML and BNF.   The BNF for these grammars are bnf1.txtbnfxml1.txt and bnfbnf1.txt respectively.   model.RDPModel.tokenizer2symbols is used to create the tokenizers for the respective grammars.     In the end, you will demonstrate that the only tokenizer and grammar that one ever has to hard code is for BNF itself!   

Continue on to dynamic parser generation...


© 2013 by Stephen Wong