Comp 212 - Lab 11: Recursive Descent Parsing

Introduction

This tutorial describes an important parsing technique called "recursive descent parsing". It leads you to the process of developing such a parser to parse scheme-like arithmetic expressions and build the corresponding abstract syntax trees. It makes use of the StreamTokenizer discussed in lecture 30 and the results from homework #3 on abstract syntax trees.

Abstract Syntax Tree

We shall write a program to read in from a text file scheme-like arithmetic expressions in pre-fix format, such as (* 3 (+ 4 5)), and build the corresponding abstract syntax tree (AST):

    *
/    \
3      +
      /     \
    4        5

Once the AST is built, one can then traverse and evaluate it, as done in Homework #3 (download the solution)! We shall restrict ourselves to arithmetic expressions containing integers, strings (representing variables), and binary operations such as addition and multiplication. The syntax of the Scheme-like arithmetic expression can be expressed as a "grammar" with the following rules.

Scheme Arithmetic Expression Grammar Rules	Description
E :: Leaf \| BinOp	An expression E is either a Leaf or a BinOp.
Leaf :: VarLeaf \| IntLeaf	A Leaf is either a VarLeaf or an IntLeaf
VarLeaf :: id	A VarLeaf is an id, where id is a token representing a variable.
IntLeaf :: int	An IntLeaf is an int, where int is a token representing an integer.
BinOp :: AddOp \| MulOp	A BinOp is either an AddOp or a MulOp
AddOp :: ( + E E )	An AddOp is a left parenthesis token, followed by a plus token, followed by an expression E, followed by another expression E, followed by a right parenthesis token.
MulOp :: ( * E E )	An MulOp is a left parenthesis token, followed by a star token, followed by an expression E, followed by another expression E, followed by a right parenthesis token.

An example for such an expression is: (* 34 (+ xyz 7)), where

( is the left parenthesis token
* is the star token
34 is an int token
+ is the plus token
xyz is an id token
) is the right parenthesis token.

In the above grammar rules,

E, Leaf, VarLeaf, IntLeaf, BinOp, AddOp, MulOp are called non-terminal symbols. They are non-terminal because they are defined in terms of other symbols.
In contrast, int, id, (, ), + , * are not defined in terms of any other symbols, and are called the terminal symbols, (or sometimes 'token').
E plays a special in "starting" the grammar rules, and is called the "start" symbol. Whenever we specify a grammar, we must specify what the start symbol is.

Recursive Descent Parsing

The manufacturing of an abstract syntax tree (AST) for the above grammar can be thought of a factory method, makeAST(), of some abstract factory, IASTFactory. How the AST is created is a variant as there are many ways to parse the input stream. We shall implement a special parsing technique called "recursive descent parsing" (RDP).

In RDP, we use a "tokenizer" to scan and "tokenize" the input from left to right, and build the AST from the top down, based on the value of the tokens. Building an AST from the top down means building it by first looking at the grammar rule for the start symbol E and try to create an AST that represent E.

As we look at the grammar rule for the start symbol E in the above, we can see that in order for RDP to make an AST, it needs to know how to make a Leaf AST and a BinOp AST.

Looking at the rule for Leaf, we see that in order to make a Leaf AST, RDP needs to know how to make a VarLeaf and an IntLeaf.
- Making VarLeaf and IntLeaf is easy and can be done directly.
Looking at the rule for BinOp, we see that in order to make a BinOp AST, RDP needs to know how to make an AddOp AST and a MulOp AST.
Looking at the rule for AddOp and MulOp, we see that in order to make an AddOp AST and a MulOp AST, RDP needs to know how to make an AST, which comes from the rule for E. This completes the recursive construction.

Here is some pseudo-code.

makeAST():

asks the tokenizer for the next token t,
and then asks t to call the appropriate factory method
- the int token and the id token call makeLeaf()
- the left parenthesis token calls makeBinOp()
- all other tokens should flag an error!
- does the above "smell" like the visitor pattern to you or not? Who are the hosts and who are the visitors?

makeLeaf():

the int token calls makeIntLeaf().
the id token calls makeVarLeaf().
all other tokens should flag an error.

makeIntLeaf(): create the IntLeaf directly.

makeVarLeaf(): create the VarLeaf directly

makeBinOp():

asks the tokenizer for the next token t,
then asks t to call the appropriate factory method
- the + calls makeAddOp(),
- the * token calls makeMulOp()
- all other tokens should flag an error!

makeAddOp():

calls makeAST() to create the first AST
calls makeAST() again to create the second AST
asks the tokenizer for the next token t,
then asks t to create the Add AST:
- the right parenthesis token will instantiate an appropriate Add AST
- all other tokens should flag an error.

makeMulOp(): analogous to makeAddOp().

In short, a RDP

has a tokenizer and maintains a current token;
has a method for each of the non-terminal symbols in the grammar; and
- the method that corresponds to the start symbol initiates the parsing.

Usually, parsing only involves checking for the syntax of an input expression. However, in our case, we decide to build the abstract syntax tree directly as we parse.

public class RdpASTFactory implements IASTFactory {
    private AToken _currentTok;
    private ITokenizer _tokenizer:
    public RdpASTFactory(ITokenizer tokenizer) { _tokenizer = tokenizer;}
    public IAST makeAST() { // etc...}
    private IAST makeLeaf() { // etc...}
    // other (private) factory methods, etc.
}

Tokens and Tokenizer

The above pseudo-code suggests designing a union of token objects coupled with visitors, and an abstract tokenizer class that knows how to get the next token.

Union of Tokens

public abstract class AToken {
    protected String _lexeme;
    protected AToken(String lex) { _lexeme = lex; }
    public String getLexeme() { return_lexeme;}
    public abstract Object execute(ATokVisitor v, Object inp);
}

public class IdToken extends AToken{
    // constructor, etc.
    public Object execute(ATokVisitor v, Object inp) {
        return v.idCase(this, inp);)
    }
}

Token visitors: the variant behaviors of the tokens are encapsulated in an abstract visitor class, as follows.

public abstract class ATokVisitor {
    /**
    * This is the case when the host is a concrete IdToken.
    * Throws an IllegalArgumentException as default behavior.
    * @param host a concrete IdToken.
    * @param inp the input needed for this ATokVisitor to perform its task.
    */
    public Object idCase(IdToken host, Object inp) {
        throw new IllegalArgumentException("Wrong token!");
    }
    // etc..., a method for each concrete token case.
}

One way to implement this tokenizer class is to wrap a StreamTokenizer object.

public interface ITokenizer (
public AToken getNextToken();
}

public class StreamTokWrapper implements ITokenizer {
private StreamTokenizer _st;
// constructor and methods, etc...
}

Lab Activities

Divide the lab into 3 groups.

One group builds the tokenizer StreamTokWrapper: play around with TestStreamTokenizer and modify it using a file with a scheme arithmetic expression as input.
One group builds the union of tokens: AToken and its concrete variants. This should be trivial. After this group is done, it should split in two subgroups: one would join group 1 and the other would join group 3 in order to communicate their results to the other groups.
One group write the recursive descent parser. This will involve writing all the needed visitors as anonymous inner classes.

To test the code, download the solution to homework #3, unzip it, and move the various java files into their appropriate subdirectories to match the packaging.

Have fun!

Dung X. Nguyen
revised 11/11/02