Rice University - Comp 212 - Intermediate Programming

Spring 2004

Lecture #30 - Design Patterns for Recursive Descent Parsing

Introduction

This lecture describes an important parsing technique called "recursive descent parsing". It leads you to the process of developing such a parser to parse scheme-like arithmetic expressions and build the corresponding abstract syntax trees. It makes use of the StreamTokenizer discussed in lecture 27 and the results from homework #3 on abstract syntax trees.

Click here for the paper on recursive descent parsing by D.X. Nguyen and S. B. Wonng.

Abstract Syntax Tree

We shall write a program to read in from a text file scheme-like arithmetic expressions in pre-fix format, such as (* 3 (+ 4 5)), and build the corresponding abstract syntax tree (AST):

    *
/    \
3      +
      /     \
    4        5

Once the AST is built, one can then traverse and evaluate it, as done in Homework #3 (download the solution)! We shall restrict ourselves to arithmetic expressions containing integers, strings (representing variables), and binary operations such as addition and multiplication. The syntax of the Scheme-like arithmetic expression can be expressed as a "grammar" with the following rules.

Scheme Arithmetic Expression Grammar Rules	Description
E :: Leaf \| BinOp	An expression E is either a Leaf or a BinOp. The notation is the left column cell is a short hand for two rules: E :: Leaf E:: BinOp
Leaf :: VarLeaf \| IntLeaf	A Leaf is either a VarLeaf or an IntLeaf There are actually two grammar rules here (see above cell).
VarLeaf :: id	A VarLeaf is an id, where id is a token representing a variable.
IntLeaf :: int	An IntLeaf is an int, where int is a token representing an integer.
BinOp :: AddOp \| MulOp	A BinOp is either an AddOp or a MulOp. There are actually two grammar rules here.
AddOp :: ( + E E )	An AddOp is a left parenthesis token, followed by a plus token, followed by an expression E, followed by another expression E, followed by a right parenthesis token.
MulOp :: ( * E E )	An MulOp is a left parenthesis token, followed by a star token, followed by an expression E, followed by another expression E, followed by a right parenthesis token.

An example for such an expression is: (* 34 (+ xyz 7)), where

( is the left parenthesis token
* is the star token
34 is an int token
+ is the plus token
xyz is an id token
) is the right parenthesis token.

In the above grammar rules,

E, Leaf, VarLeaf, IntLeaf, BinOp, AddOp, MulOp are called non-terminal symbols. They are non-terminal because they are defined in terms of other symbols.
In contrast, int, id, (, ), + , * are not defined in terms of any other symbols, and are called the terminal symbols, (or sometimes 'token').
E plays a special in "starting" the grammar rules, and is called the "start" symbol. Whenever we specify a grammar, we must specify what the start symbol is.

Context Free Grammar

The above grammar is an example of what is called an context free grammar (CFG). In general, a CFG is a quadruple (NT, T, R, S) where

NT is a finite set of symbols called non-terminal symbols,
T is a finite set of symbols called terminal symbols or tokens,
R is a finite set of "production (or grammar) rules" of the form X :: α, where X is a non-terminal and α is a sequence of zero or more non-terminals and/or terminals,
S is a non-terminal, called the "start" symbol of the grammar. By convention, we always list the grammar rules for S first.

Grammars are used to define the syntax of finite sequences of tokens. The set of all finite sequences of tokens that conform to the given grammar rules is called the language generated by the given grammar. Parsing a finite sequence of tokens according to a grammar means determining whether or not this token sequence (also called sentence) conforms to the syntax defined by the grammar. We shall implement a special parsing technique called "recursive descent parsing" (RDP) that works for a restricted set of grammars. The above grammar for prefix Scheme like expressions is one such grammar.

Recursive Descent Parsing

Here is again the class diagram of the abstract syntax tree that represents the Scheme-like arithmetic expression described by the above grammar. Note how each of the classes corresponds to a non-terminal symbol in the grammar.

The manufacturing of the above abstract syntax tree (AST) can be thought of a factory method, makeAST(), of some abstract factory, IASTFactory. How the AST is created is a variant as there are many ways to parse the input stream.

In RDP, we use a "tokenizer" to scan and "tokenize" the input from left to right, and build the AST from the top down, based on the value of the tokens. Building an AST from the top down means building it by first looking at the grammar rules for the start symbol E and try to create an AST that represent E.

As we look at the grammar rules for the start symbol E in the above, we can see that in order for RDP to make an E AST (makeAST), it needs to know how to make a Leaf AST (makeLeaf) and a BinOp AST (makeBinOp).

Looking at the rules for Leaf, we see that in order to make a Leaf AST, RDP needs to know how to make a VarLeaf (makeVarLeaf) and an IntLeaf (makeIntLeaf).
- Making VarLeaf and IntLeaf is easy and can be done directly.
Looking at the rules for BinOp, we see that in order to make a BinOp AST, RDP needs to know how to make an AddOp AST (makeAddOp) and a MulOp AST (makeMulOp).
Looking at the rules for AddOp and MulOp, we see that in order to make an AddOp AST and a MulOp AST, RDP needs to know how to make an E AST (makeAST), which comes from the rule for E. This completes the recursive construction.

Here is some pseudo-code.

makeAST():

asks the tokenizer for the next token t,
and then asks t to call the appropriate factory method
- the int token and the id token call makeLeaf()
- the left parenthesis token calls makeBinOp()
- all other tokens should flag an error!
- does the above "smell" like the visitor pattern to you or not? Who are the hosts and who are the visitors?

makeLeaf():

the int token calls makeIntLeaf().
the id token calls makeVarLeaf().
all other tokens should flag an error.

makeIntLeaf(): create the IntLeaf directly. To create an IntLeaf, we need to have the int token. Thus, makeIntLeaf() will need a token as a parameter.

makeVarLeaf(): create the VarLeaf directly. VarLeaf will need a token as parameter.

makeBinOp():

asks the tokenizer for the next token t,
then asks t to call the appropriate factory method
- the + token calls makeAddOp(),
- the * token calls makeMulOp()
- all other tokens should flag an error!

makeAddOp():

calls makeAST() to create the first E AST
calls makeAST() again to create the second E AST
asks the tokenizer for the next token t,
then asks t to create the Add AST:
- the right parenthesis token will instantiate an appropriate Add AST
- all other tokens should flag an error.

makeMulOp(): analogous to makeAddOp().

In short, a RDP

has a tokenizer;
has a method for each of the non-terminal symbols in the grammar; the method that corresponds to the start symbol initiates the parsing.

Usually, parsing only involves checking for the syntax of an input expression. However, in our case, we decide to build the abstract syntax tree directly as we parse.

public class RdpASTFactory implements IASTFactory {
    private ITokenizer _tokenizer:
    public RdpASTFactory(ITokenizer tokenizer) { _tokenizer = tokenizer;}
    public IAST makeAST() { // etc...}
    private IAST makeLeaf() { // etc...}
    // other (private) factory methods, etc.
}

Tokens and Tokenizer

The above pseudo-code suggests designing a union of token objects coupled with visitors, and an abstract tokenizer class that knows how to get the next token.

Union of Tokens

public abstract class AToken {
    protected String _lexeme;
    protected AToken(String lex) { _lexeme = lex; }
    public String toString() { return_lexeme;}
    public abstract Object execute(ATokVisitor v, Object inp);
}

public class IdToken extends AToken{
    // constructor, etc.
    public Object execute(ATokVisitor v, Object inp) {
        return v.idCase(this, inp);)
    }
}

Token visitors: the variant behaviors of the tokens are encapsulated in an abstract visitor class, as follows.

public abstract class ATokVisitor {
    /**
    * This is the case when the host is a concrete IdToken.
    * Throws an IllegalArgumentException as default behavior.
    * @param host a concrete IdToken.
    * @param inp the input needed for this ATokVisitor to perform its task.
    */
    protected abstract Object defaultCase(AToken host, Object inp) {
        // In many concrete cases, the code will look something like:
        // throw new IllegalArgumentException("Wrong token!");
    }
    public Object idCase(IdToken host, Object inp) {
        return defaultCase(host, inp);
    }
    // etc..., a method for each concrete token case.
}

One way to implement this tokenizer class is to wrap a StreamTokenizer object.

public interface ITokenizer (
public AToken getNextToken();
}

public class StreamTokWrapper implements ITokenizer {
private StreamTokenizer _st;
// constructor and methods, etc...
}

Dung X. Nguyen
revised 09/02/2004