COMP 405
Spring 2014

Parsing BNF

Home  Info  Owlspace  Java Resources  Eclipse Resources  SharePoint  Piazza

Pushing the Envelope!

(The code for these exercises can be branched from https://svn.rice.edu/r/comp405/course/ExtVisitor Parsing BNF incomplete)

In a previous lecture, we discussed out how extended visitors can be used to create a parsing architecture where the conversion from a grammar in BNF to factories that can create a visitor to parse that grammar lies only in the instantiation of a few different types of invariant factories. The difference between grammars lay only in the order in which the factories were strung together as they were being instantiated. This casts the entire parsing process in a new light and opens many new possibilities.

Note: This exercise constitutes unpublished research work. You have reached the forefront of OOP!

(This research project is called “YAK” which stands for “Yet Another Kooprey”, where Kooprey was a system built by Rice graduate student Mathias Ricken to parse BNF and automatically generate parsing code in the form of Java files, using our older recursive descent parsing system. A Kouprey is a Cambodian ox that is related to the Gnu and Yak. But YAK is also a play on YACC, “Yet Another Compiler Compiler”, which is the Unix parser generator utility.)

In particular, we have to ask the question, if everything is taking place at run-time, couldn’t we create a system that could dynamically create a parser, given the BNF for the grammar? After all, isn’t BNF a language and thus described by a grammar? That is, what if we could parse BNF itself?

If we can parse BNF itself, we could then take the parse tree that was created, analyze it and use that information to instantiate the parser for the grammar whose BNF we originally parsed!

Parsing an arbitrary grammar is a two-step process:

  1. Parse the BNF of the desired grammar:
  2. Parse the code written in the desired grammar.

 

Parsing BNF

To parse BNF, we need to start with a grammar for BNF. Here’s the BNF for BNF:

S   ::= D | S1
S1  ::= lf S
D   ::=  Id "::=" E L

L   ::=  L2 | Empty
L2  ::=  lf  L3
L3  ::=  L | D

E   ::=  T E1
E1  ::=  Empty | E1a
E1a ::=  "|" E

T   ::= T1 T2
T1  ::= Id| QuotedStringToken
T2  ::= Empty | T

Above, “lf” means linefeed since, unlike the examples from class, we will need to parse multiple lines of input here. The “QuotedStringToken” is necessary to distinguish between elements of a grammar vs. the elements of the BNF describing that grammar. Note, for instance, the quoted “|” on the 9’th line of the grammar above, where we need to distinguish between the “OR” symbol in the grammar being described (the quoted “|”) from the “OR” symbol in this BNF description (the unquoted “|”). Likewise we see the same issue for the assignment operator (“::=”).

One of the problems in processing any grammar is the presence of recursive loops in the definition. There are lots of different ways to detect loops but here, we will use a simple technique that, while potentially inefficient, is guaranteed to work:

The thing to note about loops in grammars is that fundamentally they can only ever involve non-terminal symbols (duh, right—“terminal” symbols terminate by definition!) So, since in general, all non-terminal symbols are suspect, what we will do is to first detect all the non-terminals in a grammar and then immediately create proxies for all of them. Then when we encountered a non-terminal while processing the right-hand side of any line of grammar (the left-hand side is the definition of a non-terminal), we simply always use the proxy. After all the non-terminals have been defined, we simply close all potential loops by setting the all the proxies. It may run a tad slower because we put proxies in places we didn’t have to, but it’s guaranteed to properly close all loops, which is a nice thing.

In general, the set of terminal symbols are simply the compliment of the set of non-terminal symbols, that is, if a symbol isn’t a non-terminal symbol then it is a terminal symbol. Here, since we know the grammar a priori, we can identify the terminal symbols manually: lf, assignment (quoted “::=”), OR (quoted “|”), Id, QuotedStringToken, and Empty.

Here’s our plan of attack:

  1. Create ITokVisitorFact instances of TerminalSymbolFact for all the terminal symbols. Note that for TerminalSymbolFact, the specified name in the constructor is given as a String, so for lf, the name is “lf”, for OR the name is “|”, and for assignment, the name is “::=”. The Empty token has its own factory class, MTSymbolFact whose constructor only takes the tokenizer as an input.
  2. Create ProxyFact instances for all the non-terminal symbols, i.e all symbols appearing on the left=hand side of the above grammar: S, S1, D, L, L2, L3, E, E1, E1a, T, T1, T2. It should be noted that S1, L2, L3, E1, E1a, T1 and T2 aren’t involved in loops, so it is actually unnecessary to make proxies for them. That is, when you encounter these symbols, you can simply make them as per their definitions right where they are being used.   In general however, as when we automate this process later, we can't make such statements and we are then forced to make proxies for all non-terminals.
  3. Create ITokVisitorFact instances for all the non-terminal symbols that have proxies. These will be either sequence or combination definitions. You can use either the binary, 2-symbol forms (SequenceFact or CombinationFact) or the unlimited-symbol, vararg input forms (MultiSequenceFact or MultiCombinationFact). The latter set is universally useful as it can be used wherever the former set can be used as well -- all they do is to create chains of binary compositions. Remember that these factories take factories in their constructors, so you will need to either supply the terminal symbol factories or the proxy factories or if the required factory is for a symbol not involved in a loop, the direct instantiation of that factory in terms of other factories. See the rest of the supplied code for examples of how to instantiate parsing visitor factories.
  4. Now that all the factories for the non-terminals have been instantiated, you can now close all the loops by calling the setFact() method on all the proxies passing in the appropriate non-terminal factory.
  5. Return the factory for the start symbol, the S symbol above.

EXERCISE 1: Complete the code for the model.ParserFactFactory.makeBNFParser() method that will parse a file containing BNF.

Examples of how to instantiate parser visitor factories can be seen in model.ParserFactFactory.makeOrigParser()and model.ParserFactFactory.makeXMLParser().

To run the demo:

 

Finding all the non-terminals

At this point, we are now switching to processing the parse tree that was produced by the parsing of the BNF. The parsing produces an IGrammarSymbol object corresponding to the start symbol. From class we saw that there are only a few types of grammar symbols:

It should be noted that IGrammarSymbol can execute two different types of visitors, one that is based on the value of the grammar symbol, i.e. as described above, and the other that is based on the type of the grammar symbol, that is the case index is "TerminalSymbol" for all TerminalSymbol instances,  "SequenceSymbol" for all SequenceSymbol instances, etc..   Since these two types of visitors are semantically quite different but unfortunately, to the compiler, syntactically indistinguishable,  IGrammarSymbol has two methods to execute visitors, execute() for the value-based visitors and typeExecute() for the type-based visitors.   (Yes, technically, it is possible to make the two type of visitors syntactically distinguishable -- it just haven't been done it yet.)

To find all the non-terminals in an expression, we have to walk through the grammar to find all the symbols on the left-hand side of each expression. Luckily, each line of a BNF grammar is expressed in a single symbol, “D”, in our BNF of BNF. Thus the problem reduces down to simply recording the first symbol in the D sequence for all D symbols encountered in the parse tree.

We will be writing the visitor, FindNonTerminalsAlgo by dividing the work into several pieces:

The constructor of FindNonTerminalsAlgo:

Here we need to install the main processing commands for the visitor using the addCmd method, which associates an index value (a String here) with a particular command (a IGramSymVisitorCmd here):

  1. First, we need to find the first D symbol. The start symbol, S, is a combination of D and S1 (that is, S is either D or S1), so we need to take into account the fact that we make first encounter an S1 symbol.
    1. S1 is just a sequence of a lf followed by an S, so to process an S1, we simply add a command that associated with the “S1” index value (a String). This command casts its host to SequenceSymbol (the host is typed to the superclass IGrammarSymbol) and then simply recurs the entire algorithm on the second symbol of the sequence, since the FindNonTerminalsAlgo is the algorithm to process the start symbol, S.

      To create a command that returns the proper Map object, use this template:

      new IGramSymVisitorCmd<Map<String, IGrammarSymbol>,Object>() {
      	public Map<String, IGrammarSymbol> apply(String idx, IGrammarSymbol host, Object... inps) {
      		// your code here
      	}
      }
      		
    2. To take care of the ability to handle a D symbol, we must remember that S is a combination, that is, it must combine the abilities of an S1 and D processing algorithms. We will break out the coding of the D symbol to its own section, so assume we have a variable called dAlgo that is a visitor to process a D symbol. I have added a pair of convenience methods to the extended visitors called getCmds and setCmds that are accessors to the set of commands in an extended visitor. getCmds actually returns a Set<Map.Entry<I,IExtVisitorCmd<R, I, P, H>>> which is a set of key-value pairs of the index and associated commands. setCmds takes this same structure as its input so the two methods can be used to do a wholesale transfer of commands from one visitor to another. That’s exactly what we need here because to combine the abilities of a D processing visitor with the FindNonTerminalsAlgo visitor, what we need to do is to install all the commands in the dAlgo visitor into the FindNonTerminalsAlgo visitor. That’s a lot of explanation of one very short line of code! (But remember this, because we will do more of this later.)

The dAlgo visitor

The skeleton of the dAlgo visitor has been included, so we need only concentrate on what its one command does.

The lAlgo visitor is created using a factory, AlgoMaker.Singleton, in a provided JAR file, /labs/LAlgoMaker.jar.   The following line of code is included in the constructor of FindNonTerminalsAlgo:

lAlgo = LAlgoMaker.Singleton.make(dAlgo);

Later, you will comment out this line of code and implement lAlgo yourself.

EXERCISE 2: Complete the code in the constructor and dAlgo visitor in the parser.visitor.bnf.FindNonTerminalsAlgo visitor.

At this point the “Check BNF” button in the demo will partially work. Go to the model.RDPModel.checkBNF() method and comment out the second half of that method, as indicated -- this will keep the system from trying to do operations you haven't yet coded.

EXERCISE 3: Implement the lAlgo visitor in the parser.visitor.bnf.FindNonTerminalsAlgo visitor.

Comment out the line in the FindNonTerminalsAlgo visitor that instantiates the lAlgo visitor:

lAlgo = LAlgoMaker.Singleton.make(dAlgo);

Then complete the field declaration of lAlgo with your own implementation of lAlgo.

Warning:  There is one issue that you need to be careful about because it could cause trouble depending on how you do your implementation.  Due to the recursive nature of the BNF grammar definition (the BNF of BNF), you may need to access the commands held by lAlgo while you are in the middle of defining those commands.   The problem is that those commands are not instantiated at that point, so your algorithm will fail if you try to access them.   There are a number of ways to get around this problem, but one cute albeit arguably not the fastest executing technique employs lazy evaluation that is transparently performed by an anonymous inner class (Hint: use anonymous inner classes everywhere and the solution will fall right out).    

Creating parser factories from a parse tree

It is beyond the scope of this exercise to develop the entire parser.visitor.bnf.MakeParserFactAlgo parser generator algorithm, so we will focus on some key portions of it. Fundamentally, the algorithm must walk through the BNF grammar and identify the rules of the grammar and then instantiate either a sequence or combination factory for to parse that rule. This algorithm assumes that all the non-terminals are already known, so it takes the Map<String, IGrammarSymbol> that FindNonTerminalsAlgo produces, as an input parameter.

Luckily, our BNF of BNF grammar isolates a rule in the D symbol. The first symbol in the D sequence is the non-terminal symbol that the rule defines, so its name is the name of the factory being made. The 3rd symbol in the sequence is the E symbol, which completely encapsulates the actual rule. The final L symbol just encapsulates the line feeds that lead up to and including the next rule.

An E symbol is a T symbol followed by an E1 symbol. The T symbol encapsulates a sequence, so we know that if we are processing a T symbol, we are constructing a sequence factory.

An E1 symbol is the OR operator (“|”) followed by another E symbol. Thus if we find ourselves processing an E1 symbol, we know that we need to make a combination factory the combines the previous E symbol with the following E symbol.

If we encounter a symbol that is not a non-terminal, then it must either be a terminal symbol or an Empty symbol. In any case, we can easily make the corresponding parser visitor factory. This determination only takes place inside the processing of a T symbol.

Let’s break this problem down into smaller pieces as before.

Constructor of MakeParserFactAlgo

As in the FindNonTerminalsAlgo, the first thing that this algorithm must do is deal with is the D | S1 combination. That means the code here will be exactly the same as in FindNonTerminalsAlgo, except for one addition: As mentioned before, the safe method of insuring that all loops in the grammar are properly closed, we need to make proxies for all the non-terminal symbols. The Map object given as the input parameter is a mapping from a String description to an IGrammarSymbol, so we need to make ProxyFact instances corresponding to all the String descriptions in the Map. In order to store all this information, another Map object is in order, which is a field called “proxies”. One can obtain a Set<String> of the keys in the non-terminals Map object through its keySet() method. So all that is needed is to loop over that set and store each key with a corresponding new instance of ProxyFact in the proxies Map.

The dAlgo visitor

Once again, the D symbol is the main focus. There are a number of things to do here:

  1. Get a reference to the symbol that this D symbol is defining (since the D symbol is a grammar rule).
    1. This symbol is the Id which is the first symbol in the sequence. This symbol will be one of the non-terminal symbols that were found earlier by FindNonTerminalsAlgo and hence is one of the symbols that has a proxy associated with it.
    2. We will need the String representation of this symbol later.
    3. Remember that the host, though typed to IGrammarSymbol, is really a SequenceSymbol because it is the D symbol.
  2. Get a reference to the E symbol that is the 3rd symbol in the D sequence (i.e. n=2). Remember that the getNthInSequence method that this class also provides, returns the SequenceSymbol for any request that is not at the end of the sequence. The desired symbol is thus the first symbol of the returned sequence.
    1. The E symbol is the actual definition of the corresponding to the non-terminal symbol found in previous section above. That is, the processing of the E symbol will return the factory that we want to associate with the symbol found in the previous section.
    2. Thus we want to first delegate to another visitor, eAlgo, to process the E symbol and return its factory.
    3. Now that we have the factory for E symbol, we can close any loops associated with it. We do so by retrieving the proxy associated with the String representation of the non-terminal symbol found in previous section above. We then call the proxy’s setFact method to close the loop.
    4. The factory that processes E, is in fact, the return value of processing D, and thus the return value for the entire MakeParserFactAlgo.
  3.  But before the visitor returns, the L symbol must be processed, just as in FindNonTerminalsAlgo. Do we need the return value from processing the L symbol? Will all the factories created from processing the rest of the rules in the grammar be lost? That question is left up to you.

The processing the E symbol and subsequently the T symbol are fairly involved, but follow essentially the same lines as we done so far. Their code has been provided. Some interesting points to notice about eAlgo and tAlgo are

It should be no surprise that the lAlgo in MakeParserFactAlgo is exactly the same as the lAlgo in FindNonTerminalsAlgo.

At this point the demo should be fully operational. Be sure to uncomment the section in the  model.RDPModel.checkBNF() method that was commented out in the previous exercise.

Run the demo:

(For the following, refer to the demo instructions right after the first exercise to find what files are associated with what grammars.)

  1. In the left-hand text field (File Input 1), enter a file containing the grammar you wish to parse.
  2. Click the “Parse BNF” file to create the parse tree for that grammar.
  3. In the right-hand text field (File Input 2) enter the name of a file containing text corresponding to the grammar file for which you created the parse tree.
  4. In the drop list, select the tokenizer corresponding to the grammar you parsed, e.g. Tokenizer1 for the grammars in bnf1.txt and bnf2.txt, XMLTokenizer for the grammar in bnfxml1.txt and BNFTokenizer for the grammar in bnfbnf1.txt.
  5. Click the “Check BNF” button

EXERCISE 4: Complete the code in the constructor and the dAlgo command in parser.visitor.bnf.MakeParserFactAlgo visitor.

More cutting edge stuff!

There is a 4th tokenizer that is available in the demo, RegexBNFTokenizer, that can be used to tokenize BNF grammar files.  It differs from the other tokenizers in a couple of ways:

To use the RegexBNFTokenizer tokenizer, first specify the bnfbnf3.txt file as the grammar file (File Input 1).   This file contains the definitions of the special tokens that RegexBNFTokenizer needs.   Parse this as normal using the "Parse BNF" button, which interestingly, is still using the BNFTokenizer.  To test the parser that can be made from this grammar,  select the RegexBNFTokenizer tokenizer and then type in the name of any BNF file (File Input 2).     Click "Check BNF" to see that a parser for BNF can be generated.   And yes, it is capably of parsing bnfbnf3.txt as well.

Big Questions

Our demo shows that by using an invariant algorithm, namely MakeParserFactAlgo, there exist grammars that can be used to dynamically generate parsers that can parse the very grammar text that was used to generate the parser in the first place. WHY???   What are the requirements of such grammars?  What is the required translation of special token name -> grammar token (this includes the removal of the quote marks) done by the tokenizer saying about this whole process?   

 


© 2013 by Stephen Wong