Comp405: Parsing BNF

COMP 405 Spring 2014	Parsing BNF

Pushing the Envelope!

(The code for these exercises can be branched from https://svn.rice.edu/r/comp405/course/ExtVisitor Parsing BNF incomplete)

In a previous lecture, we discussed out how extended visitors can be used to create a parsing architecture where the conversion from a grammar in BNF to factories that can create a visitor to parse that grammar lies only in the instantiation of a few different types of invariant factories. The difference between grammars lay only in the order in which the factories were strung together as they were being instantiated. This casts the entire parsing process in a new light and opens many new possibilities.

Note: This exercise constitutes unpublished research work. You have reached the forefront of OOP!

(This research project is called “YAK” which stands for “Yet Another Kooprey”, where Kooprey was a system built by Rice graduate student Mathias Ricken to parse BNF and automatically generate parsing code in the form of Java files, using our older recursive descent parsing system. A Kouprey is a Cambodian ox that is related to the Gnu and Yak. But YAK is also a play on YACC, “Yet Another Compiler Compiler”, which is the Unix parser generator utility.)

In particular, we have to ask the question, if everything is taking place at run-time, couldn’t we create a system that could dynamically create a parser, given the BNF for the grammar? After all, isn’t BNF a language and thus described by a grammar? That is, what if we could parse BNF itself?

If we can parse BNF itself, we could then take the parse tree that was created, analyze it and use that information to instantiate the parser for the grammar whose BNF we originally parsed!

Parsing an arbitrary grammar is a two-step process:

Parse the BNF of the desired grammar:
- Collect the terminal symbols in the desired grammar -- this can be done as the BNF is parsed.
- Collect the non-terminal symbols in the desired grammar -- this can be done by processing the parsed BNF.
Parse the code written in the desired grammar.

Parsing BNF

To parse BNF, we need to start with a grammar for BNF. Here’s the BNF for BNF:

S   ::= D | S1
S1  ::= lf S
D   ::=  Id "::=" E L

L   ::=  L2 | Empty
L2  ::=  lf  L3
L3  ::=  L | D

E   ::=  T E1
E1  ::=  Empty | E1a
E1a ::=  "|" E

T   ::= T1 T2
T1  ::= Id| QuotedStringToken
T2  ::= Empty | T

Above, “lf” means linefeed since, unlike the examples from class, we will need to parse multiple lines of input here. The “QuotedStringToken” is necessary to distinguish between elements of a grammar vs. the elements of the BNF describing that grammar. Note, for instance, the quoted “|” on the 9’th line of the grammar above, where we need to distinguish between the “OR” symbol in the grammar being described (the quoted “|”) from the “OR” symbol in this BNF description (the unquoted “|”). Likewise we see the same issue for the assignment operator (“::=”).

One of the problems in processing any grammar is the presence of recursive loops in the definition. There are lots of different ways to detect loops but here, we will use a simple technique that, while potentially inefficient, is guaranteed to work:

The thing to note about loops in grammars is that fundamentally they can only ever involve non-terminal symbols (duh, right—“terminal” symbols terminate by definition!) So, since in general, all non-terminal symbols are suspect, what we will do is to first detect all the non-terminals in a grammar and then immediately create proxies for all of them. Then when we encountered a non-terminal while processing the right-hand side of any line of grammar (the left-hand side is the definition of a non-terminal), we simply always use the proxy. After all the non-terminals have been defined, we simply close all potential loops by setting the all the proxies. It may run a tad slower because we put proxies in places we didn’t have to, but it’s guaranteed to properly close all loops, which is a nice thing.

In general, the set of terminal symbols are simply the compliment of the set of non-terminal symbols, that is, if a symbol isn’t a non-terminal symbol then it is a terminal symbol. Here, since we know the grammar a priori, we can identify the terminal symbols manually: lf, assignment (quoted “::=”), OR (quoted “|”), Id, QuotedStringToken, and Empty.

Here’s our plan of attack:

Create ITokVisitorFact instances of TerminalSymbolFactfor all the terminal symbols. Note that for TerminalSymbolFact, the specified name in the constructor is given as a String, so for lf, the name is “lf”, for OR the name is “|”, and for assignment, the name is “::=”. The Empty token has its own factory class, MTSymbolFact whose constructor only takes the tokenizer as an input.
Create ProxyFact instances for all the non-terminal symbols, i.e all symbols appearing on the left=hand side of the above grammar: S, S1, D, L, L2, L3, E, E1, E1a, T, T1, T2. It should be noted that S1, L2, L3, E1, E1a, T1 and T2 aren’t involved in loops, so it is actually unnecessary to make proxies for them. That is, when you encounter these symbols, you can simply make them as per their definitions right where they are being used. In general however, as when we automate this process later, we can't make such statements and we are then forced to make proxies for all non-terminals.
Create ITokVisitorFact instances for all the non-terminal symbols that have proxies. These will be either sequence or combination definitions. You can use either the binary, 2-symbol forms (SequenceFact or CombinationFact) or the unlimited-symbol, vararg input forms (MultiSequenceFact or MultiCombinationFact). The latter set is universally useful as it can be used wherever the former set can be used as well -- all they do is to create chains of binary compositions. Remember that these factories take factories in their constructors, so you will need to either supply the terminal symbol factories or the proxy factories or if the required factory is for a symbol not involved in a loop, the direct instantiation of that factory in terms of other factories. See the rest of the supplied code for examples of how to instantiate parsing visitor factories.
Now that all the factories for the non-terminals have been instantiated, you can now close all the loops by calling the setFact() method on all the proxies passing in the appropriate non-terminal factory.
Return the factory for the start symbol, the S symbol above.

EXERCISE 1: Complete the code for the `model.ParserFactFactory.makeBNFParser()` method that will parse a file containing BNF.

Examples of how to instantiate parser visitor factories can be seen in model.ParserFactFactory.makeOrigParser()and model.ParserFactFactory.makeXMLParser().

To run the demo:

The left hand text field is the name of the first file to be parsed.
- The "Parse Orig" button will parse a file as per a simple, hardwired grammar. The grammar that is used is described in bnf1.txt and equivalently, in bnf2.txt.
  - The test files inp1.txt, inp2.txt, inp3.txt and input1.txt files contain valid grammar examples.
  - The test files bad1.txt, bad2.txt and bad3.txt contain invalid grammar examples.
- The "Parse XML" button will parse XML files as per a hardwired grammar. The input files are labeled with "xml". The grammar for the XML is in bnfxml1.txt.
  - Valid grammar examples can be found in the files xml0.txt, xml1.txt, xml2.txt, xml3.txt, xml4.txt and xml5.txt.
  - Invalid grammar examples can be found in the files badxml1.txt, badxml2.txt and badxml3.txt
- The "Check XML" button will run a visitor on the resultant parse tree to check that all tags are matching. Works only on XML files obviously.
  - Files that have a valid grammar but whose tags do not match areincorrect_xml0.txt, incorrect_xml1.txt, incorrect_xml2.txt and incorrect_xml4.txt
- The "Parse BNF" will parse a BNF grammar file as per your hardwired grammar. The grammar used is given in bnfbnf1.txt. The files that can be parsed are bnf1.txt, bnf2.txt, bnfxml1.txt and bnfbnf1.txt. That is, you should be able to parse the grammars associated with the test grammar, the XML grammar and the BNF grammar itself!
The “Parse BNF” will print out the parser factory for the BNF grammar followed by the parsed input file. It should look the same as the text in the BNF file, even though the screen is actually showing the Stringrepresentation (.toString()) of the parse tree (the IGrammarSymbol for the start symbol)
We will work on the functionality of the rest of the buttons in the next section.

Finding all the non-terminals

At this point, we are now switching to processing the parse tree that was produced by the parsing of the BNF. The parsing produces an IGrammarSymbolobject corresponding to the start symbol. From class we saw that there are only a few types of grammar symbols:

MTSymbol – represents the empty symbol. Executes IGramSymVisitorvisitors with the case index = “MTSymbol”.
TerminalSymbol – represents a terminal symbol. Executes IGramSymVisitorvisitors with the case index = name of the terminal symbol.
SequenceSymbol – represents a binary sequence of symbols. Has two methods, getSymbol1() and getSymbol2() for retrieving the first and second symbols respectively. The MultiSequenceFact produces a parsing visitor that creates a string of SequenceSymbols to represent a sequence longer than 2 symbols (i.e. the second symbol is another SequenceSymbol). Executes IGramSymVisitor visitors with the case index = name of the sequence symbol.

It should be noted that IGrammarSymbol can execute two different types of visitors, one that is based on the value of the grammar symbol, i.e. as described above, and the other that is based on the type of the grammar symbol, that is the case index is "TerminalSymbol" for all TerminalSymbol instances, "SequenceSymbol" for all SequenceSymbolinstances, etc.. Since these two types of visitors are semantically quite different but unfortunately, to the compiler, syntactically indistinguishable, IGrammarSymbol has two methods to execute visitors, execute() for the value-based visitors and typeExecute() for the type-based visitors. (Yes, technically, it is possible to make the two type of visitors syntactically distinguishable -- it just haven't been done it yet.)

To find all the non-terminals in an expression, we have to walk through the grammar to find all the symbols on the left-hand side of each expression. Luckily, each line of a BNF grammar is expressed in a single symbol, “D”, in our BNF of BNF. Thus the problem reduces down to simply recording the first symbol in the D sequence for all D symbols encountered in the parse tree.

We will be writing the visitor, FindNonTerminalsAlgo by dividing the work into several pieces:

The constructor of `FindNonTerminalsAlgo`:

Here we need to install the main processing commands for the visitor using the addCmd method, which associates an index value (a Stringhere) with a particular command (a IGramSymVisitorCmd here):

First, we need to find the first D symbol. The start symbol, S, is a combination of D and S1 (that is, S is either D or S1), so we need to take into account the fact that we make first encounter an S1 symbol.
1. S1 is just a sequence of a lf followed by an S, so to process an S1, we simply add a command that associated with the “S1” index value (a String). This command casts its host to SequenceSymbol(the host is typed to the superclass IGrammarSymbol) and then simply recurs the entire algorithm on the second symbol of the sequence, since the FindNonTerminalsAlgo is the algorithm to process the start symbol, S.
  
  To create a command that returns the proper Map object, use this template:
```
new IGramSymVisitorCmd<Map<String, IGrammarSymbol>,Object>() {
	public Map<String, IGrammarSymbol> apply(String idx, IGrammarSymbol host, Object... inps) {
		// your code here
	}
}
		
```
2. To take care of the ability to handle a D symbol, we must remember that S is a combination, that is, it must combine the abilities of an S1and D processing algorithms. We will break out the coding of the Dsymbol to its own section, so assume we have a variable called dAlgothat is a visitor to process a D symbol. I have added a pair of convenience methods to the extended visitors called getCmds and setCmdsthat are accessors to the set of commands in an extended visitor. getCmds actually returns a Set<Map.Entry<I,IExtVisitorCmd<R, I, P, H>>> which is a set of key-value pairs of the index and associated commands. setCmds takes this same structure as its input so the two methods can be used to do a wholesale transfer of commands from one visitor to another. That’s exactly what we need here because to combine the abilities of a Dprocessing visitor with the FindNonTerminalsAlgo visitor, what we need to do is to install all the commands in the dAlgo visitor into the FindNonTerminalsAlgo visitor. That’s a lot of explanation of one very short line of code! (But remember this, because we will do more of this later.)

The `dAlgo` visitor

The skeleton of the dAlgo visitor has been included, so we need only concentrate on what its one command does.

The first thing to note is that a D symbol is sequence, whose first symbol is the non-terminal symbol we are looking for! Get the first symbol from the sequence and save it in the Map object that we will return, which is a field called “result”. The key value to use is the string representation of the symbol you are saving (toString), which is the name of the token associated with that symbol, i.e. its lexeme. This is not the same as the name of the symbol.
The rest of the lines of the grammar are buried in the L symbol at the end of the D symbol’s sequence. The L symbol is recursively defined to include the D symbol. You can either manually get that symbol from the sequence of sequences by repeatedly calling getSymbol2(), or you can use the little utility method provided called getNthInSequence()which will return you whatever symbol you want from a sequence. Here, we want n=3. Since the result depends on L, simply delegate to it using the visitor referenced by the lAlgo field.

The lAlgo visitor is created using a factory, AlgoMaker.Singleton, in a provided JAR file, /labs/LAlgoMaker.jar. The following line of code is included in the constructor of FindNonTerminalsAlgo:

lAlgo = LAlgoMaker.Singleton.make(dAlgo);

Later, you will comment out this line of code and implement lAlgo yourself.

EXERCISE 2: Complete the code in the constructor and `dAlgo` visitor in the `parser.visitor.bnf.FindNonTerminalsAlgo` visitor.

At this point the “Check BNF” button in the demo will partially work. Go to the model.RDPModel.checkBNF() method and comment out the second half of that method, as indicated -- this will keep the system from trying to do operations you haven't yet coded.

First parse a BNF grammar file as was done in the previous section. The demo will save the resultant parse tree.
When you click on “Check BNF” a listing of the non-terminals found in the parsed BNF file should appear in the output text area.

EXERCISE 3: Implement the `lAlgo` visitor in the `parser.visitor.bnf.FindNonTerminalsAlgo` visitor.

Comment out the line in the FindNonTerminalsAlgovisitor that instantiates the lAlgo visitor:

lAlgo = LAlgoMaker.Singleton.make(dAlgo);

Then complete the field declaration of lAlgo with your own implementation of lAlgo.

Warning: There is one issue that you need to be careful about because it could cause trouble depending on how you do your implementation. Due to the recursive nature of the BNF grammar definition (the BNF of BNF), you may need to access the commands held by lAlgo while you are in the middle of defining those commands. The problem is that those commands are not instantiated at that point, so your algorithm will fail if you try to access them. There are a number of ways to get around this problem, but one cute albeit arguably not the fastest executing technique employs lazy evaluation that is transparently performed by an anonymous inner class (Hint: use anonymous inner classes everywhere and the solution will fall right out).

Creating parser factories from a parse tree

It is beyond the scope of this exercise to develop the entire parser.visitor.bnf.MakeParserFactAlgo parser generator algorithm, so we will focus on some key portions of it. Fundamentally, the algorithm must walk through the BNF grammar and identify the rules of the grammar and then instantiate either a sequence or combination factory for to parse that rule. This algorithm assumes that all the non-terminals are already known, so it takes the Map<String, IGrammarSymbol> that FindNonTerminalsAlgoproduces, as an input parameter.

Luckily, our BNF of BNF grammar isolates a rule in the D symbol. The first symbol in the D sequence is the non-terminal symbol that the rule defines, so its name is the name of the factory being made. The 3rd symbol in the sequence is the E symbol, which completely encapsulates the actual rule. The final L symbol just encapsulates the line feeds that lead up to and including the next rule.

An E symbol is a T symbol followed by an E1symbol. The T symbol encapsulates a sequence, so we know that if we are processing a T symbol, we are constructing a sequence factory.

An E1 symbol is the OR operator (“|”) followed by another E symbol. Thus if we find ourselves processing an E1 symbol, we know that we need to make a combination factory the combines the previous E symbol with the following E symbol.

If we encounter a symbol that is not a non-terminal, then it must either be a terminal symbol or an Empty symbol. In any case, we can easily make the corresponding parser visitor factory. This determination only takes place inside the processing of a T symbol.

Let’s break this problem down into smaller pieces as before.

Constructor of `MakeParserFactAlgo`

As in the FindNonTerminalsAlgo, the first thing that this algorithm must do is deal with is the D | S1 combination. That means the code here will be exactly the same as in FindNonTerminalsAlgo, except for one addition: As mentioned before, the safe method of insuring that all loops in the grammar are properly closed, we need to make proxies for all the non-terminal symbols. The Map object given as the input parameter is a mapping from a String description to an IGrammarSymbol, so we need to make ProxyFact instances corresponding to all the Stringdescriptions in the Map. In order to store all this information, another Map object is in order, which is a field called “proxies”. One can obtain a Set<String> of the keys in the non-terminals Mapobject through its keySet() method. So all that is needed is to loop over that set and store each key with a corresponding new instance of ProxyFactin the proxies Map.

The `dAlgo` visitor

Once again, the D symbol is the main focus. There are a number of things to do here:

Get a reference to the symbol that this D symbol is defining (since the D symbol is a grammar rule).
1. This symbol is the Id which is the first symbol in the sequence. This symbol will be one of the non-terminal symbols that were found earlier by FindNonTerminalsAlgo and hence is one of the symbols that has a proxy associated with it.
2. We will need the String representation of this symbol later.
3. Remember that the host, though typed to IGrammarSymbol, is really a SequenceSymbol because it is the D symbol.
Get a reference to the E symbol that is the 3rd symbol in the Dsequence (i.e. n=2). Remember that the getNthInSequencemethod that this class also provides, returns the SequenceSymbol for any request that is not at the end of the sequence. The desired symbol is thus the first symbol of the returned sequence.
1. The E symbol is the actual definition of the corresponding to the non-terminal symbol found in previous section above. That is, the processing of the E symbol will return the factory that we want to associate with the symbol found in the previous section.
2. Thus we want to first delegate to another visitor, eAlgo, to process the E symbol and return its factory.
3. Now that we have the factory for E symbol, we can close any loops associated with it. We do so by retrieving the proxy associated with the String representation of the non-terminal symbol found in previous section above. We then call the proxy’s setFact method to close the loop.
4. The factory that processes E, is in fact, the return value of processing D, and thus the return value for the entire MakeParserFactAlgo.
But before the visitor returns, the L symbol must be processed, just as in FindNonTerminalsAlgo. Do we need the return value from processing the L symbol? Will all the factories created from processing the rest of the rules in the grammar be lost? That question is left up to you.

The processing the E symbol and subsequently the T symbol are fairly involved, but follow essentially the same lines as we done so far. Their code has been provided. Some interesting points to notice about eAlgoand tAlgo are

Since T represents a sequence, tAlgo always returns a SequenceFact.
Since E is a T followed by either an Emptyor an E1a, then E is a sequence factory if we have T-then-Empty, but a combination factory if we have T-then-E1a since E1a represents the OR symbol followed by an E.
eAlgo thus uses a forward accumulation process to delegate to the next symbol, either the Empty or E1a to determine whether to return a sequence factory (i.e. the result of processing T) or a combination factory (the combination of T and the subsequent Ein the E1a).

It should be no surprise that the lAlgo in MakeParserFactAlgo is exactly the same as the lAlgo in FindNonTerminalsAlgo.

At this point the demo should be fully operational. Be sure to uncomment the section in the model.RDPModel.checkBNF() method that was commented out in the previous exercise.

Run the demo:

(For the following, refer to the demo instructions right after the first exercise to find what files are associated with what grammars.)

In the left-hand text field (File Input 1), enter a file containing the grammar you wish to parse.
Click the “Parse BNF” file to create the parse tree for that grammar.
In the right-hand text field (File Input 2) enter the name of a file containing text corresponding to the grammar file for which you created the parse tree.
In the drop list, select the tokenizer corresponding to the grammar you parsed, e.g. Tokenizer1 for the grammars in bnf1.txt and bnf2.txt, XMLTokenizer for the grammar in bnfxml1.txt and BNFTokenizer for the grammar in bnfbnf1.txt.
Click the “Check BNF” button
- First, the list of non-terminal symbols should appear
- Then a printout of the parser factory that was generated. Note that some proxies may show null for their proxied symbol because their toString methods are being run before the loops are closed.
- The result of parsing the input file will then be shown.
  - The effect of choosing the bnf1.txt or bnf2.txt grammars and any of the corresponding test files should be the same as putting those same test files in the left-hand text field (File Input 1) and clicking the “Parse Orginial” button. That is, you can dynamically create the same parser as is hardwired for the “Parse Original” button.
  - Likewise, the same can be shown for the XML grammar and test files.
- Try parsing the grammar defined by bnfbnf1.txt, which the grammar of BNF itself and the grammar used by the "Parse BNF" button. Clicking the "Check BNF" button should therefore generate a parser for BNF, correct? This dynamically generated parser should therefore be able to parse the file that generated it, yes? Try it! (Select the BNFTokenizer and File Input 2 tobnfbnf1.txt.
- Until just two days ago (1/28/2013) however the system will be unable to dynamically generate a parser for BNF itself. That is, typing bnfbnf1.txt as both the input grammar and the file to be parsed (plus the BNFTokenizer) always resulted in a “Unknown Token” exception. The solution was a very simple removal of the quote marks in the lexeme stored by a QuotedStringToken. It is not clear yet exactly what this means; why did that simple change enable the system to self-generate? What are the underlying principles driving what we are seeing? The answers lie in more research into understanding the relationship and interaction between the parser and the tokenizer. Our current research indicates that the parser and the tokenizer are abstractly equivalent and serve to divide a higher order parsing problem down into two disconnected pieces of lower dimensionality, but we're not sure exactly what that means in a larger scope. Welcome to the edge of knowledge in computer science!

EXERCISE 4: Complete the code in the constructor and the `dAlgo` command in `parser.visitor.bnf.MakeParserFactAlgo` visitor.

More cutting edge stuff!

There is a 4th tokenizer that is available in the demo, RegexBNFTokenizer, that can be used to tokenize BNF grammar files. It differs from the other tokenizers in a couple of ways:

Instead of taking a list of keywards to define the special tokens of a language, RegexBNFTokenizer takes a specially formatted file containing regular expressions. This gives this tokenizer much more flexibility in defining the language but at the expense of a significantly more complicated specification format.
The BNF grammar used to describe BNF itself uses special names that subsitute for the BNF-specific tokens, e.g. "::=" and "|". Arguably, the BNFTokenizer also does this by requiring that those two tokens be surrounded by quote marks, though BNFTokenizer does this in a more generally applicable way than RegexBNFTokenizer because this behavior is not unique to BNFTokenizer; all the other tokenizers do this because the quote mark behavior (i.e. the QuotedStringToken's lexeme value does not include the quote marks) comes from their common superclass, ATokenizer and is thus not unique to any of its subclasses.

To use the RegexBNFTokenizer tokenizer, first specify the bnfbnf3.txt file as the grammar file (File Input 1). This file contains the definitions of the special tokens that RegexBNFTokenizer needs. Parse this as normal using the "Parse BNF" button, which interestingly, is still using the BNFTokenizer. To test the parser that can be made from this grammar, select the RegexBNFTokenizer tokenizer and then type in the name of any BNF file (File Input 2). Click "Check BNF" to see that a parser for BNF can be generated. And yes, it is capably of parsing bnfbnf3.txt as well.

Big Questions

Our demo shows that by using an invariant algorithm, namely MakeParserFactAlgo, there exist grammars that can be used to dynamically generate parsers that can parse the very grammar text that was used to generate the parser in the first place. WHY??? What are the requirements of such grammars? What is the required translation of special token name -> grammar token (this includes the removal of the quote marks) done by the tokenizer saying about this whole process?

COMP 405 Spring 2014

Parsing BNF

Pushing the Envelope!

Parsing BNF

EXERCISE 1: Complete the code for the model.ParserFactFactory.makeBNFParser() method that will parse a file containing BNF.

Finding all the non-terminals

The constructor of FindNonTerminalsAlgo:

The dAlgo visitor

EXERCISE 2: Complete the code in the constructor and dAlgo visitor in the parser.visitor.bnf.FindNonTerminalsAlgo visitor.

EXERCISE 3: Implement the lAlgo visitor in the parser.visitor.bnf.FindNonTerminalsAlgo visitor.

Creating parser factories from a parse tree

Constructor of MakeParserFactAlgo

The dAlgo visitor

EXERCISE 4: Complete the code in the constructor and the dAlgo command in parser.visitor.bnf.MakeParserFactAlgo visitor.

More cutting edge stuff!

COMP 405
Spring 2014

EXERCISE 1: Complete the code for the `model.ParserFactFactory.makeBNFParser()` method that will parse a file containing BNF.

The constructor of `FindNonTerminalsAlgo`:

The `dAlgo` visitor

EXERCISE 2: Complete the code in the constructor and `dAlgo` visitor in the `parser.visitor.bnf.FindNonTerminalsAlgo` visitor.

EXERCISE 3: Implement the `lAlgo` visitor in the `parser.visitor.bnf.FindNonTerminalsAlgo` visitor.

Constructor of `MakeParserFactAlgo`

The `dAlgo` visitor

EXERCISE 4: Complete the code in the constructor and the `dAlgo` command in `parser.visitor.bnf.MakeParserFactAlgo` visitor.