Matching Parentheses

In this example, we will build a parser which builds trees out of matching parentheses. It will accept any string with correctly matched parentheses like: (((((b))))) or a(b(cde)d(1)(f))d(fg(2(f)))


Creating the Parser Description

We'll start with a basic parser description, then make some improvements as we go along.

var parserDescription = {
	"symbols":{
		"(": { "terminal":true, "match":"\\(" },
		")": { "terminal":true, "match":"\\)" },
		"chars": { "terminal":true, "match":"[^\\(\\)]+" }
	},
	"productions":{
		"S":[
			[ "(", "S", ")" ],
			[ "chars" ],
		]
	},
	"startSymbols": [ "S" ]
};
The grammar is a pretty simple one, it basically groups up the parentheses and allows some cahracters in the middle.
A few things of note here. The value of match when describing symbols is a regular expression string. This means you have to use proper escaping when trying to match symbols.
Also notice the chars symbol, the match string is a regular expression which matches any character beside parentheses. The match string should be thought of a the lexing step of parsing, it will take the input string and break it out into tokens which are then fed into the grammar. In the case of the input (abc) the lexing step will cause the parser to recieve the stream "(" -> chars -> ")", the "abc" value is also preserved in the token so it can be accessed later.

Create the rest of the parser

As in the basic example, we will create a parser and give it our parser description:

var Parser = require("../../lib/index").Parser.LRParser;

// Create the parser
var parser = Parser.CreateWithLexer(parserDescription);

parser.on("accept", function(token_stack){
	console.log("Parser Accept:", require('util').inspect(token_stack, true, 1000));
});

parser.on("error", function(error){
	console.log("Parse Error: ", error.message);
	throw error.message;
});

Try it out

Try out the parser in a few different cases and see that it only really works in pretty basic cases.

var input = "((((bfa))))";
parser.append(input);
parser.end();
In that case it works, but in other simple variations it doesn't, (a(b)), () and a(b(c(d))) would fail this grammar. We can pick up the empty parentheses case by adding a simple rule to our grammar, modify the S production to look like this:
"S":[
	[ "(", "S", ")" ],
	[ "(", ")" ],
	[ "chars" ]
]
As you can see, we added the ["(", ")"] rule to our grammar. This will allow us to accept inputs like () or ((((())))).
Note the order of rules in the grammar dictates the order in which they'll be processed, this means the order is significant.
But what about the more complicated inputs? The intro promised a(b(cde)d(1)(f))d(fg(2(f))), and in this step we'll add the necessary rule to accomplish this. Consider the BNF representation of our current grammar:
S -> ( S )
S -> ( )
S -> chars
It's clear if you follow through the grammar there that characters can only exist inside parentheses which rules out the possibility of accepting even basic strings like (a(b)). We also need to be careful because this is an LR based parser and there are limits around the way grammars can be specified.

For instance, introducing the production [ "S", "(", "S", ")", "S" ], for S will result in whats known as a shift-reduce conflict, this is a fatal error in the grammar specification and will prevent the parser from being created.

We can start by allowing preceeding S constructs before the parentheses, that will get us at least half way there:
"S":[
	[ "S", "(", "S", ")" ],
	[ "(", "S", ")" ],
	[ "(", ")" ],
	[ "chars" ]
]
Now our grammar will accept strings like (a(b)), a(b(c)) but not a(b()), thats easily fixed by adding [ "S", "(", ")" ], as the second rule. Another string that won't be accepted is a(b)c. We've seen that we can't fix this by appending a rule like ["S", "(", "S", ")", "S"] so what we need to do instead is just allow characters to appear at the end of the closing paren. Consider the following production object for S:
"S":[
	[ "S", "(", "S", ")", "chars" ],
	[ "S", "(", ")", "chars" ],
	[ "(", "S", ")" ],
	[ "(", ")" ],
	[ "chars" ]
]
You can now see by observing the parse tree, that the orginal input (a(b(cde)d(1)(f))d(fg(2(f)))) is now accepted. In the final step we'll add some options to the symbols to clean up the output.

Cleaning up the output using excludeFromProduction

By Using the excludeFromProduction flag, we can maintain the groupings of characters but remove the parentheses tokens from the grammar.

The excludeFromProduction flag, when set on a symbol, means when it is to be reduced and merged into a production, create the production but leave this token out of it. Its useful for cases like parentheses where the existance of the parens is not so important as the grouping that are created (these will be captured by S productions). Change the symbols section of the parser description to look like this:
"symbols":{
	"(": { "terminal":true, "match":"\\(", "excludeFromProduction":true },
	")": { "terminal":true, "match":"\\)", "excludeFromProduction":true },
	"chars": { "terminal":true, "match":"[^\\(\\)]+" }
},
Now when you parse a string like a(b)(c) you'll see a result like the following:
Parser Accept: [ { head: 'S',
    body: 
     [ { head: 'S',
         body: 
          [ { head: 'S',
              body: [ { type: 'chars', value: 'a' }, [length]: 1 ] },
            { head: 'S',
              body: [ { type: 'chars', value: 'b' }, [length]: 1 ] },
            [length]: 2 ] },
       { head: 'S',
         body: [ { type: 'chars', value: 'c' }, [length]: 1 ] },
       [length]: 2 ] },
  [length]: 1 ]
Instead of
Parser Accept: [ { head: 'S',
    body: 
     [ { head: 'S',
         body: 
          [ { head: 'S',
              body: [ { type: 'chars', value: 'a' }, [length]: 1 ] },
            { type: '(', value: '(' },
            { head: 'S',
              body: [ { type: 'chars', value: 'b' }, [length]: 1 ] },
            { type: ')', value: ')' },
            [length]: 4 ] },
       { type: '(', value: '(' },
       { head: 'S',
         body: [ { type: 'chars', value: 'c' }, [length]: 1 ] },
       { type: ')', value: ')' },
       [length]: 4 ] },
  [length]: 1 ]

This is also an interesting chance to play with mergeIntoParent and mergeRecursive as described in the basic example. They produce interesting results when applied to the S symbol.