Creating a Basic C Parser

In this tutorial we will build a C-style language parser from the ground up. The final code can be found on the js-parse github

The steps of building the parser are as follows:

  1. Literals
  2. Basic Math Expressions
  3. Statements and Statement Lists
  4. Debugging Tip


We'll start with a basic parsing script:
var Parser = require("../../lib/index").Parser.LRParser;

var parserDescription = {
	"symbols":{
		
	},
	"productions":{
		
	},
	"startSymbols": [  ]
};

// Create the parser
var parser = Parser.CreateWithLexer(parserDescription);

parser.on("accept", function(token_stack){
	console.log("\n\nParser Accept:" + parser.prettyPrint());
});

parser.on("error", function(error){
	console.log("Parse Error: ", error.message);
	throw error.message;
});

Literals

We will deal with only the basic literals, float, integer and string. Full C grammars may include literals for enumated types and other more compelx structures which we won't address in this example.


Numeric Literals

We will only cover basic numeric literals. We define an integer literal as a sequence of digits, potentially prefixed by a +/-. We'll define a floating literal as a sequence of digits followed by a dot then optionally, more digits. You can refer to this file for more details on different types of literals, and more advanced representations.

We can start with a naive implementation of integers and floats:
var parserDescription = {
	"symbols":{
		"integer-literal":{
			"terminal":true,
			"match":"[\\+\\-]?[0-9]+"
		},
		"float-literal":{
			"terminal":true,
			"match":"[\\+\\-]?[0-9]*[\\.][0-9]+"
		}
	},
	"productions":{
		"literal": [
			[ "integer-literal" ],
			[ "float-literal" ]
		]
	},
	"startSymbols": [ "literal" ]
};

On the surface this looks fine, but when we start testing it out we see that it has problems. Try the input 1234, works no problem but when we try to feed a float into it 123.456 we get an error:
Parse Error: Expecting one of ['*$*'], got float-literal('.456')
What happened here is the parser picks up the 123 and calls it an integer, then it thinks its done since the only production allows for a single literal. The solution for this is simple, we will add a lookhead for the tokenizer to make sure it doesn't recognize integers if they are followed by a ".". Using the following code for the integer-literal symbol makes it work a little better:
"integer-literal":{
	"terminal":true,
	"match":"[\\+\\-]?[0-9]+",
	"lookAhead":"[^\\.]"
},

This will ensure that before recognizing an integer, the end of the string does not contain a dot. Now we have functioning, albeit basic, numeric types for our language.

String literals

String literals are pretty straightforward since they are clearly delineated by quotes. For this example we will define a string literal as any characters inside of double-quotes.
We won't worry about escaped quotes, so we can represent strings simply as:
"string-literal":{
	"terminal":true,
	"match":"\"([^\"]*)\""
}
Be sure to add [ "string-literal" ] to the literal production and test it out.

Ids

Ids are another type of token that we want to extract, they are basically variable names and for our language we will match them with a regular expression [a-zA-Z_][a-zA-Z0-9_]*, just add the symbols for this one:
"id":{
	"terminal":true,
	"match":"[a-zA-Z_][a-zA-Z0-9_]*"	
}
You can also add id to startSymbols if you want to test it out.

Basic Math Expressions

Now that we have the primitive elements we need, we can sketch out basic math expressions like assignment, addition multiplication etc...



For reference, in this step we will implement the following grammar:
assignment_operator -> '=' | '*=' | '/=' | '%=' | '+=' | '-='
add_operator -> "+" | "-"
mult_operator -> "*" | "/" | "%"
exp -> assignmed_exp | add_exp
assignment_exp -> primary_exp assignment_operator exp
primary_exp -> literal | "(" exp ")" | id
add_exp -> add_exp add_operator mult_exp | mult_exp
mult_exp -> primary_exp mult_operator mult_exp | primary_exp
We start by adding some of the basic operators as terminal symbols as well as opening and closing parens:
"assignment_operator": {
	"terminal":true,
	"match":"((=)|(\\*=)|(/=)|(\\%=)|(\\+=)|(\\-=))"
},
"add_operator": {
	"terminal":true,
	"match":"((\\+)|(\\-))",
	"lookAhead":"[^=]"
},
"mult_operator": {
	"terminal":true,
	"match":"((\\*)|(/)|(%))",
	"lookAhead":"[^=]"
},
"open_paren": {
	"terminal":true,
	"match":"\\(",
	"excludeFromProduction":true
},
"close_paren": {
	"terminal":true,
	"match":"\\)",
	"excludeFromProduction":true
}
Note that since the defined order of the symbols affects the way they are processed, the lookAhead properties on add_operator and mult_operator are not srictly necessary. It doesn't hurt to have them and it allows not to have to worry about the order in which the symbols are defined.

Set startSymbols to only contain exp and at this point our parser will accept a basic assignment or mathmatical expression such as x = (10 + 55) * 10 + 5 or ((10 + 55) * 10 + 5) / 10

Statements and Statement Lists

The final piece of the puzzle is to accept a program as a semicolor-delimited list of expressions.

When finished our parser will accept any number of math or assignment expressions in sequence. The additions to the grammar will look something like this:
statement_list = exp ";" | statement_list exp ";"


The modifications at this point are simple, we add a terminal symbol to match the semicolon, and a simple production corresponding to the production above:
"semicolon": {
	"terminal":true,
	"match":";",
	"excludeFromProduction":true
},
And...
"statement_list":[
	[ "statement_list", "exp", "semicolon" ],
	[ "exp", "semicolon" ]
]
startSymbols should now contain only statement_list. Now we can run strings like "x = ((10 + 55) * 10 + 5) / 10; y = x; z = 10;" through the parser and get full parse trees.

Notice when parsing the string above, the parse tree starts off with:
statement_list(
	statement_list(
		statement_list(
This is less than ideal, and simply adding a non-terminal symbol to the symbols object and giving it the option mergeRecursive will clean that right up. By adding the symbol:
"statement_list":{
	"terminal":false,
	"mergeRecursive":true
}
We now get a parse like:
statement_list( exp(...), exp(...), exp(...) )
Which makes much more sense.

Quick Debugging Tip

As your grammars get more and more complex, a few simple lines of code can give you a lot of insight into the parsing processs.

By adding event handlers for basic lexing and parsing events, you can paint a picture of whats going on during the parse and help diagnose issues as they might arise.
parser.getLexer().on("token", function(token){
	console.log("token", token);
});

parser.on("production", function(head, body){
	console.log("prod", head, require('util').inspect(body, true, 1000));
});
This will print out every time a token is recognized or a production is created so you can tell whats going wrong in your parsing and diagnose errors in your parser description.