In this tutorial we will build a C-style language parser from the ground up. The final code can be found on the js-parse github
The steps of building the parser are as follows:
var Parser = require("../../lib/index").Parser.LRParser;
var parserDescription = {
"symbols":{
},
"productions":{
},
"startSymbols": [ ]
};
// Create the parser
var parser = Parser.CreateWithLexer(parserDescription);
parser.on("accept", function(token_stack){
console.log("\n\nParser Accept:" + parser.prettyPrint());
});
parser.on("error", function(error){
console.log("Parse Error: ", error.message);
throw error.message;
});
We will deal with only the basic literals, float
, integer
and string
.
Full C grammars may include literals for enumated types and other more compelx structures which we won't address in this example.
var parserDescription = {
"symbols":{
"integer-literal":{
"terminal":true,
"match":"[\\+\\-]?[0-9]+"
},
"float-literal":{
"terminal":true,
"match":"[\\+\\-]?[0-9]*[\\.][0-9]+"
}
},
"productions":{
"literal": [
[ "integer-literal" ],
[ "float-literal" ]
]
},
"startSymbols": [ "literal" ]
};
1234
, works no problem but when we try to feed a float into it 123.456
we get an error:
Parse Error: Expecting one of ['*$*'], got float-literal('.456')What happened here is the parser picks up the
123
and calls it an integer, then it thinks its done since the only production allows for a single literal. The solution for this is simple,
we will add a lookhead for the tokenizer to make sure it doesn't recognize integers if they are followed by a ".
". Using the following code for the integer-literal
symbol makes it work a little better:
"integer-literal":{
"terminal":true,
"match":"[\\+\\-]?[0-9]+",
"lookAhead":"[^\\.]"
},
"string-literal":{
"terminal":true,
"match":"\"([^\"]*)\""
}
Be sure to add [ "string-literal" ]
to the literal
production and test it out.
[a-zA-Z_][a-zA-Z0-9_]*
, just add the symbols for this one:
"id":{
"terminal":true,
"match":"[a-zA-Z_][a-zA-Z0-9_]*"
}
You can also add id
to startSymbols
if you want to test it out.
Now that we have the primitive elements we need, we can sketch out basic math expressions like assignment, addition multiplication etc...
assignment_operator -> '=' | '*=' | '/=' | '%=' | '+=' | '-=' add_operator -> "+" | "-" mult_operator -> "*" | "/" | "%" exp -> assignmed_exp | add_exp assignment_exp -> primary_exp assignment_operator exp primary_exp -> literal | "(" exp ")" | id add_exp -> add_exp add_operator mult_exp | mult_exp mult_exp -> primary_exp mult_operator mult_exp | primary_expWe start by adding some of the basic operators as terminal symbols as well as opening and closing parens:
"assignment_operator": {
"terminal":true,
"match":"((=)|(\\*=)|(/=)|(\\%=)|(\\+=)|(\\-=))"
},
"add_operator": {
"terminal":true,
"match":"((\\+)|(\\-))",
"lookAhead":"[^=]"
},
"mult_operator": {
"terminal":true,
"match":"((\\*)|(/)|(%))",
"lookAhead":"[^=]"
},
"open_paren": {
"terminal":true,
"match":"\\(",
"excludeFromProduction":true
},
"close_paren": {
"terminal":true,
"match":"\\)",
"excludeFromProduction":true
}
Note that since the defined order of the symbols affects the way they are processed, the lookAhead
properties on add_operator
and mult_operator
are not srictly necessary. It doesn't hurt to have them and it allows not to have to worry about the order in which the symbols are defined.
startSymbols
to only contain exp
and at this point our parser will accept a basic assignment or mathmatical expression such as x = (10 + 55) * 10 + 5
or ((10 + 55) * 10 + 5) / 10
The final piece of the puzzle is to accept a program as a semicolor-delimited list of expressions.
When finished our parser will accept any number of math or assignment expressions in sequence. The additions to the grammar will look something like this:statement_list = exp ";" | statement_list exp ";"
"semicolon": {
"terminal":true,
"match":";",
"excludeFromProduction":true
},
And...
"statement_list":[
[ "statement_list", "exp", "semicolon" ],
[ "exp", "semicolon" ]
]
startSymbols
should now contain only statement_list
. Now we can run strings like "x = ((10 + 55) * 10 + 5) / 10; y = x; z = 10;" through the parser and get full parse trees.
statement_list( statement_list( statement_list(This is less than ideal, and simply adding a non-terminal symbol to the symbols object and giving it the option
mergeRecursive
will clean that right up. By adding the symbol:
"statement_list":{
"terminal":false,
"mergeRecursive":true
}
We now get a parse like:
statement_list( exp(...), exp(...), exp(...) )Which makes much more sense.
As your grammars get more and more complex, a few simple lines of code can give you a lot of insight into the parsing processs.
By adding event handlers for basic lexing and parsing events, you can paint a picture of whats going on during the parse and help diagnose issues as they might arise.
parser.getLexer().on("token", function(token){
console.log("token", token);
});
parser.on("production", function(head, body){
console.log("prod", head, require('util').inspect(body, true, 1000));
});
This will print out every time a token is recognized or a production is created so you can tell whats going wrong in your parsing and diagnose errors in your parser description.