nearley.js2.10.4

Writing a parser

This section describes the nearley grammar language, in which you can describe grammars for nearley to parse. Grammars are conventionally kept in .ne files. You can then use nearleyc to compile your .ne grammars to JavaScript modules.

You can find many examples of nearley grammars online, as well as some in the examples/ directory of the Github repository.

Vocabulary

By default, nearley attempts to parse the first nonterminal defined in the grammar. In the following grammar, nearley will try to parse input text as an expression.

expression -> number "+" number
expression -> number "-" number
expression -> number "*" number
expression -> number "/" number
number -> [0-9]:+

You can use the pipe character | to separate alternative rules for a nonterminal. In the example below, expression has four different rules.

expression ->
      number "+" number
    | number "-" number
    | number "*" number
    | number "/" number

The keyword null stands for the epsilon rule, which matches nothing. The following nonterminal matches zero or more cows in a row, such as cowcowcow:

a -> null | a "cow"

Postprocessors

By default, nearley wraps everything matched by a rule into an array. For example, when rule -> "foo" "bar" matches, it creates the “parse tree” ["foo", "bar"]. Most of the time, however, you need to process that data in some way: for example, you may want to filter out whitespace, or transform the results into a custom JavaScript object.

For this purpose, each rule can have a postprocessor: a JavaScript function that transforms the array and returns a “processed” version of the result. Postprocessors are wrapped in {% %}:

expression -> number "+" number {%
    function(data, location, reject) {
        return {
            operator: "sum",
            leftOperand: data[0],
            rightOperand: data[2] // data[1] is "+"
        };
    }
%}

The rule above will parse the string 5+10 into { operator: "sum", leftOperand: "5", rightOperand: "10" }.

The postprocessor can be any function. It will be passed three arguments:

Remember that a postprocessor is scoped to a single rule, not the whole nonterminal. If a nonterminal has multiple alternative rules, each of them can have its own postprocessor.

For arrow-function users, a convenient pattern is to decompose the data array within the argument of the arrow function:

expression ->
      number "+" number {% ([first, _, second]) => first + second %}
    | number "-" number {% ([first, _, second]) => first - second %}
    | number "*" number {% ([first, _, second]) => first * second %}
    | number "/" number {% ([first, _, second]) => first / second %}

There are two built-in postprocessors for the most common scenarios:

Target languages

By default, nearleyc compiles your grammar to JavaScript. You can also choose CoffeeScript or TypeScript by adding @preprocessor coffee or @preprocessor typescript at the top of your grammar file. This can be useful to write your postprocessors in a different language, and to get type annotations if you wish to use nearley in a statically typed dialect of JavaScript.

More syntax: tips and tricks

Comments

Comments are marked with ‘#’. Everything from # to the end of a line is ignored:

expression -> number "+" number # sum of two numbers

Charsets

You can use valid RegExp charsets in a rule (unless you’re using a tokenizer):

not_a_letter -> [^a-zA-Z]

The . character can be used to represent any character.

Case-insensitive string literals

You can create case-insensitive string literals by adding an i after the string literal:

cow -> "cow"i # matches CoW, COW, and so on.

Note that if you are using a lexer, your lexer should use the i flag in its regexes instead. That is, if you are using a lexer, you should not use the i suffix in nearley.

EBNF

nearley supports the *, ?, and + operators from EBNF as shown:

batman -> "na":* "batman" # nananana...nanabatman

You can also use capture groups with parentheses. Its contents can be anything that a rule can have:

banana -> "ba" ("na" {% id %} | "NA" {% id %}):+

Macros

Macros allow you to create polymorphic rules:

# Matches "'Hello?' 'Hello?' 'Hello?'"
main -> matchThree[inQuotes["Hello?"]]

matchThree[X] -> $X " " $X " " $X

inQuotes[X] -> "'" $X "'"

Macros are dynamically scoped, which means they see arguments passed to parent macros:

# Matches "Cows oink." and "Cows moo!"
main -> sentence["Cows", ("." | "!")]

sentence[ANIMAL, PUNCTUATION] -> animalGoes[("moo" | "oink" | "baa")] $PUNCTUATION

animalGoes[SOUND] -> $ANIMAL " " $SOUND # uses $ANIMAL from its caller

Macros are expanded at compile time and inserted in places they are used. They are not “real” rules. Therefore, macros cannot be recursive (nearleyc will go into an infinite loop trying to expand the macro-loop).

Additional JS

For more intricate postprocessors, or any other functionality you may need, you can include chunks of JavaScript code between production rules by surrounding it with @{% ... %}:

@{%
const cowSays = require("./cow.js");
%}

cow -> "moo" {% ([moo]) => cowSays(moo) %}

Note that it doesn’t matter where you add these; they all get hoisted to the top of the generated code.

Importing other grammars

You can include the content of other grammar files:

@include "../misc/primitives.ne" # path relative to file being compiled
sum -> number "+" number # uses "number" from the included file

There are several builtin helper files that you can include:

@builtin "cow.ne"
main -> cow:+

See the builtin/ directory for more details. Contributions are welcome!

Including a file imports all of the nonterminals defined in it, as well as any JS, macros, and configuration options defined there.