System Guide
This is the guide to the Lezer parser system. It provides a prose
description of the system's functionality. For the item-by-item
documentation of its interface, see the reference
manual.
Overview
Lezer is a parser system written in JavaScript. Given a formal
description of a grammar, it can produce a set of parse tables. Such
tables provide a description that the parser system can use to
efficiently construct a syntax tree for a given piece of text,
describing the structure of the text in terms of the grammar.
The tables are generated by the lezer-generator
tool, a build tool that takes a file in the format described later in
this guide, and outputs a big, largely unreadable
blob of JavaScript that represents the parse tables. This is something
that happens offline, as part of the build process for a grammar
package.
The lezer package provides the run-time parsing system.
Combined with a parser built by lezer-generator,
it gives you a parse
method that takes a
source file and returns a tree.
These trees, represented by data structures from the
lezer-tree package, are more limited than the
abstract syntax trees you might have seen in other contexts. They
are not very abstract. Each node only stores a type, a start and end
position, and a flat sequence of children. When writing a grammar, you
choose which productions are stored as nodes—the others don't appear
in the tree at all.
This means that the tree is very compact in memory and cheap to build.
It does make doing refined analysis on it somewhat awkward. The use
case guiding the design of this library is an editor
system, which keeps a syntax tree of the
edited document, and uses it for things like syntax highlighting and
smart indentation.
To support this use case the parser has a few other interesting
properties. It can be used incrementally, meaning it can efficiently
re-parse a document that is slightly changed compared to some previous
version given the parse for the old version. And it has error
recovery built in, meaning that even if the input doesn't adhere to
the grammar, it can still produce some reasonable syntax tree for it.
This system's approach is heavily influenced by
tree-sitter, a similar
system written in C and Rust, and several papers by Tim Wagner and
Susan Graham on incremental parsing
(1,
2).
Parser Algorithm
Lezer is based on the LR
parser, an algorithm invented by Donald Knuth in 1965, which by
pre-analyzing the grammar can derive fully deterministic (and thus
efficient) parsers for some grammars.
Roughly, this algorithm abstractly
interprets the
grammar, recording the various states the parser can be in, and
creating a table for each state mapping terminal symbols (tokens) to
the action that should be taken when the parser sees that token in
this state. If there's more than one action to take for a given token
in a given state, the grammar can not be parsed with this algorithm.
Such problems are usually called “shift-reduce” or “reduce-reduce”
conflicts. More about that in a moment.
When writing grammars for LR-based tools, it can help to have a rough
feel for this algorithm. The
Wikipedia
article linked above is a good introduction. For a more in-depth
treatment, I recommend Chapter 9 of this
book
(PDF).
Ambiguity
A lot of grammars are either impossible or extremely awkward to
represent as a plain LR grammar. If an element's syntactic role only
becomes clear later in the parse (for example JavaScript's (x = 1) + 1
versus (x = 1) => x
, where (x = 1)
can either be an expression
or a parameter list), plain LR often isn't very practical.
GLR is an extension of the
parsing algorithm that allows a parse to ‘split’ at an ambiguous point
by applying more than one action for a given token. The alternative
parses then continue beside each other. When a parse can't make any
more progress, it is dropped. As long as at least one parse continues,
we can get our syntax tree.
GLR can be extremely helpful when dealing with local ambiguities, but
when applied naively it can easily lead to an explosion of concurrent
parses and, when the grammar is actually ambiguous, multiple parses
continuing indefinitely so that you're in effect parsing the document
multiple times at once. This completely ruins the properties that made
LR parsing useful: predictability and efficiency.
Lezer allows GLR parsing but requires you to explicitly annotate the
places in your grammar where it is allowed to happen, so that you can
use it to solve problems that were otherwise difficult, but it won't
accidentally start happening all over your grammar.
The parse state splitting is optimized in Lezer, so that, though it is
more expensive than just doing a linear parse, you can have
ambiguities in commonly encountered constructs in your language and
still have a fast parser.
Error Recovery
Though the parser has a strict
mode,
by default it'll proceed through any text, no matter how badly it fits
the grammar, and come up with a tree at the end.
To do this it uses the GLR mechanism to try various recovery tricks
(ignoring the current token or skipping ahead to a place that matches
the current token) alongside each other to see which one, a few tokens
later, gets the best result.
Ignored tokens are added to the tree, wrapped in an error node.
Similarly, the place where the parser skipped ahead is marked with an
error node.
Incremental Parsing
In order to avoid re-doing work, the parser allows you to provide a
cache of nodes, which is a previous tree
mapped to the current document with the applyChanges
method. The parser will, when possible,
reuse nodes from this cache rather than re-parsing the parts of the
document they cover.
Because the syntax tree represents sequences of matches (specified in
the grammar notation with the +
and *
operators) as balanced
sub-trees, the cost of re-matching unchanged parts of the document is
low, and you can quickly create a new tree even for a huge document.
This isn't bulletproof, though—even a tiny document change, if it
changes the meaning of the stuff that comes after it, can require a
big part of the document to be re-parsed. An example would be adding
or removing a block comment opening marker.
Contextual Tokenization
In classical parsing techniques there is a strict separation between
the tokenizer, the thing that splits the input into a sequence of
atomic tokens, and the parser.
This separation can be problematic, though. Sometimes, the meaning of
a piece of text depends on context, such as with the ambiguity between
JavaScript's regular expression notation and its division operator
syntax. At other times, a sub-language used in the grammar (say, the
content of a string) has a different concept of what a token is than
the rest of the grammar.
Lezer supports contextual token reading. It will allow the tokens you
declare to overlap (to match the same input) as long as such tokens
can't occur in the same place anywhere in the grammar.
You can also define external tokenizers, which cause the parser to
call out to your code to read a token. Such tokenizers will, again,
only be called when the tokens they produce apply at the current
position.
Even white space, the type of tokens implicitly skipped by the parser,
is contextual, and you can have different rules skip different things.
Grammar Nesting
In some cases, such as JavaScript embedded in HTML or code snippets
inside a literal programming notation, you want to make another parser
responsible for parsing pieces of a document. Lezer supports one form
of this: If the nested syntax has a clearly identifiable end token,
you can specify that anything up to that token should be parsed by
another grammar.
This means that parse trees can contain node types from different
grammars.
Writing a Grammar
Lezer's parser generator defines its own notation for
grammars. You can take a look at the JavaScript
grammar
to see an example.
A grammar is a collection of rules, which define terms. Terms can
be either tokens, in which case they directly match a piece of input
text, or nonterminals, which match expressions made up of other
terms. Both tokens and nonterminals are defined using similar syntax,
but they are explicitly distinguished (tokens must appear in a
@tokens
block), and tokens are more limited in what they can
match—for example, they can't contain arbitrary recursion.
Formally, tokens must match a regular
language (roughly the
thing that basic regular expressions can match), whereas nonterminals
can express a context-free
language.
Here is an example of a simple grammar:
@top { expression }
expression { Name | Number | BinaryExpression }
BinaryExpression { "(" expression ("+" | "-") expression ")" }
@tokens {
Name { std.asciiLetter+ }
Number { std.digit+ }
}
This would match strings like (a+1)
or (100-(2+4))
. It doesn't
allow white space between tokens and the parentheses around binary
expressions are required, or it would complain about ambiguity.
@top
defines the entry point to the grammar. This is the rule that
will be used to match the entire input. It'll usually contain
something that repeats, like statement+
.
You'll notice that the example has some terms starting with lowercase
letter, and some that are capitalized. This difference is significant.
Capitalized rules will show up as nodes in the syntax tree produced by
the parser, lower-case rules will not.
(If you are writing rule names in a script that doesn't have case, you
can use an underscore at the start of a name to indicate that the rule
should not be in the tree.)
Operators
The grammar notation supports the repetition operators *
(any number
of repetitions of the thing before), +
(one or more repetitions),
and ?
(optional, zero or one repetitions). These have high
precedence and only apply to the element directly in front of them.
The pipe |
character is used to represent choice, matching either of
the expressions on its sides. It can, of course, be repeated to
express more than two choices (x | y | z
). Choice has the lowest
precedence of any operator, and, if there are no parentheses present,
will apply to the entire expressions to its left and right.
Choice in context-free grammars is
commutative,
meaning that a | b
is exactly equivalent to b | a
.
Parentheses can be used for grouping, as in x (y | z)+
.
Sequences of things are expressed by simply putting them next to each
other. a b
means a
followed by b
.
Tokens
Named tokens are defined in a @tokens
block. You can also, outside
of the tokens
block, use string literals like "+"
as tokens. These
automatically define a new token. String literals use the same
escaping rules as JavaScript strings.
String literals inside of token rules work differently. They can be
combined with other expressions to form a bigger token. All token
rules (and literal tokens) are compiled to a deterministic finite
automaton,
which can then be used to efficiently match them against a text
stream.
So an expression like "a" "b" "c"
in a nonterminal rule is a
sequence of three tokens. In a token rule, it is exactly equivalent to
the string "abc"
.
You can express character sets using set notation, somewhat similar to
square bracket notation in regular expression. $[.,]
means either a
period or a comma (there's no special meaning associated with .
in
this notation, or with escapes like \s
). $[a-z]
matches a
, z
,
and any character that, in the Unicode character code ordering, comes
between them. To create an inverted character set, matching only
character not mentioned in the set, you write an exclamation mark
rather than a dollar sign before the brackets. So ![x]
matches any
character that is not x
.
The parser generator defines a few built-in character sets under names
that start with std.
:
std.asciiLetter
matches $[a-zA-Z]
std.asciiLowercase
matches $[a-z]
std.asciiUppercase
is equivalent to $[A-Z]
std.digit
matches $[0-9]
std.whitespace
matches any character the Unicode standard defines
as whitespace.
Token rules cannot refer to nonterminal rules. But they can refer to
each other, as long as the references don't form a non-tail recursive
loop. I.e. a rule x
can not, directly or indirectly, include a
reference to x
, unless that reference appears at the very end of the
rule.
Skip Expressions
Our initial example does not allow any whitespace between the tokens.
Almost all real languages define some kind of special tokens, usually
covering spacing and comments, that may occur between the other tokens
and are ignored when matching the grammar.
To support whitespace, you must add a @skip
rule to your grammar.
@skip { space | Comment }
You could define the space
and Comment
tokens like this:
tokens {
space { std.whitespace+ }
Comment { "//" ![\n]* }
// ...
}
A skip rule will be matched zero or more times between other tokens.
So the rule above also handles a long sequence of comments and
whitespace.
Skipped tokens may be capitalized (as Comment
is), in which case
they will appear in the syntax tree. It is allowed for a skip
expression to match things more complicated than single tokens.
When your grammar needs more than one kind of whitespace, for example
when your strings aren't plain tokens but need their own internal
structure, but you don't want to match whitespace and comments between
the pieces of string, you can create a skip block like this:
@skip {} {
String {
stringOpen (stringEscape | stringContent)* stringClose
}
}
The initial braces contain the skip expression—in this case we want to
skip nothing so it is empty—and the second pair of braces contain
rules inside of which this skip expression is used.
Precedence
Let's go back to the binary operator rule in the example. If we define
it like this, removing the parentheses, we get an error message.
BinaryExpression { expression ("+" | "-") expression }
The error says "shift/reduce conflict" at expression "+" expression · "+"
. The ·
indicates the parse position, and the parser is telling
us that, after reading what looks like a binary expression and seeing
another operator, it doesn't know whether to first reduce the initial
tokens to a BinaryExpression
or to first shift the second operator
and leave the first for later. Basically, it doesn't know whether to
parse x + y + z
as (x + y) + z
or x + (y + z)
.
This kind of issue can be solved without resorting to GLR
parsing, fortunately. Doing so involves what is basically
a crude hack on top of the LR parser generator algorithm: specify
precedence to resolve these conflicts, so that when the parser
generator runs into the ambiguity, we tell it to take one parse and
discard the other.
This can be used not only for associativity (the error above), but
also for operator precedence (giving groupings involving *
precedence over groupings involving +
, for example) and various
other issues.
The way to specify precedence in Lezer is with, first, a @precedence
block that enumerates the precedence names you are going to use in the
grammar, in order of precedence (highest first), and then inserting
precedence markers at ambiguous positions.
@precedence { times @left, plus @left }
@top { expression }
expression { Number | BinaryExpression }
BinaryExpression {
expression !times "*" expression |
expression !plus "+" expression
}
tokens {
Number { std.digit+ }
}
This tells the parser generator that the !times
precedence marker
has a higher precedence than !plus
, and that both are left
associative (preferring (a + b) + c
to a + (b + c)
). You can also
specify @right
after the precedence name to make it right
associative, or leave off the associativity not specify any (in which
case the precedence won't resolve shift/reduce conflicts with itself).
The !
markers have to be inserted at the point where the conflict
occurs. In this case, the conflict happens directly in front of the
operators, so that's where the markers are placed.
This grammar can now correctly parse things like 1+2*3+4
into nodes
that group the operators by precedence, as in (1+(2*3))+4
.
It is also possible, instead of specifying an associativity for a
given precedence, to make it a cut operator by using the keyword
@cut
. A cut operator will override other interpretations even though
no conflict was detected yet. An example of where this is appropriate
is the way JavaScript's function
keyword has a different meaning in
statement position. A statement can start with an expression, and an
expression can be a function expression, but when we see the
function
token at the start of a statement, we only want to enter
the function declaration statement rule, not the function expression
rule.
@precedence { decl @cut }
@top { statement+ }
statement { FunctionDeclaration | FunctionExpression }
FunctionExpression { "function" "..." }
FunctionDeclaration { !decl "function" "..." }
This will parse function...
as a FunctionDeclaration
, though it
would also match FunctionExpression
(which could be used elsewhere
in the grammar, if it was a real grammar).
Allowing Ambiguity
Still, some things can not be resolved with precedence declarations.
To explicitly allow Lezer to try multiple actions at a given point,
you can use ambiguity markers.
@top { (GoodStatement | BadStatement)+ }
GoodStatement { "(" GoodValue ")" ";" }
GoodValue { "val" ~statement }
BadStatement { "(" BadValue ")" "!" }
BadValue { "val" ~statement }
This grammar is entirely nonsensical, but it shows the problem: in
order to know whether to match GoodValue
or BadValue
, the parser
has to decide whether it is parsing a GoodStatement
or a
BadStatement
. But at the time where it has to do that, the next
token is a closing parenthesis, which doesn't yet tell it what it
needs to know.
The parser generator will complain about that (it is a “reduce/reduce“
conflict) unless we add the ~statement
ambiguity markers at the
point where the reductions happen. The word (statement
) doesn't have
any special meaning, beyond that it has to be the same for all the
conflicting positions, so that the parser generator knows that we are
explicitly annotating these two positions as ambiguous.
With the markers, the grammar compiles, and whenever the parser gets
to the ambiguous position, it'll split its parse state, try both
approaches, and then drop off one path when it sees that it fails to
match on the next token.
Template Rules
It is possible to define parameterized rules, which can help avoid
repetitive grammar rules. Such rules act as templates—they are copied
for each set of parameters they are given.
Lezer uses angle brackets around parameters and arguments. You can
define a parameterized rule like this:
commaSep<content> { "" | content ("," content)* }
(The empty string literal is treated as the empty sequence, or ε,
matching nothing.)
After having defined the above rule, you can do commaSep<expression>
to match a comma-separated list of expressions.
When you pass arguments to a rule defined in the @tokens
block,
those arguments will be interpreted as if they are part of a token
rule, even when you call it from a nonterminal rule.
Token Precedence
By default, tokens are only allowed to overlap (match some prefix of
each other) when they do not occur in the same place of the grammar or
they are both simple tokens, defined without use of repeat
operators. I.e. you can have tokens "+"
and "+="
, but not tokens
"+"
and "+" "="+
.
To explicitly specify that, in such a case, one of the tokens takes
precedence, you can add @precedence
declarations to your @tokens
block.
@tokens {
Divide { "/" }
Comment { "//" ![\n]* }
@precedence { Comment, Divide }
}
By default, since Divide
is a prefix of Comment
, these would be
considered overlapping. The @precedence
declaration states that
Comment
overrides Divide
when both match. You can have multiple
@precedence
declarations in your @tokens
block, and each
declaration can contain any number of tokens, breaking ties between
those tokens.
Token Specialization
You could try to handle things like keywords, which overlap with
identifiers, by specifying that they take precedence. But this is
awkward, because if you define a keyword as matching "new"
and
having a higher precedence than identifiers, the input newest
would
match as first the keyword "new"
and then the identifier "est"
.
In addition, having dozens of keywords in your tokenizer automaton
greatly increases its size and complexity. What many hand-written
tokenizers do is first match an identifier, and then check whether
its content corresponds to any keyword.
This is supported in Lezer with the @specialize
operator. This
operator declares a new token given a base token and a string literal.
When specializations have been defined for a given token type, its
specialize table will be consulted every time it is read, and if its
content matches a specialization, it is replaced by the specialized
token.
NewExpression { @specialize<identifier, "new"> expression }
This rule will match things like new Array
.
There is another operator similar to @specialize
, called @extend
.
Whereas specialized tokens replace the original token, extended
tokens allow both meanings to take effect, implicitly enabling GLR
when both apply. This can be useful for contextual keywords where it
isn't clear whether they should be treated as an identifier or a
keyword until a few tokens later.
External Tokens
The regular way to define tokens is usually enough, and the most
convenient. But regular languages are notoriously limited in what they
can match, so some things, like the multiple-character look-ahead
needed to recognize the end of a comment, are awkward to express
inside the grammar. And something like JavaScript's automatic
semicolon insertion, which requires checking the content of preceding
whitespace and looking at the character ahead, can't be expressed in
the grammar at all.
Thus, Lezer allows you to declare some tokens as being external, read
by pieces of JavaScript code that you provide. An external tokenizer
might look like this:
@external tokens insertSemicolon from "./tokens" { insertedSemicolon }
This tells lezer-generator that it should import the insertSemicolon
symbol from "./tokens"
, and use that as a tokenizer which might
yield the insertedSemicolon
token.
Such a tokenizer should be an
ExternalTokenizer
instance. It will
only be called when the current state allows one of the tokens it
defines to be matched.
The order in which @external tokens
declarations and the @tokens
block appear in the grammar file is significant. Those that appear
earlier will take precedence over those that appear later, meaning
that if they return a token, the others will not be tried.
Inline Rules
Rules that are only going to be used once may be written entirely
inside the rule that uses them. This is mostly useful to declare node
names for sub-expressions. For example this definition of a statement
rule (which itself does not create a tree node) gives names to each of
its choices.
statement {
ReturnStatement { Return expression } |
IfStatement { If expression statement } |
ExpressionStatement { expression ";" }
}
Node Props
It is possible to add node props, extra metadata
associated with tree nodes, directly in a grammar file. To do so, you
write square brackets after the rule's name.
ParenthesizedExpression[delim="( )"] { "(" expression ")" }
This will add the NodeProp.delim
prop to
this node type, which provides information about the delimiters
wrapping a node. Values that consist of only letters can be written
without quotes, values like the one above must be quoted.
Most of the props defined as static properties of
NodeProp
are available by default. You can also
import custom props using syntax like this:
@external prop myProp from "./props"
SomeRule[myProp=somevalue] { "ok" }
You can write as otherName
after the prop name to rename it locally.
Inline rules, external token names, and @specialize
/@extend
operators also allow square bracket prop notation to add props to the
nodes they define.
When name
is specified in square brackets, it sets the node's name
instead of a prop. So this rule would produce a node named x
despite
being itself called y
. When a name is explicitly specified like
this, the fact that it isn't capitalized isn't relevant.
y[name=x] { "y" }
Sometimes it is useful to insert the value of a rule parameter into a
prop. You can use curly braces inside a prop value to splice in the
content of an identifier or literal argument expression. This keyword
helper rule produces a specialized token named after the keyword:
kw<word> { @specialize[name={word}]<identifier, word> }
If you put a @detectDelim
directive in your grammar file, the parser
generator will automatically detect rule delimiter tokens, and create
delim
props for rules where it finds them.
Literal Token Declarations
Literal tokens will, by default, not create a node. But it is possible
to explicitly list those that you want in the tree in the tokens
block, optionally with props.
@tokens {
"("
")"
"#"[someProp=ok]
// ... other token rules
}
Nested Grammars
Similar to external tokenizers, it is possible to pull in an external
grammar to use it to parse bits of the input.
@external grammar parser as javascript from "lezer-javascript"
scriptBlock { "{{" nest.javascript "}}" }
This defines scriptBlock
to, after the "{{"
token has been
matched, scan ahead for an "}}"
token and then continue parsing the
text in between using the parser
export from "lezer-javascript"
.
You can leave off the as
part when the local name for the grammar,
the thing you write after nest
, is the same as the exported name. It
is also possible to write empty
instead of from "..."
, in which
case the content of the block will simply not be parsed, unless a
parser for javascript
is later registered with the
withNested
method.
The value of a nested grammar can be either
null
(don't parse), a Lezer Parser
instance, or
a function (see the docs for its signature) that delays the decision
on how to parse the content to run-time.
You can pass up to two arguments to nest.something
to further
specify the way the region should be parsed. The first argument may be
an explicit end token. By default, the token occurring directly after
the nest
expression is taken as the token that ends the region, but
sometimes that isn't what you want. A second argument may provide a
default expression to parse, instead of entering the nested grammar,
in case the nested grammar is a dynamic function and it returns
{stay: true}
(declines to enter the nesting).
Building a Grammar
The lezer-generator package comes with a command-line
tool lezer-generator
that converts a grammar
file into a JavaScript module that can be loaded
to get access to the parser interface.
(This functionality is also available through a programming
interface.)
For example, we could run the tool as
lezer-generator lang.grammar -o lang.js
If there are no problems with the grammar specified in lang.grammar
,
this will write two files, lang.js
and lang.terms.js
.
The first is the main grammar file. It depends on "lezer"
and on any
files specified in external token or nested
grammar declarations. It exports a binding parser
that holds a Parser
instance.
The terms file contains, for every external token and every rule that
either is a tree node and doesn't have parameters, or has the keyword
@export
in front of it, an export with the same name as the term
that holds the term's ID. When you define an external tokenizer,
you'll usually need to import the token IDs for the tokens you want to
recognize from this file. (Since there'll be quite a lot of
definitions in the file for regular-sized grammars, you might want to
use a
tree-shaking
bundler to remove the ones you do not need.)
These are the command-line options supported by lezer-generator
:
-
--output
or -o
can be used to specify the place where the tool
should write its output. By default, it will write to its standard
output stream, but since it generates two files, that will mean you
don't get a term definition file.
-
--cjs
can be used to make the tool output CommonJS modules (the
default is to output ES modules).
-
--names
can be given to make the tool include term names in the
output, to help debugging. This adds quite a lot of bytes to the
output file, and should probably not be used in production.
-
--export
can be used to change the name of the exported binding
that holds the Parser
instance. It defaults to
parser
.
-
--noTerms
, when given, tells the tool to not write a terms file,
even though an output path was given.
Running the Parser
The simplest way to parse a file is to call
Parser.parse
on the parser you import from
the generated file. This will always succeed and return a
Tree
instance.
Sometimes you want to take advantage of the fact that LR parsing
happens in discrete steps, and have control over when the steps are
performed. You can create a ParseContext
with the startParse
method, which is an
object that represents an in-progress parse, on which you can
repeatedly call advance
to perform
the next action, until you decide you have parsed enough or the method
returns a tree to indicate that is has finished
parsing.
Such a parse context holds a collection of Stack
instances, each of which represents an actual single parse state. When
the parse is unambiguous, there'll be one stack. When it is trying out
multiple options in parallel, there'll be more than one. You don't
usually need to concern yourself with these, but an external
tokenizer or nested grammar resolution
function gets access to a stack and can query
it to, for example, ask if the current state accepts a given
token or where the rule that is currently
being matched starts.
Parsing consumes an InputStream
, which
abstracts access to a string-like data structure. You may simply pass
a string to parse
, in which case the library
will do the wrapping itself, but if you have your data in some kind of
data structure more involved than a string, you'll need to write your
own class the implements InputStream
, so that
Lezer can read directly from your data structure, without copying its
content into a flat string.
External tokenizers get access to this stream, as well as to a
Token
object that should be filled with information
about the token. When an external tokenizer is called, only the
start
property of the token has been filled
in, and should be used by the tokenizer to decide where to start
reading. When it recognizes a token, it should set the
value
(token term ID) and
end
(end offset) properties (possibly by using
the accept
method.
Working with Trees
Syntax trees produced by Lezer are stored in a format that's optimized
for compactness and efficient reuse during incremental parsing. Using
this structure directly for other purposes can be very awkard.
Firstly, the trees contain nodes that don't correspond to named rules
in the grammar, but indicate repetitions from +
or *
operators in
the grammar. These are essential for incremental parsing, but usually
uninteresting in any other context.
Secondly, chunks of small nodes are compressed together in arrays of
16-bit numbers, with each node only encoding its type id, start
postion, end position, and the index at which its children end. This
is compact, but it is not a format you want to be directly
interacting with.
Iteration
The easiest thing you can do with a tree is to iterate over its (real,
non-repeat) nodes. The iterate
method can
do either a forward or a backward walk over a part of the tree,
calling a function every time it enters a node, and, optionally,
another every time it exits a node.
These functions are passed the node type as well as its start and end
positions. You can use this, possibly along with some local state
tracking, to perform actions on the tree's content. For example, this
would be a crude way to convert a tree to XML:
function treeToXML(tree) {
let xml = ""
tree.iterate(0, tree.length, (type, start, end) => {
// Open tag on entering a node
xml += `<node type="${type.name}" start="${start}" end="${end}">`
}, () => {
// Close tag on exit
xml += `</node>`
})
return xml
}
Iteration will include all nodes that touch the given range, even when
they end at its start or start at its end.
The function called when entering a node has a somewhat complicated
return protocol. By default, when it returns undefined, iteration
proceeds as normal. When it returns the precise value false
, that
indicates that the node that's being entered should be skipped—no
callbacks will be called for its children, and no leave callback will
be called for the node itself.
Finally, if they return any other value, the iteration is immediately
stopped and that value is returned from iterate
. This can be used
for seach-like iteration, where you want to find a given node or check
whether some type of node is present.
let hasBreakStatement = tree.iterate(0, tree.length, type => {
return type.name == "BreakStatement" ? true : undefined
})
Subtree Cursors
To rummage through a tree in a less one-directional way, you can use
the Subtree
abstraction. This is a representation
of a given node in the tree, with awareness of its parents and
absolute position. It allows you to zoom in on a given node, inspect
it, continue to its children, or go back to its parent.
The top-level tree you get from the parser implements the Subtree
interface. From any subtree you can use the
resolve
method to find the innermost node
at a given position, returned as a Subtree
.
Each subtree has a parent
pointer, which
points at the subtree for its parent node (if any).
To query direct children, subtrees have
firstChild
,
lastChild
,
childBefore
, and
childAfter
properties.