Thursday, October 20, 2011

TreeQuery: a DSL for syntax tree exploration

My previous post teased about IDE support through Metalua, and among others, about a robust and readable DSL to describe AST visitors. I have hacked together a prototype of this library. This post will start describing what it can do.

This post will showcase TreeQuery's API, the interfaces which allow to deal with trees. Methods and functions fall in two categories:

  • Those which allow to describe an interesting subset of nodes;
  • Those which allow to perform action on those selected nodes.

Source code

The implementation is also cut in two parts:

Part I: selecting nodes in an AST

Queries

TreeQuery works on query objects, which represent sets of AST nodes (those sets are computed lazily). Queries are created by calling treequery() on an AST, and the resulting query initially represents the set of all nodes in the AST. By calling dedicated methods on the query, we'll eliminate irrelevant nodes, until only the wanted ones are left.

In the whole post, we'll alias "treequery" to "Q", to make things a bit terser; consider it a tribute to jQuery's $() shortcut. The following snippet loads the library, create an AST which we will use as example, then a query which represents every node in the AST:

local Q = require 'metalua.treequery'
ast = +{ block:
    local x=1
    for y=1,10 do
        print (x+i)
    end
    return math.cos(x) }
all_nodes = Q(ast)

The +{...} notation is a Metalua syntax which lets write an AST using regular syntax; it's the equivalent to Lisp's quasi-quoting forms (the anti-quote is written -{...}). Without the syntax sugar, ast would be written:

ast = {
    `Local{ { `Id "x" }, { `Number 1 } },
    `Fornum{ `Id "y", `Number 1, `Number 10, {
        `Call{ `Id "print", `Op{ "add", `Id "x", `Id "i" } }
    }
    `Return{ `Call{ `Index{ `Id "math", `String "cos" }, `Id "x" } } }

(the `Foo{bar} notation is itself a standard Metalua shortcut for {tag='Foo', bar}, which improves AST readability).

In the example above, Q(ast) represents all nodes and sub-nodes in ast: the block, the local statement, its "x" binder declaration, the "1" value, the "for" statement, etc. We will now intoduce the different methods which allow to filter those nodes and only keep the ones we're interested in.

:filter(pred), selecting a node according to its properties

The simplest operation consists of only keeping nodes which have a given property. This is done by method :filter(), which takes a predicate (a function taking a node and returning true or false). For instance, the following denotes the set of all nodes in the ast which have the 'Call' tag:

call_nodes = Q(ast) :filter (function(node) return node.tag=='Call' end)

Given the previous definition of ast, this query denotes the nodes +{print(x+i)} and +{math.cos(x)}.

:filter() allows some more sophisticated operations: in addition to receiving the tested node as first parameter, the predicate receives the node's parent as its second parameter, its grand-parent as third, etc. up to the AST root.

For instance, let's say we want to filter function-calls-as-statements (as opposed to function-calls-within-sub-expressions). These are the nodes which (1) have their tag equal to "Call" and (2) are in a block, i.e. their parent's tag is nil.

call_stats = Q(ast) :filter (function(node, parent)
    return node.tag=='Call' and parent and parent.tag==nil
end)

Moreover, we don't want users to write all of their predicates by hand. It's tedious, error-prone, and hard to read back. So we provide a library of standard predicate and predicate generators in TreeQuery.

Q.has_tag(tag_1, tag_2, ... tag_n) will return a predicate, which itself returns true if the node's tag is one of those listed as arguments. The first example, filtering all "Call" nodes, can therefore be rewritten:

call_nodes = Q(ast) :filter(Q.has_tag 'Call')

To further enhance readability, TreeQuery supports some shortcuts for the most common operations. Among others, when it expects a predicate but receives a string (or a sequence of strings), it assumes that the user meant to use the Q.has_tag() predicate generator. The last example can therefore be simplified into:

call_nodes = Q(ast) :filter 'Call'

Testing the parents

As mentioned above, predicates receive not only the node to test, but also its parent, grand-parent etc. up to the root node. We can therefore transform any predicate on a node into a predicate on this node's parent (it's just a matter of removing the predicate's first argument). Q.parent() does just that: for instance, predicate Q.is_block filters nodes which are blocks (i.e. whose tag equals nil); Q.parent(Q.is_block) therefore filters nodes whose parent is a block.

The example about "Call' nodes within blocks can therefore be rewritten by chaining two filters together:

call_stats = Q(ast) :filter 'Call' :filter (Q.parent(Q.is_block))

Positional filtering methods

:filter() acts locally. We also want to filter node according to their relative position to other node. To do that, we offer a series of methods :after(), :not_after(), :under(), not_under(), :under_or_after(), :not_under_or_after(). Each of these take a predicate, and only keep the nodes which are in the specified position relative to a node which passed the predicate.

For instance, if we want to eliminate all nodes under a "Function" node, we can write any of the following:

not_in_func = Q(ast) :not_under (function(x) x.tag=='Function' end)
not_in_func = Q(ast) :not_under (Q.has_tag 'Function')
not_in_func = Q(ast) :not_under 'Function'

Let's turn this into a more useful query: we want to find all return statements in a given function body. Those are the nodes with a 'Return' tag; however, returns which are in a nested 'Function' node must be ignored: they return from the nested function body, not from the body currently considered. For instance, in the following AST:

ast = +{block:
    if foo then return a end
    local function bar()
         return b
    end}

The first "return a" must be selected, but not the second "return b", which belongs to function bar. This is done as follows:

my_returns = Q(ast)
    :filter(Q.has_tag('Return'))
    :not_under(Q.has_tag('Function'))

Or with the string shortcuts:

my_returns = Q(ast) :filter 'Return' :not_under 'Function'

These positional filtering methods are still a work in progress. Today, they are to be interpreted strictly (i.e. a node is not considered to be under nor after itself). The option should also be offered to interpret them inclusively, and it's trivial to implement, but I can't think of a naming scheme or calling convention which makes this inclusiveness choice clear and readable. I'm OK with doubling the number of functions in the API, but I really want the resulting queries to read naturally; suggestions are welcome.

Variables and scopes

Many typical queries have to deal with identifiers: variable renaming, closing and moving a code block, finding rogue globals... For all of these jobs, we must parse variables correctly: detect potential variable captures, recognize occurrences of a same variable without being fooled by homonyms, etc. TreeQuery integrates a native support for these tasks.

First, TreeQuery makes a difference between binders, which create a local variable in a given scope ("for x=... end", "local x", "function (x)... end" etc.), and occurrences, i.e. cases when the variable is used as an expression (e.g. "print(x)").

To do so, it offers predicates to tell apart binders from occurrences, to retrieve the binder associated to a given occurrence (or return nil if it is an occurrence of a global varriable), and conversely to filter occurrences of a given binder. This is a pretty rich matter, which deserves to be explored in a post of its own. Let's just list the available APIs for now:

  • Q.binder(occurrence, ast_root) returns the binder node associated to the occurrence, or nil if the variable is open (i.e. global) in this AST.
  • Q.is_occurrence_of(binder) is a predicate which filters variables which are occurrences of the binder argument.
  • Q.is_binder(...) tests whether an 'Id' node is a binder or an occurrence.

Miscelaneous predicates

  • Q.is_nth(n)(node, ...) is true if node is the child number n of its parent
  • Q.is_nth(a, b)(node, ...) is true if node is the child number #n of its parent, and a <= n <= b
  • Q.is_stat(), Q.is_expr(), Q.is.block() test for statements, expressions, blocks.

Q.child() predicate transformer

This is the converse of Q.parent(): predicate Q.child(n, P) will return true if, when given the n-th child of the node, P returns true. It could be generalized to arbitrary descendants: for instance, Q.child(n1, n2, P) tests P on the n2-th child of the n1-th child of the tested node.

Part II: Acting on selected nodes

We've seen various ways to describe a set of nodes, within an AST root, which we're interested in; we haven't done anything with them yet. This part describes methods intended to act on this nodes set.

Extracting a list with :list()

The simplest way to do whatever you want on the nodes is to get a list of them. This is what's returned by the method :list(). The nodes in the list are ordered according to the depth-first traversal order, i.e. all the nodes in "a(b1(c11,c12),b2(c21,c22))" will be listed in order "{a, b1, c11, c12, b2, c21, c22}".

Extracting the first node with :first()

Sometimes you know you're only looking for one node, and there's no point traversing the rest of the tree once you've found it. :first() will stop as soon as it has found the first matching node (still in depth-first order), and returrn it. Moreover, it will return a multiple value: the node, its parent, grand-parent, etc. up to the root.

For instance, "Q(+{print(1+2*3)}) :filter 'Op' :first()" will return "+{1+2*3}, +{print(1+2*3)}".

Iterating on nodes with :foreach()

the :foreach() method takes one (or two, cf. below) function, and applies it on every node selected by the query. As usual with treequery, the parents of the node are passed as extra parameters to the callback.

(Historic note) In the old metalua.walk library, there were to visitor callbacks, down() and up(). down() corresponds to the top-to-bottom, depth first traversal order, whereas up() was called when going back up the tree. They guaranteed the following invariants:

  • when down() is called on a node, down() has already been called on all of its parents;
  • when up() is called on a node, up() has already been called on all of its children;
  • on a given node, down() is called before up().

This ways of controlling the traversal order sometimes remains important; therefore, when :foreach() receives two callbacks, the first is used as the down() visitor, and the second is used as the up() visitor.

(Not implemented) Mapping transformations on nodes with :map()

This method allows to transform nodes into other things before they're passed to :foreach() callback, :list() or :first(). But maybe more importantly, they allow these transformations to be conditional, to only apply on nodes which pass some predicates. This allows to perform several, partially specific operations in a single pass. :map(p1, p2, ..., pn, f) won't eliminate any node from the query: it will replace the node N which pass all predicates p1,...pn with f(N). Several mappings can be put in a single request, which will either transform different nodes, or transform the same node more than once.

As usual, f receives the node's ancestors. By returning more than one node, f() can transform not only the node itself, but also its ancestry.

A question still to be determined: when several :map() are chained, and a first map already transformed a node, should the predicates of the second map receive the transformed nodes, or the original ones?

(Not implemented) for loop iterators

It would be nicer, and more idiomatic, to let write:

for x, x_parent in Q(ast) :filter 'Call' do
     various_stuff_on_call_nodes(x, x_parent)
end

Instead of:

Q(ast) :filter 'Call' :foreach (function (x, x_parent)
     various_stuff_on_call_nodes(x, x_parent)
end)

This is tricky to do correctly, due to Lua 5.1's "don't yield across C/Lua boundaries" limitation. I keep that in mind nonetheless.

Enough already for this post. Next time I'll show some interesting uses of the query language described above.

No comments: