Sunday, December 16, 2007

Code walkers

When you write advanced macros, or when you're analyzing some code to check for some property, you often need to design a function that walks through an arbitrary piece of code, and does complex stuff on it. Such a function is called a code walker. A typical example of a code walker is CPS transformation (a way to reorganize the code so that it supports continuations). On Lisp contains a good description of such a CPS transformer. Anyway, code walkers are tricky to write, involve a lot of boilerplate code, and are generally brittle. To make things a bit easier, Metalua comes with a walk library, which intends to accelerate code walker implementation. Since code walking is intrinsically tricky, the lib won't magically make it trivial, but at least it will save you some time and code, when compared to writing all walkers from scratch.

Principles

Code walking is about traversing a tree, first from root to leaves, then from leaves back to the root. This tree is not uniform: some nodes are expressions, some others statements, some others blocks; and each of these node kinds is subdivided in several sub-cases (addition, numeric for loop...). The basic code walker just goes from root to leaves and back to root without doing anything. On the simplest form of tree possible, the 'up' and 'down' passages are materialized below by the eponymous function calls in the recursive function below:
function traverse(node)
   down(node)
   for c in children(node) do
      traverse(node)
   end
   up(node)
end
'walk' generates fancy versions of the simple traversal algorithm above, specialized for AST trees. Of course, it lets you define your own up() and down() functions. In addition, it gives you some more control:
  • Your up() and down() functions are different for each kind of node (expr/stat/block); you can use a predicate to indicate whether these functions should be called or not, typically based on the tags you're interested in.
  • When going down the tree, you can decide not to walk sub-nodes, by making your down() function return 'break'. For instance, you might decide that you're not interested to parse the body of functions. Then you put in your down() function something like match x with `Function{ params, body } -> return 'break' end

API

Now, about how to implement that in Metalua with walk.lua. walk contains 3 interesting functions: walk.expr(cfg), walk.stat(cfg) and walk.block(cfg), which respectively generate a walker for the AST of an expression, a statement and a block of statements. cfg is a table which describes the exact behavior of that walker. cfg is cut in 3 sections expr, stat and block, each describing how to parse the nodes of the corresponding kinds. Notice that an expr node can have block subnodes, stat nodes can have expr subnodes etc. Each of these 3 sections can contain any of the following fields:
  • down, the function to be applied on every node of this kind when going down;
  • up, the function to be applied on every node of this kind when going back up;
  • pred, the predicate which indicates whether up and down should be called. It can be:
    • a string: in this case the predicate is considered true if the node's tag is equal to that string;
    • a list of strings: in this case the predicate is considered true if the node's tag is one of the list's strings;
    • a function: in that case the function is the predicate, i.e. it takes the node as a parameter, and the returned result indicate whether the up/down function should be run;
    • a boolean: if true, the predicate is always true; if false, always false;
    • nothing: in that case, the predicate is considered true and the up/down function is run.
Moreover, the down() functions might return the string 'break'. In that case, it prevents the walking to go through the node's subnodes, and the node's up() function won't be called either. This allows to shortcut whole subtrees. A couple of details:
  • In nodes that create new local variables, those new variables ( `Local{ X, _ }, `Localrec{ X, _ }, `Fornum{ X, ... }, `Forin{ X, ... } and `Function{ X, ... }), the `Id sub-nodes that declare those variables aren't treated as expressions. For instance, in "local x=y"`Local{ {`Id 'x'}, {`Id 'y'} }, y is walked in as an expression, but not x. If you want to do something with those left-hand-side identifiers, you have to get them from their parent node.
  • Unkown nodes will cause a warning, and their sub-nodes won't be traversed by default (we can't know hoe to do this), but they won't crash the walker. This allows to walk through code with phony nodes, typically to turn them into real AST.
  • When a function call or a method invocation is used as a statement, it won't be traversed as an expression (although its subparts will): this allows to apply some functions only on statement calls.

First example

A bit of practice now. Let's build the simplest walker possible, that does nothing:
cfg = { }
walker = walk.block(cfg)
Now, let's say we want to catch and remove all statement calls to function assert(). This can be done by removing its tag and content: an empty list is simply ignored in an AST. So we're only interested by `Call node, and within these nodes, we want the function to be `Id 'assert'. All this is only relevant to stat nodes:
cfg.stat = {
   pred = 'Call',
   down = function(x)
      match x with
      | `Call{ `Id 'assert', ... } ->
         x.tag=nil; for i=1,#x do x[i]=nil end -- function call erased
      | `Call{ ... } -> -- nothing to do, it's a normal call
      | _ -> assert(false) -- impossible, pred filtered those out
      end
   end }
Notice that although sometimes handy, pred is not strictly necessary: the filtering could be done inside up() and down() functions. We'll now remove assert() calls in non-statement; these will simply be replaced by nil:
cfg.expr = {
   pred = 'Call',
   down = function(x)
      match x with
      | `Call{ `Id 'assert', ... } ->
         x.tag='Nil'; for i=1,#x do x[i]=nil end -- node is now `Nil
      | `Call{ ... } -> -- nothing to do, it's a normal call
      | _ -> assert(false) -- impossible, pred filtered those out
      end
   end }
Here's a remark for functional programmers: this API is very imperative; you might cringe at seeing the `Call nodes transformed in-place. Well, I tend to agree but it's generally counter-productive to work against the grain of the wood: Lua is imperative at heart, and my attempts at doing this functionally sucked more than approaches that embraced imperativeness. Feel free to find and publish a better approach :) Anyway, this idea of overriding the content of an AST node with another is pretty handy, and there's a table.override function in the standard lib which does just that, so the code above should be written:
-- [...]
match x with
| `Call{ `Id 'assert', ... } -> table.override(x, `Nil)
-- [...]

Cuts

By making down() return 'break', you can prevent the traversal to go further down. This might be either because you're not interested by the subtrees, or because you want to traverse them in a special way. In that later case, just do the traversal by yourself in the down() function, and cut the walking by returning 'break', so that nodes aren't re-traversed by the default walking algorithm. We'll see that in the next, more complex example.

Advenced example: free variables listing

In this example, we'll progressively build a walker that gathers all global variables used in a given block AST. This involves keeping, at all times, a set of the identifiers currently bound by a "local" declaration, by function parameters, as for loop variables etc. Then, every time an identifier is found in the AST, its presence is checked in the current set of bound variables. If it isn't in it, then it's a free (global) identifier. The first thing we'll need is a scope handling system: something that keeps track of what identifiers are currently in scope. It must also allow to save the current scope (e.g. before we enter a new block) and restore it afterwards (e.g. after we leave the block). This is quite straightforward and unrelated to code walking; here is the code:
require 'std'
require 'walk'

-{ extension 'match' }

--------------------------------------------------------------------------------
-- Scope handling: ':push()' saves the current scope, ':pop()'
-- restores the previously saved one. ':add(identifiers_list)' adds
-- identifiers to the current scope. Current scope is stored in
-- '.current', as a string->boolean hashtable.
--------------------------------------------------------------------------------

local scope = { }
scope.__index = scope

function scope:new()
   local ret = { current = { } }
   ret.stack = { ret.current }
   setmetatable (ret, self)
   return ret
end

function scope:push()
   table.insert (self.stack, table.shallow_copy (self.current))
end

function scope:pop()
   self.current = table.remove (self.stack)
end

function scope:add (vars)
   for id in values (vars) do
      match id with `Id{ x } -> self.current[x] = true end
   end
end
Now let's start designing the walker. We'll keep a scope object up to date, as well as a set of found free identifiers, gathered every time we find an `Id{ } node. To slightly simplify matter, we'll consider that the AST represent a block.
local function fv (term)
   local freevars = { }
   local scope    = scope:new()
   local cfg = { expr  = { pred = "Id" } }

   function cfg.expr.down(x)
      match x with
      | `Id{ name } -> if not scope.current[name] then freevars[name] = true end
      end
   end

   walk.block(cfg)(term)
   return freevars
end
Since we don't ever add any identifier to the scope, this will just list all the identifiers used in the AST, bound or not. Now let's start doing more interesting things:
  • We'll save the scope's state before entering a new block, and restore it when we leave it. That will be done by providing functions cfl.block.down() and cfg.block.up(). Saving and restoration will be performed by methods :push() and :pop().
  • Whenever we find a 'local' declaration, we'll add the list of identifiers to the current scope, so that they won't be gathered when parsing the `Id expression nodes. Notice that the identifiers declared by the 'local' statement only start being valid after the statement, so we'll add them in the cfg.stat.up() function rather than cfg.stat.down()
  local cfg = { expr  = { pred = "Id"  },
             stat  = { pred = "Local" },
             block = { } }
   [...]
   function cfg.stat.up(x)
      match x with
      | `Local{ vars, ... } -> scope:add(vars)
      end
   end

   -----------------------------------------------------------------------------
   -- Create a separate scope for each block, close it when leaving.
   -----------------------------------------------------------------------------
   function cfg.block.down() scope:push() end
   function cfg.block.up()   scope:pop()  end
This starts to be useful. We can also easily add the case for `Localrec{ } nodes (the ones generated by "local function foo() ... end"), where the variable is bound *in* the `Localrec statement, not after; we do the same as for `Local, but we do it in the down() function rather than in the up() one. We'll also take care of `Function, `Forin and `Fornum nodes, which introduce new bound identifiers as function parameters or loop variables. This is quite straightforward; the only thing to take care of is to save the scope before entering the function/loop body (in down()), and restore it when leaving (in up()). The code becomes:
local function fv (term)
   local freevars = { }
   local scope    = scope:new()
   local cfg = { expr  = { pred = { "Function", "Id" } },
                 stat  = { pred = { "Forin", "Fornum", "Local", "Localrec" } } }

   -----------------------------------------------------------------------------
   -- Check identifiers; add functions parameters to newly created scope.
   -----------------------------------------------------------------------------
   function cfg.expr.down(x)
      match x with
      | `Id{ name } -> if not scope.current[name] then freevars[name] = true end
      | `Function{ params, _ } -> scope:push(); scope:add (params)
      end
   end

   -----------------------------------------------------------------------------
   -- Close the function scope opened by 'down()'.
   -----------------------------------------------------------------------------
   function cfg.expr.up(x)  
      match x with
      | `Function{ ... } -> scope:pop()
      | `Id{ ... }       -> -- pass
      end
   end

   -----------------------------------------------------------------------------
   -- Create a new scope and register loop variable[s] in it
   -----------------------------------------------------------------------------
   function cfg.stat.down(x)
      match x with
      | `Forin{ vars, ... }    -> scope:push(); scope:add(vars)
      | `Fornum{ var, ... }    -> scope:push(); scope:add{var}
      | `Localrec{ vars, ... } -> scope:add(vars)
      | `Local{ ... }          -> -- pass
      end
   end

   -----------------------------------------------------------------------------
   -- Close the scopes opened by 'up()'
   -----------------------------------------------------------------------------
   function cfg.stat.up(x)
      match x with
      | `Forin{ ... } | `Fornum{ ... } -> scope:pop()
      | `Local{ vars, ... }            -> scope:add(vars)
      | `Localrec{ ... }               -> -- pass
      end
   end

   -----------------------------------------------------------------------------
   -- Create a separate scope for each block, close it when leaving.
   -----------------------------------------------------------------------------
   function cfg.block.down() scope:push() end
   function cfg.block.up()   scope:pop()  end

   walk.block(cfg)(term)
   return freevars
end
This is almost correct now. There's one last tricky point of Lua's semantics that we need to address: in "repeat foo until bar" loops, "bar" is included in "foo"'s scope. For instance, if we write "repeat local x=foo() until x>3", the "x" in the condition is the local variable "x" declared inside the body. This violates our way of handling block scopes: the scope must be kept alive after the block is finished. We'll fix this by providing a custom walking for the block inside `Repeat, and preventing the normal walking to happen by returning 'break':
  local cfg  = { expr  = { pred = { "Function", "Id" } },
                 stat = { pred = { "Forin", "Fornum", "Local",
                                "Localrec", "Repeat" } },
                 block = { } }

   [...]

   -----------------------------------------------------------------------------
   -- Create a new scope and register loop variable[s] in it
   -----------------------------------------------------------------------------
   function cfg.stat.down(x)
      match x with
      | `Forin{ vars, ... }    -> scope:push(); scope:add(vars)
      | `Fornum{ var, ... }    -> scope:push(); scope:add{var}
      | `Localrec{ vars, ... } -> scope:add(vars)
      | `Local{ ... }          -> -- pass
      | `Repeat{ block, cond } -> -- 'cond' is in the scope of 'block'
         scope:push()
         for s in values (block) do walk.stat(cfg)(s) end -- no new scope
         walk.expr(cfg)(cond)
         scope:pop()
         return 'break' -- No automatic walking of subparts 'cond' and 'body'
      end
   end
That's it, we've now got a full free variables lister, and have demonstrated all the APIs offered by 'walk'. Notice however that the library isn't mature yet, and its interface might change before metalua loses its alpha label.

4 comments:

Anonymous said...

consider changing the stylesheet. lightgrey on white is unreadable

Fabien said...

> lightgrey on white is unreadable.
Indeed; fixed.

Anonymous said...
This comment has been removed by a blog administrator.
Anonymous said...

Good for people to know.