Monday, December 1, 2008

Refactoring Lua code with Metalua

Metalua is primarily designed as a self-extensible language, close to Scheme through Lua's heritage, but with a more opinionated syntax than lisps (I've written something about the importance of opinionated syntax over there). But its static introspection features and its backward compatibility with Lua make it a potentially powerful tool for plain Lua code analysis and refactoring. This post will be about the first steps toward support for Lua source refactoring in Metalua. The main service offered by Metalua, in this domain, is to let users manipulate AST rather than ASCII text. However, AST are much easier to analyze and modify than source code, but they don't carry as much information as sources do: the later have some presentation features (indentation, skipped lines, comments, uses of syntax sugar, even code generating macros in Metalua) which are not present in an AST, yet must be preserved when refactoring sources. The most natural reflex would be to enrich the AST grammar to the point where it contains as much information as the source code, but that would be a ridiculously complex structure, which would become unpractical to define, learn and handle. So we take a different approach: every node of an AST generated from sources will remember, in an optional "lineinfo" field, the first and last characters which defined it in the source code. For instance, let's consider this code and its AST:
"for i=1,10 do print(i) end"
`For{ `Id "i", `Number 1, `Number 10, 
      { `Call{ `Id "print", `Id "i" } } }
The `For node will have a lineinfo range 1-27; the first `Id "i" will have 5-5, the second 21-21; `Call will have 15-22, and in it, `Id "print", 15-19; etc. Now, for each node we know both its original textual representation, and its AST structure. The idea is that when we refactor a piece of code:
  • we'll keep every untouched parts of the code under their original form;
  • we'll resynthetize code from AST for the parts we modified, and only those.
In this example, let's say that I decide to change "print" into "_G.print". In AST terms, that would be replacing `Id "print" with `Index{ `Id "_G", `String "print" }. The replacement in AST is trivial (ast [4] [1] = +{_G.print}). When regenerating the source code of the for-loop, we need to take its "skeleton", its source with holes in the places occupied by its children nodes:
for [ ] = [ ], [ ] do [ ] end
Then in this source skeleton, which keeps the original formatting, we replace each of the children-holes by the corresponding child's regenerated source:
  • the one taken straight out of the original sources for the first 3 children;
  • and a synthesized source, generated from the lineinfo-free +{_G.print} AST node, for the last one that we modified.
The first step happened a while ago: it has been to properly implement and fix the lineinfo generation system, thanks to users' friendly pressure :). That feature had been planned for a long time but remained quite sketchy until then. Step #2 now: ability to regenerate a source file from its AST+source. I won't insist on the ability to synthesize source out of raw AST, only on the ability to weave together the bits of sources, recursively, from AST leaves to top-level node; since I don't plan to mix those recomposed parts with generated ones yet, the initial goal is simply to retrieve the original source file! The program is basically a a code walker, which will, on its way down,
  • check which AST nodes can be weaved: to be able to weave it, we need to be able to extract its skeleton form the original sources, so we need the node and all of its direct children to have lineinfo fields; The list of conforming nodes will be kept in a weaveable node->Boolean table.
  • generate an easy to read parent / child relation. This is not the same as table inclusion in the AST: some tables in the AST don't represent an expression, a statement or a block, and we're only interested into these element kinds. The relation will be kept in an ast_children parent->children list table.
Here's the code of this first walker:
require 'walk'

function weave_ast (src, ast, name)
   local ast_children, weaveable, node_cfg = { }, { }, { }

   function node_cfg.down(ast, parent)
      if   not ast.lineinfo 
      then weaveable [parent] = nil
      else weaveable [ast] = true end
      ast_children [ast] = { }
      if parent then table.insert (ast_children [parent], ast) end
   end

   local cfg = { expr=node_cfg; stat=node_cfg; block=node_cfg }
   walk.block (cfg, ast)
end

local src = ???
local ast = mlc.luastring_to_ast (src, name)
weave_ast (src, ast)
Now, this information can be used on our way "back up": each weaveable node will be generated by stitching its children's sources in its own source skeleton. This has to be done children first, parents later, of course. Moreover, we'll need a tiny trick here: the code walker is imperative, not functional, it's designed to modify ASTs rather than returning a value. To "return" a node's source, we'll create a gen_source AST->source string table, which will remember all of the translatations for the duration of the operations.
function weave_ast (src, ast, name)
   local gen_source, ast_children, weaveable, node_cfg = { }, { }, { }, { }

   function node_cfg.down(ast, parent) ... end

   function node_cfg.up(ast)
      local function weave_skeleton (ast)
         local buffer = { }
         local a, d = ast.lineinfo.first[3], ast.lineinfo.last[3]
         for child in ivalues (ast_children [ast]) do
            local b, c = child.lineinfo.first[3], child.lineinfo.last[3]
            table.insert (buffer, acc (src:sub (a, b-1)) .. acc (gen_source [child]))
            a = c+1
         end
         gen_source [ast] = table.concat (buffer) .. src:sub (a, d)
      end
      if weaveable [ast] then weave (ast)
      else error "Can't handle nodes without proper lineinfo yet" end
   end

   local cfg = { expr=node_cfg; stat=node_cfg; block=node_cfg }
   walk.block (cfg, ast)
   return gen_source [ast]
end
This will be enough for today. Next steps will be:
  • step #3: ability to generate decent sources from an AST without any lineinfo nor sources.
  • step #4: ability to mix both approaches, generate decent sources from an AST that's only partially lineinfo'd.

No comments: