Tuesday, December 2, 2008

Refactoring Lua with Metalua: source generation

In a previous post, I've begun to write a system to keep the AST / source code of a program in sync together; as a result, we were able to associate a source skeleton (string with holes in it for children nodes) to any given AST node, provided that this node was generated from sources.

To make this useful, we must be able to generate source code from modified and/or generated AST parts; this post will describe how to do so. Be warned, there's nothing really fancy about it, it's pretty much brute force, systematic code walking. The resulting code is available in the Metalua repository, there:

src/samples/synth.mlua

First, I won't bother to use the walk library that comes with Metalua. This lib is a support for the visitor pattern, which is useful if you want to leave alone most of the AST nodes, or if you want to do pretty much the same thing on each node. In short, it saves its users from writing the tree traversal code for each kind of node, but this won't save much work if you have to do something different for each kind of node anyway.

So everything will be embedded in a synth class: this class will:
  • keep a generated source code accumulator;
  • keep track of the current indentation level;
  • Support some basic code creation methods: adding strings (:acc()), properly indented newlines (:nl()), handling indentation level (:nlindent() and :nldedent()),
  • and actually walk the AST nodes (:node() and :list()).
I won't go into a painstaking description of each of these methods, and the source file above is extensively commented anyway. The only noteworthy point is the way :node() works: its job is to properly generate source from an AST node, and it does so by delegating to a method specialized for the given node's tag. That's one of those cases where Lua's "Everything is a table" paradygm is useful: an AST whose tag is "Foo" is going to be rendered by a method synth:Foo(ast, ...). There's no need to establish an explicit tag->method association, the method's name is the association. So :node() dispatches nodes to those specialized methods, which perform series of calls to :acc(), :nl(), :nlindent(), :nldedent()), :node() and :list().

Now Lua supports some syntax sugar, and there are some patterns which should be recognized and printed in a nicer way:
  • "foo['bar']" should be printed "foo.bar" when bar is a valid identifier name;
  • "x.y.z = function() ... end" should be printed "function x.y.z() ... end";
  • "x.y.z.methodname = function (self, ...) end" should be printed "function x.y.z:methodname (...) end";
  • "{ ['foo'] = bar }" should be printed "{ foo = bar }" if foo is a valid identifier name;
  • "foo('bar')", "foo({bar})", "foo:bar('gnat')" and "foo:bar({gnat})" should be respectively printed "foo 'bar'", "foo{ bar }", "foo:bar 'gnat'" and "foo:bar{ gnat }";
  • Unnecessary parentheses around operators should be avoided.

All of these patterns, plus a couple of trivial ones, are easily detected with structural pattern matching: This is the main reason why synth's code is significantly nicer in Metalua than it would have been in plain Lua. As a result, it's a fairly good example of advanced patterns usage, for a purpose that will be intuitive to all Lua programmer.

This example will now have, in a future post, to be mixed with the "weaver" from a previous post. This weaver caused an error when it hadn't enough information to retrieve an AST's code skeleton. Now instead of causing an error, it should ask the synthetizer to produce decent looking source code for such AST parts. We'll see that it still hides a couple of non-trivial issues :)

Finally, I'd like to state that this program is useful by itself, as opposed to the previous one: the source it produces is readable, significantly more so than big chunks of ASTs, and as such it is a nice way to debug code generated by hairy macros; I've written it 3 days ago and I'm already dependent on it! It will certainly be added, under a form or another, to the compiler's optional outputs.

Metalua moves to Github

I've cloned the Metalua repository from http://repo.or.cz/w/metalua.git to Github, and plan to use the later as the primary repository. The rationale, technical reason behind this migration is that Github's pages look nicer :) More seriously, I'm eager to see whether "something interesting" happens thanks to Github's social networking dimension.

Next commits will be done in priority on Github, although I'll try to keep repo.or.cz updated regularly.

Monday, December 1, 2008

Refactoring Lua code with Metalua

Metalua is primarily designed as a self-extensible language, close to Scheme through Lua's heritage, but with a more opinionated syntax than lisps (I've written something about the importance of opinionated syntax over there). But its static introspection features and its backward compatibility with Lua make it a potentially powerful tool for plain Lua code analysis and refactoring.

This post will be about the first steps toward support for Lua source refactoring in Metalua. The main service offered by Metalua, in this domain, is to let users manipulate AST rather than ASCII text. However, AST are much easier to analyze and modify than source code, but they don't carry as much information as sources do: the later have some presentation features (indentation, skipped lines, comments, uses of syntax sugar, even code generating macros in Metalua) which are not present in an AST, yet must be preserved when refactoring sources.

The most natural reflex would be to enrich the AST grammar to the point where it contains as much information as the source code, but that would be a ridiculously complex structure, which would become unpractical to define, learn and handle. So we take a different approach: every node of an AST generated from sources will remember, in an optional "lineinfo" field, the first and last characters which defined it in the source code.

For instance, let's consider this code and its AST:

"for i=1,10 do print(i) end"
`For{ `Id "i", `Number 1, `Number 10,
{ `Call{ `Id "print", `Id "i" } } }

The `For node will have a lineinfo range 1-27; the first `Id "i" will have 5-5, the second 21-21; `Call will have 15-22, and in it, `Id "print", 15-19; etc. Now, for each node we know both its original textual representation, and its AST structure. The idea is that when we refactor a piece of code:
  • we'll keep every untouched parts of the code under their original form;
  • we'll resynthetize code from AST for the parts we modified, and only those.
In this example, let's say that I decide to change "print" into "_G.print". In AST terms, that would be replacing `Id "print" with `Index{ `Id "_G", `String "print" }. The replacement in AST is trivial (ast [4] [1] = +{_G.print}).

When regenerating the source code of the for-loop, we need to take its "skeleton", its source with holes in the places occupied by its children nodes:

for [ ] = [ ], [ ] do [ ] end


Then in this source skeleton, which keeps the original formatting, we replace each of the children-holes by the corresponding child's regenerated source:
  • the one taken straight out of the original sources for the first 3 children;
  • and a synthesized source, generated from the lineinfo-free +{_G.print} AST node, for the last one that we modified.

The first step happened a while ago: it has been to properly implement and fix the lineinfo generation system, thanks to users' friendly pressure :). That feature had been planned for a long time but remained quite sketchy until then.

Step #2 now: ability to regenerate a source file from its AST+source. I won't insist on the ability to synthesize source out of raw AST, only on the ability to weave together the bits of sources, recursively, from AST leaves to top-level node; since I don't plan to mix those recomposed parts with generated ones yet, the initial goal is simply to retrieve the original source file!

The program is basically a a code walker, which will, on its way down,
  • check which AST nodes can be weaved: to be able to weave it, we need to be able to extract its skeleton form the original sources, so we need the node and all of its direct children to have lineinfo fields; The list of conforming nodes will be kept in a weaveable node->Boolean table.
  • generate an easy to read parent / child relation. This is not the same as table inclusion in the AST: some tables in the AST don't represent an expression, a statement or a block, and we're only interested into these element kinds. The relation will be kept in an ast_children parent->children list table.


Here's the code of this first walker:
require 'walk'

function weave_ast (src, ast, name)
local ast_children, weaveable, node_cfg = { }, { }, { }

function node_cfg.down(ast, parent)
if not ast.lineinfo
then weaveable [parent] = nil
else weaveable [ast] = true end
ast_children [ast] = { }
if parent then table.insert (ast_children [parent], ast) end
end

local cfg = { expr=node_cfg; stat=node_cfg; block=node_cfg }
walk.block (cfg, ast)
end

local src = ???
local ast = mlc.luastring_to_ast (src, name)
weave_ast (src, ast)

Now, this information can be used on our way "back up": each weaveable node will be generated by stitching its children's sources in its own source skeleton. This has to be done children first, parents later, of course. Moreover, we'll need a tiny trick here: the code walker is imperative, not functional, it's designed to modify ASTs rather than returning a value. To "return" a node's source, we'll create a gen_source AST->source string table, which will remember all of the translatations for the duration of the operations.
function weave_ast (src, ast, name)
local gen_source, ast_children, weaveable, node_cfg = { }, { }, { }, { }

function node_cfg.down(ast, parent) ... end

function node_cfg.up(ast)
local function weave_skeleton (ast)
local buffer = { }
local a, d = ast.lineinfo.first[3], ast.lineinfo.last[3]
for child in ivalues (ast_children [ast]) do
local b, c = child.lineinfo.first[3], child.lineinfo.last[3]
table.insert (buffer, acc (src:sub (a, b-1)) .. acc (gen_source [child]))
a = c+1
end
gen_source [ast] = table.concat (buffer) .. src:sub (a, d)
end
if weaveable [ast] then weave (ast)
else error "Can't handle nodes without proper lineinfo yet" end
end

local cfg = { expr=node_cfg; stat=node_cfg; block=node_cfg }
walk.block (cfg, ast)
return gen_source [ast]
end

This will be enough for today. Next steps will be:
  • step #3: ability to generate decent sources from an AST without any lineinfo nor sources.
  • step #4: ability to mix both approaches, generate decent sources from an AST that's only partially lineinfo'd.

Tuesday, July 22, 2008

Statements as expressions

As most Lua users and every Metalua users know, Lua makes a difference between expressions and statements, and it's sometimes cumbersome to put a statement where the context expects an expression. This can be addressed in Lua by using a function closure, but at a considerable runtime cost.

Metalua introduces a `Stat{ } AST node which lets metaprogrammers introduce a statement in an expression's context, but no concrete syntax was provided, partly to discourage the use of stats-as-exprs (which often is of poor taste), partly because I couldn't come up with a satisfying syntax. In this post I'll propose a syntax that seems better, if trickier to implement, than anything I've come by until now.

`Stat{ block, expr } is an expression node, which evaluates the block, then evaluates the expression in the block's context (therefore having access to its local variables; that's also how "repeat foo until bar" works in Lua, where condition "bar" has access to "foo"'s local variables). The node's expression value is the second child's value. For instance, the following AST, when executed, prints 42:
print( -{ `Stat{ +{block: local x = 21}, +{expr: 2*x} })

I propose to present stat node between keyword "stat" ... "end". In that block, the last statement should be a return, and its value is the returned expression. The example above would be written: print(stat local x=21; return 2*x end).

This is easily done by analyzing the content of the block and extracting the content of the return statement:
mlp.lexer:add{ "stat" }
mlp.expr:add{ "stat", mlp.block, "end", builder = stat_builder }
function stat_builder(x)
local returned_expr
local walker_cfg = { ... } -- Here the magic happens
walk.block (walker_cfg, x[1])
return `Stat{ x[1], expr }
end

Walking a block to extract returns (which are not inside a nested function) has already been addressed multiple times, e.g. here. Now what happens when no return or multiple returns are found? If there's no return, it's easy, just return nil. So above, we simply initialize returned_expr at +{nil}. In case of multiple returns, we could fail at compile time, but we can actually do better, and rewrite this as follows (supposing that goto and label are supported by the syntax):
print(stat 
local x=21;
if foo then return foo else return 2*x end
end)

is rewritten as:
print(stat
local $1=nil
local x=21
if foo then $1=foo; goto 'retval'
else $1=2*x; goto 'retval' end
label 'retval'
return $1
end)

This way, multiple returns are handled gracefully. Moreover, this code has a semantically equivalent (although much less efficient) Lua implementation: "stat block end" can be translated into the closure "((function() block end)())".

The complete implementation looks like this:
-{ extension 'xmatch' }

require 'walk'

local function stat_builder (x)
local body, result = x[1]

-- Accumulate direct return statements in `return_stats' list
local return_stats = { }
local walker_cfg = { expr={ }, stat={ } }
match function walker_cfg.stat.up
| `Return{ _ } == x -> table.insert (return_stats, x)
| `Return{ } == x -> x[1] = +{nil}; table.insert (return_stats, x)
| `Return{...} -> error "No multi-returns in stat blocks!"
| _ ->
end
match function walker_cfg.expr.down
| `Function{...} -> return 'break' | _ ->
end
walk.block (walker_cfg, x[1])

match return_stats with
-- No return statement -> resulting value is nil, don't change the body
| { } -> result = +{ nil }
-- One return statement at the end -> just extract it
| { ret } if body[#body] == ret -> result = table.remove (body) [1]
-- Several returns -> use variable and goto statements
| _ ->
local retvar = mlp.gensym 'stat_value'
local label = mlp.gensym 'end_of_stat'
table.insert (body, 1, `Local{ {retvar}, { } })
table.insert (body, `Label{ label })
for ret in ivalues (return_stats) do
ret.tag = nil
ret <- { `Set{ {retvar}, {ret[1]} }, `Goto{ label } }
end
result = retvar
end
return `Stat{ body, result }
end

mlp.lexer:add{ "stat" }
mlp.expr:add{ "stat", mlp.block, "end", builder = stat_builder }


(I've used the xmatch extension described in a previous post).

Notice, if you find that implementation too verbose, that it performs some operations more complex than most macros found in other languages: code analysis, different replacements depending on the code structure, (manual) hygiene within a Lisp-1 context... The simplest possible implementation, not supporting advanced code walking, would have been:
function stat_builder (x)
local body, result = x[1], +{ nil }
if body[#body].tag == 'Return' then
result = table.remove (body) [1]
end
return `Stat{ body, result }
end

mlp.lexer:add{ "stat" }
mlp.expr:add{ "stat", mlp.block, "end", builder = stat_builder }


The only limitation of the complete version is that multiple returns inside a stat blocks aren't supported. I'd rather keep it that way: the implementation of `Stat{ } is already a hack, since Lua really dislikes statements-as-expressions, at low level as well as at the syntax level. Making it multiple-values compatible would be horribly messy for little added value.

However, it could be handled through rewriting: when at least one return in the block might have multiple values (that means, all returns with a comma, and all returns of a function call), we can:
  • Change all "return foo" statements into "return {foo}".
  • Change the default return value from +{nil} to +{{}}
  • Surround the `Stat{ } node with a call to unpack().
Definitely not worth the kludge, nor the performance hit when treating a statement of the form "return f()", in my opinion. However, just because I can, here is the code:
-{ extension 'xmatch' }

require 'walk'

local multireturn_tags = table.transpose{ 'Dots', 'Call', 'Invoke' }

local function stat_builder (x)
local body, result = x[1]

-- Accumulate direct return statements in `return_stats' list
local multireturn = false
local return_stats = { }
local walker_cfg = { expr={ }, stat={ } }
match function walker_cfg.stat.up
| `Return{...} == x ->
if #x>1 or multireturn_tags[x[#x].tag] then multireturn = true end
table.insert (return_stats, x)
| _ ->
end
match function walker_cfg.expr.down
| `Function{...} -> return 'break' | _ ->
end
walk.block (walker_cfg, x[1])

match return_stats with
-- No return statement -> resulting value is nil, don't change the body
| { } -> result = +{ nil }; assert(not multireturn)
-- One return statement at the end -> just extract it
| { ret } if body[#body] == ret ->
if multireturn then
result = table.remove (body)
result.tag = 'Table' -- change return stat into list
else
result = table.remove (body) [1]
end
-- Several returns -> use variable and goto statements
| _ ->
local retvar = mlp.gensym 'stat_value'
local label = mlp.gensym 'end_of_stat'
table.insert (body, 1, `Local{ {retvar}, { multireturn and `Table } })
table.insert (body, `Label{ label })
for ret in ivalues (return_stats) do
local this_result
if multireturn then
this_result = table.shallow_copy(ret)
this_result.tag = 'Table'
else
this_result = ret[1]
end
ret.tag = nil
ret <- { `Set{ {retvar}, {this_result} }, `Goto{ label } }
end
result = retvar
end
if multireturn then
return `Call{ `Id "unpack", `Stat{ body, result } }
else
return `Stat{ body, result }
end
end

mlp.lexer:add{ "stat" }
mlp.expr:add{ "stat", mlp.block, "end", builder = stat_builder }

Wednesday, February 27, 2008

One more catch

(This post might only be interesting to people with fairly good Lua proficiency).

Some time ago, I wrote a post about designing a try...catch extension. The point was to show that even seemingly simple extensions require care to many corner cases, if one wants them to work reliably; and the implied consequence was that lower level approaches to meta-programming, akin to source preprocessing, are not suitable to write robust language features.

I unwillingly demonstrated this by leaving an corner case unattended. The extension wraps portions of code that might fail into a "pcall(function()...end)"; this breaks return statements: they would return from the newly created function, not from the surrounding user's function as he would expect. Replacing return statements with something suitable was one of the extension's tasks (and one that would be hard to achieve with a code preprocessor). However, I forgot that "...", the extra arguments of a function call, can't be passed across function boundaries either. If they're used in a pcall+function wrapper, they have to be stored in a temporary table and unpacked when used. Here's the piece of code that has shown the issue:


-{ extension 'withdo' }

function run (fmt, ...)
with h = io.popen(string.format(fmt, ...), 'r') do
return h:read '*a'
end
end


It uses "with...do...end", which limits the lifetime of a resource (here a file handle) to a lexical scope: the "h" declared in "with h = ... do ... end" will be closed once "end" is reached, even if returns and/or errors happen in the block; of course, it is implemented using the try/catch extension, and expands to:


function run (fmt, ...)
do
local h
try
h = io.popen(string.format(fmt, ...), 'r') do
return h:read '*a'
finally
h:close()
end
end
end


This in turns is equivalent to the following plain Lua code:


function run (fmt, ...)
do
local h
local args = {...}
local caught_return
local user_success, user_error = pcall( function()
h = io.popen(string.format(fmt, unpack(args)), 'r')
caught_return = { h:read '*a' }
end)
h:close()
if user_success then
if caught_return then return unpack(caught_return) end
else
error (user_error)
end
end
end


The (repeated) moral of the story is: language extensions are even harder than regular code to make reusable. If your language encourages writing quick and dirty macros, it will have a hard time growing extensive libraries. And if your meta-programming framework hasn't got a deep understanding of the code it manipulates, it's simply pointless.

Tuesday, February 26, 2008

shorter lambdas

In a stunning illustration of the bikeshed principle, Arc's feature that got the most attention and praises has been its short notation for anonymous functions: "[foo _]" is equivalent to "(fn (_) (foo _))". Since I'm in an experimental mood these days, let's give it a try, as it's merely a one-liner.

Using "[...]" to mark the function's boundaries wouldn't be a great idea,
as it would cause ambiguities in table constructors, and when used at the beginning of a statement, might be agregated to the previous statement as an `Index accessor. Instead, I'll use a prefix antislash; there's no need for a closing marker: in Lua, the end of an expression can be determined implicitly. Here's the meta-code, to put directly in the file where the notation is to be used:

-{ mlp.expr.prefix:add{ '\\', prec=5, builder=|_,x| +{|_|-{x}} } }


It can be used, for instance, as:

table.foreach (mlp, \print('key:', _))


I chose to give it a very low precedence, this is what seemed the most sensible. When you're not happy with this, use parentheses.

What good is that extension? It can be used instead of the already short "|_| foo(_)" notation; it saves two characters, and its usage is restricted to single parameter functions; so in terms of code shortening, it only makes sense for the tiniest functions. The most decent example I could think of is "\_[1]" to extract the first element of a list.

Clearly this has no interest in terms of saving tokens nor AST nodes, yet I can't help but finding this notation rather likeable. Maybe it has some auto-documentary mojo in it? By not even bothering to give a name to the parameter of a tiny function, we give it some sort of point-free flavor: "\_[1]" is read as: "[1] as a function", rather than "the function that takes '_' and returns '_[1]'".

It would also look nice in "\_+1"; in short, it seems to have a role equivalent to Haskell's sections, which allow to write "(+)" instead of "|x,y| x+y", "(+1)" instead of "|x| x+1", "(1+)" instead of "|x| 1+x" etc. In any case, the notation seems to only make sense in very functional code. It should be tried next time I have an occasion...

Thursday, February 21, 2008

Syntax experiments

Now that metalua 0.4 has been released, I took a bit of time to play with syntax. I'm usually reluctant to superficial syntax tweaking: 90% of the time, it breaks source compatibility for a marginal or even negative increase in readability. However, it's impossible to identify the remaining 10% interesting cases unless you give them a fair shot. Hence I introduced two experimental extensions:
  • xloop the extended loop syntax;
  • xmatch the extended pattern matching syntax

xloop

The original idea behind xloop stemmed from Common Lisp's infamous loop macro. This is an incredibly powerful and arcane macro. It's quite fun to hack it, but it's also a surefire way to get unmaintainable code in most team coding conditions. So I reduced its scope so dramatically that, well, it's got almost nothing left from the original loop :) But I've got what I believe is a sound base, on which I'll be able to gradually add new features to try.

The central idea is that a Lua loop consists of a loop header, followed by a body "do ... end". What I'm going to provide is the ability to chain several headers, so that you can write:
for i=0,9 for j=0,90,10 do
print (i+j)
end
instead of:
for i=0, 9 do
for j=0, 90, 10 do
print (i+j)
end
end
The headers can be:
  • for, both in its numeric and enumerator versions. These are composed by nesting the loops, as shown above.
  • while, which gives a condition not to exit the loop.
  • until, which gives a condition to exit the loop.
  • if, which allows to zap an iteration of the loop without breaking it.
while and until header elements act as conditionial break statements, whereas if acts as a conditional continue. If a break appears in the loop's body, all the (implicit) nested loops are broken at once.

Some things that might turn out to be useful idioms include "for i=1,100 if not prime[i] do ... end", or "for i=1, 1000 while not canceled do ... end", "for i=1,10 for j=1,10 if i~=j do ... end"...

This might look not-so-interesting, but I plan to go on playing with it, in case something truly interesting emerges. I think that introducing some fold and map operators, as alternatives or complements to the loop's body, might be interesting (they would be roughly equivalent to directives such as collect or sum for Lisp's loop macro). There's a point where too many directives will hurt readability, and you can only tell after you've tried, but that's why I'm not merging these extensions with metalua's more mature stuff: they're virtually guaranteed to be at least partially rolled back.

xmatch

Structural pattern matching is already a metalua extension. However, it's the one I used and tuned the most extensively, and it's directly inspired by established functional programming languages, so it's fairly mature. The painfully repetitive idioms I've met are:
  • functions which solely consist of a match statement;
  • contrived uses of match statements when I meant it to return an expression.
Moreover, patterns are only usable in match statements, where several alternative patterns might apply. In some cases, we know that a given pattern will match, we only want to use it as a way to destructurate a piece of data, in order to easily create assignments.
match functions
instead of:
function f(a, b, c)
match a, b, c with
| x1, y1, z1 -> foo()
| x2, y2, z2 -> bar()
end
end
you can write:
match function f
| x1, y1, z1 -> foo()
| x2, y2, z2 -> bar()
end
This is very similar to Lua's "function f()... end" as a shortcut for "f=function()...end". Still in Lua's spirit, if you want to create a local match function, as in "local function f(...) match ... end end", you can use "local match function f ... end".
anonymous match functions
Of course, you can also declare anonymous functions, this is not Python after all :)
f = match function
| x if x%2==0 -> return 'even'
| _ -> return 'odd'
end
expression match
You are now allowed to put a match...with where an expression is expected. In that case, the conditional blocks are replaced by conditional expressions: print(match x with 1 -> 'one' | _ -> many). Notice that the end keyword in Lua terminates statement blocks, not expressions, and therefore isn't expected by the match expression.
Destructuring bind
This could be summed up as a single-case match statement which escapes its scope. Suppose that you know that x contains something of the form `If{ cond, { stat1, stat2, ... } } and you want to get the value of cond and the first statement of the body. You can write the full code:
assert(x.tag=='If')
cond = x[1]
stat = x[2][1]
Or you can use a bind: bind `If{ cond, { stat, ... } } = x. There is also a local bind, which declare the variables as new locals instead of merely assigning them: local bind `If{ cond, { stat, ... } } = x. Of course, if the pattern doesn't match, you get a runtime error.