Sunday, December 30, 2007

Pattern matching in metalua

Pattern matching is an extremely pleasant and powerful way to handle tree-like structures, such as ASTs. Unsurprisingly, it's a feature found in most ML-inspired languages, which excel at compilation-related tasks. There is a pattern matching extension for metalua, which I use extensively, and which is still largely undocumented. This post is intended to temporarily close this gap.

Prerequisites

I suppose that people are comfortable with AST representation of code, as described in the user's manual. Some mastering of Lua and Metalua's basic principles is also probably necessary. Finally, having something you want to do that is non-trivial and AST related will obviously make the whole thing more appealing.

Purpose

When manipulating trees, you want to check whether they have a certain structure (e.g. a `Local{ } node with as first child a list of variables whose tags are `Id{ }); when you've found a data that matches a certain pattern, you want to name the interesting sub-parts of it, so that you can manipulate them easily; finally, most of the time, you have a series of different possible patterns, and you want to apply the one that matches a given data. These are the needs addressed by pattern matching: it lets you give a list of (pattern -> code_to_execute_if_match) associations, selects the first matching pattern, and executes the corresponding code. Patterns both describe the expected structures and bind local variables to interesting parts of the data. Those variables' scope is obviously the code to execute upon matching success.

match statement

A match statement has the form:
match <some_value> with
| <pattern_1> -> <block_1>
| <pattern_2> -> <block_2>
...
| <pattern_n> -> <block_n>
end
Actually, the first vertical bar after the "with" is optional; moreover, a pattern can actually be a list of patterns, separated by bars. In this case, it's enough for one of them to match, to get the block to be executed:
match <some_value> with
| <pattern_1> | <pattern_1_bis > -> <block_1>
...
end
The first matching pattern is selected, the corresponding block is executed, and all other patterns and blocks are ignored. If no matchin pattern is found, an error "mismatch is raised (however, we'll see it's easy to add a catch-all pattern at the end of the match, when we want it to be failproof).

Patterns

Litterals
Syntactically, a pattern is mostly identical to the values it matches: numbers, booleans and strings, when used as patterns, match identical values.
match x with
| 1 -> print 'x is one'
| 2 -> print 'x is two'
end
Tables
Tables as patterns match tables with the same number of array-part elements, if each pattern field matches the corresponding value field. For instance, {1, 2, 3} as a pattern matches {1, 2, 3} as a value. Pattern {1, 2, 3} matches value {1, 2, 3, foo=4}, but pattern {1, 2, 3, foo=4} doesn't match value {1, 2, 3}: there can be extra hash-part fields in the value, not in the pattern. Notice that field 'tag' is a regular hash-part field, therefore {1, 2, 3} matches `Foo{1, 2, 3} (but not the other way around). Of course, table patterns can be nested. The table keys must currently be integers or strings. It's not difficult to add more, but the need hasn't yet emerged. If you want to match tables of arbitrary array-part size, you just have to add a "..." as the pattern's final element. For instance, pattern {1, 2, ...} will match all table with at least two array-part elements, and whose two first elements are 1 and 2.
Identifiers
The other fundamental kind of patterns are identifiers: they match everything, and bind whatever they match to themselves. For instance, pattern {1, 2, x} will match value {1, 2, 3}, and in the corresponding block, local variable x will be set to 3. By mixing tables and identifiers, we can already do interesting things, such as getting the identifiers list out of a local statement, as mentionned above:
match stat with
| `Local{ identifiers, values } -> table.foreach(identifiers, |x| print(x[1])
... -- other cases
end
When a variable appears several times in a single pattern, all the elements they match must be equal, in the sense of the "==" operator. Fore instance, pattern { x, x } will match value { 1, 1 }, but not { 1, 2 }. Both values would be matched by pattern { x, y }, though. A special identifier is "_", which doesn't bind its content. Even if it appears more than once in the pattern, metched value parts aren't required to be equal. The pattern "_" is therefore the simplest catch-all one.
Guards
Some tests can't be performed by pattern matching. For these cases, the pattern can be followed by an "if" keyword, followed by a condition.
match x with
| n if n%2 == 0 -> print 'odd'
| _ -> print 'even'
end
Notice that the identifiers bound by the pattern are available in the guard condition. Moreover, the guard can apply to several patterns:
match x with
| n | {n} if n%2 == 0 -> print 'odd'
| _ -> print 'even'
end

Multi-match

If you want to match several values, let's say 'a' and 'b', there's an easy way:
match {a,b} with
| {pattern_for_a, pattern_for_b} -> block
...
end
However, it introduces quite a lot of useless tables and checks. Since this kind of multiple matches are fairly common, they are supported natively:
match a, b with
| pattern_for_a, pattern_for_b -> block
...
end
This will save some useless tests and computation, and the compiler will complain if the number of patterns doesn't match the number of values.

Examples

There are quite a lot of samples using match in the metalua distribution, and more will come in the next one. Dig in the samples for fairly simple usages, or in the standard libs for more advanced ones.

Implementation hints

The implementation is fairly commented, and can be found in lib/ext-syntax/match.lua. It isn't completely up-to-date with this post (mismatches don't cause arrors, and multi-matches aren't handled IIRC), e-mail me if you can't wait for v0.4 to get the latest version (which is 300 lines long). Finally, there's a tutorial online which explains how to implement a (slightly simplified) version of the match extension from scratch in the manual (corresponding sources available as sample). And since it's usually presented as a terrible challenge to integrate a macro system with a syntax-enabled language, I'm just going to quote the part of the code that deals with syntax. Granted, it requires a bit of knowledge, but someone who couldn't master that kind of API should definitely not be allowed to write macros:
mlp.stat:add{ name = "match statement",
   "match", mlp.expr_list, "with",
   gg.optkeyword "|",
   gg.list{ name  = "match cases list",
      primary     = gg.sequence{ name = "match case",
         gg.list{ name  = "patterns",
            primary     = mlp.expr_list,
            separators  = "|",
            terminators = { "->", "if" } },
         gg.onkeyword{ "if", mlp.expr },
         "->",
         mlp.block },
      separators  = "|",
      terminators = "end" },
   "end",
   builder = |x| match_builder (x[1], x[3]) }
From there, the *only* thing you have to do is write the function match_builder() :P Macro writing is an essentially complex problem, which cannot be significantly reduced by today's programming languages. Those who can deal with that essential complexity can also deal with a bit of syntax and knowledge of AST; and if they do their job correctly, it's simpler for *everyone* to reuse the fruit of their precious work.

No comments: