The LuaJIT Language Toolkit is a Lua implementation of the Lua programming language itself. It generates LuaJIT's bytecode complete with debug informations. The generated bytecode, in turn can be run by the LuaJIT's virtual machine.
On itself this tookit does not do anything useful since LuaJIT is able to generate and run the bytecode for any Lua program. The purpose of the language toolkit is to provide a starting point to implement a programming language that target the LuaJIT virtual machine.
With the LuaJIT Language Toolkit is easy to create a new language or extend the Lua language because the parser is cleanly separated from the bytecode generator and the virtual machine run time environment.
The toolkit implement actually a complete pipeline to parse a Lua program, generate an AST tree and generate the bytecode.
Its role is to recognize lexical elements from the program text. It does take the text of the program as input and does produce a flow of "tokens".
Using the language toolkit you can run the lexer only to examinate the flow of tokens:
luajit run-lexer.lua tests/test-1.lua
The command above generate for the following code fragment:
local x = {}
for k = 1, 10 do
x[k] = k*k + 1
end
to obtain a list of the tokens:
TK_local
TK_name x
=
{
}
TK_for
TK_name k
=
TK_number 1
,
TK_number 10
TK_do
TK_name x
[
TK_name k
]
=
TK_name k
*
TK_name k
+
TK_number 1
TK_end
Each line represent a token where the first element is the kind of token and the second element is its value, if any.
The Lexer's code is an almost literal translation of the LuaJIT's lexer.
The parser takes the flow of tokens as given by the lexer and form the statements and expressions according to the language's grammar. The parser takes a list of user supplied rules that are invoked each time a parsing rule is completed. The user's module can return a result that will be passed to the other rules's invocation.
For example, the grammar rule for the "return" statement is:
explist ::= {exp ','} exp
return_stmt ::= return [explist]
In this case the toolkit parser rule will parse the optional expression list by calling the function expr_list
.
Then, once the expressions are parsed the user's rule ast:return_stmt(exps, line)
will be invoked by passing the expressions list obtained before.
local function parse_return(ast, ls, line)
ls:next() -- Skip 'return'.
ls.fs.has_return = true
local exps
if EndOfBlock[ls.token] or ls.token == ';' then -- Base return.
exps = { }
else -- Return with one or more values.
exps = expr_list(ast, ls)
end
return ast:return_stmt(exps, line)
end
As you cas see the user's parsing rules are invoked using the ast
object.
With the LuaJIT Language Toolkit a set of rules are defined in "lua-ast.lua" to build the AST of the program.
In addition the parser provides additional informations about:
- the function prototype
- the syntactic scope
The first is used to keep trace of some informations about the current function parsed.
The syntactic scope rules tell to the user's rule when a new syntactic block begins or end. Currently this is not really used by the AST builder but it can be useful for other implementations.
The abstract syntax tree represent the whole Lua program with all the informations. If you implement a new programming language you can implement some transformations of the AST tree if you need. Currently the language toolkit does not perform any transformation and just pass the AST tree to the bytecode generator module.
Once the AST tree is generated it can be feeded to the bytecode generator module that will generate the corresponding LuaJIT bytecode.
The bytecode generator is based on the original work of Richard Hundt for the Nyanga programming language. It was greatly modified by myself to produce optimized code similar to what LuaJIT generate itself.
Instead of passing the AST tree to the bytecode generator an alternative module can be used to generate Lua code. The module is called "luacode-generator" and can be used exactly like the bytecode generator.
The Lua code generator has the advantage of being more simple and more safe as the code is parsed directly by LuaJIT ensuring from the beginning complete compatibility of the bytecode.
Currently the Lua Code Generator backend does not preserve the line numbers of the original source code. This is meant to be fixed in the future.
Use this backend instead of the bytecode generator if you prefer to have a more safe backend to convert the Lua AST to code. The module can be used also to pretty-printing a Lua AST tree since the code itself is propably the most human readable representation of the AST tree.
The application can be run with the following command:
luajit run.lua <filename>
The "run.lua" script will just invoke the complete pipeline of the lexer, parser and bytecode generator and it will pass the bytecode to luajit with "loadstring".
The script "run.lua" can optionally show the generated bytecode using the "-bl" flag. For example:
luajit run.lua -bl tests/test-1.lua
will print on the screen:
-- BYTECODE -- "test-1.lua":0-7
00001 TNEW 0 0
0002 KSHORT 1 1
0003 KSHORT 2 10
0004 KSHORT 3 1
0005 FORI 1 => 0010
0006 => MULVV 5 4 4
0007 ADDVN 5 5 0 ; 1
0008 TSETV 5 0 4
0009 FORL 1 => 0006
0010 => KSHORT 1 1
0011 KSHORT 2 10
0012 KSHORT 3 1
0013 FORI 1 => 0018
0014 => GGET 5 0 ; "print"
0015 TGETV 6 0 4
0016 CALL 5 1 2
0017 FORL 1 => 0014
0018 => RET0 0 1
You can compare it with the bytecode generated natively by LuaJIT using the command:
luajit -bl tests/test-1.lua
In the example above the generated bytecode will be identical to those generated by LuaJIT. This is not an hazard since the Language Toolkit's bytecode generator is designed to produce the same bytecode that LuaJIT itself would generate. Yet in some cases the generated code will differ but this is not considered a problem as long as the generated code is still correct.
Currently LuaJIT Language Toolkit should be considered as beta software. The implementation is now complete in term of features and well tested, even for the most complex cases and a complete test suite is used to verify the correctness of the generated bytecode.