index.html

<html><head>
<title>Ur-Scheme: A GPL self-hosting compiler from a subset of 
R5RS Scheme to fast Linux x86 asm</title>
<link rel="stylesheet" href="../../style.css" />
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<style type="text/css">
.comment     { /* font-lock-comment-face */ color: #b22222; }
.function-name { /* font-lock-function-name-face */ color: #0000ff; }
.keyword     { /* font-lock-keyword-face */ color: #a020f0; }
.py-builtins {       /* py-builtins-face */ color: #a020f0; }
</style>
</head><body>
<h1>Ur-Scheme: A GPL self-hosting compiler from a subset of 
R5RS Scheme to fast Linux x86 asm</h1>

<p>Ur-Scheme is a compiler from a small subset of 
<a href="http://schemers.org/Documents/Standards/R5RS/">R5RS
Scheme</a> to Intel x86
assembly language for Linux.  It can compile itself.  It is free software,
licensed under the GNU GPLv3+.  It might be useful as a base for a
more practical implementation (or a more compact one), or it might be
enjoyable to read (particularly since the entire development history
is browsable using darcs), but it probably isn't that useful in its current
form.</p>

<p>Ur-Scheme is:</p>

<ul>

<li><b>Reasonably fast.</b> It <b>generates reasonably fast code</b>
&mdash; when compiled with itself, it runs 2½ times faster (in user
CPU time) than when it's compiled with <a
href="http://www.call-with-current-continuation.org/" >Chicken</a>, 1½
times faster than
when it's compiled with <a
href="http://dynamo.iro.umontreal.ca/~gambit/wiki/index.php/Main_Page"
>Gambit-C</a>, and much faster than when
it's interpreted by any of <a
href="http://www.gnu.org/software/guile/guile.html" >Guile</a>, <a
href="http://plt-scheme.org/" >MzScheme</a>, SCM, Elk, or Chicken's
interpreter.  It also <b>runs reasonably fast</b>; on my laptop, it
compiles about 1000 lines of Scheme per user CPU second, which is not
fast in absolute terms (on the same hardware, gas assembles the
resulting 66&nbsp;000-line assembly file in just over half a second,
which is about 100&nbsp;000 lines of assembly per second) but seems to
be faster than a lot of Scheme compilers.  I imagine that it would run
faster if compiled with <a
href="http://www.cs.indiana.edu/~aghuloum/ikarus/">Ikarus</a> or
Stalin, but my CPU is too old to run Ikarus, and I can't get Stalin to
compile it successfully.</li>

<li><b>Impractical.</b> It hardly implements anything beyond what's in
R5RS, and it omits quite a bit of the stuff that is in R5RS; for
example, files, macros, vectors, eval, quasiquotation, call/cc, and
floating-point ("inexact numbers").  Only things that were needed for
the compiler to compile itself and run some unit tests are
included.</li>

<li><b>Small.</b> The implementation is about <a
href="compiler.scm.html">1600 lines of Scheme</a>, plus 700 lines of
comments and blank lines.  Compiled with itself on x86 Linux and
stripped, the executable is 134kiB.  On my 700MHz laptop from the
previous millennium, compiling it from scratch with MzScheme and gas
takes just under 12 seconds.</li>

<li><b>Safe.</b> Programs compiled with it do not crash or corrupt
their memory unless there is a bug in the compiler (er, except if they
run out of stack or heap).  All types and array indices are
dynamically checked at run-time.</li>

<li><b>Portable.</b> Last time I checked, it could compile itself
successfully when compiled by Guile, MzScheme (PLT Scheme), SCM, Elk,
Gambit-C, or Chicken, although it doesn't run in Bigloo, 
Stalin, RScheme, or TinyScheme.
Retargeting it to a different processor wouldn't be a huge pain (after
all, it's only 1600 lines of code) but wouldn't be that trivial
either.</li>

<li><b>Unit-tested.</b> It comes with a fairly comprehensive set of
unit tests, and additionally, it compiles itself well enough that the
compiled version produces the same assembly-language output,
byte-for-byte, as when it's running interpreted in some other Scheme
&mdash; at least when it's compiling itself.</li>

</ul>

<h2>Downloading</h2>

<p>If you want to get it, even though it's impractical, you can <b>use
Darcs to snarf the source repository</b>:</p>

<pre>darcs get http://pobox.com/~kragen/sw/urscheme/</pre>

<p>Or you can <b><a href="urscheme-3.tar.gz">download the source
tarball of Ur-Scheme version 3</a></b>, or <a 
href="urscheme-2.tar.gz">version 2</a>, <a 
href="urscheme-1.tar.gz">version 1</a>,
or <a href="urscheme-0.tar.gz">version 0</a> if you like.</p>

<p>On Linux, if you have MzScheme, GCC, and gas installed, you can
build it by just typing "<tt>make</tt>".  So far there it only runs on
Linux, although <a href="README.macos">it looks like a MacOS port
would be pretty easy.</a></p>

<h2>Limitations</h2>

<ul>

<li> The following are missing: user-defined macros, eval,
     quasiquotation, 
     first-class continuations, vectors, numerical constants with
     radix prefixes, ports, "let*", "letrec",
     non-top-level defines ("internal definitions"), "do", promises,
     non-integer numbers, bignums, load, apply,
     case-insensitivity, multiple-value returns. </li>

<li> Pairs are immutable, so there are no set-car! and set-cdr!.
     Despite this, eqv? is not the same as equal? for lists.</li>

<li> Procedures are allowed to take fixed numbers of arguments, or any
     arbitrary number of arguments, but the normal at-least-N syntax
     (lambda (a b . rest) ...) is not supported. </li>

<li> Rebinding standard procedures may break your program. </li>

<li> However, some standard procedures are treated as special forms by
     the compiler, so rebinding them will rarely have any effect. </li>

<li> Integer arithmetic silently overflows. </li>

<li> Most arithmetic operators are omitted; only 1+, 1-, +, -,
     quotient, and remainder are present. </li>

<li> Very many standard library procedures are missing.  Only the
     following 55 procedures were actually present last time I updated
     this list: <tt>procedure? 
     string?
     make-string string-set! string-ref string-length car cdr cons
     pair? symbol? symbol-&gt;string string-&gt;symbol display newline eq?
     current-input-port read-char integer? remainder quotient &lt;
     eof-object? char? integer-&gt;char char-&gt;integer list length assq
     memq memv append not string-append char-whitespace? char&lt;?
     char&lt;=? char-between? char-alphabetic? = char=? eqv? equal?
     string=? null? boolean? number? for-each map reverse string-&gt;list
     list-&gt;string number-&gt;string string-&gt;number write</tt>. 
     </li>

<li> make-string only takes one argument. </li>

<li> read-char takes a port argument and ignores it; write, display,
     and newline do not take port arguments.  current-input-port
     returns nil.</li>

<li> map and for-each only take two arguments. </li>

<li> number-&gt;string only takes one argument. </li>

<li> string-&gt;symbol will be slow with enough symbols. </li>

</ul>

<h2>Extensions</h2>

<p>These are not in R5RS.</p>

<ul>

<li> (display-stderr foo) is like display, but uses stderr. </li>

<li> (exit 37) makes the program exit with exit code 37. </li>

<li> (error foo bar baz) aborts the program and sends to stderr
     "error: " followed by foo, bar, and baz. </li>

<li> 1+ and 1- are as in Common Lisp. </li>

<li> (char-&gt;string c) is like (make-string 1 c). </li>

<li> (escape string) returns a list of character strings which, when
     concatenated, form a string representing the original string by
     escaping all the backslashes, quotes, or newlines in the string
     by preceding them with a backslash. </li>

<li> #\tab is the tab character.  I wouldn't think to mention it but
     apparently Stalin doesn't support this. </li>

</ul>

<h2>Bugs</h2>

<p>Output is unbuffered, which accounts for a third of its
run-time.</p>

<p>I still don't have a garbage collector, and programs crash when
they run out of memory.</p>

<h2>Influences and Inspirations</h2>
<p><b>Abdulaziz Ghuloum</b>'s paper, <a
href="http://scheme2006.cs.uchicago.edu/11-ghuloum.pdf">An Incremental
Approach to Compiler Construction</a>, was an enjoyable read and
inspirational.  Ghuloum explained his objective and approach thus:</p>

<blockquote>
Compilers are perceived to be magical artifacts, carefully crafted
by the wizards, and unfathomable by the mere mortals...
Real-life compilers are too
complex to serve as an educational tool. And the gap between
real-life compilers and the educational toy compilers is too wide.
The novice compiler writer stands puzzled facing an impenetrable
barrier, "better write an interpreter instead."

<p>The goal of this paper is to break that barrier. We show that
building a compiler can be as easy as building an interpreter. The
compiler we construct accepts a large subset of the Scheme 
programming language and produces assembly code for the Intel-x86
architecture... The
development of the compiler is broken into many small 
incremental steps. Every step yields a fully working compiler for a 
progressively expanding subset of Scheme. Every compiler step produces
real assembly code that can be assembled then executed directly
by the hardware. ...</p>

<p>Compiler construction is not as complex as it is commonly 
perceived to be. In this paper, we showed that constructing a 
compiler for a large subset of Scheme that targets a real hardware is
simple. The basic compiler is achieved by concentrating on the 
essential aspects of compilation and freeing the compiler from 
sophisticated analysis and optimization passes. This helps the novice
compiler writers build the intuition for the inner-workings of 
compilers without being distracted by details.</p>

</blockquote>

<p>This is the objective and approach with which I started out
Ur-Scheme, and it turned out to be extremely successful.</p>

<p><a
href="http://www.iro.umontreal.ca/~boucherd/mslug/meetings/20041020/minutes-en.html"
>Marc Feeley's 2004 talk on the "<b>90 minute Scheme to C compiler</b>"</a>
was also quite inspirational, although I didn't use any
of the techniques he discussed!  It was useful to see that a
relatively complete Scheme compiler could be written in 800 lines of
Scheme, and that the performance of the resulting C code wasn't too
terrible.</p>

<p>In the realm of small self-hosting compilers, <a
href="http://fabrice.bellard.free.fr/otcc/" ><b>Fabrice
Bellard's 2002 OTCC</b></a> is near the pinnacle of compactness.  In
4748
bytes of code, <tt>otccelf</tt> compiles the subset of C it is written
in into standalone x86 ELF executables.  What's even more stunning is
that the original <tt>otcc</tt>, missing the ELF output, is only 3301
bytes, and Bellard explains that he actually started with a smaller
subset of C than that!  The extended version, <a
href="http://fabrice.bellard.free.fr/tcc/" >TCC</a>, was up to 100KiB
and could compile a Linux kernel from scratch in 10 seconds.</p>

<p>Looking at <b>disassembled code from <a
href="http://www.jwdt.com/~paysan/bigforth.html">Bernd Paysan's
bigFORTH</a></b>
showed me that a code generator could be really very simple.  (The
bigFORTH web page says, "The compiler generates optimized native code
for the i386.", but I don't know what meaning of "optimized" Bernd has
in mind.)  bigFORTH gets <a
href="http://shootout.alioth.debian.org/gp4/benchmark.php?test=all&lang=bigforth&lang2=gcc"
>pretty decent performance</a> in my experience, really, much
better than the <a
href="http://shootout.alioth.debian.org/gp4sandbox/benchmark.php?test=all&lang=gforth&lang2=gcc"
>gFORTH threaded-code interpreter</a>.  But it looks like most of the
bigFORTH primitives simply inline fixed instruction sequences, and no
register 
allocation is done.  I figured I could glue together fixed instruction
sequences too.
</p>

<p>Once I got started, I learned about <a
href="http://www.accesscom.com/~darius/hacks/ichbins.tar.gz" >Darius
Bacon's brilliant seven-page self-hosting "<b>ichbins</b>"
compiler</a> and read it.
It's definitely worth a read.</p>

<p>I decided to call it "Ur-Scheme" because it's not quite a Scheme,
with all the missing parts mentioned above; it's more like something
that could grow into a Scheme.  I was inspired by <a
href="http://urjtag.sourceforge.net/book/_urjtag.html#_the_name_urjtag"
>UrJTAG</a>, but of course the echo of "<a
href="http://sc2.sf.net/">Ur-Quan Masters" (one of my favorite
games)</a> was a plus as well.  Also, the ancient capital of Sumer was
Ur of the Chaldees (אור כשדים in Hebrew), and it's an Unoriginal
Reimplementation of Scheme.  (Thanks to Ben Goetter for that.)  Or at
least some of it.</p>

<h2>Origins and Lessons</h2>

<p>In February 2008, <b>I wanted to write a metacompiler for <a
href="http://pobox.com/~kragen/sw/bicicleta" 
>Bicicleta</a></b>, but I was
intimidated because I'd never written a compiler before, and nobody
had ever written a compiler either <i>in</i> Bicicleta or <i>for</i>
Bicicleta.  So I thought I'd pick a language that other people knew a
lot about writing compilers for and that wasn't too hairy, write a
compiler for it, and then use what I'd learned to write the Bicicleta
compiler.</p>

<p>It took me about 18 days from the time I started on the project to
the time that the compiler could actually compile itself, which was a
lot longer than I expected.</p>

<p>I learned a bunch from doing it.  Here are some of the main things
I learned:</p>

<ol>

<li><b>Interpreters are a better way to bootstrap than
metacompilers.</b> Instead of writing a compiler for a subset of
Scheme in that subset, I should have written an interpreter in
standard Scheme (or whatever) and a metacompiler in the language
implemented by the interpreter.  (This is what Darius Bacon did with
"ichbins".)  Interpreters are a lot simpler than compilers, especially
if they don't have to run fast, and especially if you can write them
in a language like Scheme that gives you garbage collection and
closures for free.  

<p> The restriction that the compiler had to be
correct both in R5RS Scheme and in the language that it could compile
was really a pain.  For example, although I could add new syntactic
forms to the language that it could compile, and I could add new
syntactic forms with R5RS macro definitions, I couldn't simplify the
compiler by adding new syntactic forms, because the compiler can't
compile R5RS macro definitions.  (There's a portable implementation of
R5RS macros out there, but it's about twice the size of the entire
Ur-Scheme compiler.)  Similarly, I was stuck with a bunch of the
boneheaded design decisions of the 
Scheme built-in types &mdash; for example, no
auto-growing mutable containers, separate types for strings and
characters, and the difficulty of doing any string processing without
arithmetic and side effects.</p> </li>

<li><b>Start with the simplest thing that could possibly work.</b> I
keep on learning this every year.  In this case, the really expensive
thing was that I wanted normal function calls to be fast, so I used
normal C-style stack frames for their arguments.  This seems to have
paid off in speed, but it meant I spent four days and about 200 lines
of code implementing lexical closures of unlimited extent.  (Really!
According to my change log, from February 13 to February 16, I
basically didn't do anything else except add macros.)  If I'd just
allocated all my call frames on the heap, the result would have been
slow, but I would have gotten done a lot sooner.
</li>

<li><b>Tail-recursion makes your code hard to read.</b> More
traditional control structures, such as explicit loops, are both
terser and clearer.  This is the largest Scheme program I've written,
so this is my first time really experiencing this.  Look at this code
(discarded in favor of something simpler now, thank goodness):

<blockquote><pre>
(<span class="keyword">define</span> (<span class="function-name">string-append-3</span> length s2 buf idx)
  (<span class="keyword">if</span> (= idx (string-length buf)) buf
      (<span class="keyword">begin</span>
        (string-set! buf idx (string-ref s2 (- idx length)))
        (string-append-3 length s2 buf (1+ idx)))))
(<span class="keyword">define</span> (<span class="function-name">string-append-2</span> s1 s2 buf idx)
  (<span class="keyword">if</span> (= idx (string-length s1))
      (string-append-3 (string-length s1) s2 buf idx)
      (<span class="keyword">begin</span>
        (string-set! buf idx (string-ref s1 idx))
        (string-append-2 s1 s2 buf (1+ idx)))))
(<span class="keyword">define</span> (<span class="function-name">string-append</span> s1 s2)       <span class="comment">; standard
</span>  (string-append-2 s1 s2 (make-string (+ (string-length s1)
                                         (string-length s2)))
                   0))
</pre></blockquote>

<p>That horrible rat's nest is my attempt to straightforwardly
translate this very straightforward procedural approach, without
introducing any forward references:</p>

<blockquote><pre>
<span class="keyword">def</span> <span class="function-name">string_append</span>(s1, s2):
    buf = make_string(<span class="py-builtins">len</span>(s1) + <span class="py-builtins">len</span>(s2))
    idx = 0
    <span class="keyword">while</span> idx != <span class="py-builtins">len</span>(s1):
        buf[idx] = s1[idx]
        idx += 1
    length = <span class="py-builtins">len</span>(s1)
    <span class="keyword">while</span> idx != <span class="py-builtins">len</span>(buf):
        buf[idx] = s2[idx - length]
        idx += 1
    <span class="keyword">return</span> buf
</pre></blockquote>

<p>(You wouldn't really want to do that in Python &mdash; I'm just
explaining what I was thinking.)</p>

<p>Tail-calls are merely "goto with arguments", which means that they
can be implemented very efficiently, and also means that you can
easily use them to create unreadable code.</p>

</li>

<li><b>Native-code compilers get OK performance pretty easily.</b> On
the standard <a href="fib.scm">stupid Fibonacci microbenchmark</a>, on
my laptop, Ur-Scheme outperforms MzScheme's interpreter by about a
factor of 7.6 (excluding time to compile and assemble):

<blockquote><pre>
(<span class="keyword">define</span> (<span class="function-name">fib2</span> n) 
  (<span class="keyword">cond</span> ((= n 0) 1)
        ((= n 1) 1)
        (<span class="keyword">else</span> (+ (fib2 (1- n)) 
                 (fib2 (- n 2))))))
</pre></blockquote>

<p>GCC outperforms Ur-Scheme on that same benchmark by about a factor
of 5.</p>

<p>On larger programs, such as the Ur-Scheme compiler itself,
Ur-Scheme only outperforms MzScheme's interpreter by about a
factor of 5 (rather than 7 or so), which (if you're keeping track)
means that Scheme code interpreted in MzScheme is somewhere around
25-50 times
slower than C code compiled with GCC.</p>

<p>The literature about writing compilers is full of hairy techniques
like CPS conversion, SSA conversion, <a
href="ftp://ftp.cs.cmu.edu/afs/cs.cmu.edu/user/shivers/lib/papers/diss.ps.Z"
>higher-order
control-flow analysis</a>, type
inference, automated theorem proving, partial evaluation, escape
analysis, and so on.  Ur-Scheme doesn't do any of those things.
(Well, actually, it does do a little bit of escape analysis.)  In
fact, Ur-Scheme doesn't even do <i>register allocation</i>, and it
does full dynamic type checking at run-time.  I imagine that if you
wanted to get performance only 2 or 3 or 4 times worse than C compiled
with GCC (which is not exactly a gold standard itself, these days)
then you'd need to start hauling out the hairy techniques.  But it
already beats the pants off all the <i>interpreters</i> I can find,
and at least for itself, it generates faster code than the three
well-known Scheme compilers I've tried on it (mzc, Gambit-C, and
Chicken.)  (I suspect that Stalin and Ikarus would do better...)
</p>

</li>

</ol>

<h2>Interesting Features</h2>

<p>It has a <b>one-level parser</b>, like old versions of Microsoft
BASIC; there's no separate tokenizer.  I wrote the parser after
reading the "ichbins" parser, and there is definitely some influence
visible. </p>

<p>It contains <b>relatively little mutation</b>.  Although almost
every line of the compiler
has "side effects" like outputting lines of assembly code,
there are fairly few locations where the
compiler's internal state is mutated.  I count 25 calls to
<tt>set!</tt> and <tt>string-set!</tt>
in the 1600 lines of code, including the standard library.
The I/O side effects are similarly centralized
&mdash; there are three calls to <tt>display</tt>, no calls to
<tt>newline</tt>, and one call to <tt>read-char</tt> in the whole
code.  When was the last time you saw a 1600-line C program with only
25 assignment statements?</p>

<p>To simplify code generation, it follows BigFORTH and colorForth
by <b>treating the x86 as a stack machine</b>.
</p>

<p>What little optimization it does is done by <b>optimization
macros</b> that rewrite certain patterns in the source &mdash; for
example, <tt>(if (not a) b c)</tt> gets rewritten into <tt>(if a c
b)</tt>, and <tt>(if (null? x) y z)</tt> is rewritten into a call to a
special built-in <tt>(%ifnull x y z)</tt>.  As described in R5RS,
these macros also compile the various Scheme control structures
(<tt>cond</tt>, <tt>case</tt>, <tt>and</tt>, and so on) into more
fundamental control structures like <tt>if</tt> and
<tt>lambda</tt>.</p>

<p>At present, because it compiles one expression at a time, you can
<b>interactively type code at it</b> and see the compiled assembly
results.  Unfortunately this isn't very much fun because the generated
code is so verbose.  Darius discovered this feature.</p>

<p>The assembly code it produces <b>can compile to statically linked,
standalone executables</b> because it doesn't depend on any external
libraries &mdash; it just uses the Linux system call interface
directly.  They can be relatively small; a minimal executable is about
17KiB, because it includes the whole standard library (not just the
parts that are used).</p>

<h2>Future Work</h2>

<p>First, of course, there are the bugs to fix, especially including
the absence of a <b>garbage collector</b>.</p>

<p>If this compiler has any merit at all, it is in its small size and
comprehensibility.  <a
href="http://www.accesscom.com/~darius/hacks/ichbins.tar.gz" >Darius
Bacon's brilliant 385-line "<b>ichbins</b>" self-compiling Lisp-to-C
compiler</a> is much better at that, being less than one-fifth of the
size.  So one direction of evolution is to <b>figure out what can be
stripped out of it</b>.  ichbins has no arithmetic, no closures, no
separate string or symbol type, and only one side effect.</p>

<p>There's probably a lot that could be made clearer, as well.</p>

<p>Another direction is to try to <b>improve the speed and size</b> of
its output code a bit.  For example:

<ul>

<li> <b>Lambda-lifting</b> would keep let-expressions from consing
storage for all of the arguments of the outer procedure, and if you
were slightly more ambitious, you could even optimize away the
procedure call entirely.  That would make them a lot smaller, too.</li>

<li> Although it does some <b>inlining</b> now, it could do more.
Also, the existing inlining would benefit from a preliminary scanning
pass to see which <b>globals are never mutated</b> and can therefore
be inlined safely without breaking Scheme's semantics.  Such a scan
could also remove the type-check, the two fetches, and the computed
call instruction, on the vast majority of procedure calls, and the
argument-count check in the vast majority of procedure prologues.</li>

<li> Especially <b>conditional expressions</b> could benefit from some
more inlining and integration with their parent <tt>if</tt>; currently
only <tt>null?</tt>, <tt>eq?</tt>, and <tt>not</tt> have this special
handling.  (Although that covers about half of the conditionals in the
compiler.)  The procedure-call-and-return-and-test overhead totals
about 18 instructions.  The implementation of <tt>case</tt> is
particularly inefficient because it actually calls <tt>memv</tt>.</li>

<li> A little bit of <b>peephole optimization</b> usually goes a long
way, but I'm not sure it would in this case; we could get a little bit
of dead-code removal (e.g. procedure epilogues preceded by an
unconditional jump, as in the very common case of a tail call) and 
there's the occasional <tt>pop %eax; push %eax;
movl foo, %eax</tt> sequence.  But overall I don't think there's much
of an advantage yet.
</li>

<li> The <b>procedure prologues and epilogues</b> are pretty large,
and they could probably be shrunk quite a bit, especially for
non-variadic procedures.  It might be possible to arrange the stack
frame so that a simple <tt>leave; ret $12</tt> sequence, or something
similarly simple, could replace the current eleven bytes of crap, at
least for non-variadic procedures. Tail calls are particularly
horrific; here's a three-argument tail call:
<blockquote><pre>
 804977f:	50                   	push   %eax
 8049780:	a1 20 89 06 08       	mov    0x8068920,%eax
 8049785:	8d 5c 24 08          	lea    0x8(%esp),%ebx
 8049789:	8b 55 fc             	mov    0xfffffffc(%ebp),%edx
 804978c:	8b 65 f8             	mov    0xfffffff8(%ebp),%esp
 804978f:	8b 6d f4             	mov    0xfffffff4(%ebp),%ebp
 8049792:	ff 33                	pushl  (%ebx)
 8049794:	ff 73 fc             	pushl  0xfffffffc(%ebx)
 8049797:	ff 73 f8             	pushl  0xfffffff8(%ebx)
 804979a:	52                   	push   %edx
 804979b:	e8 fb e8 ff ff       	call   804809b &lt;ensure_procedure&gt;
 80497a0:	8b 58 04             	mov    0x4(%eax),%ebx
 80497a3:	ba 03 00 00 00       	mov    $0x3,%edx
 80497a8:	ff e3                	jmp    *%ebx
</pre></blockquote>

<p>That's 43 bytes of code!  In this case, as is typical, the function
being called (<tt>char-between?</tt>) is never redefined; it takes a
fixed number of arguments and is never called with any other number of
arguments; and it is not a closure.  These observations would allow
the following abbreviated version:</p>

<blockquote><pre>
50                   	push   %eax
8d 5c 24 08          	lea    0x8(%esp),%ebx
8b 55 fc             	mov    0xfffffffc(%ebp),%edx
8b 65 f8             	mov    0xfffffff8(%ebp),%esp
8b 6d f4             	mov    0xfffffff4(%ebp),%ebp
ff 33                	pushl  (%ebx)
ff 73 fc             	pushl  0xfffffffc(%ebx)
ff 73 f8             	pushl  0xfffffff8(%ebx)
52                   	push   %edx
e9 xx xx xx xx          jmp    _char_betweenP_4
</pre></blockquote>

<p>That's only 28 bytes, a substantial improvement, but still ugly.
If we could skip the prologue of tail-called non-closure procedures
and just jump directly to the body, then in cases where the caller has
at least as many arguments as the callee, we could also avoid copying
the saved %esp, %ebp, and return address to a new location on the
stack. In that case we could simply do something like this:</p>

<blockquote><pre>
8d 5c 24 08          	lea    0x8(%esp),%ebx
8b 65 f8             	mov    0xfffffff8(%ebp),%esp
ff 33                	pushl  (%ebx)
ff 73 fc             	pushl  0xfffffffc(%ebx)
50                   	push   %eax
e9 xx xx xx xx          jmp    _char_betweenP_4
</pre></blockquote>

<p>That's only 18 bytes.  Still not that pretty, but no longer
disgusting.</p>

</li>

<li> Storing <b>symbols in a skip list or hash table</b> would
probably help a lot with programs like compilers that use
<tt>string->symbol</tt> a lot. </li>

</ul>

<p>Another direction is to <b>make it more practical</b>.  For
example, improved error-reporting, a profiler, an FFI, and access to
files and sockets would be handy.</p>

<p>Another direction is to <b>make it run faster</b>. </p>

</body></html>