-
Notifications
You must be signed in to change notification settings - Fork 5
/
Copy pathindex.html
621 lines (514 loc) · 27.1 KB
/
index.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
<html><head>
<title>Ur-Scheme: A GPL self-hosting compiler from a subset of
R5RS Scheme to fast Linux x86 asm</title>
<link rel="stylesheet" href="../../style.css" />
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<style type="text/css">
.comment { /* font-lock-comment-face */ color: #b22222; }
.function-name { /* font-lock-function-name-face */ color: #0000ff; }
.keyword { /* font-lock-keyword-face */ color: #a020f0; }
.py-builtins { /* py-builtins-face */ color: #a020f0; }
</style>
</head><body>
<h1>Ur-Scheme: A GPL self-hosting compiler from a subset of
R5RS Scheme to fast Linux x86 asm</h1>
<p>Ur-Scheme is a compiler from a small subset of
<a href="http://schemers.org/Documents/Standards/R5RS/">R5RS
Scheme</a> to Intel x86
assembly language for Linux. It can compile itself. It is free software,
licensed under the GNU GPLv3+. It might be useful as a base for a
more practical implementation (or a more compact one), or it might be
enjoyable to read (particularly since the entire development history
is browsable using darcs), but it probably isn't that useful in its current
form.</p>
<p>Ur-Scheme is:</p>
<ul>
<li><b>Reasonably fast.</b> It <b>generates reasonably fast code</b>
— when compiled with itself, it runs 2½ times faster (in user
CPU time) than when it's compiled with <a
href="http://www.call-with-current-continuation.org/" >Chicken</a>, 1½
times faster than
when it's compiled with <a
href="http://dynamo.iro.umontreal.ca/~gambit/wiki/index.php/Main_Page"
>Gambit-C</a>, and much faster than when
it's interpreted by any of <a
href="http://www.gnu.org/software/guile/guile.html" >Guile</a>, <a
href="http://plt-scheme.org/" >MzScheme</a>, SCM, Elk, or Chicken's
interpreter. It also <b>runs reasonably fast</b>; on my laptop, it
compiles about 1000 lines of Scheme per user CPU second, which is not
fast in absolute terms (on the same hardware, gas assembles the
resulting 66 000-line assembly file in just over half a second,
which is about 100 000 lines of assembly per second) but seems to
be faster than a lot of Scheme compilers. I imagine that it would run
faster if compiled with <a
href="http://www.cs.indiana.edu/~aghuloum/ikarus/">Ikarus</a> or
Stalin, but my CPU is too old to run Ikarus, and I can't get Stalin to
compile it successfully.</li>
<li><b>Impractical.</b> It hardly implements anything beyond what's in
R5RS, and it omits quite a bit of the stuff that is in R5RS; for
example, files, macros, vectors, eval, quasiquotation, call/cc, and
floating-point ("inexact numbers"). Only things that were needed for
the compiler to compile itself and run some unit tests are
included.</li>
<li><b>Small.</b> The implementation is about <a
href="compiler.scm.html">1600 lines of Scheme</a>, plus 700 lines of
comments and blank lines. Compiled with itself on x86 Linux and
stripped, the executable is 134kiB. On my 700MHz laptop from the
previous millennium, compiling it from scratch with MzScheme and gas
takes just under 12 seconds.</li>
<li><b>Safe.</b> Programs compiled with it do not crash or corrupt
their memory unless there is a bug in the compiler (er, except if they
run out of stack or heap). All types and array indices are
dynamically checked at run-time.</li>
<li><b>Portable.</b> Last time I checked, it could compile itself
successfully when compiled by Guile, MzScheme (PLT Scheme), SCM, Elk,
Gambit-C, or Chicken, although it doesn't run in Bigloo,
Stalin, RScheme, or TinyScheme.
Retargeting it to a different processor wouldn't be a huge pain (after
all, it's only 1600 lines of code) but wouldn't be that trivial
either.</li>
<li><b>Unit-tested.</b> It comes with a fairly comprehensive set of
unit tests, and additionally, it compiles itself well enough that the
compiled version produces the same assembly-language output,
byte-for-byte, as when it's running interpreted in some other Scheme
— at least when it's compiling itself.</li>
</ul>
<h2>Downloading</h2>
<p>If you want to get it, even though it's impractical, you can <b>use
Darcs to snarf the source repository</b>:</p>
<pre>darcs get http://pobox.com/~kragen/sw/urscheme/</pre>
<p>Or you can <b><a href="urscheme-3.tar.gz">download the source
tarball of Ur-Scheme version 3</a></b>, or <a
href="urscheme-2.tar.gz">version 2</a>, <a
href="urscheme-1.tar.gz">version 1</a>,
or <a href="urscheme-0.tar.gz">version 0</a> if you like.</p>
<p>On Linux, if you have MzScheme, GCC, and gas installed, you can
build it by just typing "<tt>make</tt>". So far there it only runs on
Linux, although <a href="README.macos">it looks like a MacOS port
would be pretty easy.</a></p>
<h2>Limitations</h2>
<ul>
<li> The following are missing: user-defined macros, eval,
quasiquotation,
first-class continuations, vectors, numerical constants with
radix prefixes, ports, "let*", "letrec",
non-top-level defines ("internal definitions"), "do", promises,
non-integer numbers, bignums, load, apply,
case-insensitivity, multiple-value returns. </li>
<li> Pairs are immutable, so there are no set-car! and set-cdr!.
Despite this, eqv? is not the same as equal? for lists.</li>
<li> Procedures are allowed to take fixed numbers of arguments, or any
arbitrary number of arguments, but the normal at-least-N syntax
(lambda (a b . rest) ...) is not supported. </li>
<li> Rebinding standard procedures may break your program. </li>
<li> However, some standard procedures are treated as special forms by
the compiler, so rebinding them will rarely have any effect. </li>
<li> Integer arithmetic silently overflows. </li>
<li> Most arithmetic operators are omitted; only 1+, 1-, +, -,
quotient, and remainder are present. </li>
<li> Very many standard library procedures are missing. Only the
following 55 procedures were actually present last time I updated
this list: <tt>procedure?
string?
make-string string-set! string-ref string-length car cdr cons
pair? symbol? symbol->string string->symbol display newline eq?
current-input-port read-char integer? remainder quotient <
eof-object? char? integer->char char->integer list length assq
memq memv append not string-append char-whitespace? char<?
char<=? char-between? char-alphabetic? = char=? eqv? equal?
string=? null? boolean? number? for-each map reverse string->list
list->string number->string string->number write</tt>.
</li>
<li> make-string only takes one argument. </li>
<li> read-char takes a port argument and ignores it; write, display,
and newline do not take port arguments. current-input-port
returns nil.</li>
<li> map and for-each only take two arguments. </li>
<li> number->string only takes one argument. </li>
<li> string->symbol will be slow with enough symbols. </li>
</ul>
<h2>Extensions</h2>
<p>These are not in R5RS.</p>
<ul>
<li> (display-stderr foo) is like display, but uses stderr. </li>
<li> (exit 37) makes the program exit with exit code 37. </li>
<li> (error foo bar baz) aborts the program and sends to stderr
"error: " followed by foo, bar, and baz. </li>
<li> 1+ and 1- are as in Common Lisp. </li>
<li> (char->string c) is like (make-string 1 c). </li>
<li> (escape string) returns a list of character strings which, when
concatenated, form a string representing the original string by
escaping all the backslashes, quotes, or newlines in the string
by preceding them with a backslash. </li>
<li> #\tab is the tab character. I wouldn't think to mention it but
apparently Stalin doesn't support this. </li>
</ul>
<h2>Bugs</h2>
<p>Output is unbuffered, which accounts for a third of its
run-time.</p>
<p>I still don't have a garbage collector, and programs crash when
they run out of memory.</p>
<h2>Influences and Inspirations</h2>
<p><b>Abdulaziz Ghuloum</b>'s paper, <a
href="http://scheme2006.cs.uchicago.edu/11-ghuloum.pdf">An Incremental
Approach to Compiler Construction</a>, was an enjoyable read and
inspirational. Ghuloum explained his objective and approach thus:</p>
<blockquote>
Compilers are perceived to be magical artifacts, carefully crafted
by the wizards, and unfathomable by the mere mortals...
Real-life compilers are too
complex to serve as an educational tool. And the gap between
real-life compilers and the educational toy compilers is too wide.
The novice compiler writer stands puzzled facing an impenetrable
barrier, "better write an interpreter instead."
<p>The goal of this paper is to break that barrier. We show that
building a compiler can be as easy as building an interpreter. The
compiler we construct accepts a large subset of the Scheme
programming language and produces assembly code for the Intel-x86
architecture... The
development of the compiler is broken into many small
incremental steps. Every step yields a fully working compiler for a
progressively expanding subset of Scheme. Every compiler step produces
real assembly code that can be assembled then executed directly
by the hardware. ...</p>
<p>Compiler construction is not as complex as it is commonly
perceived to be. In this paper, we showed that constructing a
compiler for a large subset of Scheme that targets a real hardware is
simple. The basic compiler is achieved by concentrating on the
essential aspects of compilation and freeing the compiler from
sophisticated analysis and optimization passes. This helps the novice
compiler writers build the intuition for the inner-workings of
compilers without being distracted by details.</p>
</blockquote>
<p>This is the objective and approach with which I started out
Ur-Scheme, and it turned out to be extremely successful.</p>
<p><a
href="http://www.iro.umontreal.ca/~boucherd/mslug/meetings/20041020/minutes-en.html"
>Marc Feeley's 2004 talk on the "<b>90 minute Scheme to C compiler</b>"</a>
was also quite inspirational, although I didn't use any
of the techniques he discussed! It was useful to see that a
relatively complete Scheme compiler could be written in 800 lines of
Scheme, and that the performance of the resulting C code wasn't too
terrible.</p>
<p>In the realm of small self-hosting compilers, <a
href="http://fabrice.bellard.free.fr/otcc/" ><b>Fabrice
Bellard's 2002 OTCC</b></a> is near the pinnacle of compactness. In
4748
bytes of code, <tt>otccelf</tt> compiles the subset of C it is written
in into standalone x86 ELF executables. What's even more stunning is
that the original <tt>otcc</tt>, missing the ELF output, is only 3301
bytes, and Bellard explains that he actually started with a smaller
subset of C than that! The extended version, <a
href="http://fabrice.bellard.free.fr/tcc/" >TCC</a>, was up to 100KiB
and could compile a Linux kernel from scratch in 10 seconds.</p>
<p>Looking at <b>disassembled code from <a
href="http://www.jwdt.com/~paysan/bigforth.html">Bernd Paysan's
bigFORTH</a></b>
showed me that a code generator could be really very simple. (The
bigFORTH web page says, "The compiler generates optimized native code
for the i386.", but I don't know what meaning of "optimized" Bernd has
in mind.) bigFORTH gets <a
href="http://shootout.alioth.debian.org/gp4/benchmark.php?test=all&lang=bigforth&lang2=gcc"
>pretty decent performance</a> in my experience, really, much
better than the <a
href="http://shootout.alioth.debian.org/gp4sandbox/benchmark.php?test=all&lang=gforth&lang2=gcc"
>gFORTH threaded-code interpreter</a>. But it looks like most of the
bigFORTH primitives simply inline fixed instruction sequences, and no
register
allocation is done. I figured I could glue together fixed instruction
sequences too.
</p>
<p>Once I got started, I learned about <a
href="http://www.accesscom.com/~darius/hacks/ichbins.tar.gz" >Darius
Bacon's brilliant seven-page self-hosting "<b>ichbins</b>"
compiler</a> and read it.
It's definitely worth a read.</p>
<p>I decided to call it "Ur-Scheme" because it's not quite a Scheme,
with all the missing parts mentioned above; it's more like something
that could grow into a Scheme. I was inspired by <a
href="http://urjtag.sourceforge.net/book/_urjtag.html#_the_name_urjtag"
>UrJTAG</a>, but of course the echo of "<a
href="http://sc2.sf.net/">Ur-Quan Masters" (one of my favorite
games)</a> was a plus as well. Also, the ancient capital of Sumer was
Ur of the Chaldees (אור כשדים in Hebrew), and it's an Unoriginal
Reimplementation of Scheme. (Thanks to Ben Goetter for that.) Or at
least some of it.</p>
<h2>Origins and Lessons</h2>
<p>In February 2008, <b>I wanted to write a metacompiler for <a
href="http://pobox.com/~kragen/sw/bicicleta"
>Bicicleta</a></b>, but I was
intimidated because I'd never written a compiler before, and nobody
had ever written a compiler either <i>in</i> Bicicleta or <i>for</i>
Bicicleta. So I thought I'd pick a language that other people knew a
lot about writing compilers for and that wasn't too hairy, write a
compiler for it, and then use what I'd learned to write the Bicicleta
compiler.</p>
<p>It took me about 18 days from the time I started on the project to
the time that the compiler could actually compile itself, which was a
lot longer than I expected.</p>
<p>I learned a bunch from doing it. Here are some of the main things
I learned:</p>
<ol>
<li><b>Interpreters are a better way to bootstrap than
metacompilers.</b> Instead of writing a compiler for a subset of
Scheme in that subset, I should have written an interpreter in
standard Scheme (or whatever) and a metacompiler in the language
implemented by the interpreter. (This is what Darius Bacon did with
"ichbins".) Interpreters are a lot simpler than compilers, especially
if they don't have to run fast, and especially if you can write them
in a language like Scheme that gives you garbage collection and
closures for free.
<p> The restriction that the compiler had to be
correct both in R5RS Scheme and in the language that it could compile
was really a pain. For example, although I could add new syntactic
forms to the language that it could compile, and I could add new
syntactic forms with R5RS macro definitions, I couldn't simplify the
compiler by adding new syntactic forms, because the compiler can't
compile R5RS macro definitions. (There's a portable implementation of
R5RS macros out there, but it's about twice the size of the entire
Ur-Scheme compiler.) Similarly, I was stuck with a bunch of the
boneheaded design decisions of the
Scheme built-in types — for example, no
auto-growing mutable containers, separate types for strings and
characters, and the difficulty of doing any string processing without
arithmetic and side effects.</p> </li>
<li><b>Start with the simplest thing that could possibly work.</b> I
keep on learning this every year. In this case, the really expensive
thing was that I wanted normal function calls to be fast, so I used
normal C-style stack frames for their arguments. This seems to have
paid off in speed, but it meant I spent four days and about 200 lines
of code implementing lexical closures of unlimited extent. (Really!
According to my change log, from February 13 to February 16, I
basically didn't do anything else except add macros.) If I'd just
allocated all my call frames on the heap, the result would have been
slow, but I would have gotten done a lot sooner.
</li>
<li><b>Tail-recursion makes your code hard to read.</b> More
traditional control structures, such as explicit loops, are both
terser and clearer. This is the largest Scheme program I've written,
so this is my first time really experiencing this. Look at this code
(discarded in favor of something simpler now, thank goodness):
<blockquote><pre>
(<span class="keyword">define</span> (<span class="function-name">string-append-3</span> length s2 buf idx)
(<span class="keyword">if</span> (= idx (string-length buf)) buf
(<span class="keyword">begin</span>
(string-set! buf idx (string-ref s2 (- idx length)))
(string-append-3 length s2 buf (1+ idx)))))
(<span class="keyword">define</span> (<span class="function-name">string-append-2</span> s1 s2 buf idx)
(<span class="keyword">if</span> (= idx (string-length s1))
(string-append-3 (string-length s1) s2 buf idx)
(<span class="keyword">begin</span>
(string-set! buf idx (string-ref s1 idx))
(string-append-2 s1 s2 buf (1+ idx)))))
(<span class="keyword">define</span> (<span class="function-name">string-append</span> s1 s2) <span class="comment">; standard
</span> (string-append-2 s1 s2 (make-string (+ (string-length s1)
(string-length s2)))
0))
</pre></blockquote>
<p>That horrible rat's nest is my attempt to straightforwardly
translate this very straightforward procedural approach, without
introducing any forward references:</p>
<blockquote><pre>
<span class="keyword">def</span> <span class="function-name">string_append</span>(s1, s2):
buf = make_string(<span class="py-builtins">len</span>(s1) + <span class="py-builtins">len</span>(s2))
idx = 0
<span class="keyword">while</span> idx != <span class="py-builtins">len</span>(s1):
buf[idx] = s1[idx]
idx += 1
length = <span class="py-builtins">len</span>(s1)
<span class="keyword">while</span> idx != <span class="py-builtins">len</span>(buf):
buf[idx] = s2[idx - length]
idx += 1
<span class="keyword">return</span> buf
</pre></blockquote>
<p>(You wouldn't really want to do that in Python — I'm just
explaining what I was thinking.)</p>
<p>Tail-calls are merely "goto with arguments", which means that they
can be implemented very efficiently, and also means that you can
easily use them to create unreadable code.</p>
</li>
<li><b>Native-code compilers get OK performance pretty easily.</b> On
the standard <a href="fib.scm">stupid Fibonacci microbenchmark</a>, on
my laptop, Ur-Scheme outperforms MzScheme's interpreter by about a
factor of 7.6 (excluding time to compile and assemble):
<blockquote><pre>
(<span class="keyword">define</span> (<span class="function-name">fib2</span> n)
(<span class="keyword">cond</span> ((= n 0) 1)
((= n 1) 1)
(<span class="keyword">else</span> (+ (fib2 (1- n))
(fib2 (- n 2))))))
</pre></blockquote>
<p>GCC outperforms Ur-Scheme on that same benchmark by about a factor
of 5.</p>
<p>On larger programs, such as the Ur-Scheme compiler itself,
Ur-Scheme only outperforms MzScheme's interpreter by about a
factor of 5 (rather than 7 or so), which (if you're keeping track)
means that Scheme code interpreted in MzScheme is somewhere around
25-50 times
slower than C code compiled with GCC.</p>
<p>The literature about writing compilers is full of hairy techniques
like CPS conversion, SSA conversion, <a
href="ftp://ftp.cs.cmu.edu/afs/cs.cmu.edu/user/shivers/lib/papers/diss.ps.Z"
>higher-order
control-flow analysis</a>, type
inference, automated theorem proving, partial evaluation, escape
analysis, and so on. Ur-Scheme doesn't do any of those things.
(Well, actually, it does do a little bit of escape analysis.) In
fact, Ur-Scheme doesn't even do <i>register allocation</i>, and it
does full dynamic type checking at run-time. I imagine that if you
wanted to get performance only 2 or 3 or 4 times worse than C compiled
with GCC (which is not exactly a gold standard itself, these days)
then you'd need to start hauling out the hairy techniques. But it
already beats the pants off all the <i>interpreters</i> I can find,
and at least for itself, it generates faster code than the three
well-known Scheme compilers I've tried on it (mzc, Gambit-C, and
Chicken.) (I suspect that Stalin and Ikarus would do better...)
</p>
</li>
</ol>
<h2>Interesting Features</h2>
<p>It has a <b>one-level parser</b>, like old versions of Microsoft
BASIC; there's no separate tokenizer. I wrote the parser after
reading the "ichbins" parser, and there is definitely some influence
visible. </p>
<p>It contains <b>relatively little mutation</b>. Although almost
every line of the compiler
has "side effects" like outputting lines of assembly code,
there are fairly few locations where the
compiler's internal state is mutated. I count 25 calls to
<tt>set!</tt> and <tt>string-set!</tt>
in the 1600 lines of code, including the standard library.
The I/O side effects are similarly centralized
— there are three calls to <tt>display</tt>, no calls to
<tt>newline</tt>, and one call to <tt>read-char</tt> in the whole
code. When was the last time you saw a 1600-line C program with only
25 assignment statements?</p>
<p>To simplify code generation, it follows BigFORTH and colorForth
by <b>treating the x86 as a stack machine</b>.
</p>
<p>What little optimization it does is done by <b>optimization
macros</b> that rewrite certain patterns in the source — for
example, <tt>(if (not a) b c)</tt> gets rewritten into <tt>(if a c
b)</tt>, and <tt>(if (null? x) y z)</tt> is rewritten into a call to a
special built-in <tt>(%ifnull x y z)</tt>. As described in R5RS,
these macros also compile the various Scheme control structures
(<tt>cond</tt>, <tt>case</tt>, <tt>and</tt>, and so on) into more
fundamental control structures like <tt>if</tt> and
<tt>lambda</tt>.</p>
<p>At present, because it compiles one expression at a time, you can
<b>interactively type code at it</b> and see the compiled assembly
results. Unfortunately this isn't very much fun because the generated
code is so verbose. Darius discovered this feature.</p>
<p>The assembly code it produces <b>can compile to statically linked,
standalone executables</b> because it doesn't depend on any external
libraries — it just uses the Linux system call interface
directly. They can be relatively small; a minimal executable is about
17KiB, because it includes the whole standard library (not just the
parts that are used).</p>
<h2>Future Work</h2>
<p>First, of course, there are the bugs to fix, especially including
the absence of a <b>garbage collector</b>.</p>
<p>If this compiler has any merit at all, it is in its small size and
comprehensibility. <a
href="http://www.accesscom.com/~darius/hacks/ichbins.tar.gz" >Darius
Bacon's brilliant 385-line "<b>ichbins</b>" self-compiling Lisp-to-C
compiler</a> is much better at that, being less than one-fifth of the
size. So one direction of evolution is to <b>figure out what can be
stripped out of it</b>. ichbins has no arithmetic, no closures, no
separate string or symbol type, and only one side effect.</p>
<p>There's probably a lot that could be made clearer, as well.</p>
<p>Another direction is to try to <b>improve the speed and size</b> of
its output code a bit. For example:
<ul>
<li> <b>Lambda-lifting</b> would keep let-expressions from consing
storage for all of the arguments of the outer procedure, and if you
were slightly more ambitious, you could even optimize away the
procedure call entirely. That would make them a lot smaller, too.</li>
<li> Although it does some <b>inlining</b> now, it could do more.
Also, the existing inlining would benefit from a preliminary scanning
pass to see which <b>globals are never mutated</b> and can therefore
be inlined safely without breaking Scheme's semantics. Such a scan
could also remove the type-check, the two fetches, and the computed
call instruction, on the vast majority of procedure calls, and the
argument-count check in the vast majority of procedure prologues.</li>
<li> Especially <b>conditional expressions</b> could benefit from some
more inlining and integration with their parent <tt>if</tt>; currently
only <tt>null?</tt>, <tt>eq?</tt>, and <tt>not</tt> have this special
handling. (Although that covers about half of the conditionals in the
compiler.) The procedure-call-and-return-and-test overhead totals
about 18 instructions. The implementation of <tt>case</tt> is
particularly inefficient because it actually calls <tt>memv</tt>.</li>
<li> A little bit of <b>peephole optimization</b> usually goes a long
way, but I'm not sure it would in this case; we could get a little bit
of dead-code removal (e.g. procedure epilogues preceded by an
unconditional jump, as in the very common case of a tail call) and
there's the occasional <tt>pop %eax; push %eax;
movl foo, %eax</tt> sequence. But overall I don't think there's much
of an advantage yet.
</li>
<li> The <b>procedure prologues and epilogues</b> are pretty large,
and they could probably be shrunk quite a bit, especially for
non-variadic procedures. It might be possible to arrange the stack
frame so that a simple <tt>leave; ret $12</tt> sequence, or something
similarly simple, could replace the current eleven bytes of crap, at
least for non-variadic procedures. Tail calls are particularly
horrific; here's a three-argument tail call:
<blockquote><pre>
804977f: 50 push %eax
8049780: a1 20 89 06 08 mov 0x8068920,%eax
8049785: 8d 5c 24 08 lea 0x8(%esp),%ebx
8049789: 8b 55 fc mov 0xfffffffc(%ebp),%edx
804978c: 8b 65 f8 mov 0xfffffff8(%ebp),%esp
804978f: 8b 6d f4 mov 0xfffffff4(%ebp),%ebp
8049792: ff 33 pushl (%ebx)
8049794: ff 73 fc pushl 0xfffffffc(%ebx)
8049797: ff 73 f8 pushl 0xfffffff8(%ebx)
804979a: 52 push %edx
804979b: e8 fb e8 ff ff call 804809b <ensure_procedure>
80497a0: 8b 58 04 mov 0x4(%eax),%ebx
80497a3: ba 03 00 00 00 mov $0x3,%edx
80497a8: ff e3 jmp *%ebx
</pre></blockquote>
<p>That's 43 bytes of code! In this case, as is typical, the function
being called (<tt>char-between?</tt>) is never redefined; it takes a
fixed number of arguments and is never called with any other number of
arguments; and it is not a closure. These observations would allow
the following abbreviated version:</p>
<blockquote><pre>
50 push %eax
8d 5c 24 08 lea 0x8(%esp),%ebx
8b 55 fc mov 0xfffffffc(%ebp),%edx
8b 65 f8 mov 0xfffffff8(%ebp),%esp
8b 6d f4 mov 0xfffffff4(%ebp),%ebp
ff 33 pushl (%ebx)
ff 73 fc pushl 0xfffffffc(%ebx)
ff 73 f8 pushl 0xfffffff8(%ebx)
52 push %edx
e9 xx xx xx xx jmp _char_betweenP_4
</pre></blockquote>
<p>That's only 28 bytes, a substantial improvement, but still ugly.
If we could skip the prologue of tail-called non-closure procedures
and just jump directly to the body, then in cases where the caller has
at least as many arguments as the callee, we could also avoid copying
the saved %esp, %ebp, and return address to a new location on the
stack. In that case we could simply do something like this:</p>
<blockquote><pre>
8d 5c 24 08 lea 0x8(%esp),%ebx
8b 65 f8 mov 0xfffffff8(%ebp),%esp
ff 33 pushl (%ebx)
ff 73 fc pushl 0xfffffffc(%ebx)
50 push %eax
e9 xx xx xx xx jmp _char_betweenP_4
</pre></blockquote>
<p>That's only 18 bytes. Still not that pretty, but no longer
disgusting.</p>
</li>
<li> Storing <b>symbols in a skip list or hash table</b> would
probably help a lot with programs like compilers that use
<tt>string->symbol</tt> a lot. </li>
</ul>
<p>Another direction is to <b>make it more practical</b>. For
example, improved error-reporting, a profiler, an FFI, and access to
files and sockets would be handy.</p>
<p>Another direction is to <b>make it run faster</b>. </p>
</body></html>