README.performance

Performance of Ur-Scheme
========================

Ur-Scheme runs faster, at least when compiling itself, when compiled
with Ur-Scheme than when compiled with MzScheme’s mzc, Chicken, or
Gambit-C.  This does not necessarily mean that Ur-Scheme is faster
than these other implementations, even on the subset of the Scheme
language that it supports.  As Richard Gabriel writes in Performance
and Evaluation of Lisp Systems:

> There is a range of methodologies for determining the speed of an
> implementation.… Finally, real (naturally occurring) code can
> be used for the benchmarks.
> 
> Unfortunately, each of these representative methodologies has
> problems.… Finally, the real-code methodology, while
> accurately measuring a particular implementation [Namely, the
> implementation on which the program was developed] is not
> necessarily accurate when comparing implementations. For example,
> programmers, knowing the performance profile of their machine and
> implementation, will typically bias their style of programming on
> that piece of code. Hence, had an expert on another system
> attempted to program the same algorithms, a different program
> might have resulted.

I think that is very likely the case here.  

For example, many parts of Ur-Scheme avoid closure creation and usage
(which are expensive in Ur-Scheme) because it is slow; and all of its
arithmetic is on small integers, i.e. fixnums.  (I think Deutsch and
Schiffman’s paper on JIT compilation points out that it’s probably a
good idea in general to assume that arithmetic will be on fixnums
until proven otherwise, but a Scheme that takes that approach will
fare poorly on e.g. floating-point FFT benchmarks.  Since Ur-Scheme
doesn’t have bignums, flonums, or actually any other kind of number
other than fixnums, the choice was easy.)

As another example, Ur-Scheme is written in a mostly-purely-functional
fashion, so it allocates memory pretty fast.  So it would probably
perform badly on a Scheme system with a traditional mark-and-sweep
collector.

It probably ought to run a lot faster than it does, by one or more
orders of magnitude.

Specific Performance Thingies
-----------------------------

Two current bottlenecks are:  
① line-at-a-time output, and  
② string blitting on output.  

Output is not buffered, with the consequence that when the compiler
compiles its output file (1.8 megabytes, 66000 instructions) it does
so with around 40 000 system calls, almost all of which are
single-line write(2) calls.  The unbuffered nature of output
simplifies the code and is convenient for debugging (you know what the
program had tried to output before it crashed) but sucks for
performance.  I know it’s a performance bottleneck because time(1)
consistently reports that about a third of the program’s run time is
“system time”, i.e. spent in the kernel.  If I merely make a copy of
the file with time dd ibs=1024k obs=36 < compiler.s > foo.tmp, dd
takes a comparable amount of “system time” to do the same thing.

If I comment out the call to “display”, the system time drops
essentially to zero, the user time does not change, and the program
produces no output.

In order to avoid making things even worse, output is coalesced into
lines before being handed to the write(2) system call.  A procedure
called “asm-flatten” takes care of this; its inner loop is the
string-blit! procedure, which copies bytes (“characters”) from one
string to another.  Commenting out the call to string-blit! reduces
the user time from 1.072 1.104 1.088 1.080 CPU seconds (in four
trials) to 0.780 0.664 0.708 0.760 CPU seconds.  That’s from a median
of 1.084 to a median of 0.734 user CPU seconds, or about a one-third
run-time reduction.

Those two bottlenecks apparently account for half of the program’s run
time.

The other things I suspect to be bottlenecks are:  
③ parsing, because it involves several function calls per input
  character, and  
④ closure analysis, because it repeatedly reanalyzes the same
  expressions.

I tried removing the parsing by pasting the entire compiler into the
standard library prologue, and then removing the part of the compiler
that reads and parses input.  This makes a compiler that compiles
exactly the same program, except that the comment “# (end of standard
library prologue)” is in a different place, and it doesn’t parse
input; it relies on the compiler that compiled it to have parsed its
input.  This takes 0.896 0.936 0.928 0.876 seconds, with a median of
0.912, down from 1.084 — a 15% improvement only.  (Although it would
presumably be a 30% improvement if the first two bottlenecks were
removed.)

So removing the above three bottlenecks should speed the compiler up
by about a factor of three, while making it somewhat longer.

I don’t have an equally useful way to remove closure analysis to see
if it’s a real problem.

The other thing is  
⑤ procedure call and return is slow and verbose.

A little bit of whole-program optimization should fix that.  See
[index.html](index.html).

cachegrind output
-----------------

`cachegrind` says that the compiler takes 666 068 500 instructions to
compile itself.  That’s almost 300 000 instructions per input line and
10 000 instructions per output line.  This, rather than any cache
effect, is responsible for the poor performance.

About 25% of its cycle estimation is in “(unknown)”, but a lot of the
rest seems to be inefficient output. 13% is in `_assert_10`, which is
actually `string-blit!`, 9% is in `_string_setBang_3`, which is
`string-set!`; 7% is in `_string_ref_3`, which is `string-ref`,
presumably mostly inside `string-blit!`; then another 6% in
`_asm_flatten_inner_3` and 5% in `_asm_flatten_size_3`.  So about 40%
of the runtime is in that little part of the output.

The other function that’s high up there is `string->symbol`, with 8%
of the runtime; presumably it is slow because it is walking this huge
list of symbols every time we see an input symbol, which happens about
6700 times in the compilation of the compiler.  Actually 82% of L1
data read misses and 29% of L2 data read misses are from
`string->symbol`.  `memq` and `assq` are also high on the list, at
about 3% each, for similar reasons.

So the #1 problem (according to `cachegrind`, which doesn’t measure
system calls) is the point I listed as ② above: string blitting on
output.  The #2 problem is one I didn’t even list above — linear
lists for the two lists of interned symbols (the built-in one for
`string->symbol` and the apparently two in the compiler for assigning
labels to global variables) and old label prefixes.  Of course
cachegrind won’t be able to see problem ⑤ above, since it's spread
across all functions.