Thursday, August 7, 2014

Fast

A long-standing goal of the Crack language has been to produce a system that is fast.  Specifically, we wanted to minimize the code-test cycle.  The language has thus far fallen short in this respect.  Startup times for anything less than very small programs have tended to be over five seconds.

We were never able to obtain any solid conclusions from profiling the code, but we knew there was a lot to be gained from caching.  So over two years ago we started work on caching.  Needless to say, this was a harder problem than we imagined.

Caching is simply storing the artifacts of a successful compile so that we don't have to produce them again.  In theory, the work involved in fully parsing and validation the source as well as generation and optimization of LLVM IR code is greater than the effort involved in simple deserialization of these structures.

Getting it to work right has been very difficult, largely because of a number of shortcuts and bad design choices that I made while producing the original compiler.  My first reaction to this was, of course, to try to adapt caching to the existing code.  However, when these attempts failed I did finally do the right thing and make the big changes to the design that were necessary to get caching to work.

And now, finally, it does.  I can run code, make changes, and run again with the most complicated programs that I have and everything just works.  The modules for the changed code and all of their dependencies get rebuilt, everything else gets loaded from the cache.  Sweet!  (I'm sure there are still some bugs in there somewhere...)

So trying this out against the most complicated program that we have -- the "wurld" game engine demo in 'examples' -- without any cached files, takes 12 seconds to build on my laptop.  After caching, it takes ... 9 seconds.

9 seconds?  Really?  A 25% reduction after all of that work?  That's depressing.  So I did some more benchmarking.

One of the results of the restructuring work required for caching is that all of the jitting of a cached program now happens at the end, just before the program is run and after all of the modules are loaded from the cache.  Originally we jitted every module when we finished loading it.  The new structure makes it much easier to see how much time we're spending on jitting versus everything else.

It turns out, we were spending most of that time, 7-8 seconds of it, on jitting.

We were storing the addresses of all of our functions before running everything.  We need function addresses in order to generate source file and function names when reporting stack traces.  But in the course of looking all of these addresses up, we were also generating all of the functions in the program, whether we needed them or not.

LLVM's ExecutionEngine lets you watch for notifications for a number of different internal events.  In particular, it lets you get a notification when a function gets jitted. So I replaced the registration loop with a notification callback.  Now instead of function registration driving jitting, jitting drives function registration, and only for the functions that are used by the program.  This got us down to about 5 seconds total runtime.

The ExecutionEngine also lets you enable "lazy jitting."  By default, when you request a function address, that function and all of its dependencies get jitted.  By contrast, lazy jitting defers the jitting of a function until it is actually called.  Enabling this feature brought the total run time down to under 2.5 seconds, because the hacked version of "wurld" I was using for benchmarking was loading everything and doing initialization but then immediately terminating, so much of the code is never called anyway.

I'm not sure if there's anything more we can do to improve on this, but 2.5 seconds is well in the realm of tolerable.

It's somewhat embarrassing that the huge effort of caching yielded only a small amount of the initial gains while an hour or two of investigation and simple tweaking had such a big impact.  But on the other hand, the simple tweaking couldn't have worked without some of the changes we put in for caching.  And as it turns out, post-tweaks, caching ends up saving us 60% of the startup time, which is huge.

Going forward, after we release 1.0, I still want to start to experiment with alternate backends.  libjit, in particular, looks like a promising alternative to LLVM for our JIT backend.  But as it stands, in terms of performance, I think we're in pretty good shape for a 1.0 release.