Written by David Miličević
on January 21, 2024

Benchmarking interpreter dispatch: part 1

In this series we will be analyzing and evaluating various interpreter dispatch techniques and their performance characteristics.

Most dynamic programming languages handle execution via an interpreter core. In many cases this is implemented via a simple while-switch loop where dispatch is done on an instruction opcode, as evidenced here in Lua (though as you can see, there's an optional indirect dispatch variant, as is the case in many other similar scripting languages).

In this post we'll explore the most popular alternative to this switch dispatch - the indirect threaded, jump table dispatch. More specifically, the performance differences. Newer processor architectures have significantly improved branch predictors, so it may be interesting to see If the gap has been narrowed over the years.

In indirect dispatch, instead a loop and a switch, we just have labels for each of the instruction opcodes (types) and an array containing the adresses of those labels. So the way we jump to those labels is by indexing the array with the instruction opcode. Pretty nifty, right?

There are alternative approaches that, instead of bytecode, interpret AST nodes or even the source code line-by-line. They won't be covered as they are not quite as efficient.

The next section will contain a short explanation of the techniques, so if you're familiar with them or maybe you just want to see the numbers, feel free to skip it!

Dispatch techniques

Switch-based dispatch

This dispatch is very simple. The while loop loops endlessly, fetching the next instruction and dispatching based on the instruction opcode. As shown below:

int switch_dispatch(unsigned char* ip) {
    int counter = 0, ret = 0;

    while(1) {
        switch(*ip++) {
            case 0: // increment
                ret++;
                break;
            case 1: // decrement
                ret--;
                break;
            case 2: // return
                return ret;
        }
    }
}

Indirect threading

This technique involves using GCC's computed goto extension, where we can goto to a dynamically computed/chosen label.

It's structured as a static jump table that contains addresses to labels for all of the instruction handling branches, each instruction handler is simply a label followed by the code that executes the actual instruction.

There is no outer loop here, each instruction handler takes care of the dispatch by fetching the next instruction, indexing the jump table by the opcode value and jumping to that label. And thus the dispatch code is replicated in each branch instead of being centralised as seen in the switch version, which may seem wasteful, but the the cycle savings are worth it.

It sounds more complicated that it is, the simple example below should clear things right up.

int indirect_dispatch(unsigned char* ip) {
    int counter = 0, ret = 0;

    static void* jump_table[] = {
        &&inc,
        &&dec,
        &&ret,
    };

    goto *jump_table[*ip++];

    inc:
        ret++;
        goto *jump_table[*ip++];
    dec:
        ret--;
        goto *jump_table[*ip++];
    ret:
        return ret;
}

Benchmarking

So for the purposes of benchmarking, we'll write a simple VM, very similar to the examples shown above.

It only contains the bare minimum, since we only need to benchmark the dispatch overhead itself. The main part used for benchmarking is the loop with some addition, just so we don't have a ton of bytecode which would bring D-cache into the mix.

Here's the gist for the switch version, and the gist for the indirect version.

There's also a few extra unused instructions in there, otherwise both clang and gcc do some weird basic block generation that's not really representative of a typical interpreter.

The hardware used is a 2019 Macbook Pro with a 2.4Ghz quad core i5.
The compiler used is Clang version 12.0.5 (clang-1205.0.22.11) with -O3 as being the only option.

Hyperfine is used for benchmarking, very highly recommended.
Provides all sorts of useful stuff, from minimising system interference to generating graphs in combination with the provided python scripts. Super practical.
The benchmark itself was ran with 3 warmups runs as that provided results that were consistent enough.

Anyways, onto the results.

Results

Directly from hyperfine:

Benchmark 1: indirect
  Time (mean ± σ):      72.6 ms ±   1.0 ms    [User: 70.2 ms, System: 1.0 ms]
  Range (min … max):    70.6 ms …  74.6 ms    40 runs
 
Benchmark 2: switch
  Time (mean ± σ):     119.2 ms ±   2.1 ms    [User: 116.6 ms, System: 1.1 ms]
  Range (min … max):   116.3 ms … 123.4 ms    25 runs

As we can see, the indirect version is approximately 64% faster than the switch version.

image showing indirect dispatch being faster

Now let's look at the assembly. Let's assume that we will simply execute the loop instruction and jump back to it until the counter has ticked over. Of course, the immediate integer values are wrong in this case, but that's not relevant for the analysis.

Switch:

.LBB0_1: # interpreter loop start, central dispatch
    mov     edx, dword ptr [rdi]
    add     rdi, 4
    cmp     rdx, 4
    ja      .LBB0_1
    jmp     qword ptr [8*rdx + .LJTI0_0] # jumps to an instruction handler via the address table
.LBB0_6: # loop instruction handler
    cmp     ecx, 30000000
    lea     ecx, [rcx + 1]
    lea     rdx, [rdi - 16]
    cmovb   rdi, rdx
    jmp     .LBB0_1 # jumps back to central dispatch

Indirect:

.LBB0_6: # loop instruction handler
    cmp     ecx, 30000000
    lea     ecx, [rcx + 1]
    lea     rdx, [rdi - 16]
    cmovae  rdx, rdi
    add     rdx, 4
    mov     esi, dword ptr [rdx - 4]
    mov     rdi, rdx
    jmp     qword ptr [8*rsi + indirect.jump_table] jumps to an instruction handler via the address table

At first glance, we can see that both version include an indirect jump, but the switch version has extra compare and branch instructions. Now, we can't exactly cycle count like in the olden days, modern out-of-order processors are bit too complex for that. The implications in the overall interpreter perfomance are not as drastic as in this benchmark, but given that we're measuring purely opcode dispatch, those 2 small differences do add up.

Now, exactly how and why they matter so much is a topic for a different blog post. I would love to give a microarchitectural analysis with stuff like frontend and backend pressure with stuf like total uOps per cycle and port usage, but that's out of the scope for this blog post. This post is meant to be more of a sample for things to come.

The next one in the series will feature a more in-depth analysis, so stay tuned and thanks for reading!

You can send any comments and suggestions to bitsdraumar(at)pm.me

Top