[Re: Open Clean Source]

John van Groningen johnvg@cs.kun.nl
Tue, 2 Feb 1999 12:36:15 +0100


Alex and Eduardo wrote:

>Let's consider ML. We will not speak about OCAML, because we don't
>know the compiler (we never could make it work). Let's talk about
>SML/NJ and Harlequin. They do have a deserved reputation for
>being slow. These two compilers generate code which is about
>four times slower than the code generated by gcc.

Yesterday I downloaded SML (version 110) for windows to see how fast it is.=
 It was slower than I expected. To find out why, I disassembled some code=
 generated by the compiler. I think I can now explain why it is so much=
 slower than C:

1. The SML compiler uses 'stosd' instructions to store data in the heap.=
 This instruction is slow on the Pentium, it takes 3 clock cycles (according=
 to Intel documentation) and can not be paired with another instruction. So=
 the code:
        mov     eax,0e2
        stosd
        mov     eax,ecx
        stosd
        mov     eax,ebx
        stosd
        mov     eax,ebp
        stosd
should take at least 16 clock cycles to execute. This can be done much=
 faster with (this is how the Clean compiler writes to the heap):
        mov     [edi],0e2
        mov     4[edi],ecx
        mov     8[edi],ebx
        mov     12[edi],ebp
        lea     16[edi],edi
This could take only 2.5 clock cycles on a Pentium.

The stosd instruction is faster on the Pentium2 and K6 (1 clock) than it is=
 on the Pentium, but the latter code is still faster.

2. SML detects integer arithmetic overflows, C and Clean do not. It is=
 implemented with a 'into' instruction after an arithmetic instruction. This=
 instruction is slow on the Pentium, 4 clocks and not pairable. It is also=
 slow on the Pentium2 (a complex 5 micro ops instruction) and K6 (complex, 4=
 clocks ?).
The overflow bit can also be tested with a jump instruction ('jo') that=
 usually takes only one clock, and can be executed in parallel with a=
 previous instructions. So by generating a 'jo' that jumps to an 'into'=
 instruction, we can save 4 clock cycles.

3. SML generates position independent code. To store the continuation=
 address in a closure, it calculates the address with:

        call    next_instruction
next_instruction:
        pop     eax
        add     eax,continuation_address-next_instruction

and stores to address in the heap with:

        stosd

C and Clean (if strict) will store the continuation address (return address)=
 on the stack, and jump to the function to be called with just a 'call'=
 instruction. Clean does not generate position independent code and can=
 store the evaluation address of a closure with one instruction:

        mov     [edi],evaluation_address

4. SML does not use a stack, but often creates closures in cases where C or=
 Clean would use a stack to pass parameters and store the return address.=
 Unfortunaly, writing to the heap is more expensive than writing to a stack,=
 because a stack can often fit inside the cache, and a heap will usually not=
 fit inside a cache. So we get more cache misses.

John van Groningen
johnvg@cs.kun.nl
University of Nijmegen