[Re: Open Clean Source]
John van Groningen
johnvg@cs.kun.nl
Tue, 2 Feb 1999 12:36:15 +0100
Alex and Eduardo wrote:
>Let's consider ML. We will not speak about OCAML, because we don't
>know the compiler (we never could make it work). Let's talk about
>SML/NJ and Harlequin. They do have a deserved reputation for
>being slow. These two compilers generate code which is about
>four times slower than the code generated by gcc.
Yesterday I downloaded SML (version 110) for windows to see how fast it is.=
It was slower than I expected. To find out why, I disassembled some code=
generated by the compiler. I think I can now explain why it is so much=
slower than C:
1. The SML compiler uses 'stosd' instructions to store data in the heap.=
This instruction is slow on the Pentium, it takes 3 clock cycles (according=
to Intel documentation) and can not be paired with another instruction. So=
the code:
mov eax,0e2
stosd
mov eax,ecx
stosd
mov eax,ebx
stosd
mov eax,ebp
stosd
should take at least 16 clock cycles to execute. This can be done much=
faster with (this is how the Clean compiler writes to the heap):
mov [edi],0e2
mov 4[edi],ecx
mov 8[edi],ebx
mov 12[edi],ebp
lea 16[edi],edi
This could take only 2.5 clock cycles on a Pentium.
The stosd instruction is faster on the Pentium2 and K6 (1 clock) than it is=
on the Pentium, but the latter code is still faster.
2. SML detects integer arithmetic overflows, C and Clean do not. It is=
implemented with a 'into' instruction after an arithmetic instruction. This=
instruction is slow on the Pentium, 4 clocks and not pairable. It is also=
slow on the Pentium2 (a complex 5 micro ops instruction) and K6 (complex, 4=
clocks ?).
The overflow bit can also be tested with a jump instruction ('jo') that=
usually takes only one clock, and can be executed in parallel with a=
previous instructions. So by generating a 'jo' that jumps to an 'into'=
instruction, we can save 4 clock cycles.
3. SML generates position independent code. To store the continuation=
address in a closure, it calculates the address with:
call next_instruction
next_instruction:
pop eax
add eax,continuation_address-next_instruction
and stores to address in the heap with:
stosd
C and Clean (if strict) will store the continuation address (return address)=
on the stack, and jump to the function to be called with just a 'call'=
instruction. Clean does not generate position independent code and can=
store the evaluation address of a closure with one instruction:
mov [edi],evaluation_address
4. SML does not use a stack, but often creates closures in cases where C or=
Clean would use a stack to pass parameters and store the return address.=
Unfortunaly, writing to the heap is more expensive than writing to a stack,=
because a stack can often fit inside the cache, and a heap will usually not=
fit inside a cache. So we get more cache misses.
John van Groningen
johnvg@cs.kun.nl
University of Nijmegen