NN code, further results
John van Groningen
johnvg@cs.kun.nl
Fri, 19 Mar 1999 17:16:18 +0100
Richard O'Keefe wrote:
>I've now got C, Clean, and Haskell versions of the NN code all running,
>and was interested in speeding up the Haskell version.
>
>I decided to focus on a smaller task instead: multiplying two
>100x100 double-precision matrices.
>
>The tests were done on an 84 MHz SPARCstation 5 using Clean 1.3.1
>and GHC 3.02 (yes, I know there is a later version of GHC, but this
>was the latest one with *prebuilt binaries* available for the Sparc).
>
> ...
>
>RESULTS
> Compiler List Ursl Array
> GHC 3.02 5 sec 3 sec 25 sec
> Clean 1.3.1 5 sec 4 sec 0.8 sec
> C 0.2 sec
>
The difference between List and Ursl is much smaller than I expected. I wrote the same algorithm in Clean to see why. These are the results (in seconds):
Sparc5(170mhz) SuperSparc(50mhz) UltraSparc(143mhz) G3 mac(266mhz)
matrix_l 1.27 2.25 0.47 0.25
matrix_lu 0.32 0.70 0.31 0.14
matrix_lu2 0.23 0.48 0.24 0.11
matrix_l uses [[Real]]
matrix_lu uses List ListR
:: List a = Cons !a !(List a) | Nil
:: ListR = ConsR !Real !ListR | NilR
matrix_lu2 uses List ListR
:: List a = Cons !a !(List a) | Nil
:: ListR = ConsR !Real !Real !ListR | NilR
For these programs unrolling and unboxing is more than 5 times as fast on a sparc 5, and still nearly twice as fast on an ultrasparc.
This is the source code for matrix_lu:
module matrix_lu
import StdEnv
:: List a = Cons !a !(List a) | Nil;
:: ListR = ConsR !Real !ListR | NilR;
matrix n v = gen 0 n (r n v);
where
r n v = genR 0 n v;
gen i n v | i<n = Cons v (gen (i+1) n v); = Nil;
genR i n v | i<n = ConsR v (genR (i+1) n v); = NilR;
transpose (Cons NilR _) = Nil;
transpose m=:(Cons r rs) = Cons (first_col_to_row m) (transpose (drop_c m));
first_col_to_row Nil = NilR;
first_col_to_row (Cons (ConsR e _) l) = ConsR e (first_col_to_row l);
drop_c Nil = Nil;
drop_c (Cons (ConsR e r) l) = Cons r (drop_c l);
mul m1 m2 = mul_t m1 (transpose m2);
mul_t Nil m = Nil;
mul_t (Cons r rs) m = Cons (mul_r r m) (mul_t rs m);
mul_r :: !ListR !(List ListR) -> ListR;
mul_r r Nil = NilR;
mul_r r (Cons c cs) = ConsR (mul_r_c r c 0.0) (mul_r r cs);
mul_r_c NilR NilR s = s;
mul_r_c (ConsR e1 l1) (ConsR e2 l2) s = mul_r_c l1 l2 (e1*e2+s);
sum_m m = Sum m 0.0;
where
Sum Nil s = s;
Sum (Cons r l) s = SumR r (Sum l s);
SumR l s = sum l s;
where
sum NilR s = s;
sum (ConsR e l) s = sum l (e+s);
Start = sum_m (mul (matrix 100 1.0) (matrix 100 1.0));
I also wrote a version in Clean using arrays ( {#{#Reals}} ) and compared this to one written in C. The results are (these programs do 10 multiplications instead of 1 for the list versions above):
Sparc5(170mhz) SuperSparc(50mhz) UltraSparc(143mhz) G3 mac(266mhz)
Clean 1.3.1 1.90 3.88 2.00 0.71
C (cc) 0.79 1.40 0.29
C (gcc) 0.73 1.46 0.70
C (egcs 1.1) 0.32
I used the following options for the c compilers:
for cc: -xO5 -dalign
for cc on ultrasparc also: -native
for gcc: -O3 -funroll-loops -msupersparc
for egcs: -O3 -funroll-loops -mcpu=603e
Unboxed lists are not much slower than arrays: (array times are divided by 10, because the array version does 10 multiplications instead of 1).
matrix_lu2 0.23 0.48 0.24 0.11
matrix(array)/10 0.19 0.388 0.200 0.071
I have made some improvements to the Clean compiler to improve the code generated for matrix multiplication. With these improvements the results are:
Sparc5(170mhz) SuperSparc(50mhz) UltraSparc(143mhz) G3 mac(266mhz)
Clean 1.53 3.18 1.49 0.56
The source code for this matrix multiplication in Clean is:
module matrix
import StdEnv
transpose :: !{#{#Real}} !Int -> {#{#Real}}
transpose a n = { { a.[j,i] \\ j<-[0..n-1]} \\ i<-[0..n-1]}
multiply_matrix_t :: !{#{#Real}} !{#{#Real}} !Int -> {#{#Real}}
multiply_matrix_t a b n
= { let
row = a.[i];
in {
let
mul_vector_row :: !Int !Int !Real !{#Real} !{#Real}-> Real
mul_vector_row i k v row col
| k>=n
= v
= mul_vector_row i (k+1) (v+row.[k]*col.[k]) row col
in
mul_vector_row i 0 0.0 row b.[j] \\ j<-[0..n-1]} \\ i<-[0..n-1]}
matrix :: !Int -> {#{#Real}}
matrix n = {createArray n 1.0 \\ i<-[0..n-1]}
repeat_mul c
| c==0
= 0
# n = 100;
| size (multiply_matrix_t (matrix n) (transpose (matrix n) n) n)>=0
= repeat_mul (c-1);
Start
= repeat_mul 10;
John van Groningen
johnvg@cs.kun.nl