NN code, further results

John van Groningen johnvg@cs.kun.nl
Fri, 19 Mar 1999 17:16:18 +0100


Richard O'Keefe wrote:
>I've now got C, Clean, and Haskell versions of the NN code all running,
>and was interested in speeding up the Haskell version.
>
>I decided to focus on a smaller task instead:  multiplying two
>100x100 double-precision matrices.
>
>The tests were done on an 84 MHz SPARCstation 5 using Clean 1.3.1
>and GHC 3.02 (yes, I know there is a later version of GHC, but this
>was the latest one with *prebuilt binaries* available for the Sparc).
>
> ...
>
>RESULTS
>    Compiler       List    Ursl   Array
>    GHC 3.02       5 sec   3 sec  25 sec
>    Clean 1.3.1    5 sec   4 sec   0.8 sec
>    C                              0.2 sec
>

The difference between List and Ursl is much smaller than I expected. I wrote the same algorithm in Clean to see why. These are the results (in seconds):

            Sparc5(170mhz)  SuperSparc(50mhz)  UltraSparc(143mhz) G3 mac(266mhz)
matrix_l        1.27            2.25                    0.47            0.25
matrix_lu       0.32            0.70                    0.31            0.14
matrix_lu2      0.23            0.48                    0.24            0.11

matrix_l uses [[Real]]

matrix_lu uses List ListR

        :: List a = Cons !a !(List a) | Nil
        :: ListR = ConsR !Real !ListR | NilR

matrix_lu2 uses List ListR

        :: List a = Cons !a !(List a) | Nil
        :: ListR = ConsR !Real !Real !ListR | NilR

For these programs unrolling and unboxing is more than 5 times as fast on a sparc 5, and still nearly twice as fast on an ultrasparc.

This is the source code for matrix_lu:

module matrix_lu

import StdEnv

:: List a = Cons !a !(List a) | Nil;
:: ListR = ConsR !Real !ListR | NilR;

matrix n v = gen 0 n (r n v);
    where
        r n v = genR 0 n v;

gen i n v | i<n = Cons v (gen (i+1) n v); = Nil;

genR i n v | i<n = ConsR v (genR (i+1) n v); = NilR;

transpose (Cons NilR _) = Nil;
transpose m=:(Cons r rs) = Cons (first_col_to_row m) (transpose (drop_c m));

first_col_to_row Nil = NilR;
first_col_to_row (Cons (ConsR e _) l) = ConsR e (first_col_to_row l);

drop_c Nil = Nil;
drop_c (Cons (ConsR e r) l) = Cons r (drop_c l);

mul m1 m2 = mul_t m1 (transpose m2);

mul_t Nil m = Nil;
mul_t (Cons r rs) m = Cons (mul_r r m) (mul_t rs m);

mul_r :: !ListR !(List ListR) -> ListR;
mul_r r Nil = NilR;
mul_r r (Cons c cs) = ConsR (mul_r_c r c 0.0) (mul_r r cs);

mul_r_c NilR NilR s = s;
mul_r_c (ConsR e1 l1) (ConsR e2 l2) s = mul_r_c l1 l2 (e1*e2+s);

sum_m m = Sum m 0.0;
where
    Sum Nil s = s;
    Sum (Cons r l) s = SumR r (Sum l s);

SumR l s = sum l s;
where
    sum NilR s = s;
    sum (ConsR e l) s = sum l (e+s);

Start = sum_m (mul (matrix 100 1.0) (matrix 100 1.0));


I also wrote a version in Clean using arrays ( {#{#Reals}} ) and compared this to one written in C. The results are (these programs do 10 multiplications instead of 1 for the list versions above):

            Sparc5(170mhz)  SuperSparc(50mhz)  UltraSparc(143mhz) G3 mac(266mhz)        
Clean 1.3.1     1.90            3.88                    2.00            0.71
C (cc)          0.79            1.40                    0.29            
C (gcc)         0.73            1.46                    0.70
C (egcs 1.1)                                                            0.32


I used the following options for the c compilers:
        for cc:   -xO5 -dalign
        for cc on ultrasparc also: -native
        for gcc:  -O3 -funroll-loops -msupersparc
        for egcs: -O3 -funroll-loops -mcpu=603e

Unboxed lists are not much slower than arrays: (array times are divided by 10, because the array version does 10 multiplications instead of 1).

matrix_lu2       0.23           0.48                    0.24            0.11
matrix(array)/10 0.19           0.388                   0.200           0.071

I have made some improvements to the Clean compiler to improve the code generated for matrix multiplication. With these improvements the results are:

            Sparc5(170mhz)  SuperSparc(50mhz)  UltraSparc(143mhz) G3 mac(266mhz)        
Clean           1.53            3.18                    1.49            0.56

The source code for this matrix multiplication in Clean is:

module matrix

import StdEnv

transpose :: !{#{#Real}} !Int -> {#{#Real}}
transpose a n = { { a.[j,i] \\ j<-[0..n-1]} \\ i<-[0..n-1]}

multiply_matrix_t :: !{#{#Real}} !{#{#Real}} !Int -> {#{#Real}}
multiply_matrix_t a b n
    = { let
            row = a.[i];
        in {
            let
                mul_vector_row :: !Int !Int !Real !{#Real} !{#Real}-> Real
                mul_vector_row i k v row col
                    | k>=n
                        = v
                        = mul_vector_row i (k+1) (v+row.[k]*col.[k]) row col
            in
                mul_vector_row i 0 0.0 row b.[j] \\ j<-[0..n-1]} \\ i<-[0..n-1]}

matrix :: !Int -> {#{#Real}}
matrix n = {createArray n 1.0 \\ i<-[0..n-1]}

repeat_mul c
    | c==0
        = 0
        # n = 100;
        | size (multiply_matrix_t (matrix n) (transpose (matrix n) n) n)>=0
            = repeat_mul (c-1);

Start
    = repeat_mul 10;



John van Groningen
johnvg@cs.kun.nl