[clean-list] Matrix timings

Siegfried Gonzi siegfried.gonzi@kfunigraz.ac.at
Wed, 31 Oct 2001 08:33:53 +0100


2. Matrix timings (Shivkumar Chandrasekaran), schrieb:

>S. Gonzi wrote:

>[Only to be sure:

>A X = Y

>A is a 300x300 array and Y and the solution array X is also a 300x300
>array]

>In this case the C code (written on top of the Boehm garbage collector) 
>timings are 2.25 seconds on a PowerBook G4 (400MHz), while Scilab with 
>ATLAS BLAS takes 1.14 seconds. At this stage the factor of 2 could be 
>because my C code uses QR factorization (which is about twice as slow as 
>LU with partial pivoting), so the ATLAS BLAS may be of no use here (or 
>my Scilab is not actually linking to ATLAS BLAS which I thought it was!).

Yesterday was an interesting day. In the evening I installed an IDL demo
version (it aborts after 7 minutes) on my Mac. IDL is especially used in
Astronomy (I do not use it very much on our Suns) and medicine for image
processing.
Not only is IDL expensive (a legal single copy costs about US$ 2500.-)
but often it is fast (at least at image processing).

And the above 300x300 problem takes on my Mac 15.5sec (they write in
their online help, that the routine is based on Numerical Recipes code).

I did then a timing-profile, and the LU-factorisation takes 5.5sec and
the backward- and forward-substitution takes 9.5sec with IDL.

Clean takes (there is a nice option on the Macintosh Clean version to
have a time-profile at request) 5.5sec for the LU and 12sec (6sec+6sec)
for the forward and backward-substitution.

I am not sure whether it is important that all this languages as Matlab,
Scilab, Yorick,... are placing the memory right when they startup. Maybe
one can than also squeeze out another 20% from Clean.

>I don't have the Clean timings handy for linear system solution. But 
>here are the timings for matrix-matrix multiplication instead:

>Clean's best timings for 300x300 was 1.98seconds vs. 0.16seconds for 
>Scilab (so maybe that ATLAS BLAS got linked in after all). The C code 
>(on top of the Boehm gc) timing was 0.85 seconds. So the Clean code was 
>about 2.3 times slower than "gcc -O3 -funroll-all-loops" in this case 
>(which was not quite fair to C; 2.5 was more likely to be the case).

Be aware that in case of you are using Zoerner's
matrix-matrix-multiply-algorithm that there has been posted (on this
list by Groningen) a faster one (factor 1.5 or so), one or two years
ago.

I can remember that on my Macintosh a 512x512
matrix-matrix-multiplication takes 40sec (in the best case of the
measurements) to 50sec. An equivalent C code takes the same time. But
here then memory access (the speed) is a limiting factor.

Lest play hardball: Newer processors include parallelism in the
processor itself. The new G4 processor highly parallise tasks. C and
Fortran can always exploit this feature. I once red an article by a NASA
magnetohydrodynamics-guy about G4 parallelism and how one can exploit it
with C or Fortran (as long a the compiler-vendor supports it).

The Clean strategy has been a few years ago to also use parallelism but
with two and more physical processors. But this is only my assumption
that to this time the prospect has been that in the future we are using
highly parallelised machines in form of two and more processors. But it
seems that hardware manufactures decided to go a different way and embed
this all into one processor.

Here C will always win.

>When I get a chance I will time S.Gonzi's code too.

No problem, it was only a suggestion.


S. Gonzi