[clean-list] Matrix-Matrix-Multiplication and Cleans memory management

Siegfried Gonzi siegfried.gonzi@kfunigraz.ac.at
Fri, 03 Nov 2000 10:29:48 +0100


Dear reader and Clean-team,


the last few days I was playing around with matrix-multiplication (I
mean matrix-matrix multiplication)
benchmarks: comparison between Clean (for the Power Mac) and Yorick,
Forth, OmicronBasic/Assembler. The detailed results will be given next
week (a first
hint: Clean comes closest to Yorick).

But first I have to tackle around with some problems.

Look at this code:

++++++++++++++++++++++++++++++++++++++++++++
module matm

import SampleVec
import Clas1,Clas2,Clas3
import StdEnv

fillMatrix :: !Int !Int -> .Matrix
fillMatrix n m = { one2n m \\i<-[1..n]}

//*** denotes matrix-matrix-multiplication; see Clas3 library
Start= (fillMatrix 1024 1024) *** (fillMatrix 1024 1024)
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

The Console output is set to 'not'. The heap memory size is around
30000K.
My Computer is a Performa 5300 (Power PC RISC 603e with 48MB on RAM).
Virtual memory of the MacOS 8.6 was set to 49MB (48+1MB).

If I start the above program I have to wait (in the best case) 725sec
(the
garbage-collection is about 5sec).

If I start the above compiled (compiled on the Performa as stand-alone)
 program as stand-alone on my Powerbook 5300
(RISC 603e 100MHz)  with 24MB on RAM and virtual-memory, of the MACOS
8.6, set to
50MB I have to wait (no typing error): 1184sec + 525sec(for garbage-
collection). The memory of the program was set to 40MB (30MB is min as
with
compilation on the Performa).

What is *wrong* with the memory management? When I start the programm on

the Powerbook the disc is working loud and loud and loud.


So someone can now say the speed degradation is due to the lack of
enough
build-in memory in the Powerbook (the Powerbook has 24MB compared to
48MB
with the Performa). But I cannot believe that.

I compared it to Yorick. If I made the same matrix calculation with
Yorick (matrix based programming language like MATLAB or IDL; but no
stand-alones possible):

++++++++++++++++Yorick
code++++++++++++++++++++++++++++++++++++++++++++++
a=span(1,1024,1024)(,-:1:1024)   /*Array 1024x1024 which every column
has
                                  values ranging from 1..1024 */
b=a(,+)*a(+,)         /*Matrix multiplication*/
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

On my Performa Yorick tooks 500sec for the calculation (I do not know
how
Yorick calculates the matrix-multiplication; maybe it uses some sort of
tuning with subarrays: see postscript remarks at the end of my message).

The memory of Yorick was set to 40MB.

And, *now*, when I did the same on my Powerbook, Yorick (memory was set
to
40MB) *also* only needed
500sec for the matrix-multiplication. Only the first and last minutes I
heard working the disc (Clean let working the disc all the time).

Something is strange with the memory management. If I calculate a matrix

multiplication with only 256x256 elements Clean tooks the same time
(10sec) on
both computers (Performa and Powerbook). But when it comes to large
arrays...

Why can Yorick manage memory so much better? Yorick itself is only a
interpreter (but Fast Fourier and all array manipulations are build in
functions, and they have compiled speed). I want not discuss here which
is
better Yorick or Clean, but I cannot understand why Yorick manages
memory
better, because Yorick only "interprets" code and Clean compiles code to

stand-alone native PowerPC code.

I have to regret that I did not search through the Clean manuals in the
hope to
encounter a solution to the above problem.


Regards,
Siegfried Gonzi
[PS: David M. (physicist at the Livermore National Laboratory) the
developer of Yorick wrote me, some citation:
"...No tricks.  However, on the current cache architectures, there is a
terrific penalty for accessing memory out of order; therefore I was
careful to write the loops in such a way that (as nearly as possible)
they access array elements in order.  That strategy works well across
many architectures.  I would never tune anything for a specific
architecture.  Things are changing far too quickly for that to pay
off..."
PSS: The port to MacOS is made by Steven L. (physicist at Livermore N.
L.)]