[clean-list] Data intensive programming and Clean

Dmitry Popov info at infognition.com
Thu Dec 15 11:48:42 MET 2011


Hello Chide,

Here's my understanding based on experience with Clean and on reading
its source code.

In Clean when you build your program you specify the size of its heap.
Your program will allocate its data and temporal values sequentially
in the heap and will quickly exhaust it, after which the garbage
collector will start working. It means even if you've got 10 MB of
live data, if you specified heap size 100 MB, your program will eat
100 MB from the system. And if your program tries to allocate 110 MB,
it will simply die even if your system has GBs of free memory.

There are two GCs in the Clean runtime. One divides your heap in two
halves and lets you use only one half of the heap, using the other
half to copy live data to when the first half is exhausted. The other
lets you use whole heap but works somewhat slower. All this means that
with data-intensive tasks you may see not so impressive performance
and also can easily run out of memory.

Also, Clean is now completely single threaded, you can't use more than
one core at a time in one process.

OCaml has a very fast generational garbage collector which will not
reserve much more memory than there is live data, and as your data
grows it will grow the heap, so your program will not die too early
as some Clean programs do (like the Heap Profiler coming with CleanIDE
which can't even load a few tens of MBs).
But OCaml runtime is also single threaded, it's not suited for
parallel execution within one process.

Haskell and F# allow multiple parallel threads and have sophisticated
parallel GCs, however code generator in F# is far from perfect, having
to translate functional style constructs into object-oriented representation.

Scala is also ok with memory and multithreading but like F# it can be
slow when functional style (with a lot of closures and recursion) is
exploited.

--
 Best regards,
 Dmitry Popov


GC> I'm trying to find out:
GC> - How, in general, functional programming languages perform
GC> on data-intensive tasks (manipulation of large datasets, e.g.:
GC> doing some statistical analysis on a table with 100.000 instances
GC> and 30 columns) (regarding speed and memory usage)
GC> - Which functional language performs best?

GC> A quick glance at the following benchmark, gave me the
GC> impression that Clean and Caml seem to perform best with regard to
GC> memory consumption:

GC> http://shootout.alioth.debian.org/

GC> Is that true?

GC> Additional question: which functional languages exploits
GC> (hardware) parallelism running on a multi core CPU best? (Or more
GC> CPU's)?

GC> Thanks in advance,

GC> Chide













More information about the clean-list mailing list