[FieldTrip] peer

Giovanni Piantoni g.piantoni at nin.knaw.nl
Thu Jan 13 18:19:16 CET 2011


Dear FieldTrip/peer developers,

I was trying the peer module out on two computer (both 64bit Linux,
one Ubuntu and one Red-Hat). Thanks for the effort in developing the
module!
The module works great, following the information on the wiki page. On
both computers, I'm able to call peermaster and peerslave, getting the
correct peerinfo and peerlist.

When I try on Ubuntu, it works flawlessly (but it only has two cores,
so no much gain there). However, when I try on Red Hat (which has many
more cores), peermaster sends the job to the peerslave and the job is
executed (f.e., with peercellfun(@mkdir, {'test'}) it creates a folder
called test), but the peerslave is not able to tell the peermaster
that the job has completed. See below for details:

In the peermaster matlab command line:

>> peermaster
peer: init
peerinit: user at computername, id = 3682858105
peer: spawning announce thread
peer: spawning discover thread
peer: spawning expire thread
peer: spawning tcpserver thread

>> peerinfo
hostid     = 3682858105
hostname   = computername
user       = user
group      = unknown
socket     =
port       = 1701
status     = master
memavail   = 4294967295 bytes
timavail   = 86400 seconds
allowuser  = {}
allowgroup = {}
allowhost  = {}
tcpserver thread is running
udsserver thread is NOT running
announce thread is running
discover thread is running
expire thread is running
there are 0 jobs in this peer's buffer

>> peerlist
there are   3 peers running in total (1 hosts, 1 users)
there are   1 peers running on  1 hosts as master
there are   2 peers running on  1 hosts as idle slave with 8.0 GB
memory available
there are   0 peers running on  0 hosts as busy slave with 0 bytes and
0 seconds required
there are   0 peers running on  0 hosts as zombie
idle slave at user at computername:1702, memavail = 4.0 GB, timavail = 1.0 days
idle slave at user at computername:1703, memavail = 4.0 GB, timavail = 1.0 days
master     at user at computername:1701

>> peercellfun(@pause, {1 2})

Then, if I look at one of the two peerslave computers, I see:

executing job 25 from user at computername (jobid=2068920377,
memreq=1073741824, timreq=3600)
executing job took 2.002441 seconds and 0 bytes
Warning: failed to return job results to the master

And if I "dbstop if caught error", I see in the peerslave:

Error using ==> peer
failed to locate specified peer

Where the error occurs at:

peer('put', joblist.hostid, argout, options, 'jobid', joblist.jobid);

however:

>> joblist.hostid
ans =
                3682858105

which is the correct hostid for the peermaster.

Do you have any idea where the problem might lie? How can I debug this?
If it's not easy to solve, how can I run each job once (bc the
peermaster keeps on sending request if it doesn't know that the first
job was completed)?

Thanks a lot!

Gio

-- 
Giovanni Piantoni, Ph.D. student
Dept. Sleep & Cognition
Netherlands Institute for Neuroscience
Meibergdreef 47
1105 BA Amsterdam (NL)

+31 (0)20 5665492
g.piantoni at nin.knaw.nl
www.nin.knaw.nl/research_groups/van_someren_group/



More information about the fieldtrip mailing list