MPICH2 Tight Integration  //  Tuesday, December 16, 2008

After getting SGE installed, I spent some time getting familiarized with the way it works. I am a little concerned with the ease with which a queue can be put into the Error state, and the necessity of an administrator to clear it. I will be working more on this in coming weeks.

The issue at hand is, of course, MPICH2 tight integration with SGE. I was using the howto on the Sun SGE site as a reference, but of course had to roll my own and Now that I've gotten all the kinks worked out, MPICH2 is successfully started when a parallel environment type of 'mpich2' is requested in the qsub script.

For example:

# Shell to execute this job under
#$ -S /bin/bash
# Name of this job
#$ -N mpihw
# My username
#$ -A caf
# Specifying the mpich2 parallel environment
#$ -pe mpich2 8

PROCS=$((NHOSTS * 2)) # Get the number of processors
/grid/mpich2/bin/mpiexec -n $PROCS ~caf/bin/mpihw
The above will execute in parallel an MPI Hello World program. Again, it concerns me that if this program were to hang, core dump, segfault, be killed, fart, or otherwise exit uncleanly, the queue in question will be put into an error state, and not be usable until a grid administrator clears it. Before I figured out the nifty qstat -f and qstat -j $PID commands, I spent many an hour scratching my head over why SGE was complaining about not having enough available queues to run the above script.

On to GridSphere. GridSphere is a JSR 168 compliant portlet container, and apparently the best available. However, compiling this thing was the stuff of nightmares, as every single time I tried, I found a new unescaped string in a JSP page. Well, the reason for this is that I like to keep things up-to-date, bleeding edge, etc. GridSphere was written in a time where unescaped double quotes were acceptable by Tomcat's standards, but as of Tomcat 5.5.26, it no longer is. Thank God Almighty for Google, because I scoured pages and pages of forums and mailing list archives and release notes before stumbling across a single line buried deep in the bowls of the internet that described this issue. That being said, I popped over to the Tomcat 5 archive page and pulled down 5.5.25, copied my tomcat-users.xml into the appropriate directory, and ln -s'd the new version to /usr/local/tomcat. I've never been so happy to watch old software start up.

Now that I've gotten GridSphere installed, I hope to get GridPortlets installed, but as of yet, haven't found a way to get it becuase the host site has been tossing me a "Bad Gateway" for almost two weeks now. WTF?!

Note: If someone clicks that link and doesn't get a bad gateway error, kindly contact me and let me know.

That's all for now. I'm off to Ghana for two weeks in the morning, so more updates will need to wait until I get back on the 2nd.

Merry Christmas!


posted by Christian @ 8:02 AM

Endgame Step 2  //  Tuesday, December 9, 2008

Since being introduced to dpkg/APT, I have become loathe to install things that are not all nicely .deb'ed. I have got to say, however, that once I got going with the Sun Grid Engine installation, things went off without a hitch (well, except for one). It did pay to have everything planned out like I did, though. For details, see the endgame plan.

The snag I hit during the SGE install was a small one (I hope). It turns out, ARCo (Accounting and Reporting Console) of SGE fame requires something called the Sun Java Web Console. In a fanstastic case study of companies that have gone Open-Source-but-only-sorta, ARCo is distributed openly, but the Sun Java Web Console is not. That being said, it is downloadable, but only in the Red Hat RPM format for Linux systems, which doesn't help us much.

I had the misfortune of thinking that alien would save me. I was wrong. We are going to go without ARCo for now.

Next up is MPICH2 tight integration with SGE.



Labels: , , ,

posted by Christian @ 6:49 PM

Endgame Step 1  //  Tuesday, December 2, 2008

Step 1 of the endgame plan mentioned in my previous post is underway. I have set up grid-control/proprietor on the spare PowerEdge, and have moved all of the NFS-shared files over to it. The former nfs-host - grid1/disseminate - is now mounting /home and /mpi via NFS like any good node ought.


NIS has now been transferred to grid-control. The former master, grid1, has been stripped of it's former position, and is now thoroughly nodified. I am going to wipe it and reinstall a fresh OS on it, just to make sure there are no weird configuration anomalies left over from being a master.

Stay tuned, there's more to come.


posted by Christian @ 9:10 PM

Site Design Copyright © 2008 Christian Funkhouser

Site used in accordance with the Elon University Web Policy.

Make note of this disclaimer.