MPICH2 Tight Integration  //  Tuesday, December 16, 2008

After getting SGE installed, I spent some time getting familiarized with the way it works. I am a little concerned with the ease with which a queue can be put into the Error state, and the necessity of an administrator to clear it. I will be working more on this in coming weeks.

The issue at hand is, of course, MPICH2 tight integration with SGE. I was using the howto on the Sun SGE site as a reference, but of course had to roll my own and Now that I've gotten all the kinks worked out, MPICH2 is successfully started when a parallel environment type of 'mpich2' is requested in the qsub script.

For example:

# Shell to execute this job under
#$ -S /bin/bash
# Name of this job
#$ -N mpihw
# My username
#$ -A caf
# Specifying the mpich2 parallel environment
#$ -pe mpich2 8

PROCS=$((NHOSTS * 2)) # Get the number of processors
/grid/mpich2/bin/mpiexec -n $PROCS ~caf/bin/mpihw
The above will execute in parallel an MPI Hello World program. Again, it concerns me that if this program were to hang, core dump, segfault, be killed, fart, or otherwise exit uncleanly, the queue in question will be put into an error state, and not be usable until a grid administrator clears it. Before I figured out the nifty qstat -f and qstat -j $PID commands, I spent many an hour scratching my head over why SGE was complaining about not having enough available queues to run the above script.

On to GridSphere. GridSphere is a JSR 168 compliant portlet container, and apparently the best available. However, compiling this thing was the stuff of nightmares, as every single time I tried, I found a new unescaped string in a JSP page. Well, the reason for this is that I like to keep things up-to-date, bleeding edge, etc. GridSphere was written in a time where unescaped double quotes were acceptable by Tomcat's standards, but as of Tomcat 5.5.26, it no longer is. Thank God Almighty for Google, because I scoured pages and pages of forums and mailing list archives and release notes before stumbling across a single line buried deep in the bowls of the internet that described this issue. That being said, I popped over to the Tomcat 5 archive page and pulled down 5.5.25, copied my tomcat-users.xml into the appropriate directory, and ln -s'd the new version to /usr/local/tomcat. I've never been so happy to watch old software start up.

Now that I've gotten GridSphere installed, I hope to get GridPortlets installed, but as of yet, haven't found a way to get it becuase the host site has been tossing me a "Bad Gateway" for almost two weeks now. WTF?!

Note: If someone clicks that link and doesn't get a bad gateway error, kindly contact me and let me know.

That's all for now. I'm off to Ghana for two weeks in the morning, so more updates will need to wait until I get back on the 2nd.

Merry Christmas!


posted by Christian @ 8:02 AM

Endgame Step 2  //  Tuesday, December 9, 2008

Since being introduced to dpkg/APT, I have become loathe to install things that are not all nicely .deb'ed. I have got to say, however, that once I got going with the Sun Grid Engine installation, things went off without a hitch (well, except for one). It did pay to have everything planned out like I did, though. For details, see the endgame plan.

The snag I hit during the SGE install was a small one (I hope). It turns out, ARCo (Accounting and Reporting Console) of SGE fame requires something called the Sun Java Web Console. In a fanstastic case study of companies that have gone Open-Source-but-only-sorta, ARCo is distributed openly, but the Sun Java Web Console is not. That being said, it is downloadable, but only in the Red Hat RPM format for Linux systems, which doesn't help us much.

I had the misfortune of thinking that alien would save me. I was wrong. We are going to go without ARCo for now.

Next up is MPICH2 tight integration with SGE.



Labels: , , ,

posted by Christian @ 6:49 PM

Endgame Step 1  //  Tuesday, December 2, 2008

Step 1 of the endgame plan mentioned in my previous post is underway. I have set up grid-control/proprietor on the spare PowerEdge, and have moved all of the NFS-shared files over to it. The former nfs-host - grid1/disseminate - is now mounting /home and /mpi via NFS like any good node ought.


NIS has now been transferred to grid-control. The former master, grid1, has been stripped of it's former position, and is now thoroughly nodified. I am going to wipe it and reinstall a fresh OS on it, just to make sure there are no weird configuration anomalies left over from being a master.

Stay tuned, there's more to come.


posted by Christian @ 9:10 PM

We're Up  //  Sunday, November 23, 2008

All eight grid nodes have been set up. I decided to forgo the imaging option for the more antiquated install and copy config files option. For eight machines, it wasn't that bad, and I was able to use dselect, dpkg, and apt-get to my advantage.

All eight machines share users via NIS, and also share /home and /mpi. To my knowledge, MPICH2 is set up correctly, and I don't need to touch that anymore. If I get a chance before the December 11th go-live date, I may attempt to upgrade to the newest version of MPICH2 (1.0.8, we're currently using 1.0.7), just so the next semester's class has the newest features and bugfixes to work with.

The next step is the SGE install, which I am currently planning. See what I am calling the Grid Endgame Proposal, with attached SGE Installation worksheet.

Next time I post, SGE ought to be up and running. Cross your fingers.



posted by Christian @ 2:07 PM

apt-get  //  Wednesday, October 29, 2008

Please note, there are some things that you should avoid, namely the following type of command:

for N in `seq 2 8`; do
ssh grid$N apt-get -u dselect-upgrade ;

Especially if there are any configuration options that require interaction, say in the case of the Sun JDK. If you do this, apt will complain, and leave your package unconfigured. Make sure you go back through your nodes and configure them, or you will have a crippled system.

Also, for future reference, it might be a good idea to do something like:
apt-get -u dselect-upgrade -y -q


posted by Christian @ 6:41 AM

New Digs  //  Tuesday, October 21, 2008

I spent the better part of the day today reorganizing the room in which the grid lives. Through various contacts, I was able to acquire a server shelf on which to house all of the systems, which allowed us to free up much of the real estate in the room. The entire grid is now side-by-side with the department web server. Nothing technical, but here's pictures:

Next time, I will discuss my solution to the node-addition problem. I'm considering g4l, to image the nodes and distribute the OS, although I admit, there doesn't really seem to be a good way to do it. I am also looking into the possibility of using something like BOINC to distribute grid jobs across more of campus.

Here's to next time.


posted by Christian @ 5:36 PM

Sticky Situation  //  Wednesday, October 1, 2008

I have gotten a sticky directory set up on each server, so the users will have a place to keep things locally on each machine. On each node (and the master), there exists /sticky, which looks thusly:

drwsrwsrwt 2 root users 4096 2008-09-21 15:32 sticky

For those that need explanation, there are two less-than-common things going on with this directory. First off, it's got the sticky bit set (chmod +t). This means that items inside /sticky can be renamed or deleted only by the item's owner, the directory's owner, or root (aka, me). The second bit of cool filesystem magic that I worked on this directory is that I set the setgid permission on /sticky. Setting the setgid permission on a directory (chmod g+s) causes new files and sub-directories created therein to inherit the parent directory's GID, rather than the primary GID of the user who created the file. I thought that this might come in handy somewhere down the road, especially if two users need to collaborate on something.

As far as accessing the sticky directory is concerned, that part is cake. I have put a symlink in each user's home directory (and /etc/skel) called local that will point to /sticky. Observe:

lrwxrwxrwx 1 caf caf 7 2008-10-02 11:35 local -> /sticky

In other news, I have gotten around to some physical moving of the grid, as well. I put the two machines (disseminate and mete) into the server closet where they will live, and adjusted their network configurations appropriately. All eight of the grid machines, along with the master, are now in their final locations. I have attached a few pictures of the setup, so you can see (kind of) what it will look like. Please note, these are not the highest of quality, as they were taken with my iPhone.

The Grid - Broad View

The Grid - Action Shot

That's all for now. Tonight, I hope to get the temporary master installed and imaged and whatnot. That will allow me to start planning for the implementation of the real master down the road.

Until then,


posted by Christian @ 11:42 AM

Site Design Copyright © 2008 Christian Funkhouser

Site used in accordance with the Elon University Web Policy.

Make note of this disclaimer.