Perhaps the easiest way to distribute a programming project is to use dsh.
It will run linux commands on all nodes or on particular compute nodes.
If your batch programs use command-line arguments you can pass different
parameters to each program so that every compute node is working on a different
part of the problem. We'll give an example shortly.
To run a command on all 34 compute nodes use the following command:
dsh -a command
To run a command on a particular node (node 5 for example) use the command:
dsh -w cnode5 command
Here's a simple example of how you can distribute a program that calculates
data at various time intervals. This works if each time interval can be
calculated independently of the other intervals. Let's say the program is
called "simulation.exe" and we pass it a time value and a filename where the
result will be stored. Create a file called "batch1.bash" with the following
contents:
#!/bin/bash
dsh -w cnode1 simulation.exe 0.1 data01.txt
dsh -w cnode2 simulation.exe 0.2 data02.txt
dsh -w cnode3 simulation.exe 0.3 data03.txt
dsh -w cnode4 simulation.exe 0.4 data04.txt
. . .
The first line is a linux standard for a bash shell script. After the script
is saved you need to give yourself execute access to it. Run the following
command:
chmod u+x batch1.bash
You can run the batch program on the command line like this:
./batch1.bash
If you have a lot of batch programs to run it is probably worth taking the
time to write a program that generates the batch shell script as its output.
Using MPI you can write programs that share data between the various programs
while they run on different compute nodes. Click on this link to see a great
MPI Resource Page.
One important difference between MPI programs and non-MPI programs is how
one compiles and runs the programs. Before compiling you should load the
appropriate module for the MPI environment. For C programs type the
following command:
module load mpich/enet-gcc
You can see what modules are available with the command "module avail".
To compile a C program that has MPI calls in it, use the command:
mpicc -o programname.exe programname.c
It will create an executable called "programname.exe".
As with C compiling with the gcc compiler, if you use math functions like
sqrt, sin, cos, etc you should use the "-lm" option with mpicc.
To run the executable produced by mpicc use the command
mpirun -np 10 programname.exe
where the -np option tells it to distribute the program over 10 nodes. It is
a good idea to write your code in such a way that it will work no matter how
many nodes you run it on. This will make it more flexible in the future when
you may have to run it on fewer nodes than you wish due to down time or
heavy load. However, sometimes a program needs to have an even number of
nodes or even a power of 2.
A popular way to write MPI programs consists of writing one program that does
different things depending on which compute node it runs on. The program uses
calls to "MPI_Comm_size" and "MPI_Comm_rank" to determine how many nodes are
being used in the problem and which particular node this program is running
on. For example, if there are 10 nodes working on the problem, then "size"
will be 10 and "rank" will be a number between 0 and 9, depending on the node.
Within the program we use an if to determine what to do on this particular
node. In the following C example, the node of rank 0 opens a file and waits
for data from the other nodes which all solve the problem for a different
time value:
. . .
if(rank==0)
{
//open the data file and wait for data from other nodes
}
else
{
for(i=rank;i<=max_time_steps;i+=size-1)
{
float time=i*0.1;
. . .
}
}
. . .
In this example, if there are 10 nodes working on the problem (size=10) the
node with rank=1 will work on times 0.1, 1.0, 1.9, etc. The node with rank=2
will work on times 0.2, 1.1, 2.0, etc.
Data can be passed between the nodes using MPI_Send and MPI_Recv. In the
example above the node with rank=0 will call MPI_Recv in a for loop where
we specify which compute node we are receiving data from (which is incremented
in a round-robin fashion so the data can be written to the file in order).
Compute nodes of rank=1 through 9 will use MPI_Send to send their results to
the node of rank 0.
Notice that the for loop that assigns time values to work on ensures that
the time values are processed in an efficient way. The time values 0.1
through 0.9 are all being done concurrently and values 1.0 through 1.8 are
done next (presumably concurrently, unless some nodes fall behind). Since the
node with rank=0 is blocked waiting for the results in order of time value,
it is important to get the time values processed somewhat in order. The calls
to MPI_Send and MPI_Recv will synchronize the nodes to a degree, so it is
important to proceed in a way that results in the shortest blocking time.
For the coding details see this
MPI Tutorial.