Brown CS: Using GridEngine

Our compute cluster (or "grid") can be accessed only via the Grid Engine batch queuing system. Using GridEngine is unavoidably complicated and site-specific. This guide is designed to get you started quickly.

How to use the Grid

Scripts Only

GridEngine won't run binary executables. It will only run scripts.

You have jobs to run, the grid has resources you need. Just tell GridEngine what you want, and let it do the rest.

The Basics

Submit jobs to be run using the qsub command:

   % qsub runme
   Your job 98 ("runme") has been submitted

Your script "runme" will be scheduled to run in the next available slot in the grid, with a 1 hour time limit.

Once your job is submitted, you can check on it with qstat:

   % qstat
   job-ID prior   name  user state submit/start at      queue               slots
   ------------------------------------------------------------------------------
       98 0.56000 runme jsb  r     12/08/2010 15:35:39  short.q@mblade1301  1

When the job is finished, its standard output and error output will be found in your home directory (which is where the script ran):

   % (cd; ls runme*)
   runme.e98  runme.o98

To find more about your script's execution, run qacct:

   % qacct -j 98
   ==============================================================
   qname        short.q
   hostname     mblade1301.cs.brown.edu
   group        tstaff
   owner        jsb
   project      NONE
   department   defaultdepartment
   jobname      runme
   jobnumber    98
   [etc...]

The output of qacct is quite long, and includes how much time and memory your job used, and lots of other information.

Running Many Jobs

qsub -t 0-99

No, that won't work. The range cannot start at zero.

You have hundreds or thousands of nearly identical jobs to run. A common case is a single program to be run on many datasets. You could call qsub over and over again, or you could use an array job:

   % qsub -t 1-100 runme

In the example above, your script "runme" will be run 100 times. Each process will be passed a different value for the environment variable SGE_TASK_ID, from 1 to 100. In this way, you can vary the execution to suit your needs. For example, if you have already partitioned your data set into separate files, your script might look like this:

   ~/project/sim < ~/data/data.$SGE_TASK_ID> ~/results/out.$SGE_TASK_ID

Array jobs have a single job id, but each individual task produces its own standard output and standard error files.

Long Running and Large Memory Jobs

If you need more than one hour of run time, or if your job uses a lot of memory (more than 1GB), then you will need to request those resources.

Time

Our grid puts all jobs into one of three categories: short, long and very long (vlong). Short jobs are the default and they will be killed if they run for more than one hour of wallclock time. Long jobs have up to 24 hours, and very long jobs can run forever.

Why would anyone use the default? Because long and very long jobs are limited to a fraction of the total grid slots at any one time. Only short jobs can fully populate the grid.

   % qsub -l hour runme    # or -l short (or no option)
   % qsub -l day runme     # or -l long
   % qsub -l inf runme     # or -l vlong

Memory

Your job will never be killed for using too much memory, but if you use a lot, you can avoid swapping by first requesting what you need. It's also good grid etiquette. To ensure you get 4GB of physical RAM:

   % qsub -l vf=4G runme

Note that vf stands for "virtual free," but in our grid it is set to the total physical RAM in each machine. The request above will only be run on a machine with at least 4GB of unused main memory. This doesn't prevent jobs from competing for memory as they run, since it only affects job placement, but it certainly improves your chances of getting what you need.

Parallel Jobs and Benchmarking

You need simultaneous access to a number of machines, or to all cores on one machine, or you just need to ensure that your job is the only one running on a set of machines. You need to use a parallel environment.

Multiple Cores

The smp parallel environment is designed to give you access to multiple cores on each machine. If your program is multi-threaded, and you want it to have 2 cores, you might run it this way:

   % qsub -pe smp 2 runme

That will ensure that the process gets two job slots on each machine on which it runs. GridEngine will also ensure that twice the memory (if you requested memory) is available.

Note that each machine has one job slot per core. Jobs can spawn more threads, or processes, than slots requested, without penalty. But requesting multiple cores, when needed, is in everyone's interest to keep grid resources from being oversubscribed.

MPI

Applications that use Message Passing Interface (MPI) consist of multiple tasks that rely on a communication infrastructure. The orte (Open Run-Time Environment) parallel environment supports Open MPI applications.

   % qsub -pe orte 4 runme

In the example above, the runme script calls mpirun to start the tasks, which GridEngine distributes to 4 separate machines.

There is an example MPI application to help you get started. Also there is much more information at open-mpi.org.

Multiple Machines

You want your parallel tasks to run on some number of separate machines. Without a well-defined framework, such as MPI (above), you'll need to do a little more work.

[more explanation once I figure this out]

Benchmarking

You need exclusive use of a machine for benchmarking. The easiest way to do this is to request all slots on the machine. Since we have a variety of machine types, you'll have to be explicit about which machines you want (see the grid page for hardware details).

   % qsub -pe smp 64 -q '*@@mblade13' runme

In this example, your job will only run on machines in the '@mblade13' host group, which are all 64-core machines, and your job will be the only job running.

There is an obvious mapping between host groups and machines. To list all host groups:

   % qconf -shgrpl

GPUs

Machines with GPUs are accessible only by requesting the gpus resource. When requested, a job setup script chooses idle GPUs and assigns them to the job. The Nvidia CUDA library automatically sees only the allocated GPUs. GPU jobs may share a machine, but will always have exclusive access to requested GPU resources.

This is the command to run a job that will use two GPUs:

   % qsub -l gpus=2 runme

GPU VRAM

Some GPUs have more memory than others. The GTX-class cards have between 8 and 11G of VRAM. A smaller number GPUs have more (see the grid resources page for details). To gain access to the GPUs with more VRAM, request the gmem resource:

% qsub -l gpus=2 -l gmem=24 runme

Interactive Sessions

Running an interactive session on a grid machine is strongly discouraged, but sometimes unavoidable. Interactive sessions are available only for short (1 hour) or long (1 day) jobs. Very long jobs must be batch jobs.

   % qlogin	# or...
   % qrsh

These commands accept the same options that qsub accepts. Only qlogin provides X11 port forwarding.

Testing

Before you unleash 1000 jobs on the grid, you want to quickly test and make sure things are working properly, but the grid is so busy! You want to run a test.

   % qsub -l test runme

Tests run at a high priority, ahead of other waiting jobs. The caveat is that they can only run for up to 10 minutes, and they are limited to one slot per machine. Don't abuse this!

Helpful Hints

Current Working Directory

To ensure that your job runs in the directory from which you submit it (and to ensure that it's standard output and error files land there) use the -cwd option:

   % qsub -cwd runme

Running Now

If you want GridEngine to run your job now or else fail, give it the -now option:

   % qsub -now y runme

Embedding Options

You don't have to remember all the qsub options you need for every job you run. You can embed them in your script:

% cat runme
#!/bin/sh
#
#  Execute from the current working directory
#$ -cwd
#
#  This is a long-running job
#$ -l inf
#
#  Can use up to 6GB of memory
#$ -l vf=6G
#
~/project/sim

With all the options in the script, executing it is simple:

   % qsub runme

You can, of course, still use command-line arguments to augment or override embedded options.

Mail Notification

To receive email notifications about your job, use the "-m" option:

   % qsub -m as runme

In the example above, you will get mail if the job aborts or is suspended. The mail options are:

   a - abort
   b - begin
   e - exit
   s - suspend

Deleting Your Jobs

Deleting your submitted jobs can be done with the qdel command:

   % qdel job-id
    The specified job-id is deleted.

   % qdel -u username
    All the jobs by usrename are deleted.

Users can only delete their own jobs.

Getting help

The man pages for gridengine commands are surprisingly helpful.

If you are using the grid, you must subscribe to the compute mailing list. All grid-related announcements are posted to this list only. You can also ask questions and coordinate grid usage there.

The grid is a supported department resource, so you can also mail problem if you need help.

Information for:

Using GridEngine