Using Slurm

CS Hydra Cluster

The CS Hydra cluster contains debian bookworm(r12 ) and ubuntu noble(r24.04) nodes. Each os type has its own partition.  Beside the two different os, there are two type of resources available in the cluster, a compute or gpu resource.  Here are the paritions listing for the Hydra cluster.

$ sinfo
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
compute*     up   infinite      1    mix echidna
compute*     up   infinite     29   idle smblade16a[1-14],smblade16b[1-8],smblade24a[1-6],typhon
dcompute     up   infinite     14   idle smblade16b[1-8],smblade24a[1-6]
dgpus            up   infinite      2    alloc gpu[2201,2301]
dgpus            up   infinite     20   idle gpu[1701-1708,1801-1802,1901-1907,2001-2003]
gpus              up   infinite      2    alloc gpu[2201,2301]
gpus              up   infinite     26   idle gpu[1601-1605,1701-1708,1801-1802,1901-1907,2001-2003,2501]
ucompute     up   infinite      1    mix echidna
ucompute     up   infinite     15   idle smblade16a[1-14],typhon
ugpus            up   infinite      7    idle gpu[1601-1605,1801,2501]

The dcompute and dgpus partitions are debian bookworm compute and gpu nodes, respectively.  The ucompute and ugpus partitions are the ubuntu noble compute and gpu nodes, respectively.  The compute and gpus partitions contain both debian and ubuntu compute and gpus, respectively.

Connecting to Hydra

The simplest way to connect to Hydra is through the ssh.cs.brown.edu gateway or the fastx cluster. You need to set up your ssh keypair in order to use the ssh gateway or fastx cluster.

Submit a compute job

You can submit a job using sbatch:

$ sbatch batch_scripts/hello.sh

You can confirm that your compute job ran successfully by running:

$ cat batch_scripts/slurm-<job id>*.out

By default, your job is submitted to the compute partition and will run for 1hr if you don't specify the partion name or run time limit.

Submit a gpu job

To submit a gpu job, you must use the gpus partition and request a gpu in your request.  Use sbatch with the following options to submit a gpu job.

$ sbatch --partition=gpus --gres=gpu:1 gputest.sh

You can confirm that your compute job ran successfully by running:

$ cat slurm-<job id>*.out

The gpus partition contains all the gpu hardware down in the CIT datacenter. You must request at least 1 gpu resource in order to run a gpu job on the gpus partition.

Showing the job queue

To see the job queue, use the squeue command.

Cancel a job

To cancel your job, use the scancel command i. e. scancel <job id>.

Using slurm options in a script

The script you submit to slurm can contain slurm options in it.  Here is a simple template, batch.script, to use for that:

#!/bin/bash
# This is an example batch script for slurm on Hydra
#
# The commands for slurm start with #SBATCH
# All slurm commands need to come before the program # you want to run. In this example, 'echo "Hello World!"
# is the command we are running.
#
# This is a bash script, so any line that starts with # is # a comment. If you need to comment out an #SBATCH line, use # infront of the #SBATCH
#
# To submit this script to slurm do:
# sbatch batch.script
#
# Once the job starts you will see a file MySerialJob-****.out
# The **** will be the slurm JobID
# --- Start of slurm commands -----------
# set the partition to run on the gpus partition. The Hydra cluster has the following partitions: compute, gpus, debug, tstaff
#SBATCH --partition=gpus
# request 1 gpu resource
#SBATCH --gres=gpu:1

# Request an hour of runtime. Default runtime on the compute parition is 1hr.
#SBATCH --time=1:00:00
# Request a certain amount of memory (4GB):
#SBATCH --mem=4G
# Specify a job name:
#SBATCH -J MySerialJob
# Specify an output file
# %j is a special variable that is replaced by the JobID when the job starts
#SBATCH -o MySerialJob-%j.out #SBATCH -e MySerialJob-%j.out
#----- End of slurm commands ----
# Run a command
echo "Hello World!"

Slurm Training

Slurm training is available the CCV slurm workshop. Go to the CCV Help page for details.