Slurm Workload Manager

Slurm Workload Manager#

Typical Cluster overview#

A typical layout of any HPC system looks like following -

        graph LR
login[(Login node)]

subgraph CPU
direction LR
c1
c2
c3
c4
c5["..."]
c1 <--> c2 <--> c3 <--> c4 <--> c5
end

subgraph GPU
direction LR
g1
g2
g3
g4
g5["..."]
g1 <--> g2 <--> g3 <--> g4 <--> g5
end

subgraph LargeMem
direction LR
m1
m2
m3
m4
m5["..."]
m1 <--> m2 <--> m3 <--> m4 <--> m5
end
c1 <--> g1 <--> m1
c2 <--> g2 <--> m2
c3 <--> g3 <--> m3
c4 <--> g4 <--> m4
c5 <--> g5 <--> m5
login --"ssh/slurm
access"--> CPU & GPU & LargeMem

There is a login node which you ssh to. Then there are a number of what we call computational nodes, shown in the figure above in groups as CPU, GPU and LargeMem. The dual arrows indicate high speed network connection between nodes.

CPU: Typically, CPU nodes are standard or default machines with a certain number of server grade cores available, and relatively 1-2GB RAM per core. These are the machines that are used largely for CPU based computations, with parallelism implemented through either OpenMP or MPI.
GPU: These nodes are similar to CPU, as they have similar number of CPU cores (sometimes lower) as CPU nodes, but in addition they usually have at least 1 high performance GPU cards, though in practice 2 or more. One uses these to run programs that can leverage parallization of GPUs. Usually programs with large parallel loops can be accelerated very well on GPUs, so that their wall time can be upto 20-100 times shorter than when run on CPUs using MPI/OpenMP.
LargeMem: These are nodes configured to address usecases or programs that require unusually high RAM to run. They can be of either type CPU/GPU in terms of compute capability, however they have an order of magnitude larger RAM than usual machines. On Meluxina, for example CPU/GPU nodes have 512GB RAM per node, but large memory nodes have 4TB RAM per node.

SLURM Overview#

The standard usage model for a HPC cluster is that you log into a front-end server or web portal and from there launch applications to run on one of more back-end servers. The software tool which manages this is called a workload manager or batch scheduler. Most HPC systems give direct user access only on login node, from where you delegate/run your computation/simulation to compute nodes.

HPC systems are essentially multi-user environments, where several users asynchronously and frequently login and run their codes. It is the scheduler or the workload manager, that monitors which compute nodes are free to use, and the ones that are occuppied running code, how long will they be in that state. It monitors the workload, and assigns to work submitted by the user to idle nodes. The scheduler used on Kay and Meluxina both, and widely used in HPC systems is Slurm workload manager.

SLURM provides command line tools to launch your code to appropriate compute nodes, monitor their progress, stop or manipulate the running codes in a number of ways. We look into some of those aspect below.

Basic Usage#

There are two modes a user can run a code on pretty much any HPC system on the planet, as illustrated in the flowchart below, the Interactive mode, and Batch mode.

        graph LR
subgraph Interactive
direction LR
login["Login Node"] --"Allocate Compute
node & ssh to it"--> comp(["Compute node"]) --> run((("Execute code"))) -->done(("Done"))
end

subgraph Batch
direction LR
login1["Login Node"] --"Prepare job script
and submit to slurm"--> comp1(["Compute node"]) --> run1((("slurm runs
the code"))) -->done1(("Done"))
end

desk[("Local machine")] --"ssh to
login node"--> Batch & Interactive

The typical way a user will interact with compute resources managed by a workload manager is as follows:

Write a job script which describes the resources required (e.g how many CPUs and for how long), instructions such as where to print standard out and error, and the commands to run once the job starts.
Submit the job to the workload manager which will then start the job once the requested resources are available.
Once the job completes, the user will see all results as well as output that would normally appear on screen in previously specified files.

The two types of job commonly used are:

Interactive : Request a set of nodes to run an interactive bash shell on. This is useful for quick tests and development work. These type of jobs should only be used with the with resource requested appropriately.

In Kay, we had a specific partition called DevQ queue, restricted with maximum wall time of 1 hour. For example, the following command will submit an interactive job requesting 1 node for 1 hour to be charged to myproj_id:

On Kay:

salloc -p DevQ -N 1 -A myproj_id -t 1:00:00

The similar command for Meluxina will be On Meluxina:

salloc -p cpu -q test -N 1 -A myproj_id -t 1:00:00

Batch : A script is submitted for later execution whenever the requested resources are available. Both within this script and on the commandline when submitting the job, a set of constraints, required information and instructions are given. The file must be a shell script (i.e start with #!/bin/sh) and Slurm directives must be preceeded by #SBATCH. A sample script is displayed below which request 4 nodes (each with 40 cores) for 48 hours to run an MPI application and could be submitted using the command:

sbatch mybatchjob.sh

Where the file mybatchjob.sh looks like -

#!/bin/sh 

#SBATCH --time=00:20:00
#SBATCH --nodes=4
#SBATCH -A myproj_id

module load intel/2019


mpirun -np 80 ./a.out

Slurm Commands#

Slurm informational command summary

sinfo: It lists all partition/queues and limits. Run man sinfo for more details about the command and further arguments.
squeue: It lists all queued jobs. Run man squeue for more details about the command and further arguments, such as
- squeue -u $USER prints jobs by the $USER.
- squeue -j jobid shows info about a particular job with job-id jobid.
- squeue --start shows estimated start time of the job.
- squeue -A myproj_id lists all jobs using specific account.
mybalance: Lists summary of core hour balances of all of the users’s account.

Slurm job commands

Below is a table of some of the common slurm commands used in running jobs.

Command	Description	Example
sbatch	Submit job script	`sbatch myjob.sh`
	submit script to use 5 nodes	`sbatch -N 5 myjob.sh`
	submit job with dependency on successful completion of other jobs	`sbatch -d afterok:job_id[:jobid...] myjob.sh`
scancel	Cancel job	`scancel <jobid>`
sattach	Attach terminal to standard output of running job (job step 0)	`sattach <jobid>.0`
scontrol	Prevent a queued job from running	`scontrol hold <jobid>`
	release a job hold	`scontrol release <jobid>`
	display detailed info about specific job	`scontrol show jobid <jobid>`
srun	Run a parallel job (mostly within an allocation created with a job script)
	run a 2 node interactive job for 30 minutes	`srun -N 2 -A myproj_id -t 00:30:00 -p DevQ --pty bash`
	run an MPI application within a Slurm submit script (using all cores allocated on all nodes)	`srun -n $[$SLURM_JOB_NUM_NODES * 40] ./my_mpi_app`
	run a Hybrid MPI/OpenMP application using 1 MPI process per node	`srun -n $SLURM_JOB_NUM_NODES --ntasks-per-node=1 ./my_hybrid_mpi_app`
	run a command on a job already running. e.g. to find out the CPU/Memory usage	`srun --jobid <jobid> ps u`

Further Information#

For detailed understanding of Slurm, please see the official website of Slurm.