Slurm Workload Manager#
Typical Cluster overview#
A typical layout of any HPC system looks like following -
There is a login node which you ssh
to. Then there are a number of what we call computational nodes, shown in the figure above in groups as CPU, GPU and LargeMem. The dual arrows indicate high speed network connection between nodes.
CPU: Typically, CPU nodes are standard or default machines with a certain number of server grade cores available, and relatively 1-2GB RAM per core. These are the machines that are used largely for CPU based computations, with parallelism implemented through either OpenMP or MPI.
GPU: These nodes are similar to CPU, as they have similar number of CPU cores (sometimes lower) as CPU nodes, but in addition they usually have at least 1 high performance GPU cards, though in practice 2 or more. One uses these to run programs that can leverage parallization of GPUs. Usually programs with large parallel loops can be accelerated very well on GPUs, so that their wall time can be upto 20-100 times shorter than when run on CPUs using MPI/OpenMP.
LargeMem: These are nodes configured to address usecases or programs that require unusually high RAM to run. They can be of either type CPU/GPU in terms of compute capability, however they have an order of magnitude larger RAM than usual machines. On Meluxina, for example CPU/GPU nodes have 512GB RAM per node, but large memory nodes have 4TB RAM per node.
SLURM Overview#
The standard usage model for a HPC cluster is that you log into a front-end server or web portal and from there launch applications to run on one of more back-end servers. The software tool which manages this is called a workload manager or batch scheduler. Most HPC systems give direct user access only on login node, from where you delegate/run your computation/simulation to compute nodes.
HPC systems are essentially multi-user environments, where several users asynchronously and frequently login and run their codes. It is the scheduler or the workload manager, that monitors which compute nodes are free to use, and the ones that are occuppied running code, how long will they be in that state. It monitors the workload, and assigns to work submitted by the user to idle nodes. The scheduler used on Kay and Meluxina both, and widely used in HPC systems is Slurm workload manager.
SLURM provides command line tools to launch your code to appropriate compute nodes, monitor their progress, stop or manipulate the running codes in a number of ways. We look into some of those aspect below.
Basic Usage#
There are two modes a user can run a code on pretty much any HPC system on the planet, as illustrated in the flowchart below, the Interactive mode, and Batch mode.
The typical way a user will interact with compute resources managed by a workload manager is as follows:
Write a job script which describes the resources required (e.g how many CPUs and for how long), instructions such as where to print standard out and error, and the commands to run once the job starts.
Submit the job to the workload manager which will then start the job once the requested resources are available.
Once the job completes, the user will see all results as well as output that would normally appear on screen in previously specified files.
The two types of job commonly used are:
Interactive : Request a set of nodes to run an interactive bash shell on. This is useful for quick tests and development work. These type of jobs should only be used with the with resource requested appropriately.
In Kay, we had a specific partition called DevQ
queue, restricted with maximum wall time of 1 hour. For example, the following command will submit an interactive job requesting 1 node for 1 hour to be charged to myproj_id:
On Kay:
salloc -p DevQ -N 1 -A myproj_id -t 1:00:00
The similar command for Meluxina will be On Meluxina:
salloc -p cpu -q test -N 1 -A myproj_id -t 1:00:00
Batch : A script is submitted for later execution whenever the requested resources are available. Both within this script and on the commandline when submitting the job, a set of constraints, required information and instructions are given. The file must be a shell script (i.e start with
#!/bin/sh
) and Slurm directives must be preceeded by#SBATCH
. A sample script is displayed below which request 4 nodes (each with 40 cores) for 48 hours to run an MPI application and could be submitted using the command:
sbatch mybatchjob.sh
Where the file mybatchjob.sh
looks like -
#!/bin/sh
#SBATCH --time=00:20:00
#SBATCH --nodes=4
#SBATCH -A myproj_id
module load intel/2019
mpirun -np 80 ./a.out
Slurm Commands#
Slurm informational command summary
sinfo
: It lists all partition/queues and limits. Runman sinfo
for more details about the command and further arguments.squeue
: It lists all queued jobs. Runman squeue
for more details about the command and further arguments, such assqueue -u $USER
prints jobs by the $USER.squeue -j jobid
shows info about a particular job with job-idjobid
.squeue --start
shows estimated start time of the job.squeue -A myproj_id
lists all jobs using specific account.
mybalance
: Lists summary of core hour balances of all of the users’s account.
Slurm job commands
Below is a table of some of the common slurm commands used in running jobs.
Command |
Description |
Example |
---|---|---|
sbatch |
Submit job script |
|
submit script to use 5 nodes |
|
|
submit job with dependency on successful completion of other jobs |
|
|
scancel |
Cancel job |
|
sattach |
Attach terminal to standard output of running job (job step 0) |
|
scontrol |
Prevent a queued job from running |
|
release a job hold |
|
|
display detailed info about specific job |
|
|
srun |
Run a parallel job (mostly within an allocation created with a job script) |
|
run a 2 node interactive job for 30 minutes |
|
|
run an MPI application within a Slurm submit script (using all cores allocated on all nodes) |
|
|
run a Hybrid MPI/OpenMP application using 1 MPI process per node |
|
|
run a command on a job already running. e.g. to find out the CPU/Memory usage |
|
Further Information#
For detailed understanding of Slurm, please see the official website of Slurm.