Slurm is used for cluster management and job scheduling. Slurm has three key features:
Submit a bash script to slurm. Slurm will schedule this bash script given the arguments presented in the script.
$ sbatch script.sh
Script example:
#! /bin/bash
#################### Batch Headers ####################
#SBATCH -p drcluster # Get it? DRC cluster ;)
#SBATCH -J hello_world # Custom name
#SBATCH -o results-%j.out # stdout/stderr redirected to file
#SBATCH -N 3 # Number of nodes
#SBATCH -n 1 # Number of cores (tasks)
#######################################################
python hello_world.py
sacct
or squeue
)%j
is job number)Environment variables:
scancel
is used to cancel a task
$ scancel <jobID>
scontrol
allows you to view or alter a job's details
View job details:
$ scontrol show job 14
Suspend a job:
$ sudo scontrol suspend 14
Continue a job:
$ scontrol resume 14
Give up on a job:
$ scontrol release 14
sinfo
provides information about the cluster
$ sinfo
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
drcluster* up infinite 3 idle node[02-04]
squeue
displays all submitted jobs.
$ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
22 drcluster hostname drc14 PD 0:00 3 (PartitionConfig)
A list of state codes can be found HERE.
srun
allows you to run parallel jobs directly from the command line. See sbatch
for command line arguments.
$ srun --nodes=3 hostname
node03
node02
node04
To run an interactive srun
session:
$ srun --pty /bin/bash