Monitoring

Count of nodes by SLURM state: sum should equal the number of compute nodes; anything with "DRAIN" in the name = bad (see below)

Runbook

Troubleshoot the SLURM scheduler

Check that the sinfo command returns something like this:

PARTITION        AVAIL  TIMELIMIT  NODES  STATE NODELIST
Compute          up     infinite     3    idle  ecpc[10-12]
FastCompute*     up     infinite     2    idle  ecpsf[10-11]

If not, see "Undrain all nodes" below.
You can also use scontrol show nodes to show a node-by-node dump of their state.

If you undrain the nodes, but they go back to drained state after a while, look at /var/log/slurm/slurmctl.log for the reason.

Undrain all nodes

for node in $(seq -w 10 12); do \
  scontrol update NodeName=ecpsc$node State=RESUME; \
  done
for fastnode in $(seq 10 11); do \
  scontrol update NodeName=ecpsf$fastnode State=RESUME; \
  done
scontrol show nodes|grep State    # Should show no DRAINED state

Nodes still drained / draining by themselves?

💡There is a dashboard for that.

Take a look at /var/log/slurm/slurmctl.log to find out why. Common causes include

slurmctld restarting (in which case, you need to undrain again manually as per above);
the RealMemory value being off in /etc/slurm/slurm.conf