ECPS COMPUTING CLUSTER
ECPS cluster: SLURM administrator's page
français | english
Navigation
Ce wiki
Cette page

Monitoring

Count of nodes by SLURM state: sum should equal the number of compute nodes; anything with "DRAIN" in the name = bad (see below)

Runbook

Troubleshoot the SLURM scheduler

  1. Check that the sinfo command returns something like this:
    PARTITION        AVAIL  TIMELIMIT  NODES  STATE NODELIST
    Compute          up     infinite     3    idle  ecpc[10-12]
    FastCompute*     up     infinite     2    idle  ecpsf[10-11]
    
    If not, see "Undrain all nodes" below.
    You can also use scontrol show nodes to show a node-by-node dump of their state.
  2. If you undrain the nodes, but they go back to drained state after a while, look at /var/log/slurm/slurmctl.log for the reason.

Undrain all nodes

for node in $(seq -w 10 12); do \
  scontrol update NodeName=ecpsc$node State=RESUME; \
  done
for fastnode in $(seq 10 11); do \
  scontrol update NodeName=ecpsf$fastnode State=RESUME; \
  done
scontrol show nodes|grep State    # Should show no DRAINED state

Nodes still drained / draining by themselves?

💡There is a dashboard for that.

Take a look at /var/log/slurm/slurmctl.log to find out why. Common causes include

  • slurmctld restarting (in which case, you need to undrain again manually as per above);
  • the RealMemory value being off in /etc/slurm/slurm.conf
Rechercher
Partager