ECPS COMPUTING CLUSTER
ECPS cluster: SLURM administrator's page
Count of nodes by SLURM state: sum should equal the number of compute nodes; anything with "DRAIN" in the name = bad (see below)
Troubleshoot the SLURM scheduler
- Check that the sinfo command returns something like this:
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
Compute up infinite 3 idle ecpc[10-12]
FastCompute* up infinite 2 idle ecpsf[10-11]
If not, see "Undrain all nodes" below.
You can also use
scontrol show nodes to show a node-by-node dump of their state.
- If you undrain the nodes, but they go back to drained state after a while, look at
/var/log/slurm/slurmctl.log for the reason.
Undrain all nodes
for node in $(seq -w 10 12); do \
scontrol update NodeName=ecpsc$node State=RESUME; \
for fastnode in $(seq 10 11); do \
scontrol update NodeName=ecpsf$fastnode State=RESUME; \
scontrol show nodes|grep State # Should show no DRAINED state
Nodes still drained / draining by themselves?
💡There is a dashboard for that.
Take a look at /var/log/slurm/slurmctl.log to find out why. Common causes include
- slurmctld restarting (in which case, you need to undrain again manually as per above);
- the RealMemory value being off in /etc/slurm/slurm.conf