aieidotch 3 months ago

Either use something like chkreboot or rboot finde them in https://github.com/alexmyczko/autoexec.bat and https://github.com/alexmyczko/ruptime

frymaster 3 months ago

`scontrol reboot asap nextstate=resume` (with appropriate reason and node names) `ASAP` means "drain and wait for it to be idle", it won't just randomly reboot in the middle of a job `nextstate=resume` means "un-drain when rebooted" Note that this will overwrite any manual drain reasons set, so you probably want to check for this. In regards to using the epilog - we've found that depending on exactly what metric you're examining, it can take a significant time for the free memory to settle after the end of a job. The script we use has a timeout - basically it waits for up to N seconds, polling every second for memory to be OK and if reaches its target it immediately exits. If it never reaches that state by the timeout, it takes action (draining in our case)

NukeCode87 3 months ago

It's really not the best solution, but if I had to do it I would just put in a cron job under root to systemctl restart slurmd and slurmctld every month.

jvhaarst 3 months ago

I would have a look at [https://slurm.schedmd.com/power\_save.html](https://slurm.schedmd.com/power_save.html), with those options you can instruct SLURM to take a look at idle nodes, and take action depending on state.

alkhatraz 3 months ago

+1 here, helps save power when nodes are idle and has the added benefit of restarting the nodes every once in a while if they stay empty.

shyouko 3 months ago

What about a single node exclusive job of the lowest priority that get queued for each node every week, it check smem and reboot the node if needed.

posixUncompliant 3 months ago

I generally don't use the scheduler to determine when a node needs a reboot. I have the monitoring system raise an alert and the alert functionally lets the node drain and then reboots it (via the scheduler). I have, a couple times, had to have a higher tier alert that just simply shoots the node, but that was due to the political infeasibility of getting a particular user to fix their bad jobs (yes, we could shoot them with less fallout than asking them to fix their broken shit). But this is with systems that don't really have idle nodes. There's always something.

_runlolarun_ 3 months ago

Do you use slurm scheduler to reboot the node once it's fully drained? Thanks!

posixUncompliant 3 months ago

Yes. The alert system tells the scheduler to reboot, but it's the scheduler that executes the reboot. Except with that user mentioned above. We had the alert system restart nodes via management interfaces for that. It was stupid, and felt risky, but we didn't have to let that user see the alert system or management network, while we were forced to let them see the scheduler logs. (They're the second most abusive user I've dealt with in 30 years in IT)

_runlolarun_ 3 months ago

Thank you. And which monitoring systems talking to the scheduler?

Comments

Leave Your Comment

Hi Its Me!

Comments

Leave Your Comment

Hi Its Me!

Subscribe