This is a new version of my now python script for monitoring the HLRN III.
I am usually on the Hannover partition, but it should work perfectly in Berlin too.
In this version, I keep all the features of the cluster version except the “gossip” section for checking on who is calculating what and how many resources are they using, since a supercomputer has too many users and we would need a terminal of a couple of square meters.
As it can be appreciated on the picture, it contains:
- Job ID. , so the one for cancelling/holding)
- Number of nodes
- Estimation of waiting time until the job starts calculating (red) or reaches the walltime (green)
- path to output
- Notification on finished jobs
- Job history for the last 14 days (displays last day by default)
- Input/Output performance check* for detecting cases where i.e. our MPI code falls into some “live lock” and does not produce results anymore, but keeps wasting CPU time
(* It contains a little hack for the time, because apparently they are not synchronizing the NTP (time) of the front-ends with the nodes, and the results of the calculations are marked as modified 18 seconds in the future)
You can download the script from my GitHub repository: Source code for HLRN III version of now.
If you are using HLRN III, you might just make a symbolic link to my scripts directory located at: