Performing job monitoring tasks while running jobs on a cluster/supercomputer can be done by several tools (such as qstat, showstart, …). Unfortunately the most of these standard tools are made by computer scientist for computer scientists, but now, there is an alternative: now.
At some point in my life, I started to get sick of qstats, greps, awks, … because it takes a couple of seconds every time one has to write them. If we multiply these seconds by the number of times I need them and by the number of computational members in my group, we get enough time to prepare a couple of new inputs, or write some post/comments in my blog.
So, I wrote my own job monitoring/visualization script, based on some ideas of my friend Iñaki during my time in the theoretical chemistry group in Donostia. That script, initially did nothing but execute these programs and display ONLY the information I need, and ALL the information I need. This information includes:
- job ID: for holding/killing
- Status: wetter it is running/queued/hold …
- Number of nodes: because it is a nice thing to check
- Path to output: because I don’t have a database in my brain linking tens of job IDs and paths
That was in the version 0.1 (I used to call “nyt”). In the latest version of now, there are some more useful features.
I have different versions of now, some for supercomputers which displays the queues, estimation of start time, … and the ones for clusters. The one I am sharing in this post is the version for small clusters, which has the following highlights:
- Displays the usage of the whole cluster/per active user by color code, horizontal, terminal fitted bars
- Number of active users and which kind of jobs are the running
- Warning message if some of the current users jobs has finished, since the last execution of now
- Input/Output monitoring option (-io) to make sure that the jobs are not getting “zombie”
- Job history, up to the last 7 days. Because when we run a lot of calculations it is nice to check quickly what did we calculate during the last days