Rocks 6.1 – Login Node Monitor


The instructions/steps given below worked for me (and Michigan Technological University) running Rocks 6.1.1 (with CentOS 6.3 and GE 2011.11p1). As has been a common practice for several years now, a full version of Operating System was installed. These instructions may very well work for you (or your institution), on Rocks-like or other linux clusters. Please note that if you decide to use these instructions on your machine, you are doing so entirely at your very own discretion and that neither this site,, nor its author (or Michigan Technological University) is responsible for any/all damage – intellectual and/or otherwise.

The problem

In spite of repeated (polite) reminders and requests to use the queuing system and make the compute nodes do the work, some of the users keep running programs (or scripts, utilities, tools, etc.) in the login node (or front end, in the absence of a login node). Not only does this waste resources available in compute nodes, it often can make a login node (or front end) crawl – depending on CPU and/or memory usage. Should the login node (or front end) fail as a result, the good users pay the price for absolutely no fault of their own while the bad users go unpunished.

The task

Device a way — a shell script or a set of them, run preferably at regular intervals via as a cron job — with which bad users receive the punishment they deserve (e.g., logging them out on first violation, locking them out on every subsequent violation and remove their account if the bad habit continues to die hard) while making sure good users have the resources they need to continue their work in a seamless fashion. This has the added advantage of freeing up systems administrators’ time/schedule to do more meaningful things (e.g., teaching, research, writing proposals, developing collaborations, etc.) while the system polices itself.



Rocks Cluster Distribution makes sure that /share/apps/ from the front end is shared across all nodes of the cluster; also, root in login node doesn’t necessarily have the same privileges as root in front end. With these in mind, let us suppose

  1. If is the cluster’s FQDN, then, create the following three groups: wigner-admins, wigner-users and wigner-abusers.
    1. All users belong to wigner-users group.
    2. If a login node is present, users that belong to this group cannot directly SSH into the front end. This can be controlled by adding the following line to /etc/ssh/sshd_config in the front end, and restarting sshd:DenyGroups wigner-users
    3. A select list of users can be added wigner-admins group, if need be (e.g., those willing to help systems administrators out by testing policies, procedures, performance, etc.).
    4. The monitor script will add bad users with repeated violation to wigner-abusers group, and these users cannot SSH into the login node. This is accomplished by adding the following line to /etc/ssh/sshd_config in the login node, and restarting sshd:DenyGroups wigner-abusers
  2. Make a list of all users, and their respective (primary) advisor in /share/apps/bin/cluster_users_list.txt, one per line, in the following format:james:amyThis file should have 644 permissions so that root in login node can read it.
  3. Place the following two scripts, and in /root/sbin/ in login node and set their permissions to 700.
  4. Update to fit the needs of your institution.
  5. Set up to run as a cron job every few minutes – first, preferably, in a test cluster and, then, in production clusters.

#! /bin/bash
# BASH script to define general and cluster-specific settings used by
# Place this script in /root/sbin/ in login node.
# Must be root to run this script.
# Usage (do not use this script directly):
# .
                           awk -F '.' '{ print $1 "." $2 }'`
                       awk -F '.' '{ print $1 }'`
export CLUSTER_UNIVERSITY="Michigan Technological University"
export EMAIL_SUFFIX=""
export CPU_LIMIT="30"
export MEM_LIMIT="15"
# List of processes that must be killed irrespective of usage
declare -a MUST_KILL_PROCESSES=("abaqus" "MATLAB" "matlab" "g09" "crystal" "Pcrystal" "properties" "lmp_parallel" "lmp_serial" "R" "molden" "gmolden" "xcrysden")
# List of processes that must not be killed irrespective of usage
declare -a MUST_NOT_KILL_PROCESSES=("aspell" "awk" "basename" "bash" "bc" "bg" "bunzip2" "bzcat" "bzip2" "bzless" "cal" "cat" "cd" "chgrp" "chmod" "chown" "clear" "cp" "cpio" "crontab" "curl" "cut" "date" "df" "diff" "diff3" "dir" "dirname" "dirs" "dos2unix" "du" "echo" "expr" "fg" "file" "find" "finger" "free" "ftp" "fuser" "grep" "groups" "gunzip" "gzip" "head" "history" "hostname" "id" "join" "kill" "last" "less" "ln" "locate" "logname" "ls" "lsof" "make" "man" "mkdir" "mount" "mv" "nice" "nohup" "open" "passwd" "paste" "patch" "pgrep" "ping" "pkill" "popd" "ps" "pushd" "pwd" "renice" "rev" "rm" "rmdir" "rsync" "scp" "screen" "sdiff" "sed" "seq" "sftp" "sftp-server" "sleep" "sort" "split" "ssh" "sshd" "su" "sudo" "tac" "tar" "tail" "tee" "time" "top" "touch" "tr" "type" "umask" "uname" "uniq" "units" "unix2dos" "unzip" "uptime" "vmstat" "w" "wait" "watch" "wc" "wget" "whatis" "whereis" "which" "who" "whoami" "xdvi" "zcat" "zdiff" "zip" "zless" "zmore" "znew")
# Mailing list(s)
export CLUSTER_ADMINS="hpcadmins-l@${EMAIL_SUFFIX}" # systems administrators
export CLUSTER_HELPLINE="it-help@${EMAIL_SUFFIX}"   # Campus help line 
# List of users that have been granted temporary admin privilege
declare -a CLUSTER_ADMINS_TEMP=("amy" "karen" "john")
export USERS_LIST=`cat /share/apps/bin/cluster_user_list.txt | \
                     awk -F ':' '{ print $1 }' | \
export ABUSER_LIST="ln_abusers_list_tmp.txt"
export ABUSER_LOG="ln_abusers_log.txt"
# Check if an array contains an element
function needle_in_haystack() {
  local n=$#
  local value=${!n}
  for ((i=1; i < $#; i++)) {
    if [ "${!i}" == "${value}" ]
      echo "y"
      return 0
  echo "n"
  return 1

#! /bin/bash
# BASH script to monitor the login node. This script needs 
# '' to be present in the same folder (/root/sbin).
# Must be root to run this script.
# Usage (preferably run it every few minutes as a cron job):
# Check if this script is being run by a non-root user. If yes, exit with 
# an error message
if [ $UID != 0 ]
  echo "  You must be logged in as root!"
  echo "  Exiting..."
# Check if this script is being run with arguments. If yes, exit with an 
# error message.
if [ $# -ne $EXPECTED_ARGS ]
  echo "  Usage: `basename $0`"
  exit $E_BADARGS
# Necessary variables
. /root/.bashrc
# List of currently running relevant processes
# 1. Run 'top' in batch mode (-b) once (-n 1)
# 2. Remove all blank lines
# 3. Remove the first 6 lines of header
# 4. Filter out system users
# 5. Print out the result to a flat text file
top -b -n 1      | \
  sed '/^$/d'    | \
  sed '1,6d'     | \
  grep -v "root" | \
  awk '{ printf "%-6s  %-8s  %-6s  %-6s  %-10s  %s\n", $1, $2, $9, $10, $11, $12 }' > $ABUSER_LIST
# Read $ABUSER_LIST line by line
# BEGIN read_abusers_list WHILE LOOP
exec<$ABUSER_LIST while read line do # # Assign each field to a variable PROCESS_ID=`echo $line | awk -F ' ' '{ print $1 }' | sed '/^ *#/d;s/#.*//'` USER_NAME=`echo $line | awk -F ' ' '{ print $2 }' | sed '/^ *#/d;s/#.*//'` CPU_USAGE=`echo $line | awk -F ' ' '{ print $3 }' | sed '/^ *#/d;s/#.*//'` MEM_USAGE=`echo $line | awk -F ' ' '{ print $4 }' | sed '/^ *#/d;s/#.*//'` TIME_USAGE=`echo $line | awk -F ' ' '{ print $5 }' | sed '/^ *#/d;s/#.*//'` CMD_USED=`echo $line | awk -F ' ' '{ print $6 }' | sed '/^ *#/d;s/#.*//'` # # Avoid 'integer expression expected' errors CPU_USAGE=`echo "($CPU_USAGE + 0.5)/1" | bc` MEM_USAGE=`echo "($MEM_USAGE + 0.5)/1" | bc` # # DEBUG # echo "$PROCESS_ID -- $USER_NAME -- $CPU_USAGE -- $MEM_USAGE -- $TIME_USAGE -- $CMD_USED" # # Check if user is a real user # USER_NAME must exist in USERS_NAME_ARRAY array if [ $(needle_in_haystack "${USER_NAME_ARRAY[@]}" "$USER_NAME") == "y" ] then # # DEBUG # echo "$USER_NAME exists in USERS_LIST" # # User's advisor export USER_ADVISOR=`grep "${USER_NAME}:" /share/apps/bin/cluster_user_list.txt | \ awk -F ':' '{ print $NF }'` # # Check if %CPU usage is greater than $CPU_LIMIT # 1: FALSE # 0: TRUE export CPU_VIOLATION=1 if [ "$CPU_USAGE" -ge "$CPU_LIMIT" ] then export CPU_VIOLATION=0 fi # # Check if %MEM usage is greater than $MEM_LIMIT # 1: FALSE # 0: TRUE export MEM_VIOLATION=1 if [ "$MEM_USAGE" -ge "$MEM_LIMIT" ] then export MEM_VIOLATION=0 fi # # Check if CMD_USED is in the list of programs ($MUST_KILL_PROCESSES) # that must be killed, irrespective of %MEM or %CPU usage # 1: FALSE # 0: TRUE export CMD_VIOLATION=1 if [ $(needle_in_haystack "${MUST_KILL_PROCESSES[@]}" "$CMD_USED") == "y" ] then export CMD_VIOLATION=0 fi # # If CPU_VIOLATION, MEM _VIOLATION or CMD_VIOLATION is true (0), then, # set VIOLATION to be true (0) # 1: FALSE # 0: TRUE export VIOLATION=1 if [ $CPU_VIOLATION -eq 0 -o $MEM_VIOLATION -eq 0 -o $CMD_VIOLATION -eq 0 ] then export VIOLATION=0 fi # # BEGIN EXCEPTIONS # Check if USER_NAME in the list of temporary admins ($CLUSTER_ADMINS_TEMP). # If yes, set the violation to be false (1). if [ $(needle_in_haystack "${CLUSTER_ADMINS_TEMP[@]}" "$USER_NAME") == "y" ] then export VIOLATION=1 fi # # Check if CMD_USED is in the list of programs ($MUST_NOT_KILL_PROCESSES) # that must not be killed, irrespective of %MEM or %CPU usage # Set VIOLATION to be false (1) if [ $(needle_in_haystack "${MUST_NOT_KILL_PROCESSES[@]}" "$CMD_USED") == "y" ] then export VIOLATION=1 fi # END EXCEPTIONS # # If VIOLATION is true (0), then # 1. Kill the process without any warning # 2. Record the violation # 3. Count the total # of violations # 4. Compose an appropriate message, depending on whether it's # first time or repeat violation # 5. Email the USER, copy her/his advisor(s) and systems administrators # 6. Also, log out the user without any warning, if it's a # repeat violation if [ $VIOLATION -eq 0 ] then # # Kill the process /bin/kill -9 $PROCESS_ID # # Record the violation export ABUSE_DATE=`date -R` printf "%-8s -- %31s -- %-6s -- %-6s -- %-6s -- %-8s -- %-s\n" "${USER_NAME}" "${ABUSE_DATE}" "${PROCESS_ID}" "${CPU_USAGE}" "${MEM_USAGE}" "${TIME_USAGE}" "${CMD_USED}" >> $ABUSER_LOG
      # Count the total number of violations
      export ABUSE_COUNT=`grep "^$USER_NAME" $ABUSER_LOG | wc -l`
      # Compose the message (for SMS)
      cat << EndOfFile > /tmp/${USER_NAME}_${PROCESS_ID}.txt
Dear ${USER_NAME},
In spite of clear instructions given in the website as well as during the training session, you seem to be running a process in a login node of ${CLUSTER_HOSTNAME}. Details are as below:
  Violation # : ${ABUSE_COUNT}
  Date/Time   : ${ABUSE_DATE}
  Process ID  : ${PROCESS_ID}
  Process     : ${CMD_USED}
  % CPU       : ${CPU_USAGE}
  % Memory    : ${MEM_USAGE}
  Time        : ${TIME_USAGE}
      if [ $ABUSE_COUNT -eq 1 ]
        # Message
        cat << EndOfFile >> /tmp/${USER_NAME}_${PROCESS_ID}.txt
Note that the process has been terminated to safeguard work of other users. Subsequent violations will lead to disabling and/or removal of your account from this cluster.
      if [ $ABUSE_COUNT -gt 1 ]
        # Log out the user 
        /usr/bin/skill -KILL -u $USER_NAME
        # Previous violation information
        PREVIOUS_ABUSE=`grep "$USER_NAME" $ABUSER_LOGS | sed '$d' | tail -1`
        ABUSE_DATE0=`echo $PREVIOUS_ABUSE | awk -F '--' '{ print $2 }' | sed '/^ *#/d;s/#.*//'`
        PROCESS_ID0=`echo $PREVIOUS_ABUSE | awk -F '--' '{ print $3 }' | sed '/^ *#/d;s/#.*//'`
        CMD_USED0=`echo $PREVIOUS_ABUSE   | awk -F '--' '{ print $7 }' | sed '/^ *#/d;s/#.*//'`
        CPU_USAGE0=`echo $PREVIOUS_ABUSE  | awk -F '--' '{ print $4 }' | sed '/^ *#/d;s/#.*//'`
        MEM_USAGE0=`echo $PREVIOUS_ABUSE  | awk -F '--' '{ print $5 }' | sed '/^ *#/d;s/#.*//'`
        TIME_USAGE0=`echo $PREVIOUS_ABUSE | awk -F '--' '{ print $6 }' | sed '/^ *#/d;s/#.*//'`
        # Message
        cat << EndOfFile >> /tmp/${USER_NAME}_${PROCESS_ID}.txt
Note that the process has been terminated to safeguard work of other users. Since this is a repeat violation, your account has been disabled until further notice. Details regarding your previous violation are as below:
  Date/Time   : ${ABUSE_DATE0}
  Process ID  : ${PROCESS_ID0}
  Process     : ${CMD_USED0}
  % CPU       : ${CPU_USAGE0}
  % Memory    : ${MEM_USAGE0}
  Time        : ${TIME_USAGE0}
Request your advisor(s) to send an email to '${CLUSTER_HELPLINE}' to have your account enabled.
        # Disable the account by adding ${USER_NAME} to
        # ${CLUSTER_ABUSERS_GROUP} group
        /usr/sbin/usermod -L ${USER_NAME}
        /usr/sbin/usermod -a -G ${CLUSTER_ABUSERS_GROUP} ${USER_NAME}
      # Message
      cat << EndOfFile >> /tmp/${USER_NAME}_${PROCESS_ID}.txt
Do not reply to this email.
      # Email the user, advisor(s) and systems administrator(s)
      # Delete the message
      /bin/rm -f /tmp/${USER_NAME}_${PROCESS_ID}.txt
    # Unset the variables
    export ABUSE_COUNT=""
    export ABUSE_DATE=""
    export PREVIOUS_ABUSE=""
    export PROCESS_ID=""
    export USER_NAME=""
    export CPU_USAGE=""
    export MEM_USAGE=""
    export TIME_USAGE=""
    export CMD_USED=""
    export PROCESS_ID0=""
    export USER_NAME0=""
    export CPU_USAGE0=""
    export MEM_USAGE0=""
    export TIME_USAGE0=""
    export CMD_USED0=""
  # END real_user_check IF
# END read_abusers_list WHILE LOOP


Revised versions of these scripts

One can find them in my script repository on GitHub.


What if my cluster has more than one login node? (and will need to be tweaked so that a user locked out in one login node does not SSH into the other login node. If interested, let me know and I will share the revised versions.


Thanks be to

all the bad users that made writing this script a necessity several years ago and in turn have freed up my time and resources so that they can be invested in more meaningful things such as teaching, research, writing proposals, developing collaborations, etc. Thanks also be to their advisors (and University’s administrators at every rung in the ladder) for sticking by this idea and in turn, help make it a default policy on all computing clusters.

3 Replies to “Rocks 6.1 – Login Node Monitor”

  1. This is most excellent. I know this has been something you’ve dealt with for a while. Nice to see you’ve finally got around to automating the denial of wrongful individuals in order to spend more time not worrying and doing amazing stuff like always 🙂 I will definitely be following your github!

  2. @Mario:
    It’s been in place, in some form, for past few years. This post was written — with sanitized and fully commented versions of the script — to share it with NSF XSEDE and Rocks mailing lists, so that more and more HPC admins wouldn’t have to waste their time.

  3. Hi Gowtham,
    Thanks for this great post!
    When you say ‘run as a cron job’, do you mean the root’s crontab?
    I recently had a problem with an NFS mount that changed IP (a NAS that I have no control over) and thought to make a cron job that makes sure the volume is mounted for the entire cluster, informs me if it’s not and tries to mount it. I wasn’t sure if this should go into the root’s crontab or if there is a cluster level, rocks specific way of doing this like editing extend-base.


Comments are closed.