Rocks 5.4.2 – Ganglia’s gmond Python module for monitoring NVIDIA GPU

Disclaimer

The instructions/steps given below worked for me (and Michigan Technological University) running Rocks 5.4.2 (with CentOS 5.5) – as has been a common practice for several years now, a full version of Operating System was installed. These instructions may very well work for you (or your institution), on Rocks-like or other linux clusters. Please note that if you decide to use these instructions on your machine, you are doing so entirely at your very own discretion and that neither this site, sgowtham.com, nor its author (or Michigan Technological University) is responsible for any/all damage – intellectual and/or otherwise.

A Bit About Ganglia (gmond & gmetad)

Citing Ganglia website,

Ganglia is a scalable distributed monitoring system for high-performance computing systems such as clusters and Grids. It is based on a hierarchical design targeted at federations of clusters. It leverages widely used technologies such as XML for data representation, XDR for compact, portable data transport, and RRD tool for data storage and visualization. It uses carefully engineered data structures and algorithms to achieve very low per-node overheads and high concurrency. The implementation is robust, has been ported to an extensive set of operating systems and processor architectures, and is currently in use on thousands of clusters around the world. It has been used to link clusters across university campuses and around the world and can scale to handle clusters with 2000 nodes.

Further, citing Wikipedia,

gmond (Ganglia Monitoring Daemon) is a multi-threaded daemon which runs on each cluster node that needs to be monitored. Installation does not require having a common NFS file system or a database back-end, install special accounts or maintain configuration files. It has four main responsibilities: monitor changes in host state; announce relevant changes; listen to the state of all other ganglia nodes via a unicast or multicast channel; answer requests for an XML description of the cluster state.

Each gmond transmits in information in two different ways: unicasting or multicasting host state in external data representation (XDR) format using UDP messages OR sending XML over a TCP connection.

Federation in Ganglia is achieved using a tree of point-to-point connections amongst representative cluster nodes to aggregate the state of multiple clusters. At each node in the tree, a Ganglia Meta Daemon (gmetad) periodically polls a collection of child data sources, parses the collected XML, saves all numeric, volatile metrics to round-robin databases and exports the aggregated XML over a TCP sockets to clients. Data sources may be either gmond daemons, representing specific clusters, or other gmetad daemons, representing sets of clusters. Data sources use source IP addresses for access control and can be specified using multiple IP addresses for fail over. The latter capability is natural for aggregating data from clusters since each gmond daemon contains the entire state of its cluster.

The Ganglia web front-end provides a view of the gathered information via real-time dynamic web pages. Most importantly, it displays Ganglia data in a meaningful way for system administrators and computer users. Although the web front-end to ganglia started as a simple HTML view of the XML tree, it has evolved into a system that keeps a colourful history of all collected data. The Ganglia web front-end caters to system administrators and users (for e.g., one can view the CPU utilization over the past hour, day, week, month, or year). The web front-end shows similar graphs for memory usage, disk usage, network statistics, number of running processes, and all other Ganglia metrics. The web front-end depends on the existence of the gmetad which provides it with data from several Ganglia sources.

Specifically, the web front-end will open the local port 8651 (by default) and expects to receive a Ganglia XML tree. The web pages themselves are highly dynamic; any change to the Ganglia data appears immediately on the site. This behaviour leads to a very responsive site, but requires that the full XML tree be parsed on every page access. Therefore, the Ganglia web front-end should run on a fairly powerful, dedicated machine if it presents a large amount of data. The Ganglia web front-end is written in the PHP scripting language, and uses graphs generated by gmetad to display history information.

Installation & Configuration

Rocks 5.4.2 installation in itself takes care of almost everything pertaining to installing and configuring Ganglia, gmond, gmetad and Ganglia web interface. However, by default & design, a Rocks cluster’s web interface is not publicly accessible. To fix this, following commands were run:

#! /bin/bash
#
# update_web_firewall.sh
# BASH script to run necessary 'rocks' commands to update the firewall rules on a
# Rocks 5.4.2 cluster's front end to make the web interface accessible from anywhere
# Must be root (or at least have sudo privilege) to run this script
 
# Begin root-check IF
if [ $UID != 0 ]
then
  clear
  echo
  echo "  You must be logged in as root!"
  echo "  Exiting..."
  echo
  exit
else
  echo
  echo "  Step #0: display current firewall rules"
  /opt/rocks/bin/rocks report host firewall localhost
 
  echo "  Step #1: removing the current rule for www"
  /opt/rocks/bin/rocks remove host firewall localhost chain=INPUT \
    flags="-m state --state NEW --source &amp;Kickstart_PublicNetwork;/&amp;Kickstart_PublicNetmask;" \
    protocol=tcp service=www action=ACCEPT network=public
 
  /opt/rocks/bin/rocks sync host firewall localhost
 
  echo "  Step #2: removing the current rule for https"
  /opt/rocks/bin/rocks remove host firewall localhost chain=INPUT \
    flags="-m state --state NEW --source &amp;Kickstart_PublicNetwork;/&amp;Kickstart_PublicNetmask;" \
    protocol=tcp service=https action=ACCEPT network=public
 
  /opt/rocks/bin/rocks sync host firewall localhost
 
  echo "  Step #3: adding new rule for www"
  /opt/rocks/bin/rocks add host firewall localhost chain=INPUT \
    flags="-m state --state NEW --source 0.0.0.0/0.0.0.0" \
    protocol=tcp service=www action=ACCEPT network=public
 
  /opt/rocks/bin/rocks sync host firewall localhost
 
  echo "  Step #4: adding new rule for https"
  /opt/rocks/bin/rocks add host firewall localhost chain=INPUT \
    flags="-m state --state NEW --source 0.0.0.0/0.0.0.0" \
    protocol=tcp service=https action=ACCEPT network=public
 
  /opt/rocks/bin/rocks sync host firewall localhost
 
  echo "  Step #5: display current firewall rules"
  /opt/rocks/bin/rocks report host firewall localhost
  echo
 
fi 
# End root-check IF

Upon pointing the browser to the http://FQDN/ganglia/, the web page should display the relevant information.

Monitoring NVIDIA GPU

The aforementioned set up works fine and as expected but it doesn’t necessarily provide any information about GPU(s) that may be part of the hardware. For e.g., the cluster used in this case has two NVIDIA GeForce GTX 260 cards in each compute node. For testing purposes, only one compute node was installed – also, one of the GTX 260 cards was replaced with a NVIDIA Quadro 6000. With more and more scientific & engineering computations tending towards GPU based computing, it’d be useful to include their status/usage information in Ganglia’s web portal. To this effect, NVIDIA released gmond Python module for GPUs (made aware of it by one of Michigan Tech ITSS directors). The instructions given in the NVIDIA-linked pages do work as described – however, Rocks 5.4.2 uses python 2.4 while one requires python 2.5 (or higher) to get the GPU metrics to show up in Ganglia.

Rebuilding Rocks Distribution with Python ctypes Library

I downloaded python-ctypes-1.0.2-2.el5.x86_64.rpm from http://ftp.osuosl.org/pub/fedora-epel/5/x86_64/ and placed it in

/export/rocks/install/ --> contrib/5.4/x86_64/RPMS/

– rebuilding of the distribution, with following commands as usual, was uneventful.

#! /bin/bash
#
# update_rocks_distribution.sh
# BASH script to download python-ctypes-1.0.2-2.el5.x86_64.rpm from 
# http://ftp.osuosl.org/pub/fedora-epel/5/x86_64/ and rebuild the
# rocks distribution
# Must be root (or at least have sudo privilege) to run this script
 
# Begin root-check IF
if [ $UID != 0 ]
then
  clear
  echo
  echo "  You must be logged in as root!"
  echo "  Exiting..."
  echo
  exit
else
  echo
  cd /export/rocks/install/contrib/5.4/x86_64/RPMS/
  wget http://ftp.osuosl.org/pub/fedora-epel/5/x86_64/python-ctypes-1.0.2-2.el5.x86_64.rpm
 
  cd /export/rocks/install/
  rocks create distro
 
fi
# End root-check IF

Re-install the compute node(s) [in this case, compute-0-0]. Without and with ctypes library, gmond (when run in debug mode, i.e. gmond -d9 -f), results in the following message.

NVIDIA Driver Installation

Once the compute node(s) are re-installed, NVIDIA driver, NVIDIA-Linux-x86_64-285.05.33.run, was installed using the following script.

#! /bin/bash
#
# install_nvidia_driver.sh
# BASH script to install NVIDIA driver in compute node(s) - save this in /share/apps/bin/
# Assumes that NVIDIA-Linux-x86_64-285.05.33.run is located in /share/apps/src/nvidia_cuda/
# Also, assumes that CUDA SDK 4.1.28 has been installed on front end in /share/apps/cuda/
# Must be root to run this script and run this in all compute nodes from the front end via
# the command
#
# rocks run host '/share/apps/bin/install_nvidia_driver.sh'
#
 
# Begin root-check IF
if [ $UID != 0 ]
then
  clear
  echo
  echo "  You must be logged in as root!"
  echo "  Exiting..."
  echo
  exit
else
  echo
  # 1. Install NVIDIA driver
  /share/apps/src/nvidia_cuda/NVIDIA-Linux-x86_64-285.05.33.run --silent
 
  # 2: Updating /etc/ld.so.config
  echo "/share/apps/cuda/lib64" &gt;&gt; /etc/ld.so.conf
  echo "/share/apps/cuda/lib" &gt;&gt; /etc/ld.so.conf
  /sbin/ldconfig
 
  # 3: Creating missing symbolic links to necessary libraries
  cd /usr/lib64/
  ln -sf libXmu.so.6.2.0 libXmu.so
  ln -sf libXi.so.6.0.0  libXi.so
 
fi
# End root-check IF

Python Bindings for the NVIDIA Management Library

This provides Python access to static information and monitoring data for NVIDIA GPUs, as well as management capabilities. It exposes the functionality of the NVML and one may download these from here – as before, the necessary steps are included in a BASH script.

#! /bin/bash
#
# install_python_nvml_bindings.sh
# BASH script to install Python Bindings for the NVML in compute node(s) - save this in /share/apps/bin/
# Assumes that nvidia-ml-py-2.285.01.tar.gz is in /share/apps/src/nvidia_ganglia/
# Must be root to run this script and run this in all compute nodes from the front end via the command
#
# rocks run host '/share/apps/bin/install_python_nvml_bindings.sh'
#
 
# Begin root-check IF
if [ $UID != 0 ]
then
  clear
  echo
  echo "  You must be logged in as root!"
  echo "  Exiting..."
  echo
  exit
else
  #
  # Download and install
  cd /tmp/
  cp /share/apps/src/nvidia_ganglia/nvidia-ml-py-2.285.01.tar.gz .
 
  tar -zxvpf nvidia-ml-py-2.285.01.tar.gz
  cd nvidia-ml-py-2.285.01
  python setup.py install
 
  # Copy nvidia_smi.py &amp; pynvml.py to /opt/ganglia/lib64/ganglia/python_modules/
  cp nvidia_smi.py /opt/ganglia/lib64/ganglia/python_modules/
  cp pynvml.py /opt/ganglia/lib64/ganglia/python_modules/
 
fi
# End root-check IF

`gmond` Python Module For Monitoring NVIDIA GPUs using NVML

After downloading ganglia-gmond_python_modules-3dfa553.tar.gz from GitHub for ganglia / gmond_python_modules to
/share/apps/src/nvidia_ganglia/, the following steps need to be performed:

#! /bin/bash
#
# copy_ganglia_gmond_python_computenodes.sh
# BASH script to copy relevant files from ganglia-gmond_python_modules to Ganglia, 
# and restart gmond - save this in /share/apps/bin/
# Assumes that ganglia-gmond_python_modules-3dfa553.tar.gz is in /share/apps/src/nvidia_ganglia/
# Must be root to run this script and run this in all compute nodes from the front end via the command
#
# rocks run host '/share/apps/bin/copy_ganglia_gmond_python_computenodes.sh'
#
 
# Begin root-check IF
if [ $UID != 0 ]
then
  clear
  echo
  echo "  You must be logged in as root!"
  echo "  Exiting..."
  echo
  exit
else
 
  # Copy relevant files to Ganglia
  cd /tmp/
  tar -zxvpf /share/apps/src/nvidia_ganglia/ganglia-gmond_python_modules-3dfa553.tar.gz
  cd ganglia-gmond_python_modules-3dfa553/gpu/nvidia/
  cp python_modules/nvidia.py /opt/ganglia/lib64/ganglia/python_modules/
  cp conf.d/nvidia.pyconf /etc/ganglia/conf.d/
 
  #
  # Restart gmond
  /etc/init.d/gmond restart
 
fi
# End root-check IF

#! /bin/bash
#
# copy_ganglia_gmond_python_frontend.sh
# BASH script to copy relevant files from ganglia-gmond_python_modules to Ganglia, 
# apply patch for Ganglia web interface and restart necessary services 
# Assumes that ganglia-gmond_python_modules-3dfa553.tar.gz is in /share/apps/src/nvidia_ganglia/
# Must be root (or at least have sudo privilege) to run this script and run this only on front end
 
# Begin root-check IF
if [ $UID != 0 ]
then
  clear
  echo
  echo "  You must be logged in as root!"
  echo "  Exiting..."
  echo
  exit
else
 
  #
  # Apply web patch for Ganglia to display custom graphs
  cd /tmp/
  tar -zxvpf /share/apps/src/nvidia_ganglia/ganglia-gmond_python_modules-3dfa553.tar.gz
  cd ganglia-gmond_python_modules-3dfa553/gpu/nvidia/
  cp graph.d/*.php /var/www/html/ganglia/graph.d/
 
  cd /var/www/html/ganglia/
  patch -p0 &lt; /tmp/ganglia-gmond_python_modules-3dfa553/gpu/nvidia/ganglia_web.patch
 
  #
  # Restart necessary services
  /etc/init.d/gmetad restart
  /etc/init.d/gmond restart
 
fi
# End root-check IF

Upon pointing the browser to the http://FQDN/ganglia/ (e.g., http://paracuda.math.mtu.edu/ganglia/ – the link will probably die or be changed to something else in due course), the display should include information about GPU as well, as shown in screenshots below:

With little more work, the rather unaesthetic looking Ganglia web interface can be made to look like a given institution’s theme:

Thanks be to

Dr. Allan Struthers for letting his paracuda.math be used for this purpose; my friendly neighbors for their kindness in letting me borrow a NVIDIA Quadro 6000 card; Robert Alexander of NVIDIA (http://developer.nvidia.com/ganglia-monitoring-system/), Bernard Li of Lawrence Berkeley National Laboratory and Jeremy Enos of National Center for Supercomputing Applications for developing this gmond Python module as well as making time to answer my questions.

Near Future Work

Work is currently underway to include all of the compute node related steps in the above described procedure in the local Rocks distribution, so that the compute nodes get them as soon as they are installed.