RHEL 6.2 – Ganglia’s gmond Python module for monitoring NVIDIA GPU

Disclaimer

The instructions/steps given below worked for me (and Michigan Technological University) running site licensed Red Hat Enterprise Linux 6.2 – as has been a common practice for several years now, a full version of Operating System was installed and all necessary patches/upgrades have been applied. These instructions may very well work for you (or your institution), on Red Hat-like or other linux distributions. Please note that if you decide to use these instructions on your machine, you are doing so entirely at your very own discretion and that neither this site, sgowtham.com, nor its author (or Michigan Technological University) is responsible for any/all damage – intellectual and/or otherwise.

A bit bbout Ganglia (gmond & gmetad)

Citing Ganglia website,

Ganglia is a scalable distributed monitoring system for high-performance computing systems such as clusters and Grids. It is based on a hierarchical design targeted at federations of clusters. It leverages widely used technologies such as XML for data representation, XDR for compact, portable data transport, and RRD tool for data storage and visualization. It uses carefully engineered data structures and algorithms to achieve very low per-node overheads and high concurrency. The implementation is robust, has been ported to an extensive set of operating systems and processor architectures, and is currently in use on thousands of clusters around the world. It has been used to link clusters across university campuses and around the world and can scale to handle clusters with 2000 nodes.

Further, citing Wikipedia,

gmond (Ganglia Monitoring Daemon) is a multi-threaded daemon which runs on each cluster node that needs to be monitored. Installation does not require having a common NFS file system or a database back-end, install special accounts or maintain configuration files. It has four main responsibilities: monitor changes in host state; announce relevant changes; listen to the state of all other ganglia nodes via a unicast or multicast channel; answer requests for an XML description of the cluster state.

Each gmond transmits in information in two different ways: unicasting or multicasting host state in external data representation (XDR) format using UDP messages OR sending XML over a TCP connection.

Federation in Ganglia is achieved using a tree of point-to-point connections amongst representative cluster nodes to aggregate the state of multiple clusters. At each node in the tree, a Ganglia Meta Daemon (gmetad) periodically polls a collection of child data sources, parses the collected XML, saves all numeric, volatile metrics to round-robin databases and exports the aggregated XML over a TCP sockets to clients. Data sources may be either gmond daemons, representing specific clusters, or other gmetad daemons, representing sets of clusters. Data sources use source IP addresses for access control and can be specified using multiple IP addresses for fail over. The latter capability is natural for aggregating data from clusters since each gmond daemon contains the entire state of its cluster.

The Ganglia web front-end provides a view of the gathered information via real-time dynamic web pages. Most importantly, it displays Ganglia data in a meaningful way for system administrators and computer users. Although the web front-end to ganglia started as a simple HTML view of the XML tree, it has evolved into a system that keeps a colourful history of all collected data. The Ganglia web front-end caters to system administrators and users (for e.g., one can view the CPU utilization over the past hour, day, week, month, or year). The web front-end shows similar graphs for memory usage, disk usage, network statistics, number of running processes, and all other Ganglia metrics. The web front-end depends on the existence of the gmetad which provides it with data from several Ganglia sources.

Specifically, the web front-end will open the local port 8651 (by default) and expects to receive a Ganglia XML tree. The web pages themselves are highly dynamic; any change to the Ganglia data appears immediately on the site. This behaviour leads to a very responsive site, but requires that the full XML tree be parsed on every page access. Therefore, the Ganglia web front-end should run on a fairly powerful, dedicated machine if it presents a large amount of data. The Ganglia web front-end is written in the PHP scripting language, and uses graphs generated by gmetad to display history information.

Installation

In order to make sure that none of the required steps are missed when performing similar installation on other machines (or repeating on the same machine), a BASH script was written.

#! /bin/bash
#
# install_ganglia.sh
# BASH script to install Ganglia on RHEL 6.2
# Must be root (or at least have sudo privilege) to run this script
 
# Begin root-check IF
if [ $UID != 0 ];
then
  clear
  echo
  echo "  You must be logged in as root!"
  echo "  Exiting..."
  echo
  exit
else
  #
  # Enable EPEL repository
  # EPEL: Extra Packages for Enterprise Linux
  cd /tmp/
  wget http://download.fedoraproject.org/pub/epel/6/i386/epel-release-6-5.noarch.rpm
  rpm -ivh epel-release-6-5.noarch.rpm
 
  #
  # Install Ganglia
  yum install ganglia ganglia-gmetad ganglia-gmond ganglia-web ganglia-gmond-python
 
  #
  # Make sure httpd, gmond and gmetad automatically start after each reboot
  chkconfig --level 345 httpd on
  chkconfig --level 345 gmond on
  chkconfig --level 345 gmetad on
 
fi 
# End root-check IF

Configuration

Assuming all went well so far, one can expect to have the following files: /etc/ganglia/gmetad.conf, /etc/ganglia/gmond.conf and
/etc/httpd/conf.d/ganglia.conf

Edit /etc/ganglia/gmetad.conf to have the following line:

1	data_source "dirac.dcs" localhost:8649

/etc/ganglia/gmond.conf will have the following edits:

/*
 * The cluster attributes specified will be used as part of the 
 * tag that will wrap all hosts collected by this instance.
 */
cluster {
  name = "dirac.dcs"
  owner = "Michigan Technological University"
  latlong = "N47.11 W88.57"
  url = "http://www.it.mtu.edu/"
}
 
/* Feel free to specify as many udp_send_channels as you like.
 * Gmond used to only support having a single channel 
*/
udp_send_channel {
  bind_hostname = yes  # Highly recommended, soon to be default.
                       # This option tells gmond to use a source address
                       # that resolves to the machine's hostname.  Without
                       # this, the metrics may appear to come from any
                       # interface and the DNS names associated with
                       # those IPs will be used to create the RRDs.
  mcast_join = 239.2.11.71
  host = dirac.dcs.it.mtu.edu
  port = 8649
  ttl = 1
}
 
/* You can specify as many udp_recv_channels as you like as well. */
udp_recv_channel {
  mcast_join = 239.2.11.71
  port = 8649
  bind = 239.2.11.71
}
 
/* You can specify as many tcp_accept_channels as you like to share
 * an xml description of the state of the cluster 
*/
tcp_accept_channel {
  port = 8649
  acl {
    default = "deny"
 
    access {
      ip = 127.0.0.1
      mask = 32
      action = "allow"
    }
  }
}

/etc/httpd/conf.d/ganglia.conf will have the following edits:

  #
  # Ganglia monitoring system php web frontend
  #
 
  Alias /ganglia /usr/share/ganglia
 
 
    Order deny,allow
    # Deny from all
    # Allow from 127.0.0.1
    # Allow from ::1
    # # Allow from .example.com
    Allow from all

The firewall needs to be modified so that it accepts UDP & TCP requests on port 8649. To that effect, /etc/sysconfig/iptables will have the following lines:

# Ganglia gmond/gmetad
-A INPUT -m udp -p udp --dport 8649 -j ACCEPT
-A INPUT -m tcp -p tcp --dport 8649 -j ACCEPT
#

(Re)Start the necessary services:

/etc/init.d/iptables restart
/etc/init.d/gmetad restart
/etc/init.d/gmond restart
/etc/init.d/httpd restart

After a few minutes of collecting the data and upon pointing the browser to the http://FQDN/ganglia/, the web page should display the relevant information.

If, instead of relevant information, the web page displays the following error message

There was an error collecting ganglia data (127.0.0.1:8652): fsockopen error: Permission denied

then, more often than not, it hints to a selinux related issue. Edit the file, /etc/sysconfig/selinux, to look like:

# This file controls the state of SELinux on the system.
# SELINUX= can take one of these three values:
#     enforcing - SELinux security policy is enforced.
#     permissive - SELinux prints warnings instead of enforcing.
#     disabled - No SELinux policy is loaded.
SELINUX=disabled
# SELINUXTYPE= can take one of these two values:
#     targeted - Targeted processes are protected,
#     mls - Multi Level Security protection.
SELINUXTYPE=targeted

Restart the machine and point the browser to the http://FQDN/ganglia/, the web page should display the relevant information.

Monitoring NVIDIA GPU

The aforementioned set up works fine and as expected but it doesn’t necessarily provide any information about GPU(s) that may be part of the hardware. For e.g., the test machine used in our case has two NVIDIA GeForce GTX 570 cards. With more and more scientific & engineering computations tending towards GPU based computing, it’d be useful to include their status/usage information in Ganglia’s web portal. To this effect, NVIDIA released gmond Python module for GPUs (made aware of it by one of Michigan Tech ITSS directors). The instructions given in the NVIDIA-linked pages do work as described and are included here for the sake of completeness.

Python Bindings for the NVIDIA Management Library

This provides Python access to static information and monitoring data for NVIDIA GPUs, as well as management capabilities. It exposes the functionality of the NVML and one may download these from here – as before, the necessary steps are included in a BASH script.

#! /bin/bash
#
# install_python_nvml_bindings.sh
# BASH script to download and install Python Bindings for the NVML
# Must be root (or at least have sudo privilege) to run this script
# Does not work with Python 2.4 - needs higher/more recent version
 
# Begin root-check IF
if [ $UID != 0 ];
then
  clear
  echo
  echo "  You must be logged in as root!"
  echo "  Exiting..."
  echo
  exit
else
  #
  # Download and install
  cd /tmp/
  wget http://pypi.python.org/packages/source/n/nvidia-ml-py/nvidia-ml-py-2.285.01.tar.gz
 
  tar -zxvpf nvidia-ml-py-2.285.01.tar.gz
  cd nvidia-ml-py-2.285.01
  python setup.py install
 
fi
# End root-check IF

`gmond` Python Module For Monitoring NVIDIA GPUs using NVML

After downloading ganglia-gmond_python_modules-3dfa553.tar.gz from GitHub for ganglia / gmond_python_modules to /tmp/, the following steps need to be performed:

#! /bin/bash
#
# copy_ganglia_gmond_python.sh
# BASH script to copy Copy nvidia_smi.py &amp; pynvml.py to 
# /usr/lib64/ganglia/python_modules/, relevant files from 
# ganglia-gmond_python_modules to Ganglia, apply patch
# for Ganglia web interface and restart necessary services
# Must be root (or at least have sudo privilege) to run this script
 
# Begin root-check IF
if [ $UID != 0 ];
then
  clear
  echo
  echo "  You must be logged in as root!"
  echo "  Exiting..."
  echo
  exit
else
  #
  # Copy nvidia_smi.py &amp; pynvml.py to /usr/lib64/ganglia/python_modules/
  cp /tmp/nvidia-ml-py-2.285.01/nvidia_smi.py /usr/lib64/ganglia/python_modules/
  cp /tmp/nvidia-ml-py-2.285.01/pynvml.py /usr/lib64/ganglia/python_modules/
 
  #
  # Copy relevant files to Ganglia
  cd /tmp/
  tar -zxvpf ganglia-gmond_python_modules-3dfa553.tar.gz
  cd ganglia-gmond_python_modules-3dfa553
 
  cd gpu/nvidia/
  cp python_modules/nvidia.py /usr/lib64/ganglia/python_modules/
  cp conf.d/nvidia.pyconf /etc/ganglia/conf.d/
  cp graph.d/*.php /usr/share/ganglia/graph.d/
 
  #
  # Apply web patch for Ganglia to display custom graphs
  cd /usr/share/ganglia/
  patch -p0 &lt; /tmp/ganglia-gmond_python_modules-3dfa553/gpu/nvidia/ganglia_web.patch
 
  #
  # Restart necessary services
  /etc/init.d/gmetad restart
  /etc/init.d/gmond restart
 
fi
# End root-check IF

Upon pointing the browser to the http://FQDN/ganglia/ (e.g., http://dirac.dcs.it.mtu.edu/ganglia/ – the link will probably die or be changed to something else in due course), the display should include information about GPU as well, as shown in screenshots below:

With little more work, the rather unaesthetic looking Ganglia web interface can be made to look like a given institution’s theme:

Near Future Work

Work is currently underway, most certainly with help from the ever-awesome Rocks mailing list, to integrate the above into a NPACI Rocks 5.4.2 cluster with compute nodes having one or more GPUs. Another post will come along as and when this work has been completed and tested.

6 Replies to “RHEL 6.2 – Ganglia’s gmond Python module for monitoring NVIDIA GPU”

mike says:

Friday, 2012-03-16 at 2:55 pm -0400

Small note, SELinux doesn’t need to be disabled on the web node. You can give apache the permissions via:

setsebool -P httpd_can_network_connect 1

m

ryan says:

Tuesday, 2012-12-18 at 9:09 am -0500

I was wondering if there is any way i can restrict a graph say tcpconns for TCP connections to just a single cluster,instead of all the clusters i have ? I cant seem to find an option in Ganglia which does that.

Mick says:

Friday, 2021-09-10 at 4:50 pm -0400

Are you still using the Nvidia metrics collection on the MTU cluster? Is there a publicly accessible URL of your Ganglia instance?

1. Gowtham says:
  
  Friday, 2021-09-10 at 6:29 pm -0400
  
  Hi Mick,
  
  I am no longer using NVIDIA metrics collection in our HPC infrastructures.
Mick Timony says:

Saturday, 2021-09-11 at 12:00 am -0400

Are you using something else to monitor GPU usage? Am trying to get the Nvidia stuff to work with Ganglia at Harvard Med! Came across your page which was helpful. Thanks!

1. Gowtham says:
  
  Sunday, 2021-09-12 at 9:46 pm -0400
  
  It’s a bit of long story but GPUs in our HPC Cluster are quite old and we no longer monitor their usage (they are not even used). However, the university has a separate GPU cluster in its first year/iteration and we are figuring out how to track usage. I believe its administrator is attempting to implement SLURM for scheduler. I can share more info when they make more progress.