Rocks 6.1 – IPoIB

Disclaimer

The instructions/steps given below worked for me (and Michigan Technological University) running Rocks 6.1 (with Service Pack 1, CentOS 6.3 and GE 2011.11p1) – as has been a common practice for several years now, a full version of Operating System was installed. The HPC cluster (wigner) used to prepare this documentation has Mellanox 56 Gb/s FDR InfiniBand switches and ports. Further, it is assumed the eth0 interface is used for private ethernet network and ib0 is the InfiniBand interface. These instructions may very well work for you (or your institution), on Rocks-like or other linux clusters. Please note that if you decide to use these instructions on your machine, you are doing so entirely at your very own discretion and that neither this site, sgowtham.com, nor its author (or Michigan Technological University) is responsible for any/all damage – intellectual and/or otherwise.

A bit about InfiniBand

Citing Wikipedia:

InfiniBand, which forms a superset of Virtual Interface Architecture (VIA), is a switched fabric communications link used in high-performance computing and enterprise data centers. Its features include high throughput, low latency, quality of service and failover, and it is designed to be scalable. The InfiniBand architecture specification defines a connection between processor nodes and high performance I/O nodes such as storage devices. Mellanox and Intel are a couple of well known manufacturers of Infiniband host bus adapters (HBA) and network switches.

As with Fibre Channel, PCI Express, Serial ATA, and many other modern interconnects, InfiniBand offers point-to-point bidirectional serial links intended for the connection of processors with high-speed peripherals such as disks. On top of the point to point capabilities, InfiniBand also offers multicast operations. It supports several signaling rates and, as with PCI Express, links can be bonded together for additional throughput.

Configuring IPoIB

Unlike versions prior to Rocks 6.1, a default installation on hardware that has InfiniBand automagically detects the InfiniBand interface and configures it during installation. It also sets up subnet manager on the front end. Depending on the needs of a specific cluster and/or situation, one may need to configure IP addresses for InfiniBand (commonly referred to as IPoIB). The need in this case was to serve/mount the /research partition from a NAS node over InfiniBand across all other nodes in the cluster.

Rocks, by default, designates 10.1.x.y/255.255.0.0 address space for its private ethernet network. In order to keep a one-to-one mapping/mirroring, 10.2.x.y/255.255.0.0 address space was designated for the InfiniBand network. While the subnet for ethernet (eth0) is denoted by .local, that for InfiniBand (ib0) will be denoted by .ibnet.

#
# Commands below need to be run from the front end as root.
# Front end usually has the IP address 10.1.1.1 for eth0.
 
#
# Network
rocks add network ibnet subnet=10.2.0.0 netmask=255.255.0.0
rocks set host interface ip     localhost iface=ib0 ip=10.2.1.1
rocks set host interface subnet localhost iface=ib0 subnet=ibnet
rocks set host interface module localhost iface=ib0 module=ip_ipoib
rocks set host interface name   localhost iface=ib0 name=wigner
rocks sync host network localhost
 
#
# Firewall
rocks add firewall host=localhost chain=INPUT protocol=all service=all action=ACCEPT network=ibnet iface=ib0 rulename="A80-IB0-PRIVATE"
rocks sync host firewall localhost
 
#
# /etc/hosts.local
echo "# Front end" > /etc/hosts.local
printf "%-14s  %-20s  %-14s\n"  "10.2.1.1"  "wigner.ibnet" "ib-wigner" >> /etc/hosts.local


Once the front end has been configured, the following shell script may be used to do the same for all other nodes:

#! /bin/bash
#
# BASH script to configure IPoIB for non front end nodes in a Rocks 6.1 cluster.
# IP address scheme for InfiniBand (ib0) will reflect that of Ethernet (eth0).
# Must be root to run this script.
 
# Function to convert the first character in a string to uppercase
function ucfirst_character () {
  original_string="$@"               
  first_character=${original_string:0:1}   
  rest_of_the_string=${original_string:1}       
  first_character_uc=`echo "$first_character" | tr a-z A-Z`
  echo "${first_character_uc}${rest_of_the_string}"  
}
 
 
# Necessary variables 
# Remove login and/or nas from the list below if the cluster does not have login and/or NAS nodes
export MYNODETYPES="login nas compute"
 
# Outer for loop begins
for x in $MYNODETYPES
do
 
  # List of nodes of given type (login, nas or compute)
  export MYNODES=`rocks list host | grep "$x" | awk -F ':' '{ print $1 }' | sort -t- -k 2,2n -k 3,3n`
 
  # /etc/hosts.local header for a given type of node
  if [ $x == "nas" ]
  then
    export y=`echo $x | tr a-z A-Z`
  else
    export y=`ucfirst_character $x`
  fi
  echo "# $y node(s)" >> /etc/hosts.local
 
  # Inner for loop begins
  for MYHOSTNAME_ETH0 in $MYNODES
  do
    #
    # Additiional necessary variables
    export MYHOSTNAME_IB0="ib-$MYHOSTNAME_ETH0"
    export MYHOSTIP_ETH0=`rocks list host interface $MYHOSTNAME_ETH0 | grep "eth0" | awk '{ print $4 }'`
    export MYHOSTIP_IB0=`echo $MYHOSTIP_ETH0 | sed 's/10.1/10.2/g'`
    export MYSHORTNAME_ETH0=`echo $MYHOSTNAME_ETH0 | sed 's/compute/c/g' | sed 's/login/l/g' | sed 's/nas/n/g'`
    export MYSHORTNAME_IB0=`echo $MYHOSTNAME_IB0   | sed 's/compute/c/g' | sed 's/login/l/g' | sed 's/nas/n/g'`
 
    #
    # Network
    rocks set host interface ip $MYHOSTNAME_ETH0 iface=ib0 ip=$MYHOSTIP_IB0
    rocks set host interface subnet $MYHOSTNAME_ETH0 iface=ib0 subnet=ibnet
    rocks set host interface module $MYHOSTNAME_ETH0 iface=ib0 module=ip_ipoib
    rocks set host interface name $MYHOSTNAME_ETH0 iface=ib0 name=$MYHOSTNAME_ETH0
    rocks sync host network $MYHOSTNAME_ETH0
 
    #
    # Firewall
    rocks add firewall host=$MYHOSTNAME_ETH0 chain=INPUT protocol=all service=all action=ACCEPT network=ibnet iface=ib0 rulename="A80-IB0-PRIVATE"
    rocks sync host firewall $MYHOSTNAME_ETH0
 
    #
    # For debugging purposes only
    printf "%-14s  %-20s  %-14s  %-18s  %-10s\n"  "${MYHOSTIP_ETH0}" "${MYHOSTNAME_ETH0}.local" "${MYSHORTNAME_ETH0}.local" "${MYHOSTNAME_ETH0}" "${MYSHORTNAME_ETH0}"
    printf "%-14s  %-20s  %-14s  %-18s  %-10s\n"  "${MYHOSTIP_IB0}"  "${MYHOSTNAME_ETH0}.ibnet" "${MYSHORTNAME_ETH0}.ibnet" "${MYHOSTNAME_IB0}"  "${MYSHORTNAME_IB0}"
 
    #
    # /etc/hosts.local
    printf "%-14s  %-20s  %-14s  %-18s  %-10s\n"  "${MYHOSTIP_ETH0}" "${MYHOSTNAME_ETH0}.local" "${MYSHORTNAME_ETH0}.local" "${MYHOSTNAME_ETH0}" "${MYSHORTNAME_ETH0}" >> /etc/hosts.local
    printf "%-14s  %-20s  %-14s  %-18s  %-10s\n"  "${MYHOSTIP_IB0}"  "${MYHOSTNAME_ETH0}.ibnet" "${MYSHORTNAME_ETH0}.ibnet" "${MYHOSTNAME_IB0}"  "${MYSHORTNAME_IB0}"  >> /etc/hosts.local
 
  done
  # Inner for loop ends
 
done
# Outer for loop ends


/etc/hosts.local (which will be included by the command rocks report host to generate /etc/hosts) looks as follows for a cluster (wigner) with 1 front end, 2 login nodes, 1 NAS node and 4 compute nodes:

# Front end
10.1.1.1        wigner.local          wigner
10.2.1.1        wigner.ibnet          ib-wigner
# Login node(s)
10.1.255.254    login-0-1.local       l-0-1.local     login-0-1           l-0-1
10.2.255.254    login-0-1.ibnet       l-0-1.ibnet     ib-login-0-1        ib-l-0-1
10.1.255.253    login-0-2.local       l-0-2.local     login-0-2           l-0-2
10.2.255.253    login-0-2.ibnet       l-0-2.ibnet     ib-login-0-2        ib-l-0-2
# NAS node(s)
10.1.255.252    nas-0-0.local         n-0-0.local     nas-0-0             n-0-0
10.2.255.252    nas-0-0.ibnet         n-0-0.ibnet     ib-nas-0-0          ib-n-0-0
# Compute node(s)
10.1.255.251    compute-0-0.local     c-0-0.local     compute-0-0         c-0-0
10.2.255.251    compute-0-0.ibnet     c-0-0.ibnet     ib-compute-0-0      ib-c-0-0
10.1.255.250    compute-0-1.local     c-0-1.local     compute-0-1         c-0-1
10.2.255.250    compute-0-1.ibnet     c-0-1.ibnet     ib-compute-0-1      ib-c-0-1
10.1.255.249    compute-0-2.local     c-0-2.local     compute-0-2         c-0-2
10.2.255.249    compute-0-2.ibnet     c-0-2.ibnet     ib-compute-0-2      ib-c-0-2
10.1.255.248    compute-0-4.local     c-0-4.local     compute-0-4         c-0-4
10.2.255.248    compute-0-4.ibnet     c-0-4.ibnet     ib-compute-0-4      ib-c-0-4


It is very highly recommended that tests (be it as simple as ping or as complex as one wants to make it) be performed to make sure that all nodes are responding well over the newly configured IP addresses range for InfiniBand before letting real users perform real simulations.


Thanks be to

Rocks Cluster Distribution developers, Rocks mailing list and its participants.

9 Replies to “Rocks 6.1 – IPoIB”

  1. I have just installed rocks 6.1 on my system. I was trying to configure my mellanox infiniband using your instructions and things looked like they worked but the ibnet definitions would not save until I issued: “rocks add network ibnet subnet=10.2.0.0 netmask=255.255.0.0”

  2. Hi!
    I’ve configured ipoIB using your tutorial, thanks!
    But now I’m struggling to mount /share on compute nodes via IB
    It seems that changing IP in /etc/exports is not enough
    Do you use IB for NFS?
    Greg

    1. Greg,

      I didn’t play around with remounting /share on the compute nodes via IB at all. Sorry for not being very useful but Rocks discussion group might have an answer for you.

  3. Hi!
    I have set up rocks-3.1 with infiniband conection without problems, works fine. However I think i have a performance trouble. I was copying a file and I got this:
    # scp datosIB.tar.gz cluster@compute-0-1:
    datosIB.tar.gz 100% 118GB 63.4MB/s 31:52

    I thought maybe it could be a hd problem, but i used hdparam and dd to check its read/write speed:
    # hdparm -tT /dev/sda
    /dev/sda:
    Timing cached reads: 21092 MB in 2.00 seconds = 10555.57 MB/sec
    Timing buffered disk reads: 386 MB in 3.01 seconds = 128.44 MB/sec

    # dd if=/dev/zero of=/tmp/output bs=8k count=10k; rm -f /tmp/output
    10240+0 records in
    10240+0 records out
    83886080 bytes (84 MB) copied, 0.0492417 s, 1.7 GB/s

    So I guess its a misconfiguration of the ib0. Any suggestion?
    Thanks in advance

    Edson

    1. Just to point out that I’m getting a 63.4MB/s in an IB connection. I can get that rate on a common ethernet connection. What should I do to make it better?

  4. I followed the guide and in the end Ganglia only sees .local nodes.
    I can ssh compute-0-0 .ibnet without any problems but Ganglia shows just .local nodes.
    If you can help me that would be awesome.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.