Disclaimer
The instructions/steps given below worked for me (and Michigan Technological University) running Rocks 6.1 (with Service Pack 1, CentOS 6.3 and GE 2011.11p1) – as has been a common practice for several years now, a full version of Operating System was installed. The HPC cluster (wigner
) used to prepare this documentation has Mellanox 56 Gb/s FDR InfiniBand switches and ports. Further, it is assumed the eth0
interface is used for private ethernet network and ib0
is the InfiniBand interface. These instructions may very well work for you (or your institution), on Rocks-like or other linux clusters. Please note that if you decide to use these instructions on your machine, you are doing so entirely at your very own discretion and that neither this site, sgowtham.com, nor its author (or Michigan Technological University) is responsible for any/all damage – intellectual and/or otherwise.
A bit about InfiniBand
Citing Wikipedia:
InfiniBand, which forms a superset of Virtual Interface Architecture (VIA), is a switched fabric communications link used in high-performance computing and enterprise data centers. Its features include high throughput, low latency, quality of service and failover, and it is designed to be scalable. The InfiniBand architecture specification defines a connection between processor nodes and high performance I/O nodes such as storage devices. Mellanox and Intel are a couple of well known manufacturers of Infiniband host bus adapters (HBA) and network switches.
As with Fibre Channel, PCI Express, Serial ATA, and many other modern interconnects, InfiniBand offers point-to-point bidirectional serial links intended for the connection of processors with high-speed peripherals such as disks. On top of the point to point capabilities, InfiniBand also offers multicast operations. It supports several signaling rates and, as with PCI Express, links can be bonded together for additional throughput.
Configuring IPoIB
Unlike versions prior to Rocks 6.1, a default installation on hardware that has InfiniBand automagically detects the InfiniBand interface and configures it during installation. It also sets up subnet manager on the front end. Depending on the needs of a specific cluster and/or situation, one may need to configure IP addresses for InfiniBand (commonly referred to as IPoIB). The need in this case was to serve/mount the /research
partition from a NAS node over InfiniBand across all other nodes in the cluster.
Rocks, by default, designates 10.1.x.y/255.255.0.0
address space for its private ethernet network. In order to keep a one-to-one mapping/mirroring, 10.2.x.y/255.255.0.0
address space was designated for the InfiniBand network. While the subnet for ethernet (eth0
) is denoted by .local
, that for InfiniBand (ib0
) will be denoted by .ibnet
.
# # Commands below need to be run from the front end as root. # Front end usually has the IP address 10.1.1.1 for eth0. # # Network rocks add network ibnet subnet=10.2.0.0 netmask=255.255.0.0 rocks set host interface ip localhost iface=ib0 ip=10.2.1.1 rocks set host interface subnet localhost iface=ib0 subnet=ibnet rocks set host interface module localhost iface=ib0 module=ip_ipoib rocks set host interface name localhost iface=ib0 name=wigner rocks sync host network localhost # # Firewall rocks add firewall host=localhost chain=INPUT protocol=all service=all action=ACCEPT network=ibnet iface=ib0 rulename="A80-IB0-PRIVATE" rocks sync host firewall localhost # # /etc/hosts.local echo "# Front end" > /etc/hosts.local printf "%-14s %-20s %-14s\n" "10.2.1.1" "wigner.ibnet" "ib-wigner" >> /etc/hosts.local |
Once the front end has been configured, the following shell script may be used to do the same for all other nodes:
#! /bin/bash # # BASH script to configure IPoIB for non front end nodes in a Rocks 6.1 cluster. # IP address scheme for InfiniBand (ib0) will reflect that of Ethernet (eth0). # Must be root to run this script. # Function to convert the first character in a string to uppercase function ucfirst_character () { original_string="$@" first_character=${original_string:0:1} rest_of_the_string=${original_string:1} first_character_uc=`echo "$first_character" | tr a-z A-Z` echo "${first_character_uc}${rest_of_the_string}" } # Necessary variables # Remove login and/or nas from the list below if the cluster does not have login and/or NAS nodes export MYNODETYPES="login nas compute" # Outer for loop begins for x in $MYNODETYPES do # List of nodes of given type (login, nas or compute) export MYNODES=`rocks list host | grep "$x" | awk -F ':' '{ print $1 }' | sort -t- -k 2,2n -k 3,3n` # /etc/hosts.local header for a given type of node if [ $x == "nas" ] then export y=`echo $x | tr a-z A-Z` else export y=`ucfirst_character $x` fi echo "# $y node(s)" >> /etc/hosts.local # Inner for loop begins for MYHOSTNAME_ETH0 in $MYNODES do # # Additiional necessary variables export MYHOSTNAME_IB0="ib-$MYHOSTNAME_ETH0" export MYHOSTIP_ETH0=`rocks list host interface $MYHOSTNAME_ETH0 | grep "eth0" | awk '{ print $4 }'` export MYHOSTIP_IB0=`echo $MYHOSTIP_ETH0 | sed 's/10.1/10.2/g'` export MYSHORTNAME_ETH0=`echo $MYHOSTNAME_ETH0 | sed 's/compute/c/g' | sed 's/login/l/g' | sed 's/nas/n/g'` export MYSHORTNAME_IB0=`echo $MYHOSTNAME_IB0 | sed 's/compute/c/g' | sed 's/login/l/g' | sed 's/nas/n/g'` # # Network rocks set host interface ip $MYHOSTNAME_ETH0 iface=ib0 ip=$MYHOSTIP_IB0 rocks set host interface subnet $MYHOSTNAME_ETH0 iface=ib0 subnet=ibnet rocks set host interface module $MYHOSTNAME_ETH0 iface=ib0 module=ip_ipoib rocks set host interface name $MYHOSTNAME_ETH0 iface=ib0 name=$MYHOSTNAME_ETH0 rocks sync host network $MYHOSTNAME_ETH0 # # Firewall rocks add firewall host=$MYHOSTNAME_ETH0 chain=INPUT protocol=all service=all action=ACCEPT network=ibnet iface=ib0 rulename="A80-IB0-PRIVATE" rocks sync host firewall $MYHOSTNAME_ETH0 # # For debugging purposes only printf "%-14s %-20s %-14s %-18s %-10s\n" "${MYHOSTIP_ETH0}" "${MYHOSTNAME_ETH0}.local" "${MYSHORTNAME_ETH0}.local" "${MYHOSTNAME_ETH0}" "${MYSHORTNAME_ETH0}" printf "%-14s %-20s %-14s %-18s %-10s\n" "${MYHOSTIP_IB0}" "${MYHOSTNAME_ETH0}.ibnet" "${MYSHORTNAME_ETH0}.ibnet" "${MYHOSTNAME_IB0}" "${MYSHORTNAME_IB0}" # # /etc/hosts.local printf "%-14s %-20s %-14s %-18s %-10s\n" "${MYHOSTIP_ETH0}" "${MYHOSTNAME_ETH0}.local" "${MYSHORTNAME_ETH0}.local" "${MYHOSTNAME_ETH0}" "${MYSHORTNAME_ETH0}" >> /etc/hosts.local printf "%-14s %-20s %-14s %-18s %-10s\n" "${MYHOSTIP_IB0}" "${MYHOSTNAME_ETH0}.ibnet" "${MYSHORTNAME_ETH0}.ibnet" "${MYHOSTNAME_IB0}" "${MYSHORTNAME_IB0}" >> /etc/hosts.local done # Inner for loop ends done # Outer for loop ends |
/etc/hosts.local
(which will be included by the command rocks report host
to generate /etc/hosts
) looks as follows for a cluster (wigner
) with 1 front end, 2 login nodes, 1 NAS node and 4 compute nodes:
# Front end 10.1.1.1 wigner.local wigner 10.2.1.1 wigner.ibnet ib-wigner # Login node(s) 10.1.255.254 login-0-1.local l-0-1.local login-0-1 l-0-1 10.2.255.254 login-0-1.ibnet l-0-1.ibnet ib-login-0-1 ib-l-0-1 10.1.255.253 login-0-2.local l-0-2.local login-0-2 l-0-2 10.2.255.253 login-0-2.ibnet l-0-2.ibnet ib-login-0-2 ib-l-0-2 # NAS node(s) 10.1.255.252 nas-0-0.local n-0-0.local nas-0-0 n-0-0 10.2.255.252 nas-0-0.ibnet n-0-0.ibnet ib-nas-0-0 ib-n-0-0 # Compute node(s) 10.1.255.251 compute-0-0.local c-0-0.local compute-0-0 c-0-0 10.2.255.251 compute-0-0.ibnet c-0-0.ibnet ib-compute-0-0 ib-c-0-0 10.1.255.250 compute-0-1.local c-0-1.local compute-0-1 c-0-1 10.2.255.250 compute-0-1.ibnet c-0-1.ibnet ib-compute-0-1 ib-c-0-1 10.1.255.249 compute-0-2.local c-0-2.local compute-0-2 c-0-2 10.2.255.249 compute-0-2.ibnet c-0-2.ibnet ib-compute-0-2 ib-c-0-2 10.1.255.248 compute-0-4.local c-0-4.local compute-0-4 c-0-4 10.2.255.248 compute-0-4.ibnet c-0-4.ibnet ib-compute-0-4 ib-c-0-4 |
It is very highly recommended that tests (be it as simple as ping
or as complex as one wants to make it) be performed to make sure that all nodes are responding well over the newly configured IP addresses range for InfiniBand before letting real users perform real simulations.
Thanks be to
Rocks Cluster Distribution developers, Rocks mailing list and its participants.
I have just installed rocks 6.1 on my system. I was trying to configure my mellanox infiniband using your instructions and things looked like they worked but the ibnet definitions would not save until I issued: “rocks add network ibnet subnet=10.2.0.0 netmask=255.255.0.0”
Not sure why I missed out on it but thank you for pointing it out. The post/script has been updated.
I found that nfs will not work without
“rocks set network servedns ibnet True”
Thank you, Cesar.
Hi!
I’ve configured ipoIB using your tutorial, thanks!
But now I’m struggling to mount /share on compute nodes via IB
It seems that changing IP in /etc/exports is not enough
Do you use IB for NFS?
Greg
Greg,
I didn’t play around with remounting /share on the compute nodes via IB at all. Sorry for not being very useful but Rocks discussion group might have an answer for you.
Hi!
I have set up rocks-3.1 with infiniband conection without problems, works fine. However I think i have a performance trouble. I was copying a file and I got this:
# scp datosIB.tar.gz cluster@compute-0-1:
datosIB.tar.gz 100% 118GB 63.4MB/s 31:52
I thought maybe it could be a hd problem, but i used hdparam and dd to check its read/write speed:
# hdparm -tT /dev/sda
/dev/sda:
Timing cached reads: 21092 MB in 2.00 seconds = 10555.57 MB/sec
Timing buffered disk reads: 386 MB in 3.01 seconds = 128.44 MB/sec
# dd if=/dev/zero of=/tmp/output bs=8k count=10k; rm -f /tmp/output
10240+0 records in
10240+0 records out
83886080 bytes (84 MB) copied, 0.0492417 s, 1.7 GB/s
So I guess its a misconfiguration of the ib0. Any suggestion?
Thanks in advance
Edson
Just to point out that I’m getting a 63.4MB/s in an IB connection. I can get that rate on a common ethernet connection. What should I do to make it better?
I followed the guide and in the end Ganglia only sees .local nodes.
I can ssh compute-0-0 .ibnet without any problems but Ganglia shows just .local nodes.
If you can help me that would be awesome.