Rocks 5.4.2 – Scheduling GPU jobs via SGE

Disclaimer

The instructions/steps given below worked for me (and Michigan Technological University) running Rocks 5.4.2 (with CentOS 5.5 and SGE 6.2u5) – as has been a common practice for several years now, a full version of Operating System was installed. These instructions may very well work for you (or your institution), on Rocks-like or other linux clusters. Please note that if you decide to use these instructions on your machine, you are doing so entirely at your very own discretion and that neither this site, sgowtham.com, nor its author (or Michigan Technological University) is responsible for any/all damage – intellectual and/or otherwise.

A bit about GPU computing

Citing NVIDIA,

GPU computing is the use of a GPU (graphics processing unit) together with a CPU to accelerate general-purpose scientific and engineering applications. Pioneered five years ago by NVIDIA, GPU computing has quickly become an industry standard, enjoyed by millions of users worldwide and adopted by virtually all computing vendors.

GPU computing offers unprecedented application performance by offloading compute-intensive portions of the application to the GPU, while the remainder of the code still runs on the CPU. From a user’s perspective, applications simply run significantly faster.

CPU + GPU is a powerful combination because CPUs consist of a few cores optimized for serial processing, while GPUs consist of thousands of smaller, more efficient cores designed for parallel performance. Serial portions of the code run on the CPU while parallel portions run on the GPU.

NVIDIA’s list of GPU applications is here.

A bit about SGE

From our internal documentation,

Sun Grid Engine [formerly known as Computing in Distributed Networked Environments (CODINE) or Global Resource Director (GRD) and later known as the Oracle Grid Engine (OGE)] is an open source queuing system developed and supported by Sun Microsystems. In December 2011, Oracle officially passed on the torch for maintaining the Grid Engine open source code base to the Open Grid Scheduler project. Open Grid Scheduler/Grid Engine is a commercially supported open source batch queuing system for distributed resource management. OGS/GE is based on Sun Grid Engine, and maintained by the same group of external (i.e. non-Sun) developers who started contributing code since 2001.

SGE is a highly scalable, flexible and reliable distributed resource manager (DRM) and. A SGE cluster consists of worker machines (compute nodes), a master machine (front end), and none or more shadow master machines. The compute nodes run copies of the SGE execution daemon (sge_execd). The front end runs the SGE qmaster daemon. The shadow front end machines run the SGE shadow daemon. Often, the number of slots in a compute node is equal to the number of CPU cores available. Each core can run one job and as such, represents one slot.

Once a job has been submitted to the queue (either using the command line or the graphical interface), it enters the pending state. During the next scheduling run, the qmaster ranks the job against the other pending jobs. The relative importance of a job is decided by the scheduling policies in effect. The most important pending jobs will be scheduled to available slots. When a job requires a resource that is currently unavailable, it will continue to be in the waiting state.

Once the job has been scheduled to a compute node, it is sent to the execution daemon in that compute node. sge_execd executes the command specified by the job, and the job enters the running state. It will continue to be in running state it completes, fails, is terminated, or is re-queued. The job may also be suspended, resumed, and/or checkpointed (SGE does not natively checkpoint any job; it will, however, run a script/program to checkpoint a job, when available) any number of times.

After a job has completed or failed, sge_execd cleans up and notifies the qmaster. The qmaster records the job’s information and drops that job from its list of active jobs. SGE provides command with with job’s information is retrieved from accounting logs and such information can be used to design computing policies.

Installation & configuration

Rocks 5.4.2 installation includes a fully working instance of SGE. Let us suppose that the cluster has 4 compute nodes, and that each compute node has 8 CPU cores along with 4 NVIDIA GPUs. Also, suppose that the compute nodes are named compute-0-0, compute-0-1, compute-0-2 and compute-0-3. It is further assumed that each of these nodes have the relevant, recent and stable version of NVIDIA drivers (and CUDA Toolkit) installed. The following command

rocks run host compute 'hostname; nvidia-smi -L'

should list 4 NVIDIA GPUs along with hostname of each compute node.

Schematic representation of a Rocks cluster

Schematic representation of a Rocks cluster

By default, SGE puts all nodes (and the CPU cores or slots therein) in one queue – all.q. Also, SGE is unaware of GPUs in each node and as such, makes it improbable to schedule & monitor jobs on a GPU.

The task

Make SGE aware of available GPUs; set every GPU in every node in compute exclusive mode; split all.q into two queues: cpu.q and gpu.q; make sure a job running on cpu.q does not access GPUs; make sure a job running on gpu.q uses only one CPU core and one GPU

Making SGE aware of available GPUs

  1. Dump the current complex configuration into a flat text file via the command qconf -sc > qconf_sc.txt
  2. Open the file, qconf_sc.txt, and add the following line at the very end
    gpu                    gpu                BOOL        ==    FORCED      NO         0        0
  3. Save and close the file.
  4. Update the complex configuration via the command, qconf -Mc qconf_sc.txt
  5. Check: qconf -sc | grep gpu should return the above line

Setting GPUs in compute exclusive mode

Run the following command:

rocks run host compute 'nvidia-smi -c 1'

Manual page for nvidia-smi indicates that this setting does not persist across reboots.

Splitting all.q into cpu.q and gpu.q

By default all 8 CPU cores from each node, for a total of 32 CPU cores, are part of all.q; The set up needs

  1. disabling of all.q
  2. 4 CPU cores from each node, for a total of 16 CPU cores, will become part of cpu.q
  3. 4 CPU cores and 4 GPUs from each node, for a total of 16 CPU cores & 16 GPUs, will become part of gpu.q; also, each CPU core in gpu.q will serve as host (or parent) to one GPU

Disabling all.q

Once the current all.q configuration is saved via the command qconf -sq all.q > all.q.txt, it can be disabled using the command qmod -f -d all.q; contents of all.q.txt should look something like below:

qname                 all.q
hostlist              @allhosts
seq_no                0
load_thresholds       np_load_avg=1.75
suspend_thresholds    NONE
nsuspend              1
suspend_interval      00:05:00
priority              0
min_cpu_interval      00:05:00
processors            UNDEFINED
qtype                 BATCH INTERACTIVE
ckpt_list             NONE
pe_list               make mpich mpi orte
rerun                 FALSE
slots                 1,[compute-0-0.local=8], \
                      [compute-0-1.local=8], \
                      [compute-0-2.local=8], \
                      [compute-0-3.local=8]
tmpdir                /tmp
shell                 /bin/csh
prolog                NONE
epilog                NONE
shell_start_mode      posix_compliant
starter_method        NONE
suspend_method        NONE
resume_method         NONE
terminate_method      NONE
notify                00:00:60
owner_list            NONE
user_lists            NONE
xuser_lists           NONE
subordinate_list      NONE
complex_values        NONE
projects              NONE
xprojects             NONE
calendar              NONE
initial_state         default
s_rt                  INFINITY
h_rt                  INFINITY
s_cpu                 INFINITY
h_cpu                 INFINITY
s_fsize               INFINITY
h_fsize               INFINITY
s_data                INFINITY
h_data                INFINITY
s_stack               INFINITY
h_stack               INFINITY
s_core                INFINITY
h_core                INFINITY
s_rss                 INFINITY
h_rss                 INFINITY
s_vmem                INFINITY
h_vmem                INFINITY

Creating cpu.q

Copy all.q.txt as cpu.q.txt and make it look like as follows:

qname                 cpu.q
hostlist              @allhosts
seq_no                10
load_thresholds       np_load_avg=1.75
suspend_thresholds    NONE
nsuspend              1
suspend_interval      00:05:00
priority              0
min_cpu_interval      00:05:00
processors            UNDEFINED
qtype                 BATCH INTERACTIVE
ckpt_list             NONE
pe_list               make mpich mpi orte
rerun                 FALSE
slots                 1,[compute-0-0.local=4], \
                      [compute-0-1.local=4], \
                      [compute-0-2.local=4], \
                      [compute-0-3.local=4]
tmpdir                /tmp
shell                 /bin/csh
prolog                NONE
epilog                NONE
shell_start_mode      posix_compliant
starter_method        NONE
suspend_method        NONE
resume_method         NONE
terminate_method      NONE
notify                00:00:60
owner_list            NONE
user_lists            NONE
xuser_lists           NONE
subordinate_list      NONE
complex_values        NONE
projects              NONE
xprojects             NONE
calendar              NONE
initial_state         default
s_rt                  INFINITY
h_rt                  INFINITY
s_cpu                 INFINITY
h_cpu                 INFINITY
s_fsize               INFINITY
h_fsize               INFINITY
s_data                INFINITY
h_data                INFINITY
s_stack               INFINITY
h_stack               INFINITY
s_core                INFINITY
h_core                INFINITY
s_rss                 INFINITY
h_rss                 INFINITY
s_vmem                INFINITY
h_vmem                INFINITY

The command, qconf -Aq cpu.q.txt, should create the new queue cpu.q. One may run 16 single processor jobs OR one 16 processor job (OR any plausible combination in between that brings the total to 16 slots) in this queue at any given time, and these jobs will not be able to access GPUs. For testing purposes, one may use this Hello, World! program, using the following submission script:

#! /bin/bash
# 
# Save this file as hello_world_cpu.sh and submit to the queue using the command
# qsub hello_world_cpu.sh
# 
#$ -cwd
#$ -j y
#$ -S /bin/bash
#$ -pe mpich 8
#$ -q cpu.q
#
 
# This assumes that the PATH variable knows about 'mpirun' command
mpirun -n $NSLOTS -machine $TMP/machines hello_world.x

Creating gpu.q

Copy all.q.txt as gpu.q.txt and make it look like as follows:

qname                 gpu.q
hostlist              @allhosts
seq_no                20
load_thresholds       np_load_avg=1.75
suspend_thresholds    NONE
nsuspend              1
suspend_interval      00:05:00
priority              0
min_cpu_interval      00:05:00
processors            UNDEFINED
qtype                 BATCH INTERACTIVE
ckpt_list             NONE
pe_list               make mpich mpi orte
rerun                 FALSE
slots                 1,[compute-0-0.local=4], \
                      [compute-0-1.local=4], \
                      [compute-0-2.local=4], \
                      [compute-0-3.local=4]
tmpdir                /tmp
shell                 /bin/csh
prolog                NONE
epilog                NONE
shell_start_mode      posix_compliant
starter_method        NONE
suspend_method        NONE
resume_method         NONE
terminate_method      NONE
notify                00:00:60
owner_list            NONE
user_lists            NONE
xuser_lists           NONE
subordinate_list      NONE
complex_values        gpu=TRUE
projects              NONE
xprojects             NONE
calendar              NONE
initial_state         default
s_rt                  INFINITY
h_rt                  INFINITY
s_cpu                 INFINITY
h_cpu                 INFINITY
s_fsize               INFINITY
h_fsize               INFINITY
s_data                INFINITY
h_data                INFINITY
s_stack               INFINITY
h_stack               INFINITY
s_core                INFINITY
h_core                INFINITY
s_rss                 INFINITY
h_rss                 INFINITY
s_vmem                INFINITY
h_vmem                INFINITY

The command, qconf -Aq gpu.q.txt, should create the new queue gpu.q. One may run 16 single CPU+GPU jobs at any given time and by design (please see the sample job submission script) and this will use only one GPU per job. For testing purposes, one may use this Hello, World!, program using the following submission script:

#! /bin/bash
# 
# Save this file as hello_world_gpu.sh and submit to the queue using the command
# qsub hello_world_gpu.sh
# 
#$ -cwd
#$ -j y
#$ -S /bin/bash
#$ -q gpu.q
#$ -hard -l gpu=1
#
 
./hello_world_cuda.x

Thanks be to

Rocks mailing list, Grid Engine mailing list and their participants.

2 Replies to “Rocks 5.4.2 – Scheduling GPU jobs via SGE”

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.