Disclaimer
The instructions/steps given below worked for me (and Michigan Technological University) running Rocks 5.4.2 (with CentOS 5.5 and SGE 6.2u5) – as has been a common practice for several years now, a full version of Operating System was installed. These instructions may very well work for you (or your institution), on Rocks-like or other linux clusters. Please note that if you decide to use these instructions on your machine, you are doing so entirely at your very own discretion and that neither this site, sgowtham.com, nor its author (or Michigan Technological University) is responsible for any/all damage – intellectual and/or otherwise.
A bit about GPU computing
Citing NVIDIA,
GPU computing is the use of a GPU (graphics processing unit) together with a CPU to accelerate general-purpose scientific and engineering applications. Pioneered five years ago by NVIDIA, GPU computing has quickly become an industry standard, enjoyed by millions of users worldwide and adopted by virtually all computing vendors.
GPU computing offers unprecedented application performance by offloading compute-intensive portions of the application to the GPU, while the remainder of the code still runs on the CPU. From a user’s perspective, applications simply run significantly faster.
CPU + GPU is a powerful combination because CPUs consist of a few cores optimized for serial processing, while GPUs consist of thousands of smaller, more efficient cores designed for parallel performance. Serial portions of the code run on the CPU while parallel portions run on the GPU.
NVIDIA’s list of GPU applications is here.
A bit about SGE
From our internal documentation,
Sun Grid Engine [formerly known as Computing in Distributed Networked Environments (CODINE) or Global Resource Director (GRD) and later known as the Oracle Grid Engine (OGE)] is an open source queuing system developed and supported by Sun Microsystems. In December 2011, Oracle officially passed on the torch for maintaining the Grid Engine open source code base to the Open Grid Scheduler project. Open Grid Scheduler/Grid Engine is a commercially supported open source batch queuing system for distributed resource management. OGS/GE is based on Sun Grid Engine, and maintained by the same group of external (i.e. non-Sun) developers who started contributing code since 2001.
SGE is a highly scalable, flexible and reliable distributed resource manager (DRM) and. A SGE cluster consists of worker machines (compute nodes), a master machine (front end), and none or more shadow master machines. The compute nodes run copies of the SGE execution daemon (
sge_execd
). The front end runs the SGEqmaster
daemon. The shadow front end machines run the SGE shadow daemon. Often, the number of slots in a compute node is equal to the number of CPU cores available. Each core can run one job and as such, represents one slot.Once a job has been submitted to the queue (either using the command line or the graphical interface), it enters the pending state. During the next scheduling run, the
qmaster
ranks the job against the other pending jobs. The relative importance of a job is decided by the scheduling policies in effect. The most important pending jobs will be scheduled to available slots. When a job requires a resource that is currently unavailable, it will continue to be in the waiting state.Once the job has been scheduled to a compute node, it is sent to the execution daemon in that compute node.
sge_execd
executes the command specified by the job, and the job enters the running state. It will continue to be in running state it completes, fails, is terminated, or is re-queued. The job may also be suspended, resumed, and/or checkpointed (SGE does not natively checkpoint any job; it will, however, run a script/program to checkpoint a job, when available) any number of times.After a job has completed or failed,
sge_execd
cleans up and notifies theqmaster
. Theqmaster
records the job’s information and drops that job from its list of active jobs. SGE provides command with with job’s information is retrieved from accounting logs and such information can be used to design computing policies.
Installation & configuration
Rocks 5.4.2 installation includes a fully working instance of SGE. Let us suppose that the cluster has 4 compute nodes, and that each compute node has 8 CPU cores along with 4 NVIDIA GPUs. Also, suppose that the compute nodes are named compute-0-0
, compute-0-1
, compute-0-2
and compute-0-3
. It is further assumed that each of these nodes have the relevant, recent and stable version of NVIDIA drivers (and CUDA Toolkit) installed. The following command
rocks run host compute 'hostname; nvidia-smi -L'
should list 4 NVIDIA GPUs along with hostname of each compute node.
By default, SGE puts all nodes (and the CPU cores or slots therein) in one queue – all.q
. Also, SGE is unaware of GPUs in each node and as such, makes it improbable to schedule & monitor jobs on a GPU.
The task
Make SGE aware of available GPUs; set every GPU in every node in compute exclusive mode; split
all.q
into two queues:cpu.q
andgpu.q
; make sure a job running oncpu.q
does not access GPUs; make sure a job running ongpu.q
uses only one CPU core and one GPU
Making SGE aware of available GPUs
- Dump the current complex configuration into a flat text file via the command
qconf -sc > qconf_sc.txt
- Open the file,
qconf_sc.txt
, and add the following line at the very endgpu gpu BOOL == FORCED NO 0 0
- Save and close the file.
- Update the complex configuration via the command,
qconf -Mc qconf_sc.txt
- Check:
qconf -sc | grep gpu
should return the above line
Setting GPUs in compute exclusive mode
Run the following command:
rocks run host compute 'nvidia-smi -c 1'
Manual page for nvidia-smi
indicates that this setting does not persist across reboots.
Splitting all.q
into cpu.q
and gpu.q
By default all 8 CPU cores from each node, for a total of 32 CPU cores, are part of all.q
; The set up needs
- disabling of
all.q
- 4 CPU cores from each node, for a total of 16 CPU cores, will become part of
cpu.q
- 4 CPU cores and 4 GPUs from each node, for a total of 16 CPU cores & 16 GPUs, will become part of
gpu.q
; also, each CPU core ingpu.q
will serve as host (or parent) to one GPU
Disabling all.q
Once the current all.q configuration is saved via the command qconf -sq all.q > all.q.txt
, it can be disabled using the command qmod -f -d all.q
; contents of all.q.txt should look something like below:
qname all.q hostlist @allhosts seq_no 0 load_thresholds np_load_avg=1.75 suspend_thresholds NONE nsuspend 1 suspend_interval 00:05:00 priority 0 min_cpu_interval 00:05:00 processors UNDEFINED qtype BATCH INTERACTIVE ckpt_list NONE pe_list make mpich mpi orte rerun FALSE slots 1,[compute-0-0.local=8], \ [compute-0-1.local=8], \ [compute-0-2.local=8], \ [compute-0-3.local=8] tmpdir /tmp shell /bin/csh prolog NONE epilog NONE shell_start_mode posix_compliant starter_method NONE suspend_method NONE resume_method NONE terminate_method NONE notify 00:00:60 owner_list NONE user_lists NONE xuser_lists NONE subordinate_list NONE complex_values NONE projects NONE xprojects NONE calendar NONE initial_state default s_rt INFINITY h_rt INFINITY s_cpu INFINITY h_cpu INFINITY s_fsize INFINITY h_fsize INFINITY s_data INFINITY h_data INFINITY s_stack INFINITY h_stack INFINITY s_core INFINITY h_core INFINITY s_rss INFINITY h_rss INFINITY s_vmem INFINITY h_vmem INFINITY |
Creating cpu.q
Copy all.q.txt
as cpu.q.txt
and make it look like as follows:
qname cpu.q hostlist @allhosts seq_no 10 load_thresholds np_load_avg=1.75 suspend_thresholds NONE nsuspend 1 suspend_interval 00:05:00 priority 0 min_cpu_interval 00:05:00 processors UNDEFINED qtype BATCH INTERACTIVE ckpt_list NONE pe_list make mpich mpi orte rerun FALSE slots 1,[compute-0-0.local=4], \ [compute-0-1.local=4], \ [compute-0-2.local=4], \ [compute-0-3.local=4] tmpdir /tmp shell /bin/csh prolog NONE epilog NONE shell_start_mode posix_compliant starter_method NONE suspend_method NONE resume_method NONE terminate_method NONE notify 00:00:60 owner_list NONE user_lists NONE xuser_lists NONE subordinate_list NONE complex_values NONE projects NONE xprojects NONE calendar NONE initial_state default s_rt INFINITY h_rt INFINITY s_cpu INFINITY h_cpu INFINITY s_fsize INFINITY h_fsize INFINITY s_data INFINITY h_data INFINITY s_stack INFINITY h_stack INFINITY s_core INFINITY h_core INFINITY s_rss INFINITY h_rss INFINITY s_vmem INFINITY h_vmem INFINITY |
The command, qconf -Aq cpu.q.txt
, should create the new queue cpu.q
. One may run 16 single processor jobs OR one 16 processor job (OR any plausible combination in between that brings the total to 16 slots) in this queue at any given time, and these jobs will not be able to access GPUs. For testing purposes, one may use this Hello, World! program, using the following submission script:
#! /bin/bash # # Save this file as hello_world_cpu.sh and submit to the queue using the command # qsub hello_world_cpu.sh # #$ -cwd #$ -j y #$ -S /bin/bash #$ -pe mpich 8 #$ -q cpu.q # # This assumes that the PATH variable knows about 'mpirun' command mpirun -n $NSLOTS -machine $TMP/machines hello_world.x |
Creating gpu.q
Copy all.q.txt
as gpu.q.txt
and make it look like as follows:
qname gpu.q hostlist @allhosts seq_no 20 load_thresholds np_load_avg=1.75 suspend_thresholds NONE nsuspend 1 suspend_interval 00:05:00 priority 0 min_cpu_interval 00:05:00 processors UNDEFINED qtype BATCH INTERACTIVE ckpt_list NONE pe_list make mpich mpi orte rerun FALSE slots 1,[compute-0-0.local=4], \ [compute-0-1.local=4], \ [compute-0-2.local=4], \ [compute-0-3.local=4] tmpdir /tmp shell /bin/csh prolog NONE epilog NONE shell_start_mode posix_compliant starter_method NONE suspend_method NONE resume_method NONE terminate_method NONE notify 00:00:60 owner_list NONE user_lists NONE xuser_lists NONE subordinate_list NONE complex_values gpu=TRUE projects NONE xprojects NONE calendar NONE initial_state default s_rt INFINITY h_rt INFINITY s_cpu INFINITY h_cpu INFINITY s_fsize INFINITY h_fsize INFINITY s_data INFINITY h_data INFINITY s_stack INFINITY h_stack INFINITY s_core INFINITY h_core INFINITY s_rss INFINITY h_rss INFINITY s_vmem INFINITY h_vmem INFINITY |
The command, qconf -Aq gpu.q.txt
, should create the new queue gpu.q
. One may run 16 single CPU+GPU jobs at any given time and by design (please see the sample job submission script) and this will use only one GPU per job. For testing purposes, one may use this Hello, World!, program using the following submission script:
#! /bin/bash # # Save this file as hello_world_gpu.sh and submit to the queue using the command # qsub hello_world_gpu.sh # #$ -cwd #$ -j y #$ -S /bin/bash #$ -q gpu.q #$ -hard -l gpu=1 # ./hello_world_cuda.x |
Thanks be to
Rocks mailing list, Grid Engine mailing list and their participants.
hi ,
could you please share documentation for ROCKS 7.4 HPC Cluster,
1 nos of master/9 nos of compute/2 nos of GPU .
regards/kaushik sen