GPU jobs

Here is a short guide to schedule GPU jobs on the HPC (High Performance Computer).

As the joltik cluster is still in pilot, this info can change. I will update this as soon as I know the changes. A complete guide to the HPC can be found here https://hpcugent.github.io/vsc_user_docs/pdf/intro-HPC-linux-gent.pdf

Make sure you have a HPC account

Goto https://account.vscentrum.be/ and upload your RSA key (it should be in ~/.ssh). See SSH key if you have no ssh RSA key.

Wait until you get a confirmation e-mail. (more info @ https://hpc.ugent.be/userwiki/index.php/User:VscRequests)

Check if you can login into the HPC

telin$ ssh {vsc_account_number}@login.hpc.ugent.be
hpc$ exit

replace with the {vsc_account_number} you have been appointed eg. vsc40053@login.hpc.ugent.be.

Transfer your code to the HPC

telin$ scp -r {name_of_directory} {vsc_account_number}@login.hpc.ugent.be:

Check if any of the GPU’s is available on the cluster

telin$ ssh {vsc_account_number}@login.hpc.ugent.be
hpc$ module swap cluster/joltik
hpc$ pbsnodes |grep -B1 "state ="
node3300.joltik.os
    state = job-exclusive
--
node3301.joltik.os
    state = job-exclusive
--
node3302.joltik.os
    state = mixed
--
node3303.joltik.os
    state = completing
--
node3304.joltik.os
    state = job-exclusive
--
node3305.joltik.os
    state = draining
--
node3306.joltik.os
    state = draining
--
node3307.joltik.os
    state = free
--
node3308.joltik.os
    state = free
--
node3309.joltik.os
    state = free

As you can see some nodes are occupied (job-exclusive), some have some GPU’s free (mixed), some are getting ready or are going to maintaince mode (draining) and some can be free. If there is not 1 node which is free or mixed you will have to wait and cannot start your GPU job right away. It will be queued until 1 node will be available again and other jobs are drained.

Make a job file

In this example we will test the GPU with a python Torch script.

hpc$ cat >py.py <<EOF
import torch
print('Is Cuda Available? :',torch.cuda.is_available())
a=torch.rand(3,3).cuda()
print('Tensor a:', a)
print('Calculated Tensor a*a: ', a*a)
EOF

And create the job file:

hpc$ cat >Python_batch.sh <<EOF
#!/bin/bash
#PBS -l walltime=71:40:0
#PBS -l vmem=30gb
#PBS -m abe
module swap cluster/joltik
module load CUDA/10.1.243
module load PyTorch/1.2.0-fosscuda-2019.08-Python-3.7.2
python py.py
EOF
hpc$ chmod +x Python_batch.sh

We mention in the header the maximum running time to be 71 hours and 40 minutes, we will use a maximum of 30GB of RAM and will output the results and errors in files. The .sh file should be executable.

Put it in the batch scheduler

hpc$ qsub -l nodes=1:gpus=1 ./Python_batch.sh

We use 1 node and 1 GPU to run our job.

Check the queue

hpc$ watch qstat -n

You can see the status of your job: Q = queued, R = running, C = completed and which node it is running. You can even login to your running node with ssh (from within the HPC login node).

Check your results

The output will be saved in a file ending with .o[job_number] and .e[job_number] if you have any errors for running your batch.