GPU jobs

Here is a short guide to schedule GPU jobs on the HPC (High Performance Computer).

At the time of writing the 2 available GPU clusters are joltik and accelgor. I will update this as soon as I know of new clusters.

VSC Account

Make sure you followed vsc-account. and login to the HPC login node:

ssh {vsc_account_number}

Check if any of the GPU’s is available on the cluster

(module swap cluster/joltik; pbsmon)
(module swap cluster/accelgor; pbsmon)

results similar to:

 3300 3301 3302 3303 3304
    J    J    X    J    j

 3305 3306 3307 3308 3309
    R    J    R    J    _

   _ free                 : 1   |   X down                 : 1   |
   j partial              : 1   |   x down_on_error        : 0   |
   J full                 : 5   |   m maintenance          : 0   |
                                |   . offline              : 0   |
                                |   o other (R, *, ...)    : 2   |

As you can see some nodes are occupied (J=full), some have some GPU’s free (j=partial), some are getting ready or are going to maintaince mode (down) and some can be _=free. If there is not 1 node which is free or mixed you will have to wait and you cannot work interactivly right away. It will be queued until 1 node will be available again and other jobs are drained.

Pytorch example

We will first prepare the pip packages preventing installing to the default $HOME/.local folder:

module load Python/3.10.8-GCCcore-12.2.0
mkdir -p $PP
pip config set $PP
echo "export PYTHONPATH=$PP" >>.bashrc

This should be only setup once, the PYTHONPATH will also be setup at the next login in your .bashrc file!

We loaded the Python/3.10.8-GCCcore-12.2.0 module. This is the same one if you use Jupyter Notebook 7.0.3 GCCcore 12.2.0 explained in here. Now we can install the lastest pytorch:

pip3 install torch torchvision torchaudio

This will download the necessary CUDA toolkit and install everything in your scratch folder where you have enough quota!

Make a job file

In this example we will test the GPU with the following python script:

cat > <<EOF
import torch
print('Is Cuda Available? :',torch.cuda.is_available())
print('Tensor a:', a)
print('Calculated Tensor a*a: ', a*a)

And create the job file:

cat > <<EOF
#PBS -l walltime=00:10:00
#PBS -l mem=8gb
#PBS -l nodes=1:ppn=1:gpus=1
#PBS -m abe

module load Python/3.10.8-GCCcore-12.2.0
chmod +x

We mention in the header the maximum running time to be 10 minutes, we will use a maximum of 8GB of RAM, 1 node, 1 core (ppn) , 1 GPU and will output the results in… and errors in… . The .sh file should be executable.

Put it in the batch scheduler

module swap cluster/joltik
qsub ./

We use 1 node and 1 GPU to run our job. Load the right cluster before you start your queue!

Check the queue

watch qstat -n

You can see the status of your job: Q = queued, R = running, C = completed and which node it is running. You can even login to your running node with ssh (from within the HPC login node).

Check your results

The output will be saved in a file ending with .o[job_number] and .e[job_number] if you have any errors for running your batch.