GPU jobs

Here is a short guide to schedule GPU jobs on the HPC (High Performance Computer).

At the time of writing the 3 available GPU clusters are joltik, accelgor and litleo. I will update this as soon as I know of new clusters.

A status of the running and queued GPU jobs can be followed here

VSC Account

Make sure you followed vsc-account. and login to the HPC login node:

ssh {vsc_account_number}@login.hpc.ugent.be

Check if any of the GPU’s is available on the cluster

(module swap cluster/joltik; pbsmon)
#or
(module swap cluster/accelgor; pbsmon)
#or
(module swap cluster/litleo; pbsmon)

results similar to:

 3300 3301 3302 3303 3304
    J    J    X    J    j

 3305 3306 3307 3308 3309
    R    J    R    J    _

   _ free                 : 1   |   X down                 : 1   |
   j partial              : 1   |   x down_on_error        : 0   |
   J full                 : 5   |   m maintenance          : 0   |
                                |   . offline              : 0   |
                                |   o other (R, *, ...)    : 2   |

As you can see some nodes are occupied (J=full), some have some GPU’s free (j=partial), some are getting ready or are going to maintaince mode (down) and some can be _=free. If there is not 1 node which is free or mixed you will have to wait and you cannot work interactivly right away. It will be queued until 1 node will be available again and other jobs are drained.

Pytorch example

Your $HOME directory is very limited in space, so it is best to use uv with a venv. If you have not done this already, install uv first:

curl -LsSf https://astral.sh/uv/install.sh | sh

Make sure you have the PATH in your search:

PATH=~/.local/bin:$PATH
export UV_LINK_MODE=copy

Add this as well to your .bashrc! Then install the uv environment:

cd $VSC_DATA
uv venv venv --python 3.12 --seed
source venv/bin/activate
uv pip install torch torchvision torchaudio  --torch-backend=cu126

This should be only setup once, will download the necessary CUDA toolkit and install everything in your $VSC_DATA folder where you should have enough quota. Another place can be your $VSC_SCRATCH_KYUKON. If your promotor asked for extra VO data storage, you can have a $VSC_DATA_VO and a $VSC_SCRATCH_KYUKON_VO as well. These are all locations where you can install your uv venv.

Another problem that can arise is the .cache directory. This one is normally located in your $HOME directory where there is very limited space in the HPC setup. You can link your cache directory to another place to fix this problem: https://telin.ugent.be/telin-docs/linux/hpc/vsc-account/#quota. There is also a variable you can set before installing the uv environment e.g.

export UV_CACHE_DIR=$VSC_DATA/.uv_cache

Add this to your .bashrci if you did not link your cache directory!

Check your available storage here: https://account.vscentrum.be/.

Make a job file

In this example we will test the GPU with the following python script:

cat >python_example.py <<EOF
import torch
print('Is Cuda Available? :',torch.cuda.is_available())
a=torch.rand(3,3).cuda()
print('Tensor a:', a)
print('Calculated Tensor a*a: ', a*a)
EOF

And create the job file:

cat >python_example.sh <<EOF
#!/bin/bash
#PBS -l walltime=00:10:00
#PBS -l mem=8gb
#PBS -l nodes=1:ppn=1:gpus=1
#PBS -m abe

module load CUDA/12.6.0
source $VSC_DATA/venv/bin/activate
uv run python_example.py
EOF
chmod +x python_example.sh

We mention in the header the maximum running time to be 10 minutes, we will use a maximum of 8GB of RAM, 1 node, 1 core (ppn) , 1 GPU and will output the results in python_example.sh.o… and errors in python_example.sh.e… . The .sh file should be executable.

Put it in the batch scheduler

module swap cluster/joltik
qsub ./python_example.sh

We use 1 node and 1 GPU to run our job. Load the right cluster before you start your queue!

Check the queue

watch qstat -n

You can see the status of your job: Q = queued, R = running, C = completed, F = failed and which node it is running. You can even login to your running node with ssh (from within the HPC login node).

Check your results

The output will be saved in a file ending with .o[job_number] and .e[job_number] if you have any errors for running your batch.