Here is a short guide to schedule GPU jobs on the HPC (High Performance Computer).
At the time of writing the 3 available GPU clusters are joltik, accelgor and litleo. I will update this as soon as I know of new clusters.
A status of the running and queued GPU jobs can be followed here
Make sure you followed vsc-account. and login to the HPC login node:
ssh {vsc_account_number}@login.hpc.ugent.be
(module swap cluster/joltik; pbsmon)
#or
(module swap cluster/accelgor; pbsmon)
#or
(module swap cluster/litleo; pbsmon)
results similar to:
3300 3301 3302 3303 3304
J J X J j
3305 3306 3307 3308 3309
R J R J _
_ free : 1 | X down : 1 |
j partial : 1 | x down_on_error : 0 |
J full : 5 | m maintenance : 0 |
| . offline : 0 |
| o other (R, *, ...) : 2 |
As you can see some nodes are occupied (J=full), some have some GPU’s free (j=partial), some are getting ready or are going to maintaince mode (down) and some can be _=free. If there is not 1 node which is free or mixed you will have to wait and you cannot work interactivly right away. It will be queued until 1 node will be available again and other jobs are drained.
Your $HOME directory is very limited in space, so it is best to use uv with a venv. If you have not done this already, install uv first:
curl -LsSf https://astral.sh/uv/install.sh | sh
Make sure you have the PATH in your search:
PATH=~/.local/bin:$PATH
export UV_LINK_MODE=copy
Add this as well to your .bashrc! Then install the uv environment:
cd $VSC_DATA
uv venv venv --python 3.12 --seed
source venv/bin/activate
uv pip install torch torchvision torchaudio --torch-backend=cu126
This should be only setup once, will download the necessary CUDA toolkit and install everything in your $VSC_DATA folder where you should have enough quota. Another place can be your $VSC_SCRATCH_KYUKON. If your promotor asked for extra VO data storage, you can have a $VSC_DATA_VO and a $VSC_SCRATCH_KYUKON_VO as well. These are all locations where you can install your uv venv.
Another problem that can arise is the .cache directory. This one is normally located in your $HOME directory where there is very limited space in the HPC setup. You can link your cache directory to another place to fix this problem: https://telin.ugent.be/telin-docs/linux/hpc/vsc-account/#quota. There is also a variable you can set before installing the uv environment e.g.
export UV_CACHE_DIR=$VSC_DATA/.uv_cache
Add this to your .bashrci if you did not link your cache directory!
Check your available storage here: https://account.vscentrum.be/.
In this example we will test the GPU with the following python script:
cat >python_example.py <<EOF
import torch
print('Is Cuda Available? :',torch.cuda.is_available())
a=torch.rand(3,3).cuda()
print('Tensor a:', a)
print('Calculated Tensor a*a: ', a*a)
EOF
And create the job file:
cat >python_example.sh <<EOF
#!/bin/bash
#PBS -l walltime=00:10:00
#PBS -l mem=8gb
#PBS -l nodes=1:ppn=1:gpus=1
#PBS -m abe
module load CUDA/12.6.0
source $VSC_DATA/venv/bin/activate
uv run python_example.py
EOF
chmod +x python_example.sh
We mention in the header the maximum running time to be 10 minutes, we will use a maximum of 8GB of RAM, 1 node, 1 core (ppn) , 1 GPU and will output the results in python_example.sh.o… and errors in python_example.sh.e… . The .sh file should be executable.
module swap cluster/joltik
qsub ./python_example.sh
We use 1 node and 1 GPU to run our job. Load the right cluster before you start your queue!
watch qstat -n
You can see the status of your job: Q = queued, R = running, C = completed, F = failed and which node it is running. You can even login to your running node with ssh (from within the HPC login node).
The output will be saved in a file ending with .o[job_number] and .e[job_number] if you have any errors for running your batch.