Matlab on HPC

Here is a short guide to start using Matlab on the HPC (High Performance Computer).

UPDATE: up-to-date list of clusters can be found here:

Make sure you have a HPC account

Goto https://account.vscentrum.be/ and upload your RSA key (it should be in ~/.ssh). See SSH key if you have no ssh RSA key.

Wait until you get a confirmation e-mail. (more info @ http://hpc.ugent.be/userwiki/index.php/User:VscRequests)

Check if you can login into the HPC

telin$ ssh {vsc_account_number}@login.hpc.ugent.be
hpc$ exit

replace with the {vsc_account_number} you have been appointed eg. vsc40053@login.hpc.ugent.be.

Transfer your Matlab code to the HPC

telin$ cd {your_matlab_dir}
telin$ scp * {vsc_account_number}@login.hpc.ugent.be:

Compiled your Matlab-code

To compile on the HPC:

telin$ ssh {vsc_account_number}@login.hpc.ugent.be
hpc$ module load MATLAB/2012b
hpc$ mcc -m -R -nojvm -R -nodisplay -R -singleCompThread {matlab_file}
The **-R -singleCompThread** is used, because Matlab is behaving very poorly with multithreading. It can even slows down your simulation, so double check this parameter!

Make a job file

you can copy and paste starting from the ‘cat’-command:

hpc$ cat >myjob.sh <<EOF
#!/bin/sh
#PBS -l nodes=1:ppn=1
#PBS -l vmem=4gb
#PBS -m abe

module load MATLAB/2018a


./run_{matlab_file}.sh $EBROOTMATLAB >& log.txt
sleep 300
EOF

hpc$ chmod 755 myjob.sh

The jobfile descibes to use only 1 nodes and 1 core. With the vmem parameter you can specify the amount of virtual memory you want to reserve (in this case 4GB, if you don’t specify it, it goes to the maximum available). The output and error discriptors are merged into the file log.txt.

Testdrive in the head node of the HPC and check with top

hpc$ ./myjob.sh & top

Kill it with by ‘k’ in top and your PID number, ‘q’ to quit top.

Launch your Matlab job to the compute queue

hpc$ qsub --job_time=11:44:00 -s ./myjob.sh >myjob1.jobs
hpc$ cat myjob1.jobs

This code says: start the job in the short compute queue for 11:44 hours and stop it. You can put the maximum job_time to 71:44:00 and this will put your job in the long queue.

Check your compute queue

hpc$ qstat                # list; Q = queued, R = running 
hpc$ qstat -n             # list and show the node it is running, you can ssh to this node and check with top
hpc$ qdel {jobnumber}     # delete the job with the number you found with qstat
hpc$ qdel all             # delete all jobs

Switch cluster

When you login you can see a lot of HPC clusters. When you want to launch your jobs to another cluster, you can instruct it before launching the qsub command e.g. to swalot:

hpc$ module swap cluster/swalot
hpc$ pbsmon

pbsmon shows you the nodes and the individual job submissions.

To ssh into the compute node look at the node name with qstat -n, you then add the domain {clustername}.gent.vsc e.g. ssh node2678.swalot.gent.vsc

Some tips

  • tip 1: Check the URL: http://hpc.ugent.be/clusterstate/ which clusters have free nodes!
  • tip 2: if you add sleep 300 at the end of the jobfile, you can still login 5 minutes and find out why Matlab crashed on the node itself!
  • tip 3: I noticed phanpy, golett and swalot are the fastest:

    swalot: 2 x 10-core Intel E5-2660v3 (Haswell-EP @ 2.6 GHz)
    phanpy and golett: 2 x 12-core Intel E5-2680v3 (Haswell-EP @ 2.5 GHz)
    raichu and delcatty: 2 x 8-core Intel E5-2670 (Sandy Bridge @ 2.6 GHz)
    
  • tip 4: Disk usage is limited, check with the show_quota command how much you have left!

Shorten the runtime

If your job takes longer than 72 hours you will have to adapt you code to save your variables at regular intervals. In the beginning of your program you check if a saved output exists and then load the variables of the previously written simulation. This way you can extend the durations to several months if needed. e.g.

if(exist('myjob.sh-ended','file')>0)
 disp('myjob.sh has ended, use "rm myjob.sh-ended" to restart');
 quit
end

MAX=1000000;
i=1;
if(exist('result.mat','file')>0)
 load('result.mat'); 
 i=i+1;
end
% the variable i was saved and reset to continue the calculations from where it was left off (+1)
for n=i:MAX
 docalculations;
 save('result.mat','i','variable1','variable2',..);
end

FileID = fopen('myjob.sh-ended','w');
quit

You will get an email when your simulation has ended, so you can restart it manually. You can however automate this restart. Our first code is to make the restart script ./restartsim in your home-directory and make it executable, the second pass is to activate the cron job.

hpc$ cat >restartsim <<EOF
#!/bin/bash
jobs="myjob.sh" #you can have more than 1 job
cluster=$1

. /etc/profile.d/vsc.sh
. /etc/profile.d/modules.sh
module swap cluster/$cluster

for i in $jobs;do
qstat|grep -q "$i" 
[ $? -ne 0 ] && [ ! -f "$i-ended" ] && qsub $i #restart the job if not ended!
done
EOF

hpc$ chmod +x restartsim               #make it executable

The example will restart the job myjob.sh on the cluster swalot. The next is the code to check the simulation with a crontab every 15 minutes:

hpc$ crontab <<EOF
*/15 * * * * ./restartsim swalot >& cron-swalot.log
EOF

You can stop the restart by putting a hash sign (#) in the front with the nano editor:

hpc$ EDITOR=nano crontab -e

It should look like this then:

# */15 * * * * ./restartsim swalot >& cron-swalot.log

HPC checkpointing

A word about using HPC checkpointing: problems occur with the Matlab save command if you use checkpointing (csub command) after resuming from a first checkpoint: the save command will crash your Matlab simulation, clearing all checkpoints when it writes twice to the same file. This happens if your checkpoint returns your program back in time to a previously saved output. You can use the dlmwrite function instead, it writes out in ASCII text format, for example:

a=rand(10,10);
dlmwrite('Result_a',a,'precision',16);

This writes out a 2d matrix with 16-digits after the comma. You can read it again with dlmread:

b=dlmread('Result_a');

dlmwrite writes a 3d matrix as 2d, but appended horizontally, so you will have to use ranges in dlmread to read them back (see help dlmread).

More info @ http://hpc.ugent.be/userwiki/index.php/Main_Page & http://hpc.ugent.be/userwiki/index.php/User:Checkpointing