Simulations

The TELIN sysadmins have made some changes to the old simulation system and have now installed a LXD cluster.

Features

  • CPU and GPU nodes
  • browser based graphical interface to individual nodes
  • user creation of custom (simulation) machines through Linux containers or virtual machines
  • external services with an own UGent DNS domain name or path in the cluster domain name
  • upload/download through SFTP, SMB client and S3 storage

Infrastructure

The LXD ([lÉ›ks’di:]) cluster is organized behind a head node called queenbee. The worker nodes are organized in 3 sets called: b100..b107 for CPU only nodes b400..b407 and b800..b806 for CPU/GPU nodes. IPI users are better off with the b8xx range because they have more diskspace for the large datasets.

The nodes do not have all the same specs. We have nodes of the 4th, 7th and 10th Intel generation. Most GPU nodes are Pascal based (GTX 1xxx), a few Turing RTX2xxx) and some Ampere (RTX3xxx), so the cluster cannot be compared to an HPC cluster. For serious calculations you should consider the UGent HPC cluster if you have the possibility.

Connect

Password

To access the cluster ask the TELIN system administrator to get a personal password.

browser

Point you browser to this address: https://queenbee.ugent.be/novnc/ (notice the ending slash)

You will see a list of all individual worker nodes. Connect and login with the password your sysadmin has given. A node can only be graphically connected once, hence if you see the node in red it is already being used. All nodes run Ubuntu 22.04. Some commercial apps like Matlab and Maple are available graphically then. On the left you will notice a novnc side bar where you can disconnect when finished. DO NOT SHUTDOWN OR SUSPEND any machine. A clipboard is also provided to copy text to and from the node and your local machine on this novnc side panel.

It could be the connection hangs and only starts after you pressed the back menu from your browser, a second time will allow the connection then.

ssh

Directly using portnumbers

This is implemented from within the Telin LAN, with wireguard VPN or UGent VPN.

Add the cluster member number with 10000 e.g. You want to access b403, the port is 10000+403=10403. Open a Terminal from windows, Mac or Linux and you can access the machine directly:

$ ssh -p 10403 lab@queenbee
lab@b403:~$ 

Using headnode queenbee

You can use ssh to connect to the head node queenbee if you don’t have access to VPN.

  • use 8822 as connection port:
$ ssh -p 8822 lab@queenbee.ugent.be
  • once connected you can ssh to the indivual nodes: e.g. b403
$ ssh b403

Storage

Network based storage tends to slow down simulations if not considered well, hence we have chosen not to implement this. Create a folder with your UGent account in the home folder (where you can find the Desktop and Documents folders) or /scratch folder. The /scratch folder is used for downloaded datasets, if you delete your files there, they will be deleted permanently. Your handmade programs should be in the home folder which is snapshotted and can be restored if you accidentally remove them (after a while).

Directly using portnumbers from your computer

This is again implemented from within the Telin LAN, with wireguard VPN or UGent VPN.

Add the cluster member number with 10000 e.g. You want to access b403, the port is 10000+403=10403. Open a Terminal from windows, Mac or Linux and you can access the machine directly, e.g. you uploaded a file called “example.zip” with put:

$ sftp -P 10403 lab@queenbee
sftp> put example.zip
quit

Getting results is possible with get:

$ sftp -P 10403 lab@queenbee
sftp> get result.zip
quit

Use cd to change to the directory before you issue get or put.

Alternatively transfer your files with WinSCP in windows, Cyberduck in MacOS or sftp:// in the location bar of your Linux file manager.

sftp from the worker node

Connect to your workgroup fileserver:

$ sftp _myusernameon_telin@_workgroup_fs
get example.zip
quit

(replace _myusernameon_telin with your username and _workgroup_fs with your workgroup fileserver)

Alternatively use sftp://_myusernameon_telin@_workgroup_fs in the location bar of your Linux file manager.

smbclient to the UGent server:

From the worker node:

$ smbclient -U _myusernameon_ugent -W UGENT //files.ugent.be/_myusernameon_ugent/home
Password for [UGENT\_myusernameon_ugent]:
Try "help" to get a list of possible commands.
smb: \> get example.zip
quit

Replace _myusernameon_ugent with your UGent username! //files.ugent.be is not a comment!

Alternatively use smb://files.ugent.be/_myusernameon_ugent/home in the location bar of your Linux file manager.

AWS S3

Point your browser to https://queenbee.ugent.be and login with the lab user. You will see a user interface where you can upload and download your files. These are available in the /lab folder of queenbee then.

e.g. you uploaded a file called “example.zip” in the “dump” bucket, from queenbee:

Use of mc command from any worker node:

$ mc cp lab/dump/example.zip .

or put

$ mc cp example.zip lab/dump

Other tools are available to connect to queenbee S3 storage here.

LXD launch

If you launch a container or virtual machine it will be mapped in the cluster network automatically. This way your container can be accessed from the cluster as if it was an extra node. Service can be made available through a seperate UGent DNS, or a path in the queenbee.ugent.be URL, e.g. https://queenbee.ugent.be/myserver/ will point to a newly created webserver with container name myserver. You can name your LXD server anything, as long as doesn’t conflict with the existing node of course.

Example 1

$ lxc launch ubuntu:22.04 c1

This will launch a container “c1”. Check it:

$ lxc list
+--------+---------+----------------------+------+-----------+-----------+----------+
|  NAME  |  STATE  |         IPV4         | IPV6 |   TYPE    | SNAPSHOTS | LOCATION |
+--------+---------+----------------------+------+-----------+-----------+----------+
| c1     | RUNNING | 10.0.127.8 (eth0)    |      | CONTAINER | 0         | b100     |
+--------+---------+----------------------+------+-----------+-----------+----------+

To setup a password use this:

$ lxc exec c1 passwd ubuntu

Now you can login in your custom server:

$ ssh ubuntu@c1
$ sudo -s
$ apt install ...

If you get an error do the fix password authentication of the ssh server:

$ lxc exec c1 sed -- -i 's/PasswordAuthentication/& yes #/' /etc/ssh/sshd_config
$ lxc exec c1 systemctl restart ssh

You can delete the c1 container:

$ lxc stop c1
$ lxc delete c1

Example 2

Now we will create a container “gpu1” which can access the graphical Nvidia card of node b401.

$ lxc launch --profile default --profile x11-nvidia ubuntu:22.04 gpu1 --target=b401 

It is however necessary that the lab user is logged in graphically. You can use the browser, login in b401 and just disconnect. The profile x11-nvidia takes a while to install extra packages and the cuda compiler. You can have a peek what the profile does:

$ lxc profile show x11-nvidia

Example 3

To make docker run in a Linux container we added security.nesting=true to the default profile. You can use snap if using the Ubuntu OS, e.g.

$ lxc launch ubuntu:22.04 c2
$ lxc exec c2 snap install docker

Change the daemon.json file to this in the Linux container to use the correct subnet and storage:

$ cat >/var/snap/docker/current/config/daemon.json <<EOF
{
    "log-level":        "error",
    "bip": "172.26.0.1/16",
    "storage-driver": "vfs"
}
EOF
$ systemctl start snap.docker.dockerd
$ docker run hello-world

Example 4

You can use the cluster to create self-hosted runners in Github. A demo on Youtube can be found here https://github.com/stgraber/lxd-github-actions

Token

Generate a token in your project repository from the http://github.com menu:

Settings -> Actions -> Runners -> New self-hosted runner

In the Configure box, you will find a token. Use this token to prepare a ephemeral container, that is one that destroys itself after the job, so you get a clean environment each time.

Runners

On queenbee you will find the scripts to get you started:

$ cd /lab/lxd/lxd-github-actions
$ lxc launch images:ubuntu/22.04 base-c1
$ ./prepare-instance base-c1 https://github.com/myaccount/runner-test AFNB2AESW6ZNNAD6XGANFELCWV5FQ
$ lxc config set base-c1 security.idmap.isolated=true 
$ lxc stop base-c1
$ ./spawn base-c1 2

This creates 2 ephemeral containers waiting when a commit has happened in the repository. In the example runner-test a workflow YML document had to be created in the Actions menu:

# This is a basic workflow to help you get started with Actions

name: CI

# Controls when the workflow will run
on:
  push:
  pull_request:

# A workflow run is made up of one or more jobs that can run sequentially or in parallel
jobs:
  # This workflow contains a single job called "build"
  build:
    # The type of runner that the job will run on
    runs-on: self-hosted

    # Steps represent a sequence of tasks that will be executed as part of the job
    steps:
      # Checks-out your repository under $GITHUB_WORKSPACE, so your job can access it
      - uses: actions/checkout@v2

      # Runs a set of commands using the runners shell
      - name: Run a multi-line script
        run: |
              lsb_release -idrc
              uname -a
              lscpu
              free
              sudo apt-get install -y cowsay
              /usr/games/cowsay "Hello world!"

The run stanza contains just an example, but normally compiles your source and can make a release automatically (actions/upload-artifact).

If you choose the nodes with a GPU, you can even have CUDA possibilities.

Anaconda

Initiate Anaconda first:

$ eval "$(/opt/anaconda3/bin/conda shell.bash hook)"

You can create a python virtual environment to setup your project:

$ conda create -y -n myproject #only do this once
$ conda activate myproject

A default work environment with pytorch has been setup:

$ conda activate work

On the GPU nodes it is possible to use cuda in this work environment, check it:

$ python -c 'import torch; print(torch.rand(2,3).cuda())'

Some home brewed scripts

List machines

A list of available machines can be viewed in the cluster, their OS version, CPU type, CPU speed (GHz), the build in RAM memory (MB) and the available Nvidia card. Issue the following command (the list is sorted alphabetacally):

$ machines

Check availability

If you plan to start a simulation, it’s better to check which CPU is the fastest, but also how many simulations are already running on a particular machine e.g.: If you have a 1 core 2000 MHz CPU, and 4 simulations are running concurrent, in the best case every simulation is assigned 500 Mhz. Thus if you run a simulation extra, your simulation will be assigned only 400 Mhz. Even an older computer with a 600 Mhz core will run your simulation 13 faster (if only 1 simulation is assigned). To account for all of this (plus the fact these are multicore systems) and to figure out what is your fastest machine, you can start a simulation with the script:

$ nextsim

It will generate a sorted list from low to high, with the available CPU speeds. Thus the machines mentioned last will be the best choice (column 1 = speed, column 2 = machine) for you.

Run on cluster

You can run a command on all nodes e.g. find the longest uptime:

$ oncluster uptime

No hangup

If you want to run your simulations in background you will be able to log out without your simulations being stopped if you use nohup (no hangup). The safest way is to redirect all standard file discriptors stdin, stdout and stderr. Suppose you want to run a Matlab program “run_multiply.sh” with arguments “3.1415 99”, you issue the following command:

$ nohup ./run_multiply.sh $EBROOTMATLAB 3.1415 99 >& multiply.out &

Some programs read from a text input file e.g. maple:

$ nohup maple < maple-script.mpl >& result-maple.txt &

Check your running programs

You can check your running programs on a node with:

$ ps ux
# or
$ top
# or
$ htop

The LXD cluster consist of recovered machines so there could be a glitch. Also keep a copy of your programs and results on your local PC, as any node member can be wiped.