Yiannos's guide to the Cambridge HPC

Welcome to my guide on using the Cambridge University HPC facilities. In this document, I'd like to share some of my own personal experience with running computations and machine learning experiments on the HPC CPU and GPU clusters while doing research at the department of Comptuer Science and Technology in Cambridge. Please note that this guide is work in progress and I will be updating it as often as possible.

Leveraging apptainer for machine learning

Imagine this: you are looking into running a specific machine learning model on data you pain-stakingly collected yourself. You read papers detailing models relevant to your research and you've downloaded the model sourcecode from the author's github. There is one problem: the model implementation depends on specific versions of the python packages the model code depends on. On your own machine or departmental systems, you can spin up a docker container with the specific packages installed and run the author's code with your data. In contrast, the Cambridge HPC environment is tightly controlled and normally it is not possible to install arbitrary software packages on the cluster nodes.

Apptainer is a lightweight equivalent to docker for HPC use.

Setting up an apptainer container can be done (typically) with the following steps:

Connect to the HPC and create a directory to build your apptainer image.
Building your image with apptainer can require a lot of disk space. We can specify locations for the various apptainer caches using the apptainer environment variables

Type the following command in the shell to inform apptainer where the build process will cache stuff:
export APPTAINER_CACHEDIR="<path to somewhere where there is space >" export APPTAINER_TMPDIR="<path to somewhere where there is space >" export SINGULARITY_CACHEDIR="<path to somewhere where there is space >"
You need to create a def file with all the steps needed to install the packages you need in your image. Mine looks like this:

BootStrap: docker From: pytorch/pytorch:1.13.1-cuda11.6-cudnn8-devel # Update Ubuntu Software repository %post apt update apt upgrade --yes apt install python3-pip --yes pip3 install --upgrade pip --quiet --exists-action i pip3 install pip install pyg_lib torch_scatter torch_sparse torch_cluster torch_spline_conv -f https://data.pyg.org/whl/torch-1.13.1+cu116.html --quiet --exists-action i pip3 install torch-geometric --quiet --exists-action i pip3 install flask pip3 install flask_compress pip3 install fastdist pip3 install sentence-transformers pip3 install flask-inflate pip3 install numba
In the example def file above, we basically tell apptainer to use the pytorch + cuda11.6 cudnn8 ubuntu image for its bootstrap and then tell the image what to install in it. You will have to find a suitable image for the tensorflow/pytorch version you need and cudaXX.Y. A google search points us to the nvidia tensorflow docker documentation.
So, replace pytorch/pytorch:1.13.1-cuda11.6-cudnn8-devel with a suitable image by replacing the xx.xx to your desired version:
nvcr.io/nvidia/tensorflow:xx.xx-tfx-py3
Build the image. Important: make sure you have ample space quota (both capacity and available file descriptors, check with quota -v on the HPC). Use the following command to build:

singularity build --sandbox --nv <target directory ending in .sif> <your def file >

This will build the image as a directory (e.g., appimage.sif) and might take a while.
Running python scripts with apptainer. Typically I use an sbatch script to submit the job and point to a shell script for execution. To execute a python script through a bash script entry point you'll need the following sequence of commands:
module purge module unload cuda/8.0 module load cuda/11.2 apptainer run --nv <path to .sif image directory> python3 <path to your script with command line arguments> Note that in steps 5 and 6, --nv informs apptainer to use the nvidia runtime.

Getting your code to run within HPC time limits

You have your data and you've prototyped your processing code. You queue the first task, wait for it to run only to discover that it exceeded the chosen tier's time allocation.

How can you reliably and efficiently determine how to modify your code and data split strategy to fit within the tier's time allocation?

I like to use a 'binary search' approach to tune my data split, batch size etc so that my models or algorithms stick to HPC time limits.

For example, you can scale the model parameters appropriately and measure how long say an epoch takes until you hit a target hours per epoch (or epochs per 12hr period).

Another example of this approach is to split your data in chunks of N and see how long it takes to process a single chunk. If a chunk exceeds the time limit, then try chunks of size N/2. If this run completes too soon, then try again with chunk size between N/2 and N. Repeat this process, just like binary search looks for a number in an ordered list, until you find a chunk size that is close to the tier's time window.

Maximising throughput when the HPC cluster is Busy

One handy trick to maximise your application's throughput on the HPC, especially during its busy periods, is to split your data to their lowest unit and queue many many little tasks (if the experiment allows for that sort of data parallelisation).

An illustrative example of this idea is grid search. For one experiment I run gridsearch for graph neural networks across many parameters. I initially tried to get as many cells on the grid per HPC node as possible. However, this approach did not yield the throughput I was expecting because my 'big tasks' often timed-out after a long wait to run.

Instead, I modified my code to make one script invocation per cell and queued many script calls via SLURM. This resulted in many smaller computations entering the queue on (a very busy day on) the HPC and this got the job done fairly quickly and reliably.

Identifying bottlenecks in machine learning code (CUDA)

I will update this section with expanded discussions soon, but for now:

Valgrind is your friend for finding bottlenecks and mistakes in C++ code;

The GNU debugger is your friend when you need to find why your C/C++ code crashes with a segmentation fault;

The NVIDIA Visual Profiler is very useful for observing how your CUDA program (this includes nvcc parallel C programs or python tensorflow/torch scripts) spends its runtime;

In my experience, I/O and moving data from the CPU/system memory to the GPU vram can be key bottlenecks in your GPU compute code. Try to cluster your copies so that large chunks of computation (e.g., an epoch) can run without delay;

Many data loading/sampling/batching interfaces for tensorflow and torch produce samples in a lazy manner (as in lazy evaluation, on-demand) during training and evaluation. This is a key reason why you will sometimes see your machine learning script load the CPU for long periods before sending bursts of compute to the GPU. This is not an ideal situation and one way to overcome this and maximise GPU utilisation is to pre-compute your batches and cache them in a format that can be reconstructed in memory with as little pre-processing as possible (e.g., pickle).

Acknowledgements

Big thanks to Dr Sean Holden for continuous input on the various HPC-related activities I'm doing

Many thanks to Fredrik Rømming for testing material in this guide in the field and for refinements to the apptainer steps.