Beginner’s Guide to Cambridge Uni’s CSD3

Please note this information was current as of November 2023. Please see the HPC documentation for latest info. Also, these notes are specifically targeted at students doing projects with the NLIP Group at the Cambridge Computer Lab. See also this wiki page about other options for accessing GPU.

Registration

To sign up for the Cambridge Service for Data Driven Discovery (CSD3, aka “the HPC”, part of the University’s Research Computing Service), complete the online application form (Raven login).

Notes:

you need to specify a PI (Principal Investigator) on the form, who would normally be your supervisor, so make sure you ask their permission before signing up (also because you need to tick a box on their behalf).
your PI may also be able to tell you about their existing HPC project codes (i.e. on the paid levels of service) which you can be associated with.
you’re in the School of Technology.
research group is NLIP.
select service level 3 (non-paying) unless you’ve been given a project code by your supervisor, or you’ve had permission from sys-admin to use the Lab’s SL2 project (see ‘free vs paying’ section here)
select ‘Wilkes3-GPU’.
be aware that the HPC is a busy service and queueing times can sometimes be very long: you might instead consider Google Colab perhaps with your Cambridge GSuite account, Kaggle, or the department’s VMs for your experimental work.

Log-in

Once your application has been approved (can take up to a week, usually faster), you can connect to HPC servers like so:

ssh <username>@login.hpc.cam.ac.uk

Where “username” is your CRSid and you’ll be prompted for your password (HPC by default uses your UIS password, the one you use for email, Raven, etc). Note that this is actually bad practice security-wise. Set your HPC password distinctly from UIS password using the passwd command. HPC log-in now requires two factor authentication.

Note that the log-in nodes are not intended to run your experiments but only for environment set-up, experiment preparation, SLURM workload management.

Modules

The HPC functions through modules and virtual environments. Once logged in, you can see the modules loaded by default by inputting module list to the command prompt, and a list of available modules with module avail

Load a module module load <module>: e.g. module load python/3.8 (note there are finer-grained versions such as ‘python-3.6.1-gcc-5.4.0-xk7ym4l’) or several at once: module load python/3.8 R/4.0.3

Unload a module module unload <module> (can auto-complete), or all modules with module purge (note this unloads the useful SLURM modules too, use with care).

Your environment

If you’re working with Python, it’s best to work in a virtual environment: python3 -m venv venvs/demo; source venvs/demo/bin/activate

Update pip and other fundamentals: pip install --upgrade pip; pip install --upgrade numpy scipy wheel; pip install tensorflow==1.15 (for the test experiment we’re going to run, we specifically need Tensorflow v1.15)

Obviously you can set up a conda environment too, for a more general virtual environment (see the HPC page about this, noting that you use miniconda rather than anaconda), e.g.:

module load miniconda/3
conda create --prefix ./myenv
source activate ./myenv

Jobs

Restrictions:

On HPC you’re most likely to be using SL2 or SL3, with SL1 being a premium tier and SL4 being the very lowest level of service.
- See full description here
There’s an interaction with levels of quality of service (QoS) where SL2 runs with QoS1 and SL3 runs with QoS2.
- See full description here
The main points to note are the differences in queueing priority and maximum job run times (12hrs on SL3/QoS2 vs 36hrs on SL2/QoS1).
If you have the choice between SL2 and SL3, test first with SL3, since this is the non-paying tier.

Now for a GED experiment with Marek Rei’s sequence labeller. First ssh into the HPC log-in nodes, then, change directory out of your homespace, where you have only 50GB, to the RDS (Research Data Store) where you have a 1TB quota (see the File and I/O Management page for more detail):

cd rds/hpc-work/
git clone https://github.com/marekrei/sequence-labeler.git
cd sequence-labeler
mkdir data embeddings models
download the CLC FCE corpus, the ‘dataset for error detection’ from the iLexIR website, and unpack it locally: cd to the resulting ‘fce-error-detection’ directory;
then copy the files from the ‘tsv’ directory: scp tsv/fce-public.* login.hpc.cam.ac.uk:my_expt/sequence-labeler/data/; ls -lh data/
pre-trained GloVe embeddings from Stanford NLP website: cd embeddings; wget http://nlp.stanford.edu/data/glove.6B.zip; unzip glove.6B.zip; rm glove.6B.zip
edit config file: e.g. cd ../conf/; emacs fcepublic.conf (or your preferred text editor; update file paths for train-dev-test with the prefix ‘data/’ and remove the dev and nucle files from the test entry; also reduce the epochs to 20 for this demo run)
exit and save, back to sequence-labeler root directory: cd ../

Next to prepare your SLURM script: see example scripts here ls -l /usr/local/Cluster-Docs/SLURM. Copy and edit the script where indicated, e.g. for a CPU job: cp -v /usr/local/Cluster-Docs/SLURM/slurm_submit.peta4-icelake .; emacs slurm_submit.peta4-icelake (or, copy the wilkes3 script for a GPU job).

change job name, e.g. ‘my_first_hpc_job’
project to be charged? e.g. COMPUTERLAB-SL3-CPU (see project names: $ mybalance), or COMPUTERLAB-SL3-GPU if a GPU job, using the free tier to test or run low-priority jobs
how many nodes? 1 if in doubt
how many tasks? 1 if in doubt
(GPUs per node: note you are charged by GPU usage!)
time required? hh:mm:ss, leave as 02:00:00 or HPC max (12:00:00 for SL3) if in doubt
change --mail-type from NONE to ALL
leave the --no-requeue line commented out with the extra hash character
modify the environment under the line module load rh18/default-amp... with module load miniconda/3 and activate your virtual environment: e.g. source activate ~/venvs/demo
insert application command(s): application="python experiment.py conf/fcepublic.conf"
with logging: options=">logfile 2>errfile"
set the working directory to your subdirectory on the RDS: workdir="/home/your_id/rds/hpc-work/your_path/"
save and exit

Now to submit and monitor your job:

first check your balance: mybalance
submit the job: sbatch slurm_submit.peta4-icelake
then monitor: showq -u
or cancel: scancel <job_id>
final output goes here: ls slurm-NNNN.out (where NNNN is the job id)
check your balance again! mybalance
inspect your results…

To go further with the HPC, check out Chris Davis’s page (contact ccd38 with your github username for access to the page) about using GPUs with batch jobs. Also be aware of the extensive documentation and the support helpdesk’s email address: support@hpc.cam.ac.uk

In particular it may be useful to know your storage limits

Also, if you wish to run longer jobs with checkpointing, then see Richard Diehl Martinez’s minimal example of how to launch a follow-on job with subprocess.

James Thorne, now a professor in Korea and formerly of the NLIP group, wrote an optional unit for postgrad students on using the HPC. Video and resources may be found here, including 10 great tips. Note tip 5, don’t be greedy, which includes information about SLURM’s fairshare formula.

Xiaochen Zhu & Pietro Lesci put together this guide to connecting to the HPC with VSCode.

Please also see this guide to the HPC by Yiannos Stathopoulos.

A note on PyTorch on the HPC

If you’re going to make use of pre-trained language models, such as those available in Hugging Face Transformers, or if you’re going to train your own neural networks, then it’s likely you’ll need to use the PyTorch library. At the time of writing, the latest version of PyTorch (2.1.1) is suitable either for CUDA 11.8 or 12.1. It looks like both are available modules on the HPC.

Andrew Caines, apc38, November 2023