Please note this information was current as of November 2023. Please see the HPC documentation for latest info. Also, these notes are specifically targeted at students doing projects with the NLIP Group at the Cambridge Computer Lab. See also this wiki page about other options for accessing GPU.
To sign up for the Cambridge Service for Data Driven Discovery (CSD3, aka “the HPC”, part of the University’s Research Computing Service), complete the online application form (Raven login).
Notes:
Once your application has been approved (can take up to a week, usually faster), you can connect to HPC servers like so:
ssh <username>@login.hpc.cam.ac.uk
Where “username” is your CRSid and you’ll be prompted for your password (HPC by default uses your UIS password, the one you use for email, Raven, etc). Note that this is actually bad practice security-wise. Set your HPC password distinctly from UIS password using the passwd
command. HPC log-in now requires two factor authentication.
Note that the log-in nodes are not intended to run your experiments but only for environment set-up, experiment preparation, SLURM workload management.
The HPC functions through modules and virtual environments. Once logged in, you can see the modules loaded by default by inputting module list
to the command prompt, and a list of available modules with module avail
Load a module module load <module>
: e.g. module load python/3.8
(note there are finer-grained versions such as ‘python-3.6.1-gcc-5.4.0-xk7ym4l’) or several at once: module load python/3.8 R/4.0.3
Unload a module module unload <module>
(can auto-complete), or all modules with module purge
(note this unloads the useful SLURM modules too, use with care).
If you’re working with Python, it’s best to work in a virtual environment: python3 -m venv venvs/demo; source venvs/demo/bin/activate
Update pip and other fundamentals: pip install --upgrade pip; pip install --upgrade numpy scipy wheel; pip install tensorflow==1.15
(for the test experiment we’re going to run, we specifically need Tensorflow v1.15)
Obviously you can set up a conda environment too, for a more general virtual environment (see the HPC page about this, noting that you use miniconda rather than anaconda), e.g.:
module load miniconda/3
conda create --prefix ./myenv
source activate ./myenv
Restrictions:
Now for a GED experiment with Marek Rei’s sequence labeller. First ssh into the HPC log-in nodes, then, change directory out of your homespace, where you have only 50GB, to the RDS (Research Data Store) where you have a 1TB quota (see the File and I/O Management page for more detail):
cd rds/hpc-work/
git clone https://github.com/marekrei/sequence-labeler.git
cd sequence-labeler
mkdir data embeddings models
scp tsv/fce-public.* login.hpc.cam.ac.uk:my_expt/sequence-labeler/data/; ls -lh data/
cd embeddings; wget http://nlp.stanford.edu/data/glove.6B.zip; unzip glove.6B.zip; rm glove.6B.zip
cd ../conf/; emacs fcepublic.conf
(or your preferred text editor; update file paths for train-dev-test with the prefix ‘data/’ and remove the dev and nucle files from the test entry; also reduce the epochs to 20 for this demo run)cd ../
Next to prepare your SLURM script: see example scripts here ls -l /usr/local/Cluster-Docs/SLURM
. Copy and edit the script where indicated, e.g. for a CPU job: cp -v /usr/local/Cluster-Docs/SLURM/slurm_submit.peta4-icelake .; emacs slurm_submit.peta4-icelake
(or, copy the wilkes3 script for a GPU job).
$ mybalance
), or COMPUTERLAB-SL3-GPU if a GPU job, using the free tier to test or run low-priority jobs--mail-type
from NONE
to ALL
--no-requeue
line commented out with the extra hash charactermodule load rh18/default-amp...
with module load miniconda/3
and activate your virtual environment: e.g. source activate ~/venvs/demo
application="python experiment.py conf/fcepublic.conf"
options=">logfile 2>errfile"
workdir="/home/your_id/rds/hpc-work/your_path/"
Now to submit and monitor your job:
mybalance
sbatch slurm_submit.peta4-icelake
showq -u
scancel <job_id>
ls slurm-NNNN.out
(where NNNN is the job id)mybalance
To go further with the HPC, check out Chris Davis’s page (contact ccd38 with your github username for access to the page) about using GPUs with batch jobs. Also be aware of the extensive documentation and the support helpdesk’s email address: support@hpc.cam.ac.uk
In particular it may be useful to know your storage limits
Also, if you wish to run longer jobs with checkpointing, then see Richard Diehl Martinez’s minimal example of how to launch a follow-on job with subprocess.
James Thorne, now a professor in Korea and formerly of the NLIP group, wrote an optional unit for postgrad students on using the HPC. Video and resources may be found here, including 10 great tips. Note tip 5, don’t be greedy, which includes information about SLURM’s fairshare formula.
Xiaochen Zhu & Pietro Lesci put together this guide to connecting to the HPC with VSCode.
Please also see this guide to the HPC by Yiannos Stathopoulos.
If you’re going to make use of pre-trained language models, such as those available in Hugging Face Transformers, or if you’re going to train your own neural networks, then it’s likely you’ll need to use the PyTorch library. At the time of writing, the latest version of PyTorch (2.1.1) is suitable either for CUDA 11.8 or 12.1. It looks like both are available modules on the HPC.
Andrew Caines, apc38, November 2023