Please note this information was current as of November 2023. Please see the HPC documentation for latest info. Also, these notes are specifically targeted at students doing projects with the NLIP Group at the Cambridge Computer Lab. See also this wiki page about other options for accessing GPU.

Registration

To sign up for the Cambridge Service for Data Driven Discovery (CSD3, aka “the HPC”, part of the University’s Research Computing Service), complete the online application form (Raven login).

Notes:

Log-in

Once your application has been approved (can take up to a week, usually faster), you can connect to HPC servers like so:

ssh <username>@login.hpc.cam.ac.uk

Where “username” is your CRSid and you’ll be prompted for your password (HPC by default uses your UIS password, the one you use for email, Raven, etc). Note that this is actually bad practice security-wise. Set your HPC password distinctly from UIS password using the passwd command. HPC log-in now requires two factor authentication.

Note that the log-in nodes are not intended to run your experiments but only for environment set-up, experiment preparation, SLURM workload management.

Modules

The HPC functions through modules and virtual environments. Once logged in, you can see the modules loaded by default by inputting module list to the command prompt, and a list of available modules with module avail

Load a module module load <module>: e.g. module load python/3.8 (note there are finer-grained versions such as ‘python-3.6.1-gcc-5.4.0-xk7ym4l’) or several at once: module load python/3.8 R/4.0.3

Unload a module module unload <module> (can auto-complete), or all modules with module purge (note this unloads the useful SLURM modules too, use with care).

Your environment

If you’re working with Python, it’s best to work in a virtual environment: python3 -m venv venvs/demo; source venvs/demo/bin/activate

Update pip and other fundamentals: pip install --upgrade pip; pip install --upgrade numpy scipy wheel; pip install tensorflow==1.15 (for the test experiment we’re going to run, we specifically need Tensorflow v1.15)

Obviously you can set up a conda environment too, for a more general virtual environment (see the HPC page about this, noting that you use miniconda rather than anaconda), e.g.:

module load miniconda/3
conda create --prefix ./myenv
source activate ./myenv

Jobs

Restrictions:

Now for a GED experiment with Marek Rei’s sequence labeller. First ssh into the HPC log-in nodes, then, change directory out of your homespace, where you have only 50GB, to the RDS (Research Data Store) where you have a 1TB quota (see the File and I/O Management page for more detail):

  1. cd rds/hpc-work/
  2. git clone https://github.com/marekrei/sequence-labeler.git
  3. cd sequence-labeler
  4. mkdir data embeddings models
  5. download the CLC FCE corpus, the ‘dataset for error detection’ from the iLexIR website, and unpack it locally: cd to the resulting ‘fce-error-detection’ directory;
  6. then copy the files from the ‘tsv’ directory: scp tsv/fce-public.* login.hpc.cam.ac.uk:my_expt/sequence-labeler/data/; ls -lh data/
  7. pre-trained GloVe embeddings from Stanford NLP website: cd embeddings; wget http://nlp.stanford.edu/data/glove.6B.zip; unzip glove.6B.zip; rm glove.6B.zip
  8. edit config file: e.g. cd ../conf/; emacs fcepublic.conf (or your preferred text editor; update file paths for train-dev-test with the prefix ‘data/’ and remove the dev and nucle files from the test entry; also reduce the epochs to 20 for this demo run)
  9. exit and save, back to sequence-labeler root directory: cd ../

Next to prepare your SLURM script: see example scripts here ls -l /usr/local/Cluster-Docs/SLURM. Copy and edit the script where indicated, e.g. for a CPU job: cp -v /usr/local/Cluster-Docs/SLURM/slurm_submit.peta4-icelake .; emacs slurm_submit.peta4-icelake (or, copy the wilkes3 script for a GPU job).

  1. change job name, e.g. ‘my_first_hpc_job’
  2. project to be charged? e.g. COMPUTERLAB-SL3-CPU (see project names: $ mybalance), or COMPUTERLAB-SL3-GPU if a GPU job, using the free tier to test or run low-priority jobs
  3. how many nodes? 1 if in doubt
  4. how many tasks? 1 if in doubt
  5. (GPUs per node: note you are charged by GPU usage!)
  6. time required? hh:mm:ss, leave as 02:00:00 or HPC max (12:00:00 for SL3) if in doubt
  7. change --mail-type from NONE to ALL
  8. leave the --no-requeue line commented out with the extra hash character
  9. modify the environment under `withmodule load miniconda/3and activate your virtual environment: e.g.source activate ~/venvs/demo`
  10. insert application command(s): application="python experiment.py conf/fcepublic.conf"
  11. with logging: options=">logfile 2>errfile"
  12. set the working directory to your subdirectory on the RDS: workdir="/home/your_id/rds/hpc-work/your_path/"
  13. save and exit

Now to submit and monitor your job:

A note on PyTorch on the HPC

If you’re going to make use of pre-trained language models, such as those available in Hugging Face Transformers, or if you’re going to train your own neural networks, then it’s likely you’ll need to use the PyTorch library. At the time of writing, the latest version of PyTorch (2.1.1) is suitable either for CUDA 11.8 or 12.1. It looks like both are available modules on the HPC.


Andrew Caines, apc38, November 2023