Department of Computer Science and Technology

The High Performance Computing Service (HPCS)

The University High Performance Computing Service (HPCS) provides a means of running a number of parallel jobs on potentially large amounts of data.

This page is intended as a quick introduction for Computer Laboratory members, but for more detail see The High Performance Computing Service. Note that as of November 2017 the macines comprising the service have been updated and it is collectively known as the Cambridge Service for Data Driven Discovery, CSD3 .

The HPCS previously consisted of a large number of compute nodes (the Darwin cluster), and a smaller number of GPU nodes (the Wilkes cluster). CSD3 consists of 2 clusters, Peta4 and Wilkes2, Peta4 is itself subdivided into 2 types of system, Skylake (aka Peta4-Skylake) and KNL (aka Peta4-KNL).

  • Peta4-Skylake comprises 24,000 cores of Intel Skylake, in nodes consisting of 32 cores each
  • Peta4-KNL comprises 342 units of Intel KNL, in nodes containing 256 logical CPUs, allocated only as entire nodes.
  • Wilkes2 comprises 360 NVIDIA Tesla P100 GPUs, in nodes containing 4 GPUs

Unless working on something that really needs a GPU, lab members would be expected to make use of Peta4-Skylake. Note that the CSD3 is under constant development, and anything said here about numbers of nodes, quotas or time allocation is likely to go out of date - see the CSD3 documentation for definitive values.

Free vs Paying

Users can fall into several Service Levels: SL2 is for paying customers working on medium scale projects, making irregular use, while SL3 is for non-paying customers making small-scale use with certain running time and priority constraints. (There are also SL1 and SL4 users, but neither will be of interest to lab users.) SL3 users get a certain allocation of hours per quarter, SL2 users get what they pay for, for as long as their money lasts. Quotas are reset on 4 fixed dates per year.

The Computer Lab has given some money to the HPCS to allow Lab members to use the service as an SL2 user. This money makes up two "Projects", by the name of COMPUTERLAB-SL2-CPU (for running on Peta4-Skylake) and COMPUTERLAB-SL2-GPU (for running on Wilkes2). (We do not currently have a paid project for Peta4-KNL as we do not anticipate using it, but if you have a need for it let us know). When you run a job you specify which Project you want in the submission script, and the hours you use will be charged to that Project. It is not a huge amount of money, so it is mainly intended for people wanting to try out the HPCS to see if it is suitable for their needs, or for student projects. If you intend to make extensive use of the HPCS you should ask your PI to set up their own project and provide funding. If you do not charge to a Project, you will receive Service Level 3, which has more restricted running time and priority than Project funded service, but should still be useable. Use of a Project requires approval, in the case of the "Computer Lab" Projects, COMPUTERLAB-SL2-CPU or COMPUTERLAB-SL2-GPU, it will need to be approved by somebody on the sys-admin team.

Applying for an account

Fill out and submit the application form. It may take up to a week for an account to be issued, but is usually much faster (same day). Unless your PI has, or wishes to, set up a project of their own, specify either COMPUTERLAB-SL2-CPU (for running on Peta4-Skylake) or COMPUTERLAB-SL2-GPU (for running on Wilkes2, ie you have a need for GPUs) if asked for a project.

Operating Procedure

A user will have 2 directories: a local space and a scratch space, both with quotas set (quotas change too often to be worth recording here)

Without too much detail, the usual working procedure is detailed at the Quick Start page:

  • login using ssh to one of the login servers: login-cpu.hpc.cam.ac.uk (for Peta4-Skylake) or login-gpu.hpc.cam.ac.uk (for Wilkes2).
  • copy your program/script into the local directory,
  • copy your data (if any) into the scratch space,
  • on the login server, compile your program if necessary and check that it runs on a simple case,
  • set up a submission script, which specifies how many instances of the program will run on which data (this is the only difficult part, refer to the Quick Start page),
  • submit the submission script to the scheduler,
  • wait for it to run (you can check its progress). Jobs can take up to 36 hours and there is no preemption, so you could potentially have to wait 36 hours for other jobs to run before your job can start, although that would require someone else to be using the entire pool of compute nodes all at once; in practice, this happens very rarely. SL2 users have higher priority than SL3 users.

Software

A wide variety of software is available: Matlab, R, Java, a variety of compilers. If you have specific version constraints, or need specific toolboxes then it is best to check availability with [Javascript required] before applying for an account.

Training

There are periodic 1-day training courses in how to use the HPCS. However, someone familiar with other batch systems, such as Condor, should be able to pick up enough from the HPCS documentation to get by. If you are new to such systems then it may be quite a steep learning curve, and you should allow several days and quite a lot of trial and error. The HPCS support staff are extremely helpful and will be of considerable assistance, and there is extensive documentation available on the UIS webpage.