MPhil, Part III, and Part II Project Suggestions (2025-2026)

Please contact Eiko Yoneki (email: eiko.yoneki@cl.cam.ac.uk) if you are interested in any project below.

1 . Multi Objective Scheduling of Distributed LLM Serving on Heterogeneous GPUs

Heterogeneous GPU clusters offer a potential solution to mitigate significant operational cost of Large Language Model (LLM) inference. However, existing scheduling systems do not adequately address the unique computational patterns of new Mixture-of-Expert (MoE) models within these complex environments. This project addresses this gap by developing and evaluating a novel scheduling algorithm, specifically designed to optimise MoE model inference on heterogeneous GPU clusters. The initial work has been explored in [1], where the algorithm is separated into two distinct processing stages: an outer loop and an inner loop. The outer loop uses Bayesian Optimisation to efficiently search the complex configuration space, partitioning an inventory of different GPUs into small optimal islands. For the inner loop, this work implements a new linear programming formulation that precisely maps workload ranges, categorised by input sequence length and separated into prefill and decode phases, to the generated islands. In this project, multi objective optimisation(e.g. Multi Objective BO) will be explored such as the latency, accuracy, and/or power consumption. The scheduling algorithm was evaluated using a simulation framework on real-world workload traces. This work will demonstrate a workload-aware scheduling approach can unlock substantial performance and cost-efficiency gains for serving large-scale MoE models in complex, heterogeneous environments.

Desirable skills: BO, LLM, GPU, Multi-Objective

References:

[1] Nathan Rignall: Scheduling of Distributed LLM Serving on Heterogeneous GPUs.

[2] S.Alabed: BoGraph: Structured Bayesian Optimization From Logs for Systems with High-dimensional Parameter Space.

2 . Multi-model serving on heterogeneous clusters

Contact: Eiko Yoneki

Serving multiple LLMs (such as GPT, Llama, OPT, Falcon etc.) efficiently in heterogeneous clusters presents unique challenges due to varying computational power, communication bandwidth, memory bandwidth, and memory limits across different types of GPUs [1]. The project aims to extend the capabilities of multi-model serving [2] to heterogeneous clusters effectively. The initial phase involves setting up a benchmarking suite to evaluate different model-serving frameworks like vLLM [3] and DistServe [4] on various cluster configurations. Subsequently, the project will focus on designing a custom LLM serving framework that leverages dynamic resource allocation to optimize for throughput and latency across heterogeneous environments. This involves developing algorithms for intelligent job scheduling and resource management that consider the unique characteristics of each cluster node. The goal is to enhance the efficiency and scalability of serving multiple models in diverse computing environments, which is critical for applications in areas like autonomous driving and real-time data analytics. There is an ongoing project in our group on the above topic, and the student can take advantage of the platform built already and focus on bench marking tasks and an extension of scheduling algorithms.

Desirable skills: Python/CUDA/C++ programming, LLM inference and serving

References:

[1] Jiang, Youhe, et al. "HexGen: Generative Inference of Large Language Model over Heterogeneous Environment." Forty-first International Conference on Machine Learning.
[2] Duan, Jiangfei, et al. "MuxServe: Flexible Spatial-Temporal Multiplexing for Multiple LLM Serving." Forty-first International Conference on Machine Learning.
[3] Kwon, Woosuk, et al. "Efficient memory management for large language model serving with paged attention." Proceedings of the 29th Symposium on Operating Systems Principles. 2023.
[4] Zhong, Yinmin, et al. "DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving." 18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24). 2024.

3 . Probabilistic Inference for Reward Specification in Reinforcement Learning

Contact: Eiko Yoneki

Reinforcement learning (RL) relies on hand-crafted reward functions, which are often difficult to design and may not fully capture task goals. Probabilistic inference provides an alternative view, treating decision making as an inference task where rewards can be derived from probabilistic models rather than explicitly specified.

This project will investigate how simple probabilistic models such as Hidden Markov Models (HMMs) and Bayesian Neural Networks (BNNs) can be adapted for RL tasks. The focus will be on using inference techniques (e.g., expectation maximisation, variational inference) to derive approximate reward signals from observed trajectories. The student will implement these models using existing probabilistic programming libraries and evaluate them on small benchmark environments (e.g., CartPole, MountainCar, GridWorld), comparing against standard RL approaches with handcrafted rewards.

Desirable skills: HMM, RL, Probabilistic Model

References:

[1] Levine, S. (2018). Reinforcement Learning and Control as Probabilistic Inference: Tutorial and Review. arXiv:1805.00909.
[2] Toussaint, M., & Storkey, A. (2006). Probabilistic Inference for Solving Discrete and Continuous State Markov Decision Processes. ICML.
[3] Koller, D., Friedman, N. (2009). Probabilistic Graphical Models: Principles and Techniques. MIT Press.

4 . BO+RL (or Bandit) for Multi-Model LLM Serving

Contact: Eiko Yoneki

This project proposes a two timescale, learning-augmented framework for multi-model LLM serving on heterogeneous GPU clouds. The outer loop performs static optimisation selecting model placements and replica counts via cost-aware Bayesian Optimisation that accounts for resource limits, reconfiguration overheads, and workload uncertainty. The inner loop delivers dynamic optimisation with safe reinforcement learning and bandits for online routing, batching, and draft verify orchestration, informed by output length prediction and short-horizon demand forecasts. Objectives are explicitly multi-objective: maximise goodput while satisfying latency SLOs (TTFT, end-to-end), and jointly reduce cloud cost, energy, and unfairness using constraints and CVaR/tail controls. The loops are coupled by contracts (budgets, SLO targets) and telemetry (utilisation, tail latency), enabling robust adaptation. Evaluation on real traces and mixed GPU clusters will produce Pareto fronts and ablations (heuristics, BO-only, RL-only, BO+RL), demonstrating scalable, adaptive, and economical LLM serving.

Desirable skills: Python, LLM, BO, RL

References:

[1] Kwon, W. et al. Efficient Memory Management for Large Language Model Serving with PagedAttention (vLLM). SOSP 23.
[2] Zhong, Y. et al. DistServe: Disaggregating Prefill and Decoding for Goodput-Optimized LLM Serving. OSDI 24.
[3] Jiang, Y. et al. ThunderServe: High-Performance and Cost-Efficient LLM Serving in Cloud Environments. arXiv 2025.
[4] Duan, J. et al. MuxServe: Flexible Spatial-Temporal Multiplexing for Multiple LLM Serving. arXiv 2024.
[5] Leviathan, Y., Kalman, M., Matias, Y. Fast Inference from Transformers via Speculative Decoding. ICML 2023.

5 . Bandwidth-Aware Parallelism Planning for Distributed LLM Training and Inference

Contact: Eiko Yoneki

Given the various communication bandwidths in cloud environments, for instance, NVLink (800/600/400 GBps, depending on generation), InfiniBand (400 GBps), RoCE (40 GBps), TCP (1-5 GBps), PCIe (20-80 GBps), and Ethernet (more than 1 GBps) and multiple parallelism strategy choices including data parallelism [1], tensor parallelism [2,3], pipeline parallelism [4,5], expert parallelism, and sequence parallelism [6], it is essential to design an automatic-parallelism method that takes the physical network topology as input and outputs a parallelism plan that minimises communication volume on low-bandwidth communication links and maximises communication volume on high-bandwidth links [7,8,9]. The system should build a network topology-aware optimisation engine that automatically profiles the physical infrastructure and searches through different parameter slice allocation and corresponding communication operations among different slices to minimise overall communication overhead. The key metrics to measure include end-to-end training time, communication-to-computation ratio, network utilisation efficiency across different bandwidth tiers, and performance comparison against existing frameworks like Megatron-LM [10,11]. First, you need to understand the detailed characteristics of different parallelism, then build appropriate modelling, then integrate some of the characteristics into existing simulator, and get simulation results that reflect real-world experimental results, and finally finish the automatic search that could guide real-world deployment.

Desirable skills: Distributed systems programming, performance modelling and profiling, network topology analysis

References:

[1] Li, M., et al. "Scaling Distributed Machine Learning with the Parameter Server." OSDI 2014
[2] Shoeybi, M., et al. "Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism." arXiv:1909.08053, 2019
[3] Narayanan, D., et al. "Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM." SC 2021
[4] Huang, Y., et al. "GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism." NeurIPS 2019
[5] Fan, S., et al. "Dapple: A Pipelined Data Parallel Approach for Training Large Models." PPoPP 2021
[6] Korthikanti, V., et al. "Reducing Activation Recomputation in Large Transformer Models." MLSys 2023
[7] Zhang, S., et al. "Auto-Parallelizing Large Models with Rhino: A Systematic Approach on Production AI Platform." arXiv 2023

6 . Optimisation of scheduling GPU-native assembly with Reinforcement Learning

Contact: Eiko Yoneki

This project builds an optimiser for scheduling GPU-native assembly. The closed-source nature of GPU compilers hinders tuning GPU kernel performance on the hardware platform, and the default compiler-generated assembly schedules are not optimal for critical kernels. This project addresses the issues by converting the traditional manual scheduling process into a search problem and then employs a RL agent to search for optimal assembly schedules from runtime feedback. This project starts recreation of the work CuAsmRL [1], where optimising NVIDIA GPU CUDA SASS Schedules via Deep Reinforcement Learning is explored. After understanding CuAsmRL, further optimisation will be tried with other GPUs. If time allows, porting CuAsmRL over Tenstorrent (https://tenstorrent.com/en) environment will be explored.

Desirable skills: CUDA, RL, GPU

References:

[1] G. He and E. Yoneki: CuAsmRL: Optimizing GPU SASS Schedules via Deep Reinforcement Learning. 2025. (https://www.cl.cam.ac.uk/~ey204/pubs/2025_CGO.pdf)

7 . Multi-Objective Compiler Optimization

Contact: Eiko Yoneki

[1] demonstrated the feasibility of using Reinforcement Learning (RL) to optimize LLVM pass lists, improving runtime performance over heuristic-based baselines. A natural extension is to move beyond single-objective optimisation and treat compiler design as a multi-objective problem. In practice, developers must balance competing goals such as runtime speed, binary size, and energy efficiency. These objectives often conflict passes that accelerate execution may increase code size or energy usage. This project will extend [1] s RL framework by defining multi-objective reward functions, enabling agents to explore trade-offs rather than a single performance metric. Approaches include scalarisation (weighted combinations of objectives) and Pareto-based RL, which approximates the set of non-dominated optimization strategies. Benchmarks from PolyBench, CoreMark, and MiBench will be used to evaluate outcomes against LLVM defaults (-O3, -Os). The result will be a prototype system showing how learning-based compilers can adapt policies across multiple optimization criteria, aligning with real-world deployment needs.

Desirable skills: RL, Compiler

References:

[1] Yilin Sun: Optimizing LLVM Pass List using Reinforcement Learning. Optimizing LLVM Pass List using Reinforcement Learning
[2] Mammadli: A. Reinforcement Learning for Compiler Optimization Pass Ordering, 2008.
[3] S. Makula: Compiler Optimization Pass Sequence Exploration using Structured Bayesian Optimization. MSc Dissertation, University of Cambridge, 2017.
[4] Deb, K., Pratap, A., Agarwal, S., & Meyarivan, T. : A Fast Elitist Multi-Objective Genetic Algorithm for Multi-Objective Optimization: NSGA-II. IEEE TEC, 2002.
[5] Van Moffaert, K., & Nowee, A. : Multi-Objective Reinforcement Learning using Sets of Pareto Dominating Policies. JMLR, 2014.

8 . Better sharding strategy search with deep reinforcement learning

Contact: Eiko Yoneki

Deep learning recommender model (DLRMs) is one of the most important applications of deep learning. The challenge of DLRMs is to shard the embedding table across multiple devices. This involves column-wise and row-wise sharding. Neuroshard [1] proposes to use a DNN as cost model to guide the search of a sharding strategy, and then it uses a combination of beam search and greedy search to find the sharding strategy. We would like to see whether deep reinforcement learning (RL) can search for a better sharding strategy. Your solution should be compared and benchmarked with [1][2].

Desirable skills: Python, deep reinforcement learning

References:

[1] Pre-train and Search: Efficient Embedding Table Sharding with Pre-trained Neural Cost Models

[2] AutoShard: Automated Embedding Table Sharding for Recommender Systems

9. Structured Bayesian Optimisation in BoTorch

Contact: Eiko Yoneki

Optimising system performance is challenging due to the high computational cost of performance evaluation, which is time-consuming and involves navigating a vast search space of variables. Traditional techniques like Bayesian Optimisation (BO) [2] often take a long time to converge and struggle with high-dimensional parameter spaces. However, incorporating structural information into surrogate models can significantly accelerate BO's convergence.

This project explores DagBO [1][3], an open-source extension of BO developed by our group, which allows for user-definable surrogate models based on directed acyclic graphs (DAGs). The project will investigate DagBO's performance in tuning Spark benchmarks compared to traditional BO. Potential extensions of the project include distributed computing and the development of sub-models within DagBO.

Desirable skills: Computer Systems, Bayesian Optimisation, Spark

References:

[1] Ross Tooley: Auto-tuning Spark with Bayesian optimisation.

[2] Jonas B. Mockus. The bayesian approach to global optimization. Freie Univ., Fachbereich

Mathematik, 1984.

[3] https://github.com/Tyv217/dagbo/

10. Bayesian Optimisation for tensor code generation

Contact: Eiko Yoneki

Tensor codes are run on massively parallel hardware. When generating tensor code (also called auto-scheduling), TVM Compiler [1] needs to search through many parameters. A State-of-the-art auto-scheduler is Ansor [2], which applies rule-based mutation to generate tensor code templates and then fine-tunes those templates via Evolutionary Search. We think Bayesian Optimisation (BayesOpt) [3] is a better approach to efficiently search the tensor code templates than Evolutionary Search.

At first, TVM is set up for benchmarking with ResNet and Bert using CPU and GPU (possible a few different types of GPU). Next same benchmarking with NVIDIA's Compiler Cutlass should be experimented.

Afterwards, exploring using BayesOpt for high-performance tensor code generation, and benchmark black-box algorithms for tensor code generation. . The main interface for tensor code generation in TVM will be through MetaScheduler [6], which provides a fairly simple Python interface for various search methodologies [7]. We also have a particular interest in tensor code generation for tensor cores, which are equipped by recent generation of GPUs (since the Turing micro-architectures) as a domain-specific architectures to massively accelerate tensor programs.

The project can an advantage of former student's work [9][10], and set the focus on the performance improvement on GPU, Multi-objective BO, and scalability

Desired Skills: Strong interest in tensor program optimisation, Some knowledge/interest in Bayesian optimisation, Python, with some knowledge in C++

References:
[1] TVM: An Automated End-to-End Optimizing Compiler for Deep Learning https://www.usenix.org/system/files/osdi18-chen.pdf.
[2] Ansor: Generating High-Performance Tensor Programs for Deep Learning https://www.usenix.org/system/files/osdi20-zheng.pdf.
[3] A Tutorial on Bayesian Optimization https://arxiv.org/abs/1807.02811.
[4] HEBO Pushing The Limits of Sample-Efficient Hyperparameter Optimisation https://arxiv.org/abs/2012.03826.
[5] Are Random Decompositions all we need in High Dimensional Bayesian Optimisation? https://arxiv.org/abs/2301.12844.
[6] Tensor Program Optimization with Probabilistic Programs https://arxiv.org/abs/2205.13603.
[7] https://github.com/apache/tvm/blob/4267fbf6a173cd742acb293fab4f77693dc4b887/python/tvm/meta_schedule/search_strategy/search_strategy.py#L238.
[8] NVIDIA Compiler https://github.com/NVIDIA/cutlass.

[9] https://github.com/hgl71964/unity-tvm

[10] Discovering Performant Tensor Programs with Bayesian Optimization

11. Tensor expression superoptimisation via deep reinforcement learning

Contact: Eiko Yoneki

AlphaDev [1][2] shows RL agent can discover faster sorting algorithm via playing the assembly game. In recentfb advances of machine learning compiler, EinNet [3] [4] shows how to discover faster tensor programs via rule-based transformation. We are interested in investigating how RL works in discovering faster tensor programs at tensor expression level transformation. You should implement a graph RL agent to play in the tensor game to discover faster tensor programs. Alternatively, we have an internal RL-driven graph transformation system, which you can leverage and apply to the tensor expression level transformation. To get started, build and benchmark EinNet to see how it works with rule-based transformation on a simple DNN (e.g. self-attention). Then, replace their search algorithm with RL algorithm.

Desired Skills: Programming with Python, C++ and python to C++ binding, e.g. pybind11

References:
[1] Faster sorting algorithms discovered using deep reinforcement learning, Nature, 2023.
[2] https://github.com/google-deepmind/alphadev.
[3] EINNET: Optimizing Tensor Programs with Derivation-Based Transformations. OSDI, 2023.
[4] https://github.com/InfiniTensor/InfiniTensor.
[5] https://github.com/ucamrl/xrlflow/tree/main.

12. Advancing Computer Systems Optimisation through Model-Based Reinforcement Learning

Contact: Eiko Yoneki

Reinforcement Learning (RL) is gaining interest as a generic optimisation and control method in data management tasks such as resource management, scheduling, database tuning, or stream processing. However, its application is hindered by sample inefficiency and extensive decision evaluation times. Model-Based Reinforcement Learning (MBRL), employing techniques like Probabilistic Ensemble-based models [1], World Models [2] and Dyna-style planning [4] etc. addresses these issues by learning environment models. This project tackles the LLVM compiler optimisation on selection of the pass list. This is critical as pass list selection affects code performance and size, and MBRL's capability to model these interactions is invaluable. Additionally, MBRL challenges on enhancing generalisation ability across different programs, essential due to the expensive training of RL agents. You would use MBRL for intelligent pass list selection in LLVM and work on enhancing generalisation across programs. They will investigate various MBRL techniques and evaluate their effectiveness in modelling the compiler environment and improving transfer learning. You would also evaluate a world-model based approach against a model free approach using software infrastructure. The previous project work [3] can be used as a starting point. Evaluating the project with SPEC 2017 (https://www.spec.org/cpu2017/). benchmarking will be ideal.

Desired Skills: Reinforcement Learning, LLVM Optimisation

References:
[1] Kurtland Chua, Roberto Calandra, Rowan McAllister, and Sergey Levine. 2018. Deep reinforcement learning in a handful of trials using probabilistic dynamics models. NIPS'18. https://dl.acm.org/doi/10.5555/3327345.3327385.
[2] World Models. https://arxiv.org/abs/1803.10122.
[3] Yilin Sun: Optimizing LLVM Pass List using Reinforcement Learning.
[4] Sutton, Richard S., et al. Dyna-style planning with linear function approximation and prioritized sweeping. https://arxiv.org/pdf/1206.3285.pdf.

13. ML-compiler: Joint optimisation of tensor graph and GPU kernels

Contact: Eiko Yoneki

TASO [2] optimises tensor graph structure, and its backend uses CuDNN to execute tensor Operators [1]. A recent hardware library, Cutlass, allows users to customise their own kernels, as if users can configure their hardware. This brings the opportunities of jointly optimisation of tensor graph and GPU kernels. The project aims to replace TASO's backend from CuDNN [3] to Cutlass [4] . Ideal candidates should have some knowledge writing C++ level codes and preferably know the basics of CUDA programming. The candidates can get started by benchmarking common tensor operators, such as MatMul, Conv2D etc, and compare the performance between different ML compilers. It is a great fit for those who want to know more and deeper about ML compilers.

[1] End-to-end comparison.
[2] Z. Jia, et al.: TASO: Optimizing Deep Learning Computation with Automatic Generation of Graph Substitutions, SOSP, 2019.
[3] CuDNN: NVIDIA cuDNN Installation Guide
[4] Cutlass: CUDA C++ template abstractions

14. Cost modelling for tensor programs

Contact: Eiko Yoneki

Cost models provide cheap estimates to get the performance of tensor programs without actual execution, so it is at the heart of accelerating tenor programs generation. For example, TVM [1] uses XGboost [2] to estimate tensor program runtime. More recently, advanced DNN-based cost models are proposed to perform more accurate estimation such as GNN [3] and transformers [4]. Cost modelling can be even cross-hardware [5]. On the other hand, distributed DNN training also needs delicate cost modelling to provide cheap estimation of the throughput of parallelisation strategies, such as [6] and [7]. However, distributed cost modelling is mostly mathematical estimation. This project aims to investigate and benchmark the state-of-the-art cost modelling. In particular, we want to understand how good those cost models are, and if possible, we want to replace the distributed cost modelling with a DNN-based cost model.

[1] TVM: An Automated End-to-End Optimizing Compiler for Deep Learning.
[2] AutoTVM: Learning to Optimize Tensor Programs.
[3] A Graph Neural Network-Based Performance Model for Deep Learning Applications.

Contact Email

Please email to eiko.yoneki@cl.cam.ac.uk for any question.

Computer Laboratory

MPhil, Part III, and Part II Project Suggestions (2025-2026)

Contact Email