MPhil, Part III, and Part II Project Suggestions (2025-2026)
|
Please contact Eiko Yoneki (email: eiko.yoneki@cl.cam.ac.uk) if you are interested in any project below.
1
.
Multi Objective Scheduling of Distributed LLM Serving on Heterogeneous GPUs
Contact: Eiko Yoneki
Heterogeneous GPU clusters offer a potential solution to mitigate significant operational cost of Large Language Model (LLM) inference. However, existing scheduling systems do not adequately address the unique computational patterns of new Mixture-of-Expert (MoE) models within these complex environments. This project addresses this gap by developing and evaluating a novel scheduling algorithm, specifically designed to optimise MoE model inference on heterogeneous GPU clusters. The initial work has been explored in [1], where the algorithm is separated into two distinct processing stages: an outer loop and an inner loop. The outer loop uses Bayesian Optimisation to efficiently search the complex configuration space, partitioning an inventory of different GPUs into small optimal islands. For the inner loop, this work implements a new linear programming formulation that precisely maps workload ranges, categorised by input sequence length and separated into prefill and decode phases, to the generated islands. In this project, multi objective optimisation(e.g. Multi Objective BO) will be explored such as the latency, accuracy, and/or power consumption. The scheduling algorithm was evaluated using a simulation framework on real-world workload traces. This work will demonstrate a workload-aware scheduling approach can unlock substantial performance and cost-efficiency gains for serving large-scale MoE models in complex, heterogeneous environments. Desirable skills: BO, LLM, GPU, Multi-Objective References: [1] Nathan Rignall: Scheduling of Distributed LLM Serving on Heterogeneous GPUs. [2] S.Alabed: BoGraph: Structured Bayesian Optimization From Logs for Systems with High-dimensional Parameter Space.
2
.
Multi-model serving on heterogeneous clusters
Contact: Eiko Yoneki
Serving multiple LLMs (such as GPT, Llama, OPT, Falcon etc.) efficiently in heterogeneous clusters presents unique challenges due to varying computational power, communication bandwidth, memory bandwidth, and memory limits across different types of GPUs [1]. The project aims to extend the capabilities of multi-model serving [2] to heterogeneous clusters effectively. The initial phase involves setting up a benchmarking suite to evaluate different model-serving frameworks like vLLM [3] and DistServe [4] on various cluster configurations. Subsequently, the project will focus on designing a custom LLM serving framework that leverages dynamic resource allocation to optimize for throughput and latency across heterogeneous environments. This involves developing algorithms for intelligent job scheduling and resource management that consider the unique characteristics of each cluster node. The goal is to enhance the efficiency and scalability of serving multiple models in diverse computing environments, which is critical for applications in areas like autonomous driving and real-time data analytics. There is an ongoing project in our group on the above topic, and the student can take advantage of the platform built already and focus on bench marking tasks and an extension of scheduling algorithms. Desirable skills: Python/CUDA/C++ programming, LLM inference and serving References:
[1] Jiang, Youhe, et al. "HexGen: Generative Inference of Large Language Model over Heterogeneous Environment." Forty-first International Conference on Machine Learning.
3
.
Probabilistic Inference for Reward Specification in Reinforcement Learning
Contact: Eiko Yoneki
Reinforcement learning (RL) relies on hand-crafted reward functions, which are often difficult to design and may not fully capture task goals. Probabilistic inference provides an alternative view, treating decision making as an inference task where rewards can be derived from probabilistic models rather than explicitly specified.
Desirable skills: HMM, RL, Probabilistic Model References:
[1] Levine, S. (2018). Reinforcement Learning and Control as Probabilistic Inference: Tutorial and Review. arXiv:1805.00909.
4
.
BO+RL (or Bandit) for Multi-Model LLM Serving
Contact: Eiko Yoneki
This project proposes a two timescale, learning-augmented framework for multi-model LLM serving on heterogeneous GPU clouds. The outer loop performs static optimisation selecting model placements and replica counts via cost-aware Bayesian Optimisation that accounts for resource limits, reconfiguration overheads, and workload uncertainty. The inner loop delivers dynamic optimisation with safe reinforcement learning and bandits for online routing, batching, and draft verify orchestration, informed by output length prediction and short-horizon demand forecasts. Objectives are explicitly multi-objective: maximise goodput while satisfying latency SLOs (TTFT, end-to-end), and jointly reduce cloud cost, energy, and unfairness using constraints and CVaR/tail controls. The loops are coupled by contracts (budgets, SLO targets) and telemetry (utilisation, tail latency), enabling robust adaptation. Evaluation on real traces and mixed GPU clusters will produce Pareto fronts and ablations (heuristics, BO-only, RL-only, BO+RL), demonstrating scalable, adaptive, and economical LLM serving. Desirable skills: Python, LLM, BO, RL References:
[1] Kwon, W. et al. Efficient Memory Management for Large Language Model Serving with PagedAttention (vLLM). SOSP 23.
5
.
Bandwidth-Aware Parallelism Planning for Distributed LLM Training and Inference
Contact: Eiko Yoneki
Given the various communication bandwidths in cloud environments, for instance, NVLink (800/600/400 GBps, depending on generation), InfiniBand (400 GBps), RoCE (40 GBps), TCP (1-5 GBps), PCIe (20-80 GBps), and Ethernet (more than 1 GBps) and multiple parallelism strategy choices including data parallelism [1], tensor parallelism [2,3], pipeline parallelism [4,5], expert parallelism, and sequence parallelism [6], it is essential to design an automatic-parallelism method that takes the physical network topology as input and outputs a parallelism plan that minimises communication volume on low-bandwidth communication links and maximises communication volume on high-bandwidth links [7,8,9]. The system should build a network topology-aware optimisation engine that automatically profiles the physical infrastructure and searches through different parameter slice allocation and corresponding communication operations among different slices to minimise overall communication overhead. The key metrics to measure include end-to-end training time, communication-to-computation ratio, network utilisation efficiency across different bandwidth tiers, and performance comparison against existing frameworks like Megatron-LM [10,11]. First, you need to understand the detailed characteristics of different parallelism, then build appropriate modelling, then integrate some of the characteristics into existing simulator, and get simulation results that reflect real-world experimental results, and finally finish the automatic search that could guide real-world deployment. Desirable skills: Distributed systems programming, performance modelling and profiling, network topology analysis References:
[1] Li, M., et al. "Scaling Distributed Machine Learning with the Parameter Server." OSDI 2014
6
.
Optimisation of scheduling GPU-native assembly with Reinforcement Learning
Contact: Eiko Yoneki
This project builds an optimiser for scheduling GPU-native assembly. The closed-source nature of GPU compilers hinders tuning GPU kernel performance on the hardware platform, and the default compiler-generated assembly schedules are not optimal for critical kernels. This project addresses the issues by converting the traditional manual scheduling process into a search problem and then employs a RL agent to search for optimal assembly schedules from runtime feedback. This project starts recreation of the work CuAsmRL [1], where optimising NVIDIA GPU CUDA SASS Schedules via Deep Reinforcement Learning is explored. After understanding CuAsmRL, further optimisation will be tried with other GPUs. If time allows, porting CuAsmRL over Tenstorrent (https://tenstorrent.com/en) environment will be explored. Desirable skills: CUDA, RL, GPU References: [1] G. He and E. Yoneki: CuAsmRL: Optimizing GPU SASS Schedules via Deep Reinforcement Learning. 2025. (https://www.cl.cam.ac.uk/~ey204/pubs/2025_CGO.pdf)
7
.
Multi-Objective Compiler Optimization
Contact: Eiko Yoneki
[1] demonstrated the feasibility of using Reinforcement Learning (RL) to optimize LLVM pass lists, improving runtime performance over heuristic-based baselines. A natural extension is to move beyond single-objective optimisation and treat compiler design as a multi-objective problem. In practice, developers must balance competing goals such as runtime speed, binary size, and energy efficiency. These objectives often conflict passes that accelerate execution may increase code size or energy usage. This project will extend [1] s RL framework by defining multi-objective reward functions, enabling agents to explore trade-offs rather than a single performance metric. Approaches include scalarisation (weighted combinations of objectives) and Pareto-based RL, which approximates the set of non-dominated optimization strategies. Benchmarks from PolyBench, CoreMark, and MiBench will be used to evaluate outcomes against LLVM defaults (-O3, -Os). The result will be a prototype system showing how learning-based compilers can adapt policies across multiple optimization criteria, aligning with real-world deployment needs. Desirable skills: RL, Compiler References:
[1] Yilin Sun: Optimizing LLVM Pass List using Reinforcement Learning.
Optimizing LLVM Pass List using Reinforcement Learning
8
.
Better sharding strategy search with deep reinforcement learning Contact: Eiko Yoneki
Deep learning recommender model (DLRMs) is one of the most important applications of deep learning. The challenge of DLRMs is to shard the embedding table across multiple devices. This involves column-wise and row-wise sharding. Neuroshard [1] proposes to use a DNN as cost model to guide the search of a sharding strategy, and then it uses a combination of beam search and greedy search to find the sharding strategy. We would like to see whether deep reinforcement learning (RL) can search for a better sharding strategy. Your solution should be compared and benchmarked with [1][2]. Desirable skills: Python, deep reinforcement learning References: [1] Pre-train and Search: Efficient Embedding Table Sharding with Pre-trained Neural Cost Models [2] AutoShard: Automated Embedding Table Sharding for Recommender Systems 9.
Structured Bayesian Optimisation in BoTorch
Contact: Eiko Yoneki
Optimising system performance is challenging due to the high computational cost of performance evaluation, which is time-consuming and involves navigating a vast search space of variables. Traditional techniques like Bayesian Optimisation (BO) [2] often take a long time to converge and struggle with high-dimensional parameter spaces. However, incorporating structural information into surrogate models can significantly accelerate BO's convergence. This project explores DagBO [1][3], an open-source extension of BO developed by our group, which allows for user-definable surrogate models based on directed acyclic graphs (DAGs). The project will investigate DagBO's performance in tuning Spark benchmarks compared to traditional BO. Potential extensions of the project include distributed computing and the development of sub-models within DagBO. Desirable skills: Computer Systems, Bayesian Optimisation, Spark References: [1] Ross Tooley: Auto-tuning Spark with Bayesian optimisation. [2] Jonas B. Mockus. The bayesian approach to global optimization. Freie Univ., Fachbereich Mathematik, 1984. [3] https://github.com/Tyv217/dagbo/ 10. Bayesian Optimisation for tensor code generation
Contact: Eiko Yoneki Tensor codes are run on massively parallel hardware. When generating tensor code (also called auto-scheduling), TVM Compiler [1] needs to search through many parameters. A State-of-the-art auto-scheduler is Ansor [2], which applies rule-based mutation to generate tensor code templates and then fine-tunes those templates via Evolutionary Search. We think Bayesian Optimisation (BayesOpt) [3] is a better approach to efficiently search the tensor code templates than Evolutionary Search. At first, TVM is set up for benchmarking with ResNet and Bert using CPU and GPU (possible a few different types of GPU). Next same benchmarking with NVIDIA's Compiler Cutlass should be experimented. Afterwards, exploring using BayesOpt for high-performance tensor code generation, and benchmark black-box algorithms for tensor code generation. . The main interface for tensor code generation in TVM will be through MetaScheduler [6], which provides a fairly simple Python interface for various search methodologies [7]. We also have a particular interest in tensor code generation for tensor cores, which are equipped by recent generation of GPUs (since the Turing micro-architectures) as a domain-specific architectures to massively accelerate tensor programs. The project can an advantage of former student's work [9][10], and set the focus on the performance improvement on GPU, Multi-objective BO, and scalability Desired Skills: Strong interest in tensor program
optimisation, Some knowledge/interest in Bayesian optimisation, Python, with
some knowledge in C++ References: [9] https://github.com/hgl71964/unity-tvm [10] Discovering
Performant Tensor Programs with Bayesian Optimization
11. Tensor expression superoptimisation via deep
reinforcement learning AlphaDev [1][2] shows RL agent can discover faster sorting algorithm via playing the assembly game. In recentfb advances of machine learning compiler, EinNet [3] [4] shows how to discover faster tensor programs via rule-based transformation. We are interested in investigating how RL works in discovering faster tensor programs at tensor expression level transformation. You should implement a graph RL agent to play in the tensor game to discover faster tensor programs. Alternatively, we have an internal RL-driven graph transformation system, which you can leverage and apply to the tensor expression level transformation. To get started, build and benchmark EinNet to see how it works with rule-based transformation on a simple DNN (e.g. self-attention). Then, replace their search algorithm with RL algorithm. References: Desired Skills: Inverse Reinforcement Learning, LLVM
Optimisation [1] Ng,
Andrew Y., and Stuart Russell: Algorithms for inverse reinforcement learning.
ICML 2000. Contact: Eiko Yoneki Reinforcement Learning (RL) is gaining interest as a generic optimisation and control method in data management tasks such as resource management, scheduling, database tuning, or stream processing. However, its application is hindered by sample inefficiency and extensive decision evaluation times. Model-Based Reinforcement Learning (MBRL), employing techniques like Probabilistic Ensemble-based models [1], World Models [2] and Dyna-style planning [4] etc. addresses these issues by learning environment models. This project tackles the LLVM compiler optimisation on selection of the pass list. This is critical as pass list selection affects code performance and size, and MBRL's capability to model these interactions is invaluable. Additionally, MBRL challenges on enhancing generalisation ability across different programs, essential due to the expensive training of RL agents. You would use MBRL for intelligent pass list selection in LLVM and work on enhancing generalisation across programs. They will investigate various MBRL techniques and evaluate their effectiveness in modelling the compiler environment and improving transfer learning. You would also evaluate a world-model based approach against a model free approach using software infrastructure. The previous project work [3] can be used as a starting point. Evaluating the project with SPEC 2017 (https://www.spec.org/cpu2017/). benchmarking will be ideal. Desired Skills: Reinforcement Learning, LLVM Optimisation References: Contact: Eiko Yoneki TASO [2] optimises tensor graph structure, and its backend uses CuDNN to execute tensor Operators [1]. A recent hardware library, Cutlass, allows users to customise their own kernels, as if users can configure their hardware. This brings the opportunities of jointly optimisation of tensor graph and GPU kernels. The project aims to replace TASO's backend from CuDNN [3] to Cutlass [4] . Ideal candidates should have some knowledge writing C++ level codes and preferably know the basics of CUDA programming. The candidates can get started by benchmarking common tensor operators, such as MatMul, Conv2D etc, and compare the performance between different ML compilers. It is a great fit for those who want to know more and deeper about ML compilers. [1] End-to-end
comparison. 14. Cost modelling for tensor programs Contact: Eiko Yoneki Cost models provide cheap estimates to get the performance of tensor programs without actual execution, so it is at the heart of accelerating tenor programs generation. For example, TVM [1] uses XGboost [2] to estimate tensor program runtime. More recently, advanced DNN-based cost models are proposed to perform more accurate estimation such as GNN [3] and transformers [4]. Cost modelling can be even cross-hardware [5]. On the other hand, distributed DNN training also needs delicate cost modelling to provide cheap estimation of the throughput of parallelisation strategies, such as [6] and [7]. However, distributed cost modelling is mostly mathematical estimation. This project aims to investigate and benchmark the state-of-the-art cost modelling. In particular, we want to understand how good those cost models are, and if possible, we want to replace the distributed cost modelling with a DNN-based cost model. [1] TVM: An Automated
End-to-End Optimizing Compiler for Deep Learning. Contact EmailPlease email to eiko.yoneki@cl.cam.ac.uk for any question. |