MPhil, Part III, and Part II Project Suggestions (2024-2025)
|
Please contact Eiko Yoneki (email: eiko.yoneki@cl.cam.ac.uk) if you are interested in any project below. * indicates high recommendation! 1.
Better sharding strategy search with deep reinforcement
learning Contact: Eiko Yoneki
Deep learning recommender model (DLRMs) is one of the most important applications of deep learning. The challenge of DLRMs is to shard the embedding table across multiple devices. This involves column-wise and row-wise sharding. Neuroshard [1] proposes to use a DNN as cost model to guide the search of a sharding strategy, and then it uses a combination of beam search and greedy search to find the sharding strategy. We would like to see whether deep reinforcement learning (RL) can search for a better sharding strategy. Your solution should be compared and benchmarked with [1][2]. Desirable skills: Python, deep reinforcement learning References: [1] Pre-train and Search: Efficient Embedding Table Sharding with Pre-trained Neural Cost Models [2] AutoShard: Automated Embedding Table Sharding for Recommender Systems 2.
Enhancing Generalization through Task Vector Fusion for Query
Optimizer * Contact: Eiko Yoneki
Recent advancements in the integration of Reinforcement Learning (RL) with database systems, notably in areas like query plan optimization [3], underscoring the burgeoning interest in exploiting RL's inherent strengths in scheduling and solving complex combinatorial optimization problems. The project aims to enhance our previous RL-based query optimizer Aurora [2], which utilizes equality saturation and query graph transformation to achieve efficient query execution. Aurora, like other RL models, struggles to adapt to evolving database workloads, requiring retraining when conditions change. To address this challenge, we propose a novel approach that dynamically adapts to new workloads with minimal retraining. By leveraging task vectors [1] parameter sets specific to individual workloads from previously trained models, we can perform arithmetic operations to predict optimal parameter adjustments for new tasks. This method allows existing optimizers to maintain high performance amid changing demands, improving both efficiency and usability. Desirable skills: Reinforcement Learning, Equality Saturation, Database, Query Optimization References: [1] Ilharco G, Ribeiro M T, Wortsman M, et al. Editing models with task arithmetic[J]. arXiv preprint arXiv:2212.04089, 2022. [2] Bărbulescu, George-Octavian, et al. "Learned Graph
Rewriting with Equality Saturation: A New Paradigm in Relational Query Rewrite
and Beyond." arXiv preprint
arXiv:2407.12794 (2024). [3] Kraska T, Alizadeh M, Beutel A, et al. Sagedb:
A learned database system[J]. 2021. 3.
Multi-model Serving on Heterogeneous Clusters* Contact: Eiko Yoneki
Serving multiple LLMs (such as GPT, Llama, OPT, Falcon etc.) efficiently in heterogeneous clusters presents unique challenges due to varying computational power, communication bandwidth, memory bandwidth, and memory limits across different types of GPUs [1]. The project aims to extend the capabilities of multi-model serving [2] to heterogeneous clusters effectively. The initial phase involves setting up a benchmarking suite to evaluate different model-serving frameworks like vLLM [3] and DistServe [4] on various cluster configurations. Subsequently, the project will focus on designing a custom LLM serving framework that leverages dynamic resource allocation to optimize for throughput and latency across heterogeneous environments. This involves developing algorithms for intelligent job scheduling and resource management that consider the unique characteristics of each cluster node. The goal is to enhance the efficiency and scalability of serving multiple models in diverse computing environments, which is critical for applications in areas like autonomous driving and real-time data analytics. Desirable skills: Proficiency in Python/CUDA/C++ programming, understanding of Large Language Model (LLM) inference and serving, and experience with cluster management and resource allocation References: [1] Jiang, Youhe, et al. "HexGen: Generative Inference of Large Language Model over Heterogeneous Environment." Forty-first International Conference on Machine Learning. [2] Duan, Jiangfei, et al. "MuxServe: Flexible Spatial-Temporal Multiplexing for Multiple LLM Serving." Forty-first International Conference on Machine Learning. [3] Kwon, Woosuk, et al. "Efficient memory management for large language model serving with paged attention." Proceedings of the 29th Symposium on Operating Systems Principles. 2023. [4] Zhong, Yinmin, et al. "{DistServe}: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving." 18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24). 2024. 4.
Automated Workload Compression for Index Selection in Scaling
Workloads* Contact: Eiko Yoneki
Scaling workloads involve numerous queries arriving within a fixed time interval in a relational database management system [1,2]. Compression technology aims to select a representative subset of queries, reducing redundant information and optimizing query performance [1,2]. This significantly decreases tuning time, particularly in index selection, where a compressed subset can generate beneficial index candidates with minimal performance loss [3,4]. Existing compression techniques rely on strong heuristics [1,2], often involving phases like column and query selection. Since the compression process can be viewed as a continuous search and scheduling problem, methods like dynamic programming or Monte Carlo Tree Search (MCTS) [5] can be applied to automate and maximize compression efficiency. This project seeks to develop a cost-efficient compression approach that minimizes performance loss compared to the original workload, optimizing indexing for downstream execution. Desirable skills: Indexing, Database Optimization, Monte Carlo Tree Search References: [1] Siddiqui, Tarique, et al. "ISUM: Efficiently compressing large
and complex workloads for scalable index tuning." Proceedings of
the 2022 International Conference on Management of Data. 2022. [2] Brucato, Matteo, et al. "Wred:
Workload Reduction for Scalable Index Tuning." Proceedings of the
ACM on Management of Data 2.1 (2024): 1-26. [3] Kossmann, Jan, Alexander Kastius, and
Rainer Schlosser. "SWIRL: Selection of Workload-aware Indexes using
Reinforcement Learning." EDBT. Vol. 2. 2022. [4] Wang, Taiyi, and Eiko Yoneki. "IA2: Leveraging Instance-Aware
Index Advisor with Reinforcement Learning for Diverse Workloads." Proceedings
of the 4th Workshop on Machine Learning and Systems. 2024. [5] Chaslot, Guillaume Maurice Jean-Bernard Chaslot. "Monte-carlo tree
search." (2010). 5.
Predictive Long-Term Indexing with RL-Enhanced Control* Contact: Eiko Yoneki
Traditional learned index tuners like AirIndex [1], LSM-Tree Tuner [2], primarily optimize for current workloads, neglecting the impact of these tuning efforts on future workloads. This approach can lead to inefficiencies when new data arrives, as the system lacks foresight regarding future workload patterns. To address this, the project proposes serializing the indexing process at a higher level. The agent, potentially equipped with RL, would control when and whether the local index tuner should take actions as new workloads arrive or data updates happen. By testing this on streaming data with various distributions, such as SOSD data [4] or skewed datasets [3], or using real data traces from sources like Microsoft or Snowflake, the project aims to develop a more predictive and efficient indexing system that better accommodates incoming workloads over time, which improves long-term performance. Desirable skills: Reinforcement Learning, Indexing, Predictive Control, Learned Index References: [1] Chockchowwat, Supawit,
Wenjie Liu, and Yongjoo Park. "Airindex: versatile index tuning through data and
storage." Proceedings of the ACM on Management of Data 1.3
(2023): 1-26. [2] Mo, Dingheng, et al. "Learning to Optimize LSM-trees: Towards A
Reinforcement Learning based Key-Value Store for Dynamic Workloads." Proceedings
of the ACM on Management of Data 1.3 (2023): 1-25. [3] Ding, Jialin, et al. "Tsunami: A learned multi-dimensional
index for correlated data and skewed workloads." arXiv
preprint arXiv:2006.13282 (2020). [4] Kipf, Andreas, et al. "SOSD: A benchmark for learned
indexes." arXiv preprint
arXiv:1911.13014 (2019). 6.
Structured Bayesian Optimisation in BoTorch
** Contact: Eiko Yoneki
Optimising system performance is challenging due to the high computational cost of performance evaluation, which is time-consuming and involves navigating a vast search space of variables. Traditional techniques like Bayesian Optimisation (BO) [2] often take a long time to converge and struggle with high-dimensional parameter spaces. However, incorporating structural information into surrogate models can significantly accelerate BO's convergence. This project explores DagBO [1][3], an open-source extension of BO developed by our group, which allows for user-definable surrogate models based on directed acyclic graphs (DAGs). The project will investigate DagBO's performance in tuning Spark benchmarks compared to traditional BO. Potential extensions of the project include distributed computing and the development of sub-models within DagBO. Desirable skills: Computer Systems, Bayesian Optimisation, Spark References: [1] Ross Tooley: Auto-tuning Spark with Bayesian optimisation. [2] Jonas B. Mockus. The bayesian approach to global optimization. Freie Univ., Fachbereich Mathematik, 1984. [3] https://github.com/Tyv217/dagbo/ 7. Bayesian Optimisation for tensor code generation **
Contact: Eiko Yoneki Tensor codes are run on massively parallel hardware. When generating tensor code (also called auto-scheduling), TVM Compiler [1] needs to search through many parameters. A State-of-the-art auto-scheduler is Ansor [2], which applies rule-based mutation to generate tensor code templates and then fine-tunes those templates via Evolutionary Search. We think Bayesian Optimisation (BayesOpt) [3] is a better approach to efficiently search the tensor code templates than Evolutionary Search. At first, TVM is set up for benchmarking with ResNet and Bert using CPU and GPU (possible a few different types of GPU). Next same benchmarking with NVIDIA's Compiler Cutlass should be experimented. Afterwards, exploring using BayesOpt for high-performance tensor code generation, and benchmark black-box algorithms for tensor code generation. . The main interface for tensor code generation in TVM will be through MetaScheduler [6], which provides a fairly simple Python interface for various search methodologies [7]. We also have a particular interest in tensor code generation for tensor cores, which are equipped by recent generation of GPUs (since the Turing micro-architectures) as a domain-specific architectures to massively accelerate tensor programs. The project can an advantage of former student's work [9][10], and set the focus on the performance improvement on GPU, Multi-objective BO, and scalability Desired Skills: Strong interest in tensor program
optimisation, Some knowledge/interest in Bayesian optimisation, Python, with
some knowledge in C++ References: [9] https://github.com/hgl71964/unity-tvm [10] Discovering
Performant Tensor Programs with Bayesian Optimization 8. Probabilistic Inference within Reinforcement Learning * Contact: Eiko Yoneki Incorporating with Probabilistic Inference within RL to solve decision making problems by treating them as probabilistic inference tasks can be interesting research to tackle. Unlike for classical RL, which requires explicit reward functions, probabilistic assumptions of the model can be implied. This would enable a fundamental new way to design the reward function by model selection and to apply existing probabilistic modelling techniques to RL problems. The project includes: - Design of Probabilistic Models for RL Tasks: Investigate how different probabilistic models (e.g., hidden Markov models, Bayesian neural networks) can be adapted or developed for RL tasks. - Reward Function Implication and Learning: Develop methods for deriving reward functions from probabilistic models, possibly through techniques like expectation-maximisation or variational inference. - Evaluation and Application: Test the proposed approach on benchmark RL problems, comparing its performance to classical RL approaches. Desirable skills: Reinforcement Learning, Probabilistic Modelling, Reward Function References: [1] Sutton, R. S., & Barto, A. G. (2018). Reinforcement Learning: An Introduction (2nd ed.). MIT Press. [2] Ghavamzadeh, M., Mannor, S., Pineau, J., & Tamar, A. (2015). Bayesian Reinforcement Learning: A Survey. Foundations and Trends in Machine Learning, 8(5-6), 359-483.
9. Tensor expression superoptimisation via deep
reinforcement learning AlphaDev [1][2] shows RL agent can discover faster sorting algorithm via playing the assembly game. In recent advances of machine learning compiler, EinNet [3] [4] shows how to discover faster tensor programs via rule-based transformation. We are interested in investigating how RL works in discovering faster tensor programs at tensor expression level transformation. You should implement a graph RL agent to play in the tensor game to discover faster tensor programs. Alternatively, we have an internal RL-driven graph transformation system, which you can leverage and apply to the tensor expression level transformation. To get started, build and benchmark EinNet to see how it works with rule-based transformation on a simple DNN (e.g. self-attention). Then, replace their search algorithm with RL algorithm. References: Contact: Eiko Yoneki Reinforcement Learning (RL) has achieved impressive results recently, from playing video games to solving robotic control problems. However, it is challenging in many applications to design a reward function that robustly describes the desired task. Inverse Reinforcement Learning (IRL) [1], which evolved from imitation learning, can be formulated as a problem of inferring the reward function of an agent, given its policy or observed behaviour. Recent works focused more on the robustness of IRL in unknown environments [2,3]. IRL is suitable for environments where it's hard to define the reward function, or it's expensive to obtain the real-time reward [4]. System-level RL tuning problems [6] are in such cases when people typically define reward functions based on the actual execution time. One of the most popular and open-source IRL methods is Generative Adversarial Imitation Learning (GAIL) [5]. GAIL combines Generative Adversarial Network (GAN) and IRL and alternatively updates the agent's reward and policy until reaching optimum. One challenge here is that IRL needs expert data to update the agent's reward functions and policy. However, unlike video games, building a virtual environment for system-level optimisation problems to collect such data is sometimes impossible. Some underlying optimisers may help here. Thus, one good exploration could be using GAIL to infer reward functions within such expensive environments. In this project you would explore the reward function using GAIL in our existing project on the optimisation of LLVM Pass List using Reinforcement Learning [7]. Desired Skills: Inverse Reinforcement Learning, LLVM
Optimisation [1] Ng,
Andrew Y., and Stuart Russell: Algorithms for inverse reinforcement learning.
ICML 2000. Contact: Eiko Yoneki Reinforcement Learning (RL) is gaining interest as a generic optimisation and control method in data management tasks such as resource management, scheduling, database tuning, or stream processing. However, its application is hindered by sample inefficiency and extensive decision evaluation times. Model-Based Reinforcement Learning (MBRL), employing techniques like Probabilistic Ensemble-based models [1], World Models [2] and Dyna-style planning [4] etc. addresses these issues by learning environment models. This project tackles the LLVM compiler optimisation on selection of the pass list. This is critical as pass list selection affects code performance and size, and MBRL's capability to model these interactions is invaluable. Additionally, MBRL challenges on enhancing generalisation ability across different programs, essential due to the expensive training of RL agents. You would use MBRL for intelligent pass list selection in LLVM and work on enhancing generalisation across programs. They will investigate various MBRL techniques and evaluate their effectiveness in modelling the compiler environment and improving transfer learning. You would also evaluate a world-model based approach against a model free approach using software infrastructure. The previous project work [3] can be used as a starting point. Evaluating the project with SPEC 2017 (https://www.spec.org/cpu2017/). benchmarking will be ideal. Desired Skills: Reinforcement Learning, LLVM Optimisation References: Contact: Eiko Yoneki Hyperparameter tuning of large-scale deep learning models, such as ResNet-18, VGG-19, ResNet-50, and ResNet-152, is a critical but computationally intensive task due to the large search space. This project proposes a novel hyperparameter tuning algorithm that integrates Monte Carlo Tree Search (MCTS) and Bayesian Optimisation (BO). MCTS, known for its efficiency in large and high-dimensional spaces, will be employed to partition the hyperparameter space into smaller subspaces, which inherently enables parallelism. MCTS's probabilistic and heuristic-driven approach makes it lightweight and well-suited for black-box optimisation [1, 3]. On the other hand, BO is particularly efficient in smaller parameter spaces, making it an ideal choice for searching within the partitions created by MCTS. By combining MCTS's efficiency in high-dimensional spaces with BO's efficiency in lower-dimensional spaces, this approach aims to significantly enhance the speed and performance of hyperparameter tuning. Some suggested benchmarks could be Cifar10 and Cifar100 datasets [7] and compared against existing methods in terms of computational efficiency, accuracy improvements, and resource allocation strategies. Desired Skills: Reinforcement Learning, Gaussian Processes,
Bayesian Optimisation, Parallel Computing References: Contact: Eiko Yoneki Desired Skills: Reinforcement Learning, Database Systems,
Machine Learning, Parameter Tuning References: 14. Learning from Demonstration to Guide Query
Rewrite Contact: Eiko Yoneki Desired Skills: Machine Learning, Database Systems, Query
Optimisation of SQL Database References: Contact: Eiko Yoneki TASO [2] optimises tensor graph structure, and its backend uses CuDNN to execute tensor Operators [1]. A recent hardware library, Cutlass, allows users to customise their own kernels, as if users can configure their hardware. This brings the opportunities of jointly optimisation of tensor graph and GPU kernels. The project aims to replace TASO's backend from CuDNN [3] to Cutlass [4] . Ideal candidates should have some knowledge writing C++ level codes and preferably know the basics of CUDA programming. The candidates can get started by benchmarking common tensor operators, such as MatMul, Conv2D etc, and compare the performance between different ML compilers. It is a great fit for those who want to know more and deeper about ML compilers. [1] End-to-end
comparison. 16. Bayesian optimisation tuning with transfer
Learning from simulation to databases Contact: Eiko Yoneki Databases come with a large number of tunable configurations [1]. We successfully used Bayesian optimisation in tuning varying numbers of parameters. However, a common issue when tuning these parameters is that a single experiment takes a long time to evaluate. One way to deal with the long evaluation time is to leverage other data sources. This project goal is to study using fast-to-execute simulators to aid the convergence speed of the optimiser. FoundationDB [2] is a recent popular database built as a simulator first and provides first-class support to performance and simulation testing at 1/10th the actual execution time [3]. This project proposes using Multi-fidelity [4] as an optimisation method to integrate the simulator in the Bayesian optimisation loop. This framework enables transfer learning from a low-accuracy-cheap model (simulator) to a high-accuracy-expensive model (FoundationDB). [1] https://github.com/apple/foundationdb/blob/master/flow/Knobs.cpp. Contact: Eiko Yoneki Task scheduling over a finite number of resources in order to maximise resources efficiency is a traditionally difficult problem with heuristics made to approximate it. Previous work used Hierarchical Reinforcement Learning [1] to find optimal task scheduling for TensorFlow graph operations. This project proposes to investigate Contextual Multi-Armed bandit [2] with Bayesian learning (possible Bayesian neural networks) to allow injecting expert knowledge in the system. Using Bayesian models to define a hierarchy of the system allows it to converge faster and be easier to understand. For this project, Park [3] provides a simulation-based environment to test task scheduling on a TensorFlow graph. TensorFlow Probability [2] provides a probabilistic programming language that makes it possible to define Bayesian models that interacts with deep learning models. [1] https://ai.google/research/pubs/pub46646. Contact: Eiko Yoneki This project tackles the hierarchical device placement from probabilistic perspective to seek further improvement to the original work by Google [1]. In [1], two networks, one a grouper and one a placer are used. The grouper learns valid grouping of TensorFlow operations and the placer learns optimal placement on multiple devices. The idea in this project is to replace the grouper network with a Variational Autoencoder (VAE) network that in the latent space and using Bayesian Optimisation (BO) to generate valid grouping representation, similar approach to [2]. The output will be fed into another BO that learns the optimal placement, which is a similar approach described in our BOAT project 3] learning the placement for distributed SGD. Furthermore you can add the second placer learning multiple objectives rather than just the runtime improvement: by making it learn more correlated tasks it should be expected to be converged faster. [1] A.
Mirhoseini et al.: A Hierarchical Model for Device Placement, LCLR, 2018.
https://openreview.net/pdf?id=Hkc-TeZ0W. Contact: Eiko Yoneki Omelette [1] and Aurora [2] use reinforcement learning to guide the application of rules in Equality Saturation. A pressing challenge is generalisation. While the proposed methodologies can generalise across unseen expressions (e.g. query rewrite problems) in theory, expecting an RL agent to solve NP-hard combinatorial problems across sparse problem instances (such as the space of possible SQL query e-graphs) is not realistic. To improve generalisation, we propose an investigation into population-based Reinforcement Learning (PBT, example: Poppy [3]) which comes with theoretical guarantees for solving combinatorial optimisation problems more efficiently. [1] Zak Sigh:
Omelette: Deep Reinforcement Learning for Equality Saturation. 20. Learned Extraction for Equality Saturation
Contact: Eiko Yoneki One of the core practical limitations of applying Equality Saturation for rewriting query plans is the latency overhead. In the database domain, Aurora [1] shows that most of the planning latency is linked to extracting the optimal expression from the e-graph. This is achieved by formulating the query plan extraction as an integer linear program (ILP). To minimise planning latency, a promising idea is to learn the e-graph extraction with reinforcement learning. However, a naive application of RL for extracting queries (or other types of expressions) from the set of all possible e-graphs is set for failure due to the enormous search space. To solve this, we propose an investigation into guiding ILP with RL (e.g. learning to branch [2]), to minimise the search space. [1] George-Octavian
Barbulescu: An Inquiry into Database Query Optimisation with Equality
Saturation and Reinforcement Learning. 21. Cost modelling for tensor programs Contact: Eiko Yoneki Cost models provide cheap estimates to get the performance of tensor programs without actual execution, so it is at the heart of accelerating tenor programs generation. For example, TVM [1] uses XGboost [2] to estimate tensor program runtime. More recently, advanced DNN-based cost models are proposed to perform more accurate estimation such as GNN [3] and transformers [4]. Cost modelling can be even cross-hardware [5]. On the other hand, distributed DNN training also needs delicate cost modelling to provide cheap estimation of the throughput of parallelisation strategies, such as [6] and [7]. However, distributed cost modelling is mostly mathematical estimation. This project aims to investigate and benchmark the state-of-the-art cost modelling. In particular, we want to understand how good those cost models are, and if possible, we want to replace the distributed cost modelling with a DNN-based cost model. [1] TVM: An Automated
End-to-End Optimizing Compiler for Deep Learning. Contact EmailPlease email to eiko.yoneki@cl.cam.ac.uk for any question. |