Department of Computer Science and Technology

Technical reports

Distributional and relational inductive biases for graph representation learning in biomedicine

Paul M. Scherer

April 2024, 164 pages

This technical report is based on a dissertation submitted May 2023 by the author for the degree of Doctor of Philosophy to the University of Cambridge, Gonville and Caius College.

DOIhttps://doi.org/10.48456/tr-993

Abstract

The immense complexity in which DNAs, RNAs, proteins and other biomolecules interact amongst themselves, with one another, and the environment to bring about life processes motivates the mass collection of biomolecular data and data-driven modelling to gain insights into physiological phenomena. Recent predictive modelling efforts have focused on deep representation learning methods which offer a flexible modelling paradigm to handling high dimensional data at scale and incorporating inductive biases. The emerging field of representation learning on graph structured data opens opportunities to leverage the abundance of structured biomedical knowledge and data to improve model performance.

Grand international initiatives have been coordinated to organise and structure our growing knowledge about the interactions and putative functions of biomolecular entities using graphs and networks. This dissertation considers how we may use the inductive biases within recent graph representation learning methods to leverage these structures and incorporate biologically relevant relational priors into machine learning methods for biomedicine. We present contributions in two parts with the aim to foster research in this multidisciplinary domain and present novel methods that achieve strong performance through the use of distributional and relational inductive biases operating on graph-structured biomedical knowledge and data.

The first part is concerned with consolidating and expanding the current ecosystem of practical frameworks dedicated to graph representation learning. Our first contribution presents Geo2DR, the first practical framework and software library for constructing methods capable of learning distributed representations of graphs. Our second contribution, Pytorch Geometric Temporal, is the first open source representation learning library for dynamic graphs, expanding the scope of research software on graph neural networks that were previously limited to static graphs.

The second part presents three methods wherein each contribution tackles an active biomedical research problem using relational structures that exist within different aspects of the data. First, we present a methodology for learning distributed representations of molecular graphs in the context of drug pair scoring. Next, we present a method for leveraging structured knowledge on the variables of gene expression profiles to automatically construct sparse neural models for cancer subtyping. Finally, we present a state-of-the-art cell deconvolution model for spatial transcriptomics data using the positional relationships between observations in the dataset.

Full text

PDF (13.0 MB)

BibTeX record

@TechReport{UCAM-CL-TR-993,
  author =	 {Scherer, Paul M.},
  title = 	 {{Distributional and relational inductive biases for graph
         	   representation learning in biomedicine}},
  year = 	 2024,
  month = 	 apr,
  url = 	 {https://www.cl.cam.ac.uk/techreports/UCAM-CL-TR-993.pdf},
  institution =  {University of Cambridge, Computer Laboratory},
  doi = 	 {10.48456/tr-993},
  number = 	 {UCAM-CL-TR-993}
}