CROSS-MODAL GRAPH CONTRASTIVE LEARNING WITH CELLULAR IMAGES

Abstract

Constructing discriminative representations of molecules lies at the core of a number of domains such as drug discovery, material science, and chemistry. Stateof-the-art methods employ graph neural networks (GNNs) and self-supervised learning (SSL) to learn the structural representations from unlabeled data, which can then be fine-tuned for downstream tasks. Albeit powerful, these methods that are pre-trained solely on molecular structures cannot generalize well to the tasks involved in intricate biological processes. To cope with this challenge, we propose using high-content cell microscopy images to assist in learning molecular representation. The fundamental rationale of our method is to leverage the correspondence between molecular topological structures and the caused perturbations at the phenotypic level. By including cross-modal pre-training with different types of contrastive loss functions in a unified framework, our model can efficiently learn generic and informative representations from cellular images, which are complementary to molecular structures. Empirical experiments demonstrated that the model transfers non-trivially to a variety of downstream tasks and is often competitive with the existing SSL baselines, e.g., a 15.4% absolute Hit@10 gains in graph-image retrieval task and a 4.0% absolute AUC improvements in clinical outcome predictions. Further zero-shot case studies show the potential of the approach to be applied to real-world drug discovery.

1. INTRODUCTION

Learning discriminative representations of molecules is a fundamental task for numerous applications such as molecular property prediction, de novo drug design and material discovery Wu et al. (2018) . Since molecular structures are essentially topological graphs with atoms and covalent bonds, graph representation learning can be naturally introduced to capture the representation of molecules Duvenaud et al. (2015) ; Xu et al. (2019) ; Song et al. (2020) ; Ying et al. (2021) . Despite the fruitful progress, graph neural networks (GNNs) are known to be data-hungry, i.e., requiring a large amount of labeled data for training. However, task-specific labeled data are far from sufficient, as wet-lab experiments are resource-intensive and time-consuming. As a result, training datasets in chemistry and drug discovery are typically limited in size, and neural networks tend to overfit them, leading to poor generalization capability of the learned representations. (2020); Velickovic et al. (2019) . The core of self-supervised learning lies in establishing meaningful pre-training objectives to harness the power of large unlabeled data. The pre-trained neural networks can then be used to fine-tune for small-scale downstream tasks.

Inspired by the fruitful achievements of the self-supervision learning in computer vision

However, pre-training on molecular graph structures remains a stiff challenge. One of the limitations of current approaches is the lack of domain knowledge in chemistry or chemical synthesis. Recent studies have pointed out that pre-trained GNNs with random node/edge masking gives limited improvements and often lead to negative transfer on downstream tasks Hu et al. (2020b); Stärk et al. (2021) , as the perturbations actions on graph structures can hurt the structural inductive bias of molecules. Furthermore, molecular modeling tasks often require predicting the binding/interaction between molecules and other biological entities (e.g., RNA, proteins, pathways), and further generalizing to the phenotypic/clinical outcomes caused by these specific bindings Kuhn et al. (2016) . Self-supervised learning methods that solely manipulate molecular structures struggle to handle downstream tasks that involve complex biological processes, limiting their practicality in a wide range of drug discovery applications. To this end, we propose using high-content cell microscopy images to assist in learning molecular representation, extending the molecular representation beyond chemical structures and thus, improving the generalization capability. High-content cell microscopy imaging (HCI) is an increasingly important biotechnology in recent years in the domain of drug and system biology Chandrasekaran et al. ( 2021 2022), etc. Thus, we hypothesize that this phenotypic modality has complementary strengths with molecular structures to make enhanced representations and thus benefit the downstream tasks involved in intricate biological processes. However, building connections between molecular structures and high-content cellular images is a challenging task that highlights representative, unsolved problems in cross-modal learning. The first challenge comes from the unclear fine-grained correspondence between molecules and images. Unlike conventional cross-modal paired data such as caption and picture, the patterns of the molecule are not directly reflected in the cellular image, thus preventing us from using traditional crossmodal encoders for alignment. The second challenge arises from the noise and batch effect of the cellular data. For example, cellular images obtained from the same molecule can vary considerably. Existing cross-modal pre-training objectives may overfit the noisy images and decrease the model's generalization ability. Herein, we propose Molecular graph and hIgh content imaGe Alignment (MIGA), a novel crossmodal graph-and-image pre-training framework to address the above issues. We first encode the molecule and cell microscopy images independently with a molecular graph encoder and an image encoder. Then we align the graph embeddings with the image embeddings through three contrastive modules: graph-image contrastive (GIC) learning, masked graph modeling (MGM) and generative graph-image matching (GGIM). Specifically, (i) GIC encourages high similarity between the latent embeddings of matched graph-image pairs while pushing those of non-matched pairs apart; (ii) MGM, a local cross-modal module that utilizes both the observed (unmasked) graph and the cell image to predict the masked molecular patterns and (iii) GGIM aims to further distinguish the hard negative graph-image pairs that share similar global semantics but differ in fine-grained details. The three modules are complementary and thus the combination of these modules can 1) make it easier for the encoders to perform cross-modal learning by capturing structural and localized information; (2) learn a common low-dimensional space to embed graphs and images that are biologically meaningful. Enabled by the massive publicly available high-content screening data Bray et al. (2017) , we establish a novel cross-modal benchmark dataset that contains 750k molecular graph-cellular image pairs. To evaluate models on this benchmark, we propose a new biological meaningful retrieval task specific to



Chen et al. (2020a); He et al. (2020) and natural language processing Devlin et al. (2018); Radford et al. (2018), there have been some attempts to extend self-supervised schemes to molecular representation learning and lead to remarkable results Hu et al. (2020b); Xu et al. (2021); You et al. (2020; 2021); Rong et al.

Figure 1: (a) Illustration of high-content screening experimental protocol. (b) Samples of moleculeimages pairs. Note that one molecule produces multiple images.

); Bray et al. (2016). As shown in Figure 1, small molecules enter into cells and affect their biological functions and pathways, leading to morphological changes in cell shape, number, structure, etc., that are visible in microscopy images after staining. Analysis and modeling based on these high-content images have shown great success in molecular bioactivity prediction Hofmarcher et al. (2019), mechanism identification Caicedo et al. (2022), polypharmacology prediction Chow et al. (

