MULTIMODAL JOINT EMBEDDING TRANSFORMER FOR CONDITIONAL DE NOVO MOLEC-ULAR DESIGN AND MULTI-PROPERTY OPTIMIZATION Anonymous

Abstract

Multi-property constrained optimization of molecules using generative de novo design models is vital for the successful application of Artificial Intelligence (AI) towards materials and drug discovery. Yet there remains a gap between the reported performance of such models in the literature and their practical utility in real world design scenarios. Furthermore, existing models are largely inaccessible to chemists without an extensive background in computer science. To address these challenges, we propose a generative foundation model, the Multimodal Joint Embedding Transformer (MOLJET), which performs conditional generation of desired molecular distributions based on human-interpretable chemistry prompts in a zero-shot manner. We assess MOLJET on the standard benchmarks available in the GuacaMol and MIMOSA evaluation frameworks. These include structurebased sampling tasks as well as a range of multi-property optimization tasks that probe a models ability to design drug-like molecules given realistic property constraints. We demonstrate that with self-supervised pretraining, MOLJET outperforms 80% of task-optimized models while using zero-shot inferences and beats all baselines after minimal supervision. Moreover, the performance of MOLJET on text-only conditioning tasks improves with the inclusion of property modalities during training, highlighting the importance of a multimodal approach to molecular design. MOLJET is the first example of text-based de novo molecular design using large-scale multimodal foundation models and should serve as a building block towards further improvements to accessible AI for chemists.

1. INTRODUCTION

Emerging crises in climate, disease and human health threaten to permanently disrupt global stability and must be actively met with creative solutions. Many such solutions are dependent on the rapid discovery of innovative functional materials or novel drug-like molecules with optimal properties. For instance, the viability of using redox-flow batteries (RFBs) for long-term and large-scale energy storage is contingent on finding stable redox species with fast electrochemical kinetics, a feasible redox potential and high solubility (Zhang et al., 2018) . Due to the immense size and complexity of chemical phase space (Polishchuk et al., 2013) , the search for suitable materials is far from trivial and traditional "direct" design approaches based on iterative modifications to existing chemical structures are often far too slow (Kuhn & Beratan, 1996) . To address this issue, researchers have increasingly begun to look towards generative de novo design models to efficiently navigate the vast molecular phase space (Meyers et al., 2021) . These models are evaluated on their ability to generate a diverse array of novel molecular structures while simultaneously biasing them towards a desired property distribution (Polykovskiy et al., 2020) . Due to the ubiquity of string-based molecular representations (Weininger, 1988; Krenn et al., 2020) , recent innovations in natural language modeling have been successfully applied to de novo molecular design. For instance, transformer architectures have achieved state-of-the-art results on property prediction tasks that require quantum-level accuracy (Ross et al., 2021) and have also been shown to increase the diversity of candidates sampled from machine-learned molecular distributions (Dollar et al., 2021) . Aside from string-based representations of molecular structures, there are other textual modalities which could provide additional context to generative models and thus improve their performance. Such modalities include IUPAC names, molecular formulas, descriptions of important chemical moieties or functional groups and natural language descriptions of chemical behavior. Yet despite the large overlap between architectures used for natural language modeling and molecular sequence modeling, there have only been a few attempts to incorporate more than a single modality within a model (Rothchild et al., 2021; Sun et al., 2021; Zeng et al., 2022) and none have included the capacity for property-driven molecular design. Massive scaling has also been primarily limited to property prediction tasks (Honda et al., 2019; Chithrananda et al., 2020) despite growing evidence of the performance benefits derived from increasing model sizes, dataset sizes and compute across all downstream tasks (Kaplan et al., 2020; Hoffmann et al., 2022) . In this work we introduce MOLJET, a large-scale multimodal joint embedding transformer for conditional molecular generation and multi-property optimization. Within this framework, molecular generation is conditioned by text-based prompts that control the structural and physicochemical characteristics of the desired molecular distributions as depicted in Figure 1 . We demonstrate conditional generation on three modalities -textual descriptions of molecular structural features, physicochemical properties and 1D atomistic molecular graphs -and provide a general framework for the inclusion of additional modalities during pretraining. To prove the efficacy of our models in realistic design scenarios, we evaluate MOLJET on a diverse set of tasks including molecular rediscovery, similarity and substructure-based sampling, isomer generation and multi-property optimization (Brown et al., 2019; Fu et al., 2021) . With only selfsupervised pretraining, MolJET outperforms all task-optimized baseline models on five out of the eight task categories and outperforms the baselines on all eight task categories after minimal taskspecific supervised optimization. Furthermore, the prompts are designed to be easily interpretable by chemists without any prior knowledge of deep learning and thus accessible to a wider audience. We provide access to our pretrained models through an online API and hope to encourage increased participation in AI-driven de novo molecular design among scientific researchers in much the same way that DALL-E and GPT have inspired increased interaction with deep learning models among the general public (Brown et al., 2020; Ramesh et al., 2022) .

2. RELATED WORK

Multi-Property Optimization. Several strategies for multi-property optimization of molecular structures have been explored to date. Some works propose to condition the generation of molecular structures with a learnable embedding corresponding to the values of one or more desired properties (Lim et al., 2018; Li et al., 2018; Gebauer et al., 2022) . These models jointly learn the conditional



Figure 1: MOLJET Framework. Prompts are (i) stochastically sampled from the available modalities in the dataset and (ii) used to condition autoregressive reconstruction of SELFIES strings. Conditions are then chosen during inference to (iii) shift the generated molecular distribution towards the desired structural or physicochemical properties.

