LEARNABLE TOPOLOGICAL FEATURES FOR PHYLOGE-NETIC INFERENCE VIA GRAPH NEURAL NETWORKS

Abstract

Structural information of phylogenetic tree topologies plays an important role in phylogenetic inference. However, finding appropriate topological structures for specific phylogenetic inference tasks often requires significant design effort and domain expertise. In this paper, we propose a novel structural representation method for phylogenetic inference based on learnable topological features. By combining the raw node features that minimize the Dirichlet energy with modern graph representation learning techniques, our learnable topological features can provide efficient structural information of phylogenetic trees that automatically adapts to different downstream tasks without requiring domain expertise. We demonstrate the effectiveness and efficiency of our method on a simulated data tree probability estimation task and a benchmark of challenging real data variational Bayesian phylogenetic inference problems.

1. INTRODUCTION

Phylogenetics is an important discipline of computational biology where the goal is to identify the evolutionary history and relationships among individuals or groups of biological entities. In statistical approaches to phylogenetics, this has been formulated as an inference problem on hypotheses of shared history, i.e., phylogenetic trees, based on observed sequence data (e.g., DNA, RNA, or protein sequences) under a model of evolution. The phylogenetic tree defines a probabilistic graphical model, based on which the likelihood of the observed sequences can be efficiently computed (Felsenstein, 2003) . Many statistical inference procedures therefore can be applied, including maximum likelihood and Bayesian approaches (Felsenstein, 1981; Yang & Rannala, 1997; Mau et al., 1999; Huelsenbeck et al., 2001) . Phylogenetic inference, however, has been challenging due to the composite parameter space of both continuous and discrete components (i.e., branch lengths and the tree topology) and the combinatorial explosion in the number of tree topologies with the number of sequences. Harnessing the topological information of trees hence becomes crucial in the development of efficient phylogenetic inference algorithms. For example, by assuming conditional independence of separated subtrees, Larget (2013) showed that conditional clade distributions (CCDs) can provide more reliable tree probability estimation that generalizes beyond observed samples. A similar approach was proposed to design more efficient proposals for tree movement when implementing Markov chain Monte Carlo (MCMC) algorithms for Bayesian phylogenetics (Höhna & Drummond, 2012) . Utilizing more sophisticated local topological structures, CCDs were later generalized to subsplit Bayesian networks (SBNs) that provide more flexible distributions over tree topologies (Zhang & Matsen IV, 2018) . Besides MCMC, variational Bayesian phylogenetics inference (VBPI) was recently proposed that leveraged SBNs and a structured amortization of branch lengths to deliver competitive posterior estimates in a more timely manner (Zhang & Matsen IV, 2019; Zhang, 2020; Zhang & Matsen IV, 2022) . Azouri et al. ( 2021) used a machine learning approach to accelerate maximum likelihood tree-search algorithms by providing more informative topology moves. Topological features have also been found useful for comparison and interpretation of the reconstructed phylogenies (Matsen IV, 2007; Hayati et al., 2022) . While these approaches prove effective in practice, they all rely on heuristic features (e.g., clades and subsplits) of phylogenetic trees that often require significant design effort and domain expertise, and may be insufficient for capturing complicated topological information. Graph Neural Networks (GNNs) are an effective framework for learning representations of graphstructured data. To encode the structural information about graphs, GNNs follow a neighborhood aggregation procedure that computes the representation vector of a node by recursively aggregating and transforming representation vectors of its neighboring nodes. After the final iteration of aggregation, the representation of the entire graph can also be obtained by pooling all the node embeddings together via some permutation invariant operators (Ying et al., 2018) . Many GNN variants have been proposed and have achieved superior performance on both node-level and graph-level representation learning tasks (Kipf & Welling, 2017; Hamilton et al., 2017; Li et al., 2016; Zhang et al., 2018; Ying et al., 2018) . A natural idea, therefore, is to adapt GNNs to phylogenetic models for automatic topological feature learning. However, the lack of node features for phylogenetic trees makes it challenging as most GNN variants assume fully observed node features at initialization. In this paper, we propose a novel structural representation method for phylogenetic inference that automatically learns efficient topological features based on GNNs. To obtain the initial node features for phylogenetic trees, we follow previous studies (Zhu & Ghahramani, 2002; Rossi et al., 2021) to minimize the Dirichlet energy, with one hot encoding for the tip nodes. Unlike these previous studies, we present a fast linear time algorithm for Dirichlet energy minimization by taking advantage of the hierarchical structure of phylogenetic trees. Moreover, we prove that these features are sufficient for identifying the corresponding tree topology, i.e., there is no information loss in our raw feature representations of phylogenetic trees. These raw node features are then passed to GNNs for more sophisticated structure representation learning required by downstream tasks. Experiments on a synthetic data tree probability estimation problem and a benchmark of challenging real data variational Bayesian phylogenetic inference problems demonstrate the effectiveness and efficiency of our method.

2. BACKGROUND

Notation A phylogenetic tree is denoted as (τ, q) where τ is a bifurcating tree that represents the evolutionary relationship of the species and q is a non-negative branch length vector that characterizes the amount of evolution along the edges of τ . The tip nodes of τ correspond to the observed species and the internal nodes of τ represent the unobserved characters (e.g., DNA bases) of the ancestral species. The transition probability P ij (t) from character i to character j along an edge of length t is often defined by a continuous-time substitution model (e.g., Jukes & Cantor (1969) ), whose stationary distribution is denoted as η. Let E(τ ) be the set of edges of τ , r be the root node (or any internal node if the tree is unrooted and the substitution model is reversible). Let Y = {Y 1 , Y 2 , . . . , Y M } ∈ Ω N ×M be the observed sequences (with characters in Ω) of length M over N species. Phylogenetic posterior Assuming different sites Y i , i = 1, . . . , M are independent and identically distributed, the likelihood of observing Y given the phylogenetic tree (τ, q) takes the form p(Y |τ, q) = M i=1 p(Y i |τ, q) = M i=1 a i η(a i r ) (u,v)∈E(τ ) P a i u a i v (q uv ), where a i ranges over all extensions of Y i to the internal nodes with a i u being the assigned character of node u. The above phylogenetic likelihood function can be computed efficiently through the pruning algorithm (Felsenstein, 2003) . Given a prior distribution p(τ, q) of the tree topology and the branch lengths, Bayesian phylogenetics then amounts to properly estimating the phylogenetic posterior p(τ, q|Y ) ∝ p(Y |τ, q)p(τ, q). Variational Bayesian phylogenetic inference Let Q φ (τ ) be an SBN-based distribution over the tree topologies and Q ψ (q|τ ) be a non-negative distribution over the branch lengths. VBPI finds the best approximation to p(τ, q|Y ) from the family of products of Q φ (τ ) and Q ψ (q|τ ) by maximizing the following multi-sample lower bound L K (φ, ψ) = E Q φ,ψ (τ 1:K ,q 1:K ) log 1 K K i=1 p(Y |τ i , q i )p(τ i , q i ) Q φ (τ i )Q ψ (q i |τ i ) ≤ log p(Y ) where Q φ,ψ (τ 1:K , q 1:K ) = K i=1 Q φ (τ i )Q ψ (q i |τ i ). To properly parameterize the variational distributions, a support of the conditional probability tables (CPTs) is often acquired from a sample

