TABULAR DEEP LEARNING WHEN d ≫ n BY USING AN AUXILIARY KNOWLEDGE GRAPH Anonymous authors Paper under double-blind review

Abstract

Machine learning models exhibit strong performance on datasets with abundant labeled samples. However, for tabular datasets with extremely high d-dimensional features but limited n samples (i.e. d ≫ n), machine learning models struggle to achieve strong performance. Here, our key insight is that even in tabular datasets with limited labeled data, input features often represent real-world entities about which there is abundant prior information which can be structured as an auxiliary knowledge graph (KG). For example, in a tabular medical dataset where every input feature is the amount of a gene in a patient's tumor and the label is the patient's survival, there is an auxiliary knowledge graph connecting gene names with drug, disease, and human anatomy nodes. We therefore propose PLATO, a machine learning model for tabular data with d ≫ n and an auxiliary KG with input features as nodes. PLATO uses a multilayer perceptron (MLP) to predict the output labels from the tabular data and the auxiliary KG with two methodological components. First, PLATO predicts the parameters in the first layer of the MLP from the auxiliary KG. PLATO thereby reduces the number of trainable parameters in the MLP and integrates auxiliary information about the input features. Second, PLATO predicts different parameters in the first layer of the MLP for every input sample, thereby increasing the MLP's representational capacity by allowing it to use different prior information for every input sample. Across 10 state-of-the-art baselines and 6 d ≫ n datasets, PLATO exceeds or matches the prior state-of-the-art, achieving performance improvements of up to 10.19%. Overall, PLATO uses an auxiliary KG about input features to enable tabular deep learning prediction when d ≫ n.

1. INTRODUCTION

Machine learning models have reached state-of-the-art performance in domains with abundant labeled data like computer vision (Wortsman et al., 2022; Deng et al., 2009) and natural language processing (Wang et al., 2019; Devlin et al., 2019; Ramesh et al., 2022) . However, for tabular datasets in which the number d of features vastly exceeds the number n of samples, machine learning models struggle to achieve strong performance (Hastie et al., 2009; Liu et al., 2017) . Unfortunately, many high impact domains like chemistry (Guyon et al., 2004) , biology (Iorio et al., 2016; Yang et al., 2012; Garnett et al., 2012; Gao et al., 2015) , and physics (Kasieczka et al., 2021) produce datasets with high-dimensional features but limited labeled samples due to the high time and labor costs associated with experiments. In chemistry, for example, mass spectrometry datasets can have tens of thousands of features but only tens or hundreds of samples (Guyon et al., 2004) . For these and other tabular datasets with d ≫ n, the performance of machine learning systems is currently limited. To date, deep learning approaches for tabular data have focused on data regimes with far more samples than features (n ≫ d) (Grinsztajn et al., 2022; Gorishniy et al., 2021; Shwartz-Ziv & Armon, 2022) . In the low-data regime with far more features than samples (d ≫ n), the dominant approaches are classical statistical methods (Hastie et al., 2009) . These statistical methods reduce the dimensionality of the input space (Abdi & Williams, 2010; Liu et al., 2017; Van der Maaten & Hinton, 2008; Van Der Maaten et al., 2009) , select features (Tibshirani, 1996; Climente-González et al., 2019; Freidling et al., 2021; Meier et al., 2008) , impose regularization penalties on parameter magnitudes (Marquardt & Snee, 1975) , or use ensembles of weak tree-based models (Friedman, 2001; Chen & Guestrin, 2016; Ke et al., 2017; Lou & Obukhov, 2017; Prokhorenkova et al., 2018) . Here, we present a novel problem setting and framework for tabular deep learning when d ≫ n (Figure 1 ). Our key insight is that even in tabular settings with limited labeled data, input features often represent real-world entities about which there is abundant prior information which can be structured as an auxiliary knowledge graph (KG) . We propose a novel problem setting in which every input feature of a tabular dataset corresponds to a node in an auxiliary KG (Figure 1a ). For example, consider a tabular medical dataset in which every row is a cancer patient, every column is a gene, every value is the amount of that gene in the patient's tumor, and the task is to predict the patient's survival. For this tabular dataset, there exists an auxiliary KG which consists of each gene's function, the relationships between genes, how a gene affects a part of human anatomy, and how human anatomy itself is structured. Note that the KG does not capture the relationships between input data samples but instead captures the relationships between input features. Within our novel problem setting, we propose PLATO, a deep learning method for tabular data with d ≫ n and an auxiliary KG with input features as nodes (Figure 1(b)-(e) ). PLATO uses a modified multilayer perceptron (MLP) to predict the output labels from the input samples and the auxiliary KG with two methodological components. First, the parameters in the first layer of the MLP are predicted from the auxiliary KG and the input sample rather than learned from just the tabular data. PLATO thereby integrates prior information about the input features from the auxiliary KG and drastically reduces the number of trainable parameters in the MLP. Second, the parameters in the first layer of the MLP are predicted differently for every sample by using the auxiliary KG and the sample values. PLATO thereby increases the representational capacity of the MLP and enables effective predictions. We exhibit PLATO's performance on 6 datasets. We choose computational biology as it is a rich domain for d ≫ n in which we can construct a single knowledge graph to serve as a unified backbone for many distinct tabular datasets with distinct input features. We compare PLATO to 10 state-of-the-art baselines spanning dimensionality reduction, feature selection, classic statistical models, deep tabular learning methods, and parameter-prediction methods. Following a rigorous evaluation protocol from the tabular deep learning literature (Grinsztajn et al., 2022; Gorishniy et al., 2021) , PLATO achieves or matches the prior state-of-the-art on all 6 datasets, achieving performance improvements of up to 10.19%. Ablation studies further demonstrate the necessity of each methodological component of PLATO. Ultimately, PLATO uses an auxiliary KG about input features to enable tabular deep learning prediction when d ≫ n.

2. RELATED WORK

Tabular deep learning methods. In contrast to PLATO's setting, tabular deep learning methods have been developed for settings with far more samples than features (i.e. n ≫ d). Recent tabular deep learning benchmarks ignore datasets with high numbers of features and low numbers of samples (Grinsztajn et al., 2022; Gorishniy et al., 2021; Shwartz-Ziv & Armon, 2022) . In the n ≫ d setting, various categories of deep tabular models have been benchmarked. We select the state-of-the-art models to compare against PLATO. First, decision tree models like NODE (Popov et al., 2020) make decision trees differentiable to enable gradient-based optimization (Hazimeh et al., 2020; Kontschieder et al., 2015; Yang et al., 2018) . Second, tabular transformer architectures use an attention mechanism to select and learn interactions among features. These include TabNet (Arik & Pfister, 2021), TabTransformer (Huang et al., 2020) , FT-Transformer (Gorishniy et al., 2021) , and others (Song et al., 2019; Somepalli et al., 2021; Kossen et al., 2021) . d ≫ n methods. For PLATO's setting in which d ≫ n, various tabular machine learning approaches exist (Hastie et al., 2009) . First, dimensionality reduction techniques like PCA (Abdi & Williams, 2010) aim to reduce the dimensionality of the input data while preserving as much of the the variance in the data as possible (Liu et al., 2017; Van der Maaten & Hinton, 2008; Van Der Maaten et al., 2009) . Second, feature selection approaches select a parsimonious set of features, leading to a smaller feature space. Classical feature selection approaches include LASSO (Tibshirani, 1996) and its variants (Climente-González et al., 2019; Freidling et al., 2021; Meier et al., 2008) . For feature selection with deep learning, Stochastic Gates (Yamada et al., 2020) are among the best performing of many variants (Balın et al., 2019; Lu et al., 2018) . Finally, classical tree-based models like XGBoost learn ensembles of weak decision trees models to make an overall prediction (Friedman, 2001; Chen & Guestrin, 2016; Ke et al., 2017; Prokhorenkova et al., 2018) . Knowledge graph approaches. Existing knowledge graph approaches are designed for tasks directly on the graph such as link prediction (Wang et al., 2017; Trouillon et al., 2016; Wang et al., 2014; Yang et al., 2015; d'Amato et al., 2021) . By contrast, PLATO does not make any predictions on the knowledge graph. Instead, PLATO makes predictions on a separate, tabular dataset by using the knowledge graph as prior information about the features and domain. Graph classification approaches. In graph classification models, every input sample is a graph with node attributes, and a model must make a prediction for that graph. Graph classification models are not relevant for PLATO's problem setting. Graph classification models assume that different samples correspond to different graphs (Ying et al., 2021; Hu et al., 2020b; a) . However, in PLATO every input sample corresponds to the exact same graph. Additional comments are in Appendix B. Parameter prediction. Using one network to predict the parameters of another has been extensively studied (Denil et al., 2013; Schmidhuber, 1992; Bengio et al., 1991) . For example, Ha et al. (2016) predicts the weights in all layers of a sequential model (i.e. RNN, LSTM) by using information about the structure of the weights. Another approach, Diet Networks (Romero et al., 2017) predicts parameters by hand-crafting prior information about the input features or using random projections. By contrast, PLATO predicts parameters in a network by leveraging prior information about the input features in an auxiliary KG. PLATO systematically constructs an embedding for each input feature which contains the prior information about the feature that is most relevant to a given sample.

3. PLATO

PLATO is a machine learning method for tabular datasets with d ≫ n and an auxiliary KG with input features as nodes (Section 3.1). The key insight of PLATO is that even in tabular datasets with limited labeled samples, input features often represent real world entities about which there is abundant prior information which can be structured as an auxiliary knowledge graph (KG) G (Figure 1(a) ). PLATO uses a modified MLP to predict the output labels from the input samples and the auxiliary KG. PLATO's modified MLP has two key methodological components. First, the parameters in the first layer of the MLP are predicted from the auxiliary KG and the input sample rather than learned from just the tabular data (Figure 1(b-e )). PLATO thereby integrates prior information about the input features from the auxiliary KG and drastically reduces the number of trainable parameters in the MLP. Second, the parameters in the first layer of the MLP are predicted differently for every sample by using the auxiliary KG and the sample values. PLATO thereby increases the representational capacity of the MLP and enables effective predictions. The full PLATO Algorithm is given in Algorithm 1.

3.1. PROBLEM SETTING

Consider a tabular dataset X ∈ R n×d with labels y ∈ R n and far more features than samples such that d ≫ n. The goal is to train a machine learning model F to predict labels ŷ from the input X. PLATO assumes the existence of an auxiliary knowledge graph G = (V, E) with |V | nodes and |E| edges such that every input feature j corresponds to a node in G. Formally, ∀j ∈ {1, . . . , d}, ∃v ∈ V s.t. j → v, as shown in Figure 1(a) . G also contains additional nodes which represent broader knowledge about the domain. The edges in G are (head node, relation type, tail node) triplets.

3.2. THE PLATO MLP F

Consider a standard MLP ŷ = T (X; Θ) with L layers, h hidden units in the first layer, and trainable parameters Θ = {Θ [1] , Θ [2] , . . . , Θ [L] }. The PLATO MLP F differs from T in two key ways. First, the parameters in the first layer of PLATO's MLP F are predicted from prior information rather than learned only from the tabular data. We observe that every parameter in the first layer of T is associated with an input feature j. In particular, Θ [1] ∈ R d×h such that Θ [1] j ∈ R h is a vector of parameters connecting input feature j to every hidden unit in the first layer of the MLP (Figure 1e ). Typically, T learns the parameters Θ [1] j and Θ [1] k associated with two features j and k independently by gradient backpropagation. In PLATO, we propose that if two input features j and k represent real-world entities that are related, then their corresponding parameters Θ k . Parameter prediction details are in Section 3.3. For now, note that the parameters in the first layer of PLATO's MLP F are predicted such that Θ[1] ∈ R d×h . The parameters Θ [2] , . . . , Θ [L] in the remaining layers of PLATO's MLP F are learned normally. Second, the parameters in the first layer of PLATO's MLP are allowed to vary for every input sample. In the standard MLP T , all parameters Θ [1] , . . . , Θ [L] are the same for every input sample X i . In the first layer of PLATO's MLP F, however, the parameters Θ[1] are being predicted from prior information about the input features. We observe that for each input sample X i , the most relevant prior information about each input feature j might differ. Therefore for each sample X i , PLATO uses different prior information about each input feature j to predict the parameters Θ[1] j . As a result, the parameters Θ[1] in the first layer of F vary with each input sample X i , increasing the representational capacity of F. How PLATO uses different prior information for parameter prediction is left to Section 3.3. For now, note that PLATO predicts Θ[1] j from prior information about feature j. The prior information about feature j that is used depends on the input sample X i . As a result, Θ[1] is conditional on X i according to Θ[1] |X i . Formal Notation. Overall, PLATO's MLP F takes the form ŷi = F(X i ; Θ|X i ). (1) F has parameters Θ = { Θ[1] |X i } ∪ {Θ [2] , . . . , Θ [L] } where L is the number of layers in F. The parameters Θ[1] |X i in the first layer of F are predicted from the input sample X i via message-passing on the KG according to Section 3.3. For every sample i, a new Θ[1] is predicted such that Θ[1] is conditional on X i at both training and inference time. The dimensionality of Θ[1] ∈ R d×h is the same as in a normal MLP where h is the number of hidden units in the first layer of F. The parameters in the remaining layers of PLATO's MLP Θ [2] , . . . , Θ [L] are the same as in a standard MLP: they are learned normally, are the same for every sample at both training and inference time, and are thus not conditional on X i .

3.3. PREDICTING THE PARAMETERS IN THE FIRST LAYER OF PLATO'S MLP F

PLATO uses prior information about the input features to predict the parameters in the first layer of PLATO's MLP F. PLATO predicts these parameters in three steps. First, PLATO uses self-supervision on the auxiliary KG to pretrain an embedding for every input feature (Section 3.3.1, Figure 1(b) ). Second, since different input samples might rely on different prior information about each input feature, PLATO updates each feature embedding to contain the most relevant prior information about the input feature for the given input sample (Section 3.3.2, Figure 1 (c)). Finally, PLATO predicts the parameters in the first layer of F from the updated feature embeddings with a small neural network that is shared across input features (Section 3.3.3, Figure 1 (d)(e)).

3.3.1. PRETRAINING FEATURE EMBEDDINGS WITH SELF-SUPERVISION

First, PLATO learns general prior information about each input feature j from the auxiliary KG G (Figure 1b ). PLATO represents the general prior information about each input feature j as a low-dimensional embedding M j ∈ R c . Since every input feature j corresponds to a specific node in G, PLATO can learn M j by learning an embedding for the corresponding feature node in G. Any self-supervised node embedding method on G can be used within PLATO's framework. Formal notation. Formally, PLATO uses self-supervision on G to pretrain an embedding for every input feature according to M = H(G). M ∈ R d×c is the matrix of all feature embeddings. H is a self-supervised node embedding method. We refer to Eq. ( 2) as pretraining since only the auxiliary KG G is used but the tabular data X, y is ignored. After pretraining, the feature embeddings M are fixed. Implementation. For H, we choose ComplEx as it is prominent and highly scalable KG node embedding method (Trouillon et al., 2016) . ComplEx uses a self-supervised objective which learn an embedding for every node in G by classifying whether a proposed edge exists in G. ComplEx's proposed edges include both feature nodes and other nodes in G, thereby integrating prior information about the input features and the broader domain.

3.3.2. UPDATING FEATURE EMBEDDINGS TO CONTAIN THE MOST RELEVANT INFORMATION FOR AN INPUT SAMPLE

Since different input samples might rely on different prior information about each input feature, PLATO next updates each feature embedding M j ∈ R d×c to Q j ∈ R d×c , a feature embedding which contain the most relevant prior information about feature j for a given input sample X i (Figure 1(c) ). PLATO uses a message-passing network Q on the KG to update the feature embeddings in a way that minimizes the number of additional trainable parameters. Q = Q(X i , M, G; Π). The message-passing network in Q uses an attention mechanism which considers the sample values X i to update the feature embeddings. The attention mechanism has a small number of trainable parameters Π. The message passing network Q. Let Q [r] j be the embedding of input feature j after round r ∈ {1, ..., R} of message passing. For every input feature j, Q first initializes the updated feature embedding to the pretrained feature embedding. Q [0] j = M j . Q then conducts R rounds of message passing. In each round of message passing, the feature embedding Q [r] j is updated from the feature embedding of each neighbor k in the prior round Q [r-1] k and its own feature embedding in the prior round Q [r-1] j . The "message" being passed is the embedding of each feature from the prior round. Q [r] j = σ Weighted messages from neighbors β( k∈Nj A ijk Q [r-1] k ) + (1 -β)Q [r-1] j Weighted message from self . (3b) Q j = σ β( k∈Nj A ijk M k ) + (1 -β)M j . σ is an optional nonlinearity. N j are the neighbors of feature node j in G. During message-passing, Q uses two scalar values β ∈ R and A ijk ∈ R to control the weights of messages. First, Q uses hyperparameter β to control the weight of the messages aggregated from all neighbors vs. the message from the feature node j itself. Second, Q calculates an attention score A ijk to control the weight of the specific message between feature j and neighbor k. The attention score is different for every sample i and is calculated by a shallow neural network A with a small number of trainable parameters Π. The attention score A ijk thus enables Q to update the information in the feature embedding in a way that is most relevant for the input sample i. Formally: A ijk = exp (A(X ij , X ik ; Π)) t∈Nj exp (A(X ij , X it ; Π)) . ( ) The number of trainable parameters in Π is small since the input of A is R 2 and the output of A is a scalar R. A and its parameters Π are shared for all samples and features. Finally, the updated feature embeddings Q j are set after R rounds of message-passing. . Every parameter in the first layer of F is associated with a feature j. PLATO thus predicts Θj , the parameters associated with the feature j, from Q j , the prior information about j. Q j = Q [R] j . Formal notation. PLATO predicts the parameters associated with every input feature j in the first layer of F according to Θ[1] j = P(Q j |X i ; Φ). P is a shallow neural network parameterized by Φ. Q j is the updated feature embedding of j which is conditional on the specific input sample X i . Φ are the parameters of P. P and its parameters Φ are shared for every feature j ∈ {1, . . . , d}. PLATO drastically reduces the number of trainable parameters compared to a standard MLP. The sharing of P and Φ across all input features enables a drastic reduction in the number of trainable parameters compared to a standard MLP. For a high-dimensional tabular dataset (i.e. d ≫ n), a standard MLP T with h hidden units has a large number of trainable parameters in the first layer since Θ [1] ∈ R d×h . A standard MLP T must learn all dh of these trainable parameters independently. By contrast, P uses a shared set of trainable parameters Φ to predict Θj from Q j for every j ∈ {1, . . . , d}. The number of trainable parameters in Φ is small compared to dh since P need only transform every Q j ∈ R c to Θ[1] ∈ R h . Thus, |Φ| = ch (assuming that P is a single layer neural network). c, the dimensionality of the feature embedding, is much less than d the number of input features. As a result, |Φ| = ch ≪ dh and PLATO drastically reduces the number of trainable parameters in the first layer of a MLP. Algorithm 1: The PLATO Algorithm. Input: a data sample X i ∈ R d , a knowledge graph (KG) G = (V, E) Output: predicted label ŷi ∈ R Pretrain KG embedding for every feature: M = H(G) Initialize feature embedding for feature j: Q [0] j = M j Compute sample i-specific attention weight: A ijk = exp (A(Xij ,X ik ;Π)) t∈N j exp (A(Xij ,Xit;Π)) , ∀ features j, k, where A is a NN parameterized by Π for r = 1; r ≤ R do Update feature embedding with message passing neural network at layer r: Q [r] j = σ β( k∈Nj A ijk Q [r-1] k ) + (1 -β)Q [r-1] j end Obtain feature j embedding from GNN last layer R: Q j = Q [R] j Predict the parameter of first layer of a MLP from the feature embedding: Θ[1] = P(Q|X i ; Φ), where P is a NN parameterized by Φ Concatenate the first layer predicted parameters with the parameters from the rest of layers:  Θ = { Θ[1] } ∪ {Θ [2] , . . . , Θ [L] } Predict label: ŷi = F(X; Θ|X i ),

4. EXPERIMENTS

We evaluate PLATO against 10 statistical and deep baselines on 6 tabular datasets with d ≫ n. Datasets. We use 6 tabular d ≫ n datasets in biomedicine compiled from prior studies (Gao et al., 2015; Garnett et al., 2012; Iorio et al., 2016; Yang et al., 2012) . We focus on biomedicine because it is a rich domain for d ≫ n in which a single KG can be used as a unified knowledge backbone across many datasets. Additional data descriptions are in Appendix E. Auxiliary Knowledge Graph. As a unified knowledge backbone for the datasets, we compile a general biomedical knowledge graph from prior studies (et al., 2020; 2016; Kuhn et al., 2015; Ruiz et al., 2021; Szklarczyk et al., 2020; Wishart et al., 2017a; b) . Our knowledge graph contains 108,447 nodes, 3,066,156 edges, and 99 relation types. All datasets include features which map to a subset of nodes in the knowledge graph. The remaining nodes serve as broader domain knowledge. The same KG is used across all datasets even though all datasets have distinct feature sets with distinct cardinalities. PLATO thus allows a single KG to serve as a unified knowledge backbone for different datasets in a domain. Additional KG details are in Appendix F. Baselines. We compare PLATO to 10 state-of-the art statistical and deep baselines. We consider classic regularization with Ridge Regression (Marquardt & Snee, 1975) , dimensionality reduction with PCA (Abdi & Williams, 2010) , feature selection with LASSO (Tibshirani, 1996) deep feature selection with Stochastic Gates (Yamada et al., 2020) , and gradient boosted decision trees with XGBoost (Chen & Guestrin, 2016) . We also consider deep tabular learning methods including a standard MLP, self-attention-based tabular methods with TabTransformer (Huang et al., 2020) and TabNet (Arik & Pfister, 2021), differentiable decision trees with NODE (Popov et al., 2020) , and parameter-prediction with Diet Networks (Romero et al., 2017) . We also attempted FT-Transformer (Gorishniy et al., 2021) , but it experienced out of memory issues on all datasets due to the large number of features. Fair Comparison of PLATO with Baselines. To ensure a fair comparison with baselines, we follow evaluation protocols in tabular benchmarks (Grinsztajn et al., 2022; Gorishniy et al., 2021) . We conduct a random search with 500 configurations of every model (including PLATO) on every dataset across a broad range of hyperparameters (Appendix A). We split data with a 60/20/20 training, validation, test split. All results are computed across 3 data splits and 3 runs of each model in each data split. We report the mean and standard deviation of the Pearson correlation (PearsonR) between y and ŷ across runs and splits on the test set. Each model is run on a GeForce RTX 2080 TI GPU.

4.1. RESULTS

PLATO outperforms statistical and deep baselines when d ≫ n. PLATO outperforms all baselines across all 6 datasets with d ≫ n (Table 1 ). PLATO achieves the largest relative improvement on the PDAC dataset, improving by 10.19% vs. XGBoost, the best baseline for PDAC (0.400 vs. 0.363). While PLATO achieves the strongest performance across all 6 datasets, the best performing baseline varies with different datasets. Ridge Regression is the strongest baseline for BRCA, LASSO for CM and CRC, XGBoost for PDAC and CH, and TabTransformer for MNSCLC. The remaining baselines (PCA, STG, Diet Networks, MLP, NODE, and TabNet) are not the strongest baseline for any dataset. We also find that the performance of a specific baseline depends largely on the specific dataset. TabTransformer, for example, is the best baseline for the MNSCLC dataset but the worst baseline for the CH dataset. The rank order of all models on all datasets is Appendix C. PLATO's performance depends on updating feature embeddings to contain information relevant to a sample. PLATO predicts the parameters Θ[1] in the first layer of a modified MLP F by using feature embeddings which contain prior information about the input features. PLATO first pretrains general feature embeddings M ∈ R d×c . PLATO then updates the feature embeddings to Q ∈ R d×c which contains information about the input features that is most relevant to a given sample X i . We test whether updating the feature embeddings based on a given X i is necessary by evaluating PLATO on the BRCA dataset in three configurations (Table 2 ). The default configuration uses the updated feature embeddings Q to predict Θ[1] according to Θ[1] = P(Q|X i ). The second configuration uses the general feature embeddings M instead of Q to predict Θ[1] according to Θ[1] = P(M). The third configuration does not use feature embeddings and thus reduces to a standard MLP. We compare PLATO's performance when it uses feature embeddings which contain the relevant information for a given sample X i (i.e. Θ[1] = P(Q|X i )) vs. the pretrained feature embeddings M which contain general information about the input features that is not specific to a given sample (i.e. Θ[1] = P(M)). Using general feature embeddings M improves over not using feature embeddings at all (0.522 vs. 0.240). Using feature embeddings Q that are specific to a given input sample further improves performance (0.583 vs. 0.522). Therefore, updating the feature embeddings to Q such that they contain the information specific to a given sample is critical to PLATO's performance. PLATO's performance depends on both feature nodes and broader knowledge nodes in the auxiliary KG. PLATO relies on an auxiliary KG G which contains information about input features and information about the broader domain. Information about input features is represented as feature nodes while information about the broader domain is represented as other nodes in G (Methods 3.1). To test the relative importance of the feature information in G vs. the broader domain information, we measured the performance of PLATO on the BRCA dataset in two KG configurations: PLATO with the full KG (i.e. both the feature nodes and the broader domain nodes) and PLATO with a "feature-only KG" (i.e. an induced subgraph on only the feature nodes) (Table 3 ). We also compare to a "No KG" configuration in which PLATO does not have access to the KG. Without auxiliary information about the input features or the broader domain, PLATO is ablated to become a standard MLP. We find that both the feature nodes and the broader knowledge nodes are important for PLATO's performance. Using the "feature-only KG" configuration of PLATO improves performance vs the "no KG" configuration of PLATO (0.539 vs 0.240). Using the "full KG" configuration further improves performance vs the "feature-only KG" configuration (0.583 vs 0.539). PLATO's performance thus relies on both the feature information and the broader domain information in the KG. For datasets with d ∼ n, PLATO is competitive with baselines. Finally, we test PLATO's performance for datasets with d ∼ n. We test 4 datasets with d ∼ n ranging from d n = 1.1 to d n = 2.0 (Table 4 ). We find that on 4 datasets with d ∼ n, PLATO is competitive with the best performing baseline, XGBoost, but does not improve performance substantially. PLATO's stronger performance for datasets with d ≫ n than for datasets with d ∼ n is justified. PLATO's key idea is to include auxiliary information about the input features. Auxiliary information is likely to help performance the most in settings with the least labeled data (i.e. d ≫ n). When d ∼ n, auxiliary information is less helpful since the tabular dataset may already have enough information to train a strong predictive model. We further find that XGBoost is consistently the strongest baseline for datasets with d ∼ n, in contrast to the varied performance of XGBoost on the datasets with d ≫ n (Table 1 ).

5. DISCUSSION

PLATO is a machine learning model for tabular data with d ≫ n and an auxiliary KG with input features as nodes. Across 6 datasets and 10 baselines, PLATO achieves state-of-the-art performance, including relative performance improvements of up to 10.19%. Ablation studies also confirm the importance of each component of PLATO. PLATO has several limitations. First, PLATO matches but does not substantially improve the performance of baselines for high-dimensional datasets with more samples (i.e. d ∼ n). Second, PLATO relies on the coverage of prior information. Datasets with input features that have little prior information in the KG are less likely to benefit from PLATO. Overall, PLATO uses an auxiliary KG about input features to enable tabular deep learning when d ≫ n.

B GRAPH CLASSIFICATION APPROACHES

Graph classification models are not relevant for PLATO's setting. In graph classification models, every input sample is a graph with node attributes, and a model must make a prediction for that graph. The PLATO problem setting breaks fundamental assumptions made by graph classification models, rendering them not applicable. First, graph classification models assume that different samples correspond to different graphs (Ying et al., 2021; Hu et al., 2020b; a) . However, in PLATO every sample corresponds to the exact same graph. There is a single background knowledge graph for all samples, and every sample has input features that correspond to the exact same nodes within the knowledge graph. Second, graph classification approaches typically assume that every node in an input graph has a node attribute (Ying et al., 2021; Hu et al., 2020b; a) . However, in PLATO only a small subset of the nodes in the knowledge graph have measured feature values. Finally, graph classification approaches typically assume small graphs: the largest graph classification task in the Open Graph Benchmark has only 244 nodes (Hu et al., 2020a) . However in PLATO, the knowledge graph contains 108,447 and the smallest dataset has 12,932 features corresponding to nodes.

C RANK ORDERING OF METHODS FOR DATASETS WITH d ≫ n

In Supplementary Table 3 , we show the rank order performance of all models on all d ≫ n datasets. We find that PLATO exhibits consistent and strong performance while the performance of the baselines depends on the specific d ≫ n dataset. For example, TabTransformer is the second best performing of all models on the MNSCLC dataset but the worst performing of all models on the PDAC and CH datasets. Similarly, XGBoost is the second best performing of all models on PDAC but only the seventh best performing of all models on BRCA. The baselines with the most stable performance are LASSO and Ridge Regression which rank consistently between the second and fifth best of all models. Supplementary Table 3 : For datasets with d ≫ n, PLATO exhibits consistent and strong performance. By contrast, the performance of the baselines varies with each dataset. For every dataset, the rank order of performance from 

D CODE DETAILS

Code to run PLATO will be included as a supplementary file in the final version of the manuscript.

E DATASET DETAILS

We compiled 6 datasets with d ≫ n and 4 datasets with d ∼ n. In all datasets, a machine learning model must predict the response of a cell or mouse to a drug. In the tabular data, every row corresponds to a specific cell or mouse. Every column corresponds to a gene name. Every value corresponds to the amount of that gene in the tumor of the specific cell or mouse. The label is the response of the cell or mouse. All genes are nodes in the knowledge graph. In practice, the number of genes is large for all tasks and the number of samples is comparatively small making the drug response setting appropriate for d ≫ n. Dataset statistics, names, and sources are given below. Dataset pre-processing followed a standard process described in (Mourragui et al., 2021) . Briefly, gene expression values underwent TMM normalization and log transformation (i.e. log(x + 1)). Values were made to have zero mean and unit standard deviation. As labels, we used ln-ic50 for datasets from (Iorio et al., 2016; Yang et al., 2012; Garnett et al., 2012) and the minimum average percent tumor growth (i.e. "min-avg-pcttumor-growth") for datasets from (Gao et al., 2015) . For all datasets, we use a 200-dimensional ComplEx (Trouillon et al., 2016) embedding of the drug as the input feature vector. All datasets will be released in the final version of the manuscript. 



Figure 1: PLATO is a machine learning model for tabular data with d ≫ n and an auxiliary knowledge graph with input features as nodes. Machine learning models struggle to achieve strong performance on tabular datasets with far more d features than n samples (i.e. d ≫ n). (a) The key insight of PLATO is that even in settings with limited labeled samples, input features often represent real-world entities about which there is abundant prior knowledge. We propose a new problem setting in which every feature in the input matrix corresponds to a node in an auxiliary knowledge graph (KG) G. (b-e) PLATO uses G to predict the parameters in the first layer of a modified MLP F. (b) First, PLATO pretrains an embedding Mj ∈ R c for every feature node in G using H, a self-supervised KG node embedding approach. (c) Second, PLATO updates each feature embedding to focus on the feature information that is most relevant to an input sample Xi. PLATO uses a message passing network Q to produce Qj ∈ R c . Q uses an attention mechanism which considers the input sample Xi. Qj thus depends on the input sample (i.e. Qj|Xi). (d, e) Finally, PLATO uses a small neural network P to predict the parameters in the first layer of a MLP F from Qj. The parameters in the first layer of F vary for every input sample Xi.

too. To capture the intuition that related input features j and k should have related parameters, PLATO's MLP F predicts Θ[1] j and Θ[1] k from prior information known about j and k. If input features j and k are related, then the parameter prediction module produces related Θ[1] j and Θ[1]

) 3.3.3 PREDICTING THE FIRST LAYER OF PARAMETERS IN F FROM THE UPDATED FEATURE EMBEDDINGS Finally, PLATO predicts the parameters in the first layer of F from each updated feature embedding (Figure 1(d)(e))

where F is an MLP parameterized by Θ PLATO outperforms statistical and deep baselines when d ≫ n. For every dataset, the best overall model is in bold and the second best model is underlined.

PLATO's performance depends on updating feature embeddings to contain information that is specific to a given sample.

PLATO's performance depends on both feature nodes in G and other nodes which represent broader domain knowledge.

PLATO's performance is competitive with baselines when d ∼ n. For every dataset, the best overall model is in bold and the second best model is underlined.

Table 1 is shown. The best overall model is in bold and the second best model is underlined. For datasets with d ∼ n, PLATO is competitive with baselines. For XGBoost is consistently the strongest baseline. For every dataset, the rank order of performance from Table 4 is shown. The best overall model is in bold and the second best model is underlined.

Knowledge graph relations between node types.

A EVALUATION PROTOCOL AND HYPERPARAMETER RANGES

To ensure a fair comparison with baselines, we follow evaluation protocols outlined in tabular benchmarks (Grinsztajn et al., 2022; Gorishniy et al., 2021) . We conduct a random search with 500 configurations of every model (including PLATO) on every dataset across a broad range of hyperparameters. We base the hyperparameter ranges on the ranges used in prior tabular learning benchmarks (Grinsztajn et al., 2022; Gorishniy et al., 2021) and the ranges mentioned in the original papers of the methods. Hyperparameter ranges for PLATO are given in Supplementary 

F KNOWLEDGE GRAPH DETAILS

As a unified knowledge backbone for the datasets, we compile a general biomedical knowledge graph from prior studies (et al., 2020; 2016; Kuhn et al., 2015; Ruiz et al., 2021; Szklarczyk et al., 2020; Wishart et al., 2017a; b) . A schematic of the KG is in Supplementary Figure 2 . A detailed breakdown of relation types is in Supplementary Table 5 .Our knowledge graph contains 108,447 total nodes, including 7,975 drugs, 18,370 diseases, 11,447 phenotypes, 22,319 genes, 11,153 molecular functions, 28,748 biological processes, and 4,184 cellular components. Our knowledge graph contains 3,066,156 edges with 99 distinct relation types.All datasets include features which map to a subset of nodes in the knowledge graph, primarily genes and drugs. The remaining node types and their relationships serve as broader domain knowledge.Edges between drug nodes and gene/protein nodes were derived from Drugbank (Wishart et al., 2017b), Gao (Gao et al., 2015) , and the Genomics of Drug Sensitivity in Cancer (Yang et al., 2012; Iorio et al., 2016; Garnett et al., 2012) . Edges between diseases and genes/proteins were derived from DisGeNet (Bauer-Mehren et al., 2010) . Edges between diseases and phenotypes were derived from the Human Phenotype Ontology (et al., 2016) . Edges between drugs and diseases were derived from the Multiscale Interactome (Ruiz et al., 2021) . Edges between drugs and side effects were derived from SIDER (Kuhn et al., 2015) . Edges between genes/proteins and other genes/proteins were derived from BioGRID (Oughtred et al., 2019) , (Rual et al., 2005) , the Database of Interacting Proteins (Salwinski et al., 2004 (Salwinski et al., ), (et al., 2020)) , (Menche et al., 2015) , (Rolland et al., 2014) , (Yu et al., 2011) , (Venkatesan et al., 2009) , and STRING (Szklarczyk et al., 2020) . Finally, edges from genes/proteins to molecular functions, biological processes, and cellular components as well as edges between molecular functions, biological processes, and cellular components were derived from the Gene Ontology (Consortium, 2018). The full knowledge graph will be included as a supplementary file in the final version of the manuscript. 

