PREDICTING ANTIMICROBIAL MICS FOR NONTY-PHOIDAL SALMONELLA USING MULTITASK REPRE-SENTATIONS LEARNING OF TRANSFORMER

Abstract

The antimicrobial resistance (AMR) pathogens have become an increasingly worldwide issue, posing a significant threat to global public health. To obtain an optimized therapeutic effect, the antibiotic sensitivity is usually evaluated in a clinical setting, whereas traditional culture-dependent antimicrobial sensitivity tests are labor-intensive and relatively slow. Rapid methods can greatly optimize antimicrobial therapeutic strategies and improve patient outcomes by reducing the time it takes to test antibiotic sensitivity. The booming development of sequencing technology and machine learning techniques provide promising alternative approaches for antimicrobial resistance prediction based on sequencing. In this study, we used a lightweight Multitask Learning Transformer to predict the MIC of 14 antibiotics for Salmonella strains based on the genomic information, including point mutations, pan-genome structure, and the profile of antibiotic resistance genes from 5,278 publicly available whole genomes of nontyphoidal Salmonella. And we got better prediction results (improved more than 10% for raw accuracy and 3% for accuracy within ±1 2-fold dilution step) and provided better interpretability than the other ML models. Besides the potential clinical application, our models would cast light on mechanistic understanding of key genetic regions influencing AMR.

1. INTRODUCTION

Antibiotics are chemical compounds that are used for killing or inhibiting the growth of bacteria, playing a pivotal role in the control of infectious diseases. However, the ever-increasing antimicrobial resistance (AMR) threatens the clinical effectiveness of antibiotic treatments. The antibiotic resistance of pathogens could result in treatment failure, including high morbidity or mortality, and increase the health care cost substantially. Over 70 percent of the bacteria which promote hospitalacquired infections are resistant to at least one common antibiotic used for treatment (Stone et al., 2009) . In clinical settings, testing the antimicrobial resistance of pathogens is critical for the appropriate choice of antibiotics in the treatment. Antimicrobial susceptibility/ sensitivity testing (AST) is an approach to determine whether antibiotics can inhibit the bacteria/fungi growth, thus measure the susceptibility, or reflect the resistance of bacteria/fungi to specific the antibiotics. Several AST methods are widely used, including broth microdilution, antimicrobial gradient, disk diffusion test, and rapid automated instrument methods (Barth et al., 2009) . Minimum inhibitory concentration (MIC) is one of the most frequently used AST methods, quantifying the lowest concentration of antibiotics preventing the growth of a microorganism. Qualitative descriptions (resistant/sensitive, etc.) of the antimicrobial sensitivity provide no accurate quantification of antimicrobial sensitivity and limit its power in certain scientific and clinical applications. In contrast, MIC measures provide a competent resolution while antimicrobial susceptibility of strains varies in a population, and this is useful for many epidemiological and clinical objectives. Since traditional antimicrobial sensitivity testing relies on culture-dependent methods, it is laborintensive and relatively slow. In the conventional microbiological laboratory diagnosis, the total time for the bacteria growth, isolation, taxonomic identification, and antibacterial MIC determination for fast growing bacteria may exceed 36h, while the time for slowly growing bacteria may be several days (Opota et al., 2015) . From a clinical point of view, testing the antimicrobial sensitivity using more accurate and rapid methods could greatly optimize antimicrobial therapeutic strategies and improves patient outcomes (Llor et al., 2014) . Whole-genome sequencing (WGS) has been widely used for public health surveillance in the past decades, guiding the clinical diagnosis and health care decisions. WGS-based data mining assesses the phylogenetic relationships, conducts outbreak investigations, detects antimicrobial resistance, and predicts the virulence or pathogenicity of potential pathogens (Varma et al., 2002) . Several recent studies have used WGS data to predict AMR phenotypes. The most common approach relies on the homology search in a reference set of antimicrobial resistance genes and polymorphisms associated with them (Stoesser et al., 2013) . This reference-guided and homology search approaches could describe antimicrobial resistance in a rough way if the targeted organisms have been adequately studied and the mechanisms of antimicrobial resistance are known. But the demand for more accurate and quantitative prediction of the antibiotic sensitivity or resistance necessitates novel predictive models. With the increase of publicly available full-genome sequences, machine learning modelling have been developed to predict the antimicrobial sensitivity based on WGS data in recent studies. Some advanced statistical or machine learning (ML) models, including logistic regression (LR), gradient boosted decision trees (GBoost), Random Forests (RF) and deep neural networks (DNN), have been applied in predicting the antimicrobial sensitivity (Bálint., 2016) . Based on the whole genome sequences of different strains and the corresponding MIC information, the predictive models could identify critical genes or regions associated with antimicrobial resistance without a priori information (Zankari, 2012). One study adopted 4 machine learning methods, including Random Forest, Gradient boosted decision trees, Deep neural networks, and Rule-based baseline to analyze wholegenome sequencing data of E. coli and predict antibiotic resistance. Using the presence or absence of genes, population structure, and year of isolation as predictors. Without prior knowledge of the causal mechanism, the Gradient boosted decision trees model achieved an average accuracy of 0.92 and a recall rate of 0.83 (Moradigaravand , 2018) . Another study analyzed 704 E. coli genomes by using MIC measurements for ciprofloxacin. The models identified that 3 mutations in gyrA, 1 mutation in parC and presence of any qnrS gene, collectively associate with the MIC strongly (Kouchaki , 2018) . Although such predictive approaches require many genomes and experimentally validated MIC for modelling, they are unbiased, accurate and able to discover genomic features associated with the AMR. Salmonella is one of the most common causes of foodborne diseases, including stomach flu (gastroenteritis) and diarrhea, in the world, causing about 80 million illnesses all over the word annually (World Health Organization, 2015) . Among Salmonella isolates, antimicrobial resistance is widespread, and infections caused by antibiotic-resistant strains are worse than those caused by antibiotic-susceptible strains (Varma, 2005) . As a result of surveillance efforts by public health agencies, many whole-genome sequences and antimicrobial susceptibility data of Salmonella strains have been obtained (Hunt et al., 2017) . One recent study adopted machine learning model called extreme gradient boosting (XGBoost) to predict MICs of 15 antibiotics based on over five thousand nontyphoidal Salmonella genomes (Marcus et al., 2017) . The overall average accuracy of this MIC prediction models is 95% within ±1 2-fold dilution. Despite the excellent predictive performance of the model, the k-mers (features used) identified with highest contribution to the model offer weak biological interpretability. To understand how different genomic features, contribute to the antimicrobial resistance, machine models with the more interpretable features, including the copy number of antibiotic resistance genes and particular polymorphisms, etc., instead of k-mers, should be developed. The transformer model (based on the paper Attention is All You Need) follows the same general pattern as a standard sequence to sequence with attention model. The input sentence is passed through N encoder layers that generates an output for each token in the sequence. The decoder attends on the encoder's output and its own input (self-attention) to predict the next word. The transformer model has been proved to be superior in quality for many sequence-to-sequence problems while being more parallelizable. Here, we are going to do genome analysis and MIC prediction that is not sequence-to-sequence problem. So, only use transformer encoder. Self-attention network (SANs) can capture long-distance dependencies by explicitly attending to all the elements, regardless of distance. However, multiple relatively distant parts of a long genome sequence can work together, which may be overlooked by single attention. To solve this problem, we use the

