REPROGRAMMING LARGE PRETRAINED LANGUAGE MODELS FOR ANTIBODY SEQUENCE INFILLING

Abstract

Antibodies comprise the most versatile class of binding molecules, with numerous applications in biomedicine. Therapeutic antibody development requires designing novel and diverse sequences with improved properties, while maintaining the structural consistency. Computational design of antibodies involves unusual challenges relative to designing other classes of proteins, as antibodies comprise multiple long, variable, and unstructured loops at the complementarity-determining region (CDR) that determine the antigen binding affinity and specificity of an antibody. Recently, deep language models and graph neural networks have shown impressive success in antibody sequence generation. Since only a limited number of antibody structures are known, training a model using this limited data can lead to degraded performance, particularly lacking diversity in the generated samples. To address such issues, we leverage the method of Model Reprogramming (MR) here, which focuses on repurposing pretrained machine learning models for target domain tasks with scarce data, where it may be difficult to train a high-performing model from scratch. Prior works in MR have primarily focused on classification-based tasks. We extend the capabilities of reprogramming beyond classification tasks, and towards a more complex problem of antibody sequence generation. Specifically, we introduce Reprogramming for Protein Sequence Infilling, a framework in which pretrained natural language models are repurposed for protein sequence infilling via reprogramming, to infill protein sequence templates as a method of novel protein generation. For variable CDR sequence design, we formulate the task as text infilling that uses the constant region of an antibody as the sequence template. Results on antibody design benchmarks show that our reprogrammed model on low resourced antibody sequence dataset provides highly diverse CDR sequences, up to more than a two-fold increase of diversity over the baselines, without losing structural integrity and naturalness. The performance benefit of the reprogrammed model learning only from antibody sequences is more evident for longer CDR design or for multiple loop infilling at once, compared to existing graph-based models that require additional structural information. The generated sequences also demonstrate enhanced antigen binding specificity or virus neutralization ability.

1. INTRODUCTION

Antibodies have emerged as essential therapeutic agents in the treatment of cancer and various other autoimmune, infectious and metabolic diseases. Since 1985, approximately 100 monoclonal antibodies (mAbs) have been designated as drugs by FDA (Jin et al., 2022) . Compared to small molecule drugs, the advantage of using antibody proteins as therapeutics is their high specificity resulting in less adverse effects. A key challenge in antibody design is tailoring their binding specificity, which is mainly influenced by the complementarity determining region (CDR). CDR plays a crucial role in antigen recognition and binding processes. It is composed of six hypervariable loops, three formed by each of heavy (H) and light (L) chains. Together, the CDRs shape the antigen binding site of the antibody. Five of the six loops usually adopt well-characterized canonical conformations. In contrast, the CDR-H3 loop shows substantial variability in sequence and structure, and hence cannot be described by a canonical structure model. Even when compared to other protein loop structures, the CDR-H3 clearly stands out with its significantly higher structural diversity. There is a high demand and need for efficient in-silico methods for designing CDRs with improved specificity and other desired properties, to reduce the cost and time associate with wet lab production and testing of antibody candidates. Generative machine learning has emerged as an attractive and viable path for this purpose. For example, for a more general task of protein design, creating new protein sequences that fold to a desired 3D structure and/or exhibit a specific function, many deep generative models have been adapted and expanded (Ingraham et al., 2019; Cao et al., 2021; Karimi et al., 2020; Syrlybaeva & Strauch, 2022; Lee & Kim, 2022; Anand & Achim, 2022) . However, compared to other protein design challenges, CDR design (Akbar et al., 2022b; Eguchi et al., 2020; Shin et al., 2021; Adolf-Bryfogle et al., 2018; Fu & Sun, 2022; Kong et al., 2022; Luo et al., 2022) , especially CDR-H3 design, comes with additional complexities, such as out-of-distribution generation to accommodate functional novelty. Additionally, in antibody design, sequence similarity may not reflect binding behavior. For example, in HER2 binding antibodies, two very similar sequences (Levenshtein distance < 2) had opposing binding behavior (Mason et al., 2021) . Furthermore, it is often desirable to explore new antigen binding modes, when designing antibodies for a target of interest. Such out-of-distribution sample generation remains challenging, particularly in a templateconstrained generation scenario. v P Y n X C a p Q c k W i 8 J U E B O T 2 d 9 k w B U y I z J L K F P c 3 k r Y i C r K j E 2 n Z E P w l l 9 e J a 2 L q n d Z r d 3 X K v W b P I 4 i n M A p n I M H V 1 C H O 2 h A E x g M 4 R l e 4 c 0 R z o v z 7 n w s W g t O P n M M f + B 8 / g B 0 L I 3 s < / l a t e x i t > y s < l a t e x i t s h a 1 _ b a s e 6 4 = " T 4 0 n T q 8 C u r U y 7 O n e i w X H J n p o B 7 s = " > A A A B 7 3 i c b V B N S 8 N A E J 3 U r 1 q / q h 6 9 L B b B U 0 m k q M e i F 4 8 V 7 A e 0 o W y 2 m 3 b p Z h N 3 J 0 I J / R N e P C j i 1 b / j z X / j t s 1 B W x 8 M P N 6 b Y W Z e k E h h 0 H W / n c L a + s b m V n G 7 t L O 7 t 3 9 Q P j x q m T j V j D d Z L G P d C a j h U i j e R I G S d x L N a R R I 3 g 7 G t z O / / c S 1 E b F 6 w E n C / Y g O l Q g F o 2 i l T t j v 4 Y g j 7 Z c r b t W d g 6 w S L y c V y N H o l 7 9 6 g 5 i l E V f I J D W m 6 7 k J + h n V K J j k 0 1 I v N T y h b E y H v G u p o h E 3 f j a / d 0 r O r D I g Y a x t K S R z 9 f d E R i N j J l F g O y O K I 7 P s z c T / v G 6 K 4 b W f C Z W k y B V b L A p T S T A m s + f J Q G j O U E 4 s o U w L e y t h I 6 o p Q x t R y Y b g L b + 8 S l o X V e + y W r u v V e o 3 e R x F O I F T O A c P r q A O d 9 C A J j C Q 8 A y v 8 O Y 8 O i / O u / O x a C 0 4 + c w x / I H z + Q M g e Z A K < / l a t e x i t > f ✓ < l a t e x i t s h a 1 _ b a s e 6 4 = " M p P v P s U y 1 z A C Y X l L O P P x R W F M P R w = " > A A A B 7 3 i c b V B N S 8 N A E J 3 U r 1 q / q h 6 9 B I v g q S R S 1 G P R i 8 c K 9 g P a U C b b T b t 0 d x N 3 N 0 I J / R N e P C j i 1 b / j z X / j t s 1 B W x 8 M P N 6 b Y W Z e m H C m j e d 9 O 4 W 1 9 Y 3 N r e J 2 a W d 3 b / + g f H j U 0 n G q C G 2 S m M e q E 6 K m n E n a N M x w 2 k k U R R F y 2 g 7 H t z O / / U S V Z r F 8 M J O E B g K H k k W M o L F S J + r 3 h i g E 9 s s V r + r N 4 a 4 S P y c V y N H o l 7 9 6 g 5 i k g k p D O G r d 9 b 3 E B B k q w w i n 0 1 I v 1 T R B M s Y h 7 V o q U V A d Z P N 7 p + 6 Z V Q Z u F C t b 0 r h z 9 f d E h k L r i Q h t p 0 A z 0 s v e T P z P 6 6 Y m u g 4 y J p P U U E k W i 6 K U u y Z 2 Z 8 + 7 A 6 Y o M X x i C R L F 7 K 0 u G a F C Y m x E J R u C v / z y K m l d V P 3 L a u 2 + V q n f 5 H E U 4 Q R O 4 R Most of the prior works compromise on the sequence and structural diversity in generated CDRs for high amino acid recovery and low root mean square deviation (RMSD) from ground truth structure. Moreover, the sequence-based models typically involve LLM training from scratch on NGS repertoire (Olsen et al., 2022) , or GNN training on a small sample of antibody sequence-structure pairs (Jin et al., 2021) . The GNN-based models also come with a cost associated with inference, e.g., iterative design of nodes and edges in a graph via autoregressive decoding. To address these challenges, we propose an alternative sequence-only framework (see Fig. 1 for an overview), that is reprogramming existing out-of-domain English language BERT model (Devlin et al., 2018) toward the protein infilling task, given the rest of the sequence as a template. We term this model ReprogBert. Additionally, for our sequence-based infilling task we also consider in-domain specialized protein model ProtBert (Elnaggar et al., 2020) as well as the English language BERT (EnglishBert), whose out-of-domain language token embeddings are replaced with in-domain amino acid embeddings (see Fig. 5 in Appendix for details). We compare all of our proposed infilling methods with physics-based and graph-based generative models on a list of tasks ranging from template constrained CDR design to CDR sequences with predicted SARS-COV-2 neutralization ability. We show that while ReprogBert enjoys comparable high structural consistency, and lower sequence perplexity, when matched against the baselines, it shows high amino acid CDR recovery while providing additional benefit regarding generating highly diverse CDR sequences. These results suggest the potential of ReprogBert toward on-demand generation of the out-of-distribution sequences in the learning from limited data scenario. The other proposed baseline systems, EnglishBert and ProtBert, achieve high CDR sequence recovery rates with consistent structural integrity, although having a modest sequence diversity performance. In summary, in this work we: (i) propose ReprogBert, a system for protein sequence infilling using model reprogramming for the task of antibody CDR design, (ii) show promising performance results as compared to many baselines (including our own proposed ProtBert and EnglishBert baseline infilling methods) and over multiple benchmarks, where our ReprogBert model upholds structural integrity and sequence recovery, while achieving valuable high diversity of the generated sequences. Moreover, the generated CDR sequences frequently have the lowest perplexity, reflecting their well-formed composition and naturalness. ReprogBert further shows its promise in harder CDR design tasks, can handle multiple CDR infilling at once, and does not need structure template information, and (iii) observe high data-efficiency of the reprogrammed model, having only a few training parameters, it can be efficiently trained in the data-scarce domains, such as antibody design, while still leveraging information from large out-of-domain language pretraining.

2. REPROGRAMMING FOR PROTEIN SEQUENCE INFILLING

The field of model reprogramming (MR) has focused on repurposing pretrained machine learning (ML) models for varied ML tasks in different domains. It was firstly proposed in an adversarial setting of stealthy resource alternation in (Elsayed et al., 2018) and later extended to cross-domain resource-efficient transfer learning (Chen, 2022; Neekhara et al., 2022) . MR achieves state-of-the-art performance in many tasks, especially in data-limited classification settings, including reprogramming general images for bio-medical measurements (Tsai et al., 2020) , human voice for time-series (Yang et al., 2021) , and sentence sentiment for protein property (Vinod et al., 2020) , to name a few. While current MR techniques focus on classification tasks, in our work we seek to extend MR capabilities into generative tasks through reprogramming large pretrained language models for protein sequence infilling. To the best of our knowledge, this work is the first study for such an endeavor. Given a protein sequence, we propose novel CDR loop design as a form of a template-infilling. The template is provided by the amino acid sequence of the constant region of the antibody, as those are conserved and less likely to change, while the sequences corresponding to CDR can vary and change the structure of the antigen binding interface, resulting into modified antigen affinity and specificity. It should be mentioned though the infilling here is performed to design CDRs of antibodies, the framework can be leveraged to infill any protein sequences. Figure 1 presents an overview of our proposed framework, ReprogBert, the reprogrammed language model for protein sequence infilling. Specifically, we use the pretrained English BERT model (Devlin et al., 2018) (in our experiments, it is the base-bert-uncased from HuggingFace) and reprogram it for infilling the CDR part of the antibodies. The number of tokens in the original language task (i.e., source domain) is denoted by V s (in our experiments |V s | = 30522 word tokens). The language sentence can then be represented as y s = ⟨w 1 , w 2 , . . . , w n ⟩, where w i is the word token. The number of tokens in the task of interest (i.e., target domain) is denoted by V t (in our experiments |V t | = 30 protein tokens: 20 amino acid tokens and 10 auxiliary tokens). The protein sentence can then be represented as x t = ⟨a 1 , a 2 , . . . , a n ⟩, where a i is an amino acid token. For each masked antibody input sequence we generate 100 predicted samples. Amino acid recovery (AAR) is computed for the specific sequence region (e.g., CDR-H3), measuring the fraction of exact matches between ground truth and the sampled sequences. Diversity (DIV) uses only the generated samples to compute the complement of the average recovery for all pairwise comparisons in the set (the higher the number, the more dissimilar is each sample to all the others). Perplexity (PPL-ProGen) is computed as the average of all the sampled sequences (masking only the region of interest), using off-the-shelf autoregressive Transformer protein model ProGen (Nijkamp et al., 2022) , which reflects "naturalness" of the designed sequences. The sample with the minimum perplexity (red box with an arrow) is then used for 3D structure prediction using AlphaFold (Jumper et al., 2021) or IgFold (Ruffolo et al., 2022) models and compared with ground truth to compute template modeling (TM) score (Zhang & Skolnick, 2004 ) and the root mean squared deviation (RMSD) from the input structure. We define two mappings (see bottom plot in Fig. 1 ) f θ : x t → x s , transforming input protein sequence into input word sequence and g γ : y s → y t , reversing the transformation by mapping output word sequence into protein one. Following the approach in (Elsayed et al., 2018; Tsai et al., 2020; Vinod et al., 2020) we constrain these mappings to be linear transformations between the source and target domains. In other words, these mappings are represented as: x s = x t θ and y t = y s γ, where the linear projection matrices θ ∈ R |Vt|×|Vs| and γ ∈ R |Vs|×|Vt| are the parameters of the transformations. During training, all model parameters are fixed and only θ and γ are optimized. Specifically, we update θ and γ with respect to minimizing L N LL (y t , y * t ), the loss between the estimated infilled protein sequence y t = f γ (M (f θ (x t ))), given the CDR-masked anitbody x t and the ground truth sequence y * t .

3. EXPERIMENTS

In this section we present evaluation results of our proposed methods on template constrained CDR design using Structural Antibody Database (SabDab) (Dunbar et al., 2013) and Rosetta Antibody Design (RabD) (Jin et al., 2021) , as well as SARS-CoV2 (CoV-AbDab dataset (Raybould et al., 2021) ) neutralization using the model's generated antibodies. In what follows, we first discuss the evaluation metrics, followed by the introduction of the baseline models and the presentation of the results on three datasets.

3.1. EVALUATION METRICS

For each input protein sequence in our experiments we generated 100 samples using our infill models. To measure the quality of these samples, we then compute the following evaluation metrics (see Fig. 2 for an illustration). Amino acid recovery (AAR) is computed for the specific sequence region of interest (e.g., CDR-H3), measuring the percent of the exact matches between ground truth and the sampled sequences. The range is 0-100, and the higher the AAR, the more accurate the recovery. Diversity (DIV), on the other hand, uses only the sampled proteins to compute the complement of the average recovery of all pairwise comparisons in the set. Here the range is 0-100 and the higher the number, the more dissimilar are samples among themselves. While in general it holds true that the recovery and diversity are inversely correlated, i.e., higher recovery rate leads to lower diversity, and vice versa, CDR design calls for generative models that achieve at least above 30% recovery (Weitzner et al., 2015) , while at the same time are able to maintain high sequence diversity. For perplexity (the model's predicted probabilities for every residue in a given sequence) we use off-the-shelf autoregressive Transformer protein model ProGen (Nijkamp et al., 2022) to compute PPL-ProGen as the average of 100 samples (masking only the region of interest). Specifically, we used ProGen2-small (151M parameters), which has been pretrained on the mixture of Uniref90 (Suzek et al., 2015) and BFD30 (Steinegger & Söding, 2018) datasets. For perplexity, the lower values mean the better performance, indicating stronger "naturalness" of generated CDRs. The sampled protein sequence with the minimum perplexity is then used for 3D structure prediction using protein folding model (e.g., AlphaFold (Jumper et al., 2021) or IgFold (Ruffolo et al., 2022) ). The full predicted and ground truth structures are then compared to compute template modeling (TM) score (Zhang & Skolnick, 2004 ) (range 0-100, higher the better) and the root mean squared deviation (RMSD) (lower the better), focusing only on the CDR part. The suffix AF in the metric names represents AlphaFold, while IF means IgFold.

3.2. BASELINE MODELS

We included the following baseline methods to compare against our BERT-based infilling models. LSTM from (Saka et al., 2021) and (Akbar et al., 2022a) , which, similar to ours, is a sequence-only model, however of smaller capacity, having a single attention layer between the input and output layers. AR-GNN -autoregressive graph neural network (Jin et al., 2021) , which is a sequence and structure-based model, at each step first it predicts the amino acid, followed by the edge generation between the current and all the past residues. RefineGNN (Jin et al., 2021 ) is a model that designs protein sequence and 3D structure of CDR together as graphs. At each step the method predicts residues autoregressively and simultaneously refines the predicted global structure, which in turn helps in subsequent residue prediction. To improve computational efficiency, they employ coarsegrained modeling by clustering every predefined number of context residues in a block, thus reducing the size of the computational graph. Table 4 : Evaluation results on the SabDab dataset for CDR-H3. As compared to CDR-H1 ( Fig. 2 ) and CDR-H2 (Fig. 3 ), longer CDR-H3 design is more challenging, which shows a drop in AAR across all the methods. ReprogBert clearly outperforms RefineGNN on this hard task, as evident from lower PPL, better AAR, and better diversity.

3.3. EXPERIMENTS ON THE STRUCTURAL ANTIBODY DATABASE (SABDAB)

SabDab (Dunbar et al., 2013 ) is a dataset containing antibody sequences and the corresponding 3D structure information, annotated with several properties like gene details, heavy and light chain pairings, CDR location, etc. For this experiment, we used the dataset curated by (Jin et al., 2021) and the statistics are shown in Table 1 . The evaluation results are shown in Tables 2, 3 , and 4. We note that the values for PPL and RMSD metrics for LSTM, AR-GNN and Refine-GNN are from the published results (Jin et al., 2021) . Comparing across the three experiments, we can see that CDR-H1, CDR-H2 and CDR-H3 estimations are progressively harder problems, which is reflected in the drop of AAR across all the methods. Among the proposed infill methods, ProtBert achieves the highest AAR across all experiments. We also can see that ReprogBert has a good recovery accuracy and at the same time generates very diverse CDR sequences. We emphasize that this performance is without the access to the available 3D structure information. RefineGNN, on the other hand, using both sequence and structure constraints, overall preforms competitively, generating CDR sequences that are accurate and diverse. Nevertheless, the advantage of ReprogBert is more prominent for longer CDR-H3, which is the hardest design task of all three, where ReprogBert evidently outperforms RefineGNN in term of perplexity, AAR, and diversity, while maintaining structural integrity. Finally, in Fig. 5 we show the results of all three CDRs infilling at once. Our BERT-based models are not architecturally limited to a single CDR generation, therefore can infill multiple regions at once with similar high recovery, structural consistency, and diversity scores. Since our BERT-based infill models do not estimate protein structure, we use AlphaFold (Jumper et al., 2021) and IgFold (Ruffolo et al., 2022) to estimate 3D structure from the generated sequence and compute TM and RMSD scores with respect to groundtruth native structure. We can see from the Tables 2, 3 , 4, and 5 that all the methods have similar structural consistency results (TM and RMSD-AF). However, these values are consistently higher when compared to RMSD for "natively" predicted structure (AR-GNN and Refine-GNN), which is likely due to the estimation errors introduced by the AlphaFold or IgFold algorithm. Since RefineGNN focuses on recovering both groundtruth sequence and structure, it does so by sacrificing exploration of the broader sequence space accessible to a given structure (Tian & Best, 2017) , which is not the case for ReprogBert. To further qualitatively illustrate the effect of recovery and diversity on the sampled sequences, we show in Fig. 3 AlphaFold-generated 3D structures of the protein sequences generated by the ReprogBert model. High structural diversity of the CDR-H3 is clearly visible by the coverage of the CDR-H3 ensemble (ground-truth shown using opaque while generated shown as transparent) . Fig. 4 presents a visualization of sequence similarity (in green)/diversity (in white to blue) across models. For example, for ProtBert the third column has a residue D in all the rows (high frequency), thus having the darkest shade, while for ReprogBert the last column has only two generated Y's (low frequency), thus colored in the light shade of blue. Therefore, the method with the high recovery and high diversity rates will have many green and light blue cells. Comparing with Table 4 , we indeed see that ReprogBert has highest diversity represented by the largest number of light blue cells, at the same time ProtBert has most green cells (highest AAR), but also many dark blue cells (low diversity). It can also be seen that RefineGNN has lower diversity and lower recovery, as compared to ReprogBert. Further, the 2D kernel density plot as a function of isoelectric point (pH when net charge is 0) and length of CDR-H3 shown in Figure 8 in Appendix implies ReprogBert maintains highest physicochemical similarity to the natural CDRs.

3.4. ANTIGEN-SPECIFIC ANTIBODY DESIGN

The goal here is to design a CDR such that it binds a given antigen, given the antibody sequence template. For this experiment, we used the dataset curated by (Jin et al., 2021) , statistics of which is shown in Table 6 . In particular it consists of all the SabDab 6 for training, excluding sequences in the same cluster as test antibodies, which were proposed by (Adolf-Bryfogle et al., 2018) The second step of our evaluation is to measure the ability of the generated antibodies to neutralize SARS-CoV2 virus, for which we follow the setup of (Jin et al., 2021) . Specifically, we employ the neutralization classifier, composed of SRU encoder (Lei, 2021) , pooling and feed-forward network, as provided in (Jin, 2022) , together with the iterative target augmentation (ITA) framework (Yang et al., 2020) . The goal is to additionally fine-tune the infilling models to generate CDRs resulting into better neutralizing antibodies, as measured by the classifier. Table 10 presents the results. Note that the performance values for the neutralization classifier, LSTM, AR-GNN and Refine-GNN are from the published results in (Jin et al., 2021) , for which they pretrained these models on SabDab dataset followed by the training on CoV-AbDab. As can be seen from the 

4. CONCLUSION

In this work we introduced Reprogramming for Protein Sequence Infilling, ReprogBert, a framework leveraging pretrained language models for protein sequence infilling. Specifically, we formulated variable CDR loop design as a template-infilling, where the template is provided by the constant region of the antibody. Results show promising performance, when compared to existing sequence and graph-based deep generative baselines over multiple benchmarks, where our ReprogBert model upholds structural integrity, sequence recovery, and naturalness, while achieving high novelty and diversity of the generated sequences. The improvement is more obvious for the longer CDR-H3. ReProgBert can handle multiple CDR infilling at once with losing performance. Generated antibodies also show antigen specificity and improved virus neutralization. Finally, it is worth emphasizing the high data-efficiency of the reprogrammed model, which results from having only a few training parameters (consisting of two linear projection matrices) that can be efficiently trained in the datascarce domains, such as antibody design, while still leveraging information from large out-of-domain language pretraining. This advantage allows the sequence-based reprogrammed model to perform competitively or better with respect to other BERT-based models or baselines that learn from both sequences and structures.

A RELATED WORK ON PROTEIN DESIGN

Protein design involves the design of new protein sequences that fold to a desired 3D structure and/or exhibit a specific function. Computational techniques for designing novel and diverse proteins are an active area of research. Physics based methods that rely on energy minimization have been proposed for designing general proteins (Leaver-Fay et al., 2011; Huang et al., 2011) , as well as specifically for antibodies (Pantazes & Maranas, 2010; Li et al., 2014; Adolf-Bryfogle et al., 2018) , but these are computationally expensive. Recently, generative deep learning techniques like Generative Adversarial Networks (Goodfellow et al., 2020 ), Variational Autoencoders (Kingma & Welling, 2013) , Graph Neural Networks (Scarselli et al., 2008; Gilmer et al., 2017) , autoregressive language models (LSTM and Transformer based) (Vaswani et al., 2017) , and diffusion based models (Ho et al., 2020) have been used for protein and antibody design (Wang et al., 2018; Akbar et al., 2022b; Amimeur et al., 2020; Eguchi et al., 2020; Shin et al., 2021; Kong et al., 2022; Fu & Sun, 2022; Syrlybaeva & Strauch, 2022; Lee & Kim, 2022; Anand & Achim, 2022) . Some representative works are discussed below. (Ingraham et al., 2019) and (Cao et al., 2021) proposed a graph and a multimodal transformer based model, respectively, for designing proteins conditioned on the backbone structure/fold. (Karimi et al., 2020) , developed a guided conditional Wasserstein Generative Adversarial Networks (gcWGAN) for fold based protein design. Another method that uses GANs to generate a distance matrix representation of proteins from which 3D coordinates can be recovered was proposed by (Anand & Huang, 2018) . Variational autoencoder based methods have also been proposed for conditional generation of protein sequences (Greener et al., 2018; Das et al., 2021) and for direct generation of 3D coordinates of immunoglobulin proteins (Eguchi et al., 2020) . Several of the above-mentioned architectures have been extended to the specific problem of antibody design, which is considered challenging due to focus on designing long, variable, and unstructured CDRs. (Melnyk et al., 2021) provides benchmarking of several deep generative models on antibody design. Recently, (Jin et al., 2021) proposed an iterative refinement graph neural network for jointly designing the sequence and 3D structure of the CDR regions of antibodies for improving its properties. A deep generative model that jointly models sequences and structures of CDRs based on diffusion processes and equivariant neural networks has been proposed in (Luo et al., 2022) . A geometry-constrained energy-based model has been suggested by (Fu & Sun, 2022) . Other approaches for protein design include modeling it as a constraint satisfaction problem (Strokach et al., 2020) , equivariant 3D translation (Kong et al., 2022) and by using combinatorial bayesian optimization (Khan et al., 2022) .

B OVERVIEW OF PROPOSED BASELINE MODELS

Figure 5 shows diagrams of the proposed baseline BERT-based infilling models: ProtBert, a specialized model that has been pretrained on millions of protein sequences and EnglishBert, the traditional English language model, where we replaced word embeddings with new learnable amino acid embeddings. Similar as our main proposed method, ReprogBert, these two models are sequence-only methods and they use maskings to infill the regions of interest. …PEDTAI??????????????GTMVTV… …PEDTAIARLINHYYGGAFDIGTMVTV… …PEDTAI[MASK]...[MASK]GTMVTV… Protein Pre-trained BERT (ProtBert) Language Pre-trained BERT (EnglishBert) Encoder Amino Acids Embeddings < l a t e x i t s h a 1 _ b a s e 6 4 = " o T / T j L p f N h K t v K J E O e 5 f j E Z o x F Y = " > A A A B 8 3 i c b V B N S 8 N A E J 3 U r 1 q / q h 6 9 L B b B U 0 l E 1 G P R i 8 c K 9 g O a U D b b T b t 0 s w m 7 E 6 G E / g 0 v H h T x 6 p / x 5 r 9 x 2 + a g r Q 8 G H u / N M D M v T K U w 6 L r f T m l t f W N z q 7 x d 2 d n d 2 z + o H h 6 1 T Z J p x l s s k Y n u h t R w K R R v o U D J u 6 n m N A 4 l 7 4 T j u 5 n f e e L a i E Q 9 4 i T l Q U y H S k S C U b S S 3 + 4 j 8 V H E 3 J B R v 1 p z 6 + 4 c Z J V 4 B a l B g W a / + u U P E p b F X C G T 1 J i e 5 6 Y Y 5 F S j Y J J P K 3 5 m e E r Z m A 5 5 z 1 J F 7 Z o g n 9 8 8 J W d W G Z A o 0 b Y U k r n 6 e y K n s T G T O L S d M c W R W f Z m 4 n 9 e L 8 P o J s i F S j P k i i 0 W R Z k k m J B Z A G Q g N G c o J 5 Z Q p o W 9 l b A R 1 Z S h j a l i Q / C W X 1 4 l 7 Y u 6 d 1 W / f L i s N W 6 L O M p w A q d w D h 5 c Q w P u o Q k t Y J D C M 7 z C m 5 M 5 L 8 6 7 8 7 F o L T n F z D H 8 g f P 5 A 2 l j k U o = < / l a t e x i t > Vt ⇥ h < l a t e x i t s h a 1 _ b a s e 6 4 = " o T / T j L p f N h K t v K J E O e 5 f j E Z o x F Y = " > A A A B 8 3 i c b V B N S 8 N A E J 3 U r 1 q / q h 6 9 L B b B U 0 l E 1 G P R i 8 c K 9 g O a U D b b T b t 0 s w m 7 E 6 G E / g 0 v H h T x 6 p / x 5 r 9 x 2 + a g r Q 8 G H u / N M D M v T K U w 6 L r f T m l t f W N z q 7 x d 2 d n d 2 z + o H h 6 1 T Z J p x l s s k Y n u h t R w K R R v o U D J u 6 n m N A 4 l 7 4 T j u 5 n f e e L a i E Q 9 4 i T l Q U y H S k S C U b S S 3 + 4 j 8 V H E 3 J B R v 1 p z 6 + 4 c Z J V 4 B a l B g W a / + u U P E p b F X C G T 1 J i e 5 6 Y Y 5 F S j Y J J P K 3 5 m e E r Z m A 5 5 z 1 J F 7 Z o g n 9 8 8 J W d W G Z A o 0 b Y U k r n 6 e y K n s T G T O L S d M c W R W f Z m 4 n 9 e L 8 P o J s i F S j P k i i 0 W R Z k k m J B Z A G Q g N G c o J 5 Z Q p o W 9 l b A R 1 Z S h j a l i Q / C W X 1 4 l 7 Y u 6 d 1 W / f L i s N W 6 L O M p w A q d w D h 5 c Q w P u o Q k t Y J D C M 7 z C m 5 M 5 L 8 6 7 8 7 F o L T n F z D H 8 g f P 5 A 2 l j k U o = < / l a t e x i t > Vt ⇥ h

Amino Acids Embeddings

< l a t e x i t s h a 1 _ b a s e 6 4 = " X h / c t O 2 r R A Z 0 E p H u P 8 m F q c 0 s 6 e o = " > A A A B 8 3 i c b V B N S 8 N A E J 3 U r 1 q / q h 6 9 L B b B U 0 l E 1 G P R i 8 c K 9 g O a U D b b T b t 0 s w m 7 E 6 G E / g 0 v H h T x 6 p / x 5 r 9 x 2 + a g r Q 8 G H u / N M D M v T K U w 6 L r f T m l t f W N z q 7 x d 2 d n d 2 z + o H h 6 1 T Z J p x l s s k Y n u h t R w K R R v o U D J u 6 n m N A 4 l 7 4 T j u 5 n f e e L a i E Q 9 4 i T l Q U y H S k S C U b S S 3 + 4 b 4 q O I u S G j f r X m 1 t 0 5 y C r x C l K D A s 1 + 9 c s f J C y L u U I m q T E 9 z 0 0 x y K l G w S S f V v z M 8 J S y M R 3 y n q W K 2 j V B P r 9 5 S s 6 s M i B R o m 0 p J H P 1 9 0 R O Y 2 M m c W g 7 Y 4 o j s + z N x P + 8 X o b R T Z A L l W b I F V s s i j J J M C G z A M h A a M 5 Q T i y h T A t 7 K 2 E j q i l D G 1 P F h u A t v 7 x K 2 h d 1 7 6 p + + X B Z a 9 w W c Z T h B E 7 h H D y 4 h g b c Q x N a w C C F Z 3 i F N y d z X p x 3 5 2 P R W n K K m W P 4 A + f z B 2 f W k U k = < / l a t e x i t > Vs ⇥ h < l a t e x i t s h a 1 _ b a s e 6 4 = " X h / c t O 2 r R A Z 0 E p H u P 8 m F q c 0 s 6 e o = " > A A A B 8 3 i c b V B N S 8 N A E J 3 U r 1 q / q h 6 9 L B b B U 0 l E 1 G P R i 8 c K 9 g O a U D b b T b t 0 s w m 7 E 6 G E / g 0 v H h T x 6 p / x 5 r 9 x 2 + a g r Q 8 G H u / N M D M v T K U w 6 L r f T m l t f W N z q 7 x d 2 d n d 2 z + o H h 6 1 T Z J p x l s s k Y n u h t R w K R R v o U D J u 6 n m N A 4 l 7 4 T j u 5 n f e e L a i E Q 9 4 i T l Q U y H S k S C U b S S 3 + 4 b 4 q O I u S G j f r X m 1 t 0 5 y C r x C l K D A s 1 + 9 c s f J C y L u U I m q T E 9 z 0 0 x y K l G w S S f V v z M 8 J S y M R 3 y n q W K 2 j V B P r 9 5 S s 6 s M i B R o m 0 p J H P 1 9 0 R O Y 2 M m c W g 7 Y 4 o j s + z N x P + 8 X o b R T Z A L l W b I F V s s i j J J M C G z A M h A a M 5 Q T i y h T A t 7 K 2 E j q i l D G 1 P F h u A t v 7 x K 2 h d 1 7 6 p + + X B Z a 9 w W c Z T h B E 7 h H D y 4 h g b c Q x N a w C C F Z 3 i F N y d z X p x 3 5 2 P R W n K K m W P 4 A + f z B 2 f W k U k = < / l a t e x i t > Vs ⇥ h

Words Embeddings

Amino Acids Embeddings

Encoder

Words Embeddings

Amino Acids Embeddings

< l a t e x i t s h a 1 _ b a s e 6 4 = " J E J T W f l G P O P g g j + S E 8 1 n M 7 T 8 3 d c = " > A A A C T X i c b V D B T h s x E P W m F E J o I R R x 4 m I R I X G K d i u g H K P 2 w h E k A k j Z K P J 6 J 8 G K 7 V 3 Z s z S R t R / D t f 2 Y n v s h 3 F B V b 7 I H A o x k 6 e n N j N + 8 l + R S W A z D v 0 H j w 9 r H 9 Y 3 m Z m v r 0 + f t n f b u l x u b F Y Z D n 2 c y M 3 c J s y C F h j 4 K l H C X G 2 A q k X C b T H 9 U / d s H M F Z k + h r n O Q w V m 2 g x F p y h p 0 b t f R c v P n E G 0 p L G M 8 X M t G y N 2 p 2 w G y 6 K v g V R D T q k r s v R b k D j N O O F A o 1 c M m s H U Z j j 0 D G D g k s o W 3 F h I W d 8 y i Y w 8 F A z B X b o F t I l P f J M S s e Z 8 U 8 j X b A v N x x T 1 s 5 V 4 i c V w 3 v 7 u l e R 7 / U G B Y 7 P h 0 7 o v E D Q f C k 0 L i T F j F Z h 0 F Q Y 4 C j n H j B u h L + V 8 n t m G E c f 2 Y p K L q r T v A 8 N P 3 m m F N O p q 9 N y M c I M X Z w K P X G n p 2 W 5 4 t b N l i a r T K P X C b 4 F N 1 + 7 0 V n 3 5 O q k 0 / t e p 9 s k B + S Q H J O I f C M 9 c k E u S Z 9 w 4 s g j + U V + B 3 + C p + A 5 + L c c b Q T 1 z h 5 Z q c b G f 3 F p t T 0 = < / l a t e x i t >

7

< l a t e x i t s h a 1 _ b a s e 6 4 = " J E J T W f l G P O P g g j + S E 8 1 n M 7 T 8  3 d c = " > A A A C T X i c b V D B T h s x E P W m F E J o I R R x 4 m I R I X G K d i u g H K P 2 w h E k A k j Z K P J 6 J 8 G K 7 V 3 Z s z S R t R / D t f 2 Y n v s h 3 F B V b 7 I H A o x k 6 e n N j N + 8 l + R S W A z D v 0 H j w 9 r H 9 Y 3 m Z m v r 0 + f t n f b u l x u b F Y Z D n 2 c y M 3 c J s y C F h j 4 K l H C X G 2 A q k X C b T H 9 U / d s H M F Z k + h r n O Q w V m 2 g x F p y h p 0 b t f R c v P n E G 0 p L G M 8 X M t G y N 2 p 2 w G y 6 K v g V R D T q k r s v R b k D j N O O F A o 1 c M m s H U Z j j 0 D G D g k s o W 3 F h I W d 8 y i Y w 8 F A z B X b o F t I l P f J M S s e Z 8 U 8 j X b A v N x x T 1 s 5 V 4 i c V w 3 v 7 u l e R 7 / U G B Y 7 P h 0 7 o v E D Q f C k 0 L i T F j F Z h 0 F Q Y 4 C j n H j B u h L + V 8 n t m G E c f 2 Y p K L q r T v A O u w t X V F b g U B S q s F c Z O F T S 4 J A k K b w q L Y L O F F 5 m 8 + 9 N / v I G r Z O F O a d l i W M N M y O n U g A F a t L 7 4 N O 2 i S c E V f N U a L D z u j v p 9 e N B 3 A Z / C J I 1 6 L N 1 n E 1 2 O i d p X o h K o y G h w L l R E p c 0 9 m B J C o V 1 N 6 0 c l i D m M M N R g A Y 0 u r F v Z 9 d 8 L z A 5 n x Y 2 P E O 8 Z f 9 X e N D O L X U W K j X Q t b u f a 8 j H c q O K p l / H X p q y I j R i N W h a K U 4 F b 9 z g u b Q o S C 0 D A G F l 2 J W L a 7 A g K H i 2 M a W U z W r h D o M / R a E 1 m N y n i 9 Y t n x I u y K e 5 N D N / c F D X G 9 f 6 x e r I T a V 4 R J k 0 y u B 8 c t / n h + D i y y A 5 H O z / 2 O 8 f f 1 v / w T b b Z Z / Y Z 5 a w I 3 b M T t k Z G z L B f r H f 7 A / 7 2 / k X f Y z 6 0 d 6 q N O q s N e / Z R k S D O y 5 W w u I = < / l a t e x i t > 3 < l a t e x i t s h a 1 _ b a s e 6 4 = " d 2 e G 6 f n T F f c E u Y 5 l Q o 7 9 O n 2 N l s o = " > A A A C d H i c b V H L a h s x F J W n j 6 T u y 2 m h m 3 Q h a g J d m Z m S R 5 e h h Z B l C n E S 8 B h z R 3 P t C E u a Q b q T 2 q j z M 9 2 2 P 5 Q f 6 T q a s R d 1 k g u C w 7 n 3 3 M d R V i r p K I 5 v O 9 G T p 8 + e b 2 2 / 6 L 5 8 9 f r N 2 9 7 O u w t X V F b g U B S q s F c Z O F T S 4 J A k K b w q L Y L O F F 5 m 8 + 9 N / v I G r Z O F O a d l i W M N M y O n U g A F a t L 7 4 N O 2 i S c E V f N U a L D z u j v p 9 e N B 3 A Z / C J I 1 6 L N 1 n E 1 2 O i d p X o h K o y G h w L l R E p c 0 9 m B J C o V 1 N 6 0 c l i D m M M N R g A Y 0 u r F v Z 9 d 8 L z A 5 n x Y 2 P E O 8 Z f 9 X e N D O L X U W K j X Q t b u f a 8 j H c q O K p l / H X p q y I j R i N W h a K U 4 F b 9 z g u b Q o S C 0 D A G F l 2 J W L a 7 A g K H i 2 M a W U z W r h D o M / R a E 1 m N y n i 9 Y t n x I u y K e 5 N D N / c F D X G 9 f 6 x e r I T a V 4 R J k 0 y u B 8 c t / n h + D i y y A 5 H O z / 2 O 8 f f 1 v / w T b b Z Z / Y Z 5 a w I 3 b M T t k Z G z L B f r H f 7 A / 7 2 / k X f Y z 6 0 d 6 q N O q s N e / Z R / T j L p f N h K t v K J E O e 5 f j E Z o x F Y = " > A A A B 8 3 i c b V B N S 8 N A E J 3 U r 1 q / q h 6 9 L B b B U 0 l E 1 G P R i 8 c K 9 g O a U D b b T b t 0 s w m 7 E 6 G E / g 0 v H h T x 6 p / x 5 r 9 x 2 + a g r Q 8 G H u / N M D M v T K U w 6 L r f T m l t f W N z q 7 x d 2 d n d 2 z + o H h 6 1 T Z J p x l s s k Y n u h t R w K R R v o U D J u 6 n m N A 4 l 7 4 T j u 5 n f e e L a i E Q 9 4 i T l Q U y H S k S C U b S S 3 + 4 j 8 V H E 3 J B R v 1 p z 6 + 4 c Z J V 4 B a l B g W a / + u U P E p b F X C G T 1 J i e 5 6 Y Y 5 F S j Y J J P K 3 5 m e E r Z m A 5 5 z 1 J F 7 Z o g n 9 8 8 J W d W G Z A o 0 b Y U k r n 6 e y K n s T G T O L S d M c W R W f Z m 4 n 9 e L 8 P o J s i F S j P k i i 0 W R Z k k m J B Z A G Q g N G c o J 5 Z Q p o W 9 l b A R 1 Z S h j a l i Q / C W X 1 4 l 7 Y u 6 d 1 W / f L i s N W 6 L O M p w A q d w D h 5 c Q w P u o Q k t Y J D C M 7 z C m 5 M 5 L 8 6 7 8 7 F o L T n F z D H 8 g f P 5 A 2 l j k U o = < / l a t e x i t > Vt ⇥ h < l a t e x i t s h a 1 _ b a s e 6 4 = " o T / T j L p f N h K t v K J E O e 5 f j E Z o x F Y = " > A A A B 8 3 i c b V B N S 8 N A E J 3 U r 1 q / q h 6 9 L B b B U 0 l E 1 G P R i 8 c K 9 g O a U D b b T b t 0 s w m 7 E 6 G E / g 0 v H h T x 6 p / x 5 r 9 x 2 + a g r Q 8 G H u / N M D M v T K U w 6 L r f T m l t f W N z q 7 x d 2 d n d 2 z + o H h 6 1 T Z J p x l s s k Y n u h t R w K R R v o U D J u 6 n m N A 4 l 7 4 T j u 5 n f e e L a i E Q 9 4 i T l Q U y H S k S C U b S S 3 + 4 j 8 V H E 3 J B R v 1 p z 6 + 4 c Z J V 4 B a l B g W a / + u U P E p b F X C G T 1 J i e 5 6 Y Y 5 F S j Y J J P K 3 5 m e E r Z m A 5 5 z 1 J F 7 Z o g n 9 8 8 J W d W G Z A o 0 b Y U k r n 6 e y K n s T G T O L S d M c W R W f Z m 4 n 9 e L 8 P o J s i F S j P k i i 0 W R Z k k m J B Z A G Q g N G c o J 5 Z Q p o W 9 l b A R 1 Z S h j a l i Q / C W X 1 4 l 7 Y u 6 d 1 W / f L i s N W 6 L O M p w A q d w D h 5 c Q w P u o Q k t Y J D C M 7 z C m 5 M 5 L 8 6 7 8 7 F o L T n F z D H 8 g f P 5 A 2 l j k U o = < / l a t e x i t > Vt ⇥ h Figure 5 : Baseline methods proposed for protein sequence infilling. Given an input antibody sequence, where part of the amino acids is missing (e.g., CDR-H3), the goal is to infill them using information from the rest of the protein. The infilling problem is formulated similar to the masked-language modeling task, where the missing amino acids are marked with a token ⟨MASK⟩ and the model generates amino acids token to infill them. These are sequence-only methods and do not rely on any structure information during generation process. The top diagram shows ProtBert, the BERT model that has been pretrained on the protein sequences and therefore can be applied to the protein infilling task as is (the entire model is still fine-tuned on the downstream infilling task). The bottom diagram shows traditional English language BERT model (EnglishBert), whose incompatible word embeddings (V s × h, V s is the number of language tokens, h -latent model dimension) are swapped with the trainable amino acid embeddings (V t × h, V t is the number of amino acid tokens). The full model is then fine-tuned on the infilling dataset.

C MODEL ARCHITECTURE AND TRAINING

In Table 11 we present the architectural details of our BERT-based models for the protein sequence infilling, while 

D ABLATION ON DATA

In Table 13 we show an ablation results on the effect of training data size on model performance.

SabDab-H3

Training data fraction PPL-ProGen AAR DIV ProtBert 1.0 6.8 41.5 14.5 0.8 6.7 41.3 13.1 0.6 6.6 40.9 15.9 0.4 6.4 40.5 18.9 0.2 



t e x i t s h a 1 _ b a s e 6 4 = " M b T S 1 h + M 8 t S U v b D J a A P r p e 2 j a B U = " > A A A B 6 n i c b V B N S 8 N A E J 3 U r 1 q / q h 6 9 L B b B U 0 l E 1 G P R i 8 e K 9 g P a U D b b T b t 0 s w m 7 E 7 G E / g Q v H h T x 6 i / y 5 r 9 x 2 + a g r Q 8 G H u / N M D M v S K Q w 6 L r f T m F l d W 1 9 o 7 h Z 2 t r e 2 d 0 r 7 x 8 0 T Z x q x h s s l rF u B 9 R w K R R v o E D J 2 4 n m N A o k b w W j m 6 n f e u T a i F g 9 4 D j h f k Q H S o S C U b T S / V M P e + W K W 3 V n I M v E y 0 k F c t R 7 5 a 9 u P 2 Z p x B U y S Y 3 p e G 6 C f k Y 1 C i b 5 p N R N D U 8 o G 9 E B 7 1 i q a M S N n 8 1 O n Z A T q / R J G G t b C s l M / T 2 R 0 c i Y c R T Y z o j i 0 C x 6 U / E / r 5 N i e O V n Q i U p c s X m i 8 J U E o z J 9 G / S F 5 o z l G N L K N P C 3 k r Y k G r K 0 K Z T s i F 4 i y 8 v k + Z Z 1 b u o n t + d V 2 r X e R x F O I J j O A U P L q E G t 1 C H B j A Y w D O 8 w p s j n R f n 3 f m Y t x a c f O Y Q / s D 5 / A F 0 K o 3 s < / l a t e x i t > x t < l a t e x i t s h a 1 _ b a s e 6 4 = " 1 o B k I A 9 h 0 5 M H b V r D / C 0 A h u U O o o 0 = " > A A A B 6 n i c b V B N S 8 N A E J 3 U r 1 q / q h 6 9 L B b B U 0 m k q M e i F 4 8 V 7 Q e 0 o W y 2 m 3 b p Z h N 2 J 0 I o / Q l e P C j i 1 V / k z X / j t s 1 B W x 8 M P N 6 b Y W Z e k E h h 0 H W / n c L a + s b m V n G 7 t L O 7 t 3 9 Q P j x q m T j V j D d Z L G P d C a j h U i j e R I G S d x L N a R R I 3 g 7 G t z O / / c S 1 E b F 6 x C z h f k S H S o S C U b T S Q 9 b H f r n i V t 0 5 y C r x c l K B H I 1 + + a s 3 i F k a c Y V M U m O 6 n p u g P 6 E a B Z N 8 W u q l h i e U j e m Q d y 1 V N O L G n 8 x P n Z I z q w x I G G t b C s l c / T 0 x o Z E x W R T Y z o j i y C x 7 M / E / r 5 t i e O 1 P h E p S 5 I o t F o W p J B i T 2 d 9 k I D R n K D N L K N P C 3 k r Y i G r K 0 K Z T s i F 4 y y + v k t Z F 1 b u s 1 u 5 r l f p N H k c R T u A U z s G D K 6 j D H T S g C Q y G 8 A y v 8 O Z I 5 8 V 5 d z 4 W r Q U n n z m G P 3 A + f w B 1 s I 3 t < / l a t e x i t > y t < l a t e x i t s h a 1 _ b a s e 6 4 = " z c i z 8 w G J N B e + z A x H 3 Z y y V X N p u n 4 = " > A A A B 6 n i c b V B N S 8 N A E J 3 U r 1 q / q h 6 9 L B b B U 0 l E 1 G P R i 8 e K 9 g P a U D b b S b t 0 s w m 7 G 7 G E / g Q v H h T x 6 i / y 5 r 9 x 2 + a g r Q 8 G H u / N M D M v S A T X x n W / n c L K 6 t r 6 R n G z t L W 9 s 7 t X 3 j 9 o 6 j h V D B s s F r F q B 1 S j 4 B I b h h u B 7 U Q h j Q K B r W B 0 M / V b j 6 g 0 j + W D G S f o R 3 Q g e c g Z N V a 6 f + r p X r n i V t 0 Z y D L x c l K B H P V e + a v b j 1 k a o T R M U K 0 7 n p s Y P 6 P K c C Z w U u q m G h P K R n S A H U s l j V D 7 2 e z U C T m x S p + E s b I l D Z m p v y c y G m k 9 j g L b G V E z 1 I v e V P z P 6 6 Q m v P I z L p P U o G T z R W E q i I n J 9 G / S 5 w q Z E W N L K F P c 3 k r Y k C r K j E2 n Z E P w F l 9 e J s 2 z q n d R P b 8 7 r 9 S u 8 z i K c A T H c A o e X E I N b q E O D W A w g G d 4 h T d H O C / O u / M x b y 0 4 + c w h / I H z + Q N y p o 3 r < / l a t e x i t > x s < l a t e x i t s h a 1 _ b a s e 6 4 = " 0 J a 9 j Y l D 0 Y j V q n d S d k / m K y L U i K M = " > A A A B 6 n i c b V B N S 8 N A E J 3 U r 1 q / q h 6 9 L B b B U 0 m k q M e i F 4 8 V 7 Q e 0 o W y 2 k 3 b p Z h N 2 N 0 I o / Q l e P C j i 1 V / k z X / j t s 1 B W x 8 M P N 6 b Y W Z e k A i u j e t + O 4 W 1 9 Y 3 N r e J 2 a W d 3 b / + g f H j U 0 n G q G D Z Z L G L V C a h G w S U 2 D T c C O 4 l C G g U C 2 8 H 4 d u a 3 n 1 B p H s t H k y X o R 3 Q o e c g Z N V Z 6 y P q 6 X 6 6 4 V X c O s k q 8 n F Q g R 6 N f / u o N Y p Z G K A 0 T V O u u 5 y b G n 1 B l O B M 4 L f V S j Q l l Y z r E r q W S R q j 9 y f z U K T m z y o C E s b I l D Z m r v y c m N N I 6 i w L b G V E z 0 s v e T P z P 6 6 Y m

Figure1: Overview of the proposed Protein Sequence Infilling using Model Reprogramming . Given a heavy chain of an antibody, the goal is to design three Complementarity-Determining Regions (CDR-H1, CDR-H2, CDR-H3), shown in green, blue and red colors, using information from the rest of the protein. The infilling problem is formulated similar to the masked-language modeling task, where the missing amino acids are marked with a token ⟨MASK⟩ and the model generates tokens to infill them. We emphasize that our system is a sequence-only method, and while the structure information might be available (bottom of the figure, showing Y-shaped antibody structure with CDRs), our method does not rely on it in the generation process. It makes the model computationally efficient while still achieving high sequence recovery and diversity rates as compared to the current baselines. Reprogrammed language BERT model (ReprogBert) is our proposed infilling model, where the English language BERT remains unchanged and frozen (source domain), and we introduce additional amino acid embeddings (target domain) together with the linear matrices (θ ∈ R |Vt|×|Vs| and γ ∈ R |Vs|×|Vt| ) to project from one domain to another. During CDR infilling training, only the projection matrices and protein embeddings are fine-tuned, the language model remains unmodified. The bottom diagram shows the schematic view of the reprogramming: f θ : x t → x s is transforming input protein sequence (target domain (T)) into input word sequence (source domain (S)) and g γ : y s → y t reverses the mapping. Thus, for a masked protein sequence x t we get predicted CDR-infilled antibody y t = f γ (M (f θ (x t ))), where M is the pretrained language model.

Figure2: Evaluation process and the computed metrics. For each masked antibody input sequence we generate 100 predicted samples. Amino acid recovery (AAR) is computed for the specific sequence region (e.g., CDR-H3), measuring the fraction of exact matches between ground truth and the sampled sequences. Diversity (DIV) uses only the generated samples to compute the complement of the average recovery for all pairwise comparisons in the set (the higher the number, the more dissimilar is each sample to all the others). Perplexity (PPL-ProGen) is computed as the average of all the sampled sequences (masking only the region of interest), using off-the-shelf autoregressive Transformer protein model ProGen(Nijkamp et al., 2022), which reflects "naturalness" of the designed sequences. The sample with the minimum perplexity (red box with an arrow) is then used for 3D structure prediction using AlphaFold(Jumper et al., 2021) or IgFold(Ruffolo et al., 2022) models and compared with ground truth to compute template modeling (TM) score(Zhang & Skolnick, 2004) and the root mean squared deviation (RMSD) from the input structure.

Figure 3: AlphaFold-estimated 3D structures of the proteins generated by the ReprogBert model on SabDab dataset. Each plot shows 30 generated samples for a specific PDB ID, where the CDR-H3 part of the input has been masked and the model then generates CDR-H3 sequence. The ground truth and the generated CDR are shown on the bottom part of each figure using solid and faded colors, respectively. As can be seen, CDR-H3 part shows high structural diversity, confirming the same findings as in Table4, i.e., that ReprogBert achieves high recovery rate while maintaining the highest sequence diversity.

Figure4: Visualization of the sequence recovery and diversity metrics for generated CDR-H3 (PDB ID 7e7y) across different models. The top row colored in red shows the ground truth CDR-H3 sequence, while the following 20 rows show the same for the generated CDR-H3s. The green cell with the star symbol represents the same amino acid as in the ground truth, while the white/blue cell shows new and different generated residues. The darker shade of the blue cell represents the frequency of the amino acid in that column.

8 N P 3 m m F N O p q 9 N y M c I M X Z w K P X G n p 2 W 5 4 t b N l i a r T K P X C b 4 F N 1 + 7 0 V n 3 5 O q k 0 / t e p 9 s k B + S Q HJ O I f C M 9 c k E u S Z 9 w 4 s g j + U V + B 3 + C p + A 5 + L c c b Q T 1 z h 5 Z q c b G f 3 F p t T 0 = < / l a t e x i t > 7 < l a t e x i t s h a 1 _ b a s e 6 4 = " d 2 e G 6 f n T F f c E u Y 5 l Q o 7 9 O n 2 N l s o = " > A A A C d H i c b V H L a h s x F J W n j 6 T u y 2 m h m 3 Q h a g J d m Z m S R 5 e h h Z B l C n E S 8 B h z R 3 P t C E u a Q b q T 2 q j z M 9 2 2 P 5 Q f 6 T q a s R d 1 k g u C w 7 n 3 3 M d R V ir p K I 5 v O 9 G T p 8 + e b 2 2 / 6 L 5 8 9 f r N 2 9 7

t e x i t s h a 1 _ b a s e 6 4 = " o T

Figures 6 and 7 show additional visualizations of the recovery and diversity metrics for CDR-H3 across different methods.

Figure6: Visualization of the recovery and diversity metrics for CDR-H3 (PDB ID 2r56) across different models. The top red line shows the ground truth CDR-H3 sequence, while the next lines show the generated CDR-H3 by each of the model. The green cell with the star symbol represents the same amino acid as in the ground truth, while the blue cell shows new and different generated residues. The shade of the blue cell represents the frequency of the amino acid in that column. We see that ReprogBert has highest diversity represented by the largest number of light blue cells, at the same time ProtBert has most green cells (highest AAR), but also many dark blue cells (low diversity). It can also be seen that RefineGNN has lower diversity and lower recovery, as compared to ReprogBert.

Figure7: Visualization of the recovery and diversity metrics for CDR-H3 (PDB ID 5y7z) across different models. The top red line shows the ground truth CDR-H3 sequence, while the next lines show the generated CDR-H3 by each of the model. The green cell with the star symbol represents the same amino acid as in the ground truth, while the blue cell shows new and different generated residues. The shade of the blue cell represents the frequency of the amino acid in that column. We see that ReprogBert has highest diversity represented by the largest number of light blue cells, at the same time ProtBert has most green cells (highest AAR), but also many dark blue cells (low diversity). It can also be seen that RefineGNN has lower diversity and lower recovery, as compared to ReprogBert.

Statistics of the Structural Antibody Database (SabDab) for the training, validation and test splits across the three CDRs. We also show the average number of amino acids per CDR and average CDR diversity (length-normalized) across proteins. As can be seen CDR-H3 is the longest and most diverse and therefore represents the most challenging prediction task.

Evaluation results on the SabDab dataset for CDR-H1 in the heavy chain. Dark grey cell denote best results, while light grey are the second best. ReprogBert generates sequences with lowest perplexity, second best diversity and high enough AAR and structural consistency. RefineGNN yields the best diversity. ProtBert and EnglishBert, both lack in the CDR sequence. RMSD-AF and

Evaluation results on the SabDab dataset for CDR-H2 in the heavy chain. As compared to Table 2, all of our proposed infill methods now outperform RefineGNN in terms of AAR metric, while reprogBert also provides second best diversity.

Evaluation results on the three heavy chain CDR loops generated at once using the SabDab dataset. This is the most challenging task compared to designing one CDR at a time. However, since our BERT-based models are not architecturally limited to a single CDR generation, they can infill multiple protein regions at once with similar high recovery scores, as opposed to AR-GNN and RefineGNN. Moreover, the reprogrammed model showed the lowest perplexity, good structural consistency, and the highest sequence variability among the three proposed methods.

Statistics of the Rosetta Antibody Design (RabD) dataset for CDR-H3.

Evaluation results on the RabD dataset for CDR-H3. Our infilling models outperform RefineGNN, with ProtBert achieving the highest AAR score, while the ReprogBert has the best diversity rate with AAR comparable to RefineGNN. As before, ProtBert and EnglishBert show better recovery performance but suffer from less diverse generation. ReprotBert achieving the highest diversity while maintaining good sequence recovery and low perplexity.

Evaluation results on the CoV-AbDab dataset for generated CDR-H3. Since no ground truth structure is available for this dataset, the other structure consistency metrics are not computed.

table, under both training scenarios, our ReprogBert infilling method gets the largest improvement over the original neutralization classifier, achieving 75.6 % and 76.7 % neutralization scores, respectively.

Table 12 shows the settings used for model training.

Architectural details of the BERT-based model for protein sequence infilling. Note that for ReprogBert the number of trainable parameters is defined by the two R 30522×30 matrices.

Training details for ProtBert, EnglishBert and ReprogBert. For example, for SabDab dataset to reach the best performance it took 5 hours for ReprogBert, 6 hours for EnglishBert and 14 hours for ProtBert, which is equivalent to approximately 1800 epochs (134 minibatch iterations per epoch). Average inference time per protein sequence is 0.02 seconds for ProtBert, and 0.008 seconds for ReprogBert and EnglishBert (as measured on the test set of SabDab for CDR-H3 infilling). For reference, the average inference time for RefineGNN is 0.004 seconds, which is comparable to our ReprogBert.

Ablation results on the effect of training data size on model performance. The fractions 1.0, 0.8, 0.6, 0.4 and 0.2 representing progressively smaller subsets of the original SabDab training dataset. It can be seen that as the size of training data drops, the recovery rate also decreases, while the diversity increases (this is expected as now the generated sequences are less accurate). However, for ProtBert, the decrease is slower, likely due to this model being pretrained on large protein dataset, thus retaining its prediction capacity.

RefineGNN

Comparing this region across other models, we see that ReprogBert has the closes resemblance to the ground truth, while others place too much weight there. The second row from the top shows the density for 10 protein sequences, where visual inspection of the region marked with orange arrow reveals that ReprogBert has closest similarity to the ground truth based on the distribution and orientation of the highly dense region, while for other methods theshape of this region is tilted and a second minimum appears.

