REPROGRAMMING LARGE PRETRAINED LANGUAGE MODELS FOR ANTIBODY SEQUENCE INFILLING

Abstract

Antibodies comprise the most versatile class of binding molecules, with numerous applications in biomedicine. Therapeutic antibody development requires designing novel and diverse sequences with improved properties, while maintaining the structural consistency. Computational design of antibodies involves unusual challenges relative to designing other classes of proteins, as antibodies comprise multiple long, variable, and unstructured loops at the complementarity-determining region (CDR) that determine the antigen binding affinity and specificity of an antibody. Recently, deep language models and graph neural networks have shown impressive success in antibody sequence generation. Since only a limited number of antibody structures are known, training a model using this limited data can lead to degraded performance, particularly lacking diversity in the generated samples. To address such issues, we leverage the method of Model Reprogramming (MR) here, which focuses on repurposing pretrained machine learning models for target domain tasks with scarce data, where it may be difficult to train a high-performing model from scratch. Prior works in MR have primarily focused on classification-based tasks. We extend the capabilities of reprogramming beyond classification tasks, and towards a more complex problem of antibody sequence generation. Specifically, we introduce Reprogramming for Protein Sequence Infilling, a framework in which pretrained natural language models are repurposed for protein sequence infilling via reprogramming, to infill protein sequence templates as a method of novel protein generation. For variable CDR sequence design, we formulate the task as text infilling that uses the constant region of an antibody as the sequence template. Results on antibody design benchmarks show that our reprogrammed model on low resourced antibody sequence dataset provides highly diverse CDR sequences, up to more than a two-fold increase of diversity over the baselines, without losing structural integrity and naturalness. The performance benefit of the reprogrammed model learning only from antibody sequences is more evident for longer CDR design or for multiple loop infilling at once, compared to existing graph-based models that require additional structural information. The generated sequences also demonstrate enhanced antigen binding specificity or virus neutralization ability.

1. INTRODUCTION

Antibodies have emerged as essential therapeutic agents in the treatment of cancer and various other autoimmune, infectious and metabolic diseases. Since 1985, approximately 100 monoclonal antibodies (mAbs) have been designated as drugs by FDA (Jin et al., 2022) . Compared to small molecule drugs, the advantage of using antibody proteins as therapeutics is their high specificity resulting in less adverse effects. A key challenge in antibody design is tailoring their binding specificity, which is mainly influenced by the complementarity determining region (CDR). CDR plays a crucial role in antigen recognition and binding processes. It is composed of six hypervariable loops, three formed by each of heavy (H) and light (L) chains. Together, the CDRs shape the antigen binding site of the antibody. Five of the six loops usually adopt well-characterized canonical conformations. In contrast, the CDR-H3 loop shows substantial variability in sequence and structure, and hence cannot be described by a canonical structure model. Even when compared to other protein loop structures, the CDR-H3 clearly stands out with its significantly higher structural diversity. Predicted CDR < l a t e x i t s h a 1 _ b a s e 6 4 = " J E J T W f l G P O P g g j + S E 8 1 n M 7 T 8  3 d c = " > A A A C T X i c b V D B T h s x E P W m F E J o I R R x 4 m I R I X G K d i u g H K P 2 w h E k A k j Z K P J 6 J 8 G K 7 V 3 Z s z S R t R / D t f 2 Y n v s h 3 F B V b 7 I H A o x k 6 e n N j N + 8 l + R S W A z D v 0 H j w 9 r H 9 Y 3 m Z m v r 0 + f t n f b u l x u b F Y Z D n 2 c y M 3 c J s y C F h j 4 K l H C X G 2 A q k X C b T H 9 U / d s H M F Z k + h r n O Q w V m 2 g x F p y h p 0 b t f R c v P n E G 0 p L G M 8 X M t G y N 2 p 2 w G y 6 K v g V R D T q k r s v R b k D j N O O F A o 1 c M m s H U Z j j 0 D G D g k s o W 3 F h I W d 8 y i Y w 8 F A z B X b o F t I l P f J M S s e Z 8 U 8 j X b A v N x x T 1 s 5 V 4 i c V w 3 v 7 u l e R 7 / U G B Y 7 P h 0 7 o v E D Q f C k 0 L i T F j F Z h 0 F Q Y 4 C j n H j B u h L + V 8 n t m G E c f 2 Y p K L q r T v A 8 N P 3 m m F N O p q 9 N y M c I M X Z w K P X G n p 2 W 5 4 t b N l i a r T K P X C b 4 F N 1 + 7 0 V n 3 5 O q k 0 / t e p 9 s k B + S Q H J O I f C M 9 c k E u S Z 9 w 4 s g j + U V + B 3 + C p + A 5 + L c c b Q T 1 z h 5 Z q c b G f 3 F p t T 0 = < / l a t e x i t > 7 Linear projection Linear projection < l a t e x i t s h a 1 _ b a s e 6 4 = " d 2 e G 6 f n T F f c E u Y 5 l Q o 7 9 O n 2 N l s o = " > A A A C d H i c b V H L a h s x F J W n j 6 T u y 2 m h m 3 Q h a g J d m Z m S R 5 e h h Z B l C n E S 8 B h z R 3 P t C E u a Q b q T 2 q j z M 9 2 2 P 5 Q f 6 T q a s R d 1 k g u C w 7 n 3 3 M d R V i r p K I 5 v O 9 G T p 8 + e b 2 2 / 6 L 5 8 9 f r N 2 9 7 O u w t X V F b g U B S q s F c Z O F T S 4 J A k K b w q L Y L O F F 5 m 8 + 9 N / v I G r Z O F O a d l i W M N M y O n U g A F a t L 7 4 N O 2 i S c E V f N U a L D z u j v p 9 e N B 3 A Z / C J I 1 6 L N 1 n E 1 2 O i d p X o h K o y G h w L l R E p c 0 9 m B J C o V 1 N 6 0 c l i D m M M N R g A Y 0 u r F v Z 9 d 8 L z A 5 n x Y 2 P E O 8 Z f 9 X e N D O L X U W K j X Q t b u f a 8 j H c q O K p l / H X p q y I j R i N W h a K U 4 F b 9 z g u b Q o S C 0 D A G F l 2 J W L a 7 A g K H i 2 M a W U z W r h D o M / R

Light chain Heavy chain

CDRs < l a t e x i t s h a 1 _ b a s e 6 4 = " X h / c t O 2 r R A Z 0 E p H u P 8 m F q c 0 s 6 e o = " x s < l a t e x i t s h a 1 _ b a s e 6 4 = " 0 J a 9 j Y l D 0 There is a high demand and need for efficient in-silico methods for designing CDRs with improved specificity and other desired properties, to reduce the cost and time associate with wet lab production and testing of antibody candidates. Generative machine learning has emerged as an attractive and viable path for this purpose. For example, for a more general task of protein design, creating new protein sequences that fold to a desired 3D structure and/or exhibit a specific function, many deep generative models have been adapted and expanded (Ingraham et al., 2019; Cao et al., 2021; Karimi et al., 2020; Syrlybaeva & Strauch, 2022; Lee & Kim, 2022; Anand & Achim, 2022) . However, compared to other protein design challenges, CDR design (Akbar et al., 2022b; Eguchi et al., 2020; Shin et al., 2021; Adolf-Bryfogle et al., 2018; Fu & Sun, 2022; Kong et al., 2022; Luo et al., 2022) , especially CDR-H3 design, comes with additional complexities, such as out-of-distribution generation to accommodate functional novelty. Additionally, in antibody design, sequence similarity may not reflect binding behavior. For example, in HER2 binding antibodies, two very similar sequences (Levenshtein distance < 2) had opposing binding behavior (Mason et al., 2021) . Furthermore, it is often desirable to explore new antigen binding modes, when designing antibodies for a target of interest. Such out-of-distribution sample generation remains challenging, particularly in a templateconstrained generation scenario. > A A A B 8 3 i c b V B N S 8 N A E J 3 U r 1 q / q h 6 9 L B b B U 0 l E 1 G P R i 8 c K 9 g O a U D b b T b t 0 s w m 7 E 6 G E / g 0 v H h T x 6 p / x 5 r 9 x 2 + a g r Q 8 G H u / N M D M v T K U w 6 L r f T m l t f W N z q 7 x d 2 d n d 2 z + o H h 6 1 T Z J p x l s s k Y n u h t R w K R R v o U D J u 6 n m N A 4 l 7 4 T j u 5 n f e e L a i E Q 9 4 i T l Q U y H S k S C U b S S 3 + 4 b 4 q O I u S G j f r X m 1 t 0 5 y C r x C l K D A s 1 + 9 c s f J C y L u U I m q T E 9 z 0 0 x y K l G w S S f V v z M 8 J S y M R 3 y n q W K 2 j V B P r 9 5 S s 6 s M i B R o m 0 p J H P 1 9 0 R O Y 2 M m c W g 7 Y 4 o j s + z N x P + 8 X o b R T Z A L l W b I F V s s i j J J M C G z A M h A a M 5 Q T i y h T A t 7 K 2 E j q i l D G 1 P F h u A t v 7 x K 2 h d 1 7 6 p + + X B Z a 9 w W c Z T h B E 7 h H D y 4 h g b c Q x N a w C C F Z 3 i F N y d z X p x 3 5 2 P R W n K K m W P 4 A + f z B 2 f W k U k = < / l a t e x i t > V s ⇥ h < l a t e x i t s h a 1 _ b a s e 6 4 = " X h / c t O 2 r R A Z 0 E p H u P 8 m F q c 0 s 6 e o = " > A A A B 8 3 i c b V B N S 8 N A E J 3 U r 1 q / q h 6 9 L B b B U 0 l E 1 G P R i 8 c K 9 g O a U D b b T b t 0 s w m 7 E 6 G E / g 0 v H h T x 6 p / x 5 r 9 x 2 + a g r Q 8 G H u / N M D M v T K U w 6 L r f T m l t f W N z q 7 x d 2 d n d 2 z + o H h 6 1 T Z J p x l s s k Y n u h t R w K R R v o U D J u 6 n m N A 4 l 7 4 T j u 5 n f e e L a i E Q 9 4 i T l Q U y H S k S C U b S S 3 + 4 b 4 q O I u S G j f r X m 1 t 0 5 y C r x C l K D A s 1 + 9 c s f J C y L u U I m q T E 9 z 0 0 x y K l G w S S f V v z M 8 J S y M R 3 y n q W K 2 j V B P r 9 5 S s 6 s M i B R o m 0 p J H P 1 9 0 R O Y 2 M m c W g 7 Y 4 o j s + z N x P + 8 X o b R T Z A L l W b I F V s s i j J J M C G z A M h A a M 5 Q T i y h T A t 7 K 2 E j q i l D G 1 P F h u A t v 7 x K 2 h d 1 7 6 p + + X B Z a 9 w W c Z T h B E 7 h H D y 4 h g b c Q x N a w C C F Z 3 i F N y d z X p x 3 5 2 P R W n K K m W P 4 A + f z B 2 f W k U k = < / l a t e x i t > V s ⇥ h M S T T S < l a t e x i t s h a 1 _ b a s e 6 4 = " M b T S 1 h + M 8 t S U v b D J a A P r p e 2 j a B U = " > A A A B 6 n i c b V B N S 8 N A E J 3 U r 1 q / q h 6 9 L B b B U 0 l E 1 G P R i 8 e K 9 g P a U D b b T b t 0 s w m 7 E 7 G E / g Q v H h T x 6 i / y 5 r 9 x 2 + a g r Q 8 G H u / N M D M v S K Q w 6 L r f T m F l d W 1 9 o 7 h Z 2 t r e 2 d 0 r 7 x 8 0 T Z x q x h s s l r F u B 9 R w K R R v o E D J 2 4 n m N A o k b w W j m 6 n f e u T a i F g 9 4 D j h f k Q H S o S C U b T S / V M P e + W K W 3 V n I M v E y 0 k F c t R 7 5 a 9 u P 2 Z p x B U y S Y 3 p e G 6 C f k Y 1 C i b 5 p N R N D U 8 o G 9 E B 7 1 i q a M S N n 8 1 O n Z A T q / R J G G t b C s l M / T 2 R 0 c i Y c R T Y z o j i 0 C x 6 U / E / r 5 N i e O V n Q i U p c s X m i 8 J U E o z J 9 G / S F 5 o z l G N L K N P C 3 k r Y k G r K 0 K Z T s i F 4 i y 8 v k + Z Z 1 b u o n t + d V 2 r X e R x F O I J j O A U P L q E G t 1 C H B j A Y w D O 8 w p s j n R f n 3 f m Y t x a c f O Y Q / s D 5 / A F 0 K o 3 s < / l a t e x i t > Y j V q n d S d k / m K y L U i K M = " > A A A B 6 n i c b V B N S 8 N A E J 3 U r 1 q / q h 6 9 L B b B U 0 m k q M e i F 4 8 V 7 Q e 0 o W y 2 k 3 b p Z h N 2 N 0 I o / Q l e P C j i 1 V / k z X / j t s 1 B W x 8 M P N 6 b Y W Z e k A i u j e t + O 4 W 1 9 Y 3 N r e J 2 a W d 3 b / + g f H j U 0 n G q G D Z Z L G L V C a h G w S U 2 D T c C O 4 l C G g U C 2 8 H 4 d u a 3 n 1 B p H s t H k y X o R 3 Q o e c g Z N V Z 6 y P q 6 X 6 6 4 V X c O s k q 8 n F Q g R 6 N f / u o N Y p Z G K A 0 T V O u u 5 y b G n 1 B l O B M 4 L f V S j Q l l Y z r E r q W S R m V n G 7 t L O 7 t 3 9 Q P j x q m T j V j D d Z L G P d C a j h U i j e R I G S d x L N a R R I 3 g 7 G t z O / / c S 1 E b F 6 w E n C / Y g O l Q g F o 2 i l T t j v 4 Y g j 7 Z c r b t W d g 6 w S L y c V y N H o l 7 9 6 g 5 i l E V f I J D W m 6 7 k J + h n V K J j k 0 1 I v N T y h b E y H v G u p o h E 3 f j a / d 0 r O r D I g Y a x t K S R z 9 f d E R i N j J l F g O y O K I 7 P s z c T / v G 6 K 4 b W f C Z W k y B V b L A p T S T A m s + f J Q G j O U E 4 s o U w L e y t h I 6 o p Q x t R y Y b g L b + 8 S l o X V e + y W r u v V e o 3 e R x F O I F T O A c P r q A O d 9 C A J j C Q 8 A y v 8 O Y 8 O i / O u / O x a C 0 4 + c w x / I H z + Q M g e Z A K < / l a t e x i t > f ✓ < l a t e x i t s h a 1 _ b a s e 6 4 = " M p P v P s U y 1 z A C Y X l L O P P x R W F M P R w = " > A A A B 7 3 i c b V B N S 8 N A E J 3 U r 1 q / q h 6 9 B I v g q S R S 1 G P R i 8 c K 9 g P a U C b b T b t 0 d x N 3 N 0 I J / R N e P C j i 1 b / j z X / j t s 1 B W x 8 M P N 6 b Y W Z e m H C m j e d 9 O 4 W 1 9 Y 3 N r e J 2 a W d 3 b / + g f H j U 0 n G q C G 2 S m M e q E 6 K m n E n a N M x w 2 k k U R R F y 2 g 7 H t z O / / U S V Z r F 8 M J O E B g K H k k W M o L F S J + r 3 h i g E 9 s s V r + r N 4 a 4 S P y c V y N H o l 7 9 6 g 5 i k g k p D O G r d 9 b 3 E B B k q w w i n 0 1 I v 1 T R B M s Y h 7 V o q U V A d Z P N 7 p + 6 Z V Q Z u F C t b 0 r h z 9 f d E h k L r i Q h t p 0 A z 0 s v e T P z P 6 6 Y m u g 4 y J p P U U E k W i 6 K U u y Z 2 Z 8 + 7 A 6 Y o M X x i C R L F 7 K 0 u G a F C Y m x E J R u C v / z y K m l d V P 3 L a u 2 + V q n f 5 H E U 4 Q R O 4 R Most of the prior works compromise on the sequence and structural diversity in generated CDRs for high amino acid recovery and low root mean square deviation (RMSD) from ground truth structure. Moreover, the sequence-based models typically involve LLM training from scratch on NGS repertoire (Olsen et al., 2022) , or GNN training on a small sample of antibody sequence-structure pairs (Jin et al., 2021) . The GNN-based models also come with a cost associated with inference, e.g., iterative design of nodes and edges in a graph via autoregressive decoding.



a E 1 m N y n i 9 Y t n x I u y K e 5 N D N / c F D X G 9 f 6 x e r I T a V 4 R J k 0 y u B 8 c t / n h + D i y y A 5 H O z / 2 O 8 f f 1 v / w T b b Z Z / Y Z 5 a w I 3 b M T t k Z G z L B f r H f 7 A / 7 2 / k X f Y z 6 0 d 6 q N O q s N e / Z R k S D O y 5 W w u I = < / l a t e x i t > 3 VQLVESGGGLVQPGGSLRLSCAAS********MSWVRQAPGKGLEWV SA*******YYADSVKGRFTISRHNSKNTLYLQMKSLRPEDTAIYYC ******************

t e x i t s h a 1 _ b a s e 6 4 = " 1 o B k I A 9 h 0 5 M H b V r D / C 0 A h u U O o o 0 = " > A A A B 6 n i c b V B N S 8 N A E J 3 U r 1 q / q h 6 9 L B b B U 0m k q M e i F 4 8 V 7 Q e 0 o W y 2 m 3 b p Z h N 2 J 0 I o / Q l e P C j i 1 V / k z X / j t s 1 B W x 8 M P N 6 b Y W Z e k E h h 0 H W / n c L a + s b m V n G 7 t L O 7 t 3 9 Q P j x q m T j V j D d Z L G P d C a j h U i j e R I G S d x L N a R R I 3 g 7 G t z O / / c S 1 E b F 6 x C z h f k S H S o S C U b T S Q 9 b H f r n i V t 0 5 y C r x c l K B H I 1 + + a s 3 i F k a c Y V M U m O 6 n p u g P 6 E a B Z N 8 W u q l h i e U j e m Q d y 1 V N O L G n 8 x P n Z I z q w x I G G t b C s l c / T 0 x o Z E x W R T Y z o j i y C x 7 M / E / r 5 t i e O 1 P h E p S 5 I o t F o W p J B i T 2 d 9 k I D R n K D N L K N P C 3 k r Y i G r K 0 K Z T s i F 4 y y + v k t Z F 1 b u s 1 u 5 r l f p N H k c R T u A U z s G D K 6 j D H T S g C Q y G 8 A y v 8 O Z I 5 8 V 5 d z 4 W r Q U n n z m G P 3 A + f w B 1 s I 3 t < / l a t e x i t > y t < l a t e x i t s h a 1 _ b a s e 6 4 = " z c i z 8 w G J N B e + z A x H 3 Z y y V X N p u n 4 = " > A A A B 6 n i c b V B N S 8 N A E J 3 U r 1 q / q h 6 9 L B b B U 0 l E 1 G P R i 8 e K 9 g P a U D b b S b t 0 s w m 7 G 7 G E / g Q v H h T x 6 i / y 5 r 9 x 2 + a g r Q 8 G H u / N M D M v S A T X x n W / n c L K 6 t r 6 R n G z t L W 9 s 7 t X 3 j 9 o 6 j h V D B s s F r F q B 1 S j 4 B I b h h u B 7 U Q h j Q K B r W B 0 M / V b j 6 g 0 j + W D G S f o R 3 Q g e c g Z N V a 6 f + r p X r n i V t0 Z y D L x c l K B H P V e + a v b j 1 k a o T R M U K 0 7 n p s Y P 6 P K c C Z w U u q m G h P K R n S A H U s l j V D 7 2 e z U C T m x S p + E s b I l D Z m p v y c y G m k 9 j g L b G V E z 1 I v e V P z P 6 6 Q m v P I z L p P U o G T z R W E q i I n J 9 G / S 5 w q Z E W N L K F P c 3 k r Y k C r K j E 2 n Z E P w F l 9 e J s 2 z q n d R P b 8 7 r 9 S u 8 z i K c A T H c A o e X E I N b q E O D W A w g G d 4 h T d H O C / O u / M x b y 0 4 + c w h / I H z + Q N y p o 3 r < / l a t e x i t >

q j 9 y f z U K T m z y o C E s b I l D Z m r v y c m N N I 6 i w L b G V E z 0 s v e T P z P 6 6 Y mv P Y n X C a p Q c k W i 8 J U E B O T 2 d 9 k w B U y I z J L K F P c 3 k r Y i C r K j E 2 n Z E P w l l 9 e J a 2 L q n d Z r d 3 X K v W b P I 4 i n M A p n I M H V 1 C H O 2 h A E x g M 4 Rl e 4 c 0 R z o v z 7 n w s W g t O P n M M f + B 8 / g B 0 L I 3 s < / l a t e x i t > y s < l a t e x i t s h a 1 _ b a s e 6 4 = " T 4 0 n T q 8 C u r U y 7 O n e i w X H J n p o B 7 s = " > A A A B 7 3 i c b V B N S 8 N A E J 3 U r 1 q / q h 6 9 L B b B U 0 m k q M e i F 4 8 V 7 A e 0 o W y 2 m 3 b p Z h N 3 J 0 I J / R N e P C j i 1 b / j z X / j t s 1 B W x 8 M P N 6 b Y W Z e k E h h 0 H W / n c L a + s b

Figure1: Overview of the proposed Protein Sequence Infilling using Model Reprogramming . Given a heavy chain of an antibody, the goal is to design three Complementarity-Determining Regions (CDR-H1, CDR-H2, CDR-H3), shown in green, blue and red colors, using information from the rest of the protein. The infilling problem is formulated similar to the masked-language modeling task, where the missing amino acids are marked with a token ⟨MASK⟩ and the model generates tokens to infill them. We emphasize that our system is a sequence-only method, and while the structure information might be available (bottom of the figure, showing Y-shaped antibody structure with CDRs), our method does not rely on it in the generation process. It makes the model computationally efficient while still achieving high sequence recovery and diversity rates as compared to the current baselines. Reprogrammed language BERT model (ReprogBert) is our proposed infilling model, where the English language BERT remains unchanged and frozen (source domain), and we introduce additional amino acid embeddings (target domain) together with the linear matrices (θ ∈ R |Vt|×|Vs| and γ ∈ R |Vs|×|Vt| ) to project from one domain to another. During CDR infilling training, only the projection matrices and protein embeddings are fine-tuned, the language model remains unmodified. The bottom diagram shows the schematic view of the reprogramming: f θ : x t → x s is transforming input protein sequence (target domain (T)) into input word sequence (source domain (S)) and g γ : y s → y t reverses the mapping. Thus, for a masked protein sequence x t we get predicted CDR-infilled antibody y t = f γ (M (f θ (x t ))), where M is the pretrained language model.

