STOCHASTIC BRIDGES AS EFFECTIVE REGULARIZERS FOR PARAMETER-EFFICIENT TUNING

Abstract

Parameter-efficient tuning methods (PETs) have achieved promising results in tuning large pre-trained language models (PLMs). By formalizing frozen PLMs and additional tunable parameters as systems and controls respectively, PETs can be theoretically grounded to optimal control and further viewed as optimizing the terminal cost and running cost in the optimal control literature. Despite the elegance of this theoretical grounding, in practice, existing PETs often ignore the running cost and only optimize the terminal cost, i.e., focus on optimizing the loss function of the output state, regardless of the running cost that depends on the intermediate states. Since it is non-trivial to directly model the intermediate states and design a running cost function, we propose to use latent stochastic bridges to regularize the intermediate states and use the regularization as the running cost of PETs. As the first work to propose regularized PETs that use stochastic bridges as the regularizers (running costs) for the intermediate states, we show the effectiveness and generality of this regularization across different tasks, PLMs and PETs. In view of the great potential and capacity, we believe more sophisticated regularizers can be designed for PETs and better performance can be achieved in the future.

1. INTRODUCTION

Recent years have witnessed the dramatic growth of pre-trained language models (PLMs) in various fields (Devlin et al., 2019; Dosovitskiy et al., 2021) . As the size of PLMs continues to increase, the number of parameters has now even reached hundreds of billions (Brown et al., 2020; Smith et al., 2022) , making fine-tuning the whole PLM both computationally impractical and environmentally unfriendly. In view of this, a variety of Parameter-Efficient Tuning methods (PETs) are proposed (Houlsby et al., 2019; Hu et al., 2022; Zaken et al., 2022; Lester et al., 2021) . By only tuning a small number of additional parameters, PETs can be comparable to full-parameter fine-tuning. Despite the success of PETs, their underlying mechanism remains an open problem. Recently, several works have proposed to interpret PETs with optimal control theory. Yang & Liu (2022) first show that the optimization in Prefix Tuning (Li & Liang, 2021 ) (a typical method of PETs) can be considered as the search for optimal control variables in the context of optimal control, i.e., the trainable prefixes can be seen as the control variables that drive the PLM (the system) to the desired output. Ding et al. (2022) further show that the optimal control perspective can be applied to almost all PETs. The optimization of PETs' parameters can be seen as minimizing the two cost functions in the optimal control literature: (1) terminal cost L T , which measures the quality of the terminal state, and (2) running cost L R , which measures the feasibility of the controlled intermediate states and the control variables. Although L T can well correspond to the loss function of the model output, L R is only vaguely described as the regularizers on the parameters of PETs (control variables) in Yang & Liu (2022) and Ding et al. (2022) , ignoring the dependency of L R on the intermediate states. In this work, we show that designing a running cost to regularize intermediate states not only makes the optimal control perspective of PETs more theoretically sound, but also empirically leads to better PETs. We begin by assuming that in PLMs, the intermediate hidden states for generating different tokens in a sentence have different dynamics (or trajectories), and the dynamics can be approximated with stochastic processes in a latent space. Specifically, we first freeze the PLM and learn a mapping from the original hidden state space of the PLM to a latent space. In the latent space, the dynamics of the intermediate hidden states for generating different target tokens can be approximated with The drift vector field of [is] bridge The drift vector field defined by < l a t e x i t s h afoot_0 _ b a s e 6 4 = " S P 1 J Z f 4 B 4 A 0 2 P E 3 k 1 f X y s v d P w f E = " > A A A B / 3 i c b V C 7 S g N B F L 0 b X 0 l 8 R S 3 T D A b F K u w q q G X A x j K C e U C y h N n J 7 G b I z O w y M y u E J Y W / Y K l i a y e 2 f o q t X + L k U Z j E A x c O 5 9 z L v f c E C W f a u O 6 3 k 1 t b 3 9 j c y h e K 2 z u 7 e / u l g 8 O m j l N F a I P E P F b t A G v K m a Q N w w y n 7 U R R L A J O W 8 H w Z u K 3 H q j S L J b 3 Z p R Q X + B I s p A R b K z U j n r d C A u B e 6 W K W 3 W n Q K v E m 5 N K r f B W z j / n 0 n q v 9 N P t x y Q V V B r C s d Y d z 0 2 M n 2 F l G O F 0 X O y m m i a Y D H F E O 5 Z K L K j 2 s + m 9 Y 3 R i l T 4 K Y 2 V L G j R V / 0 5 k W G g 9 E o H t F N g M 9 L I 3 E f / z O q k J r / 2 M y S Q 1 V J L Z o j D l y M R o 8 j z q M 0 W J 4 S N L M F H M 3 o r I A C t M j I 1 o Y U s g x k U b i r c c w S p p n l e 9 y + r F n U 3 n F G b I Q x m O 4 Q w 8 u I I a 3 E I d G k C A w x O 8 w K v z 6 L w 7 H 8 7 n r D X n z G e O Y A H O 1 y 9 L I J i Y < / l a t e x i t > g Approximate < l a t e x i t s h a 1 _ b a s e 6 4 = " F V Y n F V N g l 4 U W 8 3 O 3 U Y L Y u W N r I S c = " > A A A C U X i c b V D L b h M x F L 0 Z o K R p K Q G W 3 V h U V E W K o p k i 0 S 4 r s W F Z J N J G y g y j a 8 e Z m N o z I / t O 1 c i a H 2 H B p / A T r N g i t r B n h 5 N 0 Q R 9 H s n R 0 z n 3 5 8 F o r R 3 H 8 o x M 9 e P h o 4 3 F 3 s 7 e 1 / W T n a f / Z 8 z N X N V b I k a h 0 Z c c c n d S q l C N S p O W 4 t h I N 1 / K c X 7 x b + u e X 0 j p V l R 9 p U c v M Y F G q m R J I Q c r 7 Y 5 + u h k x s w T O f D O J h P I j b I k 8 L N A Y P U m 7 8 v M 1 T k l f k R U O y / e Q p / 9 w O W M r R + r V 9 j / + 6 z f t 7 Y d Y K 7 C 5 J r s n e y e b X 3 e 6 X q D n N + z / T a S U a I 0 s S G p 2 b J H F N m U d L S m j Z 9 t L G y R r F B R Z y E m i J R r r M r 4 5 v 2 a u g T N m s s u G V x F b q / x 0 e j X M L w 0 O l Q Z q 7 2 9 5 S v M + b N D Q 7 z r w q 6 / C 3 U q w X z R r N q G L L O N l U W S l I L w J B Y V W 4 l Y k 5 W h Q U Q r + x h Z u 2 F 0 J J b k d w l 5 w d D p O 3 w z c f Q j r 7 s E Y X d u E l H E A C R 3 A C 7 + E U R i D g G / y C 3 / C n 8 7 3 z N 4 I o W p d G n e u e F 3 A D 0 d Y / T Y 6 4 R Q = = < / l a t e x i t > g (h tj cute , htj cute ) < l a t e x i t s h a 1 _ b a s e 6 4 = " X s D y f k u a Y S 9 f B b M L D 4 j y 6 B g r v C I = " > A A A C L n i c b V D L S g M x F M 3 4 t r 6 q L t 0 M i u J C y k y L j 2 X B j U s F a w u d c c i k m R p N Z o b k j l h C v s S N n + A v u H G h e 8 G F C K 7 8 C j F t X f g 6 c O F w z r 2 5 N y f O O V P g e c / O y O j Y + M T k 1 H R p Z n Z u f q G 8 u H S i s k I S 2 i A Z z 2 Q r x o p y l t I G M O C 0 l U u K R c x p M 7 7 Y 7 / v N S y o V y 9 J j 6 O U 0 F L i b s o Q R D F a K y t s 6 G D z S l t 0 4 1 P 6 W V 6 n Z 8 n d M E A s d i M J E G q J z c 6 o D o F e g S Q H U G B O V 1 7 y K N 4 D 7 l / h f Z K 2 + f G 8 + q j f J Y V R + C z o Z K Q R N g X C s V N v 3 c g g 1 l s A I p 6 Y U F I r m m F z g L m 1 b m m J B V a g H l x l 3 3 S o d N 8 m k r R T c g f p 9 Q m O h V E / E t l N g O F O / v b 7 4 n 9 c u I N k L N U t z + 6 u U D B c l B X c h c / t Z u R 0 m K Q H e s w Q T y e y t L j n D E h O w i f 7 Y E g t T s q H 4 v y P 4 S 0 6 q N t 1 K 7 c i m s 4 G G m E I r a B V t I h / t o j o 6 Q I e o g Q i 6 R n f o A T 0 6 t 8 6 T 8 + K 8 D l t H n K + Z Z f Q D z v s n u K W t a Q = = < / l a t e x i t > µ cute tj 1. Endpoint Determination

Pretraining Corpus

< l a t e x i t s h a 1 _ b a s e 6 4 = " k b H c j a r E W t l 4 m J u M V f W h y p 1 E c 1 Y = " > A A A C F X i c b Z C 7 T s M w F I a d c i v l F m C E w a J C K g x V A g h Y k C q x M D A U i V 6 k J o 0 c 1 2 m t 2 k l k O 0 h V l I W X 4 B V Y Y W d D r M y s P A l u m w F a f s n S p / + c o 3 P 8 + z G j U l n W l 1 F Y W F x a X i m u l t b W N z a 3 z O 2 d p o w S g U k D R y w S b R 9 J w m h I G o o q R t q x I I j 7 j L T 8 4 f W 4 3 n o g Q t I o v F e j m L g c 9 U M a U I y U t j x z 3 0 k d n 6 e D z E u j r J t W 6 F H m a K Z X V t a 9 9 c y y V b U m g v N g 5 1 A G u e q e + e 3 0 I p  x w E i r M k J Q d 2 4 q V m y K h K G Y k K z m J J D H C Q 9 Q n H Y 0 h 4 k S 6 6 e Q X G T z U T g 8 G k d A v V H D i / p 5 I E Z d y x H 3 d y Z E a y N n a 2 P y 3 5 v O Z z S q 4 d F M a x o k i I Z 4 u D h I G V Q T H E c E e F Q Q r N t K A s K D 6 d o g H S C C s d J A l H Y o 9 G 8 E 8 N E + q 9 n n 1 9 O 6 s X D v O 4 y m C P X A A K s A G F 6 A G b k A d N A A G j + A Z v I B X 4 8 l 4 M 9 6 N j 2 l r w c h n d s E f G Z 8 / k M e f Q Q = = < / l a t e x i t > {h (i) o } L i=0 GLUE < l a t e x i t s h a 1 _ b a s e 6 4 = " c S T r B S P 7 o 5 u H y R C c t u B / S b Y B 6 3 U = " > A A A C E 3 i c b V D L S s N A F J 3 U V 6 2 v q D v d B I t S N y V R U Z c F N y 4 r 2 A c 0 M U y m 0 3 b o T B J m b s Q S A v 6 E v + B W 9 + 7 E r R / g 1 i 9 x 2 m a h r Q c u H M 6 5 l 3 v v C W L O F N j 2 l 1 F Y W F x a X i m u l t b W N z a 3 z O 2 d p o o S S W i D R D y S 7 Q A r y l l I G 8 C A 0 3 Y s K R Y B p 6 1 g e D X 2 W / d U K h a F t z C K q S d w P 2 Q 9 R j B o y T f 3 3 E C k g 8 x P X a A P k J I E a J b d p R X 7 O P P N s l 2 1 J 7 D m i Z O T M s p R 9 8 1 v t x u R R N A Q C M d K d R w 7 B i / F E h j h N C u 5 i a I x J k P c p x 1 N Q y y o 8 t L J D 5 l 1 q J W u 1 Y u k r h C s i f p 7 I s V C q Z E I d K f A M F C z 3 l j 8 1 w v E z G b o X X o p C 2 P 9 a k i m i 3 s J t y C y x g F Z X S Y p A T 7 S B B P J 9 O 0 W G W C J C e g Y S z o U Z z a C e d I 8 q T r n 1 d O b s 3 L t K I + n i P b R A a o g B 1 2 g G r p G d d R A B D 2 i Z / S C X o 0 n 4 8 1 4 N z 6 m r Q U j n 9 l F f 2 B 8 / g C V g J 7 E < / l a t e x i t > h (0) cute < l a t e x i t s h a 1 _ b a s e 6 4 = " 3 M B c a 5 7 q 5 M 2 K v E l Z n D m Q K 6 O H R a M = " > A A A C E X i c b V D L S s N A F J 3 U V 6 2 v q B v B T b A o d V M S F X V Z c O O y g n 1 A E 8 N k O m 2 H z i R h 5 k Y s I f 6 E v + B W 9 + 7 E r V / g 1 i 9 x 2 m a h r Q c u H M 6 5 l 3 v v C W L O F N j 2 l 1 F Y W F x a X i m u l t b W N z a 3 z O 2 d p o o S S W i D R D y S 7 Q A r y l l I G 8 C A 0 3 Y s K R Y B p 6 1 g e D X 2 W / d U K h a F t z C K q S d w P 2 Q 9 R j B o y T f 3 3 E C k g 8 x P X a A P k D K V Z X d p x T 7 O f L N s V + 0 J r H n i 5 K S M c t R 9 8 9 v t R i Q R N A T C s V I d x 4 7 B S 7 E E R j j N S m 6 i a I z J E P d p R 9 M Q C 6 q 8 d P J B Z h 1 q p W v 1 I q k r B G u i / p 5 I s V B q J A L d K T A M 1 K w 3 F v / 1 A j G z G X q X X s r C O A E a k u n i X s I t i K x x P F a X S U q A j z T B R D J 9 u 0 U G W G I C O s S S D s W Z j W C e N E + q z n n 1 9 O a s X D v K 4 y m i f X S A K s h B F 6 i G r l E d N R B B j + g Z v a B X S K h V J D E e h O g a G v p r 2 R + K 8 X i K n N 0 L 3 w U h b G + t W Q T B Z 3 E 2 5 B Z I 0 C s j p M U g J 8 q A k m k u n b L d L H E h P Q M R Z 1 K M 5 0 B L O k c V x x z i o n N 6 e l 6 k E e T w H t o X 1 U R g 4 6 R 1 V 0 h W q o j g h 6 R M / o = " > A A A C E H i c b V A 9 S w N B E N 2 L X z F + R S 1 t F o M o F u F O R c V K s L G w U E h U y B L i E f s X P D K S M d 9 k N t C x V L A Y T 5 I O H + n T D K h 0 a J d q W Q j p Q f 0 / k L D a m F 4 e 2 s z j S j H q F + J / X y j A 6 D H K h 0 g x B 8 Z 9 F U S Y p J r R I h 3 a E B o 6 y Z w n j W t h b K b 9 l m n G 0 G V Z s C N 7 o y + P k c q f u 7 d d 3 L / Z q x 9 v D O M p k j a y T L e K R A 3 J M T s k 5 a R J O H s g T e S G v z q P z 7 L w 5 7 z + t J W c 4 s 0 r + w P n 4 B v p 7 n R o = < / l a t e x i t > Terminal cost: LT < l a t e x i t s h a _ b a s e = " j m L M M f f l K q M w p R b s P x r a g  x Q = " > A A A C D i c b V A S w N B E N z M a v q K X N Y l D E I t y p q F g F b C w s Y j A f k D v C m Y v W b K d + z O i e H I P D x r h Y K G J r a + e / c S J o Y k P B h v z T A z z F D b b c / M L i n J u J b + t r x W d j a r u s o U Z T V a C Q i f S J Z o J L V g M O g j V j x U j o C b w + e Z h n S v N I s E g Z l I u p I H n B I w U r t w A J A D I q m U X H Y x j T R c u n i I Z B A j x K R g z b X a h a J f s E f A s c S a k i C a o t A t f b i e i S c g k U E G b j l D F K F H A q D D v J p r F h P Z J l U M l S R k k t H / w z x v l E O I i U K Q l p P e S E m o S D T W d p J M v E / r V A c O G l X M Y J M E n H i J E Y I h w F g u c M U o i I E h h C p u b s W R x S h Y C L M m x C c Z d n S f J y V T m P i + W j S R w t I v C F y D k q o t U Q T V E S N R q / o z X q y X q x P c O m d N Z n b Q H i f P z o Q n L M = < / l a t

3.. for PETs

< l a t e x i t s h a 1 _ b a s e 6 4 = " E o U c n w B 3 + y 4 5 H 0 / + T a q W p z n F g 9 2) fitting the SDE directly. These two methods act as a trade-off between efficiency and effectiveness: the first method incurs only negligible computational cost and has satisfactory results, while the second one is slower, but yield better regularizers. M = " > A A A C D 3 i c b V A 7 T 8 M w G H T K q 5 R X g J E l o g J 1 q K q k r S h s F Q w w M J S K P q Q 0 q h z X a a 0 6 D 9 k O U h X l H 7 D w V 1 g Y Q I i V l Y 1 / g 5 N m g J a T L J 3 u v s 8 + n x 1 Q w o W u f y u 5 l d W 1 9 Y 3 8 Z m F r e 2 d 3 T 9 0 / 6 H I / Z A h 3 k E 9 9 1 r c h x 5 R 4 u C O I o L g f M A x d m + K e P b 1 K / N 4 D Z p z 4 3 r 2 Y B d h y 4 d g j D k F Q S G m o n k a D 9 B K z f X 1 p R d V a o 2 z U j f K F E Q 9 c K C Y I 0 u g 2 H r b j o V r U K 3 o K b Z k Y G S m C D K 2 h + j U Y + S h 0 s S c Q h Z y b h h 4 I K 4 J M E E R x X B i E H A c Q T e E Y m 5 J 6 0 M X c i t I o s X Y i l Z H m + E w e T 2 i p + n s j g i 7 n M 9 e W k 0 l K v u g l 4 n + e G Q r n 3 I q I F 4 Q C e 2 j + k B N S T f h a U o 4 2 I g w j Q W e S Q M S I z K q h C W Q Q C V l h Q Z Z g L H 5 5 m X S r F e O s U r u r F 5 u l r I 4 8 O A L H o A Q M 0 A B N c A N a o A M Q e A T P 4 B W 8 K U / K i / K u f M x H c 0 q 2 c w j + Q P n 8 A c We conduct experiments on different PLMs of different sizes, and the experimental results on GLUE (Wang et al., 2019) under both full-set and few-shot settings demonstrate the effectiveness of our proposal across four different PETs. Further analyses show that the learned regularizer helps pull apart the hidden states of different label words. We also observe that when we project the intermediate hidden states of PETs without our regularizer into our latent space, the better the PETs perform, the closer the latent states are to our latent bridges. This spontaneous approaching behavior may indicate that stochastic-bridge-like latent dynamics naturally exists in well-trained PETs. In summary, our work has the following contributions: (1) Guided by the perspective of optimal control for PETs, we design latent stochastic bridge regularizers on the intermediate states during the training of PETs. (2) We propose two methods to construct the latent space according to the two representations of stochastic bridges, offering a trade-off between efficiency and effectiveness. (3) Our regularizers are shown to be effective and general across different PLMs, different PETs, and different tasks. (4) We show that well-trained PETs without any regularization spontaneously exhibit stochastic-bridge-like latent dynamics.

2. BACKGROUND 2.1 DEFINITION AND MATHEMATICAL NOTATIONS

Consider using a L-layer PLM with the vocabulary V to handle a text-to-text task D. For each sample (x, y) ∈ D, y ∈ V is the output token and x ∈ V N is the input token sequence 1 , where N is the length of x. With x as the input, each layer of the PLM will output a sequence of hidden states, and we denote the hidden states of the i-th PLM layer as h (i) = {h (i) j } N j=1 , where h (i) j ∈ R d is the state at the position j of the i-th layer, and d is the model dimension. We denote the position where the model outputs the target y as o, i.e., the model should predict y with the hidden states h (L) o .

2.2. OPTIMAL CONTROL PERSPECTIVE OF PETS

Conventionally, adapting the PLM to D requires full-parameter fine-tuning, which is given as: min ∆θ Ex,y∼D L h (L) o , y + R ∆θ , h (i) = h (i-1) + G (i) θ+∆θ h (i-1) , i = 1, . . . , L, Embed(x), i = 0, where θ is the PLM parameters, ∆θ is the full-parameter update, L(•, •) is the loss function, R(•) is the regularization function, G (i) θ+∆θ (•) is the forward propagation of the i-th PLM layer after updating the parameters, Embed transforms the input token sequence into input embeddings. As |θ| continues to increase, full-parameter fine-tuning becomes impractical, and various PETs are proposed to mitigate this problem. Let ϕ = {ϕ (i) } L i=0 be PETs' parameters. Ding et al. (2022) give a unified view of PETs from the perspective of optimal control, and Eq. 1 can be re-written as min ϕ Ex,y∼D LT h (L) o , y + L i=0 LR ϕ (i) , h (i) = h (i-1) + G(i) θ h (i-1) , ϕ (i) , i = 1, . . . , L, ϕ (0) ; Embed(x) , i = 0, where Typically for classification and generation tasks, L T corresponds to the cross-entropy loss of the prediction, and L R can be seen as a regularizer on PETs' parameters ϕ. However, in the optimal control literature, L R depends on not only the control variables ϕ, but also the controlled intermediate states {h G(i) θ (•, ϕ (i) ) (i) o } L i=1 . In this paper, we show that including intermediate states to build L R (i.e., using L i=0 L R ϕ (i) , h (i) o instead of L i=0 L R ϕ (i) ) not only makes the optimal control perspective of PETs more theoretically sound, but also empirically leads to better PETs.

2.3. DIFFUSION BRIDGES

A diffusion process X = (X t ) t∈[0,T ] is a continuous-time Markov process. For any t a < t b , the diffusion process is equipped with a transition Probability Density Function (PDF) p(t b , b | t a , a), which gives the probability density of reaching b at time t b given the history of reaching a at time t a . A diffusion process is also the solution to an Itô SDE d Xt = µ(t, Xt )dt + σ(t, Xt )dB t , where B t is a standard Brownian motion, µ(•, •) is called drift function and σ(•, •) is called diffusion function. A diffusion bridge X T ;α,β is a diffusion process conditioning on the path observations of the two endpoints (0, α) and (T, β), i.e., X T ;α,β 0 = α and X T ;α,β T = β. For simplicity, we assume α = 0 in this work, and omit the superscript α. We consider two typical diffusion bridges, the Brownian bridge and the Ornstein-Uhlenbeck bridge (OU bridge). We present here the properties of the Brownian bridge and leave the properties of OU bridge to Appendix B. Proposition 2.1 (The properties of the Brownian Bridge). A Brownian bridge X T ;β with X T ;β 0 = 0 and X T ;β T = β is the solution to the following SDE: d Xt = (β -Xt)/(T -t) dt + dBt, X0 = 0. (3) The transition PDF from X T ;β 0 = 0 to X T ;β t b = b is given as p T ;β (t b , b | 0, 0) = 1 √ 2πt b (T -t b ) exp -(b-(t b /T )β) 2 2t b (T -t b ) . Diffusion bridges and SDEs are battle-tested tools to describe the stochastic dynamics of complex systems in engineering (Sobczyk, 2013) , finance (Wang & Sloan, 2011) , biology (Horsthemke & Lefever, 1984) , etc. Considering the dynamics of PLMs' hidden states are necessarily complex, diffusion bridges and SDEs serve as ideal tools for us to model the dynamics.

3.1. THE OVERALL FRAMEWORK

Building latent dynamics in the latent space. Since directly regularizing the intermediate states and constructing the running cost are non-trivial, we introduce a projection from the intermediate state space to a latent space, and leverage diffusion bridges as regularizers to construct the running cost. Specifically, we define a r-dimensional latent space U ⊆ R r (r < d) and a learnable mapping g γ : R d × R d → U, where γ denotes the parameters. g γ projects the hidden state h (i) o and its context state h(i) into the latent space U at each layer of the PLM. Since h (i) o is contextualized while latent bridges are not, introducing the dependency on h(i) can inform g γ about the context at the i-th layer and allow g γ to decontextualize the hidden states. We simply take the averaged states at the i-th layer h (i) = 1 N N j=1 h (i) j as the context. We define the latent states with discrete time in U as uD(gγ, {h (i) o } L i=0 ) = {ti+1, gγ(h (i) o , h(i) )} L i=0 , ti+1 = (i + 1)/(L + 2), where t i+1 is the normalized layer index. We include the 0-th layer (input layer) because some PETs (e.g., prompt tuning) act on the 0-th layer. We use t 0 = 0, t L+2 = 1 represent the two endpoints. By using natural cubic spline knotted at {h (i) o } L i=0 to interpolate over [-1, L + 1], we further give a continuous representation of the states in the latent space U as uC (gγ, {h (x) o } x∈[-1,L+1] ) = {tx+1, gγ(h (x) o , h(x) )} x∈[-1,L+1] , tx+1 = (x + 1)/(L + 2) ∈ [0, 1]. (6) Learning the mapping from hidden state space to latent space. Since adapting PLMs to downstream tasks can be seen as transferring the knowledge obtained from pre-training tasks to downstream tasks, we argue that the latent dynamics of intermediate hidden states for generating the same token y should be similar in both the pre-training and downstream tasks. Therefore, we train the mapping g γ on the corpus that is used to pre-train the backbone PLM, and then apply the learned mapping to downstream tasks to encourage the latent dynamics to be similar to that in pre-training. Specifically, we assume that the states to generate the token y in the latent space U form a trajectory that is a path sampled from X 1;βy with high probability, where X 1;βy is the pre-determined diffusion bridge describing the latent dynamics to generate y, and β y is the tail endpoint of the diffusion bridge. More details of X 1;βy will be discussed in Section 3.2. On the corpus where the PLM is pre-trained, we fix the PLM and use its hidden states {h (i) o } L i=1 to learn g γ by maximizing the goodness of approximation for latent states u under the bridge X 1;βy : γ ← arg max γ ′ goodness-of-approx u(g γ ′ , {h (•) o }), X 1;βy , where u can be u D (Eq. 5) or u C (Eq. 6) depending on the fitting method, goodness-of-approx(•, •) is also a function depends on the choice of the fitting method, measuring how likely u is a sample trajectory of X 1;βy . In Section 3.3, we will define this function alongside the fitting methods. Regularizing PETs with latent dynamics. After learning g γ with Eq. 7, we freeze γ and use the goodness-of-approx function as the running cost in Eq. 2 for PETs on downstream tasks. The final objective for PETs becomes L = LT (h (L) o , y) + α • goodness-of-approx u(gγ, {h (•) o }), X 1;βy , where the second term is the running cost and α is a hyper-parameter controlling the regularization strength. By optimizing Eq. 8, PETs need to ensure the final hidden state is capable of predicting y, while keeping the latent states at the position o conform to the diffusion bridge X 1;βy . Note that introducing g γ as the regularizer does not increase the number of trainable parameters for PETs during the training stage since γ is fixed. Also, g γ does not involve in the inference stage.

3.2. DETERMINING ENDPOINTS FOR DIFFUSION BRIDGES

An intuitive approach to determine the endpoints for the diffusion bridges for each target token is to optimize the endpoints together with the mapping g γ . However, optimizing endpoints and g γ jointly may admit a trivial solution: endpoints are both 0 ∈ R r and g γ always outputs 0. Since 0 is always a point in the sample path of such a degenerated diffusion bridge, the value of goodnessof-approximation function can be meaninglessly high. Although sophisticated constraints can be imposed here, as the first work that uses diffusion bridges as regularizers, we simply pre-determine the endpoints and keep them fixed, and remain introducing constraints as our future work. Specifically, we apply principal component analysis (PCA) to the output token embedding matrix V ∈ R |V|×d of the PLM, obtaining a r-dimensional embedding matrix, and re-normalize each row to have a norm η. Let the resulting embedding matrix be β ∈ R |V|×r . We then use 0 ∈ R r as the heads for all the bridges, and β as the tails of the diffusion bridges, i.e., the r-dimensional embedding of y in β is used as β y in X 1;βy . The intuition for using β as the tails is that the trajectories of the intermediate states for similar target tokens should be close. In V , similar tokens are close, and β obtained by PCA can well preserve the token similarity after reducing dimensions.

3.3. FITTING THE MAPPING g γ

We use the Brownian bridge to illustrate the fitting of g γ . It can be analogous to OU bridge easily. Method 1: Approximating the Transition PDF. Generalizing Eq. 4 to high dimension, we can derive the transition PDF from (0, 0) to (t i+1 , g γ (h (i) o , h(i) )) for X 1;βy : p 1;βy (ti+1, gγ(h (i) o , h(i) ) | 0, 0) ∝ exp( ∥gγ (h (i) o , h(i) )-t i+1 βy ∥ 2 2t i+1 (1-t i+1 ) ), (i = 0, . . . , L), where t i has the same definition as that in u D (Eq. 5). To make g γ approximate the transition PDF, we maximize the sum of log-probability of u D under the Brownian bridge X 1;βy : goodness-of-approx = L i=0 log p 1;βy (ti+1, gγ(h (i) o , h(i) ) | 0, 0) + const. ( ) Here, g γ can be seen as a mapping from the hidden state space to the latent space by predicting the expectation of the Brownian bridge X 1;βy at {t i+1 } L i=0 . Method 2: Approximating the SDE. Since the Brownian bridge is the solution to the SDE in Eq. 3, we can let g γ approximate the SDE. Solving the SDE requires continuous latent states, while we only have L + 1 discrete observations, we thus use the continuous representation u C introduced in Eq. 6. Generalizing Eq. 3 to high dimension, the SDE approximated by g γ can be defined as: dZt = gγ(h (x) o , h(x) , t)dt + dBt, x = (L + 2)t -1, ( ) where x is the same as that in Eq. 6, B : [0, 1] → R r is a standard r-dimensional Brownian motion. Here, we additionally introduce the dependency on t for g γ , since time information is shown to be important in previous neural differential equation works (Zhang et al., 2020; Dupont et al., 2019) . Following Li et al. (2020) , when two SDEs share the same diffusion function, the KL divergence between the probability measures induced by the two SDEs is finite. Since the diffusion function σ ≡ I for Eq. 11 and the multi-dimensional generalization of Eq. 3, the KL divergence between the probability measures µ Y of Eq. 11 and µ X of generalized Eq. 3 can be estimated by: DKL(µX ||µY ) = EZ T 0 1 2 ∥u(t, γ)∥ 2 2 , u(t, γ) = σ -1 gγ(h (x) o , h(x) , t) -µ(t, Zt) = gγ(h (x) o , h(x) , t) - βy -Zt 1 -t , where µ(•, •) is the drift function of the pre-determined Brownian bridge X 1;βy . We use the KL divergence as the goodness-of-approximation function to optimize the mapping g γ . Here, g γ can be seen as a mapping from the hidden state space to the latent state space by approximating the drift vector field of the underlying Brownian bridge X 1;βy .

4. EXPERIMENTS

To verify the effectiveness and generality of the regularizers built on stochastic bridges, we conduct experiments on (1) different PLMs: BERT large (340M) (Devlin et al., 2019) and Deberta xlarge (750M) (He et al., 2021) ; (2) different PETs: Prompt tuning, LoRA, BitFit and Adapter; (3) different diffusion bridges: Brownian bridge and OU bridge. We show that the regularizers effectively improve the performance on GLUE (Wang et al., 2019) under both full-set and few-shot settings.

4.1. EXPERIMENTAL SETUPS

Datasets. Since both BERT large and Deberta xlarge use Wikipedia and BookCorpus (Zhu et al., 2015) for pre-training, we thus use these two corpora to train g γ . We report accuracy for MNLI, SST-2, QNLI, and RTE; F1 for MRPC and QQP; and Matthews correlation for CoLA. We report the average performance and the standard deviation on the development set over 3 different runs. We append a special token [MASK] to each sequence, and require the PLM to output the label word at [MASK] (e.g., negative or positive for SST-2). We exclude STS-B for it is a regression task. Models and PETs. We use the checkpoint released by Megatron (Shoeybi et al., 2019) for BERT large , and the official v1 checkpoint for Deberta xlarge . We use a simple three-layer MLP to build g γ . For Prompt tuning, we use a soft prompt of length 20, and append it to the end of each sequence. For LoRA, we apply it to the query and value of attention modules. For Adapter, we apply it to the output of attention and feed-forward modules. For BitFit, we tune all the bias terms in linear layers and layer normalization modules. Hereafter, we use PDF regularizer to refer to using g γ fitted by approximating the transition PDF, and SDE regularizer to refer to using g γ fitted by approximating the SDE, vanilla x to refer to the PET x without using regularizers. dev , and report its performance on the original development set D dev . Hyper-parameters. We list hyper-parameters in Appendix E. We mainly focus on the difference in performance between vanilla PETs and regularized PETs. Therefore, we directly set the hyperparameters to reasonable values and do not perform much hyper-parameter search. But we ensure the hyper-parameters for vanilla PETs and regularized PETs are the same for a fair comparison.

4.2. FULL-SET RESULTS

The experimental results for BERT large are reported in Table 1 . Due to space limitation, see Table 7 for the complete results including OU bridge regularizers. The first line of each block in the table is the performance of vanilla PETs, and the rest of the lines are the performances of the regularized PETs. We also conduct the same experiments for Deberta xlarge and place the results in Appendix C. In general, both Brownian and OU bridges, and both PDF and SDE regularizers are able to effectively improve the performance of PETs, showing the effectiveness of our proposed regularizers. Particularly, for Prompt tuning, the SDE regularizer with both diffusion bridges yield an increase on the average performance of more than 2 points. We assume that it is because Prompt tuning has far less trainable parameters than other PETs, and it only acts at the input layer, which is far from the supervision signals of the terminal cost L T . Therefore, when provided with the regularization on the hidden states, the prompts receive more guidance and eventually reaching a better local optimal. Overall, the two diffusion bridges in our experiments do not show much difference. As for the two fitting methods, SDE regularizer is generally more effective, especially for Prompt tuning where the number of trainable parameters is restricted. However, we also observe that SDE regularizer is about 3 times slower than PDF regularizer, which brings the trade-off between performance and efficiency. One can expect a better performance by leveraging more sophisticated underlying stochastic bridges, exploring more reasonable endpoints for bridges and designing better mapping g γ . As the first work using latent stochastic bridges as regularizers, we mainly consider the most straightforward cases and aim to show the great potential of the approach and the importance of regularizing hidden states.

4.3. FEW-SHOT RESULTS

We observe that in Table 1 , the improvements brought by our regularizers are more substantial on small datasets MRPC, CoLA and RTE. We assume that this is because in rich-resource datasets, the abundant data itself has provided enough information to learn high quality PETs, while in lowresource datasets, the data is insufficient and the regularizer can offer additional helpful supervision. To validate this, we conduct the experiments under the few-shot setting on GLUE. The 16-shot results are shown in Table 2 , and the results for the OU bridge, results for 4-, 8-and 32-shot and results for Deberta xlarge are placed in Appendix C. For all PETs, the SDE regularizer yields an improvement of more than 3 points. Particularly, the SDE regularizer on LoRA brings an improvement of 5.2 points. By applying regularizers under the few-shot setting, there is now a substantial boost on what was originally a rich-resource dataset, such as MNLI, QQP and QNLI. The PDF regularizer also gives modest improvements. Although slightly inferior to the SDE regularizer, it is still satisfying, considering that the PDF regularizer brings such a performance improvement with little computational cost introduced. We additionally observe that the improvement is more significant on Deberta xlarge in Table 9 , demonstrating the potential of our regularizers on larger models.

5. ANALYSES

In order to have a better understanding of the role played by our regularizers, we analyze the hidden states of the PETs with and without regularizers in this section. We choose Prompt tuning as a representative in the analyses. By varying the hyper-parameter α in Eq. 8, we show that as the strength of the regularization gets stronger, the clusters of hidden states corresponding to different label tokens become compacter and more distinguishable. Also, we show that the hidden states of vanilla PETs spontaneously approach the latent bridges in the latent space without knowing the bridges, indicating that there may exist intrinsically diffusion-bridge-like latent dynamics for PETs. [MASK] } L i=1 . We vary the regularization strength by adjusting the coefficient α in Eq. 8 to inspect the impact of the regularization strength on the hidden states. Note that when α = 0, it degenerates to the vanilla PET.

5.1. THE REGULARIZER WIDENS THE DISTANCES BETWEEN LABEL CLUSTERS

We randomly sample 100 samples for each label in MNLI, and use UMAP (McInnes et al., 2018) to reduce the dimension of the last layer's hidden states of different prompts, and plot them in Figure 2 . It shows clearly that for both regularizers, as the regularization strength becomes stronger, the hidden states of the last layer becomes more distinguishable among different labels. By looking at the axes of these plots, we find that the distance between the clusters generally increases when the reg- ularization strength is increased. We also notice that the SDE regularizer better helps separate the hidden states of the last layer by substantially enlarging the distance between the centroids of different labels, which could be one of the reasons why the SDE regularizer has a better effectiveness in almost all our experiments. We also calculate the Pearson's correlation between the regularization strength α and the average distance between the centroids of different clusters. The results are shown in Table 3 . On all the datasets, the regularization strength α has a positive correlation to the average centroid distance, and on most of the datasets, the correlations are significant in the sense that the p-value < .05. This indicates that as the regularization strength becomes stronger, the centroids of different label clusters become more distant, which is a desired effect because the regularizer encourages the hidden states for different label tokens to conform to different latent diffusion bridges. An interesting phenomenon we observe is that the vanilla PETs' intermediate hidden states spontaneously approach our latent bridges when they are projected by our mapping g γ . That is, applying our mapping g γ to the hidden states of vanilla PETs, we find that when the performance of vanilla PETs becomes better, the average distance from g γ ({h

5.2. HIDDEN STATES APPROACH SPONTANEOUSLY TO THE LATENT BRIDGES

(•) o , h(•) }) to our latent bridge gets closer. Here, similar to Wang et al. (2022) , we define the distance from g γ (h (•) o , h(•) ) to its corresponding latent bridge X y using Eq. 10 without the constant. Note that the vanilla PETs have no access to g γ and the latent bridges during the training process, and g γ also has no access to the PETs during its fitting process. We demonstrate the above observation by conducting analyses in few-shot scenarios and reporting the correlation between (1) the number of shots and the average distance from latent hidden states to latent bridges in Table 4 (2) the performance and the average distance from latent hidden states to latent bridges in Table 5 . Specifically, we report Kendall's rank correlationfoot_1 for (1), and Pearson's correlation for (2). See Appendix D for the detailed setup for the calculation of the correlations. From Table 4 , the number of shots has a negative correlation to the distance, and the correlation is significant on 4 out of 6 datasets. This indicates that as the amount of available data increases for vanilla PETs, its intermediate hidden states in latent space spontaneously approach latent bridges even without knowing the mapping g γ and the bridges. Additionally, the results in Table 5 show the negative correlation between the performance of vanilla PETs and the distance to the latent bridges, and it is significant on 3 out of 6 datasets. Altogether, the two findings on correlation show that as the PETs learn to solve the problem, their intermediate hidden states spontaneously approach our latent bridges in the latent space projected by g γ . This implies that there exists intrinsically diffusion-bridge-like latent dynamics for PETs, and also justifies our use of latent diffusion bridges as regularizers.

6. RELATED WORKS

Recent years have witnessed the success of PLMs (Raffel et al., 2020; Brown et al., 2020) , which acquires rich knowledge from unlabeled data in a self-supervised manner and can be adapted to specific tasks by tuning their parameters. Despite the success of PLMs, as the size of PLMs continues to grow, it becomes increasingly impractical to perform full-parameter fine-tuning for downstream tasks. Therefore, various recent efforts have been devoted to PETs, aiming to freeze PLMs and only tune a few additional parameters for task adaptation, such as Prompt tuning (Lester et al., 2021) that prepends tunable tokens per task to the input text, Adapter (Houlsby et al., 2019) that inserts small modules into each layer, BitFit (Zaken et al., 2022) that tunes only the bias terms, and LoRA (Hu et al., 2022) that decomposes the weight updates into low-rank matrices. In this paper, based on the theoretical grounding of PETs on optimal control (Yang & Liu, 2022; Ding et al., 2022) , we develop stochastic bridges as the regularizer for intermediate hidden states and introduce regularized PETs, showing effectiveness and generality on different PLMs and tasks. Since we adopt latent stochastic bridges as the regularizer, our work closely relates to continuoustime neural ODEs (Chen et al., 2018; Rubanova et al., 2019) and neural SDEs (Li et al., 2019; Kidger et al., 2021) . Continuous-time neural ODEs and SDEs model the dynamics of the hidden states with ODEs and SDEs parameterized by neural network respectively For example Chen et al. (2018) show the resemblance between ODE and ResNet (He et al., 2016) , and propose to learn the neural ODEs as the generalization of ResNet in continuous time. Inspired by these works, we use SDEs to represent the latent dynamics of PETs in the latent space. Our work differs from these work in that we focus on using neural SDEs as regularizers for intermediate hidden states, rather than feature extractors on downstream tasks. We also notice that Wang et al. (2022) explore the use of Brownian bridge in PLMs. However, they use Brownian bridge to regularize the text dynamics across time, while we use Brownian bridge to regularize the dynamics of intermediate hidden states across model's layers. We additionally show that diffusion bridges other than Brownian bridge can be easily applied in our regularizer. As far as we know, our work is the first to show the diffusion-bridge-like dynamics for hidden states across layers and use diffusion bridges as the regularizer for intermediate hidden states.

7. CONCLUSION

In this work, we start from the optimal control perspective of PETs and we notice that the existing PETs lack a running cost that regularizes the intermediate hidden states. Therefore, we propose to use stochastic bridges in a latent space as the regularizers for PETs. Experimental results on different models, different tasks and different PETs show that the proposed regularizer effectively improve the PETs' performance. Our analyses further show that when the PETs are trained without the regularizer, the hidden states spontaneously approach our diffusion bridges, indicating that there exists intrinsically diffusion-bridge-like dynamics for PETs. As the first work using stochastic bridges as regularizers, we show its effectiveness and generality even with simple diffusion bridges. We believe it is a promising direction and we are excited to see more future work. ICCV 

A BACKGROUND FOR PARAMETER-EFFICIENT TUNING METHODS

The large number of parameters in the PLMs makes fine-tuning impractical, therefore different PETs are proposed to mitigate the problem. The current PETs can be categorized into three groups: addition-based, specification-based and reparameterization-based (Ding et al., 2022) . To verify the generality of our method, we include one or two PETs from each category in this work, and we give a brief review to these PETs. Prompt Tuning is an addition-based PET. It prepends or appends trainable virtual tokens P ∈ R m×d to each sequence x ∈ R n×d to form a new input sequence [P ; x] or [x; P ] ∈ R (n+m)×d , where n, m are length of original sequence and virtual tokens respectively, d is the embedding dimension. The virtual tokens P can be either continuous Lester et al. (2021) or be restricted to be embeddings of discrete tokens in vocabulary (Gao et al., 2021) . Adapter (Houlsby et al., 2019)  d Xt = q -coth [q(T -t)] Xt + β sinh [q(T -t)] dt + σdB t , X0 = 0, ( ) where q is the diffusion coefficient and σ is the diffusion for the OU process. The transition probability density function reads as: p T ;β (t, y | 0, 0) = 1 2πσ(s, t) exp      - y -sinh(q(T -t)) sinh(q(T -s)) β 2 2σ(s, t)      , where σ(s, t) = σ 2 q sinh(q(T -t)) sinh(q(t -s)) sinh(q(T -s)) .

C OTHER RESULTS FOR GLUE EXPERIMENTS

In this section, we present the complete results including OU bridge regularizer for Table 1 and Table 2. We also report the results for Deberta xlarge , and the results on few-shot GLUE for both BERT large and Deberta xlarge under 4-, 8-, 16-, and 32-shot. We observe that BERT large cannot give reasonable answers on CoLA and the Matthews correlations are around 0 for all the PETs and all the shots we have experienced with. However, the situation gets better for the larger model Deberta xlarge . Therefore, we exclude CoLA for BERT large and keep it for Deberta xlarge . We only select the Brownian bridge as the representative in this section, since the Brownian bridge and Ornstein-Uhlenbeck bridge have no significant difference in Table 1 . In Table 7 and Table 6 , we report the performance of OU bridge regularizers. The experimental setups are the same as Table 1 and Table 2 respectively. The performances between OU bridge and Brownian bridge do not have a significant difference. In Table 8 , we report the performance of Deberta xlarge on full GLUE datasets. On all four PETs, the SDE regularizer outperforms the PDF regularizer, this is consistent with the results we see in Table 1 . The results for 4-, 8-, and 32-shot for BERT large and Deberta xlarge are plotted respectively in Figure 3 and Figure 4 . For simplicity, we only plot the average performance for each PET. We report the results for 16-shot experiments for Deberta xlarge in Table 9 . The setup for the experiment is almost the same as the experiment in Section 4.3, and the hyper-parameters are listed in Appendix E. The SDE regularizer outperforms the PDF regularizer on most of the PETs except Prompt tuning. We notice that the SDE regularizer helps Deberta xlarge substantially on CoLA for most of the PETs, indicating the SDE regularizer can effectively provide useful guidance when the data is scarce and the task is hard.

D CALCULATION OF CORRELATION IN SECTION 5

In this section, we elaborate on how we calculate the correlations reported in Table 4 and Table 5 . A pair of observation {(x i , y i ), (x j , y j )} is defined as tied if x i = x j or y i = y j . Since we generate the few-shot datasets using 5 random seeds for each shot, each PET has 5 results for each shot. This results in observations with ties, e.g., the two observations for distances on the first seed and second seed for 8-shot {(8, d 1 ), (8, d 2 )} are tied. To calculate the correlation for data with ties, the Tau-b of Kendall's rank correlation is more suitable than Pearson's correlation. We therefore report the Kendall's rank correlation for the correlation between the number of shots and the hidden states distance to the latent bridges.

D.2 CORRELATION BETWEEN PERFORMANCE AND DISTANCE TO BRIDGE

We mix all the few-shot results for different shots and different seeds to form observations of performance and the hidden states distances to bridges, and then calculate the Pearson's correlation.

E HYPER-PARAMETERS

E.1 TRAINING g γ we use simple 3-layer MLP for g γ in all of our experiments. For PDF regularizer, the output dimensions of each layer are 1024, 256, 128, and for SDE regularizer, the output dimensions of each layer are 1024, 256, 32. We observe that the running time increases noticeably with the final output dimension for SDE regularizer, we thus choose a smaller one for SDE regularizer. The hyper-parameters for training g γ are listed in Table 10 . -+BROWN PDF 35.8 0.9 57.7 2.4 53.5 1.5 87.5 3.0 52.2 1.9 2.8 2.9 54.1 1.8 49.1 0.8 3.7 +BROWN SDE 35.9 1.6 59.6 6.4 53.1 1.9 82.1 4.2 55.4 1.4 2.9 3.0 54.1 We run all the experiments for 50k steps, and evaluate on the development set every 1k steps. For BERT large , we use 32 as the batch size while for Deberta xlarge , we use 16 as the batch size. We choose learning rate 1e-3 for Prompt tuning for both PLMs, and 1e-4 for other PETs for both PLMs. We use 0.01 weight decay, 1.0 maximum gradient norm and no learning rate warm-up for all the experiments. We search the best regularization strength α in {0.01, 0.05, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0} for PDF regularizer, and in {0.0001, 0.0005, 0.001, 0.005, 0.01, 0.05, 0.1, 0.2, 0.5, 1.0} for SDE regularizer. The best α for PDF regularizer are listed in We run all the experiments for 1k steps, and evaluate on the development set every 50 steps. For all the shots for both regularizers and both models, we use a batch size of 2. Other hyper-parameters are kept the same as in the experiments on full-set GLUE.

F PERFORMANCE OF REGULARIZERS TRAINED WITH TINY CORPUS

Although we use the pre-training corpus to train our mapping g γ , the training is actually fast and data-efficient. We show that when using only 10,000 documents in the pre-training corpus (about 0.1% of the corpus), the obtained regularizers still perform great and are comparable to the regularizers trained on the whole pre-training corpus. We train the mapping g γ for 5,000 iterations with 128 batch size. On a single NVIDIA A100 GPU, the training can be done in 1 hour for PDF regularizer, and 3 hours for SDE regularizer. The cost of training our regularizer is quite small compared to the resources required for pre-training. We conduct the same experiments as Section 4.2 and Section 4.3 with the regularizers trained with tiny corpus. The results are presented in Table 13 and Table 14 respectively. On full-set GLUE, the PDF regularizer performs even better on three out of four PETs, and although its performance is affected on Adapter, it still outperforms the vanilla Adapter. The SDE regularizer is slightly affected on three out of four PETs, but it still brings substantial improvements on all the PETs. On few-shot GLUE, the impact of the shrinkage of the corpus is relatively obvious. But overall, the regularizers still performs great on all the PETs. The drop in performances are relatively small compared to the boost they bring to vanilla PETs.



Here we assume y ∈ V since a sample where y ∈ V M can be decomposed to M samples, The i-th sample is ([x; y<i], yi) for auto-regressive language modeling or ([x; y-i], yi) for auto-encoding language modeling. The reason that we choose Kendall's rank correlation is that it is suitable for data with ties. See Appendix D.



t e x i t s h a 1 _ b a s e 6 4 = " Q n P q C E M 7 2 G 7 3 S 1 l C L z 0 z B M u+ A t c = " > A A A C V X i c b V D L T h s x F H W m F G j S l r R d s r G K K q U S i m Z A b V m i d s M S p A a Q M t P R t e N M L O y Z k X 2 n I r L 8 B / 2 q / g R i 2 x X b s g a p z o R F e R z J 1 t E 5 9 2 E f V i t p M Y 4 v O 9 G z l e e r a + s v u r 2 X r 1 5 v 9 N + 8 P b Z V Y 7 g Y 8 U p V 5 p S B F U q W Y o Q S l T i t j Q D N l D h h Z 9 8 W / s l P Y a y s y u 8 4 r 0 W m o S j l V H L A I O X 9 1 K X t k L E p W O b i 4 c 6 n 7 f Z K f J G n B W g N g 5 R p N / O 5 S 1 G c o 5 P W + x 8 O c + m 3 a c r A u K X 9 h P / R5 / 2 t e B i 3 o I 9 J c k e 2 9 n v X X 9 d 6 v 2 4 O 8 / 6f d F L x R o s S u Q J r x 0 l c Y + b A o O R K + G 7 a W F E D P 4 N C j A M t Q Q u b u f Y D n n 4 I y o R O K x N O i b R V / + 9 w o K 2 d a x Y q N e D M P v Q W 4 l P e u M H p X u Z k W T c o S r 5 c N G 0 U x Y o u I q U T a Q R H N Q 8 E u J H hr Z T P w A D H E P y 9 L U z 7 b g g l e R j B Y 3 K 8 M 0 w + D 3 e P Q j o D s s Q 6 2 S T v y Y A k 5 A v Z J w f k k I w I J 7 / J F f l L r j s X n d t o J V p d l k a d u 5 5 3 5 B 6 i j X 9 Q f r p 9 < / l a t e x i t > g (h ti is , hti is ) < l a t e x i t s h a 1 _ b a s e 6 4 = " G u M 7 p 7 r 9 O D I s q y B + I J t S z M E g l p M = " > A A A C K 3 i c b V C 7 S g N B F J 3 1 b X y t W t o s i m A h Y V f x U Q o 2 l h F M I m T X Z X Y y S Q Z n d p e Z u 2 I Y 5 j f E n 7 C 0 t d X e S r E S B G u / w M m j U O O B C 4 d z 7 o u T 5 J w p 8 P 0 X Z 2 x 8 Y n J q e m a 2 N D e / s L j kL q / U V F Z I Q q s k 4 5 k 8 T 7 C i n K W 0 C g w 4 P c 8 l x S L h t J 5 c H v f 8 + h W V i m X p G X R z G g n c T l m L E Q x W i t 0 d H f a X N G Q 7 i b R f 3 t v u V W D C R O h Q F C b W E D N z o U O g 1 6 C Z M s b E 7 o Z f 9 v v w R k k w J B t H 7 u f E T f v + q x K 7 7 2 E z I 4 W g K R C O l W o E f g 6 Rx h I Y 4 d S U w k L R H J N L 3 K Y N S 1 M s q I p 0 / y 3 j b V q l 6 b U y a S s F r 6 / + n N B Y K N U V i e 0 U G D r q r 9 c T / / M a B b Q O I 8 3 S v A C a k s G h V s E 9 y L x e U F 6 T S U q A d y 3 B R D L 7 q 0 c 6 W G I C N s 5 f V x J h S j a U 4 G 8 E o 6 S 2 U w 7 2 y 7 u n N p 0 t N M A M W k P r a A s F 6 A A d o R N U Q V V E 0 C 1 6 Q I / o y b l z n p 1 X 5 2 3 Q O u Y M Z 1 b R L z g f 3 5 H z r F s = < / l a t e x i t > µ is ti

4 8 l 4 M 9 6 N j 2 l r w c h n d t E f G J 8 / 5 f K d 2 w = = < / l a t e x i t > t e x i t s h a 1 _ b a s e 6 4 = " t H R A z E W v i 3 Y 6 M N k u 7 3 0F O S u z N 9 U = " > A A A C E X i c b V D L S s N A F J 3 4 r P U V d S O 4 C R a l b k q i o i 4 L b l y 4 q G A f 0 M Q w m U 7 b o T N J m L k R S 4 g / 4 S + 4 1 b 0 7 c e s X u P V L n L Z Z a O u B C 4 d z 7 u X e e 4 K Y M w W 2 / W X M z S 8 s L i 0 X V o q r a + s b m + b W d k N F i S S 0 T i I e y V a A F e U s p H V g w G k r l h S L g N N m M L g c + c 1 7 K h W L w l s Y x t Q T u B e y L i M Y t O S b u 2 4 g 0 n 7 m p y 7 Q B 0 i Z y r K 7 t H x 9 l P l m y a 7 Y Y 1 i z x M l J C e W o + e a 3 2 4 l I I m g I h G O l 2 o 4 d g 5 d i C Y x w m h X d R N E Y k w H u 0 b a m I R Z U e e n 4 g 8 w 6 0 E r H 6 k Z S V w j W W P 0 9 k W K h 1 F A E u l N g 6 K t p b y T + 6 w V i a j N 0 L 7 y U h X E C N C S T x d 2 E W x B Z o 3 i s D p O U A B 9 q g o l k + n a L 9 L H E B H S I R R 2 K M x 3 B L G k c V 5 y z y s n N a a l 6 m M d T Q H t o H 5 W R g 8 5 R F V 2 h G q o jg h 7 R M 3 p B r 8 a T 8 W a 8 G x + T 1 j k j n 9 l B f 2 B 8 / g A S a Z 3 3 < / l a t e x i t > h (L) is < l a t e x i t s h a 1 _ b a s e 6 4 = " + d j E B 7 J1 P K H c u + U 8 B w t H o / C T 0 Z E = " > A A A C E 3 i c b V D L S s N A F J 3 4 r P U V d a e b Y B X q p i Q q 6 r L g x o W L C v Y B T Q y T 6 b Q d O p O E m Ru x h I A / 4 S + 4 1 b 0 7 c e s H u P V L n L Z Z a O u B C 4 d z 7 u X e e 4 K Y M w W 2 / W X M z S 8 s L i 0 X V o q r a + s b m + b W d k N F i S S 0 T i I e y V a A F e U s p H V g w G k r l h S L g N N m M L g c + c 1 7 K h W L w l s Y x t Q T u B e y L i M Y t O S b u 2 4 g 0 n 7 m p y 7 Q B 0 h J A j T L 7 t L y 9 V H m m y W 7 Y o 9 h z R I n J y W U o + a b 3 2 4 n I o m g I R C O l W o 7 d g x e i i U w w m l W d B N F Y 0 w G u E f b m o Z Y U O W l 4 x 8 y 6 1 A r H a s b S V 0 h W G P 1 9 0

B b 0 a T 8 a b 8 W 5 8 T F r n j H x m B / 2 B 8 f k D w U 6 e 3 g = = < / l a t e x i t > h t e x i t s h a 1 _ b a s e 6 4 = " d E r 9 y t J u g o 1 B A a 4 F 7 c x 1 Y p m f O k 0

1 h b z O n S / b 2 j t 0 5 M R z 5 C T b + F R s L R W w t 7 f w 3 7 s U U m v h g 4 P H e D D P z w l Q K g 6 7 7 5 Z Q m J q e m Z 8 q z l b n 5 h c W l 6 v L K p U k y z a H J E 5 n o 6 5 A Z k E J B E w V K u E 4 1 s D i U c B V 2 T w r / 6 g 6 0 E Y l q Y C + F I G Y 3 S k S C M 7 R S u 7 r p I 9 x j G O U N 0 L F Q T F K e G D z y a Z / 6 M c N b z m R + 1 m 8 3 2 t W a W 3 c H o O P E G 5 I a G e K 8 X f 3 0 O w n P Y l D I J T O m 5 b k p B j n T K

Figure 1: An overview of our proposed latent stochastic bridge regularizer. different target-specific diffusion bridges. The obtained mapping can then be plugged to the model to regularize the intermediate hidden states when training PETs parameters. Besides, since a diffusion bridge is (1) a Markov process and (2) a solution to a stochastic differential equation (SDE), we correspondingly propose two methods to learn the mapping: (1) fitting the Markov transition probability density function (PDF) and (2) fitting the SDE directly. These two methods act as a trade-off between efficiency and effectiveness: the first method incurs only negligible computational cost and has satisfactory results, while the second one is slower, but yield better regularizers.

represents the forward propagation of the i-th PLM layer intervened by PETs, [•; •] is the concatenation operation, L T (•, •) is the terminal cost and L R (•) is the running cost. Since |ϕ| ≪ |θ|, PETs can greatly reduce the tuning cost (more details in Appendix A).

-shot Experiments. We randomly sample 2 × k examples from the original training set D train for each class. The sampling is performed 5 times with different seeds to form 5 training sets and development sets { D(i) train , D(i) dev } 5 i=1 with each being k-shot. Each time we train PETs on D(i) train , we select the best model on D(i)

Figure 2: A visualization of the last layer's hidden states on MNLI using the prompt that is trained (a) without regularizer (b-d) with the SDE regularizer (e-g) with the PDF regularizer.

is an addition-based PET. It inserts two-layer MLPs after the attention module and feed-forward module at each layer. Denote h ∈ R d as the input of Adapter, r as the intermediate dimension of Adapter's MLP, W d ∈ R r×d , W u ∈ R d×r as the down-projection and up-projection of Adapter, and σ as the activation function. Then the computation of Adapter can be formulated as h ← W u σ(W d x) + h BitFit (Zaken et al., 2022) is a specification-based PET. It specifies the bias terms in layer normalization modules and linear transformation modules as trainable. LoRA (Hu et al., 2022) is a reparameterization-based PET. It assumes that when training the model, the updates ∆W for model's pre-trained parameters W ∈ R d×k are low-rank, and thus reparameterize the ∆W of each matrix in attention module with a low-rank decomposition ∆W = BA, where B ∈ R d×r , A ∈ R r×k . For a forward pass h = W x, the computation of LoRA can be written as h = (W + ∆W )x = W x + BAx B PROPERTIES FOR ORNSTEIN-UHLENBECK BRIDGE Proposition B.1 (Properties of Ornstein-Uhlenbeck Bridge). A Ornstein-Uhlenbeck X T ;β pinned at X T ;β 0 = 0 and X T ;β T = β is the solution to the following SDE:

Figure 3: The average BERT large few-shot GLUE results trained with different PETs under different shots. The results are averaged across 5 different seeds and the error bars indicate the 95% confidence. SDE regularizer consistently outperforms the baseline PDF regularizer.

4

Figure 4: The average Deberta xlarge few-shot GLUE results trained with different PETs under different shots. The results are averaged across 5 different seeds and the error bars indicate the 95% confidence.

e x i t >

The results on GLUE for BERT large . The values are the average value of the best performances over three different runs, and the subscripts are the standard deviations. The ∆ column shows the difference of the average performance between the vanilla PETs regularized PETs.

The results on GLUE for BERT large under the 16-shot setting. We exclude CoLA because all PETs fail to give reasonable results under the few-shot setting.

We use the different prompts obtained with or without regularizers on the full-set GLUE, and record the intermediate hidden states at the position of prediction as {h



Pearson's correlation between the performance and the distance to the latent bridges.

2015, Santiago, Chile, December 7-13, 2015, pp. 19-27. IEEE Computer Society, 2015.  doi: 10.1109/ICCV.2015.11. URL https://doi.org/10.1109/ICCV.2015.11.

The complete results on GLUE for BERT large under 16-shot setting. We exclude CoLA because all PETs fail to give reasonable result in few-shot setting.

The complete results on GLUE for BERT large . The values are the average value of the best performances over three different runs, and the subscripts are the standard deviations. The ∆ column shows the difference of the average performance between the PETs with and without our regularizers. 89.2 0.2 93.5 0.2 95.5 0.1 84.6 0.4 62.8 1.6 78.9 1.6 84.8 0.3 -+BROWN PDF 88.9 0.1 89.6 0.1 93.9 0.1 95.6 0.2 85.1 0.7 63.7 0.5 80.0 0.5 85.2 0.1 0.4 +OU PDF 88.9 0.2 89.4 0.1 93.7 0.1 95.7 0.3 86.0 0.5 63.6 0.6 80.5 0.8 85.4 0.1 0.6 +BROWN SDE 88.9 0.1 89.5 0.1 93.7 0.1 95.7 0.1 86.5 1.2 63.9 0.4 80.9 0.8 85.6 0.1 0.8 89.7 0.0 93.8 0.1 95.8 0.1 86.8 0.6 61.9 0.2 82.0 0.6 85.6 0.1 1.1 +BROWN SDE 88.9 0.1 89.8 0.1 93.9 0.2 95.8 0.1 85.9 0.4 62.3 1.8 82.2 0.2 85.5 0.2 1.0 +OU SDE 88.9 0.1 89.8 0.1 93.7 0.1 95.7 0.1 85.9 0.7 62.5 1.2 82.7 0.5 85.6 0.2 1.1

The results on GLUE for Deberta xlarge . PROMPT 87.2 0.1 86.5 0.1 93.8 0.1 96.8 0.1 75.4 2.5 64.2 3.8 78.8 3.5 83.2 0.5 -+BROWN PDF 87.6 0.1 86.8 0.3 94.2 0.1 96.8 0.1 80.3 2.9 65.5 1.3 79.8 0.5 84.5 0.6 1.3 +BROWN SDE 87.6 0.1 86.8 0.1 94.0 0.1 96.8 0.1 84.4 0.6 64.8 0.5 79.5 1.3 84.8 0.2 1.6 LORA 91.1 0.1 90.3 0.1 95.1 0.1 96.8 0.1 88.7 0.7 68.0 1.3 83.4 1.1 87.6 0.3 -+BROWN PDF 91.1 0.0 90.5 0.0 95.2 0.0 97.0 0.1 90.1 0.8 68.6 0.8 85.9 1.3 88.3 0.2 0.7 +BROWN SDE 91.1 0.1 90.4 0.0 95.1 0.0 96.9 0.2 90.5 0.6 69.6 1.1 85.6 0.9 88.5 0.1 0.9 BITFIT 90.0 0.1 88.4 0.0 95.0 0.0 96.6 0.1 87.3 0.6 66.9 0.2 82.4 0.6 86.7 0.1 -+BROWN PDF 90.2 0.0 88.3 0.1 95.0 0.1 96.6 0.1 89.8 0.5 67.9 0.8 82.9 0.6 87.2 0.1 0.5 +BROWN SDE 90.1 0.1 88.3 0.0 94.8 0.0 96.6 0.1 90.4 0.5 67.9 0.4 83.8 0.5 87.4 0.1 0.7 ADAPTER 91.1 0.1 90.0 0.1 95.2 0.0 96.8 0.2 87.9 0.5 68.8 1.8 85.0 0.6 87.8 0.2 -+BROWN PDF 91.2 0.1 90.0 0.0 95.3 0.1 96.9 0.2 89.2 0.8 70.1 1.0 86.9 1.5 88.5 0.5 0.7 +BROWN SDE 91.2 0.2 90.1 0.1 95.2 0.0 96.9 0.2 90.3 0.9 70.8 1.1 86.3 1.4 88.7 0.4 0.9

The results on GLUE for Deberta xlarge under 16-shot setting. PROMPT 34.4 1.2 53.2 5.1 51.7 1.7 73.3 8.7 50.2 3.2 2.5 2.5 52.2 4.9 45.4 2.1

1.3 49.0 1.1 3.6 LORA 43.1 3.6 68.4 2.9 60.1 5.3 91.8 1.1 57.6 1.5 3.1 3.8 56.6 2.3 54.4 1.0 -+BROWN PDF 52.1 3.4 70.2 2.9 73.3 6.3 91.7 1.1 59.5 4.5 14.2 5.6 60.4 4.2 60.2 1.2 5.8 +BROWN SDE 49.6 4.3 70.6 1.5 72.4 6.0 90.7 1.2 59.8 4.5 28.9 1.5 60.6 3.1 61.8 0.5 7.4 BITFIT 41.9 3.8 67.7 2.6 60.3 4.2 91.8 0.8 54.9 2.5 9.4 2.4 57.6 2.0 54.8 0.8 -+BROWN PDF 45.2 3.7 70.3 1.2 65.4 6.7 90.9 0.8 55.6 2.1 8.2 2.5 59.6 2.3 56.5 0.7 1.7 +BROWN SDE 45.7 3.8 69.2 2.3 69.2 6.3 89.7 1.3 57.8 4.0 24.2 4.6 59.4 2.8 59.3 1.2 4.5 ADAPTER 43.1 2.9 67.7 2.7 55.9 5.3 91.1 0.9 56.1 2.1 8.6 5.6 59.0 2.3 54.5 1.4 -+BROWN PDF 50.7 3.0 70.1 1.6 70.9 5.2 90.6 1.7 57.6 4.3 16.1 7.4 60.6 3.9 59.5 0.8 5.0 +BROWN SDE 47.1 1.6 72.0 1.1 71.3 4.3 91.0 1.0 59.7 4.5 26.4 5.5 60.0 5.7 61.1 0.6 6.6 Hyper-parameters for training g γ

Table 11, and best α for SDE regularizer are listed in Table 12. Best α for PDF regularizer on full-set GLUE MNLI QQP QNLI SST-2 MRPC CoLA RTE

Best α for SDE regularizer on full-set GLUE

G SPEED OF THE REGULARIZERS

As we have mentioned in Section 1, the PDF regularizer only incurs negligible computational cost.In this section, we present the plot of step-metric curve on full-set GLUE in Figures 5, 6 , 7 and 8.On different PETs, the regularized PETs with PDF regularizer has similar running time to the vanilla PETs. On the two large datasets, QQP and MNLI, regularized PETs with SDE regularizer take about 2 to 3 times longer to achieve the best performance than vanilla PETs. However, on medium-sized (QNLI, SST-2) and small datasets (CoLA, MRPC, RTE), the time to achieve the best results with SDE regularizer is comparable to vanilla PETs.Overall, the PDF regularizer can effectively improve the performance of PETs without introducing much computational cost. In scenarios where there is relatively more focus on the inference performance of PETs and less concern about the slightly longer training time, or when the dataset is small, SDE regularizer should be a good choice. 

