PREDICTING INDUCTIVE BIASES OF PRE-TRAINED MODELS

Abstract

Most current NLP systems are based on a pre-train-then-fine-tune paradigm, in which a large neural network is first trained in a self-supervised way designed to encourage the network to extract broadly-useful linguistic features, and then finetuned for a specific task of interest. Recent work attempts to understand why this recipe works and explain when it fails. Currently, such analyses have produced two sets of apparently-contradictory results. Work that analyzes the representations that result from pre-training (via "probing classifiers") finds evidence that rich features of linguistic structure can be decoded with high accuracy, but work that analyzes model behavior after fine-tuning (via "challenge sets") indicates that decisions are often not based on such structure but rather on spurious heuristics specific to the training set. In this work, we test the hypothesis that the extent to which a feature influences a model's decisions can be predicted using a combination of two factors: The feature's extractability after pre-training (measured using information-theoretic probing techniques), and the evidence available during finetuning (defined as the feature's co-occurrence rate with the label). In experiments with both synthetic and naturalistic data, we find strong evidence (statistically significant correlations) supporting this hypothesis.

1. INTRODUCTION

Large pre-trained language models (LMs) (Devlin et al., 2019; Raffel et al., 2020; Brown et al., 2020) have demonstrated impressive empirical success on a range of benchmark NLP tasks. However, analyses have shown that such models are easily fooled when tested on distributions that differ from those they were trained on, suggesting they are often "right for the wrong reasons" (McCoy et al., 2019) . Recent research which attempts to understand why such models behave in this way has primarily made use of two analysis techniques: probing classifiers (Adi et al., 2017; Hupkes et al., 2018) , which measure whether or not a given feature is encoded by a representation, and challenge sets (Cooper et al., 1996; Linzen et al., 2016; Rudinger et al., 2018) , which measure whether model behavior in practice is consistent with use of a given feature. The results obtained via these two techniques currently suggest different conclusions about how well pre-trained representations encode language. Work based on probing classifiers has consistently found evidence that models contain rich information about syntactic structure (Hewitt & Manning, 2019; Bau et al., 2019; Tenney et al., 2019a) , while work using challenge sets has frequently revealed that models built on top of these representations do not behave as though they have access to such rich features, rather they fail in trivial ways (Dasgupta et al., 2018; Glockner et al., 2018; Naik et al., 2018) . In this work, we attempt to link these two contrasting views of feature representations. We assume the standard recipe in NLP, in which linguistic representations are first derived from large-scale selfsupervised pre-training intended to encode broadly-useful linguistic features, and then are adapted for a task of interest via transfer learning, or fine-tuning, on a task-specific dataset. We test the hypothesis that the extent to which a fine-tuned model uses a given feature can be explained as a function of two metrics: The extractability of the feature after pre-training (as measured by probing classifiers) and the evidence available during fine-tuning (defined as the rate of co-occurrence with the label). We first show results on a synthetic task, and second using state-of-the-art pre-trained LMs on language data. Our results suggest that probing classifiers can be viewed as a measure of the pre-trained representation's inductive biases: The more extractable a feature is after pre-training, the less statistical evidence is required in order for the model to adopt the feature during fine-tuning. Contribution. This work establishes a relationship between two widely-used techniques for analyzing LMs. Currently, the question of how models' internal representations (measured by probing classifiers) influence model behavior (measured by challenge sets) remains open (Belinkov & Glass, 2019; Belinkov et al., 2020) . Understanding the connection between these two measurement techniques can enable more principled evaluation of and control over neural NLP models.

2. SETUP AND TERMINOLOGY

2.1 FORMULATION Our motivation comes from McCoy et al. (2019) , which demonstrated that, when fine-tuned on a natural language inference task (Williams et al., 2018, MNLI) , a model based on a state-of-the-art pre-trained LM (Devlin et al., 2019, BERT) categorically fails on test examples which defy the expectation of a "lexical overlap heuristic". For example, the model assumes that the sentence "the lawyer followed the judge" entails "the judge followed the lawyer" purely because all the words in the latter appear in the former. While this heuristic is statistically favorable given the model's training data, it is not infallible. Specifically, McCoy et al. (2019) report that 90% of the training examples containing lexical overlap had the label "entailment", but the remaining 10% did not. Moreover, the results of recent studies based on probing classifiers suggest that more robust features are extractable with high reliability from BERT representations. For example, given the example "the lawyer followed the judge"/"the judge followed the lawyer", if the model can represent that "lawyer" is the agent of "follow" in the first sentence, but is the patient in the second, then the model should conclude that the sentences have different meanings. Such semantic role information can be recovered at > 90% accuracy from BERT embeddings (Tenney et al., 2019b) . Thus, the question is: Why would a model prefer a weak feature over a stronger one, if both features are extractable from the model's representations and justified by the model's training data? Abstracting over details, we distill the basic NLP task setting described above into the following, to be formalized in the Section 2.2. We assume a binary sequence classification task where a target feature t perfectly predicts the label (e.g., the label is 1 iff t holds). Here, t represents features which actually determine the label by definition, e.g., whether one sentence semantically entails another. Additionally, there exists a spurious feature s that frequently co-occurs with t in training but is not guaranteed to generalize outside of the training set. Here, s (often called a "heuristic" or "bias" elsewhere in the literature) corresponds to features like lexical overlap, which are predictive of the label in some datasets but are not guaranteed to generalize. Assumptions. In this work, we assume there is a single t and a single s; in practice there may be many s features. Still, our definition of a feature accommodates multiple spurious or target features. In fact, some of our spurious features already encompass multiple features: the lexical feature, for example, is a combination of several individual-word features because it holds if one of a set of words is in the sentence. This type of spurious feature is common in real datasets: E.g., the hypothesis-only baseline in NLI is a disjunction of lexical features (with semantically unrelated words like "no", "sleeping", etc.) (Poliak et al., 2018b; Gururangan et al., 2018) . We assume that s and t frequently co-occur, but that only s occurs in isolation. This assumption reflects realistic NLP task settings since datasets always contain some heuristics, e.g., lexical cues, cultural biases, or artifacts from crowdsourcing (Gururangan et al., 2018) . Thus, our experiments focus on manipulating the occurrence of s alone, but not t alone: This means giving the model evidence against relying on s. This is in line with prior applied work that attempts to influence model behavior by increasing the evidence against s during training (Elkahky et al., 2018; Zmigrod et al., 2019; Min et al., 2020) .

2.2. DEFINITIONS

Let X be the set of all sentences and S be the space of all sentence-label pairs (x, y) ∈ X × {0, 1}. We use D ⊂ S to denote a particular training sample drawn from S. We define two types of binary features: target (t) and spurious (s). Each is a function from sentences x ∈ X to a binary label {0, 1} that indicates whether the feature holds. Target and spurious features. The target feature t is such that there exists some function f : {0, 1} → {0, 1} such that ∀(x, y) ∈ S, f (t(x)) = y. In other words, the label can always be perfectly predicted given the value of t.foot_0 A feature s is spurious if it is not a target feature. Partitions of S. To facilitate analysis, we partition S in four regions (Figure 1 ). We define S s-only to be the set of examples in which the spurious feature occurs alone (without the target). Similarly, S t-only is the set of examples in which the target occurs without the spurious feature. S both and S neither are analogous. For clarity, we sometimes drop the S * notation (e.g., s-only in place of S s-only ). < l a t e x i t s h a 1 _ b a s e 6 4 = " b S i q w Q A N S N 3 0 6 j L / e I 6 a F q d f p Q w = " > A A A 0 0 n i c n V t b c 9 z K c a Y c x 3 E Y x z l O H v 2 C C i X 7 n B T F I n W Y i l 9 c Z d 5 J c X l f X q S z O i o s t h c L E T d h Z r G k 4 E 1 V K i 9 + 8 E v y g / I / / F / 8 k J 4 Z D L o H C / L E Y Z V E 9 P f N N O b y 9 X Q v F h z m c S T k + v q f X v z o r 3 7 8 1 z / 5 m 5 / + 7 f L f / e z v f / 4 P X / 3 i H 2 9 E N i 0 C u A 6 y O C v u h r 6 A O E r h W k Y y h r u 8 A D 8 Z x n A 7 v N 9 R / G 0 J h Y i y t C 8 f c / i Q + G E a j a P A l w h 9 / O r P H v 5 c f a w G E h 5 k k V Q v h 5 m c v J z P f / X b Q f X 1 w + r j N 4 P V 3 w 9 W 5 d c P 3 / x 2 Y z C D U Q i e M M b c G w y W 2 7 1 T i O Q E i m 4 H 6 9 z B e r c D + f J 1 l s a P P z y A J / q L Z / q v t y a w / P G r l f W 1 d f 3 j L V 5 s 1 B c r S / X P + c d f v P m f w S g L p g m k M o h 9 I b 7 b W M / l h 8 o v Z B T E M F 8 e T A X k f n D v h / A d X q Z + A u J D p b d p 7 r 1 C Z O S N s w L / p d L T K O 9 R + Y k Q j 8 k Q W y a + n I g 2 p 8 A u 7 r u p H P / m Q x W l + V R C G p g b j a e x J z N P 7 b k 3 i g o I Z P y I F 3 5 Q R D h W L 5 j 4 h R 9 I V I Z z l 2 G W 3 U t / K N C H t 6 8 H W i Q + O v J R T 2 5 L N Q q Z Z b F u G q A C x f I C X 4 i x m H f O 0 A H F d D i O w r n n z C n V d 4 b k Q z X F / 7 s 6 r H r j O P P l f O G 2 / v B h N S z 8 f B I F D y 4 r s 1 G W Z h J a o 5 p h Y z W C p 2 b Q u R F t U O 3 p / 2 W 2 z 4 / w / + N 9 e f m V + v F O 9 2 6 9 k 6 3 + o b e 7 t 3 9 0 e t Q / O j u 9 8 j T V O Y N V 6 3 Z 1 i O u 7 / M o 7 8 Y t 7 T 6 B S 8 G Q Q X j b G b c 3 N t d J s A W M o i i g N l a x G U R k J 2 w y X b l o o e a Q w C 7 I k 8 d N R N U A w h r G c V 9 U A E u / r H l 5 / M 5 8 v t A k w k q C w r X a 0 1 d W u i M J J 4 + x S G V 2 t Z J b b N v 0 s 7 2 q B x 5 v M E t t o W 1 s L 7 e p 5 + 7 a Z / 1 S L o W 0 x f K p F Y F s E T 7 U Y 2 R Y j 1 Q K 3 4 R B n F 6 s Z e r 6 H 7 V X Y w h i P 9 5 G H a 5 O 4 P v B a g f P v N j 6 g l + H Y W 9 m Y a 0 F 4 + 3 p T z K 7 h o Q C r X p z N o H i t Q n V t e Y A u 9 b L C e G W j M h v 4 7 w O 0 K u 2 g q z s O N 5 J + v K b P B S H x y F N 7 L 9 S O I W 8 8 7 l u P + 2 2 P m p a z z N 5 z 5 U 1 9 V + H Z R h 7 O q D b e 2 B 6 f p / 6 I u q x 8 u 7 K 5 0 G 2 1 6 W O v v u W u N v V 0 r o y q n 1 0 O V L 4 Z f B 0 C z n p 0 O L A L Y n p f 2 d 5 X H b 0 v b S 9 9 J M + y J s r W m o U x d x d 6 Z Z o Y f G J p 2 g 4 n B U D b J f O 3 8 u 2 i R 1 o 1 5 v v b R d 9 + 6 g F u g u r c s W T w 2 c z Z N n l 6 0 o 6 f a Z 5 D o Z O G c b N X u 9 n r c r P l F f 6 M 1 r 3 l 7 P X r 1 3 6 Z R S N v K t T R F I 2 9 P B M i w n R l X O e x j 6 F T + 3 9 6 d C o f 5 h h J H X N U j O l e t / m L J 1 k 7 2 m k c 7 f y g I 5 x z i p W K O o N N W 2 F 8 a L g Z E U r F 0 t b V 6 9 d P y g R H 5 8 d h h v l / k n T M E z k z u q b R s x N l r h Z m u m V d b X W 4 s o K 3 9 8 N J N L 6 e P w z 6 T q e t H + y 0 s K g 5 H l r 1 z J n 6 F G q G q 6 6 e 2 x T T v 6 3 e 8 6 b / u d v f z r S 5 A Y 5 a X T 8 5 4 Use of Spurious Feature. If a model has falsely learned that the spurious feature s alone is predictive of the label, it will have a high error rate when classifying examples for which s holds but t does not. We define the s-only error to be the classifier's error on examples from S s-only . When relevant, t-only error, both error, and neither error are defined analogously. In this work, "feature use" is a model's predictions consistency with that feature; we are not making a causal argument. F p w E M V K r L G 6 w A M d G 6 i r 2 h / W P F m h a X 1 l e H 1 Z N 0 B q m F Q b 7 W w j C w y E e T V Q m T / w 4 2 q 3 3 a D 0 4 2 j E G 3 w 0 1 1 h V G 2 q + 4 B K E 7 O 6 g m X k z I 8 i F y n G 5 i O I s r f P T J b r I E q / 0 i 6 g u L r W + Q f q V q e Z N A V i 9 H C D 0 c m 6 X s 2 j R P j F D l x k S E 7 h M Q M z I Z U b E g M s A M W O X G R M T u k x I z M R l J s R E L h M R 8 8 l l P h F z 7 z L 3 x M Q u E 8 + 1 j I v E i w R G L H 4 w H D 2 q w 8 7 s 4 K r 3 a S q k h y X x r 6 W n P q q g H B / V y e N s j J f U v l P X d 0 p 3 z V w m I y Z 3 m Z y Y z y 7 z m Z j C Z Q p i h M s I Y q T L S G K m L j M l p n S Z k p i Z y 8 y I e X C Z B 2 I e X e a R m C 8 u 8 2 V u y j w b A J i Z s + Z 4 L + s g q U w o D c c s b J p x 4 y d s H S W 2 h b Y Z z z g O D w l m s V E G B L P A K E c E s 6 g o g W A W E u W Y Y B Y P Z U g w C 4 Z y Q j C L h H J K M A u D 8 h P B L A b K e 4 J Z A J Q x w T G D E 4 I T B r O F 5 i u c E c z E X O Y E M y W X n w l m M i 4 L g p m G S 0 G w 4 J t K s O x e E y 7 d k m C m 2 3 J G M B N t + U A w U 2 z 5 S D C T a / m F Y K v V P f w I j o W + / r R X d O g W j O g 6 z 2 U w y u s 8 m c H I r / N s B q P B z t M Z j B A 7 z 2 c w a u w 8 o c F I s v O M B q P L z l M a u S f P a T A K 7 T y p w c i 0 8 6 w G o 9 X 2 a W 2 5 x O U S z j 1 5 E o O R b u d Z D E a / n a c x G B F 3 n s d g l N x 5 I o O R c + e Z D E b T n a c y G G F 3 n s t g 1 N 1 5 M o O R e O f Z D E b n n a c z G L F 3 n s 9 g F P / 0 C Y 2 x U E R B U 6 E k W x Q f W x Q 2 y T b B 2 w z e I X i H w b s E 7 z J 4 j + A 9 B u 8 T v M / g A 4 I P G H x I 8 C G D j w g + Y v B b g t 8 y + J j g Y w b 3 C O 4 x + I T g E w a f E n z K 4 D O C z x h 8 T v A 5 g y 8 I v m D w J c G X D L 4 i + I r B f Y L 7 D L 4 m + J r B N w T f M P i W 4 F s G 3 x F 8 x + B 3 B L 9 j 8 H u C 3 z 9 9 v L q i A 6 M 6 p t E t p l 8 t P c Z t c 2 7 H 5 X Y 4 t + t y u 5 z b c 7 k 9 z u 2 7 3 D 7 n D l z u g H O H L n f I u S O X O + L c W 5 d 7 y 7 l j l z v m X M / l e p w 7 c b k T z p 2 6 3 C n n z l z u j H P n L n f O u Q u X u + D c h J U T 5 B f T n C P z s u t 7 0 L b M U K v t 5 1 m L J 1 E C D h J J G U x M r 3 K 2 H y x p m y N A g V I f o K g Q R q j 5 0 7 Y E I 1 R x l P R K q N H S d g Q j V F 7 q 6 Q I S q C l 1 T I E K 1 h K 4 k E K E K Q t c P i F D d o K s G R K h a 0 L U C I j F b B 4 N Q Z a D r A k R S t n 4 G o S p A 1 w C I U O 7 X m R 8 R y v g 6 3 y N C e V 5 n e U Q E W 3 C D U E 4 v 6 2 1 h m 1 I a h P K 3 z t 6 I U N b W O R s R y t U 6 U y N C G V r n Z 0 S 6 q l G 3 D C 3 9 O J + o / d a / G w W W w 1 o c W h c W p I 9 a 9 G S i p m I / G Y 5 U D 3 N B R J Z A q H D 9 m 2 A t S S V H C 6 B D R P B / g k Q U J q q r / k 2 w F W 4 t 2 m Y i V c X H X y m x W g v F G p C F Q h 2 x S V V K o N Z C g Y 7 J Q n G G Z K E w J 2 T h c N l Y U Z C f y E I x 3 r O 1 q Z Q I m 5 l X S o D W w s V k q 4 j i y 9 i S V E p 0 1 k L R f S Y L B V e w l a q U 0 J o F q p T I r I U L z Z Y Z B V a S h e K a k Y X C e i A L R f V I F g r q y 7 z + z g v z 7 I P B d Y 5 F n V F u 1 Z k V E c q o O p 8 i Q n l U Z 1 F E K H v q 3 I k I 5 U y d M R G h T K n z J C K U H 3 V 2 R I S y o s 6 J i F A u 1 J k Q E c q A O v 8 h Q n l P Z z 1 E K N v p X I c I 5 T i d 4 R C h z K b z G i K U z 3 Q 2 Q 4 S y m M 5 h i F D u 0 p k L E c p Y O l 8 h Q n l K Z y l E K D v p 3 I Q I 5 S S d k R C h T K T z E C K U f 3 T 2 Q Y S y j s 4 5 i F C u 0 Z k G k f d s B y k v D H l a S M 4 n 9 U E 8 w C u 2 e j b 0 F d O r w 7 + Z X B 3 D i r s y c a x V 1 I d U q O / y d y G I / Q J Q V J M t d Q L h H U 2 x J 8 a R e l Q K a Z C N o j R E Z / 4 0 V o g Y N 9 f J v B L q K e 8 V y K c c D L N 4 9 E N u h g 9 z D M L 2 k 9 p U 6 O 8 I T d 6 s / e m n 1 P X U p K k v U 8 H U L 7 c t R v q X O x a j C J C 7 F q M Y k H s W o y i Q + x a j O J A H F q N I k I c W o 1 i Q R x a j a J B v L U b x I I 8 t R h E h e x a j m J A n F q O o k K c W o 7 i Q Z x a j y J D n F q P Y k B c W o + i Q l x a j + J B X F q M I k X 2 L U Y z I a 4 t R l M g b i 1 G c y F u L U a T I O 4 t R r M h 3 F q N o k e 8 t Z i o y F P K B e i n B s K H 9 n B s 4 H z f C b Q a T L s I d B p M 0 w l 0 G k z r C P Q a T Q M J 9 B p N G w g M G k 0 z C Q w a T U s I j B p N Y w r c M J r 2 E x w w m y Y Q 9 B p N q w h M G k 3 D C U w a T d s I z B p N 8 w n M G k 4 L C C w a T i M J L B p O O w i s G k 5 T C P o N J T e E 1 g 0 l Q 4 Q 2 D S V P h L Y N J V u E d g 0 l Z 4 T s G k 7 j C 9 w y 2 F T 8 e b X W p J p q n K E M m L r F N K G l L 7 B B K 0 h K 7 h G p l v f J 2 9 T c Z U w G e 7 w m Q H t 4 6 h p G 3 t + o N I f A V L i e R 8 G b Z N B 4 h h B Z 4 Q n / v g b X k t P D U 6 z R Z j I 7 U + y 7 w k G N t q b / M t V + p 7 9 M d S Z 3 i g F A S p z g k l L Q p j g g l a Y q 3 h J I y x T G h J E z R I 5 R 0 K U 4 I J V m K U 0 J J l e K M U B K l O C e U N C k u C C V J i k t C S Z H i i l A S p O g T S n o U 1 4 S S H M U N o a R G c U s o i V H c E U p a F O 8 I J S m K 9 4 Q 2 j 1 x S r P t A f 4 T w z c O W u g g E q g B 6 b v G v y s M t s l C q 2 2 S h R H f I Q m n u k o W H 3 R 5 Z K K J 9 s l A 8 B 2 S h a A 7 J Q r E c k Y U i e U s W i u O Y L B R F j y w U w w l Z K I J T s n D z z 8 j C T T 8 n C z f 7 g i z c 5 E u y c H O v y M J N 7 Z O F m 3 l N F m 7 i D V m 4 e b d k 4 a b d k Y W b 9 Y 4 s 3 K T 3 7 H 5 1 p V V X W W r L g G + Z N B U X H i k q f v X 7 l B j E B l 3 1 Z p G c Z F P p Y b n j q Z f 7 c i j c g g i o I n K q o f r 2 s t G A b r h Q C I I u l 6 B V L 4 E u m K B V M Y E u m a B V M 4 E u m q B V N Y E u m 6 B V N 4 E u n K B V O Y E u n a B V O 4 E u n q B V P Y E u n 6 B V P 4 E u o K B V Q Y E u o a B V Q 4 E u o q B V R Y E u o 6 B V R 4 E u p K B V S Y E u p a B V S 4 E u p q B V T Y E u p 6 B V T 4 E u q K B V U Y E u q a B V U 4 E u q q B V V Y E u q 6 B V V 4 E u r K B V W Y E u r a B V W 4 E u r q B V X Y E u r 6 B V X 4 E u s I B V W P h J A V O O L K b g T d M R F P G j e m l p 5 E v f C y G F A r O N s i O B S h 9 O V e p x Z Z u r p v M q / 1 g N i q T S h k 5 8 y i s k e V R E m P K c / s 2 7 g 8 N H n e 7 0 a y D q J p g f W 7 7 t G y I T X + I n d f c W T s t z 3 v J 8 3 j W Y J B t B / N x E d I N m J s Z a u E / d 6 P y 5 R r m M 4 h H U L Q f a a E b f 9 M B j Q m b B x B f q 1 W d / K j P 9 C Q o K Z 4 S t F 1 h z 0 6 Y Z Y 9 1 l c Q A j c N o Z s 6 N d g Q Q e O r a d M V E L g X 6 G 5 j a O / T z 2 A 5 g 3 b 9 T 0 a k C 9 b V 1 f u 8 v r 9 t + b N x l v r z 2 Q n m A v 7 f T a 7 O W c 5 / b W q Z n k b I 1 b Z F z M 7 X M 3 l y g g n D d P 0 t p U I G m O y o r G E R R t 1 y I b y 8 R / o J Y W a L f D Z J H p l 5 j M Q 7 Z F L 3 k 8 V b P / o p 4 E u O x x b 8 7 f Y D r u L W z g j V / Q C J T R 9 i / x l 1 / g 3 h c Z a 3 m 1 s A E 7 W U m 0 M p R A b 7 N 4 X P i J e i A 1 m W U F F q j C f x T e y 9 7 3 b 1 6 q t 3 f 0 n w 1 M U / M W q s h x / 4 V + e + z l A O K Y t b E P R F 9 5 2 5 g A M e R T 9 d 8 j x j s k 6 i 0 2 V Q U b p 6 y 1 e o U 0 m 4 Y 6 Z + q i O J K w q t 2 L z B t l o N z N o v s o h 1 H k r 7 V e Q c 6 K J F Y P 7 + d V 7 / v 1 e Q e Z p a C 4 j S 5 O z n S / N 1 1 c r p i 8 g 9 F a 6 H 0 / i N K x f G y H T u 4 X 6 u E w H h u + C p Y r w L N W + C F 4 U e q l W V 3 Q S 3 h Y 8 3 Y m m V D L k 6 k C M J h 4 u / j Z N 4 V f C 0 / 9 F c T a s v M 4 5 y x X p 3 N W / A t q v A j 1 A P D 3 Y F V d P d d Q n Z O m I V 5 1 u 9 R q x W b 6 / y d a 9 F F Q f f W G X w x y 4 A 8 x z u J s N i z A v 2 / N X k 7 z G E x h E / t p G I N 5 Q V B f 4 v 0 / f r W y 0 f 7 7 l s W L m z d r G 5 t r / 3 q x u f K 7 7 f p v X 3 6 6 9 M u l f 1 7 6 e m l j 6 d + W f r d 0 u H S + d L 0 U v B i + + M O L / 3 r x 3 5 v 9 z S + b / 7 H 5 n 6 b p j 1 7 U f f 5 p y f n Z / O P / A s W X Y D g = < / l a t e x i t > Extractability of a Feature. We want to compare features in terms of how extractable they are given a representation. For example, given a sentence embedding, it may be possible to predict multiple features with high accuracy, e.g., whether the word "dog" occurs, and also whether the word "dog" occurs as the subject of the verb "run". However, detecting the former will no doubt be an easier task than detecting the latter. We use the prequential minimum description length (MDL) Rissanen (1978) -first used by Voita & Titov (2020) for probing-to quantify this intuitive difference.foot_1 MDL is an information-theoretic metric that measures how accurately a feature can be decoded and the amount of effort required to decode it. Formally, MDL measures the number of bits required to communicate the labels given the representations. Conceptually, MDL can be understood as a measure of the area under the loss curve: If a feature is highly extractable, a model trained to detect that feature will converge quickly to high accuracy, resulting in a low MDL. Computing MDL requires repeatedly training a model over a dataset labeled by the feature in question. To compute MDL(s), we train a classifier (without freezing any parameters) to differentiate S s-only vs. S neither , and similarly compute MDL(t). See Voita & Titov (2020) for additional details on MDL. 3 2.3 HYPOTHESIS Stated using the above-defined terminology, our hypothesis is that a model's use of the target feature is modulated by two factors: The relative extractability of the target feature t (compared to the spurious feature s), and the evidence from s-only examples provided by the training data. In particular, we expect that higher extractability of t (relative to s), measured by MDL(s)/MDL(t), will yield models that achieve better performance despite less training evidence.

3. EXPERIMENTS WITH SYNTHETIC DATA

Since it is often difficult to fully decouple the target feature from competing spurious features in practice, we first use synthetic data in order to test our hypothesis in a clean setting. We use a simple classifier with an embedding layer, a 1-layer LSTM, and an MLP with 1 hidden layer with tanh activation. We use a synthetic sentence classification task with k-length sequences of numbers as input and binary labels as output. We use a symbolic vocabulary V with the integers 0 . . . |V | -1. We fix k = 10 and |V | = 50K. We begin with an initial training set of 200K, evenly split between examples from S both and S neither . Then, varied across runs, we manipulate the evidence against the spurious feature (i.e., the s-only rate) by replacing a percentage p of the initial data with examples from S s-only for p ∈ {0%, 0.1%, 1%, 5%, 10%, 20%, 50%}. Test and validation sets consist of 1,000 examples each from S both , S neither , S t-only , S s-only . In all experiments, we set the spurious feature s to be the presence of the symbol 2. We consider several different target features t (Table 1 ), intended to vary in their extractability. Figure 2 shows model performance as a function of s-only rate for each of the four features described above. Here, performance is reported using error rate (lower is better) on each partition (S s-only , S t-only , S both , S neither ) separately. We are primarily interested in whether the relative extractability of the target feature (compared to the spurious feature) predicts model performance. We indeed see a fairly clear relationship between the relative extractability (MDL(s) MDL(t)) and model performance, at every level of training evidence (s-only rate). For example, when t is no less extractable than s (i.e., contains-1), the model achieves zero error at an s-only rate of 0.001, meaning it learns that t alone predicts the label despite having only a handful of examples that support this inference. In contrast, when t is harder to extract than s (e.g., first-last), the model fails to make this inference, even when a large portion of training examples provide evidence supporting it.

4. EXPERIMENTS WITH NATURALISTIC DATA

We investigate whether the same trend holds for language models fine-tuned with naturalistic data, e.g., grammar-generated English sentences. To do this, we fine-tune models for the linguistic acceptability task, a simple sequence classification task as defined in Warstadt & Bowman (2019) , longer blocks, using classifiers that are trained on previously transmitted data. The high MDL's are a result of overfitting by classifiers that are trained on limited data-and therefore, the classifiers have worse compression performance than the uniform baseline. A model that has learned to use the target feature alone to predict the label will achieve zero error across all partitions. s-only and t-only error reach 0 quickly when t is as easy to extract as s (i.e., the relative extractability is 1). However, when t is harder to extract than s (rel. extractability < 1), performance lags until evidence from s-only examples is quite strong. in which the goal is to differentiate grammatical sentences from ungrammatical ones. We focus on acceptability judgments since formal linguistic theory guides how we define the target features, and recent work in computational linguistics shows that neural language models can be sensitive to spurious features in this task (Marvin & Linzen, 2018; Warstadt et al., 2020a) .

4.1. DATA

We design a series of simple natural language grammars that generate a variety of feature pairs (s, t), which we expect will exhibit different levels of relative extractability (MDL(s) MDL(t)). We focus on three syntactic phenomena (described below). In each case, we consider the target feature t to be whether a given instance of the phenomenon obeys the expected syntactic rules. We then introduce several spurious features s which we deliberately correlate with the positive label during fine-tuning. The Subject-Verb Agreement (SVA) construction requires detecting whether the verb agrees in number with its subject, e.g., "the girls are playing" is acceptable while "the girls is playing" is not. In general, recognizing agreement requires some representation of hierarchical syntax, since subjects may be separated from their verbs by arbitrarily long clauses. We introduce four spurious features: 1) lexical, grammatical sentences begin with specific lexical items (e.g., "often"); 2) length, grammatical sentences are longer; 3) recent-noun, verbs in grammatical sentences agree with the immediately preceding noun (in addition to their subject); and 4) plural, verbs in grammatical sentences are preceded by singular nouns as opposed to plural ones. The Negative Polarity Items (NPI) construction requires detecting whether a negative polarity item (e.g., "any", "ever") is grammatical in a given context, e.g., "no girl ever played" is acceptable while "a girl ever played" is not. In general, NPIs are only licensed in contexts that fall within the scope of a downward entailing operator (such as negation). We again consider four types of spurious features: 1) lexical, in which grammatical sentences always include one of a set of lexical items ("no" and "not"); 2) length (as above); 3) plural, in which each noun in a grammatical sentence is singular, as opposed to plural; and 4) tense, in which grammatical sentences are in present tense. Some verbs (e.g. "recognize") require a direct object. However, in the right syntactic contexts (i.e., when in the correct syntactic relation with a wh-word), the object position can be empty, creating what is known as a "gap". E.g., "I know what you recognized " is acceptable while "I know that you recognized " is not. The Filler-Gap Dependencies (GAP) construction requires detecting whether a sentence containing a gap is grammatical. For our GAP tasks, we again consider four spurious features (lexical, length, plural, and tense), defined similarly to above. The templates above (and slight variants) result in 20 distinct fine-tuning datasets, over which we perform our analyses (see Appendix for details). Table 2 shows several examples. For the purposes of this paper, we are interested only in the relative extractability of t vs. s given the pre-trained representation; we don't intend to make general claims about the linguistic phenomena per se. Thus, we do not focus on the details of the features themselves, but rather consider each template as generating one data point, i.e., an (s, t) pair representing a particular level of relative extractability.

4.2. SETUP

We evaluate T5, BERT, RoBERTa, GPT-2 and an LSTM with GloVe embeddings (Raffel et al., 2020; Devlin et al., 2019; Liu et al., 2019b; Radford et al., 2019; Pennington et al., 2014) . 5 Both T5 and BERT learn to perform well over the whole test set, whereas the GloVe model struggles with many of the tasks. We expect that this is because contextualized pre-training encodes certain syntactic features which let the models better leverage small training sets (Warstadt & Bowman, 2020) . Again, we begin with an initial training set of 2000 examples, evenly split between both and neither, and then introduce s-only examples at rates of 0%, 0.1%, 1%, 5%, 10%, 20%, and 50%, using three random seeds each. 

4.3. RESULTS

For each (s, t) feature pair, we plot the use of the spurious feature (s-only error) as a function of the evidence against the spurious feature seen in training (s-only example rate). 7 We expect to see the same trend we observed in our synthetic data, i.e., the more extractable the target feature t is relative to the spurious feature s, the less evidence the model will require before preferring t over s. To quantify this trend, we compute correlations between 1) the relative extractability of t compared to s and 2) the test F-score averaged across all rates and partitions of the datafoot_7 , capturing how readily the model uses (i.e., makes predictions consistent with the use of) the target feature. Figure 3 : Relative Extractability Correlates with Target Feature Use. In (a) we show the Spearman's ρ between the test F-Score vs measures of extractability of the (s, t) pairs; * indicates significance. Relative extractability, whether ratio (MDL(s)/MDL(t)) or difference (MDL(s) -MDL(t)), explains learning behavior better than absolute extractability of either feature. Figure 3 shows these correlations and associated scatter plots. We can see that relative extractability is strongly correlated with average test F-score (Figure 3a ), showing high correlations for both BERT (ρ = 0.79) and T5 (ρ = 0.57). That is, the more extractable t is relative to s, the less evidence the model requires before preferring t, performing better across all partitions. This relationship holds regardless of whether relative extractability is computed using a ratio of MDL scores or an absolute difference. We also see that, in most cases, the relative extractability explains the model's behavior better than does the extractability of s or t alone. For GloVe there is little variation in model behavior: For most of the 11/20 pairs on which the model is able to learn the task, it requires an s-only example rate of 0.5. Thus, the correlations are weak, but qualitative results appear steady (Figure 8 in Appendix A), following the pattern that when s is easier to extract than t, more evidence is required to stop using s. Figure 4 shows the performance curves for BERT and T5 (with others the in Appendix A), i.e., use of the spurious feature (s-only error) as a function of the evidence from s-only examples seen in training (s-only example rate). Each line corresponds to a different s, t feature pair, and each data point is the test performance on a dataset with a given s-only example rate (which varies along the x-axis.) For pairs with high MDL ratios (i.e., when t is actually easier to extract than s), the model learns to solve the task "the right way" even when the training data provides no incentive to do so: That is, in such cases, the models' decisions do not appear to depend on the spurious feature s even when s and the target feature t perfectly co-occur in the fine-tuning data. Figure 4 shows that T5 (compared to BERT) requires more data to perform well. This may be because we fine-tuned T5 with a linear classification head, rather than the text-only output on which it was pre-trained. We made this decision 1) because we had trouble training T5 in the original manner, and 2) using a linear classification head was consistent with the other model architectures.

5. DISCUSSION

Our experimental results provide support for our hypothesis: The relative extractability of features given an input representation (as measured by information-theoretic probing tech- . Each line represents one (s, t) pair (described in §4.1). Pairs vary in the relative extractability of t vs. s (measured by the ratio MDL(s)/MDL(t) and summarized in the bar chart). When t is much harder to extract relative to s (lower ratios), the classifier requires much more statistical evidence during training (higher s-only rate) in order to achieve low error. We find similar patterns GPT2 and RoBERTa; see Appendix A for all the results. niques) is predictive of the decisions a trained model will make in practice. In particular, we see evidence that models will tend to use imperfect features that are more readily extractable over perfectly predictive features that are harder to extract. This insight is highly related to prior work which has shown, e.g., that neural networks learn "easy" examples before they learn "hard" examples (Mangalam & Prabhu, 2019) . Our findings additionally connect to new probing techniques which have received significant attention in NLP but have yet to be connected to explanations of or predictions about state-of-the-art models' decisions in practice. Fine-tuning may not uncover new features. The models are capable of learning both the s and t features in isolation, so our experiments show that if the relative extractibility is highly skewed, one feature may hide the other -a fine-tuned model may not use the harder-to-extract feature. This suggests a pattern that seems intuitive but is in fact non-trivial: If one classifier does not pick up on a feature readily enough, another classifier (or, rather, the same classifier trained with different data) may not be sensitive to that feature at all. This has ramifications for how we view fine-tuning, which is generally considered to be beneficial because it allows models to learn new, task-relevant features. Our findings suggest that if the needed feature is not already extractable-enough after pretraining, fine-tuning may not have the desired effect. Probing classifiers can be viewed as measures of a pre-trained representation's inductive biases. Analysis with probing classifiers has primarily focused on whether important linguistic features can be decoded from representations at better-than-baseline rates, but there has been little insight about what it would mean for a representations' encoding of a feature to be "sufficient". Based on these experiments, we argue that a feature is "sufficiently" encoded if it is as available to the model as are surface features of the text. For example, if a fine-tuned model can access features about a word's semantic role as easily as it can access features about that word's lexical identity, the model may need little (or no) explicit training signal to prefer a decision rule based on the former structural feature. The desire for models with such behavior motivates the development of architectures with explicit inductive biases (e.g., TreeRNNs). Evidence that similar generalization behavior can result from pre-trained representations has exciting implications for those interested in sample efficiency and cognitively-plausible language learning (Warstadt & Bowman, 2020; Linzen, 2020) . We note that this work has not established that the relationship between extractability and feature use is causal. This could be explored using intermediate task training (Pruksachatkun et al., 2020) in order to influence the extractability of features prior to fine-tuning for the target task; e.g., Merchant et al. (2020) suggests fine-tuning on parsing might improve the extractability of syntactic features.

6. RELATED WORK

Significant prior work analyzes the representations and behavior of pre-trained LMs. Work using probing classifiers (Veldhoen et al., 2016; Adi et al., 2017; Conneau et al., 2018; Hupkes et al., 2018) suggests that such models capture a wide range of relevant linguistic phenomena (Hewitt & Manning, 2019; Bau et al., 2019; Dalvi et al., 2019; Tenney et al., 2019a; b) . Similar techniques include attention maps/visualizations (Voita et al., 2019; Serrano & Smith, 2019) , and relational similarity analyses (Chrupała & Alishahi, 2019) . A parallel line of work uses challenge sets to understand model behavior in practice. Some works construct evaluation sets to analyze weaknesses in the decision procedures of neural NLP models (Jia & Liang, 2017b; Glockner et al., 2018; Dasgupta et al., 2018; Gururangan et al., 2018; Poliak et al., 2018b; Elkahky et al., 2018; Ettinger et al., 2016; Linzen et al., 2016; Isabelle et al., 2017; Naik et al., 2018; Jia & Liang, 2017a; Linzen et al., 2016; Goldberg, 2019, and others) . Others use such datasets to improve models' handling of linguistic features (Min et al., 2020; Poliak et al., 2018a; Liu et al., 2019a) , or to mitigate biases (Zmigrod et al., 2019; Zhao et al., 2018; 2019; Hall Maudslay et al., 2019; Lu et al., 2020) . Nie et al. (2020) and Kaushik et al. (2020) explore augmenting training sets with human-in-the-loop methods. Our work is related to work on generalization of neural NLP models. Iyyer et al., 2018; Hsieh et al., 2019; Jia et al., 2019; Alzantot et al., 2018; Hsieh et al., 2019; Ilyas et al., 2019; Madry et al., 2017; Athalye et al., 2018) is also relevant, as it relates to the influence of dataset artifacts on models' decisions. A still larger body of work studies feature representation and generalization in neural networks outside of NLP. Mangalam & Prabhu (2019) show that neural networks learn "easy" examples (as defined by shallow machine learning model performance) before they learn "hard" examples. Zhang et al. (2016) and Arpit et al. (2017) show that neural networks which are capable of memorizing noise nonetheless acheive good generalization performance, suggesting that such models might have an inherent preference to learn more general features. Finally, ongoing theoretical work characterizes the ability of over-parameterized networks to generalize in terms of complexity (Neyshabur et al., 2019) and implicit regularization (Blanc et al., 2020) . Concurrent work (Warstadt et al., 2020b) 

7. CONCLUSION

This work bears on an open question in NLP, namely, the question of how models' internal representations (as measured by probing classifiers) influence model behavior (as measured by challenge sets). We find that the feature extractability can be viewed as an inductive bias: the more extractable a feature is after pre-training, the less statistical evidence is required in order for the model to adopt the feature during fine-tuning. Understanding the connection between these two measurement techniques can enable more principled evaluation of and control over neural NLP models.

PREDICTING INDUCTIVE BIASES OF PRE-TRAINED MODELS

A ADDITIONAL RESULTS Figure 6, 7, 8, 9, 10 show additional results for all models over all partitions (both accuracy, neither accuracy, and F-score). These charts appear at the end of the Appendix. Details on the MDL statistics are available in Table 3 . For the transformer models, for 18/20 feature pairs, the models are able to solve all the spurious and target features in isolation during probing. (They do solve the test set in all cases-its that two D TEMPLATES FOR NATURALISTIC DATA Each template corresponds to a combination of target features, grammars, and spurious features (the target and spurious features are discussed in Section 4.1). See Table 8 for a complete list of templates. See Table 10 for further details about the templates that are used for each of the the target features. Complete details about implementation of these templates (and all data) will be released upon acceptance.

E WHY EXACTLY IS IT HARD TO GENERATE T-ONLY EXAMPLES?

Target features may be unavoidably linked to spurious ones. For example, for a Negative Polarity Item to be licensed (perhaps smoothing over some intricacies) the NPI ("any", "all", etc) must be a downward entailing context. These downward entailing contexts are created by triggers, e.g., if a negative word like "no" or "not" or a quantifier like "some". Linguists who study the problem have assembled a list of such triggers (see Hoeksema (2008) ). Arguably, one cannot write down a correct example of NLP licensing that doesn't contain one of these memorizable triggers. Thus, we cannot train or test models on correct examples of NPI usage while simultaneously preventing it from having access to trigger-specific features.

Hyperparameter Value

Hyperparameters random seed 1, 2, 3 batch size 128 cumulative mdl block sizes (%) 0. 1, 0.1, 0.2, 0.4, 0.8, 1.6, 3.05, 6.25, 12 .5, 25 s-only rates (%) 0, 0. 



Without loss of generality, we define t in our datasets s.t. t(x) = y, ∀x, y ∈ S. We do this to iron out the case where t outputs the opposite value of y. We observe similar overall trends when using an alternative metric based on validation loss (Appendix A.3). Note that our reported MDL is higher in some cases than that given by the uniform code (the number of sentences that are being encoded). The MDL is computed as a sum of the costs of transmitting successively Note, all models are ultimately able to learn to detect t (achieve high test accuracy) on the both partition, but not on the t-only partition. In pilot studies, we found that standard BOW and CNN-based models were unable solve the tasks. This control does not impact results: Appendix A.1. See Appendix for both error and neither error; both are stable and low in general. Initially, we used a more complicated metric based on the s example rate required for the model to solve the test set. Both report similar trends and correlations. For posterity, we include details in the Appendix A.2.



p c t d c u 7 K 5 a 4 4 1 3 e 5 P u e u X e 6 a c z c u d 8 O 5 W 5 e 7 5 d y d y 9 1 x 7 p 3 L v e P c e 5 e z s r /

Figure 1: We partition datasets into four sections, defined by the features (spurious and/or target) that hold. We sample training datasets D, which provide varying amounts of evidence against the spurious feature, in the form of s-only examples. In the illustration above, the s-only rate is 2 10 = 0.2, i.e., 20% of examples in D provide evidence that s alone should not be used to predict y. Evidence from Spurious-Only Examples. We are interested in spurious features which are highly correlated with the target during training. Given a training sample D and features s and t, we define the s-only example rate as the evidence against the use of s as a predictor of y. Concretely, s-only rate = |D s-only | |D|, the proportion of training examples in which s occurs without t (and y = 0).

Figure 4: Learning Curves for BERT & T5. Curves show use of spurious feature (s-only accuracy) as a function of training evidence (s-only rate). Each line represents one (s, t) pair (described in §4.1). Pairs vary in the relative extractability of t vs. s (measured by the ratio MDL(s)/MDL(t) and summarized in the bar chart). When t is much harder to extract relative to s (lower ratios), the classifier requires much more statistical evidence during training (higher s-only rate) in order to achieve low error. We find similar patterns GPT2 and RoBERTa; see Appendix A for all the results.

Figure 11: Overfitting on the Synthetic Tasks.

Table1contains MDL metrics for each feature (computed on training sets of 200K, averaged over 3 random seeds). We see some gradation of feature extractability, but having more features with wider variation would help solidify our results.4   Instantiations of the target feature t in our synthetic experiments. The spurious feature s is always the presence of the symbol 2. Features are intended to differ in how hard they are for an LSTM to detect given sequential input (measured by MDL per §2.2, reported in k-bits).

Examples of features used to generate fine-tuning sets with target/spurious features of varying extractability scores. Top examples show a case in which t and s both occur and the sentence is acceptable, and bottom examples show a case in which s occurs without t and the sentence is unacceptable. Only s is highlighted since t is often defined over the structure of the sentence (see text) and thus difficult to localize to a few tokens. Table 9 in the Appendix has neither examples.

Test and validation sets consist of 1000 examples each from S both , S neither , S s-only . In the natural language setting, it is often difficult to generate t-only examples, and thus we cannot compute extractability of the target feature t by training a classifier to distinguish S t-only from a random subset of S neither , as we did in Section 3. Therefore, we estimate MDL by training a classifier to distinguish between examples from S s-only and examples from S both . Using the simulated data from Section 3, we confirm that both methods (S s-only vs. S both and S t-only vs. S neither ) produce similar estimates of MDL(t) (see Appendix). Per model, we filter out feature pairs for which the model could not achieve at least 90% accuracy on each probing task in isolation.6

also investigates the inductive biases of large pre-trained models (RoBERTa), in particular, they ask when (at what amount of pre-training data) such models shift from a surface feature (what we call spurious features) to a linguistic feature (what we call a target feature). In our work, we focus on how to predict which of these two biases characterize the model (via relative MDL).

Summary of extractability (MDL in bits) for t and s for each template and each model.

Hyper and System Parameters. We use Hugging Face for the underlying model implementations.

We measure the target MDL directly on the toy data, where we can access the target feature. Recall that in natural conditions we can not generate examples with the target feature (free of spurious features), so we used a dataset comprising both and s-only examples. In the simulated setting, our approach for measuring the target extractability indirectly by using both and s-only examples reports results similar to those when directly using target and neither. These values are in kbits.

ACKNOWLEDGEMENTS

We would like to thank Michael Littman for helpful suggestions on how to better present our findings and Ian Tenney for insightful comments on a previous draft of this work. We also want to thank our reviewers were their detailed and helpful comments. This work is supported by DARPA under grant number HR00111990064. This research was conducted using computational resources and services at the Center for Computation and Visualization, Brown University.

Absolute

Relative (t to s) of the spurious features ended up being very difficult for the models.) During the reviews, we did not control for the cases where the model did not solve the probing task. These 2 extraneous points accentuate the lineplot curves, but do not change the character of the results (nor much adjust the correlations). In the paper, now, we control for accuracy by filtering out these cases. With or without this control, the accuracy provides no predictive power about the inductive biases. We present the correlations without filtering for these cases for consistency with the reviews (Table 4 above); we believe it is important to control for these cases because they could have acted as giveaways, where even accuracy might have worked.

A.2 ALTERNATE METRIC: s-RATE

We initially used a different metric when computing the correlations to compact the lineplots. Rather than using the average test performance, we looked at the evidence required for the model to solve the test set. Both of these metrics conceptually capture what we are interested in, but the new one (simply averaging test performance) is much easier to understand, and captures the performance across all partitions. Here we report the correlations with this evidence required metric instead, which we called s-rate . Specifically, we defined it to be: s-rate is the lowest s-only example rate at which the fine-tuned model achieves essentially perfect performance (F-score > 0.99) (see Figure 5a ). Intuitively, s-rate is the (observed) minimum amount of evidence from which the model infer that t alone is predictive of the label. See Table 5b for the results. There are two major parts to this project in terms of reproducibility: (1) the data and (2) the model implementations. We describe the templates for the data below in Appendix D -the full details are in the project source. For the transformer models, we use Hugging Face for the implementations and access to the pre-trained embeddings (Wolf et al., 2020) . We use PyTorch Lightning to organize the training code (Falcon, 2019) . We fix all hyperparameters, which are reported in Table 6 .We want to call attention to BERT requiring much less data than T5 to capture our target features.At face value, it seems that BERT requires much less data than T5 to capture our target features. However, we are wary about making such strong claims. Something to consider here (noted in the Appendix B is that for T5 we used a linear model rather than formatting the task in text (which is how T5 is trained). We made this decision (1) because we had trouble training T5 in this purely textual manner, and (2) using a linear classification head over two classes is consistent with the other model architectures. Again, GPT2 and RoBERTa performed on par with BERT, so the difference between the performance of BERT and T5 may be due to how we trained T5.

C MEASURING EXTRACTABILITY INDIRECTLY

We measure the MDL for t with both and s-only examples. In the simulated setting we can compare this approach with measuring the MDL directly (t-only vs neither). See Table 7 for MDL results. The ordering of the feature's difficulty holds across the two methods. The N p ever V . *The man who some lawyer hated ever traveled. NPI XThe N p-ever V *The boy who ever smiled shouted.

SVA

The N x 1 of the (N i of theThe piano teacher of the lawyers wounds the handyman. SVA XThe N x 1 of the (N i of the*The piano teacher of the lawyers wound the handyman. Whether or not a noun and verb agree is given by whether their superscripts match.

GAP (N

I know who he believed. *I know who he knew the lawyers that she believed.Table 10 : List of templates that are used in Section 4. These do not include the spurious features which are discussed in Section 4.1. For NPI, each N * represents a noun phrase. N p-neg is valid after a negation (might contain a polarity item). N p is valid after a determiner (cannot contain an unlicensed polarity item). N p-ever is not valid after a determiner (contains an unlicensed polarity item). These phrases have complex nesting behavior and can become arbitrary long. In addition, a sentence might consist of multiple independent clauses, each of which is given by one of these templates. For SVA, the base templates do not have the additional nouns in the starred parentheticals, while the nested templates have zero or more. For GAP, the harder set of templates include ISL (island) examples as an additional s-only example that force the model to not violate (one specific type) of island constraint. For complete details and lexicons see the source. Similar to the NPI example, it's not possible (to our knowledge) to construct target-only examples for filler-gap since construction requires a wh-word and syntactic gap; thus, we can't create a positively labeled, grammatical sentence that exhibits a Filler Gap without these elements.In summary, target-only examples may add new spurious features (as with NPI), or be impossible to construct because the presence of the target feature implies the presence of the spurious feature (as with filler gaps). Still, our setup permits the MDL to be computed directly with target-only examples, and so, in cases where it is feasible to create target-only examples (e.g. the Subject-Verb Agreement templates), it would have bolstered our argument to do so. 

F MDL ISSUES: OVERFITTING IN THE SYNTHETIC EXPERIMENTS

We found that the MDL exceeds the uniform code length in some of the synthetic experiments. We found that this occurs because the model overfits on the small early-block sizes. See Figure 11 .0.0 0.5 1.0 

