DOES DEEP LEARNING LEARN TO ABSTRACT? A SYSTEMATIC PROBING FRAMEWORK

Abstract

Abstraction is a desirable capability for deep learning models, which means to induce abstract concepts from concrete instances and flexibly apply them beyond the learning context. At the same time, there is a lack of clear understanding about both the presence and further characteristics of this capability in deep learning models. In this paper, we introduce a systematic probing framework to explore the abstraction capability of deep learning models from a transferability perspective. A set of controlled experiments are conducted based on this framework, providing strong evidence that two probed pre-trained language models (PLMs), T5 and GPT2, have the abstraction capability. We also conduct in-depth analysis, thus shedding further light: (1) the whole training phase exhibits a "memorize-thenabstract" two-stage process; (2) the learned abstract concepts are gathered in a few middle-layer attention heads, rather than evenly distributed throughout the model; (3) the probed abstraction capabilities exhibit robustness against concept mutations, and are more robust to low-level/source-side mutations than high-level/target-side ones; (4) generic pre-training is critical to the emergence of abstraction capability, and PLMs exhibit better abstraction with larger model sizes and data scales.

1. INTRODUCTION

Whereas concrete concepts are typically concerned only with things in the world, abstract concepts are about internal events. - Barsalou et al. (1999) Abstraction means capturing the general patterns (often referred to as abstract concepts) efficiently in a specific learning context and reusing these patterns flexibly beyond the context (Mitchell, 2021; Kumar et al., 2022; Giunchiglia & Walsh, 1992; Hull, 1920) . For instance, the abstraction on language means recognizing the underlying syntax and semantics behind concrete sentences. It is thought to be one of the fundamental faculties in human cognition for effectively learning, understanding and robustly generalizing, and has been studied for a long time in cognitive psychology and behavioral sciences (Gentner & Medina, 1998; Barsalou et al., 1999; Shivhare & Kumar, 2016; Konidaris, 2019) . The abstraction capability is also critical for deep learning, but many previous studies suggested that the surprising success of deep learning may come from the memorization of some surface patterns (also called superficial correlations or shortcuts) (Geirhos et al., 2020; Du et al., 2022) , such as some special tokens (Niven & Kao, 2020; Gururangan et al., 2018) , overlapping contexts (Lai et al., 2021; Sen & Saffari, 2020) , and familiar vocabularies (Aji et al., 2020) . It is still unclear whether the models just memorize these patterns without abstractions, or they do learn abstract concepts (yet overwhelmed by surface patterns when applied in a similar context as in training). Therefore, this paper aims to take a step forward to probe the abstraction capability of deep learning models, keeping the effects of abstract concepts and surface patterns decoupled and controlled individually. Our key idea is to probe the abstraction capability from a transferability perspective, since surface patterns are always bounded with task-specific characteristics while abstract concepts can be more Motivating Example As shown in Figure 1 , suppose we want to examine whether a model can learn the abstract rule (i.e., the symbolic mapping rule x 1 x 2 → X 1 X 2 , in which x i and X i are general variable slots) from the task A, or just memorize surface maps (e.g., ab → AB, in which a and A are task-specific symbols). To reveal the different transferability of two learning mechanisms, we utilize a probing task B that contains the same underlying abstract rule as task A but does not overlap with its symbol set. If the model could learn the abstract rule from task A, it would reuse it to interpret new context, thus effectively solving task B. But if not, memorizing some surface maps that are bounded with task-specific symbols is less effective to solve task B. Motivated by this example, we design a systematic framework for probing abstraction capability. This framework considers a set of probing tasks along with three procedures of experiments based on the transfer learning paradigm. The use of abstract concepts and task-specific characteristics in probing tasks are separately controlled. To probe the abstraction capability of language models, this work mainly considers grammar as the abstract conceptfoot_0 . The grammar of a formal language is a set of hidden rules behind concrete sentences and determines how terminals are combined into sentences that are valid to the syntax. We want to explore whether the model can be aware of the grammar, or simply memorize some specific word combinations. We instantiate our framework as a grammar probe that is constructed from the designed formal grammar and terminal sets. The probing results show strong evidence that two probed PLMs (specifically, T5-Base (Raffel et al., 2020) and GPT2-Medium (Radford et al., 2019) ) have the abstraction capability to learn abstract concepts from concrete instances, rather than just simply memorizing surface patterns. After probing the existence of abstraction capability, we further explore the following questions. RQ1: What is the characteristic of the training dynamics on learning abstract concepts? RQ2: How are these learned abstract concepts distributed in the model? Concentrated in a few modules or evenly distributed in whole model? RQ3: How robust is the abstraction capability on tasks with mutated abstract concepts? RQ4: How would generic pre-training and general factors influence abstraction? Here we outline some interesting findings from our in-depth investigations: (1) the training phase exhibits a "memorize-then-abstract" two-stage process; (2) the abstract concepts learned in our probes are gathered in a few middle-layer heads; (3) abstraction capability is more robust to source-side/low-level mutations than to target-side/high-level ones; (4) generic pre-training is critical to the emergence of abstraction capability, and larger model size and data scale are beneficial. Contributions 1) We propose a systematic probing framework for abstraction capability, guiding the design of controlled tasks and procedures from a transferability perspective. 2) We instantiate this framework with concrete tasks and show strong evidence that two probed PLMs have the abstraction capability. 3) We further analyze this capability and provide insightful conclusions by investigating the above research questions. Our code and data are publicly available at https://github.com/microsoft/ContextualSP/tree/master/abstraction_probing.

2. RELATED WORK

Probing deep learning models. To explore whether deep learning models have certain capabilities, there has been much work examining these black-box models in some specially designed settings, called probes (Petroni et al., 2019; Tenney et al., 2018; Warstadt et al., 2019; Lin et al., 2019; Hewitt & Manning, 2019; Vulić et al., 2020) . The key challenge in designing probes is to exclude superficial correlations. That is, the performance of the model in the probing setting should be highly correlated with the capability to be probed rather than other influencing factors. For instance, to probe whether the model encodes some knowledge/information in the representation rather than just over-fit the data, a standard approach is to freeze the model parameters (Petroni et al., 2019; Tenney et al., 2018) ; to probe whether the model have compositionality rather than just memorize the label distribution, previous work injected statistical bias into the data splits (Lake & Baroni, 2018; Keysers et al., 2019; Kim & Linzen, 2020) . In this work, to explore whether models have abstraction capability rather than just memorize surface patterns, we leverage the transferability of abstract concepts, which has been considered as one essential aspect of abstraction (Mitchell, 2021; Kumar et al., 2022) and explored from a cognitive science perspective on neural networks (Dienes et al., 1999; Geiger et al., 2022) . Abstraction capability. Abstraction has been studied for a long term in cognitive psychology and behavioral sciences (Hull, 1920; Gentner & Medina, 1998; Barsalou et al., 1999; Burgoon et al., 2013; Wang, 2015; Shivhare & Kumar, 2016; Lake et al., 2017; Daniel, 2017; Konidaris, 2019) and has attracted attention in the artificial intelligence field (Giunchiglia & Walsh, 1992; Richardson et al., 2020; Clark et al., 2020; Talmor et al., 2020; Mitchell, 2021; Zadrozny, 2021; Millhouse et al., 2021; Kumar et al., 2022) . The abstraction capability of DNN models has been explored in many tasks such as visual reasoning (Johnson et al., 2017; Barrett et al., 2018; Chollet, 2019; Kumar et al., 2022) , grounded language understanding (Ruis et al., 2020) , and game playing (Tsividis et al., 2021) . As our work focuses on language models, another closely related topic is compositional generalization (Lake & Baroni, 2018; Keysers et al., 2019; Kim & Linzen, 2020) , which explored whether neural models could learn high-level grammars from specially designed training examples and apply the learned grammars through compositions. These works concluded that general-propose neural models (such as LSTM and Transformer) could not learn the full grammar with biased observations and demonstrated the importance of symbolic mechanisms for abstraction (Liu et al., 2020; Chen et al., 2020; Liu et al., 2021a) . Some other previous work also explored the abstraction of language models in their specially designed tasks (Chollet, 2019; Mitchell, 2021; Zadrozny, 2021) . Most previous explorations of DNN abstraction capabilities did not consider to explicitly avoid and check the influence from task-specific characteristics, thus leaving potential risks that the model may perform well in terms of surface patterns over-fitted to task-specific designs (e.g., patterns in candidate answers (Zhang et al., 2019) ) rather than abstract concepts. Some implicit strategies have been leveraged to alleviate such potential influence through indirect ways: some previous work considered using biased task-specific designs in training and test data separately (Kim & Linzen, 2020; Barrett et al., 2018) ; some have attempted to fix the observed problems in existing probes on an ad hoc basis (Hu et al., 2021; Benny et al., 2021) ; some considered to inject great task diversity, which implicitly increases difficulty of learning practical shortcut (Chollet, 2019) . In this work, rather than implicitly alleviating this potential risks, we consider to explicitly check whether there is performance leakage from surface patterns by leveraging the transferability of abstraction capability and comparing performance among a set of controlled experiments.

3. PROBING FRAMEWORK

As mentioned in Section 1, abstraction is the capability to induce abstract concepts from concrete instances in a certain learning context and flexibly generalize these concepts beyond the context. A key difference between a surface pattern and an abstract concept is their different cross-task transferability, as the former is always bounded with some task-specific characteristics (e.g., a certain vocabulary) while the latter is transferable across tasks. We define this property as following. Property: Transferability of Abstract Concepts. Consider two machine learning tasks A and B that do not share any common instances between their task-specific characteristics spaces, but have essentially the same set of abstract concepts behind them, the transferability of abstract concepts means that learning A can help better learn B. Based on this property, we can verify the learning of abstract concepts by checking whether the transferability is exhibited (i.e., assessing ∆(A ⇒ B)). In the following, we design a framework for probing the learning of abstract concepts in a systematic manner and illustrate it in Figure 2 .

Aiming

• This framework examines whether a probed model could learn abstract concepts C A from the aiming task A with a train set A.

Task Design

• Probing task B with the transfer set B and test set B contains the abstract concepts C B that is required to be the same as C A . The task-specific characteristics used to construct B ∪ B do not overlap with that of A. In addition, the examples in B are restricted to contain insufficient information for the probed model to learn C B perfectly. Thus, the gain from the abstraction in task A would be noticeable. 

Hypothesis and Expectations

• Hypothesis: the probed model can learn abstract concepts C A from A. • Expectation 1: ∆(A ⇒ B) is significantly high, i.e., A ⇒ B brings considerable gain compared with ⇑ B. • Expectation 2: ∆(C ⇒ B) is significantly lower than ∆(A ⇒ B) (or close to zero), i.e., Expectation 1 is highly correlated with the learning of abstract concepts rather than other factors.

4. PROBING ABSTRACTION CAPABILITY OF LANGUAGE MODELS

The abstract concepts mainly considered in this work is grammar, a set of syntactic rules hidden behind concrete sentences that determine how terminals are combined into sentences that are valid to the syntax. To design a grammar probe, we instantiate the framework with formal language translation (FLT) tasks. We assume that the generative grammar of the source and target languages contain the abstract concepts of FLT tasks, and that the surface patterns (e.g., familiar bigrams) are bounded with task-specific terminal sets. We give a more specific definition of abstraction based on FLT tasks: Definition: Considering an FLT task T : L s → L t that translate the source language L s (with grammar G s and terminals S s ) to the target language L t (with grammar G t and terminals S t ), and a set of concrete pairs T = {(l i s → l i t )} k in which l i s and l i t are sentences from L s and L t respectively, the abstraction capability is learning the map from G s to G t during training on T rather than just simply memorizing terminal-specific patterns that are bounded with S s and S t . Our FLT tasks are majorly derived from the synthetic semantic parsing task COGS (Kim & Linzen, 2020) and the Probabilistic Context-Free Grammar (PCFG) it used. We directly take the source grammar G s in COGS which mimics the English natural language grammar, and reconstruct the target grammar G t in COGS to be chain-structured (detailed in Appendix K.1). The map from G s to G t is a homomorphism (partly shown in Table 1 ). Terminals can be divided into three groups: the verbs S v in G s (aand the PREDICATEs S P in G t ), the nouns S n (the ENTITYs S E ) and the conjunctions S c (the CONCATs S C ). The production rules can be categorized as T-Production rules (only containing terminals at the right side) and N-Production rules. We assign to the tasks A and B the same set of production rules while different terminals. It means that task A and B share the same abstract concepts while having no overlap between the task-specific characteristic spaces. For constructing task C, we completely change the production rules for A while preserving the terminal sets, thus task A and C do not share abstract concepts while could have similar task-specific characteristics. We describe the instantiation of different sets in detail as follows. Examples in these sets are contained in Appendix F.1. Train set A. To generate examples in A, we derive G + s and G + t by only one-to-one replacing the terminals in G s and G t with new onesfoot_1 . New terminals are sampled from the Wordlist Corpora in NLTK (Bird et al., 2009) . Additionally, as the original S c (also S C ) only contains a single terminal, we add 31 additional terminals into the new S c (and S C ) to increase the diversity. The terminal diversity will be further discussed in Section G.1. Transfer set B and Test set B. We take the train set in COGS as B, and take the sentential complement (Com.) set and subject modification (Mod.) set as B for two sub-probes. The B only contains examples with up to 2 recursions and object modifications, while the B contains up to 12 recursions and subject modifications. It has been proved that training on B is not enough for a DNN model to learn the full grammars of COGS for handling the test cases in B (Kim & Linzen, 2020) . Contrast set C. Compared with A, C is generated with the same source grammar G + s , but the target grammar is totally changed as G - t : for each rule of G + t , its right-side word order is reversedfoot_2 . Except for the generative grammar, all other factors are kept the same with A during generating C.

5. EXPERIMENTAL SETUP AND MAIN RESULTS

We probe two pre-trained language models: T5-Base and GPT2-Medium. Our experiments are based on the Huggingface Transformer models (Wolf et al., 2020) . For both (continue) pre-training and fine-tuning, we take Adam (Loshchilov & Hutter, 2018) with 1e-5 learning rate and 0.01 weight decay. Batch size is 8 and max training step is 100k. We generate 3 groups of new terminals, repeat the experiments on each group with 2 random seeds, and finally average 6 results. The early-stopping strategy is applied to avoid catastrophic forgetting. Detailed settings are listed in Appendix K. Table 2 shows the main results of our probe. For both two sub-probes, the performances of two probed models are in line with two Expectations set in Section 3. First, the results of ⇑ B are very low, and A ⇒ B can bring significant improvement, which is in line with Expectation 1. Second, the results of C ⇒ B are much lower than A ⇒ B (and are even just comparable with ⇑ B), which is in line with Expectation 2. As two expectations are experimentally examined, we can draw a preliminary conclusion: our probing results provide strong evidence that two probed PLMs have the abstraction capability to learn abstract concepts from concrete instances rather than just memorize surface patterns, and to transfer the learned abstract concepts beyond specific tasks.

6. ANALYSIS

Based on our designed probe and results above, we further analyze the abstraction capability of PLMs to answer the RQs mentioned in Section 1. All experiments below are derived from Com. sub-probe, and are mainly conducted with T5-Base model except that are explicitly mentioned.

6.1. LEARNING PROCESS OF ABSTRACT CONCEPTS

To investigate the learning process of abstract concepts, we save checkpoints for every 1,000 steps during training on A. Each checkpoint is further fine-tuned on B and tested on B. For comparison, we also investigate the process of memorizing surface patterns by directly examining each checkpoint on the held-out dev set in task A. Figure 3 shows the performance curves of two learning processes. The training phase exhibits a "memorize-then-abstract" two-stage process. As shown in Figure 3 , there is an obvious phase difference (48k training steps) between two time points that two learning processes achieve their 90% relative performance, respectively. Such a phase difference means that when the model has already performed well on task A in an early training phase, the learning of desired abstract concepts is still on-going. In other words, the in-task performance in an early training phase comes mainly from the effects of some task-specific surface patterns rather than general abstract concepts. With extending the training phase, the abstract concepts can be further extracted/enhanced. This phase difference also suggests that the pre-training process should be continued even if the model has already achieved a good in-pre-training performance. The learning of abstract concepts is accelerated after in-task examples are well learned. After the model reaches 90% in-task relative performance (i.e. right side of the red dashed line), the learning curve of abstract concepts (i.e., the blue curve) rises more rapidly. The learning of abstract concepts is not stable in the early training phase. The curve of in-task performance is much smoother than the cross-task one. This suggests that the learning and transfer of abstract concepts is not stable. Nevertheless, the large fluctuations occur mainly in the early phases of training. With increasing training steps, this instability gradually decreases.

6.2. ABSTRACT ATTENTION HEADS

To investigate how the learned abstract concepts are distributed in the model, we first conduct preliminary experiments by separately freezing parameters in each layer and sub-layer during fine-tuning, and find that the parameters in attention sub-layers play important roles (detailed in Appendix E). To further determine the contribution of each attention head, we consider measuring the performance degradation after excluding the effect of each head. Specifically, we evaluate the change in perplexity (PPL) of examples in B after pruning the normalized wight of each head as follows, ∆ θ,B (h) = 1 |B| i [PPL(l i t |l i s ; θ -h ) -PPL(l i t |l i s ; θ)], in which h represents a certain head, θ is the full set of parameters in PLM after fine-tuning on B, θ -h means pruning the h head, and (l i s , l i t ) is the input-output pair in B. Note that a higher PPL means a lower performance. Considering that some heads may store the task-specific knowledge learned from fine-tuned data B, pruning these heads may also lead to performance changes. Therefore, we also evaluate a baseline PPL change ∆ θ, B (h) on fine-tuned examples in B and measure the difference in PPL changes (DPC)= ∆ θ,B -∆ θ, B . The DPC of each head is shown in Figure 4 . Abstract concepts are largely contained in a few heads, not evenly distributed in all heads. Note that there are totally 432 attention heads in T5-Base. Figure 4 shows that among hundreds of heads, only a dozen of them are highly correlated with storing abstract concepts in our probe. These abstract attention heads are gathered in middle layers in T5. A larger index in Figure 4 means that the corresponding head is more away from the input side and closer to the output side. It shows that the abstract attention heads (i.e., heads with high DPC) are mainly located in the middle layers of T5-Base model, i.e., the last encoder layers and first decoder layers. We further explore whether abstract concepts are modularized in the model. A module is a part of parameters that can individually perform a specific target functionality (Csordás et al., 2020) . To investigate modularity, we take the method of freezing certain parameters during fine-tuning to examine whether the update of these parameters can be independent. We consider the top 36 heads with the highest DPC (which contain some redundant heads) as abstract heads. For comparison, we separately experiment with freezing 36 random heads. Table 3 shows that freezing abstract heads takes effect while freezing random heads does not. We further explore the modularity in Appendix E. 

6.3. ROBUSTNESS OF ABSTRACTION CAPABILITY

We explore the robustness of the probed abstraction capability when the abstract concepts in our designed probes are mutated. Different from the contrast task C in which the target grammar is totally changed, here we consider partially injecting mutations into source/target-side grammar. According to the formal grammar in Table 1 , we consider injecting mutations at different abstract levels: changing T-Production rules can be regarded as a low-level mutation, since only terminals will be influenced and the whole sentence structure is kept; changing non-iterative N-Production rules can be regarded as a mid-level mutation, since the local structure will be mutated but the whole recursive structure is preserved; changing iterative N-Production rules can be regarded as a high-level mutation, since the whole recursive structure will be reconstructed. Based on the grammar used in formal language task A, we design three derivations G The local word order in a sentence is reversed. Specifically, we reverse the right-side word orders of the N-Production rules, except for the rule in the last row of Table 1 which is an iterative one. It means that the order of CLAUSEs (determined by the last rule) remains the same, while the terminals in each CLAUSE are locally reversed.

Nested G *

t (high-level mutation): It is obtained by changing the iterative rule (i.e, the last rule in Table 1 ) from the chain-structure to be nested. The new N-Production rule is "CLAUSE ↠ PREDICATE ( AGENT, CONCAT CLAUSE )". We can also construct G * s from G + s with the same technique except for the coarse one, as the source language must contain enough information to generate targets. Thus, we design a Redundant G * s which contains redundant terminals that are not mapped into targets (detailed in Appendix F.3). We separately change the source and target grammars to derivations and show results in Figure 5 . PLMs can exhibit robustness against mutations in abstract concepts. Results of these derivations with mutations are higher than the Control Exp (and Contrast Exp), indicating that even though the learned abstract concepts are only partially matched with that in downstream tasks, the abstraction capability of PLMs can still leverage the similar parts in two sets of mutated abstract concepts. Abstraction capability is more robust to low-level mutations. Among three kinds of derivations, the low-level mutated ones (Coarse G * t and Redundant G * s ) perform best, while the high-level mutated ones (Nested G * t and G * s ) perform worst. This trend implies that the robustness of the abstraction capability decreases as the mutation level of abstract concept rises. This also suggests that matching of high-level abstract concepts should be prioritized when selecting pre-training tasks. Abstraction capability is more robust to source-side mutations. Comparing the results in Figure 5a and 5b, source-side mutations bring less affects to downstream performance than target-side ones, indicating that PLMs can more robustly reuse source-side abstract concepts. Redundant information barely affects abstraction. Surprisingly, the performance of Redundant G * s is nearly the same with that of the original G + s , which means that injecting redundant information into inputs would hardly affect the learning of abstract concepts. It indicates that the abstract capability of PLM can naturally exclude the influence of irrelevant information. Fuzzy abstract concepts can also be learned and transferred. Compared with the formal grammar discussed above, which can be concretely defined, fuzzy grammar is more free (such as natural language grammar). To explore how would abstraction capability perform on fuzzy grammar, we take natural language sentences for experiments and design different sets by mimicking the Com. sub-probe. Detailed designs are described in Appendix H. We report BLEU score in Table 4 . It shows that the performance of PLMs on learning fuzzy grammar is also in line with our two expectations.

6.4. GENERAL FACTORS & GENERIC PRE-TRAINING

As explored in previous work, there are some general factors that influence the performance of DNN models (Bommasani et al., 2021; Wei et al., 2022; Henighan et al., 2020) , such as model size and data scale. We investigate how these general factors and the generic pre-training influence the learning of abstract concepts. More results and analysis can be found in Appendix G.1. PLMs exhibit better abstraction with larger model sizes. Figure 6a and 6b show that for both T5 and GPT2 architectures, larger pre-trained language models have better abstraction capability than the smaller ones, as we can observe that the gains from the Control Exp to the Main Exp become greater with the model sizes increasing. Larger data scale in pre-training helps better exhibit abstraction. Figure 6c shows T5-Base performance with different scales of the train set A. It shows that performance increases rapidly from ∼300 to ∼3.4K (with ∼50% absolute accuracy improvement) and improves marginally (and unstably) from ∼3.4K to ∼680K (with ∼5% absolute accuracy improvement). Overall, the performance trend is going up with data scale increasing, indicating that the larger data scales benefit abstraction. Generic pre-training is critical for the emergence of abstraction. We probe the abstraction capability of randomly initialized T5-Base and GPT2-Medium (i.e., without loading pre-trained checkpoints) and report the results in Table 5 . The poor performance on A ⇒ B reveals that without generic pre-training, these deep learning models can hardly extract transferable abstract concepts from task A, even though they can still achieve >98% dev set performance on task A by fitting some task-specific suffer patterns. The comparison of the results in Table 2 and Table 5 demonstrate that the broader background pre-training is critical for the emergence of abstraction capability.

7. CONCLUSIONS

In this paper, we introduce a systematic probing framework from a transferability perspective to guide the design of probes for abstraction capability. We instantiate this framework as a grammar probe and show strong evidence that two probed PLMs have the abstraction capability. We further analyze this probed capability by investigating several in-depth questions and provide insightful conclusions.

ETHICS STATEMENT

A sufficiently robust abstraction capability that can perfectly extract abstract concepts and exclude concrete information in any situation will help deep learning models avoid many potential risks of ethical issues such as social bias and privacy breaches. However, as investigated in this work, the abstraction capability of some commonly used deep learning models may be fragile and can be affected by their training situation. This suggests that the abstraction capabilities of these models are still not reliable enough to naturally avoid these potential ethical issues, and calls for future work to explore ways to strengthen the robustness of the abstraction capabilities of deep learning models. This is the Appendix of the paper Does Deep Learning Learn to Abstract? A Systematic Probing Framework.

A DISCUSSIONS

Below are more discussions about our work. Potential factors that may hinder our probing. We consider the main factor that could block the use of our probing framework is the catastrophic forgetting problem in deep learning (Goodfellow et al., 2013; Kemker et al., 2018) . Since our probing framework relies on the transferability property of abstract concepts, if catastrophic forgetting dominates the learning of downstream tasks, such transferability will hardly take effect and the probing results will fail to reveal the abstraction capabilities. Considering this problem, we utilize the early-stopping strategy (detailed in Appendix) to alleviate catastrophic forgetting. Moreover, our tested pre-trained models are naturally more robust to catastrophic forgetting (Ramasesh et al., 2021) . Better understanding "why does transfer learning work". Recent success of transfer learning shows that pre-training (or continue pre-training) with similar source tasks can help better solve downstream target task (e.g., question answering (Khashabi et al., 2020; Liu et al., 2021b) , face verification (Cao et al., 2013) , and general NLU tasks (Pruksachatkun et al., 2020) ). Some previous work in cross-lingual transfer learning empirically observed that the model can transfer some knowledge beyond vocabulary (Artetxe et al., 2020; Ri & Tsuruoka, 2022) , but they did not consider to exclude the influence from other potential factors. Our results can serve as stronger evidence for the reason to the success of transfer learning, that in addition to transferring some surface patterns, the better target performance can also benefit from similar abstract concepts learned from source tasks. Limitations and future work. The main limitations in this work are 1) we do not quantify the abstraction capability and 2) we only test two large pre-trained models. We leave these two points to our future work. Another future direction is to further explore the mechanisms behind abstractions.

B COMPARISONS WITH PREVIOUS FINDINGS ABOUT LEARNING DYNAMIC

Comparison with information bottleneck. Shwartz-Ziv & Tishby (2017) found a two-phase learning process from the view of information flow in deep neural networks: empirical error minimization phase and representation compression phase. This process is different from the memorize-thenabstract process since they measure the training dynamics in quite different perspectives. The former focuses on the compression of representation (and reduction of mutual information) while the latter portrays the learning of abstract concepts. The analogy between the two may lie in that the extraction of abstract concepts from concrete instances is in some way have the effect of information compression. Comparison with Grokking. Power et al. ( 2022) revels that the improvement in generalization (on validation set) can happen well past the point of over-fitting (on train set). Both 'grokking' and 'memorize-then-abstract' phenomenon indicate that some general patterns are always learned in a later training stage. The difference is that the 'grokking' focuses on generalization beyond over-fitting training data, while 'memorize-then-abstract' portrays the transfer of abstract concepts beyond task-specific characteristics.

C PRINCIPLES FOR DESIGNING PROBING TASKS

To verify whether the model could learn abstract concepts from task A by assessing ∆(A ⇒ B), we propose the following principles for designing task B: 1) The space of task-specific characteristics of B should be very different from that of A so that the memorization of surface patterns in A is helpless to B. 2) The abstract concepts of B should be the same as that of A so that the abstraction on A could be reflected with a better performance on B. 3) The fine-tuning-only approach ⇑ B should be not enough to learn task B perfectly; otherwise, the gain from the abstraction on A would not be noticeable. Furthermore, to verify that the performance gain on B is from the abstraction on A rather than other factors, we consider a contrast task C of A: 4) The abstract concepts of C should be very different from A (also B), while other latent factors are kept the same such as data scale and pre-training steps. The consideration of contrast task is similar to selectivity (Hewitt & Liang, 2019) .

D AN OPERATION PROBE

The operations semantics (e.g., conjunction) in the Boolean algebra can be regarded as transition functions between the given Boolean variables and corresponding outputs. We want to examine whether the model can learn the meaning of operations from concrete logical expressions, or just learn superficial correlations from specific sketches in expressions. In operation probe, we instantiate the framework with logical expression evaluation (LEE) tasks. We consider operation semantics in logical expressions as abstract concepts, and surface patterns (e.g., local string matching) are bounded with operation sketches. Figure 7a shows two kind of sketches: chain sketch and tree sketch. The model trained on chain sketch may learned the meaning of operations (e.g., conjunction of two Boolean variables) or simply memorize some head or tail patterns of strings (e.g., if the head of input string is "False AND ( ", the output is always "False"). Learning operation semantics can more generally help understand other expressions with different sketches, but memorizing head or tail patterns in chain sketches is helpless or even harmful to understand tree sketches, since these surface patterns can lead to wrong results in different sketches. We give a more specific definition of abstraction based on LEE tasks: Definition 2: Considering an LEE task L : E s → B t that the source logical expressions E s (with operation semantics P s and sketch K s ) are evaluated as Boolean values B t , and a set of inputoutput pairs L = {(e i s → b i t )} k in which e i s is an logical expression sampled from E s and b i t ∈ {T rue, F alse} is the evaluation result of e i s , the abstraction capability is learning the meanings of operations P s during training on L rather than memorizing sketch-specific patterns that are bounded with K s . Our LLE tasks probe the learning of four logical operations: P + s ={Conjunction (Conj.), Disconjunction (Disc.), Alternative Denial (Alt.), Joint Denial (Joi.)}. Figure 7b illustrates the transition functions of these operations.. For generating data, each operations in P + s is constantly aligned with one operator (i.e., a concrete string) in S o . Examples of these sets are contained in Appendix F.2. Train set A. We synthesize the data in A with P + s and chain sketch. Each expression contains eight operators which are sampled from S o . Transfer set B and Test set B. We synthesize the data in B ∪ B with P + s and tree sketch. Each expression contains eight operators which sampled from S o . When probing a certain operation p s ∈ P + s , we limit that the expression in B does not contain p s , while each expression in B must contain p s . To make the model familiar with operators in S o or not forget them during further finetuning, we supplement B with 100 two-operator expressions which cover the full S o . Empirically, as DNN models can be easily influenced by the statistical bias in label distributions, we balance the 'True' and 'False' examples during sampling. Contrast set C. The operators and sketch in C are kept the same with A, but each operator in S o is aligned with another set of logical operations P - s ={Material Implication, Converse Implication, Material Non-implication, Converse Non-implication}. The transition functions of these operations are listed in Appendix F.2. Results of our operation probe is shown in Table 6 . 

E ABSTRACT CONCEPTS ARE MODULARIZED IN PLMS

To supplement our analysis on abstract attention heads, here we provide our detailed explorations to identify the modulars in PLMs that store abstract concepts in our probes. Our explorations are based on the following property and assumption. Property: Forgetting of Abstract Concepts. Consider the Main Exp A ⇒ B that task A and B share the abstract concepts but do not share task-specific knowledge. After fully fine-tuning on task B, the model's parameters will somehow over-fit the task-specific knowledge in task B and the abstract concepts stored in these parameters will be partially forgotten. Assumption: Identifying the Modularization of Abstract Concepts by Freezing Parameters. If a module in the model individually store a part of abstract concepts, these parameters can be directly reused in new tasks without further fine-tuning. Furthermore, considering the property above, freezing this abstract module can avoid the forgetting of abstract concepts, resulting in the performance improvement in Main Exp A ⇒ B. Based on this assumption, to identify whether abstract concepts are modularized in some parameters, we partially freeze the model in a coarse-to-fine manner. The following experiments are conducted on one of the three pre-training terminal sets. First, we freeze each layer in the model, showing in Figure 8 . We find that the last layer in encoder and the first layer in decoder modularize part of abstract concepts. Furthermore, we freeze these two layers in our Contrast Exp C ⇒ B and find no improvements in Figure 9a , indicating that the improvement of freezing these two layers comes from keep abstract concepts. In next step, we try to identify the abstract concepts are stored in attention layers or FF layers. We separately freeze the attention sub-layer and FF sub-layer in the last encoder layer and first decoder layer. Figure 9b shows that attention layers take more responsibility for storing abstract concepts. Then, we analyze whether these abstract concepts are modularized in some attention heads or averaged in the whole attention layers. Our investigation in Section 6.2 finds that the abstract concepts are centralized in some middle-layer attention heads. Based on the results in Figure 4 , we freeze the top 36 heads to further verify that they are responsible for storing abstract concepts. Results in Table 7 indicate that abstract concepts are modularized in this small part of attention heads. E 0 E 1 E 2 E 3 E 4 E 5 E 6 E 7 E 8 E 9 E 1 0 E 1 1 D 0 D 1 D 2 D 3 D 4 D 5 D 6 D 7 D 8 D 9 D 1 0 D 1 1 Frozen Layer 11 shows examples in Conj. sub-probe in operation probe. Figure 10 illustrates the transitions of operations used in contrast task in operation probe.

F.3 REDUNDANT DESIGNS

For Redundant G * s , we supplement the second T-Production rule in Table 1 as "sub / direct-obj / indirect-obj ↠ S adj S n ". Terminals in S adj can re regarded as adjectives for S n . These terminals 

G MORE EXPERIMENTAL RESULTS

We present additional experimental results to supplement our probing and analysis. G.1 DATA SCALE AND TERMINAL DIVERSITY Figure 11a shows the effects from data scale for different model sizes. It shows that the performance improves marginally and unstably when the data scale increases from 1.7K to 680K instances. Moreover, it seems that the performance gap between models with different sizes is still considerable when the data scale is enough large. Figure 11b shows the effects from data diversity for different model sizes. Here, we consider the terminal diversity as a perspective of data diversity, i.e., the number of terminals of the grammar. Following Section 4, we only change the number of terminals in S c and S C , increasing from 1 to 128. The overall trend is that the performance improves marginally and unstably when the diversity increases. Interestingly, we observe that for all three models, their performances achieve the peak before rising to 128 terminals and then keep oscillating. We speculate that their performances are bounded by the limited data scale, as we control the data scale as 34K instances when increasing the terminal diversity. To examine our speculation, we conduct another experiment on T5-Base that pre-training on 680K instances with 128 terminals, achieving an average accuracy rate of 93.5%. This performance is higher than the result after pre-training on 680K instances with 32 terminals (89.2%) and higher than the best average accuracy of T5-Base in Figure 11b (88.4%), suggesting that higher data diversity should be equipped with a larger data scale. We take natural language sentences for experiments and design different sets by mimicking the Com. sub-probe in our grammar probe. Data in our fuzzy grammar probe is taken from Europarl v7 (2005), a large parallel corpus for machine translation . For the probing on natural language data, we can not guarantee to satisfy the requirements in our framework perfectly, as the grammar of the natural language is hard to be controllable as the formal language. We describe the instantiation of different sets as following. Train set A. We take the German-to-French (De-Fr) sentence paris as A. Transfer set B and Test set B. We take English-to-Romanian (En-Ro) as the probing task B. As both German and English are belong to the West Germanic language branch while both French and Romanian are belong to Italic branch, the abstract grammars used in De-Fr and En-Ro have some similarities. To satisfy that the ⇑ B performs poorly, we limit the B with only short sentences (15.7/13.8 words in En/Ro sentences in average) while B with only long sentences (78.0/74.4 words in En/Ro sentences in average). It means that the model can learn most of the lexicons from B but can not be aware of the grammars of long sentences. Contrast set C. Mimicking the construction of C in the formal language task, we also reverse the word order in the target language of A. Table 4 shows the performance of two models on natural language data. These results indicate that fuzzy grammar in natural language data can also be learned and transferred by the two PLMs. In addition, as this setting can also be regarded as a length generalization problem, the low ∆(C ⇒ B) further confirm that our probing results benefit from learning abstract concepts rather than surface patterns (i.e., length distribution).

I TRY TO MEASURE ABSTRACTION CAPABILITY

As mentioned in Section A that one limitation in our probing is the lack of a metric that can quantitatively measure the abstraction capibility. Thus, we can not compare the abstraction capibility of two models with different architectures. Here, we try to design such a metric to compare the abstraction capibility of T5-Base, with ∼220M parameters, and GPT2-Medium, with 345M parameters. In the beginning, we want to clarify that this metric is just for primary exploration, as it is based on a strong assumption that can not be satisfied in all situations. Assumption: We assume that the performance score of a certain task (such as accuracy and BLEU score) can linearly reflect the ability of the model to solve this task. It means that, for instance, improving the accuracy from 90% to 100% is not harder than improving from 40% to 50%. Apparently, this assumption does not suit all tasks and performance scores (even the tasks and scores in our probing). But it does not influence the comparison between T5-Base and GPT2-Medium. The reason will be discussed later. Intuitively, we consider the contribution of abstract concepts to overall performance as a measure of abstraction capibility, that is, For the full performance score in the denominator in Equation 2, we evaluate the model performance on B after (only) fine-tuning on the full set B, which is sampled in the same distribution of B (rather than a limited distribution of B). We denote this procedure as ⇑ B. Thus, the metric in Equation 2 can be formalized as: MoA = score a score f , MoA = score(A ⇒ B) -max[score(⇑ B), score(C ⇒ B)] score( ⇑ B) . Table 16 shows MoA for two models on grammar probe and fuzzy grammar probe, and lists the scores required to calculate MoA. On each task, MoA of T5-Base is higher than that of GPT2-Medium. Furthermore, during calculating MoA, the baseline score max[score(⇑ B), score(C ⇒ B)] of T5-Base is always higher than that of GPT2-Medium. As it is harder for the model to improve the accuracy and BLEU scores on these tasks from a relatively higher baseline, MoA can just underestimate the abstract ability of T5-Base. Therefore, we can roughly conclude that the abstraction capibility of T5-Base is higher than GPT2-Medium.

J COMPARISON WITH PREVIOUS NEGATIVE RESULTS

Some previous work demonstrated that neural models could not learn abstract concepts (Liu et al., 2020; Chen et al., 2020; Liu et al., 2021a; Chollet, 2019; Mitchell, 2021; Zadrozny, 2021) . Our probing results shed some light that neural models, especially PLMs, exhibit abstraction capibility to some extent. Compared with previous work, two points could lead to different conclusions. The first point is the probing methodology. In all works (including ours), the basic idea of probing abstraction is to separate it with memorization. To implement this idea, previous work has almost always involved designing a special probing task in which memorization of the train set is helpless to solve the test set. However, such an implementation constraints the generation of the train set, which could bring some biases or limitations in training data. To overcome these biases or limitations, the model should have some other abilities more than abstraction, such as reasoning and systematic generalizability. Therefore, the previous disappointing results may have been caused by the lack of other abilities rather than abstraction. The second point is the test model. Some previous work probed the vanilla Transformer LSTM while we take the pre-trained language models. We suppose that the model may acquire better abstraction capibility from the pre-training corpus, and can better exhibit this ability with larger model sizes.

K DETAILS OF EXPERIMENTS

K.1 DATA We show more details about the sets described in Section 4, including data scales, average input lengths and average output lengths. For the target side grammar of our formal language tasks, we mentioned in Section 4 that we change the original target grammar of COGS to be chain-structured. In Table 18 , we list some examples with the original target grammar and the new chain-structured grammar. First, to distinguish the input and output tokens, we capitalize all output tokens (e.g., from "rose" to "ROSE"). Second, we replace the variables (e.g., "x _ 1") in the original grammar with its corresponding terminals (e.g., "ROSE"). Then, we group the terminals of AGENT (e.g., "DOG"), THEME (e.g., "ROSE") and RECIPIENT with their corresponding terminal of PREDICATE (e.g., "HELP") and combine this group of terminals in a function format, i.e., "PREDICATE ( AGENT, THEME, RECIPIENT )". If the predicate is not equipped with an agent, theme or recipient in the original grammar, the corresponding new non-terminals (i.e., AGENT, THEME and RECIPIENT, respectively) in the function format above will be filled with the terminal NONE (e.g., "HELP ( DOG, ROSE, NONE )"). For simplicity, we omitted NONE in Table 1 , Table 8 , and Table 12 . Such a function format is the minimum unit of a CLAUSE. Finally, each CLAUSE is concatenated with another CLAUSE by the terminal CCOMP (e.g., "HOPE ( LIAM, NONE, NONE ) CCOMP PREFER ( DOG, NONE, NONE )").

K.2 PROCEDURE

Training Each pre-training takes 100,000 steps, and the final-step checkpoint is used for fine-tuning. Each fine-tuning takes 100,000 steps, and the checkpoints for every 10,000 steps are saved. Evaluation We take an early-stopping strategy in our evaluation to avoid catastrophic forgetting. First, each checkpoint saved during fine-tuning is evaluated on the held-out dev set. We choose the first checkpoint that achieves the best dev score for testing. For formal language tasks, we utilize the constraint decoding strategy that the model can only generate the words in the vocabulary.

Compute and Resources

We majorly use Tesla-V100-16GB GPUs for training and evaluation, except for the experiments on T5-Large or GPT2-Large, which require Tesla-V100-32GB GPUs. On average, one pre-training takes ∼15 GPU hours, one fine-tuning takes ∼15 GPU hours (including saving checkpoints), and one testing takes ∼2 GPU hours (as test cases are very long).

K.3 HYPERPARAMETERS

Hyperparameters used for training and testing are listed in (5)

K.5 RESULTS

We list the detailed results that are plotted in the figures (i.e., Figure 5 and Figure 6 ), including the average scores, minimum scores, maximum scores, and standard deviations for all replicate experiments. Table 18 : Examples with the original grammar and the new chain-structured grammar. Original Target Grammar Chain-Structured Target Grammar rose ( x _ 1 ) AND help . theme ( x _ 3 , x _ 1 ) AND help . agent ( x _ 3 , x _ 6 ) AND dog ( x _ 6 ) HELP ( DOG, ROSE, NONE ) * captain ( x _ 1 ) ; eat . agent ( x _ 2 , x _ 1 ) EAT ( CAPTION, NONE, NONE ) * dog ( x _ 4 ) ; hope . agent ( x _ 1 , Liam ) AND hope . ccomp ( x _ 1 , x _ 5 ) AND prefer . agent ( x _ 5 , x _ 4 ) HOPE ( LIAM, NONE, NONE ) CCOMP PREFER ( DOG, NONE, NONE ) Main Exp Control Exp T5-Small T5-Base T5-Large GPT2 GPT2-Medium GPT2-Large T5-Small T5-Base T5-Large GPT2 GPT2 (c) T5-Small Data Scale (T5-Small) 1.7K 3.4K 6.8K 17K 34K 68K 170K 680K Avg 28.3 31.9 26.1 27.3 34.4 40.5 38.7 39.4 Min 27.4 11.5 16.1 21.2 20.7 19.7 23.5 29.7 Max 29.2 50.5 36.2 33.7 47.4 62.9 50.9 56.0 Std 0.9 15.6 9.2 4.1 9.8 19.0 9.9 9.5



We also probed other abstract concepts such as operation semantics in Appendix D. Some non-semantic terminals are kept the same, such as the period in Lsrc and parentheses in Ltgt. Some basic rules are preserved (e.g., the order of the preceding and following parentheses).



Figure 2: The illustration of the probing framework.

Figure 3: Two learning process. Blue curves represent the learning performance of abstract concepts and red curves represent the learning performance of in-task examples.

Figure 4: DPC of each pruned head. The heads sorted from left to right are located from the first to the last layer in the model.

Figure 5: Performance with different derivations of (a) source and (b) target grammar.

Figure 6: Performance with different model sizes, data scales and data diversity.

Figure 7: Four operations and two sketches in our operation probe. (a) shows the chain sketch used in task A and C, and the tree sketch used in B ∪ B. Each 'OP' represents one operation. (b) shows the transition results of four operations with different left and right Boolean variables.

Figure 8: Freeze 24 layers separately in A ⇒ B.

Figure 9: Further explore the modularity in the middle two layers.

Figure 10: Transitions of operations used in contrast task in operation probe.

Figure 11: More results for different data scales and data diversity.

in which the MoA means the Metric of Abstraction, score f means the full performance score on a certain task without limiting the training data, and score a means the part of the score contributed by the abstract concepts. Following our probing framework, we consider the score a as the relative gain from ⇑ B to A ⇒ B. Furthermore, considering the influence of other factors which is reflected by C ⇒ B, we design the score a as:score a = score(A ⇒ B) -max[score(⇑ B), score(C ⇒ B)],(3)in which score() represents the performance score of a certain procedure, and max[score(⇑ B), score(C ⇒ B)] means to choose the maximum performance score between ⇑ B and C ⇒ B.

Contrast task C with the contrast set C aims to further confirm that the performance in task B is principally correlated with abstractions rather than other factors. The abstract concepts C C is constructed by greatly breaking (changing) C A , thus compared with task A, the abstraction on task C is less effective for solving task B. The task-specific characteristics and other latent factors in constructing C are kept the same with that in A.

Part of G s and G t . Rules in the last row are allowed to iterate up to 12 times.

The main results of our probe. ∆(A ⇒ B) and ∆(C ⇒ B) are in brackets. The evaluation metric is exact match accuracy (%). Com. and Mod. represent the sentential complement and subject modification sets for B. These results are in line with our two expectations.

Freeze abstract heads.

PLMs performance with fuzzy abstract concepts.

Probing results for randomly initialized models. ∆(A ⇒ B) and ∆(C ⇒ B) are in brackets.

Results in our operation probe.

Supplement to Table3. Freeze a set of abstract heads and fine-tune.

Examples in different sets in Com. sub-probe in grammar probe. The terminal NONE in target side is omitted to more clearly show the structure of the target example.

Examples in different sets in mod. sub-probe in grammar probe. only contained in the source side and no terminals in target side are aligned with them. Table13shows an example in Redundant G * s .

Examples in different G * t .

Examples in Redundant G * s . Gray terminals are redundant that would not be mapped to targets.

Downstream performance under different multi-grammar settings.

Increase total number of terminals.

MoA of two models on both grammar probe and fuzzy grammar probe and the scores required to calculate MoA. T5 and GPT2 are T5-Base and GPT2-Medium, respectively.

Data scales, average input lengths, and average output lengths of different sets in our probing. Data Scale Avg Input Len Avg Output Len Data Scale Avg Input Len Avg Output Len Data Scale Avg Input Len Avg Output Len Data Scale Avg Input Len Avg Output Len



Hyperparameters for training and testing.

Detailed results for Figure5.

Detailed results for Figure 6a and 6b.

Detailed results for Figure6c.

ACKNOWLEDGMENTS

We thank all the anonymous reviewers for their valuable comments. This work was supported in part by NSFC under grant No. 62088102. We would like to thank Qian Liu for his valuable suggestions and feedback on this work.

G.2 MULTI-GRAMMAR PRE-TRAINING

Before this section, we consider the setup in which we only see one pair of input and output grammars during pre-training. This section explores whether multi-grammar pre-training would influence the model to exhibit abstraction. Here, we consider two cases: can or can not access the golden grammar. The golden grammar is the grammar used in the downstream task. For the multi-grammar pre-training, we assemble different target grammars in Section 6.3 while keep the source grammar.During generating pre-training data with more than one target grammar, for each instance, we add a prefix (chosen from original, coarse, localreverse, nest, and reverse) at the beginning of the source tokens, guiding the model which target grammar it should use. Table 14 shows the results with different ensemble grammars.First, consider that we have no access to the golden grammar. We take the target grammar that performs the best in Section 6.3, Coarse G * t , as the single-grammar baseline. Table 14 shows that augmenting the Coarse G * t with other target grammars can always perform better than the single grammar. Even augmenting with the Reverse G * t from the contrast task can bring a slight gain (1.4% accuracy). It indicates that even though the model has no access to the golden abstract concepts, increasing the diversity of abstract concepts can make the model better aware of the existence of abstract concepts. This awareness can be regarded as a higher level of abstraction capibility.Then, considering that the model has access to the golden grammar, the downstream task performance is lower than only pre-training on the golden grammar (accuracy 88.2%). Therefore, augmenting other similar abstract concepts would confuse the model and make it hard to choose which concepts should be used for the downstream task.

G.3 INCREASE TOTAL NUMBER OF TERMINALS

We increase the total number of terminals to ∼1,500 in our probing tasks and report the results with T5-Base in Table 15 . These results are similar to the original results in Table 2 and are still in line with our two expectations.

H FUZZY GRAMMAR

The abstract concepts discussed in grammar probe in Section 4 can be concretely defined, but in many application scenarios, abstract concepts can be fuzzy (e.g., natural language grammar). Here, we want to examine whether models can learn fuzzy grammar or just can recognize the concrete one. 

