ON A BUILT-IN CONFLICT BETWEEN DEEP LEARNING AND SYSTEMATIC GENERALIZATION

Abstract

Out-of-distribution or systematic generalization is a desirable property that most deep learning algorithms lack. In this paper, we hypothesize that internal function sharing is one of the reasons to weaken systematic generalization in deep learning for classification tasks. Under equivalent prediction, a model partitions an input space into multiple parts separated by boundaries. The function sharing prefers to reuse boundaries, leading to fewer parts for new outputs, which conflicts with systematic generalization. We show such phenomena in standard deep learning models, such as fully connected, convolutional, residual networks, LSTMs, and (Vision) Transformers. We hope this study provides novel insights and forms a basis for new research directions to improve systematic generalization.



, and the generalization is enabled by producing an unseen combination of seen factor values. For example, models trained on blue rectangles and green triangles predict blue triangles. We adopt factors mainly in designing experiments and developing intuitions. It helps experiments because new outputs are only related to function sharing between factors (Section 3). So we limit our claim to the cases for recombination of factors. One stream of artificial intelligence is Connectionism (Feldman & Ballard, 1982; Rumelhart et al., 1986) , which uses many simple neuron-like units richly interconnected and processed in parallel. It was criticized that Connectionist models do not support systematic generalization well (Fodor & Pylyshyn, 1988; Marcus, 1998) This paper contributes to uncovering a built-in conflict between deep learning and systematic generalization. We hope this study provides novel insights, forms a basis for new research directions, and helps improve machine intelligence to the human level.



Figure 2: Intuitions for the model preference of deep learning (DL). Each figure is an input space containing three sets of training samples with outputs of +/+, -/-, and -/+, respectively. Models have two decision boundaries (cyan and magenta). The orange dot is a test sample with a new groundtruth factor combination +/-. We discuss that the training process prefers the first case (a) over the second one (b) because it likes to share or reuse functions. Suppose the first function (c) is learned, then the process tends to reuse it by learning a simple function (d) and combining them instead of learning the complicated magenta function in (b) from scratch. So the function between +/+ and -/is shared. Few inputs are mapped to the new output, and systematic generalization is not achieved.

Figure 2 has an intuitive example explaining why the conflict happens. A test sample (in orange) equals a set of training samples (+/+) on the first output. Then they are also equal on the second output if the function is reused (see caption). Therefore, they are equal in all the outputs, which conflicts with systematic generalization. More generally, the two boundaries jointly partition an input space into multiple parts, and deep learning prefers (a) because it has fewer parts than (b). Figure3has a visualized example. It is similar to the example in Figure2a. The two functions are shared in the top-right region, and few inputs are predicted as the new combination. Figure1has a simplified plot for a result in the experiment section. As the degree of function sharing increases (more sharing layers), the accuracy of the test dataset or generalization capacity decreases accordingly. It supports that function sharing weakens systematic generalization. Please refer to Section 3 for more details.

