GRADIENT DESCENT RESISTS COMPOSITIONALITY

Abstract

In this paper, we argue that gradient descent is one of the reasons that make compositionality learning hard during neural network optimization. We find that the optimization process imposes a bias toward non-compositional solutions. This is caused by gradient descent, trying to use all available and redundant information from input, violating the conditional independence property of compositionality. Based on this finding, we suggest that compositionality learning approaches considering only model architecture design are unlikely to achieve complete compositionality. This is the first work to investigate the relation between compositional learning and gradient descent. We hope this study provides novel insights into compositional generalization, and forms a basis for new research directions to equip machine learning models with such skills for human-level intelligence. The source code is included in supplementary material.

1. INTRODUCTION

Compositional generalization is the algebraic capacity to understand and produce many novel combinations from known components (Chomsky, 1957; Montague, 1970) , and it is a key element of human intelligence (Minsky, 1986; Lake et al., 2017) to recognize the world efficiently and create imagination. Broadly speaking, compositional generalization is a class of out-of-distribution generalization (Bengio, 2017) , where the training and test distributions are different. A sample in such a setting is a combination of several components, and the generalization is enabled by recombining the seen components of the unseen combination during inference. For example, in the image domain, an object is a combination of multiple parts or properties. In the language domain, a sentence is a combination of syntax and semantics. Each component of an output depends only on the corresponding input component, but not on other variables. We call this the conditional independence property, and will formally introduce in Section 3. People hope to design machine learning algorithms with compositional generalization skills. However, conventional neural network models generally lack such ability. There have been many attempts to equip models with compositionality (Fodor & Pylyshyn, 1988; Bahdanau et al., 2019) , and most efforts focus on designing neural network architectures (Graves et al., 2014; Andreas et al., 2016; Henaff et al., 2016; Shazeer et al., 2017; Li et al., 2018; Santoro et al., 2018; Kirsch et al., 2018; Rosenbaum et al., 2019; Goyal et al., 2019) . Recently, multiple approaches showed progress in specific tasks (Li et al., 2019; 2020; Lake, 2019; Russin et al., 2019) , but we still do not know why standard approaches seldom achieve good compositionality in general. In this paper, we argue that there is a bias to prevent parameters from reaching compositional solutions, when we use gradient descent in optimization (please see Figure 1 for illustrations). This is because gradient seeks the steepest direction, so that it uses all possible and redundant input information, which contradicts to the conditional independence property of compositionality. This problem is not due to how gradient is computed, such as back propagation, but caused by the essential property of gradient. We derive theoretical relation between gradient descent and compositionality with information theory. We also provide examples and visualization to show the detailed process of how gradient resists compositionality. Based on the finding, we propose that compositionality learning approaches with model structure design (manual or searching) alone are not likely to achieve complete compositionality. We hope this research provides new insights and forms a basis for new research directions in compositional generalization, and helps to improve machine intelligence towards human-level. The contributions of this paper can be summarized as follows. X 1 Ŷ1 X 2 Ŷ2 θ A X 1 Ŷ1 X 2 Ŷ2 θ B θ A θ θ B -∇ θ L • The novelty of this work is to find the relation between compositional learning and gradient descent in optimization process, i.e., gradient descent resists compositionality. • We theoretically derive the result and explain why standard approaches with architecture design alone do not address compositionality.

2. RELATED WORK

Compositionality Humans learn language and recognize the world in a flexible way by leveraging systematic compositionality. The compositional generalization is critical in human cognition (Minsky, 1986; Lake et al., 2017) , and it helps humans to connect limited amount of learned concepts for unseen combinations. Though deep learning has many achievements in recent years (LeCun et al., 2015; Krizhevsky et al., 2012; Yu & Deng, 2012; He et al., 2016; Wu & et al, 2016) , compositional generalization has not been well addressed (Fodor & Pylyshyn, 1988; Marcus, 1998; Fodor & Lepore, 2002; Marcus, 2003; Calvo & Symons, 2014) . There are observations that current neural network models do not learn compositionality (Bahdanau et al., 2019) . Most recently, multiple approaches are proposed to address compositionality in neural networks (Li et al., 2019; 2020; Lake, 2019; Russin et al., 2019) for specific tasks. However, we are still not sure why compositionality is hard to achieve in general cases, and this work discusses about this problem from optimization perspective. Another line of related work is independent disentangled representation learning (Higgins et al., 2017; Locatello et al., 2019) . Its main assumption is that the expected components are statistically independent in training data. This setting does not have transferring problem in test, because all combinations have positive joint probabilities in training (please refer to Section 3). Compositionality is applied in different areas such as continual learning (Jin et al., 2020; Li et al., 2020) Most of the previous work focus on faster reduction of loss and theoretical convergence analysis of SGD (Bottou et al., 2018; Luo, 1991; Reddi et al., 2018; Chen et al., 2018; Zhou et al., 2018; Zou & Shen, 2018; De et al., 2018; Zou et al., 2018; Ward et al., 2018; Barakat & Bianchi, 2019) . In particular, this work focuses on investigating why standard neural network training only achieves limited level of compositionality by studying the relationship between gradient descent and compositionality.



Figure 1: Conceptual illustration of compositionality and the impact of gradient descent. X 1 , X 2 are entangled input, and Ŷ1 , Ŷ2 are entangled output. Ŷi aligns with X i , for i = 1, 2. (Left) Compositional solution with θ A . (Middle) Non-compositional solution with θ B . (Right) In parameter space, gradient descent encourages parameters closer to θ B , than θ A , hence resisting compositionality.

, question answering(Andreas et al., 2016; Hudson & Manning, 2019; Keysers et al., 2020),  and reasoning (Talmor et al., 2020).

