ANALYZING ATTENTION MECHANISMS THROUGH LENS OF SAMPLE COMPLEXITY AND LOSS LAND-SCAPE

Abstract

Attention mechanisms have advanced state-of-the-art deep learning models for many machine learning tasks. Despite significant empirical gains, there is a lack of theoretical analyses on their effectiveness. In this paper, we address this problem by studying the sample complexity and loss landscape of attention-based neural networks. Our results show that, under mild assumptions, every local minimum of the attention model has low prediction error, and attention models require lower sample complexity than models without attention. Besides revealing why popular self-attention works, our theoretical results also provide guidelines for designing future attention models. Experiments on various datasets validate our theoretical findings.

1. INTRODUCTION

Significant research in machine learning has focused on designing network architectures for superior performance, faster convergence and better generalization. Attention mechanisms are one such design choice that is widely used in many natural language processing and computer vision tasks. Inspired by human cognition, attention mechanisms advocate focusing on relevant regions of input data to solve a desired task rather than ingesting the entire input. Several variants of attention mechanisms have been proposed, and they have advanced the state of the art in machine translation (Bahdanau et al., 2014; Luong et al., 2015; Vaswani et al., 2017) , image captioning (Xu et al., 2015 ), video captioning (Pu et al., 2018) , visual question answering (Zhou et al., 2015; Lu et al., 2016 ), generative modeling (Zhang et al., 2018) , etc. In computer vision, spatial/spatiotemporal attention masks are employed to focus only on relevant regions of images/video frames for underlying downstream tasks (Mnih et al., 2014) . In natural language tasks, where inputoutput pairs are sequential data, attention mechanisms focus on the most relevant elements in the input sequence to predict each symbol of the output sequence. Hidden state representations of a recurrent neural network are typically used to compute these attention masks. Substantial empirical evidence on the effectiveness of attention mechanisms motivates us to study the problem from a theoretical lens. To this end, it is important to understand the loss landscape and optimization of neural networks with attention. Analyzing the loss landscape of neural networks is an active ongoing research area, and it can be challenging even for two-layer neural networks (Poggio & Liao, 2017; Rister & Rubin, 2017; Soudry & Hoffer, 2018; Zhou & Feng, 2017; Mei et al., 2018b; Soltanolkotabi et al., 2017; Ge et al., 2017; Nguyen & Hein, 2017a; Arora et al., 2018) . Convergence of gradient descent for two-layer neural networks has been studied in Allen-Zhu et al. (2019); Mei et al. (2018b); Du et al. (2019 ). Ge et al. (2017) shows that there is no bad local minima for two-layer neural nets under a specific loss landscape design. These works reveal the importance of understanding loss landscape of neural networks. Unfortunately, these results cannot be directly applied for attention mechanisms. In attention models, the network structure is different, and the attention introduces additional parameters that are jointly optimized. To the best of our knowledge, there is no existing work analyzing the loss landscape and optimization of attention models. In this work, we present theoretical analysis of self-attention models (Vaswani et al., 2017) , which uses correlations among elements of input sequence to learn an attention mask.

