ATOMIZED DEEP LEARNING MODELS

Abstract

Deep learning models often tackle the intra-sample structure, such as the order of words in a sentence and pixels in an image, but have not pay much attention to the inter-sample relationship. In this paper, we show that explicitly modeling the intersample structure to be more discretized can potentially help model's expressivity. We propose a novel method, Atom Modeling, that can discretize a continuous latent space by drawing an analogy between a data point and an atom, which is naturally spaced away from other atoms with distances depending on their intra structures. Specifically, we model each data point as an atom composed of electrons, protons, and neutrons and minimize the potential energy caused by the interatomic force among data points. Through experiments with qualitative analysis in our proposed Atom Modeling on synthetic and real datasets, we find that Atom Modeling can improve the performance by maintaining the inter-sample relation and can capture an interpretable intra-sample relation by mapping each component in a data point to electron/proton/neutron.

1. INTRODUCTION

Multiple widely used neural networks are composed of two parts: the first part projects data points into another space, and the other part of the model does further regression/classification upon this space. By transforming raw data features to another potentially more tractable space, deep learning models have recently shown potential in many areas, ranging from dialogue systems (Vinyals & Le, 2015; López et al., 2017; Chen et al., 2017) , medical image analysis (Kononenko, 2001; Ker et al., 2017; Erickson et al., 2017; Litjens et al., 2017; Razzak et al., 2018; Bakator & Radosav, 2018) to robotics (Peters et al., 2003; Kober et al., 2013; Pierson & Gashler, 2017; Sünderhauf et al., 2018) . One major challenge in deep learning is to model better intra-and inter-sample structures for complex data features. Recent works often model the intra-sample structure by considering the order and adjacency of the input features, for instance, positional encoding for texts/speech in Transformers (Vaswani et al., 2017) and kernel width for images in convolution neural networks (LeCun et al., 2015) . Regarding the inter-sample structure, literature often assumes that a dataset can be represented in a continuous space and an interpolation of two embeddings might be meaningful (Bowman et al., 2016; Chen et al., 2016) , while the data might be naturally discrete (van den Oord et al., 2017) . Moreover, the mainstream relies on a non-fully transparent optimization function to reorganize the space. Therefore, in this work, we would like to explore how to explicitly, dynamically rearrange the space (inter-sample structure) by leveraging the intra-sample structures. Inspired by Atomic Physics, where atom is the smallest unit of matter and meanwhile discretely distributed, we propose that we can model a data point as an atom. As illustrated in the left of Figure 2 , an atom in a Bohr model (Bohr, 1913) , an often adopted concept in Physics (Halliday et al., 2013) and Chemistry (Brown, 2009) , contains a dense nucleus, which is composed of the positively charged protons and uncharged neutrons, surrounded by orbiting negatively charged electrons with a nucleus radius. Further, multiple atoms can have interatomic forces, composed of attractive and repulsive forces, that make the atoms distant away (non-zero). Such interatomic forces are also the reason for the atoms to form molecules, crystals and metals in our observable life. In this paper, we propose Atom Modeling, a science-and theoretically-based method that explicitly model the intra-sample relation via atomic structure and the inter-sample relation via interactomic forces. Specifically, we consider a data point as an atom and let a model automatically learn the mapping of each component in a data point to an electron, a proton, or a neutron. We then estimate dynamically distance away the latent space increase s(h) Figure 1 : An illustration of our motivation, where a neural network is seen to be composed of two groups of functions h(•) and g(•). For a fixed capacity of g(•), e.g., a linear function, the left latent space is not able to be separated. If the lower left part is distanced away, the same g(•) can split them. interatomic forces with the learned subatomic particles, nucleus radius and atomic spacing. Finally, the model is optimized to minimize the potential energy induced by the interatomic forces and maintain the balance of total charges and the number of electrons, protons, and neutrons. This method is not only found effective, but also easy to implement in tens of lines for any model architecture. We validate the effects of Atom Modeling on synthetic data and real data in the domains of text and image classification as well as on convolutional neural networks and transformers. The empirical results show that Atom Modeling can consistently improve the performance accross data amounts, domains, and output complexity. The analyses demonstrate that Atom Modeling can capture intrasample structures with interpretable meanings of subatomic particles in a data point, while forms an inter-sample structure that increases the model expressivity. Our contributions are: • We propose to look into the problem of discrete representation in deep learning models. • We propose Atom Modeling, a simple method of Atomic Physics for machine learning where the distances among data points (inter-sample structure) naturally depend on their intra-sample structure. • We empirically demonstrate that Atom Modeling can improve the performance across different setups and provides an interpretable atomic structure of a data point.

2. MOTIVATION

We are motivated by a property of the hidden layer of a neural network, and the property of the naturally existing atoms.

2.1. DISCRETE REPRESENTATION

A neural network can be represented as a composition of functions f ( •) = f N • f N -1 • ... • f 1 (•), where each function is one of its N layers. When a neural network f (•) is seen as two groups of functions f (x) = g • h(x), the first half h(•) = f n • f n-1 • ... • f 1 (•) encodes the input into a hidden space and the second half g(•) = f N • f N -1 • ... • f n+1 (•) transforms the latent into the output space. We consider that in a situation when the model capacity of the second half c(g) is fixed. The projected latent space by the first half functions are hence important to the final output of the model. For easier mathematical description, we denote a quantization of the simplicity of the encoded hidden space as s(h). The whole model capacity c(f ) is therefore bounded by: min (s(h), c(g)) ≤ c(f ) ≤ max(s(h), c(g)) Our intuition is that if the distance and the shape of two classes are hard to separate, the simplicity s(h) is small. Then the second half functions with a model capacity equal to a linear function cannot separate them. An example is shown in Figure 1 . If the points' distances are larger, they can be split by the same linear function. That is, the space simplicity s(h) can be promoted by a more discrete latent space, so one of the bounds of the model capacity c(f ) can be improved. However, if only naively increasing the distances among all the points, the space will be unboundly enlarged and the relative positions might not be changed much. A method that focuses more on separating points that are nearby but with different properties (e.g., the lower left side in the two plots of Figure 1 ) is desired.

