LESS IS MORE: RETHINKING FEW-SHOT LEARNING AND RECURRENT NEURAL NETS

Abstract

The statistical supervised learning framework assumes an input-output set with a joint probability distribution that is reliably represented by the training dataset. The learner is then required to output a prediction rule learned from the training dataset's input-output pairs. In this work, we provide meaningful insights into the asymptotic equipartition property (AEP) (Shannon, 1948) in the context of machine learning, and illuminate some of its potential ramifications for fewshot learning. We provide theoretical guarantees for reliable learning under the information-theoretic AEP, and for the generalization error with respect to the sample size. We then focus on a highly efficient recurrent neural net (RNN) framework and propose a reduced-entropy algorithm for few-shot learning. We also propose a mathematical intuition for the RNN as an approximation of a sparse coding solver. We verify the applicability, robustness, and computational efficiency of the proposed approach with image deblurring and optical coherence tomography (OCT) speckle suppression. Our experimental results demonstrate significant potential for improving learning models' sample efficiency, generalization, and time complexity, that can therefore be leveraged for practical real-time applications.

1. INTRODUCTION

In recent years, machine learning (ML) methods have led to many state-of-the-art results, spanning through various fields of knowledge. Nevertheless, a clear theoretical understanding of important aspects of artificial intelligence (AI) is still missing. Furthermore, there are many challenges concerning the deployment and implementation of AI algorithms in practical applications, primarily due to highly extensive computational complexity and insufficient generalization. Concerns have also been raised regarding the effects of energy consumption of training large scale deep learning systems (Strubell et al., 2020) . Improving sample efficiency and generalization, and the integration of physical models into ML have been the center of attention and efforts of many in the industrial and academic research community. Over the years significant progress has been made in training large models. Nevertheless, it has not yet been clear what makes a representation good for complex learning systems (Bottou et al., 2007; Vincent et al., 2008; Bengio, 2009; Zhang et al., 2021) .

Main Contributions.

In this work we investigate the theoretical and empirical possibilities of few shot learning and the use of RNNs as a powerful platform given limited ground truth training data. (1) Based on the information-theoretical asymptotic equipartition property (AEP) (Cover & Thomas, 2006) , we show that there exists a relatively small set that can empirically represent the input-output data distribution for learning. (2) In light of the theoretical analysis, we promote the use of a compact RNN-based framework, to demonstrate the applicability and efficiency for few-shot learning for natural image deblurring and optical coherence tomography (OCT) speckle suppression. We demonstrate the use of a single image training dataset, that generalizes well, as an analogue to universal source coding with a known dictionary. The method may be applicable to other learning architectures as well as other applications where the signal can be processed locally, such as speech and audio, video, seismic imaging, MRI, ultrasound, natural language processing and more. Training of the proposed framework is extremely time efficient. Training takes about 1-30 seconds on a GPU workstation and a few minutes on a CPU workstation (2-4 minutes), and thus does not require expensive computational resources. (3) We propose an upgraded RNN framework incorporating receptive field normalization (RFN) (see (Pereg et al., 2021) , Appendix C) that decreases the input data distribution entropy, and improves visual quality in noisy environments. (4) We illuminate a possible optimization mechanism behind RNNs. We observe that an RNN can be viewed as a sparse solver starting from an initial condition based on the previous time step. The proposed interpretation can be viewed as an intuitive explanation for the mathematical functionality behind the popularity and success of RNNs in solving practical problems.

2. DATA-GENERATION MODEL -A DIFFERENT PERSPECTIVE

Typically, in statistical learning (Shalev-Shwartz & Ben-David, 2014; Vapnik, 1999) , it is assumed that the instances of the training data are generated by some probability distribution. For example, we can assume a training input set Ψ Y = {{y i } m i=1 : y i ∼ P Y }, such that there is some correct target output x, unknown to the learner, and each pair (y i , x i ) in the training data Ψ is generated by first sampling a point y i according to P Y (•) and then labeling it. The examples in the training set are randomly chosen and, hence, independently and identically distributed (i.i.d.) according to the distribution P Y (•). We have access to the training error (also referred to as empirical risk), which we normally try to minimize. The known phenomenon of overfitting is when the learning system fits perfectly to the training set and fails to generalize. In classification problems, probably approximately correct (PAC) learning defines the minimal size of a training set required to guarantee a PAC solution. The sample complexity is a function of the accuracy of the labels and a confidence parameter. It also depends on properties of the hypothesis class. To describe generalization we normally differentiate between the empirical risk (training error) and the true risk. It is known that the curse of dimensionality renders learning in high dimension to always amount to extrapolation (out-of-sample extension) (Balestriero et al., 2021) .

INFORMATION THEORETIC PERSPECTIVE: CONNECTION TO AEP

Hereafter, we use the notation x n to denote a sequence x 1 , x 2 , ..., x n . In information theory, a stationary stochastic process u n taking values in some finite alphabet U is called a source. In communication theory we often refer to discrete memoryless sources (DMS) (Kramer et al., 2008; Cover & Thomas, 2006) . However, many signals, such as image patches, are usually modeled as entities belonging to some probability distribution forming statistical dependencies (e.g., a Markov-Random-Field (MRF) (Roth & Black, 2009; Weiss & Freeman, 2007) ) describing the relations between data points in close spatial or temporal proximity. Here, we will briefly summarize the AEP for ergodic sources with memory (Austin, 2017) . Although the formal definition of ergodic process is somewhat involved, the general idea is simple. In an ergodic process, every sequence that is produced by the process is the same in statistical properties (Shannon, 1948) . The symbol frequencies obtained from particular sequences will approach a definite statistical limit, as the length of the sequence is increased. More formally, we assume an ergodic source with memory that emits n symbols from a discrete and finite alphabet U , with probability P U (u 1 , u 2 , ..., u n ). We recall a theorem (Breiman, 1957) , here without proof. Theorem 1 (Entropy and Ergodic Theory). Let u 1 , u 2 , ..., u n be a stationary ergodic process ranging over a finite alphabet U , then there is a constant H such that H = lim n→∞ - 1 n log 2 P U (u 1 , ..., u n ). H is the entropy rate of the source. Intuitively, when we observe a source with memory over several time units, the uncertainty grows more slowly as n grows, because once we know the previous sources entries, the dependencies reduce the overall conditional uncertainty. The entropy rate H, which represents the average uncertainty per time unit, converges over time. This, of course, makes sense, as it is known that H(X, Y ) ≤ H(X) + H(Y ). In other words, the uncertainty of a joint event is less than or equal to the sum of the individual uncertainties. The generalization of the AEP to arbitrary ergodic sources is as following (McMillan, 1953) . Theorem 2 (Shannon McMillan (AEP)). For ǫ > 0, the typical set A n ǫ with respect to the ergodic process P U (u) is the set of sequences u = (u 1 , u 2 , ..., u n ) ∈ U n obeying 1. Pr[u ∈ A n ǫ ] > 1ǫ, for n sufficiently large. 2. 2 -n(H+ǫ) ≤ P U (u) ≤ 2 -n(H-ǫ) .

