BIG LEARNING: A UNIVERSAL MACHINE LEARNING PARADIGM?

Abstract

Recent breakthroughs based on big/foundation models reveal a vague avenue for AI, that is, big data, big/foundation models, big learning, • • • . Following that avenue, here we elaborate on our newly introduced big learning. Specifically, big learning exhaustively exploits the information/tasks inherent in its large-scale complete/incomplete training data, by learning to simultaneously model many/all joint/conditional/marginal data distributions (thus named big learning) with one universal foundation model. We reveal that big learning is what existing foundation models are implicitly doing; accordingly, our big learning provides high-level guidance for flexible design and improvements of foundation models. Besides, big learning (i) is equipped with great flexibilities for complete/incomplete training data and for customizing trustworthy data tasks; (ii) potentially delivers all joint/conditional/marginal data capabilities after training; (iii) significantly reduces the training-test gap with improved model generalization; and (iv) potentially unifies conventional machine learning paradigms and enables their flexible cooperations, manifested as a universal learning paradigm. Preliminary experiments verified the effectiveness of the presented big learning.

1. INTRODUCTION

AI is undergoing a paradigm shift with the rise of big/foundation models (Bommasani et al., 2021; Yuan et al., 2022) , e.g., BERT (Stickland & Murray, 2019) , GPT-3 (Brown et al., 2020) , DALL-Es (Ramesh et al., 2021; 2022) , MAE (He et al., 2021) , etc. Foundation models, often based on maskand-predict pretraining and downstream finetuning, are capable of benefiting from pretraining on broad data at scale and accordingly, demonstrate diverse downstream task capabilities with impressive robustness (Stickland & Murray, 2019) , adaptability (He et al., 2021) , and generalization (Ramesh et al., 2021) . Therefore, they are rapidly being integrated into real-world AI systems, e.g., BERT into Google search, Codex (Chen et al., 2021a) into GitHub's Copilot, etc. Despite the impressive capabilities and characteristics of foundation models, a unified theoretical framework justifying their successes remains missing (Bommasani et al., 2021; Yuan et al., 2022) , which is crucial for their further improvements and is likely a milestone for the foundation model community (Tamkin et al., 2021) . To address that challenge, we first notice that the successes of foundation models are mainly attributed to the following two properties, in addition to increasingly powerful parallel computing techniques. • Data comprehensiveness. Foundation models are not picky about their training data and therefore embrace flexible training on massive easily-accessible data with great diversity (e.g., those crawled from the Internet). These training data, thanks to their massiveness, diversity, and minimal human interventions in the collection, are likely more consistent with the "true" data distribution that underlies both training and test phases, leading to a narrowed training-test gap from the data perspective and serving as one reason for the improved generalization and robustness of foundation models. • Task comprehensiveness. A foundation model is often pretrained in a massive-task manner on a wealth of data tasks (like mask-and-predict), which can be flexibly specified as modeling some conditional data distributions across potentially diverse domains (see Section 3 for details). Such massive-task and potentially diverse pretraining of foundation models narrows the pretrainingfinetuning/training-test gap from the learning perspective, i.e., it's likely the downstream task resembles a pretraining one, and therefore contributes to the successes of foundation models. Moreover, the massive-task pretraining may encourage learning compositional intrinsic metaknowledge encoded in the model parameters (Lu et al., 2021; Aghajanyan et al., 2021) , which may hold the key for out-of-distribution generalization (Bommasani et al., 2021) . Based on the above observations and by reviewing the development history of deep learning, we perceive a vague avenue for AI, that is big data, big/foundation models, big learning, • • • . Specifically, one leverages big data to comprehensively represent the underlying data distribution, develops big/foundation models to serve as a big information "container," relies on big learning to comprehensively and exhaustively convey data information into that container, and so on. Accordingly, different from existing machine learning paradigms that only exploit limited information contained in training data, we present big learning for exhaustive data information exploitation, following that AI avenue. The presented big learning further strengthens the above-mentioned data and task comprehensiveness by leveraging a universal foundation model to simultaneously model many/all joint/conditional/marginal data distributions (across potentially diverse domains), manifested as a "big" training task that exhaustively exploits the data information. Such big learning behavior closely resembles the fundamental unconscious mind and the vision system of human brains, which are excellent at comprehensive information exploitation in a multitasking manner (Bargh & Morsella, 2008; Mesquita, 2015; Ludwig et al., 2014; Saarela & Landy, 2015) . Our big learning comes with three main contributions. • It serves as a theoretical platform for analyzing, justifying, and improving big/foundation models, because most of them are implicitly doing (parts of) big learning, as revealed in Section 3. • By modeling many/all joint/conditional/marginal data distributions, big learning (i) comprehensively exploits the available data information and embraces statistical sharing power to encourage summarizing intrinsic compositional meta-knowledge within model parameters and (ii) potentially delivers all joint/conditional/marginal data capabilities after training, which are of great value e.g., for arbitrary data completion, flexible counter-factual analysis, and reasoning. • It delivers extraordinary data and training-task flexibilities by enabling large-scale training with complete/incomplete data on diverse learning tasks across different domains, leading to (i) minimal human interventions in data collection and learning-task specification, (ii) significantly reduced training-test (or pretraining-finetuning) gap, and (iii) potentially a universal machine learning paradigm that unifies and enables cooperations among conventional ones.

2. RELATED WORK AND PRELIMINARY

Big/Foundation models. Taking shape in NLP, big/foundation models have drastically changed the research and practice of AI (Bommasani et al., 2021; Yuan et al., 2022) . BERT (Stickland & Murray, 2019) and GPT series (Radford et al., 2019; Brown et al., 2020) significantly accelerate the development of natural language processing, while models like DALL-Es (Ramesh et al., 2021; 2022) effectively promote interdisciplinary research among different research fields. Most foundation models are pretrained in a mask-and-predict manner, i.e., holding out a portion of the input followed by training the model to use the remaining parts to predict that held-out portion. We will reveal in Section 3 that such mask-and-predict pretraining is a special case of the proposed big learning, which accordingly reveals the underlying principle of foundation models and serves as a theoretical platform for their analysis, justification, and further improvements. 



Transformers and Vision Transformers (ViTs). Based on the self-attention mechanism(Vaswani  et al., 2017), Transformers have been serving as the de facto model architecture for foundation models. Often Transformers (like BERT) take as input a sequence of discrete indexes x ∈ Z L with length L and output the corresponding latent embedding h ∈ R L×D with embedding dimension D for downstream applications; attentions are implemented among the L locations layer-wisely. ViTs(Dosovitskiy et al., 2020)  are Transformers modified for dealing with continuous images, which have been empirically proven to have better generalization and robustness than convolutional neural networks(Naseer et al., 2021). Different from Transformers embedding discrete indexes into highdimensional continuous features, ViTs directly employ flattened image patches as those features, as demonstrated in Fig.1b. It's well known that Transformers/ViTs are over-parameterized Lan et al.

