BIG LEARNING: A UNIVERSAL MACHINE LEARNING PARADIGM?

Abstract

Recent breakthroughs based on big/foundation models reveal a vague avenue for AI, that is, big data, big/foundation models, big learning, • • • . Following that avenue, here we elaborate on our newly introduced big learning. Specifically, big learning exhaustively exploits the information/tasks inherent in its large-scale complete/incomplete training data, by learning to simultaneously model many/all joint/conditional/marginal data distributions (thus named big learning) with one universal foundation model. We reveal that big learning is what existing foundation models are implicitly doing; accordingly, our big learning provides high-level guidance for flexible design and improvements of foundation models. Besides, big learning (i) is equipped with great flexibilities for complete/incomplete training data and for customizing trustworthy data tasks; (ii) potentially delivers all joint/conditional/marginal data capabilities after training; (iii) significantly reduces the training-test gap with improved model generalization; and (iv) potentially unifies conventional machine learning paradigms and enables their flexible cooperations, manifested as a universal learning paradigm. Preliminary experiments verified the effectiveness of the presented big learning.

1. INTRODUCTION

AI is undergoing a paradigm shift with the rise of big/foundation models (Bommasani et al., 2021; Yuan et al., 2022) , e.g., BERT (Stickland & Murray, 2019) , GPT-3 (Brown et al., 2020) , DALL-Es (Ramesh et al., 2021; 2022) , MAE (He et al., 2021) , etc. Foundation models, often based on maskand-predict pretraining and downstream finetuning, are capable of benefiting from pretraining on broad data at scale and accordingly, demonstrate diverse downstream task capabilities with impressive robustness (Stickland & Murray, 2019 ), adaptability (He et al., 2021 ), and generalization (Ramesh et al., 2021) . Therefore, they are rapidly being integrated into real-world AI systems, e.g., BERT into Google search, Codex (Chen et al., 2021a) into GitHub's Copilot, etc. Despite the impressive capabilities and characteristics of foundation models, a unified theoretical framework justifying their successes remains missing (Bommasani et al., 2021; Yuan et al., 2022) , which is crucial for their further improvements and is likely a milestone for the foundation model community (Tamkin et al., 2021) . To address that challenge, we first notice that the successes of foundation models are mainly attributed to the following two properties, in addition to increasingly powerful parallel computing techniques. • Data comprehensiveness. Foundation models are not picky about their training data and therefore embrace flexible training on massive easily-accessible data with great diversity (e.g., those crawled from the Internet). These training data, thanks to their massiveness, diversity, and minimal human interventions in the collection, are likely more consistent with the "true" data distribution that underlies both training and test phases, leading to a narrowed training-test gap from the data perspective and serving as one reason for the improved generalization and robustness of foundation models. • Task comprehensiveness. A foundation model is often pretrained in a massive-task manner on a wealth of data tasks (like mask-and-predict), which can be flexibly specified as modeling some conditional data distributions across potentially diverse domains (see Section 3 for details). Such massive-task and potentially diverse pretraining of foundation models narrows the pretrainingfinetuning/training-test gap from the learning perspective, i.e., it's likely the downstream task

