Q-MATCH: SELF-SUPERVISED LEARNING FOR TABU-LAR DATA BY MATCHING DISTRIBUTIONS INDUCED BY A QUEUE

Abstract

In semi-supervised learning, student-teacher distribution matching has been successful in improving performance of models using unlabeled data in conjunction with few labeled samples. In this paper, we aim to replicate that success in the self-supervised setup where we do not have access to any labeled data during pre-training. We show it is possible to induce the student-teacher distributions without any knowledge of downstream classes by using a queue of embeddings of samples from the unlabeled dataset. We show that Q-Match outperforms previous self-supervised learning techniques on tabular datasets when measuring downstream classification performance. Furthermore, we show that our method is sample efficient, both in terms of labels required for both downstream task training and amount of unlabeled data required for pre-training.

1. INTRODUCTION

Tabular data is the most common form of data for problems in industry. While many robust techniques exist to solve real-world machine learning problems on tabular data, most of these techniques require access to labels. Leveraging unlabeled data to learn good representations remains a key open problem in the tabular domain. In this work, we propose a flexible and powerful framework using deep learning that helps us use unlabeled data in the tabular domain. Deep learning has been successful in processing data in many different domains like images, audio, and text. Learning non-linear features using deep architectures has been shown to be a key feature in improving performance across a wide variety of problems like image recognition (Krizhevsky et al., 2017 ), speech recognition (Deng et al., 2013) , and machine translation (Singh et al., 2017) . Until recently, achieving state-of-the-art performance would not have been possible without large, manually annotated datasets. However, a class of learning algorithms called self-supervised learning (Wu et al., 2018; Devlin et al., 2018) has shown that highly performant features can be learned without large-labeled datasets as well. In this work, we propose a new self-supervised learning algorithm and study its effectiveness in the tabular data domain. In self-supervised learning, a task known as the pretext task is first solved on a dataset that is typically large and unlabeled. The aim of self-supervised learning is to learn an encoding function f parameterized by θ that captures variances and invariances in the dataset without the need for human-annotated labels. This initial training is usually referred to as the pre-training stage. The parameters learned from the pretext task during pre-training are then used to solve a new objective called the downstream task. Concretely, f is a function with parameters θ that maps a sample x from the input dimensionality d to an embedding size d that is f : (X, θ) → R d where x is a single data-point in the input space X. After the pretext algorithm updates the model parameters θ during the pre-training stage, the downstream data (typically a manually annotated dataset) is either used to fine-tune θ or learn the parameters of a linear classifier on the output of f . Recently proposed self-supervised approaches for tabular data (Darabi et al., 2021; Yoon et al., 2020; Arik & Pfister, 2021; Lee et al., 2020) have shown encouraging results. In this work, we propose a novel method for self-supervised learning called Q-Match which is closely related to the semi-supervised learning framework called FixMatch (Sohn et al., 2020) . For labeled data, FixMatch uses the standard supervised learning loss. For unlabeled data, FixMatch proposes to match the student and teacher distributions over the set of classes used in the downstream task. In a

