Q-MATCH: SELF-SUPERVISED LEARNING FOR TABU-LAR DATA BY MATCHING DISTRIBUTIONS INDUCED BY A QUEUE

Abstract

In semi-supervised learning, student-teacher distribution matching has been successful in improving performance of models using unlabeled data in conjunction with few labeled samples. In this paper, we aim to replicate that success in the self-supervised setup where we do not have access to any labeled data during pre-training. We show it is possible to induce the student-teacher distributions without any knowledge of downstream classes by using a queue of embeddings of samples from the unlabeled dataset. We show that Q-Match outperforms previous self-supervised learning techniques on tabular datasets when measuring downstream classification performance. Furthermore, we show that our method is sample efficient, both in terms of labels required for both downstream task training and amount of unlabeled data required for pre-training.

1. INTRODUCTION

Tabular data is the most common form of data for problems in industry. While many robust techniques exist to solve real-world machine learning problems on tabular data, most of these techniques require access to labels. Leveraging unlabeled data to learn good representations remains a key open problem in the tabular domain. In this work, we propose a flexible and powerful framework using deep learning that helps us use unlabeled data in the tabular domain. Deep learning has been successful in processing data in many different domains like images, audio, and text. Learning non-linear features using deep architectures has been shown to be a key feature in improving performance across a wide variety of problems like image recognition (Krizhevsky et al., 2017 ), speech recognition (Deng et al., 2013) , and machine translation (Singh et al., 2017) . Until recently, achieving state-of-the-art performance would not have been possible without large, manually annotated datasets. However, a class of learning algorithms called self-supervised learning (Wu et al., 2018; Devlin et al., 2018) has shown that highly performant features can be learned without large-labeled datasets as well. In this work, we propose a new self-supervised learning algorithm and study its effectiveness in the tabular data domain. In self-supervised learning, a task known as the pretext task is first solved on a dataset that is typically large and unlabeled. The aim of self-supervised learning is to learn an encoding function f parameterized by θ that captures variances and invariances in the dataset without the need for human-annotated labels. This initial training is usually referred to as the pre-training stage. The parameters learned from the pretext task during pre-training are then used to solve a new objective called the downstream task. Concretely, f is a function with parameters θ that maps a sample x from the input dimensionality d to an embedding size d that is f : (X, θ) → R d where x is a single data-point in the input space X. After the pretext algorithm updates the model parameters θ during the pre-training stage, the downstream data (typically a manually annotated dataset) is either used to fine-tune θ or learn the parameters of a linear classifier on the output of f . Recently proposed self-supervised approaches for tabular data (Darabi et al., 2021; Yoon et al., 2020; Arik & Pfister, 2021; Lee et al., 2020) have shown encouraging results. In this work, we propose a novel method for self-supervised learning called Q-Match which is closely related to the semi-supervised learning framework called FixMatch (Sohn et al., 2020) . For labeled data, FixMatch uses the standard supervised learning loss. For unlabeled data, FixMatch proposes to match the student and teacher distributions over the set of classes used in the downstream task. In a self-supervised setup, however, access to the relevant classes or any labeled data during pre-training is limited. For this reason, Q-Match uses a queue of embeddings instead of known classes to induce the teacher and the student distributions. The queue is implemented similar to MoCo He et al. (2020) and NNCLR (Dwibedi et al., 2021) by updating a list of embeddings with newer embeddings and discarding older ones as training proceeds. We use individual samples (via their embeddings) to generate the student and teacher distributions required for training. We find Q-Match leads to improvements not only in the final performance of the downstream model, but also reduces the number of labeled samples required for the downstream task.

2. PROPOSED APPROACH

Motivation. Our self-supervised approach is based on the success of the semi-supervised learning algorithm FixMatch (Sohn et al., 2020) . In their method, the authors show it is possible to leverage a large unlabeled dataset and a few labeled samples to improve the performance of the model on a downstream task. They do so by matching the class distributions of the student and teacher views produced by augmenting the input in two different ways. We hypothesize that it might be possible to adapt their framework to the self-supervised setup by removing the dependency on the known classes during training. To do so, we keep a queue of past embeddings that can serve as a proxy for classes. We use this queue to produce the target and student distributions used to train a model. The training then proceeds by performing continuous knowledge distillation from the teacher to the student model such that the student ultimately learns to predict the distribution induced by the teacher.

Method.

In Figure 1 , we outline our method for the pretext training approach used in Q-Match. We corrupt the input x i two times independently using the method proposed in VIME (Yoon et al., 2020) to produce the student view x i,s and the teacher view x i,t . We pass x i,s through the student model to produce the student embedding z i,s . Similarly, we pass x i,t through the teacher model to produce the teacher embedding z i,t . We want this pair of embeddings to induce similar probability distributions over a representative set of samples of the dataset. In other words, we want our encoder to maintain similar relationships between different data samples in spite of the corruption introduced while generating the views. As the dataset can be quite large, we maintain a fixed size queue Q of past embeddings like MoCo (He et al., 2020) and NNCLR (Dwibedi et al., 2021) . A stop-gradient is applied to the teacher embeddings, so only the student weights are updated at each iteration. We multiply the student embeddings and the teacher embeddings with the embeddings in the queue to produce the student and teacher logits. After taking softmax of these logits, we produce the student distribution p i,s and teacher distribution p i,t respectively. The teacher distribution p i,t is the target distribution which the student distribution p i,s should match. We define the distribution matching loss as follows: L QM i = H(p i,t , p i,s ) = H softmax z i,t • Q τ t , softmax z i,s • Q τ s where τ t is a scalar temperature value that controls the sharpness of the distribution produced by the teacher, τ s is a scalar temperature value that controls the sharpness of the student distribution, Q ∈ R d×m is the queue of m many previous teacher embeddings, and H(p, q) is the cross-entropy loss between two distributions p and q. We normalize the views z i,s and z i,t using L2 normalization. The queue size is constant and is refreshed at every training iteration with the previous batch of teacher embeddings while the oldest embeddings are removed from the queue. The parameters of the teacher model are updated using the Exponential Moving Average (EMA) of the student model parameters. Empirically, we observe that corrupting the teacher with a smaller probability leads to better performance (discussed later in Section 3.3). Model Architecture. In all experiments, we use an MLP as our encoding function f . We follow the same architecture from i-Mix (Lee et al., 2020) which includes five fully connected layers (2048-2048-4096-4096-8192) . The final layer uses a 4-set max-out activation (Goodfellow et al., 2013) . All layers except the output layer have batch normalization followed by a ReLu. We use a linear projector of 128 dimensions on top of the encoder in all our experiments. Similar to i-Mix, we added a 2-layer (512-128) MLP projection head, but we noticed that it performed similar to the model without the projection head for Q-Match training. Data Preprocessing. For data preprocessing, we compute the normalization statistics by using a batch normalization layer without the learnable parameters of scale and bias just after the input layer. During evaluation, the accumulated exponential moving average statistics are applied to the

