QAID: QUESTION ANSWERING INSPIRED FEW-SHOT INTENT DETECTION

Abstract

Intent detection with semantically similar fine-grained intents is a challenging task. To address it, we reformulate intent detection as a question-answering retrieval task by treating utterances and intent names as questions and answers. To that end, we utilize a question-answering retrieval architecture and adopt a two stages training schema with batch contrastive loss. In the pre-training stage, we improve query representations through self-supervised training. Then, in the finetuning stage, we increase contextualized token-level similarity scores between queries and answers from the same intent. Our results on three few-shot intent detection benchmarks achieve state-of-the-art performance.

1. INTRODUCTION

Intent detection (ID) is the task of classifying an incoming user query to one class from a set of mutually-exclusive classes, a.k.a. intents (Wang et al., 2014; Schuurmans & Frasincar, 2019; Liu et al., 2019a) . This ability is a cornerstone for task-oriented dialogue systems as correctly identifying the user intent at the beginning of an interaction is crucial to its success. However, labeled data is required for training and manual annotation is costly. This calls for sample efficient methods, gaining high accuracy with minimal amounts of labeled data. Recent works tackling few-shot ID have relied on large-scale pre-trained language models, such as BERT (Devlin et al., 2018) . These works leverage task-adaptive training and focus on pre-training a model on a large open-domain dialogue corpus and fine-tuning it for ID classification (Mehri et al., 2020; Wu et al., 2020a; Casanueva et al., 2020; Zhang et al., 2021a) . Alternative approaches tried to learn query representation based on query-to-query matching (henceforth, Match-QQ systems) (Zhang et al., 2020; Mass et al., 2020; Mehri et al., 2021) . Zhang et al. (2020); Mass et al. (2020) adopt pairwise-encoding systems with cross-attention to deploy K-Nearest-Neighbor (K-NN) (Fix & Hodges, 1989) classification schema where training queries are fully utilized for both training and inference stages. Nevertheless, those methods' downside is the processing time combined with the difficulty of scaling to large number of intents (Liu et al., 2021c) . The need to efficiently compare an incoming query to a large set of possible answers resides at the core of any question answering (QA) retrieval system (henceforth, Match-QA systems) (Karpukhin et al., 2020) . Recently, Khattab & Zaharia (2020) introduced ColBERT, which allows faster training and inference by replacing the cross-attention mechanism used by Match-QQ systems (Zhang et al., 2020; Mass et al., 2020; Nogueira & Cho, 2019) with a fast contextualized token-level similarity mechanism dubbed late interaction. In this work, we present a Question Answering inspired Intent Detection system, named QAID. We start by formulating the ID task as a question-answering retrieval task by treating the utterances and the intent names as queries and answers, respectively. This reformulation allows us to introduce valuable additional signal from the intent names. Then, we adapts the efficient architecture of Col-BERT while replacing its triplet function loss with batch contrastive loss which was proven to be more robust (Khosla et al., 2020) and performs well in various tasks (Gunel et al., 2021; Gao et al., 2021a) , including ID classification (Zhang et al., 2021b) . In contrast to ColBERT which compares a query to a pair of positive and negative documents, we also include queries as positive examples, and so we compare the queries both to their answers and to other queries from the same intent. This allows QAID to represent similarly both queries and answers of the same intent. Therefore, our training method assumes the settings of both Match-QQ and Match-QA. In inference, QAID relies on the token-level similarity (late interaction) mechanism between incoming query and all intent names for its predictions (Khattab & Zaharia, 2020). Our contribution is thus threefold. (1) We show that few-shot intent detection can be successfully handled by QA systems when letting the intent name play the role of the answer. (2) We show how intent detection architectures can benefit from recent advancements in supervised batch contrastive training and late-interaction scores. (3) We report state-of-the-art results on three few-shot intent detection benchmarks.

2. METHOD

Our method addresses the few-shot intent detection task, in which we have C defined intents and the task is to classify an incoming user query, q, into one of the C classes. In our formulation, upon getting a new user query q, we need to retrieve the most suited intent name. We set balanced K-shot learning for each intent (Mehri et al., 2020; Casanueva et al., 2020; Zhang et al., 2020) , i.e., the training data containing K examples per intentfoot_0 . In the following section, we describe the structure of our QAID framework and its training stages. First, in Section 2.1 we elaborate on the different components of QAID. Then, in Section 2.2 we present the two training stages: the self-supervised contrastive pre-training in 2.2.1 and the supervised batch contrastive fine-tuning in 2.2.2. Lastly, in Section 2.3 we briefly touch on our decision to formulate ID as a question retrieval task.

2.1. REPRESENTATION LEARNING FRAMEWORK

The main components of our framework are: • Data Augmentation module, Aug(⋅). For each input query, q, we generate two random augmentations, q = Aug(q), each of which represents a different view of the input, q. For our augmentation we use the combination of two simple and intuitive 'corruption' techniques (Gao et al., 2021a; Wu et al., 2020b; Liu et al., 2021b) ; (i) randomly masking tokens from q (Devlin et al., 2018); (ii) dropping a small subset of neurons and representation dimensions. Technique (i) is done before passing the query to the encoder and technique (ii) is done in the forward propagation through the encoder model. • Encoder model, Enc(⋅), which maps a query q, consisting q 1 , ..., q m tokens, to Enc(q) ∈ R m×D E , where D E is the embedding dimension. In our experiments, it is either 768 or 1024. • Projection layer, P roj(⋅), a single linear layer that maps vectors of dimension D E to vectors of dimension D P = 128, followed by normalization to the unit hypersphere. • Token-level score, Score(⋅, ⋅), given two queries u = (u 1 , ..., u m ) and v = (v 1 , ..., v l ), the relevance score of u regarding to v, denoted by Score(u, v), is calculated by the late interaction between their bags of projected contextualized representations, i.e z(u) = P roj(Enc(u)). Namely, the sum of the maximum token-wise cosine similarity of their projected representations (Khattab & Zaharia, 2020) . Equation 1shows the formulation of this score. Score(u, v) = i∈[m] max j∈[l] z(u) i ⋅ z(v) j (1)

2.2. TWO-STAGE CONTRASTIVE TRAINING

In both stages, given a batch of input samples Q = (q 1 , .., q n ), we first apply Aug followed by the encoding and projection layer, denoted by the z(⋅) function as described in the last section, and so we have X Q = z(Aug(Q)) ∈ R 2n×D P , where the 2n is a result of the two random augmentations we applied to each query. In the self-supervised training, each two augmented queries are the only



In the rest of the paper we refer to the intent examples as queries and use intents and classes interchangeably

