KNOWLEDGE-DRIVEN ACTIVE LEARNING

Abstract

The deployment of Deep Learning (DL) models is still precluded in those contexts where the amount of supervised data is limited. To answer this issue, active learning strategies aim at minimizing the amount of labelled data required to train a DL model. Most active strategies are based on uncertain sample selection, and even often restricted to samples lying close to the decision boundary. These techniques are theoretically sound, but an understanding of the selected samples based on their content is not straightforward, further driving non-experts to consider DL as a black-box. For the first time, here we propose a different approach, taking into consideration common domain-knowledge and enabling non-expert users to train a model with fewer samples. In our Knowledge-driven Active Learning (KAL) framework, rule-based knowledge is converted into logic constraints and their violation is checked as a natural guide for sample selection. We show that even simple relationships among data and output classes offer a way to spot predictions for which the model need supervision. The proposed approach (i) outperforms many active learning strategies in terms of average F1 score, particularly in those contexts where domain knowledge is rich. Furthermore, we empirically demonstrate that (ii) KAL discovers data distribution lying far from the initial training data unlike uncertainty-based strategies, (iii) it ensures domain experts that the provided knowledge is respected by the model on test data, and (iv) it can be employed even when domain-knowledge is not available by coupling it with a XAI technique. Finally, we also show that KAL is also suitable for object recognition tasks and, its computational demand is low, unlike many recent active learning strategies.

1. INTRODUCTION

Deep Learning (DL) methods have achieved impressive results over the past decade in fields ranging from computer vision to text generation (LeCun et al., 2015) . However, most of these contributions relied on overly data-intensive models (e.g. Transformers), trained on huge amounts of data (Marcus, 2018) . With the advent of Big Data, sample collection does not represent an issue any more, but, nonetheless, in some contexts the number of supervised data is limited, and manual labelling can be expensive (Yu et al., 2015) . Therefore, a common situation is the unlabelled pool scenario (McCallumzy & Nigamy, 1998) , where many data are available, but only some are annotated. Historically, two strategies have been devised to tackle this situation: semi-supervised learning which exploit the unlabelled data to enrich feature representations (Zhu & Goldberg, 2009) , and active learning which select the smallest set of data to annotate to improve the most model performances (Settles, 2009) . The main assumption behind active learning strategies is that there exists a subset of samples that allows to train a model with a similar accuracy as when fed with all training data. Iteratively, the strategy indicates the optimal samples to be annotated from the unlabelled pool. This is generally done by ranking the unlabelled samples w.r.t. a given measure, usually on the model predictions (Settles, 2009; Netzer et al., 2011; Wang & Shang, 2014) , or on the input data distribution (Zhdanov, 2019; Santoro et al., 2017) and by selecting the samples associated to the highest rankings (Ren et al., 2021; Zhan et al., 2021) . While being theoretically sound, an understanding of the selected samples based on their content is not straightforward, in particular to non-ML experts. This issue becomes particularly relevant when considering that Deep Neural Networks are already seen as black box models (Gilpin et al., 2018; Das & Rad, 2020) On the contrary, we believe that neural models must be linked to Commonsense knowledge related to a given learning problem. Therefore, in this paper, we propose for the first time to exploit this symbolic knowledge in the selection process of an active learning strategy. This not only lower the amount of supervised data, but it also enables domain experts to train a model leveraging their knowledge. More precisely, we propose to compare the predictions over the unsupervised data with the available knowledge and to exploit the inconsistencies as a criterion for selecting the data to be annotated. Domain knowledge, indeed, can be expressed as First-Order Logic (FOL) clauses and translated into real-valued logic constraints (among other choices) by means of T-Norms (Klement et al., 2013) to assess its satisfaction (Gnecco et al., 2015; Diligenti et al., 2017; Melacci et al., 2021) . In the experiments, we show that the proposed Knowledge-driven Active Learning (KAL) strategy (i) performs better (on average) than several standard active learning methods, particularly in those contexts where domain-knowledge is rich. We empirically demonstrate (ii) that this is mainly due to the fact that the proposed strategy allows discovering data distributions lying far from the initial training data, unlike uncertainty-based approaches. Furthermore, we show that (iii) the provided knowledge is acquired by the trained model and respected on unseen data, (iv) that KAL can also work on domains where no knowledge is available if we extract knowledge from the same model by means of a XAI technique, (v) that the KAL strategy can be easily employed also in the object-detection context, where standard uncertainty-based strategies are not-straightforward to apply (Haussmann et al., 2020) and, finally, (vi) that it is not computationally expensive, unlike many recent methods. The paper is organized as follows: in Section 2 the proposed method is explained in details, with first an example on inferring the XOR operation and then contextualized in more realistic active learning domains; the aforementioned experimental results on different datasets are reported in Section 3, comparing the proposed technique with several active learning strategies; in Section 4 the related work about active learning and about integrating reasoning with machine learning is briefly resumed; finally, in Section 5 we conclude the paper by considering possible future work.

2. KNOWLEDGE-DRIVEN ACTIVE LEARNING

In this paper, we focus on classification problems with c classes, in which each input x ∈ X is associated to one or more classes and d represents the input dimensionality. Let us consider the classification problem f : X → Y , where X ⊆ R d represents the feature space which may also comprehend non-structured data (e.g. input images) and Y ⊆ {0, 1} c is the output space composed of c ≥ 1 dimensions. More precisely, we consider a vector function f = [f 1 , . . . , f c ], where each function f i predicts the probability that x belongs to the i-th class. In the Active Learning context, we also define X s ⊂ X as the portion of input data already associated to an annotation y i ∈ Y s ⊂ Y and n the dimensionality of the starting set of labelled data. At each iteration, a set of p samples X p ⊂ X u ⊂ X is selected by the active learning strategy to be annotated from X u , the unlabelled data pool, and be added to X s . This process is repeated for q iterations, after which the training terminates. The maximum budget of annotations b therefore amounts to b = n + q * p. When considering an object-detection problem, for a given class i and a given image x j , we consider as class membership probability, the maximum score value among all predicted bounding boxes around the objects belonging to the i-th class. Formally, f i (x j ) = max s∈S i (xj ) s(x j ) where S i (x j ) is the set of the confidence scores of the bounding boxes predicting the i-th class for sample x j . Let us also consider the case in which additional domain knowledge is available for the problem at hand, involving relationships between data and classes. By considering the logic predicate f associated to each function f , First-Order Logic (FOL) becomes the natural way of describing these relationships. For example, ∀x ∈ X, x 1 (x) ∧ x 2 (x) ⇒ f (x), where x 1 (x), x 2 (x) respectively represent the logic predicates associated to the first and the second input features, and meaning that when both predicates are true also the output function f (x) needs to be true. Also, we can consider relations among classes, such as ∀x ∈ X, f v (x) ∧ f z (x) ⇒ f u (x), meaning that the intersection between the v-th class and the z-th class is always included in the u-th one.

2.1. CONVERTING DOMAIN-KNOWLEDGE INTO LOSS FUNCTIONS

The Learning from Constraints framework (Gnecco et al., 2015; Diligenti et al., 2017) defines a way to convert domain knowledge into logic constraints and how to use them on the learning problem. Among a variety of other type of constraints (see, e.g., Table 2 in (Gnecco et al., 2015) ), it studies the process of handling FOL formulas so that they can be either injected into the learning problem (in

