WEAKLY SUPERVISED KNOWLEDGE TRANSFER WITH PROBABILISTIC LOGICAL REASONING FOR OBJECT DE-TECTION

Abstract

Training object detection models usually requires instance-level annotations, such as the positions and labels of all objects present in each image. Such supervision is unfortunately not always available and, more often, only image-level information is provided, also known as weak supervision. Recent works have addressed this limitation by leveraging knowledge from a richly annotated domain. However, the scope of weak supervision supported by these approaches has been very restrictive, preventing them to use all available information. In this work, we propose ProbKT, a framework based on probabilistic logical reasoning that allows to train object detection models with arbitrary types of weak supervision. We empirically show on different datasets that using all available information is beneficial as our ProbKT leads to significant improvement on target domain and better generalization compared to existing baselines. We also showcase the ability of our approach to handle complex logic statements as supervision signal.

1. INTRODUCTION

Object detection is a fundamental ability of numerous high-level machine learning pipelines such as autonomous driving [4; 16] , augmented reality [42] or image retrieval [17] . However, training state-of-the-art object detection models generally requires detailed image annotations such as the boxcoordinates location and the labels of each object present in each image. If several large benchmark datasets with detailed annotations are available [26; 15] , providing such detailed annotation on new specific datasets comes with a significant cost that is often not affordable for many applications. More frequently, datasets come with only limited annotation, also referred to as weak supervision. This has sparked research in weakly-supervised object detection approaches [25; 6; 40] , using techniques such as multiple instance learning [40] or variations of class activation maps [3] . However, these approaches have been shown to significantly underperform their fully-supervised counterparts in terms of robustness and accurate localization of the objects [39] . An appealing and intuitive approach to improve the performance of weakly supervised object detection is to perform transfer learning from an existing object detection model pre-trained on a fully annotated dataset [14; 46; 43] . This approach, also referred to as transfer learning or domain adaptation, consists in leveraging transferable knowledge from the pre-trained model (such as bounding boxes prediction capabilities) to the new weakly supervised domain. This transfer has been embodied in different ways in the literature. Examples include a simple fine-tuning of the classifier of bounding box proposals of the pre-trained model [43] , or an iterative relabeling of the weakly supervised dataset for retraining a new full objects detection model on the re-labeled data [46] . However, existing approaches are very restrictive in the type of weak supervision they are able to harness. Indeed, some do not support new object classes in the new domain [20] , others can only use a label indicating the presence of an object class [46] . However, in practice, the supervision on the new domain can come in very different forms. For instance, the count of each object class can be given, such as in atom detection from molecule images where only chemical formula might be given. Or, when many objects are present on an image, a range can be provided instead of an exact class counts (e.g. "there are at least 4 cats on this image"). Crucially, this variety of potential supervisory signals on the target domain cannot be fully utilized by existing domain adaption approaches. To address this limitation, we introduce ProbKT, a novel framework that allows to generalize knowledge transfer in object detection to arbitrary types of weak supervision using neural probabilistic logical reasoning [27] . This paradigm allows to connect probabilistic outputs of neural networks with logical rules and to infer the resulting probability of particular queries. One can then evaluate the probability of a query such as "the image contains at least two animals" and differentiate through the probabilistic engine to train the underlying neural network. Our approach allows for arbitrarily complex logical statements and therefore supports weak supervision like class counts or ranges, among other. To our knowledge, this is the first approach to allow for such versatility in utilizing the available information on the new domain. To assess the capabilities of this framework, we provide extensive empirical analysis of multiple object detection datasets. Our approach also supports any type of objects detection backbone architecture. We thus use two popular backbone architectures, DETR [7] and RCNN [34] and evaluate their performance in terms of accuracy, convergence as well of generalization on out-of-distribution data. Our experiments show that, due to its ability to use the complete supervisory signal, our approach outperforms previous works in a wide range of setups.

Key contributions:

(1) We propose a novel knowledge transfer framework for object detection relying on probabilistic programming that uniquely allows using arbitrary types of weak supervision on the target domain. (2) We make our approach amenable to different levels of computational capabilities by proposing different approximations of ProbKT. (3) We provide an extensive experimental setup to study the capabilities of our framework for knowledge transfer and out-of-distribution generalization.

2. RELATED WORKS

A comparative summary of related works is given in Table 1 . We distinguish three main categories: (1) pure weakly supervised object detection methods (WSOD) that do not leverage a richly annotated source domain, (2) unsupervised object detection methods with knowledge transfer (DA or domain adaptation methods) that do not use supervision on the target domain and (3) weakly supervised object detection methods with knowledge transfer (WSOD w/transfer) that are restrictive in the type of supported weak supervision. To our knowledge, our work is the first to allow for arbitrary supervision on the target domain (and supporting new classes in the target domain) while also leveraging knowledge from richly annotated domains. ProbKT supports arbitrary weak supervision thanks to the inherited expressiveness of Prolog [41] which is based on a subset of first-order predicate logic, Horn clauses and is Turing-complete. Weakly supervised object detection (WSOD) This class of method allows training object detection models with only weak supervision. One can thus train these approaches directly on the target domain. However, they do not allow to leverage potentially available richly annotated datasets, which has been shown to lead to worse performance [39] . Different flavors of WSOD architectures have been proposed relying on a variety of implementations such as multiple instance learning (MIL)-based [25; 40] or class activation (CAM) based [47; 3] . In contrast to WSOD methods, our approach is designed to exploit existing richly annotated datasets and thus provides increased performance on the target domain. For a comprehensive review of WSOD methods we refer the reader to Shao et al. [39] .

Domain adaptation methods (DA)

In contrast to WSOD methods, domain adaptation methods do rely on fully supervised source domain dataset. However, they do not assume any supervision on the target domain and are therefore not equipped to exploit such signal when available [37; 8; 22; 48] . WSOD with knowledge transfer Our approach belongs to the class of weakly supervised object detection models with knowledge transfer. These methods aim to transfer knowledge from a source domain, where full supervision is available, to a target domain where only weak labels are available. Existing work in this class of models only allows for limited type of supervision of the target domain. Most architectures only support a label indicating the presence or absence of a class of object in the image [14; 46; 43] . Inoue et al. [20] allows for class counts as weak supervision but unfortunately does not allow for new classes in the target domain. In contrast, ProbKT natively allows for class counts and new classes as well as other types of weak supervision. Neural probabilistic logical reasoning Probabilistic logical reasoning combines logic and probability theory. Favored for its high-level reasoning abilities, it was introduced as an alternative way to deep learning in the quest for artificial intelligence [10] . Statistical artificial intelligence [32; 23] and probabilistic logic programming [11] are examples of areas relying on these premises. In a unification effort, researchers have proposed hybrid architectures, embedding both deep learning and logical reasoning components [38; 35] . Our work builds upon the recent advances in the field, where combinations of deep learning, logical, and probabilistic approaches were introduced [27] , allowing high-level reasoning with uncertainty using differentiable neural network architectures. 

3.1. PROBLEM STATEMENT

We consider the problem of weakly supervised knowledge transfer for object detection. Using a model trained on a richly annotated source domain, we aim at improving its performance on a less richly annotated target domain. Let D s = {(I i s , b i s , y i s ) : i = 1, ..., N s )} be a dataset issued from the source domain and consisting of N s images I s along with their annotations. We write b i s ∈ R ni×4 and y i s ∈ {1, .., K s } ni for the box coordinates and class labels of objects in image I i s . n i is the number of objects present in image I i s and K s is the total number of object classes in the source domain. This represents the typical dataset required to train classical fully-supervised object detection architectures. The target dataset D t = {(I i t , q i t ) : i = 1, ..., N t )} contains N t image from the target domain along with image-level annotations q i t . These annotations are logical statements about the content of the image in terms of object classes and their location. Examples include the presence of different classes in each image (i.e. the classical assumption in weakly supervised object detection) but also extends to the counts of classes or a complex combination of counts of objects attributes (e.g. "two red objects, and at least two bicycles"). What is more, the logical statements q i t can include classes not already present in the source domain. This type of logical annotation is then strictly broader than the restrictive supervision usually assumed. Based on the availability of a source dataset and a target dataset as described above, our goal is then to harness the available detailed information from the source domain to perform accurate object detection on the target domain. A graphical illustration of this process is given in Figure 1 .

3.2.1. OBJECT DETECTION

Object detection aims at predicting the location and labels of objects in images. One then wishes to learn a parametric function f θ : I → {B × R K } Z with f θ (I) = {( b, py )} n = {( bi , py,i ) : i = 1, ..., n} such that the distance between predicted and true boxes and labels, d({( b, py )} n, {(b, y)} n ), is minimum. Objects detection architecture would usually output box features proposals {h i : i = 1, ..., n} conditioned on which they would predict the probability vector of class labels py,i = g p (h i ) and the box location predictions bi = g b (h i ) using shared parametric functions g p (•) and g b (•). For an object n, we write the predicted probability of the object belonging to class k as pk y,n .

3.2.2. PROBABILISTIC LOGICAL REASONING

Probabilistic logical reasoning uses knowledge representation relying on probabilities that allow encoding uncertainty in knowledge. Such a knowledge is encoded in a probabilistic logical program P as a set of N probabilistic facts U = {U 1 , ..., U N } and M logical rules F = {f 1 , ...f M } connecting them. A simple example of probabilistic fact is "Alice and Bob will each pass their exam with probability 0.5" and an example of logical rule is "if both Alice and Bob pass their exam, they will host a party". Combining probabilistic facts and logical rules, one can then construct complex probabilistic knowledge representation, that can also be depicted as probabilistic graphical models. Probabilistic logical programming allows to perform inference by computing the probability of a particular statement or query. For instance, one could query the probability that "Alice and Bob will host a party". This query is executed by summing over the probabilities of occurrence of the different worlds w = {u 1 , ..., u N } (i.e. individual realization of the set of probabilistic facts) that are compatible with the query q. The probability of a query q in a program P can then be inferred as P P (q) = w P (w) • I[F (w) ≡ q], where F (w) ≡ q stands for the fact that propagation of the realization w across the knowledge graph, according to the logical rules F leads to q being true. Remarkably, recent advances in probabilistic programming have lead to learnable probabilistic facts [27] . In particular, the probability of a fact can be generated by a neural network with learnable weights. Such a learnable probabilistic fact is then referred to as a neural predicate U θ , where we make the dependence on the weights θ explicit. One can then train these weights to minimize a loss that depend on the probability of a query q: θ = arg min θ L(P (q | θ)). Our approach builds upon this ability to learn neural predicates and uses DeepProbLog [27] as the probabilistic reasoning backbone. DeepProbLog is a neural probabilistic logic programming language that allows to conveniently perform inference and differentiation with neural predicates. We refer the reader to the excellent introduction of Manhaeve et al. [28] for further details about this framework.

REASONING

A graphical description of our approach is presented in Figure 2 . Our framework starts from a pre-trained object detection model f θ on the source domain. The backbone of this model is extracted and inserted into a new object detection model f * θ with new target box position predictors and box label classifiers. This new model is then used to predict box proposals along with the corresponding box features on target domain images I t . These box features are then fed to a new target box position predictor and box label classifier. The predictions of this classifier are considered neural predicates and are given to a probabilistic logical module. This module evaluates the probability of queries q t , the loss, and the corresponding gradient that can be backpropagated to the classifier and the backbone. As we want to maximize the probability of the queries being true, we use the following loss function: L θ = (It,qt)∈Dt -logP P (q t | f * θ (I t )) In theory, the backbone can be trained end to end with this procedure. Our experiments showed that only updating the box features classifiers resulted in more stability as also shown in previous works [46] . We then adopt here the same iterative relabeling strategy, as described next. This layer computes the probability of the query along with the gradients with respect to py and b that can be backpropagated through the entire network.

Object Detection Backbone

Probabilistic Reasoning

3.3.1. ITERATIVE RELABELING

The approach described above allows to fine-tune our model f * θ to the target domain. To further improve the performance, we propose an iterative relabeling strategy that consists in multiple steps : fine-tuning, re-labeling and re-training. A similar has also been proposed by Zhong et al. [46] . Fine-tuning. This step corresponds to training ProbKT on the weakly supervised labels, by minimizing the loss of Equation 1. Re-labeling. Once ProbKT has been trained, we can use its predictions to annotate images in the target domain. In practice, we only relabel images for which the model predictions comply with the available query labels in order to avoid too noisy labels. Re-training. The re-labeled target domain can be used to re-train the object detection backbone of ProbKTin a fully-supervised fashion. This procedure can be repeated multiple times to improve the quality of the relabeling and the quantity of relabelled in the target domain dataset. A graphical representation of the relabeling pipeline is presented in Figure 3 .

Pre-training

Fine-tuning Re-labeling Re-training Figure 3 : Iterative relabeling. A full cycle is composed of a fine-tuning, a re-labeling and a re-training step. After one cycle, the fine-tuning step and/or re-labeling step can be iteratively repeated.

3.3.2. COMPUTATIONAL COMPLEXITY AND APPROXIMATIONS

The computational complexity of inference in probabilistic programming depends on the specific query q and several approximations have been proposed for improving the computation time [44] . We propose two approaches for reducing the computational cost adapted to object detection: (1) filtering the data samples before applying ProbKT (see Appendix Section C.1) or ( 2) when the supervision consists of the class labels counts, considering only the most probable world (ProbKT*) instead of all possible worlds.

3.3.3. PROBKT*: THE MOST PROBABLE WORLD AND CONNECTION TO HUNGARIAN MATCHING

The probabilistic inference step requires a smart aggregation of all worlds compatible with the query q. Yet, in certain cases, one can reduce the computational cost by only considering the most probable world. Indeed, consider the case when the query consists of the list of different class labels in the images. For a number of boxes n proposed by the objects detection model, the query can be written as the set of labels q = {y i : i = 1, ..., n}. If we further write pk y,n as the probability of the label of box n belonging to class k given by the model (as introduced in Section 3.2.1), we have: P P (q) = n! j=1 pσj(0) y,0 • pσj(1) y,1 • ... • pσj(n) y,n = n! j=1 n pσj(n) y,n where σ j corresponds to the j th permutation of the query vector q. To avoid the computation of each possible world contribution, one can only use the configuration with the largest contribution to P P (q) and discard the other ones. This possible world corresponds to the permutation σ * that satisfies: σ * = arg max σ log( n pσj(n) y,n ) = arg max σ n pσj(n) y,n = arg min σ n (1 -pσj(n) y,n ). Remarkably, this corresponds to the solution of the best alignment using the Hungarian matching algorithm with cost c(n) = (1 -pσj(n) y,n ), as used, among others, in DETR [7] . Thus, when the query is is the set of class labels, the most plausible world can thus be inferred with the Hungarian matching algorithm. In Appendix C.2, we also show that the gradient of ProbKT can be interpreted as a probability weighted extension of the gradient resulting from the Hungarian matching.

4.1. DATASETS

We evaluate our approach on three different datasets: (1) a CLEVR-mini dataset, (2) a Molecules dataset with images of chemical compounds, and (3) an MNIST-based object detection dataset. For each dataset, three subsets, corresponding to different domains, are used: (1) a source domain, (2) a target domain, and (3) an out-of-distribution domain (OOD). The source domain is the richly annotated domain that was used to pre-train the object detection model. The target domain is the domain of interest but with image-level annotations only. Lastly, the OOD domain contains images from a different distribution than the source and target domains and is used to study the generalizability of the models. Source and target domains are split into 5 folds of train and validation sets and an independent test set. We focused our experiments on the small sample regime (range 1k-2k numbers of samples) both for the source as the target domain. More details on each dataset can be found in Appendix B.

4.2. MODELS

In the experiments, we apply our method ProbKT on two different pre-trained object detection backbone models: (1) DETR [7] and (2) FasterRCNN [34] . Both are pre-trained on the COCO dataset [26] . We also evaluate an Hungarian-algorithm approximation (ProbKT*) of our method when the weak supervision allows it. For sake of conciseness, we omit the results of ProbKT* here but they can be found in Appendix D. The details of the training procedures, as well as the hyper-parameters used for the different models and the different datasets are summarized in Table 4 in Appendix A.

4.2.1. BASELINE MODELS

As shown in Section 2, all available approaches for weakly supervised object detection are very restrictive in terms of the supervision signal they support. Our main comparison partner is the state of the art WSOD-transfer method [46] . Additionally, we compare our approach against a Resnet50 [18] backbone pre-trained on Ima-geNet [12] . Fine-tuning is performed by adding an extra multitask regression layer that is trained to predict the individual counts of the objects in the image as in Xue et al. [45] . This architecture naturally relies only on label counts in the target images for fine-tuning. We then predict box predictions using class activation maps as in Bae et al. [3] to compare its performance on object localization. We call this approach Resnet50-CAM. When the supervision signal allows it, we also compare with a DETR model trained end-to-end jointly on target and source domains, masking the box costs in the matching cost of the Hungarian algorithm for image-level annotated samples. We call this approach DETR-joint.

Model

Data Domain CLEVR count acc. CLEVR mAP (mAP@IoU=0.5) Mol. count. acc Mol. mAP (mAP@IoU=0. 

4.3. EVALUATION METRICS

We evaluate the performance of the models on the different datasets based on two criteria : the count accuracy and the objects localization performance. The count accuracy measures the ratio of correct images where all individual counts of (all detected) objects are correct. To evaluate how well the model is performing in localizing the different objects in the image we report the mean average precision (mAP) performance, a widely used metric for evaluating object detection models.

4.4. WEAKLY SUPERVISED KNOWLEDGE TRANSFER WITH CLASS COUNTS

We first investigate the performance of ProbKT when the weakly supervision consists of class counts only. The query q for each image then consists of the number of objects from each class in the image. We evaluate the models on the CLEVR-mini and Molecules datasets. For the Molecules dataset, the query for an image containing 6 carbon atoms (C), 6 oxygen atoms (O) and 12 hydrogen atoms (H) would result in the following query: q = ([C, O, H], [6, 6, 12] ). These weak labels in the case of the Molecules dataset are widely and easily available in the form of the chemical formula of the molecule on the image (e.g C 6 H 12 O 6 ). The recognition of atomic level entities on images of molecules is a challenge in the field of Optical Chemical Structure Recognition (OCSR) [9; 33; 29; 19] . For the CLEVR-mini dataset, the query for an example image containing 2 spheres, 1 cylinder and 3 cubes would be q = ([Cube, Cylinder, Sphere], [3, 1, 2]). Formal descriptions of the queries for each task are presented in Appendix E. Results of the experiments are summarized in Table 2 . We observe on both datasets that ProbKT is able to transfer knowledge from the source domain to the target domain and improve count accuracy on the target domain and in most cases also on the source domain. The count accuracy increases on both the target domain and on OOD, suggesting better generalization performance. This is in contrast with Resnet50-CAM which performs well on the target domain of the Molecules dataset but fails on OOD. We also note a significant improvement in object localization (mAP) for ProbKT on the CLEVR-mini dataset. However, fine-tuning seems detrimental for mAP on the Molecules dataset. This can be explained by the very small bounding boxes in the Molecules dataset. We therefore also report the mAP@IoU=0.5 where we observe some increase in performance after fine-tuning. Lastly, we observe that our approach outperforms WSOD-transfer on all metrics for both datasets. WSOD-transfer performs well on CLEVR-mini but fails for the Molecules dataset. This can be explained by the fact that this method only supports class indicators (whether a class is present in the image), which is particularly detrimental in molecules images containing a lot of objects.

4.5.1. CLASS RANGES

The annotation of images is a tedious task, which limits the availability of fully annotated datasets. When the number of objects on an image is large, counting the exact number of objects of a particular class becomes too time-consuming. A typical annotation in this case consists oof class ranges where, instead of exact class counts, an interval is given for the count. For example an image from the CLEVR-mini dataset with more than 4 cubes, exactly 4 cylinders and less than 4 spheres would result in the following query: q = ([cube, cylinder, sphere], [4, ∞[, [4, 5[, [0, 4[ ). We evaluate this experimental setup and report results in Table 3 . We observe that ProbKT performs significantly better than WSOD-transfer on count accuracy, which still uses only presence/absence labels. We note that Resnet50-CAM is unable to use this type of supervision and is thus reported as n/a. More complex types of weak supervision than the ones considered above are also possible. To illustrate the capabilities of our approach, we build an MNIST object detection dataset where images show multiple digits as objects. Examples images are available in Appendix B. The weak supervision is here the sum of all digits in the image : q = SUM(digits). Our ProbKT can seamlessly integrate this type of supervision as shown in Table 3 . As all other baselines are unable process this type of supervision, we compare against a pre-trained RCNN and a variation of Resnet50-CAM where we add an extra neural network layer that sums the individual counts to give the resulting sum. We report count accuracy, mAP and sum accuracy. The sum accuracy measures the ratio of correct images where the predicted sum (instead of the label of the digits) is correct. Details about the results on extra experiments with DETR as backbone using complex types of weak supervision can be found in Appendix D. Iterative relabeling. In Figure 4 , we plot the evolution of the performance on the test sets after multiple rounds of fine-tuning and re-labeling, as detailed in Section 3.3.1. The final performance reported in the results tables is selected based on best relabeling iteration on the validation dataset. We observe that iterative relabeling after fine-tuning can improve performance significantly. Nevertheless, the benefit of iterative relabeling is less pronounced for DETR on the Molecules dataset. We impute it to the fact that the fine-tuned DETR model is less accurate on this dataset.

Object detection backbone

Our method can seamlessly accommodate different object detection backbones. In Table 2 , we present the results for our method with a DETR [7] and a FasterRCNN [34] backbone. We observe that FasterRCNN is typically performing better. In particular, the DETR backbone performs poorly on the Molecules dataset. This could be due to the small objects in the Molecules dataset. Indeed, Carion et al. [7] recommend to use DETR-DC5 or DETR-DC5-R101 for small objects instead.

5. CONCLUSIONS AND DISCUSSION

Objects detection models are a key component of machine learning deployment in the real world. However, training such models usually requires large amounts of richly annotated images that are often prohibitive for many applications. In this work, we proposed a novel approach to train object detection models by leveraging richly annotated datasets from other domains and allowing arbitrary types of weak supervision on the target domain. Our architecture relies on a probabilistic logical programming engine that efficiently blends the power of symbolic reasoning and deep learning architecture. As such, our model also inherits the current limitations from the probabilistic reasoning implementations, such as higher computational complexity. We proposed several approaches to speed-up the inference process significantly and our work will directly benefit from further advances in this field. Lastly, the versatility of probabilistic programming could help support other related tasks in the future, such as image to graph translation. Reproducibility Statement Details for reproducing all experiments shown in this work are available in Appendix E. More details on the datasets used in the experiments can be found in Appendix B. [46] Yuanyi Zhong, Jianfeng Wang, Jian Peng, and Lei Zhang. Boosting weakly supervised object detection with progressive knowledge transfer. In European conference on computer vision, pages 615-631. Springer, 2020. [47] Bolei Zhou, Aditya Khosla, Agata Lapedriza, Aude Oliva, and Antonio Torralba. Learning deep features for discriminative localization. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2921-2929, 2016. [48] Xinge Zhu, Jiangmiao Pang, Ceyuan Yang, Jianping Shi, and Dahua Lin. Adapting object detectors via selective cross-domain alignment. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 687-696, 2019.

A TRAINING DETAILS

For the hyper-parameters the idea was to stay as close as possible to the defaults of the pre-trained standard models although some lightweight tuning was done. In Table 4 

B DATASETS

We evaluate our approach on three different datasets: (1) a CLEVR-mini dataset, (2) a Molecules data set with images of chemical compounds, and (3) an MNIST-based object detection dataset. For each dataset, three subsets, corresponding to different domains, are used: (1) a source domain, (2) a target domain, and (3) an out-of-distribution domain (OOD). Source and target domains are split into 5 folds of train and validation sets and an independent test set. Sizes of the different splits per dataset are summarized in Table 5 . Table 5 : Dataset sizes for the different splits. For train and validations splits 5 folds are used.

B.0.1 CLEVR-MINI DATASET

The CLEVR-mini dataset for our experiments is a selection of samples from the CLEVR dataset [21] . The different types available in the CLEVR dataset are combinations of shapes (cube, sphere, and cylinder), materials (metal and rubber), and sizes (large and small). Colors are ignored as the images are first converted to grayscale before feeding them to the models. For the richly annotated source domain, we randomly select images with only sphere or cylinder-shaped objects (no cubes) and with a maximum of four objects per image and a minimum of three objects. For the weakly annotated target domain we experiment with two type of annotations. Firstly we experiment when we have the class counts of objects in the image available. Secondly, instead of the exact counts of classes in the image the annotations only specify if there is exactly one object class in the image or multiple. The advantage of this kind of labeling is that the annotator does not need to count the objects and instead just make a distinction of only one object class in image or multiple. The images in the target domain can contain all combinations of object types (including cube-shaped objects) and allow a minimum of five objects per image and a maximum of six objects per image. For the OOD dataset we also select images with all possible combinations of object types, always with 10 objects per image. Some example images from the CLEVR-mini dataset can be found in Figure 1 .

B.0.2 MOLECULES DATASET

The Molecules dataset contains images depicting chemical compounds. For the richly annotated source domain, a procedure similar as described in Oldenhof et al. [29, 30] was executed using an RDKit [2] fork for generating the bounding box labels for the individual atoms present in the images. In the source domain, we allow the following atom types: carbon (C), hydrogen (H), oxygen (O), and nitrogen (N). In the weakly annotated target domain, we only have the counts of the atoms present which translates to the chemical formula of the molecule in the image (e.g C 6 H 12 O 6 ). The same classes from the source domain (C, H, O, and N) are also present in the target domain as well as an extra atom type: sulfur (S). The OOD test dataset consists of 1000 images from the external UoB dataset [36] where chemical compounds containing only the atom types present in the target domain (C, H, O, N, and S). Some example images from the Molecules dataset are visualized in Figure 5 .

B.0.3 MNIST OBJECT DETECTION DATASET

The MNIST object detection dataset is generated [1] using the original MNIST dataset [13] . Each image consists of three MNIST digits randomly positioned in the image. The MNIST object detection dataset allows experimenting with a more arbitrary type of weak supervision. Each object in this dataset represents a digit that can be aggregated. This allows to label an image with only the sum of all digits in the image instead of the class counts of the objects. For the richly annotated source domain digits 7, 8, and 9 are left out. The weakly annotated target domain has all possible digit classes (0-9). The labels of the target domain only contain the sum of all digits. For the OOD test dataset, images are used that contain maximum of four MNIST digits, instead of three digits as in the other domains. Some example images from the MNIST object detection dataset are visualized in Figure 6 . On the left we have source domain where a model can be trained using bounding box information(labels,positions) but only on a limited set of atom types (C,H, O, N). In the middle we can see that the pre-trained model is not able to recognize the sulfur (S) from target domain correctly. On the right we see that the model is able to adapt to target domain after probabilistic reasoning using weak labels (e.g. counts of objects on image) and is able to recognize the sulfur (S). Figure 6 : Weakly supervised knowledge transfer with probabilistic logical reasoning (ProbKT). On the left we have source domain where a model can be trained using bounding box information(labels,positions) but only on a limited set of digits (0, 1, 2, 3, 4, 5, 6). In the middle we can see that the pre-trained model is not able to recognize the digit eight (8) from target domain correctly. On the right we see that the model is able to adapt to target domain after probabilistic reasoning using weak labels (e.g. sum of digits on image) and is able to recognize the digit eight (8).

C PROBKT AND PROBKT* SUPPLEMENTARY DETAILS C.1 FILTERING SAMPLES

The computation complexity of inference in the probabilistic programming module grows with the number of possible worlds. In turn, the number of possible worlds grows with the number of probabilistic facts n. One avenue to reduce the computational cost of the inference step is then to artificially reduce the number of probabilistic facts in each image. Let {p y,n : n = 1, ..., n} and q the corresponding inference query. We compute the filtered set of probabilistic facts py,n by setting pk y,n =    1 if pk y,n ≥ δ 0 if ∃k ′ s.t. pk ′ y,n ≥ δ and pk y,n < δ pk y,n otherwise. ( The parameter δ ∈ [0, 1] is a threshold at which we consider the probabilistic fact as certain. A probability of 1 or 0 effectively discards the probabilistic fact py,n from the inference procedure. However, we also have to update the inference query q to reflect this filtration. We write q the filtered query q. Example To illustrate this filtration strategy let's consider an MNIST image with 3 digits in the image : {3, 4, 7}. The query q corresponds to the class labels in the images. That is q = {3, 4, 7}. The object detection backbones outputs 3 box features with corresponding probabilities {p y,0 , py,1 , py,2 , }. Now let e.g. p3 y,1 = 0.99. We can filter out py,1 (i.e the prediction for a digit 3 is certain), and compute the filtered query q = {4, 7}. Remark Equation 2 suggests a filtering based on the output probabilities only. However, one can also use information about the query for the filtration. For instance, one would only filter out a probabilistic fact if it is consistent with the query q. In the example above, it would be wiser not to filter out e.g. p9 y,1 = 0.99 as no images are supposedly present in the image. One should then ideally propagate this probabilistic fact to the inference module such as to update the weights of the backbone and learn from this error.

C.2 GRADIENT OF THE LIKELIHOOD

The ProbKT likelihood has the following form: P P (q) = α∈Eq i j pαij ij , where α is a "possible world" matrix of indicator variables: α ij = 1 object i is of class j 0 otherwise, and E q is the set of all possible α worlds compatible with the logical annotation q. Lemma 1. The gradient of the likelihood has the following form: ∂P P (q) ∂θ = i j ∂p ij ∂θ C ij , where the weight has the form: C ij = P (E|O i = j) = α∈E| O i =j i ′ j ′ I (i̸ =i ′ ∨j̸ =j ′ ) p αij ij In case of the Hungarian matching the most probable possible word is selected, which corresponds to setting the conditional probability P (E|O i = j) to 1 if object i is paired with label j and 0 otherwise. The ProbKT gradient can be interpreted as a probability weighted extension of the gradient resulting from the Hungarian matching.

D FULL RESULTS

In Table 6 , we present the full results for the MNIST experiment. We report the count accuracy (i.e.. correct identification of the digits in the image), sum accuracy (i.e. correct estimation of the sum of the digits in the image) and the mean average precision (mAP) (i.e. a common object detection metric that reflects the ability to predict the positions and labels of the objects). We observe that the Resnet baseline performs poorly, lacking the necessary logic to process this dataset. We used both DETR and RCNN as object detection backbones in our experiments, showing high test accuracies when fine-tuned with our approach. As the results suggest, RCNN backbones lead to better performance than the DETR backbone. As expected the distribution of the sum is the convolution of the distributions of the two terms. This observation trivially generalizes to more than two terms. The cost function corresponding to the maximum likelihood estimation is the negative log-likelihood -log(p Y (Y )).



OOD test set of Molecules dataset has no bounding box labels.



Figure 1: ProbKT: Weakly supervised knowledge transfer with probabilistic logical reasoning. (Left) A model can be trained on the source domain using full supervision (labels, positions) but only on a limited set of shapes (cylinders and spheres). (Middle) The pre-trained model does not recognize the cubes from the target domain correctly. (Right) The model can adapt to the target domain after applying ProbKT and can recognize the cubes.

Figure 2: ProbKT. The pre-trained object detection backbone outputs the box features h for the detected objects. Box classifiers (red) and box position predictors (blue) then predict corresponding label predictions py and box position predictions b that are fed to the probabilistic reasoning layer. This layer computes the probability of the query along with the gradients with respect to py and b that can be backpropagated through the entire network.

Figure 4: Iterative relabeling performance for the different datasets. Iteration 0: pretrained on source domain. Iteration 1: fine-tuned. Iteration 2: re-labeled and re-trained. Iteration 3: relabeled and re-trained. Iteration 4: relabeled and re-trained.

Figure 5: Weakly supervised knowledge transfer with probabilistic logical reasoning (ProbKT).On the left we have source domain where a model can be trained using bounding box information(labels,positions) but only on a limited set of atom types (C,H, O, N). In the middle we can see that the pre-trained model is not able to recognize the sulfur (S) from target domain correctly. On the right we see that the model is able to adapt to target domain after probabilistic reasoning using weak labels (e.g. counts of objects on image) and is able to recognize the sulfur (S).

Results of the experiments for the datasets: CLEVR-mini and Molecules. Reported test accuracies over the 5 folds. Best method is in bold for each metric and data distribution. *: OOD test set of Molecules dataset has no bounding box labels.



a summary is given for the hyper-parameters used for the different models. Overview of hyperparameters for the different models, most hyperparamaters are left default from standard models. Tuning was mostly done on learning rate and learning rate scheduling. For every fold/dataset the best epoch/lr/lr_step_size model is used based on validation data.

Results of the SUM experiments on the MNIST object detection dataset. Reported test accuracies over the 5 folds.

± 0.042 (0.991 ± 0.001) 0.603 ± 0.037 n/a 1 ProbKT(RCNN) source domain 0.995 ± 0.002 0.941 ± 0.041 (0.998 ± 0.001) 0.96 ± 0.002 0.666 ± 0.005 (0.978 ± 0.002)

Results of the experiments for the datasets: CLEVR-mini and Molecules. Reported test accuracies over the 5 folds. Best method is in bold for each metric and data distribution.

acknowledgement

ACKNOWLEDGMENTS AA, MO and YM are funded by (1) Research Council KU Leuven: Symbiosis 4 (C14/22/125), Symbiosis3 (C14/18/092); (2) Federated cloud-based Artificial Intelligence-driven platform for liquid biopsy analyses (C3/20/100); (3) CELSA -Active Learning (CELSA/21/019); (4) European Union's Horizon 2020 research and innovation programme under the Marie Skłodowska-Curie grant agreement No. 956832; (5) Flemish Government (FWO: SBO (S003422N), Elixir Belgium (I002819N), SB and Postdoctoral grants: S003422N, 1SB2721N, 1S98819N, 12Y5623N) and (6) VLAIO PM: Augmenting Therapeutic Effectiveness through Novel Analytics (HBC.2019.2528); ( 7) YM, AA, EDB, and MO are affiliated to Leuven.AI and received funding from the Flemish Government (AI Research Program). EDB is funded by a FWO-SB grant (S98819N). Computational resources and services used in this work were partly provided by the VSC (Flemish Supercomputer Center), funded by the Research Foundation -Flanders (FWO) and the Flemish Governmentdepartment EWI.

availability

Our code is available at https://github.com/molden/ProbKT 

E SOURCE CODE AND DATASETS

The source code and basic instructions are available on https://github.com/molden/ ProbKT. The source code integrates features from the Weights & Biases (WandB) platform [5] . Basic features are supported without the need for an account on WandB but to make full use of all features we recommend to create an account.Datasets can be downloaded here:• CLEVR-mini dataset https://figshare.com/s/db012765e5a38e14ef9c The query q in the case of class counts would be count_objects(X,L,C). For example an image X with 1 small metal cube and 3 large rubber cylinders would result in the following query: count_objects(X,[small_metal_cube,large_rubber_cylinder], [1, 3] ).ProbLog script used in the ProbKT Probabilistic logical reasoning framework for aggregating the digits on an image: The query q in the case of sum of digits would be sum_digits(X,Y). For example an image X with as sum of digits 12 would result in the following query: sum_digits(X,12). The query q in the case of non exact counts of objects would be range_count_objects(X,L,C,S). For example an image X with exactly one metal small cube and multiple rubber large spheres would result in the following query: range_count_objects(X,[s_metal_cube,l_rubber_sphere], [1, 1] ,[0,1]).

E.1 INFERENCE EXAMPLE FOR MNIST DATASET

To illustrate the inference process let us follow the evaluation of the clause sum([x1, x2],8), what can result from query sum_digits(X,8) in case of two visible digit in the image X.This clause is true if and only if X 1 + X 2 = 8.In case of MNIST digits ({0, 1, . . . , 9}) enumerating the possible worlds would give the following set:{(0, 8), (1, 7), (2, 6) , . . . , (8, 0)}After summing the probability of all possible worlds we get: where p 1 and p 2 are the distribution of random variable X 1 and X 2 respectively.Or in a general form:

