SEMI-SUPERVISED KEYPOINT LOCALIZATION

Abstract

Knowledge about the locations of keypoints of an object in an image can assist in fine-grained classification and identification tasks, particularly for the case of objects that exhibit large variations in poses that greatly influence their visual appearance, such as wild animals. However, supervised training of a keypoint detection network requires annotating a large image dataset for each animal species, which is a labor-intensive task. To reduce the need for labeled data, we propose to learn simultaneously keypoint heatmaps and pose invariant keypoint representations in a semi-supervised manner using a small set of labeled images along with a larger set of unlabeled images. Keypoint representations are learnt with a semantic keypoint consistency constraint that forces the keypoint detection network to learn similar features for the same keypoint across the dataset. Pose invariance is achieved by making keypoint representations for the image and its augmented copies closer together in feature space. Our semi-supervised approach significantly outperforms previous methods on several benchmarks for human and animal body landmark localization.

1. INTRODUCTION

Detecting keypoints helps with fine-grained classification (Guo & Farrell, 2019) and re-identification (Zhu et al., 2020; Sarfraz et al., 2018) . In the domain of wild animals (Mathis et al., 2018; Moskvyak et al., 2020; Liu et al., 2019a; b) , annotating data is especially challenging due to large pose variations and the need for domain experts to annotate. Moreover, there is less commercial interest in keypoint estimation for animals compared to humans, and little effort is invested in collecting and annotating public datasets. Unsupervised detection of landmarks 1 (Jakab et al., 2018; Thewlis et al., 2017; 2019) can extract useful features, but are not able to detect perceptible landmarks without supervision. On the other hand, supervised learning has the risk of overfitting if trained only on a limited number of labeled examples. Semi-supervised learning combines a small amount of labeled data with a large amount of unlabeled data during training. It is mostly studied for classification task (van Engelen & Hoos, 2019) but it is also important for keypoint localization problem because annotating multiple keypoints per image is a time-consuming manual work, for which precision is the most important factor. Pseudo-labeling (Lee, 2013) is a common semi-supervised approach where unlabeled examples are assigned labels (called pseudo-labels) predicted by a model trained on a labeled subset. A heuristic unsupervised criterion is adopted to select the pseudo-labeled data for a retraining procedure. More recently, the works of (Dong & Yang, 2019; Radosavovic et al., 2018) apply variations to selection criteria in pseudo-labeling for semi-supervised facial landmark detection. However, there are less variations in facial landmark positions than in human or animal body joints, where there is a high Previous work of (Honari et al., 2018) in semi-supervised landmark detection utilizes additional class attributes and test only on datasets that provide these attribute annotations. Our work focuses on keypoint localization task in a common real-world scenario where annotations are for a small subset of data from a large unlabeled dataset. More specifically, we propose a method for semi-supervised keypoint localization that learns a list of heatmaps and a list of semantic keypoint representations for each image (Figure 1 ). A semantic keypoint representation is a vector of real numbers in a low-dimensional space relative to the image size, and the same keypoints in different images have similar representations. We leverage properties that are specific to the landmark localization problem to design constraints for jointly optimizing both representations. We extend a transformation consistency constraint of (Honari et al., 2018) to be able to apply it on each representation differently (i.e. transformation equivariant constraint for heatmaps and transformation invariant constraint for semantic representations). Moreover, we formulate a semantic consistency constraint that encourages detecting similar features across images for the same landmark independent of the pose of the object (e.g. an eye in all images should look similar). Learning both representations simultaneously allows us to use the power of both supervised and unsupervised learning. Our work is motivated by data scarcity in the domain of wild animals, but is not limited to animals, and as well, it is applicable to human body landmarks detection. The contribution of our work is three-fold: • We propose a technique for semi-supervised keypoint localization that jointly learns keypoint heatmaps and semantic representations optimised with supervised and unsupervised constraints; • Our method can be easily added to any existing keypoint localization networks with no structural and with minimal computational overhead; • We evaluate the proposed method on annotated image datasets for both humans and animals. As demonstrated by our results, our method significantly outperforms previously proposed supervised and unsupervised methods on several benchmarks, using only limited labeled data. The paper is organised as follows. Related work on semi-supervised learning and keypoint localization is reviewed in Section 2. Our proposed method is described in Section 3. Experimental settings, datasets and results are discussed in Section 4.



We use terms keypoints or landmarks interchangeably in our work. These terms are more generic than body joints (used in human pose estimation) because our method is applicable to a variety of categories.



Figure 1: Our semi-supervised keypoint localization system learns a list of heatmaps and a list of semantic keypoint representations for each image. In addition to a supervised loss optimized on the labeled subset of the data, we propose several unsupervised constraints of transformation equivariance, transformation invariance, and semantic consistency.

