SEMI-SUPERVISED KEYPOINT LOCALIZATION

Abstract

Knowledge about the locations of keypoints of an object in an image can assist in fine-grained classification and identification tasks, particularly for the case of objects that exhibit large variations in poses that greatly influence their visual appearance, such as wild animals. However, supervised training of a keypoint detection network requires annotating a large image dataset for each animal species, which is a labor-intensive task. To reduce the need for labeled data, we propose to learn simultaneously keypoint heatmaps and pose invariant keypoint representations in a semi-supervised manner using a small set of labeled images along with a larger set of unlabeled images. Keypoint representations are learnt with a semantic keypoint consistency constraint that forces the keypoint detection network to learn similar features for the same keypoint across the dataset. Pose invariance is achieved by making keypoint representations for the image and its augmented copies closer together in feature space. Our semi-supervised approach significantly outperforms previous methods on several benchmarks for human and animal body landmark localization.

1. INTRODUCTION

Detecting keypoints helps with fine-grained classification (Guo & Farrell, 2019) and re-identification (Zhu et al., 2020; Sarfraz et al., 2018) . In the domain of wild animals (Mathis et al., 2018; Moskvyak et al., 2020; Liu et al., 2019a; b) , annotating data is especially challenging due to large pose variations and the need for domain experts to annotate. Moreover, there is less commercial interest in keypoint estimation for animals compared to humans, and little effort is invested in collecting and annotating public datasets. Unsupervised detection of landmarks 1 (Jakab et al., 2018; Thewlis et al., 2017; 2019) can extract useful features, but are not able to detect perceptible landmarks without supervision. On the other hand, supervised learning has the risk of overfitting if trained only on a limited number of labeled examples. Semi-supervised learning combines a small amount of labeled data with a large amount of unlabeled data during training. It is mostly studied for classification task (van Engelen & Hoos, 2019) but it is also important for keypoint localization problem because annotating multiple keypoints per image is a time-consuming manual work, for which precision is the most important factor. Pseudo-labeling (Lee, 2013) is a common semi-supervised approach where unlabeled examples are assigned labels (called pseudo-labels) predicted by a model trained on a labeled subset. A heuristic unsupervised criterion is adopted to select the pseudo-labeled data for a retraining procedure. More recently, the works of (Dong & Yang, 2019; Radosavovic et al., 2018) apply variations to selection criteria in pseudo-labeling for semi-supervised facial landmark detection. However, there are less variations in facial landmark positions than in human or animal body joints, where there is a high



We use terms keypoints or landmarks interchangeably in our work. These terms are more generic than body joints (used in human pose estimation) because our method is applicable to a variety of categories.

