DECOMPOSE TO GENERALIZE: SPECIES-GENERALIZED ANIMAL POSE ESTIMATION

Abstract

This paper challenges the cross-species generalization problem for animal pose estimation, aiming to learn a pose estimator that can be well generalized to novel species. We find the relation between different joints is important with two-fold impact: 1) on the one hand, some relation is consistent across all the species and may help two joints mutually confirm each other, e.g., the eyes help confirm the nose and vice versa because they are close in all species. 2) on the other hand, some relation is inconsistent for different species due to the species variation and may bring severe distraction rather than benefit. With these two insights, we propose a Decompose-to-Generalize (D-Gen) pose estimation method to break the inconsistent relations while preserving the consistent ones. Specifically, D-Gen first decomposes the body joints into several joint concepts so that each concept contains multiple closely-related joints. Given these joint concepts, D-Gen 1) promotes the interaction between intra-concept joints to enhance their reliable mutual confirmation, and 2) suppresses the interaction between inter-concept joints to prohibit their mutual distraction. Importantly, we explore various decomposition approaches, i.e., heuristic, geometric and attention-based approaches. Experimental results show that all these decomposition manners yield reasonable joint concepts and substantially improve cross-species generalization (and the attentionbased approach is the best).

1. INTRODUCTION

Animal pose estimation (Cao et al., 2019; Li & Lee, 2021b; Mu et al., 2020; Mathis et al., 2021) aims to identify and localize the anatomical joints of animal bodies, and has received increasing attention for its wide application, i.e., biology, zoology, and aquaculture. A critical challenge in realistic animal pose estimation is the cross-species problem, i.e., using the already-learned pose estimator for novel species. Specifically, it is infeasible to collect and annotate all the animal species, because the animal kingdom is a vast group of millions of different species. Under this background, the cross-species generalization is of great value for realistic applications. This paper tackles the cross-species animal pose estimation from the domain-generalization (DG) viewpoint (i.e., a species is a respective domain) and reveals a unique factor for this cross-species generalization, i.e., the relation between different joints. The joint relation in our view can be visual (e.g., the color relation between neighboring joints), structural (e.g., the nose is under the eyes) and many more. Our focus on the joint relation is different from the popular concern in general domain generalization (Ben-David et al., 2010; Blanchard et al., 2021; 2011; David et al., 2010) , which mainly considers the distribution shift between the source and the unseen target domain(s). The generic DG methods usually learn domain-invariant representations (Muandet et al., 2013; Ghifary et al., 2015; Li et al., 2018b; c; Bui et al., 2021; Yang et al., 2021; Gong et al., 2021) , enhance the generalizability through meta-learning (Dou et al., 2019; Balaji et al., 2018; Li et al., 2018a; 2019) or data augmentation (Zhou et al., 2021; Shankar et al., 2018; Carlucci et al., 2019; Volpi et al. , 2018) . While these methods are potential for cross-species generalization as well, the joint relation is a unique viewpoint that has never been explored under other DG scenarios. The importance of joint relation is two-fold: 1) on the one hand, some joint relation is consistent across all the species and is beneficial. With consistent relation, two joints may mutually confirm each other, e.g., the eye helps confirm the nose and vice versa, because they are consistently close in all species. 2) on the other hand, some joint relation is inconsistent for different species due to species variation and is thus harmful for generalization, e.g., the length of non-rigid body parts such as legs. Such inconsistent relation makes the already-learned mutual confirmation become a severe distraction rather than any benefit. We note that the latter (negative) impact has more or less been recognized by some earlier literature (Cao et al., 2019) , while the former (positive) impact was neglected. In contrast to Cao et al. ( 2019), we argue that both two factors are important and should be considered in combination. With these two insights, we propose a Decompose-to-Generalize (D-Gen) pose estimation method to break the inconsistent relations while preserving the consistent ones. Specifically, D-Gen first decomposes the body joints into several joint concepts. The decomposition facilitates that each individual concept contains multiple closely-related joints and that the joints in different concepts are far away or prone to inconsistent relations. Given these joint concepts, D-Gen promotes the interaction between intra-concept joints and meanwhile suppresses the interaction between interconcept joints. The approach for interaction promotion / suppression is very simple: D-Gen splits the top layers of the backbone network into several pose-estimation branches, each one of which is responsible for a corresponding joint concept. Intuitively, the joints in different branches have less interaction, compared to the joints in the same branch. Consequently, D-Gen suppresses the distraction from inconsistent joint relation and yet preserves the beneficial mutual confirmation of consistent joint relation, thus improving cross-species generalization. We explore three strategies for joint decomposition, i.e., heuristic, geometric and the attentionbased manner. The geometric manner clusters the joints based on their geometric distances. The attention-based manner uses the attention mechanism to learn the affinity between joint features and uses the affinity matrix for decomposition. Experimental results show that all these three strategies substantially improve cross-species generalization, validating the effectiveness of our joint decomposition. Another interesting observation is that the attention-based strategy surpasses the other two strategies, indicating that attention-based concepts are better than the concept derived from human intuition (i.e., heuristic) and the pure geometric relation. Since the attention-based approach combines deep feature and geometric priors, its superiority against the geometric manner suggests that there are multiple forms of joint relation beyond the structural relation.

2. RELATIVE WORKS

Our work is closely related to two research areas, i.e., pose estimation and domain generalization. 2D pose estimation for human and animals. 2D pose estimation refers to identifying all the anatomical joints of bodies for images. There are mainly two paradigms, top-down and bottom-up. The top-down paradigm (Huang et al., 2017; Papandreou et al., 2017; Zhang et al., 2020; Cai et al., 2020; Newell et al., 2016; Moon et al., 2019; Khirodkar et al., 2021) fist detects the person and then localize the joints for each detected person. The bottom-up paradigm (Sun et al., 2019a; Wang et al., 



Figure 1: Two reasons that bring domain gap to the joint relation. 1) structural discrepancy: the part lengths (i.e. the distances between different joints) may vary for different species (left most); 2) the visual similarities between some different joints are inconsistent for different species, e.g., the visual similarities between faces and other body parts are different for tiger, fox, and the cow.

