TOPOZERO: DIGGING INTO TOPOLOGY ALIGNMENT ON ZERO-SHOT LEARNING

Abstract

Common space learning, associating semantic and visual domains in a common latent space, is essential to transfer knowledge from seen classes to unseen ones on Zero-Shot Learning (ZSL) realm. Existing methods for common space learning rely heavily on structure alignment due to the heterogeneous nature between semantic and visual domains, but the existing design is sub-optimal. In this paper, we utilize persistent homology to investigate geometry structure alignment, and observe two following issues: (i) The sampled mini-batch data points present a distinct structure gap compared to global data points, thus the learned structure alignment space inevitably neglects abundant and accurate global structure information. (ii) The latent visual and semantic space fail to preserve multiple dimensional geometry structure, especially high dimensional structure information. To address the first issue, we propose a Topology-guided Sampling Strategy (TGSS) to mitigate the gap between sampled and global data points. Both theoretical analyses and empirical results guarantee the effectiveness of the TGSS. To solve the second issue, we introduce a Topology Alignment Module (TAM) to preserve multi-dimensional geometry structure in latent visual and semantic space, respectively. The proposed method is dubbed TopoZero. Empirically, our TopoZero achieves superior performance on three authoritative ZSL benchmark datasets.

1. INTRODUCTION

Given a large amount of training data, deep learning has exhibited excellent performance on various vision tasks, e.g., image recognition He et al. (2016) ; Dosovitskiy et al. (2020) , object detection Lin et al. (2017) ; Liu et al. (2021) , and instance segmentation He et al. (2017) ; Bolya et al. (2019) . However, when considering a more realistic situation, e.g., the testing class does not appear at the training stage, the deep learning model fails to give a prediction on these novel classes. To remedy this, some pioneering researchers Lampert et al. (2014); Mikolov et al. (2013) point out that the auxiliary semantic information (sentence embeddings and attribute vectors) is available for both seen and unseen classes. Thus, by employing this common semantic representation, Zero-Shot Learning (ZSL) was proposed to transfer knowledge from seen classes to unseen ones. Common space learning, enabling a significant alignment between semantic and visual information on the common embedding space, is a mainstream algorithm for ZSL. Existing approaches for common space learning can be divided into two categories: algorithms with 1) distribution alignment and 2) structure and distribution alignment. Typical methods in the first category employ various encoding networks to directly align the distribution between visual and semantic domains, e.g., variational autoencoder in Schönfeld et al. (2019) , bidirectional latent embedding framework in Wang & Chen (2017), and deep visual-semantic embedding network in Tsai et al. (2017) . Even though these methods encourage distribution alignment between visual and semantic domains, the alignment on the geometry structure is usually neglected. Note that the structure gap naturally exists in these two domains due to their heterogeneous nature Chen et al.  m+1) R ) < D H (X, X )). Combining this illustrator example with our theoretical analysis guarantees that our TGSS can mitigate the structure gap between mini-batch and global data points. (d)-(f) Compared to the input space, HSVA latent space can only preserve 0-dimensional topological features, indicating some high dimensional structure representation is lost during the dimension reduction phase. In contrast, our TopoZero latent space can preserve more accurate topological features by taking advantage of our proposed Topology Alignment Module. Although HSVA empirically works well, we discover that there exist two issues in HSVA's structure alignment module. To clarify our findings clearly, we first introduce some background information in terms of Persistent Homology Zomorodian & Carlsson (2005) . Persistent homology is a tool for computing topological featuresfoot_0 of a data set at different spatial resolutions. More persistent features can be found over a wide range of spatial scales and represent true features of the underlying geometry space. We first introduce the concept of simplicial homology. For a simplicial complex R, i.e. a generalised graph with higher-order connectivity information such as cliques, simplicial homology employs matrix reduction algorithms to assign R a family of groups, namely homology groups. The d-th homology group H d (R) of R contains d-dimensional topological features, such as connected components (d = 0), cycles/tunnels (d = 1), and voids (d = 2). Homology groups are typically summarised by their ranks, thereby obtaining a simple invariant signature of a manifold. For example, a circle in R 2 has one feature with d = 1 (a cycle), and one feature with d = 0 (a connected component). Based on these background knowledge, we further introduce how to compute a Persistent Homology when given a point cloud X. Firstly, we denote the Vietoris-Rips complex Vietoris (1927) of X at scale ϵ as V ϵ (X). Then, we can obtain the Persistent Homology PH(V ϵ (X))of a Vietoris-Rips complex V ϵ (X), which consists of persistence diagrams {D 1 , D 2 , ...} and persistence pairs {π 1 , π 2 , ...}. The d-dimensional persistence diagram D d contains coordinates with the form (a, b), where a refers to a threshold ϵ at which a d-dimensional topological feature appears and b refers to a threshold ϵ ′ at which it disappears. The d-dimensional persistence pairs contains indices (i, j) corresponding to simplices s i , s j ∈ V ϵ (X), which create and destroy the corresponding topological features determined by (a, b) ∈ D d . Note that more detailed background knowledge (e.g., simplex, Vietoris-Rips complex) is introduced in Section A.



Connectivity-based features, e.g., connected components in 0-dimensional, cycles in 1-dimensional, and voids in 2-dimensional topological features



(2021c). To mitigate the structure gap for promoting alignment between visual and semantic domains, HSVA Chen et al. (2021c) was proposed and become a pioneering work in the second category. Inspired by the successful structure alignment work Lee et al. (2019) in unsupervised domain adaptation, HSVA introduces a novel hierarchical semantic-visual adaptation framework to align the structure and distribution progressively.

Figure 1: Motivation Illustration. (a)-(c) Based on the same random sampled data points X (m) in (a), the sampled batch data points from our Topological-guided Sampling Strategy (TGSS) are closer to the global data points compared to those sampled from random sampling strategy (D H (X, X

