TOPOZERO: DIGGING INTO TOPOLOGY ALIGNMENT ON ZERO-SHOT LEARNING

Abstract

Common space learning, associating semantic and visual domains in a common latent space, is essential to transfer knowledge from seen classes to unseen ones on Zero-Shot Learning (ZSL) realm. Existing methods for common space learning rely heavily on structure alignment due to the heterogeneous nature between semantic and visual domains, but the existing design is sub-optimal. In this paper, we utilize persistent homology to investigate geometry structure alignment, and observe two following issues: (i) The sampled mini-batch data points present a distinct structure gap compared to global data points, thus the learned structure alignment space inevitably neglects abundant and accurate global structure information. (ii) The latent visual and semantic space fail to preserve multiple dimensional geometry structure, especially high dimensional structure information. To address the first issue, we propose a Topology-guided Sampling Strategy (TGSS) to mitigate the gap between sampled and global data points. Both theoretical analyses and empirical results guarantee the effectiveness of the TGSS. To solve the second issue, we introduce a Topology Alignment Module (TAM) to preserve multi-dimensional geometry structure in latent visual and semantic space, respectively. The proposed method is dubbed TopoZero. Empirically, our TopoZero achieves superior performance on three authoritative ZSL benchmark datasets.

1. INTRODUCTION

Given a large amount of training data, deep learning has exhibited excellent performance on various vision tasks, e.g., image recognition He et al. (2016) ; Dosovitskiy et al. (2020) , object detection Lin et al. (2017) ; Liu et al. (2021) , and instance segmentation He et al. (2017) ; Bolya et al. (2019) . However, when considering a more realistic situation, e.g., the testing class does not appear at the training stage, the deep learning model fails to give a prediction on these novel classes. To remedy this, some pioneering researchers Lampert et al. (2014) ; Mikolov et al. (2013) point out that the auxiliary semantic information (sentence embeddings and attribute vectors) is available for both seen and unseen classes. Thus, by employing this common semantic representation, Zero-Shot Learning (ZSL) was proposed to transfer knowledge from seen classes to unseen ones. Common space learning, enabling a significant alignment between semantic and visual information on the common embedding space, is a mainstream algorithm for ZSL. Existing approaches for common space learning can be divided into two categories: algorithms with 1) distribution alignment and 2) structure and distribution alignment. Typical methods in the first category employ various encoding networks to directly align the distribution between visual and semantic domains, e.g., variational autoencoder in Schönfeld et al. (2019) , bidirectional latent embedding framework in Wang & Chen (2017) , and deep visual-semantic embedding network in Tsai et al. (2017) . Even though these methods encourage distribution alignment between visual and semantic domains, the alignment on the geometry structure is usually neglected. Note that the structure gap naturally exists in these two domains due to their heterogeneous nature Chen et al. (2021c) . To mitigate the structure gap for promoting alignment between visual and semantic domains, HSVA Chen et al. (2021c) was proposed and become a pioneering work in the second category. Inspired by the successful structure alignment work Lee et al. (2019) in unsupervised domain adaptation, HSVA introduces a novel hierarchical semantic-visual adaptation framework to align the structure and distribution progressively. ) < D H (X, X )). Combining this illustrator example with our theoretical analysis guarantees that our TGSS can mitigate the structure gap between mini-batch and global data points. (d)-(f) Compared to the input space, HSVA latent space can only preserve 0-dimensional topological features, indicating some high dimensional structure representation is lost during the dimension reduction phase. In contrast, our TopoZero latent space can preserve more accurate topological features by taking advantage of our proposed Topology Alignment Module. Although HSVA empirically works well, we discover that there exist two issues in HSVA's structure alignment module. To clarify our findings clearly, we first introduce some background information in terms of Persistent Homology Zomorodian & Carlsson (2005) . Persistent homology is a tool for computing topological featuresfoot_0 of a data set at different spatial resolutions. More persistent features can be found over a wide range of spatial scales and represent true features of the underlying geometry space. We first introduce the concept of simplicial homology. For a simplicial complex R, i.e. a generalised graph with higher-order connectivity information such as cliques, simplicial homology employs matrix reduction algorithms to assign R a family of groups, namely homology groups. The d-th homology group H d (R) of R contains d-dimensional topological features, such as connected components (d = 0), cycles/tunnels (d = 1), and voids (d = 2). Homology groups are typically summarised by their ranks, thereby obtaining a simple invariant signature of a manifold. For example, a circle in R 2 has one feature with d = 1 (a cycle), and one feature with d = 0 (a connected component). Based on these background knowledge, we further introduce how to compute a Persistent Homology when given a point cloud X. Firstly, we denote the Vietoris-Rips complex Vietoris (1927) of X at scale ϵ as V ϵ (X). Then, we can obtain the Persistent Homology PH(V ϵ (X))of a Vietoris-Rips complex V ϵ (X), which consists of persistence diagrams {D 1 , D 2 , ...} and persistence pairs {π 1 , π 2 , ...}. The d-dimensional persistence diagram D d contains coordinates with the form (a, b), where a refers to a threshold ϵ at which a d-dimensional topological feature appears and b refers to a threshold ϵ ′ at which it disappears. The d-dimensional persistence pairs contains indices (i, j) corresponding to simplices s i , s j ∈ V ϵ (X), which create and destroy the corresponding topological features determined by (a, b) ∈ D d . Note that more detailed background knowledge (e.g., simplex, Vietoris-Rips complex) is introduced in Section A. Based on the powerful geometry feature analysis ability of Persistent Homology, we discover 2 problems in the existing state-of-the-art (sota) structure alignment module Chen et al. (2021c) : (i) Due to the limitation of batch size, the underlying geometry structure of mini-batch samples can not represent global samples'. Thus, when applying structure alignment metric (i.e., sliced Wasserstein discrepancy Lee et al. (2019) ) on random sampledfoot_2 mini-batch visual and semantic data points, we can only achieve a local-level structure alignment, indicating the accurate global geometry information is lost inevitably. (ii) HSVA utilizes sliced Wasserstein discrepancy to align latent visual and semantic space for bridging structure alignment. Actually, this implementation requires an assumption that the latent visual and semantic space can represent their underlying geometry structure adequately. To verify the correctness of this assumption, we adopt persistent homology to visualize the underlying geometry structure of input space and latent space on the visual domain. As shown in Fig. 1 (d ) -(f), there is a distinct gap between the blue dash line and the orange dash line, which is further expanded in the latter two images, representing that the HSVA latent visual space loses abundant geometry structure, especially for 1-dimensional and 2-dimensional topological features. The rationale is that after dimensionality reduction (namely curse of dimensionality Wang & Chen (2017) ), the topological structure is difficult to maintain. In this paper, we devise a TopoZero framework to achieve a more desirable structure alignment by solving 2 aforementioned issues. Concretely, our TopoZero adopts CADA-VAE Schönfeld et al. (2019) as the distribution alignment module and develops a Topology Alignment Module (TAM) with 2 following novelties. (i) To alleviate the structure gap between the sampled mini-batch data points and global data points, we propose a Topology-guided Sampling Strategy (TGSS) to explicitly and progressively mine the topology-preserving data point into the sampled mini-batch data point. Moreover, the theoretical analysis illustrated in Section A guarantees the advantage of our TGSS. Besides, as shown in Fig. 1 (b) -(c), we further visualize the advantage of our TGSS in an illustrator example: based on the same random sampled data points X (m) , X (m+1) T and X (m+1) R are constructed by our TGSS and random sampling strategy, respectively. Obviously, the Hausdorff Distance 3 D H (X, X (m+1) T ) between X (m+1) T and global data points X is bounded by D H (X, X (m+1) R ), indicating our TGSS can alleviate the gap between sampled data points and global data points compared to random sampling strategy. (ii) To preserve the topological structure for visual and semantic latent space, we develop a dual topological-aware branch as well as a topologicalpreserving loss to learn a topological-invariant latent representation. Moreover, based on the opensource tool Ripserfoot_4 , we compute the persistent homology to analyze the multi-dimensional topological features from input space, HSVA latent structure space, and TopoZero latent structure space on the visual domain. Given a set of data points, ripser can compute the corresponding persistent homology, which consists of persistence diagrams {π 1 , π 2 , ...} and persistence pairs {D 1 , D 2 , ...}. Thus based on the obtained persistence diagrams and persistence pairs, we can calculate the number of alive 0/1/2-dimensional topological features under different threshold ϵ. As such, we draw the Fig. •1 (d)-(f) , where the line represents the trend of the number of alive topological features under different threshold ϵ. As revealed from these visualization results, by taking advantage of our proposed TAM, the multi-dimensional topology feature gap between our TopoZero latent space and input space is negligible.

2. RELATED WORKS

Zero-Shot Learning. In recent years, the ZSL realm has attracted many researchers' attention Zhang & Saligrama (2016) ; Li et al. (2017) ; Zhu et al. (2019a) ; Fu et al. (2015) ; Ye & Guo (2017) ; Yu & Lee (2019b) ; Chen et al. (2018) . One typical branch to solve the ZSL problem is learning a common embedding space for aligning semantic and visual domains, termed common space learning. Early common space learning methods focus on framework designation for better distribution alignment. Wang et al. Wang & Chen (2017) have proposed a bidirectional latent embedding framework with two subsequent learning stages. Liu et al. (2018) maps visual features and semantic representations of class prototypes into a common embedding space to guarantee the seen data is compatible with seen and unseen classes. CADA-VAE Liu et al. (2018) have demonstrated that only two variational autoencoders as well as a distribution alignment loss, can achieve a significant distribution alignment in a common space. However, as pointed out from HSVA Chen et al. (2021c) , due to the heterogeneous nature of the feature representations in semantic and visual domains, the distribution and structure variation intrinsically exists. Motivated by this, Chen et al. Chen et al. (2021c) propose a hierarchical semantic-visual adaptation framework for aligning structure and distribution progressively. Thus, the structure alignment in ZSL emerges with a new state-of-the-art performance on the task of common space learning. Persistent Homology. Persistent homology, a tool for topological data analysis, is used for understanding topological features at different dimension. Concretely, persistent homology can detect multi-dimensional topological features (holes, circles, connected components) under various dimensions for the underlying manifold of a set of sampled data points. Based on this property, persistent homology has been applied to a vast body of scenarios, e.g., characterizing graphs in Archambault et al. (2007) ; Carrière et al. ( 2020); Li et al. (2012) , analysing underlying manifolds in Bae et al. (2017) ; Futagami et al. (2019) , topological preserving autoencoder in Moor et al. (2020) . In this paper, by leveraging persistent homology, we discover that the latent visual and semantic space can not preserve multi-dimensional topological features. Furthermore, to improve the geometry representation of latent space in both domains, we propose a Topology Alignment Module for encoding multi-dimensional topological representation explicitly.

3. METHODOLOGY

To begin with, we formulate the task of ZSL. Assume we have a set of seen samples S for training, and a set of unseen samples U for testing only, where S = {(x s , y s , a s ) | x s ∈ X s , y s ∈ Y s , a s ∈ A} be a training set. x s is seen image feature, which is extracted from the pre-trained CNN backbone (ResNet-101 He et al. ( 2016) is adopted in this paper). y s and a s are x s corresponding class label and semantic vector, respectively. Analogously, let U = {(x u , y u ) | x u ∈ X u , y u ∈ Y u }. Note that Y s ∩ Y u = ∅. The objectiveness of conventional ZSL (CZSL) is to learn a classifier for mapping unseen image features into unseen categories, i.e., F czsl : X u → Y u , while the challenging generalized ZSL (GZSL) focus on learning a classifier to map image features to both seen and unseen categories, i.e., F gzsl : X → Y u ∪ Y s . As shown in Fig. 2 , our TopoZero contains two parallel alignment modules, Distribution Alignment Module and Topology Alignment Module Specifically, we directly adopt the architecture of CADA-VAE Schönfeld et al. (2019) as our Distribution Alignment Module. While for our TAD, topologyguided sampling strategy and dual topological-aware branch are proposed to mitigate the geometry structure gap between mini-batch and global data points and preserve multi-dimensional topological structure on both visual and semantic domains, respectively.

3.1. TOPOLOGY-GUIDED SAMPLING STRATEGY

To bridge a structure gap between mini-batch and global data points, we propose a Topology-guided Sampling Strategy (TGSS) as well as a theoretical analysis to guarantee its superiority.

3.1.1. DESCRIPTION

Algorithm 1 describes how our TGSS samples mini-batch samples from global data points. First, we random sample b/2foot_5 data points (X b2 ) from global training samples (X). After that, we select the incremental data point x max according to Equ. 1. Then, we construct a set of candidate set (C) by Equ. 2 and random sample b/2 -1 data points from C to form C mini . Finally, the mini-batch sampled data points are constructed by integrating X b2 , x max and C mini . The advantage of our TGSS relies heavily on the selection of x max , which is proved by the following theoretical analysis. to preserve multi-dimensional structure information. L T A is also applied to align z t x and z t a . For the distribution alignment module, we adopt the framework of CADA-VAE, which consists of two variational autoencoders and optimized by L BCE , L KL , L DA , and L CA . ∃ x max ∈ X, x ′ max ∈ X b2 , s.t. dist(x max , x ′ max ) = d H (X, X b2 ) (1) C(x max , d) = {{x 0 , ..., x k }, x i ∈ X, x i / ∈ T || dist(x i , x max ) < d} (2) where T denotes a set of sampled data points from X and dist represents distance metric (Euclidean Distance in this paer). d H refers to the Hausdorff distance Huttenlocher et al. (1993) between X and X b2 . Then, we revisit the definition of Hausdorff Distance that d H (X, Y ) = max sup x∈X d(x, Y ), sup y∈Y d(X, y) , which measures how far two subsets of a metric space are from each other. Informally speaking, the x max represents the farthest data point in X to the sampled X b2 when adopting Hausdorff Distance metric. Thus, by integrating the x max into X b2 , the Hausdorff Distance between X and X b2 can be reduced, indicating the gap between sampled and global data points is also mitigated according to Theorem 1. Moreover, considering that the advantage of our TGSS relies heavily on the selection of x max , we provide a theoretical analysis to guarantee its superiority. Besides, the introduction of C mini is to maintain the representation of local topology structure surrounding from x max .

3.1.2. THEORETICAL ANALYSIS FOR TGSS

The core design of our TGSS is the procedure of selection x max (line 4 in Algorithm 1), which can eliminate the structure gap compared to random sampling strategy. Here, we further provide a theoretical analysis to guarantee the advantage of this selection procedure. Before we carry out our analysis, we define a few important definitions and notations. For a point cloud X := {x 1 , . . . , x m } ⊆ R d , denote X (m) be a subsample of X with cardinality m. Based on X (m) and the procedure of TGSS's selection x max , the constructed set is denoted as X (m+1) T . While for random sampling strategy, we have X (m+1) R . Thus, we have: X (m+1) T = {X (m) ∪ x, x = x max } (3) X (m+1) R = {X (m) ∪ x, x ∈ X\X (m) } (4) where x max is defined in Equ. 1 Theorem 1. Moor et al. (2020) . Let X be a point cloud of cardinality n and X (m) be one subsample of X of cardinality m, i.e. X (m) ⊆ X, sampled without replacement. We can bound the probability of the persistence diagrams of X (m) exceeding a threshold in terms of the bottleneck distance as m+1) be the distance matrix between samples of X and m+1) be the distance matrix between samples of X and P(d b (D X , D X (m) ) > ϵ) ≤ P(d H (X, X (m) ) > 2ϵ) (5) C = C(xmax, dH (X b2 , X)); 6: X b2 = X b2 ∪ xmax; 7: if len(C) < b/2 -1 then 8: M ← random select b/2 -1 -len(C) data points from X/X b2 ; 9: X b2 = X b2 ∪ C ∪ M; 10: else 11: M ← random select b/2 -1 data points from C; 12: X b2 = X b2 ∪ M; 13: end if 14: T = T ∪ X b2 15: end for 16: return T ; Theorem 2. Let A X,X (m+1) T ∈ R n×( X (m+1) T , and A X,X (m+1) R ∈ R n×( X (m+1) R . The X (m+1) T and X (m+1) R are both sorted to ensure that the first (m+1) rows correspond to the columns of the m subsampled points with diagonal elements a ii = 0. Assume that the entries a ij in both matrix are independent and follow a same distance distribution F D when i > (m + 1). For A X,X (m+1) T , the minimal distances δ  E[d H (X, X (m+1) T )] ≤ E[d H (X, X (m+1) R )] We include its proof in Section A. Theorem. 2 illustrates that compared to random sampling strategy (X (m+1) R ), the sampled batch data points (X (m+1) T ) from our TGSS are closer to the global data points X with Hausdorff Distance metric, which constitutes the upper bound of bottleneck distance between two persistence diagrams (Theorem 1). Thus, since bottleneck distance is usually used to measure the distance between two persistence diagrams in the topological space Beketayev et al. (2014) ; Bubenik et al. (2010) , we can conclude that compared to random sampling strategy, the sampled batch data points from our TGSS are closer to the global data points in the topological space.

3.2. TOPOLOGY ALIGNMENT MODULE

As shown in Fig. 1 (a)-(c), HSVA, a state-of-the-art common space learning method by taking structure alignment into account, fails to preserve multi-dimensional topological features. Specifically, the terrible structure representation in the latent space inevitably leads to a sub-optimal structure alignment. To remedy this, we propose a Topology Alignment Module, consisting of a dual topology-aware branch and a topology-preserving loss, to encode multi-dimensional topological information into latent visual and semantic space for conducting a more desirable structure alignment. Our Dual Topology-aware Branch is illustrated in Fig. 2 , which contains two autoencoders for obtaining topological-aware latent representation in visual and semantic domains. Specifically, the encoder E t x / E t a encodes image feature (x) / semantic vector (a) into latent space and obtain visual for reconstructing the latent representation into x / a. We first apply reconstruction loss to optimize our Dual Topology-aware Branch: L x AE = L REC = ∥D t x (E t x (x)) -x∥ 2 (7) L a AE = L REC = ∥D t a (E t a (a)) -a∥ 2 (8) Then we utilize the topology-preserving loss proposed by Moor et al. (2020) to preserve multiple dimensional topological features on the latent visual and semantic space, which is calculated by the following steps: 1) Given a batch of visual feature X  (m) v , termed A X (m) v . The corresponding persistent homology of X (m) v is recorded as PH(V ϵ (X (m) v )) = (D X (m) v , π X (m) v ). Analogously, for X (m) a , Z (m) v and Z (m) a , we can obtain corresponding distance matrix A X (m) a , A Z (m) v and A Z (m) a , persistence pairings π X (m) a , π Z (m) v and π Z (m) a ; 3) Finally, we retrieve the value of 0-dimensional / 1-dimensional / 2-dimensional persistence diagram 6 from distance matrix with indices provided by the persistence pairings, namely D X (m) v 0 ≃ A X (m) v [π X (m) v 0 ]. Through this computation process, we get the 0/1/2 -dimensional persistence diagrams in X (m) v X (m) a , Z (m) v and Z (m) a , which are optimized by the following topology-preserving loss: L x T P = 2 i=0 ∥ D X (m) v i -D Z (m) v i ∥ 2 , L a T P = 2 i=0 ∥ D X (m) a i -D Z (m) a i ∥ 2 Finally, to encourage interaction between visual and semantic domains in the topological space, we directly minimize the L2 distance between latent visual topological representation and latent semantic topological representation:  L T A = ∥Z (m) v -Z (m) a ∥ 2 L x V AE = L BCE -βL KL = E E d x (x) [log D d x (z d x )] -βD KL (E d x (x)||p(z)) (13) L a V AE = L BCE -βL KL = E E d a (a) [log D d a (z d a )] -βD KL (E d a (a)||p(z)) ( ) where D KL represents the Kullback-Leibler divergence and p(z) is a prior distribution (standard Gaussian distribution N (0, 1) in this paper). The binary cross-entropy loss L BCE is served as the reconstruction loss. Following Schonfeld et al. (2019) , β serves as the balanced weight to measure the importance of D KL . Distribution alignment loss is formulated as: L DA = ∥µ x -µ a ∥ 2 2 + (δ x ) 1 2 -(δ a ) 1 2 2 F 1 2 (15) 6 Due to the page limited, we provide a more detailed computation process in Section A where ∥ • ∥ 2 F is the squared matrix Frobenius norm, and cross-alignment loss is formulated as: L x CA = |x -D d x (E d a (a))| (16) L a CA = |a -D d a (E d x (x))| (17)

3.4. TOPOZERO OBJECTIVE FUNCTION

Our TopoZero is optimized by the following objective function: L T opoZero = L x AE + L a AE + λ 1 * (L x T P + L a T P ) + λ 2 * L T A + λ 3 * (L x CA + L a CA + L x V AE + L a V AE + L DA ) where λ 1 , λ 2 , and λ 3 are the balanced weight to measure the importance of each module in our TopoZero. In the branch of TAM, L x AE and L a AE aim to obtain the latent visual and semantic representation. L x T P and L a T P assist the latent visual and semantic representation to preserve multidimensional topology structure. L T A associates semantic and visual latent representation in a common space. While for the branch of distribution alignment module, all the objective functions keep the same with those in CADA-VAE Zhu et al. (2019a) .

3.5. ZERO-SHOT PREDICTION

After the optimization of TopoZero, we need to train F gzsl and F czsl for predicting unseen or seen samples. Given a seen image features x s , we can obtain the latent distribution representation  z d x s = E d x (x s )

4. EXPERIMENTS

In this section, we first elaborate on implementation details and 3 authoritative benchmark datasets in the field of ZSL. Then we compare our TopoZero with existing state-of-the-art ZSL methods. Finally, we provide some qualitative and quantitative analysis to illustrate the advantage of our TopoZero. Due to the limitation of page size, several parts are placed on Appendix A.

4.1. DATASETS AND IMPLEMENTATION

Datasets. We verify our TopoZero on 3 popular ZSL benchmark datasets, including CUB Welinder et al. (2010) , SUN Patterson & Hays (2012) Evaluation Protocols. Following the standard evaluation protocol Xian et al. (2018a) , our TopoZero is evaluated by the top-1 accuracy. For CZSL, we only compute the accuracy on unseen classes. While for GZSL, we both calculate the accuracy of seen and unseen classes. For determining the performance of GZSL in a unified criterion, the harmonic mean (defined as H = (2 × S × U )/(S + U )) is adopted in this paper. 2020) take advantage of data augmentation, methods involving these 2 branches are not taken into account in this part. Compared to methods only with distribution alignment, our TopoZero illustrates a significant improvement of 6.4%, 3.1%, and 8.0% on CUB, SUN, and AWA2 datasets at least. While compared to HSVA Chen et al. (2021c) with distribution and structure alignment, our TopoZero also achieves a great improvement of 1.5%, 0.9% on CUB and SUN datasets, respectively. Such a significant performance directly verifies the effectiveness of topology alignment for the ZSL task. Results on Generalized Zero-Shot Learning. By looking at the challenging GZSL results in Tab. 2, our TopoZero also achieves a dominant harmonic mean performance of 57.3%, 44.7%, and 68.0% on CUB, SUN, and AWA2 datasets, respectively. Both superiority results of TopoZero on CZSL and GZSL settings demonstrate that our TopoZero is better than HSVA on structure alignment.

5. CONCLUSION

In this paper, we propose a TopoZero framework to improve structure alignment for common space learning methods. To begin with, we discover that existing structure alignment approaches confront two challenging issues: 1) sampled mini-batch data points present a distinct gap compared to global ones; 2) latent visual and semantic space lose some high-dimensional structure information due to the 'curse of dimensionality.' To solve these two problems, Topology-guided sampling strategy and Topology Alignment Module are proposed to construct our TopoZero. Furthermore, we provide a theoretical analysis as well as visualization results to guarantee the advantage of our TopoZero, namely excellent multi-dimensional topology-preserving and topology-alignment ability. Finally, The extensive and superior experiment results demonstrate that our TopoZero has a great potential to advance the ZSL community. A APPENDIX B PROOF OF THEOREM 2 Proof. First, we derive the distribution F ∆ ′ (y) and F ∆ ′′ (y): F ∆ ′ (y) = P(δ ′ i ≤ y) = 1 -P(δ ′ i > y) = 1 -P( min 1≤j≤m+1 a ij > y) (19) = 1 -P( j a ij > y) = 1 -(1 -F D (y)) m+1 (20) = 1 -(1 -F D (y)) m+1 , y < E[d H (X, X (m) )] 1 , else Analogously, we have: F ∆ ′′ (y) = P(δ ′′ i ≤ y) = 1 -P(δ ′′ i > y) = 1 -P( j a ij > y) (22) = 1 -(1 -F D (y)) m+1 (23) For convenience, we denote 1 -(1 -F D (y)) m+1 as F δ (y). Next, we derive the distribution (F Z ′ (z) and F Z ′′ (z) ) of Z ′ and Z

′′

, respectively: F Z ′ (z) = P (Z ′ ≤ z) = P ( max m+1<i≤n δ i ≤ z) = P ( m+1<i≤n δ i ≤ z) (25) = F ∆ (z) (n-m+1) , z < E[d H (X, X (m) )] 1 , else Analogously, F Z ′′ (z) = P (Z ′′ ≤ z) = P ( max m+1<i≤n δ i ≤ z) = P ( m+1<i≤n δ i ≤ z) (27) = F ∆ (z) (n-m+1) , z < E[d H (X, X (m) )] F ∆ (z) , else Thus, we have: E Z ′ ∼F Z ′ [Z ′ ] = +∞ 0 (1 -F Z ′ (z)) dz - 0 -∞ F Z ′ (z) dz (29) = +∞ 0 (1 -F Z ′ (z)) dz (30) = E[d H (X,X (m) ] 0 (1 -F Z ′ (z)) dz + +∞ E[d H (X,X (m) ] (1 -F Z ′ (z)) dz (31) = E[d H (X,X (m) ] 0 (1 -F ∆ (z) n-m ) dz + +∞ E[d H (X,X (m) ] (1 -1) dz (32) = E[d H (X,X (m) ] 0 (1 -F ∆ (z) n-m ) dz and: E Z ′′ ∼F Z ′′ [Z ′′ ] = +∞ 0 (1 -F Z ′′ (z)) dz (34) = E[d H (X,X (m) ] 0 (1 -F Z ′′ (z)) dz + +∞ E[d H (X,X (m) ] (1 -F Z ′′ (z)) dz (35) = E[d H (X,X (m) ] 0 (1 -F ∆ (z) n-m-1 ) dz + +∞ E[d H (X,X (m) ] (1 -F ∆ (z)) dz (36) (37) Finally, E Z ′ ∼F Z ′ [Z ′ ] -E Z ′′ ∼F Z ′′ [Z ′′ ] = +∞ E[d H (X,X (m) ] (F ∆ (z) -1) dz ≤ 0 (38) => E Z ′ ∼F Z ′ [Z ′ ] ≤ E Z ′′ ∼F Z ′′ [Z ′′ ] (39) => E[d H (X, X (m+1) T )] ≤ E[d H (X, X (m+1) R )] C PERSISTENT HOMOLOGY Here, we further provide several explanations on the definition of simplex, simplicial complex, abstract simplicial complex and Vietoris-Rips complex. (a) Simplex: In geometry, a simplex is a generalization of the notion of a triangle or tetrahedron to arbitrary dimensions. The simplex is so-named because it represents the simplest possible polytope made with line segments in any given dimension. For example, a 0-simplex is a point, a 1-simplex is a line segment, and a 2-simplex is a triangle. (b) Simplicial Complex: In topology, it is common to "glue together" simplices to form a simplicial complex. A simplicial complex is a set composed of points, line segments, triangles, and their n-dimensional counterparts. The strict definition of a simplicial complex is that A simplicial complex K is a set of simplices that satisfies the following conditions: 1) Every face of a simplex from K is also in K; 2) The non-empty intersection of any two simplices σ 1 , σ 2 ∈ K is a face of both σ 1 and σ 2 . (c) Abstract Simplicial Complex The purely combinatorial counterpart to a simplicial complex is an abstract simplicial complex. (d) Vietoris-Rips complex: In topology, the Vietoris-Rips complex, also called the Vietoris complex or Rips complex, is a way of forming a topological space from distances in a set of points. It is an abstract simplicial complex that can be defined from any metric space M and distance δ by forming a simplex for every finite set of points that has a diameter at most δ. That is, it is a family of finite subsets of M, in which we think of a subset of k points as forming a (k1)-dimensional simplex (an edge for two points, a triangle for three points, a tetrahedron for four points, etc.); if a finite set S has the property that the distance between every pair of points in S is at most δ, then we include S as a simplex in the complex. As illustrated in Moor et al. (2020) , we can compute the persistent homology of a set of data points X based on this background information.

D COMPUTATION PROCEDURE OF TOPOLOGY-PRESERVING LOSS

Here, we further introduce how to retrieve the value of 0-dimensional / 1-dimensional / 2dimensional persistence diagram from distance matrix with indices provided by the persistence pairings, namely D X (m) v 0 ≃ A X (m) v [π X (m) v 0 ]. In essence, this retrieving procedure equals to how to select retreival indices from 0-dimensional / 1-dimensional / 2-dimensional persistence pairings. Concretely, for 0-dimensional topological features , we select the "destroyer" simplices in the 0-dimensional persistence pairings. For 1-dimensional topological features and 1-dimensional topological features , we regard the maximum edge of the "destroyer simplices" in corresponding persistence pairings as retrieval indices. Model Complexity Analysis. Our TopoZero has a clear intuition of leveraging parallel structure and distribution for advancing ZSL. Such design thus inevitably leads to the first 5 terms in Eq. 18 for multi-dimensional structure alignment and the last 5 terms in Eq. 18 for distribution alignment. Although TopoZero has 4 autoencoders in total, the entire training process is simultaneous and loss weights of all terms in Eq. 18 are the same for all datasets. The consistently significant results on all datasets show that our model is robust and easy to train. Additionally, several losses are formulated with similar forms, which are cooperated for easy optimization, i.e. L x AE and L a , L x CA and L a CA , L x T P and L a T P . Finally, TAM and DAM are parallel and such disentangle design can make the learning curve smooth and maximize the role of each branch, respectively. Benefiting from this disentangled design, our TopoZero is easy to train compared to HSVA, where the latter adopts coupled framework. Hyper-parameter Analysis. In this part, we further verify the sensitivity of hyper-parameter in our TopoZero by conducting experiments on the CUB dataset, including λ 1 , λ 2 , and λ 3 . As shown in Fig. 3 , the performance of TopoZero is of great robustness when varying hyper-parameter from {0.01, 0.05, 0.1, 0.25, 0.5, 1.0}. Finally, λ 1 , λ 2 and λ 3 are set 0.05, 0.05, and 1 in this paper for the better result. Although this hyper-parameter configuration achieves a great performance on 3 ZSL benchmark datasets, it also raises an interesting question: given these 3 hyper-parameters play distinct role in our TopoZero framework, why their effects are so consistent? For instance, the green lines in Fig. 3 almost present a consistent trending. The reason for this question is that the configuration in the hyperparameter selection setting is unreasonable, where the candidate range of λ 3 is small. This hides the role of each term in the objective function since the value of 4-th term (controlled by λ 3 ) is far larger than that of 2-nd (controlled by λ 1 ) and 3rd (controlled by λ 2 ) terms, where the value of L a V AE and L a V AE in 4-th term is extraordinarily large. Thus, to conduct a detailed hyper-parameter analysis, we extend the range of λ 3 into {0.0001, 0.0005, 0.001, 0.0050.01, 0.05, 0.1, 0.25, 0.5, 1.0}. Based on this revision, the individual effects of the three hyper-parameters are expanded remarkably, that is illustrated in Fig. 4 . Simultaneously, our TopoZero achieves a higher CZSL accuracy of 64.9% on the cub dataset via this step. In our opinion, this improvement benefits from this more reasonable hyper-parameter selection procedure, which is conducive to getting rid of "hyper-parameter overfitting" via mining the role of each item accurately. Considering this step involves some tricks of hyper-parameter tuning, we only discuss this situation rather than adopting this hyper-parameter configuration for better results. Visualization Result. As shown in Fig. 1 (a) -(c), we utilize persistent homology to visualize the multi-dimensional topological features of TopoZero and HSVA latent structure space. We can



Connectivity-based features, e.g., connected components in 0-dimensional, cycles in 1-dimensional, and voids in 2-dimensional topological features Existing methods all adopt random sampling strategy to generate mini-batch data points. A metric that can measure the bounded distance between two persistence diagrams. Available at https://github.com/Ripser/ripser. b represents the size of batch training samples



Figure 1: Motivation Illustration. (a)-(c) Based on the same random sampled data points X (m) in (a), the sampled batch data points from our Topological-guided Sampling Strategy (TGSS) are closer to the global data points compared to those sampled from random sampling strategy (D H (X, X (m+1) R

for rows with i > (m + 1) follow a distribution F ∆ ′ .Letting Z ′ := max 1≤i≤n δ ′ i with a corresponding distribution F ′ Z . For A X,X (m+1) R , the minimal distances δ ′′ i for rows with i > (m + 1) follow a distribution F ∆ ′′ . Letting Z ′′ := max 1≤i≤n δ ′′ i with a corresponding distribution F ′′ Z , the expected Hausdorff distance between X and X

, and AWA2 Xian et al. (2018a). CUB contains 11788 images of 200 bird classes (seen/unseen classes = 150/50) with 312 attributes. SUN consists of 14340 images from 717 classes (seen/unseen classes = 645/72) with 102 attributes. AWA2 includes 37322 images of 50 animal classes (seen/unseen classes = 40/10) with 85 attributes. Finally, we adopt the "split version 2.0" mode Xian et al. (2018b) to conduct data splits on CUB, SUN, and AWA2. Network Architecture. As illustrated in Fig. 2, our TopoZero contains 2 Encoders and 2 Decoders, which are basic Multi-Layer Perceptions with 2 fully connected (FC) layers and 4096 hidden units. The dimension of latent variable in the distribution alignment and topology alignment module are both set 64. The architecture of CZSL and GZEL classifier is a single FC layer. Optimization Details. Our TopoZero is optimized by Adam optimizer with an initial learning rate 10 -4 . The total training epoch of TopoZero is set 100 with a batch size 50. For training final CZSL and GZSL classifiers, the training epoch, batch size, and initial learning rate are set 25, 28, 10 -3 respectively.

COMPARISON WITH STATE-OF-THE-ARTS. Results on Conventional Zero-Shot Learning. Tab. 1 reports the CZSL results of our TopoZero and recent state-of-the-art (sota) methods on 3 ZSL datasets. Considering that attribute-based sota methods Huynh & Elhamifar (2020); Chen et al. (2021a) exploit the advantage of pre-trained NLP models GloVE and generation-based sota methods Xian et al. (2018b); Yu et al. (

Figure 3: The coarse effects of λ 1 , λ 2 and λ 3 on the CUB dataset.

Figure 4: The fine effects of λ 1 , λ 2 and λ 3 on the CUB dataset.

Algorithm 1 Topology-guided Sampling Strategy

Results (%) of the state-of-the-art models on CUB, SUN and, AWA2 datasets. The best result is masked in bold. The symbol "-" indicates no available result.

E EXPERIMENTS. E.1 ABLATION STUDY

Based on the CADA-VAE Schonfeld et al. (2019) , we conduct ablative experiments on CUB, SUN, and AWA2 datasets to verify the effectiveness of our proposed Topology-guided Sampling Strategy and Topology Alignment Module. We first clarify the notations in Tab. 2. TAD denotes our Topology Alignment Module. TAD 0 / TAD 0-1 represents our Topology Alignment Module with preserving 0-dimensional/ 0-dimensional and 1-dimensional topological features. We can see the 4th row with TAD performs a better result than the 2nd row with T AD 0 and the 3rd row with TAD 0-1 , indicating the effectiveness of multi-dimensional (especially high dimensional) structure alignment. Then, with the addition of TGSS, the performance is further enhanced, demonstrating the TGSS can achieve better structure alignment. This experiment result is highly compatible with our provided theoretical analysis on TGSS. 

E.2 ANALYSIS

The effectiveness of TAM. To verify the effectiveness of single Topology Alignment Module, we disentangle it from our overall TopoZero framework. As reported in Tab. 3, although a single TAM can achieve great performance, there exactly exists a distinct performance gap compared with recent sota methods Chen et al. (2021c; b) . This is why we introduce an off-the-shelf distribution alignment module into our TopoZero framework. see that our TopoZero topological latent space presents an almost consistent trend in terms of input topological space while HSVA fails, indicating that our TopoZero can preserve more geometry information than HSVA when handling with 'curse of dimensionality ' Wang & Chen (2017) .

