LEARNING VISUAL REPRESENTATION WITH SYN-THETIC IMAGES AND TOPOLOGICALLY-DEFINED LA-BELS

Abstract

We propose a scheme for neural networks to learn visual representation with synthetic images and mathematically-defined labels that capture topological information. To verify that the model acquires a different visual representation than with the usual supervised learning with manually-defined labels, we show that the models pretrained with our scheme can be finetuned for image classification tasks to achieve an improved convergence compared to those trained from scratch. Convolutional neural networks, built upon iterative local operations, are good at learning local features of the image, such as texture, whereas they tend to pay less attention to larger structures. Our method provides a simple way to encourage the model to learn global features through a specifically designed task based on topology. Furthermore, our method requires no real images nor manual labels; hence it sheds light on some of the lately concerned topics in computer vision, such as the cost and the fairness in data collection and annotation.

1. INTRODUCTION

Self-supervised learning (SSL) has seen a great success both practically and scientifically by providing a scheme to train neural networks without human annotated labels and by showing the similarity to humans in the learning process and the learned representation. However, there are still large gaps between SSL and human learning. In visual representation, convolutional neural networks (CNNs) are famously known to be "shortsighted" by being biased to textural information (Geirhos et al., 2019) . Most image features, hand-crafted or obtained in a data-driven manner, are local in the sense that they are computed from patches in the image. For example, convolution, a typical and versatile form of the operation used to construct various image features, is local unless a huge kernel is used. Although local features can be aggregated by applying reduction operations such as the mean, the global information obtained in this way is limited. One way to capture global characteristics of images is to introduce a new model architecture. The attention mechanism has enabled to learn global image features in a data-driven manner as demonstrated by Vision Transformer (Dosovitskiy et al., 2021) . Another way, which we concern in this paper, is to device a new training scheme, which works with virtually any model architecture with little modification. Topology is a study of shapes whose ultimate goal is to classify shapes by their topological types. Topologists have invented various topological invariants that can discern different shapes. For example, homology is used to classify manifolds that are locally the same Euclidean space but globally different. Our idea is to design a task of computing invariants from an input image by a neural network so that the neural network is encouraged to learn visual representation that is relevant to the topology of the image. In contrast to the popular SSL schemes, which are largely based on learning the distribution of real (plausible) images, our method is based on approximating the mathematical computation of an invariant, called persistent homology, of synthesised images. Persistent homology (PH), one of the main tools of the emerging field of Topological Data Analysis (TDA), provides efficient machinery for computing global topological features of data mathematically (Adams & Moy, 2021) . It has been proved to be practically useful for image processing such as classification (Dunaeva et al., 2016) and segmentation (Tanabe et al., 2021) . In recent years, applications of persistent homology have expanded to many areas in science and led to new discoveries (Giunti, 2021) , but in most cases, it is used just as yet another feature extractor. In view 1

