LEARNING VISUAL REPRESENTATION WITH SYN-THETIC IMAGES AND TOPOLOGICALLY-DEFINED LA-BELS

Abstract

We propose a scheme for neural networks to learn visual representation with synthetic images and mathematically-defined labels that capture topological information. To verify that the model acquires a different visual representation than with the usual supervised learning with manually-defined labels, we show that the models pretrained with our scheme can be finetuned for image classification tasks to achieve an improved convergence compared to those trained from scratch. Convolutional neural networks, built upon iterative local operations, are good at learning local features of the image, such as texture, whereas they tend to pay less attention to larger structures. Our method provides a simple way to encourage the model to learn global features through a specifically designed task based on topology. Furthermore, our method requires no real images nor manual labels; hence it sheds light on some of the lately concerned topics in computer vision, such as the cost and the fairness in data collection and annotation.

1. INTRODUCTION

Self-supervised learning (SSL) has seen a great success both practically and scientifically by providing a scheme to train neural networks without human annotated labels and by showing the similarity to humans in the learning process and the learned representation. However, there are still large gaps between SSL and human learning. In visual representation, convolutional neural networks (CNNs) are famously known to be "shortsighted" by being biased to textural information (Geirhos et al., 2019) . Most image features, hand-crafted or obtained in a data-driven manner, are local in the sense that they are computed from patches in the image. For example, convolution, a typical and versatile form of the operation used to construct various image features, is local unless a huge kernel is used. Although local features can be aggregated by applying reduction operations such as the mean, the global information obtained in this way is limited. One way to capture global characteristics of images is to introduce a new model architecture. The attention mechanism has enabled to learn global image features in a data-driven manner as demonstrated by Vision Transformer (Dosovitskiy et al., 2021) . Another way, which we concern in this paper, is to device a new training scheme, which works with virtually any model architecture with little modification. Topology is a study of shapes whose ultimate goal is to classify shapes by their topological types. Topologists have invented various topological invariants that can discern different shapes. For example, homology is used to classify manifolds that are locally the same Euclidean space but globally different. Our idea is to design a task of computing invariants from an input image by a neural network so that the neural network is encouraged to learn visual representation that is relevant to the topology of the image. In contrast to the popular SSL schemes, which are largely based on learning the distribution of real (plausible) images, our method is based on approximating the mathematical computation of an invariant, called persistent homology, of synthesised images. Persistent homology (PH), one of the main tools of the emerging field of Topological Data Analysis (TDA), provides efficient machinery for computing global topological features of data mathematically (Adams & Moy, 2021) . It has been proved to be practically useful for image processing such as classification (Dunaeva et al., 2016) and segmentation (Tanabe et al., 2021) . In recent years, applications of persistent homology have expanded to many areas in science and led to new discoveries (Giunti, 2021), but in most cases, it is used just as yet another feature extractor. In view of TDA, we proceed a step further and ask if we can teach topology to a neural network so that the model learns low-level image features together with the mechanism for computing high-level topological features from low-level features. Answering this question affirmatively would lead to the acquisition of more abstract and generalisable representation by the model. More precisely, our scheme is to train the model with regression of the vectorised persistent homology for synthesised images. Through this task, the model is expected to learn relevant visual representation required to approximate persistent homology. We demonstrate the validity of our scheme by experiments showing that a CNN pretrained by the proposed method can be finetuned for image classification tasks to show an improved convergence compared with one trained from scratch. It is also worth pointing out that our scheme does not rely on real images nor labels, but uses mathematically generated images annotated with mathematically defined features. In this way, it is is free from human bias which lies not only in the manual annotation but also the image themselves; photos reflect the present world and the view of the photographer. In fact, the models trained with ImageNet by SSL, even without human annotated labels, are known to be subject to bias (Steed & Caliskan, 2021) .

2. RELATED WORK

The proposed method is built on three key ingredients, which we discuss in this section. We introduce some novel ideas to each of the three and combine them to develop our scheme.

2.1. SELF-SUPERVISED LEARNING ON IMAGES

Self-supervised learning tasks, which we call pretext tasks, are designed to learn image features without manually-defined labels. There are three major types of pretext tasks. The first task is to tell if given two images come from the same image or not, where variations of images are generated by applying transformations in spacial and colour domains (see Jaiswal et al. (2021) for a survey). The second task is to undone degradation, such as adding noise and masking, and reconstruct the original image. The third is similar to the second one but to perform a pair of (approximately) invertible processes, such as compression-expansion, as represented by the celebrated autoencoder (Hinton & Salakhutdinov, 2006) . All of the three tasks demand the model to acquire high-level representation of the images. The main objective of these methods lies in, more or less, capturing the distribution of the training data; to be good at in-painting or compression, one has to find a low-dimensional manifold which models the training data well. We propose another type of task that put more emphasis on the computation process rather than the distribution of the data. When the computation is based on a certain mathematical structure of the data, a model will be incentivised to focus on the structure through learning the task. Our pretext task is to approximate the computation of persistent homology of the image. Persistent homology can be computed mathematically from the data and used as the label for a regression task. It is also notable that our pretext task does not rely on semantics or human perception but solely on mathematics. This allows us to use synthetic images almost meaningless to human eyes, and make the procedure completely free of real data.

2.2. LEARNING WITH SYNTHETIC IMAGES

Even though SSL saves the annotation costs, the preparation of training data is still a vexing problem. Publicly available datasets can be of low-quality, subject to bias, or violating usage rights and privacy. ImageNet, one of the most popular large-scale datasets, suffers from fairness issues (Yang et al., 2020) , and there have been a growing interest in the fairness of machine learning Mehrabi et al. (2021) . Steed & Caliskan (2021) points out that even models trained on ImageNet with SSL without using labels learn racial, gender, and intersectional biases from the way people are stereotypically portrayed on the web. No matter how much care and attentions are paid for data collection, it is impossible to be free from all kinds of these issues as long as real images are used. Using generative adversarial networks (GANs) to generate image datasets for training is a popular and successful strategy to mitigate the situation (Besnier et al., 2020) , but GANs are also trained with natural images and cannot avoid above-mentioned problems. A promising approach is to use algorithmically synthesised images. Formula-driven Supervised Learning introduced in Kataoka et al. ( 2020) considers pretraining with synthetic images generated by a mathematical formula. The labels are assigned according to the

