CHANGE DETECTION FOR BI-TEMPORAL IMAGES CLASSIFICATION BASED ON SIAMESE VARIATIONAL AUTOENCODER AND TRANSFER LEARNING

Abstract

Siamese structures empower Deep Learning (DL) models to increase their efficiency by learning how to extract the relevant temporal features from the input data. In this paper, a Siamese Variational Auto-Encoder (VAE) model based on transfer learning (TL) is applied for change detection (CD) using bi-temporal images. The introduced method is trained in a supervised strategy for classification tasks. Firstly, the suggested generative method utilizes two VAEs to extract features from bi-temporal images. Subsequently, concatenates them into a feature vector. To get a classification map of the source scene, the classifier receives this vector and the ground truth data as input. The source model is fine-tuned to be applied to the target scene with less ground truth data using a TL strategy. Experiments were carried out in two study areas in the arid regions of southern Tunisia. The obtained results reveal that the proposed method outperformed the Siamese Convolution Neural Network (SCNN) by achieving an accuracy of more than 98%, in the source scene, and increased the accuracy in the target scene by 1.25% by applying the TL strategy.

1. INTRODUCTION

The feature extraction step in the classification process allows improving DL model performance in several fields (Hakak et al., 2021; Islam & Nahiduzzaman, 2022; Xiong & Zuo, 2022) . In fact, Convolutional neural network (CNN) has been efficiently employed to solve computer vision problems in a variety of fields including industry, environment, and healthcare (Alzubaidi et al., 2021; Huang et al., 2022) . Nevertheless, the performance of the algorithms depends of the used datasets. Furthermore, CNN has shown a low performance in the classification task thanks to the high similarity and non-dispersity of the input data. Recently, with these challenging, the VAE has demonstrated its good performance in the classification tasks as it is based on distribution-free assumptions and nonlinear approximation (Zerrouki et al., 2020; Ran et al., 2022) . However, the periodicity of the input data reduces its efficiency and, therefore, makes it unable to ensure the temporal consistency of the extracted features (Zhao & Peng, 2022) . Moreover, traditional DL models (e.g. CNN, VAE, etc.) cannot capture the temporal information. Thus, they have limited capability to extract the temporal features. To overcome this shortcoming, the Siamese structure, which is one of the best approaches for CD in bi-temporal images, can be a good solution. Siamese networks were first utilized for signature verification. Subsequently, they were applied in feature matching, particularly between pairs of images (Ghosh et al., 2021; Zhang et al., 2022) . Recent studies focusing on classification tasks have employed bi-temporal images for CD (Lee et al., 2021; Zheng et al., 2022) . The CD process consists in identifying the differences between bi-temporal images of the same geographic location undergoing anthropic and climatic factors . Exploring the generalization of Siamese DL models is a key challenge. Discussing its TL capabilities is one of the most popular analyses (Krishnamurthy et al., 2021; Abou Baker et al., 2022) . The TL aims at gaining knowledge by solving a problem and applying it to another related problem. The use of TL in practice is to apply knowledge from one context with several labeled data to another situation with limited labels. In application, TL consists in re-using the weight values of the trained model with source data, while applying a fine-tuning approach to provide a model adapted to the target data (Raffel et al., 2020; Shabbir et al., 2021; Toseef et al., 2022) . By employing the pre-trained model source as the target scene adapter instead of starting from the scratch, the fine-tuning technique reinforces learning and considerably reduces the model overfitting (Tan et al., 2018; Cao et al., 2022) . The contributions of the present work are presented below: • Proposing a new method for bi-temporal images classification based on Siamese VAE, in order to extract the relevant temporal features. • Using a TL strategy to transfer the pre-trained Siamese VAE from the source to the target scene. • Evaluating the introduced method w.r.t. SCNN, in two study areas, using bi-temporal multispectral images acquired with Landsat. The rest of this manuscript is organized as follows. Some related works are described in section 2. The developed technique and the background of the Siamese VAE and the TL strategy used in this study are presented in section 3. Section 4 depicts the experimental settings, the implementation details and the applied evaluation metrics. The obtained results are provided and discussed in section 5, while Section 6 concludes the paper and gives some future perspectives.

2. RELATED WORK

Recently, numerous studies have focused on Siamese structure using bi-temporal images to enhance feature extraction-based classification models for CD. For example, Zhu et al. ( 2022) have designed a Siamese global learning (Siam-GL) framework for high spatial resolution (HSR) remote sensing images. The Siamese structure has been used to improve the feature extraction of bi-temporal HSR remote sensing images. Researchers have concluded that the Siam-GL framework outperformed the advanced semantic CD methods as it provided more pixel data and ensured high precision classification. Besides,Zhao & Peng ( 2022) have presented a semi-supervised technique relying on VAE with the Siamese structure to detect changes in Synthetic Aperture Radar (SAR) images by concatenating the extracted features of bi-temporal images in a single latent space vector to extract the pixel change characteristics. Moreover, Daudt et al. (2018) proposed a CD framework based on a Siamese CNN for CD. In the suggested method, Sentinel-2 multispectral pair images were encoded via a Siamese network to extract a new data representation. Then, the extracted bi-temporal representations were combined to produce an urban CD map. The designed network was trained in a fully supervised manner and showed excellent test performance. Indeed, the performance of a DL model can be enhanced applying TL strategy, especially CD models based on Siamese structures. We list, in this paragraph, some research works relying on Siamese CD and TL. For instance, Yang et al. ( 2019) proposed a DL-based CD framework with TL strategy to apply a learned source CD model in a target domain. The introduced framework includes pretraining and fine-tuning steps, while the target domain is used to fit the source domain modification concept. A method of image difference in the target domain was utilized to pick pixels having a high probability of being correctly classified by an unsupervised technique, which improved the change detection network (CDN) of the target domain. The provided findings showed that the developed method outperforms the state-of-the-art CD techniques and offers an excellent ability to transfer the concept of change from the source domain to the target domain. Heidari & Fouladi-Ghaleh (2020) presented a face recognition platform based on TL in a Siamese network made up of two identical CNNs. On this platform, the Siamese network extracts the features from the input pair of images and determines whether they belong to the same person or not. The experimental results revealed that the accuracy of the proposed model (95.62%) is better than that of the state-of-the-art methods of face recognition. Bandara & Patel (2022) designed a transformer-based Siamese network architecture (ChangeFormer) for CD by considering a pair of co-registered remote sensing images. In the proposed Siamese network, convolutional networks (ConvNets) were used to transform the multiscale information into long-range data needed for CD. The experimental findings demonstrate that the developed end-to-end trainable architecture outperforms the existing ones as it enhances the CD performance. Andresini et al. (2022) suggested a Siamese network trained utilizing labeled imagery data of the same land scene acquired by Sentinel-2 at various times to detect changes in land cover in bi-temporal images. The trained Siamese network in a labeled scene was transferred to a new unlabeled scene applying a fine-tuned TL strategy. The lack of change labels in the new scene was addressed by estimating pseudo-change labels in an unsupervised manner. The experiment

