A MULTI-SCALE STRUCTURE-PRESERVING HETEROL-OGOUS IMAGE TRANSFORMATION ALGORITHM BASED ON CONDITIONAL ADVERSARIAL NETWORK LEARN-ING

Abstract

Image transformation model learning is a basic technology for image enhancement, image super-resolution, image generation, multimodal image fusion, etc. which uses deep convolutional networks as a representation model for arbitrary functions, and uses fitting optimization with paired image training sets to solve the transformation model between images in the different sets. Affected by the complex and diverse changes of the 3D shape of the actual scene and the pixel-level optical properties of materials, the solution of the heterologous image conversion model is an ill-posed problem. In recent years, most of the proposed conditional adversarial learning methods for image transformation networks only consider the overall consistency loss constraint of the image, and the generated images often contain some pseudo-features or local structural deformations. In order to solve this problem, using the idea of multi-scale image coding and perception, this paper proposes a multi-scale structure-preserving heterologous image transformation method based on conditional adversarial network learning. First, using the idea of multi-scale coding and reconstruction, a multi-scale, step by step generator lightweight network structure is designed. Then, two image multi-scale structure loss functions are proposed, and combined with the existing overall consistency loss, a loss function for generative adversarial learning is designed. Finally, test experiments are performed on the KAIST-MPD-set1 dataset. The experimental results show that, compared with the state-of-the-art algorithms, the proposed algorithm can better suppress the local structural distortion, and has significant advantages in evaluation indicators such as RMSE, LPIPS, PSNR, and SSIM.

1. INTRODUCTION

Heterologous image conversion is a process of generating a modal B image of the same scene from a modal A image by constructing a pixel conversion model from modality A to modality B. Early heterogeneous image conversion requires 3D scene models and complete material information to manually design pixel conversion models. However, in practical applications, the construction of 3D models of actual scenes is difficult, the composition of materials is complex, and the acquisition of imaging characteristics of materials is difficult, which makes the solution of heterogeneous image conversion models an ill-posed problem with low efficiency and low applicability. In recent years, deep learning has led the trend in the field of computer vision, and image translation model learning is a new solution for solving heterogeneous image translation models. Image transformation model learning is an image generation technique for image super-resolution, image enhancement, and multimodal image fusion. It uses a deep convolutional network as a representation model for an arbitrary function, and solves a transition model between images of two modalities by using a fitting optimization on a training set of paired images. The most mainstream way to learn image conversion models is Conditional Generative Adversarial Nets (CGAN) proposed by Mirza & Osindero (2014) . Since CGAN can model the semantic information of images and constrain the output results, CGAN is currently used for Image transformation has become a research hotspot in the field of computer vision and the latest heterogeneous image transformation methods are based on it. Image conversion model learning relies on a large amount of training data. This paper uses the Multispectral Pedestrian Detection (MPD) dataset published by the Korea Advanced Institute of Science and Technology (KAIST) (Hwang & Jaesik Park, 2015) , referred to as the KAIST-MPD dataset, which is currently the only publicly available dataset that contains a large number of visual and infrared approximate common aperture images of multiple scenes. Although the heterogenous image pairs of this dataset have viewing angle deviation and the view field of visual image is smaller, the viewing angle deviation is small and the size of the view field remains fixed, so it is feasible to use this dataset to solve the transformation model. The research in this paper is to solve the heterogeneous image conversion model from infrared to visual based on the KASIT-MPD dataset, as shown in the figure below. The contributions of this paper are summarized as: 1.A new four-layer multi-scale encoder generator is proposed, in which the images are downsampled three times, and then they are encoded and decoded from small to large in turn. Such a network structure can make full use of the structural information in the input images; 2.Two structure-sensitive loss functions are proposed to solve the problem that L1 loss and generative adversarial loss cannot effectively constrain the structural information of a single image. 



Figure 1: Heterologous image conversion from infrared to visual.

