MAXIMUM CATEGORICAL CROSS ENTROPY (MCCE): A NOISE-ROBUST ALTERNATIVE LOSS FUNCTION TO MITI-GATE RACIAL BIAS IN CONVOLUTIONAL NEURAL NET-WORKS (CNNS) BY REDUCING OVERFITTING Anonymous

Abstract

Categorical Cross Entropy (CCE) is the most commonly used loss function in deep neural networks such as Convolutional Neural Networks (CNNs) for multi-class classification problems. In spite of the fact that CCE is highly susceptible to noise; CNN models trained without accounting for the unique noise characteristics of the input data, or noise introduced during model training, invariably suffer from overfitting affecting model generalizability. The lack of generalizability becomes especially apparent in the context of ethnicity/racial image classification problems encountered in the domain of computer vision. One such problem is the unintended discriminatory racial bias that CNN models trained using CCE fail to adequately address. In other words, CNN models trained using CCE offer a skewed representation of classification performance favoring lighter skin tones. In this paper, we propose and empirically validate a novel noise-robust extension to the existing CCE loss function called Maximum Categorical Cross-Entropy (MCCE), which utilizes CCE loss and a novel reconstruction loss, calculated using the Maximum Entropy (ME) measures of the convolutional kernel weights and input training dataset. We compare the use of MCCE with CCE-trained models on two benchmarking datasets, colorFERET and UTKFace, using a Residual Network (ResNet) CNN architecture. MCCE-trained models reduce overfitting by 5.85% and 4.3% on colorFERET and UTKFace datasets respectively. In cross-validation testing, MCCE-trained models outperform CCE-trained models by 8.8% and 25.16% on the colorFERET and UTKFace datasets respectively. MCCE addresses and mitigates the persistent problem of inadvertent racial bias for facial recognition problems in the domain of computer vision.

1. INTRODUCTION

Convolutional Neural Networks (CNNs) offer state-of-the-art results in computer vision tasks He et al. (2016) ; Szegedy et al. (2015) ; Simonyan & Zisserman (2014) but are susceptible to inherent noises in the input training data preempting overfitting on the input data during information propagation. When new data is presented, overfit models do not generalize well and offer significantly lower classification performance, exacerbating the problem of bias towards a specific subset of data. The fundamental learning theory behind CNNs is to approximate an underlying d-dimensional interpolated function f (X) ∈ R d by using information from n number of d-dimensional input vectors Maiorov (2006) . The problem of approximation is theoretically non-linear and there is empirical evidence to support the assertion that CNNs simply memorize the input training data Zhang et al. (2016) . X = {x 1 , x 2 , • • • , x n } where x i =< x 1 , x 2 , • • • , x d > and i, d ∈ Z >0 Overfitting occurs when the internal parameters of a CNN model are finely tuned to the unique variances of the input training data that it perfectly models its characteristics Hawkins (2004) . Misclassification occurs when overfit models are unable to distinguish between overlapping variances for different classes of images. Reducing overfitting is also difficult since establishing a theoretical understanding or analyzing the mechanisms of learning in CNNs for non-convex optimization problems such as image classification is generally not well understood Shamir (2018). A simple way to reduce overfitting is to train models using a very large number of images Shorten & Khoshgoftaar (2019), such as the ImageNet dataset consisting of millions of training images used for the purpose of natural image classification. While using big data solutions might mask the underlying problem of model overfitting, acquisition of clean/noise-free labeled data for supervised model training is challenging. The problem of data acquisition is compounded further by ethical, societal, and practical concerns when dealing with facial datasets, especially for the task of race or gender classification. Another key challenge while creating datasets is the consideration that needs to be made on the distribution of data amongst the multiple classes along with the variability of data within an individual class. Unbalanced datasets where the data distribution of images is not equal for all the classes introduces bias during model training Ganganwar (2012) . The only viable solution to rectify imbalanced datasets is to augment or supplement datasets with new images which as mentioned before is an ongoing challenge. To the best of our knowledge, there is no research/work undertaken to optimize data distribution of the convolutional kernel weights during model training. We hypothesize that balancing convolutional kernel data, during model training could aide in mitigating bias and increase classification performance through alleviating the severity of inherent noise. Some researchers attribute racial bias of CNN models to noises in the training data and associated labels proposing alternate loss functions like Mean Absolute Error (MAE) Ghosh et al. (2017) to commonly used loss functions like Categorical Cross Entropy (CCE), as explained in Section 2.1. MAE was proposed as a noise-robust alternative to mitigate the susceptibility of CNNs to noise, but as Zhang & Sabuncu (2018) asserts, MAE is not applicable for complex natural image datasets like ImageNet and as such it is not considered in this paper. The task of classifying race in human faces is established to be more complex than natural image classification because there exists a narrow range of possible variations in features between human faces of different races, especially when skin tone is not the major determining factor for racial identity Fu et al. (2014) . In this paper, we explore the problem of overfitting with respect to racial classification by assessing the train-test divergence to quantify the degree of generalizability where a higher train-test divergence indicates a greater degree of model overfitting on the training data. We also propose a novel extension to the commonly used CCE loss function using Maximum Entropy (ME) Hartley (1928) 2018) with unrealistic assumptions made about the distribution of input data; we do not make any such assumptions. The contributions of this paper are as follows: • We propose a novel extension to the Categorical Cross Entropy (CCE) loss function using Maximum Entropy (ME) measures known as Maximum Categorical Cross Entropy (MCCE) loss to reduce model overfitting. • We empirically validate the MCCE loss function with respect to model overfitting using traintest divergence as a metric and evaluate generalizability across datasets by using cross-validation testing.



measures, called Maximum Categorical Cross Entropy (MCCE). MCCE loss calculations are determined by taking into account the distribution of convolutional kernel weights during model training and the traditional CCE loss. Most related works explore model over-parameterization Zhang et al. (2019) or under-parameterization Soltanolkotabi et al. (

