MAXIMUM CATEGORICAL CROSS ENTROPY (MCCE): A NOISE-ROBUST ALTERNATIVE LOSS FUNCTION TO MITI-GATE RACIAL BIAS IN CONVOLUTIONAL NEURAL NET-WORKS (CNNS) BY REDUCING OVERFITTING Anonymous

Abstract

Categorical Cross Entropy (CCE) is the most commonly used loss function in deep neural networks such as Convolutional Neural Networks (CNNs) for multi-class classification problems. In spite of the fact that CCE is highly susceptible to noise; CNN models trained without accounting for the unique noise characteristics of the input data, or noise introduced during model training, invariably suffer from overfitting affecting model generalizability. The lack of generalizability becomes especially apparent in the context of ethnicity/racial image classification problems encountered in the domain of computer vision. One such problem is the unintended discriminatory racial bias that CNN models trained using CCE fail to adequately address. In other words, CNN models trained using CCE offer a skewed representation of classification performance favoring lighter skin tones. In this paper, we propose and empirically validate a novel noise-robust extension to the existing CCE loss function called Maximum Categorical Cross-Entropy (MCCE), which utilizes CCE loss and a novel reconstruction loss, calculated using the Maximum Entropy (ME) measures of the convolutional kernel weights and input training dataset. We compare the use of MCCE with CCE-trained models on two benchmarking datasets, colorFERET and UTKFace, using a Residual Network (ResNet) CNN architecture. MCCE-trained models reduce overfitting by 5.85% and 4.3% on colorFERET and UTKFace datasets respectively. In cross-validation testing, MCCE-trained models outperform CCE-trained models by 8.8% and 25.16% on the colorFERET and UTKFace datasets respectively. MCCE addresses and mitigates the persistent problem of inadvertent racial bias for facial recognition problems in the domain of computer vision.



-dimensional input vectors X = {x 1 , x 2 , • • • , x n } where x i =< x 1 , x 2 , • • • , x d > and i, d ∈ Z >0 . The problem of approximation is theoretically non-linear and there is empirical evidence to support the assertion that CNNs simply memorize the input training data Zhang et al. (2016) .



(CNNs) offer state-of-the-art results in computer vision tasks He et al. (2016); Szegedy et al. (2015); Simonyan & Zisserman (2014) but are susceptible to inherent noises in the input training data preempting overfitting on the input data during information propagation. When new data is presented, overfit models do not generalize well and offer significantly lower classification performance, exacerbating the problem of bias towards a specific subset of data. The fundamental learning theory behind CNNs is to approximate an underlying d-dimensional interpolated function f (X) ∈ R d by using information from n number of d

