EXPRESSIVE MONOTONIC NEURAL NETWORKS

Abstract

The monotonic dependence of the outputs of a neural network on some of its inputs is a crucial inductive bias in many scenarios where domain knowledge dictates such behavior. This is especially important for interpretability and fairness considerations. In a broader context, scenarios in which monotonicity is important can be found in finance, medicine, physics, and other disciplines. It is thus desirable to build neural network architectures that implement this inductive bias provably. In this work, we propose a weight-constrained architecture with a single residual connection to achieve exact monotonic dependence in any subset of the inputs. The weight constraint scheme directly controls the Lipschitz constant of the neural network and thus provides the additional benefit of robustness. Compared to currently existing techniques used for monotonicity, our method is simpler in implementation and in theory foundations, has negligible computational overhead, is guaranteed to produce monotonic dependence, and is highly expressive. We show how the algorithm is used to train powerful, robust, and interpretable discriminators that achieve competitive performance compared to current state-of-the-art methods across various benchmarks, from social applications to the classification of the decays of subatomic particles produced at the CERN Large Hadron Collider.

1. INTRODUCTION

The need to model functions that are monotonic in a subset of their inputs is prevalent in many ML applications. Enforcing monotonic behaviour can help improve generalization capabilities (Milani Fard et al., 2016; You et al., 2017) and assist with interpretation of the decision-making process of the neural network (Nguyen & Martínez, 2019) . Real world scenarios include various applications with fairness, interpretability, and security aspects. Examples can be found in the natural sciences and in many social applications. Monotonic dependence of a model output on a certain feature in the input can be informative of how an algorithm works-and in some cases is essential for realword usage. For instance, a good recommender engine will favor the product with a high number of reviews over another with fewer but otherwise identical reviews (ceteris paribus). The same applies for systems that assess health risk, evaluate the likelihood of recidivism, rank applicants, filter inappropriate content, etc. In addition, robustness to small perturbations in the input is a desirable property for models deployed in real world applications. In particular, when they are used to inform decisions that directly affect human actors-or where the consequences of making an unexpected and unwanted decision could be extremely costly. The continued existence of adversarial methods is a good example for the possibility of malicious attacks on current algorithms (Akhtar et al., 2021) . A natural way of ensuring the robustness of a model is to constrain its Lipschitz constant. To this end, we recently developed an architecture whose Lipschitz constant is constrained by design using layer-wise normalization which allows the architecture to be more expressive than the current state-of-the-art with stable and fast training (Kitouni et al., 2021) . Our algorithm has been adopted to classify the decays of subatomic particles produced at the CERN Large Hadron Collider in the real-time data-processing system of the LHCb experiment, which was our original motivation for developing this novel architecture. In this paper, we present expressive monotonic Lipschitz networks. This new class of architectures employs the Lipschitz bounded networks from Kitouni et al. ( 2021) along with residual connections to implement monotonic dependence in any subset of the inputs by construction. It also provides exact robustness guarantees while keeping the constraints minimal such that it remains a universal approximator of Lipschitz continuous monotonic functions. We show how the algorithm is used to train powerful, robust, and interpretable discriminators that achieve competitive performance compared to current state-of-the-art methods across various benchmarks, from social applications to its original target application: the classification of the decays of subatomic particles produced at the CERN Large Hadron Collider.

2. RELATED WORK

Prior work in the field of monotonic models can be split into two major categories. 2020), which relies on Mixed Integer Linear Programming to certify the monotonicity of piece-wise linear architectures. The method uses a heuristic regularization to penalize the non-monotonicty of the model on points sampled uniformly in the domain during training. The procedure is repeated with increasing regularization strength until the model passes the certification. This iteration can be expensive and while this method is more flexible than the constrained architectures (valid for MLPs with piece-wise linear activations), the computational overhead of the certification process can be prohibitively expensive. Similarly, Sivaraman et al. (2020) propose guaranteed monotonicity for standard ReLU networks by letting a Satisfiability Modulo Theories (SMT) solver find counterexamples to the monotonicity definition and adjust the prediction in the inference process such that monotonicity is guaranteed. However, this approach requires queries to the SMT solver during inference time for each monotonic feature, and the computation time scales harshly with the number of monotonic features and the model size (see Figure 3 and 4 in Sivaraman et al. ( 2020)). Our architecture falls into the first category. However, we overcome both main drawbacks: lack of expressiveness and impractical complexity. Other related works appear in the context of monotonic functions for normalizing flows, where monotonicity is a key ingredient to enforce invertibility (De Cao et al., 2020; Huang et al., 2018; Behrmann et al., 2019; Wehenkel & Louppe, 2019) .

3. METHODS

The goal is to develop a neural network architecture representing a vector-valued function f : R d → R n , d, n ∈ N, ) that is provably monotonic in any subset of its inputs. We first define a few ingredients. Definition 3.1 (Monotonicity). Let x ∈ R d , x S ≡ 1 S ⊙ x, and the Hadamard product of x with the indicator vector 1 S (i) = 1 if i ∈ S and 0 otherwise for a subset S ⊆ {1, • • • , d}. We say that outputs Q ⊆ {1, • • • , n} of f are monotonically increasing in features S if f (x ′ S + xS) i ≤ f (x S + xS) i ∀i ∈ Q and ∀x ′ S ≤ x S , where S denotes the complement of S and the inequality on the right uses the product (or componentwise) order. Definition 3.2 (Lip p function). g : R d → R n is Lip p if it is Lipschitz continuous with respect to the L p norm in every output dimension, i.e., ||g(x)g(y)|| ∞ ≤ λ∥x -y∥ p ∀x, y ∈ R n . (3)



Built-in and constrained monotonic architectures: Examples of this category include Deep Lattice Networks(You et al., 2017)  and networks in which all weights are constrained to have the same sign(Sill, 1998). The major drawbacks of most implementations of constrained architectures are a lack of expressiveness or poor performance due to superfluous complexity. • Heuristic and regularized architectures (with or without certification): Examples of such methods include Sill & Abu-Mostafa (1996) and Gupta et al., which penalizes point-wise negative gradients on the training sample. This method works on arbitrary architectures and retains much expressive power but offers no guarantees as to the monotonicity of the trained model. Another similar method is Liu et al. (

