TREE-STRUCTURE SEGMENTATION FOR LOGISTIC RE-GRESSION

Abstract

The decision for a financial institution to accept or deny a loan is based on the probability of a client paying back their debt in time. This probability is given by a model such as a logistic regression, and estimated based on, e.g., the clients' characteristics, their credit history, the repayment performance. Historically, different models have been developed on different markets and/or credit products and/or addressed population. We show that this amounts to modelling default as a mixture model composed of a decision tree and logistic regression on its leaves (thereafter "logistic regression tree"). We seek to optimise this practice by considering the population to which a client belongs as a latent variable, which we will estimate. After exposing the context, the notations and the problem formalisation, we will conduct estimation using a Stochastic-Expectation-Maximisation (SEM) algorithm. We will finally show the performance on simulated data, and on real retail credit data from [COMPANY], as well as real open-source data.

1. INTRODUCTION AND NOTATIONS

1.1 CONTEXT [COMPANY] , like most financial institutions, has a relatively automatic procedure to accept or deny loans and estimate its capital requirements. The procedure is based on credit scores. A client fills in a questionnaire with socio-demographic information and banking behavioural questions, which answers are used to compute a score. This score determines the financing of the client and the necessary impairment for the bank to be ready in case of a potential default. The score is learned on past clients' characteristics (from the questionnaire), which we denote by x, and the repayment in time, or not, of their loan which we denote by y ∈ {0, 1} (where 1 represents the default). The score is directly proportional to the probability p(1|x) of the client not paying back the loan in time, which is estimated with a model {p θ (y|x)} θ∈Θ . A parametric family Θ is chosen (usually logistic regression) and the optimal parameter θ in this family ( θ ∈ Θ) is estimated from an n-sample (x, y) = (x i , y i ) n 1 , usually using a maximum likelihood approach. Such a model is relatively weak, in the sense that the hypothesis space is too restricted to fit the whole clientele of big financial institutions.

1.2. A MODEL FOR EACH SEGMENT OF CLIENTS

Most financial institutions address multiple markets, e.g. automobile, home appliances, or partners who sell such products, and different populations of clients (professionals, organisations, agriculture, private clients). We call "segment" such a sub-population, and denote it by c ∈ C = {1, . . . , K}, where K denotes the total number of segments. Formally, we have a vector of customer characteristics X = (X j ) d 1 , made of d features, either continuous (i.e. valued in R) or categorical (i.e. valued in {1, . . . , m j } without order). The aim is to predict the default Y ∈ {0, 1} from an observation x. These features differ depending on the segment of the population, for instance "time since creation of the company" is a feature which does not apply to private clients. However, for simplicity, in the rest of the paper, we will assume that all d features are shared by all segments. This leads to little loss in generality since continuous features can be discretized and a "Not Relevant" level can be introduced for categorical features; additionally we can resort to feature selection at the segment level (see Section 4.6).

annex

Subsequently, financial institutions create different predictive models {p θ c (y|x)} K 1 for each population c, where θ c denotes the coefficient used for segment c (with potential null entries), as shown in Figure 1 , which leads to K models. This means we learn "expert" logistic regression models on separate "segments" of clients arranged in a tree.Since this structure is inherited from past a priori decisions, it is likely to be sub-optimal; hence we seek to optimise the performance on the whole population. To this end, we formalise the data generating process in the next section.

1.3. FORMALISATION OF THE DATA GENERATING PROCESS

We assume that the model in Figure 1 used by financial institutions accurately depicts the data generation, i.e. for a given client x, there exists a segment c and a logistic regression parameter θ c for which the default y is drawn from p θ c (•|x; c). In other words, we assume that this model is wellspecified. We denote by C ∼ p(•) the random variable valued in {1, . . . , K} which corresponds to the assignment to a group (the tree's leaves in Figure 1 ). C specifies both the distribution of the predicting variables, i.e. x|c ∼ p(•|c) and the default law for each group, which we suppose to be logisticJust like for Gaussian mixture models, we seek to estimate p(c|x) (a simple proportion in the latter example), which will subsequently allow estimation of p θ c (y|x; c). Current procedures are described in the next section.

1.4. AD HOC "TWO-STAGES" PRACTICE

The ad-hoc methods rely on "two-stages" procedures: first optimising the segmentation, then learning the separate logistic regression on each segment. The segmentation is done by practitioners using simple unsupervised "clustering" techniques such as Principal Component Analysis (PCA) and its refinements. In presence of (possibly only) categorical features, the Multiple Correspondence Analysis (MCA), by Lebart et al. (1995) or the Factor Analysis of Mixed Data (FAMD), by Pagès ( 2014) can be more appropriate. Practitioners then visually assess whether clusters appear on the projection of samples onto the 2-3 first Principal Components like in the examples of Appendix A, thus resulting in a qualitative, clustering-like technique, which often performs poorly.In Section 2 we review the existing approaches to create logistic regression trees. In Section 3 we formalise the problem of determining the best logistic regression tree as a mixture model and propose an estimation strategy in Section 4. We devote Section 5 to numerical experiments on simulated data, and Section 6 to experiments on real data.

2. LITERATURE REVIEW OF EXISTING DIRECT APPROACHES: LOGISTIC REGRESSION TREES

The first research work focusing on a similar problem than the present one seems to be LOTUS, by Chan & Loh (2004) , where logistic regression trees are constructed so as to select features to split the data on the tree's nodes which break the linearity assumption of logistic regression. Its authors' motivation is that logistic regression has a fixed parameter space, defined by the number of input features, whereas trees adapt their flexibility (i.e. depth) to the sample size n. Thus, they search for trees which leaves are logistic regressions with a few continuous features and which intermediate nodes (found via an appropriate χ 2 test) split the population based on categorical or

