TREE-STRUCTURE SEGMENTATION FOR LOGISTIC RE-GRESSION

Abstract

The decision for a financial institution to accept or deny a loan is based on the probability of a client paying back their debt in time. This probability is given by a model such as a logistic regression, and estimated based on, e.g., the clients' characteristics, their credit history, the repayment performance. Historically, different models have been developed on different markets and/or credit products and/or addressed population. We show that this amounts to modelling default as a mixture model composed of a decision tree and logistic regression on its leaves (thereafter "logistic regression tree"). We seek to optimise this practice by considering the population to which a client belongs as a latent variable, which we will estimate. After exposing the context, the notations and the problem formalisation, we will conduct estimation using a Stochastic-Expectation-Maximisation (SEM) algorithm. We will finally show the performance on simulated data, and on real retail credit data from [COMPANY], as well as real open-source data.

1. INTRODUCTION AND NOTATIONS

1.1 CONTEXT [COMPANY] , like most financial institutions, has a relatively automatic procedure to accept or deny loans and estimate its capital requirements. The procedure is based on credit scores. A client fills in a questionnaire with socio-demographic information and banking behavioural questions, which answers are used to compute a score. This score determines the financing of the client and the necessary impairment for the bank to be ready in case of a potential default. The score is learned on past clients' characteristics (from the questionnaire), which we denote by x, and the repayment in time, or not, of their loan which we denote by y ∈ {0, 1} (where 1 represents the default). The score is directly proportional to the probability p(1|x) of the client not paying back the loan in time, which is estimated with a model {p θ (y|x)} θ∈Θ . A parametric family Θ is chosen (usually logistic regression) and the optimal parameter θ in this family ( θ ∈ Θ) is estimated from an n-sample (x, y) = (x i , y i ) n 1 , usually using a maximum likelihood approach. Such a model is relatively weak, in the sense that the hypothesis space is too restricted to fit the whole clientele of big financial institutions.

1.2. A MODEL FOR EACH SEGMENT OF CLIENTS

Most financial institutions address multiple markets, e.g. automobile, home appliances, or partners who sell such products, and different populations of clients (professionals, organisations, agriculture, private clients). We call "segment" such a sub-population, and denote it by c ∈ C = {1, . . . , K}, where K denotes the total number of segments. Formally, we have a vector of customer characteristics X = (X j ) d 1 , made of d features, either continuous (i.e. valued in R) or categorical (i.e. valued in {1, . . . , m j } without order). The aim is to predict the default Y ∈ {0, 1} from an observation x. These features differ depending on the segment of the population, for instance "time since creation of the company" is a feature which does not apply to private clients. However, for simplicity, in the rest of the paper, we will assume that all d features are shared by all segments. This leads to little loss in generality since continuous features can be discretized and a "Not Relevant" level can be introduced for categorical features; additionally we can resort to feature selection at the segment level (see Section 4.6).

