INTERPRETABLE (META)FACTORIZATION OF CLINI-CAL QUESTIONNAIRES TO IDENTIFY GENERAL DIMEN-SIONS OF PSYCHOPATHOLOGY

Abstract

Psychiatry research aims at understanding manifestations of psychopathology in behavior, in terms of a small number of latent constructs. These are usually inferred from questionnaire data using factor analysis. The resulting factors and relationship to the original questions are not necessarily interpretable. Furthermore, this approach does not provide a way to separate the effect of confounds from those of constructs, and requires explicit imputation for missing data. Finally, there is no clear way to integrate multiple sets of constructs estimated from different questionnaires. An important question is whether there is a universal, compact set of constructs that would span all the psychopathology issues listed across those questionnaires. We propose a new matrix factorization method designed for questionnaires aimed at promoting interpretability, through bound and sparsity constraints. We provide an optimization procedure with theoretical convergence guarantees, and validate automated methods to detect latent dimensionality on synthetic data. We first demonstrate the method on a commonly used general-purpose questionnaire. We then show it can be used to extract a broad set of 15 psychopathology factors spanning 21 questionnaires from the Healthy Brain Network study. We show that our method preserves diagnostic information against competing methods, even as it imposes more constraints. Finally, we demonstrate that it can be used for defining a short, general questionnaire that allows recovery of those 15 meta-factors, using data more efficiently than other methods.

1. INTRODUCTION

Standardized questionnaires are a common tool in psychiatric practice and research, for purposes ranging from screening to diagnosis or quantification of severity. A typical questionnaire comprises questions -usually referred to as items -reflecting the degree to which particular symptoms or behavioural issues are present in study participants. Items are chosen as evidence for the presence of latent constructs giving rise to the psychiatric problems observed. For many common disorders, there is a practical consensus on constructs. If so, a questionnaire may be organized so that subsets of the items can be added up to yield a subscale score quantifying the presence of their respective construct. Otherwise, the goal may be to discover constructs through factor analysis. The factor analysis of a questionnaire matrix (#participants × #items) expresses it as the product of a factor matrix (#participants × #factors) and a loading matrix (#factors × #items). The method assumes that answers to items may be correlated, and can therefore be explained in terms of a smaller number of factors. The method yields two real-valued matrices, with uncorrelated columns in the factor matrix. The number of factors needs to be specified a priori, or estimated from data. This solution is often subjected to rotation so that, after transformation, each factor has non-zero loadings on few variables, and each variable has a high-loading on a single factor, if possible. The values of the factors for each participant can then be viewed as a succinct representation of them. Interpreting what construct a factor may represent is done by considering its loadings over all the items. Ideally, if very few items have a non-zero loading, it will be easy to associate the factor with them. However, in practice, the loadings could be an arbitrary linear combination of items, with positive and negative weights. Factors are real-valued, and neither their magnitude nor their sign are intrinsically meaningful. Beyond this, any missing data will have to be imputed, or the respective items ommitted, before factor analysis can be used. Finally, patterns in answers that are driven by other characteristics of participants (e.g. age or sex) are absorbed into factors themselves, acting as confounders, instead of being represented separately or controlled for. Over time, many different questionnaires have been developed. Some focus on constructs relevant to particular disorders or behavioral issues; others aim at screening for a wide range of problems. One important question for psychiatry researchers is how many constructs would suffice to explain most manifestations of psychopathology. In addition to its scientific interest, an answer to this question would also be of clinical use, informing design of light-weight questionnaires designed to estimate all key constructs from a minimal number of items. The availability of datasets such as Healthy Brain Network (Alexander et al., 2017) , where tens of questionnaires are collected for thousands of children and adolescent participants, makes it possible to address this question in a data-driven way. However, a joint factor analysis of many questionnaires faces additional obstacles, e.g. their having different response scales, very disparate numbers of items, or patterns of missing entries. In this paper, we propose to address all of the issues above with a novel matrix factorization method specifically designed for use with questionnaire data, through the following contributions. Contribution #1: We introduce Interpretability-Constrained Questionnaire Factorization (ICQF), a new matrix factorization method for questionnaire data. Our method was designed to incorporate characteristics that increased interpretability of the resulting factors, based on several desiderata from active clinical researchers in psychiatry. First, factor values are constrained to be in the range [0, 1], so as to represent a degree of presence of the factor. Second, the loadings across items for each factor have to be in the same range as answers in the original questionnaire (typically, [0, max] ). This makes it possible to examine them as a pattern of answers associated with the factor. Third, the reconstructed matrix obtained by multiplying factors by factor loadings is constrained, so that no entry exceeds the range -or observed maximum value -of the original questionnaire. Fourth, the method handles missing data directly, so no imputation is required. Finally, the method supports pre-specifying some factors to model known variables, such as age or sex, to capture the answer patterns correlated with them (e.g. drinking problems appearing as age increases). We demonstrate ICQF in the Childhood Behavior Checklist (CBCL), a widely used questionnaire, and show that it preserves all diagnostic information in various questionnaires, even with additional regularization. Contribution #2: We provide theoretical guarantees on the convergence and performance of the optimization procedure. We introduce an optimization procedure for ICQF, using alternating minimization with ADMM. We demonstrate that this procedure converges to a local minimum of the optimization problem. We implement blockwise-cross-validation (BCV) to determine the number of factors. If this number of factors is close to that underlying the data, the solution will be close to a global minimum. Finally, we show that our procedure detects the number of factors more precisely than competing methods, as evaluated in synthetic data with different noise density. Contribution #3: We use a two-level meta-factorization of 21 questionnaires to identify 15 general factors of psychopathology in children and adolescents. We apply ICQF individually to 21 Healthy Brain Network questionnaires (first-level), and then again to a concatenation of the resulting 21 factor matrices (second-level), yielding a meta-factorization with 15 interpretable meta-factors. We show that these meta-factors can outperform individual questionnaires in diagnostic prediction. We also show that the meta-factorization can be used to produce a short, general questionnaire, with little loss of diagnostic information, using data much more efficiently than competing methods.

2. RELATED WORK

The extraction of latent variables (a.k.a. factors) from matrix data is often done through low rank matrix factorizations, such as singular value decomposition (SVD), principal component analysis (PCA) and exploratory Factor Analysis (hereafter, just FA) (Golub & Van Loan, 2013; Bishop & Nasrabadi, 2006) . While SVD and PCA aim at reconstructing the data, FA aims at explaining correlations between (questions) items through latent factors (Bandalos & Boehm-Kaufman, 2010) . Factor rotation (Browne, 2001; Sass & Schmitt, 2010; Schmitt & Sass, 2011) is then performed to obtain a sparser solution which is easier to interpret and analyze. For a comprehensive review of FA, see Thompson (2004) ; Gaskin & Happell (2014) ; Gorsuch (2014); Goretzko et al. (2021) .

