INTERPRETABLE (META)FACTORIZATION OF CLINI-CAL QUESTIONNAIRES TO IDENTIFY GENERAL DIMEN-SIONS OF PSYCHOPATHOLOGY

Abstract

Psychiatry research aims at understanding manifestations of psychopathology in behavior, in terms of a small number of latent constructs. These are usually inferred from questionnaire data using factor analysis. The resulting factors and relationship to the original questions are not necessarily interpretable. Furthermore, this approach does not provide a way to separate the effect of confounds from those of constructs, and requires explicit imputation for missing data. Finally, there is no clear way to integrate multiple sets of constructs estimated from different questionnaires. An important question is whether there is a universal, compact set of constructs that would span all the psychopathology issues listed across those questionnaires. We propose a new matrix factorization method designed for questionnaires aimed at promoting interpretability, through bound and sparsity constraints. We provide an optimization procedure with theoretical convergence guarantees, and validate automated methods to detect latent dimensionality on synthetic data. We first demonstrate the method on a commonly used general-purpose questionnaire. We then show it can be used to extract a broad set of 15 psychopathology factors spanning 21 questionnaires from the Healthy Brain Network study. We show that our method preserves diagnostic information against competing methods, even as it imposes more constraints. Finally, we demonstrate that it can be used for defining a short, general questionnaire that allows recovery of those 15 meta-factors, using data more efficiently than other methods.

1. INTRODUCTION

Standardized questionnaires are a common tool in psychiatric practice and research, for purposes ranging from screening to diagnosis or quantification of severity. A typical questionnaire comprises questions -usually referred to as items -reflecting the degree to which particular symptoms or behavioural issues are present in study participants. Items are chosen as evidence for the presence of latent constructs giving rise to the psychiatric problems observed. For many common disorders, there is a practical consensus on constructs. If so, a questionnaire may be organized so that subsets of the items can be added up to yield a subscale score quantifying the presence of their respective construct. Otherwise, the goal may be to discover constructs through factor analysis. The factor analysis of a questionnaire matrix (#participants × #items) expresses it as the product of a factor matrix (#participants × #factors) and a loading matrix (#factors × #items). The method assumes that answers to items may be correlated, and can therefore be explained in terms of a smaller number of factors. The method yields two real-valued matrices, with uncorrelated columns in the factor matrix. The number of factors needs to be specified a priori, or estimated from data. This solution is often subjected to rotation so that, after transformation, each factor has non-zero loadings on few variables, and each variable has a high-loading on a single factor, if possible. The values of the factors for each participant can then be viewed as a succinct representation of them. Interpreting what construct a factor may represent is done by considering its loadings over all the items. Ideally, if very few items have a non-zero loading, it will be easy to associate the factor with them. However, in practice, the loadings could be an arbitrary linear combination of items, with positive and negative weights. Factors are real-valued, and neither their magnitude nor their sign are intrinsically meaningful. Beyond this, any missing data will have to be imputed, or the respective items ommitted, before factor analysis can be used. Finally, patterns in answers that are driven by other characteristics of participants (e.g. age or sex) are absorbed into factors themselves, acting as confounders, instead of being represented separately or controlled for. Over time, many different questionnaires have been developed. Some focus on constructs relevant to particular disorders or behavioral issues; others aim at screening for a wide range of problems. One important question for psychiatry researchers is how many constructs would suffice to explain most manifestations of psychopathology. In addition to its scientific interest, an answer to this question would also be of clinical use, informing design of light-weight questionnaires designed to estimate all key constructs from a minimal number of items. The availability of datasets such as Healthy Brain Network (Alexander et al., 2017) , where tens of questionnaires are collected for thousands of children and adolescent participants, makes it possible to address this question in a data-driven way. However, a joint factor analysis of many questionnaires faces additional obstacles, e.g. their having different response scales, very disparate numbers of items, or patterns of missing entries. In this paper, we propose to address all of the issues above with a novel matrix factorization method specifically designed for use with questionnaire data, through the following contributions. Contribution #1: We introduce Interpretability-Constrained Questionnaire Factorization (ICQF), a new matrix factorization method for questionnaire data. Our method was designed to incorporate characteristics that increased interpretability of the resulting factors, based on several desiderata from active clinical researchers in psychiatry. First, factor values are constrained to be in the range [0, 1], so as to represent a degree of presence of the factor. Second, the loadings across items for each factor have to be in the same range as answers in the original questionnaire (typically, [0, max] ). This makes it possible to examine them as a pattern of answers associated with the factor. Third, the reconstructed matrix obtained by multiplying factors by factor loadings is constrained, so that no entry exceeds the range -or observed maximum value -of the original questionnaire. Fourth, the method handles missing data directly, so no imputation is required. Finally, the method supports pre-specifying some factors to model known variables, such as age or sex, to capture the answer patterns correlated with them (e.g. drinking problems appearing as age increases). We demonstrate ICQF in the Childhood Behavior Checklist (CBCL), a widely used questionnaire, and show that it preserves all diagnostic information in various questionnaires, even with additional regularization. Contribution #2: We provide theoretical guarantees on the convergence and performance of the optimization procedure. We introduce an optimization procedure for ICQF, using alternating minimization with ADMM. We demonstrate that this procedure converges to a local minimum of the optimization problem. We implement blockwise-cross-validation (BCV) to determine the number of factors. If this number of factors is close to that underlying the data, the solution will be close to a global minimum. Finally, we show that our procedure detects the number of factors more precisely than competing methods, as evaluated in synthetic data with different noise density. Contribution #3: We use a two-level meta-factorization of 21 questionnaires to identify 15 general factors of psychopathology in children and adolescents. We apply ICQF individually to 21 Healthy Brain Network questionnaires (first-level), and then again to a concatenation of the resulting 21 factor matrices (second-level), yielding a meta-factorization with 15 interpretable meta-factors. We show that these meta-factors can outperform individual questionnaires in diagnostic prediction. We also show that the meta-factorization can be used to produce a short, general questionnaire, with little loss of diagnostic information, using data much more efficiently than competing methods.

2. RELATED WORK

The extraction of latent variables (a.k.a. factors) from matrix data is often done through low rank matrix factorizations, such as singular value decomposition (SVD), principal component analysis (PCA) and exploratory Factor Analysis (hereafter, just FA) (Golub & Van Loan, 2013; Bishop & Nasrabadi, 2006) . While SVD and PCA aim at reconstructing the data, FA aims at explaining correlations between (questions) items through latent factors (Bandalos & Boehm-Kaufman, 2010) . Factor rotation (Browne, 2001; Sass & Schmitt, 2010; Schmitt & Sass, 2011) is then performed to obtain a sparser solution which is easier to interpret and analyze. For a comprehensive review of FA, see Thompson (2004) ; Gaskin & Happell (2014) ; Gorsuch (2014); Goretzko et al. (2021) . Non-negative matrix factorization (NMF) was proposed as a way of identifying sparser, more interpretable latent variables, which can be added to reconstruct the data matrix. It was introduced in Paatero & Tapper (1994) and was further developed in Lee & Seung (2000) . Different varieties of NMF-based models have been proposed for various applications, such as the sparsity-controlled (Eggert & Korner, 2004; Qian et al., 2011) , manifold-regularized (Lu et al., 2012) , orthogonal Ding et al. (2006) ; Choi (2008) , convex/semi-convex (Ding et al., 2008) , or archetypal regularized NMF (Javadi & Montanari, 2020) . Recently, the Deep-NMF (Trigeorgis et al., 2016; Zhao et al., 2017) and Deep-MF (Xue et al., 2017; Fan & Cheng, 2018; Arora et al., 2019) have been introduced that can model non-linearities on top of (non-negative) factors, when the sample is large (Fan, 2021) . These methods do not directly model either the interpretability characteristics or the constraints that we view as desirable. If the goal is to identify latent variables relevant for multiple matrices, the standard approach is multi-view learning (Sun et al., 2019) , or variants that can handle only partial overlap in participants across matrices (Ding et al., 2014; Gunasekar et al., 2015; Gaynanova & Li, 2019) . Finally, non-negative matrix tri-factorization (NMTF) (Li et al., 2009; Pei et al., 2015) , supports an additional matrix mapping between latent representations for different matrices. Obtaining a factorization with these methods requires both specifying the number of latent variables, and solving an optimization problem. In SVD/PCA, the number of variables is often selected based on the percentage of variance explained, or determined via techniques such as spectral analysis, the Laplace-PCA method, or Velicer's MAP test (Velicer, 1976; Velicer et al., 2000; Minka, 2000) . For FA, several methods have been proposed: Bartlett's test (Bartlett, 1950) , parallel analysis (Horn, 1965; Hayton et al., 2004) , MAP test and comparison data (Ruscio & Roche, 2012) . For NMF, iterative detection algorithms are recommended, e.g. the Bayesian information criterion (BIC) (Stoica & Selen, 2004) , cophenetic correlation coefficient (CCC) (Fogel et al., 2007) and the dispersion (Brunet et al., 2004) . More recent proposals for NMF are Bi-cross-validation (BiCV) (Owen & Perry, 2009) and its generalization, the blockwise-cross-validation (BCV) (Kanagal & Sindhwani, 2010) , which we use in this paper. The optimization problem for NMF is non-convex, and different algorithms for solving it have been proposed. Multiplicative update (MU) (Lee & Seung, 2000) is the simplest and mostly used. Projected gradient algorithms such as the block coordinate descent (Cichocki & Phan, 2009; Xu & Yin, 2013; Kim et al., 2014) and the alternating optimization (Kim & Park, 2008; Mairal et al., 2010) aim at scalability and efficiency in larger matrices. Given that our optimization problem has various constraints, we use a combination of alternative optimization and Alternating Direction Method of Multipliers (ADMM) (Boyd et al., 2011; Huang et al., 2016) .

3.1. INTERPRETABLE CONSTRAINED QUESTIONNAIRE FACTORIZATION (ICQF)

Inputs Our method operates on a questionnaire data matrix M ∈ R n×m ≥0 with n participants and m questions, where entry (i, j) is the answer given by participant i to question j. Given that questionnaires often have missing data, we also have a mask matrix M ∈ {0, 1} n×m of the same dimensionality as M , indicating whether each entry is available (=1) or not (=0). Optionally, we may have a confounder matrix C ∈ R n×c ≥0 , encoding c known variables for each participant that could account for correlations across questions (e.g. age or sex). If the j th confound C [:,j] is categorical, we convert it to indicator columns for each value. If it is continuous, we first rescale it into [0, 1] (range in the dataset), and replace it with two new columns, C [:,j] and 1 -C [:,j] . This mirroring procedure ensures that both directions of the confounding variables are under consideration (e.g. answer patterns more common the younger or the older the participants are). Optimization problem We seek to factorize the questionnaire matrix M as the product of a n×k factor matrix W ∈ [0, 1], with the confound matrix C ∈ [0, 1] as optional additional columns, and a m × (k + c) loading matrix Q := [ R Q, C Q], with a loading pattern R Q over m questions for each of the k factors (and C Q for optional confounds). Denoting the Hadamard product as ⊙, our optimization problem minimizes the squared error of this factorization minimize W ∈W,Q∈Q,Z∈Z 1/2 ∥M ⊙ (M -Z)∥ 2 F + β • R(W , Q) such that [W , C]Q T = Z, Z = {Z| min(M ) ≤ Z ij ≤ max(M )} , Q = {Q| 0 ≤ Q ij } and W = {W | 0 ≤ W ij ≤ 1} (ICQF) subject to entries of Q being in the same value range as question answers, so loadings are interpretable, and bounding the reconstruction by the range of values in the questionnaire matrix M . We further regularize W and Q through R(W , Q) := ∥W ∥ p,q + γ∥Q∥ p,q , γ = n m max(M ), where ∥A∥ p,q := ( m i=1 ( n j=1 |A ij | p ) q/p )foot_0/q . Here, we use p = q = 1 for sparsity control. γ is a heuristic to balance the sparsity control between W and Q. With a slight abuse of notation, γ is absorbed into β of Q if no ambiguity results. Choice of number of factors For each β, we choose the number of factors k using blockwisecross-validation (BCV). Given a matrix M , for each k, we shuffle the rows and columns of M and subdivide it into b r × b c blocks. These blocks are split into 10 folds and we repeatedly omit blocks in a fold, factorize the remainder, impute the omitted blocks via matrix completion and compute the error 1 of that imputation. We choose k with the lowest average error. This procedure can adapt to the distribution of confounds C by stratified splitting. We compared this with other approaches for choosing k, for ICQF and other methods, over synthetic data, and report the results in Appendix F.

3.2. SOLVING THE OPTIMIZATION PROBLEM

Optimization procedure The ICQF problem is non-convex and requires satisfying multiple constraints. We solve it through an ADMM optimization procedure. The Lagrangian L ρ is: L ρ (W , Q, Z, α Z ) =1/2∥M ⊙ (M -Z)∥ 2 F + I W (W ) + β∥W ∥ 1,1 + I Q (Q) + β∥Q∥ 1,1 + α Z , Z -[W , C]Q T + ρ/2 Z -[W , C]Q T 2 F + I Z (Z) (1) where ρ is the penalty parameter, α Z is the vector of Lagrangian multipliers and I X (X) = 0 if X ∈ X and ∞ otherwise. We alternatingly update primal variables W , Q and the auxiliary variable Z by solving the following sub-problems: W (i+1) = arg min W ∈W ρ/2∥Z (i) -[W , C]Q (i),T + ρ -1 α (i) Z ∥ 2 F + β∥W ∥ 1,1 Q (i+1) = arg min Q∈Q ρ/2∥Z (i) -[W (i+1) , C]Q T + ρ -1 α (i) Z ∥ 2 F + β∥Q∥ 1,1 Z (i+1) = arg min Z∈Z ∥M ⊙ (M -Z)∥ 2 F + ρ∥Z -[W (i+1) , C]Q (i+1),T + ρ -1 α (i) Z ∥ 2 F (4) for some penalty parameter ρ. Lastly, α Z is updated via α (i+1) Z ← α (i) Z + ρ(Z (i+1) -[W (i+1) , C](Q (i+1) ) T ) Equations 2 and 3 can be further split into row-wise constrained Lasso problems and there is a closed form solution for equation 4. The optimization details are further discussed in Appendix A. Given the flexibility of ADMM, a similar procedure can also be used with other regularizations. Convergence of the optimization procedure In Appendix B, we provide a proof that the constraint ρ ≥ √ 2 on the penalty parameter ρ guarantees monotonicity of the optimization procedure, and that it will converge to a local minimum. Integrating this constraint with the adaptive selection of ρ (Xu et al., 2017) , we obtain an efficient optimization for ICQF. Furthermore, Bjorck et al. (2021) showed that, if k = k * of a ground-truth solution (W * , Q * ) in non-negative matrix factorizations, the error ∥M - W Q T ∥ 2 F is star-convex towards (W * , Q * ) , and the solution is close to a global minimum. In Appendix C, we show that, if k ̸ = k * , the relative error between W * and W increases with | k/k * -1|. Inaccurate estimation of k * thus affects both the interpretability of (W , Q) and the convergence to global minima. As reported in Appendix F, BCV is more robust to noise when estimating k than other alternatives, and this is why we use it.

3.3. META-FACTORIZATION

ICQF produces interpretable factors for individual questionnaires. As discussed earlier, our second goal is to obtain interpretable factors that explain psychopathology across a range of questionnaires. ICQF can also be used to obtain these meta-factors, through a two level-factorization: factorize each individual questionnaire, concatenate their respective factor matrices, and then factorize this matrix. The main obstacle is that each participant may only have answered a subset of the questionnaires available. This is the second reason for including a mask matrix M in our problem formulation. In describing our meta-factorization procedure, we suppress C and the regularization terms R to simplify the discussion. Let {M i } S i=1 to be the data matrices of S questionnaires with dimensions {(n i , m i )} S i=1 . Note that {M i } S i=1 can be fully, partially or non-overlapped with each other. For dimension consistency, we extend M i (so as W i , M i ) into n rows by padding rows of zeros for missing participants, where n ≥ n i ∀i = 1, . . . , S denotes the number of unique participants from the S questionnaires. We then introduce mask matrices E i ∈ {0, 1} n×mi := D i • M i , which is composed of the extended M i and a diagonal mask matrix D i ∈ {0, 1} n×n indicating the availability of participants. Performing matrix factorization for each questionnaire, we obtain E i ⊙ M i ≈ E i ⊙ (W i Q T i ) for i = 1, . . . , S. We then concatenate {W i } S i=1 and perform a second level factorization: [D 1 • W 1 , • • • , D S • W S ] ≈ [D 1 • 1 1 , . . . , D S • 1 S ] ⊙ WQ T , where 1 i is a 1-matrix of dimension n × k i with k i denoting the number of factors in W i , for i = 1, . . . , S. The columns of W are the meta-factors. There are alternative approaches for factorizing multiple questionnaires. The most obvious would be to factorize the concatenation of all {M i } S i=1 as [E 1 ⊙ M 1 , . . . , E S ⊙ M S ] ≈ W Q T . It requires a wider detection range for the best k ({1, . . . , S i m i }). This is used for competing methods in our experiments. Moreover, any low-rank matrix completion algorithms could be used for metafactorization. However, constraining W and Z from each questionnaire is crucial; as discussed in Maisog et al. (2021) , and witnessed by us in practice, simple normalization before estimating k for meta-factor W may induce unpredictable effects. We could also optimize a meta-objective function: 1 2 S i=1 α i ∥E i ⊙ (M i -W Q T i )∥ 2 F . This has a smaller range of k in practice, but extra hyper-parameters α i (relative importance of data matrix M i ) are introduced. Finally, we could use tri-factorization: [E 1 ⊙ M 1 , . . . , E S ⊙ M S ] ≈ W GQ T . This did not work well in our case.

4. DATA

The Healthy Brain Network (HBN) (Alexander et al., 2017) is an ongoing project to create a biobank from New York City area children and adolescents. Data are publicly available, and include psychiatric, behavioral, cognitive, multimodal brain imaging, and genetics. In this work, we use a subset of 21 psychiatric questionnaires about behavioral and emotional problems. They were selected by domain experts by their focus on psychopathology, frequency of use in clinical research, and completeness in the HBN dataset. This subset contains general-purpose questionnaires covering different domains of psychopathology (e.g. CBCL, SDQ and SympChck) and others focusing on specific disorders (e.g. ASSQ for autism screening, SWAN for ADHD, and SCARED for anxiety). The full list of 21 questionnaires is reported in Table 4 in Appendix G. Across all questionnaires, we have 978 questions and 3572 unique participants. Finally, we have the age and sex at birth of each participant, which will be used as confounds, and diagnostic labels for 11 conditions, if applicable.

5.1. ICQF FACTORIZATION OF THE CHILD BEHAVIOR CHECKLIST

We begin with a qualitative assessment of ICQF applied to the 2001 Child Behavior Checklist (CBCL), which is designed to detect behavioral issues. The checklist includes 113 questions, grouped into 8 syndrome subscales: Aggressive, Anxiety/Depressed, Attention, Rule Break, Social, Somatic, Thought, Withdrawn problems. Answers are scored on a three-point Likert scale (0=absent, 1=occurs sometimes, 2=occurs often) and the time frame for the responses is the past 6 months. We estimated the latent dimensionality k = 8 using BCV to compute a test error for ICQF at each possible k. The regularization parameter β = 0.5 was set the same way (See bottom-left-panel of Figure 1 ). The top-panel shows the heat map of Q := [ R Q, C Q], the loadings over questions R Q for the latent factors W , and the loadings C Q for the confounds C. Questions are grouped by syndrome subscale. While there were factors that loaded primarily in questions from one subscale, as expected, we were surprised to find others that grouped questions from multiple subscales. These were deemed sensible co-occurrences by our clinical collaborators. We show the top 10 questions, ranked by magnitude of loading, for the first factor Q [:,1] from R Q, as a demonstration of how one might interpret the factor (bottom-right-panel of Figure 1 ). As a further, sanity check, we inspected the loadings of confound Old (increasing age) and observe that they covered issues such as "Argues", "Act Young", "Swears" and "Alcohol". The loadings of Q also reveal the relative importance among questions in each estimated factor; subscales deem all questions equally important.

5.2. META-ICQF FOR META-FACTORIZATION OF THE HBN QUESTIONNAIRES

This section provides a qualitative evaluation of meta-ICQF, analogous to that of ICQF in Section 5.1. As described earlier, the meta-factorization requires a first-level ICQF of each questionnaire in HBN, yielding factor matrix W i and loading matrix Q i . On the second level factorization, we concatenate {W i } S i=1 and use ICQF to get meta-factors W and respective loadings Q over first level factors as described in equation 6 (note the change in font for these). Figure 2 (left) shows the lower triangular part of the correlation matrix of Q T . The first level factors from all questionnaires are grouped through agglomerative clustering -as many clusters as meta-factors k -on their meta-factor loadings. The sparse, block-diagonal pattern and the diversified factor-origins within each block demonstrate how meta-factorization can combine related latent factors from multiple questionnaires. Figure 2 (bottom-right) shows the trend of validation errors with different (k, β) using the BCV detection scheme. The optimal inflection point is k = 15 and β = 0.1. Finally, we can back-propagate Q from factor-to question-levelfoot_1 by multiplying diag [Q 1 , . . . , Q S ] • Q =: Q. These loadings retrieve the question's latent representation in the meta-factor space; the magnitude of each question's entry in each column of Q reveals its influence of the corresponding meta-factor. Figure 2 (top-right) shows the top 10 questions of the first column of Q ranked by their magnitude, The top 10 questions associated with meta-factor 1, their loading value, and their questionnaire of origin. This meta-factor reflects attention issues. A similar plot for every meta-factor is reported in Appendix I. with their questionnaire of origin. This meta-factor reflects attention issues. Similar plots for all 15 meta-factors are reported in Appendix I. There is strong topic coherence among the top ranked questions in each meta-factor. The meta-factors have been deemed interpretable and clinically plausible presentations by our psychiatry collaborators.

5.3. DIAGNOSTIC CLASSIFICATION

Given the absence of ground-truth factorizations for participants in the HBN study, it is challenging to carry out a quantitative evaluation of ICQF versus other factorization methods, or subscales. In this section, we report on two different experiments based on predicting diagnostic labels for each participant, from factor scores. The first tests whether factor matrices W preserve the necessary information for this, when applied to general-purposed questionnaires or a combination of every HBN questionnaire. The second tests whether the question ranking induced by Q, across all questionnaires in HBN, selects the most informative questions for each factor.

5.3.1. EXPERIMENTAL SETUP

Baseline methods Our first baseline method is ℓ 1 -regularized NMF (ℓ 1 -NMF) (Cichocki & Phan, 2009) , as it also imposes non-negativity and sparsity constraints. As constructs (or questions) can be correlated, we rule out other NMF methods with orthogonality constraints. FA with promax rotation (FA-promax) (Hendrickson & White, 1964) using minimum residual as estimation method is included because it is commonly used in analyzing questionnaires. Syndrome subscales are included if available for a questionnaire, since they are often used for diagnoses. Finally, we include raw questionnaire answers, as they have all the information available. To estimate the number of factors k, we use BCV for ℓ 1 -NMF and ICQF, and parallel analysis for FA. The choice was driven by the experiments on synthetic questionnaire data reported in Appendix F. Questionnaires The two experiments are motivated by the routine use of general-purpose questionnaires in our dataset -CBCL, SDQ (Symptoms and Difficulties), and SympChck (Symptom check) -to screen and refer patients to pediatric psychiatry clinics, for a variety of diagnoses (Heflinger et al., 2000; Biederman et al., 2005; 2020) . The referral is based either on raw answers on the questionnaire or syndrome-specific subscales derived from them. Beyond this, and given that we have 21 questionnaires from HBN, we carried out experiments on W derived from them. The factors are obtained using the meta-factorization described in Section 3.3 (meta-ICQF), or by concatenating the questionnaires and factorizing the result (ℓ 1 -NMF, FA-promax), or simply using the concatenation (raw), possibly aggregated (by subscales, if defined, or all added otherwise). Dataset splits We use a similar evaluation procedure in both experiments. We group the 21 HBN questionnaires, and split participants into train, validation, and test sets with ratio 70/15/15, based on participant availability across questionnaires and the distribution of confounds and diagnostic labels. This ensures a similar data distribution in the three sets, as shown in Figure 5 in Appendix E, where more details are provided. We resample 50 dataset splits using different seeds, and carry out both experiments in each split. The results reported are the average across results in all splits. Model training and inference Let W set i denote the participant factor matrix in ICQF or NMF, or the factor score in FA. The subscript i is the questionnaire index, and is dropped if considering only one. The superscript denotes the set. Similarly, let Q i denote the question loadings for each method. Model training will yield a (W train , Q) for participants in the training set. Inference with the model will produce W validate and W test in validation and test sets, using the trained Q and confounds C validate , C test (if applicable). See Figure 4 i=1 . This then gives W validate and W test . While meta-factorization is possible for NMF or FA, we do not use it, as results were same or worse than factorizing concatenated questionnaires.

5.3.2. DIAGNOSTIC PREDICTION FROM FACTORS

For each one of 11 diagnostic labels, we train a logistic regression model with ℓ 2 regularization, and balanced class weights, on W train (W train ). The regularization strength is tuned using W validate (W validate ). Prediction assessment is conducted on W test (W test ) using the ROC-AUC metric (Krzanowski & Hand, 2009) . The ROC-AUC of each setting is then averaged across all random dataset splits. As results were obtained over the same datasets, for 11 different classification problems, we use a Friedman test with significance level α = 0.05 (followed by a posthoc Nemenyi test if the null hypothesis of the Friedman test is rejected), following Demšar (2006) . Table 1 shows the summary of AUCs obtained on the various questionnaires using different factorizations, averaged across all 11 diagnostic labels. HBN corresponds to the use of all 21 questionnaires, as described earlier. Results for each problem are provided in Table 5 in Appendix J. In CBCL, the null hypothesis is rejected and the post-hoc Nemenyi test indicates that subscales is significantly worse than all other factorizations. The null is not rejected for SDQ or SympChck. In the HBN setting, the null hypothesis is rejected and the Nemenyi test indicates a significant difference between the group (meta-ICQF, raw) and the group (ℓ 1 -NMF, subscales). Overall, we conclude that ICQF and meta-ICQF preserve diagnostic information, in spite of additional regularization and constraints versus other methods. Human-defined subscales are slightly but significantly worse. A parallel experiment on another CBCL dataset from different population is reported in Appendix K. 

5.3.3. DIAGNOSTIC PREDICTION FROM SYNTHESIZED QUESTIONNAIRES

Our goal is to assess the degree to which the process of extracting factors from all of the HBN questionnaires yields a Q where loadings identify informative questions for every domain of psychopathology. If so, we should be able to parlay those loadings into general-purpose questionnaires defined in a purely data-driven way, using the most informative question subset. We operationalize this as follows. We first perform factorization in the training set, as described for the previous prediction problem. For each column Q [:,i] , we rank questions according to the absolute magnitude of their loadings. By grouping top t questions from each of the k (meta-) factors, we can derive a new questionnaire, which we call the t-questionnaire. The t-questionnaire inherits the ranking of questions and ideally, preserves the key information for diagnostic prediction. The experimental setup parallels Section 5.3.2, but using either the 21 components of the tquestionnaire, for meta-ICQF, or their concatenation, otherwise. For the latter we trained ℓ 1 -NMF and FA-promax, either selecting the number of factors from data as before, or setting it to that of meta-ICQF (k = 15). We also trained ICQF directly on the concatenation (estimated k = 22). We then trained a ℓ 2 regularized logistic regression, with regularization parameter set over validation data, and evaluated its performance on the test set. This procedure was carried out for t up 40 (or #questions if < t). Figure 3 shows the trend of prediction performance for Depression for increasing t. The red dotted line is the average performance without eliminating any questions (same as HBN in Table 5 ). The trends for the other 10 diagnostic predictions are reported in Appendix L. They are broadly similar in relative terms across the methods (except for suspected ASD). Across the range of t, meta-ICQF and ICQF had substantially higher AUCs than other methods, especially when t was small. This suggests that meta-ICQF is effective at determining the relative importance of ∼ 1000 questions from 21 questionnaires, as well as grouping them into interpretable meta-factors.

6. DISCUSSION

In this paper, we have introduced ICQF, a method for non-negative matrix factorization with additional constraints to further enhance the interpretability of factors, and the capability to directly handle confounds. We have demonstrated ICQF in a widely used questionnaire, and showed that interpretability does not affect our ability to make diagnostic predictions from factors. We also showed that ICQF can be used for two-level meta-factorizations of sets of questionnaires. This allowed us to identify 15 meta-factors of psychopathology that coherently group questions from many questionnaires, and correspond to clinical presentations of patients. Furthermore, we showed that the resulting meta-factorization induces a ranking of the most informative questions for each questionnaire. This makes it possible to generate a minimal general-purpose questionnaire for estimating the 15 meta-factors, while maintaining a specified level of diagnostic prediction performance. In the future, we plan to use the 15 meta-factors as latent variables for studying the structural and functional brain imaging data available in HBN. We also plan on releasing the ICQF code for community use, and carrying out clinical validations of the t-questionnaires generated from the meta-factorization. Following the ADMM approach, we alternatingly update primal variables W , Q and the auxiliary variable Z, instead of updating them jointly. In particular, we iteratively solve the following subproblems: W (i+1) = arg min W ∈W ρ 2 Z (i) -[W , C]Q (i),T + 1 ρ α (i) Z 2 F + β∥W ∥ 1,1 (Sub-problem 1) Q (i+1) = arg min Q∈Q ρ 2 Z (i) -[W (i+1) , C]Q T + 1 ρ α (i) Z 2 F + β∥Q∥ 1,1 (Sub-problem 2) Z (i+1) = arg min Z∈Z 1 2 ∥M ⊙ (M -Z)∥ 2 F + ρ 2 Z -[W (i+1) , C]Q (i+1),T + 1 ρ α (i) Z 2 F (Sub-problem 3) for some penalty parameter ρ. We denote the Hadamard product as ⊙. The vector of Lagrangian multipliers α Z is updated via α (i+1) Z ← α (i) Z + ρ(Z (i+1) -[W (i+1) , C](Q (i+1) ) T ) SUB-PROBLEMS 1 AND 2 (EQUATIONS 2 AND 3) Note that equation 2 (and similarly equation 3 by taking the transpose) can be split into row-wise constrained Lasso problem. Specifically, the r th row problem can be simplified into: x * = arg min 0≤xi≤1 ρ 2 ∥b -Ax∥ 2 F + β∥x∥ 1 , A = Q (i) , b = Z (i) -CQ (i),T + 1 ρ α (i) Z [r,:] Here we use the Matlab matrix notation • [r,:] to represent row extraction operation. As suggested in Gaines et al. (2018) one can also use ADMM to solve equation 8: x (i+1) = arg min x ρ 2 ∥b -Ax∥ 2 2 + τ 2 ∥x -y (i) + 1 τ µ (i) ∥ 2 2 + β∥x∥ 1 (9) y (i+1) = P roj [0,1] (x (i+1) + 1 τ µ (i) ) (10) µ (i+1) ← µ (i) + τ (x (i+1) -y (i+1) ) (11) Similarly, µ is the vector of Lagrangian multipliers and τ is the penalty parameter. P roj [0,1] refers to the orthogonal projection into [0, 1] (inherited from the box-constraints of W ). Equation 9can be solved via the well-established FISTA algorithm (Beck & Teboulle, 2009) . Consider the following optimization problem arg min x λ∥x∥ 1 + 1 2 f (x) The FISTA algorithm for solving 12 is summarized as follows: Algorithm 1: FISTA for equation 12 Initialize: δ = 1e-6; x -1 = 0, x 0 = t 0 = 1 Input: L, Lipschitz constant of ∇f Result: Solution x of equation 12 while ∥x i -x i-1 ∥ 2 > δ do x i+1 = arg min z λ L ∥z∥ 1 + 1 2 z -x i -1 L ∇f (x i ) ; t i+1 = 1+ √ 1+4t 2 i 2 ; x i+1 = x i+1 + ti-1 ti+1 ( x i+1 -x i ); end To solve equation 9 with FISTA algorithm, using the notation as introduced in equation 8, we have f (x) = ρ∥b -Ax∥ 2 2 + τ ∥x -y (i) + 1 τ µ (i) ∥ 2 2 (13) To compute L, the Lipschitz constant of ∇f , we have ∇f (x) = 2ρ A T A(x -b) + τ (x -c) = 2(ρA T A + τ I)x -2(ρA T Ab + τ c) ( ) where c = y (i) -1 τ µ (i) . Thus, L is just equal to the largest eigenvalue of 2(ρA T A + τ I). As recommended in Huang et al. (2016) , ADMM provides flexibility to use various types of loss functions and regularizations without changing the procedure. For example, we can simply change to L 2,1 norm and equation 8 becomes a constrained ridge-regression problem, which can be efficiently solved by non-negative quadratic programming algorithms. For most clinical usage, the size of questionnaire data is manageable on a single machine. However, if optimal computational and memory efficiency is required, various stochastic optimization approaches such as Mairal et al. (2010) can replace the ADMM procedure. Yet, an unbiased sampling scheme for generating random batches that handles missing responses is also needed. Such a scheme is non-trivial to obtain, especially under the multi-questionnaires scenario.

SUB-PROBLEM 3 (EQUATION 4)

Since both terms in equation 4 are in Frobenius-norm, Z can be optimized entry-wise. In particular, we have the following closed-form solution for Z (i+1) : Z (i+1) = P roj [min(M ),max(M )] M ⊙ M + ρ[W (i+1) , C](Q (i+1) ) T -α (i) Z ⊘ (ρ1 + M) (15) where 1 is a 1-matrix with appropriate dimension and ⊘ is the Hadamard division.

B NON-INCREASING PROPERTY OF THE OPTIMIZATION ALGORITHM

In the following, we provide a self-contained convergence proof and show that, under an appropriate choice of the penalty parameter ρ, the ADMM optimization scheme discussed in Section 3.2 converges to a local minimum. To simplify notation, we denote V (i,j,k) = {W (i) , Q (j) , Z (k) } to be the tuple of variables W , Q and Z during iteration (i), (j) and (k) respectively. If i = j = k, we abbreviate it as V (i) . We also denote R (i) = [W (i) , C](Q (i) ) T and for any matrices A, B with appropriate dimensions, ⟨A, B⟩ = Trace(A T B). In the following, we are going to show that the Lagrangian is decreasing across iterations. Particularly, we consider the difference of Lagrangian between consecutive iterations: L ρ (V (i+1) , α (i+1) Z ) -L ρ (V (i) , α (i) Z ) = L ρ (V (i+1) , α (i+1) Z ) -L ρ (V (i+1) , α (i) Z ) (I) + L ρ (V (i+1) , α (i) Z ) -L ρ (V (i) , α (i) Z ) (II) (16) Expanding term (I), we have L ρ (V (i+1) , α (i+1) Z ) -L ρ (V (i+1) , α (i) Z ) = α (i+1) Z -α (i) Z , Z (i+1) -R (i+1) = 1 ρ ∥α (i+1) Z -α (i) Z ∥ 2 F ( ) Expanding term (II), we have ,i+1,i) , α (i+1,i,i) , α L ρ (V (i+1) , α (i) Z ) -L ρ (V (i) , α (i) Z ) = (A) L ρ (V (i+1) , α (i) Z ) -L ρ (V (i+1 (i) Z ) + (B) L ρ (V (i+1,i+1,i) , α (i) Z ) -L ρ (V (i) Z ) + L ρ (V (i+1,i,i) , α (i) Z ) -L(S (k) , α (i) Z ) (C) (18) Expanding (A) by the definition, we have 1 2 ∥M ⊙ (M -Z (i+1) )∥ 2 F - 1 2 ∥M ⊙ (M -Z (i) )∥ 2 F + α (i) Z , Z (i+1) -R (i+1) -α (i) Z , Z (i) -R (i+1) + ρ 2 Z (i+1) -R (i+1) 2 F - ρ 2 Z (i) -R (i+1) 2 F = M ⊙ (Z (i+1) -M ), M ⊙ (Z (i+1) -Z (i) ) -∥M ⊙ (Z (i+1) -Z (i) )∥ 2 F + ⟨α (i) Z , Z (i+1) -Z (i) ⟩ + ρ Z (i+1) -R (i+1) , Z (i+1) -Z (i) -ρ∥Z (i+1) -Z (i) ∥ 2 F = M ⊙ (Z (i+1) -M ) + ρ • Z (i+1) + α (i) Z -ρR (i+1) , Z (i+1) -Z (i) -∥M ⊙ (Z (i+1) -Z (i) )∥ 2 F -ρ∥(Z (i+1) -Z (i) )∥ 2 F -M ⊙ (Z (i+1) -M ), (1 -M) ⊙ (Z (i+1) -Z (i) ) Since Z (i+1) is the minimizer of equation 4, we have M ⊙ (M -Z (i+1) ) 2 F + ρ Z (i+1) -R (i+1) + 1 ρ α (i) Z 2 F ≤ M ⊙ (M -Z (i) ) 2 F + ρ Z (i) -R (i+1) + 1 ρ α (i) Z 2 F which gives 2 M ⊙ (Z (i+1) -M ), M ⊙ (Z (i+1) -Z (i) ) -∥M ⊙ (Z (i+1) -Z (i) )∥ 2 F ≤ -2 ρ • Z (i+1) + α (i) Z -ρR (i+1) , Z (i+1) -Z (i) + ρ∥Z (i+1) -Z (i) ∥ 2 F It further implies ρ • Z (i+1) + α (i) Z -ρR (i+1) , Z (i+1) -Z (i) ≤ -M ⊙ (Z (i+1) -M ), M ⊙ (Z (i+1) -Z (i) ) + 1 2 ∥M ⊙ (Z (i+1) -Z (i) )∥ 2 F + ρ 2 ∥Z (i+1) -Z (i) ∥ 2 F By direct substitution, we have (A) ≤ M ⊙ (Z (i+1) -M ), Z (i+1) -Z (i) -M ⊙ (Z (i+1) -M ), M ⊙ (Z (i+1) -Z (i) ) + 1 2 ∥M ⊙ (Z (i+1) -Z (i) ∥ 2 F + ρ 2 ∥Z (i+1) -Z (i) ∥ 2 F -∥M ⊙ (Z (i+1) -Z (i) )∥ 2 F -ρ∥(Z (i+1) -Z (i) )∥ 2 F -M ⊙ (Z (i+1) -M ), (1 -M) ⊙ (Z (i+1) -Z (i) ) = - 1 2 ∥M ⊙ (Z (i+1) -Z (i) )∥ 2 F - ρ 2 ∥(Z (i+1) -Z (i) )∥ 2 F ≤ - ρ 2 ∥(Z (i+1) -Z (i) )∥ 2 F (19) For the second term (B), by definition, we have, (B) = ρ 2 Z (i) -R (i+1) + 1 ρ α (i) Z 2 F - ρ 2 Z (i) -[W (i+1) , C]Q (i),T + 1 ρ α (i) Z 2 F + β∥Q (i+1) ∥ 1,1 -β∥Q (i) ∥ 1,1 =ρ R (i+1) -Z (i) - 1 ρ α (i) Z , [W (i+1) , C](Q (i+1),T -Q (i),T ) - ρ 2 [W (i+1) , C](Q (i+1),T -Q (i),T ) 2 F + β(∥Q (i+1) ∥ 1,1 -∥Q (i) ∥ 1,1 ) We recall that Q is updated via solving constrained Lasso problems for every row Q [r,:] : y = arg min x,0≤x β∥x∥ 1 + ρ 2 ∥b -Ax∥ 2 2 , where A = [W (i+1) , C], b = Z (i) + 1 ρ α (i) Z [r,:] One obtains y if and only if there exists g ∈ ∂∥y∥ 1 , the sub-differential of ∥ • ∥ 1 such that ρA T (Ay -b) + βg = 0. As ∥ • ∥ 1 is convex, we have ∥x∥ 1 ≥ ∥y∥ 1 + ⟨x -y, g⟩ which gives ∥y∥ 1 -∥x∥ 1 ≤ y -x, ρ β A T (Ay -b) = A(y -x), ρ β (Ay -b) (23) Re-substituting x = Q (i),T [r,:] , y = Q (i+1),T [r,:] , A = [W (i+1) , C], b = Z (i) + 1 ρ α (i) Z [r,:] and sum over r, we have β∥Q (i+1) ∥ 1,1 -β∥Q (i) ∥ 1,1 ≤ -ρ R (i+1) -Z (i) - 1 ρ α (i) Z , [W (i+1) , C](Q (i+1),T -Q (i),T ) (24) Therefore, we have (B) ≤ - ρ 2 [W (i+1) , C](Q (i+1),T -Q (i),T ) 2 F With similar argument, we can bound (C) by (C) ≤ - ρ 2 [(W (i+1) -W (i) ), C]Q (i),T 2 F ( ) To get an upper bound of ∥α (i+1) Z -α (i) Z ∥ 2 F , we have ∥α (i+1) Z -α (i) Z ∥ 2 F ≤∥Z (i+1) -Z (i) ∥ 2 F + ∥R (i+1) -R (i) ∥ 2 F ≤∥Z (i+1) -Z (i) ∥ 2 F + ∥[W (i+1) , C]Q (i+1),T -[W (i+1) , C]Q (i),T ∥ 2 F + ∥[W (i+1) , C]Q (i),T -[W (i) , C]Q (i),T ∥ 2 F ≤∥Z (i+1) -Z (i) ∥ 2 F + ∥[W (i+1) , C](Q (i+1),T -Q (i),T )∥ 2 F + ∥[(W (i+1) -W (i) ), C]Q (i),T ∥ 2 F Combining equation 17, 27, 18, 19, 25 and 26 with equation 16, we have L ρ (V (i+1) , α Z ) -L ρ (V (i) , α Z ) ≤ 1 ρ α (i+1) Z -α (i) Z 2 F - ρ 2 Z (i+1) -Z (i) 2 F - ρ 2 [W (i+1) , C](Q (i+1),T -Q (i),T ) 2 F - ρ 2 [(W (i+1) -W (i) ), C]Q (i),T 2 F ≤ 1 ρ - ρ 2 • ∥Z (i+1) -Z (i) ∥ 2 F + ∥[W (i+1) , C](Q (i+1),T -Q (i),T )∥ 2 F + ∥[(W (i+1) -W (i) ), C]Q (i),T ∥ 2 F ( ) This is summarized into the following theorem: Theorem B.1 (Non-increasing property). Assume ρ ≥ √ 2, for all i, we have L ρ (W (i+1) , Q (i+1) , Z (i+1) , α (i+1) Z ) ≤ L ρ (W (i) , Q (i) , Z (i) , α (i) Z ). We set ρ = 3 in all experiments for sufficiency.

C ERROR ANALYSIS

WHEN k ̸ = k * Assume that there is a ground-truth factorization (W * , Q * ) of the given M = W * (Q * ) T , with latent dimension k * , where W * and Q * are matrix-valued random variables with entries sampled from some bounded distributions. With high probability, the error ∥M -WQ T ∥ 2 F we are minimizing is star-convex towards (W * , Q * ) whenever k = k * (Bjorck et al., 2021) . To demonstrate the importance of the choice of k, we consider the scenario when k ̸ = k * below. First, a more precise assumption for ICQF is to model W as row-independent bounded random matrices. Recall that W is generated by arranging n participants' latent representation as rows of n × k matrix, where the n participants are assumed to be independent from each other and their corresponding latent representations follow a high-dimensional bounded distribution. Second, let (W 1 , Q 1 ) and (W 2 , Q 2 ) be two factorizations with dimensions k 1 and k 2 respectively. Assume both factorizations achieve (a): equivalent mismatching loss in expectation, and (b): equivalent expectation approximation to data matrix M: (a) : E ∥M -W 1 Q T 1 ∥ 2 F = E ∥M -W 2 Q T 2 ∥ 2 F and (b) : E[W 1 Q T 1 ] = E[W 2 Q T 2 ] We also assume (c): E n j=1 (W i ) 2 jκ := σ 2 Wi and E m j=1 (Q i ) 2 jκ := σ 2 Qi for all κ = k i , i = 1, 2. Expanding (a), we have E Trace (M -W 1 Q T 1 ) T (M -W 1 Q T 1 ) = E Trace (M -W 2 Q T 2 ) T (M -W 2 Q T 2 ) This gives E Trace W T 1 W 1 Q T 1 Q 1 -2M T W 1 Q T 1 = E Trace W T 2 W 2 Q T 2 Q 2 -2M T W 2 Q T 2 Denote E[W i ] = µ Wi , E[Q i ] = µ Qi for i = 1, 2, we have W i = Wi + µ Wi and Q i = Qi + µ Qi , where Wi and Qi denote the corresponding centered variables. Note that by the independence of W i and Q i and linearity of trace and expectation operator, E Trace M T W 1 Q T 1 =E Trace M T W1 QT 1 + M T W1 µ T Q1 + M T µ W1 QT 1 + M T µ W1 µ T Q1 =Trace(M T E[W 1 ]E[Q T 1 ]) = Trace(M T E[W 2 ]E[Q T 2 ]) = E Trace M T W 2 Q T 2 (30) which yields E Trace W T 1 W 1 Q T 1 Q 1 = E Trace W T 2 W 2 Q T 2 Q 2 (31) Consider E Trace W T 1 W 1 Q T 1 Q 1 via definition, we have E Trace W T 1 W 1 Q T 1 Q 1 =Trace E W T 1 W 1 E Q T 1 Q 1 =Trace      E      n j=1 (W 1 ) 2 j1 * . . . * n j=1 (W 1 ) 2 jk1      × E      m j=1 (Q 1 ) 2 j1 0 . . . 0 m j=1 (Q 1 ) 2 jk1           = k1 κ=1 E   n j=1 (W 1 ) 2 jκ   E   m j=1 (Q 1 ) 2 jκ   Incorporating assumption (c), we have E Trace W T 1 W 1 Q T 1 Q 1 = k 1 σ 2 W1 σ 2 Q1 (33) Consider equation 31 with k 1 > k 2 . For W 1 , Q 1 , W.L.O.G. we pad k 2 -k 1 columns of zeros. Moreover, let P be an optimal k 2 × k 2 permutation matrix, we also have E Trace (W 2 P) T W 2 P(Q 2 P) T Q 2 P = E Trace W T 2 W 2 Q T 2 Q 2 = k 2 σ 2 W2 σ 2 Q2 ( ) Combining with equation 31, it is equivalent to k 1 σ 2 W1 σ 2 Q1 = k 2 σ 2 W2 σ 2 Q2 ( ) which gives E ∥W 1 ∥ 2 F = σ 2 Q2 σ 2 Q1 E ∥W 2 ∥ 2 F = σ 2 Q2 σ 2 Q1 E ∥W 2 P∥ 2 F (36) To evaluate the impact of interpretability of latent representation under different latent dimension, we consider E ∥W 1 -W 2 P∥ 2 F : E ∥W 1 -W 2 P∥ 2 F = E Trace (W 1 -W 2 P) T (W 1 -W 2 P) = E ∥W 1 ∥ 2 F + σ 2 Q1 σ 2 Q2 E ∥W 1 ∥ 2 F -2E Trace(W T 1 W 2 P) As Trace(W T 1 W 2 P ) ≤ ∥W 1 ∥ F ∥W 2 P∥ F , we also have E Trace(W T 1 W 2 P) ≤ E [∥W 1 ∥ F ] • E [∥W 2 P∥ F ] ≤ E [∥W 1 ∥ 2 F ] • E [∥W 2 ∥ 2 F ] = σ 2 Q1 σ 2 Q2 E ∥W 1 ∥ 2 F ( ) which implies E ∥W 1 -W 2 P∥ 2 F ≥ 1 -2 σ 2 Q1 σ 2 Q2 + σ 2 Q1 σ 2 Q2 E ∥W 1 ∥ 2 F = 1 - σ 2 Q1 σ 2 Q2 2 E ∥W 1 ∥ 2 F (39) Since W i is generated from row-wise independent bounded distribution, if we add a mild assumption that σ 2 Wi := σ 2 W for all i through re-scaling, Equation 35implies k 1 σ 2 Q1 = k 2 σ 2 Q2 and therefore E ∥W 1 -W 2 ∥ 2 F ≥ 1 -2 k 2 k 1 + k 2 k 1 E ∥W 1 ∥ 2 F = k 2 k 1 -1 2 E ∥W 1 ∥ 2 F (40) If we substitute k 1 = k * , (W 1 , Q 1 ) = (W * , Q * ), we have E ∥W * -W 2 ∥ 2 F ≥ k 2 k * -1 2 E ∥W * ∥ 2 F (41) which means the relative expected difference between W * and W 2 is bounded below by k2 k * -1 2 . To prove that equation 41 holds in general, we consider the matrix concentration inequalities and show that large deviations from their means are exponentially unlikely. Benefitting from the model constraints, we can further assume that W is generated from some high dimensional bounded distribution. In the following, we make use of the main theorem proposed in Meckes & Szarek (2012) on concentration of non-commutative random matrices polynomials. As W i are generated from bounded distributions, ∥W i -E[W i ]∥ F is uniformly bounded. Therefore, it satisfies the convex concentration properties. The theorem achieves the following results: P ∥W∥ 2 F -E ∥W∥ 2 F > tkn 2 ≤ C 1 exp -C 2 min(t 2 , t 1/2 )n (42) Recall that E ∥W 1 -W 2 P ∥ 2 F = E ∥W 1 ∥ 2 F + σ 2 Q 1 σ 2 Q 2 E ∥W 1 ∥ 2 F -2E Trace(W T 1 W 2 P) . By padding W 1 and W 2 with zeros columns, we assume that W i are all n × n matrices. Then the probability that the any one of the terms is deviating from their mean by a relative factor ϵ is less than C 1 exp(-C 2 ϵ 2 n) for some small ϵ. By the union bound, the probability that the either of them does is less than or equal to C 3 exp(-C 4 ϵ 2 n). This experiment aimed at comparing the effectiveness of BCV and other procedures for estimating the number of latent factors in a synthetic example, for ICQF and two baseline methods. We generated the synthetic questionnaire with k * = 10 factors ([0, 1] bounded), where each factor was present in isolation for 200 participants and in tandem with another factor for others. Each factor had an associated loading over 100 questions. The answers to questions are bounded between 0 and 100.

D VISUALIZATION OF THE EXPERIMENTAL SETUP FOR DIAGNOSTIC PREDICTION EVALUATION

To generate the answer matrix M ∈ R 200×100

≥0

, we design the underlying matrices W and Q directly. We first create a matrix D of size 200 × 10, shown in Figure 6 (left). The 10 columns of D are correlated with a step-like pattern, where each "step" is of length 20 and entries on the "step" have weight 1. Every consecutive pair of steps is overlapped by 10 units to synthesize correlation between latent factors. An entry W [i, j] is then defined to be W [i, j] := D[i, j] * a * b, a ∼ U (0.5, 1), b ∼ B(1, p = 0.9) where U (0.5, 1) is the uniform distribution between [0.5, 1] and B(1, p = 0.9) is the Bernoulli distribution with probability p = 0.9. The matrix Q of size 100 × 10, shown in Figure 6 (center) is defined to be Q We compared ICQF against ℓ 1 -regularized NMF (ℓ 1 -NMF) (Cichocki & Phan, 2009) and factor analysis with promax rotation (FA-promax) (Hendrickson & White, 1964) , as factors can be correlated. Both ICQF and ℓ 1 -NMF were initialized with NNDSVD (Boutsidis & Gallopoulos, 2008) , and the sparsity (β = 1e-1) and stopping criterion (relative iteration convergence tolerance ϵ < 1e-3) for fairness. The estimation method for FA was minimum residual. [i, j] := c * d, c ∼ U (0, 100), d ∼ B(1, 0.3) Table 3 shows the mean error ε and the standard error s E of the detected k versus ground-truth k * = 10, across different generated datasets. Statistics are marked with asterisk if they contain runs where no inflection point was detected using the kneed algorithm (Satopaa et al., 2011) . We tested five popular detection algorithms: BCV (Kanagal & Sindhwani, 2010) , {BIC 1 , BIC 2 } (Stoica & Selen, 2004) 3 , CCC (Fogel et al., 2007) and Dispersion (Brunet et al., 2004) . For ICQF, BCV is the best overall detection scheme at all noise levels; BIC 2 performs well for low noise only. Within the three common schemes for FA, Horn's PA (Horn, 1965) and MAP (Velicer, 1976 ) are superior to BIC 3 (Preacher et al., 2013) . PA achieves the best performance when the noise density is high. This result aligns with the conclusions in Velicer et al. 3 The two BIC versions are respectively BIC1(k ) := log ∥M -W Q T ∥ 2 F + k m+n mn log mn m+n and BIC2(k) := log ∥M -W Q T ∥ 2 F + k m+n mn log min( √ m, √ n) 2 . There are other similar versions of BIC but their performances are indistinguishable, thus we only include two representatives in the manuscript. In Section 5.3.3, we demonstrated that the additional regularization and constraints in ICQF improve the interpretability of the factorization. Coherently, we should expect to obtain a more consistent factorization result from ICQF if it is applied on dataset from another population. Using CBCL-ABCD dataset, we extend our evaluation via studying the stability of factor loadings (matrix Q) in a cross-population scenario. For each method (ICQF, ℓ 1 -NMF and FA-promax), we factorize CBCL-HBN (care-seeking population) and CBCL-ABCD dataset (general-diversified population) separately to obtain the factor loadings Q HBN and Q ABCD respectively. In this experiment, for fairness sake, we set the number of factors k = 8 for all methods. Considering Q HBN as the reference, we perform the greedy-matching algorithm to re-order factors in Q ABCD and compute the Pearson correlation coefficient for every best-matched pair of factors from (Q HBN , Q ABCD ). Figure 15 shows the trend of the 8 Pearson correlation coefficients achieved by ICQF, ℓ 1 -NMF and FA-promax factorizations. Since there is a sign ambiguity of FA-promax, we take an absolute to all correlation coefficients achieved from FA-promax. The first 5 correlation coefficients are close to 1 for all three methods, suggesting that all methods capture similar factors in two different populations. However, starting at the 6 th pair of factors, correlation coefficient achieved by ICQF is superior to those from ℓ 1 -NMF and FA-promax. This indicates that the extra constraints help stabilize the factorization outcomes and are potentially advantageous to transfer learning or model interpretation. 



Appropriate weighting is multiplied to the error if number of blocks in the last fold is less than others. For multi-questionnaire setting, we abuse the notation Q to denote the latent representation of questions, either by direct factorization, or the meta-factorization followed by back-propagation.



Figure 1: Top: Heat map of factor and confound loadings Q := [ R Q, C Q]. Note that questions are grouped by syndrome subscale; some factors are syndrome specific, while others bridge syndromes. Bottom left: Detection of (k, β) by BCV. Bottom right: Top 10 questions of Q [:,1] , loading mostly on the Aggressive subscale. Top 10 questions for all factors are shown in Appendix H.

Figure 2: Left: Correlation matrix of Q, showing clusters of similar ICQF factors for all questionnaires on the left (format questionnaire-#), and the meta-factors (1-15) those clusters correspond to at the bottom. Bottom-right: Detection of (k, β) by BCV. Top-right:The top 10 questions associated with meta-factor 1, their loading value, and their questionnaire of origin. This meta-factor reflects attention issues. A similar plot for every meta-factor is reported in Appendix I.

in Appendix D for a diagram. A sans serif font (e.g. W) indicates a second-level factorization result. In meta-factorization, both model training and inference are performed on individual questionnaires to obtain {W train i level. At the second level, we concatenate {W train i } S i=1 and do model training to get meta-factor W train and Q, followed by model inference on the concatenated {W validate i

Figure 3: ROC-AUC trend of prediction performance for Depression using the t-questionnaire, for increasing t. The methods are ICQF (estimated k = 22), meta-ICQF (estimated k = 15), ℓ 1 -NMF and FA-promax (with k = 15 or estimated), and metafactor (meta-ICQF with all questions in HBN)

Figure 4: Setup for diagnostic prediction experiments.

44)Having W and Q, we obtain a noiseless data matrix M clean := min(0, max(W Q T , 100)). To introduce additive noise, we modify M clean byM := min (0, max(M clean + e * f, 100)) , f ∼ U (-100, 100)(45)where e follows a discrete probability distribution with P (e = 1) = δ, P (e = 0) = 1 -δ. This yields a data matrix M , shown in Figure6(right) for δ = 0.3. If δ is high, more entries in the M clean will be contaminated by the additive noise f .

Figure 6: Synthetic example of W , Q and M with noise density δ = 0.3.

(2000);Watkins (2018);Goretzko et al. (2021).

Figure 7: Top 10 questions ranked by Q in CBCL using Q obtained from ICQF (Factor 1-3).

Figure 15: Sorted Pearson correlation of best-matched pair of factor loadings Q obtained from CBCL-ABCD and CBCL-HBN questionnaire.

Figure 16: ROC-AUC trend of GenAnxiety, ADHD, Suspected ASD and {Panic, Agoraphobia, SeparationAnx, SocialAnx}.

Figure 17: ROC-AUC trend of BPD, Specific Phobia, OCD and Eating Disorder.

Averaged ROC-AUC scores of the diagnostic prediction under different factorizations.

Number of positive diagnostic labels in HBN dataset

Experiment of k estimation for synthetic questionnaires. The average error (ε) and the standard error of estimated k (s E ) are reported. Statistics with asterisk (*) ignore undetectable inflection points.

Optimal (k, β) of all 21 questionnaires.

ROC-AUC scores of the diagnostic prediction under different data inputs and factorizations.

APPENDIX CONTENT

• Appendix A : Detailed description of the optimization algorithm.• Appendix B : Demonstrations for theoretical guarantees on algorithm convergence.• Appendix C : Error analysis with respect to the choice of latent dimension k.• Appendix D : Visualization of the experimental setup for diagnostic prediction evaluation.• Appendix E : Visualization of train, validation and test sets in the multi-questionnaire setting.• Appendix F : Details of the synthetic questionnaire experiment (both data generation and the experimental setup).• Appendix G : Table summarizing optimal (k, β) detected using ICQF with BCV for each of the 21 questionnaires.• Appendix H : Full list of top 10 questions for each factor in the loading matrix Q, obtained by factorizing the CBCL questionnaire.• Appendix I : Full list of top 10 questions for each factor in the loading matrix Q, obtain by performing meta-ICQF on the 21 questionnaires.• Appendix J : ROC-AUC scores of each diagnostic prediction obtained using different factorizations (Expansion of Table 1 ).• Appendix K : Evaluation and model inference in CBCL-ABCD dataset• Appendix L : ROC-AUC trend of t-questionnaire for all 11 diagnoses.

A OPTIMIZATION PROCEDURE OF ICQF

Recall that the Lagrangian L ρ of ICQF is: We perform the same diagnostic prediction evaluation on CBCL-ABCD dataset as reported in Section 5.3.2. We split participants into train, validation and test sets with ratio 70/15/15, based on the distribution of confounds and diagnostic labels. We resample 50 dataset splits using different seeds and carry out prediction experiment in each split. Table 7 reports the summary of averaged AUCs achieved by different factorization methods. The null hypothesis is rejected and the post-hoc Nemenyi test indicates that subscales and ℓ 1 -NMF are slightly but significantly worse than the raw setting, while other differences are inconclusive. Similarly, we conclude that ICQF preserves diagnostic information even though we introduce additional regularization and constraints versus other methods. 

