LEVERAGING DOUBLE DESCENT FOR SCIENTIFIC DATA ANALYSIS: FACE-BASED SOCIAL BEHAVIOR AS A CASE STUDY

Abstract

Scientific data analysis often involves making use of a large number of correlated predictor variables to predict multiple response variables. Understanding how the predictor and response variables relate to one another, especially in the presence of relatively scarce data, is a common and challenging problem. Here, we leverage the recently popular concept of "double descent" to develop a particular treatment of the problem, including a set of key theoretical results. We also apply the proposed method to a novel experimental dataset consisting of human ratings of social traits and social decision making tendencies based on the facial features of strangers, and resolve a scientific debate regarding the existence of a "beauty premium" or "attractiveness halo," which refers to a (presumed) advantage attractive people enjoy in social situations. We demonstrate that more attractive faces indeed enjoy a social advantage, but this is indirectly due to the facial features that contribute to both perceived attractiveness and trustworthiness, and that the component of attractiveness perception due to facial features (unrelated to trustworthiness) actually elicit a "beauty penalty.". Conversely, the facial features that contribute to trustworthiness and not to attractiveness still contribute positively to pro-social trait perception and decision making. Thus, what was previously thought to be an attractiveness halo/beauty premium is actually a trustworthiness halo/premium plus a "beauty penalty." Moreover, we see that the facial features that contribute to the trustworthiness halo primarily have to do with how smiley a face is, while the facial features that contribute to attractiveness but actually acts as a beauty penalty is related to anti-correlated with age. In other words, youthfulness and smiley-ness both contribute to attractiveness, but only smiley-ness positively contributes to pro-social perception and decision making, while youthfulness actually negatively contribute to them. A further interesting wrinkle is that youthfulness as a whole does not negatively contribute to social traits/decision-making, only the component of youthfulness contributing to attractiveness does.

1. INTRODUCTION

Scientific data analysis often involves building a linear regression model between a large number of predictor variables and multiple response variables. Understanding how the predictor and response variables relate to one another, especially in the presence of relatively scarce data, is an important but challenging problem. For example, a geneticist might have a genomic dataset with many genetic features as predictor variables and disease prevalence data as response variables: the geneticist may want to know how the different types of disease are related to each other through their genetic underpinnings. Another example is that a social psychologist might have a set of face images (with many facial features) that have been rated by a relatively small set of subjects for perceived social traits and social decision making tendencies, and wants to discover how the different social traits and decision making tendencies relate to each other through the underlying facial features. A common problem encountered in these types of problems is that the large number of features relative to the number of data points typically entails some kind of dimensionality reduction and feature selection, and this process needs to be differently parameterized in order to optimize for each response variable, making direct comparison of the features underlying different response variables challenging. In the worst case, there may not be any subset of features that can predict all response variables better than chance level. Here, we leverage the "double descent" phenomenon to develop and present a novel analysis framework that obviates such issues by relying on a universal, overly parameterized feature representation. As a case study, we apply the framework to better understand the underlying facial features that contribute separately and conjointly to human trait perception and social decision making. Humans readily infer social traits, such as attractiveness and trustworthiness, from as little as a 100 ms exposure to a stranger's face (Willis & Todorov, 2006) . Though the veracity of such judgments is still an area of active research (Valla et al., 2011; Todorov et al., 2015) , such trait evaluations have been found to predict important social outcomes, ranging from electoral success (Todorov et al., 2005; Ballew & Todorov, 2007; Little et al., 2007) to prison sentencing decisions (Blair et al., 2004; Eberhardt et al., 2006) . In particular, psychologists have observed an "attractiveness halo", whereby humans tend to ascribe more positive attributes to more attractive individuals (Eagly et al., 1991; Langlois et al., 2000) , and economists have observed a related phenomenon, the "beauty premium", whereby more attractive individuals out-earn less attractive individuals in economics games (Mobius & Rosenblat, 2006) . However, these claims are not without controversy (Andreoni & Petrie, 2008; Willis & Todorov, 2006) , as more attractive people can also incur a "beauty penalty" in certain situation. Moreover, a robust correlation between attractiveness and trustworthiness (Willis & Todorov, 2006; Oosterhof & Todorov, 2008; Xu et al., 2012; Ryali et al., 2020) has also been reported, making it unclear how much of the attractiveness halo effect might be indirectly due to perceived trustworthiness. To tease apart the contributions of trustworthiness and attractiveness to social perception and decision-making, we perform linear regression of different responses variables, consisting of subjects' ratings of social perception and social decision-making tendencies, against features of the Active Appearance Model (AAM), a well-established computer vision model (Cootes et al., 2001) , whose features have been found to be linearly encoded by macaque face-processing neurons (Chang & Tsao, 2017) . A similar regression framework has been adopted by previous work modeling human face trait perception (Oosterhof & Todorov, 2008; Said & Todorov, 2011; Song et al., 2017; Guan et al., 2018; Ryali et al., 2020) , using features either from AAM or deep neural networks. Because the number of features is typically quite large, usually larger than the number of rated faces, previous approaches have all used some combination of dimensionality reduction and feature selection. This approach gives rise to a dilemma when one wants to compare the facial features contributing to different types of social perceptions (response variables), since the number of features that optimizes prediction accuracy for each task can be quite different (see Figure 1 ). Either one optimizes this quantity separately for each task, thus not having a common set of features to compare across; or one can fix a particular set of features for all tasks, but then having suboptimal prediction accuracy (in the worst case, perhaps worse than chance level performance). To overcome this challenge, we appeal to 'the 'double descent" (Belkin et al., 2019; 2020) trick, the use of a highly overparameterized representation (more features than data points) to achieve good performance. In particular, if we use the original AAM feature representation, while foregoing any kind of dimensionality reduction or feature selection, then we have a universal representation that may also have great performance on all tasks, even novel tasks not seen before, or responses corresponding to predictor variable settings totally different than previously seen. While overparameterized linear regression has chiefly been used as an analytically tractable case study (Belkin et al., 2019; Xu & Hsu, 2019; Belkin et al., 2020) to gain insight into the theoretical basis and properties of "double descent", we use it as a practical setting for scientific data analysis. Notably, while previous papers on overparameterized regression defined statistical assumptions and constraints in the generative sense, we work for pragmatic reasons purely with sample statistics (e.g. whether two features are "truly" decorrelated (Xu & Hsu, 2019 )), we work directly with sample statistics (e.g. whether two feature vectors across a set of data points have a correlation coefficient of 0). For this reason, our theoretical results are distinct from and novel with respect to those prior results. Finally, it is noteworthy that the human visual pathway also exhibits feature expansion rather than feature reduction, from the sensory periphery to higher cortical areas (Wandell, 1995) -this raises the intriguing possibility that the brain has also discovered an overparameterized representation as a universal representation for learning to perform well on many tasks, including novel ones not previously encountered. In Section 2, we use over-parameterized linear regression to develop a framework that generalizes well across both tasks and data space. We provide theoretical conditions under which the complete over-parameterized representation is 1) guaranteed to yield linear estimators that perform better than chance, and 2) are optimal among the class of hard regularizers. We also provide exact error expressions for these estimators, as well as an exact measure of how far the estimators are from the optimal hard regularizers when the over-parameterized estimators are suboptimal. In Section 3, we verify the practical usefulness of our theoretical framework by comparing the prediction accuracy of over-parameterized regression against task-specific classical (underparameterized) regression that optimizes feature selection for each task, on original data collected from a face-based social perception and decision-making study. In Section 4, we apply our mathematical and computational framework to show that the halo effect appears to arise from trustworthiness rather than attractiveness per se, and that attractiveness unrelated to trustworthiness actually induces a beauty penalty, while trustworthiness unrelated to attractiveness induces a premium, thus reconciling conflicting results in the literature regarding the existence of an attractiveness halo. Finally, we present a novel finding that the component of attractiveness related to pro-social perception and judgment is related to how smiley a face appears, while the component of attractiveness unrelated to attractiveness is related to the youthfulness of facial appearance.

mse( ŷ )

Figure 1 : Loss (prediction MSE), as a function of number of features for face-based socialperception tasks, using cross-validation as specified in section 3. The vertical dashed lines indicate the minimum error in the under-parameterized regime for the different tasks (repsonse variables), illustrating the difficulty of finding a common number of features to use for all tasks in the underparameterized regime. The fully over-parameterized regression results in better-than-chance MSE for all tasks. Horizontal dashed line: variance of collected responses, normalized to 1 in all tasks for ease of visual comparison; error bars: standard error of the mean.

2. MATHEMATICAL FRAMEWORK

We consider a linear regression problem where each response y is a linear function of n real-valued variables x ∈ R n , parameterized by a vector β ∈ R n , in addition to some noise (ϵ). More formally for m datapoints (with n ≥ m), we assume: y = Xβ + ϵ, ϵ ∼ N (0, σ 2 ϵ ). with both β and ϵ zero-mean and i.i.d.. We further assume, without loss of generality, that both the design matrix X ∈ R m×n and the vector of responses y ∈ R m are centered, and that X is full rank, with rank denoted by r. Note that for an over-parameterized, centered full rank matrix, r = m -1. We use the pseudoinverse to obtain the n-dimensional minimum L2-norm estimator β of β, β = X † y, and the mean squared error (MSE) to evaluate the estimator β: mse( β) = tr E[(β -β)(β -β) T ], where tr(•) denotes the matrix trace and E[•] the expected value. We also use || • || to denote the L2-norm.

2.1. THEORETICAL CONDITIONS FOR A GOOD ESTIMATOR β

Estimator Condition 1. If the noise-variance in y (σ 2 ϵ ) is less than or equal to half the signalvariance in y (σ 2 y ), then β is an above chance estimator, i.e. σ 2 ϵ ≤ σ 2 y 2 ⇐⇒ mse( β) ≤ σ 2 β , where σ 2 β is the variance of the parameter vector β. Proof Sketch. This follows from the definition of mse( β) and σ 2 β , the cyclical properties of the trace and linearity of expectation, as well as the i.i.d. assumptions on β and ϵ. See the appendix for an explicit derivation. Estimator Condition 2. If the smallest singular value (s r ) of the design matrix X satisfies, s 2 r ≥ σ 2 ϵ ||β|| 2 /r , then β is the minimum MSE (MMSE) estimator among the class of hard regularizers subject to linear constraints. Proof Sketch. Park (1981) proved the above for the prediciton MSE of a PCR estimator in the underparameterized regime. An extension to the over-parameterized regime follows as 1) the MSE is invariant under orthogonal transformations, and 2) any over-parameterized estimator has an "equivalent" under-parameterized estimator (equivalent in the sense that the estimators yield the same MSE). See the appendix for a detailed proof.

2.2. EXACT ERROR EXPRESSIONS

Error Expression 1. The MSE of the over-parameterized estimator β is given by mse( β) = σ 2 ϵ r i=1 1 s 2 i , where s 1 , ..., s r are the singular values of the design matrix X. Proof Sketch. Once again, this follows from the definition of mse( β), the cyclical properties of the trace and linearity of expectation, as well as the i.i.d. assumptions on β and ϵ. See the appendix for an explicit derivation. Error Expression 2. Suppose the MMSE estimator θ * has p components. Then the difference in MSEs between the MMSE estimator and the fully over-parameterized estimator is given by, mse( β) -mse( θ * ) = σ 2 ϵ r i=p+1 1 s 2 i - ||β|| 2 r (r -p). Proof Sketch. This follows from extending Park (1981) to the over-parameterized regime, in addition to the definition of MSE, the cyclical properties of the trace and linearity of expectation, as well as the i.i.d. assumptions on β and ϵ. See the appendix for a detailed proof.

3. EXPERIMENTAL VALIDATION & COMPUTATIONAL FRAMEWORK

As the theoretical conditions and error expressions established in the previous section depend on variables that are unknown in real data (such as noise and signal variance, the norm of the true parameter vector etc.), and as real data may violate the theoretical assumptions, we validate how well the over-parameterized representation generalizes in practice using data collected in a facebased social decision-making study (Figure 2 ).

3.1. SOCIAL DECISION-MAKING EXPERIMENT

613 undergraduate students at the University of California, San Diego participated in a 3 block hour long study in which they were asked to rate social traits (block A), make decisions in social scenarios (block B), and play economic games (block C) with novel face images (Figure 2 ). All blocks were counterbalanced across subjects. Inclusion/exclusion criteria. Participants who had a response entropy and/or a CC between their response and the average response below two standard deviations of the mean were excluded, resulting in standardized responses from 485 subjects being included in the analysis. Face stimuli. 72 white female faces with direct gaze and natural expressions were sampled from the 10K US Adult Faces Database (Bainbridge et al., 2013) . A sub-sample of 52 faces was then used in blocks A and C, while 36 face pairings were used in block B. The 52 face images used in all blocks were included in the analysis. 

3.2. COMPUTATIONAL FRAMEWORK

Feature Representation. We train a three-color-channel AAM on the Chicago Face Database (Ma et al., 2015) plus the 10K US Adult Face Database (Bainbridge et al., 2013) . Like conventional practice, we perform principal component analysis (PCA) on the faces the AAM was trained on, but unlike conventional practice, we do not reduce the number of principal components, resulting in a representation with n = 10, 764 features. Model Evaluation. Using leave-one-out cross-validation (m = 52), we evaluate the prediction MSE on held-out test data. More formally, for each held-out face x i , we predict a social decision (ŷ i ): ŷi = x T i β, ( ) where β is the minimum L2-norm estimator specified in Section 2. We then evaluate: mse(ŷ) = 1 m m i=1 (y i -ŷi ) 2 . ( )

3.3. VALIDATION

Using the computational framework, as well as the responses collected in the social decision making study, we observe the prediction MSE on unseen test data is 1) within the standard error of the mean of the MMSE estimator, and 2) well below chance for a wide variety of social decision-making tasks, indicating the over-parameterized representation generalizes well across tasks in practice (Figure 3 ). variance of collected responses) for all tasks (except dominance, which cannot be predicted better than chance by any model), indicating the over-parameterized representation generalizes well across tasks in practice. Note that the below-chance dominance estimator will not be used in any subsequent analysis.

4. APPLICATION: BEAUTY PENALTY AND TRUSTWORTHINESS HALO

Consistent with previous studies, we observe a strong positive correlation between collected attractiveness ratings and social decisions in both social scenarios and economic games (Figure 4A ), indicating both an attractiveness halo and a beauty premium. We observe an even stronger positive correlation between trustworthiness ratings and social decisions (Figure 4A ), which indicates both a trustworthiness halo (this typically refers to trait perception and decision making in social scenarios) and a trustworthy premium (this typically refers to decision making in economic games). However, the strong positive correlation between attractiveness and trustworthiness (CC = 0.53, p-value ≤ 0.001) makes it impossible to separate the contributions of attractiveness and trustworthiness to the halo and premium effects from collected responses alone. To tease these contributions apart, we use the mathematical and computational framework developed above. Using leave-one-out cross-validation, we compute predicted social decisions using orthogonalized estimators ( βA⊥T and βT ⊥A ), then compute the correlation between these predictions (ŷ A⊥T and ŷT ⊥A ) and social decisions. This orthogonalized estimators contain facial feature information unique to that trait (task) and not related to the other trait. To orthogonalize the estimators, we calculate the normalized projections of one estimator onto the other. We calculate the orthogonal projection of βT onto βA as βT ⊥A = βT -( βA • βT ) βA , where (•) denotes the normalized dot product, and vice versa for the projection of βA onto βT . We observe (Figure 4A ) attractiveness unrelated to trustworthiness is not significantly correlated with any social scenarios (except dating app), while trustworthiness unrelated to attractiveness is significantly correlated with all social scenarios (except dating app). This indicates the halo effect is driven by trustworthiness, rather than attractiveness, though it appears as an attractiveness effect due to the facial features that contribute to both attractiveness and trustworthiness. We also observe (Figure 4B ) attractiveness unrelated to trustworthiness is significantly anticorrelated with two out of three economic games, while trustworthiness unrelated to attractiveness is significantly correlated with all economic games. Once again, it seems what masquerades as an attractiveness effect is truly a trustworthiness effect, and that rather than inducing a beauty premium, attractiveness by itself (excluding those facial features also contributing to trustworthiness) induces a beauty penalty. Without teasing apart the two components using feature orthogonalization, the beauty penalty effect is masked by the strong beauty/trustworthiness premium effect. A.

B.

A tt ra ct iv en es s When trustworthiness is unrelated to attractiveness (second row), on the other hand, the significant positive correlation remains. This shows a trustworthiness halo effect in social scenario decisions, as well as a trustworthiness premium in economic games. Note that there is a significant anti-correlation between the attractiveness and unrelated trustworthiness (row 1, col. 2), indicating non-linear effects in the responses, which cannot be captured by the linear models.

T ru st w o rt h in es s

Since AAM readily generates faces for any coordinates in the feature space, we can visualize the estimator (regression coefficient) axes and their orthogonalized versions (Figure 5 ). Visual inspection reveals both more attractive (top row) and trustworthy (bottom row) faces smile more, while less attractive faces also appear older. More interestingly, orthogonalizing the attractiveness estimator against the trustworthiness is no longer related to smiley-ness but appears anti-correlated with age (more youthful-looking faces are more attractive, which has previously been observed (Sutherland et al., 2013) . Notably, the projections of the face stimuli used in the experiment along this dimension are indeed significantly correlated (CC = -0.29, p-value< 0.05), with previously collected age ratings of these faces Bainbridge et al. (2013) , while these projections are significantly negatively correlated with ratings in economic games (Figure 4 ). To summarize, the above results imply that the youthfulness-related component of attractiveness drive a "beauty penalty" effect in economic games, while the facial features that drive both attractiveness and trustworthiness perception are what give rise to an attraction/trustworthiness halo. In addition, when we orthogonalize trustworthiness against attractiveness, a strong smiley-ness effect remains (just as in the unorthogoalized case), while the age effect mostly disappears. Moreover, we find that this residual component unrelated to attractiveness is still positively correlated with social scenario and economic games. 

5. DISCUSSION

In this paper, we provided conditions under which an over-parameterized representation is guaranteed to yield optimal, as well as better than generalization performance in a linear regression setting with hard constraints. We also provided exact expression for the estimator error, as well as an exact expression for how far the fully over-parameterized estimator is from the optimal hardregularizer. We next validated the usefulness of our mathematical framework by applying it to a wide range of social decision-making tasks, in which the fully over-parameterized estimator performed within the error bounds (standard error of the mean) of the theoretically optimal estimator on all tasks. We then used this framework to show 1) the halo effect appears to arise from trustworthiness rather than attractiveness, and 2) trustworthiness unrelated to attractiveness induces a premium in economic games, while attractiveness unrelated to trustworthiness induces a penalty, indicating a trustworthiness premium and beauty penalty, which helps reconciling conflicting reports in the existing literature. Moreover AAM-based visualization indicated that the trustworthiness halo/premium is underpinned by smiley-ness and the beauty premium by youthfulness (through the component specifically important for attractiveness). While some of the statistical analyses among traits, social scenarios, and economic games could have been done using only ratings, the ability of grounding those ratings in an image-computable and generative model representation is highly valuable. Without the latter, we wouldn't have been able to orthogonalize estimated regression coefficient vectors against one another (orthogonalization makes no sense if feature vectors do not live in the same feature space), or visualize faces along those vectors. Such visualizations (Figure 5 ) reveal that both trustworthiness and trustworthiness unrelated to attractiveness appear highly correlated with smiling. This begs the question of how smiling, or emotional states such as happiness, contribute to the halo effect. Such data can be collected framework helps to identify concrete directions for future research endeavors. Having a universal, overparameterized representation that serve all tasks can assist with iterative scientific analysis and hypothesis generation, as new experiments are designed and data collected, and new conclusions are drawn. One limitation of our framework is that it does not include contributions from non-linear components, which have been found to contribute to trait ratings, including attractiveness (Ryali & Yu, 2018; Todorov & Oosterhof, 2011) and trustworthiness (Todorov & Oosterhof, 2011) . A further limitation of our study is that we only focused on female faces. There is evidence dominance and trustworthiness are rated using gender-based internal models (He & Yu, 2021) , which could also be true of social decisions. In addition, a strong correlation between dominance and election success has been established in the literature for male faces (Berinsky et al., 2019) . However, a preliminary analysis of dominance ratings collected in the social decision-making experiment reveals no such correlation for female faces (Figure 6 ), indicating different traits might contribute to halo effects for female and male faces. These questions remain exciting avenues for future work. The more general limitations of our theoretical framework is that the optimality conditions only hold for hard regularizers, and that our error expressions are for estimator MSE rather than prediction MSE. Extending the general theoretical results to soft-regularizers (such as ridge-and lassoregression) and the more practically useful prediction MSE are also exciting future directions.

5.1. RELATED THEORETICAL WORK IN OVER-PARAMETERIZED LINEAR REGRESSION

Our PCR approach might at first glance seem identical to that of Xu & Hsu (2019) . However, while Xu & Hsu (2019) analyze what they call an "oracle" estimator, which uses the generative ("true") covariance matrix, we use the more classical version of PCR, which is based on the sample covariance matrix. This results in quite different behavior. For instance, there is no over-parameterized regime in PCR, as m data points can be expressed by most m linearly independent features (m -1 when the data is centered). As such, there is no "second descent" in PCR. Xu & Hsu (2019) also noted that a full analysis that accounts for estimation errors in PCR remains open, though it is worth noting that an extensive analysis of the under-parameterized regime was done by Park (1981) . Also worth noting is that there seems to be a sharp divide between the"classical" underparameterized and the "modern" over-parameterized regime in the literature, with an understanding of the latter "now only starting to emerge" (Belkin et al., 2020) . We offer a different view by showing any over-parameterized representation has an "equivalent" under-parameterized representation, and as such, the over-parameterized regime can be fully understood in terms of the under-parameterized regime.  = mse( φ) ≥ mse( φ * ) Park (1981) ≥ mse( θ * ). Theorem 1. The MSE of the minimum MSE PCR estimator lower bounds the MSE in both the over-parameterized and under-parameterized regimes. Proof. This follows as the minimum MSE PCR estimator lower bounds the MSE of underparameterized estimators (Park, 1981) , as well as over-parameterized estimators (lemmas 4 and 5). Park 1. If the p-th singular value (s p ) of the design matrix X satisfies, s 2 p ≥ σ 2 ϵ ||β|| 2 /r , ( ) then θp is the MMSE PCR estimator (Park, 1981) . Proof. See Park (1981) . Estimator Condition 2. If the smallest singular value (s r ) of the design matrix X ∈ R m×n satisfies, s 2 r ≥ σ 2 ϵ ||β|| 2 /r , then the over-parameterized estimator with all n features is the minimum MSE estimator. Proof. This follows from Park 1, as well as Theorem 1. Denote the minimum MSE PCR estimator as θ * , the PCR estimator with all PCs as θ, and the over-parameterized estimator with all n features as β, and suppose the threshold is satisfied for the r-th singular value. Then, mse( θ * ) = mse( θ) = mse( β), which we know from Theorem 1 lower bounds the MSE in both the under-parameterized and overparameterized regimes. As such, β is an MMSE estimator. Error Expression 1. The MSE of the over-parameterized estimator without feature reduction is given by mse ( β) = σ 2 ϵ r i=1 1 s 2 i , where σ 2 ϵ is the noise variance, and s 1 , .., s r the singular values of the design matrix X. Proof. Note that β can be written in terms of X, y, and ϵ, as β = X † (y -ϵ). Then, mse( β) : = tr E[(β -β)(β -β) T ] = tr E[(X † ϵ)(X † ϵ) T ] = tr E[X † ϵϵ T X †T ] = tr(X † E[ϵϵ T ]X †T ) = σ 2 ϵ tr(X † X †T ) = σ 2 ϵ tr(UΣ † 2 U T ) = σ 2 ϵ tr(Σ † 2 U T U) = σ 2 ϵ tr Σ † 2 = σ 2 ϵ r i=1 1 s 2 i . Lemma 6. The variance of the true parameter vector β is given by σ 2 β = (σ 2 y -σ 2 ϵ ) r i=1 1 s 2 i , where σ 2 y is signal variance, σ 2 ϵ the noise variance, and s 1 , .., s r are the singular values of the design matrix. Proof. Estimator Condition 1. If the noise-variance in y (σ 2 ϵ ) is less than or equal to half the signalvariance in y (σ 2 y ), then β is an above chance estimator, i.e. σ 2 ϵ ≤ σ 2 y 2 ⇐⇒ mse( β) ≤ σ 2 β , where σ 2 β is the variance of the parameter vector β. Proof. This follows from Error Expression 1 and Lemma 6. Proof. Recall that θ can be written in terms of Z, y, and ϵ, as θ = Z † (y -ϵ). First note that the feature reduced PCR estimator θp is given by, θp : = Z † p y = Z † p (Zθ + ϵ) = Z † p Zθ + Z † p ϵ = Σ † p U T UΣθ + Z † p ϵ = Σ † p Σθ + Z † p ϵ = I p θ + Z † p ϵ, where I p is an m × r dimensional matrix with ones on the diagonal for the first p entries and zeros on the remainder. 



Figure 2: Overview of the face-based social decision-making experiment. (A)The trait rating tasks (block A), social scenario tasks (block B) and economic games (block C) with sample screenshots from the experiment display. For each task, participants respond on a scale of 1-9. For the social scenario tasks, 1/9 indicates maximal preference for the face on the left/right, while 5 indicates equally preferable. In Prisoner's Dilemma (PD;Kremp et al., 1982)  participants are asked how likely they are to cooperate (rather than defect). In the Ultimatum Game (UG; Solnick and Schweitzer, 1999) and Trust Game (TG;Wilson and Eckel, 2006)  participants are asked how much money (in $) they would invest (TG) or propose (UG). (B) Questions displayed in the five social scenario tasks.

Figure3: Experimental validation of our mathematical framework for several social decision tasks. For all tasks, the fully over-parameterized estimator (orange bars, right) is within the error bounds (standard error of the mean) of the MMSE estimator (blue bars, left) and below chance level (+; variance of collected responses) for all tasks (except dominance, which cannot be predicted better than chance by any model), indicating the over-parameterized representation generalizes well across tasks in practice. Note that the below-chance dominance estimator will not be used in any subsequent analysis.

Figure 4: Heatmap of correlation coefficients (CCs) between social traits (cols. 1-6), social scenario decisions (cols. 10-14), economic games (cols. 7-9) with significance levels (*: p-value ≤ 0.05, **: p-value ≤ 0.01, ***: p-value ≤ 0.001).(A) Both attractiveness (first row) and trustworthiness (second row) are significantly positively correlated with all economic games and social scenarios (except trustworthiness and dating app). However, the strong positive correlation between attractiveness and trustworthiness (CC = 0.53, p-value < 0.001) makes it impossible to tease the contributions of these two traits apart using just the collected responses. (B) When attractiveness is unrelated to trustworthiness (first row), the significant positive correlation with social scenarios disappears (except dating app), dispelling an attractiveness halo effect. The positive correlation with two of the three economic games (PD and TG) also becomes significantly negative, indicating a beauty penalty. When trustworthiness is unrelated to attractiveness (second row), on the other hand, the significant positive correlation remains. This shows a trustworthiness halo effect in social scenario decisions, as well as a trustworthiness premium in economic games. Note that there is a significant anti-correlation between the attractiveness and unrelated trustworthiness (row 1, col. 2), indicating non-linear effects in the responses, which cannot be captured by the linear models.

Figure 5: Face visualization along regression coefficient (estimators) directions without (A) and with (B) orthogonalization. Top row: attractiveness; bottom row: trustworthiness. The middle face in each triplet is the average face (corresponding to feature coordinates that average over all 52 faces used in the study). Visualization moves in equal steps along the estimator axes (left: negative direction, right: positive direction).

Figure 6: Heatmap of CCs between dominance and social traits (cols. 1-6), social scenario decisions (cols. 10-14), economic games (cols. 7-9) with significance levels (*: p-value ≤ 0.05, **: p-value ≤ 0.01, ***: p-value ≤ 0.001). None of the CCs between dominance and social scenarios/economic games are significant, indicating dominance does not significantly contribute to decisions in these tasks. As there is an established correlation between dominance and election for male faces in the literature, this indicates different traits might contribute to halo effects for male and female faces.

tr E[ββ T ] = tr E[X † (y -ϵ)(y -ϵ) T X †T ] = tr(X † E[(y -ϵ)(y -ϵ) T ]X †T ) = tr X † (E[yy T ] -E[ϵϵ T ])X †T = (σ 2 y -σ 2 ϵ ) tr(X † X †T )

The MSE of the feature reduced PCR estimator θp (with p ≤ r coefficients) is given by,

It then follows that, mse( θp) : = tr E[(θ -θp )(θ -θp ) T ] = tr E[((I r -I p )θ + Z † p ϵ)(ϵ T Z †T p + θ T (I r -I p ) T )] = tr E[(I r -I p )θθ T (I r -I p ) T ] + tr E[Z † p ϵϵ T Z †T p ] = tr(I r -I p )E[θθ T ](I r -I p ) T + σ 2 ϵ tr Σ † p = ||θ|| 2r tr(I r -I p ) 2 + σ 2 ϵ tr Σ p Suppose the MMSE PCR estimator θ * has p ≤ r components. Then the difference in MSEs between the MMSE estimator and the fully over-parameterized estimator is given by, mse( β) -mse( θ * ) = σ 2

A APPENDIX

A.1 PROOFS A.1.1 PRELIMINARIES Recall, we consider an over-parameterized linear regression setting (n ≥ m),with both β ∈ R n and ϵ ∈ R m assumed to be zero-mean and i.i.d, and rank of X ∈ R m×n denoted by r.We use the fact that any matrix X can be written in terms of its singular value decomposition as X = UΣV T , and X expressed in its principal component (PC) representation is simply UΣ, which we denote by Z.We also use the PCR setting,where Z ∈ R m×r is the design matrix expressed in terms of its PCs and θ is the PCR estimator.Note on terminology. By the MMSE estimator, we mean the minimum mean squared error estimator among the class of hard regularizers subject to linear constraints (this does not include softregularizers, such as ridge and lasso estimators).A.1.2 PROOFS Lemma 1. The PCR estimator with all PCs is an orthogonal transformation of the overparameterized estimator with all features.Proof. This follows from the definitions. Let θ denote the PCR estimator and β denote the overparameterized estimator. Then,Lemma 2. Estimator MSE is invariant under orthogonal transformations.Proof. Let β denote the estimator and V denote an orthogonal matrix. Then,Lemma 3. The MSE of the PCR estimator with all PCs equals the MSE of the over-parameterized estimator with all features.Proof. This follows from Lemmas 1 and 2. Let θ denote the PCR estimator and β denote the over-parameterized estimator. Then,

mse( θ)

(1)= mse(V θ)(2)= mse( β).Lemma 4. The MSE of the minimum MSE PCR estimator lower bounds the MSE of the overparameterized estimator with all n features.Proof. This follows from the definition of the minimum MSE, as well as Lemma 3. Denote the minimum MSE PCR estimator as θ * , the PCR estimator with all PCs as θ and the over-parameterized estimator with all features as β. Then, mse( θ * )≤ mse( θ)(3)= mse( β).Lemma 5. The MSE of the minimum MSE PCR estimator lower bounds the MSE of an overparameterized, feature reduced estimator.Proof. This follows from Lemma 3, as well as Park (1981) . Let X n be the original design matrix with all n features, and denote the minimum PCR estimator of this design matrix as θ * . Let X p be a feature reduced design matrix, expressed in any of the p < n original features, and denote the p dimensional estimator of X p as α. Denote φ * and φ as the minimum MSE PCR estimator and PCR estimator with all PCs respectively. Note that these estimators are under-parameterized estimators. It then follows that,

