CROSS-QUALITY FEW-SHOT TRANSFER FOR ALLOY YIELD STRENGTH PREDICTION: A NEW MATERIAL SCIENCE BENCHMARK AND AN INTEGRATED OPTI-MIZATION FRAMEWORK

Abstract

Discovering high-entropy alloys (HEAs) with high yield strength is an important yet challenging task in material science. However, the yield strength can only be accurately measured by very expensive and time-consuming real-world experiments, hence cannot be acquired at scale. Learning-based methods could facilitate the discovery process, but the lack of a comprehensive dataset on HEA yield strength has created barriers. We present X-Yield, a large-scale material science benchmark with 240 experimentally measured ("high-quality") and over 100K simulated (imperfect or "low-quality") HEA yield strength annotations. Due to the scarcity of experimental annotations and the quality gap in imperfectly simulated data, existing transfer learning methods cannot generalize well on our dataset. We address this cross-quality few-shot transfer problem by leveraging model sparsification "twice" -as a noise-robust feature learning regularizer at the pre-training stage, and as a data-efficient learning regularizer at the few-shot transfer stage. While the workflow already performs decently with ad-hoc sparsity patterns tuned independently for either stage, we take a step further by proposing a bi-level optimization framework termed Bi-RPT, that jointly learns optimal masks and automatically allocates sparsity levels for both stages. The optimization problem is solved efficiently using gradient unrolling, which is seamlessly integrated with the training process. The effectiveness of Bi-RPT is validated through extensive experiments on our new challenging X-Yield dataset, alongside other synthesized testbeds. Specifically, we achieve an 8.9 ∼ 19.8% reduction in terms of the test mean squared error and 0.98 ∼ 1.53% in terms of test accuracy, merely using 5-10% of the experimental data. Codes and sample data are in the supplement.

1. INTRODUCTION

Machine learning (ML) methods have recently demonstrated great promise in the important field of material science, and in this paper, we focus on ML-assisted high-entropy alloy (HEA) (Yeh et al., 2004) discovery and property prediction. HEAs own promising properties that traditional alloys do not hold, such as extraordinary mechanical performance at high temperatures, making them wellsuited options for various material applications. One particular property, i.e., the yield strength of HEAs, characterizes the maximum stress a material can endure before starting to deform, which is a critical parameter for customized HEA design. However, in order to accurately measure the yield strength of specific HEAs, expensive scientific experiments need to be conducted for each alloy, often involving hard-to-create experimental conditions, especially at high temperatures (mainly caused by difficulties with oxidation control) as well as extremely long experimental duration. At high temperatures, these measurements are typically taken with the Gleeble system (Gle). From sample preparation to yield strength measurement can take between two to four weeks even for a domain expert team, including melting of the alloy, machining the sample, and preparing and mechanically testing with the Gleeble. Therefore, it is challenging to acquire yield strength measurements from those "high-quality" experiments at scale. Similar to the trends in computer vision fields (Tremblay et al., 2018) , recent efforts attempt to mitigate the scarcity of real-world measurements using ML-based predictors: to directly predict their yield strengths from the alloy inputs (Bhandari et al., 2021a) ; and such ML-based predictors could be trained using simulated data. Indeed, material sciences applications are often blessed by developed simulation models, e.g., Maresca & Curtin (2020) . However, such a blessing is often compromised by the domain gap between the simulated data and the "ground-truth" experimental data, often due to many inevitable simplifications in simulation modeling. For example, the yield strength of a material can vary greatly based on processing and testing conditions as well as grain size and texture (Toda-Carballo et al., 2014; Lin et al., 2014 ); yet simulation models commonly rely on properties intrinsic to the alloy and do not incorporate variations in experimental conditions. The lack of public datasets in this field also renders it difficult to benchmark ML models' progress. In this paper, we start by curating a large-scale benchmark, called X-Yield, that for the first time combines experimental data with simulation data to address the problem of predicting yield strength in HEAs. While using experimental data is always preferred since they are "high-quality" ground truths, it is impractical to generate high quantities of data, especially for capturing yield strength at elevated temperatures. Thus, simulation data can be acquired by massive quantities to fill the gap, despite their relatively "low quality" due to inherent model misspecification or simplification. The low-quality simulation data was selected to represent ternary-septenary systems from an eleven-element palette consisting of mostly refractory elements (Al-Cr-Fe-Hf-Mo-Nb-Ta-Ti-V-W-Zr). While there are existing experimental databases (Borg et al., 2020) and models to predict high-temperature yield strength in HEAs (Maresca & Curtin, 2020) , to our best knowledge, this is the first multi-fidelity dataset in the public domain that combines real experimental measurements and large quantities (over 100K) of simulation data for mechanical property prediction in HEAs. This specialized data set should be able to predict high-temperature yield strength across a broad range of HEAs. The predictions of this model could be used to pinpoint which alloys are the strongest at elevated temperatures, allowing experiments to focus on pre-sorted candidates for future study eliminating the need to spend several weeks testing a candidate without promise. Figure 1 : Proposed two-stage workflow. The HEA yield strength prediction model is first pretrained on massive "low-quality" simulation data, and is then fine-tuned/transferred on few-shot "high-quality" experimental data to optimize its prediction in this target domain. Note that the tool of sparsity will be leveraged in both pre-training and fine-tuning stages, for the purposes of gaining noise robustness/transferablity and enhancing data efficiency, respectively. The new X-Yield benchmark is set to facilitate ML for HEA yield strength prediction, but learning from such a multi-fidelity dataset is highly non-trivial. To this end, we next conceptualize a cross-quality few-shot transfer workflow: first pre-training the prediction model on the data-rich yet "low-quality" source domain (simulated data), and then fine-tuning the model towards the datascarce yet "high-quality" target domain (experimental data). However, this vanilla workflow is challenged by two issues: a significant quality gap between source and target domains, and an extreme data scarcity of target data. Inspired by the recent success of sparsity regularizers, we propose to incorporate sparsity to regularize both stages: sparsifying pre-training to improve the robustness and cross-domain transferability of learned features (Guo et al., 2018; Sehwag et al., 2019; Chen et al., 2022; Sehwag et al., 2020; Ding et al., 2022; Diffenderfer et al., 2021) , and sparsifying fine-tuning to overcome data shortfalls (Liu et al., 2020; Chen et al., 2021; Tao et al., 2022) . We demonstrate proof-of-concept experiments that even the simplest magnitude-based weight pruning could play effective regularization roles in our workflow. Furthermore, to avoid the ad-hoc two-step pruning as well as trial-and-error sparsity ratio selection at either stage, we propose a novel integrated optimization framework termed Bi-Level Regularized Pre-training and Transfer (Bi-RPT), that jointly learns optimal sparse masks and automatically allocates sparsity levels for both stages. Our main contributions are summarized as follows: • Dataset: We present X-Yield, the first public large-scale, multi-quality material science benchmark for HEA yield strength prediction, containing alloys' compositions, processing temperatures, and yield strengths. Specifically, the yield strengths of 240 HEAs are experimentally measured, while that of the remaining samples (over 100K) is calculated by simulations. • Methodology: we formulate a cross-quality few-shot transfer workflow that can jointly exploit the simulated and experimental data for accurate predictions, and we innovate to leverage sparsity for addressing both the simulated/experimental domain gap and the scarcity of experimental data. While ad-hoc magnitude-based weight pruning is already found to be helpful, we further formulate an integrated bi-level optimization framework called Bi-RPT to automate the optimal sparse mask generation and sparsity ratio allocation at both pre-training and fine-tuning stages. • Results: Extensive experiments show that Bi-RPT can boost performance on the X-Yield benchmark alongside other synthesized testbeds. In particular, for the yield strength regression task, we achieve a reduction of 19 ∼ 38% on the test mean squared error by merely using 5-10% of the available experimental data. For the yield strength classification task, we achieve 0.98% ∼ 1.53% of improvement in terms of the test accuracy.

2.1. MACHINE LEARNING IN MATERIALS RESEARCH

ML has been applied to solve a wide range of problems in materials science ranging from the fields of inorganic chemistry (Kailkhura et al., 2019) to sustainability (Gomes et al., 2021) , and metallurgy (Stan et al., 2020) , with the typical purposes to predict materials properties and accelerate simulations (Pilania, 2021) . In both cases, ML techniques are hailed as reducing computational time in contrast to traditional materials science methods and are typically fast to develop (Wei et al., 2019) . Later on, deep learning has been successfully applied to problems in the field of HEAs, in particular to predict phase formation (Lee et al., 2021b; Zhu et al., 2022) . These approaches provide significant increases in speed compared to phase predictions with CALculation of PHAse Diagrams (CALPHAD) (Saunders & Miodownik, 1998) , density functional theory (Parr, 1983) , and molecular dynamics methods (Shuichi, 1991) commonly used in materials science. Other properties predicted with deep learning are crystal structures, elastic constants (Liu et al., 2023) and hardness (Bhandari et al., 2021b) . When it comes specifically to the yield strength of HEAs, its prediction has also been previously explored with deep learning (Liu et al., 2023; Bhandari et al., 2021a) . However, a majority of these efforts are restricted to the development of specific alloys (Zheng et al., 2021; Bhandari et al., 2021b) or consist solely of transition metals (Wen et al., 2019) , and many studies also only use a small experimental dataset for prediction (Wen et al., 2021) . A generalized multi-fidelity ML model to predict yield HEA strength at scale remains to be absent yet highly demanded.

2.2. SPARSITY REGULARIZATION IN DEEP LEARNING

Sparsity or pruning was traditionally treated as a mainstream model compression approach in deep learning (Han et al., 2015) . Recently, sparse regularizers have been increasingly used to enhance deep model robustness to various noise, malicious attacks, and distribution shifts. Guo et al. (2018) ; Sehwag et al. (2019) ; Gui et al. (2019) studied the intrinsic relationship between pruning and adversarial robustness. Recently, Diffenderfer et al. (2021) comprehensively demonstrated the benefit of model sparsification to improve robustness to distributional shifts (Hendrycks & Dietterich, 2019; Bulusu et al., 2020) . Sparse regularizers also exhibit promise in improving data efficiency. For example, Zheng et al. (2019) ; Liu et al. (2020) proposed to learn model pruning strategies for few-shot learning; Tian et al. (2020) combined model sparsification with meta-learning to improve few-shot performance. Sparse regularizers have even been proven effective beyond few-shot image classification, such as enhancing the data efficiency in image generation (Chen et al., 2021) .

2.3. BI-LEVEL OPTIMIZATION

Bi-level optimization is a hierarchical framework where the variables in the upper-level optimization problem are dependent on the lower-level problem. Finn et al. (2017) ; Rajeswaran et al. (2019) formulated the meta-learning problem in the form of bi-level optimization, and solve it by using first-order approximations. Other applications of bi-level optimization include data and label poisoning (Mehra et al., 2021; Huang et al., 2020) , and adversarial training (Zhang et al., 2021) . In this work, we utilize bi-level optimization to formulate our two-stage workflow with sparsification and find each stage's optimal weights and sparse masks while considering their sequential dependency.

3. X-YIELD: A NEW BENCHMARK FOR HEA YIELD STRENGTH PREDICTION

Overview Conventional alloys typically have one principal element with small amounts of other elements added to improve material properties (Ye et al., 2016) while HEAs can have multiple principal elements. The discovery of HEAs opened the door to a significantly wider range of design space to explore, most of which has yet to be examined (Miracle & Senkov, 2017) . To address the task of using ML to predict HEA yield strength, we focus on the sub-field of refractory HEAs (RHEAs). These materials have been demonstrated to maintain excellent mechanical properties at high temperatures (Li et al., 2020) , making them ideal candidates for hypersonics and aerospace industry applications. Prior work adopting ML to predict RHEA properties either uses solely experimental data (Wen et al., 2021) , or restricts predictions to only transition metals (Wen et al., 2019) or specific alloys such as MoNbTaTiW (Bhandari et al., 2021a) . Hence, a generalizable ML prediction model for a broad range of RHEAs is still absent. As mentioned earlier, it is impractical to generate high quantities of experimental data, especially for capturing yield strength at elevated temperatures. There are also challenges specific to high-temperature measurements such as controlling oxidation, confirming the heating profile and gradient within the samples, and use of more challenging experimental techniques (crosshead displacement) than those at lower temperatures (extensometers). This work develops X-Yield, the first publicly available, multi-fidelity dataset consisting of over 100K low-quality simulated points and 240 experimental data points to explore the RHEA design space. In this study alone, the entire composition space of all alloys containing between ternaryseptenary systems from the Al-Cr-Fe-Hf-Mo-Nb-Ta-Ti-V-W-Zr family is examined. Since obtaining real high-temperature yield strength data is challenging, a majority of the experimental yield strength data in the literature was taken close to room temperature (Borg et al., 2020) even though there is more interest in RHEA properties at the high-temperature end (Miracle & Senkov, 2017) . From X-Yield, a multi-fidelity ML model is expected to be trained to predict high-temperature yield strength for a broad palette of RHEAs. The combination of high-temperature yield strengths from the simulated dataset and experimental input can generate an ML model to accurately and efficiently predict high-temperature yield strengths of alloys not included in the training set.

Dataset Construction

The yield strength of the simulation data was predicted using the analytic and parameter-free mechanistic yield strength model developed by Maresca & Curtin (2020) . This model describes body-centered cubic (BCC) multi-principal element alloy (MPEA) solid solution strengthening associated with edge dislocations, in terms of elemental atomic volumes and elastic moduli. The yield strength was predicted for all ternary (1% increments), quaternary (1% increments), quinary (5% increments), senary (5% increments), and septenary (5% increments) alloys from the Al-Cr-Fe-Hf-Mo-Nb-Ta-Ti-V-W-Zr element family at temperatures between 300K-2500K in increments of 100K. This resulted in over three billion data points of which approximately 100, 000 were randomly selected for inclusion in this study. Note that even this advanced simulation model suffers from notable oversimplification and data quality issues. For example, The phase stability and dislocation character were not used to filter alloys in the study and the model may overpredict the yield strength of alloys with non-BCC phases and underpredict the yield strength of alloys with different dislocation character, e.g. screw. The high-quality experimental dataset was carefully filtered and curated from the database generated by Borg et al. ( 2020) consisting of mechanical property information for MPEAs. All data points were extracted that consisted solely of elements from the above element family, consisted only of BCC phases, were at temperatures higher than 20°C, and contained a yield strength value. Dataset Characteristics and "Quality Gap" As depicted in Figure 2 , the simulation and experimental yield stress have different distributions. In the low-quality simulation data, a considerable portion of yield stress annotations is greater than 2, while the experimental data hardly contains yield stress points beyond 2 (with one datapoint exception) due to the experimental condition constraints. The distribution of the simulated yield stress is also significantly more skewed than the experimental ones. Pairwise visualization of the yield stress on the 240 high-quality experimental samples suggests a substantial deviation between the simulation and experimental results. The distributions of the processing temperatures are also heterogeneous, i.e., the simulation data presents a uniform pattern while the temperatures in the conducted experiments are in a bimodal shape. These observations showcase the domain shifts or "quality gap" between simulations and experiments.

4. CROSS-QUALITY FEW-SHOT TRANSFER: A TWO-STAGE WORKFLOW AIDED BY SPARSITY (TWICE)

In this section, we first introduce the basic two-sage workflow, upon which we propose sparsification methods (a vanilla approach "Hand-Tune" and an improved principled framework "Bi-RPT"). Basic Two-Stage Workflow: Pre-training then Fine-tuning Let us denote the high-quality target domain (experimental data) by D t , and the low-quality source domain (simulated data) by D s . Our goal is to learn a generalizable predictor over D t while leveraging the aid of D s . One naive idea is to simply combine the two data domains and jointly train a supervised model. However, the large domain gap between D s and D t , as well as the sample scarcity in D t , will result in the jointly trained predictor to fit D t poorly. Instead, we propose to formulate our workflow as a two-stage pipeline: first pre-training a model on D s , and then fine-tuning to optimize the prediction over D t . Incorporating Bi-Stage Sparsity: A Vanilla Approach. However, the features learned from D s will inevitably suffer from domain gap and noise when applied towards D t , and the extreme data scarcity of D t remains as another challenge. Inspired by the recent success of sparse regularizers in improving both robustness/transferability and data efficiency, we attempt to incorporate sparsity into both stages to address the two-fold challenges. We first prove our concepts by proposing a vanilla ad-hoc approach, which we refer to as Hand-Tune. Starting from pre-training over D s , we perform the standard iterative magnitude pruning (IMP) (Frankle & Carbin, 2019) during pre-training. In particular, we alternate between (re-)training and pruning; each time, we prune the 20% smallest-magnitude weights from the existing non-zero weights by default and continue (re-)training the remaining non-zero weights. Such a "prune-andretrain" routine is repeated for N s rounds to obtain the final sparse mask m s (1 denotes the element to be non-zero and 0 to be pruned) associated with the pretrained model weight. Then, we move on to fine-tuning over D t , and start another round of IMP on top of the pre-trained model: note that this second-stage IMP continues only on the subset of current non-zero weights, i.e., the 1-valued regions in m s . IMP in fine-tuning repeats another N t round (with the identical protocol as the first stage), yielding another sparse mask m t . The final model uses the joint sparse mask m s ⊙ m t where ⊙ represents the point-wise product. Hereby, N s and N t are hyperparameters that control the sparsity allocation between two stages. Intuitively, while certain sparsity may contribute to noise resilience, an overly large N s will cause the pre-trained model to be over-sparsified, limiting its capacity to learn sufficiently informative and transferable features. Fine-tuning has a similar trade-off. Therefore, N s and N t have to be manually tuned for the two-stage workflow to achieve good performance (see Appendix B.2). Principled Bi-Stage Sparsity Integration with Bi-RPT. Hand-Tune has some apparent flaws: (1) it removes weight elements by merely using weight magnitude information, which is not explicitly task-driven; (2) the two sparse masks m s and m t are decided in a sequential manner rather than jointly optimized, e.g., learning m t will passively suffer from any artifact in learning m s ; (3) the sparsity ratios assigned in both stages, as controlled by N s and N t , need to be manually tuned, without any obvious insight beyond exhaustive hyperparameter search. We, therefore, devise a more principled framework that can jointly learn the optimal sparse masks as well as sparsity allocations for both stages, termed Bi-Level Regularized Pre-training and Transfer (Bi-RPT). The optimization problem is expressed as follows (γ is a coefficient): min θ,ms,mt E (xt,yt)∼Dt [L t ((m s ⊙ m t ) ⊙ θ, x t , y t |θ * , m * s )] + γR(m * s ⊙ m t ) (1) s.t. {θ * , m * s } = arg min θ,ms E (xs,ys)∼Ds L s (m s ⊙ θ, x s , y s ), where L s /L t represents the objective functions for the two stages, respectively, θ represents the models' parameters, and R represents the sparsity regularizer. Seemingly complicated at the first glance, the bi-level optimization formulation of Bi-RPT actually admits a clear physics "workflow" interpretation. Let us start from the lower-level problem (2) which instantiates the sparsity regularized pre-training stage over D s : its outputs include the pre-trained weight θ * and the corresponding sparse mask m s . Then, the upper-level problem (1) depicts the sparsity regularized fine-tuning over D t , which inherits both θ * and m * s as its starting point. It continues to modify the weight as well as to evolve another sparse mask m t . Eventually, a sparsity-promoting function R enforces the total sparsity over the joint mask m s ⊙ m t , and the final model weights could be represented as (m s ⊙ m t ) ⊙ θ. Importantly, the lower-and upper-level problems in Bi-RPT are solved in an end-to-end manner, meaning that even the fine-tuning depends on θ * and m * s , it can, in turn, provide feedbacks for adjusting the latter: hence a synergistic optimization is achieved between two stages. The sparse mask selection now directly hinges on the end task (target domain loss L t ) rather than heuristics such as weight magnitudes. Lastly, the sparsity levels of m s and m t do not need to be separately designated nor manually controlled: we automatically learn the sparsity ratio allocation, under only the total sparsity regularizer R. To practically solve the bi-level optimization of Bi-RPT, we derive algorithms whose details can be found in Appendix A. For the sparsity regularizer R, we adopt the smoothed ℓ 0 term (Guo et al., 2021) to facilitate differentiable training: a gate function g ϵ (x) = x 2 /(x 2 + ϵ), whose outputs are almost binary when the ϵ is small, is used. In general, for the lower-level optimization problem, we update the models' parameters θ by gradients to minimize L s ; for the upper-level optimization, we utilize the gradient unrolling to develop update rules for θ.

4.1. PROOF-OF-CONCEPT EXPERIMENTS ON IMAGE DATA

For proof-of-concept, we conduct experiments on a synthesized testbed of image classification, to compare Hand-Tune and Bi-RPT. We adopt two source-domain dataset options: ImageNet (Deng et al., 2009) and ImageNet-C (Hendrycks & Dietterich, 2019) , the latter more noisy and corrupted. Two target-domain options are also accompanied: CUB-200 (Wah et al., 2011) and , the latter designed to be rigorously "few-shot" where each class has only 10 training samples. Different combinations of D s /D t allow us to conduct controlled experiments for stretch-testing various algorithm options' noise robustness as well as data efficiency. For those methods with IMP involved, we hand-select the sparsity ratio(s) for either or both stages that yield the highest generalization performance on D t , i.e., from hyperparameter grid search via cross-validation. Table 1 reports the accuracies of all methods over various source/target combinations, on the same testing set of CUB-200 in Table 1 . All methods use the same ResNet-18 backbone. We highlight several key observations: (1) incorporating D s in general helps both , and the improvement margin is much more substantial for the few-shot case; (2) models trained by Mix Training fail to generalize on D t -in fact even worse than No Pretraining, showcasing the negative influence of the quality gap; (3) in the same regime of pre-training then fine-tuning, adding appropriate sparsity helps, and two-stage sparsity can help more; (4) Bi-RPT stably outperforms Hand-Tune (especially, very notably in few-shot cases), despite the best efforts in tuning the latter's hyperparameters. More observations and analysis can be found in Appendix B (Tables A5 -A8, and Figure A4 ): including but unlimited to the backfiring effect of "over-sparsification", and the compound influence of per-stage IMP sparsity allocation in Hand-Tune. Task Definition. The most naturally defined task on X-Yield is the regression, i.e., predicting the yield strength of alloys, and calculating the error between the model prediction and "ground-truth" (experimental results). Besides the regression task, we formulate another surrogate classification task by constructing five categorical labels based on the bin intervals where the ground-truth yields strength fall in. These intervals are: [0, 0.5), [0.5, 1), [1, 1.5), [1.5, 2), and [2, ∞). Data Representations. We featurize each HEA by mapping its composition and temperature into a "pseudoimage" (please refer to Appendix B.5 and Figure A5 ). The pseudoimages have two channels: the first channel is constructed from the alloys' composition using the randomized periodic table structure (Feng et al., 2021) . As the temperatures are originally recorded in Kelvin, we convert and normalize them by T normalized = (K -273.15)/2000 where K is the temperature in Kelvin, and then embed the converted temperature as the second channel in pseudoimages. Architectures and Baselines. The structure of the ML predictor we use is a convolutional neural network. It consists of 3 convolutional layers, each of which has a kernel size of 3, followed by Batch Normalization (Ioffe & Szegedy, 2015) and ReLU (Glorot et al., 2011) activation. A multilayer perceptron is appended after the convolutional neural network to generate the final prediction for both the regression and classification tasks. We focus on comparing our main proposal, Bi-RPT, with two baselines of No Pretraining and Pretrain-and-transfer, same as defined in Section 4.1. Evaluation Metrics and Data Splits. We evaluate each method in two ways. Besides the widely used 10-fold cross-validation, we explore two challenging extreme few-shot settings: we sample 5% (and 10%) of experimental data from each alloy type (ternary, quaternary, quinary and senary) as our training set, and the rest are left as the test set. Note that these classes are not the classification labels. Eventually, we have only 23( 11) training samples and 217(229) testing samples. All the low-quality (simulated) data is used for pretraining where applicable. For the regression task, we report the best mean squared error (MSE) on the test splits; and for the classification task, we report models' test split accuracy to measure their performance. Training Hyperparameters We pretrain the ML predictor on the simulation data for 10 epochs. During the pretraining, we use the Adam optimizer (Kingma & Ba, 2014) with an initial learning rate of 1 × 10 -4 and a cosine annealing schedule (Loshchilov & Hutter, 2016) . For the transfer stage, we fine-tune the pretrained model on the experimental data for 90 epochs. The optimizer we use is the SGD optimizer with an initial learning rate of 1 × 10 -3 . We also decay the learning rate by 10 for every 30 epoch. The batch sizes for pretraining and fine-tuning are 16 and 4, respectively.

5.2. MAIN RESULTS

Classification and regression with extreme few-shot settings. We first apply Bi-RPT to solve the regression and classification tasks under the two extreme few-shot settings where only 5% and 10% experimental data are available, respectively. Table 2 shows that: (1) pretraining on simulation data can benefit the ML predictor consistently on both the regression (over 10% reduction in MSE) and classification (over 11% improvement in accuracy) tasks, especially when the data is more scarce; (2) the integration of sparsity into the pretraining and transfer workflow can further strengthen the predictor's generalization, improving accuracy by 0.98% and reducing MSE by 8.91% using merely 10% training experiment data, and the improvement also becomes even more significant with 5% training data (1.53% increase in terms of the accuracy and 19.75% reduction in terms of the MSE). Classification and regression with 10-fold cross-validation. On the slightly "data-rich" 10-fold cross-validation setting, we have observed a similar trend: the bi-stage regime of pretraining and transfer out-performs the single-stage training pipeline, and incorporating sparsity can consistently provide remarkable improvement to the ML predictor, particularly in the regression performance. 4 shows the predicted yield stress on these three alloys using Bi-RPT and baselines. On the quinary and senary alloy systems, Bi-RPT shows exceptional precision in predicting the experimental yield stress. More scrutiny of those predictions reveals several findings that neatly align with our material science expertise. For example, it is known that screw dislocations are more likely to be dominant than the edge in MoNbTi and NbTaTi ternaries (shown from the ternary comparison in the Citrine database (Borg et al., 2020) ). Thus it makes sense that the model under-predicts the MoNbTaTi and MoNbTaTiW cases: our model seems to correctly pick up these differences and predicts a higher yield strength. Another example is that our model over-predicts HfMoNbTaTiZr at lower temperatures (300K ∼ 900K). Since all our collected experimental samples are 100% body-centered cubic (which shows, admittedly, a limitation of X-Yield compared to the tremendous variations in real-world HEAs), it is likely that a non-BCC phase will appear at lower temperatures, hence lowering the yield strength. Performance at high temperatures. One of the important tasks in the alloy design community is to find alloys that are capable of withstanding stress at high temperatures. To verify if Bi-RPT can provide reliable recommendations to help the community achieve this goal, we look deeper into the predictive performance in high-temperature regimes. We train our model with 10% data, predict the yield stress for the rest 90%, and compare the predictive quality of models at high temperatures in Figure 3 . We can see that Bi-RPT significantly outperforms other baselines, especially at temperatures greater than 1400K. These results suggest Bi-RPT could serve as a strong tool for designing HEAs with superior yield stress at elevated temperatures.

6. CONCLUSIONS

To address the important yet challenging problem of HEA yield stress prediction, we curated and released X-Yield, the first large-scale, multi-fidelity benchmark. To effectively leverage this benchmark, we also designed a two-stage cross-quality few-shot transfer workflow and proposed to utilize sparsity to tackle both challenges of low data quality at pretraining and scarcity at transfer. Besides ad-hoc methods, we formulated a principled bi-level optimization framework to automatically learn the optimal sparse masks and sparsity allocation between two stages. Extensive experiments on both image data testbeds and X-Yield demonstrate the Bi-RPT showed a substantial improvement over existing baselines. Moving forward, we are now closely working with material scientists to validate our ML prediction results based on their domain expertise, and the team has already identified some alloy candidates that appear promising to be experimentally validated.

Upper-level problem and Sparse Regularization Loss

The upper-level problem is the sum of two losses: a normal training loss L t and a sparse regularization loss R (γ is a coefficient). We first develop update rules for the training loss L t . The weights θ * (:= θ (p) l ) and masks m * s (:= m (p) s,l ) from the lower-level problem after p unroll steps will serve as the initialization of the upperlevel problem. We update the model weight θ and masks at the upper level by applying gradientbased methods (take SGD as an example): θ (k+1) = θ * -λ u dL t dθ * (5) = θ * -λ u ( ∂L t ∂θ * + ∂L t ∂m * s ∂m * s ∂θ * ), where λ u is the learning rate for the weights for the upper-level optimization problem. The gradient on m t is easy enough: ∂Lt ∂mt , while the gradient on m s is slightly complicated: dL t dm * s = ∂L t ∂m * s + ∂L t ∂θ * ∂θ * ∂m * s . We expand the latter terms in Eqn. 5 and Eqn. 6 based on the first-order approximation (picking p = 1) on the lower-level problem: ∂θ * ∂m * s = ∂(θ (0) l -λ l ∇ θ L s ) ∂(m (0) s,l -λ m,l ∇ ms L s ) = ∂(θ (0) l -λ l ∇ θ L s ) ∂θ (0) l ∂θ (0) l ∂(m (0) s,l -λ m,l ∇ ms L s ) + ∂(θ (0) l -λ l ∇ θ L s ) ∂m (0) s,l ∂m s,l -λ m,l ∇ ms L s ) = (I -λ l ∇ 2 θ L s )(-λ m,l ∇ msθ L s ) -1 + (-λ l ∇ msθ L s )(I -λ m,l ∇ 2 ms L s ) -1 , ∂m * s ∂θ * = ∂(m (0) s,l -λ m,l ∇ ms L s ) ∂(θ (0) l -λ l ∇ θ L s ) = (I -λ l ∇ 2 θ L s ) -1 (-λ m,l ∇ msθ L s )+ (-λ l ∇ msθ L s ) -1 (I -λ m,l ∇ 2 ms L s ). Further approximations can be made to avoid the matrix inverse and save computation: ∂θ * ∂m * s ≈ -λ l ∇ msθ L s , ∂m * s ∂θ * ≈ -λ m,l ∇ msθ L s . Based on the rules, m s and m t can be optimized by: m(k+1) t = m (k) t -λ m ∂L t ∂m t | mt=m (k) t , m(k+1) s = m (k) s -λ m ∂L t ∂m s +λ m λ l ∂L t ∂θ * ∇ msθ L s | ms=m (k) s , where the superscript (k) means the steps updated. We then focus on the latter term. We choose ℓ 0 loss (i.e. the number of non-zero elements) as the sparse regularizer R, which is not differentiable and difficult to optimize. Therefore, we follow Guo et al. (2021) to use the smoothed ℓ 0 formulation to facilitate differentiable training. Specifically, a gate function g ϵ (x) := x 2 x 2 +ϵ , where ϵ is a small positive number, is used to replace the binary masks, which are instead parameterized by g ϵ (m s ) and g ϵ (m t ). We decay the value of ϵ every epoch, and the gate function will gradually output only polarized numbers (i.e., 0 and 1). We further apply the proximal-SGD (Nitanda, 2014) to minimize the ℓ 0 loss: after we update the m s and m t with respect to L t by gradient descent-based methods (Eqn. 9), we use the proximal operator to alternatively update each mask. For m s , the formulation can be written as: prox λmγR (m (k+1) s ) = arg min ms 1 2 ∥m (k+1) s ⊙ m(k+1) t -m(k+1) s ⊙ m(k+1) t ∥ 2 2 +λ m γ∥m (k+1) s ⊙ m(k+1) t ∥ 0 . We follow (Guo et al., 2021) to solve it by relaxing it to the ℓ 1 norm problem, which has a closed form solution: m s,i =          m(k+1) s,i -γλm m(k+1) t,i , m s,i ≥ γλm m(k+1) t,i m(k+1) s,i + γλm m(k+1) t,i , m s,i ≤ -γλm m(k+1) t,i 0, -γλm m(k+1) t,i < m(k+1) s,i < γλm m(k+1) t,i , where m s,i is the i-th element in m s (the same for m t,i ). Similarly, we derive the update for m t : m (k+1) t,i =          m(k+1) t,i -γλm m(k+1) s,i , m t,i ≥ γλm m(k+1) t,i m(k+1) t,i + γλm m(k+1) s,i , m t,i ≤ -γλm m(k+1) t,i 0, -γλm m(k+1) s,i < m(k+1) t,i < γλm m(k+1) s,i . Finally, we combine all these components into Algorithm 2. Algorithm 2 Solving Bi-RPT Input: Initialization weights θ 0 , training loss functions for two stages L s and L t , low-quality pretraining dataset D s , high-quality fine-tuning dataset D t , number of steps for gradient unroll p. Output: Trained model weights θ, sparse masks m s and m t . Train θ 0 on D s to get weights θ. while not converged do Given the fixed m s , update the weights θ on D s by gradient unrolling (Eqn. 3) Update the weights θ by Eqn. 5 Update the masks m s and m t by Eqn. 9. Update the masks m s and m t by Eqn. 10 and Eqn. 11. end while

B.1 BASELINES AND HYPERPARAMETERS

We list the hyper-parameters we used for all the baselines in this section. General Settings. When pre-training the models on D s (ImageNet and ImageNet-C), we use the SGD optimizer and a learning rate is 4 × 10 -1 . We linearly warm-up the learning rate within 5 epochs, and then decay it by 10 for every 30 epochs. Models are pretrained for 95 epochs on D s , with a batch size of 1024. On D t , i.e., , we set the initial learning rate as 1 × 10 -3 . The learning rate is decayed by 10 every 30 epochs, and the model is trained for 90 epochs with a batch size of 64. For Hand-Tune, we train the models with 95 epochs from scratch on D s to get a densely pretrained models. The number of training epochs is reduced to 45 after the pretrained model is derived. After the pretraining stage ends, we continue to transfer the model on D t following the above hyperparameters. The number of epochs is also reduced to 45 after we prune the weights. For No-Pretraining, we train the model using an initial learning rate of 1 × 10 -2 and a batch size of 64. For Mix-Training, as the number of classes is different for ImageNet and CUB-200, we use two fully-connected layers on top the normal ResNet-18 backbone, and train them simultaneously. We sample batches from the two domains (D s and D t ) using the same batch size of 64. The initial learning rate for these methods are 1 × 10 -2 , and it is decayed by 10 every 30 epochs. For Bi-RPT, we follow the same learning rate settings despite some additional hyper-parameters are newly introduced. The learning rate for the lower-level problem (λ l ) is 1 × 10 -3 , the same as the learning rate for upper-level problem (λ u ). The value of γ is set to 1 × 10 -4 , which are determined through ablation studies in Table A10 . The value of λ m are set to 3.5, which are also determined through ablation studies in Table A11 .

B.2 PERFORMANCE OF HAND-TUNE UNDER DIFFERENT LEVELS OF SPARSITY

We report the performance of the Hand-Tune method under different levels of sparsity. We conduct experiments with N s = {0, 1, 2, 3, 4, 5} and N t = {0, 1, 2, 3, 4}, resulting sparsity levels at pre-training stage of {0.00%, 20.00%, 36.00%, 48.80%, 59.04%, 67.23%} and sparsity levels at transfer stage of {0.00%, 20.00%, 36.00%, 48.80%, 59.04%}. We conduct experiments over all the combinations of pretraining and transfer pruning rounds. More specifically, we first perform IMP on D s for N s rounds, and continue to perform IMP on D t for another N t rounds. The experiment results over various source and target combinations are shown in Table A5 to Table A8 . Note that all the models are evaluated on the testing samples in D t . From this series of tables we observe that: (1) sparsity at pretraining helps improve the model's performance on D t after fine-tuning, and the performance gain is larger when D s contains more noise and has larger domain shifts; (2) sparsity at transfer is also beneficial to the performance after fine-tuning, and the improvement is more significant when the D t is more "data-scarce"; (3) the optimal sparse levels for the two stages vary for different combinations of pretrain and transfer domains, highlighting the importance of choosing the correct pruning rounds for both stages. The performance comparison is shown in Table A9 , where we can see that learning masks at both stage yields the highest performance. Effects of γ We conduct a set of ablation experiments to study the effects of different γ again on ResNet-18 (pretrained by ImageNet-C, fine-tuned on CUB-200).We vary γ with in i.e., {0.5, 1, 2, 3} × 10 -4 , and we present the results in Table A10 . We show that 1 × 10 -4 yields the highest performance among all the choices. Effects of learning rates. We conduct a set of ablation experiments on ResNet-18 (pretrained by ImageNet-C, fine-tuned on CUB-200) to study the effects of different learning rate on m s and m t . The learning rates we study in this ablation experiments are {2.5, 3.0, 3.5, 4.0, 4.5, 5.0}. We present the test accuracies in Table A11 , and we observe that Bi-RPT can stably outperform baselines (74.01%) within a wide range of λ m . We report the sparsity level of the two masks, as well as their combined sparsity (note that Bi-RPT allows for the two masks to partially overlap).

B.5 HEA DATA REPRESENTATIONS

The raw inputs for our ML predictor are the alloy's composition and the temperature where the experiment is conducted; therefore, they are 11-dimensional vectors. We map these vectors into 2D images following the pipeline shown in Figure A5 . Given a formulation of an alloy, the periodic table representation (PTR) sets the percentage of each element into a specific position according to its position in the periodic table; and the randomized periodic table representation (RPTR) sets the percentage of each element with a pre-defined shuffled periodic table . In our experiments, we use the RPTR to map values in a more balanced way.

C ADDITIONAL EXPERIMENTS

C.1 UNCERTAINTY QUANTIFICATION We provide additional analysis of uncertain quantification. We ensemble ten models trained with Bi-RPT and pretrain-and-transfer (PT) methods by averaging their predictions (Lakshminarayanan et al., 2017) , and calculate the standard deviation of the predictions as the uncertainty. The results after ensemble are shown in Table A12 . we show that an ensemble of sparse models provides more reliable results compared to the pretrainand-transfer baseline. Compared with the ensemble of dense models (PT), the ensemble of sparse models also exhibits strong correlation between the uncertainty and the prediction error.

D DATASET COMPARISON

We have provided a comparison between different relevant datasets in Table A13 . We elaborate more on the differences: 1. Maresca & Curtin (2020) have only sparse data from the Mo-Nb-Ta-V-W element family. 2. Lee et al. (2021a) has released a database of the predicted yield strength of 10 million alloys from the Al-Cr-Mo-Nb-Ta-V-W-Hf-Ti-Zr family at 1300 K. Our dataset contains alloys from Al-Cr-Fe-Mo-Nb-Ta-V-W-Hf-Ti-Zr family at temperatures from 300 K to 2500 K. Our simulation data are significantly larger (over 3 billion samples). The whole simulation data will be available, while only 100K are included for training the ML models in this study. 3. Borg et al. (2020) compiles experimental data from published material science articles since 2004. The dataset contains 630 samples with different crystal structures. Our experimental dataset also compiles experimental data from published material science articles too, but we have also sub-selected the data points using material science domain knowledge. Specifically, we only focus on alloys with BCC structures in contrast to Borg et al. (2020) . Al-Cr-Fe-Mo-Nb-Ta-V-W-Hf-Ti-Zr 3 billion 300 K -2500 K



Figure 2: Left: the distribution of the yield stress; Middle: the distribution of the temperature; Right: pairwise visualization of the yield stress.

Several baselines are compared to Hand-Tune and Bi-RPT: (1) Pretrain-and-transfer: the basic workflow of pre-training on D s followed by finetuning on D t , with no sparsity involved; (2) Pretrain sparsity only /transfer sparsity only: following our proposed pretrain-and-transfer workflow, but conducting IMP to only the pre-training/finetuning stage; (3) No Pretraining: directly training on D t without using D s ; (4) Mix Training: training one model on D s and D t combined.

Figure 3: Prediction MSE under different temperatures. We compare the results of three methods: No Pretraining (NP), Pretrain-and-transfer (PT), and Bi-RPT.

Figure A4: Layerwise sparsity learned by Bi-RPT on CUB-200 and Birds-S with ImageNet and ImageNet-C pretraining.We report the sparsity level of the two masks, as well as their combined sparsity (note that Bi-RPT allows for the two masks to partially overlap).

Figure A5: The pipeline for converting a raw input into a pseudoimage. The temperature is embedded as the value of the second channel.

Experiments on image data: testing accuracy of fine-tuned ResNet-18 on CUB-200 / CUB-200 (10-shot) as D t , after pretraining on ImageNet and ImageNet-C as D s , respectively.

Test accuracy on the testing set of different splits of high-fidelity alloy data. The experiments are repeated 10 times, and we report both the mean and the 95% confidence interval.

Classification and regression performance under the ten-folded cross-validation settings.Performance comparison on alloys at various temperatures. Based on the trained model with 10% experimental data, we predict the yield strength of three alloys, MoNbTaTi, MoNbTaTiW and HfMoNbTaTiZr, at different temperatures. Table

Predicted yield stress of different alloys under different temperatures. Only 10% of the experimental data are available during fine-tuning. We compare the predicted yield stress generated by Bi-RPT with our "No Pretraining" (NP) and "Pretrain-and-transfer" (PT) baselines and the simulation. The numbers with the smallest error are marked in bold.

Test accuracy of fine-tuned ResNet-18 on CUB-200 after pretrained on ImageNet, under different levels of sparsity at pretraining and sparsity at transfer.

Test accuracy of fine-tuned ResNet-18 on CUB-200 after pretrained on ImageNet-C, under different levels of sparsity at pretraining and sparsity at transfer.

Test accuracy of fine-tuned ResNet-18 on CUB-200 (10-shot) after pretrained on Ima-geNet, under different levels of sparsity at pretraining and sparsity at transfer.

Test accuracy of fine-tuned ResNet-18 on CUB-200 (10-shot) after pretrained on ImageNet-C, under different levels of sparsity at pretraining and sparsity at transfer.

Ablation study on dif-

Ablation study on the

Ablation study on the

Uncertainty estimation calculated by ensembling independently trained models. We study two methods: pretrain-and-transfer (PT) and Bi-RPT. The results after ensemble are reported as PT-Ensemble and Bi-RPT-Ensemble, respectively. The estimated uncertainty is reported in brackets.

Comparison between different datasets.

A MORE DETAILS ON METHODS

In this section, we present the technical details of our proposed method and framework ("Hand-Tune" and "Bi-RPT").A.1 HAND-TUNE Hand-Tune decides sparse masks for the two stages in an iterative way as explained in Algorithm 1.

Algorithm 1 Hand-Tune

Input: Initialization weights θ 0 , low-quality pretraining dataset D s , high-quality fine-tuning dataset D t , number of IMP rounds N s for the pretraining stage and N t for the fine-tuning stage. Output: the trained weights θ * , the sparse mask m s for the pretraining stage, and the sparse mask m t for the fine-tuning stage. Initialize the sparse masks m s for the pretraining stage to be a all "1" mask. Initialize the model's weight as θ 0 and train the weights on D s to obtain θ s . for i = 1, 2, . . . , N s do ▷ IMP at the pre-training stage Prune 20% of the smallest-magnitude weights from the non-zero regions of m s ⊙ θ s , by setting the values at corresponding positions to those weights in m s to "0".(Re-)train the sparse weights m s ⊙ θ s on D s . Only θ s is updated. end for Initialize the sparse masks at the fine-tuning stage m t to be all "1" masks and freeze m s . Initialize model's weight as m s ⊙ θ s , and train on D t to obtain m s ⊙ θ t . for i = 1, 2, . . . , N t do ▷ IMP at the fine-tuning stage Prune 20% of the smallest-magnitude weights from the non-zero regions of weights (m s ⊙ m t ) ⊙ θ s , by setting the values at corresponding positions to those weights in m t to "0".(Re-)train the sparse weights (m s ⊙ m t ) ⊙ θ t on D t . Only θ t is updated. end for Obtain the final sparse weights (m s ⊙ m t ) ⊙ θ * and return θ * , m s and m t .

A.2 BI-RPT

We now build the techniques to solve the bi-level optimization problem formulated in Bi-RPT.Lower-level problem We solve the lower-level problem through a p-step SGD unrolling. Let θ (k) be the model weights, and ms be the mask for the pretraining stage. The superscript (k) indicates they have been updated on the upper-level for k steps.θ (k) and m (k) s will be the starting points for the lower-level optimization problem. θ where λ l is the learning rate for the model weight θ, and λ m,l is the learning rate for the mask m (t) s,lat the lower-level optimization problem.

