TEMPORAL LABEL SMOOTHING FOR EARLY PREDICTION OF ADVERSE EVENTS Anonymous authors Paper under double-blind review

Abstract

Models that can predict adverse events ahead of time with low false-alarm rates are critical to the acceptance of decision support systems in the medical community. This challenging machine learning task remains typically treated as simple binary classification, with few bespoke methods proposed to leverage temporal dependency across samples. We propose Temporal Label Smoothing (TLS), a novel learning strategy that modulates smoothing strength as a function of proximity to the event of interest. This regularization technique reduces model confidence at the class boundary, where the signal is often noisy or uninformative, thus allowing training to focus on clinically informative data points away from this boundary region. From a theoretical perspective, we also show that our method can be framed as an extension of multi-horizon prediction, a learning heuristic proposed in other early prediction work. TLS empirically matches or outperforms all competitor methods on various early prediction benchmark tasks. In particular, our approach significantly improves performance on clinically-relevant metrics such as event recall at low false-alarm rates.

Laboratory tests

Vital signs Early prediction of adverse events is key to safetycritical operations such as clinical care [1] or environmental monitoring [2] . In particular, adverse event prediction is highly relevant to clinical decisionmaking, as the deployment of in-patient risk stratification models can significantly improve patient outcomes and facilitate resource planning [1] . For instance, the National Early Warning Score (NEWS), a simple rule-based model predicting acute deterioration in critical care units, has been demonstrated to reduce in-patient mortality [3; 4] . Medication Deteriorating patient signals are often identified by mining large quantities of existing medical data and associated patient outcomes, which has sparked a growing interest in machine learning and medical literature. Applications of such adverse event prediction models include alarm systems for delirium [5] , septic shock [6] , as well as circulatory or kidney failure in the intensive care unit (ICU) [7; 8] . Adverse event prediction remains a challenging modeling task requiring specific technical solutions. Recent years have seen the development of deep learning architectures for electronic health records (EHR), which help tackle the high dimensionality, irregular sampling, and informative missingness patterns in patient covariates [6; 9; 10; 8] . Still, adverse clinical events are often noisy, infrequent, and, as illustrated in Figure 1 , must be predicted with enough anticipation to allow for appropriate physician response -yet early prediction remains largely considered a simple binary classification task [7; 11; 9; 8] . As a result, current decision support models often suffer from high false positive prediction rates, with associated risks of alarm fatigue and thus limited physician engagement [12; 13; 1] . As highlighted in Figure 2a , the traditional cross-entropy objective results in the highest error rates near the class boundary, corresponding to the prediction horizon before the event. Data in this boundary region dominates the loss but may not be clinically discriminative of patient deterioration patterns. Motivated by this observation, we propose Temporal Label Smoothing (TLS), a novel regularization strategy making label smoothing [14] time-dependent to better match prediction uncertainty patterns over time. As visualized in Figure 2b , our method is designed to reduce model confidence with stronger smoothing at the class boundary, allowing training to focus on more clinically informative data points away from this noisily labeled region. Contributions. The contributions of our work are threefold: (i) In Section 3.2, we introduce a novel label smoothing method 1 , which leverages the temporal structure of early prediction tasks to focus training and model confidence on areas with a stronger predictive signal. (ii) In Section 5, we show that our approach improves prediction performance over previously proposed objectives, particularly for clinically relevant criteria. (iii) In Section 3.3, we bridge the gap between prior work on multi-horizon prediction (MHP) [8] and label smoothing [14] by showing the former is equivalent to a special case of TLS under reasonable assumptions that we verify empirically. Figure 2 : Illustration of temporal label smoothing for early prediction of adverse events. Predictions are carried out over a horizon h and t e is the time of the next event, shaded in grey. True labels in black. (a) Model confusion is highest near the label boundary t e -h (maximum false positive, FPR, and minimum true positive rates, TPR), while performance is best close to event occurrence (t e ) and away from it (t e -2h). This motivates greater smoothing near t e -h. Metrics are computed over four-hour bins based on a 50% precision threshold. (b) γ controls the smoothing strength of surrogate labels q T LS .

2. RELATED WORK

Recent years have seen the development of custom machine learning methods to predict expected patient evolution and support clinical decision-making [15; 16; 17; 7] . Amongst these, early prediction of adverse clinical events is a particularly complex task due to their typically rare occurrence and noisy label definition, which induces challenging, highly imbalanced datasets for model training [8] . As a result, prediction systems often suffer from high false-alarm rates with limited usefulness in the clinical context [1] . Prior works on early event prediction have adopted various approaches to tackle this issue, which we compare in Table 1 and formalize in Appendix A.4. We also discuss similarities and distinctions between our task and the frameworks of early time-series classification and survival analysis [18] in Appendix A. 3 . Learning objectives for imbalanced datasets. Class imbalance is often addressed through loss reweighting techniques. Static class reweighting was used for sepsis or circulatory failure prediction [17; 7] through a balanced cross-entropy, which assigns a higher weight to samples from the minority class [19] . Still, performance improvements with this objective remain limited on highly imbalanced prediction tasks [20] . In contrast, dynamic reweighting methods such as focal loss and extensions [21; 22] induce a learning bias towards samples with high model uncertainty, typically harder to classify. This approach can improve the prediction of disease progression from imbalanced datasets [23] but does not consider patterns of sample informativeness over time.  ✗ ✓ ✗ ω y δ y=c log(ŷ) Focal loss [21] ✗ ✓ ✗ ω y (1 -ŷ) ζ δ y=c log(ŷ) Label smoothing [14] ✗ ✓ ✓ q LS (c|y) log(ŷ) Multi-horizon prediction [8] ✓ ✗ ✓ h y h log(ŷ h ) Temporal label smoothing ✓ ✓ ✓ q T LS (c|y, t) log(ŷ) Multi-horizon prediction. In contrast, other early prediction models learn to leverage temporal trends in the data by outputting event predictions over several horizons [8; 24; 25] . This training heuristic improves prediction performance on the horizon of interest but scales poorly with the number of output horizons. In Section 3.3, we highlight that TLS can induce a similar temporal bias in learning while overcoming scalability limitations. Label smoothing. For greater generalization of models applied to heterogeneous real-world data, another well-known training strategy is to avoid model overconfidence through label smoothing [14] . This regularization technique improves both the calibration of deep learning models [26] and their performance under noisy labeling [27; 26] . Still, despite extensions including novel prior distributions over classes [28] or modifications to the objective itself [29; 30] , label smoothing remains designed for classification problems with i.i.d. samples, ill-adapted to the time-dependent nature of our data. To the best of our knowledge, we are the first work to explore adding a temporal dependence to label smoothing and empirically demonstrate the added value of this approach. Whereas reweighted loss functions only bias learning towards minority or uncertain data points, multi-horizon prediction and label smoothing approaches alter the individual sample optimum. As a consequence, these approaches avoid model overconfidence and are thus more robust to noisy labeling [27] . In this work, we propose to combine the respective advantages of these established methods in a novel way to improve the early prediction of adverse events. We first formalize the problem of early adverse event prediction and introduce temporal label smoothing. We then highlight how MHP can be framed as a special case of TLS.

3.1. PROBLEM FORMALISM

We assume access to a dataset of N patient stays. These consist of irregular time series of high-dimensional patient covariates X i,t = [x i,0 , . . . , x i,t ] and binary event labels e i,t encoding whether a patient of index i is undergoing an adverse event of interest at time t. For each patient, we thus have a sequence {(x i,1 , e i,1 ), . . . , (x i,Ti , e i,Ti )} of length T i . Our early prediction task consists of modeling a binary target variable y i,t , positive if the event occurs within a given prediction horizon h. For labelling purposes, we define the next event time for each time point, t e (i, t) = arg min τ :τ ≥t {e i,τ : e i,τ = 1}. If patient i never undergoes any event, we set t e (i, t) = +∞. Thus, we have: y i,t = 1 [t e -h < t < t e ]. As our task focuses specifically on early modeling for clinical relevance, no prediction is carried out if the patient is currently undergoing the event. Then, as for any binary learning problem, we define a model f parameterized by θ with ŷi,t = f θ (X i,t ) = p θ (y i,t = 1). We denote the optimal set of parameters minimizing the objective function as θ * , giving y * i,t = f θ * (X i,t ). Temporal structure. An important distinction must be made with the classification tasks typically addressed with label smoothing. In adverse event prediction, data is not independent and identically distributed (i.i.d.) as each sample x i,t depends on a timestep t and a patient stay indexed as i. Contiguous samples within a common stay are thus timely dependent: p(y i,t+d = 1) ≥ p(y i,t = 1) ∀ d : 0 ≤ d < t e (i, t) -t Treating data as i.i.d, as is commonly done in early prediction works [7; 11] , does not account for the increase in signal strength as the prediction time is approaching the event. Our goal is to leverage this structure in our data to focus training on relevant timesteps and help address issues of noisy label boundaries and class imbalance, which are inherent to our choice of real-world medical datasets.

3.2. TEMPORAL LABEL SMOOTHING

As introduced by Szegedy et al. [14] , label smoothing consists of substituting the original label distribution, δ yi=c for class c, with a smooth version q LS (c|y i ) in the cross-entropy objective L i = L CE (y i , ŷi ). For binary tasks, label smoothing becomes a linear interpolation: q LS (1|y i ) = (1 -α)y i + α(1 -y i ) where parameter α controls the smoothing strength. By shifting the minimum of the objective function away from y * i = y i towards y * i = q LS , label smoothing prevents models from becoming overconfident during training. This approach should therefore help improve the robustness of early prediction models against the inherently noisy nature of the task [27] but does not account for the time dependency between samples of a given stay. For this purpose, we propose temporal label smoothing, an approach to modulate smoothing based on time t to infuse this prior knowledge into the training objective. We define the corresponding surrogate distribution similarly to label smoothing: q T LS (1|i, t) = 1 -α(i, t) For early prediction of events, to enforce the temporal inductive bias in Equation 1, we parametrize α(i, t) as a monotonously decreasing function of t ∈ [0, t e (i, t)]. In practice, as illustrated in Figure 3a , this increases smoothing strength around the label boundary t = t e -h, reducing prediction certainty in this region prone to high error rates, as shown in Figure 2a . Smoothing parametrizations. We propose various temporal smoothing parametrizations for α(i, t) in Appendix A.2. Experimental results suggest that an exponential parametrization, defined as follows, performs best on considered tasks. Corresponding smoothed labels q exp (1|i, t) can be visualized in Figure 2b . (b) Parametrization α step (Equation 5). α exp (i, t) =    1 -e -γ(te(i,t)-t-d) -A if h min < t e (i, t) -t < h max 0 if t e (i, t) -t ≤ h min 1 if t e (i, t) -t ≥ h max Figure 3 : Label smoothing strength over time under different parametrizations, with (h min , h max ) = (0, 2h). Note that |y -q T LS | corresponds to the difference in optimum y * between the TLS objective and cross-entropy. Smoothing function α step is equivalent to multi-horizon prediction with a unique output.

3.3. LINK WITH MULTI-HORIZON PREDICTION

As motivated above, temporal label smoothing adapts the contribution of each sample to reflect prior knowledge about the temporal structure of event prediction labels. In this section, we find that MHP leverages the same information in Equation 1 to teach the model to predict events over multiple horizons/ [8] . Under simplifying assumptions justified empirically in Section 5.2, we show that this approach can be seen as a special case of temporal label smoothing with a 'staircase' parametrization. In this framework, the unique label y i,t associated with patient covariates X  = -1 H H k=1 y h k i,t log(ŷ h k i,t ) + (1 -y h k i,t ) log(1 -ŷh k i,t ). Proposition 1. Under the assumption that model outputs {ŷ h k i,t } k are equal for all {h k } k (rather than monotonically increasing), MHP is equivalent to temporal label smoothing parameterized with α step (i, t). This function, illustrated in Figure 3b , is defined as the following sequence of step functions in time: α step (i, t) =    k H if h k ≤ t e (i, t) -t < h k+1 ∀k ≤ H -1 0 if t e (i, t) -t ≤ h 1 1 if t e (i, t) -t > h H (5) Proof. See Appendix A.1. Proposition 1 frames MHP as a special case of TLS with step-function parametrization. We empirically justify the equal-output assumption through an ablation study in Section 5.2.

4.1. EARLY PREDICTION TASKS

We demonstrate the effectiveness of our method on three clinical early prediction tasks with different characteristics, to understand its added value in each case. All tasks, established in existing literature and published benchmarks, deal with electronic health records from the ICU, where early prediction of organ failure or acute deterioration is critical to patient management [1] . Our work is first benchmarked on the prediction of acute circulatory failure and mild respiratory failure within the next h = 12 hours. These tasks are part of HiRID-ICU-Benchmark (HiB) [20] , built on the publicly available HiRID dataset [7] . The dataset contains high-resolution observations of over 33,000 ICU admissions. Our third evaluation task consists of early prediction of patient mortality, or decompensation, within a horizon of h = 24 hours. Although less clinically relevant, this task has been widely studied in the machine learning literature [31] . Defined in the MIMIC-III Benchmark (M3B) [32] , this task originates from the widely used MIMIC-III dataset [33] , counting approximately 40,000 patient stays. All three clinical events are labeled following internationally accepted criteria as in Harutyunyan et al. [32] and Yèche et al. [20] . Positive label prevalence is 4.3%, 38.6%, and 2.1% of timepoints for circulatory, respiratory failure, and decompensation prediction respectively -with rarer events associated with more severe states, in this instance. Further details on task definition and data pre-processing are provided in Appendix B. Signal deterioration over time. As visualized in Figure 4 , all tasks show a reduction in recall between event time t e and prediction horizon t e -h, suggesting a weakening in the discriminative signal associated with events and an increase in noise close to the label boundary. Whereas this performance decay is strong for the circulatory failure and decompensation tasks, respiratory failure shows a more consistent recall over time. From this observation, we expect that temporal label smoothing should improve performance on circulatory failure and decompensation prediction to a greater extent than on respiratory failure.

4.2. BENCHMARKING STRATEGY

Baselines. We quantify the added value of our method by comparing its performance to alternative learning approaches used for early event prediction, discussed in Section 2. Our first baselines consist of balanced cross-entropy [19] and focal loss [21] , popular sample reweighting methods for imbalanced tasks. We also implement multi-horizon prediction as a multi-output model trained to predict event occurrence over different horizons between 0 and 2h. Note that for a fair comparison, we set (h min , h max ) = (0, 2h) in TLS. As in Tomašev et al. [8] , a cumulative distribution function layer on logits enforces the monotonicity of predictions (Eq. 1). Finally, we also compare our method to conventional label smoothing [14] to confirm that a temporal dependency does improve performance. Hyperparameter tuning. Hyperparameters introduced by our method, such as strength term γ in smoothing parametrization α exp (Equation 4), are optimized through grid searches on the validation set. The same approach is adopted for hyperparameters specific to each baseline, as shown in Figure 5 . Architecture choice. As our method and baselines are model-agnostic and only vary in terms of optimization objective, a unique model architecture is used for each task, selected through a random search on cross-entropy validation performance. Following a published benchmark on the HiRID dataset [20] , we use a GRU [34] and transformer [35] architecture for the circulatory and respiratory failure tasks respectively. For decompensation prediction, transformers outperform the LSTM-based models [36] originally proposed in the M3B benchmark [32] , and are thus used in our work. As recommended by Tomašev et al. [8] , we apply l 1 -regularization to input embedding layers, which improves performance on all tasks. Further implementation details are provided in Appendix C.

4.3. EVALUATION METRICS

To account for the imbalanced nature of clinical early prediction tasks, model performance is often reported through the area under the receiver operating characteristic curve (AUROC). Although this widely-used metric can be informative for moderate imbalances, the area under the precision-recall curve (AUPRC) provides more insight for our tasks: under a low prevalence of positive samples, precision is more sensitive to false alarms than specificity [37] . Still, "area under the curve" metrics can be poorly representative of clinical usefulness, as improvements in low precision regions can dominate such global metrics but remain incompatible with the low false alarm rates required for clinical deployment. Thus, to better assess model performance in this context, we also measure performance at a clinically motivated operating point through recall at 50% precision [24] . In addition to timestep-level metrics, which measure prediction performance at each data point, we also evaluate models in an event-based approach. Following Tomašev et al. [8] 's definition, an event prediction is positive if the model outputs a positive prediction at any time over the h hours before the event. The threshold defining a positive prediction is chosen based on a precision lower-bound: in practice, we use a 50% stepwise precision criterion. This allows us to measure the event recall of our approach in comparison to published baselines. Unless stated otherwise, we always report mean performance with 95% confidence intervals computed over ten training runs.

5.1. PREDICTION PERFORMANCE

Overall, our results highlight that TLS improves performance over other approaches proposed to address the challenges of early clinical prediction. In Table 2 , we find TLS to outperform other baselines across all metrics for both circulatory failure and decompensation. Despite overlapping confidence intervals between multi-horizon and TLS on decompensation due to indi- In contrast, as illustrated in Figure 5 , loss reweighting methods designed to tackle class imbalance were found to reduce performance on all tasks over traditional cross-entropy. For weighted cross-entropy, we attribute it to the increase in false alarms resulting from the drive to improve recall. It further reduces the low precision of all models, thus negatively affecting the AUPRC (as visualized in Appendix D.5). On the other hand, focal loss down-weighs confident samples in training, constraining the model to focus on samples with uncertain predictions. In the context of noisy labeling, as is the case close to our class boundary, data points with ambiguous signals cannot be correctly predicted and thus dominate the loss, impeding improvements in other regions of input space. We analyze model performance over time in Section 5.2 to further support this hypothesis. Clinically-relevant performance. We also compare the full precision-recall curve of models trained with these different objectives in Figure 6a -note that we obtain comparable results for decompensation prediction in Appendix D.2. In addition to visually confirming the numerical results in Table 2 , we find that our training objective affords particular performance improvements in the clinically-relevant region corresponding with low false-alarm rates (precision greater than 50%) [1] . Event-based analysis. Finally, as highlighted in Figure 6b , TLS improves performance in terms of predicting overall adverse event episodes throughout a stay on all prediction tasks. This suggests that performance improvements at the timestep level affect a large number of events and translate to better event detection. Indeed, we demonstrate in Section 5.2 that TLS affords larger performance gains close to the event time, thus leading to a better recall of imminent events. We obtain similar conclusions for both other tasks (see Appendix D.1). For circulatory failure, temporal label smoothing is able to predict 7.4% more events than the closest baseline (multi-horizon prediction): this corresponds to reducing the number of missed events in the test set by a factor of 2, from 303 to 152 out of 2045 events on average. Within these events not captured by MHP, TLS predicts them on average 104 minutes before their occurrence, giving clinicians sufficient time to take action and avoid patient degradation. 

5.2. ILLUSTRATIVE INSIGHTS

We propose ablations and analyses to build intuition around our proposed method. In particular, we aim to highlight how temporal smoothing works and why it outperforms other training approaches for early prediction tasks. Performance over time. In Figure 7 , we compare the performance difference between our method, TLS, and the regular cross-entropy objective over time -previously studied in Figure 2a . We perform the same analysis in Appendix D for other tasks. As expected, the prediction model trained with TLS is less competitive where label smoothing is strongest, near t e -h, but this performance loss remains minor even with significant smoothing. This result validates our hypothesis that the signal is too noisy in the boundary region for any model to recover the original label distribution. In contrast, away from the label boundary, TLS results in a significant increase in true positive and negative rates. From a clinical perspective, errors made in the boundary region are less critical, as they result in the latest false positives or earliest false negatives. Consequently, TLS not only improves global event prediction performance but allows these gains to occur at more critical times for clinicians. Empirical comparison to multi-horizon prediction. In our theoretical discussion in Section 3.3, we demonstrated how MHP is a restriction of label smoothing with a step function α step (i, t). This claim relies on the constraint to produce a unique prediction across all considered horizons, reflecting the design of our method. We verify the impact of this assumption by measuring performance gains afforded by learning distinct predictions per horizon. As shown in Table 3 , with full precision-recall curves in Figure 19 , we find no statistical evidence for performance gain over using α step on all tasks and studied metrics. Thus, models do not appear to leverage this additional flexibility offered by MHP. With superior results on all timestep-and event-based experiments, and greater scalability thanks to the single prediction horizon modeled, we find temporal label smoothing to be a superior training objective to MHP in early prediction tasks. 

5.3. TRADE-OFFS AND LIMITATIONS

Despite the demonstrated advantage of our training paradigm for two distinct early prediction tasks, we observed more limited performance gain over traditional cross-entropy when predicting respiratory failure in Table 2 , as with other baselines. This observation motivated an analysis of the specific problem settings in which our objective helps. Respiratory failure events are much more frequent than circulatory failure or decompensation, with the majority of ICU patients undergoing approximately two such events during their stay, as quantified in Appendix B. We hypothesize that this reduced class imbalance leads to sufficient discriminative information within the label boundary region. This belief is supported by the lower performance drop from event to prediction time in Figure 4 in comparison to other tasks, and results in a more significant performance loss close to t e -h with TLS, with a 1% drop in true positive rate (TPR) in Figure 8 . However, as expected by design, our method improves recall (+1% TPR) over cross-entropy close to the event. This also leads to a non-negligible 0.4% (p-value < 0.05) improvement in event recall, visualized in Appendix D.2. Overall, this analysis reveals that whereas TLS has little impact on global metrics for tasks with limited performance reduction over time (e.g., close-tobalanced, of often limited usefulness in clinical decision support efforts [8] ), it still results in clinically meaningful performance improvements along per-horizon and event-based metrics.

6. CONCLUSION

Early prediction of adverse events is paramount to the development of clinical decision support systems, with a demonstrated potential to improve patient outcomes [3] . Still, this task remains poorly studied in the machine learning literature, with few training solutions tailored to address its challenges. Based on typically rare and noisy labels, models must learn to discriminate a predictive signal in anticipation of events to allow an adequate medical response. After highlighting the limitations of traditional classification objectives and methods designed to address the class imbalance, we propose a novel training framework that leverages trends in event signals over time. We show that multi-horizon prediction, a heuristic used to improve early prediction, can be formalized as a restriction of our framework. Simple but effective, temporal label smoothing empirically matches or outperforms all considered baselines on various tasks and datasets, with significant improvements on clinically-relevant evaluation metrics. Performance gains are limited, as with other baselines, for respiratory failure prediction in which higher event prevalence provides sufficient informative data points for the model to learn through a conventional cross-entropy objective. In further work, we aim to explicitly adapt the temporal inductive bias to the task at hand and to combine temporal label smoothing with recent objectives designed to directly optimize AUPRC, such as minimum precision constraint [39] or dice-based loss functions [40] . Looking ahead, we expect that temporal label smoothing will be leveraged to develop more clinically reliable systems for risk prediction of rare adverse events. Further research on tailored machine learning solutions to improve real-world decision support holds promise for better clinical care and operations management.

A THEORETICAL DETAILS A.1 MULTI-HORIZON PREDICTION: PROOF OF PROPOSITION 1

Equivalency between MHP and TLS objectives. Recalling the formalism of multi-horizon prediction outlined in Section 3.3, true labels and model predictions can be rewritten as y i,t = [y h1 i,t , . . . , y h i,t , . . . , y h H i,t ] and ŷi,t = [ŷ h1 i,t , . . . , ŷh i,t , . . . , ŷh H i,t ], where H is the number of horizons considered. The training objective for patient i becomes: L M HP (y i,t , ŷi,t ) = - 1 H H k=1 y h k i,t log(ŷ h k i,t ) + (1 -y h k i,t ) log(1 -ŷh k i,t ) The assumption that {ŷ h k i,t } k is equal for all k allows to rewrite the objective as follows: L M HP (y i,t , ŷi,t ) = -log(ŷ i,t ) 1 H H k=1 y h k i,t + log(1 -ŷi,t ) 1 H H k=1 (1 -y h k i,t ) with ŷi,t being the common prediction shared across all horizons. This equation can now be viewed as a temporal label smoothing objective with smoothed labels q step (1|i, t) = 1 H H k=1 y h k i,t : L M HP (y i,t , ŷi,t ) = -log(ŷ i,t ) • q step (1|i, t) + log(1 -ŷi,t ) • 1 -q step (1|i, t) Smoothing parametrization. Next, we aim to recover the explicit form of q step (1|i, t). Without loss of generality, we assume that horizons {h k } k are in ascending order. The temporal dependency between samples, formalized in Equation 1), results in the following relationship between predictions at horizons h u and h v : v ≤ u and y hv i,t = 1 =⇒ y hu i,t = 1 v ≥ u and y hv i,t = 0 =⇒ y hu i,t = 0 (7) Thanks to the above property, we can determine q step (1|i, t) by studying three cases of multi-horizon labels, illustrated in Figure 9 . For notational simplicity, we define d e (i, t) = t e (i, t) -t. e -h H t e -h k + 1 t e -h k t e -h 1 t e t 0 1 q step (1|t) = 1 H ∑ h y h ∀h : y h = 0 ∀h ≤ hk : y h = 0 ∀h ≥ h k + 1 : y h = 1 ∀h : y h = 1 h H h k + 1 h k h 1 0 d e = t e -t Figure 9 : Label values for multi-horizon prediction, and conversion to smoothed labels q step (1|t). Case 1: d e (i, t) ≤ h 1 . From label definition. we have that y h1 i,t = 1 if d e (i, t) ≤ h 1 . As h 1 is the smallest horizon, following Equation 6, we have y hc i,t = 1, ∀c ∈ 1, H . We can rewrite the objective as: L M HP (y i,t , ŷi,t ) = -log(ŷ i,t ) = -[q step (1|i, t) log(ŷ i,t ) + (1 -q step (1|i, t)) log(1 -ŷi,t )] where q step (1|i, t) = 1. Case 2: d e (i, t) > h H . Similarly, if d e (i, t) > h H , then y h H i,t = 0 which implies y hc i,t = 0, ∀c ∈ 1, H from Equation 7. The objective can be rewritten as: L M HP (y i,t , ŷi,t ) = -log(1 -ŷi,t ) = -[q step (1|i, t) log(ŷ i,t ) + (1 -q step (1|i, t)) log(1 -ŷi,t )] where q step (1|i, t) = 0. Case 3: ∃k ∈ 1, H -1 s.t h k < d e (t) ≤ h k+1 . Following the same reasoning as in the first two cases, we now have a specific index k which separates positive and negative labels. We have y hc i,t = 0, ∀c ∈ 1, k and y hc i,t = 1, ∀c ∈ k + 1, H . This allows to rewrite the objective as follows: L M HP (y i,t , ŷi,t ) = -[ H -k H log(ŷ i,t ) + k H log(1 -ŷi,t )] = -[q step (1|i, t) log(ŷ i,t ) + (1 -q step (1|i, t)) log(1 -ŷi,t )] where q step (1|i, t) = H -k H . Defining a new smoothing parametrisation α step such that q step (1|i, t) = 1 -α step (i, t), we obtain: α step (i, t) =    k H if h k ≤ d e (i, t) < h k+1 ∀k ≤ H -1 0 if d e (i, t) ≤ h 1 1 if d e (i, t) > h H Thus, ∀d e (t) > 0, we find that L M HP i = L T LS i when smoothed labels are defined as q step (1|i, t) = 1 -α step (i, t). This concludes our proof. Motivated by prior work [8; 18] , we compare the performance of various smoothing functions α(i, t).

A.2 TEMPORAL LABEL SMOOTHING FUNCTIONS

All proposed parametrizations are continuous and monotonous decreasing functions that satisfy boundary conditions α(i, t e (i, t) -2h) = 1 and α(i, t e (i, t)) = 0. As evidenced in Table 4 , we find exponential label smoothing to perform best or as well as others across all tasks and metrics. Performance as a function of hyperparameter setting can be visualized in Figure 11 . All model and hyperparameter selection were carried out on the validation set, including the final choice of parametrization function.  α shif t (i, t) = 1 [t e (i, t) -t ≥ h shif t ] where h shif t is a hyperparameter controlling the horizon of the smoothed labels (h shif t = h corresponds to cross-entropy training). The strength of this smoothing function is illustrated in Figure 10a . Figure 11 outlines the performance of this alternative smoothing parametrization as a function of h shif t . For both decompensation and respiratory failure, shifting the label boundary closer to the event time decreases performance. On circulatory failure, performance does improve over traditional cross-entropy training as the label horizon is brought closer to the event of interest, which can be interpreted as an inductive bias similar to that induced by the exponential smoothing function. Linear label smoothing. The most straightforward extension to the step function α step described in Section 3.3 is a linear label smoothing corresponding to the case H → +∞. Our parametrization α linear (i, t) is thus defined as follows: α linear (i, t) = te(i,t)-t 2h if t e (i, t) -t < 2h 1 if t e (i, t) -t ≥ 2h We illustrate the impact of the number of steps H in Figure 10b . Sigmoidal label smoothing. Another natural direction to explore is to smooth labels starting from the true distribution, a unique step function at t = t e (t) -h. This can be achieved by defining α(t) as a generalized logistic function [41] : α sigmoid (i, t) = 1 - K-A 1+e te(i,t)-t-d γ -A if t e (i, t) -t < 2h 1 if t e (i, t) -t ≥ 2h where K, A and d are three constants fixed by imposing the boundary conditions at t = t e (i, t) -2h and t = t e (i, t), as well as α(t e (i, t) -2h) = 1 2 . This yields: K = -Ae 2h-d γ A = e -d γ + 1 e -d γ -e 2h-d γ d = h As shown in Figure 10d , γ controls the smoothing strength, interpolating between the true distribution δ yi=1 as γ → 0 and q linear when γ → +∞. Exponential label smoothing. The smoothing function we find to perform best is the exponential decay one. This idea is motivated by survival analysis, where patient survival probability can be modeled as the exponential decay of a cumulative hazard function [18; 42] . In practice, as defined in Section 3.2, our exponential smoothing function α exp (i, t) is defined as follows: α exp (i, t) = 1 -e -γ(te(i,t)-t-d) -A if t e (i, t) -t < 2h 1 if t e (i, t) -t ≥ 2h where parameters {d, A} are set to satisfy boundary conditions: A = -e -γ(2h-d) d = - 1 γ ln 1 -e -γ2h Here, γ also controls the smoothing strength between q linear when γ → 0 and q(t) = 0 ∀t < t e when γ → +∞. Overall, despite α sigmoid and α shif t achieving good results on respiratory and circulatory failure respectively, α exp statistically outperforms these smoothing parameterizations across all tasks on validation metrics. An interesting avenue for further work would be to combine exponential smoothing with the boundary shift approach, or effectively change (h min , h max ), which was fixed to (0, 2h) in our work for a fair comparison to multi-horizon prediction. Concave exponential label smoothing. Finally, to mirror the behavior of the exponential smoothing function away from linear interpolation and investigate its effect on performance, we designed the following concave smoothing function α concave : α concave (i, t) = e -γ(d-te(i,t)+t) -A if t e (i, t) -t < 2h 1 if t e (i, t) -t ≥ 2h Parameters {d, A} are identical to the convex smoothing function parameters, set to satisfy boundary conditions. The strength of this concave smoothing function is illustrated Figure 10c . No performance gains were obtained through temporal label smoothing with a concave function, as shown in Figure 11 . This smoothing function effectively penalizes false positives harder than false negatives, which is less adapted to our tasks of interest (in contrast to the convex a exp ). As a result, the best-performing concave parametrization is consistently obtained with the lowest value of γ, closer to a linear function choice.

A.3 RELATED TIME-SERIES TASKS

Comparison to survival analysis. Survival analysis consists of statistical methods concerned with predicting the probability of a certain event taking place over time [42] . In our formalism outlined in Section 3.1, the corresponding task is to regress the time of the next event, t e (t, i), based on patient information accumulated up to time t. To recover early event prediction, a threshold on the hazard model can thus be applied to determine whether an event will happen within our horizon of interest h. Modeling constraints imposed in survival analysis improve time-to-event prediction performance over traditional regression methods, which supports our approach to leverage the temporal structure of our comparable task. Interestingly, recent developments in survival modeling to deal with dynamic predictions have been addressed with multi-horizon prediction [43] . Still, distinctions must be highlighted between our adverse event prediction problem and the typical experimental setup for survival analysis: in our case, multiple events can occur over the course of a patient's stay, with unknown patient states during and immediately after event occurrence. This results in complex, informative censoring patterns and challenges common assumptions in survival analysis, which can therefore not be directly applied to our task. Comparison to early time-series classification. A distinction must be drawn between our task of early prediction of adverse events and that of early time-series classification. The latter has been more extensively explored in the literature [44; 45; 46] , but addresses a distinct problem. Considering a time series up to timestep t, early event prediction is concerned with classifying whether a particular event will occur between t and t + h, for a fixed horizon h. Predictions are made at each timepoint over the entire time series: as multiple samples arise from the same time series and therefore depend on one another over time, these should not be considered as i.i.d. In contrast, early classification of time series aims to regress the first timepoint t at which a label for the entire time series can be predicted with a desired accuracy [44] . A single prediction is made, as soon as possible, for the entire series -which can be considered an independent sample from the dataset of time series. This latter task can be framed as early prediction of the event "prediction is possible", where h = ∞, given a separate time-series classifier. As a result, an interesting avenue of further work would be to apply temporal label smoothing to the latter task. On the other hand, early event prediction cannot be translated into a simple early classification problem. As a result, methods designed for early time-series classification are therefore not applicable to this problem setting.

A.4 BASELINE OBJECTIVE FUNCTIONS

In this section, we clarify the mathematical formalism behind our baselines to facilitate comparison to temporal label smoothing. All baselines explored effectively propose a modification of the cross-entropy objective often used for binary classification tasks, L i = L CE (y i , ŷi ). Balanced cross-entropy. To facilitate learning from highly imbalanced datasets, balanced crossentropy relies on reweighting samples based on their class prevalence, as follows: L CE = 1 N N i ω yi L(ŷ i , y i ) where C is the number of classes, C for all classes. In the binary setting, b(1) can be treated as a hyperparameter determining the contribution of the minority class to the loss. ω yi = 1 C•b(yi) Focal loss. Denoting our output prediction as ŷi = p θ (y i = 1), the focal loss objective for binary classification of target y i is a variant on the balanced cross-entropy loss: L f ocal (y i , ŷi ) = -ω 1 (1 -ŷi ) ζ y i log(ŷ i ) -ω 0 ŷζ i (1 -y i ) log(1 -ŷi ) where ω yi is a balancing weight for class y i and ζ is the focal loss weight. Multi-horizon prediction. As highlighted in Section 3.3, multi-horizon training can be formalized as the following objective: L M HP (y i,t , ŷi,t ) = - 1 H H k=1 y h k i,t log(ŷ h k i,t ) + (1 -y h k i,t ) log(1 -ŷh k i,t ) where true labels and model predictions are given by y i,t = [y h1 i,t , . . . , y h i,t , . . . , y h H i,t ] and ŷi,t = [ŷ h1 i,t , . . . , ŷh i,t , . . . , ŷh H i,t ], for H distinct horizons. Label smoothing. As introduced by Szegedy et al. [14] , label smoothing consists of substituting the original label distribution δ yi=c in the cross-entropy objective L i = L CE (y i , ŷi ) by a smoothed version q LS (c|y i ). This surrogate distribution over classes c is defined as follows : q LS (c|y i ) = δ yi=c (1 -α) + u(c)α In the original approach, u is uniform and α ∈ [0, 1] controls the smoothing strength. By shifting the minimum of the objective function away from ŷi = 1, labels smoothing prevents the model from becoming overconfident during training. Alternative designs for u have been proposed [28; 29; 30] but are incompatible with the binary nature of adverse event prediction. In binary tasks, labeling is defined according to the positive class such that y i ∈ {0, 1} and ŷi = p θ (y i = 1). Label smoothing therefore becomes a linear interpolation with parameter α such that q LS (1|y i ) = p(y i = 1): q LS (1|y i ) = (1 -α)y i + α(1 -y i ) As suggested by Lukasik et al. [27] , label smoothing can be used to regularize early prediction models due to the inherently noisy nature of the task. It does not, however, account for the time dependency between samples of a given stay -highlighted in our problem formalism (Section 3.1). In contrast, temporal label smoothing modulates smoothing based on time t to infuse this prior knowledge into the training objective.

B DATASET DETAILS B.1 TASK DEFINITION

In this section, we provide more details on the definition of our early prediction tasks for circulatory failure and respiratory failure from HiB [20] and decompensation from M3B [32] . A breakdown of event prevalence for each clinical endpoint is given in Table 5 . Circulatory failure is a failure of the cardiovascular system, detected in practice through elevated arterial lactate (> 2 mmol/l) and either low mean arterial pressure (< 65 mmHg) or administration of a vasopressor drug. Yèche et al. [20] defines a patient to be experiencing a circulatory failure event at a given time if those conditions are met for 2/3 of time points in a surrounding two-hour window. Early prediction labels are then derived from these event labels as outlined in Section 3.1. Respiratory failure is defined by Yèche et al. [20] as a P/F ratio (arterial pO 2 over FIO 2 ) below 300 mmHg. This definition includes mild respiratory failure, which explains higher event prevalence in Table 5 . As above, Yèche et al. [20] consider a patient to be experiencing respiratory failure if 2/3 of timepoints are positive within a surrounding 2h window. Decompensation refers to the death of a patient. Event labels are directly extracted from the MIMIC-III [33] metadata about the time of death of a patient. Early prediction labels are also extracted following Section 3.1. Note that decompensation can occur outside of the ICU stay if a patient is sent to a palliative unit, for instance, which can result in patient stays with fewer than 24 positive samples.

B.2 PRE-PROCESSING

We describe the pre-processing steps we applied to both datasets, HiRID and MIMIC-III. Imputation. Diverse imputation methods exist for ICU time series. For simplicity, we follow the approach of original benchmarks [32; 20] by using forward imputation when a previous measure existed. The remaining missing values are zero-imputed after scaling, corresponding to a mean imputation. Scaling. Whereas prior work explored clipping the data to remove potential outliers [8] , we do not adopt this approach as we found it to reduce performance on early prediction tasks. A possible explanation is that, due to the rareness of events, clipping extreme quantiles may remove parts of the signal rather than noise. Instead, we simply standard-scale data based on the training sets statistics.  Hidden Dimension (32, 64, 128, 256) L1 Regularization (1e-2, 1e-1, 1, 10) Multi-horizon prediction. Following Tomašev et al. [8] , we consider H horizons on both side of the true horizon h between 0 and 2h. As we didn't find H -→ +∞, to increase performance, we selected H = 11 (including true horizon h) compared to H = 8 in Tomašev et al. [8] , which we found to perform slightly worse. This means we made a prediction every 2 hours for HiB tasks and every 4 hours for decompensation. Label Smoothing. Label smoothing [14] , as defined in Section 3.2, is normally used in multi-class setting. We still compared our method to it for two reasons. First, to explore if it can help when dealing with a noisy signal as we claim is the case for early event detection. Second, to ablate the impact of adding a temporal dependency to the method. Again, we select the hyperparameter α through a grid search. Interestingly, we found label smoothing to slightly improve performance over the validation set for all tasks as opposed to the results reported for the test set in Table 2 . We found α = 0.05 to perform best for circulatory failure and decompensation. For respiratory failure, we found α = 0.1 to have the best validation performance.

C.2 TLS IMPLEMENTATION

TLS depends on two components, the temporal range over which we smooth labels, defined by h min and h max , and the smoothing function α(i, t). Concerning the temporal range, for a fair comparison, we fix it to match MHP, thus for all experiments we set h min = 0 and h max = 2h. For the smoothing function, we perform a grid search over the type of function discussed in Appendix A.2 and the smoothing strength parameter γ. For all experiments, we found α exp to outperform other considered functions. Given validation performance, we used γ = 0.2 for circulatory failure and γ = 0.05 for respiratory failure and decompensation. As discussed in Section 3.2, contrary to MHP, TLS does not require any change to the architecture leading to a computational overhead. The smoothing of the labels can be easily integrated into the data loader, as shown in Figure 12 .

D ADDITIONAL EXPERIMENTS AND ABLATION STUDIES

This section provides additional results and experiments to complete our findings from the main manuscript. Unless otherwise stated, mean results are shown with a 95% confidence interval on the mean shaded or in error bars. Decompensation. Precision-recall curves obtained for timestep-level event prediction on respiratory failure and decompensation tasks are given in Figure 14 . As for circulatory failure prediction, decompensation recall gains are concentrated in regions of low false-alarm rates (>50% precision) which are most clinically relevant. Likewise, whereas recall near the label boundary t e -h is slightly negatively affected by temporal label smoothing in Figure 15 , true positive rates are significantly improved leading up to the event time t e . This mirrors the temporal smoothing pattern which favors higher model confidence away from the label boundary. As discussed in Section 5.2, this is aligned with clinical priorities in terms of model performance, as it ensures imminent events are better predicted.

D.1 EVENT-BASED METRICS FOR OTHER TASKS

Respiratory Failure. As discussed in Section 5.3, on respiratory failure, there is no clear advantage of using temporal label smoothing (or any baseline) over cross-entropy on timestep level metrics as in Figure 14 . This can be attributed to the more balanced nature of this task. Still, we find that performance over time in Figure 16 reflects the design of temporal label smoothing, as true positive rates are negatively affected near the highly smoothed label boundary but improve when approaching event time. 

D.3 SUB-GROUP ANALYSIS

Populations in the intensive care unit are often heterogeneous. This has motivated recent works to focus on the fairness of deep learning across these sub-populations. In this analysis, we ensure that temporal label smoothing does not negatively affect performance in specific subgroups, compared to the objectives commonly used in the literature [8; 7; 11] . To achieve this, we measured event prediction performance across genders and age groups (below 50, between 50 and 70, and over 70 years old). As shown in Table 9 , TLS matches or outperforms baseline performance across all studied subgroups, suggesting that the overall population-wide improvements are not achieved by disproportionally favouring specific cohorts. While some algorithmic bias can be observed across all methods, for instance in poorer decompensation performance amongst female patients, TLS does not appear to be amplifying this issue. In further work, we look forward to extending this analysis to more specific subgroups and to study the fairness of early event prediction methods for clinical applications. Table 9 : Sub-group performance analysis. We color in green improvement above the 95% confidence interval and in orange differences within it, often within the confidence interval of cross-entropy (CE). Impact of weighted cross-entropy on precision. With a relative weight for the positive class ω 1 = 0.5 b(c=1) > 1, weighted cross-entropy encourages a greater number of true positives to improve recall. Doing so also increases the of false positives, impairing precision. In Figure 18 , as the starting precision of all cross-entropy models is poor, no discernible improvements in the recall can be observed as class weights are increased, whereas precision is markedly reduced in low-recall regions. This explains the overall reduction in AUPRC with this method across all tasks.



All code is made publicly available at https://anonymous.4open.science/r/tls/. While some methods have overlapping confidence intervals on circulatory failure and decompensation prediction, TLS remains superior on each training run, giving p-values of 0.



Figure 1: Early prediction task.

Timestep performance of regular cross-entropy training for decompensation on MIMIC-III. Comparison of temporally-smoothed and ground-truth labels.

Figure 5: Performance loss with class reweighting methods, on the validation set for circulatory failure prediction. Weighted cross-entropy corresponds to ζ = 0.

Precision-recall curve. Inset shows the clinically-applicable region with precision > 50%.

Event-level performance for a 50% timestep-level precision threshold.

Figure 6: Clinically-oriented performance analysis of different training objectives on circulatory failure prediction. See Appendix D for results on other tasks.

True positive rate (TPR).

Figure 7: Performance improvement over time for TLS over traditional cross-entropy on circulatory failure prediction. Timestep-level metrics computed for a precision of 0.5 over two-hour bins.

Figure 8: Performance improvement over time for TLS over traditional cross-entropy, for respiratory failure. True positive rates (TPR) are computed for a precision of 0.5 over 2-hour bins.

Figure 10: Illustration of temporal label smoothing with alternative smoothing parametrizations.

and b(c) defines the prevalence of class c such that c b(c) = 1. Regular cross-entropy corresponds to the case where b(c) = 1

Figure 12: Temporal label smoothing algorithm. Python-style code to obtain smooth early prediction labels from event labels.

Figure13: Event recall at 50% timestep-level precision, for two additional tasks.

Figure 14: Precision-recall curves, for two additional tasks. Inset shows the clinically-applicable region with precision greater than 0.5.

True positive rate (TPR).

Figure 15: Performance improvement over time for TLS over traditional cross-entropy on decompensation prediction. Timestep-level metrics computed for precision of 0.5 over two-hour bins.

Figure 16: Performance improvement over time for TLS over traditional cross-entropy on respiratory failure prediction. Timestep-level metrics computed for precision of 0.5 over two-hour bins.

Figure 17: Performance loss with class reweighting methods, on validation set. Balanced crossentropy corresponds to ζ = 0.

Related work. Comparison with different training objectives for binary early prediction tasks. y ∈ {0, 1} corresponds to a sample's true label at time t and ŷ ∈ [0, 1] to the model's prediction. Additional details and respective advantages of each work are further discussed in Appendix A.4.

Parameters h min and h max define the time range over which we apply smoothing, namely [t e -h max , t e -h min ]. Under this constraint, parameters {d, A} are defined to enforce α(i, t) to be continuous at boundary points (see Appendix A.2). Finally, γ controls the smoothing strength at a given time.

Timestep-level performance on different early prediction tasks. Recall is reported at a 50% precision. Circulatory and respiratory failure are predicted on the HiB dataset, decompensation on M3B. In bold, we highlight best-performing methods with statistically significant p-values (< 0.05) under paired Student's t-tests[38]. As expected, performance gains on respiratory failure prediction are not significant on these metrics. Temporal Label Smoothing 40.6 ± 0.3 32.3 ± 0.7 35.5 ± 0.3 29.3 ± 0.4 60.4 ± 0.2 77.0 ± 0.3

Do MHP's multiple outputs improve performance over TLS with q step ? We provide p-values for the paired Student-t test[38] on the null hypothesis H 0 : µ step ≥ µ M HP . With no statistically significant improvements (p < 0.05), we justify our assumption in Proposition 1.

Performance of different smoothing functions on early prediction tasks. Recall is reported at a 50% precision. sigmoid 39.4 ± 0.3 29.7 ± 0.8 34.9 ± 0.4 28.8 ± 0.5 60.6 ± 0.2 77.3 ± 0.5 α concave 39.4 ± 0.3 29.7 ± 0.8 35.1 ± 0.4 29.2 ± 0.6 60.3 ± 0.3 77.0 ± 0.6 α exp 40.6 ± 0.3 32.3 ± 0.7 35.5 ± 0.3 29.3 ± 0.4 60.4 ± 0.2 77.0 ± 0.3 window of interest. This defines the following smoothing parametrization α shif t (i, t):

Event prevalence analysis, highlighting class imbalance. Positive timesteps are counted for 12-hour and 24-hour horizons for HiRID tasks and decompensation respectively. Statistics are computed on the training set.

Hyperparameter search range for respiratory failure with Transformer[35] backbone. In bold are parameters selected by random search.

Hyperparameter search range for decompensation with Transformer[35] backbone. In bold are parameters selected by random search.

± 0.5 29.4 ± 0.6 38.8 ± 0.6 29.6 ± 1.1 39.2 ± 0.3 29.0 ± 1.0 39.3 ± 0.6 30.0 ± 0.7 39.1 ± 0.4 29.0 ± 1.0 TLS 40.4 ± 0.5 32.7 ± 1.0 41.1 ± 0.4 32.6 ± 0.7 40.0 ± 0.3 31.7 ± 0.7 41.2 ± 0.3 32.8 ± 0.6 40.4 ± 0.3 32.0 ± 0.8 ± 0.2 70.3 ± 0.8 62.5 ± 0.3 78.0 ± 0.5 63.2 ± 0.3 81.7 ± 0.8 54.2 ± 0.2 72.9 ± 0.8 63.7 ± 0.2 79.8 ± 0.6 TLS 51.9 ± 0.4 70.5 ± 0.7 62.4 ± 0.2 77.8 ± 0.3 63.2 ± 0.3 81.1 ± 0.6 53.9 ± 0.2 72.3 ± 0.6 63.7 ± 0.2 79.8 ± 0.4 ± 0.8 25.3 ± 1.2 34.9 ± 0.9 27.4 ± 0.6 35.8 ± 0.2 29.4 ± 0.6 30.9 ± 0.4 24.8 ± 0.6 38.3 ± 0.6 31.4 ± 0.5 TLS 30.5 ± 0.5 26.2 ± 1.1 36.7 ± 0.5 29.1 ± 0.5 36.3 ± 0.3 30.3 ± 0.4 31.6 ± 0.3 25.7 ± 0.5 39.6 ± 0.5 32.8 ± 0.6 UNCERTAINTY ESTIMATION WITH PIVOT BOOTSTRAP As mentioned in Appendix C, our uncertainty estimation approach was based on measuring standard error across 10 training runs. With the uncertainty evaluation framework from Tomašev et al. [8], bootstrapping patients from our test set 200 times for each training instance, we obtained similar means to Table 2 but with confidence intervals all smaller than or equal to 0.1%. Variance within bootstrap samples from the same training instance is therefore much smaller than across instances.Our alternative uncertainty estimation approach, measuring variability between training runs, returns more conservative estimates, and was thus chosen for all results reported in this work.D.5 LOSS REWEIGHTING METHODSHyperparameter grid search results for different loss reweighting methods are shown in Figures5 and 17. For all three tasks, both weighted cross-entropy and focal loss were found to negatively affect performance in comparison to traditional cross-entropy. Likely explanations for these results are provided in Section 5.2: focal loss focuses training on noisily labeled samples, and weighted crossentropy largely reduces precision. We validate the latter hypothesis by visualizing precision-recall curves of models trained with this objective in Figure18.

C IMPLEMENTATION DETAILS

Training details. For all models, we set the batch size according to the available hardware capacity. Because transformers are memory-consuming, we train the models for respiratory failure and decompensation with a batch size of 8 stays. On the other hand, we train the GRU model for circulatory failure with a batch size of 64. We early stopped each model training according to their validation loss when no improvement was made after 10 epochs.Libraries. A full list of libraries and the version we used is provided in the environment.yml file. The main libraries on which we build our experiments are the following: pytorch 1.11.0 [47] , scikit-learn 0.24.1 [48] , ignite 0.4.4, CUDA 10.2.89 [49] , cudNN 7.6.5 [50] , gin-config 0.5.0 [51] .Infrastructure. We follow all guidelines provided by pytorch documentation to ensure the reproducibility of our results. However, reproducibility across devices is not ensured. Thus we provide here the characteristics of our infrastructure. We trained all models on a single NVIDIA RTX2080Ti with a Xeon E5-2630v4 core. Training took between 3 and 10 hours for a single run.Uncertainty estimation. We compute uncertainty estimates over a population of 10 training instances with different seeds. This widely-used approach has the advantage to account for the stochasticity of the training procedure, which we found to be predominant in early prediction tasks. This approach differs from other work [25; 23; 8; 24] which computes uncertainty estimate by bootstrapping the test population. We compare both approaches in Appendix D.4 to demonstrate that using a pivot bootstrap estimator decreases confidence intervals by effectively increasing the population size. To be conservative with our results, we retained the former approach to compute statistics across 10 training instances. We report the 95% confidence interval over the population means in all experiments.Architecture choices We used the same architecture and hyperparameters reported giving the best performance on respiratory and circulatory failure in Yèche et al. [20] . For these tasks, we only optimized embedding regularization parameters [8] . Exact parameters are reported in Table 6 and Table 7 . For decompensation, as we found a transformer architecture to perform better than originally proposed models [32] , we carried out our own random search on validation AUPRC performance. Exact parameters for this task are reported in Table 8 .Table 6 : Hyperparameter search range for circulatory failure with GRU [34] backbone. In bold are parameters selected by random search.

Hyperparameter Values

Learning Rate (1e-5, 3e-5, 1e-4, 3e-4) 

D.6 VISUAL COMPARISON OF TLS WITH q step AND MHP PERFORMANCE

In Figure 19 , we compare the precision-recall curve of multi-horizon prediction and temporal label smoothing with q step smoothing, ensuring that there is no area where MHP is superior. In complement to Table 3 and to the analysis in Section 5.2, this confirms that predicting a single horizon with a step function smoothing is sufficient to match the performance of multi-horizon prediction.

D.7 COMBINING TLS WITH OTHER METHODS

Finally, we investigated whether temporal label smoothing could be combined with other objective functions to leverage their respective added value and further improve prediction performance. The 3 , further demonstrating that the multiple outputs of multi-horizon prediction do not lead to superior performance, and supporting assumptions in Proposition 1. performance of temporal label smoothing combined with a weighted cross-entropy objective is given in Figure 20 . Balanced reweighting per class results in a performance drop, as observed when applied to traditional cross-entropy (see Section 5.1, Figure 5 ). Another possible approach to combine these methods would be to leverage temporal information in sample re-weighting, and we reserve this investigation for further work.Similarly, no additional performance gains were obtained from combining multi-horizon prediction or focal loss with temporal label smoothing over using TLS with cross-entropy loss. 

