FACTORS INFLUENCING GENERALIZATION IN CHAOTIC DYNAMICAL SYSTEMS

Abstract

Many real-world systems exhibit chaotic behaviour, for example: weather, fluid dynamics, stock markets, natural ecosystems, and disease transmission. While chaotic systems are often thought to be completely unpredictable, in fact there are patterns within and across that experts frequently describe and contrast qualitatively. We hypothesise that given the right supervision / task definition, representation learning systems will be able to pick up on these patterns, and successfully generalize both in-and out-of-distribution (OOD). Thus, this work explores and identifies key factors which lead to good generalization. We observe a variety of interesting phenomena, including: learned representations transfer much better when fine-tuned vs. frozen; forecasting appears to be the best pre-training task; OOD robustness falls off very quickly outside the training distribution; recurrent architectures generally outperform others on OOD generalization. Our findings are of interest to any domain of prediction where chaotic dynamics play a role.

1. INTRODUCTION

There are many reasons to be interested in understanding and predicting behaviour of chaotic systems. For example, the current climate crisis is arguably the most important issue of our time. From atmospheric circulation and weather prediction to economic and social patterns, there are chaotic dynamics in many data relevant to mitigate impact and adapt to climate changes. Most natural ecosystems exhibit chaos; a better understanding of the mechanisms of our impact on our environment is essential to ensuring a sustainable future on our planet. The spread of information in social networks, many aspects of market economies, and the spread of diseases, all have chaotic dynamics too, and of course these are not isolated systems -they all interact in complex ways, and the interaction dynamics can also exhibit chaos. This makes chaotic systems a compelling challenge for machine learning, particularly representation learning: Can models learn representations that capture high-level patterns and are useful across other tasks? Which losses, architectures, and other design choices lead to better representations? These are some of the questions which we aim to answer. Our main contributions are: • The development of a lightweight evaluation framework, ValiDyna, to evaluate representations learned by deep-learning models in new tasks, new scenarios, and on new data. • The design of experiments using this framework, showcasing its usefulness and flexibility. • A comparative analysis of 4 popular deep-learning architectures using these experiments. Table 1 : Summary of the generalisation results. S, C and F stand for the tasks of Supervised featurisation, Classification, and Forecasting. A ↛ B and A → B indicate strict (see section 5.2) and loose (see section 5.3) feature-transfer from task A to task B. All runs generalise in-distribution. ✓ andindicate whether or not the model-run pair achieves OOD generalisation in the final task. model S C F S ↛ C F ↛ C S ↛ F C ↛ F F → S F → C C → S C → F GRU ✓ ✓ - ✓ ✓ - - ✓ ✓ ✓ - LSTM ✓ ✓ - ✓ ✓ - - ✓ ✓ ✓ - Transformer ✓ ✓ - ✓ ✓ - - ✓ - - - N-BEATS - -- - - - - - - - -

2. RELATED WORK

Many works have studied factors influencing generalization for deep networks; see Maharaj (2022) for review, and Arjovsky (2021) for OOD specifically. To our knowledge, ours is the first such analysis for data exhibiting chaotic dynamics. Our work relies on that of Gilpin (2021) , which presents a dataset of dynamical systems that show chaotic behaviour under certain conditions. They benchmark statistical and deep-learning models typically used with time series for a variety of tasks including forecasting and dataset-transfer, and highlight some connections between model performance and chaotic properties. Although not directly addressing chaos, the intersection of physics-informed and dynamical systems literature with representation learning holds relevance for chaotic dynamics, e.g. Raissi et al. (2019) show how to train models whose predictions respect the laws of physics, by employing partial differential equations as regularisation. Yin et al. (2022) propose a framework to learn contextual dynamics by decomposing the learned dynamical function into two components that capture context-invariant and context-specific patterns. As AI systems are increasingly deployed in the real world, researchers have increasingly noted shortcomings of standard practice (i.e. performance on validation/test set) for comprehensively evaluating learned representations. An increasing number of evaluation frameworks have been proposed to help address this, e.g. Gulrajani & Lopez-Paz (2021) propose model selection algorithms and develop a framework (DomainBed) for testing domain/OOD generalisation. Of particular relevance, Wang et al. (2020) discuss the difference between generalisation to new data domains and to new ODE parameters in the context of dynamical systems. They show that ML techniques generalise badly when the parameters of a test system/data are not included in the train set (extrapolation).

3. DATA

Our data is generated using dysts, a Python library of 130+ chaotic dynamical systems published by Gilpin (2021) . In dysts, each dynamical system can be integrated into a trajectory with any desired initial condition, length and granularity, thus allowing to generate an unlimited number of trajectories. It can also generate trajectories of similar time scales across different chaotic systems. See Figure 1 for examples and Figure A9 for further examples. Figure 1 : Sample trajectories from two related chaotic attractors. Both systems have two 'lobes'; Arneodo (left) has a characteristic shell shape with one lobe inside the other, while Lorenz (right) shows a characteristic butterfly shape with lobes at an angle to one another. This is the kind of high-level pattern experts describe for many real-world chaotic systems, which we hypothesize representation learning systems could pick up on.

3.1. THE DATA GENERATION PROCESS

We sample data from each dynamical system by picking different initial conditions. This leads to trajectories that are sufficiently different from each other, but representative of the underlying chaotic system. However, dysts relies on numerical ODE solvers to generate trajectories, which could fail due to numerical instabilities when the initial condition is too extreme. To avoid that, we generate the default trajectory for each system, compute the component-wise minima and maxima, and use a percentage p of the resulting intervals to sample random initial conditions for that system. In addition to the properties of the trajectory, the parameters of this process are the random seed and the percentage p of the observed initial condition range to be used for sampling.

3.2. OUR DATASETS AND THEIR PARAMETERS

We generate three sets of data for all our experiments: training, validation, and test. The training set is used to optimise the model weights. The validation set is used for early stopping and learning rate adaptation (see Appendix A.3), and to measure in-distribution generalisation. The test set is used to measure OOD generalisation. The train and validation sets come from the same data distribution, while the test set comes from a larger distribution containing the former. All the sets contain trajectories with the same length (5 periods) and granularity (50 points per period). The parameters used to generate the data can be found in Table A2 . We choose to only include dysts systems of 3 dimensions in the datasets (i.e. 100 out of 131 systems, cf. Table A3 ) to avoid adapting models for variable input dimensions, and for faster training. The default trajectory from each included system can be seen in Figure A9 .

4. THE VALIDYNA EVALUATION FRAMEWORK

The exploratory and comparative nature of this work results in the need to have a common experimental framework for consistency and configurability across experiments, including different tasks and combinations of losses. We present ValiDyna, an open-source, lightweight framework built on top of Pytorch and Lightning . It is built with extensibility in mind, so that new model architectures, metrics and training objectives can be easily added. The framework saves a large amount of code repetition and complex indexing/references, e.g. in multi-task experiments.

4.1. TASKS

ValiDyna currently includes three tasks on learned representations from with time series data: 1. (Task S) Self-supervised featurisation (aka feature extraction) involves extracting features from time series such that similar time series have similar features, similarity defined as coming from the same dynamical system. We use a triplet margin loss, which takes 3 features as input: that of an anchor time series a, a positive series p similar to it, and a negative (dissimilar) series n: L triplet (a, p, n) = max(d(a, p) -d(a, n) + m, 0) where d is a distance metric (euclidean in our case) and m is the margin of tolerance, i.e. the minimum difference between the positive and negative distances for the loss to be non-zero. The number of features to be extracted and the margin value are the main parameters of this task. 2. (Task C) Classification involves predicting a single discrete class for each time series, in our case the chaotic system from which it came. We use cross-entropy loss to measure how close the model's output is to the true class. The main parameter is the data-dependent number of classes. 3. (Task F) Forecasting, perhaps the most popular task for time-series, involves predicting the future values of a time series based on its past values. We use the mean squared error (MSE) loss. Although the number of time steps in the past and in the future need not necessarily be fixed, we do so due to N-BEATS' architecture (cf. Section 4.2). Thus, this task is parameterised by the number T in of time steps that are input to the model, and the number T out of time steps output by the model. Each of these tasks is implemented in ValiDyna as a separate Lightning module ([Slice] Featuriser, Classifier and Forecaster) that wraps around a model architecture to allow for easy training and metric logging. All such modules log the corresponding loss during training for all data sets, while the Classifier module additionally logs the classification accuracy.

4.2. MODEL ARCHITECTURES

ValiDyna currently includes 4 machine learning architectures often used for temporal data: • GRU (Cho et al., 2014) and LSTM (Hochreiter & Schmidhuber, 1997 ): these Recurrent Neural Networks (RNNs) are likely the most popular ML architectures to be used for time series as they allow crunching a series of variable size into a fixed-size representations. • Transformer (Vaswani et al., 2017) : an attention-based architecture that achieves state-ofthe-art performance for seq2seq, and has replaced LSTMs in many time series tasks. • N-BEATS (Oreshkin et al., 2020) : a purely deep neural state-of-the-art forecasting architecture based on residual blocks. Originally written in TensorFlow, we provide a PyTorch implementation based on that of Herzen et al. (2021) . The main challenge of our multi-task setup involves adapting the model architectures above for tasks they were not originally built for. The most straightforward way is to use the architecture (or part of it) as a feature extractor, and then attach a classification or forecasting head. For RNNs, we consider the outputs of the last layer as the "features". For the Transformer, we use its encoder as a feature extractor, and completely discard the decoder. For N-BEATS, we choose the concatenation of the forecast neural basis expansions of all blocks as the "features". To ensure fair comparisons, the framework makes it easy to ensure a fixed number of features (N features ) across model architectures. To accomplish this, we insert a simple linear layer with N features output units between the vanilla feature extractor and the task-specific heads. 

5.1. RANDOM SAMPLING

In this baseline experiment, we measure the dependence of model performance on the specific set of trajectories used for training. We construct 5 subsets (using the random seeds 0 to 4) sampling 75 % of the available trajectories in each set without replacement. For each sub-sampled set, we train each model architecture for each of the 3 tasks. The results in Table 2 also serve as a performance baseline for the more complex experiments that follow. Table 2 shows the following: N-BEATS consistently performs poorly for featurisation and classification (an alternate featuriser/forecaster decomposition could potentially perform better); OOD forecasting generalisation is bad overall; GRU, LSTM and Transformer perform better on the validation set than on the train set (possibly an effect of dropout regularisation). The training curves in Figure A2 show that: different seeds result in different training times although final performance is stable; test curves are noise-like (no improvement) for N-BEATS on all tasks, and for forecasting with all models.

5.2. STRICT FEATURE-TRANSFER ACROSS TASKS (FROZEN WEIGHTS)

With this experiment, we seek to evaluate the usefulness of transferring learned representations from one of the three tasks to another. We expect forecasting to rely on implicitly learning the system of a time series to generate better predictions, and hope that pre-training for tasks that do this explicitly will be beneficial. Moreover, since classification and featurisation try to achieve very similar goals, we expect that training a model for one is beneficial for the other. In each run, we: 1. pre-train a model for one task (the pre-training task); 2. freeze the parameters associated to the feature extractor (cf. Figure 2 ); 3. train for another task (the main task). Note that featurisation cannot be a main task as freezing the featuriser results in no learning. Table 3 shows that in almost all cases, pre-training on other tasks results in a worse performance. The only exception is when pre-training Transformer for forecasting and then classifying, but even then GRU and LSTM still achieve better classification accuracy with no pre-training. Figure A3 shows that pre-training for other tasks puts models in a better initial position during the main training, but then performance stops improving and ends up worse than without pre-training. Given this initial performance boost, we speculate that learned features are useful across tasks, but freezing the featuriser is too extreme and prevents learning during the main training phase.

5.3. PROBING FOR OTHER TASKS (FINE-TUNING)

The goal of this experiment is to better understand how training a model for one task impacts its performance on other tasks. The flexibility of our framework allows to write this experiment in a simple manner, by implementing a new "prober" metric that actively treats a task module A as if it were the module for task B, and logs metrics for B. It is to be noted that to probe for some tasks, a model must have already been pre-trained for that task. For instance, one cannot probe for classification or forecasting during featurisation training. However, we can probe for featurisation while training for other tasks, as the featuriser is used for those tasks (except N-BEATS which does not train its featuriser layer). A side-effect of this experiment is that it solves the main limitation of the previous one, by allowing the transfer of features across tasks without freezing any component of the model. First, we look at task performance using pre-training, extending the observations of Section 5.2. Table 4 show that: the best classification performance is with forecasting pre-training; all models except N-BEATS perform forecasting at least as well on the train/val sets with classification pretraining; featurisation loss is generally better with pre-training than without (cf. Table 2a baseline) . Then, for task probing, Table 4 shows that the classification metrics are consistently bad during the training of other tasks: accuracy ≈ 1% (the random baseline), and the loss is an order of magnitude higher than its counterparts on all sets. This is not surprising. Consider C F , as the featuriser is updated during forecasting training, it is no longer compatible with the classification head. Considering the evolution of metrics during training, Figure 3 shows that the performance of task A when training for task B is initially good, but either collapses after one or two epochs, or stays stable. This agrees with our theory that the featuriser and task-specific heads become incompatible. In summary, this experiment shows that pre-training on other tasks can be greatly beneficial for another task, likely because it places the model parameters in a region of space that is easier to optimise. However, the update of the featuriser weights during training renders the previouslytrained task-specific heads useless as they cannot adapt to the new features that they receive.  (c) (Task C) Classification loss (↓) train validation test - C F - C F - C F C S F C C S F C C S F C 93.

5.4. FEW-SHOT LEARNING

With this experiment, we hope to better understand how our models adapt to a distribution shift consisting of a new environment with a new chaotic system in it. We focus on the dynamical system SprottE. We consider two sets of systems: a set of 4 toy chaotic systems with simple equations similar to SprottE's (Sprott, 1994) : SprottA, SprottB, SprottC and SprottD; and a set of 4 systems with more complex differential equations: Arneodo, Lorenz, Sakarya and QiChen. We show the default trajectory and differential equations of each system in Figure A5 . The experiment is set as follows. In some cases, pre-train a model on one set of 4 systems (similar or different), then add SprottE to the data and train fully. In other cases, train models directly on one set of 5 systems (SprottE + similar/different). Runs are identified by the similarity of the other systems (similar vs. different), and by whether SprottE is included ("no" during pre-training, "no→yes" after pre-training, and "yes" when SprottE is there from the beginning). Since we care about model performance on SprottE in particular, we introduce new metrics: for forecasting, MSE loss only on SprottE i.e. S-MSE; for classification, sensitivity i.e. true positive rate (TPR) and specificity i.e. true negative rate (TNR) of SprottE vs the other classes; for featurisation, the standard deviation of the features extracted from SprottE series. Note that these metrics can only be tracked when SprottE is included in the data (i.e. not during pre-training). Table 5 confirms our choice of similar and different systems: classification accuracy and featurisation loss are better for different systems (i.e. easier to differentiate), while forecasting loss is better for similar (i.e. reusable representations). Pre-training does not have a significant impact on classification or forecasting metrics, but is better for featurisation, in particular for different systems. This could be due to the triplet margin loss needing more samples to be optimised than the two others. Figure A6 mainly shows that convergence is faster for pre-trained models, as expected. There does not seem to be any relationship between SprottE feature standard deviation and performance. In this experiment, we also visualise the features learned by the models using 2D PCA projections. Figure 4 shows features learned under the feature extraction task for the 4 different architectures. Full plots for all settings in Figure A11 . A very noticeable result is that the two principal axes of the PCA projections of the features generated by N-BEATS always explain at least 95% of the feature variance (we call this value 'r'). We speculate that, although the feature extractor of N-BEATS has an output dimension of 32, its effective number of degrees of freedom (i.e. its effective capacity) is much lower, around 2, which would explain why features for all systems are mixed in a linear or "V" shape of most PCA projections.

5.5. COMBINING TASK LOSSES

In this experiment, we explore optimising the three task losses simultaneously. Concretely, we implement a new SliceModule whose loss is a weighted mean of all task losses. In particular, we consider forecasting as a main task, and use the other losses to explicitly enforce the learning of a series' system, by setting the weights L total = αL MSE + (1 -α) (L triplet + L cross ). Figure A10 shows no evident benefit from enforcing shared representations across tasks.

6. LIMITATIONS AND FUTURE DIRECTIONS

Limitations: Despite ValiDyna's configurability and extensibility, it has some limitations inherent in choosing an experimental scope: • It is built with time series data at its core, and would not work with other types of data such as static images (e.g. forecasting is time series-specific). • While adding new architectures is simple, adapting new models to multi-task learning requires some expertise. • There are many ways to extract features for some of the model architectures (e.g. for N-BEATS), and the current version only support a single scheme per model architecture. • It only supports training models on a single dataset, and has no way of distinguishing between different training environments corresponding for example to different ecosystems. There are also limitations in our experiments: • They were done for a particular timescale of data, and might differ substantially when considering coarser or finer granularities. This is a challenge in general for climate data as some events occur in large timescales and others in smaller ones, and climate models must account for both. • Out-of-distribution generalisation was only explored in the context of extrapolation initial condition of trajectories. • Although the dysts library allows generating noisy trajectories, the data used in the experiments is free of noise, while real-world measurements often contain noise. Future directions: Apart from evaluating further architectures and adding new losses to the existing framework, there are a few ways in which we would like to expand our framework,. While initial experiments with meta-learning were not promising (this is why we did not focus the framework around them), adding the ability to do multi-dataset losses such as meta-learning losses is something we would like to pursue. We hypothesize that the benefits of this approach might require massively multi-environment settings to show. We would also like to exploring generative modelling and the role it can play in encouraging good representations. We would also like to explore further experimental settings, such as mixing different data timescales, dynamical system parameters (e.g. changing β in Lorenz), and trajectory noise levels. Finally, an area of ongoing work is to perform a comparison with real ecological measurements and see how the results differ from the synthetic case.

7. CONCLUSION

In summary, we present an experimental analysis of factors influencing generalization for data exhibiting chaotic dynamics. To do so, we built a configurable and extensible model evaluation framework called Validyna. Using Validyna, we constructed and ran five experiments -random sampling, transfer learning with frozen weights, fine-tuning with unfrozen weights (probing), few-shot learning, multi-task loss -to better understand the quality of representations learned by four popular machine learning architectures -GRU, LSTM, Transformer, N-BEATS -on three tasksfeature extraction, classification, forecasting -for in-and out-of-distribution generalization. Takeaways. Summarizing our extensive experiments, the main takeaways from our work are: • All four model architectures generalise poorly to an unseen data distribution for the forecasting task. This is likely due to the chaotic nature of our data. • Our feature extractor for N-BEATS performs very poorly, while the others perform better. • All four model architectures are robust to data sub-sampling in the sense that their final performance is stable, but training times can vary considerably. • Dropout seems to be an effective regularizer for in-and out-of-distribution generalisation. • Learned representations can transfer well across tasks, especially from forecasting to classification, but not when the feature extraction module is frozen. • There is no straightforward relationship between optimising for the triplet or cross-entropy loss, although they try to achieve a very similar goal. • The cross-entropy loss and the classification accuracy of a model do not necessarily follow each other when models are optimised for other losses. • There is no evident benefit from enforcing shared representations across tasks. These results provide insights and starting points for future research in representation learning of chaotic dynamical systems.

A APPENDIX

A.1 ADDITIONAL DETAILS ON VALIDYNA EVALUATION FRAMEWORK A.1.1 DATA PROCESSING Entire trajectories cannot be fed to the models due to N-BEATS' limitation of requiring a fixed number of input and output time steps T in and T out . Also, task-specific data i.e. positive and negative examples for the (S) featurisation task need to be constructed. And, although the generation process is transparent to the different scales of system values (cf. Section 3.1), the generated trajectories still need to be scaled. To address these issues, we: 1. compute the component-wise minima and maxima of the train trajectories of each system; 2. use them to scale the trajectories per system in all 3 sets; 3. map each (scaled) trajectory of length N into the list of all the possible contiguous slices of length T in +T out ; 4. split each such slice in two parts of length T in and T out ; 5. attach the system name which generated it, encoded as a number. Thus, a single data sample in our setting is a triplet (X in , X out , c) i.e. model input for all 3 tasks, the true future for forecasting, and the class label for classification. In addition, given a batch of anchor time series, the Featuriser module can retrieve a batch of one positive and one negative example per anchor.

A.1.2 FRAMEWORK ARCHITECTURE, EXTENSIBILITY

Introducing a new multi-task model to the framework is straightforward, as it suffices to implement a new MultiTaskModel class and adapt the model to all available tasks. Adding new metrics to be tracked for specific tasks or experiments is also very straightforward, and the process is briefly explained in the experiments of Sections 5.3 and 5.4. Creating variations of current tasks can also be done quickly, e.g. adding a Forecaster that optimises the mean average error (MAE) instead of the MSE is trivial. However, introducing a new task can be complicated, as it requires: adapting all existing multi-task models to it; implementing the core training objective in a new SliceModule; adapting MultiTaskDataset and the data processing pipeline if new kinds of data are necessary. See Figure A1 for a high-level class diagram.

A.2 EXAMPLE OF EXPERIMENT CODE

With our framework, one run for the strict feature-transfer experiment could be simply written as:  #

A.3 EXPERIMENTAL SETTINGS

We use the values T in = T out = 5 for data processing and for the models. For each dataset, we shuffle the data and use batches of size 1024. After every training batch, we compute the metrics of interest for a random validation and test batch. All models are optimised using PyTorch's AdamW optimiser, with a starting learning rate of 0.01 that is divided by 5 when the validation loss does not improve, with a patience of 1 epoch. All training procedures include early stopping, so that training stops when the validation loss stops decreasing, with a patience of 3 epochs. A maximum of 100 We also want to make sure that our models have a comparable representative power, and we use the number of parameters as a proxy. Table A1 shows the number of parameters obtained for each model using the hyper-parameters that follow. All models use the value N features = 32. All forecasting and classification heads in our models are simple feed-forward neural networks with 3 hidden layers of width equal to N features = 32, and ReLU activations after each hidden layer. For N-BEATS, we use 4 stacks of 4 blocks each, a neural-basis expansion dimension of 4, and the fully-connected network of each block has 4 hidden layers of 8 units each and ReLU activation. For the two RNNs, we use a dropout probability of 0.1 and 2 layers, with GRU having 30 hidden units per layer and LSTM 26. The Transformer uses 4 encoder layers, each having a feed-forward dimension of 6, as well 4 attention heads, an embedding dimension of 16, and a dropout probability of 0.1 in the feed-forward and self-attention networks. The margin of the triplet loss is equal to the default of 1. Table A1 : The number of parameters of each model architecture as used in the experiments. We consider the number of parameters used in the featuriser and in total (including any task-specific heads). The total parameter count is very stable (around 11-12k), while there is slightly more variability for the featuriser one (around 7-9k .274 .274 .332 .232 .265 .206 .198 .259 .171 .203 .32 .297 .379 .3 .336 LSTM .307 .317 .35 .366 .232 .223 .241 .267 .28 .172 .347 .381 .394 .387 .283 Transformer .558 .488 .49 .598 .49 .486 .549 .443 .557 .443 .642 .7 .603 .711 (d) (Task C) Classification accuracy (random baseline of 0.01) (↑) (a) (Task C) Classification metrics (↑)

SprottE?

no no->yes yes The forecasting train/val loss decreases with α, showing that GRU does not benefit from the enforced system representations from the other tasks. The featurisation and classification losses generally increase with α, so they do not benefit from the forecasting representations either. attractors ̸ = = ̸ = = ̸ = = metric MSE MSE MSE S- attractors ̸ = = ̸ = = ̸ = = metric L L L σ L σ L σ L      ∂x ∂t = y ∂y ∂t = z ∂z ∂t = -ax -by -cz + dx 3      ∂x ∂t = σ(y -x) ∂y ∂t = x(ρ -z) -y ∂z ∂t = xy -βz      ∂x ∂t = a(y -x) + yz ∂y ∂t = cx + y -xz ∂z ∂t = xy -bz      ∂x ∂t = ax + hy + syz



Figure 3: Prober experiment: training curves per model and pre/main task combination. A running average of length 500 (roughly a quarter of an epoch) is used for readability. The best task performance is obtained when the model is being trained for that specific task. Metrics of tasks different to the training task seem to stay stable during training.

Figure 4: Comparison of the features extracted by the 4 different architectures. Note the effective low dimensionality of the N-BEATS features compared to the others. In these examples and in general, recurrent architectures LSTM and GRU appear to have the most separable learned features.

Python 3.9 pseudocode model: MultiTaskModel = GRU(...) dataset: MultiTaskDataset = ... # Pre-train for classification classifier = Classifier(model) classifier.fit(dataset) # Freeze feature extractor weights model.freeze_featurizer() # Train model for forecasting forecaster = Forecaster(model) forecaster.fit(dataset)

Figure A1: Class diagram of the ValiDyna Framework. training epochs is allowed, but no training run attains it. Each model training is run deterministically, by setting Lightning's random seed to 2022.

Figure A2: Random sampling experiment: training curves. A running average of length 700 (roughly half an epoch) is used for readability.The random seeds don't seem to impact the final performance of the models, but they do impact the training times and speed of convergence.

Figure A2: Random sampling experiment: training curves (cont.)

Figure A4: Prober experiment: training curves (cont.)

∂y ∂t = -by -px + qxz ∂z ∂t = cz -rxy (c) The set of systems different to SprottE.

Figure A5: Few-shot learning experiment: The set of 9 systems used in the experiment. The default trajectory of 500 points per period and 10 periods is shown for each. Each trajectory component is re-scaled to be in the range [-1, 1]. The differential equations of SprottE and the similar systems are simpler, with at most a sum of two elementary products, while those of the different systems involve more complicated terms.

Figure A6: Few-shot learning experiment: Forecasting: training curves (only runs with SprottE are included). A running average of length 100 (roughly an epoch) is used for readability. All runs with pre-training converge faster except those of N-BEATS and classification sensitivity.

Figure A7: Few-shot learning experiment: Classification loss, accuracy: training curves (only runs with SprottE are included). A running average of length 100 (roughly an epoch) is used for readability. All runs with pre-training converge faster except those of N-BEATS and classification sensitivity.

Figure A8: Few-shot learning experiment: Classification TPR, TNR: training curves (only runs with SprottE are included). A running average of length 100 (roughly an epoch) is used for readability. All runs with pre-training converge faster except those of N-BEATS and classification sensitivity.

Figure A9:The 100 dynamical systems of dimension 3 in our synthetic dataset (only Torus is nonchaotic). A single trajectory is shown for each, with the default initial condition, 500 points per period, and 10 periods. Each trajectory component is re-scaled to be in the range [-1, 1].

Figure A9: The 100 chaotic dynamical systems of dimension 3. (cont.)

Figure A9: The 100 chaotic dynamical systems of dimension 3. (cont.)

Figure A9: The 100 chaotic dynamical systems of dimension 3. (cont.)

Figure A9: The 100 chaotic dynamical systems of dimension 3. (cont.)

These models are implemented in ValiDyna as sub-classes of a MultiTaskModel with all of the functionality above, as shown in Figure2. For further details on the framework, see Appendix A.1

Random sampling experiment: final metric means and standard deviations aggregated over the 5 different random sampling seeds. See TableA4for the full results per sampling seed. We highlight the best mean value obtained on each set. N-BEATS performs poorly for classification and featurisation. GRU is consistently among the best performers, LSTM is close behind. Validation metrics are equal or better than the train ones for all models but N-BEATS. Generalisation to the OOD test set is generally good, except for forecasting.

Feature-freeze experiment: final task metrics as a function of the pre-training task. We highlight the best metric value obtained for each model-dataset pair. In general, pre-training on other tasks results in a worse performance. Interestingly, the representations learned by Transformer during forecasting seem to transfer well to classification. Freezing the feature extractor entirely seems to prevent learning.

Prober experiment: final metrics per pre-training task, training task, model and dataset. The best metric value is highlighted for each model-set pair. Forecasting features transfer well to classification and the reverse is often true. Classification and forecasting features transfer decently to featurisation (cf. baseline in Table 2a). .15 .21 4.7 4.6 .09 .35 4.7 4.6 .21 Transformer .45 6.3 4.7 .39 .44 6.3 4.7 .37 .63 6.3 4.7 .58 N-BEATS 2 4.8 4.7 1.9 2.1 4.8 4.7 1.9 38 4.8 4.7 74

85.6 0.97 1 89.2 87.9 0.97 1 90.3 85.7 0.97 0.99 88.3 45.9 1.84 1 50.2 45.7 1.84 1 50.1 44.3 1.79 0.99 48.5

Few-shot learning experiment: final metrics per training task, system similarity (= or ̸ =), and SprottE status, averaged over the train/validation/test sets. See TableA5for the full table. Forecasting MSE is significantly better for the similar systems. Featurisation loss and classification accuracy are better for the different systems.

).

Parameters used to generate trajectories for each data set. Theoretically, the trajectories from the train and validation sets come from the same distribution, while those from the test set come from a larger distribution containing the other.

The distribution of attractor dimensions in dysts. Most attractors have dimension 3.

Random sampling experiment: full final metrics per random sampling seed.

Few-shot learning experiment: full final metrics.

Prober experiment: training curves per model and pre/main task combination. A running average of length 500 (roughly a quarter of an epoch) is used for readability. The best task performance is obtained when the model is being trained for that specific task. Metrics of tasks different to the training task seem to stay stable during training.

