META-ACTIVE LEARNING IN PROBABILISTICALLY-SAFE OPTIMIZATION

Abstract

Learning to control a safety-critical system with latent dynamics (e.g. for deep brain stimulation) requires judiciously taking calculated risks to gain information. We present a probabilistically-safe, meta-active learning approach to efficiently learn system dynamics and optimal configurations. The key to our approach is a novel integration of meta-learning and chance-constrained optimization in which we 1) meta-learn an LSTM-based embedding of the active learning sample history, 2) encode a deep learning-based acquisition function with this embedding into a mixed-integer linear program (MILP), and 3) solve the MILP to find the optimal action trajectory, trading off the predicted information gain from the acquisition function and the likelihood of safe control. We set a new state-of-the-art in active learning to control a high-dimensional system with latent dynamics, achieving a 46% increase in information gain and a 20% speedup in computation time. We then outperform baseline methods in learning the optimal parameter settings for deep brain stimulation in rats to enhance the rats' performance on a cognitive task while safely avoiding unwanted side effects (i.e., triggering seizures).

1. INTRODUCTION

Safe and efficient control of a novel systems with latent dynamics is an important objective in domains from healthcare to robotics. In healthcare, deep brain stimulation devices implanted in the brain can improve memory deficits in patients with Alzheimers (Posporelis et al., 2018) and responsive neurostimulators (RNS) can counter epileptiform activity to mitigate seizures. Yet, the surgeon's trial-and-error process of finding effective RNS parameters for each patient is time-consuming and risky, with poor device settings possibly damaging the brain. Researchers studying active learning and Bayesian optimization have sought to develop algorithms to efficiently and safely learn a systems' dynamics, e.g. learning a brain's dynamics for RNS configuration (Ashmaig et al., 2018; Sui et al., 2018) . However, because these algorithms fail to scale up to higher-dimensional state-action spaces, researchers utilize only simple voltage and frequency controls rather than all 32 channels of the RNS waveform (Ashmaig et al., 2018) . Similarly, tasks in robotics, e.g. learning the dynamics of novel robotic systems (e.g., an autopilot learning to fly a damaged aircraft), require active learning methods that succeed in higher-dimensional domains. In this paper, we develop a probabilistically-safe, meta-active learning approach to tackle these challenging tasks to efficiently learn system dynamics and optimal configurations. We draw inspiration from recent contributions in meta-learning (Finn et al., 2017; Nagabandi et al., 2019; Wang et al., 2016; Andrychowicz et al., 2016) that seek to leverage a distribution over training tasks to optimize the parameters of a neural network for efficient, online adaptation. Researchers have previously investigated meta-learning for active learning, e.g. learning a Bayesian prior over a Gaussian Process (Wang et al., 2018b) for learning an acquisition function. However, these approaches do not consider the important problem of safely and actively learning to control a system with altered dynamics, which is a requirement for safety-critical robotic applications. Furthermore, as we show in Section 5, on challenging control tasks for healthcare and robotics, the performance of prior active learning approaches (Kirsch et al., 2019; Hastie et al., 2017 ) leaves much to be desired. We seek to overcome these key limitations of prior work by harnessing the power of meta-learning for active learning in a chance-constrained optimization framework for safe, online adaptation by encoding a learned representation of sample history. Instead of hand-engineering an acquisition function for our specific domains, our approach employs a data-drive, meta-learning approach, which results in better performance than prior approaches, as shown in Section 5. Furthermore, our approach has the unique ability to impose analytical safety constraints over a sample trajectory. Contributions -We develop a probabilistically safe, meta-learning approach for active learning ("meta-active learning") that sets a new state-of-the-art. Our acquisition function (i.e., the function that predicts the expected information gain of a data point) is meta-learned offline, allowing the policy to benefit from past experience and provide a more robust measure of the value of an action. The key to our approach is a novel interweaving of our deep, meta-learned acquisition function as a Long-Short Term Memory Network (Gers et al., 1999) (LSTM) within a chance-constrained, mixed-integer linear program (MILP) (Schrijver, 1998) . By encoding the LSTM's linear, piece-wise output layers into the MILP, we directly optimize an action trajectory that best ensures the safety of the system while also maximizing the information learned about the system. In this paper, we describe our novel architecture which uniquely combines the power of a learned acquisition function with chance-constrained optimization and evaluate its performance against state-of-the-art baselines in several relevant domains. To the best of our knowledge, this is the only architecture which meta-learns an acquisition function for optimization tasks and is capable of embedding this acquisition function in a chance-constrained linear program to guarantee a minimum level of safe operation. The contributions of this paper are as follows: 1. Meta-active learning for autonomously synthesizing an acquisition function to efficiently infer altered or unknown system dynamics and optimize system parameters. 2. Probabilistically-safe control combined with an active-learning framework through the integration of our deep learning-based acquisition function and integer linear programming. 3. State-of-the art results for safe, active learning. We achieve a 46% increase in information gain in a high-dimensional environment of controlling a damaged aircraft, and we achieve a 58% increase in information gain in our deep brain stimulation against our baselines.

2. PRELIMINARIES

In this section, we review the foundations of our work in active, meta-, and reinforcement learning. Active Learning -Labelled training data is often difficult to obtain due either to tight time constraints or lack of expert resources. Active learning attempts to address this problem by utilizing an "acquisition function" to quantify the amount of information an unlabelled training sample, x ∈ D U = x i n i=1 , would provide a base learner, Tψ , if that sample were given a label, y and added to a labeled dataset, D L = x j , y j m j=1 , i.e., D L ← D L ∪ x, y . The active learning algorithm queries its acquisition function, H(D U , D L , T ψ ), to select which x ∈ D U should be labeled and added to D L ; then, a label is queried (e.g., by taking an action in an environment and observing the effect) for x, and the new labeled sample is added to D L (Muslea et al., 2006; Pang et al., 2018) . Meta-Learning -Meta-learning approaches attempt to learn a method to quickly adapt to new tasks online. In contrast to active learning, meta-learning attempts to learn a skill or learning method, e.g. learning an active learning function, which can be transferable to novel tasks or scenarios. These tasks or skills are trained offline, and a common assumption is that the tasks selected at test time are drawn from the same distribution used for training (Hospedales et al., 2020) . Reinforcement Learning and Q-Learning -A Markov decision process (MDP) is a stochastic control process for decision making and can be defined by the 5-tuple X , U, T , R, γ . X represents the set of states and U the set of actions. T : X × U × X → [0, 1] is the transition function that returns the probability of transitioning to state x from state x applying action, u. R : X × U → R is a reward function that maps a state and action to a reward, and γ weights the discounting of future rewards. Reinforcement learning seeks to synthesize a policy, π : X → U, mapping states to actions to maximize the future expected reward. When π is the optimal policy, π * , the following Bellman condition holds: Q π * (x, u) := E x ∼T R(x, u) + γQ π * (x , π * (x)) (Sutton & Barto, 2018).

2.1. PROBLEM SET-UP

Our work is at the unique nexus of active learning, meta-learning and deep reinforcement learning with the objective of learning the Q-function as an acquisition function to describe the expected future information gained when taking action, u, in state, x, given a set of previously experienced states

