DEEP REINFORCEMENT LEARNING BASED INSIGHT SELECTION POLICY

Abstract

We live in the era of ubiquitous sensing and computing. More and more data is being collected and processed from devices, sensors, and systems. This opens up opportunities to discover patterns from these data that could help in gaining better understanding into the source that produces them. This is useful in a wide range of domains, especially in the area of personal health, in which such knowledge could help in allowing users to comprehend their behavior and indirectly improve their lifestyle. Insight generators are systems that identify such patterns and verbalize them in a readable text format, referred to as insights. The selection of insights is done using a scoring algorithm which aims at optimizing this process based on multiple objectives, e.g., factual correctness, usefulness, and interestingness of insights. In this paper, we propose a novel Reinforcement Learning (RL) framework that for the first time recommends health insights in a dynamic environment based on user feedback and their lifestyle quality estimates. With the use of highly reusable and simple principles of automatic user simulation based on real data, we demonstrate in this preliminary study that the RL solution may improve the selection of insights towards multiple pre-defined objectives.

1. INTRODUCTION

The latest developments in big data, internet of things and personal health monitoring have led to the massive increase in the ease and scale at which data has been collected and processed. Learning from the information present in the data has shown to help to gain wisdom to better run businesses, manage health care services and even maintain a healthier lifestyle. Such understanding are mostly in the form of identifying significant rise or fall of a certain measurement given a context of interest. Let's say that the sleep data logs of a user of a health monitoring service shows that the time at which they went to sleep was later during the weekends in comparison to weekdays. This can be informed to the user as a statement such as, "You went to sleep later during the weekends than the weekdays". Here, the time at which they went to sleep is the measurement and the fact of the day being a weekday or a weekend is the context of interest. We call such statements as 'insights'. Providing such insights that accurately describe the scenarios during which certain health parameter improved or deteriorated could enable the user to make better lifestyle choices. Moreover, it has been accepted Abraham & Michie (2008) that providing relevant information to the user could improve their behavior. The insight generation task can be seen as a natural language generation task where a generator model creates appropriate insight statements. A generalized framework for such an insight generator (Genf) was proposed, in which components to analyze the data and generate the statements played an important role (Susaiyah et al., 2020) . More importantly, the framework has a provision to capture user feedback mechanism that understands what type of insights they are interested in. Implementations of this framework have shown to incorporate the "overgenerate and rank" approach, in which all possible candidates as per definition are generated and later filtered using a calculated rank or a score (Gatt & Krahmer, 2018; Varges & Mellish, 2010) . The selection of the most relevant insight via ranking or scoring from a list of multiple insights is an ongoing research topic. Earlier works have utilized purely statistical insight selection mechanisms where the top ranking insights based on a statistical algorithm are selected (Härmä & Helaoui, 2016) , often combined with machine-readable knowledge (Musto et al., 2017) . Other approaches combined neural networks with the knowledge of statistical algorithms with simulated user feedback (Susaiyah et al., 2021) . All the above techniques have limitations with respect to over-simplification of user-preference or need for a huge amount of data. On the other hand, as noted in Afsar et al. ( 2021), the very nature of a recommendation is a sequential decision problem thus that could be modelled as a Markov Decision Process (MDP). Reinforcement Learning (RL) can therefore be used to solve this problem, taking into account the dynamics of the user's interactions with the system, its long-term engagement to specific topics and more complex feedbacks than binary ratings. In this paper, we introduce a novel Deep Reinforcement Learning (DRL) based framework to recommend health insights to users based on their health status and feedback. While it incorporates previously developed insights generation techniques (Susaiyah et al., 2020) , the presented framework is based on a completely new training pipeline that uses real-time simulated data instead of retrospective data and an objective of choosing the best insight instead of assigning scores to all insights. By the use of DRL, the presented system is able to deliver both useful and preferable insights. To the best of our knowledge, there is no other existing system capable of reaching both of those objectives. We evaluate it in this preliminary study in terms of significance of life quality improvements, speed and accuracy of adaptation to the dynamics of user preferences, and deploy-ability in a real life scenario.

2. RELATED WORK

Traditionally, insights were generated using associate rule mining techniques (Agrawal & Shafer, 1996) , where associations between different contexts in a dataset are discovered. However, they do not work for continuous variables. This led to the work of Härmä & Helaoui (2016) where both continuous and categorical variables were considered. However, it lacked the ability to adapt to specific users, which is very important as what we consider as an insight is highly subjective. Later, a Genf was introduced in Susaiyah et al. ( 2020) to incorporate the users as part of the insight generation system. This framework requires highly dynamic mechanisms to rank and recommend the insights based on the dynamics of user interests. To have a clear understanding of the main goals of this task, the survey Pramod & Bafna (2022) summarizes and presents ten challenges to overcome for conversational recommender systems. As one of them, our approach was designed to respond to nine out of the ten, the latest being only related to dialogue management, which is not part of the scope of the present study. The main challenges that we focus on are to: 1) keep the reliability in the ratings given by the user, 2) minimize the time spent for rating, 3) allow cold start recommendations, 4) balance cognitive load against user control, 5) remove user's need to tell the technical requirement, 6) allow a fallback recommendation when no items found, 7) limit item presentation complexity, 8) allow domain-dependent recommendations using a Genf, and 9) convince users about the recommendation by presenting facts. The neural insight selection model presented in Susaiyah et al. (2021) was agnostic to the overall objective of the user to use the insights: to improve a behavior/performance. The authors modeled the problem as a scoring objective that assigns a score between 0 and 1 on how relevant it is to the user. Tops insights were selected on need basis in order to improve the systems understanding of users preference. The main drawback of this approach is that it only focuses on insight selection for user preferences and used supervised learning from binary feedback. Therefore, it could neither consider the long term nor short term impact a given insight will produce for a given user. Nowadays, the problem of long time interaction, understand daily recommendation over multiple months, can be solved using DRL. However, DRL is known to be a very consuming approach in terms of sample efficiency whether being model-free as policy-gradient, value-based, actor critic or model-based. All DRL algorithms as SAC (Haarnoja et al., 2018 ), A3C (Mnih et al., 2016 ), DDPG (Lillicrap et al., 2015) , DQN (Mnih et al., 2013) or PPO (Schulman et al., 2017) suffer from this problem and require, on average, several millions of interactions with their environment to solve complex problems as demonstrated in their paper. This is even more problematic as the continuous supervised learning in Susaiyah et al. ( 2021) already required on average 15.6 labelled insights with the user feedback every day.

