TOWARDS A RELIABLE AND ROBUST DIALOGUE SYS-TEM FOR MEDICAL AUTOMATIC DIAGNOSIS

Abstract

Dialogue system for medical automatic diagnosis (DSMAD) aims to learn an agent that mimics the behavior of a human doctor, i.e. inquiring symptoms and informing diseases. Since DSMAD has been formulated as a Markov decisionmaking process, many studies apply reinforcement learning methods to solve it. Unfortunately, existing works solely rely on simple diagnostic accuracy to justify the effectiveness of their DSMAD agents while ignoring the medical rationality of the inquiring process. From the perspective of medical application, it's critical to develop an agent that is able to produce reliable and convincing diagnosing processes and also is robust in making diagnosis facing noisy interaction with patients. To this end, we propose a novel DSMAD agent, INS-DS (Introspective Diagnosis System) comprising of two separate yet cooperative modules, i.e., an inquiry module for proposing symptom-inquiries and an introspective module for deciding when to inform a disease. INS-DS is inspired by the introspective decision-making process of human, where the inquiry module first proposes the most valuable symptom inquiry, then the introspective module intervenes the potential responses of this inquiry and decides to inquire only if the diagnoses of these interventions vary. We also propose two evaluation metrics to validate the reliability and robustness of DSMAD methods. Extensive experimental results demonstrate that INS-DS achieves the new state-of-the-art under various experimental settings and possesses the advantages of reliability and robustness compared to other methods.

1. INTRODUCTION

Dialogue system for medical automatic diagnosis (DSMAD) aims to learn an agent to collect patient's information and make preliminary diagnosis in an interactive manner like a human doctor. This task increasingly grasps the attention of researchers because of its huge industrial potential (Tang et al., 2016) . Similar to other task-oriented dialogue tasks (Lipton et al., 2018; Wen et al.; Yan et al., 2017; Lowe et al., 2015) , DSMAD is composed of a sequence of dialogue-based interactions between the patient and the agent, which can be formulated as a Markov decision process and resolved by reinforcement learning (RL) (Mnih et al., 2015; Van Hasselt et al., 2016) . Although several frameworks have been proposed (Xu et al., 2019; Wei et al., 2018; Peng et al., 2018; Tang et al., 2016) , DSMAD is still far from being applicable, because these works only evaluate the agent based on the accuracy of diagnosis, but ignoring the importance of robustness and reliability for practical medical applications. The two major shortcomings of the current DSMAD methods are summarized below. Unreliable symptom-inquiry and disease-diagnosis. It is reasonable to measure DSMAD by diagnosis accuracy since the accuracy is the ultimate goal of the task. However, in unilateral pursuit of high accuracy, DSMAD agent pays less attention to the rationale of the diagnosis process, reducing the trust of users. For example, a DSMAD agent might jump into a conclusion without inquiring about any symptom. As long as the diagnosis is correct, such an agent will still get a positive reward. In this sense, the correctness of diagnoses is not sufficient to reflect the performance of DSMAD, and might lead the agent to make a hasty diagnosis without interactions. Moreover, DSMAD should learn to make consistent disease-diagnoses according to the symptom-disease relation in the training data, insensitive to the noise happened during training. Sensitive to small disturbance. Almost all of the current DSMAD methods combine the operation of symptom-inquiry and disease-diagnosis together and allow models to make the sequential deci-sion in a black-box manner (Zhang & Zhu, 2018; Koh & Liang, 2017) without regulations, resulting in a system vulnerable to the noise during the interaction process. If we place one of the inquired symptoms to the self-report (to ensure the information is consistent between two cases), the agent sensitive to noise would make different diagnoses. To this end, we propose a novel DSMAD agent, Introspective Diagnosis System (INS-DS) (Fig. 1 ) and two new evaluation metrics in terms of reliability and robustness. The diagnosis logic of INS-DS draws on the introspective decision-making process of human doctors. In real life, human doctors come into a conclusion if they believe that more inquiries make no difference. In INS-DS, the inquiry module is responsible for selecting the most valuable symptom to be inquired about, while the introspective module intervenes the potential answers of this inquiry to decide whether to inquire the symptom or inform the disease. Specifically, the introspection module assigns different possible answers to the inquiry resulting in multiple one-step-look-ahead dialogue states. Then it inspects whether the diagnosis results of these states are going to be varied. If the predicted results are the same, which means that inquiring the most valuable symptom inquiry would result in the same diagnosis, INS-DS will inform the disease instead. Otherwise, the agent would inquire about the symptom. Such mandatory introspection makes the inquiry more disease-related because the agent is not allowed to make a diagnosis until the agent has collected sufficient symptom information. It also makes the disease-diagnosis more consistent because the disease can only be informed when the comprehensive hypothesis test is passed. To quantify the reliability and robustness of DSMAD, we also propose two novel evaluation metrics, namely, reliability and robustness. Reliability. The purpose of the reliability metric is to quantify how confident the diagnosis is from the perspective of the model (internal trust or Int.) and user (external trust or Ext.). The internal trust is to testify whether the diagnoses made by the model is insensitive to the task-irrelevant factors, e.g., the sampling noise and parameter initialization. Specifically, for Int., we adopt the expected diagnostic probability of a set of bootstrapping models. These bootstrapping models are initialized with different parameters and trained on re-sampled data with replacement, to reduce the effect of the parameter initialization and the data sampling. Therefore, the higher Int. is, the less sensitive the diagnosis result is to the noise in the training process. The external trust is proposed to indicate the trust degree of a diagnosing process to users. Intuitively, patients are more likely to believe in the diagnosis made from the agents who request symptoms like human doctors. According to this intuition, for Ext., we compute the symptoms overlap ratio based on the co-occurrence between symptoms and diseases in the diagnostic dialogue dataset. The higher the level of Ext., the more likely the agent is to inquire symptoms like a human doctor. Robustness. As for robustness, we draw on the inspiration from the noted adversarial attack (Kurakin et al., 2018) in machine learning models, which generally uses samples with a subtle modification that is indistinguishable for human but may be different for a machine to prove the vulnerability of a model. Our evaluation metric for the robustness is the final unaltered proportion of correct diagnoses after feeding the model with the attack samples constructed according to the formulas in Sec. 5. The extensive experimental results evidence that INS-DS achieves the superior performance compared to other DSMAD baselines under various settings and possesses the advantages of reliability and robustness. We also conduct human evaluations on our INS-DS and achieve significant improvement in the aspects of diagnosis validity, symptom rationality as well as topic transition smoothness.

2. RELATED WORK

The task-oriented dialogue system is designed to accomplish specific tasks, like the ticket, restaurant booking, online shopping etc. (Lipton et al., 2018; Wen et al.; Yan et al., 2017) . Most of the current task-oriented dialogue systems adopt the framework of reinforcement learning (Mnih et al., 2015; Lipton et al., 2018; Li et al., 2017) , and some adopt the sequence-to-sequence style for dialogue generation (Madotto et al., 2018; Wu et al., 2019; Lei et al., 2018) . For medical dialogue system, due to a large number of symptoms, reinforcement learning is a better choice for topic selection (Tang et al., 2016; Kao et al., 2018; Peng et al., 2018) . Tang et al. (2016) apply Deep Q-Network (DQN) (Mnih et al., 2015) to diagnose using synthetic data. While Wei et al. (2018) first did experiments on real-world data using DQN. To include explicit medical inductive bias for improving the diagnostic performance, Xu et al. (2019) proposed an end-to-end model

