X2T: TRAINING AN X-TO-TEXT TYPING INTERFACE WITH ONLINE LEARNING FROM USER FEEDBACK

Abstract

We aim to help users communicate their intent to machines using flexible, adaptive interfaces that translate arbitrary user input into desired actions. In this work, we focus on assistive typing applications in which a user cannot operate a keyboard, but can instead supply other inputs, such as webcam images that capture eye gaze or neural activity measured by a brain implant. Standard methods train a model on a fixed dataset of user inputs, then deploy a static interface that does not learn from its mistakes; in part, because extracting an error signal from user behavior can be challenging. We investigate a simple idea that would enable such interfaces to improve over time, with minimal additional effort from the user: online learning from user feedback on the accuracy of the interface's actions. In the typing domain, we leverage backspaces as feedback that the interface did not perform the desired action. We propose an algorithm called x-to-text (X2T) that trains a predictive model of this feedback signal, and uses this model to fine-tune any existing, default interface for translating user input into actions that select words or characters. We evaluate X2T through a small-scale online user study with 12 participants who type sentences by gazing at their desired words, a large-scale observational study on handwriting samples from 60 users, and a pilot study with one participant using an electrocorticography-based brain-computer interface. The results show that X2T learns to outperform a non-adaptive default interface, stimulates user co-adaptation to the interface, personalizes the interface to individual users, and can leverage offline data collected from the default interface to improve its initial performance and accelerate online learning.

1. INTRODUCTION

Recent advances in user interfaces have enabled people with sensorimotor impairments to more effectively communicate their intent to machines. For example, Ward et al. (2000) enable users to type characters using an eye gaze tracker instead of a keyboard, and Willett et al. ( 2020) enable a paralyzed human patient to type using a brain implant that records neural activity. The main challenge in building such interfaces is translating high-dimensional, continuous user input into desired actions. Standard methods typically calibrate the interface on predefined training tasks for which expert demonstrations are available, then deploy the trained interface. Unfortunately, this does not enable the interface to improve with use or adapt to distributional shift in the user inputs. In this paper, we focus on the problem of assistive typing: helping a user select words or characters without access to a keyboard, using eye gaze inputs (Ward et al., 2000) ; handwriting inputs (see Figure 7 in the appendix), which can be easier to provide than direct keystrokes (Willett et al., 2020) ; or inputs from an electrocorticography-based brain implant (Leuthardt et al., 2004; Silversmith et al., 2020) . To enable any existing, default interface to continually adapt to the user, we train a model using online learning from user feedback. The key insight is that the user provides feedback on the interface's actions via backspaces, which indicate that the interface did not perform the desired action in response to a given input. By learning from this naturally-occurring feedback signal instead of an explicit label, we do not require any additional effort from the user to improve the interface. Furthermore, because our method is applied on top of the user's default interface, our approach is complementary to other work that develops state-of-the-art, domain-specific methods for problems like gaze tracking and handwriting recognition. Figure 1 describes our algorithm: we initialize our model using offline data generated by the default interface, deploy our interface as an augmentation to the default interface, collect online feedback, and update our model. We formulate assistive typing as an online decision-making problem, in which the interface receives observations of user inputs, performs actions that select words or characters, and receives a reward signal that is automatically constructed from the user's backspaces. To improve the default interface's actions, we fit a neural network reward model that predicts the reward signal given the user's input and the interface's action. Upon observing a user input, our interface uses the trained reward model to update the prior policy given by the default interface to a posterior policy conditioned on optimality, then samples an action from this posterior (see Figure 1 ). We call this method x-to-text (X2T), where x refers to the arbitrary type of user input; e.g., eye gaze or brain activity. Our primary contribution is the X2T algorithm for continual learning of a communication interface from user feedback. We primarily evaluate X2T through an online user study with 12 participants who use a webcam-based gaze tracking system to select words from a display. To run ablation experiments that would be impractical in the online study, we also conduct an observational study with 60 users who use a tablet and stylus to draw pictures of individual characters. The results show that X2T quickly learns to map input images of eye gaze or character drawings to discrete word or character selections. By learning from online feedback, X2T improves upon a default interface that is only trained once using supervised learning and, as a result, suffers from distribution shift (e.g., caused by changes in the user's head position and lighting over time in the gaze experiment). X2T automatically overcomes calibration problems with the gaze tracker by adapting to the miscalibrations over time, without the need for explicit re-calibration. Furthermore, X2T leverages offline data generated by the default interface to accelerate online learning, stimulates co-adaptation from the user in the online study, and personalizes the interface to the handwriting style of each user in the observational study.



Figure1: We formulate assistive typing as a human-in-the-loop decision-making problem, in which the interface observes user inputs (e.g., neural activity measured by a brain implant) and performs actions (e.g., word selections) on behalf of the user. We treat a backspace as feedback from the user that the interface performed the wrong action. By training a model online to predict backspaces, we continually improve the interface.

