EXTRACTING MEANINGFUL ATTENTION ON SOURCE CODE: AN EMPIRICAL STUDY OF DEVELOPER AND NEURAL MODEL CODE EXPLORATION Anonymous authors Paper under double-blind review

Abstract

The high effectiveness of neural models of code, such as OpenAI Codex and Al-phaCode, suggests coding capabilities of models that are at least comparable to those of humans. However, previous work has only used these models for their raw completion, ignoring how the model reasoning, in the form of attention weights, can be used for other downstream tasks. Disregarding the attention weights means discarding a considerable portion of what those models compute when queried. To profit more from the knowledge embedded in these large pre-trained models, this work compares multiple approaches to post-process these valuable attention weights for supporting code exploration. Specifically, we compare to which extent the transformed attention signal of CodeGen, a large and publicly available pretrained neural model, agrees with how developers look at and explore code when each answering the same sense-making questions about code. At the core of our experimental evaluation, we collect, manually annotate, and open-source a novel eye-tracking dataset comprising 25 developers answering sense-making questions on code over 92 sessions. We empirically evaluate five attention-agnostic heuristics and ten attention-based post processing approaches of the attention signal against our ground truth of developers exploring code, including the novel concept of follow-up attention which exhibits the highest agreement. Beyond the dataset contribution and the empirical study, we also introduce a novel practical application of the attention signal of pre-trained models with completely analytical solutions, going beyond how neural models' attention mechanisms have traditionally been used.

1. INTRODUCTION

Recent large neural source code models such as Codex (Chen et al., 2021) , CodeGen (Nijkamp et al., 2022) and AlphaCode (Li et al., 2022) are remarkably effective at program synthesis and competitive programming tasks respectively. Yet our understanding of why they produce a particular solution is limited. In practice, the models are mostly used for their prediction alone, i.e., as generative models, and the way they reason about code internally largely remains untapped. These models are often based on the attention mechanism (Bahdanau et al., 2016) , a key component of the transformer architecture (Vaswani et al., 2017) . Besides providing substantial performance benefit, attention weights have been used to provide interpretability of neural models (Lin et al., 2017; Vashishth et al., 2019; Paltenghi & Pradel, 2021) . In particular, Wan et al. (2022) and Vig & Belinkov (2019) have shown how the attention weights contain important syntactic information on both the Abstract Syntax Tree (AST) of source code and Part of Speech (POS) tags in natural language. Moreover, Wan et al. (2022) showed how using attention weights to infer the distance between two tokens outperformed techniques using hidden representations. In a similar direction, Zhang et al. ( 2022) has shown how a novel graph representation of source code derived solely from attention weights achieved comparable performance on the VarMisuse dataset (Allamanis et al., 2018) to that of a hand-crafted graph representation based on control flow and data flow (Hellendoorn et al., 2020) . The work cited above suggests the attention mechanism reflects or encodes objective properties of the source code processed by the model. We argue, that just as software developers consider different locations in the code individually and follow precise connections between them, so the self-attention of transformers connects and creates information flow between similar and linked code locations. If those relations are indeed comparable, this raises the possibility: can the knowledge about source code conveyed by the attention weights of neural models be leveraged to support human code exploration? There are datasets tracking developers' visual attention while looking at code, but they do not seem suitable to this task. The largest ones either put the developers in an unnatural (and thus possibly biasing) environment where most of the vision is blurred (Paltenghi & Pradel, 2021), or they contain few and very specific code comprehension tasks (Bednarik et al., 2020) on code snippets too short to exhibit any interesting code navigation pattern. This work. To address these limitations and stimulate developers and the neural model to not only glance at code, but also deeply reason about it, we prepare an ad-hoc code understanding assignment called sense-making task. This involves questions on code including mental code execution, side effects detection, algorithmic complexity, and deadlock detection. We collect an eye-tracking dataset of 92 valid sessions with developers. The sense-making task is additionally designed to be machine friendly with a specific prompt to trigger a completion from the model which hopefully stimulates the reasoning of the model. Then we query CodeGen on the same sense-making task and compare its attention signalfoot_0 to the attention of developers. They turn out to be positively correlated (r=+0.23), motivating the use of raw and processed versions of the attention signal for code exploration. To that end, we experimentally evaluate how well existing and novel attention post-processing methods align with the code exploration patterns derived from the chronological sequence of eye-fixation events of our dataset. To the best of our knowledge, this work is the first to investigate the attention signal of these pre-trained models to support code exploration, a specific code-related task. We empirically demonstrate that post-processing methods based on the attention signal can be well aligned with the way developers explore code. In particular, using the novel concept of follow-up attention we achieve the highest overlap with the top-3 developers' ground truth on which line to explore next.

Main contributions. Our key contributions are:

• A novel dataset of eye tracking data, comprising 92 visual attention sessions of 25 developers engaged in sense-making tasks while using a common code editor with code written in three popular programming languages (Python, C++ and C#). • The first experimental comparison of both effectiveness and visual attention of GPT-like models and developers when reasoning on sense-making questions. • Demonstrating a connection between the neural attention signal and the temporal sequence of location shifts regarding developer focus. • The analytical formula for follow-up attention, a novel post-processing approach derived solely from the attention signal, which aligns well with the developer interaction of which line to look at next when exploring code. • An empirical evaluation comprising ten post-processing approaches of the attention signal, five heuristics, and an ablation study of the follow-up attention against the collected ground truth of developers exploring code.

2. RELATION TO EXISTING WORK

Attention as explanation. Previous work (Jain & Wallace, 2019) on studying attention weights of recurrent neural models has found that the attention weights do not always agree with other explanation methods and that alternative weights can be adversarially constructed while still preserving the same model prediction. However, in response, a successive study (Wiegreffe & Pinter, 2019) showed how the alternative attention weights can be constructed only per a single instance prediction, whereas obtaining a model which is consistently wrong in its explanations is very unlikely



Attention signal refers to the attention weights produced during a forward pass by the transformer blocks.

