LEARNING TO IMPROVE CODE EFFICIENCY

Abstract

Improvements in the performance of computing systems, driven by Moore's Law, have transformed society. As such hardware-driven gains slow down, it becomes even more important for software developers to focus on performance and efficiency during development. While several studies have demonstrated the potential from such improved code efficiency (e.g., 2x better generational improvements compared to hardware), unlocking these gains in practice has been challenging. Reasoning about algorithmic complexity and the interaction of coding patterns on hardware can be challenging for the average programmer, especially when combined with pragmatic constraints around development velocity and multi-person development. This paper seeks to address this problem. We analyze a large competitive programming dataset from the Google Code Jam competition (Google Code-Jam) and find that efficient code is indeed rare, with a 2x runtime difference between the median and the 90th percentile of solutions. We propose using machine learning to automatically provide prescriptive feedback in the form of hints, to guide programmers towards writing high-performance code. To automatically learn these hints from the dataset, we propose a novel discrete variational auto-encoder, where each discrete latent variable represents a different learned category of code-edit that increases performance. We show that this method represents the multi-modal space of code efficiency edits better than a sequence-to-sequence baseline and generates a distribution of more efficient solutions.

1. INTRODUCTION

The computational efficiency of code is often front-and-center in any computer science curriculum. While there are many ways to solve a particular problem, there is often wide variance in the runtime of different implementations. This variance is often attributed to many different factors: the algorithmic complexity of the code in question, the data structures that are used, the libraries that are called, and lower-level execution effects like efficient caching or memory usage. Similarly, computational efficiency is a critical component of professional software development. The computing industry as a whole has relied on the automatic performance increases of Moore's Law to scale massive warehouse computing systems to meet the internet requirements of the world. As these automatic performance increases slow down, the burden of reducing computational cost and carbon footprint now falls on writing high-performance code (Patterson et al. ( 2021)). Writing efficient code is challenging, even for experienced programmers, as it requires understanding computational complexity as well as the underlying hardware. Lower-level performance optimizations are therefore automated by compilers which automatically apply a small set of known, sound lowlevel program transformations to an already written program to increase its efficiency. However, compilers and current tooling have more difficulty identifying higher-level optimizations, such as more efficient algorithms for the same problem. So far, these types of optimizations could only be identified by humans. We hypothesize that machine learning can be used to guide humans towards such optimizations, by suggesting edits that optimize code efficiency. To study this problem, we examine a competitive programming dataset where tens of thousands of developers have submitted answers to about 180 different questions. Studying these solutions, we find wide variance in computational cost: the runtime difference between a median solution and the 90th percentile is over two-fold. The scarcity of high-performance solutions highlights the difficulty of our task. Therefore, we aim to provide prescriptive feedback to developers to guide them towards writing high-performance code.

