Students interested in any of the following should contact me by email.
There are lots of tools for statically detecting programming errors. However, in practice it is impossible to resolve all findings on a codebase: some will be false positives, others will be low priority to fix. What would be useful in this context would be a way of tracking findings over time.
For this project you will look at developing fingerprinting techniques for static analysis findings. These techniques should be as stable as possible over code changes i.e. if new lines are added, or some refactoring applied the system should be able to tell that this is the same finding as before.
Various techniques exist for code fingerprinting which could be investigated here but my hypothesis is that the best results will come from choosing different techniques for different checks. If this turns out to be the case then approaches for choosing the right technique for a check will also be interesting.
Proposed work: The Google Error Prone Java compiler provides a large range of checks and is easy to apply to Java code. We would then evaluate fingerprinting approaches against real evolving code bases e.g. taken from Github.
See the work described in: User-Guided Program Reasoning using Bayesian Inference (Raghothaman et al.) and Pointer Analysis (Smaragdakis et al.)
Systems such as TabNine use deep learning to build a language model for source code so as to automatically complete code in an IDE. This project will involve implementing a variety of different learning algorithms and deploying them in an IDE for Java source code. One option for doing this would be using VisualStudio.Code and Language Services Protocol (LSP). You could start with simple approaches such as n-grams and then consider LSTMs (perhaps bidirectional) or graph neural networks. Particularly interesting would be to consider pointer networks which can copy from other parts of the source file to complete the current text. Evaluation of such a system is quite convenient since you can try deleting parts of a file and then seeing how the network does at filling them back in.