Dr Yiannos Stathopoulos

I am a researcher at the University of Cambridge, Department of Computer Science and Technology working on the ALEXANDRIA project. My work on the project revolves around user support tools and methods, based on AI and machine learning, for the interactive theorem prover (ITP) Isabelle.

I am the co-creator of the SErAPIS search engine for the Isabelle library and Archive of Formal Proofs (AFP) along with my colleague, Dr Angeliki Koutsoukou-Argyraki. Working with my colleague, Dr Anthony Bordg, we have created the Isabelle Parallel Corpus (IPC), a parallel corpus pairing formal Isabelle artefacts (e.g., theorems and their proofs) to their natural language counterparts. The IPC is intended to be a "living corpus" for use in machine learning and natural language tasks.

I have had the privillage of collaborating with Dr Alexis Litvine and Dr Oliver Dunn on the THOTH project. THOTH is an initiative within Cambridge University to apply AI and machine learning in the humanities and social sciences.

My background is in Information Retrieval (i.e., search), Artificial Intelligence, Machine Learning, Computer Vision and Natural Language Processing (NLP)

I have previously worked on extracting and parsing mathematical expressions directly from PDF documents. As part of this work, supervised by Dr Brian Harrington, I have:

Built tools that extract text box, text line and character data from PDFs and produce machine-readable XML representations of page data. I modified pdftotext to align data in PDF documents with rasterised pages in PDF documents.
Built Mathalyzer, an interactive machine learning annotation and exploratory tool for automatically extracting and parsing mathematical expressions from PDF documents (see below).
Obtained extensive experience working with computer vision algorithms and the OpenCV library (in C++ and Python).

My PhD, titled Retrieval of research-level mathematics via joint modelling of text and types, is about Mathematical Information Retrieval (MIR) of research mathematics and was supervised by Professor Simone Teufel. My thesis was examined by Professor Tim Griffin (internal) and Professor Akiko Aizawa (external).

Education

BSc in Computer Science, First class -- University of Nottingham
MSc in Statistics -- University of Nottingham
MSc in Computer Science -- University of Oxford
PhD in Computer Science -- University of Cambridge

Recent Invited Talks

National Institute of Informatics (Tokyo, Japan), Title: Recent Work by the ALEXANDRIA Project at the University of Cambridge;
EuroproofNet 2022 (Tbilisi, Georgia), Title: Finding Facts in Large Formalization Libraries: Two Isabelle/AFP attempts (joint talk with Fabian Huch).

Guides and Tutorials

Isabelle users: checkout the SErAPIS Youtube channel for a tutorial on how to use the SErAPIS concept-oriented Isabelle search engine.
Cambridge researchers: checkout my Guide to using the Cambridge HPC facilities.

Selected Publications

2022

A parallel corpus for natural language machine translation to isabelle.
Anthony Bordg, Yiannos Stathopoulos and Lawrence Paulson
In The 15th Conference on Intelligent Formal Mathematics (CICM), Work in Progress papers. Tblisi, Georgia. September, 2022.
A Parallel Corpus of Natural Language and Isabelle Artefacts.
Anthongy Bordg, Yiannos Stathopoulos and Lawrence Paulson
In Informal Proceedings of the 7th Conference on Artificial Intelligence and Theorem Proving (AITP 2022).

2020

SErAPIS: A Concept-Oriented Search Engine for the Isabelle Libraries Based on Natural Language
Yiannos A. Stathopoulos, Angeliki Koutsoukou-Argyraki, Lawrence Paulson
Isabelle Workshop 2020. March, 2020

2019

Developing a Concept-Oriented Search Engine for Isabelle Based on Natural Language: Technical Challenges
Yiannos A. Stathopoulos, Angeliki Koutsoukou-Argyraki, Lawrence Paulson
Artificial Intelligence and Theorem Proving (AITP) 2020. December 2019

2018

Variable Typing: Assigning Meaning to Variables in Mathematical Text
Yiannos A. Stathopoulos, Simon Baker, Marek Rei and Simone Teufel
In Proceedings of the 16th Annual Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL 2018) New Orleans, United States, 2018

2016

Mathematical Information Retrieval Based on Type Embeddings and Query Expansion
Yiannos A. Stathopoulos and Simone Teufel
In Proceedings of the 26th International Conference on Computational Linguistics (Coling 2016). Osaka, Japan, 2016.

2015

Retrieval of research-level mathematical information needs: A Test Collection and Technical Terminology Experiment
Yiannos A. Stathopoulos and Simone Teufel
In Proceedings of the Short Papers of the 53rd Annual Meeting of the Association for Computational Linguistics (ACL 2015). Beijing, China, 2015.

2011

OMEX: Software for Mining Mathematical Expression Semantics from Scientific Documents.
Yiannos A. Stathopoulos and Brian Harrington
In Proceedings of the 2011 IEEE Fifth International Conference on Semantic Computing (ICSC '11). IEEE Computer Society, Washington, DC, USA, 209-210. DOI=10.1109/ICSC.2011.65 http://dx.doi.org/10.1109/ICSC.2011.65

Code and Data Downloads

Download the Cambridge University MathIR Test Collection (for retrieval of research-level mathematics) described in
"Retrieval of research-level mathematical information needs: A Test Collection and Technical Terminology Experiment"
Download the Cambridge Dictionary of Mathematical Types (CDMT) seed type dictionary (10601 phrases), gold-standard data set for type detection from "Mathematical Information Retrieval Based on Type Embeddings and Query Expansion" and extended type dictionary (1.23m phrases) from "Variable Typing: Assigning Meaning to Variables in Mathematical Text"
Download the Variable Typing Data Set for assigning meaning to mathematical variables using Machine Learning

Cool things I've built

This is a partial list of cool stuff I've built.

Mathalyzer -- an interactive tool for analysing mathematical formuale in PDF documents. Written in C++ and GTK+, this tool employs the Presentation-Abstraction-Control (PAC) pattern to synchronise multiple data elements in a unified presentation. The idea behind Mathalyzer is to produce a tool that combines elements of Acrobat, Photoshop and SPSS.
Spine -- A small C++ library, forked from the subsystems of Mathalyzer, that implements Presentation-Abstraction-Control (PAC) message passing with GTK+ controls. This library is used to synchronise the data-model of GUI apps, with various independent GUI elements implemented in GTK+.
Interval and range trees -- A small C++ library of interval and range trees for optimising the Mathalyzer canvas. My implementation of interval and range trees is built on top of Red-black trees. Upon rotation, the R-B tree implementation raises a rotation event. Event handlers at higher levels are responsible for applying transformations that re-establish the invariants of the interval and range trees.
OMEX -- Software that detects and extracts mathematical expressions from PDF. The pipeline is the subject of my paper with Dr. Brian Harrington. Mathalyzer was built to extend aspects of this pipeline with machine learning.
MapReduce in C++ -- I built a small C++ implementation of Google's MapReduce. The implementation is designed to abstract parallelisation of tasks using Mappers, Grouppers and Reducers on multi-core systems.