Open source and NLP Workshop - Cambridge

Organisers: Ann Copestake (Uni. Cambridge) and Aurelie Herbelot (Uni. Trento)
If you have any questions about the workshop, do not hesitate to contact us at ann copestake AT cl cam ac uk, or aurelie herbelot AT cantab net.

Abstract

There are many good motivations to open-source one's code. It gives a project visibility and potential contributors who will help develop new ideas (and find bugs!) For research projects, it can be a way to test theories and implementations in real-world settings. Still, open source NLP is not as widespread as it could be, especially in terms of developing end-user applications. This workshop gathers open source practitioners and interested participants to discuss motivations, methods and strategies.

Schedule:

DAY 2: Thursday, November 10th, FW11

Thursday is the main workshop day. We will have a mix of short talks and informal discussions, split into the following sessions:

09:00 - 10:30 - Session 1: Why open source NLP?

Ann Copestake (University of Cambridge).
Open Source in NLP research

NLP research requires software and lingware, a generic term for linguistic resources, including dictionaries, (annotated) corpora, grammars, lexicons and so on. Scientific research requires that others can reproduce reported findings: lingware typically cannot be fully described or defined (unlike an algorithm), so has to be distributed. WordNet, which was first released in 1991, is one of the best-known and widely used examples of lingware, but its open licence (a variant of BSD) was very much the exception. Indeed the idea of open source lingware has met considerable resistance. I'll explain why I see open source as essential for NLP research using complex lingware and how I respond to some of the counter-arguments.

Aurelie Herbelot (University of Trento).
Let meaning remain: on the necessity of open sourced AI

In recent years, computational semantics has made massive progress towards representing meaning in a machine-readable way. Concepts, including their linguistic, but also visual, sound and even smell properties, can be expressed as mathematical vectors which have been shown to closely simulate the type of representation that an average human holds in her brain. But what is the significance of building meaning from only a handful of algorithms and corpora? In this talk, I'll show that meaning vectors are brittle representations, which can easily be broken or biased, with dramatic consequences for the user of an AI system. I will argue that such biases are not problematic as long as they are diverse and plentiful: like human meaning, machine-extracted meaning must be individual and idiosyncratic to fulfill its communicative and social functions. I'll suggest that open-sourcing the generation of conceptual representations, in an accessible and transparent fashion, is a necessary step in retaining the value of meaning in an AI world.

10:30 - 12:00 - Session 2: Turning research into open source

Diana Maynard (University of Sheffield).
20 years of Text Mining Applications with GATE: from Donald Trump to curing cancer

The GATE open source NLP toolkit has now been in continuous development for 20 years at the University of Sheffield. Originally funded by a small EPSRC research grant, it now involves a team of 12 researchers working on it, and has been downloaded by hundreds of thousands of users all over the world. Its users range from solitary research students to multinational companies and government institutions. In this talk, I will give an overview of GATE and its history, and give examples of real-life case studies, ranging from analysing polarised opinions in online political debates (Brexit, the UK and US elections) through to finding a new cause of cancer by analysing information in the biomedical domain.

Behrang QasemiZadeh (University of Düsseldorf).
PoP for PeARS: The PPP-Hashing Technique for Vector Space Construction (work with A. Herbelot and L. Kallmeyer)

Since their inception, text-based information retrieval systems have relied on vector space mathematics to build computable models of natural language documents and queries (e.g., Salton's model, Landauer's latent semantic indexing, and even the so-called embedding technique). In this talk, we introduce yet another method for vector representation of natural language artefacts, called PoP. Further, based on the mathematical principles behind PoP, we suggest the PPP-hashing technique. Compared to locality sensitive hashing techniques, PPP-Hashing can be seen as a semantically-sensitive counterpart. We explain the advantages of using PoP and PPP-hashing for a distributed search engine such as PeARS, namely interoperability, scalability, and high discriminatory power.

12:00 - 14:00 - Lunch
Self-arranged lunch. Several cafes are available in the vicinity of the Lab.

14:00 - 15:00 - Session 3: Sharing code for open source

Nandaja Varma.
Git-ting your way to Open Source

Once you are set out on your programming journey, the one tool that is going to help you the most on the road is Git. Git is a version control tool with the help of which you can go on coding without worrying at all about getting your project to others or getting others to contribute to it. What makes Git stand out compared to its alternatives is its powerful yet simple design and its ease to get started with. This session touches upon what git is, how it manages the source code, and how it helps in collaborative coding by looking at the workflows followed by popular open source projects. By the end of this session, hopefully everyone will know enough to do their first open source contribution using git.

Q&A session: bring your git problems!

15:30 - 17:00 - Session 4: Into the real world: end-users and community

Oeistein Andersen (University of Cambridge)
RASP

Esther Seyffarth (University of Düsseldorf)
NLP for bots, and bots for NLP

With open source NLP libraries and freely available corpora, it is easy and fun to develop creative twitterbots, which makes them a good choice for programming exercises in the context of NLP education. This session will give an introduction to the open source philosophy of the international twitterbot-making community, accompanied by examples of twitterbot developers providing or contributing to open source NLP resources.

Hrishikesh K.B. (Swathanthra Malayalam Computing)
How technology changes a language : reflections from the Malayalam Cyberspace from a Free and Open Source point of view

As time passes, every language evolves. Linguistic change is traceable through many different aspects, from the meaning of words to the writing system. This talk covers the changes that are introduced by technology (with a stress on the Free and Open Source Technology) into the script of a language spoken in India: Malayalam. We will be covering some historical background, technology basics (fonts, ASCII, Unicode, rendering, input methods), and the way they affected the Malayalam script.