Project Suggestion from Martin Kleppmann

There are some potential projects around Samza that would be interesting, I think. As it's such a new technology, many of the open problems are quite tractable -- allowing a student to contribute significantly within the scope of a part II project.

As you can see from the bug tracker links, these are real projects that people out there would find useful. When your project is complete, you can contribute your code to the open source project if you want, allowing it to live on and be maintained by the community (not to mention it'll look good on your CV).

Samza is written in Scala and Java, so you would probably need to use one of those languages.

Proposal

High-volume streams of events are becoming widespread: sensor data from the internet of things, activity events from social media, and monitoring events for fraud detection, to mention just a few. However, the tools we have today for processing such streams are not very good. Many systems are not scalable enough or too complicated to work with.

Apache Samza (http://samza.incubator.apache.org/) is an open source system, developed at LinkedIn, that is trying to make this better. It scales to very large streams by distributing stream processing across multiple machines and handling failures automatically. There are several open problems with Samza that would make good part II projects:

- Implement a high-level query language for streams, e.g. StreamSQL, on top of Samza. (https://issues.apache.org/jira/browse/SAMZA-390) Evaluate the pros and cons of different query languages.

- Implement change data capture for MySQL (https://issues.apache.org/jira/browse/SAMZA-200) or PostgreSQL (https://issues.apache.org/jira/browse/SAMZA-212), which would allow users of an existing database to efficiently get their data into a stream format. Evaluate the performance.

- Implement hot standby for Samza (https://issues.apache.org/jira/browse/SAMZA-406), allowing the system to continue processing without interruption, even when a machine dies. Evaluate by intentionally killing processes and observing how the system responds.

- Implement a network interface for Samza jobs (https://issues.apache.org/jira/browse/SAMZA-316), allowing random-access queries to be mixed with stream processing. Evaluate by implementing a Twitter-like newsfeed that can push out status updates in real time.