Systems Research Group – NetOS

Student Projects (2011—2012)

NetOS

This page collects together various Part II project suggestions from the Network and Operating Systems part of the Systems Research Group. In all cases there is a contact e-mail address given; please get in touch if you want more information about the project.

Under construction: please keep checking back, as more ideas will hopefully be added to this page during the coming weeks.

Note: there are a number of stumbling blocks in Part II project selection, proposal generation and execution. Some useful guidance from a CST alumnus (and current NetOS PhD student) is here.

Twitteropticon
Contact: Jon Crowcroft
Twitter users frequently tweet (and retweet) mentions of websites.
There are many studies of the twittersphere and the characteristics of
its inhabitants and even of the kinds of things they mention and talk
about[1], but one area largely missing is the behaviour of
people as they enter and leave twitterspace. A particular example is when
people receive a tweet and follow a URL.

This project would instrument an android phone twitter client and web
browser to track the user's behaviour - in particular, the goal is to
find out how long is spent looking at different web pages and even to
try and understand the users reaction (positive/negative, weak/strong)
to a web page by looking at movement of the phone and so on (as in the
EmotionSense work[2].

The twitter client would be released via the Android Market place, and
an experiment (with informed consent of users) to collect (anonymized
and secured) logs of user behaviour carried out. Statistical
Analysis of the logs would hopefully reveal distributions of
popularity of content, and their relationship to the tweet source or
social structure of the source. This would potentially inform people
working on opinion dynamics, and perhaps be useful for marketting as
well as public information services (e.g. earthquake warnings, or
simple systems like travel alerts) in the twitter space, which are increasingly
common.

The work would entail

1. implementing (or modifying) an open source
twitter client for Android, and writing a logging system (this could
easily be an extension of twitter itself) which would monitor the
interaction with browsers or else intercept UI events caused by the
browser and read environmental/context data from the phone's
peripherals to establish user behaviour while reading web pages.
See also
http://perscon.net/blog/2011/09/01/aethers-notebook.html
for possible starting point.

2. Building a suite of analysis tools and running some statistics

3. Writing up the work as a dissertation + potential paper for a
conference (such as ICWSM for example).

[1] http://twitter.mpi-sws.org/
[2] http://www.cl.cam.ac.uk/research/srg/netos/emotionsense/
SpringBoard
Contact: Jon Crowcroft
This project is to build a fully decentralised system for
censorship-proof anonymous communication like safebook[1].
The system would leverage the Haggle system for Opportunistic
Communication[2], and would deliver (initially) twitter-like
functionality. Optionally, the ability to transmit documents to an
anonymous dropbox [4] (think "wikileaks") could also be produced. the
system would distribute content over a disconnection tolerant cloud of
android phones using any and every opportunity for transmission along
a sequence of application layer hops. Privacy preserving forwarding
(like ToR, but using the Matryoshka scheme from facebook), plus cover
traffic (crowds) would provide anonymity. The decentralised nature of
Haggle would avoid the need (but allow it if possible) for potentially
censored infrastructure networks like the Internet or 3G.

The project would initially build an android application based on the
android haggle code, plus the safebook privacy mechanisms. Subsequent
work would deploy the system amongst local users and evaluate the
latency for the system running with and without anonymity mechanisms.
Additional evaluation could address battery usage for various cases.

A dissertation would be written, and a paper on the system could be
submitted to one of the relevant systems security conferences (perhaps
Usenix Security).

[1] http://www.safebook.us/home.html
[2] http://www.haggleproject.org/
[3] http://twitter.mpi-sws.org/
[4] A prototype for a Haggle style dropbox, called Mistify, was
implemented by an MPhil student, Karthik Nilakant
now a PHD student in the lab, currently the subject of a paper
submission. This could help accelerate the implementation work.
Antadvertisement
Contact: Jon Crowcroft
Google (and a number of other online service providers) make their money (and
a lot of it) by targeted[1] advertising. Services such as Youtube have
effectively legalised Napster by allowing uploads of copyright material by
third parties, and then post-hoc arrangements for delivery of advertising
revenue to the copyright owner.

Under the bonnet, there are several parties in the system on the web today,
not just the client and server. Client search info is sent to analytics who
produce an ordered list of answers together with links to sites + links to
advertising. The order of the answers defaults to what you get from the
pagerank algorithm [4]. However, different orderings are produced by people
paying for additional "clicks" - hence, one can promote a web site in the
result for search, or promote an advert in a sidebar, by paying more.

Google (and others') success has been that the payment for the advert occurs
when someone actually clicks on the site. This means that advertisers get to
know how succesful their advert is. In the "old world" (e.g. of commercial
broadcast tv) it was often said that an advertisement was 50% succesful, but
one could not know which 50%. A now legendary experiment was carried out by
the Guinness company decades ago in reducing its advertising voluntarily
for a period, and then analysing the fall of sales in different customer
demographics. This led to them being one of the very few people that knew the
actual value of advertisements, not just the success rate.

However, the price paid for advertising on google does not reflect the value
of business gained or lost, as it is typically set effectively by a blind
auction between competitors.

Privacy issues have received some attention (e.g. [2] and some solutions have
been proposed (e.g. privad [3]). Some of this literature illustrates the
architecture of today's systems well and makes good background.

What this project will look at is tools for advertisers to run
controlled experiments in reducing advertisement in various areas. This would
allow advertisers to establish a link between value,
clickthrough specifics and demographics region by region. This would then
allow them to set their maximum bid when paying for priority on google.

The project would build a system for ad blocking that worked together with an
analytics engine, and deploy this experimentally for some trial users in a
small city. Statistical analysis of the results would be an inherent part of
the deliverables.

A dissertation could lead to a paper to be submitted, say, to WWW.

[1] http://en.wikipedia.org/wiki/Clickthrough_rate
[2] Privacy Diffusion on the Web: A Longitudinal Perspective
Balachander Krishnamurthy and Craig E. Wills
Proceedings of the World Wide Web Conference, April 2009
[3] http://saikat.guha.cc/pub/nsdi11-privad/index.php
[4] http://en.wikipedia.org/wiki/PageRank
CIEL-o-scope
Contact: Anil Madhavapeddy
CIEL is a Turing-powerful distributed execution engine that can run code across hundreds of machines, with fault-tolerance and scheduling all taken care of by the system. It has language bindings for Scala, Java, OCaml, Haskell and Python, and additional ones are easy to add.

A cool and useful project would be to combine source code with the execution traces of the engine (which detail which bit of the code ran on which machine, when it ran, and what its inputs and outputs are). This could be implemented as a Javascript user interface, with the D3 visualisation library to render the graphs. You should have a decent grasp of Javascript and Python if you want to do this project as a Part II student, or the strong desire to learn some.
Functional Protocols
Contact: Anil Madhavapeddy
Mirage is a new operating system being built in the Computer Lab that compiles OCaml code directly into Xen microkernels, UNIX binaries, or Javascript fragments that can execute on node.js. An educational way to learn how the Internet works is to pick a protocol and implement a client and server under Mirage.

Some protocols that would come in handy are:
- Secure Sockets Layer (SSL) is the cryptographic protocol of choice on the Internet
- XMPP is the instant messaging layer that lets you talk from Facebook to Google
- IMAP is the messaging protocol used by most e-mail providers
You should know some ML, or want to learn it, to do this project.
Building Naming Scheme in NDN over V2V Communication
Contact: Eiko Yoneki
Originator/Supervisor: Eiko Yoneki/Lixia Zhang (with Karthik Nilakant)

Keywords: Named Data Networking, Content Centric Networking, Naming

This project explores an emerging new Internet architecture called "Named Data Networking" (NDN) originated by the project Content Centric Networking (CCN) by Van Jacobson in PARC. NDN provides network storage, mobility, and multi-path forwarding. In NDN, communication is driven by the data consumer who initiates sending out a packet, which contains the interest of the consumer by a hierarchically-structured name. It is similar to a URL in an HTTP GET request: the name in an Interest does not indicate the data itself, but may contain information that a receiving node interprets to dynamically generate a appropriate response. The response to an Interest is called a Data packet, which carries both the complete name and the data itself together with a digital signature created using a key of the data publisher. Interests are routed based on prefixes of their names, like routing IP packets on prefixes of their destination IP address. Data packet takes the reverse path of the Interest to be delivered to the consumers. One Interest retrieves one Data Packet, and a sequence of Interests is sent to retrieve the sequence of individually-named Data packets that make up a large piece of content such as a video.
The naming in the NDN architecture is not mature. In this project, you will explore naming schemes in NDN. The semantics of names is opaque to the network but depends on the application. The structure of names (e.g. hierarchically structured with an arbitrary number of components), discovery, and namespace navigation should be investigated and the vehicle-to-vehicle communication environment is targeted.
You can exploit location information (e.g. latitude and longitude by GPS or any geographical information that can be obtained from the map) to navigate the data flow depending on the consumers' interests. The meaning of names can be determined by local and global distribution requirements, and the content producer and consumer application conventions, as embodied in individual router's routing tables and application-level transport libraries.
You will use CCNx open source environment http://www.ccnx.org/software-download-information-request/download-releases/. If time allows, you can add a comparison study with another content centric networking paradigm Haggle from naming aspects.
References:
• Named Data Networking: http://www.named-data.org/ndn-proj.pdf
• CCNx: http://www.ccnx.org/
• NDN: http://www.named-data.org/
• Haggle: http://code.google.com/p/haggle/
Building Graph Query Function using Functional Programming
Contact: Eiko Yoneki
Keywords: Graph, Functional programming, NONSQL database

Demand to store and search of online data with graph structure is emerging. Such data range from online social networks to web links and it requires efficient query processing in scalable manner. In this project, you will build a graph query function (layer) to achieve efficient graph data processing. The graph query function builds on a lazy graph loaded from multiple sources and performs queries at the same speed or faster than a relational database. The function should be written in Clojure or another functional language to support practical lazy loading from data source to an in-memory graph database. Efficient representations for graph queries should be also investigated. The evaluation includes comparisons with existing Graph Database and the query capability comparing with SQL. You can start the project by using our on-going work on this topic Crackle.
References:
-Microsoft Research, Trinity project: Distributed graph database, http://research.microsoft.com/en-us/projects/trinity/
-Neo Technology, Java graph database, http://neo4j.org/
Sync-Ball
Contact: Eiko Yoneki
Keywords: Mobile Agent, Opportunistic networks, Haggle

In this project, you will build function/mechanism for migrating data objects over the Haggle platform [1]. Haggle is built over a "layerless" networking architecture that incorporates event-driven and asynchronous operations, which reside in the device as a kernel component. Functional components implement the functional logic and interact only directly with the kernel (see [2] for further details about the Haggle implementation). A data object is an object defined in Haggle as an entity which can move between Haggle equipped devices. Each data object contains metadata, consisting of name and attribute pairs. Metadata operations like search or forwarding depend on the size and complexity of the network.
Individual nodes in Haggle (e.g. smart phones) are powerful enough to perform substantial amounts of computing. An example of computation would be to pass around a "Sync-Ball" from node to node as they meet, and into which any node may stick the result of some computation. The main task of the "Sync Ball" project is to design a data object which is able to operate at a Haggle node, collect/append data, and then migrate to other devices. Haggle already has an Android platform implementation, and Sync-Ball can be developed over it. A challenge is that the data object should be designed for multiple purposes and simple programming languages for such data object is desired or design the function as part of database like operation. For example, you can design a migrating (distributed) Excel, where computations are a set of formulae each computed over a column, and each node may append zero or more rows when it has the ball.
If time allows, you can add parallel operation of data objects and writing also runtime to support it. It might also be interesting to consider issues of data integrity and availability (through redundancy). Potential applications for Sync-Ball:
-Synchronisation among devices: information to be synchronised is in the data object, devices receives/updates the data then sent out to the other devices.
-Twitter entries, which are accumulated through the migration within Haggle name space until the data object is sent out to the Internet based twitter.
-Environmental monitoring object: dispatch sync-ball gathering Co_2 values from various sensor devices in the area.
References:
[1] EU FP6 Haggle Project
[2] Erik Nordstrom, Per Gunningberg, and Christian Rohner. A Search-based Network Architecture for Mobile Devices. Tech Report Uppsala University, 2009-003.
[3] E. Yoneki, I. Baltopoulos and J. Crowcroft " D^3N: Programming Distributed Computation in Pocket Switched Networks", ACM MOBIHELD at SIGCOMM, 2009.
[4] Xerox PARC's Bayou Project
Recommender Systems for Privacy Settings
Contact: Daniele Quercia
On social media sites, concealment and disclosure are key to meet a user's individual need for privacy. In theory, striking the right balance between concealment and disclosure takes an extraordinary amount of knowledge and judgment. In (social networking) practice, this translates into fiddling with privacy settings and applying restrictions on what can be viewed, and by whom. This project would design new ways of recommending default privacy settings based on existing recommender system algorithms. These algorithms will be quantitively evaluated upon already available Facebook data.
Personality and Use of Social Media
Contact: Daniele Quercia
Psychological personality has been shown to affect a variety of aspects: preferences for interaction styles in the digital world and for music genres, for example. Consequently, the design of personalized user interfaces and music recommender systems might benefit from understanding the relationship between personality and use of social media. This project would evaluate machine learning algorithms that predict individuals' personality traits based on social media use [1]. These algorithms will be evaluated upon already available Facebook and Twitter data and upon to-be-crawled Foursquare data.
[1] Automatic Recognition of Personality in Conversation. HLT-NAACL 2006.
Me, You, and Our Special Friend: The social dynamics of triangles
Contact: Daniele Quercia
Triadic relationships consist of two people with one common friend. It has been shown that if the common friend is female, the relationship tends to be strong; by contrast, if the common friend is male, the relationship is more likely to break [1]. This holds not only in the offline world, but also in the online worlds of Twitter and Facebook. This project is about evaluating machine learning algorithms (including ERGM models) that predict which Facebook triads are likely to decay and ultimately disappear, and which triads are likely to persist.
[1] Gender clustering in friendship networks: some sociological implications. Methodological Innovations Online. 2009.
[2] Maintaining Ties on Social Media Sites: The Competing Effects of Balance, Exchange, and Betweenness. AAAI 2011.
Influencers and Subordinates: Predicting power relations in Twitter
Contact: Daniele Quercia
Based on a macro analysis of the Twitter graph, researchers claimed that "Twitter is not a social network but a news media"[1]. While a (macro) graph analysis suggests that Twitter is not a social network, recent (micro) studies counsel caution. These studies went beyond a graph analysis and analyzed the activity of individual Twitter profiles (i.e., who replies to whom, which relationships are mutual, which tweets are passed on) and found that the social media site can be classified as a set of real communities according to well-established sociological definitions of community. This project would study social power relationships in Twitter [2]. More specifically, this project would use machine learning techniques (including SVM and L-LDA) to predict social power relationships based on Twitter direct messages.
[1] What is Twitter, a social network or a news media? WWW 2010.
[2] Extracting Social Power Relationships from Natural Language. ACL 2011.
Predicting "Gross Community Happiness" from Tweets
Contact: Daniele Quercia
Policy makers are calling for new socio-economic measures that reflect subjective well-being, to complement traditional measures of material welfare as the Gross Domestic Product (GDP). Self-reporting has been found to be reasonably accurate in measuring one's well-being and conveniently tallies with sentiment expressed on social media (e.g., those satisfied with life use more positive than negative words in their Facebook status updates). Social media content can thus be used to track well-being of individuals. A question left unexplored is whether such content can be used to track well-being of entire physical communities as well. To this end, this project would: consider Twitter and Foursquare activities of London census communities; study the relationship between these activities (e.g., sentiment [1], space segmentation [2,3]) and community socio-economic well-being; and finally predict well-being based on these activities using machine learning algorithms.
[1] An unobtrusive behavioral model of "gross national happiness". CHI 2010.
[2] A density-based algorithm for discovering clusters in large spatial databases with noise. KDD 2006.
[3] Eigenplaces: Segmenting Space through Digital Signatures. IEEE Pervasive Magazine 2010.
Differences in the use of emotional words across topics
Contact: Daniele Quercia
Conventional wisdom holds that different kinds of information are associated with different uses of emotion words. For example, political discussions tend to be emotionally charged, while discussions pertaining to hermeneutics tend to be less so. This assertion makes intuitive sense but is hard to quantify. This project proposes to study this issue on Twitter, analysing the ways in which tweets of different topics use emotion words differently (in a way similar to [1]). We will do so by combining existing algorithms of sentiment analysis and topic modelling.
[1] Differences in the mechanics of information diffusion across topics: idioms, political hashtags, and complex contagion on twitter. WWW 2011.
Recommending Friends and Foes on Facebook
Contact: Daniele Quercia
This project would design and evaluate Facebook recommender algorithms for suggesting people one may want to befriend and people one may want to "unfriend". Existing data of online activities will be processed by a variety of algorithms that are based on social network theories of geographical proximity and of link prediction [1], and a quantitative evaluation of these algorithms would be required.
[1] FriendSensing: Recommending Friends Using Mobile Phones. RecSys 2009.

Department of Computer Science and Technology

Student Projects (2011—2012)

NetOS