MPhil, Part III, and Part II Project Suggestions (20142015)

Please contact Eiko Yoneki (email: eiko.yoneki@cl.cam.ac.uk) if you are interested in any project below. 1.
Clustering Entities across Multiple
Documents in Massive Scale Originator/Supervisor:
Eiko Yoneki Keywords: Clustering, Graph Partitioning, Random
Walk, Distributed Algorithms Many largescale distributed problems, including the
optimal storage of large sets of graph structured data over several hosts, are
crucial for today’s Cloud infrastructure. However, in very largescale
distributed scenarios, stateoftheart algorithms are not directly applicable,
because frequent global operations over the entire graph are difficult. In [1],
balanced graph partitioning is achieved by a fully distributed algorithm, called
JabeJa, which uses local search and simulated annealing techniques for graph
partitioning. The algorithm is massively parallel: each node is processed
independently, and only the direct neighbours of the nodes and a small subset of
random nodes in the graph need to be known locally. Strict synchronisation is
not required. These features allow JabeJa to be easily adapted to any
distributed graph processing system. This will involve the understanding of
JabeJa as a starting point, and investigate further performance improvements.
A case study: a graphbased approach to coreference resolution, where a graph
representation of documents and their context is used, and applying a community
detection algorithm based on [1] can speed up the task of coreference resolution
by a very large degree. References: [1] Fatemeh Rahimian, Amir H. Payberah, Sarunas Girdzijauskas, Mark
Jelasity and Seif Haridi: JabeJa: A Distributed Algorithm for Balanced
Graph Partitioning, IEEE International Conference on SelfAdaptive and
SelfOrganizing Systems (SASO), 2013. [2] Fatemeh Rahimian, Amir H. Payberah, Sarunas Girdzijauskas, and Seif
Haridi: Distributed VertexCut Partitioning, DAIS, 2014. [3] Fatemeh Rahimian:
GossipBased Algorithms for Information Dissemination and Graph Clustering,
PhD Thesis, 2014.
2. Sampling and Approximation for Triangle Counting in
Massive Graph Computation Originator/Supervisor:
Eiko Yoneki Keywords: Sampling, Approximation, Triangle Counting,
Cluster Coefficient, Big Data Graphs are becoming important to analyse the data.
Two key metrics used to characterise a graph are its triangle count (TC) and its
clustering coefficient (CC). Both metrics give an intuition of the community
structure. When the data volume gets larger or the data rate becomes higher, we
need to sample/approximate for speeding up data processing. We have extended the
methods presented in [1] to allow the approximation of the TC and CC of graphs
stored in external memory on a single machine. These methods are based on wedge
sampling. A wedge is a path of length 2: a pair of edges sharing a vertex. As is
often the case for graph computations, the algorithms we use show a lack of
locality in memory accesses. Key to our approach is the parallel prefetching of
memory accesses which will be needed in the future. This project addresses this
parallel prefetching, where you need a smart indexing system, in which the graph
is stored in such indexed form (e.g. CSR). The project will fully explore
sampling and approximation techniques and investigate the format of the graph
data, including reading the data in streaming manner such as described in
XStream [2]. As an extension, local TC and CC could be explored. References: [1] C. Seshadhri, A. Pinar, and T. G. Kolda.
Triadic measures on graphs: the power of
wedge sampling. CoRR, abs/1202.5230, 2013. [2] Amitabha Roy, Ivo Mihailovic, Willy Zwaenepoel: XStream: Edgecentric Graph Processing using Streaming Partitions. SOSP 2013
3. Approximate Algorithms Determining
Local Clustering Coefficients Anonymously Originator/Supervisor:
Eiko Yoneki
(with Amitabha Roy (EPFL)) Keywords: Sampling, Approximation, Privacy, Cluster Coefficient Anonymous social networks are a new phenomenon in an increasingly privacy conscious world. A natural question to ask in this setting is whether we can continue to apply known principles of network science in such settings where neighbourhood information is concealed both between nodes and to external observers. This project is to work on approximate algorithms that determines clustering coefficient in such settings. Clustering coefficients provide a way to relatively order nodes by importance in a network, determine densely connected neighbourhoods and distinguishing social graphs from web graphs. Algorithms to measure clustering coefficients have hitherto required global knowledge of graph links. This requires individual nodes in the graph to give up the identity of their neighbours. This project investigates an algorithm for estimating the clustering coefficient of a vertex by exchanging only anonymised set summaries of neighbours, from which it is difficult to reverse engineer individual links. The bulk of the work will consist of working on and improving sampling techniques to arrive at accurate local clustering coefficients without exchanging explicit neighbour lists in social networks. [1] P. Flajolet, ric Fusy, O. Gandouet, and et al.
Hyperloglog: The analysis of a nearoptimal cardinality estimation algorithm. In
Proceedings of the International Conference on Analysis of Algorithms, 2007.
4. Graph Compression Originator/Supervisor:
Eiko Yoneki
(with Weixiong Rao) Keywords: Graph Compression This project explores graph compression mechanisms as
part of a project looking into high performance semiexternal memory graph
processing (see [1] to get an idea of semiexternal memory approaches). The
graph compression work will build on an inhouse graph representation format
that we have developed that allows succinct representation of graphs that show
hierarchical structures. The project will explore ways to improve the
representation yielding smaller graphs on disk that are less costly to traverse.
A key element of the representation scheme is a recursive graph partitioning
step that minimises the number of edges between partitions. This is a rich space
for exploration of suitable algorithms. We are concerned primarily with
experimentally evaluating I/O costs on large graphs and measuring the
compression performance. However a student with a theoretical background might
consider designing algorithms with provable bounds on compression performance,
which would be a big plus. If time allows you could also implement an efficient
transformation tool based on the developed graph compression algorithm using
parallel processing tools (e.g. map/reduce). References: [1] R. Pearce, M. Gokhale and N. Amato:
Multithreaded Asynchronous Graph
Traversal for InMemory and SemiExternal Memory, ACM/IEEE High Performance
Computing, Networking, Storage and Analysis, 2010.
http://dl.acm.org/citation.cfm?id=1884675 5. Develop ScaleOut SSD based Graph Traversal
Platform Originator/Supervisor:
Eiko Yoneki
(with Karthik Nilakant) Keywords: Graph processing, I/O optimisation, Graph
structure This project contributes to our ongoing work on high
performance semiexternal memory graph processing (see [1]). We have developed
an inhouse prefetching algorithm for mining large graphs stored on Solid State
Drives. A solid state drive provides the ability to service many random reads in
parallel, provided that the requesting application is able to deliver requests
at a rate that keeps the drive sufficiently busy. Thus prefetching multiple
pages based on the current request in an “intelligent” manner can greatly
improve performance. Our current prefetcher implementations are aligned with
the graph traversal approach, (that is the “graph iterator” being employed). Furthermore, the prefetcher implementation could be
extended to utilise multiple storage devices in parallel. The aim in this
setting would be to outperform standard storage parallelisation techniques such
as hardware or software RAID. The performance of this approach would also depend
on the method of dividing graph data between available storage devices. Your
task is developing scaleout version of our current work. You could also extend
it in distributed graph traversals to study how a large graph (e.g. over 2
billion vertices) could get benefits from scaleout version, even though the
current version can handle large graphs as a single node. You could also explore
generating a large graph for testing by combining various entities of social
network data (e.g. what happens when every "Like", "Comment" and "Photo" on
Facebook are treated as vertices inside a graph  not just "People" and "Pages"
 such graphs would be huge). References: [1] E. Yoneki and A. Roy: Scaleup Graph Processing:
A Storagecentric View. ACM SIGMOD  GRADES, 2013. [2] GraphLab:
http://graphlab.org/
6. Building Graph Query Function using Functional
Programming Originator/Supervisor: Eiko Yoneki (with Karthik
Nilakant) Keywords: Graph, Functional programming Demand to store and search of online data with graph
structure is emerging. Such data range from online social networks to web links
and it requires efficient query processing in a scalable manner. In this
project, you will build a graph query function (layer) to achieve efficient
graph data processing. The graph query function builds on a lazy graph loaded
from multiple sources and performs queries at the same speed or faster than a
relational database. The function should be written in Clojure or another
functional language to support practical lazy loading from data source to an
inmemory graph database. Efficient representations for graph queries should be
also investigated. The evaluation includes comparisons with existing Graph
Database and the query capability comparing with SQL. You can start the project
by our existing work in this area. References: [1] Microsoft Research:
Trinity project: Distributed graph
database,
http://research.microsoft.com/enus/projects/trinity/ [2] Neo Technology,
Java graph database,
http://neo4j.org/ 7. RasPiNet: Building Stream Data Processing
Platform over RasPiNET Originator/Supervisor:
Eiko Yoneki Keywords: Raspberry Pi, Delay Tolerant Networks,
Satellite Communication, Stream Processing We have built a decentralised Raspberry Pi network
(RasPiNET [1]), which can be deployed in wild and remote regions as a standalone
network. The gateway Raspberry Pi nodes are integrated with satellite
communication devices, where the light version of Delay Tolerant Network (DTN)
bundle protocol is embedded. RasPiNET could consist of 10100 nodes. As an
example, a remote sensing application could be written either in RasPi or Smart
phones that can connect to RasPi. Collected data could be processed within
RasPiNET to reduce data size that streams over the satellite communication to
the base location. The crowd sourcing application can also run on top of
RasPiNET. The goal of this project is to build a stream processing platform in
both directions: from data collection from RasPiNET nodes to the data processing
nodes possibly via a satellite gateway and from bulk of data delivery to the
satellite gateway node to disseminate necessary information to RasPiNET nodes. A
good filtering function and RasPiNET innetwork data aggregation could be
developed. References: [1] E. Yoneki:
RasPiNET: Decentralised Communication and Sensing Platform with Satellite
Connectivity. ACM CHANTS, 2014. [2] Delay Tolerant Network Bundle Protocol:
http://tools.ietf.org/html/rfc6255 [3] RockBlock Technology:http://rockblock.rock7mobile.com/ 8. RaspberryBSP: Cheap and Safe Bulk
Synchronous Processing on Wimpy Nodes Originator/Supervisor:
Eiko Yoneki
(with Karthik Nilakant) Keywords: Graph Processing, Data Parallel, Bulk
Synchronous Processing This project is inspired by FAWN [1] and aims to
replicate the benefit of an array of low power processors backed by persistent
storage to graph mining. The aim is to build a software stack for RaspberryPI
that allows a set of such devices, each equipped with either an SD card or a USB
flash drive to act as a node for Bulk Synchronous Processing of graphs (see
Pregel [2] for an example of bulk synchronous processing). All the necessary
data structures will reside on Flash with the small 256MB RAM on the
RaspberryPI acting as a scratchpad. The aim will be to show that when
processing time, energy consumption and cost are taken together this solution is
competitive to running BSP on a cluster of PCs. References: [1] D. Andersen, J. Franklin, A. Phanishayee, L. Tan
and V. Vasudevan: FAWN: A Fast Array of
Wimpy Nodes, Communications of the ACM, July 2011. [2] G. Malewicz, M. Austern, A. Bik, J. Dehnert, I.
Horn, N. Leiser, and G. Czajkowski:
Pregel: A System for LargeScale Graph Processing, SIGMOD, 2010.
9. Building
Dynamic TimeDependent Multicast Tree in Twitter Networks Originator/Supervisor:
Eiko Yoneki
(with Valentin Dalibard) Keywords: Joint Diagonalisation, Multicast Tree,
TimeDependent Spread Mode, Content Distribution Simulation, Twitter The content diffusion occurs spontaneously in online
social networks based on the network topology built by the followers, the
content cascades. This project aims at investigating the characteristics of such
dynamic and temporal cascading trees, which may appear as multiple different
spanning trees even from the same source node depending on the contents or
depending on the time. The ultimate goal of the project is building multicast
trees based on aggregating such spanning trees to accelerate the cascade process
and potentially providing a more efficient content delivery process. The methodology that extracts such spanning trees can
be obtained from our previous work on Joint diagonalisation (JD) [1]. JD is a
technique used to estimate an average eigenspace of a set of matrices. JD on
matrices of spanning trees of a network extracts multiple modes. Note that there
is no single underlying static graph in most of real world networks. The average
eigenspace may be used to construct a graph which represents the ‘average
spanning tree’ of the network or a representation of the most common propagation
paths. Examining the distribution of deviations from the average reveals this
distribution is multimodal; thus indicating several modes in the underlying
network. These modes are identified and are found to correspond to particular
times. You can explore this project in two ways: 1)
modelling multicast tree from aggregation of spanning tree in more theoretical
manner or 2) using OSN data and demonstrate the extracted tree and efficiency of
multicast tree for content dissemination in content distribution network
simulation. References: [1] D. Fay, J. Kunegis, and E. Yoneki: Centrality and
Mode Detection in Dynamic Contact Graphs; a Joint Diagonalisation Approach.
IEEE/ACM ASONAM, 2013 [2] Meeyoung Cha, Hamed Haddadi, Fabrício Benevenuto,
P. Krishna Gummadi: Measuring User Influence in Twitter: The Million Follower
Fallacy. ICWSM 2010. [3] Evan T.R. Rosenman: Retweets, but Not Just
Retweets: Quantifying and Predicting Influence on Twitter. Thesis, Harvard
University. Contact EmailPlease email to eiko.yoneki@cl.cam.ac.uk for any question. 