Systems Research Group – NetOS

ACS Projects (2015—2016)

NetOS

This page collects together various ACS project suggestions from the Network and Operating Systems part of the Systems Research Group. In all cases there is a contact e-mail address given; please get in touch if you want more information about the project.

Under construction: please keep checking back, as more ideas will hopefully be added to this page during the coming weeks.

Contents

1. Hardware-assisted generation garbage collection with CHERI
2. Hardware page-table walker for BERI
3. OS Support for Garbage Collection
4. Dynamic Task Scheduling on Heterogeneous CPU/GPU Environment using Machine Learning Technique for Parallel Processing
5. Clustering Entities across Multiple Documents in Massive Scale
6. Approximate Algorithms Determining Local Clustering Coefficients Anonymously
7. RasPi-Net: Building Stream Data Processing Platform over RasPiNET
8. Graph Compression in the Semi-External Memory Environment
9. Scale-Out SSD based Graph Traversal Platform
10. Building Graph Query Function using Functional Programming
11. Fun with Bus Data
12. Packet Capture on MirageOS
13. Exploring Architectures for SDN Switching
14. Exploring 100Gb/s Datapath Architectures
15. Understanding the performance of consensus systems
More Systems Projects at the DTG Project Page

1. Hardware-assisted generation garbage collection with CHERI

Contact: David Chisnall

Prerequisites: Knowledge of C, graph theory, virtual memory, MIPS assembly

Sun Labs had a project (Project Maxwell) that identified that the objects in the young generation for a generational garbage collector and the objects in the cache are similar. This project added a full object memory to SPARC and allowed the young-generational collector to run entirely in the cache.

CHERI adds a capability-oriented view of virtual memory to the MIPS ISA. With CHERI, it is possible for the hardware to distinguish between pointers (capabilities to memory) and other data. A previous project wrote a purely software garbage collector that worked for C.

It should be possible to combine these two approaches into a single system that would provide generation garbage collection. The basic idea would be a young generation implemented as a semi-space collector in SRAM in a range of the virtual address space. When a capability to this is stored in the cache, it would be marked as a root. When a cache line containing a capability to such a location is flushed to main memory, it would trap to software, which would promote the object to an older generation.

The young generation could be collected entirely asynchronously.

2. Hardware page-table walker for BERI

Contact: David Chisnall

BERI implements the MIPS R4K system-level interface, which mandates a software-managed TLB. When the CPU can not find an entry in the TLB, it raises an interrupt and the OS is responsible for installing the relevant TLB entry.

The advantage of this approach is that it makes it much easier to experiment with different page table designs - very important when the R4K was released in the early 1990s and virtual memory was still a very active research topic. Unfortunately, it also provides several disadvantages:

TLB fills must happen synchronously with respect to the main pipeline - the TLB can’t be speculatively filled waiting for a future load.
The TLB-handling interrupt routine consumes instruction cache space.
Entering the TLB-fill routine causes a pipeline flush.

A successful implementation will provide a hardware walker that can inspect the FreeBSD page table format and automatically fill the TLB if there is a page ready. As an extension, you could make the TLB-fill logic programmable so that it can support multiple page table formats, or hard-code some others (for example, an inverted page table).

Evaluation should show whether a hardware implementation is faster for a simple in-order pipeline (BERI) by running a variety of workloads with and without the hardware support.

Note: This project requires familiarity with Bluespec SystemVerilog and so is probably best suited to a Part III / ACS student taking the Advanced Computer Design course.

3. OS Support for Garbage Collection

Contact: David Chisnall

Prerequisites: A good knowledge of C and the ability to work with concurrent data structures using fine-grained locking

Microsoft Windows provides an API for receiving notifications related to which pages have been modified. This is used in the .NET runtime, but similar interfaces are lacking on other systems. This project will involve modifying the FreeBSD virtual memory subsystem to provide an API that allows garbage collectors to query this data and modifying the Boehm collector to use it.

The Boehm collector provides a platform-independent mechanism for retrieving a list of dirty pages, along with multiple implementations (the Windows API, using mmap() to mark pages as read-only and catching the faults, and a few others) so these changes will be relatively small. The OS changes will be more significant. Some important considerations include:

The OS uses the dirty bit to identify pages that are candidates for swapping, so its notion of a dirty page is not the same as the garbage collector’s.
Programs use multiple threads. As a first approximation, it would be acceptable to only query for dirty bits after calling pthread_suspend_all_np().
For correctness in a garbage collector, it is acceptable to provide a superset of the modified pages - scanning a page that is not modified only hurts performance - but missing a write is a serious problem.

The implementation will most likely involve adding a counter to each page that is incremented when the page moves from clean to dirty status and querying a range of pages to identify whether their counters have incremented since a previous call (make sure you handle overflow in the counters sensibly! This can be just by providing a ‘reset all counters’ API to userspace and).

Evaluation should involve running the modified Boehm collector on some benchmarks and determining whether it provides better performance. If it doesn’t, then evaluation should describe what the overheads were that offset the speedup.

4. Dynamic Task Scheduling on Heterogeneous CPU/GPU Environment using Machine Learning Technique for Parallel Processing

Originator/Supervisor: Eiko Yoneki (with Anton [Javascript required])

Keywords: GPU Clusters, Heterogeneous many/multi-core, Parallel Computing, OpenCL, Task Scheduling

In this project, various aspects of parallel processing will be explored using a new generation of CPU/GPU integrated board, where more than one GPU clusters are placed on a chip. We use ARM based Mali-T628 MP6, in Exynos 5422 [1] [2], which has two clusters, of four (MP4) and two cores (MP2). Using OpenCL, tasks can be dispatched to GPU and CPU code in parallel. This new GPUs makes it possible to cluster the GPU nodes for different scale of parallel processing. GPUs offer a much higher hardware thread count and have access to higher memory bandwidth. We use a simulator on top of the hardware to experiment various task scheduling strategies explored by the machine learning methodologies for prediction of workload, vector instructions, and mixture of model parallelism and data parallelism.

Application running on top could be image analysis or irregular graph processing. Graph processing can take advantage of processor heterogeneity to adapt to structural data patterns. The overall aim of graph processing can be seen as scheduling irregular tasks to optimise data-parallel heterogeneous graph processing, by analysing the graph at runtime and dispatching graph elements to appropriate computation units. Efficient scheduling underlies the vision of a heterogeneous runtime platform for graph computation, where a data-centric scheduler is used to achieve optimal workload.

[1] http://www.anandtech.com/show/8234/arms-mali-midgard-architecture-explored

[2] www.samsung.com/global/business/semiconductor/product/application/detail?productId=7978&iaId=2341

5. Clustering Entities across Multiple Documents in Massive Scale

Originator/Supervisor: Eiko Yoneki

Keywords: Clustering, Graph Partitioning, Random Walk, Distributed Algorithms

Many large-scale distributed problems including the optimal storage of large sets of graph structured data over several hosts - a key problem in today's Cloud infrastructure. However, in very large-scale distributed scenarios, state-of-the-art algorithms are not directly applicable, because frequent global operations over the entire graph are difficult. In [1], balanced graph partitioning is achieved by a fully distributed an algorithm, called Ja-be-Ja, that uses local search and simulated annealing techniques for graph partitioning annealing techniques for graph partitioning. The algorithm is massively parallel: each node is processed independently, and only the direct neighbours of the nodes and a small subset of random nodes in the graph need to be known locally. Strict synchronisation is not required. These features allow Ja-be-Ja to be easily adapted to any distributed graph processing system. This project starts understanding Ja-be-Ja as a starting point, and investigates further performance improvement. A case study: a graph-based approach to coreference resolution, where a graph representation of the documents and their context is used and applying a community detection algorithm based in [1] can speed up the task of coreference resolution by a very large degree.

[1] Fatemeh Rahimian, Amir H. Payberah, Sarunas Girdzijauskas, Mark Jelasity and Seif Haridi: JabeJa: A Distributed Algorithm for Balanced Graph Partitioning, IEEE International Conference on Self-Adaptive and Self-Organizing Systems (SASO), 2013.

[2] Fatemeh Rahimian, Amir H. Payberah, Sarunas Girdzijauskas, and Seif Haridi: Distributed Vertex-Cut Partitioning, DAIS, 2014.

6. Approximate Algorithms Determining Local Clustering Coefficients Anonymously

Originator/Supervisor: Eiko Yoneki (with Amitabha Roy)

Keywords: Sampling, Approximation, Privacy, Cluster Coefficient

Anonymous social networks are a new phenomenon in an increasingly privacy conscious world. A natural question to ask in this setting is whether we can continue to apply known principles of network science in such settings where neighbourhood information is concealed both between nodes and to external observers. This project is to work on approximate algorithms that determines clustering coefficient in such settings. Clustering coefficients provide a way to relatively order nodes by importance in a network, determine densely connected neighbourhoods and distinguishing social graphs from web graphs. Algorithms to measure clustering coefficients have hitherto required global knowledge of graph links. This requires individual nodes in the graph to give up the identity of their neighbours. This project investigates an algorithm for estimating the clustering coefficient of a vertex by exchanging only anonymised set summaries of neighbours, from which it is difficult to reverse engineer individual links. The bulk of the work will consist of working on and improving sampling techniques to arrive at accurate local clustering coefficients without exchanging explicit neighbour lists in social networks.

[1] P. Flajolet, Eric Fusy, O. Gandouet, and et al. Hyperloglog: The analysis of a near-optimal cardinality estimation algorithm. In Proceedings of the International Conference on Analysis of Algorithms, 2007.

7. RasPi-Net: Building Stream Data Processing Platform over RasPiNET

Originator/Supervisor: Eiko Yoneki

Keywords: Raspberry Pi, Delay Tolerant Networks, Satellite Communication, Stream Processing

We have built a decentralised Raspberry Pi network (RasPiNET [1]), which can be deployed in wild and remote regions as a standalone network. The gateway Raspberry Pi nodes are integrated with satellite communication devices, where the light version of Delay Tolerant Network (DTN) bundle protocol is embedded. RasPiNET could consist of 10-100 nodes. As an example, a remote sensing application could be written either in RasPi or Smart phones that can connect to RasPi. Collected data could be processed within RasPiNET to reduce data size that streams over the satellite communication to the base location. The crowd sourcing application can run on top of RasPiNET, too. The goal of this project is building a stream processing platform in both directions: from data collection from RasPiNET nodes to the data processing nodes possibly via a satellite gateway and from bulk of data delivery to the satellite gateway node to disseminate necessary information to RasPiNET nodes. A good filtering function and RasPiNET in-network data aggregation could be developed.

References:

[1] E. Yoneki: RasPiNET: Decentralised Communication and Sensing Platform with Satellite Connectivity. ACM CHANTS, 2014.

[2] Delay Tolerant Network Bundle Protocol: http://tools.ietf.org/html/rfc6255

[3] RockBlock Technology:http://rockblock.rock7mobile.com/

8. Graph Compression in the Semi-External Memory Environment

Originator/Supervisor: Eiko Yoneki

Keywords: Graph Compression

This project explores graph compression mechanisms as part of a project looking into high performance semi-external memory graph processing (see [1] to get an idea of semi-external memory approaches). The graph compression work will build on an in-house graph representation format that we have developed that allows succinct representation of graphs that show hierarchical structures. The project will explore ways to improve the representation yielding smaller graphs on disk that are less costly to traverse. A key element of the representation scheme is a recursive graph partitioning step that minimises the number of edges between partitions. This is a rich space for exploration of suitable algorithms. We are concerned primarily with experimentally evaluating I/O costs on large graphs and measuring the compression performance. However a student with a theoretical background might consider designing algorithms with provable bounds on compression performance, which would be a big plus. If time allows you could also implement an efficient transformation tool based on the developed graph compression algorithm using parallel processing tools (e.g. map/reduce).

References:

[1] R. Pearce, M. Gokhale and N. Amato: Multithreaded Asynchronous Graph Traversal for In-Memory and Semi-External Memory, ACM/IEEE High Performance Computing, Networking, Storage and Analysis, 2010. http://dl.acm.org/citation.cfm?id=1884675

9. Scale-Out SSD based Graph Traversal Platform

Originator/Supervisor: Eiko Yoneki

Keywords: Graph processing, I/O optimisation, Graph structure

This project contributes to our ongoing work on high performance semi-external memory graph processing (see [1]). We have developed an in-house prefetching algorithm for mining large graphs stored on Solid State Drives. A solid state drive provides the ability to service many random reads in parallel, provided that the requesting application is able to deliver requests at a rate that keeps the drive sufficiently busy. Thus pre-fetching multiple pages based on the current request in an intelligent manner can greatly improve performance. Our current pre-fetcher implementations are aligned with the graph traversal approach, (that is the graph iterator being employed).

Furthermore, the prefetcher implementation could be extended to utilise multiple storage devices in parallel. The aim in this setting would be to out-perform standard storage parallelisation techniques such as hardware or software RAID. The performance of this approach would also depend on the method of dividing graph data between available storage devices. Your task is developing scale-out version of our current work. You could also extend it in distributed graph traversals to study how a large graph (e.g. over 2 billion vertices) could get benefits from scale-out version, even though the current version can handle large graphs as a single node. You could also explore generating a large graph for testing by combining various entities of social network data (e.g. what happens when every "Like", "Comment" and "Photo" on Facebook are treated as vertices inside a graph -- not just "People" and "Pages" - such graphs would be huge).

References:

[1] E. Yoneki and A. Roy: Scale-up Graph Processing: A Storage-centric View. ACM SIGMOD - GRADES, 2013.

[2] GraphLab: http://graphlab.org/

10. Building Graph Query Function using Functional Programming

Originator/Supervisor: Eiko Yoneki

Keywords: Graph, Functional programming

Demand to store and search of online data with graph structure is emerging. Such data range from online social networks to web links and it requires efficient query processing in a scalable manner. In this project, you will build a graph query function (layer) to achieve efficient graph data processing. The graph query function builds on a lazy graph loaded from multiple sources and performs queries at the same speed or faster than a relational database. The function should be written in Clojure or another functional language to support practical lazy loading from data source to an in-memory graph database. Efficient representations for graph queries should be also investigated. The evaluation includes comparisons with existing Graph Database and the query capability comparing with SQL. You can start the project by our existing work in this area.

References:

[1] Microsoft Research, Trinity project: Distributed graph database, http://research.microsoft.com/en-us/projects/trinity/

[2] Neo Technology, Java graph database, http://neo4j.org/

11. Fun with Bus Data

Contact: Richard Mortier

We have access to a near-realtime feed of bus data for Cambridge public buses, which presents several opportunities for a range of projects. The following two are of particular interest, but others may also be considered.

At present, data is received via POST to a webserver, which then both archives and serves it. This project would design and build a scalable custom webserver to perform these tasks using MirageOS as one or more microservices. This would make use of Irmin to store and retrieve the data, and would then present a range of computed views of the data via URLs against which applications and visualisations might be developed. Extensions might include using Jitsu to auto-scale components in response to load, exploring how to manage long-term archival of data in an "append-only" store such as Irmin, and many others.
The data is currently being simply visualised using Google Maps and similar mapping services. This project would explore the many more advanced data visualisations and analyses that could be carried out. Starting points include implementation of online visualisations such as for the MBTA metro dataset, or analyses as published on an earlier version of these data by Bacon et al (2011). "Using Real-Time Road Traffic Data to Evaluate Congestion". LNCS 6875:93-117.

12. Packet Capture on MirageOS

Contact: Richard Mortier

PCAP is a simple packet capture format, emitted by many standard tools such as tcpdump and wireshark. The commonly-used implementation of PCAP is the libpcap library which is, unfortunately, a common source of zero-day exploits at events such as DefCon. Rudimentary support for capturing and parsing PCAP format dumps already exists for MirageOS. This project would rationalise and extend this existing support to produce a working packet capture unikernel able to store captures to a MirageOS block device such as the recently introduced Archive format (where a tar file is mounted as a block device). The focus of the project will be system-level support for secure and efficient operation, able to capture large traces over long periods of time. Extensions might include playback of previously captured traces, with control over playback parameters (e.g., timing, packet metadata).

Pre-requisites: Familiarity with network package capture. Existing competence programming in OCaml, or familiarity with another functional programming language.

13. Exploring Architectures for SDN Switching

Contact: Noa Zilberman

The concept of Software Defined Networks was introduced less than a decade ago, and has quickly overtaken a significant role in today's networking. The NetFPGA [1] platform was the open-source platform used to first implement OpenFlow, the first example of SDN in practice [2] and today's common interface between the control and the data plane.
NetFPGA-SUME [3] is the third generation of NetFPGA and a technology enabler for datacentre research. In this project you will implement an OpenFlow switch [4] over the NetFPGA-SUME plarform, based on the original OpenFlow switch design [5], and explore various architectures for high bandwidth designs, new OpenFlow features and research performance trade-offs in implementations, experimented over real networking platforms.

References:
[1] http://www.netfpga.org
[2] N. McKeown, T. Anderson, H. Balakrishnan, G. Parulkar, L. Peterson, J. Rexford, S. Shenker, and J. Turner, Openflow: enabling innovation in campus networks, ACM SIGCOMM CCR, vol. 38, no. 2, pp. 69–74, 2008
[3] Noa Zilberman, Yury Audzevich, Adam Covington, Andrew W. Moore. NetFPGA SUME: Toward Research Commodity 100Gb/s, IEEE Micro, vol.34, no.5, pp.32,41, Sept.-Oct. 2014
[4] Open Networking Foundation, OpenFlow Switch Specification, April 2015, https://www.opennetworking.org/sdn-resources/technical-library
[5] NetFPGA-1G OpenFlow switch project

Pre-requisites: This project requires basic knowledge of computer networks and Verilog.

14. Exploring 100Gb/s Datapath Architectures

Contact: Noa Zilberman

The bandwidth of network switching silicon has increased by several orders of magnitude over the last decade. This rapid increase has called for innovation in datapath architectures, with many solutions competing for their place.
NetFPGA-SUME [1] , the third generation of the NetFPGA [2] open source networking platforms, is a technology enabler for 100Gb/s and datacentre research. As a reconfigurable hardware platform, it allows rapid prototyping of various architectures, allowing to explore various trade offs.
In this project you will explore multiple 100Gb/s datapath architectures over the NetFPGA-SUME plarform and study different types of high bandwidth designs. You will research trade-offs in implementations, such as bandwidth performance, latency, power and resource usage . Real networking hardware platforms will be used for experimentation.

References:
[1] Noa Zilberman, Yury Audzevich, Adam Covington, Andrew W. Moore. NetFPGA SUME: Toward Research Commodity 100Gb/s, IEEE Micro, vol.34, no.5, pp.32,41, Sept.-Oct. 2014
[2] http://www.netfpga.org

Pre-requisites: This project requires basic knowledge of computer networks and Verilog.

15. Understanding the performance of consensus systems

Originator/Supervisor: Matthew P. Grosvenor / Andrew W. Moore

Keywords: Consensus, protcols, I/O performance, network performance

Consensus systems like Zookeeper, Chubby, Raft, Log-cabin, Paxos and others are required for fault tolerant operation of many large scale distributed systems e.g at Google, Facebook, Microsoft etc. Due to the intense emphasis on correctness, little is known about the limits of performance of these systems other than the accepted truth that they are “slow". In this project we would like to see, a coherent and systematic study of the performance of consensus systems in order predict what benefits might be had form hardware based acceleration of the protocols (in the future). Some interesting avenues to explore might include:

Constructing a detailed model of the communication patterns of consensus systems and applying them to the known limits of performance of high speed networking devices. With this we would hope to be able to put a bounding box on the absolute best case performance of existing protocols “in the real world”.
Modifying/augmenting the I/O code to intentionally induce communication latency. If we can predict how real systems behave we we slow them down, we might be able to predict how they behave when we speed them up.
Modifying the I/O code to employ kernel bypass / userspace network stacks to increase performance without altering the code. If we can predict how systems behave when we speed them up a little, we might be able to predict how they will behave when we speed them up a lot.
Perform a detailed code analysis on hot-spots within the code. If we can understand where the bottlenecks are, we might be able to understand which parts would benefit from hardware acceleration in the future.

More Systems Projects at the DTG Project Page

Contact: Ripduman Sohan

Please see the DTG project suggestions page for a number of interesting systems projects.

Department of Computer Science and Technology

ACS Projects (2015—2016)

NetOS

1. Hardware-assisted generation garbage collection with CHERI

2. Hardware page-table walker for BERI

3. OS Support for Garbage Collection

4. Dynamic Task Scheduling on Heterogeneous CPU/GPU Environment using Machine Learning Technique for Parallel Processing

5. Clustering Entities across Multiple Documents in Massive Scale

6. Approximate Algorithms Determining Local Clustering Coefficients Anonymously

7. RasPi-Net: Building Stream Data Processing Platform over RasPiNET

8. Graph Compression in the Semi-External Memory Environment

9. Scale-Out SSD based Graph Traversal Platform

10. Building Graph Query Function using Functional Programming

11. Fun with Bus Data

12. Packet Capture on MirageOS

13. Exploring Architectures for SDN Switching

14. Exploring 100Gb/s Datapath Architectures

15. Understanding the performance of consensus systems

More Systems Projects at the DTG Project Page