Project Archive

These are the list of Part II/MPhil ACS projects from previous years.

Cypher to SQL

Cypher is a powerful graph query language created by the developers of Neo4J, it offers a succinct way to represent complex and large queries over graph data. However Neo4J itself is a relatively slow database, at least until you are scaling to several database servers working in coordination.

As a compromise, this project would investigate if a more performant solution could be found by implementing a translator from a subset of Cypher to SQL, and then running the backend data store on a fast SQL database such as postgress. The project would be easily evaluated by comparing the speed of various operations when performed on a Neo4J database, using their system, or written natively in SQL.

Several extensions to the project are available. Firstly attempting to more completely map the Cypher language. Secondly, Cypher currently has no primitives for defining stored procedures, so a student could look at extending the Cypher language to allow for stored procedures, and implementing these under the hood using SQL stored procedures.

Contact: Dr R. Sohan

Library-level Fault Injection

A typical application interacts with the kernel using the standard C library which acts as a wrapper around the system calls, abstracting away the complexity associated with the system call mechanism which can be platform and architecture dependant.

Under most conditions these system/library calls behave as expected but under certain run-time environment conditions these calls may return valid but unexpected values. For e.g. a write may return a disk full error or a read may return fewer number of bytes that expected. Previous work has shown that few applications test for these errors and most will fail if one of the more rare errors is returned.

The goal of this project is to create a framework to allow programmers to test against, mask and mitigate these errors by interposing between the application and library call interface. For MPhil students this could be extended to a comprehensive survey of popular application response to system-call errors.

This project was completed in 2015-2016 by Laurynas Karazija: New students can either re-attempt the project in full, with Laurynas's system offering a benchmarking point. Or there are several extensions to the project that are available that Laurynas's work has shown.

Firstly there is an interesting amount of work in determining if a program has failed while it is under fault injection. This task is heavily non-trivial as the system would need to be able to distinguish between faults that have been handled and ones that have caused a failure. Also it should be able to differentiate successfully between graceful and non-graceful failure.

Secondly there is notable work in automatically determining the possible error modes of library code. This can be attempted by parsing the library binary and attempting to determine, by static code analysis or otherwise, what possible error conditions the code can produce. A useful result for this would be checking how accurate the error details on man pages are, and possibly auto-generating corrections.

As part of the OPUS project in FRESCO we have an implementation of a library level interposition framework to capture provenance. This codebase can be made available for the purposes of this project.

Interested students should have C experience and a basic understanding of the Linux C-library/system-call interface and implementation.

Contact: Dr R. Sohan

Code generation from PVM mapping

As part of the OPUS project we have created a semi-formal logic for provenance versioning called PVM. From this versioning scheme we then constructed a mapping from the POSIX interface to the PVM system, such that POSIX function calls could be interpreted as a set of operations on a graph stored in a database.

Currently this mapping is partially encoded in JSON and partially hard-coded in python. This project would look at setting up a suitable DSL that would encompass the PVM mappings and allow for multiple opportunities.

The first goal of the project would be the ability to generate the entirety of the current POSIX call to database transaction code. Given python functions which implement the PVM primitives, the system should be able to generate code that applies those primitives to inbound messages in order to appropriately manipulate the database.

The second goal of the project would be to be able to produce output that can be used to perform validation/verification of properties of the mapping in question.

The third goal would be to extend the DSL and see if it could also absorb code that exists in the capture segment of OPUS. As there a definition of all the functions that are captured is stated, this is used to generate capture code. However that list of functions mirrors the mapping stored in the server. So a potential union of these two datasets would help with reducing duplication and confusion.

Contact: Dr R. Sohan

Descriptive file names from content and context

It is a common failing of some computer users that they create many files with extremely unintuitive or non-descriptive names (e.g. foo, foo1, foobar, bar3). However, based on the contents of a file and the context within which it was created it is possible to automatically produce descriptive names for files. Currently there exist systems such as OPUS that can track the relations between programs and data objects that occur on a system, effectively providing the context in which these files were produced. By utilising NLP techniques to parse the contents and using systems such as OPUS to understand the context you would build an application that attempts to give a view onto the file system where these unintuitive names files are renamed in a way that is appropriate to their derivation or purpose.

Contact: Dr R. Sohan

Characterising asynchronous behavior in the Linux kernel

Modern operating system kernels have moved towards a model where the preferred way of executing operations is in an nonblocking/asynchronous manner. Many applications are using those asynchronous features (for example from the storage/network stacks of the Linux kernel) to achieve better scalability properties.

However, this also means that irrespective of application constraints, the exact timing of I/O operations now depends on asynchronous operation schedulers in the kernel. Those schedulers (for example, the IO scheduler) are responsible for managing the system-wide multiplexing of resources. Because of this, it becomes difficult to predict and understand how different applications will interact with each other and whether the workloads they service are synergistic or antagonistic.

In this project, you will look at quantifying the side effects of applications executing asynchronous operations on other applications running on the same physical machine. The purpose is to understand the variance introduced in the latency/response times by interacting workloads and to discover optimisation opportunities (either in applications themselves or in the scheduling of operations).

In order to achieve your goal, you will:

  1. Use a measurement infrastructure called Resourceful and extend it to record asynchronous behavior.
  2. Instrument a number of applications to expose their use of asynchronous operations.
  3. Run experiments to understand the interactions/variations in response times of those applications when they run concurrently on the same machine.

If you complete this first stage of the project, we will look at extending the experiments for Linux containers.

Interested students should have basic Operating Systems knowledge. Experience with the Linux kernel is advantageous as is programming experience in C.

Well executed, this project will result in a top-tier publication.

Contact: Dr R. Sohan


We have internally created a machine-learning technique for understanding performance interference. Currently we only use this to infer the performance overhead of virtualisation on a highly-contended machine. This project would pick another area where there is performance interference and apply our machine-learning approach to comprehend it.

Well executed, this project will result in a top-tier publication.

Contact:Dr R. Sohan

Resourceful For the HPCS Clan

High performance computing workloads multiplex jobs between a cluster of machines. When they do so there is often performance interference between badly interacting workloads. This project would be a collaboration with the university’s high performance computing service to find how to combine some existing Linux mechanisms (eg cgroups) with an internal tool for measuring fine-grained resource consumption in order to limit bad interactions.

Well executed, this project will result in a top-tier publication.

Contact:Dr R. Sohan

Evaluating and improving low-level kernel probing mechanisms

System measurement tools such as DTrace or SystemTap make use of low-level probing mechanisms for inserting small pieces of code in the normal kernel execution flow, at runtime. Those mechanisms, together with the infrastructure for inserting/activating/deactivating and running the inserted code have a significant impact on the efficiency and side-effects that the higher-level measurement tools introduce.

In this project, you will be looking at comparing the mechanisms employed by DTrace, SystemTap (kprobes) and a new kind of probes developed internally in our group (kamprobes). The purpose is to further optimise the probes and to understand their execution and overheads on different computing architectures. Even simple improvements in those low-level mechanisms will generate opportunities for significantly advancing in-production system monitoring and performance diagnosis.

Interested students should have basic Operating Systems knowledge. Experience with the Linux kernel, programming in C and ASM is advantageous.

Well executed, this project will result in a top-tier publication.

Contact:Dr R. Sohan

Improving DTrace reliability

While DTrace is one of the leading tools for doing system-level measurement and introspection, it has been built with a focus on 'use as the exception': you would turn probes on when something goes wrong or can not be explained, solve the issue and then disable the probes. However, a number of different usecases (in security, provenance recording and in automatic root cause analysis) require probing to be extensive and "always on". In such a scenario, DTrace prioritises kernel liveliness over correctness, by dropping events from the recorded traces

In this project, you will be looking at exploring the reverse tradeoff: maintaining correctness even by slowing down applications or other kernel tasks. You will first identify the cause of DTrace bottlenecks (limited buffer size, not scheduling the process that reads from the trace buffers, too many probes being fired) and then proceed to change DTrace in order to elliminate the dropping of events.

You will explore the following strategies:

  1. Moving DTrace to per-application/per-cpu buffers instead of shared per-cpu buffers
  2. Keeping buffer highmarks in order to detect high event rates produced by certain applications
  3. Throttling applications that produce too many events by scheduling them less often (scheduler changes required)

In case the primary goal of the project is achieved, opportunities exist for extending it to more subtle strategies for avoiding dropped events. For example, one can explore artificially prolonging the duration of system calls for offending processes.

Interested students should have basic Operating Systems knowledge. Experience with the Linux/FreeBSD kernel is advantageous as is programming experience in C.

Well executed, this project will result in a top-tier publication.

Contact:Dr R. Sohan

Fast I/O Data Paths using eBPF

Today, doing complex processing for network packets or general I/O streams requires traversing the full network or I/O stack (we exclude kernel-bypass mechanisms from the discussion). However, with the implementation of eBPF, an opportunity exists for applications to insert pieces of code in the kernel, at runtime and in a safe manner. This means at least a part of the complex logic for filtering/forwarding and load balancing or caching can be pushed towards the lowest points in the software stack.

This project will be looking at implementing a programmable, high performance I/O data path by using eBPF programs. Depending on the interest of students, the focus can be placed on:

  • Programmable network data paths for mitigating DDOS attacks
  • Fine grained application hints for I/O caching and resource usage
  • Custom, application controlled I/O coalescing or persistence

Interested students should have basic Operating Systems knowledge. Experience with the Linux kernel is advantageous as is programming experience in C.

Well executed, this project will result in a top-tier publication.

Contact:Dr R. Sohan

Characterizing the execution of interactive workloads on wimpy-core (Calxeda) and Tile (Tilera) architectures

The current server/data center space is dominated by traditional x86-64 architectures. However, the drive for improved efficiency and low energy consumption has created the space for alternative architectures to exist. Calxeda (ARM) and Tilera are two such architectures, with hardware available publicly (and accessible within the DTG).

Because of their novelty, the performance of running existing applications and the opportunities for optimising them for those architectures are not fully known.

You will have the opportunity of choosing one of the architectures and characterizing the execution of server applications on top of it, in direct comparison to x86-64. We will try to understand things like:

  • The performance of the network and I/O stacks in comparison to x86-64.
  • The deployment of kernel-level measurement tools (perf / kprobes) or our custom probes implementation (kamprobes) to understand practical architectural differences.
  • Optimal scheduling/placement of interrupts, I/O and CPU tasks.
  • What is the scope of deploying virtualisation (Xen, containers) on top of those architectures.

Well executed, this project will result in a top-tier publication.

Contact:Dr R. Sohan

Linux Kernel specialisation for advanced, virtualisation-assisted sandboxing

We propose a new way of enforcing the sandboxing of Linux applications based on a primitive we have developed, called shadow kernels. In order to deny access to particular kernel functionalities for a given application, one can present that application with a kernel image in which the memory pages containing the restricted features are zeroed-out.

We already have implemented the basic mechanisms for creating different kernel text sections and switching to them, under the control of the hypervisor (Xen).

You will need to use this primitive to implement sandboxing and show that even given exploitable code (NULL pointer references), the application still can't access restricted features.

Well executed, this project will result in a top-tier publication.

Contact: Dr R. Sohan

Compartmentalising applications using SGX and deprivileged OS services

Applications can become quite complex consisting of numerous components in the form of libraries and modules. The OS treats each application as a single executing entity and grants all components in the application the same privileges and access to the same set of resources. This leads to the problem where a security vulnerability in any one component can affect all other components, compromising the entire application. Thus it is desirable to isolate and compartmentalise individual components from each other and only grant each component the privileges it requires to operate.

Recent work (SOAAP) has shown that compartmentalisation can be implemented with the help of source code annotations combined with a custom compiler toolset to restrict the privileges (system calls) granted to individual components within applications and carefully controlling the communication between component boundaries. This approach however does not enforce strict memory isolation as the entire application's address space is still addressable from any code in the process.

With modern hardware such as Intel SGX (ref) it is possible to isolate specific parts of an application's memory address space from the rest of the process including the OS by using the concept of memory enclaves. However, a major drawback of using SGX enclaves is that all system calls are prohibited from within an enclave region. Thus if a component in an application is provisioned within an enclave it is effectively cut off from the system. SGX also restricts ring-0 code from executing within an enclave, thus it is not possible in the conventional model to execute the OS within an enclave.

In this project we endeavour to implement application compartmentalisation using SGX by de-privileging various OS services and running them in ring-3 mode within their own SGX enclaves, enabling user-space applications (in SGX enclaves) to invoke/link with required/allowed services directly. This can guarantee both memory isolation as well as restricted access to system resources for components within applications.

Contact: Dr R. Sohan

Fine-grained lineage to Apache Spark

In-memory processing frameworks such as Apache Spark are increasingly being adopted in industry due to their good performance for many applications. We plan to add fine-grained provenance support for Spark. In fact, Spark uses coarse-grained lineage (instead of data replication) to achieve fault tolerance: by recomputing a lost data partition. However Spark is not able to capture precise relationships between input and output as (1) lineage is coarse-grained and (2) stateful data flow is not tracked. This project will augment Spark to capture fine-grained lineage that can be leveraged effectively for data audit and debugging use cases.

Well executed, this project will result in a top-tier publication.

Contact: Dr R. Sohan

Deep packet-level inspection using Hadoop

Inspecting network packets are of important use in many applications such as root-cause analysis and intrusion detection. In order for these applications to scale with current data volumes, the task of packet-level inspection has to leverage distributed "big data" processing frameworks. In this project, we plan to investigate how to build deep packet inspection tools on top of Hadoop. These tools will enable "realtime" analysis of network traffic without requiring costly/specialised hardware.

Well executed, this project will result in a top-tier publication.

Contact: Dr R. Sohan