New Systems Projects

These are the list of new systems projects for Part II and MPhil ACS.

Machine-learning graph filtering

The project involves training a basic machine learning model to select nodes of interest from a graph. In this case, the graph describes system-level interractions (processes reading files, writing to sockets as well as many other OS events). Starting from a given source node, the goal is to discriminate between successor nodes that could have been part of a malicious intrusion or suspect behaviour and nodes that just describe normal system operation. Based on recommendations provided by the model you develop, nodes will be either displayed, hidden or grouped together in a cyber-security analyst UI developed for forensically analysing attacksour.

You will start from data describing real network intrusions and attacks as well as normal operation, collected as part of the CADETS and OPUS research projects. From those graphs, you will extract examples of malicious/benign behaviour based on hand-crafted rules. For example, a rule might be that all nodes belonging to a path describing "a file downloaded from the internet that was executed" are marked as suspicious/interesting for analysis. Based on the extracted labeling, you will then perform supervised learning in order to build a model that recognizes nodes of interest and classifies them accordingly. You will then evaluate and refine the approach.

Once the primary goal is achieved, numerous extension possibilities exist. For example, you could "switch sides" and, as an attacker, try to hand-craft situations where the model developed above is ineffective.

Interested students should have an interest in machine learning and security. Experience with OS-level POSIX semantics and use of graph databases are advantegeous.

Keywords: SVM, Context-aware collaborative filtering.

Contact: Lucian Carata Originator: Dr R. Sohan

OS Performance Tools (exposing kernel subsystems performance metrics)

Linux exposes numerous aggregate statistics describing the behaviour of the system as a whole. However, when it comes to understanding per-application behaviour, its ability of introspection is greatly reduced. Using existing tools such as SytemTap or DTrace require significant expertise and sometimes may incur large overheads.

In this project, you will focus on an alternative approach. You will be extending the Resourceful framework, a research project developed in our group and available as open-source, to expose detailed metrics about OS operatios triggered by user-space code. The first proposed tool you will write is going to track per socket statistics (data sent/received/latencies) for all the sockets created by an application. This will involve defining a new Resourceful monitoring subsystem and automatically adding minimal binary instrumentation to an application.

You will then use the data from your measurements to evaluate the performance of different ways of using sockets at the application level (blocking/unblocking).

Contact: Lucian Carata Originator: Dr R. Sohan

Query-engine for fine-grained application/OS performance metrics (SQL-like)

The project would involve extending Resourceful, a Linux tracing framework developed in our group. The goal is to take the fine-grained OS measurements performed by applications through the existing Resourceful API, develop a simple per-application backing storage and then extend an existing SQL engine for allowing complex queries on this data. The goal is to provide simple ways for interracting with large fine-grained measurement datasets when debugging the performance of userspace applications or kernel drivers.

You will then perform a scalability analysis of your solution and determine how various types of queries could be optimized. As an extension, you will investigate different ways of indexing and modifying the backing store physical layout for increasing performance.

Contact: Lucian Carata Originator: Dr R. Sohan

Allow running eBPF programs on top of low-overhead probes (kamprobes)

Currently, the most advanced kernel tracing framework on linux is eBPF, allowing safe execution of small user-space programs compiled for a special eBPF ISA directly inside the kernel. However, the way of attaching those programs to various points inside the kernel suffers from low performance and scalability issue. In this project, you will extend both eBPF and our own low-overhead kernel probing framework to work together, overcoming the limitations of existing solutions.

You will evaluate the ability to run linux kernels with lots of active probing points and measure the performance in relation to previous solutions.

Contact: Lucian Carata Originator: Dr R. Sohan

Automated identification of event loops and application workqueues given binaries and/or source code

One of the things making application-level tracing challenging is the increased reliance on event loops and task workqueus working together to provide increased performance. Such applications are notoriously difficult to debug because of very little data being exposed about what executes, when and with what interleavings. In order to overcome that difficulty, you will set to perform static analysis of binaries and/or source code for automatically identifying event loops and corresponding workqueues data structures. The goal is to be able to then apply automatic instrumentation of those points without any programmer effort, and provide application introspection "for free".

You will evaluate your system on a number of real applications such as Apache, Nginx and Postgres and compare your results with those of manual instrumentation.

An existing approach doing initial work in this space is available here

Contact: Lucian Carata Originator: Dr R. Sohan

Instrumenting Go's runtime to track the effects of goroutine scheduling on application latency

Similar to the project described above5, but you will look at instrumenting a runtime (the go runtime in particular) to add measurements at work multiplexing points. This means instrumenting the goroutine scheduler in order to determine how green-theading works for a given application. You will use the results to characterise the go scheduler and provide general optimisation stragegies when writing go programs.

Contact: Lucian Carata Originator: Dr R. Sohan

Intel PT Enhanced Record and Replay

OS level deterministic replay is a useful tool for many domains such as debugging, low-overhead (virtual server) migration between hosts, deterministic state logging for the purposes of reproducibility, etc. Traditionally, deterministic replay has shown a 20-40x overhead in SMP configurations due to the need to trap and log every read and write in shared memory regions.

Intel PT provides a low-ovehead, hardware assisted mechanism for fine-grained tracing of processor actions, including reads and writes at the byte level. It is possible to leverage Intel PT to accelerate deterministic replay in order to provide a system that provides fast deterministic replay for SMP configurations.

Done well, this project would form the basis of a top-tier publication.

Contact: Lucian Carata Originator: Dr R. Sohan

Parallelize and Increase the efficiency of SNORT

SNORT is a tool that forms the core of many IDS and IPS tools in the security industry but it suffers from the issue that is very CPU intensive making it slow. A single instance of SNORT is unable to proces more than 200 MB/sec on most modern CPUS. The reason for its slowness is due the fact that (a) it has to check all incoming data against a number of rulesets an (b) these rulesets are checked linearly. There is quite a lot of low-hanging fruit in (a) architecture level optimizations (e.g. SIMD instructions) (b) Rule dependency analysis and (c) changing the design of the tool to incorporate multi-threading scalability.

Done well, this project would form the basis of a top-tier publication.

Contact:Dr R. Sohan

Linux-to-FreeBSD Kernel Module Translator

The Linux OS has much better support for modern devices due to its popularty compared to FreeBSD, yet FreeBSD is used in more security conscious and enterprise environments. The kernel-level abstractions and paradigms exported by both OSes are conceptually similar but differ on an implementation level. In this project you will write a tool that takes as input a Linux kernel module and automatically translates it to work on FreeBSD (for some definition of work). In doing so you will reduce the barrier-to-entry for FreeBSD module creation, allowing it to make use of newer hardware quicker as well as reducing the time-to-adoption of hardware for FreeBSD users.

Contact:Dr R. Sohan


Code is code. There is no reason I shouldn't be able to use jails from FreeBSD or the Linux NVMF subsystem from within FreeBSD. With modern processors supporting hardware virtualization and IOMMU memory isolation it should be possible for one OS to use the subsystem of another by spawing the OS it wants to use within its innards and leveraging the code written for its own use. In this project you will create a tool that allows a user to run a subset of another OS within his OS in order to allow him or her to use newer (e.g. NVMF), or different (e.g. SGX) technology. The working OS set in this project will be FreeBSD and Linux.

Contact:Dr R. Sohan

Improving Container Fairness

Containers are a very useful abstraction as they seem to provide the right mix of logical isolation and resource efficiency for most workloads. However misbehaving containers can seriously impact overall runtime fairness. In this project you will build on the DTG Resourceful project to create a tool that allows devops personnel to improve container fairness by reacting to continual fine-grained resource measurements taken of running containers in a continuous manner. In doing so, you will create a tool that provides a very useful function in a low-overhead and unintrusive manner.

Contact: Lucian Carata Originator: Dr R. Sohan

PT Enhanced DTrace Provenance Collection

Intel PT provides a low-ovehead, hardware assisted mechanism for fine-grained tracing of processor actions, including reads and writes at the byte level. This facility is extremely useful for collecting the lineage of running processes as they execute. In turn, this lineage is used for specific use-cases such as detecting if program execution was errnoeous or deviated from expected state.

Previous researchers have leveraged PTrace for Provenance collection, but their work has been focused on collecting trace information. In this work you will extend the FreeBSD DTrace mechanism to augment it with Intel PT information thereby increasing its performance, reducing its resource consumption and providing greater context for instrumentation points. This work will feed into the CADETS project increasing CADET's scalability and efficiency.

Contact: Lucian Carata Originator: Dr R. Sohan

Spark Logs Summarization

Data-Intensive Scalable Computing (DISC) systems such as Spark and Hadoop have gained a widespread adoption recently. However, debugging these systems is still a cumbersome task. Programmers spend significant number of hours checking huge volume of distributed logged events to debug these systems. Spark logs analysis and summarization will allow programmers to grasp the key logged events, facilitate root cause analysis and accelerate debuggability.

Contact: Lucian Carata Originator: Dr R. Sohan

Non-repudiable NFS

Our current prototype for non-repudiable disk I/O leverages SGX to provide strong assurances regarding data read and written by applications. The goal of this project is to extend our current design and implementation to NFS, wherein the application's system call and its corresponding NFS client RPC requests are captured securely and verified on the NFS server side in a trustworthy manner. In a standard NFS setup both the client and server run within the kernel, therefore, a major technical challenge in this project is to execute them ( or specific parts of the client and server) in user-space within an SGX enclave. The existing mechanism for non-repudiable disk I/O may then be used on the server side to ensure that we have a trusted path up to the underlying storage device.

Contact: Lucian Carata Originator: Dr R. Sohan

Automatically dockerise applications from runtime dependency information

A typical application comes with several runtime dependencies. These dependencies are usually in the form of libraries and files required by the application. This make the process of distributing such applications cumbersome since the required dependencies must fist be setup and configured on a given machine. This however poses a risk since the installed dependencies may be incompatible with other applications on the system having the same dependencies, for e.g. due to specific version requirements or a specific configuration option. One solution could be to manually package all dependencies as part of the application install and ensure that the existing environment on the end system is not changed, however, this process is non-trivial since (i) developers of the application must manually maintain and update this list of dependencies as the application evolves and (ii) developers have no easy way to determine the dependency graph for their applications. Containerisation technologies such as Docker solves most of these problems. However, the burden is still upon the application developer to provide the corresponding dockerfile containing the list of configurations and dependency setup information.

As part of the OPUS project in FRESCO we have an implementation of a library level interposition framework that captures file and process provenance as programs execute. This provenance data can contain information about dynamic libraries, modules imported (e.g. for python applications) etc. Our goal in this project is to automatically generate docker files for applications from their runtime dependency information. A quantitative evaluation of this system will measure the overheads of capturing such data and a qualitative evaluation will measure the accuracy and effectiveness of this approach for a diverse set of applications.

Contact: Lucian Carata Originator: Dr R. Sohan

Protecting sensitive state in NFV applications using SGX

Network Function Virtualisation is a way to virtualise and package existing network functions that run on specialised dedicated hardware into virtual machines that can run on commodity servers. The advantage of this approach is that it decouples network functions from reliance on specialised hardware, increases product development cycle speeds and eases the process of deploying such network applications. The downside however is that such applications are now exposed to common attack vectors that exist on commodity servers. NFV applications implementing a Intrusion Detection System (IDS) or an Intrusion Prevention System (IPS) typically contain sensitive states pertaining to a network flow that is used to properly detect or prevent intrusions. With NFV, security vulnerabilities may allow attackers to steal or manipulate these internal states. The goal of this project is to run sensitive parts of existing NFV application(s) within SGX and evaluate the cost (overheads on throughput, effort required to decouple and port sensitive state in SGX) of doing so.

Contact: Lucian Carata Originator: Dr R. Sohan

Distributed indexing for Neo4j

Neo4j relies on a strict index creation method wherein a centralised node generates the index and passes it on to the other nodes in the system. A distributed indexing scheme would allow multiple nodes to create an index, which locally fulfils the ACID properties, but globally the ACID constraints are relaxed. These indexes are then merged and the databases are updated. The different possible relaxation policies (esp relaxing Consistency and Isolation) and distributed transaction handling strategies are to be experimented upon.

Contact: Lucian Carata Originator: Dr R. Sohan

Distributed overlapping graph partitioning for streaming data for graph database indexing

Given a stream of data, develop a method to partition the graph over a distributed system. Currently Neo4j (graph database) indexes the graph at a single node. This is not optimal especially for large graphs that will evolve over time. This project proposes to partition the graph into sub-graphs that have overlaps between them. Initially the graph is naively partitioned. As the graph is populated, and the graph structure is then better understood, the model will be updated. The graph partition would be improved accordingly. This partitioned graph is independently indexed on each of the node. The overlapping partition would allow the inter-machine caching to be improved. The aim of this work would be to reduce the indexing and querying overheads.

Contact: Lucian Carata Originator: Dr R. Sohan

Provenance for Internet Artifacts

Often art work and other media are posted and re-posted on the internet without appropriate attribution to the original artist. Internet search engines these days offer advanced search systems that allow you to search by image or even by audio clip. Application of these capabilities to deriving the original sources of art works would be useful for providing appropriate attribution.

Contact: Lucian Carata Originator: Dr R. Sohan

Provenance Augmentation for Environmental Setup

Extend a tool like vagrant to log data about what actions are being performed to what VMs in order to give better conception of what has and hasn't run for different hosts. Also to better debug provisioning issues.

Contact: Lucian Carata Originator: Dr R. Sohan