Part III & MPhil Project Suggestions

These are some suggestions for Part III and MPhil projects within the Digital Technology Group. If you have an idea related to our group's research interests that isn't mentioned here, get in touch.

10G Networking

  1. Content Based Steering for Efficient Data Delivery in 10G NICs

    The coarse grained coupling between the NIC and Endhost interface in current high speed NIC designs can result in lowered application performance. In particular, the mapping of NIC receive queues to CPUs can only be done at a very high (per connection) level. This makes it very hard to arrange the system so data is always received on the same CPU on which it is consumed resulting in higher application data processing latency (due to CPU cache misses) and lower throughput performance (due to CPU stalls). Furthermore, mapping a single connection to one or more CPUs can result in overloading of those CPUs (network processing overhead). While recent technologies such as Receive Side Scaling and FlowDirector try to mitigate this issue by spreading load over CPUs and trying to automatically deliver data to the "right" CPUs they are very simple mechanisms and therefore not very effective.

    In this project you would be exploring the idea of "content based data steering". The core hypothesis of this work is that if the NIC possesses some simple application logic it can efficiently steer packets to the correct CPUs reducing latency and increasing performance. The idea is that you would examine the feasibility of the approach and design and implement the idea on the NetFPGA platform. Ideally the system would be based on a BPF-like language that allows applications to specify steering policies to the NIC coupled with minor application changes required to leverage the architecture.

    Contact: R. Sohan

  2. Framework For High-Speed Endhost NIC Evaluation

    Benchmarking high speed (10G) endhost NICs can be a complicated, error prone and tedious task. There can be significant differences in performance based on the hardware configuration, kernel, driver and software versions involved. Moreover, there are OS, application and device specific runes which often get overlooked or misconfigured impacting performance. Finally, most devices benchmarks are usually irreproducible as the entirety of the configuration is unknown.

    In this project you will attempt to create an extensible open-source high-speed end host NIC evaluation framework that makes it simple and fast to obtain reproducible benchmarks over a wide range of 10G NICs and application types. You will then proceed to test it by benchmarking a number of production 10G NICs for the purposes of quantitative comparision.

    Contact: R. Sohan


  1. Support For Stateful Lazy Label Propagation In Hadoop MapReduce

    We have previously published on the Use of lazy label propagation as a mechanism for post-hoc tracking and auditing of MapReduce data, see link.

    However, this framework is of limited use if it doesn't support stateful jobs -- i.e. jobs that employ state that is shared between different records (e.g., TopK queries). In this project you would extend our framework to stateful MapReduce jobs.

    In order to precisely track lineage when state is involved we have to expose this state to the MapReduce framework. One way to achieve this is by statically analysing the user code and adding additional instrumentations at specific points in the program that will enable the framework to capture the correct and precise lineage at record level. There are already techniques available for static code analysis that can leveraged as building blocks for this project.

    There is a possibility to publish the work done in this project to a technical conference.

    Interested students should have Java programming experience and will ideally be familiar with the Hadoop MapReduce implementation.

    Contact: Dr S. Akoush

  2. Transparent Virtual Machine State Introspection

    Cloud platforms are a boon for users wishing to build and deploy distributed applications. It takes seconds to provision and fire up a new virtual machine and machines can be brought down just as easily.

    From an external viewpoint, however, it is very difficult for the cloud operator to obtain visibility into the configuration and dependancies of their customer's machines. Cloud operators often need immediate answers to pressing questions such as:

    1. What OS is this virtual machine running?
    2. What version is the OS?
    3. Is (e.g.) Apache running? If so, is it patched?
    4. Which other machines is it communicating with?
    5. Why is this virtual machine consuming so much CPU?
    6. Why is this virtual machine opening up so many listening ports?

    Having immediate and accurate answers to these (and other related) questions allow operators to be more flexible, responsive and accurate when addressing customer and system needs.

    We observe, however, that modern operators can leverage CPU support for virtualisation to:

    • Access virtual machine system memory.
    • Trap system calls made in the virtual machine.
    • Inject system calls into the virtual machine's kernel.
    • Intercept all network and disk I/O generated to and from the virtual machine.

    The goal of this project will be to build a set of tools that allow cloud operators to answer questions about their customer's virtual machines. In particular you will leverage memory fingerprinting, system-call interception, system-call injection and I/O interception to build applications that allow an operator to characterise (and possibly modify) a running virtual machine without requiring client co-operation.

    There is a possibility to publish the work done in this project to a technical conference.

    Interested students should have basic knowledge of virtualisation. Experience with the Xen or KVM hypervisors is advantageous as is programming experience in C.

    Contact:Dr R. Sohan

  3. Programmatic Generation Of Current Operating Environment

    The current generation of cloud providers sell compute units in the form of virtual machines that end-users customise for their use. While a number of users are beginning to use automatic provisioning and configuration tools such as Puppet and Chef, the vast majority will hand-roll configurations in an ad-hoc manner.

    While hand-rolling configurations is simple and easy, it is hard to record changes in configuration and data in order to replicate or store them. Effectively, retrieving and recording changes from the baseline configuration becomes a hit-and-miss affair.

    In this project the goal is to create a tool that will compare an existing system configuration against a baseline and automatically generate an output file that allows a user to characterise and replicate the changes in configuration required to bring a pristine install into current state. Additional extensions to this work include auto-generation of different configuration formats (e.g. output Puppet manifests or Chef recipes), translation between configuration formats and auto-detection and segregation of machine specific variables in the output configuration.

    There is a possibility to publish the work done in this project to a technical conference.

    Interested students should have basic knowledge of Linux system administration.

    Contact:Dr R. Sohan

  4. Library-level Fault Injection And Masking Towards Increased Dependability

    A typical application interacts with the kernel using the the standard C library which acts as a wrapper around the system calls, abstracting away the the complexity associated with the system call mechanism which can be platform and architecture dependant.

    Under most conditions these system/library calls behave as expected but under certain run-time environment conditions these calls may return valid but unexpected values. For e.g. a write may return a disk full error or a read may return fewer number of bytes that expected. Previous work has shown that few applications test for these errors and most will fail if one of the more rare errors is returned.

    The goal of this project is to create a framework to allow programmers to test against, mask and mitigate these errors by interposing between the application and library call interface. For MPhil students this could be extended to a comprehensive survey of popular application response to system-call errors.

    As part of the OPUS project in FRESCO we have an implementation of a library level interposition framework to capture provenance. This codebase can be made available for the purposes of this project.

    There is a possibility to publish the work done in this project to a technical conference.

    Interested students should have C experience and a basic understanding of the Linux C-library/system-call interface and implementation.

    Contact: N. Balakrishnan, Dr R. Sohan

  5. IORecordReplay (IORe^2)

    The complexity of the modern software stack makes deterministic reproduction of application-kernel interactions difficult if not impossible. This is especially pronounced in the storage stack where a single read or write call has to traverse the application, the system call, VFS, IO scheduling and block layers to reach the hardware device.

    While the obvious mechanism for reproducing application-kernel interaction is interposing at the syscall layer doing so tends to introduce the "Observer effect" affecting both accuracy and precision. Moreover, most of these techniques require significant kernel and userspace modification.

    In the FRESCO project we've produced a very low-overhead (less than 5% in the common case) system interposition library that is capable of recording all process IO calls. You will leverage this library to:

    1. Produce a tool that is capable of accurately and precisely recording the IO signature of a user-level process.
    2. Extend the tool to replay the trace.
    3. (Time permitting) Modify the replay according to user specifications.
    4. (Time permitting) Correlate user-level IO calls with kernel level actions (e.g. scheduling, read-ahead, flush to disk etc).
    5. (Time permitting) Create a model for the IO behaviour of applications and generate synthetic workloads that can be tested on various environments.

    Well executed, this project will result in an academic paper in the storage area. Interested students should have some experience programming in C/C++ and a basic understanding of the Linux C-library/system-call interface and implementation.

    Dr R. Sohan, N. Balakrishnan

  6. Demincer: Making Meat from Mince

    The concept of scientific reproducibility is becoming an increasingly important issue both in academia and industry [1]. One of the pivotal aspects of reproducibility is the ability to reuse existing results for the purposes of extension and/or comparison. However, given the standard method for result dissemination in today's scientific environment is that of (PDF) paper publication it would be very useful to have a tool that is able to extract information from existing papers so that it can be augmented and/or extended.

    In essence we are advocating the creation of an extensible tool that is able to extract text, figures, tables and other domain specific information from published papers in a manner that facilitates their reuse. Such a tool would be very useful in lowering the entry barrier to reproducibility and is thus likely to encourage reproducible research.

    Contact: Dr R. Sohan

  7. Elephant: Long memory for reproducible runs

    The concept of scientific reproducibility is becoming an increasingly important issue both in academia and industry [1]. One of the pivotal aspects of reproducibility is the ability to reproduce the environment in which an experiment is run. However, the ad-hoc nature of today's scientific reporting process means most researchers only record the parameters deemed interesting meaning that those seeking to reproduce their results are reduced to either trying to replicate them from published results or contacting the authors directly.

    In this project we are advocating the creation of "Elephant", an extensible reproducibility assurance tool. Elephant will capture the environment of a program (including hardware and software) such that the details of the run are recorded in a precise and complete manner. Futher to this, Elephant can then be used to validate the environment before a run is reproduced. Elephant is intended to be extensible so it can be augmented with domain specific reproducibility information.

    Contact: R. Sohan

DNS Analysis, Optimisation And Security

  1. Classifying DNS use

    Given about 240M DNS message question/response pairs per day from two of Nominet's main nameservers that are responsible for the UK namespace, build a system that is able classify the requestors in near real time (that is, with as minimal data or limited time). Some of the heavy users are large ISP's resolvers, such as Google's DNS. Others are continuously scanning the namespace for changes for various reasons, ranging from speculating on the domain name market, botnets and spambots, regular uptime tests, etc, etc. Additionally, given the dataset and possible noSQL solutions like Hadoop/MapReduce, what other interesting aspects can be derived.

    Contact: R. Sohan

  2. Spotting distributed whois abuse

    We occasionally see attempts to mine large amounts of data from our whois service. This may be seen as, say, a large number of requests from a single client. However we have anti-abuse limits in place to stop this, so an obvious tactic would be to use a distributed system where each client performs a small number of queries and so no individual limits are reached. It might be possible to see this happening and impose a single limit on "cooperating clients" (assuming it could be done in real time).

    Contact: R. Sohan

  3. Design a dedicated DNSSEC hardware security module (HSM)

    Building a security module is hard. Outside interference may lead to compromised keys, or might trick the device in performing undesired operations. Lack of interference implies lack of an entropy source as well, hence the device needs a good internal source of entropy to generate keys. Design a device that is able to generate a key pair, guards the private keys, is able to generate signatures given the right credentials and is performant (at least 20 signatures a second).

    Contact: R. Sohan

  4. Next Generation Rate Limiting

    DNS servers are often abused for amplified reflection attacks, where a third party receives large responses from a large nameserver due to incoming queries with a spoofed source address. Currently, some basic rate limiting is possible which, however, can easily be circumvented. We're interested in a taxonomy of possible solutions or novel ideas to defend networks against large scale amplified reflection attacks.

    Contact: R. Sohan

Indoor Smartphone and Person Tracking

  1. Bluetooth Low Energy Phone Tracking

    This project will look at positioning mobile phones using BLE beacons distributed around the environment. This is a topic in industry at the moment but the beacons are expensive and the phone platforms have only just begun to support BLE properly. We will create beacons using raspberry pis and cheap bluetooth dangles. Unknowns include what power to beacon at, what update rate can be expected, what the continuous scan costs on the phone, whether we can infer good distance estimates and how easy it will be to spoof beacons and thereby cause havoc! The project has a high chance of international publication and further PhD work. Programming for android and/or iOS needed, along with good Linux skills. Previous knowledge of Bluetooth (in any form) is valuable but not essential.

    Contact: R. Harle

  2. Bluetooth Low Energy Sensor Network

    This project will invert the typical Bluetooth tracking scenario by using BLE beacons on the person rather than in the environment building a sensor network of Raspberry Pi BLE sensors. Beacons are small and last years so each person could conceivably carry multiple to aid positioning and to minimise body attenuation. The research aim will be to establish the capabilities of such a system in terms of range, accuracy, power consumption, update rate and maximum beacon numbers. The project has a high chance of international publication and further PhD work. Experience working with Raspberry Pis or Linux environments essential. Previous knowledge of Bluetooth (in any form) is valuable but not essential.

    Contact: R. Harle

  3. Smartphone Camera-based Movement Classification

    A key problem in Pedestrian Dead Reckoning is determining the direction of motion. It is hard to distincuish between back steps, side steps and forward steps. This project will look at repurposing smartphone cameras to estimate relative movement direction based on feature tracking applied to the ceiling or floor. Many optical flow-like algorithms exist that can be trialled. The important result is not just that the direction is correct, but also that the drain on the smartphone battery is minimised. This project will be carried out using the Android platform: some experience of programming for it will be necessary. The optical algorithms may be imported from e.g. openCV (which has an Android port), or written from scratch if preferable.

    Contact: R. Harle

  4. WiFi-IR Positioning

    The dominant indoor positioning approach is the use of WiFi fingerprints. However, these are typically unable to unambiguously locate to a room since WiFi penetrates walls. We have in the past developed an InfraRed based location system (the Active Badge) that had people wear IR emitters and used IR receivers in each room. IR was very good at room localisation (IR does not penetrate walls) but we could not realistically install receivers everywhere. This project will look at exploiting the rising number of IR transmitters on recent smartphones (e.g. Galaxy S4) and simple networked IR receivers (built from e.g. Raspberry Pis or similar) to create a modern-day Active Badge that is fused with WiFi positioning data to create a more robust and ubiquitous tracking system. Experience of Android programming essential.

    Contact: R. Harle

Programming language research to support physical science researchers

The Computer Lab is currently running a research project to apply programming language research to support programming in the sciences, via tools and languages (a slightly longer synopsis can be found here). As part of this project, we are investigating augmenting code with specifications to aid verification, program comprehension and construction, and improve bug analysis.

  1. Stencil access specifications for verifying numerical code

    Stencil computations are array based transformations, where each element of an output array, at position i, is computed from a finite set of neighbouring elements at position i in some input array(s) (e.g., convolution, the Game of Life). Some stencils are complicated, detailed, and dense (see for example, this stencil computation in a Navier-Stokes fluid simulator) where errors can be easily introduced by accidentally permuting indices, offsets, arrays, and even omitting particular indices.

    The goal of this project is to design and implement a language of abstract stencil specifications, which can be attached to an existing general-purpose language, e.g. Fortran. These specifications will provide a guide to the programmer and a verification technique for the compiler.

    For more examples of why this might be useful and how it might work see here

    Contact: Dominic Orchard

Smart phone usage and energy consumption

  1. Energy consumption of web-service APIs

    It is common for smartphone apps to make requests to server-side APIs either to download information or to post notifications. Commonly this is done using XMLRPC over HTTP. However, it could well be expected that this carries a considerable energy overhead due to use of a TCP connection, the addition of HTTP headers and the text-based encoding of information. This project seeks to measure the potential energy savings of different options such as more efficient encodings (e.g. Google's Protobuf) and the use of UDP.

    Energy measurement hardware is available as are android phones for testing and equipment for building a controlled wifi testbed.

    Interested students will need to demonstrate good programming ability and application development along with a good understanding of TCP, UDP, IP, Wifi and Cellular networking.

    Contact: Andrew Rice

  2. Reality-based benchmarks

    There are a variety of benchmarking tools available for Android which can produce a performance score for a particular handset. However, these tests do not really reflect the needs of actual phone users. The idea of this project is to use the Device Analyzer dataset to come up with better benchmarks.

    The project will need to survey the various properties that current benchmarks are attempting to measure. Data analysis from Device Analyzer can then be used to work out how many users would be interested in these measurements and to see if there are more important properties to measure. We have a variety of phone handsets which can then be used for testing to see how they perform with the new designs.

    Interested students will need to demonstrate good programming ability and have an interest in systems measurement. It is expected that data analysis will require some sort of distributed processing system such as Hadoop running on the DTG cluster.

    Contact: Andrew Rice