Kiwi Bowtie Genome Matcher Demo

PAGE UNDER CONSTRUCTION LATE 2016

Preface

The well-known BowTie algorithm uses the Burrows-Wheeler transform (BWT) (both of whom were Computer Laboratory staff) to search for DNA strings with minor edits. The search string is processed from both ends at once using meet-in-the-middle heuristics.

On this page we will put the files needed to run this demo under Kiwi and also as a free-standing dotnet application for performance extrapolation.

Possible Project Proposal

Investigation of Domain-specific Language Constructs for Genomic Algorithm Acceleration targeting FPGA.

It is widely accepted that many problems in scientific computing can be vastly accelerated using either FPGA or GPU execution resources. Also, the FPGA approach, in particular, leads generally to a significant saving in execution energy. The product of the two gains can be typically one thousand fold. The problems however, tend to resolve around the need to re-factor code, difficulty of use for the advanced tool chains and lack of debugging visibility. A field-programmable gate array (FPGA) is a large silicon chip that is completely programmable in terms of its internal circuitry.

In the last decade more than 100 papers have reported the aforesaid savings with biomedical applications, particularly those concerned with genomes, when implemented on the FPGA. The present author published such work in 2008 [U1] based on his Kiwi approach where dotnet programs are converted to hardware circuits for the FPGA, augmented with large amounts of DRAM and a high-performance connection to a file server over Ethernet.

The Kiwi work is distinguished from the mainstream research on compilation of algorithms to hardware in two main ways. Firstly it starts with dotnet bytecode and secondly it supports parallel programs. The mainstream research is exemplified by a flow such as LegUp that compile only single-threaded C without dynamic storage allocation. Alternatively, one can redirect to FPGA the open\_CL. Open\_CL was intended for GPUs which is somewhat tortious to use. Another option is Maxerler's OpenFlow which is suitable for a certain class of applications. The two leading suppliers of FPGA (Altera and Xilinx) both offer C-to-gates flows that are broadly equivalent to LegUp. Note that Greaves developed a C-to-gates tool that was licensed commercially in 2000 to Tenison EDA (now part of Synopsys) but this was suppressed in favour of the reverse flow that was commercialised for totally separate purposes.

The dotnet system was developed by Microsoft but is now a public standard with open source implementations from the Mono project. Dotnet code is most often generated from C\# programs. C\# is significantly more advanced and cleaner than C or C++ but is currently not as widely used as C/C++, or indeed the other languages commonly used in bioinformatics, Java and Python. But Dotnet is a sound intermediate code not only for C\# compilers, but also for custom domain-specific front end experiments that are proposed in this current project.

Parallel programming splits into three broad classes according to the sharing of data between parallel tasks. So-called embarrassingly parallel programs have no sharing and each task can run to completion in isolation. A typical example is the calculation of pixel in a Mandlebrot plot. A second class is stream or pipelined processing where each task does some small amount of processing of the data before forwarding it on to the next task. This style is the basis of Maxeler's stream processing on FPGA and it probably works well provided no complex feedback of data to earlier in the stream is required.

Implementation

When searching for a given fragment, the transform of the fragment can equally well be done offline or on the FPGA. Ditto its indexing. If the index is done offline then more data needs to be moved into the FPGA before commencing the search, but the data to be searched can be quite big and high-performance filesystem I/O is needed in any case.

The frequency of index points is a major design decision: too few and a lot of linear scan is needed to find the correct entry, and too many leads to the index being too large. A good size will fit in a DRAM row so that no new RAS cycles are needed during the linear scan. The Kiwi Performance Predictor should assist on this matter.

When using inexact match, a major design decision is whether to backtrack over assumptions and whether to increase parallelism by exploring both sides of an edit at once.

Tasks/Progress

The Burrows-Wheeler transform, and its inverse, was implemented in C# as a library component that could be used in a number of projects. This works fine on mono and FPGA.

The parsing of ASCII-format FASTQ files on the FPGA works fine.

Some basic tests of the whole algorithm have been run on RTL_SIM (Verilator applied to the KiwiC output) in conjunction with the Kiwi substrate filesystem, also in simulation.

We currently do not have a working file server for the physical FPGAs owing to the previous implementation being on Windows and us not having any Windows machines. But getting it working again on the Zynq platform should be no problem.

We have not done any evaluation, comparing our performace with other solutions, but we expect performance to be similar but ease of design and performance prediction to be much better.

Source Code

Bowtie Genome Alignment (Burrows-Wheeler Transform) Src Code Fragments CSharp.

Alternative n-gram implementation

The great thing about HLS is that it's a piece of cake to explore totally different algorithms without manually recoding any RTL.

Using the same FASTQ and DNA parser framework and FPGA, it was relatively easy to also test a simple n-gram based matcher ...

Conclusions

This is an ideal demo for determining the accuracy of our performance predictor, getting rapid feedback after each C# edit without an FPGA place and route. There are a lot of minor design aspects that greatly affect speed when all the data is preloaded in the DRAM. It is also an application that can nicely demonstrate streaming data from the fileserver since it is not sufficiently I/O bound that FPGAs stop delivering performance. We are working on an analyical model that captures and models the sweet spots for FPGA accelleration.

Bibliographic Links

[U1] Synthesis of a Parallel Smith-Waterman Sequence Alignment Kernel into FPGA Hardware Satnam Singh, David Greaves, and Sutirtha Sanyal Many-Core and Reconfigurable Supercomputing Conference 2009, Berlin. MSR LINK. PDF.

There are a number of papers where BWT, BowTie and TopHat have been run on FPGA ... please cite here.