Suggestion from ARM

Tucked in at the top of this page: ARM ARM Cortex-M based SoC design for Video Processing (October 2017) PDF

Project suggestions from David Greaves.

ACS/Part II Project Suggestion(s)

These Kiwi-related suggestions were new for 2016/17.

Kiwi Performance IDE Plugin

The Kiwi HLS compiler converts C# programs to Verilog RTL. It takes all afternoon to compile once for the FPGA, so getting a rapid performance indication after each edit in the IDE would be very useful. In this project, after each edit in the IDE, a window automatically updates showing the expected performance of the program on datasets of various sizes, taking into account expected fileserver datarate and DRAM cache hit rate. Hence the user can see whether their edit is likely to accelerate the true performance when the real system is compile and run.

KiwiC Exception Handling

The Kiwi HLS compiler converts C# programs to Verilog RTL. The C# programs may contain catch blocks for exceptions, but Kiwi ignores these, so no exceptions can be caught. The project is to implement catch blocks LINK.

KiwiC Custom-Width Floating Point

The Kiwi HLS compiler converts C# programs to Verilog RTL. The C# programs may contain floating point operations of single or double precision. However, FPGAs can implement any width and a lower precision floating point arithmetic system is preferred for some dense or low-energy applications. So that the RTL continues to do the same as the original C#, all of the basic arithmetic types need overloading with a C# class for the new precision. Moreover, FPGA implementations of the floating point ALUs need to be created and plumbed into Kiwi's technology library so it instantiates them on the FPGA. Finally, the energy costs of running with the custom precision for a chosen application can be investigated. LINK.

Kiwi Sequential Consistency

Modern CPUs do not preserve memory read and write ordering, leading to concurrency bugs or else the need to insert memory fence instuctions. The Kiwi HLS compiler converts C# multithreaded programs to Verilog RTL for FPGA and it exposes the same problems. Are there different possible solutions for HLS compared with general multiprocessing? "shared_memory_example50".

Older Suggestions Now Follow

Random Instruction Sequence Generator

The OpenRISC processor is an open source family of CPU cores and SoCs with a GNU C compiler and GNU toolchain. Ditto RISC-V. Currently there are various RTL models in Verilog and a fast instruction set simulator (ISS) written in C. The two are tested using both hand-crafted and compiler-generated sequences of instructions. However, testing with random sequences of instructions has not been done, despite this being a known good means of finding bugs.

In this project you will generate random sequences of instructions that are valid. You will take one of the simulators for OpenRISC, such as the SystemC simulator from Greaves+Pusovnik that contains both the RTL and fast ISS simulators. You will measure and predict the 'fault coverage' you have achieved. You will also find one or two real bugs in the OpenRISC implementation - a useful contribution since this core is now being used in real projects (eg on the International Space Station).

CPU Energy Use Logging

The OpenRISC processor is an open source family of CPU cores and SoCs with a GNU C compiler and GNU toolchain. It is available in Verilog RTL and other forms. The Verilog can be converted to C++ using a free program called Verilator. The resulting C++ uses assignment macros for each update. Energy use in processors depends greatly on the number of bits that change value at each clock event. The project is to replace the macros with assignment functions that log the number of bits that have changed. (Verilator is an open source tool similar to the commercial tool from Carbon Design Systems and they already have some energy use logging.) The resulting numbers can be logged by the TLM POWER3 library. The project is to understand how different application programs cause different patterns of energy use in the different parts of the processor. An interesting aspect for exploration is how frequently the bit-level activity needs to be observed to get an accurate measurement of energy use and whether results of a similar accuracy can be obtained from suitable annotations to the high-level instruction set simulator (ISS) for the OpenRISC.

Parallel SystemC Implementation

The free SystemC simulation library provides C++ threads to component models. However, all of these threads run on a single CPU core on the hosting workstation which is no longer ideal, given the prevelance of multicore CPUs. Coding styles used in C++ tend to assume non-reentrant, non-preemptive schedulling.

The project is to take the free simulation kernel and make it use multiple processor-level threads (using say posix pthreads) and then to look at the problems in user-level models that may arise from assumptions about the threading model.

Evaluation can be in terms of how much speed up is achieved per additional core and on what percentage of some existing code bases of SystemC needed any modification for truely parallel execution.

Parallel Verilator Implementation

This one is perhaps too complex for part II and should be an ACS PROJECT.

The OpenRISC processor is an open source family of CPU cores and SoCs with a GNU C compiler and GNU toolchain. It is available in Verilog RTL and other forms. The Verilog can be converted to C++ using a free program called Verilator. However, Verilator generates models that only exploit one CPU core of today's multicore workstations.

The project is to see what style of cooperation between posix pthreads can support the fine-grain parallelism needed to make these models go faster. If the resulting hardware model 'clocks' at tens of killohertz the inter-thread communication will typically need to be an order or two faster, meaning that spinning on shared variables is the best approach. The project will examine the metrics reported by 'oprofile' and similar and try to find an analytical explanation for any speedup gained by using multiple threads.

There is already some discussion on the Verilator IRC about this project. A fair amount of work would be involved in restructuring the output from Verilator to run on multiple cores - finding good static partitions and hoping they make a good dynamic partition or else using profile-directed feedback to refine the partitioning.

Algorithmic Energy on Multicore

Multicore computers pass cache update messages between the cores to maintain an accurate view of main memory. There is a view that main memory is now a cheap resource and algorithms that write to each heap location only once are feasible, especially on multicore systems where evicting modified cache lines consumes more energy that using fresh memory that will never change ownership between cores.

In the past...

Previous Years' Suggestions.

ACS Project Suggestion(s)

Originator DJ Greaves

Parallel Verilator Implementation

See above.

Scheduler for Toy Bluespec RTL Compiler

There is a locally-written, toy Bluespec Verilog compiler on this LINK.

A basic parser has just been added but there are many details missing compared with the compiler from Bluespec Inc. The most important and interesting thing is the rule scheduler. The toy version currently just puts the rules in the priority order found in the source file.

The toy compiler is written in F Sharp.

The project would be to consider several basic design problems that can benefit from a good scheduler or which cannot be scheduled using the standard approach. Several small examples and one larger example would make a good basis. The next step is to make sure the toy compiler can compile these designs to some extent (one or two basic Bluespec features might need to be added to the compiler. Finally, explore the performance of the designs as scheduled in new and different ways.