Old Reading Lists

The page has the remains of previous year's reading lists.

You are encouraged to subscribe to or peruse Electronics Times and Design and Reuse. Perhaps also EDN and ESLsyn (Electronic System Level Synthesis Conference). Also, please make sure you are familiar with this year's u/grad course and do ask me to go over material of interest: Bachelor's Course.

Michvac Heads-Up Reading List

Preferably scan through as much of this as possible before the P35 module commences.

Make sure your are confident with much of the material covered in recent Undergraduate Exam Questions.
There is a collection of System On Chip Lecture Notes on this link: SoC Design and Modelling Patterns (PDF). Some of this material will be lectured as part of the P35 ACS module but most of it is lecture notes given to undergraduates in recent years.
Xilinx Zync Product Brief Xilinx Ultrascale MP-SoC.
AT91SAM SoC Datasheet This is the chip we looked at in session 1.
Recommended book: Transaction-Level Modeling with SystemC: TLM Concepts and Applications for Embedded Systems' by Frank Ghenassia. Published Springer 2010.
Recommended book: System Design with SystemC by Grotket, Liao, Martin and Swan. Published Springer.
Recommended book: System Level Design with .Net Technology edited by El Mostapha Aboulhamid, Frederic Rousseau

Lent 2017 Reading List 1: General Miscellany

The LEAP FPGA Operating System Fleming et al.
LegUp: An Open Source High-Level Synthesis Tool for FPGA-Based Processor/Accelerator SystemsLegup Canis etal, ACM_TECS.
A Comparison of x86 Computer Architecture Simulators A Akram L Sawalha. 1-Page Poster. No mention of Prazor!
Accurate Fine-Grained Processor Power Proxies Huang et al. (See also Fine-grained Energy/Power Instrumentation for Software-level Efficiency OptimizationGreaves et al FDL'15 Barcelona).

Lent 2017 Reading List 2

I am collecting some papers here but will continue to alter this part of the list in the first half of February.

C-Squared Bound: A Capacity and Concurrency Driven Analytical Model for Many-core Design Yu-Hang Liu, Xian-He Sun. 2015.
"Simulation-Based Verification of the MOST NetInterface Specification Revision 3.0". A. Braun, D. Lettnin,O. Bringmann and W. Rosenstiel. DATE'10
An ESL Timing & Power Estimation and Simulation Framework for Heterogeneous SoCs - Gruttner 2014
Combining System Level Modeling with Assertion Based Verification. Dahan
Minerva: Enabling Low-Power, Highly-Accurate Deep Neural Network Accelerators PDF.
Optimizing FPGA-based Accelerator Design for Deep Convolutional Neural Networks PDF.
PMHLS 2.0: An Automated Optimization of Power Management During High-Level Synthesis PDF.

Last Year - Lent 2016 Reading List 1: General

Enable IoT ASIC Design Using Platforms Pradeep Sukumaran, Sr. Solutions Architect, Open-Silicon -January 07, 2016.
Take a look through RISC-V Workshop Jan 2016 and choose an article of personal interest.
Emulating Future HPC SoC Architectures Using RISC-V Farzad Fatollahi, Fard, Dave Donofrio, John Shalf, Lawrence Berkeley National Lab. Also a similar one from Oracle:Link to be added.
VectorBlox: Risc-V Meeting Slides.

Last Year - Lent 2016 Reading List 2: IP-XACT

How do we name, document, test and implement registers forming the programmers view ?

Both hardware and software needs a common view of the multitude of configuration registers found in a modern SoC.

Generally we can automatically generate all of the following outputs from a common source representation:

A memory map where each device and register is allocated a disjoint address.
A human-readable summary document and detailed document.
A set of .h file for C/C++ inclusion.
SystemVerilog or SystemC code conforming to OVM/UVM coding styles for automated testing and set up of the registers.
RTL for the register file inside each IP block - perhaps to be manually edited afterwards to add additional functionality but perhaps neater than that. Can we explore this for Chisel HCL ?
Glue logic and address decoders to wire up the SoC busses according to the memory map.

Would it be sensible to generate XML from something embedded in Chisel/Scala files or is it more sensible for a master IP-XACT file to be edited with an XML editor or Eclipse plugin and a Chisel file be configured by it.

The IP-XACT at Accellera
Check out the IP-XACT support in Xilinx Vivado Vivado_IP_Integrator_Backgrounder.
Straightforward IP Integration with IP-XACT RTL-TLM Switching
Date 2008: Kruijtzer: Industrial IP Integration Flows based on IP-XACT Standards

Lent 2016 Reading List 3: Reconfigurable Computing and Accelerators

PushPush: Seamless Integration of Hardware and Software Objects Via Function Calls over AXI.. Shane T. Fleming ...
More to be added ... actually not - we focussed on IP block export and import instead.

Lent 2016 Reading List 4: ESL and High-level Modelling.

ZSim: Fast and Accurate Microarchitectural Simulation of Thousand-Core Systems - Sanchez @ ISCA 2013 Sniper is a next generation parallel, high-speed and accurate x86 simulator. This multi-core simulator is based on the interval core model and the Graphite simulation infrastructure, allowing for fast and accurate simulation and for trading off simulation speed for accuracy to allow a range of flexible simulation options when exploring different homogeneous and heterogeneous multi-core architectures. Sniper
The gem5 simulator is a modular platform for computer system architecture research, encompassing system-level architecture as well as processor microarchitecture. GEM5
QEMU is a generic and open source machine emulator and virtualizer. When used as a machine emulator, QEMU can run OSes and programs made for one machine (e.g. an ARM board) on a different machine (e.g. your own PC). By using dynamic translation, it achieves very good performance. QEMU
Location, Location, Location—The Role of Spatial Locality in Asymptotic Energy Minimization . André DeHon, FPGA 2013. PDF.
Advanced SoC virtual prototyping for system-level power planning and validation F Mischkalla, W Mueller - Power and Timing Modeling, Optimization and …, 2014 Abstract—Today's electronic devices imply significant efforts in pre-silicon low-power design. Key techniques such as scaling of operating points, or switching power off to unused blocks play a major role and are usually managed at entire system scope. Finally, the ... PDF.
VPPET: Virtual platform power and energy estimation tool for heterogeneous MPSoC based FPGA platforms SK Rethinagiri, O Palomar, J Arias Moreno, O Unsal… - Power and Timing Modeling …, 2014 Abstract-Using low-power symmetric multi-cores on FPGAs are becoming ubiquitous in embedded computing. This is due to the emergence of power and energy as key design metrics, as important as performance. This leads to the requirement of powerful and ... PDF.
An ESL Timing & Power Estimation and Simulation Framework for Heterogeneous SoCs - Gruttner 2014
Parallel Simulation of SystemC TLM 2.0 - Mello 2012

Whole-System Energy Design

Identifying Compiler Options to Minimise Energy Consumption for Embedded Platforms James Pallister, Simon Hollis, Jeremy Bennett.
This paper presents an analysis of the energy consumption of an extensive number of the optimisations a modern compiler can perform. Using GCC as a test case, we evaluate a set of ten carefully selected benchmarks for five different embedded platforms. A fractional factorial design is used to systematically explore the large optimisation space (2^82 possible combinations), whilst still accurately determining the effects of optimisations and optimisation combinations. Hardware power measurements on each platform are taken to ensure all architectural effects on the energy consumption are captured. We show that fractional factorial design can find more optimal combinations than relying on built in compiler settings. We explore the relationship between run-time and energy consumption, and identify scenarios where they are and are not correlated. A further conclusion of this study is the structure of the benchmark has a larger effect than the hardware architecture on whether the optimisation will be effective, and that no single optimisation is universally beneficial for execution time or energy consumption. PDF.
An Experimental Survey of Energy Management Across the Stack - Kambadur OOPSLA 2014

General and Further Reading

Reading on Power Estimation from High-Level Models

McPAT (Multicore Power, Area, and Timing) is an integrated power, area, and timing modeling framework for multithreaded, multicore, and manycore architectures. McPAT Tech Report PDF.
XEEMU: An Improved XScale Power SimulatorPDF Herczeg et al.
O. Celebican, T. Simunic Rosing and V. Mooney, "Energy Estimation of Peripheral Devices in Embedded Systems," Proceedings of the 2004 ACM Great Lakes Symposi um on VLSI (GLVLSI'04), pp.430-435, April 2004. PDF.
SLIP 2001. Multi-terminal Nets do Change Conventional Wire Length Distribution Models. Dirk Stroobandt. Tutorial Slides.
2005. `A Power Estimation Methodology for SystemC Transaction Level Models'. Dhanwada. A majority of existing works on system level power estimation have focused on the processor, while there are very few that address power consumption of peripherals in a SoC. With the presence of complex cores in current day embedded system-on-chip devices, the problem of complete system level power estimation is gaining significance... PDF.
`MPSoC Power Estimation Framework at Transaction Level Modeling' Rabie Ben Atitallah, Smail Niar and Jean-Luc Dekeyser.Early power estimation is increasingly important in MultiProcessor System-On-Chip (MPSoC) architectures for a reliable Design Space Exploration (DSE). In this paper, we present an MPSoC power modeling framework at the Timed Programmer View (PVT) level that offers a good performance/power tradeoff to be found early in the design flow. PDF.
PowerViP: SoC Power Estimation Framework at Transaction Level. Ikhwan Lee, Hyunsuk Kim, Peng Yang, Sungjoo Yoo, Eui-Young Chung, Kyu-Myung Choi, Jeong-Taek Kong, and Soo-Kwan Eo. In this work, we propose a SoC power estimation framework built on our system-level1 simulation environment. Our framework provides designers with the system-level power profile in a cycle-accurate manner. PDF.
'Creation of ESL Power Models for Communication Architectures using Automatic Calibration' Stefan Schurmans et al. PDF.
`HIGH-LEVEL POWER ESTIMATION & THE AREA COMPLEXITY OF BOOLEAN FUNCTIONS' Mahadevamurty Nemani and Farid N. Najm PDF.
'System-Level Power Estimation using an On-Chip Bus Performance Monitoring Unit' Youngjin Cho, Younghyun Kim, Sangyoung Park and Naehyuck Chang. In this paper we propose an on-chip bus PMU which makes accurate estimates of system power consumption from a first-order linear power model by utilizing system-level activity information exchanged on the on-chip bus. It can easily be customized for different on-chip and off-chip memory devices, and is not dependent on a specific CPU core... PDF.
D&R Activity-Based System Level Power Estimation
D&R System Level Power Estimation
2012. `TLM POWER3: Power Estimation Methodology for SystemC TLM 2.0' DJ Greaves & MM Yasin. At FDL'12 Forum on specification & Design Languages. Vienna. September 2012. 6 pages. We report on a SystemC add-on library which extends every SystemC module with non-functional data regarding power consumption ... Paper: Full Text PDF. Slides: SLIDES PDF.
`Black box power estimation for digital signal processors using virtual platforms' By Gereon Onnebrink et all, RWTH Aachen. RAPIDO'16. PDF.

2015 Current Articles

You are encouraged to subscripe or peruse Electronics Times and Design and Reuse. Perhaps also ESLsyn (Electronic System Level Synthesis Conference).

Here are some randomly-picked articles for discussion (2015):

Reading on System Simulators

Sorin: AMVA Techniques for High Service Time Variability The work in this paper is motivated by a recent highly efficient heuristic AMVA model for evaluating shared memory architectures that contain complex modern processors [24]. In that architecture model, each processor is modeled by a FCFS queue. Service times at the processor represent the time between memory requests that miss in the second level cache when the processor is active. Sorin00 (PDF).
Lee: CPR: Composable Performance Regression for Scalable Multiprocessor Models. Uniprocessor simulators track resource utilization cycle by cycle to estimate performance. Multiprocessor simulators, however, must account for synchronization events that increase the cost of every cycle simulated and shared resource contention that increases the total number of cycles simulated. These effects cause multiprocessor simulation times to scale superlinearly with the number of cores. paper (PDF), slides (PDF).
DRAMSIM2 and others are collated here : citeulike klimkin.
(Basic queue modelling theory The purpose of this site is to teach the user basic Queueing Theory. Queue Theory Tutor).

Reading on Power-Aware Design

2012. `Power-Aware Multi-Core Simulation for Early Design Stage Hardware/Software Co-Optimization' Wim Heirman, Souradip Sarkar, Trevor E. Carlson, Ibrahim Hur, Lieven Eeckhout. Stringent performance targets and power constraints push designers towards building specialized workload-optimized systems across a broad spectrum of the computing arena, including supercomputing applications as exemplified by the IBM BlueGene and Intel MIC architectures. In this paper, we make the case for hardware/software co-design during early design stages of processors for scientific computing applications. PDF.
2010. `Energy-Performance Tradeoffs in Processor Architecture and Circuit Design: A Marginal Cost Analysis'. Omid Azizi, Aqeel Mahesri, Benjamin C. Lee, Sanjay J. Patel, Mark Horowitz. Power consumption has become a major constraint in the design of processors today. To optimize a processor for energy-efficiency requires an examination of energy-performance tradeoffs in all aspects of the processor design space, including both architectural and circuit design choices. In this paper, we apply an integrated architecture-circuit optimization framework to map out energy-performance trade-offs ... PDF.

Reading on High-level Synthesis

Conservation Cores: Reducing the Energy of Mature Computations. ASPOLOS'10. Ganesh Venkatesh, Jack Sampson, Nathan Goulding, Saturnino Garcia, Vladyslav Bryksin, Jose Lugo-Martinez, Steven Swanson, Michael Bedford Taylor. Growing transistor counts, limited power budgets, and the breakdown of voltage scaling are currently conspiring to create a utilization wall that limits the fraction of a chip that can run at full speed at one time. In this regime, specialized, energy-efficient processors can increase parallelism by reducing the per-computation power requirements and allowing more computations to execute under the same power budget. To pursue this goal, this paper introduces conservation cores. Conservation cores, or c-cores, are specialized processors that focus on reducing energy and energy-delay instead of increasing performance. This focus on energy makes c-cores an excellent match for many applications that would be poor candidates for hardware acceleration (e.g., irregular integer codes). We present a toolchain for automatically synthesizing c-cores from application source code and demonstrate that they can significantly reduce energy and energy-delay for a wide range of applications. The c-cores support patching, a form of targeted reconfigurability, that allows them to adapt to new versions of the software they target. Our results show that conservation cores can reduce energy consumption by up to 16.0x for functions and by up to 2.1x for whole applications, while patching can extend the useful lifetime of individual c-cores to match that of conventional processors. PDF.
Date 2008: Prevostini: Executable Models and Verification from MARTE and SysML: a Comparative Study of Code Generation Capabilities
OOPSLA'10 `Lime: a Java-compatible and synthesizable language for heterogeneous architectures.' The halt in clock frequency scaling has forced architects and language designers to look elsewhere for continued improvements in performance. We believe that extracting maximum performance will require compilation to highly heterogeneous architectures that include reconfigurable hardware. We present a new language, Lime, which is designed to be executable across a broad range of architectures, from FPGAs to conventional CPUs. We present the language as a whole, focusing on its novel features for limiting side-effects and integration of the streaming paradigm into an object- oriented language. We conclude with some initial results demonstrating applications running either on a CPU or co- executing on a CPU and an FPGA. PDF
FPGA 2011. 'LegUp: High-Level Synthesis for FPGA-Based Processor/Accelerator Systems'. Andrew Canis, Jongsok Choi, Mark Aldham, Victor Zhang, Ahmed Kammoona, Jason Anderson, Stephen Brown, and Tomasz Czajkowski In this paper, we introduce a new open source high-level synthesis tool called LegUp that allows software techniques to be used for hardware design. LegUp accepts a standard C program as input and automatically compiles the program to a hybrid architecture containing an FPGA-based MIPS soft processor and custom hardware accelerators that communicate through a standard bus interface. Results show that the tool produces hardware solutions of comparable quality to a commercial high-level synthesis tool. PDF
2009. Microsoft Tech Report: 'Exploiting System-Level Concurrency Abstractions for Hardware Descriptions'. Greaves, Singh. PDF.
ACM Trans Reconfigurable Systems: 2010 'Reducing Memory Constraints in Modulo Scheduling Synthesis for FPGAs' Nadav Rotem, Yosi Ben Asher, Danny Meisler. In High-Level Synthesis (HLS), extracting parallelism in order to create small and fast circuits is the main advantage of HLS over software execution. Modulo Scheduling (MS) is a technique in which a loop is parallelized by overlapping different parts of successive iterations. This ability to extract parallelism makes MS an attractive synthesis technique for loop acceleration. In this work we consider two problems involved in the use of MS which are central when targeting FPGAs. Current MS scheduling techniques sacrifice execution times in order to meet resource and delay constraints. Let “ideal” execution times be the ones that could have been obtained by MS had we ignored resource and delay constraints.
Online corporate web site: C-to-Verilog.
2012. `Deadlock Avoidance and Combinational Balancing for High-Level Synthesis'. DJ Greaves. At Compiling Complete Programs into Circuits Workshop (CCPC 2012) 4th March 2012, London. The Bluespec and Kiwi tool chains project systems of communicating processes into hardware circuits. When a number of proceses are composed, two problems commonly arise at the system level: deadlock and excessive combinational delay. Both problems are emergent as the system grows and are best solved using a global pass of the whole assembly, rather than by systematic modification to components before composition. SLIDES PDF.
FDL 2009. Jan Langer, Ulrich Heinkel. OneSpin: High Level Synthesis Using Operation Properties

Reading on Co-simulation

`A Timing-Accurate HW/SW Co-simulation of an ISS with SystemC' Luca Formaggio Franco Fummi Graziano Pravadelli. The paper presents a system level co-simulation methodology for modeling, validating, and analyzing the performance of embedded systems. The proposed solution relies on the integration between an instruction set simulator (ISS) and the SystemC simulation kernel. In this way, the ISS is used to abstract the model of the real programmable device where the SW should run, while SystemC is used to model HW components that interact with the SW. PDF.
Combining System-Level Modeling with Assertion Based Verification. Dahan

Reading on Assertion Based Design

Formal Techniques for SystemC Verification. M Vardi
MTV 2005: Nicola Bombieri: On PSL Properties Re-use in SoC Design Flow Based on Transaction Level Modeling

Computer Laboratory