LG6 Notes: Structure, Flow and Tools

Slides for LG6. Topics: Bus Structures, Design Flow and Tools.

Throughput of Bridged Bus Systems.

The most basic bus has one initiator and several targets. The initiator does not need to arbitrate for the bus since it has no competitors.

Bus operations are reads or writes. Most on-chip busses support burst transactions, whereby multiple consecutive reads or writes can be performed as a single transaction with subsequent addresses being implied as offsets from the first address.

Typical capacity might be 32 bits times 200 MHz which is 6.4 Gb/s. This figure can be thought of as unity (i.e. one word per clock tick) in comparisons with other configurations.

Interrupt signals are not shown in these figures. In a SoC they do not need to be part of the bus as such: they can just be dedicated wires running from device to device.

Un-buffered wiring can potentially serve for the write and address busses, whereas multiplexors are needed for read data. Buffering is needed in all directions for busses that go a long way over the chip.

Multiple Initiators

With multiple initiators, the bus may be busy when a new initiator wants to use it, so there are various arbitration policies thay might be used. Preemptive and non-preemptive with static priority, round robin and so on. The maximum tbus hroughput of unity is now shared among initiators.

The latency in a non-preemptive system depends on how long the bus is held for. Maximum bus holding design is a design issue.

Bridged Bus Structures

To make use of the additional capacity from bridged structures we need at least one main initiator for each bus. However, a low speed bus might not have its own initiators: it is just a slave to one of the other busses.

Bus bridges provide full or partial connectivity and some may write post. Global address space, non-uniform access time (NUMA). Some busses might be slower, narrower or in different clock domains from others.

The maximum throughput is the sum of that of all the busses that have their own initiators, but the acheived throughput will be lower if the bridges are used a lot: a bridged cycle consumes bandwidth on both sides.

How and where to connect DRAM is always a key design issue. The DRAM may be connected via a cache. The cache may be dual ported on to two busses, or more.

Bus bridges and top-levels of structural wiring automatically generated. An example tool that does this is ARChitect2 from ARC International.

Network On Chip.

Switched networks require switching elements. With a 2x2 element it is easy to build a ring network. The switching element may contain buffering or it may rely on back-pressure to make sources reduce their load.

Single ring: throughput=2. Counter-rotating ring (one ring in each direction): throughput=4 since a packet only travles 1/4 of the way round the ring on average.

One using a network, the delay may be multiple clock cycles and so a write posting approach is reasonable. If an initiator is to have multiple outstanding read requests pending it must put a token in each request that is returned in the response packet for identification purposes.

Although there can be effective local arbitration in each element, a network on a chip can suffer from deadlock. Some implementations uses separate request and response networks, so that a response is never held up by new requests, but this just pushes deadlock to the next higher logical level when some requests might not be servicable without the server issuing a subsidiary request to a third node. Global policies and careful design are required to avoid deadlock and starvation.

Multi-stage switching networks

A simple ring is not very effective. Instead, richer meshes of elements are used and the elements can have a higher radix, such as 4x4. Thre are a number of well-known switch wiring schemes, whth names such as Benes, Clos, Shuffle, Delta.

The throughput is potentially equal to the number of ports, but the fabric may partially block and there may be uneven traffic flows leading to receiver contention. These effects reduce throughput. Typically will not need quite as many initiators as targets, so a symmetric switch system will be over provisioned.

Such systems can be overly complex on the small scale, but scales up well.

Tool: Mullins et al. network generator.

An on-chip IP protocol with temporal decoupling: OCP BVCI.

The Open Core Protocol (OCP) is a freely available bus-independent protocol. Download full spec from OCPIP. See also D&R.

A prominent feature is totally separate request and response ports. This makes it highly tolerant of delays over the network and amenable to crossing clock domains. However requests and responses must not get our of order since there is no id token. Older-style protocols where targets had to respond within a prescribed number of clock cycles cannot be used in these situations.

For each half of the port there are request and aknowledge signals, with data being transferred on any positive edge of the clock where both are asserted.

If a block is both an initiator and a target, such as our DMA controller example, then there are two complete instances of the port.

Memory subsystems: DRAM & Controller, Caches and Memory Macrocell Generator.

DRAM not normally put on the main SoC chip. Instead a standard part is put down and wired up. A row address is first sent to a bank in the DRAM and then one has random access to the columns of that row using different row addresses. When finished with the row, that bank goes busy while it writes back the data: DRAM cells internally have destructive read out.

DRAM is slow to access and certainly not 'random access' compared with on-chip RAM. Hence typically put a cache on the SoC as part of the memory controller. May have error correction logic in controller as well.

RAM Compiler

The average SoC is 71 percent RAM memory. The RAMs are typically generated by a RAM compiler. The input parameters are:

The outputs are a datasheet for the RAM, high and low detail simulation models and something that turns into actual polygons in the final masks.

Sometimes self test modules are also generated. For example Mentor's MBIST Architect(TM) generates an SRTL BIST with the memory and ARM/Artisan's Generator will generate a wrapper that implements self repair of the RAM by diverting access from a fault row to a spare row. ARM.

Other related generators might exist: e.g. a FIFO generator would be similar and a masked ROM generator or PLA generator.

Dynamic Clock Gate Insertion

Clock trees consume quite a lot of the power in an ASIC and considerable savings can be made by turning off the clocks to small regions. A region of logic is idle if all of the flip-flops are being loaded with their current contents, either through synchronous clock enables or just through the nature of the design.

Instead of using synchronous clock enables, current design practice is to use a clock gating insertion tool that gates the clock instead.

Care must be taken not to generate glitches on the clock as it is gated and transparent latches in the clock enable signal can re-time it to prevent this.

How to generate clock enable conditions ? One can have software control (additional control register flags) or automatically detect. Automatic tools compute 'clock needed' conditions. A clock is `needed' if any register will change on a clock edge.

A lot of clock needed computation can get expensive, resulting in no net saving, but it can be effective if computed once at head of a pipeline.

Beyond just turning off the clock or power to certain regions, in LG8 we look at further power saving techniques: dynamic frequency and voltage scaling.

Test Program Generator

Not lectured in 2008/9.

Scan Path Insertion and JTAG standard test port.

Not lectured in 2008/9.

Cell Library Tour

A cell library in the public domain: TANNER AMI.

Another VLSI TECH.

Things to note: there's a good variety of basic gates, including quite a few multi-level gates, such as AND-OR gate. There's also I/O pads, flip-flops and special function cells. Many gates are available with various output powers.

For each gate there are comprehensive figures that enable one to predict its delay, taking into account its track loading, how many other gates it is feeding and the current supply voltage.

ASIC Design flow: Signoffs prior to `Tapeout'.

Mask making is an expensive step, equivalent to several man years of well-paid hardware Engineer's time, so hardware engineers must be given the best tools to make sure their chips will work first time!

Manufacturing Industry Taxonomy

The IC industry and its products can be partitioned up as follows:

Most of the world's 'SoCs' are made by fabless manufactuers. The actual maufacture is mostly done by one company, TSMC: Taiwan Semiconductor Manufacturing Company Limited: TSMC.

IC Catagories: Analog, Power, RF, Processors, Memories, Commodity: logic, discretes, FPGA and CPLD, SoC/ASIC, Other high volume (disk drive, LCD, ... ).