LG3 Notes: System On Chip Design

Slides for LG3.

Overview of bus structure in an example SoC.

A platform chip is a SoC that is used in a number of products although chunks of it might be turned off in any one application: for example, the USB port might not be made available on a portable media player despite being on the core chip.

Generally devices must be allocated to busses with knowledge of the expected access and traffic patterns. Commonly there is one main bus master per bus. The bus master is the device that generates the address for the next data movement (read or write operation).

Busses are connected to bridges, but crossing a bridge has latency and also uses up bandwidth on both busses.

Lower-speed busses may go off chip.

DRAM is always an important component that used to be off chip. Today, some on-chip DRAM is being used in SoCs.

Microcomputer from 1975 era

A basic microprocessor such as the original Intel 8008 device has a 16 bit address bus and an 8 bit data bus so can address 64 Kbytes of memory. It is an A16/D8 processor core. Internally it has instruction fetch, decode and execute logic. It is a bus initiator in ESL terms.

The interrupt input makes it save its PC and load a fixed value instead: an external hardware event forces it to make a jump.

The high-order address bits are decoded to create chip enable signals for each of the connected peripherals, such as the RAM, ROM and UART.

The first SoCs were perhaps the microcontrollers. The Intel 8051 used in the mouse shipped with the first IBM PC is a good example. For the first time, RAM, ROM, Processor and I/O devices are all on one piece of silicon. We all now have many of these such devices : one in every card in our wallet or purse. Today's SoC are the same, just much more complex.

Programmed I/O

I've given a complete set of macro definitions using the C preprocessor that enables a C program to access a hardware UART with programmed input/output.

The input and output subroutines use spinlocks. The receiver spins until the empty flag in the status register goes away. Reading the data register makes the status register go empty again. The actual hardware device might have a receive FIFO, so instead of going empty, the next character from the FIFO would become available straightaway.

The output function is exactly the same in principle, except it spins while the device is still busy with any data written previously.

General Interrupt Structure

Nearly all devices have a master interrupt enable control flag that can be set and cleared by under programmed I/O by the controlling processor. Its output is just ANDed with the local interrupt source.

The programmed I/O uses the write enable (wen) signal to guard the transfer of data from the main data bus into the control register. A ren signal is used for reading.

The principal of programming is

With only a single interrupt wire to the processor, all interrupt sources share it and the processor must poll around on each interrupt to find the device that needs attention.

Enchancement: a vectored interrupt makes the processor branch to a device-specific location.

Interrupts can also be associated with priorities, so that interrupts of a higher level than currently being run preempt.

Programming for the UART using Interrupts

LINK: C Code For Device Driver.

The code fragment illustrates the complete set of four software routines needed to manage a pair of circular buffers for input and output to the UART.

If the UART has a single interrupt output for both send and receive events, then two of the four routines are combined with a software dispatch.

General Purpose I/O Pins (GPIO)

The general structure of GPIO pins has not changed since the early days of the 6821 I/O controller. A number of pins are provided that can either be input or output. A data direction register sets the direction on a per-pin basis. If an output, data comes from a data register. Interrupt polarity and masks are available on a per-pin basis for received events. A master interrupt enable mask is also provided.

The slide illustrates the schematic and the Verilog RTL for such a device. All of the registers are accessed by the host using programmed I/O.

The keyboard scanner

The keyboard scanner scans each key until it finds one pressed. It then loads the scan code into the pkey register where the host finds it when it does a programmed I/O read.

The host will know to do a read when it gets an interrupt. The interrupt occurs when a key is pressed and is cleared when the host does a read hren.

In practice, one would not scan at the speed of the processor clock. One would scan more slowly and use extra register on asynchronous input pressed (see crossing clock domains). Or, typically, one might use a separate microcontroller to scan keyboard.

Note, a standard PC keyboard generates an output byte on press and release and implements a short FIFO internally.

Counter/Timer Block

The counter/timer block is essentially a counter that counts internal clock pulses or external events and which interrupts the processor on a certain count value.

An automatic re-load register accommodates poor interrupt latency, so that the processor does not need to re-load the counter before the next event.

Timer (illustrated in the RTL) : counts pre-scaled system clock, but a counter has external inputs as shown on the schematic (e.g. car rev counter).

Four to eight, versatile, configurable counter/timers generally provided in one block.

All registers also configured as bus slave read/write resources for programmed I/O.

In this example, the interrupt is cleared by host programmed I/O to host_op.

Framestore

The framestore reads out the contents of its frame buffer again and again. The memory is implemented in a Verilog array and this has two address ports. Another approach is to have a single address port and for the RAM to be simply 'stolen' from the output device when the host makes a write to it. This will cause noticable display artefacts if writes are at all frequent.

This framestore has fixed resolution, but real ones have programmable values read from registers instead of the fixed numbers 230 and 110.

The framestore in this example has its own local RAM. This reduces RAM bandwidth costs on the main RAM but uses more silicon area. A delicate trade off! A typical compromise, also used on audio and other DSP I/O, is to have a small staging RAM or FIFO in the actual device but to keep as much as possible in the main memory.

It's an output only device that never goes bust, so it generates no interrupts.

DMA Controller

The DMA controller is the first device we have seen that is a bus initiator as well as a bus target. It has two complete sets of bus connections. Note the direction reversal of all nets.

This controller just makes block copies from source to destination with the length being set in a third register. Finally, a status/control register controls interrupts and kicks of the procedure.

The RTL code for the controller is relatively straightforward, with much of it being dedicated to providing the target side programmed I/O access to each register.

The active RTL code that embodies the function of the DMA controller is contained in the two blocks qualified with the active net in their conjunct.

Typically, DMA controllers are multi-channel, being able to handle four or so concurrent transferrs. Many devices have their own DMA controllers built in, rather than relying on dedicated external controllers. However, this is not possible for devices connected the other side of bus bridges that do not allow mastering (initiating) in the reverse directions. This is a common-enough situation for perhipherals such as IDE disk drives.

Rather than using a DMA controller one can just use another processor. If the processor runs out of a small, local instruction RAM it will not impact on memory bus bandwidth with its fetches and it might not be that much larger in terms of silicon area.

An enhancement might be to keep either of the src or destination registers constant for streaming device access. For instance, to play audio out of a sound card, the destination address could be set to the programmed I/O address of the output register for audio samples and set not to increment.

For media with hard real-time characteristics, such as audio, video and modem signals, a small staging FIFO is likely to be needed in the device itself because the initiator port may experience latency when it is serviced. The DMA controller then initiates the next burst of its transfer when the local FIFO reaches a trigger depth.

Network Devices

Network devices, such as Ethernet, USB, Firewire, 802.11 are rather similar to the audio/video/modem device with embedded DMA controller just discussed. For high throughput these devices should likely be bus masters or use a DMA channel.

DMA offloads work from the main processor, but, equally importantly, using DMA requires less staging RAM or FIFO in device. In the majority of cases, RAM is the dominant cost in terms of SoC area.

Another advantage of a shared RAM pool is statistical multiplexing gain. It is well known in queueing theory that having a monolithic server performs better than having a number of smaller servers that each are dedicated to one client. If the clients all share one server and arrive more or less at random, the system can be more efficient in terms of server utlisation. So it goes with RAM buffer allocation: having a central pool requires less overall RAM, on average, than having the RAM split around the various devices.

The DMA controller in a network device will might often have the ability to follow elaborate data structures set up by the host, linking and de-linking buffer pointers from a central pool in hardware.

Bus Bridge

The basic idea of the bus bridge is that bus operations slaved on one side are mastered on the other. The bridge need not be symmetric: speeds and data widths may be different on each side.

A bus bridge connects together two busses that are potentially able to operate indepently when traffic is not crossing. However, in some circumstances, especially when briding down to a slower bus, there may be no initiator on the other side, so it never actually operates independently.

The bridge need not support a flat address space: addresses seen on one side may be totally re-organised when viewed on the other side or unadressable. However, for debugging and test purposes, it is generally helpful to maintain a flat address space and to implement paths that are not likely to be used in normal operation.

A bus bridge might implement write posting using an internal FIFO. However it will generally block when reading. In another LG we cover networks on a chip that go further in that resepect.

Note, the 'busses' on each side use multiplexors and not tri-states on a SoC. The multiplexors are different from bus bridges since they do not provide spatial reuse of bandwidth.

With a bus bridge, system bandwidth ranges from 1.0 to 2.0 bus bandwidth: inverse proportion to bridge crossing cycles.

Clock Tree

The clock tree delivers a clock to all flops in a domain with sufficiently low skew to avoid shoot-thru. This is achieved by balancing wire lengths between the drivers.

The clock frequency is a multiple of the external reference which is commonly sourced from the piezo-effect of sound waves in a thin slice of quartz crystal.

This slide was not lectured in 2009. However, later on we talk about having a programmable clock frequency, so it's worth noting that the multiplication factor of 10 illustrated in the slide can be variable and programmed in some systems (e.g. laptops).

Clock Domain Crossing

Only briefly lectured in 2009.

A clock crossing bridge is like a bus bridge, but has different clock domains on each side.

A simplex clock domain crossing bridge carries information in only one direction. Duplex carries in both directions.

These are commonly needed when connecting to I/O devices that operate at independent speeds: for example, an Ethernet receiver sub-circuit works at the exact rate of the remote transmitter that is sending to it.

Simplex: can never be sure about the precise delay.

We need a protocol with insertable/deletable padding symbols that have no semantic meaning. Or at a higher level, the protocol must have elidable idle states between transactions.

100 percent utilisation is impossible when crossing clock domains. The four-phase handshake limits utilisation to 50 percent (or 25 if registered at both sides) Other protocols can get arbitarily close to saturating one side or the other provided we know the maximum tolerance in the nominal clock rates.

Duplex: cannot reply on any precise timing relationship between the two directions. The protocol must rely on sequencing or expicit transaction tokens. In other words, we need a lot of temporal decoupling of requests and acks for crossing clock domains (and also network on chip later) (NB: This is not the same as the temporal decoupling in ESL modelling).


END (C) 2009 DJ GREAVES.