LG1 Notes: RTL

What is a SoC ?

A SoC is a complete system on a chip. By a system, we mean a something that generates activity, such as a processor, and most or all of its memory and peripherals. The processor may be a custom or standard microprocessor, or it could be a specialised media processor for sound, modem or video applications. Other generators of bus cycles are DMA controllers. DMA controllers can be arbitrarily complex, and are really only distinguished from microprocessors by their complete or partial lack of instruction fetching.

A SoC typically contains several such systems: that is, it is made up of a number of co-operating processor subsystems. These are interconnected using a variety of mechanisms, including shared memory and message-passing hardware entities such as mailboxes.

SoCs are found in every consumer product, from modems, telephones, DVD players, televisions and iPODs.

Review/revision of Verilog RTL

From the point of view of this course, Verilog and VHDL are completely equivalent as register transfer languages (RTLs). Both support simulation and synthesis with nearly-identical paradigms. Of course, each has its proponent's.

Verilog constructs can be thought of as falling into four classes:

Structural RTL enables an hierarchic component tree to be instantiated and supports wiring (a netlist) between components.

Continuous assignments describe combinational logic using a rich set of integer operators, including all those found in software languages such as C++ and Java.

Synthesisable behavioural RTL uses parallel threads to describe the reactive behaviour of sub-circuits. In Verilog a thread is introduced with the 'always' keyword. However, standards for synthesisable RTL greatly restrict the allowable patterns of execution: they do not allow a thread to leave the module where it was defined, they do not allow a variable to be written by more than one thread and they can restrict the amount of event control (i.e. waiting for clock edges) that the thread performs.

Finally, there are the so-called 'non-synthesisable' constructs, which is everything else in the language, including anything that pauses for an amount of time.

All the time values in the RTL are ignored for synthesis and zero-delay components are synthesisable. For them also to be simulatable in a deterministic way the simulator core implements the delta cycle mechanism.

One can argue that anything written in RTL that describes deterministic and finite-state behaviour ought to be synthesisable. However, this is not what the community wanted in the past: they wanted a simple set of rules for generating hardware from RTL so that engineers could retain good control over circuit structures from what they wrote in the RTL.

Today, one might argue that the designer/programmer should not be forced into such low-level expression or into the excessively-parallel thought patterns that follow on. Certainly it is good that programmers are forced to express designs in ways that can be parallelised, but the tool chain perhaps should have much more control over the details of allocation of events to clock cycles and the state encoding.

RTL synthesis tools are not normally expected to re-time a design, or alter the amount of state or state encodings. Newer languages and flows (such as Bluespec and Kiwi) still encourage the user to express a design in parallel terms, yet provide easier to use constructs with the expectation that detailed timing and encoding might be chosen by the tool.

What is synthesisable RTL? Look at its AST!

The abstract syntax tree for synthesisable RTL supports a rich set of expression operators but just the assignment and branching commands (no loops). (Loops in synthesisable VHDL and Verilog are restricted to so-called structural generation statements that are fully unwound by the compiler front end and so have no data-dependent exit conditions).

The slide shows the full syntax tree for synthesisable RTL: it's not very complex!

An RTL program can be used both for simulation and synthesis.

Simulation uses event-driven simulation (EDS). When using zero-delay models, we can use the compute/commit paradigm that requires the EDS kernel to be augmented to support delta cycles.

Synthesis involves converting to a parallel form with one right-hand-side expression per variable. Then converting each expression to a logic tree, preferably taking into account sub-expression sharing and user speed/power/area requirements. Pointers to ML fragments that implement a basic form of both halves of this process.

Toy implementation of EDS RTL Simulator

Toy EDS ML fragment.

Pointer to a toy implementation in ML. This maintains an ordered queue of events. An event is a change in value of a net at some time in the future. Operation takes the next event from the head of the queue and executes it. Execution means changing the net to that value. All component models that are sensitive to changes on that net then run, potentially generating new events that are inserted into the event queue.

The current simulation time, tnow, is defined as the time of the event at the head of the queue.

A modification is needed when zero-delay models are present because new events will all be at time tnow: at a minimum, we need to insert them after all other events with time tnow. The preferred solution is to implement the compute/commit paradigm, whereby all new events are generated as pending updates, waiting to be committed, but the commit is not done while there are still events with time tnow on the event queue.

VHDL and Verilog RTL both support the compute/commit paradigm. A net that is to have its updated deferred in VHDL (and SystemC) is called a signal. In Verilog, all nets can be assigned in either way and two different assignment operators are provided.

Toy implementation of RTL Synthesiser

Compiling RTL to gates involves two basic stages.

Firstly, for each register we need exactly one hardware circuit for its input, regardless of however many times it is assigned, so we need to build a multiplexor expression that ranges over all its sources and is controlled by the conditions that make the assignment occur. In other words, we need a list for each clock domain that holds pairs of the form (register name, value to be assigned on clock edge).

Conversion to list form ML fragment.

There are two variations of the conversion to list algorithm depending on whether non-blocking signal assigns are used or normal variable assignment is used. The difference is simply whether we need to look up variables occurring on the right-hand side of expressions in the list of already assigned variables.

Secondly, for each register that is more than one bit and is sourced from arithmetic operators or logic operators that cross the bit lanes (e.g. variable shifts), we must produced binary, gate-level circuits for each of these operations. In other words, we must remove the arithmetic operators and leave only boolean operators.

Conversion to Bit Blasted Form ML fragment.

Here is the additional example for bit-blasting the barrel shifter: BARREL ML fragment. Once everything is represented as gates as a datatype in store they can be printed out to an RTL file with code such as this: gatebuilder ML fragment.

Arrays are slightly more problematic.

The name alias problem is that at compile time we might not be able to determine whether a pair of subscripts are going to be the same or not at run time, and hence, for blocking variable assigns we cannot always do a lookup. Secondly, the restricted number of ports leads to hazards that may need the design to be re-timed.

Of course, this is a simplified approach to logic synthesis and real tools must consider sub-expression sharing and replication depending on whether they are aiming for speed, area, power or some composite performance goal. Also, not all arithmetic units should be converted to gates: it is better to implement by instantiating special-purpose components. If these components are not-fully pipelined then we get further hazards.

Memories and other pipeline hazards.

We have a hazard when an operation cannot proceed because some information is not known or a resource is in use.

RAMs have a small number of ports but when RTL arrays are held in RAM it is easy to write RTL expressions that require many operations on the contents of a RAM in one operation, even from within one thread. For instance we might need three operations on a RAM to implement

 A[x] <= A[y + A[z]]

Because RTL is a very-low-level language, RTL requires the user to do manual schedulling of port use. (However, some current FPGA tools do something automatically).

Multipliers and floating point units also typically present hazards.

To overcome hazards automatically, stalls and holding registers must be inserted. The user's model of the design is stalled as ports are re-used in the time domain, using extra clock cycles to copy data to and from the holding registers.

A non-fully pipelined component cannot start a new operation on every clock cycle. Instead it has hadnshake wires that start it and inform the surrounding logic when it is ready.

Automatic retiming.

Apart from using retiming of the design to overcome hazards, the technique is also useful for balancing logic between pipeline stages. Re-timing is therefore helpful for meeting timing closure which means ensuring the critical path of the design is short enough that the clock frequency can meet the envisioned target.

D-type migration is a transform that re-codes the state. For example, rather than forming the conjunction of (ANDing) two signals and registering the conjunct (feeding it through a D-type) one can instead form the conjunct of the registered inputs. A complete algebra can be used to annotate each signal with its offset from the designer's first-envisaged timing and D-types shunted around more or less at will. However, it is impossible to put a negative D-type on an input to the circuit (i.e. a gate that predicts the future), instead one must delay every other signal to compensate.

RAM and Offchip Memory

RAM makes up more than half of most SOCs by transistor count and by area.

Notes on RAM testing: use embedded software, wafer probe or dedicated BIST logic.

RAM testing is not needed on FPGA since the whole chip is tested at manufuacture.

Flash memory requires special mask processes and power supplies so is not normally included on the SoC. Ditto DRAM. These tend to be large area units that justify their own silicon chip on that basis as well. They are commodity items that one can buy futures in.

Adder and multiplier structures.

The ripple carry adder is generated by the ML fragment above. The kogge stone is frequently used as a practical, synthesisable adder that is fast and not critical over its layout.

Flash multipliers use special wallace tree cells (not lectured) whereas long multiplication using Booth's algorithm is a nice example of a non-fully pipelined multiplier.

Conclusion

RTL is not as expressive for algorithms as software.

Higher-level entry forms are ideally needed, perhaps schedulling within a thread at compile-time and between threads at run time ? This is what we do in our Kiwi project.