Many of the principles taught in this book are relevant for all forms of system
architect, including those who are designing cloud-scale applications, custom
accelerators or IoT devices in general, or those making FPGA designs. But the details of
design verification in Chapter 8 are likely to be just of interest to those designing
semi-custom silicon using standard cells. A git repository of online additional material is
available at bitbucket.org/djg11/modern-soc-design-djg. This contains data used
for generating tables and graphs in the book, as well as further source code, lab
materials, examples and answers to selected exercises. The repo contains a SystemC
model of the Zynq super FPGA device family, coded in blocking TLM style. It is
sufficient to run an Arm A9 Linux kernel using an identical boot image as the real
silicon. Published by Arm Education Media, 605 pages in softback and ePDF. ISBN
978-1-911531-36-4
Figure 1: Typical wireless link for digital communications, showing the transmitter
(top) and receiver (bottom) for a simplex link.
Figure 2: Block diagram and photographs of a first-generation Bluetooth USB
dongle.
Figure 3: Two PCIe 802.11 Wi-Fi modules for a laptop. Their shielding lids have
been removed. The older unit, on the left, has a three-chip set from Broadcom. It
is physically nearly twice the size of the newer unit, shown on the right, which has
been magnified . The new unit uses a Qualcomm QCA9565 device that has a higher
throughput and also includes Bluetooth. The only other active part needed is the
40-MHz crystal oscillator to the left of the chip. As can be seen at top right, to give
a lower-cost lower-performance product, the second of the two diversity antenna
sockets, labelled ‘ALT’, was not fitted.
temp := 200 // Set initial temperature to a high valueans := first_guess // This is the design vector (or tree)metric := metric_metric ans // We seek the highest-metric answerwhile (temp > 1){ // Create new design point, offsetting with delta proportional to temperature ans’ := perturb_ans temp ans // Evaluate (scalar) objective function (figure of merit) for new design point metric’ := metric_metric ans’ // Accept if better probabilistically accept := (metric’ > metric) || rand(100..200) < temp; if (accept) (ans, metric, temp) := (ans’, metric’, temp * 0.99)}return ans;
Figure 4: An iteration for hill climbing using simulated annealing to find a design
point with the highest scalar metric.
Figure 5: Screenshot from a place-and-route tool, showing how the wiring avoids
a region of hardened layout (grey area on the right). Some area is required to route
these interconnections.
Table 1: Determining the ROM needs of an MP3 decoder by disassembling a
segment .rodata.
Label
Start address (hex)
g_drmp3_pow43-0x120
0x0000
g_drmp3_pow43>
0x120
g_scf_partitions.6678>
0x0c40
…
_end_of_static
0x2350
Table 2: Statistics logged during one second of a profile run of MP3 stream
decoding. The numbers of integer ALU operations exclude loop control and array
subscription operations.
Event type
Number of operations
Input bytes
16 392
Output frames
44 352
DCT operations
154
Floating-point adds and subtracts
874 965
Floating-point multiplies
401 255
Integer adds and subtracts
162 107
Integer multiplies
88 704
Figure 6: Mailbox, relay and pump paradigms: three initiator/target configurations
possible for dual-ported IP blocks.
Figure 7: Schematic symbols of a FIFO buffer following the mailbox pattern:
synchronous (left) and clock-domain crossing (right).
Table 3: Synchronous delay and combinational paths for basic FIFO types.
Type
Data latency
Ready latency
Combinational paths
Fully registered
1
1
None
Bypass
0
1
WR_ENRD_RDY
Pipelined
1
0
RD_ENWR_RDY
Bubble-free
0
0
Both directions
Asynchronous
Several
Several
None
Credit-controlled
1
n/a
None
Figure 8: A generic schematic of a one-place synchronous FIFO buffer. A
synchronous set/reset flip-flop records whether there are valid data in the broadside
register. The dashed lines show optional latency-reducing wiring. The output
multiplexor is needed only if the dashed bypass wire is installed. Also illustrated are
typical patterns of combinational path wiring in the client source and sink. The red
annotations show a potential combinational loop.
Figure 9: Relay and mailbox paradigms for a one-place synchronous FIFO buffer. If
there are no combinational paths, the only differences are the net names.
Figure 10: A custom ALU inside a CPU core implemented in two similar ways: (a) as
a custom instruction or (b) as a coprocessor.
Figure 11: A specialised inter-core message-passing fabric allows cores to rapidly
exchange a packet composed of several successive registers with each other.
Figure 12: A custom function implemented as a peripheral IP block, with optional
DMA (bus master) capability.
Figure 13: Example of a hardware accelerator receiving data from a specialist
interface. It uses DMA for data transfer to the main memory. PIO is managed by the
host processor.
Figure 14: Representative bump-in-wire server blade architecture that has the FPGA
in series with the network interface and secondary storage (disk drives or SSD).
Figure 15: Simplified block diagram of a super FPGA, like the Xilinx Zynq devices.
Table 4: Key parameters for a range of super-FPGA parts available from Xilinx in
2018.
Device name
VU31P
VU33P
VU35P
VU37P
VU11P
VU13P
System logic cells (k)
962
962
1907
2852
2835
3780
CLB flip-flops (k)
879
879
1743
2607
2592
3456
CLB LUTs (k)
440
440
872
1304
1296
1728
Maximum distributed RAM (Mb)
12.5
12.5
24.6
36.7
36.2
48.3
Total block RAM (Mb)
23.6
23.6
47.3
70.9
70.9
94.5
Ultra RAM (Mb)
90.0
90.0
180.0
270.0
270.0
360.0
HBM DRAM (GB)
4
8
8
8
–
–
Clock management tiles
4
4
5
3
6
4
DSP slices
2880
2880
5952
9024
9216
12 288
PCIe ports
4
4
5
6
3
4
CCIX ports
4
4
4
4
–
–
150G Interlaken
0
0
2
4
6
8
100G Ethernet with RS-FEC
2
2
5
8
9
12
Maximum single-ended I/O
208
208
416
624
624
832
Multi-standard Gbps SERDES
32
32
64
96
96
128
Table 5: Basic parameters for two caches with different speed/power ratios but the
same technology and bit density, versus cache size.
L1
L2
Cache
Size
Energy
Area
Hit rate
Access time
Mean time
Energy
Area
Hit rate
Access time
Mean time
1
0.01
0.001
0.002
0.0
200
0.001
0.001
0.002
0.1
200
2
0.02
0.002
0.004
0.0
199
0.002
0.002
0.004
0.1
199
4
0.04
0.004
0.008
0.0
198
0.004
0.004
0.008
0.2
198
8
0.08
0.008
0.015
0.0
197
0.008
0.008
0.015
0.3
197
16
0.16
0.016
0.030
0.0
194
0.016
0.016
0.030
0.4
194
32
0.32
0.032
0.059
0.1
188
0.032
0.032
0.059
0.6
188
64
0.64
0.064
0.111
0.1
178
0.064
0.064
0.111
0.8
178
128
1.28
0.128
0.200
0.1
160
0.128
0.128
0.200
1.1
160
256
2.56
0.256
0.333
0.2
133
0.256
0.256
0.333
1.6
134
512
5.12
0.512
0.500
0.2
100
0.512
0.512
0.500
2.3
101
1024
10.24
1.024
0.667
0.3
67
1.024
1.024
0.667
3.2
69
2048
20.48
2.048
0.800
0.5
40
2.048
2.048
0.800
4.5
44
4096
40.96
4.096
0.889
0.6
23
4.096
4.096
0.889
6.4
28
8192
81.92
8.192
0.941
0.9
13
8.192
8.192
0.941
9.1
20
16 384
163.84
16.384
0.970
1.3
7
16.384
16.384
0.970
12.8
18
32 768
327.68
32.768
0.985
1.8
5
32.768
32.768
0.985
18.1
21
65 536
655.36
65.536
0.992
2.6
4
65.536
65.536
0.992
25.6
27
131 072
1310.72
131.072
0.996
3.6
4
131.072
131.072
0.996
36.2
37
262 144
2621.44
262.144
0.998
5.1
5
262.144
262.144
0.998
51.2
51
Table 6: Some composite design examples.
L1
L2
L2
Composite
Composite
Composite
size
size
energy
energy
area
mean time
64
262 144
233.0
233.6
262.2
45.8
128
262 144
209.7
211.0
262.3
41.2
1024
262 144
87.4
97.6
263.2
17.4
4096
262 144
29.1
70.1
266.2
6.3
Table 7: Power laws for performance delivered and power used in terms of three
independent parameters.
Metric
Core complexity ()
DVFS voltage ()
Number of cores ()
Performance delivered
Power used
Increase in power for double
performance
4
8
2.16
Figure 16: Relative performance of simple and complex CPU cores with DVFS.
Figure 17: Three multi-access NoC topologies.
Figure 18: Comparison of various multi-access and switched NoC designs in terms
of throughput (136 bit words per clock cycle) per gate input as the number of end
points (stations) is increased.
# Unfold=1 is the baseline design. Unfold=3 uses three times more silicon.static_dynamic_tradeoff <- function(clock_freq, leakage, unfold, xx){ op_count <- 2e7; # Model: Pollack-like unfold benefit execution_time <- op_count / clock_freq / (unfold ^ 0.75); # Model: Higher supply needed for higher clock and leakage resistance slightly increasing with Vdd vdd <- 1 + 0.5 * (clock_freq/100e6); static_power <- leakage * vdd ^ 0.9 * unfold; # Integrate static power and energy static_energy <- static_power * execution_time; # Use CV^2 for dynamic energy dynamic_energy <- op_count * vdd ^ 2.0 * 5e-10;}
Figure 19: Trading off Aesop’s hare versus the tortoise for increasingly leaky
technology. In a hard real-time computation, we know the number of clock cycles
needed but should we do them quickly and halt (Hare) or do them slowly and finish
just in time (Tortoise)?
Figure 20: Shift in the sweet spot (small vertical line) for DVFS with high-leakage
technology for a real-time task. The total energy (solid line) is the sum of the static
energy (dashes) and dynamic energy (dots).
Figure 21: Block diagram for an MPEG compression subsystem. The input is a
real-time video stream and the outut is an MPEG transport stream (without sound).
The DCT, IDCT and motion estimators are highly compute intensive.
/* Generate variable-length codes for an intra-coded
block (6.2.6, 6.3.17) */
void putintrablk(Picture picture, short [] blk, int cc)
{
/* DC Difference from previous block (7.2.1) */
int dct_diff = blk[0] - picture.dc_dct_pred[cc];
picture.dc_dct_pred[cc] = blk[0];
if (cc==0) putDClum(dct_diff);
else putDCchrom(dct_diff);
/* AC coefficients (7.2.2) */
int run = 0;
byte [] scan_tbl = (picture.altscan ? alternate_scan:
zig_zag_scan);
for (int n=1; n<64; n++)
{ // Use appropriate entropy scanning pattern
int signed_level = blk[scan_tbl[n]];
if (signed_level!=0)
{
putAC(run, signed_level, picture.intravlc);
run = 0;
}
else run++; /* count zero coefficients */
}
/* End of Block -- normative block punctuation */
if (picture.intravlc!=0) putbits(6,4); // 0110 (B-15)
else putbits(2,2); // 10 (B-14)
}
// Return difference between two (8*h) sub-sampled blocks
// blk1, blk2: addresses of top left pels of both blocks
// rowstride: distance (in bytes) of vertically adjacent pels
// h: height of block (usually 8 or 16)
int sumsq_sub22(byte [] blk1, byte [] blk2, int rowstride, int h)
{
int ss = 0, p1 = 0, p2 = 0;
for (int j=0; j<h; j++)
{
for (int i=0; i<8; i++)
{ int v = blk1[p1+i] - blk2[p2+i];
ss += v*v;
}
p1+= rowstride; p2+= rowstride;
}
return ss;
}
Figure 22: Code fragments for an MPEG encoder, coded in CSharp for Kiwi HLS. The
putintrablk routine takes the 64 values from the quantised DCT and transmits
them using the variable-length coding and with a run-length encoding of zero values
in the routines it calls. The sumsq_sub22 function is one of many used during
motion estimation to compute differences between macro blocks.
Figure 23: Typical examples of FUs deployed by higher-level design languages,
especially HLS.
Figure 24: Fixed-point implementation of a floating-point accumulator.
Figure 25: Dynamic load balancing using the server farm paradigm.
Figure 26: Generic setup of a system integrator tool. This is the IP-XACT reference
model for design capture and synthesis. A central interactive editor enables a
system to be created from externally provided IP blocks by invoking generator and
configurator tools.
Figure 27: IP-XACT structures for documenting the memory map and register field
definitions, plus the access type supported for each bit field.
Figure 28: Generating an interconnect from IP-XACT: abstract view (top) using five
IP-XACT XML files and synthesised wiring (bottom).
Figure 29: Example of high-level manual wiring between subsystems with wiring
configured using a GUI.
Figure 30: Screenshot of the connectivity matrix editor within Arm Socrates.
Figure 31: An example Chisel module: a static-priority arbiter, showing Chisel
source code (left) and the circuit generated (right) when n_inputs is 3.
Figure 32: An 8-bit counter with a synchronous load using sequential logic and a
when/otherwise clause: Chisel source code and schematic symbol.
Figure 33: Three communication processes expressed using Handel-C.
module mkTb1 (Empty); // This module has no externally callable methods Reg#(int) rx <- mkReg (23); // Create an instance of a 23-bit register called rx rule countone (rx < 30); // A rule named ’countup’ with an explicit guard int y = rx + 1; // This is short for int y = rx.read() + 1; rx <= rx + 1; // This is short for rx.write(rx.read() + 1); $display ("countone: rx = %0d, y = %0d", rx, y); endrule rule counttwo (rx > 20); // A competing rule, also guarded rx <= rx + 2; // This increments twice each cycle $display ("counttwo: rx = %0d", rx); endrule rule done (rx >= 40); // A third rule $finish (0); endruleendmodule: mkTb1
Figure 34: A Bluespec example in which three rules compete to act on a simple
broadside register.
interface Pipe_ifc;
method Action put(int arg);
method int get();
endinterface
_________________________________________
module mkTb2 (Empty); // Testbench
Reg#(int) x <- mkReg (’h10);
Pipe_ifc thepipe <- mkPipe;
rule fill; // explicit guard of (true) is implied
thepipe.put(x);
// This is short for x.write(x.read() + ’h10);
x <= x + ’h10;
endrule
rule drain;
let y = thepipe.get();
$display (" y = %0h", y);
endrule
endmodule
Figure 35: Bluespec definition of an example interface, the Pipe_ifc (top left),
the definition of a component that has an instance of a component that has that
interface and two rules to operate on it (bottom left) and a net-level schematic
symbol for the instantiated component showing handshake nets. The symbol for
the Tb2 component has no external nets (apart from clock and reset) since it has an
empty interface.
Figure 36: Basic steps of an HLS tool chain for ASIC or FPGA.
// A simple long multiplier with
// variable latency
int multiply(int A, int B)
{
int RA=A;
int RB=B;
int RC=0;
while(RA>0)
{
if odd(RA) RC = RC + RB;
RA = RA >> 1;
RB = RB << 1;
}
return RC;
}
module LONGMULT8b8(clk, reset, C, Ready, A, B, Start);
input clk, reset, Start;
output Ready;
input [7:0] A, B;
output [15:0] C;
reg [15:0] RC, RB, RA;
reg Ready;
reg xx, yy, qq, pp; // Control and predicate nets
reg [1:0] fc;
reg [3:0] state;
always @(posedge clk) begin
xx = 0; // default settings.
yy = 0;
fc = 0;
// Predicates
pp = (RA!=16’h0); // Work while pp holds
qq = RA[0]; // Odd if qq holds
if (reset) begin // Sequencer
state <= 0;
Ready <= 0;
end
else case (state)
0: if (Start) begin
xx = 1;
yy = 1;
fc = 2;
state <= 1;
end
1: begin
fc = qq;
if (!pp) state <= 2;
end
2: begin
Ready <= 1;
if (!Start) state <= 3;
end
3: begin
Ready <= 0;
state <= 0;
end
endcase // case (state)
RB <= (yy) ? B: RB<<1; // Data path
RA <= (xx) ? A: RA>>1;
RC <= (fc==2) ? 0: (fc==1) ? RC+RB: RC;
end
assign C = RC;
endmodule
Figure 37: A very basic worked HLS example showing the input source code (left)
and generated RTL (right) for a 32-bit multiplier as a variable-latency FU with start
and ready handshake nets.
Figure 38: Long multiplier output from HLS showing the internal data path and
sequencer (left) and the schematic symbol (right). The result is used as an FU in a
higher-level design.
Figure 39: An example of a static schedule for a basic block containing a single
assignment.
Figure 40: Various circuits that compute a running sum. The everyday circuit C1 can
be used if the adder has a latency of 1. However, if the adder has a latency of 2, more
complex circuits are needed, namely C2 and C3.
Figure 41: Fragment of an example inter-block initiation and hazard graph. Block 1
may be followed by itself or by block 2. Each block schedule contains both C (control
flow) predicate evaluation and D (data) computation operations.
public static int associative_reduction_example(int starting) { int vr = 0; for (int i=0;i<15;i++) // or also i+=4 {int vx = (i+starting)*(i+3)*(i+5); // Mapped computationvr ^= ((vx&128)>0 ? 1:0); // Associative reduction } return vr; }
Figure 42: An iteration that performs associative reduction (in vr).
Figure 43: An iteration that has a loop-carried data dependency (through variable
vd).
static int [] foos = new int [10]; static int ipos = 0; public static int loop_forwarding_example(int newdata) { foos[ipos ++] = newdata; ipos %= foos.Length; int sum = 0; for (int i=0;i<foos.Length-1;i++) {int dv = foos[i]^foos[i+1]; // Two adjacent locations are read sum += dv; // Associative scalar reduction in sum } return sum; }
Figure 44: A loop where data fetched in one iteration (foo[i]) can usefully be
forwarded to a subsequent iteration.
public static int data_dependent_controlflow_example(int seed) { int vr = 0; int i; for (i=0;i<20;i++) {vr += i*i*seed;if (vr > 1111) break; // Early loop exit } return i; }
Figure 45: A loop that has data-dependent control flow (the loop exit depends on
variable vr).
Figure 46: Lookup procedure when searching for a string using the BWT. A haystack
has been transformed and stored in Bwt[]. An index Ranks[] is computed for it.
Looking for the string in Needle is then very fast.
Figure 47: Compacted Ranks array for BWT, redefinition of the ranks() routine for
an interpolated lookup and a sensible layout in a DRAM row.
Figure 48: Data dependencies (slightly simplified) in the Smith–Waterman
alignment-finding algorithm.
Figure 49: Affine transformation examples, adapted from [?]. Nest 1 runs after nest
2, but by skewing the access pattern of each loop, there is increased parallelism
available within each nest and pipelining becomes possible. Nest 2 commences just
after nest 1 has produced its first diagonal output.
Figure 50: Shuffle data flow for an FFT (left), elemental butterfly (top right) and code
(bottom right).
Figure 51: A circuit intended to compute the running sum of streaming data values.
Figure 52: Overall setup of a generic ESL flow. The heavy lines show the fastest
technique, which avoids using an instruction set simulator (ISS) through native
cross-compiling of the embedded firmware and direct linking with behavioural
models of the hardware.
SC_MODULE(mycounter) // An example of a leaf module (no subcomponents){ sc_in < bool > clk, reset; sc_out < sc_int<10> > myout; void mybev() // Internal behaviour, invoked as an SC_METHOD { myout = (reset) ? 0: (myout.read()+1); // Use .read() since sc_out makes a signal } SC_CTOR(mycounter) // Constructor { SC_METHOD(mybev); // Require that mybev is called on each positive edge of clk sensitive << clk.pos(); }}
Figure 53: A 10-bit binary counter with synchronous reset, coded as a SystemC class.
Figure 57: An example waveform view plotted by gtkwave.
sc_trace_file *tf = sc_create_vcd_trace_file("tracefilename"); // Now call: // sc_trace(tf, <traced variable>, <string>); sc_signal < bool > serialin("serialin"); // A named signal sc_signal < bool > serialout; // An unnamed signal float fbar; sc_trace(tf, clk); sc_trace(tf, serialin); sc_trace(tf, serialout, "serialout"); // Give name since not named above sc_trace(tf, fbar, "fbar"); // Give name since POD form sc_start(1000, SC_NS); // Simulate for 1 microsecond (old API) sc_close_vcd_trace_file(tr); return 0;
Figure 58: Naming nets.
sc_signal < bool > mywire; // Rather than a channel conveying just one bit struct capsule { int ts_int1, ts_int2; bool operator== (struct ts other) { return (ts_int1 == other.ts_int1) && (ts_int2 == other.ts_int2); } int next_ts_int1, next_ts_int2; // Pending updates void update() { ts_int1 = next_ts_int1; ts_int2 = next_ts_int2; } ... ... // Also must define read(), write() and value_changed() }; sc_signal < struct capsule > myast; // We can send two integers at once
Figure 59: Equality operator overload.
void mymethod() { .... } SC_METHOD(mymethod) sensitive << myast.pos(); // User must define concept of posedge for their own abstract type
Figure 60: Positive edge.
Figure 61: Three views of the four-phase transactional protocol from §?? operating
between a sender and a receiver. The views are for a net-level connection, an
untimed TLM push and an untimed TLM pull.
Figure 62: Schematic of a FIFO device (top) and its typical instantiation (bottom).
//Define the interfaces:class write_if: public sc_interface{ public: virtual void write(char) = 0; virtual void reset() = 0;};class read_if: public sc_interface{ public: virtual char read() = 0;};//Define a component that inherits:class fifo_dev: sc_module("fifo_dev"),public write_if, public read_if, ...{ void write(char) { ... } void reset() { ... } ...}
Figure 63: SystemC module definition of an interface (left) and FIFO writer (right).
Figure 64: TLM 2.0 generic payload structure (left) and interconnection of three IP
blocks showing passthrough and demultiplexing (right).
// Filling in the fields or a TLM2.0 generic payload: trans.set_command(tlm::TLM_WRITE_COMMAND); trans.set_address(addr); trans.set_data_ptr(reinterpret_cast<unsigned char*>(&data)); trans.set_data_length(4); trans.set_streaming_width(4); trans.set_byte_enable_ptr(0); trans.set_response_status( tlm::TLM_INCOMPLETE_RESPONSE );// Sending the payload through a TLM socket: socket->b_transport(trans, delay);
Figure 65: Setting up a generic payload.
Table 8: List of TLM 2.0 convenience socket types.
simple_initiator_socket.h
A version of an initiator socket
that has a default implementation of all interfaces. It
allows the registration of an implementation for any of
the interfaces to the socket, either unique interfaces or
tagged interfaces (carrying an additional ID).
simple_target_socket.h
A basic target socket that has a default implementation
of all interfaces. It also allows the registration of an
implementation for any of the interfaces to the socket,
either unique interfaces or tagged interfaces (carrying
an additional ID). This socket allows only one of the
transport interfaces (blocking or non-blocking) to be
registered and implements a conversion if the socket is
used on the other interface.
passthrough_target_socket.h
A target socket that has a default implementation
of all interfaces. It also allows the registration of an
implementation for any of the interfaces to the socket.
multi_passthrough_initiator_socket.h
An implementation of a socket that allows multiple
targets to be bound to the same initiator socket. It
implements a mechanism that allows the index of the
socket the call passed through in the backward path to
be identified.
multi_passthrough_target_socket.h
An implementation of a socket that allows multiple
initiators to bind to the same target socket. It implements
a mechanism that allows the index of the socket the call
passed through in the forward path to be identified.
Figure 66: Illustration of modelling artefacts relating to model interleaving and
timing arising from loosely timed TLM. In a cycle-accurate or approximately timed
model (left), two threads, T0 and T1, can access a shared resource. One blocks while
the resource is busy. With loose timing (right), the average contention delay is added
to the response times of all customers. With a large quantum, one thread may run
entirely before the other.
vqueue::b_transact(pkt, sc_time &delay) { // Measure utilisation and predict queue delay based on last 32 transactions if (++opcount == 32) { sc_time delta = sc_time_stamp()+delay-last_measure_time; local_processing_delay += (delay_formula(delta/32)-local_processing_delay)/16; logging.log(25, delta); // record utilisation last_measure_time = sc_time_stamp()+delay; opcount = 0; } // Add estimated (virtual) queuing penalty delay += local_processing_delay; // Do actual work output.b_transact(pky, delay); }
Figure 67: Essence of the code for a virtual queue contention point. A number of
clients send work using vqueue:b_transact. The work is done straight away by
output.b_transact, but the virtual time spent in the queue is added to the loosely
timed delay.
Figure 68: Typical setup of a thread using loosely timed modelling with a quantum
keeper for one core of an MPSoC. In reality there would be multiple interconnect
components between the initiator and its final target.
Figure 69: Mixing modelling styles using a target-to-initiator transactor. An
initiator-to-target transactor, not shown, would receive net-level transactions and
make method calls on a TLM target.
Figure 70: Mixing modelling styles using a mailbox paradigm transactor.
Figure 71: Five views of the LocalLink protocol (also of AXI4-Stream). (a) Timing
diagram. (b) Net-level wiring. (c) TLM abstract view. (d) Blocking TLM target code,
big step. (e) Blocking TLM, small step.
void mips64iss::step(){ u32_t ins = ins_fetch(pc); pc += 4; u8_t opcode = ins >> 26; // Major opcode u8_t scode = ins&0x3F; // Minor opcode u5_t rs = (ins >> 21)&31; // Registers u5_t rd = (ins >> 11)&31; u5_t rt = (ins >> 16)&31; if (!opcode) switch (scode) // decode minor opcode { case 052: /* SLT - set on less than */ regfile_up(rd, ((int64_t)regfile[rs]) < ((int64_t)regfile[rt])); break; case 053: /* SLTU - set on less than unsigned */ regfile_up(rd, ((u64_t)regfile[rs]) < ((u64_t)regfile[rt])); break; ... ...void mips64iss::regfile_up(u5_t d, u64_t w32){ if (d != 0) // Register zero stays at zero { TRC(trace("[ r%i := %llX ]", d, w32)); regfile[d] = w32; }}
Figure 72: main step function.
Table 9: Typical relative performance of different
virtual platform processor modelling approaches compared with real time. Figures
assume a high-performance modelling workstation and a SoC with just one core.
Index
Type of ISS
I-cache traffic
D-cache traffic
Relative
modelled
modelled
performance
(1)
Interpreted RTL
Y
Y
0.000001
(2)
Compiled RTL
Y
Y
0.00001
(3)
V-to-C C++
Y
Y
0.001
(4)
Handcrafted cycle-accurate C++
Y
Y
0.1
(5)
Handcrafted high-level C++
Y
Y
1.0
(6)
Trace buffer/JIT C++
N
Y
20.0
(7)
Native cross-compile
N
N
50.0
Figure 73: Block diagram of one channel of a DMA unit. (Repeated from
Figure 254).
// Behavioural model of // target side: PIO register r/w uint32 src, dest, length; bool busy, int_enable; u32_t status() { return (busy << 31) | (int_enable << 30); } u32_t completer_read(u32_t a) { return (a==0)? src: (a==4) ? dest: (a==8) ? (length) : status(); } void completer_write(u32_t1 a, u32_t d) { if (a==0) src=d; else if (a==4) dest=d; else if (a==8) length = d; else if (a==12) { busy = d >> 31; int_enable = d >> 30; } }
// Bev model of bus mastering portion while(1) { waituntil(busy); while (length-- > 0) mem.write(dest++, mem.read(src++)); busy = 0; }
Interrupt code:
interrupt = int_enable&!busy;
Figure 74: Behavioural model of (one channel of) the DMA controller.
Table 10: Product space of power and area estimates for an RTL implementation.
Without simulation
Using simulation
Without placeand route
Fast design exploration Area and delay heuristics needed
Can
generate indicative activity ratios that
can be used instead of a simulation in
further runs
With place androute
Static timing analyser will give an
accurate clock frequency
Gold standard: only bettered by
measuring a real chip
module CTR16( input mainclk, input din, input cen, output o); reg [3:0] count, oldcount; // D-types always @(posedge mainclk) begin if (cen) count <= count + 1; // ALU if (din) oldcount <= count; // Wiring end assign o = count[3] ^ count[1]; // Combinational endmodule
Figure 75: Example RTL fragment used in a static analysis.
Table 11: Normal parameters for a RAM compiler together with selected values.
Parameter
Value
Fabrication geometry
22 nm
Organisation
64Kx64
Nominal VDD
1.0 V
Number of banks
1
Read/write ports per bank
1
Read ports per bank
0
Write ports per bank
0
Figure 76: Performance and overheads for various sizes of a single-ported 32-bit
data width SRAM implemented in 22 nm, as modelled by Cacti 7.0. The tool gives
the access time, read and write dynamic energy use, and static power and area use
for each size.
void sram64_cbg::recompute_pvt_parameters() // Called in constructor and when Vcc is changed{ m_latency = sc_time(0.21 + 3.8e-4 *sqrt(float(m_bits)), SC_NS); pw_power leakage = pw_power(82.0 * m_bits, PW_nW); set_static_power(leakage); set_fixed_area(pw_area(13359.0 + 4.93/8 * m_bits, PW_squm)); m_read_energy_op = pw_energy(5.0 + 1.2e-4 / 8.0 *m_bits, pw_energy_unit::PW_pJ); m_write_energy_op = 2.0 * m_read_energy_op; // rule of thumb! // NB: Might want different energy when high-order address bits change pw_voltage vcc = get_vcc(); m_latency = m_latency / vcc.to_volts(); cout << name () << ":" << kind() << ": final latency = " << m_latency << "\n";}
void sram64_cbg::b_access(PW_TLM_PAYTYPE &trans, sc_time &delay){ tlm::tlm_command cmd = trans.get_command(); // Log wiring power consumed by transaction arriving here. // Also set which nets modelled by the TLM will be active after this operation: // For a write none (except a response ack) and for read the payload data. trans.pw_log_hop(this, (cmd==tlm::TLM_READ_COMMAND ? PW_TGP_DATA: PW_TGP_NOFIELDS) | PW_TGP_ACCT_CKP, &read_bus_tracker); if (cmd == tlm::TLM_READ_COMMAND) { // Log internal transaction energy for read pw_module_base::record_energy_use(m_read_energy_op); ... } else if (cmd == tlm::TLM_WRITE_COMMAND) { // Log internal transaction energy for write pw_module_base::record_energy_use(m_write_energy_op); ... }}
Figure 77: TLM+TE model of an SRAM. Constructor and blocking transport
methods.
Figure 78: A die-stacked DRAM subsystem in the style of the Micron HMC and
HBM. Several DRAM chips are piled on top of an upside-down (flip-chip) controller
chip, interconnected into a 3-D structure using through-silicon vias. There could be
several stacks. A passive interposer (ceramic PCB) wires the logic die using a parallel
interconnect to SERDES chips (also upside-down) for serial connection over the
circuit board to the hosts.
Table 12: Example performance data from collated DRAMSim4 data sheets for
a variety of DDR4 devices, showing their closed page, random access time and
dynamic energy for several major operations (total energy for 8 devices making up
a 32-bit data bus). Only two significant figures are accurate. Note: V and .
Device type
Clock
Access
Activate
Read
Write
(capacity, word
frequency
time
size and speed)
(cycles)
(MHz)
(ns)
(pJ)
(pJ)
(pJ)
DDR4_4Gb_x4_1866
13
934.6
32.1
4006.1
5752.3
6985.0
DDR4_4Gb_x4_2133
16
1063.8
33.8
3754.0
5775.4
7219.2
DDR4_4Gb_x4_2400
17
1204.8
31.5
4063.7
5418.2
7330.6
DDR4_4Gb_x4_2666
19
1333.3
31.5
4104.0
6048.0
7488.0
DDR4_8Gb_x4_1866
13
934.6
32.1
5156.5
5341.4
4930.6
DDR4_8Gb_x4_2133
16
1063.8
33.8
5270.0
4692.5
4331.5
DDR4_8Gb_x4_2666
19
1333.3
31.5
6105.6
4608.0
4089.6
DDR4_8Gb_x4_2933
21
1470.6
31.3
6632.4
4595.7
4021.2
DDR4_8Gb_x4_3200
22
1587.3
30.2
7136.6
4644.9
4015.9
Figure 79: A Parallella single-board computer. The two 3320C USB driver devices
and connectors are highlighted in red boxes.
Table 13: Phase/mode example: supply rail voltage and current and total power
consumption for 3320C-EZK USB line driver devices.
Operating mode
Rail 1
Rail 2
Rail 3
Total power
(volts)
(mA)
(volts)
(mA)
(volts)
(mA)
(mW)
Standby
3.3
0.018
1.8
0.0007
3.3
0.03
0.16
L/S mode
3.3
6.3
1.8
11
3.3
5
57
H/S mode
3.3
29
1.8
22
3.3
59
155
Figure 80: Example power estimation spreadsheet. The inventory of instantiated
components is annotated with clock frequencies and activity ratios to generate an
approximate total power use in watts.
Figure 81: Two similar designs with different Rent exponents (top) and two
non-Rentian design points (bottom).
Figure 82: Left: average net length in systems composed of 59 and 671 gates for
various values of rent exponent. Right: average net length for subsystems of different
sizes for Rent exponent of 0.65 using four improved equations presented in [?].
Figure 83: Lowest common parent of the end-point logic blocks. The distance
between two end points is always roughly the same for any sensible layout of a
design, so a detailed layout, like the one shown, is not required.
Figure 84: Example of a transactional energy modelling plot. Static power depends
on the power mode, whereas the dynamic power depends on internal and external
activity events.
Figure 85: Example reports generated by the TLM POWER3 library. The energy use
between two checkpoints is shown in the upper table and power consumption in
the lower table. A total for the whole device is given, along with subtotals for parts of
the design hierarchy specifically selected by the user.
Figure 86: Back-end stages in the synthesis and physical design of a SoC, showing
representative diagrams for five of the stages. The red track on the left is the physical
intent flow.
Figure 87: A small example of a universal power format (UPF) file.
Table 14: Representative microprocessors
Year introduced
Microprocessor
No of transistors
Geometry
2007
Dual-core Intel Itanium 2
1.6 billion
90 nm
2010
8-core Intel Nehalem
2.3 billion
45 nm
2010
Altera Stratix IV FPGA
2.5 billion
40 nm
2015
Intel CPU
circa 10 billion
19 nm
2020
Nvidia’s GA100 Ampere
54 billion
7 nm
Figure 88: Technology scaling scatter plot.
Figure 89: Dark silicon (using data from [?, ?]).
Figure 90: Basic layers in planar CMOS VLSI. Cross section through an invertor
fabricated using planar transistors (not FinFETs) in a twin-well process and metal
layers, M1 to M7 (not to scale).
Figure 91: Generic 3-D structure of a FinFET (left). The gate is wrapped around the
channel. To get a sufficiently low on-resistance, a number of FETs are arranged in
parallel using techniques shown in the other three diagrams, which are in 2-D plan
view.
Figure 92: Typical first-generation standard cell layout showing three rows of cells.
Power rails of alternating polarity run horizontally using the metal M1 layer (blue).
This is also used for the internal wiring within the cells. N- and P-type diffusion
zones are shown in green and yellow. FET gates are formed where the polysilicon
(orange) crosses a diffusion layer. The violet and brown indicate metal wiring
layers used for inter-cell connections. The brown layer is predominantly used for
horizontal net segments and the violet for vertical segments.
Table 15: ITRS roadmap projection for geometry, supply voltage, transistor
properties and FO3 gate energy. Predictions are made for both high-performance
(HP) and low-performance (LP) transistors.
Year of production
2015
2017
2019
2021
2024
2027
2030
Logic device technology names
P70M56
P48M36
P42M24
P32M20
P24M12G1
P24M12G2
P24M12G3
Logic industry node range label (nm)
16/14
11/10
8/7
6/5
4/3
3/2.5
2/1.5
Logic device structure
FinFET
FinFET
FinFET
FinFET
VGAA
VGAA
VGAA
FDSOI
FDSOI
LGAA
LGAA
M3D
M3D
M3D
VGAA
Device Electrical Specifications
Power supply voltage,(V)
0.80
0.75
0.70
0.65
0.55
0.45
0.40
Sub-threshold slope (mV/decade)
75
70
68
65
40
25
25
Inversion layer thickness (nm)
1.10
1.00
0.90
0.85
0.80
0.80
0.80
sat (mV) atnA/µm, HP logic
129
129
133
136
84
52
52
sat (mV) atnA/µm, LP logic
351
336
333
326
201
125
125
Effective mobility (cmVs)
200
150
120
100
100
100
100
(m), HP Logic
280
238
202
172
146
124
106
Ballisticity: injection velocity (cm/s)
(V), HP logic
0.115
0.127
0.136
0.128
0.141
0.155
0.170
(V), LP logic
0.125
0.141
0.155
0.153
0.169
0.186
0.204
(A/m) atnA/µm, HP logic with
2311
2541
2782
2917
3001
2670
2408
(A/m) atnA/µm, HP logic after
1177
1287
1397
1476
1546
1456
1391
(A/m) atpA/µm, LP logic with
1455
1567
1614
1603
2008
1933
1582
(A/m) atpA/µm, LP logic after
596
637
637
629
890
956
821
Cch, total (fF/µm), HP/LP logic
31.38
34.52
38.35
40.61
43.14
43.14
43.14
Cgate, total (fF/µm), HP logic
1.81
1.49
1.29
0.97
1.04
1.04
1.04
Cgate, total (fF/µm), LP Logic
1.96
1.66
1.47
1.17
1.24
1.24
1.24
CV/I (ps), FO3 load, HP logic
3.69
2.61
1.94
1.29
1.11
0.96
0.89
I/(CV) (1/ps), FO3 load, HP logic
0.27
0.38
0.52
0.78
0.90
1.04
1.12
Energy per switching (CV)(fJ/switching), FO3 load, HP logic
3.47
2.52
1.89
1.24
0.94
0.63
0.50
Figure 93: Cross section showing stacked wiring metal layers.
Figure 94: EDA tools applied to synthesisable RTL for a simulation (left) and the
synthesis to a gate-level or structural netlist (right).
Figure 96: A structural RTL example (left) and the net-level circuit it defines (a
divide-by-five Johnson counter, right).
Figure 97: A combinational RTL example (top) and the naive net-level circuit it
defines (bottom).
Figure 98: Example RTL fragment, before and after flattening. For many designs,
the flattened netlist is often bigger than the hierarchic netlist owing to multiple
instances of the same component. Here it was smaller.
Figure 99: Schematic symbol, Verilog model, timing diagram and possible
implementation for a dual-edge-triggered RS flop.
Figure 100: Simple synthesisable Verilog examples, including a transparent latch.
Figure 101: Four-value logic-level encoding and its behaviour for six common gates.
Figure 102: EDS event queue, which is a linked list, sorted in ascending temporal
order.
Figure 103: Behaviour of a runt pulse in an RS latch when modelling with transport
delay.
Figure 109: Essence of logic synthesised for integer division of the 32-bit value by
the constant 10 using just adders, based on being 0.11001100 recurring. A logic
synthesiser can create similar bespoke divide circuits for any constant denominator.
Figure 110: AND-OR-INVERT gate, logical function (left) and actual circuit when
realised in CMOS (right).
module SSRAM( input clk, // Synchronous reads and writes input ren, // Read enable (optional) input wen, // Write enable input [14:0] addr, // Address input input [31:0] wdata, // Write data in output reg [31:0] rdata); // Read data out reg [31:0] myram [32767:0]; // 32k words of 32 bits each always @(posedge clk) begin if (ren) rdata <= myram[addr]; if (wen) myram[addr] <= wdata; endendmodule
Figure 111: Typical RTL coding style for RAM inference. Data out is registered once
without otherwise being used and the same subscript expression is used in both the
read and write contexts.
Figure 112: Scatter plot of area versus operating frequency for a design (Arm
Cortex-A9 Falcon) on a common process node and cell library for different core
utilisation ratios. The best performers are highlighted with black crosses (and joined
together with the black dashed line).
Figure 113: FinFET invertor area versus number of fins for different cell heights at a
process node.
Figure 114: Scatter plot of area versus operating frequency for a given design (Arm
Cortex-A9 Falcon) and standard cell libraries on a common process node that vary
in cell height, measured in track equivalents.
Figure 115: A broad-brush classification of digital ICs
into full-custom, semi-custom and field-programmable classes with some example
device roles.
Figure 116: SoC application diagram for an automotive headlight controller. A heavy
load is connected to a controller-area network (CAN bus).
Figure 117: Two-input NOR gate, circuit schematic (left) and several layers of a
standard cell layout (right). Blue is the bottom layer of metal, red is polysilicon,
green is channel diffusion and yellow is the P well in which the N-channel FETs are
constructed.
Figure 118: Typical cell data sheet from a human-readable version of a standard cell
library.
Table 16: Example of process, voltage and temperature (PVT) ranges.
Parameter
Values
Process variation
0.9 to 1.1
Supply voltage range
0.85 to 1.1 V
Temperature range
0 to 70C
Figure 119: Delay versus output loading capacitance for various input transition
rates.
Figure 120: Input capacitance is augmented or diminished due to the Miller effect.
It is augmented by the opposite-moving plate potentials arising from an inverting
configuration, which is the common case.
Figure 121: Real-world gate, with a linear slew input waveform, showing the timing
behaviour of , and .
Figure 122: Metal layers in a mask-programmed gate array (ECL codec for fibre optic
ring network [?]).
Figure 123: Transmission gate or bilateral switch: internal wiring (left) and
schematic symbols (right).
Figure 124: An active multiplexor (a) needs more silicon area than a pass-transistor
multiplexor (b), but restores logic levels rather than degrading the signal. Larger
pass-transistor multiplexors (c) are efficient and easy to lay out.
Figure 125: FPGA, showing the I/O blocks around the edge, the interconnection
matrix blocks and the configurable logic blocks. Recently, the regular structure has
been broken up by custom blocks, including RAM and multiplier (aka DSP) blocks.
Figure 126: CLB (left) and an IOB (right) for a simple FPGA.
Figure 128: Power and floor plan for a simple SoC along with the external supply,
which uses battery-backed RAM (left) and buck regulators (right). PSUC: power
supply controller.
Figure 129: Macroscopic back-end flow highlighting the verification flow paths.
Figure 86 is a detailed flow diagram.
Figure 130: Standard cell polygons for a two-input NOR gate. The abstract view (left)
identifies the power connections (top and bottom strips) and the input and output
contact sites (denoted with asterisks) on the M1 and M2 metal layers, respectively
(shades of blue). The full polygon set (right) is required for final design rule checking
and tapeout. In reality, the right-hand polysilicon layer (red) could be under the M2
output strip but is here shown displaced for clarity.
Figure 131: A placed horizontal strip of standard cells showing the polygons
significant for routing, as rendered in a layout editor. The power and ground rails are
a distinctive feature.
Figure 132: Snapshot from
a place-and-route tool, showing
hundreds of interlaced wires that
are
routed with several metal layers,
both vertically and horizontally.
Figure 133: Enlargement
of Figure 132, showing several
wires routed with several metal
layers (one colour for each layer),
vertically and horizontally. Vias
are marked with an X.
000 000 0 001 111 1
123 456 7 890 123 4
[ 00H 00H p H00 x00 p ]
[ 01H 00H p H00 x00 p ]
[ 10H 00H p H00 x00 p ]
[ 11L 00H p H00 x00 p ]
Figure 134: Pin connections for a 7400 quad NAND gate device and part of a test
program.
Figure 135: A wafer (6 to 10 inches diameter) is diced into chips (1 cm on a side or
so).
Figure 136: Load card with wafer probe pins for testing a chip before the wafer is
diced.
Figure 137: General configuration of a wafer probe testing machine. The robotic
staging system delivers wafers to and from conveyor systems as well as steps the
probes over the wafer.
Figure 138: Probe needles just visible inside the testing load card. This image must
be deleted from the book since we can’t get rights clearance.
Figure 139: Plastic chip package.
Figure 140: An FPGA composed of several chiplets with broadside inter-chiplet
bonding, connected closely to several DRAM chips in an MCM.
Figure 141: Example of structural Verilog RTL that instantiates disconnected
standard cells as a ‘sewing kit’.
Table 17: Simplistic and rough estimates of recurring (RE) and non-recurring
expenses (NRE) for the first production run of wafers.
Type of expense
Item
Item cost
Total cost
NRE
6 months: 10 software engineers
$100k pa
$500k
NRE
6 months: 10 hardware engineers
$250k pa
$1250k
NRE
4 months: 20 verification engineers
$200k pa
$1333k
NRE
1 mask set (22 nm)
$1500k
$1500k
RE
Per device IP licence fees
?
$??
RE
6-inch wafer
$5k
$5k
Total
$4583k + 5k
Table 18: Die yield. The cost for a working die given a 6-inch wafer with a processing
cost of $5000 and a probability of a square millimetre being defect-free of 99.55 per
cent.
Area (mm)
Number of wafer dies
Number of working dies
Cost per working die ($)
2
9000
8910
0.56
3
6000
5910
0.85
4
4500
4411
1.13
6
3000
2911
1.72
9
2000
1912
2.62
13
1385
1297
3.85
19
947
861
5.81
28
643
559
8.95
42
429
347
14.40
63
286
208
24.00
94
191
120
41.83
141
128
63
79.41
211
85
30
168.78
316
57
12
427.85
474
38
4
1416.89
Figure 142: The three main timing specifications for a sequential cell, such as a
flip-flop (repeated from Figure 270).
Figure 143: An example circuit with static timing annotations for maximum GBA
analysis.
Figure 144: CMOS chip structure, highlighting layers most affected by FEOL and
BEOL variations.
Table 19: Examples of process corners.
String
Meaning
ss_0p9v_m40c
Slow P and N channel transistors at 0.9 V and °C
tt_1p0v_25c
Typical P and N channel transistors at 1.0 V and room temperature
ff_1p1v_125c
Fast P and N channel transistors at 1.1 V and 125°C
Table 20: Examples of BEOL corners.
BEOL Corner
Meaning
C or C
Narrow
wires
with
wide
spacing
for
the
smallest
capacitance
component
RC or RC
Thick
wires
with
less
resistance
to
minimise
the
RC
product
and
net
delay
Typical
Wires
and
vias
meet
the
target
dimensions
RC or RC
Thin
wires
with
more
resistance
to
maximise
the
RC
product
and
net
delay
C or C
Wide
wires
with
narrow
spacing
for
the
largest
capacitance
component
Figure 145: Troublesome early and late path configurations for maximum (left) and
minimum (right) timing. Late path components are shaded dark.
Figure 146: A logic structure with a gated clock that could suffer in an FS process
corner (left) and a combinational logic circuit with two paths that differ in their
levels of inversion.
Figure 147: An example of an SDC file. TCL commands are used to specify and
constrain the STA behaviour.
Figure 148: Global and localised views of process distributions.
Figure 149: LVF path delay distribution and sigma multiplier.
Figure 150: Statistical shift due to bias temperature instability for various bias
times.
Figure 151: Comparing simulation (left) with a formal proof of correctness (right).
assert(x<4); x := x + 1000; assert(x<1004);
Figure 152: Examples of imperative or procedural safety assertions in a software
language such as C++. These conditions must hold when reached by the program’s
flow of control.
Figure 153: FSM transition diagram with liveness, fairness and deadlock
indications.
Figure 154: Two simple PSL assertions.
Table 21: The three principal regular expression operators and concise derived
shorthands.
Syntax
Fundamental
Description
{A;B}
Core
Semicolon denotes sequence concatenation
{A[*]}
Core
A postfix asterisk denotes arbitrary repetition
{AB}
Core
Vertical bar (stile) denotes alternation
{A[+]}
Derived
One or more occurrences of A
{A[*n]}
Derived
Repeat times
{A[=n]}
Derived
Repeat times non-consecutively
{A[->n]}
Derived
As =n but ending on the last occurrence
{A:B}
Derived
Fusion concatenation (last of A occurs during first of B)
Table 22: Summary of the main SERES temporal conjunction and sequencing
dyadic operators.
Operator
Syntax
Description
Simple conjunction
A & B
A and B finish matching at once
Length-matching conjunction
A && B
A and B occur at once with common duration
(length matching)
Simple conjunction
A within B
A occurred at some point during B
Strong positive sequencing
A until B
A held at all times until B started
Weak positive sequencing
A before B
A held before B held
Sequence implication
A |=> B
Whenever A finishes, B immediately starts
Fusion implication
A |-> B
The same, but with the last event of B
coincident with the first of A
Table 23: Some built-in primitive macros in PSL.
Macro function
Description
rose(X)
X changed from zero to one
fell(X)
X changed from one to zero
stable(X)
X did not change
changed(X)
X did change
onehot(X)
X is a power of 2
onehot0(X)
X is zero or a power of 2
Figure 155: Dynamic validation: Monitoring bus operation with an RTL checker
(top) or interface PCs (bottom).
Figure 156: Framed standard synchronous connection, with interface checker
instance (left) and allowable protocol transitions (right). Only allowed qualified
edges are shown. Any other qualified transitions are protocol errors.
module framed_standard_sync_monitor( input reset, input clk, // Clock input. ALL CONNECTIONS ARE INPUTS! input req, // Request signal input rdy, // Ready signal, for the reverse direction input [7:0] data, // Data bus input dfirst, // First word of packet indicator input dlast); // Last word indicator bit q1; integer error_flag; always @(posedge clk) if (reset) q1 = 0; else begin error_flag = 0; if (req && rdy && !q1) begin if (dfirst && !dlast) q1 = 1; // Frame start else if (dlast && !dfirst) begin $display("%m: %1t: C2: End outside of frame.", $time); error_flag = 2; end else if (!dlast && !dfirst) begin $display("%m: %1t: C3: Byte outside a frame.", $time); error_flag = 3; end end else if (req && rdy && q1) begin if (!dfirst && dlast) q1 = 0; // Frame end else if (dlast && dlast) begin $display("%m: %1t: C1b: One-word frame during existing frame.", $time); error_flag = 1; end else if (!dlast && dfirst) begin $display("%m: %1t: C1a: Frame start during existing frame.", $time); error_flag = 1; end end endendmodule
Figure 157: Example of a PC for the framed interface implemented using an RTL
state machine.
wire en = req && rdy; // The transition from Q0 -> Q1 -> ... -> Q1 -> Q0: sva_transaction: assert property (@(posedge clk) ( (en && dfirst && !dlast) |=> (!en || (!dfirst && !dlast))[*0:$] ##0 (en && !dfirst && dlast) ) ) // Forbid any exit from Q0 except with dfirst: good_Q0: assert property (@(posedge clk) ( (en && dlast) || reset |=> (!(en && dfirst))[*0:$] ##0 (en && dfirst) ) )
Figure 158: Comparable example of the PC implemented with SVA. The ##0 form is
an idiomatic marker that merely separates successive SERES components.
Figure 159: A mitre compares the outputs from a pair of supposedly equivalent
combinational components.
Figure 160: A two-bit shift register (left) with a conventional design. By using a
clock-enabled flip-flop (centre), an alterntive implementation is possible (right).
The state encoding is totally different, but the observable black-box behaviour is
identical.
Figure 161: An X-propagation mitre around two instances of a simple DUT. The
equivalence comparison is modified to ignore mismatches on the data bus if the
valid output from the upper instance does not hold. A counterexample, as found by a
model checker, is denoted in red. The alternative values that would be encountered
in a simulation are in blue.
Figure 162: An alternative DUT. The circuit produces a false negative under
X-propagation simulation. The blue annotations show CRV simulation values,
whereas model checker values, in red, correctly indicate that the DUT has passed
the test.
Figure 163: A formal glue shim around a data path component (a LIFO stack). This
enables symbolic verification using a basic model checker. All inputs on the left are
unconstrained, as is the pop input on the right. No stimulus pattern should make
the FAIL output hold.
Figure 164: A pulse generator: schematic symbol and timing waveforms.
Figure 165: Four of six permutations that implement a simplex protocol adaptor.
Each has the same signature of one input and two output ports of 32 and 8 bits,
respectively. More typically, duplex implementations are required.
Figure 166: Comparison of speed-of-light (LC) and RC-based Elmore delay models
for propagation along a net. Also shown is a real-world design rule for 16-nm
geometry.
Figure 167: Timing diagram for an asynchronous four-phase handshake.
Figure 168: Timing diagram for AHB bus write cycle with one wait state.
Figure 169: Timing diagrams for the synchronous LocalLink protocol. Left:
Back-to-back transfer of words because the destination is always ready. Right:
Pauses to match the acceptance rate of the destination. Note that all control signals
are active low (denoted with the _n RTL suffix) in LocalLink.
Figure 170: BVCI core nets. Arrows indicate signal directions on the initiator. All of
these are reversed for the target.
Figure 172: BVCI protocol: Response phase timing diagram. Operations are
qualified with the conjunction of req and ack. Response and acknowledge cycles
maintain their respective ordering. Bursts are common. Successive addressing may
be implied.
Figure 173: Out-of-order bus formed from a pair of in-order busses with added tags
(blue).
Figure 174: Augmenting the tag width through successive multiplexing stages.
Figure 175: The baseline port structure for an A32D32 AXI standard initiator
showing the five temporally floating channels, two for reads and three for writes.
Parity bits may be present on any of the busses. Writing an address and writing data
operate close to lockstep for single-word writes, but are split to support multi-word
bursts. Signal directions are reversed for a target.
Table 24: Major predefined AMBA AXI bus standards and profiles. Within each
profile, there can be variations in address, data, tag and other bus widths per
instance.
Profile
Channels
Other nets
Description
AXI3
AR+R, AW+W+B
Tag ID, WLanes
Bursts 1–16 beats
AXI4
AR+R, AW+W+B
Tag ID, WLanes, QoS
Bursts 1–256 beats
AXI4-Lite
AR+R, AW+W+B
No burst transfers.No byte lanes
AXI4-Stream
W
Simplex. Noaddressing.Unrestricted length
AXI ACE
All of AXI4
AC+CR+CD
Cache coherencyextensions
ACE5-Lite
All of AXI4
AC+CR+CD
Single beat.Out-of-orderresponses
Figure 176: A basic SoC bus structure for the MSOC1 protocol. One initiator
addresses three targets (high-level view and detailed wiring).
Figure 177: Example where one of the targets is also an initiator (e.g. a DMA
controller).
Figure 178: A system design using three bridged busses. Each main bus has its own
primary initiator (pink), which is typically a CPU, but the bus bridges (green) also
initiate transactions.
Figure 179: Bidirectional bus bridge for the MSOC1 protocol. It has a pair of
back-to-back simplex bridges. It could be a single IP block on a single chip, or the
two halves could be on different chips with a SERDES (§??) serial link between them.
Figure 180: Crossbar connectivity implemented in three different ways: with a
high-capacity TDM bus (left), crossbar elements (centre) and multiplexors (right).
Figure 181: Multiphase (split transaction) bus demultiplexor and remultiplexor for
responses (left). Three of these structures could be used to implement the bus fabric
hub (right).
Figure 182: Two folded busses that have similar component counts. For each, the
number of switching elements grows linearly with the number of end points (EP).
The linear form (left) is suitable for multi-chip use, but the tree form (right) has lower
latency. The receive half of the tree form often has a tree structure running exactly
parallel to the transmit half, but here it is shown flattened to almost a bus.
Figure 183: A radix-3 switching element for a NoC using a broadcast bus at the link
bandwidth and credit-based flow control.
Figure 184: Example of a NoC fabric using radix-3 switching in a unidirectional torus
mesh.
Figure 185: Illustration of cut-through routing on a 2-D mesh NoC (left), wormhole
operation (centre) and potentially deadlocking routes (right). The thin black lines
are the NoC links, which are bidirectional. The thick coloured lines are the routes
chosen for a packet. The white marks indicates the end of a packet.
Figure 186: Four elements in a 2-D mesh. Two transactions, red and green, are in a
fabric deadlock, since unconstrained cut-through switching with wormhole routing
has allowed each to block the other. The dashed parts cannot complete.
Figure 187: The eight possible 2-D turns (left), the allowable 2-D turns with
Y-before-X (north–south first) routing (centre) and an example set of turns used
by two transactions that obey the north–south first global policy but still manage
to deadlock owing to interference between requests and responses. The turns that
violate the global policy are marked with an exclamation mark. BLA: bottom left
anticlockwise; BRA: bottom right anticlockwise; TRA: top right anticlockwise; TLA:
top left anticlockwise; BLC: bottom left clockwise; BRC: bottom right clockwise; TRC:
top right clockwise; TLC: top left clockwise.
Figure 188: Sample behavioural traffic specification containing a load profile and a
transaction phase dependency.
Figure 189: One possible structure using link-level credit-based flow control,
showing tolerance to pipeline stages in the interconnect nets. The forward and
return paths need not be matched in delay terms.
Figure 190: Two end points interconnected via some number of switching elements.
The AMBA 5 CHI protocol layers are shown.
Figure 191: A minimal application of the AMBA 5 CHI specification between a
requester and a completer (left) and full net-level details of the six channels found
on a more-typical request node (RN) (right).
Table 25: Some (simplified) synthetic traffic generation vectors and their
descriptions.
No.
Name
Description
1.
Rate:
open
loop
Flows:
all
to
all
Length:
8
Spacing:
Regular
Average rate injection from all ingress ports to all egress ports
of 8 byte payloads, with no burstiness
2.
Rate:
open
loop
Flows:
all
to
one
Length:
8
Spacing:
Regular
Average rate injection from all ingress ports to one egress port,
with no burstiness.
3.
Rate:
saturated
Flows:
all
to
all
Length:
8
Spacing:
Regular
Injection at peak capacity from all ingress ports to all egress
ports, with no burstiness.
4.
Rate:
open
loop
Flows:
all
to
all
Length:
8
Spacing:
Random
Average injection rate with random delays between
injections, from all ingress ports to all egress ports.
5.
Rate:
open
loop
Flows:
all
to
all
Length:
Variable
Spacing:
Regular
Average injection rate from all ingress ports to all egress ports,
with variable length packets.
6.
Rate:
closed
loop
Flows:
all
to
all
Length:
32
Spacing:
Regular
Ingress port only generates a new message after previous
response. All packets are long (32 bytes).
Figure 192: A ring network. This is a low-complexity NoC structure.
Figure 193: Regular cubic structures with 1, 2, 3 or 4 dimensions. 5-D and above are
hard to draw.
Figure 194: The butterfly (or shuffle) network topology (left) and a flattened version
(right), with some example paths highlighted for comparison. The vertical links in
the flattened form are bidirectional.
Figure 195: Example of an switching fabric that uses radix-2 elements and the delta
wiring pattern. Interchanging the initiators and targets is equally valid.
Figure 196: A one-place pipelined FIFO stage using credit-based flow control. A
multi-place FIFO stage would replace the synchronous S/R flops used for the initial
credit and backlogged state with counters, which result in better density than
cascading one-place stages. (The vertical line on the S input denotes that setting
has priority over resetting when both are asserted.)
Figure 197: Metastability illustrated by a pencil balancing on a razor’s edge (left).
The essential structure of a transparent bistable (latch) and a transfer function
(centre). The gate signal and measured responses from a transparent latch with an
input wired to a voltage source close to the metastable point (right).
Figure 198: Generic structure of a simplex CBRI. Parallel data are reliably sent
between CDs.
Figure 199: Timing diagram (top) and a basic structure for two systems that use
harmonically locked clocks (bottom). Relative edge timings that require tight phase
control are highlighted.
Figure 200: Net-level view of the AMBA P-channel interface for device power
control.
Figure 201: Main components of an 8b10b, block-coded
SERDES (serialiser/deserialsier) transceiver (transmitter/receiver) operating over a
twisted-pair channel.
Table 26: Four quadrants of a computer
Control unit
Execution unit
Primary storage
I/O devices
Figure 202: Transactional-level model (TLM) of the simplest computer (left) and
code fragments (right).
Figure 203: Schematic symbol and external connections for a tri-state version of a
simple microprocessor.
Table 27: Net-level connections
Connection
Direction
Use
data[7:0]
I/O
Bidirectional data bus
addr[15:0]
Output
Selection of internal address; not all 32 bits are used
hren
Output
Asserted during a data read from the target to the host
hwen
Output
Asserted during a write of data from the host to the target
ack
Input
Asserted when the addressed device has completed its operation
Figure 204: A simple A16D8 microcomputer structure. A microprocessor initiates
all transactions on a bidirectional/tri-state data bus, which is connected to all other
components.
Figure 205: Connections to memory
Table 28: Memory map.
Start
End
Resource
0000
03FF
ROM (1 kbytes)
0400
3FFF
Unused images of ROM
4000
7FFF
RAM (16 kbytes)
8000
BFFF
Unused
C000
C007
Registers (8) in the UART
C008
FFFF
Unused images of the UART
Figure 206: MSOC1 reference bus protocol, with read and write examples.
Table 29: Net-level connections of an MSOC1 port in an A32D32 system.
Connection
Direction
Use
addr[31:0]
Output
Selection of internal address; not all 32 bits are used
hwen
Input
Asserted during a write from the host to the target
hren
Input
Asserted during a read from the target to the host
wdata[31:0]
Input
Data to a target when writing or storing
rdata[31:0]
Output
Data read from a target when reading or loading
interrupt
Output
Asserted by target when needing attention
Figure 207: Structure of a typical microcontroller, a single-chip microcomputer.
Figure 208: Hitachi HD614080 microcontroller chip from 1980. Such devices were
often in very large, dual in-line (DIL) packages to make a large number of GPIO pins
available.
Figure 209: Contact plate for a smart card. The card reader supplies VCC power,
clock and reset. Inputs and outputs are then made via the 1-bit bidirectional data
pin.
Figure 210: An invertor viewed at various levels of abstraction.
Figure 211: Front-end flow in SoC design.
Figure 212: Main PCB of an ADSL home modem.
Figure 213: Main PCB structure of an ADSL home modem.
Figure 214: Overall design and manufacturing flow for a SoC.
Figure 215: General internal view of a mobile phone (left) and views of both sides
of the main circuit board (centre and right). Highlighted in red are the main SoC,
which contains several Arm processors, and a multi-chip module containing several
memory chips (a proprietary mix of DRAM, SRAM and flash).
Figure 216: An Apple SoC with two Arm and three GPU cores. It was made by
arch-rival Samsung.
Figure 217: A platform chip example: the Virata Helium 210, as used in many ADSL
modems.
Figure 218: Helium chip as part of a home gateway ADSL modem (partially masked
by the 802.11 module).
Figure 219: Another platform chip intended for use in thin-client display devices,
such as a tablet.
Figure 220: Moore’s law (§??).
Figure 221: TLM connection diagrams for CPU and memory configurations: (a)
direct connection, (b) with a cache, (c) with a split cache and (d) two cores with a
shared L2 cache. The fan-in to the TLM socket denotes that the bandwidth is shared.
In reality, debug and managements ports also need to be connected.
Figure 222: TLM diagrams for a CPU core with integrated first-level caches and
external coprocessor connections.
Figure 223: Main data paths in a generic five-stage RISC microarchitecture,
excluding the back sides of the cache and MMU. Instruction field bit-extracts to
control the multiplexors and ALU are not shown.
Figure 224: Key components of a super-scalar CPU core, excluding the back sides of
the cache and MMU.
Figure 225: TLM connection diagrams for a CPU and memory management unit
(MMU): (a) naive view and (b) one possible cache arrangement.
Figure 226: A generalised PRAM model of computing. A number of synchronous
processors, each with some private local store, make random access and read and
write operations on shared memory that has many atomic locations of some size.
The interconnect implements some variant of coherence (value and sequential
consistency).
Figure 227: Two possible configurations for scratchpad memory.
Figure 228: Typical arrangement of virtual and physical mapping with L1, TLB and
L2. Data are served from L1 or L2, but are unlikely to be served directly from primary
storage. Instead, update mechanisms (not shown) are used. Four-way associativity
is illustrated, but a higher degree is typically used, to increase both the hit ratio and
capacity, since it is generally required that the directly mapped axis is not wider than
the virtual memory page size.
Figure 229: An example system using a snoop control IP block from Arm (Cortex A9
family).
Figure 230: MOESI-like state diagram used in the Arm ACE and CHI protocols.
Figure 231: Three I/O blocks
connected to a CPU, memory
and an interrupt controller.
Figure 232: Three I/O blocks with flexible
interrupt distribution in a multi-core
system.
Figure 233: Interrupt generation: general structure within a device and at system
level.
Table 30: Principal characteristics of memory technologies currently used for
booting, caches, primary storage and secondary storage.
Memory
Volatile
Main applications
Implementation
ROM
No
Booting, coefficients
Content set by a tapeout mask
SRAM
Yes
Caches, scratchpads, FIFO buffers
One bistable (invertor pair) per bit
DRAM
Yes
Primary storage
Capacitor charge storage
EA-ROM
No
Secondary storage
Floating-gate FET charge storage
Memristive
No
Next generation
Electrically induced resistance changes
Figure 234: Floorplan visualisation of a 3D implementation of the Arm Neoverse
N1 design using two chips, vertically stacked. The cache memories are placed on
the lower piece of silicon, shown in monochrome, whereas the logic for the ALUs,
register files and everything else is shown in colour on top. The 1MB L2 label is the
L2 logic and not L2 memory arrays.
Figure 235: Low-aspect-ratio RAM array showing the shared peripheral circuitry,
which includes the word-line decoder, sense amplifiers, tri-state buffers and I/O
multiplexor. Externally it offers words of bits where . Internally it uses rows each
with -bit cells.
Figure 236: Mask-programmed NMOS ROM structure with a capacity of locations,
each holding an -bit word.
Figure 237: Transistor-level view of a standard six-transistor (6T) SRAM cell. M1,
M2, M3 and M4 are used for storage. M5 and M6 are used to access the cell for read
and write operations.
Figure 238: SSRAM with a single port, showing the logic symbol (left) and internal
RTL model (right).
Figure 239: 8T SRAM cell with true dual-porting.
Figure 240: Dual-ported SSRAM logic symbol.
Figure 241: Micrograph of a simple DRAM chip circa 1994. This is a Micron
Technology MT4C1024 device with organisation 1024x1024x1.
Figure 242: Left: pinout for a typical asynchronous DRAM, such as the MT4C1024
pictured above, that is controlled by the RAS and CAS signals instead of a having a
clock input. Right: basic internal structure schematic. This device only has a single
bank so there are no bank address inputs.
Table 31: DRAM address terminology and hierarchy with typical sizes.
Quantity
Aggregate capacity
Description
1 channel
16 GB
A physical bus: 64 data bits, 3 bank bits and 14 address bits
4 DIMMs
16 GB
Multiple DIMMs are connected on the PCB to one channel
1 rank
4 GB
A number of logical DIMMs within a physical DIMM
16 chips
= 4 GB
This DIMM uses 16 4-bit chips making a 64-bit word
Lanes/chip
4 bit lanes = 1 GB
Each chip serves a word 4 bits wide
8 banks
Gbit
Each bank has its own bit-cell arrays (simultaneously open)
rows
64 Mbit
A page or row is one row of bit cells in an array
(Burst)
8 words = 64 bytes
The unit of transfer over the channel
columns
16 kbit
The data read/write line to a bit cell
Figure 243: 4-Gbyte DRAM dual in-line memory module (DIMM) for a laptop
computer. Eight chips are mounted on a small PCB, four on each side. A label on the
back says ‘1600 11-11-11’. More detailed information, including the supply voltage,
is stored electronically in a small serial presence detect (SPD) ROM in the centre of
the DIMM.
Figure 244: Internal block diagram of a 4-bit-wide DRAM device. There are four
banks, although eight are normal. (The data strobe (DS) signal is omitted for clarity.)
Table 32: Typical DIMM connections.
Clock
Clock (400 MHz)
RAS-
Row address strobe
CAS-
Column address strobe
WE-
Write enable
DQ[63:0]
Data in/out
Reset
Power-on reset
WQ[7:0]
Write-lane qualifiers
DS[7:0]
Data strobes
DM[7:0]
Data masks
CS-
Chip select
MAddr[15:0]
Address input
BK[2:0]
Bank select
spd[3:0]
Serial presence detect
Figure 245: NOR ROM block diagram (left) with capacity bits and alternative NAND
ROM array detail (right). For a ROM more than 1 bit wide, the Dout connection is a
bus of width and the low address bus width is reduced by bits.
Figure 246: EPROM device from 1980’s. The silicon die is visible through the top
window.
Figure 247: Schematic of a possible bit cell for EPROM and EEPROM. There is a kink
on the top of the extra gate of this floating-gate transistor to denote that tunnelling
is possible.
Figure 248: Typical increase in benchmark performance in response to L3 cache size
enlargement for L2=64 KB (blue), L2=128 KB(yellow) and L2=256 KB (red).
Figure 249: Typical I/O ports. Shown are two serial ports, one parallel port, one
Ethernet port, two USB ports and three audio ports.
Figure 250: Two typical configurations for a serial port using a UART. Inset: Serial
port timing diagram. The old wiring to a D9 connector for an RS-232 port is today
often replaced with a very short connection to a second UART inside a USB dongle
for connecting to a laptop.
Figure 251: Timing diagram for an asynchronous four-phase handshake.
// Programming modelreg [31:0] ddr; // Data direction regreg [31:0] sfunction; // Special modereg [31:0] dout; // Output registerreg [31:0] imask; // Interrupt maskreg [31:0] ipol; // Interrupt polaritiesreg [31:0] pins_r; // Registered pin datareg int_enable;// Master int enable (for all bits)always @(posedge clk) begin pins_r <= pins; if (hwen && addr==0) ddr <= wdata; if (hwen && addr==4) sfunction <= wdata; if (hwen && addr==8) dout <= wdata; if (hwen && addr==12) imask <= wdata; if (hwen && addr==16) ipol <= wdata; if (hwen && addr==20) int_enable <= wdata[0]; end// Tri-state buffers.bufif b0(pins[0], sfunction[0]?special[0]:dout[0], ddr[0]);... // 30 others herebufif b31(pins[31], sfunction[1]?special[31]dout[31], ddr[31]);// Generally the programmer can read all the// programming model registers but not hereassign rdata = pins_r;// Interrupt maskingwire int_pending = (|((pins_r ^ ipol)&imask));assign interrupt = int_pending && int_enable;
Figure 252: Schematic and RTL implementation of 32 GPIO bits connected to an
MSOC1 bus.
// RTL for one channel of a typical timer // Programmers’ model state reg int_enable, int_pending; reg [31:0] prescaler; reg [31:0] reload; // Programmer-invisible internal state reg ovf; reg [31:0] counter, prescale; // Host write operations always @(posedge clk) begin if (hwen && addr==0) int_enable <= wdata[0]; if (hwen && addr==4) prescaler <= wdata; if (hwen && addr==8) reload <= wdata; // Write to addr==12 to clear the interrupt end wire irq_clr = hwen && addr == 12; // Host read operations assign rdata = (addr==0) ? {int_pending, int_enable}: (addr==4) ? prescaler: (addr==8) ? reload: 0; // A timer counts system clock cycles // A counter counts transitions from an external input always @(posedge clk) begin ovf <= (prescale == prescaler); prescale <= (ovf) ? 0: prescale+1; if (ovf) counter <= counter -1; if (counter == 0) begin int_pending <= 1; counter <= reload; end if (irq_clr) int_pending <= 0; end // Interrupt generation assign interrupt = int_pending && int_enable;
Figure 253: Schematic symbol for a counter/timer block and internal RTL for one
timer function.
// State for programmers’ model reg [31:0] count, src, dest; reg int_enable, active; // Other local state reg [31:0] datareg; reg intt, rwbar; always @(posedge clk) begin // Target if (hwen && addr==0) begin { int_enable, active } <= wdata[1:0]; intt <= 0; rwbar <= 1; end if (hwen && addr==4) count <= wdata; if (hwen && addr==8) src <= wdata; if (hwen && addr==12) dest <= wdata; end assign rdata = ...// Target readbacks always @(posedge clk) begin // Initiator if (active && rwbar && m_ack) begin datareg <= m_rdata; rwbar <= 0; src <= src + 4; end if (active && !rwbar && m_ack) begin rwbar <= 1; dest <= dest + 4; count <= count - 1; end if (count==1 && active && !rwbar) begin active <= 0; intt <= 1; end end assign m_wdata = datareg; assign m_ren = active && rwbar; assign m_wen = active && !rwbar; assign m_addr = (rwbar) ? src:dest; assign interrupt = intt && int_enable;
Figure 254: A simple DMA controller: schematic symbol, example TLM wiring and
RTL for one channel.
Figure 255: Net-level schematic symbol for a DMA-capable network interface IP
block.
Figure 256: Structure of a simple frame store, RTL implementation and generated
timing waveforms.
Figure 257: Doorbell and mailbox block.
Figure 258: Example of parallel speedup. 35 units of work run across four servers,
showing dependency arcs typical in the map-reduce design pattern. Arcs implicitly
exist between all adjacent work unit boxes.
Figure 259: A schematic of a typical arbiter (left) and the RTL implementation (right)
for a three-port synchronous example using static priority with pre-emption. See
also Figure 31.
Figure 260: General structure of a queue/server pair with mean customer rate and
mean service rate .
Figure 261: Plots of the average number of customers in a system versus the
utilisation for three common arrival/service disciplines.
Table 33: Formulae for and where .
Discipline
Figure 262: Separate queueing for high-priority and low-priority traffic gives an
expedited service.
Figure 263: Generic switch that includes both input and output buffering on the
input and output ports, respectively, of a buffer-less switch fabric. Flow control
operates between the switch fabric and the buffers in the switch.
Figure 264: Two scenarios that demonstrate statistical multiplexing gain for
sources. Sharing the channel bandwidth (a) raises the question of whether
reservations are made on a peak or average basis. If is small, we should use peak
allocation, but for large , we can use average allocation. The law of large numbers
states that needs only to be about 50 for an average allocation to be acceptable.
Buffer pools can be partitioned (b) or shared (c).
Figure 265: A policed queue showing both input and output regulators, although
commonly only one site would be policed. Each policer has a number of rate
channels (e.g. three are shown for the input site).
int burst_tolerance, credit_rate; // Set up by PIO int credit; // State variable void reset() // Complete setup { credit = 0; register_timer_callback(crediter, credit_rate); } void crediter() // Called at 1/credit_rate intervals { if (credit < burst_tolerance) credit += 1; } bool police() // Check operation currently allowed { if (credit==0) return false; credit -= 1; return true; }
Figure 266: Essence of a software implementation for one channel of a generic
traffic policer or regulator.
Figure 267: Abstract views of the principal axes that span the SoC design space for
a task. Parallelism can be traded for clock frequency (left), which shows two design
points. At a given clock frequency, the silicon area (and parallelism) can be traded
for execution time using a time/space fold/unfold (right).
Figure 268: Thermal management of a high-power chip. The primary heat sink
makes thermal contact with the chip. A heat pipe is connected to a second heat sink.
There is no other connection to the second heat sink.
Figure 269: Generic thermal circuit, showing on/off heat source, thermal node and
thermal resistance between the node and its heat sink.
Figure 270: Primary timing characteristics of a D-type flip-flop (left). Typical nature
of a critical path in a synchronous clock domain indicating how the maximum clock
frequency () is calculated (right).
Figure 271: A circuit before (top) and after (bottom) insertion of an additional
pipeline stage.
Figure 272: Flip-flop migration. Two circuits that behave identically but which have
different state encodings.
Figure 273: Schematic and count sequence for a divide-by-five counter using
Johnson encoding.
Figure 274: Two code fragments using message-passing in shared memory with
explicit memory fences. The yield() call could be to the suspend primitive of
the operating system scheduler. Alternatively, if running on bare metal, it is a
dedicated instruction (called YIELD in the Arm ISA) that interacts with hardware
hyper-threading if present.
Figure 275: SRAM memory bank structures for MCMs (top) with two
interconnection patterns inside the SoC (bottom).
Figure 276: Pseudo-dual porting of RAM (using write mirroring).
Figure 277: Typical structure of a 32-bit DRAM controller connected to DRAM
devices. The on-SoC controller manages a DRAM channel with one rank made of
eight individual DRAM chips. Each chip has eight banks with four bit planes. Each
bit plane has bits. A DRAM is typically made available to various on-SoC subsystems
using a multi-way bus multiplexor.
Figure 278: Three possible arrangements of DRAM address fields within a physical
address. In the centre arrangement (b), the bank field is lower than its naive position
in the top arrangement (a). This improves load balancing over banks, but the field is
not moved so low that a cache line or burst transfer is split over banks. The rank and
channel fields can also be moved lower. Arrangement (c) shows a channel field that
has been split and dispersed.
Figure 279: (a) Lithium-ion battery for a mobile phone (3.7 V, 1650 mA h, 6 W h or
22 kJ), external view. (b) Typical internal structure. (c) Two sugar cubes.
Figure 280: Lumped-element electrical equivalent modelling of dynamic power use
of a CMOS SoC. The energy drawn from the battery each clock cycle is essentially
‘wasted’ as heat in the distributed resistance of the active parts of the SoC.
Figure 281: A generic CMOS invertor structure shown with explicit parasitic
resistances and lumped-equivalent output loading. This illustrates the primary
electric paths in CMOS logic.
Figure 282: A tri-state data line showing driving gates, sensing buffers and a bus
holder (or keeper). The bus holder has a weak output driver, denoted ‘W’, which acts
like a series resistor to reduce the output current during transient bus fights (§??).
Figure 283: Logic net with a single source and three loads, showing tracking and
input load capacitances.
Figure 284: Dynamic energy use for various technologies.
Figure 285: Fanout 4 (FO4) delay specification (left) and CMOS logic propagation
delay versus supply voltage (right).
// spice-cmos-inverter-djg-demo.hsp// Updated 2017 by DJ Greaves// Based on demo by David Harris harrisd@leland.stanford.edu// Declare global supply nets and connect them to a constant-voltage supply.global Vdd GndVsupply Vdd Gnd DC ‘VddVoltage’///////////////////////////////////////////// Set up the transistor geometry by defining lambda.opt scale=0.35u * Define lambda // This is half the minimum channel length.// Set up some typical MOSFET parameters.//http://www.seas.upenn.edu/~jan/spice/spice.models.html#mosis1.2um.MODEL CMOSN NMOS LEVEL=3 PHI=0.600000 TOX=2.1200E-08 XJ=0.200000U +TPG=1 VTO=0.7860 DELTA=6.9670E-01 LD=1.6470E-07 KP=9.6379E-05 +UO=591.7 THETA=8.1220E-02 RSH=8.5450E+01 GAMMA=0.5863 +NSUB=2.7470E+16 NFS=1.98E+12 VMAX=1.7330E+05 ETA=4.3680E-02 +KAPPA=1.3960E-01 CGDO=4.0241E-10 CGSO=4.0241E-10 +CGBO=3.6144E-10 CJ=3.8541E-04 MJ=1.1854 CJSW=1.3940E-10 +MJSW=0.125195 PB=0.800000.MODEL CMOSP PMOS LEVEL=3 PHI=0.600000 TOX=2.1200E-08 XJ=0.200000U +TPG=-1 VTO=-0.9056 DELTA=1.5200E+00 LD=2.2000E-08 KP=2.9352E-05 +UO=180.2 THETA=1.2480E-01 RSH=1.0470E+02 GAMMA=0.4863 +NSUB=1.8900E+16 NFS=3.46E+12 VMAX=3.7320E+05 ETA=1.6410E-01 +KAPPA=9.6940E+00 CGDO=5.3752E-11 CGSO=5.3752E-11 +CGBO=3.3650E-10 CJ=4.8447E-04 MJ=0.5027 CJSW=1.6457E-10 +MJSW=0.217168 PB=0.850000/////////////////////////////////////////////// Define the invertor, made of two MOSFETs as usual, using a subcircuit..subckt myinv In Out N=8 P=16 // Assumes 5 lambda of diffusion on the source/drainm1 Out In Gnd Gnd CMOSN l=2 w=N+ as=‘5*N’ ad=‘5*N’+ ps=‘N+10’ pd=‘N+10’m2 Out In Vdd Vdd CMOSP l=2 w=P+ as=‘5*P’ ad=‘5*P’+ ps=‘P+10’ pd=‘P+10’.ends myinv//////////////////////////////////////////////// Top-level simulation net list// One instance of my invertor and a load capacitorx1 In Out myinv // InvertorC1 Out Gnd 0.1pF // Load capacitor//////////////////////////////////////////////// Stimulus: Create a waveform generator to drive In// Use a "Piecewise linear source" PWL that takes a list of time/voltage pairs.Vstim In Gnd PWL(0 0 1ns 0 1.05ns ‘VddVoltage’ 3ns VddVoltage 3.2ns 0)//////////////////////////////////////////////// Invoke transient simulation (that itself will first find a steady state).tran .01ns 6ns // Set the time step and total duration.plot TRAN v(In) v(Out).end
Figure 286: SPICE description and setup for two transistors arranged as a CMOS
invertor simulated with a two-step input.
Figure 287: Plots of the invertor when running from VCC supplies of 2.5 V (left) and
1.5 V (right). Red is the input stimulus and blue is the simulated output.
Table 34: Design space for dynamic power-saving techniques.
Clock
Power
On/Off
Clock gating
Power supply gating
Variable
Dynamic frequency scaling (DFS)
Dynamic voltage scaling (DVS)
Table 35: Example of static and dynamic power use for a three-level DVFS
configuration. (The static current was estimated using , where was chosen to give
an approximate 1:3 ratio of static to dynamic power at the middle supply voltage.)
Supply voltage
Clock frequency
Static power
Dynamic power
Total power
(V)
(MHz)
(mW)
(mW)
(mW)
0.8
100
40
24
64
1.35
100
67
68
135
1.35
200
67
136
204
1.8
100
90
121
211
1.8
200
90
243
333
1.8
400
90
486
576
Figure 288: Clock enable using (a) a multiplexor, (b) an AND gate and (c) an OR gate.
Figure 289: Illustrating a transparent latch and its use to suppress clock gating
glitches.
Figure 290: Using XOR gates to determine whether a clock edge would have any
effect.
Figure 291: Clock-needed computations forwarded down a pipeline.
Figure 292: Power gating in general (left) and an actual power gate circuit (right).
Table 36: Summary of the properties of four power-saving techniques.
Technique
Clock gating
Supply gating
DVFS
Control
Automatic
Various
Software
Granularity
Register or FSM
Larger blocks
Macroscopic
Clock tree
Mostly free runs
Turned off
Slows down
Response time
Instant
2 to 3 cycles
Instant (or ms if PLL adjusted)
Figure 293: Debugging hardware for a single-core SoC. A debug access port is
connected via JTAG and USB to a debug workstation.
Figure 294: Typical additional event-stream debug resources for a modern MPSoC.
Operational data busses are black. Event-stream busses are green. Debug-access
busses are blue. The pink arrows denote event-monitoring nets from miscellaneous
IP blocks that either do not have their own counters or need to be made available
for cross-triggering.
Figure 295: JTAG interface chaining at board level to create a top-level node in the
debug device tree.
Table 37: JTAG signal list.
TDI
In
Test data in: serial bits from test agent or previous device
TMS
In
Test mode select: frame data and addresses
TCK
In
Test clock: clocks each bit in and out
TDO
Out
Test data out: to next device or back to agent
Figure 296: Basic structure of the additional logic required in input and output pads
for boundary scanning. Serial data are shifted from one pad to the next using the
test clock. The result from the previous test vector is shifted out as a vector is shifted
in. The two strobe signals are used to apply the test vector to the input pads and to
sample the data at the output pads.
Figure 297: A static RAM with self-test wrapper around it. An error correction
wrapper has a similar structure, but then the SRAM data bus will be wider than the
external data bus. Both wrappers are often logically present, in which case a single
wrapper may implement both functionalities.
Figure 298: Crystal oscillator circuit (left), canned crystal and contents (centre), and
specification and output waveform (right).
Figure 299: Clock multiplication using a PLL (left) and clock distribution layout
using a fractal H-tree (right).
Figure 300: Clock skewing. The delivery of the clock edge to some D-types is delayed
or advanced to balance out timing margins either side.
Figure 301: Basic form of a timing slack graph.
Figure 302: Schematic (left) and timing diagram (right) of a typical multi-cycle path.
The delay through logic block C ranges from to with both limits being between one
and two clock periods.
Figure 303: Environmental testing under various temperature and humidity
conditions in a climate-controlled chamber.
Figure 304: EMC testing of radio-frequency interference.