Figures and Tables from Modern SoC Design on Arm

DJ Greaves
University of Cambridge

May 2001
PIC

1 A Textbook

The book is externally available as an ePDF from the Arm web site for free or in hardcopy from Amazon etc..

Full text local access download (via Raven or equivalent): https://www.cl.cam.ac.uk/~djg11/pubs/modern-soc-design-djg/msd-private/DJG-Modern-SoC-Design-On-Arm-First-Edition.pdf PDF (607 pages).

Many of the principles taught in this book are relevant for all forms of system architect, including those who are designing cloud-scale applications, custom accelerators or IoT devices in general, or those making FPGA designs. But the details of design verification in Chapter 8 are likely to be just of interest to those designing semi-custom silicon using standard cells. A git repository of online additional material is available at bitbucket.org/djg11/modern-soc-design-djg. This contains data used for generating tables and graphs in the book, as well as further source code, lab materials, examples and answers to selected exercises. The repo contains a SystemC model of the Zynq super FPGA device family, coded in blocking TLM style. It is sufficient to run an Arm A9 Linux kernel using an identical boot image as the real silicon. Published by Arm Education Media, 605 pages in softback and ePDF. ISBN 978-1-911531-36-4

https://www.cl.cam.ac.uk/ djg11/pubs/modern-soc-design-djg/

2 Pictures and Tables

PIC

Figure 1: Typical wireless link for digital communications, showing the transmitter (top) and receiver (bottom) for a simplex link.

 
 
 
 
 
 

PIC

Figure 2: Block diagram and photographs of a first-generation Bluetooth USB dongle.

 
 
 
 
 
 
 
 
 

PIC
PIC
Figure 3: Two PCIe 802.11 Wi-Fi modules for a laptop. Their shielding lids have been removed. The older unit, on the left, has a three-chip set from Broadcom. It is physically nearly twice the size of the newer unit, shown on the right, which has been magnified 2×. The new unit uses a Qualcomm QCA9565 device that has a higher throughput and also includes Bluetooth. The only other active part needed is the 40-MHz crystal oscillator to the left of the chip. As can be seen at top right, to give a lower-cost lower-performance product, the second of the two diversity antenna sockets, labelled ‘ALT’, was not fitted.

 
 
 

temp   := 200                 // Set initial temperature to a high value 
ans    := first_guess         // This is the design vector (or tree) 
metric := metric_metric ans   // We seek the highest-metric answer 
 
while (temp > 1) 
{ 
   // Create new design point, offsetting with delta proportional to temperature 
   ans’ := perturb_ans temp ans 
 
   // Evaluate (scalar) objective function (figure of merit) for new design point 
   metric’ := metric_metric ans’ 
 
   // Accept if better probabilistically 
   accept := (metric’ > metric) || rand(100..200) < temp; 
   if (accept) (ans, metric, temp) := (ans’, metric’, temp * 0.99) 
} 
return ans;
Figure 4: An iteration for hill climbing using simulated annealing to find a design point with the highest scalar metric.

 
 
 

PIC
Figure 5: Screenshot from a place-and-route tool, showing how the wiring avoids a region of hardened layout (grey area on the right). Some area is required to route these interconnections.

 
 
 

Table 1: Determining the ROM needs of an MP3 decoder by disassembling a segment .rodata.
Label Start address (hex)
g_drmp3_pow43-0x120 0x0000
g_drmp3_pow43> 0x120
g_scf_partitions.6678> 0x0c40
_end_of_static 0x2350

 
 
 

Table 2: Statistics logged during one second of a profile run of MP3 stream decoding. The numbers of integer ALU operations exclude loop control and array subscription operations.
Event type Number of operations
Input bytes 16 392
Output frames 44 352
DCT operations 154
Floating-point adds and subtracts 874 965
Floating-point multiplies 401 255
Integer adds and subtracts 162 107
Integer multiplies 88 704

 
 
 

PIC
Figure 6: Mailbox, relay and pump paradigms: three initiator/target configurations possible for dual-ported IP blocks.

 
 
 

PIC
Figure 7: Schematic symbols of a FIFO buffer following the mailbox pattern: synchronous (left) and clock-domain crossing (right).

 
 
 

Table 3: Synchronous delay and combinational paths for basic FIFO types.
Type Data latency Ready latency Combinational paths
Fully registered 1 1 None
Bypass 0 1 WR_ENRD_RDY
Pipelined 1 0 RD_ENWR_RDY
Bubble-free 0 0 Both directions
Asynchronous Several Several None
Credit-controlled 1 n/a None

 
 
 

PIC
Figure 8: A generic schematic of a one-place synchronous FIFO buffer. A synchronous set/reset flip-flop records whether there are valid data in the broadside register. The dashed lines show optional latency-reducing wiring. The output multiplexor is needed only if the dashed bypass wire is installed. Also illustrated are typical patterns of combinational path wiring in the client source and sink. The red annotations show a potential combinational loop.

 
 
 

PIC
Figure 9: Relay and mailbox paradigms for a one-place synchronous FIFO buffer. If there are no combinational paths, the only differences are the net names.

 
 
 

PIC
Figure 10: A custom ALU inside a CPU core implemented in two similar ways: (a) as a custom instruction or (b) as a coprocessor.

 
 
 

PIC
Figure 11: A specialised inter-core message-passing fabric allows cores to rapidly exchange a packet composed of several successive registers with each other.

 
 
 

PIC
Figure 12: A custom function implemented as a peripheral IP block, with optional DMA (bus master) capability.

 
 
 

PIC
Figure 13: Example of a hardware accelerator receiving data from a specialist interface. It uses DMA for data transfer to the main memory. PIO is managed by the host processor.

 
 
 

PIC
Figure 14: Representative bump-in-wire server blade architecture that has the FPGA in series with the network interface and secondary storage (disk drives or SSD).

 
 
 

PIC
Figure 15: Simplified block diagram of a super FPGA, like the Xilinx Zynq devices.

 
 
 
 
 
 
 
 
 

Table 4: Key parameters for a range of super-FPGA parts available from Xilinx in 2018.
Device name VU31P VU33P VU35P VU37P VU11P VU13P
System logic cells (k) 962 962 1907 2852 2835 3780
CLB flip-flops (k) 879 879 1743 2607 2592 3456
CLB LUTs (k) 440 440 872 1304 1296 1728
Maximum distributed RAM (Mb) 12.5 12.5 24.6 36.7 36.2 48.3
Total block RAM (Mb) 23.6 23.6 47.3 70.9 70.9 94.5
Ultra RAM (Mb) 90.0 90.0 180.0 270.0 270.0 360.0
HBM DRAM (GB) 4 8 8 8
Clock management tiles 4 4 5 3 6 4
DSP slices 2880 2880 5952 9024 9216 12 288
PCIe ports 4 4 5 6 3 4
CCIX ports 4 4 4 4
150G Interlaken 0 0 2 4 6 8
100G Ethernet with RS-FEC 2 2 5 8 9 12
Maximum single-ended I/O 208 208 416 624 624 832
Multi-standard Gbps SERDES 32 32 64 96 96 128

 
 
 

Table 5: Basic parameters for two caches with different speed/power ratios but the same technology and bit density, versus cache size.
L1
L2
Cache
Size Energy Area Hit rate Access time Mean time Energy Area Hit rate Access time Mean time
1 0.01 0.001 0.002 0.0 200 0.001 0.001 0.002 0.1 200
2 0.02 0.002 0.004 0.0 199 0.002 0.002 0.004 0.1 199
4 0.04 0.004 0.008 0.0 198 0.004 0.004 0.008 0.2 198
8 0.08 0.008 0.015 0.0 197 0.008 0.008 0.015 0.3 197
16 0.16 0.016 0.030 0.0 194 0.016 0.016 0.030 0.4 194
32 0.32 0.032 0.059 0.1 188 0.032 0.032 0.059 0.6 188
64 0.64 0.064 0.111 0.1 178 0.064 0.064 0.111 0.8 178
128 1.28 0.128 0.200 0.1 160 0.128 0.128 0.200 1.1 160
256 2.56 0.256 0.333 0.2 133 0.256 0.256 0.333 1.6 134
512 5.12 0.512 0.500 0.2 100 0.512 0.512 0.500 2.3 101
1024 10.24 1.024 0.667 0.3 67 1.024 1.024 0.667 3.2 69
2048 20.48 2.048 0.800 0.5 40 2.048 2.048 0.800 4.5 44
4096 40.96 4.096 0.889 0.6 23 4.096 4.096 0.889 6.4 28
8192 81.92 8.192 0.941 0.9 13 8.192 8.192 0.941 9.1 20
16 384 163.84 16.384 0.970 1.3 7 16.384 16.384 0.970 12.8 18
32 768 327.68 32.768 0.985 1.8 5 32.768 32.768 0.985 18.1 21
65 536 655.36 65.536 0.992 2.6 4 65.536 65.536 0.992 25.6 27
131 072 1310.72 131.072 0.996 3.6 4 131.072 131.072 0.996 36.2 37
262 144 2621.44 262.144 0.998 5.1 5 262.144 262.144 0.998 51.2 51

 
 
 

Table 6: Some composite design examples.
L1 L2 L2 Composite Composite Composite
size size energy energy area mean time
64 262 144 233.0 233.6 262.2 45.8
128 262 144 209.7 211.0 262.3 41.2
1024 262 144 87.4 97.6 263.2 17.4
4096 262 144 29.1 70.1 266.2 6.3

 
 
 

Table 7: Power laws for performance delivered and power used in terms of three independent parameters.

Metric

Core complexity (c) DVFS voltage (v) Number of cores (n)

Performance delivered

c0.5 v1.0 n0.9

Power used

c1.0 v3.0 n1.0

Increase in power for double performance

4 8 2.16

 
 
 

PIC
Figure 16: Relative performance of simple and complex CPU cores with DVFS.

 
 
 

PIC
Figure 17: Three multi-access NoC topologies.

 
 
 

Figure 18: Comparison of various multi-access and switched NoC designs in terms of throughput (136 bit words per clock cycle) per gate input as the number of end points (stations) is increased.
PIC

 
 
 

# Unfold=1 is the baseline design. Unfold=3 uses three times more silicon. 
static_dynamic_tradeoff <- function(clock_freq, leakage, unfold, xx) 
{ 
  op_count <- 2e7; 
 
  # Model: Pollack-like unfold benefit 
  execution_time <- op_count / clock_freq / (unfold ^ 0.75); 
 
  # Model: Higher supply needed for higher clock and leakage resistance slightly increasing with Vdd 
  vdd <- 1 + 0.5 * (clock_freq/100e6); 
  static_power <- leakage * vdd ^ 0.9 * unfold; 
 
  # Integrate static power and energy 
  static_energy <- static_power * execution_time; 
 
  # Use CV^2 for dynamic energy 
  dynamic_energy <- op_count * vdd ^ 2.0 * 5e-10; 
}
Figure 19: Trading off Aesop’s hare versus the tortoise for increasingly leaky technology. In a hard real-time computation, we know the number of clock cycles needed but should we do them quickly and halt (Hare) or do them slowly and finish just in time (Tortoise)?

 
 
 

PIC
Figure 20: Shift in the sweet spot (small vertical line) for DVFS with high-leakage technology for a real-time task. The total energy (solid line) is the sum of the static energy (dashes) and dynamic energy (dots).

 
 
 

PIC
Figure 21: Block diagram for an MPEG compression subsystem. The input is a real-time video stream and the outut is an MPEG transport stream (without sound). The DCT, IDCT and motion estimators are highly compute intensive.

 
 
 

//Output bit-to-byte buffer
void putbits(uint val, uint no_of_bits)
{
  buffer |= val << (int)no_of_bits;
  buffer_bits += no_of_bits;
  while (buffer_bits >= 8)
  { yield_byte((byte)(buffer & 0xFF));
    buffer_bits -= 8;
    buffer_bits >>= 8;
  }
}

// Send a DC component
void putDC(sVLCtable [] tab, int val)
{
  uint absval, size;
  absval = (uint) Math.Abs(val);
  /* Compute dct_dc_size */
  size = 0;
  while (absval!=0)
  { absval >>= 1;
    size ++;
  }
  // Generate VLC for dct_dc_size (B-12 or B-13)
  putbits(tab[size].code, tab[size].len);
  // Append fixed-length code (dc_dct_differential)
  if (size!=0) // Send val + (2 ^ size) - 1
  { if (val>=0) absval = (uint)val;
    else absval = (uint)(val + (1 << (int)size) - 1);
    putbits(absval, size);
  }
}

void putDClum(int val)
{
  putDC(DClumtab, val);
}

void putDCchrom(int val)
{
  putDC(DCchromtab, val);
}

void putAC(int run, int signed_level, int vlcformat)
{
    // ...
}

/* Generate variable-length codes for an intra-coded
   block (6.2.6, 6.3.17) */
void putintrablk(Picture picture, short [] blk, int cc)
{
  /* DC Difference from previous block (7.2.1) */
  int dct_diff = blk[0] - picture.dc_dct_pred[cc];
  picture.dc_dct_pred[cc] = blk[0];

  if (cc==0) putDClum(dct_diff);
  else putDCchrom(dct_diff);

  /* AC coefficients (7.2.2) */
  int run = 0;
  byte [] scan_tbl = (picture.altscan ? alternate_scan:
                      zig_zag_scan);
  for (int n=1; n<64; n++)
  { // Use appropriate entropy scanning pattern
    int signed_level = blk[scan_tbl[n]];
    if (signed_level!=0)
    {
      putAC(run, signed_level, picture.intravlc);
      run = 0;
    }
    else run++; /* count zero coefficients */
  }

  /* End of Block -- normative block punctuation */
  if (picture.intravlc!=0) putbits(6,4); // 0110 (B-15)
    else putbits(2,2); // 10 (B-14)
}

// Return difference between two (8*h) sub-sampled blocks
// blk1, blk2: addresses of top left pels of both blocks
// rowstride:  distance (in bytes) of vertically adjacent pels
// h:          height of block (usually 8 or 16)
int sumsq_sub22(byte [] blk1, byte [] blk2, int rowstride, int h)
{
  int ss = 0, p1 = 0, p2 = 0;
  for (int j=0; j<h; j++)
  {
    for (int i=0; i<8; i++)
    { int v = blk1[p1+i] - blk2[p2+i];
      ss += v*v;
    }
    p1+= rowstride; p2+= rowstride;
  }
  return ss;
}

Figure 22: Code fragments for an MPEG encoder, coded in CSharp for Kiwi HLS. The putintrablk routine takes the 64 values from the quantised DCT and transmits them using the variable-length coding and with a run-length encoding of zero values in the routines it calls. The sumsq_sub22 function is one of many used during motion estimation to compute differences between macro blocks.

 
 
 

PIC
Figure 23: Typical examples of FUs deployed by higher-level design languages, especially HLS.

 
 
 

PIC
Figure 24: Fixed-point implementation of a floating-point accumulator.

 
 
 

PIC
Figure 25: Dynamic load balancing using the server farm paradigm.

 
 
 

PIC
Figure 26: Generic setup of a system integrator tool. This is the IP-XACT reference model for design capture and synthesis. A central interactive editor enables a system to be created from externally provided IP blocks by invoking generator and configurator tools.

 
 
 

PIC
Figure 27: IP-XACT structures for documenting the memory map and register field definitions, plus the access type supported for each bit field.

 
 
 
 
 
 

PIC
Figure 28: Generating an interconnect from IP-XACT: abstract view (top) using five IP-XACT XML files and synthesised wiring (bottom).

 
 
 
 
 
 

PIC
Figure 29: Example of high-level manual wiring between subsystems with wiring configured using a GUI.

 
 
 
 
 
 

PIC
Figure 30: Screenshot of the connectivity matrix editor within Arm Socrates.

 
 
 
 
 
 
 
 
 

PIC
Figure 31: An example Chisel module: a static-priority arbiter, showing Chisel source code (left) and the circuit generated (right) when n_inputs is 3.

 
 
 

PIC
Figure 32: An 8-bit counter with a synchronous load using sequential logic and a when/otherwise clause: Chisel source code and schematic symbol.

 
 
 

   // Generator (src) 
   while(1) 
   { 
     ch1 ! (x); 
     x += 3; 
   }

   // Processor 
   while(1) 
   { 
     ch2 ! (ch1? + 2) 
   } 

   // Consumer (sink) 
   while(1) 
   { 
     $display(ch2?); 
   } 
Figure 33: Three communication processes expressed using Handel-C.

 
 
 

module mkTb1 (Empty);             // This module has no externally callable methods 
 
   Reg#(int) rx <- mkReg (23);    // Create an instance of a 23-bit register called rx 
 
   rule countone (rx < 30);       // A rule named ’countup’ with an explicit guard 
      int y = rx + 1;             // This is short for int y = rx.read() + 1; 
      rx <= rx + 1;               // This is short for rx.write(rx.read() + 1); 
      $display ("countone: rx = %0d, y = %0d", rx, y); 
   endrule 
 
   rule counttwo (rx > 20);       // A competing rule, also guarded 
      rx <= rx + 2;               // This increments twice each cycle 
      $display ("counttwo: rx = %0d", rx); 
   endrule 
 
   rule done (rx >= 40);          // A third rule 
      $finish (0); 
   endrule 
 
endmodule: mkTb1
Figure 34: A Bluespec example in which three rules compete to act on a simple broadside register.

 
 
 

interface Pipe_ifc;
  method Action put(int arg);
  method int get();
endinterface

_________________________________________

module mkTb2 (Empty); // Testbench

  Reg#(int) x       <- mkReg (’h10);
  Pipe_ifc  thepipe <- mkPipe;

  rule fill;   // explicit guard of (true) is implied
    thepipe.put(x);
     // This is short for  x.write(x.read() + ’h10);
     x <= x + ’h10;
  endrule

  rule drain;
     let y = thepipe.get();
     $display ("    y = %0h", y);
  endrule
endmodule

PIC
Figure 35: Bluespec definition of an example interface, the Pipe_ifc (top left), the definition of a component that has an instance of a component that has that interface and two rules to operate on it (bottom left) and a net-level schematic symbol for the instantiated component showing handshake nets. The symbol for the Tb2 component has no external nets (apart from clock and reset) since it has an empty interface.

 
 
 

PIC
Figure 36: Basic steps of an HLS tool chain for ASIC or FPGA.

 
 
 

// A simple long multiplier with
// variable latency
int multiply(int A, int B)
{
  int RA=A;
  int RB=B;
  int RC=0;
  while(RA>0)
  {
    if odd(RA) RC = RC + RB;
    RA = RA >> 1;
    RB = RB << 1;
  }
  return RC;
}

module LONGMULT8b8(clk, reset, C, Ready, A, B, Start);
   input clk, reset, Start;
   output Ready;
   input [7:0] A, B;
   output [15:0] C;
   reg [15:0] RC, RB, RA;
   reg        Ready;

   reg xx, yy, qq, pp; // Control and predicate nets
   reg [1:0] fc;
   reg [3:0] state;
   always @(posedge clk) begin
      xx = 0; // default settings.
      yy = 0;
      fc = 0;

      // Predicates
      pp = (RA!=16’h0);   // Work while pp holds
      qq = RA[0];         // Odd if qq holds

      if (reset) begin    // Sequencer
         state <= 0;
         Ready <= 0;
         end
      else case (state)
             0: if (Start) begin
                xx = 1;
                yy = 1;
                fc = 2;
                state <= 1;
             end

             1: begin
                fc = qq;
                if (!pp) state <= 2;
             end
             2: begin
                Ready <= 1;
                if (!Start) state <= 3;
             end

             3: begin
                Ready <= 0;
                state <= 0;
             end

           endcase // case (state)
      RB <= (yy) ? B: RB<<1;   // Data path
      RA <= (xx) ? A: RA>>1;
      RC <= (fc==2) ? 0: (fc==1) ? RC+RB: RC;
      end

   assign C = RC;
endmodule

Figure 37: A very basic worked HLS example showing the input source code (left) and generated RTL (right) for a 32-bit multiplier as a variable-latency FU with start and ready handshake nets.

 
 
 

PIC
Figure 38: Long multiplier output from HLS showing the internal data path and sequencer (left) and the schematic symbol (right). The result is used as an FU in a higher-level design.

 
 
 

PIC
Figure 39: An example of a static schedule for a basic block containing a single assignment.

 
 
 
 
 
 

PIC
Figure 40: Various circuits that compute a running sum. The everyday circuit C1 can be used if the adder has a latency of 1. However, if the adder has a latency of 2, more complex circuits are needed, namely C2 and C3.

 
 
 

PIC
Figure 41: Fragment of an example inter-block initiation and hazard graph. Block 1 may be followed by itself or by block 2. Each block schedule contains both C (control flow) predicate evaluation and D (data) computation operations.

 
 
 

  public static int associative_reduction_example(int starting) 
  { 
    int vr = 0; 
    for (int i=0;i<15;i++) // or also i+=4 
      { 
      int vx = (i+starting)*(i+3)*(i+5); // Mapped computation 
      vr ^= ((vx&128)>0 ? 1:0);          // Associative reduction 
      } 
    return vr; 
  }
Figure 42: An iteration that performs associative reduction (in vr).

 
 
 

  double loop_carried_example(double seed, double arg0) 
  { 
    double vr = 0.0, vd = seed; 
    for (int i=0;i<15;i++) 
    { 
        double vd = xf1(i*arg0);     // Parallelisable 
        vd = xf2(vd + vd) * 3.14;    // Non-parallelisable 
        vr += vd; 
    } 
    return vr; 
  }
Figure 43: An iteration that has a loop-carried data dependency (through variable vd).

 
 
 

  static int [] foos = new int [10]; 
  static int ipos = 0; 
  public static int loop_forwarding_example(int newdata) 
  { 
    foos[ipos ++] = newdata; 
    ipos %= foos.Length; 
    int sum = 0; 
    for (int i=0;i<foos.Length-1;i++) 
      { 
      int dv = foos[i]^foos[i+1];  // Two adjacent locations are read 
        sum += dv;                   // Associative scalar reduction in sum 
      } 
    return sum; 
  }
Figure 44: A loop where data fetched in one iteration (foo[i]) can usefully be forwarded to a subsequent iteration.

 
 
 

  public static int data_dependent_controlflow_example(int seed) 
  { 
    int vr = 0; 
    int i; 
    for (i=0;i<20;i++) 
      { 
      vr += i*i*seed; 
      if (vr > 1111) break; // Early loop exit 
      } 
    return i; 
  }
Figure 45: A loop that has data-dependent control flow (the loop exit depends on variable vr).

 
 
 

PIC
Figure 46: Lookup procedure when searching for a string using the BWT. A haystack has been transformed and stored in Bwt[]. An index Ranks[] is computed for it. Looking for the string in Needle is then very fast.

 
 
 

PIC
Figure 47: Compacted Ranks array for BWT, redefinition of the ranks() routine for an interpolated lookup and a sensible layout in a DRAM row.

 
 
 

PIC
Figure 48: Data dependencies (slightly simplified) in the Smith–Waterman alignment-finding algorithm.

 
 
 

PIC
Figure 49: Affine transformation examples, adapted from [?]. Nest 1 runs after nest 2, but by skewing the access pattern of each loop, there is increased parallelism available within each nest and pipelining becomes possible. Nest 2 commences just after nest 1 has produced its first diagonal output.

 
 
 

PIC
Figure 50: Shuffle data flow for an FFT (left), elemental butterfly (top right) and code (bottom right).

 
 
 

PIC
Figure 51: A circuit intended to compute the running sum of streaming data values.

 
 
 

PIC
Figure 52: Overall setup of a generic ESL flow. The heavy lines show the fastest technique, which avoids using an instruction set simulator (ISS) through native cross-compiling of the embedded firmware and direct linking with behavioural models of the hardware.

 
 
 

SC_MODULE(mycounter) // An example of a leaf module (no subcomponents) 
{ 
   sc_in  < bool       > clk, reset; 
   sc_out < sc_int<10> > myout; 
 
   void mybev()  // Internal behaviour, invoked as an SC_METHOD 
   { 
      myout =  (reset) ? 0: (myout.read()+1); // Use .read() since sc_out makes a signal 
   } 
 
   SC_CTOR(mycounter)        // Constructor 
     { SC_METHOD(mybev);     // Require that mybev is called on each positive edge of clk 
       sensitive << clk.pos(); 
     } 
}
Figure 53: A 10-bit binary counter with synchronous reset, coded as a SystemC class.

 
 
 

PIC

//Example of structural hierarchy and wiring between levels: 
SC_MODULE(shiftreg)  // Two-bit shift register 
{   sc_in  < bool >  clk, reset, din; 
    sc_out < bool >  dout; 
 
    sc_signal < bool > q1_s; 
    dff dff1, dff2;      // Instantiate FFs 
 
    SC_CTOR(shiftreg) : dff1("dff1"), dff2("dff2") 
    {   dff1.clk(clk); 
        dff1.reset(reset); 
        dff1.d(din); 
        dff1.q(q1_s); 
 
        dff2.clk(clk); 
        dff2.reset(reset); 
        dff2.d(q1_s); 
        dff2.q(dout); 
    } 
};
Figure 54: Schematic (left) and SystemC structural netlist (right) for a 2-bit shift register.

 
 
 

  int nv;                       // nv is a simple C variable (POD, plain old data) 
  sc_out    < int > data;       // data and mysig are sc_signals (non-POD) 
  sc_signal < int > mysig;      // 
  ... 
        nv += 1; 
        data = nv; 
        mysig = nv; 
        printf("Before nv=%i, %i %i\n’’, nv, data.read(), mysig.read()); 
        wait(10, SC_NS); 
        printf("After  nv=%i, %i %i\n’’, nv, data.read(), mysig.read()); 
  ... 
Before nv=96, 95 95 
After  nv=96, 96 96
Figure 55: Compute/commit behaviour.

 
 
 

SC_MODULE(mydata_generator) 
{ sc_out < int  > data; 
  sc_out < bool > req; 
  sc_in  < bool > ack; 
 
  void myloop() 
  { while(1) 
    { data = data.read() + 1; 
      wait(10, SC_NS); 
      req = 1; 
      do { wait(10, SC_NS); } while(!ack.read()); 
      req = 0; 
      do { wait(10, SC_NS); } while(ack.read()); 
    } 
  } 
 
  SC_CTOR(mydata_generator) 
  { 
    SC_THREAD(myloop); 
  } 
}
Figure 56: Sample code using SC_THREAD.

 
 
 

PIC
Figure 57: An example waveform view plotted by gtkwave.

 
 
 

      sc_trace_file *tf = sc_create_vcd_trace_file("tracefilename"); 
 
      // Now call: 
      // sc_trace(tf, <traced variable>, <string>); 
 
      sc_signal < bool  > serialin("serialin"); // A named signal 
      sc_signal < bool  > serialout;            // An unnamed signal 
      float fbar; 
      sc_trace(tf, clk); 
      sc_trace(tf, serialin); 
      sc_trace(tf, serialout, "serialout");     // Give name since not named above 
      sc_trace(tf, fbar, "fbar");               // Give name since POD form 
 
      sc_start(1000, SC_NS);                    // Simulate for 1 microsecond (old API) 
      sc_close_vcd_trace_file(tr); 
      return 0;
Figure 58: Naming nets.

 
 
 

  sc_signal < bool > mywire; // Rather than a channel conveying just one bit 
 
  struct capsule 
  { int ts_int1, ts_int2; 
    bool operator== (struct ts other) 
    { return (ts_int1 == other.ts_int1) && (ts_int2 == other.ts_int2); } 
 
    int next_ts_int1, next_ts_int2; // Pending updates 
    void update() 
    { ts_int1 = next_ts_int1; ts_int2 = next_ts_int2; 
    } 
    ... 
    ... // Also must define read(), write() and value_changed() 
  }; 
 
  sc_signal < struct capsule > myast; // We can send two integers at once
Figure 59: Equality operator overload.

 
 
 

  void mymethod()  { .... } 
  SC_METHOD(mymethod) 
  sensitive << myast.pos(); // User must define concept of posedge for their own abstract type
Figure 60: Positive edge.

 
 
 

PIC
Figure 61: Three views of the four-phase transactional protocol from §?? operating between a sender and a receiver. The views are for a net-level connection, an untimed TLM push and an untimed TLM pull.

 
 
 

PIC

Figure 62: Schematic of a FIFO device (top) and its typical instantiation (bottom).

 
 
 

//Define the interfaces: 
class write_if: public sc_interface 
{ public: 
  virtual void write(char) = 0; 
  virtual void reset() = 0; 
}; 
 
class read_if: public sc_interface 
{ public: 
  virtual char read() = 0; 
}; 
 
//Define a component that inherits: 
class fifo_dev: sc_module("fifo_dev"), 
public write_if, public read_if, ... 
{ 
  void write(char) { ... } 
  void reset() { ... } 
  ... 
}

SC_MODULE("fifo_writer") 
{ 
  sc_port< write_if > outputport; 
  sc_in < bool > clk; 
  void writer() 
  { 
    outputport.write(random()); 
  } 
 
  SC_CTOR(fifo_writer) { 
     SC_METHOD(writer); 
     sensitive << clk.pos(); 
  } 
} 
 
//Top level instances: 
fifo_dev myfifo("myfifo"); 
fifo_writer mywriter("mywriter"); 
// Port binding: 
mywriter.outputport(myfifo);
Figure 63: SystemC module definition of an interface (left) and FIFO writer (right).

 
 
 

PIC
Figure 64: TLM 2.0 generic payload structure (left) and interconnection of three IP blocks showing passthrough and demultiplexing (right).

 
 
 

// Filling in the fields or a TLM2.0 generic payload: 
  trans.set_command(tlm::TLM_WRITE_COMMAND); 
  trans.set_address(addr); 
  trans.set_data_ptr(reinterpret_cast<unsigned char*>(&data)); 
  trans.set_data_length(4); 
  trans.set_streaming_width(4); 
  trans.set_byte_enable_ptr(0); 
  trans.set_response_status( tlm::TLM_INCOMPLETE_RESPONSE ); 
 
// Sending the payload through a TLM socket: 
  socket->b_transport(trans, delay);
Figure 65: Setting up a generic payload.

 
 
 
 
 
 

Table 8: List of TLM 2.0 convenience socket types.
simple_initiator_socket.h

A version of an initiator socket that has a default implementation of all interfaces. It allows the registration of an implementation for any of the interfaces to the socket, either unique interfaces or tagged interfaces (carrying an additional ID).

simple_target_socket.h

A basic target socket that has a default implementation of all interfaces. It also allows the registration of an implementation for any of the interfaces to the socket, either unique interfaces or tagged interfaces (carrying an additional ID). This socket allows only one of the transport interfaces (blocking or non-blocking) to be registered and implements a conversion if the socket is used on the other interface.

passthrough_target_socket.h

A target socket that has a default implementation of all interfaces. It also allows the registration of an implementation for any of the interfaces to the socket.

multi_passthrough_initiator_socket.h

An implementation of a socket that allows multiple targets to be bound to the same initiator socket. It implements a mechanism that allows the index of the socket the call passed through in the backward path to be identified.

multi_passthrough_target_socket.h

An implementation of a socket that allows multiple initiators to bind to the same target socket. It implements a mechanism that allows the index of the socket the call passed through in the forward path to be identified.

 
 
 

PIC
Figure 66: Illustration of modelling artefacts relating to model interleaving and timing arising from loosely timed TLM. In a cycle-accurate or approximately timed model (left), two threads, T0 and T1, can access a shared resource. One blocks while the resource is busy. With loose timing (right), the average contention delay is added to the response times of all customers. With a large quantum, one thread may run entirely before the other.

 
 
 

  vqueue::b_transact(pkt, sc_time &delay) 
  { 
     // Measure utilisation and predict queue delay based on last 32 transactions 
     if (++opcount == 32) 
     {  sc_time delta = sc_time_stamp()+delay-last_measure_time; 
        local_processing_delay += (delay_formula(delta/32)-local_processing_delay)/16; 
        logging.log(25, delta);                  // record utilisation 
        last_measure_time = sc_time_stamp()+delay; 
        opcount = 0; 
     } 
 
     // Add estimated (virtual) queuing penalty 
     delay += local_processing_delay; 
 
     // Do actual work 
     output.b_transact(pky, delay); 
  }
Figure 67: Essence of the code for a virtual queue contention point. A number of clients send work using vqueue:b_transact. The work is done straight away by output.b_transact, but the virtual time spent in the queue is added to the loosely timed delay.

 
 
 
 
 
 

PIC
Figure 68: Typical setup of a thread using loosely timed modelling with a quantum keeper for one core of an MPSoC. In reality there would be multiple interconnect components between the initiator and its final target.

 
 
 

PIC
Figure 69: Mixing modelling styles using a target-to-initiator transactor. An initiator-to-target transactor, not shown, would receive net-level transactions and make method calls on a TLM target.

 
 
 

PIC
Figure 70: Mixing modelling styles using a mailbox paradigm transactor.

 
 
 

PIC
Figure 71: Five views of the LocalLink protocol (also of AXI4-Stream). (a) Timing diagram. (b) Net-level wiring. (c) TLM abstract view. (d) Blocking TLM target code, big step. (e) Blocking TLM, small step.

 
 
 

void mips64iss::step() 
{ 
  u32_t ins = ins_fetch(pc); 
  pc += 4; 
  u8_t opcode = ins >> 26;    // Major opcode 
  u8_t scode = ins&0x3F;      // Minor opcode 
  u5_t rs = (ins >> 21)&31;   // Registers 
  u5_t rd = (ins >> 11)&31; 
  u5_t rt = (ins >> 16)&31; 
 
  if (!opcode) switch (scode) // decode minor opcode 
  { 
    case 052: /* SLT - set on less than */ 
      regfile_up(rd, ((int64_t)regfile[rs]) < ((int64_t)regfile[rt])); 
      break; 
 
    case 053: /* SLTU - set on less than unsigned */ 
      regfile_up(rd, ((u64_t)regfile[rs]) < ((u64_t)regfile[rt])); 
      break; 
      ... 
   ... 
 
void mips64iss::regfile_up(u5_t d, u64_t w32) 
{ if (d != 0) // Register zero stays at zero 
    { TRC(trace("[ r%i := %llX ]", d, w32)); 
      regfile[d] = w32; 
    } 
}
Figure 72: main step function.

 
 
 

Table 9: Typical relative performance of different virtual platform processor modelling approaches compared with real time. Figures assume a high-performance modelling workstation and a SoC with just one core.
Index Type of ISS I-cache traffic D-cache traffic Relative
modelled modelled performance
(1) Interpreted RTL Y Y 0.000001
(2) Compiled RTL Y Y 0.00001
(3) V-to-C C++ Y Y 0.001
(4) Handcrafted cycle-accurate C++ Y Y 0.1
(5) Handcrafted high-level C++ Y Y 1.0
(6) Trace buffer/JIT C++ N Y 20.0
(7) Native cross-compile N N 50.0

 
 
 

PIC
Figure 73: Block diagram of one channel of a DMA unit. (Repeated from Figure 254).

 
 
 

  // Behavioural model of 
  // target side: PIO register r/w 
  uint32 src, dest, length; 
  bool busy, int_enable; 
 
  u32_t status() { return (busy << 31) 
            | (int_enable << 30); } 
 
  u32_t completer_read(u32_t a) 
  { 
    return (a==0)? src: (a==4) ? dest: 
     (a==8) ? (length) : status(); 
  } 
  void completer_write(u32_t1 a, u32_t d) 
  { 
     if (a==0) src=d; 
     else if (a==4) dest=d; 
     else if (a==8) length = d; 
     else if (a==12) 
     { busy = d >> 31; 
       int_enable = d >> 30; } 
  }

  // Bev model of bus mastering portion 
  while(1) 
  { 
    waituntil(busy); 
    while (length-- > 0) 
      mem.write(dest++, mem.read(src++)); 
    busy = 0; 
  }

Interrupt code:

   interrupt = int_enable&!busy;
Figure 74: Behavioural model of (one channel of) the DMA controller.

 
 
 

Table 10: Product space of power and area estimates for an RTL implementation.

Without simulation

Using simulation

Without place and route

Fast design exploration
Area and delay heuristics needed

Can generate indicative activity ratios that can be used instead of a simulation in further runs

With place and route

Static timing analyser will give an accurate clock frequency

Gold standard: only bettered by measuring a real chip

 
 
 

    module CTR16( 
      input mainclk, 
      input din, input cen, 
      output o); 
 
      reg [3:0] count, oldcount;          // D-types 
 
      always @(posedge mainclk) begin 
          if (cen) count <= count + 1;    // ALU 
          if (din) oldcount <= count;     // Wiring 
          end 
 
     assign o = count[3] ^ count[1];      // Combinational 
 
   endmodule
Figure 75: Example RTL fragment used in a static analysis.

 
 
 

Table 11: Normal parameters for a RAM compiler together with selected values.
Parameter Value
Fabrication geometry 22 nm
Organisation 64Kx64
Nominal VDD 1.0 V
Number of banks 1
Read/write ports per bank 1
Read ports per bank 0
Write ports per bank 0

 
 
 

PIC
Figure 76: Performance and overheads for various sizes of a single-ported 32-bit data width SRAM implemented in 22 nm, as modelled by Cacti 7.0. The tool gives the access time, read and write dynamic energy use, and static power and area use for each size.

 
 
 

void sram64_cbg::recompute_pvt_parameters() // Called in constructor and when Vcc is changed 
{ 
  m_latency =  sc_time(0.21 + 3.8e-4 *sqrt(float(m_bits)), SC_NS); 
 
  pw_power leakage = pw_power(82.0 * m_bits, PW_nW); 
  set_static_power(leakage); 
 
  set_fixed_area(pw_area(13359.0 + 4.93/8 * m_bits, PW_squm)); 
 
  m_read_energy_op = pw_energy(5.0 + 1.2e-4 / 8.0 *m_bits, pw_energy_unit::PW_pJ); 
  m_write_energy_op = 2.0 * m_read_energy_op; // rule of thumb! 
 
  // NB: Might want different energy when high-order address bits change 
 
  pw_voltage vcc = get_vcc(); 
  m_latency = m_latency / vcc.to_volts(); 
  cout << name () << ":" << kind() << ": final latency = " << m_latency << "\n"; 
}
void sram64_cbg::b_access(PW_TLM_PAYTYPE &trans, sc_time &delay) 
{ 
  tlm::tlm_command cmd = trans.get_command(); 
 
  // Log wiring power consumed by transaction arriving here. 
  // Also set which nets modelled by the TLM will be active after this operation: 
  // For a write none (except a response ack) and for read the payload data. 
  trans.pw_log_hop(this,  (cmd==tlm::TLM_READ_COMMAND ? PW_TGP_DATA: PW_TGP_NOFIELDS) | 
          PW_TGP_ACCT_CKP,  &read_bus_tracker); 
 
  if (cmd == tlm::TLM_READ_COMMAND) 
    { 
      // Log internal transaction energy for read 
      pw_module_base::record_energy_use(m_read_energy_op); 
      ... 
    } 
  else if (cmd == tlm::TLM_WRITE_COMMAND) 
    { 
      // Log internal transaction energy for write 
      pw_module_base::record_energy_use(m_write_energy_op); 
      ... 
    } 
}
Figure 77: TLM+TE model of an SRAM. Constructor and blocking transport methods.

 
 
 

PIC
Figure 78: A die-stacked DRAM subsystem in the style of the Micron HMC and HBM. Several DRAM chips are piled on top of an upside-down (flip-chip) controller chip, interconnected into a 3-D structure using through-silicon vias. There could be several stacks. A passive interposer (ceramic PCB) wires the logic die using a parallel interconnect to SERDES chips (also upside-down) for serial connection over the circuit board to the hosts.

 
 
 

Table 12: Example performance data from collated DRAMSim4 data sheets for a variety of DDR4 devices, showing their closed page, random access time and dynamic energy for several major operations (total energy for 8 devices making up a 32-bit data bus). Only two significant figures are accurate. Note: VDD=1.2 V and tCL=tRCD=tRP.
Device type tCL Clock Access Activate Read Write
(capacity, word frequency time
size and speed) (cycles) (MHz) (ns) (pJ) (pJ) (pJ)
DDR4_4Gb_x4_1866 13 934.6 32.1 4006.1 5752.3 6985.0
DDR4_4Gb_x4_2133 16 1063.8 33.8 3754.0 5775.4 7219.2
DDR4_4Gb_x4_2400 17 1204.8 31.5 4063.7 5418.2 7330.6
DDR4_4Gb_x4_2666 19 1333.3 31.5 4104.0 6048.0 7488.0
DDR4_8Gb_x4_1866 13 934.6 32.1 5156.5 5341.4 4930.6
DDR4_8Gb_x4_2133 16 1063.8 33.8 5270.0 4692.5 4331.5
DDR4_8Gb_x4_2666 19 1333.3 31.5 6105.6 4608.0 4089.6
DDR4_8Gb_x4_2933 21 1470.6 31.3 6632.4 4595.7 4021.2
DDR4_8Gb_x4_3200 22 1587.3 30.2 7136.6 4644.9 4015.9

 
 
 
 
 
 
 
 
 
 
 
 

PIC
Figure 79: A Parallella single-board computer. The two 3320C USB driver devices and connectors are highlighted in red boxes.

 
 
 

Table 13: Phase/mode example: supply rail voltage and current and total power consumption for 3320C-EZK USB line driver devices.
Operating mode
Rail 1
Rail 2
Rail 3
Total power
(volts) (mA) (volts) (mA) (volts) (mA) (mW)
Standby 3.3 0.018 1.8 0.0007 3.3 0.03 0.16
L/S mode 3.3 6.3 1.8 11 3.3 5 57
H/S mode 3.3 29 1.8 22 3.3 59 155

 
 
 
 
 
 

PIC
Figure 80: Example power estimation spreadsheet. The inventory of instantiated components is annotated with clock frequencies and activity ratios to generate an approximate total power use in watts.

 
 
 

PIC
Figure 81: Two similar designs with different Rent exponents (top) and two non-Rentian design points (bottom).

 
 
 

PIC
Figure 82: Left: average net length in systems composed of 59 and 671 gates for various values of rent exponent. Right: average net length for subsystems of different sizes for Rent exponent of 0.65 using four improved equations presented in [?].

 
 
 

PIC
Figure 83: Lowest common parent of the end-point logic blocks. The distance between two end points is always roughly the same for any sensible layout of a design, so a detailed layout, like the one shown, is not required.

 
 
 

PIC
Figure 84: Example of a transactional energy modelling plot. Static power depends on the power mode, whereas the dynamic power depends on internal and external activity events.

 
 
 

PIC
Figure 85: Example reports generated by the TLM POWER3 library. The energy use between two checkpoints is shown in the upper table and power consumption in the lower table. A total for the whole device is given, along with subtotals for parts of the design hierarchy specifically selected by the user.

 
 
 

PIC
Figure 86: Back-end stages in the synthesis and physical design of a SoC, showing representative diagrams for five of the stages. The red track on the left is the physical intent flow.

 
 
 

PIC
Figure 87: A small example of a universal power format (UPF) file.

 
 
 

Table 14: Representative microprocessors
Year introduced Microprocessor No of transistors Geometry
2007 Dual-core Intel Itanium 2 1.6 billion 90 nm
2010 8-core Intel Nehalem 2.3 billion 45 nm
2010 Altera Stratix IV FPGA 2.5 billion 40 nm
2015 Intel CPU circa 10 billion 19 nm
2020 Nvidia’s GA100 Ampere 54 billion 7 nm

 
 
 

PIC
Figure 88: Technology scaling scatter plot.

 
 
 
 
 
 

PIC
Figure 89: Dark silicon (using data from [??]).

 
 
 

PIC
Figure 90: Basic layers in planar CMOS VLSI. Cross section through an invertor fabricated using planar transistors (not FinFETs) in a twin-well process and metal layers, M1 to M7 (not to scale).

 
 
 

PIC
Figure 91: Generic 3-D structure of a FinFET (left). The gate is wrapped around the channel. To get a sufficiently low on-resistance, a number of FETs are arranged in parallel using techniques shown in the other three diagrams, which are in 2-D plan view.

 
 
 
 
 
 

PIC
Figure 92: Typical first-generation standard cell layout showing three rows of cells. Power rails of alternating polarity run horizontally using the metal M1 layer (blue). This is also used for the internal wiring within the cells. N- and P-type diffusion zones are shown in green and yellow. FET gates are formed where the polysilicon (orange) crosses a diffusion layer. The violet and brown indicate metal wiring layers used for inter-cell connections. The brown layer is predominantly used for horizontal net segments and the violet for vertical segments.

 
 
 
 
 
 

Table 15: ITRS roadmap projection for geometry, supply voltage, transistor properties and FO3 gate energy. Predictions are made for both high-performance (HP) and low-performance (LP) transistors.

Year of production

2015 2017 2019 2021 2024 2027 2030

Logic device technology names

P70M56 P48M36 P42M24 P32M20 P24M12G1 P24M12G2 P24M12G3

Logic industry node range label (nm)

16/14 11/10 8/7 6/5 4/3 3/2.5 2/1.5

Logic device structure

FinFET FinFET FinFET FinFET VGAA VGAA VGAA

FDSOI FDSOI LGAA LGAA M3D M3D M3D

VGAA

Device Electrical Specifications

Power supply voltage, Vdd (V)

0.80 0.75 0.70 0.65 0.55 0.45 0.40

Sub-threshold slope (mV/decade)

75 70 68 65 40 25 25

Inversion layer thickness (nm)

1.10 1.00 0.90 0.85 0.80 0.80 0.80

VT sat (mV) at Ioff=100 nA/µm, HP logic

129 129 133 136 84 52 52

VT sat (mV) at Ioff=100 nA/µm, LP logic

351 336 333 326 201 125 125

Effective mobility (cm2 V1 s1)

200 150 120 100 100 100 100

Rext (Ω m), HP Logic

280 238 202 172 146 124 106

Ballisticity: injection velocity (cm/s)

1.20×107 1.32×107 1.45×107 1.60×107 1.76×107 1.93×107 2.13×107

Vdsat (V), HP logic

0.115 0.127 0.136 0.128 0.141 0.155 0.170

Vdsat (V), LP logic

0.125 0.141 0.155 0.153 0.169 0.186 0.204

Ion (A/m) at Ioff=100 nA/µm, HP logic with Rext=0

2311 2541 2782 2917 3001 2670 2408

Ion (A/m) at Ioff=100 nA/µm, HP logic after Rext

1177 1287 1397 1476 1546 1456 1391

Ion (A/m) at Ioff=100 pA/µm, LP logic with Rext=0

1455 1567 1614 1603 2008 1933 1582

Ion (A/m) at Ioff=100 pA/µm, LP logic after Rext

596 637 637 629 890 956 821

Cch, total (fF/µm2), HP/LP logic

31.38 34.52 38.35 40.61 43.14 43.14 43.14

Cgate, total (fF/µm), HP logic

1.81 1.49 1.29 0.97 1.04 1.04 1.04

Cgate, total (fF/µm), LP Logic

1.96 1.66 1.47 1.17 1.24 1.24 1.24

CV/I (ps), FO3 load, HP logic

3.69 2.61 1.94 1.29 1.11 0.96 0.89

I/(CV) (1/ps), FO3 load, HP logic

0.27 0.38 0.52 0.78 0.90 1.04 1.12

Energy per switching (CV2) (fJ/switching), FO3 load, HP logic

3.47 2.52 1.89 1.24 0.94 0.63 0.50

 
 
 
 
 
 

 
 
 

PIC
Figure 93: Cross section showing stacked wiring metal layers.

 
 
 

PIC
Figure 94: EDA tools applied to synthesisable RTL for a simulation (left) and the synthesis to a gate-level or structural netlist (right).

 
 
 

wire dout[39:0]; 
reg[3:0] values[0:4] = {5, 6, 7, 8, 15}; 
 
generate 
  genvar i; 
  for (i=0; i<5; i++)  begin 
    MUT mut[i] ( 
    .out(dout[i*8+7:i*8]), 
    .value_in(values[i]), 
    .clk(clk), 
    ); 
  end 
endgenerate

PIC

Figure 95: Example of a generate statement in Verilog RTL (left) and the resulting structural netlist (right).

 
 
 

module subcircuit( 
  input clk, 
  input rst, 
  output q2); 
   wire q1, q3, a; 
   DFFR Ff_1(clk, rst, a, q1, qb1), 
        Ff_2(clk, rst, q1, q2, qb2), 
        Ff_3(clk, rst, q2, q3, qb3); 
   NOR2 Nor2_1(a, q2, q3); 
endmodule

PIC

Figure 96: A structural RTL example (left) and the net-level circuit it defines (a divide-by-five Johnson counter, right).

 
 
 

PIC
Figure 97: A combinational RTL example (top) and the naive net-level circuit it defines (bottom).

 
 
 

PIC
Figure 98: Example RTL fragment, before and after flattening. For many designs, the flattened netlist is often bigger than the hierarchic netlist owing to multiple instances of the same component. Here it was smaller.

 
 
 

PIC

PIC

Figure 99: Schematic symbol, Verilog model, timing diagram and possible implementation for a dual-edge-triggered RS flop.

 
 
 

PIC
Figure 100: Simple synthesisable Verilog examples, including a transparent latch.

 
 
 

PIC
Figure 101: Four-value logic-level encoding and its behaviour for six common gates.

 
 
 

PIC
Figure 102: EDS event queue, which is a linked list, sorted in ascending temporal order.

 
 
 

PIC
Figure 103: Behaviour of a runt pulse in an RS latch when modelling with transport delay.

 
 
 

PIC
Figure 104: Hybrid automobile transmission system.

 
 
 

PIC
Figure 105: Zeno wondered whether Achilles could ever catch the tortoise, but we know that the sum of a geometric progression often converges.

 
 
 

PIC
Figure 106: Hybrid model simulation of a bouncing ball dropped from 7.0 m. It stops completely at 9.366 s.

 
 
 
 
 
 
 
 
 

PIC
Figure 107: Fragment of a synthesised netlist rendered in a schematic viewer.

 
 
 
 
 
 

module TC(input clk, input cen); 
  reg [1:0] count; 
  always @(posedge clk) if (cen) count<=count+1; 
endmodule

module TC(input clk, input cen); 
  wire u10022, u10021, u10020, u10019; 
  wire [1:0] count; 
  input cen;   input clk; 
  CVINV  i10021(u10021, count[0]); 
  CVMUX2  i10022(u10022, cen, u10021, count[0]); 
  CVDFF  u10023(count[0], u10022, clk, 1’b1, 1’b0, 1’b0); 
  CVXOR2  i10019(u10019, count[0], count[1]); 
  CVMUX2  i10020(u10020, cen, u10019, count[1]); 
  CVDFF  u10024(count[1], u10020, clk, 1’b1, 1’b0, 1’b0); 
endmodule
Figure 108: Baseline RTL elaboration example showing synthesisable RTL input (left) and structural netlist output that uses generic gates (right).

 
 
 

  reg [31:0] q, n; 
  ... 
  q = n / 10; 
  return q; 
 
 

   reg [31:0] q, n; 
   ... 
   q = (n >> 1) + (n >> 2); 
   q += (q >> 4); 
   q += (q >> 8); 
   q += (q >> 16); 
   return q >> 3;
Figure 109: Essence of logic synthesised for integer division of the 32-bit value n by the constant 10 using just adders, based on 8/10 being 0.11001100 recurring. A logic synthesiser can create similar bespoke divide circuits for any constant denominator.

 
 
 

PIC
Figure 110: AND-OR-INVERT gate, logical function (left) and actual circuit when realised in CMOS (right).

 
 
 

module SSRAM( 
  input clk,                  // Synchronous reads and writes 
  input ren,                  // Read enable (optional) 
  input wen,                  // Write enable 
  input [14:0] addr,          // Address input 
  input [31:0] wdata,         // Write data in 
  output reg [31:0] rdata);   // Read data out 
  reg [31:0] myram [32767:0]; // 32k words of 32 bits each 
  always @(posedge clk) begin 
     if (ren) rdata <= myram[addr]; 
     if (wen) myram[addr] <= wdata; 
     end 
endmodule

PIC

Figure 111: Typical RTL coding style for RAM inference. Data out is registered once without otherwise being used and the same subscript expression is used in both the read and write contexts.

 
 
 

PIC
Figure 112: Scatter plot of area versus operating frequency for a design (Arm Cortex-A9 Falcon) on a common process node and cell library for different core utilisation ratios. The best performers are highlighted with black crosses (and joined together with the black dashed line).

 
 
 

PIC
Figure 113: FinFET invertor area versus number of fins for different cell heights at a process node.

 
 
 

PIC
Figure 114: Scatter plot of area versus operating frequency for a given design (Arm Cortex-A9 Falcon) and standard cell libraries on a common process node that vary in cell height, measured in track equivalents.

 
 
 

PIC
Figure 115: A broad-brush classification of digital ICs into full-custom, semi-custom and field-programmable classes with some example device roles.

 
 
 

PIC
Figure 116: SoC application diagram for an automotive headlight controller. A heavy load is connected to a controller-area network (CAN bus).

 
 
 

PIC
Figure 117: Two-input NOR gate, circuit schematic (left) and several layers of a standard cell layout (right). Blue is the bottom layer of metal, red is polysilicon, green is channel diffusion and yellow is the P well in which the N-channel FETs are constructed.

 
 
 

PIC
Figure 118: Typical cell data sheet from a human-readable version of a standard cell library.

 
 
 
 
 
 

Table 16: Example of process, voltage and temperature (PVT) ranges.
Parameter Values
Process variation 0.9 to 1.1
Supply voltage range 0.85 to 1.1 V
Temperature range 0 to 70°C

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

PIC
Figure 119: Delay versus output loading capacitance for various input transition rates.

 
 
 

PIC PIC
Figure 120: Input capacitance is augmented or diminished due to the Miller effect. It is augmented by the opposite-moving plate potentials arising from an inverting configuration, which is the common case.

 
 
 

PIC
Figure 121: Real-world gate, with a linear slew input waveform, showing the timing behaviour of Vin, Vout and Iout.

 
 
 

PIC
Figure 122: Metal layers in a mask-programmed gate array (ECL codec for fibre optic ring network [?]).

 
 
 

PIC
Figure 123: Transmission gate or bilateral switch: internal wiring (left) and schematic symbols (right).

 
 
 

PIC
Figure 124: An active multiplexor (a) needs more silicon area than a pass-transistor multiplexor (b), but restores logic levels rather than degrading the signal. Larger pass-transistor multiplexors (c) are efficient and easy to lay out.

 
 
 

PIC
Figure 125: FPGA, showing the I/O blocks around the edge, the interconnection matrix blocks and the configurable logic blocks. Recently, the regular structure has been broken up by custom blocks, including RAM and multiplier (aka DSP) blocks.

 
 
 

PIC

PIC

Figure 126: CLB (left) and an IOB (right) for a simple FPGA.

 
 
 

PIC
Figure 127: The DSP block in the Xilinx Virtex 7 family (©Xilinx Inc). The main functions offered are multiply–accumulate and multiply–sum for long multiplication of wider words than supported by the 25×18 base unit.

 
 
 
 
 
 
 
 
 

PIC
Figure 128: Power and floor plan for a simple SoC along with the external supply, which uses battery-backed RAM (left) and buck regulators (right). PSUC: power supply controller.

 
 
 
 
 
 

PIC
Figure 129: Macroscopic back-end flow highlighting the verification flow paths. Figure 86 is a detailed flow diagram.

 
 
 
 
 
 

PIC
Figure 130: Standard cell polygons for a two-input NOR gate. The abstract view (left) identifies the power connections (top and bottom strips) and the input and output contact sites (denoted with asterisks) on the M1 and M2 metal layers, respectively (shades of blue). The full polygon set (right) is required for final design rule checking and tapeout. In reality, the right-hand polysilicon layer (red) could be under the M2 output strip but is here shown displaced for clarity.

 
 
 

PIC
Figure 131: A placed horizontal strip of standard cells showing the polygons significant for routing, as rendered in a layout editor. The power and ground rails are a distinctive feature.

 
 
 

PIC
Figure 132: Snapshot from a place-and-route tool, showing hundreds of interlaced wires that are routed with several metal layers, both vertically and horizontally.
PIC
Figure 133: Enlargement of Figure 132, showing several wires routed with several metal layers (one colour for each layer), vertically and horizontally. Vias are marked with an X.

 
 
 

PIC
      000 000 0 001 111 1
      123 456 7 890 123 4
    [ 00H 00H p H00 x00 p ]
    [ 01H 00H p H00 x00 p ]
    [ 10H 00H p H00 x00 p ]
    [ 11L 00H p H00 x00 p ]

Figure 134: Pin connections for a 7400 quad NAND gate device and part of a test program.

 
 
 

PIC
Figure 135: A wafer (6 to 10 inches diameter) is diced into chips (1 cm on a side or so).

 
 
 

PIC
Figure 136: Load card with wafer probe pins for testing a chip before the wafer is diced.

 
 
 

PIC
Figure 137: General configuration of a wafer probe testing machine. The robotic staging system delivers wafers to and from conveyor systems as well as steps the probes over the wafer.

 
 
 

PIC
Figure 138: Probe needles just visible inside the testing load card. This image must be deleted from the book since we can’t get rights clearance.

 
 
 
 
 
 

PIC
Figure 139: Plastic chip package.

 
 
 

PIC
Figure 140: An FPGA composed of several chiplets with broadside inter-chiplet bonding, connected closely to several DRAM chips in an MCM.

 
 
 
 
 
 

module sewkit(  // TSMC 0.18u library 
       intput clk, 
       input n_reset); 
    // verilint 630  on : Port connected to a NULL expression 
    dfcfb1  DZBRB1_1(.CDN(n_reset), .CPN(clk), .D(1’b0), .Q(), .QN()); 
    dfcfb1  DZBRB1_2(.CDN(n_reset), .CPN(clk), .D(1’b0), .Q(), .QN()); 
 
    nd02d2  ND02D2_1 (.A1(1’b0), .A2(1’b0), .ZN() ); 
    nd02d2  ND02D2_2 (.A1(1’b0), .A2(1’b0), .ZN() ); 
 
    inv0d2  INV0D2_1(.I(1’b0), .ZN()); 
    inv0d2  INV0D2_2(.I(1’b0), .ZN() ); 
    inv0d2  INV0D4_1(.I(1’b0), .ZN() ); 
    inv0d2  INV0D4_2(.I(1’b0), .ZN() ); 
 
    buffd7  BUFFD1_1(.I(1’b0), .Z() ); 
    buffd7  BUFFD1_2(.I(1’b0), .Z() ); 
 
    mx02d2  MX02D1_1(.I0(1’b0), .I1(1’b0), .S(1’b0), .Z() ); 
    mx02d2  MX02D1_2(.I0(1’b0), .I1(1’b0), .S(1’b0), .Z() ); 
 
    nr02d2  NR02D2_1 (.A1(1’b0), .A2(1’b0), .ZN() ); 
    nr02d2  NR02D2_2 (.A1(1’b0), .A2(1’b0), .ZN() ); 
 
    aoi211d2 AOI311D1_1(.A(1’b0), .B(1’b0), .C1(1’b0), .C2(1’b0), .ZN() ); 
    aoi211d2 AOI311D1_2(.A(1’b0), .B(1’b0), .C1(1’b0), .C2(1’b0), .ZN() ); 
endmodule
Figure 141: Example of structural Verilog RTL that instantiates disconnected standard cells as a ‘sewing kit’.

 
 
 

Table 17: Simplistic and rough estimates of recurring (RE) and non-recurring expenses (NRE) for the first production run of n wafers.
Type of expense

Item

Item cost

Total cost

NRE

6 months: 10 software engineers

$100k pa

$500k

NRE

6 months: 10 hardware engineers

$250k pa

$1250k

NRE

4 months: 20 verification engineers

$200k pa

$1333k

NRE

1 mask set (22 nm)

$1500k

$1500k

RE

Per device IP licence fees

?

$?? ×n

RE

6-inch wafer

$5k

$5k ×n

Total

$4583k + 5k×n

 
 
 

Table 18: Die yield. The cost for a working die given a 6-inch wafer with a processing cost of $5000 and a probability of a square millimetre being defect-free of 99.55 per cent.
Area (mm2) Number of wafer dies Number of working dies Cost per working die ($)
2 9000 8910 0.56
3 6000 5910 0.85
4 4500 4411 1.13
6 3000 2911 1.72
9 2000 1912 2.62
13 1385 1297 3.85
19 947 861 5.81
28 643 559 8.95
42 429 347 14.40
63 286 208 24.00
94 191 120 41.83
141 128 63 79.41
211 85 30 168.78
316 57 12 427.85
474 38 4 1416.89

 
 
 

PIC
Figure 142: The three main timing specifications for a sequential cell, such as a flip-flop (repeated from Figure 270).

 
 
 

PIC
Figure 143: An example circuit with static timing annotations for maximum GBA analysis.

 
 
 
 
 
 
 
 
 
 
 
 

PIC
Figure 144: CMOS chip structure, highlighting layers most affected by FEOL and BEOL variations.

 
 
 

Table 19: Examples of process corners.
String Meaning
ss_0p9v_m40c Slow P and N channel transistors at 0.9 V and 40°C
tt_1p0v_25c Typical P and N channel transistors at 1.0 V and room temperature
ff_1p1v_125c Fast P and N channel transistors at 1.1 V and 125°C

 
 
 

Table 20: Examples of BEOL corners.
BEOL Corner

Meaning

Cmin or Cbest

Narrow wires with wide spacing for the smallest capacitance component

RCmin or RCbest

Thick wires with less resistance to minimise the RC product and net delay

Typical

Wires and vias meet the target dimensions

RCmax or RCworst

Thin wires with more resistance to maximise the RC product and net delay

Cmax or Cworst

Wide wires with narrow spacing for the largest capacitance component

 
 
 
 
 
 

PIC
Figure 145: Troublesome early and late path configurations for maximum (left) and minimum (right) timing. Late path components are shaded dark.

 
 
 

PIC
Figure 146: A logic structure with a gated clock that could suffer in an FS process corner (left) and a combinational logic circuit with two paths that differ in their levels of inversion.

 
 
 

# ---- Create Clocks ---- 
create_clock -add -period $clock_period -name VCLK 
 
foreach clock_name $clock_list { 
  create_clock -add -period $clock_period [get_ports $clock_name] -name $clock_name 
  set_clock_latency $clock_latency [get_clocks $clock_name] 
} 
 
set_clock_uncertainty [expr $setup_margin + $clock_jitter] -setup [all_clocks] 
set_clock_uncertainty [expr $hold_margin]                  -hold  [all_clocks] 
 
set_driving_cell -lib_cell $clock_driving_cell \ 
                 -input_transition_rise $max_clock_transition \ 
                 -input_transition_fall $max_clock_transition \ 
                 [get_ports $clock_list] 
 
# ---- I/O timing constraints ---- 
set_input_delay $max_input_constraint -max -clock VCLK \ 
             [remove_from_collection [all_inputs] $clock_list] 
set_input_delay $min_input_constraint -min -clock VCLK \ 
             [remove_from_collection [all_inputs] $clock_list] 
 
set_output_delay $max_output_constraint -max -clock VCLK [all_outputs] 
set_output_delay $min_output_constraint -min -clock VCLK [all_outputs] 
 
# ---- Path groups ---- 
group_path -name reg2reg -from [all_registers] -to [all_registers] 
 
# ---- Timing exceptions ---- 
set_multicycle_path 2 -setup -end -from [get_ports DFT*] 
set_multicycle_path 1 -hold  -end -from [get_ports DFT*] 
% 
% # ---- Scan mode ---- 
%\end{verbatim}}
Figure 147: An example of an SDC file. TCL commands are used to specify and constrain the STA behaviour.

 
 
 

 
 
 

PIC
Figure 148: Global and localised views of process distributions.

 
 
 

PIC
Figure 149: LVF path delay distribution and sigma multiplier.

 
 
 

PIC
Figure 150: Statistical VT shift due to bias temperature instability for various bias times.

 
 
 
 
 
 

PIC
Figure 151: Comparing simulation (left) with a formal proof of correctness (right).

 
 
 

   assert(x<4); 
   x := x + 1000; 
   assert(x<1004);
Figure 152: Examples of imperative or procedural safety assertions in a software language such as C++. These conditions must hold when reached by the program’s flow of control.

 
 
 

PIC
Figure 153: FSM transition diagram with liveness, fairness and deadlock indications.

 
 
 

PIC
Figure 154: Two simple PSL assertions.

 
 
 

Table 21: The three principal regular expression operators and concise derived shorthands.
Syntax Fundamental Description
{A;B} Core Semicolon denotes sequence concatenation
{A[*]} Core A postfix asterisk denotes arbitrary repetition
{A|B} Core Vertical bar (stile) denotes alternation
{A[+]} Derived One or more occurrences of A
{A[*n]} Derived Repeat n times
{A[=n]} Derived Repeat n times non-consecutively
{A[->n]} Derived As =n but ending on the last occurrence
{A:B} Derived Fusion concatenation (last of A occurs during first of B)

 
 
 

Table 22: Summary of the main SERES temporal conjunction and sequencing dyadic operators.
Operator Syntax

Description

Simple conjunction A & B

A and B finish matching at once

Length-matching conjunction A && B

A and B occur at once with common duration (length matching)

Simple conjunction A within B

A occurred at some point during B

Strong positive sequencing A until B

A held at all times until B started

Weak positive sequencing A before B

A held before B held

Sequence implication A |=> B

Whenever A finishes, B immediately starts

Fusion implication A |-> B

The same, but with the last event of B coincident with the first of A

 
 
 

Table 23: Some built-in primitive macros in PSL.
Macro function Description
rose(X) X changed from zero to one
fell(X) X changed from one to zero
stable(X) X did not change
changed(X) X did change
onehot(X) X is a power of 2
onehot0(X) X is zero or a power of 2

 
 
 

PIC
Figure 155: Dynamic validation: Monitoring bus operation with an RTL checker (top) or interface PCs (bottom).

 
 
 

PIC
Figure 156: Framed standard synchronous connection, with interface checker instance (left) and allowable protocol transitions (right). Only allowed qualified edges are shown. Any other qualified transitions are protocol errors.

 
 
 

module framed_standard_sync_monitor( 
               input reset, 
               input clk,           // Clock input. ALL CONNECTIONS ARE INPUTS! 
               input req,           // Request signal 
               input rdy,           // Ready signal, for the reverse direction 
               input [7:0] data,    // Data bus 
               input dfirst,        // First word of packet indicator 
               input dlast);        // Last word indicator 
  bit q1; integer error_flag; 
  always @(posedge clk) 
    if (reset) q1 = 0; 
    else begin 
    error_flag = 0; 
    if (req && rdy && !q1) begin 
      if (dfirst && !dlast) q1 = 1; // Frame start 
      else if (dlast && !dfirst) begin 
        $display("%m: %1t: C2: End outside of frame.", $time);  error_flag = 2; 
        end 
      else if (!dlast && !dfirst) begin 
        $display("%m: %1t: C3: Byte outside a frame.", $time); error_flag = 3; 
        end 
    end 
    else if (req && rdy && q1) begin 
       if (!dfirst && dlast) q1 = 0; // Frame end 
       else if (dlast && dlast) begin 
         $display("%m: %1t: C1b: One-word frame during existing frame.", $time); error_flag = 1; 
         end 
      else if (!dlast && dfirst) begin 
         $display("%m: %1t: C1a: Frame start during existing frame.", $time); error_flag = 1; 
         end 
      end 
   end 
endmodule
Figure 157: Example of a PC for the framed interface implemented using an RTL state machine.

 
 
 

  wire en = req && rdy; 
 
  // The transition from Q0 -> Q1 -> ... -> Q1 -> Q0: 
  sva_transaction: assert property (@(posedge clk) 
    ( (en && dfirst && !dlast) |=> (!en || (!dfirst && !dlast))[*0:$] ##0 (en && !dfirst && dlast) ) ) 
 
  // Forbid any exit from Q0 except with dfirst: 
  good_Q0: assert property (@(posedge clk) 
    ( (en && dlast) || reset |=> (!(en && dfirst))[*0:$] ##0 (en && dfirst) ) ) 
Figure 158: Comparable example of the PC implemented with SVA. The ##0 form is an idiomatic marker that merely separates successive SERES components.

 
 
 

PIC
Figure 159: A mitre compares the outputs from a pair of supposedly equivalent combinational components.

 
 
 

PIC
Figure 160: A two-bit shift register (left) with a conventional design. By using a clock-enabled flip-flop (centre), an alterntive implementation is possible (right). The state encoding is totally different, but the observable black-box behaviour is identical.

 
 
 

PIC
Figure 161: An X-propagation mitre around two instances of a simple DUT. The equivalence comparison is modified to ignore mismatches on the data bus if the valid output from the upper instance does not hold. A counterexample, as found by a model checker, is denoted in red. The alternative values that would be encountered in a simulation are in blue.

 
 
 

PIC
Figure 162: An alternative DUT. The circuit produces a false negative under X-propagation simulation. The blue annotations show CRV simulation values, whereas model checker values, in red, correctly indicate that the DUT has passed the test.

 
 
 

PIC
Figure 163: A formal glue shim around a data path component (a LIFO stack). This enables symbolic verification using a basic model checker. All inputs on the left are unconstrained, as is the pop input on the right. No stimulus pattern should make the FAIL output hold.

 
 
 

PIC
Figure 164: A pulse generator: schematic symbol and timing waveforms.

 
 
 

PIC
Figure 165: Four of six permutations that implement a simplex protocol adaptor. Each has the same signature of one input and two output ports of 32 and 8 bits, respectively. More typically, duplex implementations are required.

 
 
 

PIC
Figure 166: Comparison of speed-of-light (LC) and RC-based Elmore delay models for propagation along a net. Also shown is a real-world design rule for 16-nm geometry.

 
 
 

PIC
Figure 167: Timing diagram for an asynchronous four-phase handshake.

 
 
 

PIC
Figure 168: Timing diagram for AHB bus write cycle with one wait state.

 
 
 

PIC
Figure 169: Timing diagrams for the synchronous LocalLink protocol. Left: Back-to-back transfer of words because the destination is always ready. Right: Pauses to match the acceptance rate of the destination. Note that all control signals are active low (denoted with the _n RTL suffix) in LocalLink.

 
 
 

PIC
Figure 170: BVCI core nets. Arrows indicate signal directions on the initiator. All of these are reversed for the target.

 
 
 

PIC
Figure 171: BVCI protocol: Command phase timing diagram.

 
 
 

PIC
Figure 172: BVCI protocol: Response phase timing diagram. Operations are qualified with the conjunction of req and ack. Response and acknowledge cycles maintain their respective ordering. Bursts are common. Successive addressing may be implied.

 
 
 

PIC
Figure 173: Out-of-order bus formed from a pair of in-order busses with added tags (blue).

 
 
 

PIC
Figure 174: Augmenting the tag width through successive multiplexing stages.

 
 
 

PIC
Figure 175: The baseline port structure for an A32D32 AXI standard initiator showing the five temporally floating channels, two for reads and three for writes. Parity bits may be present on any of the busses. Writing an address and writing data operate close to lockstep for single-word writes, but are split to support multi-word bursts. Signal directions are reversed for a target.

 
 
 

Table 24: Major predefined AMBA AXI bus standards and profiles. Within each profile, there can be variations in address, data, tag and other bus widths per instance.
Profile Channels Other nets

Description

AXI3 AR+R, AW+W+B Tag ID, WLanes

Bursts 1–16 beats

AXI4 AR+R, AW+W+B Tag ID, WLanes, QoS

Bursts 1–256 beats

AXI4-Lite AR+R, AW+W+B

No burst transfers. No byte lanes

AXI4-Stream W

Simplex. No addressing. Unrestricted length

AXI ACE All of AXI4 AC+CR+CD

Cache coherency extensions

ACE5-Lite All of AXI4 AC+CR+CD

Single beat. Out-of-order responses

 
 
 
 
 
 

PIC
Figure 176: A basic SoC bus structure for the MSOC1 protocol. One initiator addresses three targets (high-level view and detailed wiring).

 
 
 

PIC
Figure 177: Example where one of the targets is also an initiator (e.g. a DMA controller).

 
 
 

PIC
Figure 178: A system design using three bridged busses. Each main bus has its own primary initiator (pink), which is typically a CPU, but the bus bridges (green) also initiate transactions.

 
 
 

PIC
Figure 179: Bidirectional bus bridge for the MSOC1 protocol. It has a pair of back-to-back simplex bridges. It could be a single IP block on a single chip, or the two halves could be on different chips with a SERDES (§??) serial link between them.

 
 
 

PIC
Figure 180: Crossbar 4×4 connectivity implemented in three different ways: with a high-capacity TDM bus (left), crossbar elements (centre) and multiplexors (right).

 
 
 

PIC
Figure 181: Multiphase (split transaction) bus demultiplexor and remultiplexor for responses (left). Three of these structures could be used to implement the 2×3 bus fabric hub (right).

 
 
 

PIC
Figure 182: Two folded busses that have similar component counts. For each, the number of switching elements grows linearly with the number of end points (EP). The linear form (left) is suitable for multi-chip use, but the tree form (right) has lower latency. The receive half of the tree form often has a tree structure running exactly parallel to the transmit half, but here it is shown flattened to almost a bus.

 
 
 

PIC
Figure 183: A radix-3 switching element for a NoC using a broadcast bus at 4× the link bandwidth and credit-based flow control.

 
 
 

PIC
Figure 184: Example of a NoC fabric using radix-3 switching in a unidirectional torus mesh.

 
 
 

PIC

PIC
PIC
Figure 185: Illustration of cut-through routing on a 2-D mesh NoC (left), wormhole operation (centre) and potentially deadlocking routes (right). The thin black lines are the NoC links, which are bidirectional. The thick coloured lines are the routes chosen for a packet. The white marks indicates the end of a packet.

 
 
 
 
 
 

PIC
Figure 186: Four elements in a 2-D mesh. Two transactions, red and green, are in a fabric deadlock, since unconstrained cut-through switching with wormhole routing has allowed each to block the other. The dashed parts cannot complete.

 
 
 

PIC
Figure 187: The eight possible 2-D turns (left), the allowable 2-D turns with Y-before-X (north–south first) routing (centre) and an example set of turns used by two transactions that obey the north–south first global policy but still manage to deadlock owing to interference between requests and responses. The turns that violate the global policy are marked with an exclamation mark. BLA: bottom left anticlockwise; BRA: bottom right anticlockwise; TRA: top right anticlockwise; TLA: top left anticlockwise; BLC: bottom left clockwise; BRC: bottom right clockwise; TRC: top right clockwise; TLC: top left clockwise.

 
 
 

1Profiles: 
2    t0: { src: u_M0, type: readRequest, avg: 10, peak: 100, req_beats: 1, 
3          resp_beats: 4, qos: 0, lc: false, dst: u_S0 } 
4    t1: { src: u_M0, type: writeRequest, avg: 10, peak: 94.3, req_beats: 4, 
5          resp_beats: 1, qos: 0, lc: false, dst: u_S0 } 
6Dependencies: 
7    # Receipt of readRequest at u_S0, triggers a transaction at u_M0 
8    d0: { from: u_S0.readRequest, to: u_M0.readRequest } 
9
Figure 188: Sample behavioural traffic specification containing a load profile and a transaction phase dependency.

 
 
 

PIC
Figure 189: One possible structure using link-level credit-based flow control, showing tolerance to pipeline stages in the interconnect nets. The forward and return paths need not be matched in delay terms.

 
 
 

PIC
Figure 190: Two end points interconnected via some number of switching elements. The AMBA 5 CHI protocol layers are shown.

 
 
 

PIC

PIC
Figure 191: A minimal application of the AMBA 5 CHI specification between a requester and a completer (left) and full net-level details of the six channels found on a more-typical request node (RN) (right).

 
 
 

Table 25: Some (simplified) synthetic traffic generation vectors and their descriptions.
No.

Name

Description

1.

Rate: open loop
Flows: all to all
Length: 8
Spacing: Regular

Average rate injection from all ingress ports to all egress ports of 8 byte payloads, with no burstiness

2.

Rate: open loop
Flows: all to one
Length: 8
Spacing: Regular

Average rate injection from all ingress ports to one egress port, with no burstiness.

3.

Rate: saturated
Flows: all to all
Length: 8
Spacing: Regular

Injection at peak capacity from all ingress ports to all egress ports, with no burstiness.

4.

Rate: open loop
Flows: all to all
Length: 8
Spacing: Random

Average injection rate with random delays between injections, from all ingress ports to all egress ports.

5.

Rate: open loop
Flows: all to all
Length: Variable
Spacing: Regular

Average injection rate from all ingress ports to all egress ports, with variable length packets.

6.

Rate: closed loop
Flows: all to all
Length: 32
Spacing: Regular

Ingress port only generates a new message after previous response. All packets are long (32 bytes).

 
 
 

PIC
Figure 192: A ring network. This is a low-complexity NoC structure.

 
 
 

PIC
Figure 193: Regular cubic structures with 1, 2, 3 or 4 dimensions. 5-D and above are hard to draw.

 
 
 

PIC
Figure 194: The butterfly (or shuffle) network topology (left) and a flattened version (right), with some example paths highlighted for comparison. The vertical links in the flattened form are bidirectional.

 
 
 

PIC
Figure 195: Example of an 8×8 switching fabric that uses radix-2 elements and the delta wiring pattern. Interchanging the initiators and targets is equally valid.

 
 
 

PIC
Figure 196: A one-place pipelined FIFO stage using credit-based flow control. A multi-place FIFO stage would replace the synchronous S/R flops used for the initial credit and backlogged state with counters, which result in better density than cascading one-place stages. (The vertical line on the S input denotes that setting has priority over resetting when both are asserted.)

 
 
 

PIC

PIC

PIC

Figure 197: Metastability illustrated by a pencil balancing on a razor’s edge (left). The essential structure of a transparent bistable (latch) and a transfer function (centre). The gate signal and measured responses from a transparent latch with an input wired to a voltage source close to the metastable point (right).

 
 
 

PIC
Figure 198: Generic structure of a simplex CBRI. Parallel data are reliably sent between CDs.

 
 
 

PIC
Figure 199: Timing diagram (top) and a basic structure for two systems that use harmonically locked clocks (bottom). Relative edge timings that require tight phase control are highlighted.

 
 
 

PIC
Figure 200: Net-level view of the AMBA P-channel interface for device power control.

 
 
 

PIC
Figure 201: Main components of an 8b10b, block-coded SERDES (serialiser/deserialsier) transceiver (transmitter/receiver) operating over a twisted-pair channel.

 
 
 

Table 26: Four quadrants of a computer


Control unit

Execution unit



Primary storage

I/O devices



 
 
 

PIC
Figure 202: Transactional-level model (TLM) of the simplest computer (left) and code fragments (right).

 
 
 

PIC
Figure 203: Schematic symbol and external connections for a tri-state version of a simple microprocessor.

 
 
 

Table 27: Net-level connections
Connection Direction Use
data[7:0] I/O Bidirectional data bus
addr[15:0] Output Selection of internal address; not all 32 bits are used
hren Output Asserted during a data read from the target to the host
hwen Output Asserted during a write of data from the host to the target
ack Input Asserted when the addressed device has completed its operation

 
 
 

PIC
Figure 204: A simple A16D8 microcomputer structure. A microprocessor initiates all transactions on a bidirectional/tri-state data bus, which is connected to all other components.

 
 
 

PIC

Figure 205: Connections to memory

 
 
 

Table 28: Memory map.
Start End Resource
0000 03FF ROM (1 kbytes)
0400 3FFF Unused images of ROM
4000 7FFF RAM (16 kbytes)
8000 BFFF Unused
C000 C007 Registers (8) in the UART
C008 FFFF Unused images of the UART

 
 
 

PIC
Figure 206: MSOC1 reference bus protocol, with read and write examples.

 
 
 

Table 29: Net-level connections of an MSOC1 port in an A32D32 system.
Connection Direction Use
addr[31:0] Output Selection of internal address; not all 32 bits are used
hwen Input Asserted during a write from the host to the target
hren Input Asserted during a read from the target to the host
wdata[31:0] Input Data to a target when writing or storing
rdata[31:0] Output Data read from a target when reading or loading
interrupt Output Asserted by target when needing attention

 
 
 
 
 
 

PIC
Figure 207: Structure of a typical microcontroller, a single-chip microcomputer.

 
 
 

PIC
Figure 208: Hitachi HD614080 microcontroller chip from 1980. Such devices were often in very large, dual in-line (DIL) packages to make a large number of GPIO pins available.

 
 
 

PIC
Figure 209: Contact plate for a smart card. The card reader supplies VCC power, clock and reset. Inputs and outputs are then made via the 1-bit bidirectional data pin.

 
 
 

PIC
Figure 210: An invertor viewed at various levels of abstraction.

 
 
 

PIC
Figure 211: Front-end flow in SoC design.

 
 
 

PIC
Figure 212: Main PCB of an ADSL home modem.

 
 
 

PIC
Figure 213: Main PCB structure of an ADSL home modem.

 
 
 

PIC
Figure 214: Overall design and manufacturing flow for a SoC.

 
 
 
 
 
 

PIC
Figure 215: General internal view of a mobile phone (left) and views of both sides of the main circuit board (centre and right). Highlighted in red are the main SoC, which contains several Arm processors, and a multi-chip module containing several memory chips (a proprietary mix of DRAM, SRAM and flash).

 
 
 

PIC
Figure 216: An Apple SoC with two Arm and three GPU cores. It was made by arch-rival Samsung.

 
 
 

PIC
Figure 217: A platform chip example: the Virata Helium 210, as used in many ADSL modems.

 
 
 

PIC
Figure 218: Helium chip as part of a home gateway ADSL modem (partially masked by the 802.11 module).

 
 
 

PIC
Figure 219: Another platform chip intended for use in thin-client display devices, such as a tablet.

 
 
 
 
 
 

PIC
Figure 220: Moore’s law (§??).

 
 
 

PIC
Figure 221: TLM connection diagrams for CPU and memory configurations: (a) direct connection, (b) with a cache, (c) with a split cache and (d) two cores with a shared L2 cache. The fan-in to the TLM socket denotes that the bandwidth is shared. In reality, debug and managements ports also need to be connected.

 
 
 

PIC
Figure 222: TLM diagrams for a CPU core with integrated first-level caches and external coprocessor connections.

 
 
 

PIC
Figure 223: Main data paths in a generic five-stage RISC microarchitecture, excluding the back sides of the cache and MMU. Instruction field bit-extracts to control the multiplexors and ALU are not shown.

 
 
 

PIC
Figure 224: Key components of a super-scalar CPU core, excluding the back sides of the cache and MMU.

 
 
 

PIC
Figure 225: TLM connection diagrams for a CPU and memory management unit (MMU): (a) naive view and (b) one possible cache arrangement.

 
 
 

PIC
Figure 226: A generalised PRAM model of computing. A number of synchronous processors, each with some private local store, make random access and read and write operations on shared memory that has many atomic locations of some size. The interconnect implements some variant of coherence (value and sequential consistency).

 
 
 

PIC
Figure 227: Two possible configurations for scratchpad memory.

 
 
 

PIC
Figure 228: Typical arrangement of virtual and physical mapping with L1, TLB and L2. Data are served from L1 or L2, but are unlikely to be served directly from primary storage. Instead, update mechanisms (not shown) are used. Four-way associativity is illustrated, but a higher degree is typically used, to increase both the hit ratio and capacity, since it is generally required that the directly mapped axis is not wider than the virtual memory page size.

 
 
 

PIC
Figure 229: An example system using a snoop control IP block from Arm (Cortex A9 family).

 
 
 

PIC
Figure 230: MOESI-like state diagram used in the Arm ACE and CHI protocols.

 
 
 

PIC
Figure 231: Three I/O blocks connected to a CPU, memory and an interrupt controller.
PIC
Figure 232: Three I/O blocks with flexible interrupt distribution in a multi-core system.

 
 
 

PIC
Figure 233: Interrupt generation: general structure within a device and at system level.

 
 
 

Table 30: Principal characteristics of memory technologies currently used for booting, caches, primary storage and secondary storage.
Memory Volatile Main applications Implementation
ROM No Booting, coefficients Content set by a tapeout mask
SRAM Yes Caches, scratchpads, FIFO buffers One bistable (invertor pair) per bit
DRAM Yes Primary storage Capacitor charge storage
EA-ROM No Secondary storage Floating-gate FET charge storage
Memristive No Next generation Electrically induced resistance changes

 
 
 

Figure 234: Floorplan visualisation of a 3D implementation of the Arm Neoverse N1 design using two chips, vertically stacked. The cache memories are placed on the lower piece of silicon, shown in monochrome, whereas the logic for the ALUs, register files and everything else is shown in colour on top. The 1MB L2 label is the L2 logic and not L2 memory arrays.
PIC

 
 
 

PIC
Figure 235: Low-aspect-ratio RAM array showing the shared peripheral circuitry, which includes the word-line decoder, sense amplifiers, tri-state buffers and I/O multiplexor. Externally it offers 2N words of M bits where N=log2(R)+log2(C). Internally it uses R rows each with (M×C)-bit cells.

 
 
 
 
 
 

PIC
Figure 236: Mask-programmed NMOS ROM structure with a capacity of 2N locations, each holding an M-bit word.

 
 
 

PIC
Figure 237: Transistor-level view of a standard six-transistor (6T) SRAM cell. M1, M2, M3 and M4 are used for storage. M5 and M6 are used to access the cell for read and write operations.

 
 
 
 
 
 
 
 
 
 
 
 

PIC
Figure 238: SSRAM with a single port, showing the logic symbol (left) and internal RTL model (right).

 
 
 
 
 
 

PIC
Figure 239: 8T SRAM cell with true dual-porting.

 
 
 

PIC
Figure 240: Dual-ported SSRAM logic symbol.

 
 
 

PIC
Figure 241: Micrograph of a simple DRAM chip circa 1994. This is a Micron Technology MT4C1024 device with organisation 1024x1024x1.

 
 
 

PIC

PIC

Figure 242: Left: pinout for a typical asynchronous DRAM, such as the MT4C1024 pictured above, that is controlled by the RAS and CAS signals instead of a having a clock input. Right: basic internal structure schematic. This device only has a single bank so there are no bank address inputs.

 
 
 
 
 
 

Table 31: DRAM address terminology and hierarchy with typical sizes.
Quantity Aggregate capacity Description
1 channel 16 GB A physical bus: 64 data bits, 3 bank bits and 14 address bits
4 DIMMs 16 GB Multiple DIMMs are connected on the PCB to one channel
1 rank 4 GB A number of logical DIMMs within a physical DIMM
16 chips 16×0.5 = 4 GB This DIMM uses 16 4-bit chips making a 64-bit word
Lanes/chip 4 bit lanes = 1 GB Each chip serves a word 4 bits wide
8 banks 214+12+8=0.5 Gbit Each bank has its own bit-cell arrays (simultaneously open)
212 rows 64 Mbit A page or row is one row of bit cells in an array
(Burst) 8 words = 64 bytes The unit of transfer over the channel
214 columns 16 kbit The data read/write line to a bit cell

 
 
 

PIC
Figure 243: 4-Gbyte DRAM dual in-line memory module (DIMM) for a laptop computer. Eight chips are mounted on a small PCB, four on each side. A label on the back says ‘1600 11-11-11’. More detailed information, including the supply voltage, is stored electronically in a small serial presence detect (SPD) ROM in the centre of the DIMM.

 
 
 
 
 
 

PIC
Figure 244: Internal block diagram of a 4-bit-wide DRAM device. There are four banks, although eight are normal. (The data strobe (DS) signal is omitted for clarity.)

 
 
 

Table 32: Typical DIMM connections.
Clock± Clock (400 MHz)
RAS- Row address strobe
CAS- Column address strobe
WE- Write enable
DQ[63:0] Data in/out
Reset Power-on reset
WQ[7:0] Write-lane qualifiers

DS[7:0] Data strobes
DM[7:0] Data masks
CS- Chip select
MAddr[15:0]Address input
BK[2:0] Bank select
spd[3:0] Serial presence detect

 
 
 

PIC
Figure 245: NOR ROM block diagram (left) with capacity 2(R+C) bits and alternative NAND ROM array detail (right). For a ROM more than 1 bit wide, the Dout connection is a bus of width M and the low address bus width is reduced by log2(M) bits.

 
 
 

PIC
Figure 246: EPROM device from 1980’s. The silicon die is visible through the top window.

 
 
 

PIC
Figure 247: Schematic of a possible bit cell for EPROM and EEPROM. There is a kink on the top of the extra gate of this floating-gate transistor to denote that tunnelling is possible.

 
 
 
 
 
 

Figure 248: Typical increase in benchmark performance in response to L3 cache size enlargement for L2=64 KB (blue), L2=128 KB(yellow) and L2=256 KB (red).
PIC

 
 
 

PIC
Figure 249: Typical I/O ports. Shown are two serial ports, one parallel port, one Ethernet port, two USB ports and three audio ports.

 
 
 

PIC
Figure 250: Two typical configurations for a serial port using a UART. Inset: Serial port timing diagram. The old wiring to a D9 connector for an RS-232 port is today often replaced with a very short connection to a second UART inside a USB dongle for connecting to a laptop.

 
 
 
 
 
 
 
 
 

PIC
Figure 251: Timing diagram for an asynchronous four-phase handshake.

 
 
 
 
 
 

PIC

// Programming model 
reg [31:0] ddr;       // Data direction reg 
reg [31:0] sfunction; // Special mode 
reg [31:0] dout;      // Output register 
reg [31:0] imask;     // Interrupt mask 
reg [31:0] ipol;      // Interrupt polarities 
reg [31:0] pins_r;    // Registered pin data 
 
reg int_enable;// Master int enable (for all bits) 
 
always @(posedge clk) begin 
   pins_r <= pins; 
   if (hwen && addr==0) ddr <= wdata; 
   if (hwen && addr==4) sfunction <= wdata; 
   if (hwen && addr==8) dout <= wdata; 
   if (hwen && addr==12) imask <= wdata; 
   if (hwen && addr==16) ipol <= wdata; 
   if (hwen && addr==20) int_enable <= wdata[0]; 
   end 
 
// Tri-state buffers. 
bufif b0(pins[0], 
         sfunction[0]?special[0]:dout[0], ddr[0]); 
... // 30 others here 
bufif b31(pins[31], 
          sfunction[1]?special[31]dout[31], ddr[31]); 
 
// Generally the programmer can read all the 
// programming model registers but not here 
assign rdata = pins_r; 
 
// Interrupt masking 
wire int_pending = (|((pins_r ^ ipol)&imask)); 
assign interrupt = int_pending && int_enable;
Figure 252: Schematic and RTL implementation of 32 GPIO bits connected to an MSOC1 bus.

 
 
 

PIC

  // RTL for one channel of a typical timer 
 
  // Programmers’ model state 
  reg int_enable, int_pending; 
 
  reg [31:0] prescaler; 
  reg [31:0] reload; 
 
  // Programmer-invisible internal state 
  reg ovf; 
  reg [31:0] counter, prescale; 
 
  // Host write operations 
  always @(posedge clk) begin 
     if (hwen && addr==0) int_enable <= wdata[0]; 
     if (hwen && addr==4) prescaler <= wdata; 
     if (hwen && addr==8) reload <= wdata; 
     // Write to addr==12 to clear the interrupt 
     end 
  wire irq_clr = hwen && addr == 12; 
 
  // Host read operations 
  assign rdata = 
    (addr==0) ? {int_pending, int_enable}: 
    (addr==4) ? prescaler: 
    (addr==8) ? reload: 0; 
 
  // A timer counts system clock cycles 
  // A counter counts transitions from an external input 
  always @(posedge clk) begin 
     ovf <= (prescale == prescaler); 
     prescale <= (ovf) ? 0: prescale+1; 
     if (ovf) counter <= counter -1; 
     if (counter == 0) begin 
         int_pending <= 1; 
         counter <= reload; 
         end 
     if (irq_clr) int_pending <= 0; 
     end 
 
  // Interrupt generation 
  assign interrupt = int_pending && int_enable;

Figure 253: Schematic symbol for a counter/timer block and internal RTL for one timer function.

 
 
 

PIC

PIC

  // State for programmers’ model 
  reg [31:0] count, src, dest; 
  reg int_enable, active; 
 
  // Other local state 
  reg [31:0] datareg; 
  reg intt, rwbar; 
 
  always @(posedge clk) begin // Target 
     if (hwen && addr==0) begin 
         { int_enable, active } <= wdata[1:0]; 
         intt <= 0; rwbar <= 1; 
         end 
     if (hwen && addr==4) count <= wdata; 
     if (hwen && addr==8) src <= wdata; 
     if (hwen && addr==12) dest <= wdata; 
     end 
   assign rdata = ...// Target readbacks 
 
   always @(posedge clk) begin // Initiator 
     if (active && rwbar && m_ack) begin 
           datareg <= m_rdata; 
           rwbar <= 0; 
           src <= src + 4; 
           end 
     if (active && !rwbar && m_ack) begin 
           rwbar <= 1; 
           dest <= dest + 4; 
           count <= count - 1; 
           end 
     if (count==1 && active && !rwbar) begin 
           active <= 0; 
           intt <= 1; 
           end 
     end 
  assign m_wdata = datareg; 
  assign m_ren = active && rwbar; 
  assign m_wen = active && !rwbar; 
  assign m_addr = (rwbar) ? src:dest; 
  assign interrupt = intt && int_enable;
Figure 254: A simple DMA controller: schematic symbol, example TLM wiring and RTL for one channel.

 
 
 

PIC
Figure 255: Net-level schematic symbol for a DMA-capable network interface IP block.

 
 
 
 
 
 

PIC

  reg [2:0] framestore[32767:0]; 
  reg [7:0] hptr, vptr; 
  output reg [2:0] rgb_video; 
  output reg hsynch, vsynch; 
 
  always @(posedge clk) begin 
       hptr <= (hsynch) ? 0: hptr + 1; 
       hsynch <= (hptr >= 230) 
       if (hsynch) vptr <= (vsynch) ? 0: vptr + 1; 
       vsynch <= (vptr == 110) 
 
       if (hwen) framestore[haddr]<= wdata[2:0]; 
       /*else*/ rgb_video <= framestore[{vptr[6:0], hptr}]; 
 
       end
Figure 256: Structure of a simple frame store, RTL implementation and generated timing waveforms.

 
 
 

PIC
Figure 257: Doorbell and mailbox block.

 
 
 

PIC
Figure 258: Example of parallel speedup. 35 units of work run across four servers, showing dependency arcs typical in the map-reduce design pattern. Arcs implicitly exist between all adjacent work unit boxes.

 
 
 

PIC
module arbiter(input clk, 
               input reset, 
               input [2:0] reqs, 
               output reg [2:0] grants); 
 
  always @(posedge clk) if (reset) grants <= 0; 
      else begin 
         grants[0] <= reqs[0]; // Highest static priority 
         grants[1] <= reqs[1] && !(reqs[0]); 
         grants[2] <= reqs[2] && !(reqs[0] || reqs[1]); 
      end
Figure 259: A schematic of a typical arbiter (left) and the RTL implementation (right) for a three-port synchronous example using static priority with pre-emption. See also Figure 31.

 
 
 

PIC
Figure 260: General structure of a queue/server pair with mean customer rate λ and mean service rate μ.

 
 
 

PIC
Figure 261: Plots of the average number of customers in a system versus the utilisation ρ for three common arrival/service disciplines.

 
 
 

Table 33: Formulae for Nsystem and NQ where Nsystem=NQ+Nserver.
Discipline Nsystem NQ
M/M/1 ρ1ρ ρ21ρ
M/D/1 ρ22(1ρ)+ρ ρ22(1ρ)
D/D/1 NQ+ρ 1(1ρ)

 
 
 
 
 
 

PIC
Figure 262: Separate queueing for high-priority and low-priority traffic gives an expedited service.

 
 
 

PIC
Figure 263: Generic switch that includes both input and output buffering on the input and output ports, respectively, of a buffer-less switch fabric. Flow control operates between the switch fabric and the buffers in the switch.

 
 
 

PIC
Figure 264: Two scenarios that demonstrate statistical multiplexing gain for N=10 sources. Sharing the channel bandwidth (a) raises the question of whether reservations are made on a peak or average basis. If N is small, we should use peak allocation, but for large N, we can use average allocation. The law of large numbers states that N needs only to be about 50 for an average allocation to be acceptable. Buffer pools can be partitioned (b) or shared (c).

 
 
 

PIC
Figure 265: A policed queue showing both input and output regulators, although commonly only one site would be policed. Each policer has a number of rate channels (e.g. three are shown for the input site).

 
 
 

  int burst_tolerance, credit_rate; // Set up by PIO 
  int credit;                       // State variable 
  void reset() // Complete setup 
  { credit = 0; 
    register_timer_callback(crediter, credit_rate); 
  } 
  void crediter() // Called at 1/credit_rate intervals 
  { if (credit < burst_tolerance) credit += 1; 
  } 
  bool police() // Check operation currently allowed 
  { if (credit==0) return false; 
    credit -= 1; 
    return true; 
  }
Figure 266: Essence of a software implementation for one channel of a generic traffic policer or regulator.

 
 
 

PIC
Figure 267: Abstract views of the principal axes that span the SoC design space for a task. Parallelism can be traded for clock frequency (left), which shows two design points. At a given clock frequency, the silicon area (and parallelism) can be traded for execution time using a time/space fold/unfold (right).

 
 
 

PIC
Figure 268: Thermal management of a high-power chip. The primary heat sink makes thermal contact with the chip. A heat pipe is connected to a second heat sink. There is no other connection to the second heat sink.

 
 
 

PIC
Figure 269: Generic thermal circuit, showing on/off heat source, thermal node and thermal resistance between the node and its heat sink.

 
 
 

PIC

PIC

Figure 270: Primary timing characteristics of a D-type flip-flop (left). Typical nature of a critical path in a synchronous clock domain indicating how the maximum clock frequency (F) is calculated (right).

 
 
 

PIC
Figure 271: A circuit before (top) and after (bottom) insertion of an additional pipeline stage.

 
 
 

PIC
Figure 272: Flip-flop migration. Two circuits that behave identically but which have different state encodings.

 
 
 

PIC
Figure 273: Schematic and count sequence for a divide-by-five counter using Johnson encoding.

 
 
 

             Thread 1 - Requestor       |           Thread 2 - Server 
                                        | 
          ...                           |    while(true) 
          buffer[1] = operand1;         |    { 
          buffer[2] = operand2;         |      if (!buffer[0]) { yield(); continue; } 
          write_fence();                |      read_fence(); 
          buffer[0] = COMMAND;          |      handle(buffer); 
          ...                           |      buffer[0] = 0; 
                                        |    }
Figure 274: Two code fragments using message-passing in shared memory with explicit memory fences. The yield() call could be to the suspend primitive of the operating system scheduler. Alternatively, if running on bare metal, it is a dedicated instruction (called YIELD in the Arm ISA) that interacts with hardware hyper-threading if present.

 
 
 

PIC
Figure 275: SRAM memory bank structures for MCMs (top) with two interconnection patterns inside the SoC (bottom).

 
 
 

PIC
Figure 276: Pseudo-dual porting of RAM (using write mirroring).

 
 
 

PIC
Figure 277: Typical structure of a 32-bit DRAM controller connected to DRAM devices. The on-SoC controller manages a DRAM channel with one rank made of eight individual DRAM chips. Each chip has eight banks with four bit planes. Each bit plane has 212×212 bits. A DRAM is typically made available to various on-SoC subsystems using a multi-way bus multiplexor.

 
 
 

PIC
Figure 278: Three possible arrangements of DRAM address fields within a physical address. In the centre arrangement (b), the bank field is lower than its naive position in the top arrangement (a). This improves load balancing over banks, but the field is not moved so low that a cache line or burst transfer is split over banks. The rank and channel fields can also be moved lower. Arrangement (c) shows a channel field that has been split and dispersed.

 
 
 

PIC

PIC

Figure 279: (a) Lithium-ion battery for a mobile phone (3.7 V, 1650 mA h, 6 W h or 22 kJ), external view. (b) Typical internal structure. (c) Two sugar cubes.

 
 
 

PIC
Figure 280: Lumped-element electrical equivalent modelling of dynamic power use of a CMOS SoC. The energy drawn from the battery each clock cycle is essentially ‘wasted’ as heat in the distributed resistance of the active parts of the SoC.

 
 
 

PIC
Figure 281: A generic CMOS invertor structure shown with explicit parasitic resistances and lumped-equivalent output loading. This illustrates the primary electric paths in CMOS logic.

 
 
 

PIC
Figure 282: A tri-state data line showing driving gates, sensing buffers and a bus holder (or keeper). The bus holder has a weak output driver, denoted ‘W’, which acts like a series resistor to reduce the output current during transient bus fights (§??).

 
 
 

PIC
Figure 283: Logic net with a single source and three loads, showing tracking and input load capacitances.

 
 
 

PIC
Figure 284: Dynamic energy use for various technologies.

 
 
 

PIC

PIC

Figure 285: Fanout 4 (FO4) delay specification (left) and CMOS logic propagation delay versus supply voltage (right).

 
 
 

// spice-cmos-inverter-djg-demo.hsp 
// Updated 2017 by DJ Greaves 
// Based on demo by David Harris harrisd@leland.stanford.edu 
// Declare global supply nets and connect them to a constant-voltage supply 
.global Vdd Gnd 
Vsupply Vdd Gnd DC ‘VddVoltage’ 
/////////////////////////////////////////// 
// Set up the transistor geometry by defining lambda 
.opt scale=0.35u  * Define lambda // This is half the minimum channel length. 
// Set up some typical MOSFET parameters. 
//http://www.seas.upenn.edu/~jan/spice/spice.models.html#mosis1.2um 
 
.MODEL CMOSN NMOS LEVEL=3 PHI=0.600000 TOX=2.1200E-08 XJ=0.200000U 
    +TPG=1 VTO=0.7860 DELTA=6.9670E-01 LD=1.6470E-07 KP=9.6379E-05 
    +UO=591.7 THETA=8.1220E-02 RSH=8.5450E+01 GAMMA=0.5863 
    +NSUB=2.7470E+16 NFS=1.98E+12 VMAX=1.7330E+05 ETA=4.3680E-02 
    +KAPPA=1.3960E-01 CGDO=4.0241E-10 CGSO=4.0241E-10 
    +CGBO=3.6144E-10 CJ=3.8541E-04 MJ=1.1854 CJSW=1.3940E-10 
    +MJSW=0.125195 PB=0.800000 
 
.MODEL CMOSP PMOS LEVEL=3 PHI=0.600000 TOX=2.1200E-08 XJ=0.200000U 
    +TPG=-1 VTO=-0.9056 DELTA=1.5200E+00 LD=2.2000E-08 KP=2.9352E-05 
    +UO=180.2 THETA=1.2480E-01 RSH=1.0470E+02 GAMMA=0.4863 
    +NSUB=1.8900E+16 NFS=3.46E+12 VMAX=3.7320E+05 ETA=1.6410E-01 
    +KAPPA=9.6940E+00 CGDO=5.3752E-11 CGSO=5.3752E-11 
    +CGBO=3.3650E-10 CJ=4.8447E-04 MJ=0.5027 CJSW=1.6457E-10 
    +MJSW=0.217168 PB=0.850000 
///////////////////////////////////////////// 
// Define the invertor, made of two MOSFETs as usual, using a subcircuit. 
.subckt myinv In Out N=8 P=16 // Assumes 5 lambda of diffusion on the source/drain 
m1 Out In Gnd Gnd CMOSN l=2 w=N 
+ as=‘5*N’ ad=‘5*N’ 
+ ps=‘N+10’ pd=‘N+10’ 
m2 Out In Vdd Vdd CMOSP l=2 w=P 
+ as=‘5*P’ ad=‘5*P’ 
+ ps=‘P+10’ pd=‘P+10’ 
.ends myinv 
 
////////////////////////////////////////////// 
// Top-level simulation net list 
//  One instance of my invertor and a load capacitor 
x1 In Out  myinv         // Invertor 
C1 Out Gnd 0.1pF         // Load capacitor 
////////////////////////////////////////////// 
// Stimulus: Create a waveform generator to drive In 
// Use a  "Piecewise linear source"  PWL that takes a list of time/voltage pairs. 
Vstim In Gnd PWL(0 0 1ns 0   1.05ns ‘VddVoltage’   3ns VddVoltage    3.2ns 0) 
////////////////////////////////////////////// 
// Invoke transient simulation (that itself will first find a steady state) 
.tran .01ns 6ns // Set the time step and total duration 
.plot TRAN v(In) v(Out) 
.end
Figure 286: SPICE description and setup for two transistors arranged as a CMOS invertor simulated with a two-step input.

 
 
 

PIC

PIC

Figure 287: Plots of the invertor when running from VCC supplies of 2.5 V (left) and 1.5 V (right). Red is the input stimulus and blue is the simulated output.

 
 
 
 
 
 

Table 34: Design space for dynamic power-saving techniques.
Clock Power
On/Off Clock gating Power supply gating
Variable Dynamic frequency scaling (DFS) Dynamic voltage scaling (DVS)

 
 
 

Table 35: Example of static and dynamic power use for a three-level DVFS configuration. (The static current was estimated using βV0.9, where β was chosen to give an approximate 1:3 ratio of static to dynamic power at the middle supply voltage.)
Supply voltage Clock frequency Static power Dynamic power Total power
(V) (MHz) (mW) (mW) (mW)
0.8 100 40 24 64
     
1.35 100 67 68 135
1.35 200 67 136 204
     
1.8 100 90 121 211
1.8 200 90 243 333
1.8 400 90 486 576

 
 
 
 
 
 

PIC
Figure 288: Clock enable using (a) a multiplexor, (b) an AND gate and (c) an OR gate.

 
 
 

PIC
Figure 289: Illustrating a transparent latch and its use to suppress clock gating glitches.

 
 
 

PIC
Figure 290: Using XOR gates to determine whether a clock edge would have any effect.

 
 
 

PIC
Figure 291: Clock-needed computations forwarded down a pipeline.

 
 
 

PIC

PIC

Figure 292: Power gating in general (left) and an actual power gate circuit (right).

 
 
 
 
 
 

Table 36: Summary of the properties of four power-saving techniques.
Technique Clock gating Supply gating DVFS
Control Automatic Various Software
Granularity Register or FSM Larger blocks Macroscopic
Clock tree Mostly free runs Turned off Slows down
Response time Instant 2 to 3 cycles Instant (or ms if PLL adjusted)

 
 
 

PIC
Figure 293: Debugging hardware for a single-core SoC. A debug access port is connected via JTAG and USB to a debug workstation.

 
 
 

PIC
Figure 294: Typical additional event-stream debug resources for a modern MPSoC. Operational data busses are black. Event-stream busses are green. Debug-access busses are blue. The pink arrows denote event-monitoring nets from miscellaneous IP blocks that either do not have their own counters or need to be made available for cross-triggering.

 
 
 

PIC
Figure 295: JTAG interface chaining at board level to create a top-level node in the debug device tree.

 
 
 

Table 37: JTAG signal list.
TDI In Test data in: serial bits from test agent or previous device
TMS In Test mode select: frame data and addresses
TCK In Test clock: clocks each bit in and out
TDO Out Test data out: to next device or back to agent

 
 
 

PIC
Figure 296: Basic structure of the additional logic required in input and output pads for boundary scanning. Serial data are shifted from one pad to the next using the test clock. The result from the previous test vector is shifted out as a vector is shifted in. The two strobe signals are used to apply the test vector to the input pads and to sample the data at the output pads.

 
 
 

PIC
Figure 297: A static RAM with self-test wrapper around it. An error correction wrapper has a similar structure, but then the SRAM data bus will be wider than the external data bus. Both wrappers are often logically present, in which case a single wrapper may implement both functionalities.

 
 
 
 
 
 
 
 
 
 
 
 

PIC
Figure 298: Crystal oscillator circuit (left), canned crystal and contents (centre), and specification and output waveform (right).

 
 
 

PIC
Figure 299: Clock multiplication using a PLL (left) and clock distribution layout using a fractal H-tree (right).

 
 
 

PIC
Figure 300: Clock skewing. The delivery of the clock edge to some D-types is delayed or advanced to balance out timing margins either side.

 
 
 

PIC
Figure 301: Basic form of a timing slack graph.

 
 
 

PIC
Figure 302: Schematic (left) and timing diagram (right) of a typical multi-cycle path. The delay through logic block C ranges from Cmin to Cmax with both limits being between one and two clock periods.

 
 
 

PIC
Figure 303: Environmental testing under various temperature and humidity conditions in a climate-controlled chamber.

 
 
 

PIC
Figure 304: EMC testing of radio-frequency interference.

 
 
 

End

©2021 - DJ Greaves. All figures are available for use under creative commons CC BY 4, unless otherwise stated.