Figures and Tables from Modern SoC Design on Arm

Figure 1: Typical wireless link for digital communications, showing the transmitter (top) and receiver (bottom) for a simplex link.

Figure 2: Block diagram and photographs of a first-generation Bluetooth USB dongle.

Figure 3: Two PCIe 802.11 Wi-Fi modules for a laptop. Their shielding lids have been removed. The older unit, on the left, has a three-chip set from Broadcom. It is physically nearly twice the size of the newer unit, shown on the right, which has been magnified

2 \times

. The new unit uses a Qualcomm QCA9565 device that has a higher throughput and also includes Bluetooth. The only other active part needed is the 40-MHz crystal oscillator to the left of the chip. As can be seen at top right, to give a lower-cost lower-performance product, the second of the two diversity antenna sockets, labelled ‘ALT’, was not fitted.

temp   := 200                 // Set initial temperature to a high value 
ans    := first_guess         // This is the design vector (or tree) 
metric := metric_metric ans   // We seek the highest-metric answer 
 
while (temp > 1) 
{ 
   // Create new design point, offsetting with delta proportional to temperature 
   ans’ := perturb_ans temp ans 
 
   // Evaluate (scalar) objective function (figure of merit) for new design point 
   metric’ := metric_metric ans’ 
 
   // Accept if better probabilistically 
   accept := (metric’ > metric) || rand(100..200) < temp; 
   if (accept) (ans, metric, temp) := (ans’, metric’, temp * 0.99) 
} 
return ans;

Figure 4: An iteration for hill climbing using simulated annealing to find a design point with the highest scalar metric.

Figure 5: Screenshot from a place-and-route tool, showing how the wiring avoids a region of hardened layout (grey area on the right). Some area is required to route these interconnections.

Figure 6: Mailbox, relay and pump paradigms: three initiator/target configurations possible for dual-ported IP blocks.

Figure 7: Schematic symbols of a FIFO buffer following the mailbox pattern: synchronous (left) and clock-domain crossing (right).

Figure 8: A generic schematic of a one-place synchronous FIFO buffer. A synchronous set/reset flip-flop records whether there are valid data in the broadside register. The dashed lines show optional latency-reducing wiring. The output multiplexor is needed only if the dashed bypass wire is installed. Also illustrated are typical patterns of combinational path wiring in the client source and sink. The red annotations show a potential combinational loop.

Figure 9: Relay and mailbox paradigms for a one-place synchronous FIFO buffer. If there are no combinational paths, the only differences are the net names.

Figure 10: A custom ALU inside a CPU core implemented in two similar ways: (a) as a custom instruction or (b) as a coprocessor.

Figure 11: A specialised inter-core message-passing fabric allows cores to rapidly exchange a packet composed of several successive registers with each other.

Figure 12: A custom function implemented as a peripheral IP block, with optional DMA (bus master) capability.

Figure 13: Example of a hardware accelerator receiving data from a specialist interface. It uses DMA for data transfer to the main memory. PIO is managed by the host processor.

Figure 14: Representative bump-in-wire server blade architecture that has the FPGA in series with the network interface and secondary storage (disk drives or SSD).

Figure 15: Simplified block diagram of a super FPGA, like the Xilinx Zynq devices.

Figure 16: Relative performance of simple and complex CPU cores with DVFS.

Figure 17: Three multi-access NoC topologies.

Figure 18: Comparison of various multi-access and switched NoC designs in terms of throughput (136 bit words per clock cycle) per gate input as the number of end points (stations) is increased.

# Unfold=1 is the baseline design. Unfold=3 uses three times more silicon. 
static_dynamic_tradeoff <- function(clock_freq, leakage, unfold, xx) 
{ 
  op_count <- 2e7; 
 
  # Model: Pollack-like unfold benefit 
  execution_time <- op_count / clock_freq / (unfold ^ 0.75); 
 
  # Model: Higher supply needed for higher clock and leakage resistance slightly increasing with Vdd 
  vdd <- 1 + 0.5 * (clock_freq/100e6); 
  static_power <- leakage * vdd ^ 0.9 * unfold; 
 
  # Integrate static power and energy 
  static_energy <- static_power * execution_time; 
 
  # Use CV^2 for dynamic energy 
  dynamic_energy <- op_count * vdd ^ 2.0 * 5e-10; 
}

Figure 19: Trading off Aesop’s hare versus the tortoise for increasingly leaky technology. In a hard real-time computation, we know the number of clock cycles needed but should we do them quickly and halt (Hare) or do them slowly and finish just in time (Tortoise)?

Figure 20: Shift in the sweet spot (small vertical line) for DVFS with high-leakage technology for a real-time task. The total energy (solid line) is the sum of the static energy (dashes) and dynamic energy (dots).

Figure 21: Block diagram for an MPEG compression subsystem. The input is a real-time video stream and the outut is an MPEG transport stream (without sound). The DCT, IDCT and motion estimators are highly compute intensive.

//Output bit-to-byte buffer
void putbits(uint val, uint no_of_bits)
{
  buffer |= val << (int)no_of_bits;
  buffer_bits += no_of_bits;
  while (buffer_bits >= 8)
  { yield_byte((byte)(buffer & 0xFF));
    buffer_bits -= 8;
    buffer_bits >>= 8;
  }
}

// Send a DC component
void putDC(sVLCtable [] tab, int val)
{
  uint absval, size;
  absval = (uint) Math.Abs(val);
  /* Compute dct_dc_size */
  size = 0;
  while (absval!=0)
  { absval >>= 1;
    size ++;
  }
  // Generate VLC for dct_dc_size (B-12 or B-13)
  putbits(tab[size].code, tab[size].len);
  // Append fixed-length code (dc_dct_differential)
  if (size!=0) // Send val + (2 ^ size) - 1
  { if (val>=0) absval = (uint)val;
    else absval = (uint)(val + (1 << (int)size) - 1);
    putbits(absval, size);
  }
}

void putDClum(int val)
{
  putDC(DClumtab, val);
}

void putDCchrom(int val)
{
  putDC(DCchromtab, val);
}

void putAC(int run, int signed_level, int vlcformat)
{
    // ...
}

/* Generate variable-length codes for an intra-coded
   block (6.2.6, 6.3.17) */
void putintrablk(Picture picture, short [] blk, int cc)
{
  /* DC Difference from previous block (7.2.1) */
  int dct_diff = blk[0] - picture.dc_dct_pred[cc];
  picture.dc_dct_pred[cc] = blk[0];

  if (cc==0) putDClum(dct_diff);
  else putDCchrom(dct_diff);

  /* AC coefficients (7.2.2) */
  int run = 0;
  byte [] scan_tbl = (picture.altscan ? alternate_scan:
                      zig_zag_scan);
  for (int n=1; n<64; n++)
  { // Use appropriate entropy scanning pattern
    int signed_level = blk[scan_tbl[n]];
    if (signed_level!=0)
    {
      putAC(run, signed_level, picture.intravlc);
      run = 0;
    }
    else run++; /* count zero coefficients */
  }

  /* End of Block -- normative block punctuation */
  if (picture.intravlc!=0) putbits(6,4); // 0110 (B-15)
    else putbits(2,2); // 10 (B-14)
}

// Return difference between two (8*h) sub-sampled blocks
// blk1, blk2: addresses of top left pels of both blocks
// rowstride:  distance (in bytes) of vertically adjacent pels
// h:          height of block (usually 8 or 16)
int sumsq_sub22(byte [] blk1, byte [] blk2, int rowstride, int h)
{
  int ss = 0, p1 = 0, p2 = 0;
  for (int j=0; j<h; j++)
  {
    for (int i=0; i<8; i++)
    { int v = blk1[p1+i] - blk2[p2+i];
      ss += v*v;
    }
    p1+= rowstride; p2+= rowstride;
  }
  return ss;
}

Figure 22: Code fragments for an MPEG encoder, coded in CSharp for Kiwi HLS. The putintrablk routine takes the 64 values from the quantised DCT and transmits them using the variable-length coding and with a run-length encoding of zero values in the routines it calls. The sumsq_sub22 function is one of many used during motion estimation to compute differences between macro blocks.

Figure 23: Typical examples of FUs deployed by higher-level design languages, especially HLS.

Figure 24: Fixed-point implementation of a floating-point accumulator.

Figure 25: Dynamic load balancing using the server farm paradigm.

Figure 26: Generic setup of a system integrator tool. This is the IP-XACT reference model for design capture and synthesis. A central interactive editor enables a system to be created from externally provided IP blocks by invoking generator and configurator tools.

Figure 27: IP-XACT structures for documenting the memory map and register field definitions, plus the access type supported for each bit field.

Figure 28: Generating an interconnect from IP-XACT: abstract view (top) using five IP-XACT XML files and synthesised wiring (bottom).

Figure 29: Example of high-level manual wiring between subsystems with wiring configured using a GUI.

Figure 30: Screenshot of the connectivity matrix editor within Arm Socrates.

Figure 31: An example Chisel module: a static-priority arbiter, showing Chisel source code (left) and the circuit generated (right) when n_inputs is 3.

Figure 32: An 8-bit counter with a synchronous load using sequential logic and a when/otherwise clause: Chisel source code and schematic symbol.

   // Generator (src) 
   while(1) 
   { 
     ch1 ! (x); 
     x += 3; 
   }

   // Processor 
   while(1) 
   { 
     ch2 ! (ch1? + 2) 
   }

   // Consumer (sink) 
   while(1) 
   { 
     $display(ch2?); 
   }

Figure 33: Three communication processes expressed using Handel-C.

module mkTb1 (Empty);             // This module has no externally callable methods 
 
   Reg#(int) rx <- mkReg (23);    // Create an instance of a 23-bit register called rx 
 
   rule countone (rx < 30);       // A rule named ’countup’ with an explicit guard 
      int y = rx + 1;             // This is short for int y = rx.read() + 1; 
      rx <= rx + 1;               // This is short for rx.write(rx.read() + 1); 
      $display ("countone: rx = %0d, y = %0d", rx, y); 
   endrule 
 
   rule counttwo (rx > 20);       // A competing rule, also guarded 
      rx <= rx + 2;               // This increments twice each cycle 
      $display ("counttwo: rx = %0d", rx); 
   endrule 
 
   rule done (rx >= 40);          // A third rule 
      $finish (0); 
   endrule 
 
endmodule: mkTb1

Figure 34: A Bluespec example in which three rules compete to act on a simple broadside register.

interface Pipe_ifc;
  method Action put(int arg);
  method int get();
endinterface

_________________________________________

module mkTb2 (Empty); // Testbench

  Reg#(int) x       <- mkReg (’h10);
  Pipe_ifc  thepipe <- mkPipe;

  rule fill;   // explicit guard of (true) is implied
    thepipe.put(x);
     // This is short for  x.write(x.read() + ’h10);
     x <= x + ’h10;
  endrule

  rule drain;
     let y = thepipe.get();
     $display ("    y = %0h", y);
  endrule
endmodule

Figure 35: Bluespec definition of an example interface, the Pipe_ifc (top left), the definition of a component that has an instance of a component that has that interface and two rules to operate on it (bottom left) and a net-level schematic symbol for the instantiated component showing handshake nets. The symbol for the Tb2 component has no external nets (apart from clock and reset) since it has an empty interface.

Figure 36: Basic steps of an HLS tool chain for ASIC or FPGA.

// A simple long multiplier with
// variable latency
int multiply(int A, int B)
{
  int RA=A;
  int RB=B;
  int RC=0;
  while(RA>0)
  {
    if odd(RA) RC = RC + RB;
    RA = RA >> 1;
    RB = RB << 1;
  }
  return RC;
}

module LONGMULT8b8(clk, reset, C, Ready, A, B, Start);
   input clk, reset, Start;
   output Ready;
   input [7:0] A, B;
   output [15:0] C;
   reg [15:0] RC, RB, RA;
   reg        Ready;

   reg xx, yy, qq, pp; // Control and predicate nets
   reg [1:0] fc;
   reg [3:0] state;
   always @(posedge clk) begin
      xx = 0; // default settings.
      yy = 0;
      fc = 0;

      // Predicates
      pp = (RA!=16’h0);   // Work while pp holds
      qq = RA[0];         // Odd if qq holds

      if (reset) begin    // Sequencer
         state <= 0;
         Ready <= 0;
         end
      else case (state)
             0: if (Start) begin
                xx = 1;
                yy = 1;
                fc = 2;
                state <= 1;
             end

             1: begin
                fc = qq;
                if (!pp) state <= 2;
             end
             2: begin
                Ready <= 1;
                if (!Start) state <= 3;
             end

             3: begin
                Ready <= 0;
                state <= 0;
             end

           endcase // case (state)
      RB <= (yy) ? B: RB<<1;   // Data path
      RA <= (xx) ? A: RA>>1;
      RC <= (fc==2) ? 0: (fc==1) ? RC+RB: RC;
      end

   assign C = RC;
endmodule

Figure 37: A very basic worked HLS example showing the input source code (left) and generated RTL (right) for a 32-bit multiplier as a variable-latency FU with start and ready handshake nets.

Figure 38: Long multiplier output from HLS showing the internal data path and sequencer (left) and the schematic symbol (right). The result is used as an FU in a higher-level design.

Figure 39: An example of a static schedule for a basic block containing a single assignment.

Figure 40: Various circuits that compute a running sum. The everyday circuit C1 can be used if the adder has a latency of 1. However, if the adder has a latency of 2, more complex circuits are needed, namely C2 and C3.

Figure 41: Fragment of an example inter-block initiation and hazard graph. Block 1 may be followed by itself or by block 2. Each block schedule contains both C (control flow) predicate evaluation and D (data) computation operations.

  public static int associative_reduction_example(int starting) 
  { 
    int vr = 0; 
    for (int i=0;i<15;i++) // or also i+=4 
      { 
      int vx = (i+starting)*(i+3)*(i+5); // Mapped computation 
      vr ^= ((vx&128)>0 ? 1:0);          // Associative reduction 
      } 
    return vr; 
  }

Figure 42: An iteration that performs associative reduction (in vr).

  double loop_carried_example(double seed, double arg0) 
  { 
    double vr = 0.0, vd = seed; 
    for (int i=0;i<15;i++) 
    { 
        double vd = xf1(i*arg0);     // Parallelisable 
        vd = xf2(vd + vd) * 3.14;    // Non-parallelisable 
        vr += vd; 
    } 
    return vr; 
  }

Figure 43: An iteration that has a loop-carried data dependency (through variable vd).

  static int [] foos = new int [10]; 
  static int ipos = 0; 
  public static int loop_forwarding_example(int newdata) 
  { 
    foos[ipos ++] = newdata; 
    ipos %= foos.Length; 
    int sum = 0; 
    for (int i=0;i<foos.Length-1;i++) 
      { 
      int dv = foos[i]^foos[i+1];  // Two adjacent locations are read 
        sum += dv;                   // Associative scalar reduction in sum 
      } 
    return sum; 
  }

Figure 44: A loop where data fetched in one iteration (foo[i]) can usefully be forwarded to a subsequent iteration.

  public static int data_dependent_controlflow_example(int seed) 
  { 
    int vr = 0; 
    int i; 
    for (i=0;i<20;i++) 
      { 
      vr += i*i*seed; 
      if (vr > 1111) break; // Early loop exit 
      } 
    return i; 
  }

Figure 45: A loop that has data-dependent control flow (the loop exit depends on variable vr).

Figure 46: Lookup procedure when searching for a string using the BWT. A haystack has been transformed and stored in Bwt[]. An index Ranks[] is computed for it. Looking for the string in Needle is then very fast.

Figure 47: Compacted Ranks array for BWT, redefinition of the ranks() routine for an interpolated lookup and a sensible layout in a DRAM row.

Figure 48: Data dependencies (slightly simplified) in the Smith–Waterman alignment-finding algorithm.

Figure 49: Affine transformation examples, adapted from [?]. Nest 1 runs after nest 2, but by skewing the access pattern of each loop, there is increased parallelism available within each nest and pipelining becomes possible. Nest 2 commences just after nest 1 has produced its first diagonal output.

Figure 50: Shuffle data flow for an FFT (left), elemental butterfly (top right) and code (bottom right).

Figure 51: A circuit intended to compute the running sum of streaming data values.

Figure 52: Overall setup of a generic ESL flow. The heavy lines show the fastest technique, which avoids using an instruction set simulator (ISS) through native cross-compiling of the embedded firmware and direct linking with behavioural models of the hardware.

SC_MODULE(mycounter) // An example of a leaf module (no subcomponents) 
{ 
   sc_in  < bool       > clk, reset; 
   sc_out < sc_int<10> > myout; 
 
   void mybev()  // Internal behaviour, invoked as an SC_METHOD 
   { 
      myout =  (reset) ? 0: (myout.read()+1); // Use .read() since sc_out makes a signal 
   } 
 
   SC_CTOR(mycounter)        // Constructor 
     { SC_METHOD(mybev);     // Require that mybev is called on each positive edge of clk 
       sensitive << clk.pos(); 
     } 
}

Figure 53: A 10-bit binary counter with synchronous reset, coded as a SystemC class.

//Example of structural hierarchy and wiring between levels: 
SC_MODULE(shiftreg)  // Two-bit shift register 
{   sc_in  < bool >  clk, reset, din; 
    sc_out < bool >  dout; 
 
    sc_signal < bool > q1_s; 
    dff dff1, dff2;      // Instantiate FFs 
 
    SC_CTOR(shiftreg) : dff1("dff1"), dff2("dff2") 
    {   dff1.clk(clk); 
        dff1.reset(reset); 
        dff1.d(din); 
        dff1.q(q1_s); 
 
        dff2.clk(clk); 
        dff2.reset(reset); 
        dff2.d(q1_s); 
        dff2.q(dout); 
    } 
};

Figure 54: Schematic (left) and SystemC structural netlist (right) for a 2-bit shift register.

  int nv;                       // nv is a simple C variable (POD, plain old data) 
  sc_out    < int > data;       // data and mysig are sc_signals (non-POD) 
  sc_signal < int > mysig;      // 
  ... 
        nv += 1; 
        data = nv; 
        mysig = nv; 
        printf("Before nv=%i, %i %i\n’’, nv, data.read(), mysig.read()); 
        wait(10, SC_NS); 
        printf("After  nv=%i, %i %i\n’’, nv, data.read(), mysig.read()); 
  ... 
Before nv=96, 95 95 
After  nv=96, 96 96

Figure 55: Compute/commit behaviour.

SC_MODULE(mydata_generator) 
{ sc_out < int  > data; 
  sc_out < bool > req; 
  sc_in  < bool > ack; 
 
  void myloop() 
  { while(1) 
    { data = data.read() + 1; 
      wait(10, SC_NS); 
      req = 1; 
      do { wait(10, SC_NS); } while(!ack.read()); 
      req = 0; 
      do { wait(10, SC_NS); } while(ack.read()); 
    } 
  } 
 
  SC_CTOR(mydata_generator) 
  { 
    SC_THREAD(myloop); 
  } 
}

Figure 56: Sample code using SC_THREAD.

Figure 57: An example waveform view plotted by gtkwave.

      sc_trace_file *tf = sc_create_vcd_trace_file("tracefilename"); 
 
      // Now call: 
      // sc_trace(tf, <traced variable>, <string>); 
 
      sc_signal < bool  > serialin("serialin"); // A named signal 
      sc_signal < bool  > serialout;            // An unnamed signal 
      float fbar; 
      sc_trace(tf, clk); 
      sc_trace(tf, serialin); 
      sc_trace(tf, serialout, "serialout");     // Give name since not named above 
      sc_trace(tf, fbar, "fbar");               // Give name since POD form 
 
      sc_start(1000, SC_NS);                    // Simulate for 1 microsecond (old API) 
      sc_close_vcd_trace_file(tr); 
      return 0;

Figure 58: Naming nets.

  sc_signal < bool > mywire; // Rather than a channel conveying just one bit 
 
  struct capsule 
  { int ts_int1, ts_int2; 
    bool operator== (struct ts other) 
    { return (ts_int1 == other.ts_int1) && (ts_int2 == other.ts_int2); } 
 
    int next_ts_int1, next_ts_int2; // Pending updates 
    void update() 
    { ts_int1 = next_ts_int1; ts_int2 = next_ts_int2; 
    } 
    ... 
    ... // Also must define read(), write() and value_changed() 
  }; 
 
  sc_signal < struct capsule > myast; // We can send two integers at once

Figure 59: Equality operator overload.

  void mymethod()  { .... } 
  SC_METHOD(mymethod) 
  sensitive << myast.pos(); // User must define concept of posedge for their own abstract type

Figure 60: Positive edge.

Figure 61: Three views of the four-phase transactional protocol from §?? operating between a sender and a receiver. The views are for a net-level connection, an untimed TLM push and an untimed TLM pull.

Figure 62: Schematic of a FIFO device (top) and its typical instantiation (bottom).

//Define the interfaces: 
class write_if: public sc_interface 
{ public: 
  virtual void write(char) = 0; 
  virtual void reset() = 0; 
}; 
 
class read_if: public sc_interface 
{ public: 
  virtual char read() = 0; 
}; 
 
//Define a component that inherits: 
class fifo_dev: sc_module("fifo_dev"), 
public write_if, public read_if, ... 
{ 
  void write(char) { ... } 
  void reset() { ... } 
  ... 
}

SC_MODULE("fifo_writer") 
{ 
  sc_port< write_if > outputport; 
  sc_in < bool > clk; 
  void writer() 
  { 
    outputport.write(random()); 
  } 
 
  SC_CTOR(fifo_writer) { 
     SC_METHOD(writer); 
     sensitive << clk.pos(); 
  } 
} 
 
//Top level instances: 
fifo_dev myfifo("myfifo"); 
fifo_writer mywriter("mywriter"); 
// Port binding: 
mywriter.outputport(myfifo);

Figure 63: SystemC module definition of an interface (left) and FIFO writer (right).

Figure 64: TLM 2.0 generic payload structure (left) and interconnection of three IP blocks showing passthrough and demultiplexing (right).

// Filling in the fields or a TLM2.0 generic payload: 
  trans.set_command(tlm::TLM_WRITE_COMMAND); 
  trans.set_address(addr); 
  trans.set_data_ptr(reinterpret_cast<unsigned char*>(&data)); 
  trans.set_data_length(4); 
  trans.set_streaming_width(4); 
  trans.set_byte_enable_ptr(0); 
  trans.set_response_status( tlm::TLM_INCOMPLETE_RESPONSE ); 
 
// Sending the payload through a TLM socket: 
  socket->b_transport(trans, delay);

Figure 65: Setting up a generic payload.

Figure 66: Illustration of modelling artefacts relating to model interleaving and timing arising from loosely timed TLM. In a cycle-accurate or approximately timed model (left), two threads, T0 and T1, can access a shared resource. One blocks while the resource is busy. With loose timing (right), the average contention delay is added to the response times of all customers. With a large quantum, one thread may run entirely before the other.

  vqueue::b_transact(pkt, sc_time &delay) 
  { 
     // Measure utilisation and predict queue delay based on last 32 transactions 
     if (++opcount == 32) 
     {  sc_time delta = sc_time_stamp()+delay-last_measure_time; 
        local_processing_delay += (delay_formula(delta/32)-local_processing_delay)/16; 
        logging.log(25, delta);                  // record utilisation 
        last_measure_time = sc_time_stamp()+delay; 
        opcount = 0; 
     } 
 
     // Add estimated (virtual) queuing penalty 
     delay += local_processing_delay; 
 
     // Do actual work 
     output.b_transact(pky, delay); 
  }

Figure 67: Essence of the code for a virtual queue contention point. A number of clients send work using vqueue:b_transact. The work is done straight away by output.b_transact, but the virtual time spent in the queue is added to the loosely timed delay.

Figure 68: Typical setup of a thread using loosely timed modelling with a quantum keeper for one core of an MPSoC. In reality there would be multiple interconnect components between the initiator and its final target.

Figure 69: Mixing modelling styles using a target-to-initiator transactor. An initiator-to-target transactor, not shown, would receive net-level transactions and make method calls on a TLM target.

Figure 70: Mixing modelling styles using a mailbox paradigm transactor.

Figure 71: Five views of the LocalLink protocol (also of AXI4-Stream). (a) Timing diagram. (b) Net-level wiring. (c) TLM abstract view. (d) Blocking TLM target code, big step. (e) Blocking TLM, small step.

void mips64iss::step() 
{ 
  u32_t ins = ins_fetch(pc); 
  pc += 4; 
  u8_t opcode = ins >> 26;    // Major opcode 
  u8_t scode = ins&0x3F;      // Minor opcode 
  u5_t rs = (ins >> 21)&31;   // Registers 
  u5_t rd = (ins >> 11)&31; 
  u5_t rt = (ins >> 16)&31; 
 
  if (!opcode) switch (scode) // decode minor opcode 
  { 
    case 052: /* SLT - set on less than */ 
      regfile_up(rd, ((int64_t)regfile[rs]) < ((int64_t)regfile[rt])); 
      break; 
 
    case 053: /* SLTU - set on less than unsigned */ 
      regfile_up(rd, ((u64_t)regfile[rs]) < ((u64_t)regfile[rt])); 
      break; 
      ... 
   ... 
 
void mips64iss::regfile_up(u5_t d, u64_t w32) 
{ if (d != 0) // Register zero stays at zero 
    { TRC(trace("[ r%i := %llX ]", d, w32)); 
      regfile[d] = w32; 
    } 
}

Figure 72: main step function.

Figure 73: Block diagram of one channel of a DMA unit. (Repeated from Figure 254).

  // Behavioural model of 
  // target side: PIO register r/w 
  uint32 src, dest, length; 
  bool busy, int_enable; 
 
  u32_t status() { return (busy << 31) 
            | (int_enable << 30); } 
 
  u32_t completer_read(u32_t a) 
  { 
    return (a==0)? src: (a==4) ? dest: 
     (a==8) ? (length) : status(); 
  } 
  void completer_write(u32_t1 a, u32_t d) 
  { 
     if (a==0) src=d; 
     else if (a==4) dest=d; 
     else if (a==8) length = d; 
     else if (a==12) 
     { busy = d >> 31; 
       int_enable = d >> 30; } 
  }

  // Bev model of bus mastering portion 
  while(1) 
  { 
    waituntil(busy); 
    while (length-- > 0) 
      mem.write(dest++, mem.read(src++)); 
    busy = 0; 
  }

Interrupt code:

   interrupt = int_enable&!busy;

Figure 74: Behavioural model of (one channel of) the DMA controller.

    module CTR16( 
      input mainclk, 
      input din, input cen, 
      output o); 
 
      reg [3:0] count, oldcount;          // D-types 
 
      always @(posedge mainclk) begin 
          if (cen) count <= count + 1;    // ALU 
          if (din) oldcount <= count;     // Wiring 
          end 
 
     assign o = count[3] ^ count[1];      // Combinational 
 
   endmodule

Figure 75: Example RTL fragment used in a static analysis.

Figure 76: Performance and overheads for various sizes of a single-ported 32-bit data width SRAM implemented in 22 nm, as modelled by Cacti 7.0. The tool gives the access time, read and write dynamic energy use, and static power and area use for each size.

void sram64_cbg::recompute_pvt_parameters() // Called in constructor and when Vcc is changed 
{ 
  m_latency =  sc_time(0.21 + 3.8e-4 *sqrt(float(m_bits)), SC_NS); 
 
  pw_power leakage = pw_power(82.0 * m_bits, PW_nW); 
  set_static_power(leakage); 
 
  set_fixed_area(pw_area(13359.0 + 4.93/8 * m_bits, PW_squm)); 
 
  m_read_energy_op = pw_energy(5.0 + 1.2e-4 / 8.0 *m_bits, pw_energy_unit::PW_pJ); 
  m_write_energy_op = 2.0 * m_read_energy_op; // rule of thumb! 
 
  // NB: Might want different energy when high-order address bits change 
 
  pw_voltage vcc = get_vcc(); 
  m_latency = m_latency / vcc.to_volts(); 
  cout << name () << ":" << kind() << ": final latency = " << m_latency << "\n"; 
}

void sram64_cbg::b_access(PW_TLM_PAYTYPE &trans, sc_time &delay) 
{ 
  tlm::tlm_command cmd = trans.get_command(); 
 
  // Log wiring power consumed by transaction arriving here. 
  // Also set which nets modelled by the TLM will be active after this operation: 
  // For a write none (except a response ack) and for read the payload data. 
  trans.pw_log_hop(this,  (cmd==tlm::TLM_READ_COMMAND ? PW_TGP_DATA: PW_TGP_NOFIELDS) | 
          PW_TGP_ACCT_CKP,  &read_bus_tracker); 
 
  if (cmd == tlm::TLM_READ_COMMAND) 
    { 
      // Log internal transaction energy for read 
      pw_module_base::record_energy_use(m_read_energy_op); 
      ... 
    } 
  else if (cmd == tlm::TLM_WRITE_COMMAND) 
    { 
      // Log internal transaction energy for write 
      pw_module_base::record_energy_use(m_write_energy_op); 
      ... 
    } 
}

Figure 77: TLM+TE model of an SRAM. Constructor and blocking transport methods.

Figure 78: A die-stacked DRAM subsystem in the style of the Micron HMC and HBM. Several DRAM chips are piled on top of an upside-down (flip-chip) controller chip, interconnected into a 3-D structure using through-silicon vias. There could be several stacks. A passive interposer (ceramic PCB) wires the logic die using a parallel interconnect to SERDES chips (also upside-down) for serial connection over the circuit board to the hosts.

Figure 79: A Parallella single-board computer. The two 3320C USB driver devices and connectors are highlighted in red boxes.

Figure 80: Example power estimation spreadsheet. The inventory of instantiated components is annotated with clock frequencies and activity ratios to generate an approximate total power use in watts.

Figure 81: Two similar designs with different Rent exponents (top) and two non-Rentian design points (bottom).

Figure 82: Left: average net length in systems composed of 59 and 671 gates for various values of rent exponent. Right: average net length for subsystems of different sizes for Rent exponent of 0.65 using four improved equations presented in [?].

Figure 83: Lowest common parent of the end-point logic blocks. The distance between two end points is always roughly the same for any sensible layout of a design, so a detailed layout, like the one shown, is not required.

Figure 84: Example of a transactional energy modelling plot. Static power depends on the power mode, whereas the dynamic power depends on internal and external activity events.

Figure 85: Example reports generated by the TLM POWER3 library. The energy use between two checkpoints is shown in the upper table and power consumption in the lower table. A total for the whole device is given, along with subtotals for parts of the design hierarchy specifically selected by the user.

Figure 86: Back-end stages in the synthesis and physical design of a SoC, showing representative diagrams for five of the stages. The red track on the left is the physical intent flow.

Figure 87: A small example of a universal power format (UPF) file.

Figure 88: Technology scaling scatter plot.

Figure 89: Dark silicon (using data from [?, ?]).

Figure 90: Basic layers in planar CMOS VLSI. Cross section through an invertor fabricated using planar transistors (not FinFETs) in a twin-well process and metal layers, M1 to M7 (not to scale).

Figure 91: Generic 3-D structure of a FinFET (left). The gate is wrapped around the channel. To get a sufficiently low on-resistance, a number of FETs are arranged in parallel using techniques shown in the other three diagrams, which are in 2-D plan view.

Figure 92: Typical first-generation standard cell layout showing three rows of cells. Power rails of alternating polarity run horizontally using the metal M1 layer (blue). This is also used for the internal wiring within the cells. N- and P-type diffusion zones are shown in green and yellow. FET gates are formed where the polysilicon (orange) crosses a diffusion layer. The violet and brown indicate metal wiring layers used for inter-cell connections. The brown layer is predominantly used for horizontal net segments and the violet for vertical segments.

Figure 93: Cross section showing stacked wiring metal layers.

Figure 94: EDA tools applied to synthesisable RTL for a simulation (left) and the synthesis to a gate-level or structural netlist (right).

wire dout[39:0]; 
reg[3:0] values[0:4] = {5, 6, 7, 8, 15}; 
 
generate 
  genvar i; 
  for (i=0; i<5; i++)  begin 
    MUT mut[i] ( 
    .out(dout[i*8+7:i*8]), 
    .value_in(values[i]), 
    .clk(clk), 
    ); 
  end 
endgenerate

Figure 95: Example of a generate statement in Verilog RTL (left) and the resulting structural netlist (right).

module subcircuit( 
  input clk, 
  input rst, 
  output q2); 
   wire q1, q3, a; 
   DFFR Ff_1(clk, rst, a, q1, qb1), 
        Ff_2(clk, rst, q1, q2, qb2), 
        Ff_3(clk, rst, q2, q3, qb3); 
   NOR2 Nor2_1(a, q2, q3); 
endmodule

Figure 96: A structural RTL example (left) and the net-level circuit it defines (a divide-by-five Johnson counter, right).

Figure 97: A combinational RTL example (top) and the naive net-level circuit it defines (bottom).

Figure 98: Example RTL fragment, before and after flattening. For many designs, the flattened netlist is often bigger than the hierarchic netlist owing to multiple instances of the same component. Here it was smaller.

Figure 99: Schematic symbol, Verilog model, timing diagram and possible implementation for a dual-edge-triggered RS flop.

Figure 100: Simple synthesisable Verilog examples, including a transparent latch.

Figure 101: Four-value logic-level encoding and its behaviour for six common gates.

Figure 102: EDS event queue, which is a linked list, sorted in ascending temporal order.

Figure 103: Behaviour of a runt pulse in an RS latch when modelling with transport delay.

Figure 104: Hybrid automobile transmission system.

Figure 105: Zeno wondered whether Achilles could ever catch the tortoise, but we know that the sum of a geometric progression often converges.

Figure 106: Hybrid model simulation of a bouncing ball dropped from 7.0 m. It stops completely at 9.366 s.

Figure 107: Fragment of a synthesised netlist rendered in a schematic viewer.

module TC(input clk, input cen); 
  reg [1:0] count; 
  always @(posedge clk) if (cen) count<=count+1; 
endmodule

module TC(input clk, input cen); 
  wire u10022, u10021, u10020, u10019; 
  wire [1:0] count; 
  input cen;   input clk; 
  CVINV  i10021(u10021, count[0]); 
  CVMUX2  i10022(u10022, cen, u10021, count[0]); 
  CVDFF  u10023(count[0], u10022, clk, 1’b1, 1’b0, 1’b0); 
  CVXOR2  i10019(u10019, count[0], count[1]); 
  CVMUX2  i10020(u10020, cen, u10019, count[1]); 
  CVDFF  u10024(count[1], u10020, clk, 1’b1, 1’b0, 1’b0); 
endmodule

Figure 108: Baseline RTL elaboration example showing synthesisable RTL input (left) and structural netlist output that uses generic gates (right).

  reg [31:0] q, n; 
  ... 
  q = n / 10; 
  return q;

   reg [31:0] q, n; 
   ... 
   q = (n >> 1) + (n >> 2); 
   q += (q >> 4); 
   q += (q >> 8); 
   q += (q >> 16); 
   return q >> 3;

Figure 109: Essence of logic synthesised for integer division of the 32-bit value

n

by the constant 10 using just adders, based on

8 / 10

being 0.11001100 recurring. A logic synthesiser can create similar bespoke divide circuits for any constant denominator.

Figure 110: AND-OR-INVERT gate, logical function (left) and actual circuit when realised in CMOS (right).

module SSRAM( 
  input clk,                  // Synchronous reads and writes 
  input ren,                  // Read enable (optional) 
  input wen,                  // Write enable 
  input [14:0] addr,          // Address input 
  input [31:0] wdata,         // Write data in 
  output reg [31:0] rdata);   // Read data out 
  reg [31:0] myram [32767:0]; // 32k words of 32 bits each 
  always @(posedge clk) begin 
     if (ren) rdata <= myram[addr]; 
     if (wen) myram[addr] <= wdata; 
     end 
endmodule

Figure 111: Typical RTL coding style for RAM inference. Data out is registered once without otherwise being used and the same subscript expression is used in both the read and write contexts.

Figure 112: Scatter plot of area versus operating frequency for a design (Arm Cortex-A9 Falcon) on a common process node and cell library for different core utilisation ratios. The best performers are highlighted with black crosses (and joined together with the black dashed line).

Figure 113: FinFET invertor area versus number of fins for different cell heights at a process node.

Figure 114: Scatter plot of area versus operating frequency for a given design (Arm Cortex-A9 Falcon) and standard cell libraries on a common process node that vary in cell height, measured in track equivalents.

Figure 115: A broad-brush classification of digital ICs into full-custom, semi-custom and field-programmable classes with some example device roles.

Figure 116: SoC application diagram for an automotive headlight controller. A heavy load is connected to a controller-area network (CAN bus).

Figure 117: Two-input NOR gate, circuit schematic (left) and several layers of a standard cell layout (right). Blue is the bottom layer of metal, red is polysilicon, green is channel diffusion and yellow is the P well in which the N-channel FETs are constructed.

Figure 118: Typical cell data sheet from a human-readable version of a standard cell library.

Figure 119: Delay versus output loading capacitance for various input transition rates.

Figure 120: Input capacitance is augmented or diminished due to the Miller effect. It is augmented by the opposite-moving plate potentials arising from an inverting configuration, which is the common case.

Figure 121: Real-world gate, with a linear slew input waveform, showing the timing behaviour of

V_{i n}

,

V_{o u t}

and

I_{o u t}

.

Figure 122: Metal layers in a mask-programmed gate array (ECL codec for fibre optic ring network [?]).

Figure 123: Transmission gate or bilateral switch: internal wiring (left) and schematic symbols (right).

Figure 124: An active multiplexor (a) needs more silicon area than a pass-transistor multiplexor (b), but restores logic levels rather than degrading the signal. Larger pass-transistor multiplexors (c) are efficient and easy to lay out.

Figure 125: FPGA, showing the I/O blocks around the edge, the interconnection matrix blocks and the configurable logic blocks. Recently, the regular structure has been broken up by custom blocks, including RAM and multiplier (aka DSP) blocks.

Figure 126: CLB (left) and an IOB (right) for a simple FPGA.

Figure 127: The DSP block in the Xilinx Virtex 7 family (©Xilinx Inc). The main functions offered are multiply–accumulate and multiply–sum for long multiplication of wider words than supported by the

25 \times 18

base unit.

Figure 128: Power and floor plan for a simple SoC along with the external supply, which uses battery-backed RAM (left) and buck regulators (right). PSUC: power supply controller.

Figure 129: Macroscopic back-end flow highlighting the verification flow paths. Figure 86 is a detailed flow diagram.

Figure 130: Standard cell polygons for a two-input NOR gate. The abstract view (left) identifies the power connections (top and bottom strips) and the input and output contact sites (denoted with asterisks) on the M1 and M2 metal layers, respectively (shades of blue). The full polygon set (right) is required for final design rule checking and tapeout. In reality, the right-hand polysilicon layer (red) could be under the M2 output strip but is here shown displaced for clarity.

Figure 131: A placed horizontal strip of standard cells showing the polygons significant for routing, as rendered in a layout editor. The power and ground rails are a distinctive feature.

Figure 132: Snapshot from a place-and-route tool, showing hundreds of interlaced wires that are routed with several metal layers, both vertically and horizontally.

Figure 133: Enlargement of Figure 132, showing several wires routed with several metal layers (one colour for each layer), vertically and horizontally. Vias are marked with an X.

      000 000 0 001 111 1
      123 456 7 890 123 4
    [ 00H 00H p H00 x00 p ]
    [ 01H 00H p H00 x00 p ]
    [ 10H 00H p H00 x00 p ]
    [ 11L 00H p H00 x00 p ]

Figure 134: Pin connections for a 7400 quad NAND gate device and part of a test program.

Figure 135: A wafer (6 to 10 inches diameter) is diced into chips (1 cm on a side or so).

Figure 136: Load card with wafer probe pins for testing a chip before the wafer is diced.

Figure 137: General configuration of a wafer probe testing machine. The robotic staging system delivers wafers to and from conveyor systems as well as steps the probes over the wafer.

Figure 138: Probe needles just visible inside the testing load card. This image must be deleted from the book since we can’t get rights clearance.

Figure 139: Plastic chip package.

Figure 140: An FPGA composed of several chiplets with broadside inter-chiplet bonding, connected closely to several DRAM chips in an MCM.

module sewkit(  // TSMC 0.18u library 
       intput clk, 
       input n_reset); 
    // verilint 630  on : Port connected to a NULL expression 
    dfcfb1  DZBRB1_1(.CDN(n_reset), .CPN(clk), .D(1’b0), .Q(), .QN()); 
    dfcfb1  DZBRB1_2(.CDN(n_reset), .CPN(clk), .D(1’b0), .Q(), .QN()); 
 
    nd02d2  ND02D2_1 (.A1(1’b0), .A2(1’b0), .ZN() ); 
    nd02d2  ND02D2_2 (.A1(1’b0), .A2(1’b0), .ZN() ); 
 
    inv0d2  INV0D2_1(.I(1’b0), .ZN()); 
    inv0d2  INV0D2_2(.I(1’b0), .ZN() ); 
    inv0d2  INV0D4_1(.I(1’b0), .ZN() ); 
    inv0d2  INV0D4_2(.I(1’b0), .ZN() ); 
 
    buffd7  BUFFD1_1(.I(1’b0), .Z() ); 
    buffd7  BUFFD1_2(.I(1’b0), .Z() ); 
 
    mx02d2  MX02D1_1(.I0(1’b0), .I1(1’b0), .S(1’b0), .Z() ); 
    mx02d2  MX02D1_2(.I0(1’b0), .I1(1’b0), .S(1’b0), .Z() ); 
 
    nr02d2  NR02D2_1 (.A1(1’b0), .A2(1’b0), .ZN() ); 
    nr02d2  NR02D2_2 (.A1(1’b0), .A2(1’b0), .ZN() ); 
 
    aoi211d2 AOI311D1_1(.A(1’b0), .B(1’b0), .C1(1’b0), .C2(1’b0), .ZN() ); 
    aoi211d2 AOI311D1_2(.A(1’b0), .B(1’b0), .C1(1’b0), .C2(1’b0), .ZN() ); 
endmodule

Figure 141: Example of structural Verilog RTL that instantiates disconnected standard cells as a ‘sewing kit’.

Figure 142: The three main timing specifications for a sequential cell, such as a flip-flop (repeated from Figure 270).

Figure 143: An example circuit with static timing annotations for maximum GBA analysis.

Figure 144: CMOS chip structure, highlighting layers most affected by FEOL and BEOL variations.

Figure 145: Troublesome early and late path configurations for maximum (left) and minimum (right) timing. Late path components are shaded dark.

Figure 146: A logic structure with a gated clock that could suffer in an FS process corner (left) and a combinational logic circuit with two paths that differ in their levels of inversion.

# ---- Create Clocks ---- 
create_clock -add -period $clock_period -name VCLK 
 
foreach clock_name $clock_list { 
  create_clock -add -period $clock_period [get_ports $clock_name] -name $clock_name 
  set_clock_latency $clock_latency [get_clocks $clock_name] 
} 
 
set_clock_uncertainty [expr $setup_margin + $clock_jitter] -setup [all_clocks] 
set_clock_uncertainty [expr $hold_margin]                  -hold  [all_clocks] 
 
set_driving_cell -lib_cell $clock_driving_cell \ 
                 -input_transition_rise $max_clock_transition \ 
                 -input_transition_fall $max_clock_transition \ 
                 [get_ports $clock_list] 
 
# ---- I/O timing constraints ---- 
set_input_delay $max_input_constraint -max -clock VCLK \ 
             [remove_from_collection [all_inputs] $clock_list] 
set_input_delay $min_input_constraint -min -clock VCLK \ 
             [remove_from_collection [all_inputs] $clock_list] 
 
set_output_delay $max_output_constraint -max -clock VCLK [all_outputs] 
set_output_delay $min_output_constraint -min -clock VCLK [all_outputs] 
 
# ---- Path groups ---- 
group_path -name reg2reg -from [all_registers] -to [all_registers] 
 
# ---- Timing exceptions ---- 
set_multicycle_path 2 -setup -end -from [get_ports DFT*] 
set_multicycle_path 1 -hold  -end -from [get_ports DFT*] 
% 
% # ---- Scan mode ---- 
%\end{verbatim}}

Figure 147: An example of an SDC file. TCL commands are used to specify and constrain the STA behaviour.

Figure 148: Global and localised views of process distributions.

Figure 149: LVF path delay distribution and sigma multiplier.

Figure 150: Statistical

V_{T}

shift due to bias temperature instability for various bias times.

Figure 151: Comparing simulation (left) with a formal proof of correctness (right).

   assert(x<4); 
   x := x + 1000; 
   assert(x<1004);

Figure 152: Examples of imperative or procedural safety assertions in a software language such as C++. These conditions must hold when reached by the program’s flow of control.

Figure 153: FSM transition diagram with liveness, fairness and deadlock indications.

Figure 154: Two simple PSL assertions.

Figure 155: Dynamic validation: Monitoring bus operation with an RTL checker (top) or interface PCs (bottom).

Figure 156: Framed standard synchronous connection, with interface checker instance (left) and allowable protocol transitions (right). Only allowed qualified edges are shown. Any other qualified transitions are protocol errors.

module framed_standard_sync_monitor( 
               input reset, 
               input clk,           // Clock input. ALL CONNECTIONS ARE INPUTS! 
               input req,           // Request signal 
               input rdy,           // Ready signal, for the reverse direction 
               input [7:0] data,    // Data bus 
               input dfirst,        // First word of packet indicator 
               input dlast);        // Last word indicator 
  bit q1; integer error_flag; 
  always @(posedge clk) 
    if (reset) q1 = 0; 
    else begin 
    error_flag = 0; 
    if (req && rdy && !q1) begin 
      if (dfirst && !dlast) q1 = 1; // Frame start 
      else if (dlast && !dfirst) begin 
        $display("%m: %1t: C2: End outside of frame.", $time);  error_flag = 2; 
        end 
      else if (!dlast && !dfirst) begin 
        $display("%m: %1t: C3: Byte outside a frame.", $time); error_flag = 3; 
        end 
    end 
    else if (req && rdy && q1) begin 
       if (!dfirst && dlast) q1 = 0; // Frame end 
       else if (dlast && dlast) begin 
         $display("%m: %1t: C1b: One-word frame during existing frame.", $time); error_flag = 1; 
         end 
      else if (!dlast && dfirst) begin 
         $display("%m: %1t: C1a: Frame start during existing frame.", $time); error_flag = 1; 
         end 
      end 
   end 
endmodule

Figure 157: Example of a PC for the framed interface implemented using an RTL state machine.

  wire en = req && rdy; 
 
  // The transition from Q0 -> Q1 -> ... -> Q1 -> Q0: 
  sva_transaction: assert property (@(posedge clk) 
    ( (en && dfirst && !dlast) |=> (!en || (!dfirst && !dlast))[*0:$] ##0 (en && !dfirst && dlast) ) ) 
 
  // Forbid any exit from Q0 except with dfirst: 
  good_Q0: assert property (@(posedge clk) 
    ( (en && dlast) || reset |=> (!(en && dfirst))[*0:$] ##0 (en && dfirst) ) )

Figure 158: Comparable example of the PC implemented with SVA. The ##0 form is an idiomatic marker that merely separates successive SERES components.

Figure 159: A mitre compares the outputs from a pair of supposedly equivalent combinational components.

Figure 160: A two-bit shift register (left) with a conventional design. By using a clock-enabled flip-flop (centre), an alterntive implementation is possible (right). The state encoding is totally different, but the observable black-box behaviour is identical.

Figure 161: An X-propagation mitre around two instances of a simple DUT. The equivalence comparison is modified to ignore mismatches on the data bus if the valid output from the upper instance does not hold. A counterexample, as found by a model checker, is denoted in red. The alternative values that would be encountered in a simulation are in blue.

Figure 162: An alternative DUT. The circuit produces a false negative under X-propagation simulation. The blue annotations show CRV simulation values, whereas model checker values, in red, correctly indicate that the DUT has passed the test.

Figure 163: A formal glue shim around a data path component (a LIFO stack). This enables symbolic verification using a basic model checker. All inputs on the left are unconstrained, as is the pop input on the right. No stimulus pattern should make the FAIL output hold.

Figure 164: A pulse generator: schematic symbol and timing waveforms.

Figure 165: Four of six permutations that implement a simplex protocol adaptor. Each has the same signature of one input and two output ports of 32 and 8 bits, respectively. More typically, duplex implementations are required.

Figure 166: Comparison of speed-of-light (LC) and RC-based Elmore delay models for propagation along a net. Also shown is a real-world design rule for 16-nm geometry.

Figure 167: Timing diagram for an asynchronous four-phase handshake.

Figure 168: Timing diagram for AHB bus write cycle with one wait state.

Figure 169: Timing diagrams for the synchronous LocalLink protocol. Left: Back-to-back transfer of words because the destination is always ready. Right: Pauses to match the acceptance rate of the destination. Note that all control signals are active low (denoted with the _n RTL suffix) in LocalLink.

Figure 170: BVCI core nets. Arrows indicate signal directions on the initiator. All of these are reversed for the target.

Figure 171: BVCI protocol: Command phase timing diagram.

Figure 172: BVCI protocol: Response phase timing diagram. Operations are qualified with the conjunction of req and ack. Response and acknowledge cycles maintain their respective ordering. Bursts are common. Successive addressing may be implied.

Figure 173: Out-of-order bus formed from a pair of in-order busses with added tags (blue).

Figure 174: Augmenting the tag width through successive multiplexing stages.

Figure 175: The baseline port structure for an A32D32 AXI standard initiator showing the five temporally floating channels, two for reads and three for writes. Parity bits may be present on any of the busses. Writing an address and writing data operate close to lockstep for single-word writes, but are split to support multi-word bursts. Signal directions are reversed for a target.

Figure 176: A basic SoC bus structure for the MSOC1 protocol. One initiator addresses three targets (high-level view and detailed wiring).

Figure 177: Example where one of the targets is also an initiator (e.g. a DMA controller).

Figure 178: A system design using three bridged busses. Each main bus has its own primary initiator (pink), which is typically a CPU, but the bus bridges (green) also initiate transactions.

Figure 179: Bidirectional bus bridge for the MSOC1 protocol. It has a pair of back-to-back simplex bridges. It could be a single IP block on a single chip, or the two halves could be on different chips with a SERDES (§??) serial link between them.

Figure 180: Crossbar

4 \times 4

connectivity implemented in three different ways: with a high-capacity TDM bus (left), crossbar elements (centre) and multiplexors (right).

Figure 181: Multiphase (split transaction) bus demultiplexor and remultiplexor for responses (left). Three of these structures could be used to implement the

2 \times 3

bus fabric hub (right).

Figure 182: Two folded busses that have similar component counts. For each, the number of switching elements grows linearly with the number of end points (EP). The linear form (left) is suitable for multi-chip use, but the tree form (right) has lower latency. The receive half of the tree form often has a tree structure running exactly parallel to the transmit half, but here it is shown flattened to almost a bus.

Figure 183: A radix-3 switching element for a NoC using a broadcast bus at

4 \times

the link bandwidth and credit-based flow control.

Figure 184: Example of a NoC fabric using radix-3 switching in a unidirectional torus mesh.

Figure 185: Illustration of cut-through routing on a 2-D mesh NoC (left), wormhole operation (centre) and potentially deadlocking routes (right). The thin black lines are the NoC links, which are bidirectional. The thick coloured lines are the routes chosen for a packet. The white marks indicates the end of a packet.

Figure 186: Four elements in a 2-D mesh. Two transactions, red and green, are in a fabric deadlock, since unconstrained cut-through switching with wormhole routing has allowed each to block the other. The dashed parts cannot complete.

Figure 187: The eight possible 2-D turns (left), the allowable 2-D turns with Y-before-X (north–south first) routing (centre) and an example set of turns used by two transactions that obey the north–south first global policy but still manage to deadlock owing to interference between requests and responses. The turns that violate the global policy are marked with an exclamation mark. BLA: bottom left anticlockwise; BRA: bottom right anticlockwise; TRA: top right anticlockwise; TLA: top left anticlockwise; BLC: bottom left clockwise; BRC: bottom right clockwise; TRC: top right clockwise; TLC: top left clockwise.

1Profiles: 
2    t0: { src: u_M0, type: readRequest, avg: 10, peak: 100, req_beats: 1, 
3          resp_beats: 4, qos: 0, lc: false, dst: u_S0 } 
4    t1: { src: u_M0, type: writeRequest, avg: 10, peak: 94.3, req_beats: 4, 
5          resp_beats: 1, qos: 0, lc: false, dst: u_S0 } 
6Dependencies: 
7    # Receipt of readRequest at u_S0, triggers a transaction at u_M0 
8    d0: { from: u_S0.readRequest, to: u_M0.readRequest } 
9

Figure 188: Sample behavioural traffic specification containing a load profile and a transaction phase dependency.

Figure 189: One possible structure using link-level credit-based flow control, showing tolerance to pipeline stages in the interconnect nets. The forward and return paths need not be matched in delay terms.

Figure 190: Two end points interconnected via some number of switching elements. The AMBA 5 CHI protocol layers are shown.

Figure 191: A minimal application of the AMBA 5 CHI specification between a requester and a completer (left) and full net-level details of the six channels found on a more-typical request node (RN) (right).

Figure 192: A ring network. This is a low-complexity NoC structure.

Figure 193: Regular cubic structures with 1, 2, 3 or 4 dimensions. 5-D and above are hard to draw.

Figure 194: The butterfly (or shuffle) network topology (left) and a flattened version (right), with some example paths highlighted for comparison. The vertical links in the flattened form are bidirectional.

Figure 195: Example of an

8 \times 8

switching fabric that uses radix-2 elements and the delta wiring pattern. Interchanging the initiators and targets is equally valid.

Figure 196: A one-place pipelined FIFO stage using credit-based flow control. A multi-place FIFO stage would replace the synchronous S/R flops used for the initial credit and backlogged state with counters, which result in better density than cascading one-place stages. (The vertical line on the S input denotes that setting has priority over resetting when both are asserted.)

Figure 197: Metastability illustrated by a pencil balancing on a razor’s edge (left). The essential structure of a transparent bistable (latch) and a transfer function (centre). The gate signal and measured responses from a transparent latch with an input wired to a voltage source close to the metastable point (right).

Figure 198: Generic structure of a simplex CBRI. Parallel data are reliably sent between CDs.

Figure 199: Timing diagram (top) and a basic structure for two systems that use harmonically locked clocks (bottom). Relative edge timings that require tight phase control are highlighted.

Figure 200: Net-level view of the AMBA P-channel interface for device power control.

Figure 201: Main components of an 8b10b, block-coded SERDES (serialiser/deserialsier) transceiver (transmitter/receiver) operating over a twisted-pair channel.

Figure 202: Transactional-level model (TLM) of the simplest computer (left) and code fragments (right).

Figure 203: Schematic symbol and external connections for a tri-state version of a simple microprocessor.

Figure 204: A simple A16D8 microcomputer structure. A microprocessor initiates all transactions on a bidirectional/tri-state data bus, which is connected to all other components.

Figure 205: Connections to memory

Figure 206: MSOC1 reference bus protocol, with read and write examples.

Figure 207: Structure of a typical microcontroller, a single-chip microcomputer.

Figure 208: Hitachi HD614080 microcontroller chip from 1980. Such devices were often in very large, dual in-line (DIL) packages to make a large number of GPIO pins available.

Figure 209: Contact plate for a smart card. The card reader supplies VCC power, clock and reset. Inputs and outputs are then made via the 1-bit bidirectional data pin.

Figure 210: An invertor viewed at various levels of abstraction.

Figure 211: Front-end flow in SoC design.

Figure 212: Main PCB of an ADSL home modem.

Figure 213: Main PCB structure of an ADSL home modem.

Figure 214: Overall design and manufacturing flow for a SoC.

Figure 215: General internal view of a mobile phone (left) and views of both sides of the main circuit board (centre and right). Highlighted in red are the main SoC, which contains several Arm processors, and a multi-chip module containing several memory chips (a proprietary mix of DRAM, SRAM and flash).

Figure 216: An Apple SoC with two Arm and three GPU cores. It was made by arch-rival Samsung.

Figure 217: A platform chip example: the Virata Helium 210, as used in many ADSL modems.

Figure 218: Helium chip as part of a home gateway ADSL modem (partially masked by the 802.11 module).

Figure 219: Another platform chip intended for use in thin-client display devices, such as a tablet.

Figure 220: Moore’s law (§??).

Figure 221: TLM connection diagrams for CPU and memory configurations: (a) direct connection, (b) with a cache, (c) with a split cache and (d) two cores with a shared L2 cache. The fan-in to the TLM socket denotes that the bandwidth is shared. In reality, debug and managements ports also need to be connected.

Figure 222: TLM diagrams for a CPU core with integrated first-level caches and external coprocessor connections.

Figure 223: Main data paths in a generic five-stage RISC microarchitecture, excluding the back sides of the cache and MMU. Instruction field bit-extracts to control the multiplexors and ALU are not shown.

Figure 224: Key components of a super-scalar CPU core, excluding the back sides of the cache and MMU.

Figure 225: TLM connection diagrams for a CPU and memory management unit (MMU): (a) naive view and (b) one possible cache arrangement.

Figure 226: A generalised PRAM model of computing. A number of synchronous processors, each with some private local store, make random access and read and write operations on shared memory that has many atomic locations of some size. The interconnect implements some variant of coherence (value and sequential consistency).

Figure 227: Two possible configurations for scratchpad memory.

Figure 228: Typical arrangement of virtual and physical mapping with L1, TLB and L2. Data are served from L1 or L2, but are unlikely to be served directly from primary storage. Instead, update mechanisms (not shown) are used. Four-way associativity is illustrated, but a higher degree is typically used, to increase both the hit ratio and capacity, since it is generally required that the directly mapped axis is not wider than the virtual memory page size.

Figure 229: An example system using a snoop control IP block from Arm (Cortex A9 family).

Figure 230: MOESI-like state diagram used in the Arm ACE and CHI protocols.

Figure 231: Three I/O blocks connected to a CPU, memory and an interrupt controller.

Figure 232: Three I/O blocks with flexible interrupt distribution in a multi-core system.

Figure 233: Interrupt generation: general structure within a device and at system level.

Figure 234: Floorplan visualisation of a 3D implementation of the Arm Neoverse N1 design using two chips, vertically stacked. The cache memories are placed on the lower piece of silicon, shown in monochrome, whereas the logic for the ALUs, register files and everything else is shown in colour on top. The 1MB L2 label is the L2 logic and not L2 memory arrays.

Figure 235: Low-aspect-ratio RAM array showing the shared peripheral circuitry, which includes the word-line decoder, sense amplifiers, tri-state buffers and I/O multiplexor. Externally it offers

2^{N}

words of

M

bits where

N = \log_{2} (R) + \log_{2} (C)

. Internally it uses

R

rows each with

(M \times C)

-bit cells.

Figure 236: Mask-programmed NMOS ROM structure with a capacity of

2^{N}

locations, each holding an

M

-bit word.

Figure 237: Transistor-level view of a standard six-transistor (6T) SRAM cell. M1, M2, M3 and M4 are used for storage. M5 and M6 are used to access the cell for read and write operations.

Figure 238: SSRAM with a single port, showing the logic symbol (left) and internal RTL model (right).

Figure 239: 8T SRAM cell with true dual-porting.

Figure 240: Dual-ported SSRAM logic symbol.

Figure 241: Micrograph of a simple DRAM chip circa 1994. This is a Micron Technology MT4C1024 device with organisation 1024x1024x1.

Figure 242: Left: pinout for a typical asynchronous DRAM, such as the MT4C1024 pictured above, that is controlled by the RAS and CAS signals instead of a having a clock input. Right: basic internal structure schematic. This device only has a single bank so there are no bank address inputs.

Figure 243: 4-Gbyte DRAM dual in-line memory module (DIMM) for a laptop computer. Eight chips are mounted on a small PCB, four on each side. A label on the back says ‘1600 11-11-11’. More detailed information, including the supply voltage, is stored electronically in a small serial presence detect (SPD) ROM in the centre of the DIMM.

Figure 244: Internal block diagram of a 4-bit-wide DRAM device. There are four banks, although eight are normal. (The data strobe (DS) signal is omitted for clarity.)

Figure 245: NOR ROM block diagram (left) with capacity

2^{(R + C)}

bits and alternative NAND ROM array detail (right). For a ROM more than 1 bit wide, the Dout connection is a bus of width

M

and the low address bus width is reduced by

\log_{2} (M)

bits.

Figure 246: EPROM device from 1980’s. The silicon die is visible through the top window.

Figure 247: Schematic of a possible bit cell for EPROM and EEPROM. There is a kink on the top of the extra gate of this floating-gate transistor to denote that tunnelling is possible.

Figure 248: Typical increase in benchmark performance in response to L3 cache size enlargement for L2=64 KB (blue), L2=128 KB(yellow) and L2=256 KB (red).

Figure 249: Typical I/O ports. Shown are two serial ports, one parallel port, one Ethernet port, two USB ports and three audio ports.

Figure 250: Two typical configurations for a serial port using a UART. Inset: Serial port timing diagram. The old wiring to a D9 connector for an RS-232 port is today often replaced with a very short connection to a second UART inside a USB dongle for connecting to a laptop.

Figure 251: Timing diagram for an asynchronous four-phase handshake.

// Programming model 
reg [31:0] ddr;       // Data direction reg 
reg [31:0] sfunction; // Special mode 
reg [31:0] dout;      // Output register 
reg [31:0] imask;     // Interrupt mask 
reg [31:0] ipol;      // Interrupt polarities 
reg [31:0] pins_r;    // Registered pin data 
 
reg int_enable;// Master int enable (for all bits) 
 
always @(posedge clk) begin 
   pins_r <= pins; 
   if (hwen && addr==0) ddr <= wdata; 
   if (hwen && addr==4) sfunction <= wdata; 
   if (hwen && addr==8) dout <= wdata; 
   if (hwen && addr==12) imask <= wdata; 
   if (hwen && addr==16) ipol <= wdata; 
   if (hwen && addr==20) int_enable <= wdata[0]; 
   end 
 
// Tri-state buffers. 
bufif b0(pins[0], 
         sfunction[0]?special[0]:dout[0], ddr[0]); 
... // 30 others here 
bufif b31(pins[31], 
          sfunction[1]?special[31]dout[31], ddr[31]); 
 
// Generally the programmer can read all the 
// programming model registers but not here 
assign rdata = pins_r; 
 
// Interrupt masking 
wire int_pending = (|((pins_r ^ ipol)&imask)); 
assign interrupt = int_pending && int_enable;

Figure 252: Schematic and RTL implementation of 32 GPIO bits connected to an MSOC1 bus.

  // RTL for one channel of a typical timer 
 
  // Programmers’ model state 
  reg int_enable, int_pending; 
 
  reg [31:0] prescaler; 
  reg [31:0] reload; 
 
  // Programmer-invisible internal state 
  reg ovf; 
  reg [31:0] counter, prescale; 
 
  // Host write operations 
  always @(posedge clk) begin 
     if (hwen && addr==0) int_enable <= wdata[0]; 
     if (hwen && addr==4) prescaler <= wdata; 
     if (hwen && addr==8) reload <= wdata; 
     // Write to addr==12 to clear the interrupt 
     end 
  wire irq_clr = hwen && addr == 12; 
 
  // Host read operations 
  assign rdata = 
    (addr==0) ? {int_pending, int_enable}: 
    (addr==4) ? prescaler: 
    (addr==8) ? reload: 0; 
 
  // A timer counts system clock cycles 
  // A counter counts transitions from an external input 
  always @(posedge clk) begin 
     ovf <= (prescale == prescaler); 
     prescale <= (ovf) ? 0: prescale+1; 
     if (ovf) counter <= counter -1; 
     if (counter == 0) begin 
         int_pending <= 1; 
         counter <= reload; 
         end 
     if (irq_clr) int_pending <= 0; 
     end 
 
  // Interrupt generation 
  assign interrupt = int_pending && int_enable;

Figure 253: Schematic symbol for a counter/timer block and internal RTL for one timer function.

  // State for programmers’ model 
  reg [31:0] count, src, dest; 
  reg int_enable, active; 
 
  // Other local state 
  reg [31:0] datareg; 
  reg intt, rwbar; 
 
  always @(posedge clk) begin // Target 
     if (hwen && addr==0) begin 
         { int_enable, active } <= wdata[1:0]; 
         intt <= 0; rwbar <= 1; 
         end 
     if (hwen && addr==4) count <= wdata; 
     if (hwen && addr==8) src <= wdata; 
     if (hwen && addr==12) dest <= wdata; 
     end 
   assign rdata = ...// Target readbacks 
 
   always @(posedge clk) begin // Initiator 
     if (active && rwbar && m_ack) begin 
           datareg <= m_rdata; 
           rwbar <= 0; 
           src <= src + 4; 
           end 
     if (active && !rwbar && m_ack) begin 
           rwbar <= 1; 
           dest <= dest + 4; 
           count <= count - 1; 
           end 
     if (count==1 && active && !rwbar) begin 
           active <= 0; 
           intt <= 1; 
           end 
     end 
  assign m_wdata = datareg; 
  assign m_ren = active && rwbar; 
  assign m_wen = active && !rwbar; 
  assign m_addr = (rwbar) ? src:dest; 
  assign interrupt = intt && int_enable;

Figure 254: A simple DMA controller: schematic symbol, example TLM wiring and RTL for one channel.

Figure 255: Net-level schematic symbol for a DMA-capable network interface IP block.

  reg [2:0] framestore[32767:0]; 
  reg [7:0] hptr, vptr; 
  output reg [2:0] rgb_video; 
  output reg hsynch, vsynch; 
 
  always @(posedge clk) begin 
       hptr <= (hsynch) ? 0: hptr + 1; 
       hsynch <= (hptr >= 230) 
       if (hsynch) vptr <= (vsynch) ? 0: vptr + 1; 
       vsynch <= (vptr == 110) 
 
       if (hwen) framestore[haddr]<= wdata[2:0]; 
       /*else*/ rgb_video <= framestore[{vptr[6:0], hptr}]; 
 
       end

Figure 256: Structure of a simple frame store, RTL implementation and generated timing waveforms.

Figure 257: Doorbell and mailbox block.

Figure 258: Example of parallel speedup. 35 units of work run across four servers, showing dependency arcs typical in the map-reduce design pattern. Arcs implicitly exist between all adjacent work unit boxes.

module arbiter(input clk, 
               input reset, 
               input [2:0] reqs, 
               output reg [2:0] grants); 
 
  always @(posedge clk) if (reset) grants <= 0; 
      else begin 
         grants[0] <= reqs[0]; // Highest static priority 
         grants[1] <= reqs[1] && !(reqs[0]); 
         grants[2] <= reqs[2] && !(reqs[0] || reqs[1]); 
      end

Figure 259: A schematic of a typical arbiter (left) and the RTL implementation (right) for a three-port synchronous example using static priority with pre-emption. See also Figure 31.

Figure 260: General structure of a queue/server pair with mean customer rate

λ

and mean service rate

μ

.

Figure 261: Plots of the average number of customers in a system versus the utilisation

ρ

for three common arrival/service disciplines.

Figure 262: Separate queueing for high-priority and low-priority traffic gives an expedited service.

Figure 263: Generic switch that includes both input and output buffering on the input and output ports, respectively, of a buffer-less switch fabric. Flow control operates between the switch fabric and the buffers in the switch.

Figure 264: Two scenarios that demonstrate statistical multiplexing gain for

N = 10

sources. Sharing the channel bandwidth (a) raises the question of whether reservations are made on a peak or average basis. If

N

is small, we should use peak allocation, but for large

N

, we can use average allocation. The law of large numbers states that

N

needs only to be about 50 for an average allocation to be acceptable. Buffer pools can be partitioned (b) or shared (c).

Figure 265: A policed queue showing both input and output regulators, although commonly only one site would be policed. Each policer has a number of rate channels (e.g. three are shown for the input site).

  int burst_tolerance, credit_rate; // Set up by PIO 
  int credit;                       // State variable 
  void reset() // Complete setup 
  { credit = 0; 
    register_timer_callback(crediter, credit_rate); 
  } 
  void crediter() // Called at 1/credit_rate intervals 
  { if (credit < burst_tolerance) credit += 1; 
  } 
  bool police() // Check operation currently allowed 
  { if (credit==0) return false; 
    credit -= 1; 
    return true; 
  }

Figure 266: Essence of a software implementation for one channel of a generic traffic policer or regulator.

Figure 267: Abstract views of the principal axes that span the SoC design space for a task. Parallelism can be traded for clock frequency (left), which shows two design points. At a given clock frequency, the silicon area (and parallelism) can be traded for execution time using a time/space fold/unfold (right).

Figure 268: Thermal management of a high-power chip. The primary heat sink makes thermal contact with the chip. A heat pipe is connected to a second heat sink. There is no other connection to the second heat sink.

Figure 269: Generic thermal circuit, showing on/off heat source, thermal node and thermal resistance between the node and its heat sink.

Figure 270: Primary timing characteristics of a D-type flip-flop (left). Typical nature of a critical path in a synchronous clock domain indicating how the maximum clock frequency (

F

) is calculated (right).

Figure 271: A circuit before (top) and after (bottom) insertion of an additional pipeline stage.

Figure 272: Flip-flop migration. Two circuits that behave identically but which have different state encodings.

Figure 273: Schematic and count sequence for a divide-by-five counter using Johnson encoding.

             Thread 1 - Requestor       |           Thread 2 - Server 
                                        | 
          ...                           |    while(true) 
          buffer[1] = operand1;         |    { 
          buffer[2] = operand2;         |      if (!buffer[0]) { yield(); continue; } 
          write_fence();                |      read_fence(); 
          buffer[0] = COMMAND;          |      handle(buffer); 
          ...                           |      buffer[0] = 0; 
                                        |    }

Figure 274: Two code fragments using message-passing in shared memory with explicit memory fences. The yield() call could be to the suspend primitive of the operating system scheduler. Alternatively, if running on bare metal, it is a dedicated instruction (called YIELD in the Arm ISA) that interacts with hardware hyper-threading if present.

Figure 275: SRAM memory bank structures for MCMs (top) with two interconnection patterns inside the SoC (bottom).

Figure 276: Pseudo-dual porting of RAM (using write mirroring).

Figure 277: Typical structure of a 32-bit DRAM controller connected to DRAM devices. The on-SoC controller manages a DRAM channel with one rank made of eight individual DRAM chips. Each chip has eight banks with four bit planes. Each bit plane has

2^{12} \times 2^{12}

bits. A DRAM is typically made available to various on-SoC subsystems using a multi-way bus multiplexor.

Figure 278: Three possible arrangements of DRAM address fields within a physical address. In the centre arrangement (b), the bank field is lower than its naive position in the top arrangement (a). This improves load balancing over banks, but the field is not moved so low that a cache line or burst transfer is split over banks. The rank and channel fields can also be moved lower. Arrangement (c) shows a channel field that has been split and dispersed.

Figure 279: (a) Lithium-ion battery for a mobile phone (3.7 V, 1650 mA h, 6 W h or 22 kJ), external view. (b) Typical internal structure. (c) Two sugar cubes.

Figure 280: Lumped-element electrical equivalent modelling of dynamic power use of a CMOS SoC. The energy drawn from the battery each clock cycle is essentially ‘wasted’ as heat in the distributed resistance of the active parts of the SoC.

Figure 281: A generic CMOS invertor structure shown with explicit parasitic resistances and lumped-equivalent output loading. This illustrates the primary electric paths in CMOS logic.

Figure 282: A tri-state data line showing driving gates, sensing buffers and a bus holder (or keeper). The bus holder has a weak output driver, denoted ‘W’, which acts like a series resistor to reduce the output current during transient bus fights (§??).

Figure 283: Logic net with a single source and three loads, showing tracking and input load capacitances.

Figure 284: Dynamic energy use for various technologies.

Figure 285: Fanout 4 (FO4) delay specification (left) and CMOS logic propagation delay versus supply voltage (right).

// spice-cmos-inverter-djg-demo.hsp 
// Updated 2017 by DJ Greaves 
// Based on demo by David Harris harrisd@leland.stanford.edu 
// Declare global supply nets and connect them to a constant-voltage supply 
.global Vdd Gnd 
Vsupply Vdd Gnd DC ‘VddVoltage’ 
/////////////////////////////////////////// 
// Set up the transistor geometry by defining lambda 
.opt scale=0.35u  * Define lambda // This is half the minimum channel length. 
// Set up some typical MOSFET parameters. 
//http://www.seas.upenn.edu/~jan/spice/spice.models.html#mosis1.2um 
 
.MODEL CMOSN NMOS LEVEL=3 PHI=0.600000 TOX=2.1200E-08 XJ=0.200000U 
    +TPG=1 VTO=0.7860 DELTA=6.9670E-01 LD=1.6470E-07 KP=9.6379E-05 
    +UO=591.7 THETA=8.1220E-02 RSH=8.5450E+01 GAMMA=0.5863 
    +NSUB=2.7470E+16 NFS=1.98E+12 VMAX=1.7330E+05 ETA=4.3680E-02 
    +KAPPA=1.3960E-01 CGDO=4.0241E-10 CGSO=4.0241E-10 
    +CGBO=3.6144E-10 CJ=3.8541E-04 MJ=1.1854 CJSW=1.3940E-10 
    +MJSW=0.125195 PB=0.800000 
 
.MODEL CMOSP PMOS LEVEL=3 PHI=0.600000 TOX=2.1200E-08 XJ=0.200000U 
    +TPG=-1 VTO=-0.9056 DELTA=1.5200E+00 LD=2.2000E-08 KP=2.9352E-05 
    +UO=180.2 THETA=1.2480E-01 RSH=1.0470E+02 GAMMA=0.4863 
    +NSUB=1.8900E+16 NFS=3.46E+12 VMAX=3.7320E+05 ETA=1.6410E-01 
    +KAPPA=9.6940E+00 CGDO=5.3752E-11 CGSO=5.3752E-11 
    +CGBO=3.3650E-10 CJ=4.8447E-04 MJ=0.5027 CJSW=1.6457E-10 
    +MJSW=0.217168 PB=0.850000 
///////////////////////////////////////////// 
// Define the invertor, made of two MOSFETs as usual, using a subcircuit. 
.subckt myinv In Out N=8 P=16 // Assumes 5 lambda of diffusion on the source/drain 
m1 Out In Gnd Gnd CMOSN l=2 w=N 
+ as=‘5*N’ ad=‘5*N’ 
+ ps=‘N+10’ pd=‘N+10’ 
m2 Out In Vdd Vdd CMOSP l=2 w=P 
+ as=‘5*P’ ad=‘5*P’ 
+ ps=‘P+10’ pd=‘P+10’ 
.ends myinv 
 
////////////////////////////////////////////// 
// Top-level simulation net list 
//  One instance of my invertor and a load capacitor 
x1 In Out  myinv         // Invertor 
C1 Out Gnd 0.1pF         // Load capacitor 
////////////////////////////////////////////// 
// Stimulus: Create a waveform generator to drive In 
// Use a  "Piecewise linear source"  PWL that takes a list of time/voltage pairs. 
Vstim In Gnd PWL(0 0 1ns 0   1.05ns ‘VddVoltage’   3ns VddVoltage    3.2ns 0) 
////////////////////////////////////////////// 
// Invoke transient simulation (that itself will first find a steady state) 
.tran .01ns 6ns // Set the time step and total duration 
.plot TRAN v(In) v(Out) 
.end

Figure 286: SPICE description and setup for two transistors arranged as a CMOS invertor simulated with a two-step input.

Figure 287: Plots of the invertor when running from VCC supplies of 2.5 V (left) and 1.5 V (right). Red is the input stimulus and blue is the simulated output.

Figure 288: Clock enable using (a) a multiplexor, (b) an AND gate and (c) an OR gate.

Figure 289: Illustrating a transparent latch and its use to suppress clock gating glitches.

Figure 290: Using XOR gates to determine whether a clock edge would have any effect.

Figure 291: Clock-needed computations forwarded down a pipeline.

Figure 292: Power gating in general (left) and an actual power gate circuit (right).

Figure 293: Debugging hardware for a single-core SoC. A debug access port is connected via JTAG and USB to a debug workstation.

Figure 294: Typical additional event-stream debug resources for a modern MPSoC. Operational data busses are black. Event-stream busses are green. Debug-access busses are blue. The pink arrows denote event-monitoring nets from miscellaneous IP blocks that either do not have their own counters or need to be made available for cross-triggering.

Figure 295: JTAG interface chaining at board level to create a top-level node in the debug device tree.

Figure 296: Basic structure of the additional logic required in input and output pads for boundary scanning. Serial data are shifted from one pad to the next using the test clock. The result from the previous test vector is shifted out as a vector is shifted in. The two strobe signals are used to apply the test vector to the input pads and to sample the data at the output pads.

Figure 297: A static RAM with self-test wrapper around it. An error correction wrapper has a similar structure, but then the SRAM data bus will be wider than the external data bus. Both wrappers are often logically present, in which case a single wrapper may implement both functionalities.

Figure 298: Crystal oscillator circuit (left), canned crystal and contents (centre), and specification and output waveform (right).

Figure 299: Clock multiplication using a PLL (left) and clock distribution layout using a fractal H-tree (right).

Figure 300: Clock skewing. The delivery of the clock edge to some D-types is delayed or advanced to balance out timing margins either side.

Figure 301: Basic form of a timing slack graph.

Figure 302: Schematic (left) and timing diagram (right) of a typical multi-cycle path. The delay through logic block C ranges from

C_{m i n}

to

C_{m a x}

with both limits being between one and two clock periods.

Figure 303: Environmental testing under various temperature and humidity conditions in a climate-controlled chamber.

Figure 304: EMC testing of radio-frequency interference.

Label	Start address (hex)
g_drmp3_pow43-0x120	0x0000
g_drmp3_pow43>	0x120
g_scf_partitions.6678>	0x0c40
…
_end_of_static	0x2350

Event type	Number of operations
Input bytes	16 392
Output frames	44 352
DCT operations	154
Floating-point adds and subtracts	874 965
Floating-point multiplies	401 255
Integer adds and subtracts	162 107
Integer multiplies	88 704

Type	Data latency	Ready latency	Combinational paths
Fully registered	1	1	None
Bypass	0	1	WR_EN $\to$ RD_RDY
Pipelined	1	0	RD_EN $\to$ WR_RDY
Bubble-free	0	0	Both directions
Asynchronous	Several	Several	None
Credit-controlled	1	n/a	None

Device name	VU31P	VU33P	VU35P	VU37P	VU11P	VU13P
System logic cells (k)	962	962	1907	2852	2835	3780
CLB flip-flops (k)	879	879	1743	2607	2592	3456
CLB LUTs (k)	440	440	872	1304	1296	1728
Maximum distributed RAM (Mb)	12.5	12.5	24.6	36.7	36.2	48.3
Total block RAM (Mb)	23.6	23.6	47.3	70.9	70.9	94.5
Ultra RAM (Mb)	90.0	90.0	180.0	270.0	270.0	360.0
HBM DRAM (GB)	4	8	8	8	–	–
Clock management tiles	4	4	5	3	6	4
DSP slices	2880	2880	5952	9024	9216	12 288
PCIe ports	4	4	5	6	3	4
CCIX ports	4	4	4	4	–	–
150G Interlaken	0	0	2	4	6	8
100G Ethernet with RS-FEC	2	2	5	8	9	12
Maximum single-ended I/O	208	208	416	624	624	832
Multi-standard Gbps SERDES	32	32	64	96	96	128

	L1					L2



Cache
Size	Energy	Area	Hit rate	Access time	Mean time	Energy	Area	Hit rate	Access time	Mean time
1	0.01	0.001	0.002	0.0	200	0.001	0.001	0.002	0.1	200
2	0.02	0.002	0.004	0.0	199	0.002	0.002	0.004	0.1	199
4	0.04	0.004	0.008	0.0	198	0.004	0.004	0.008	0.2	198
8	0.08	0.008	0.015	0.0	197	0.008	0.008	0.015	0.3	197
16	0.16	0.016	0.030	0.0	194	0.016	0.016	0.030	0.4	194
32	0.32	0.032	0.059	0.1	188	0.032	0.032	0.059	0.6	188
64	0.64	0.064	0.111	0.1	178	0.064	0.064	0.111	0.8	178
128	1.28	0.128	0.200	0.1	160	0.128	0.128	0.200	1.1	160
256	2.56	0.256	0.333	0.2	133	0.256	0.256	0.333	1.6	134
512	5.12	0.512	0.500	0.2	100	0.512	0.512	0.500	2.3	101
1024	10.24	1.024	0.667	0.3	67	1.024	1.024	0.667	3.2	69
2048	20.48	2.048	0.800	0.5	40	2.048	2.048	0.800	4.5	44
4096	40.96	4.096	0.889	0.6	23	4.096	4.096	0.889	6.4	28
8192	81.92	8.192	0.941	0.9	13	8.192	8.192	0.941	9.1	20
16 384	163.84	16.384	0.970	1.3	7	16.384	16.384	0.970	12.8	18
32 768	327.68	32.768	0.985	1.8	5	32.768	32.768	0.985	18.1	21
65 536	655.36	65.536	0.992	2.6	4	65.536	65.536	0.992	25.6	27
131 072	1310.72	131.072	0.996	3.6	4	131.072	131.072	0.996	36.2	37
262 144	2621.44	262.144	0.998	5.1	5	262.144	262.144	0.998	51.2	51

Figures and Tables from Modern SoC Design on Arm

1 A Textbook

2 Pictures and Tables

End

L1	L2	L2	Composite	Composite	Composite
size	size	energy	energy	area	mean time
64	262 144	233.0	233.6	262.2	45.8
128	262 144	209.7	211.0	262.3	41.2
1024	262 144	87.4	97.6	263.2	17.4
4096	262 144	29.1	70.1	266.2	6.3

Metric	Core complexity ( $c$ )	DVFS voltage ( $v$ )	Number of cores ( $n$ )
Performance delivered	$c^{0.5}$	$v^{1.0}$	$n^{0.9}$
Power used	$c^{1.0}$	$v^{3.0}$	$n^{1.0}$
Increase in power for double performance	4	8	2.16

simple_initiator_socket.h	A version of an initiator socket that has a default implementation of all interfaces. It allows the registration of an implementation for any of the interfaces to the socket, either unique interfaces or tagged interfaces (carrying an additional ID).
simple_target_socket.h	A basic target socket that has a default implementation of all interfaces. It also allows the registration of an implementation for any of the interfaces to the socket, either unique interfaces or tagged interfaces (carrying an additional ID). This socket allows only one of the transport interfaces (blocking or non-blocking) to be registered and implements a conversion if the socket is used on the other interface.
passthrough_target_socket.h	A target socket that has a default implementation of all interfaces. It also allows the registration of an implementation for any of the interfaces to the socket.
multi_passthrough_initiator_socket.h	An implementation of a socket that allows multiple targets to be bound to the same initiator socket. It implements a mechanism that allows the index of the socket the call passed through in the backward path to be identified.
multi_passthrough_target_socket.h	An implementation of a socket that allows multiple initiators to bind to the same target socket. It implements a mechanism that allows the index of the socket the call passed through in the forward path to be identified.

Index	Type of ISS	I-cache traffic	D-cache traffic	Relative
		modelled	modelled	performance
(1)	Interpreted RTL	Y	Y	0.000001
(2)	Compiled RTL	Y	Y	0.00001
(3)	V-to-C C++	Y	Y	0.001
(4)	Handcrafted cycle-accurate C++	Y	Y	0.1
(5)	Handcrafted high-level C++	Y	Y	1.0
(6)	Trace buffer/JIT C++	N	Y	20.0
(7)	Native cross-compile	N	N	50.0

	Without simulation	Using simulation
Without place and route	Fast design exploration Area and delay heuristics needed	Can generate indicative activity ratios that can be used instead of a simulation in further runs

With place and route	Static timing analyser will give an accurate clock frequency	Gold standard: only bettered by measuring a real chip

Parameter	Value
Fabrication geometry	22 nm
Organisation	64Kx64
Nominal VDD	1.0 V
Number of banks	1
Read/write ports per bank	1
Read ports per bank	0
Write ports per bank	0

Device type	$t_{CL}$	Clock	Access	Activate	Read	Write
(capacity, word		frequency	time
size and speed)	(cycles)	(MHz)	(ns)	(pJ)	(pJ)	(pJ)
DDR4_4Gb_x4_1866	13	934.6	32.1	4006.1	5752.3	6985.0
DDR4_4Gb_x4_2133	16	1063.8	33.8	3754.0	5775.4	7219.2
DDR4_4Gb_x4_2400	17	1204.8	31.5	4063.7	5418.2	7330.6
DDR4_4Gb_x4_2666	19	1333.3	31.5	4104.0	6048.0	7488.0
DDR4_8Gb_x4_1866	13	934.6	32.1	5156.5	5341.4	4930.6
DDR4_8Gb_x4_2133	16	1063.8	33.8	5270.0	4692.5	4331.5
DDR4_8Gb_x4_2666	19	1333.3	31.5	6105.6	4608.0	4089.6
DDR4_8Gb_x4_2933	21	1470.6	31.3	6632.4	4595.7	4021.2
DDR4_8Gb_x4_3200	22	1587.3	30.2	7136.6	4644.9	4015.9

Operating mode	Rail 1		Rail 2		Rail 3		Total power
	(volts)	(mA)	(volts)	(mA)	(volts)	(mA)	(mW)
Standby	3.3	0.018	1.8	0.0007	3.3	0.03	0.16
L/S mode	3.3	6.3	1.8	11	3.3	5	57
H/S mode	3.3	29	1.8	22	3.3	59	155

Year introduced	Microprocessor	No of transistors	Geometry
2007	Dual-core Intel Itanium 2	1.6 billion	90 nm
2010	8-core Intel Nehalem	2.3 billion	45 nm
2010	Altera Stratix IV FPGA	2.5 billion	40 nm
2015	Intel CPU	circa 10 billion	19 nm
2020	Nvidia’s GA100 Ampere	54 billion	7 nm

Year of production	2015	2017	2019	2021	2024	2027	2030
Logic device technology names	P70M56	P48M36	P42M24	P32M20	P24M12G1	P24M12G2	P24M12G3
Logic industry node range label (nm)	16/14	11/10	8/7	6/5	4/3	3/2.5	2/1.5
Logic device structure	FinFET	FinFET	FinFET	FinFET	VGAA	VGAA	VGAA
	FDSOI	FDSOI	LGAA	LGAA	M3D	M3D	M3D
				VGAA

Device Electrical Specifications
Power supply voltage, $V_{d d}$ (V)	0.80	0.75	0.70	0.65	0.55	0.45	0.40
Sub-threshold slope (mV/decade)	75	70	68	65	40	25	25
Inversion layer thickness (nm)	1.10	1.00	0.90	0.85	0.80	0.80	0.80
$V_{T}$ sat (mV) at $I_{off} = 100$ nA/µm, HP logic	129	129	133	136	84	52	52
$V_{T}$ sat (mV) at $I_{off} = 100$ nA/µm, LP logic	351	336	333	326	201	125	125
Effective mobility (cm $^{2}$ V $^{- 1}$ s $^{- 1}$ )	200	150	120	100	100	100	100
$R_{ext}$ ( $Ω$ m), HP Logic	280	238	202	172	146	124	106
Ballisticity: injection velocity (cm/s)	$1.20 \times 10^{- 7}$	$1.32 \times 10^{- 7}$	$1.45 \times 10^{- 7}$	$1.60 \times 10^{- 7}$	$1.76 \times 10^{- 7}$	$1.93 \times 10^{- 7}$	$2.13 \times 10^{- 7}$
$V_{dsat}$ (V), HP logic	0.115	0.127	0.136	0.128	0.141	0.155	0.170
$V_{dsat}$ (V), LP logic	0.125	0.141	0.155	0.153	0.169	0.186	0.204
$I_{on}$ (A/m) at $I_{off} = 100$ nA/µm, HP logic with $R_{ext} = 0$	2311	2541	2782	2917	3001	2670	2408
$I_{on}$ (A/m) at $I_{off} = 100$ nA/µm, HP logic after $R_{ext}$	1177	1287	1397	1476	1546	1456	1391
$I_{on}$ (A/m) at $I_{off} = 100$ pA/µm, LP logic with $R_{ext} = 0$	1455	1567	1614	1603	2008	1933	1582
$I_{on}$ (A/m) at $I_{off} = 100$ pA/µm, LP logic after $R_{ext}$	596	637	637	629	890	956	821
Cch, total (fF/µm $^{2}$ ), HP/LP logic	31.38	34.52	38.35	40.61	43.14	43.14	43.14
Cgate, total (fF/µm), HP logic	1.81	1.49	1.29	0.97	1.04	1.04	1.04
Cgate, total (fF/µm), LP Logic	1.96	1.66	1.47	1.17	1.24	1.24	1.24
CV/I (ps), FO3 load, HP logic	3.69	2.61	1.94	1.29	1.11	0.96	0.89
I/(CV) (1/ps), FO3 load, HP logic	0.27	0.38	0.52	0.78	0.90	1.04	1.12
Energy per switching (CV $^{2}$ ) (fJ/switching), FO3 load, HP logic	3.47	2.52	1.89	1.24	0.94	0.63	0.50

Type of expense	Item	Item cost	Total cost
NRE	6 months: 10 software engineers	$100k pa	$500k
NRE	6 months: 10 hardware engineers	$250k pa	$1250k
NRE	4 months: 20 verification engineers	$200k pa	$1333k
NRE	1 mask set (22 nm)	$1500k	$1500k
RE	Per device IP licence fees	?	$?? $\times n$
RE	6-inch wafer	$5k	$5k $\times n$
Total			$4583k + 5k $\times n$

Area (mm $^{2}$ )	Number of wafer dies	Number of working dies	Cost per working die ($)
2	9000	8910	0.56
3	6000	5910	0.85
4	4500	4411	1.13
6	3000	2911	1.72
9	2000	1912	2.62
13	1385	1297	3.85
19	947	861	5.81
28	643	559	8.95
42	429	347	14.40
63	286	208	24.00
94	191	120	41.83
141	128	63	79.41
211	85	30	168.78
316	57	12	427.85
474	38	4	1416.89

Syntax	Fundamental	Description
{A;B}	Core	Semicolon denotes sequence concatenation
{A[*]}	Core	A postfix asterisk denotes arbitrary repetition
{A $\|$ B}	Core	Vertical bar (stile) denotes alternation

{A[+]}	Derived	One or more occurrences of A
{A[*n]}	Derived	Repeat $n$ times
{A[=n]}	Derived	Repeat $n$ times non-consecutively
{A[->n]}	Derived	As =n but ending on the last occurrence
{A:B}	Derived	Fusion concatenation (last of A occurs during first of B)

Profile	Channels	Other nets	Description
AXI3	AR+R, AW+W+B	Tag ID, WLanes	Bursts 1–16 beats
AXI4	AR+R, AW+W+B	Tag ID, WLanes, QoS	Bursts 1–256 beats
AXI4-Lite	AR+R, AW+W+B		No burst transfers. No byte lanes
AXI4-Stream	W		Simplex. No addressing. Unrestricted length
AXI ACE	All of AXI4	AC+CR+CD	Cache coherency extensions
ACE5-Lite	All of AXI4	AC+CR+CD	Single beat. Out-of-order responses

Parameter	Values
Process variation	0.9 to 1.1
Supply voltage range	0.85 to 1.1 V
Temperature range	0 to 70 $°$ C

String	Meaning
ss_0p9v_m40c	Slow P and N channel transistors at 0.9 V and $- 40$ °C
tt_1p0v_25c	Typical P and N channel transistors at 1.0 V and room temperature
ff_1p1v_125c	Fast P and N channel transistors at 1.1 V and 125°C

BEOL Corner	Meaning
C $_{min}$ or C $_{best}$	Narrow wires with wide spacing for the smallest capacitance component
RC $_{min}$ or RC $_{best}$	Thick wires with less resistance to minimise the RC product and net delay
Typical	Wires and vias meet the target dimensions
RC $_{max}$ or RC $_{worst}$	Thin wires with more resistance to maximise the RC product and net delay
C $_{max}$ or C $_{worst}$	Wide wires with narrow spacing for the largest capacitance component

Operator	Syntax	Description
Simple conjunction	A & B	A and B finish matching at once
Length-matching conjunction	A && B	A and B occur at once with common duration (length matching)
Simple conjunction	A within B	A occurred at some point during B

Strong positive sequencing	A until B	A held at all times until B started
Weak positive sequencing	A before B	A held before B held
Sequence implication	A \|=> B	Whenever A finishes, B immediately starts
Fusion implication	A \|-> B	The same, but with the last event of B coincident with the first of A

Macro function	Description
rose(X)	X changed from zero to one
fell(X)	X changed from one to zero
stable(X)	X did not change
changed(X)	X did change
onehot(X)	X is a power of 2
onehot0(X)	X is zero or a power of 2

No.	Name	Description
1.	Rate: open loop Flows: all to all Length: 8 Spacing: Regular	Average rate injection from all ingress ports to all egress ports of 8 byte payloads, with no burstiness
2.	Rate: open loop Flows: all to one Length: 8 Spacing: Regular	Average rate injection from all ingress ports to one egress port, with no burstiness.
3.	Rate: saturated Flows: all to all Length: 8 Spacing: Regular	Injection at peak capacity from all ingress ports to all egress ports, with no burstiness.
4.	Rate: open loop Flows: all to all Length: 8 Spacing: Random	Average injection rate with random delays between injections, from all ingress ports to all egress ports.
5.	Rate: open loop Flows: all to all Length: Variable Spacing: Regular	Average injection rate from all ingress ports to all egress ports, with variable length packets.
6.	Rate: closed loop Flows: all to all Length: 32 Spacing: Regular	Ingress port only generates a new message after previous response. All packets are long (32 bytes).

Connection	Direction	Use
data[7:0]	I/O	Bidirectional data bus
addr[15:0]	Output	Selection of internal address; not all 32 bits are used
hren	Output	Asserted during a data read from the target to the host
hwen	Output	Asserted during a write of data from the host to the target
ack	Input	Asserted when the addressed device has completed its operation

Start	End	Resource
0000	03FF	ROM (1 kbytes)
0400	3FFF	Unused images of ROM
4000	7FFF	RAM (16 kbytes)
8000	BFFF	Unused
C000	C007	Registers (8) in the UART
C008	FFFF	Unused images of the UART

Connection	Direction	Use
addr[31:0]	Output	Selection of internal address; not all 32 bits are used
hwen	Input	Asserted during a write from the host to the target
hren	Input	Asserted during a read from the target to the host
wdata[31:0]	Input	Data to a target when writing or storing
rdata[31:0]	Output	Data read from a target when reading or loading
interrupt	Output	Asserted by target when needing attention

Memory	Volatile	Main applications	Implementation
ROM	No	Booting, coefficients	Content set by a tapeout mask
SRAM	Yes	Caches, scratchpads, FIFO buffers	One bistable (invertor pair) per bit
DRAM	Yes	Primary storage	Capacitor charge storage
EA-ROM	No	Secondary storage	Floating-gate FET charge storage
Memristive	No	Next generation	Electrically induced resistance changes

Quantity	Aggregate capacity	Description
1 channel	16 GB	A physical bus: 64 data bits, 3 bank bits and 14 address bits
4 DIMMs	16 GB	Multiple DIMMs are connected on the PCB to one channel
1 rank	4 GB	A number of logical DIMMs within a physical DIMM
16 chips	$16 \times 0.5$ = 4 GB	This DIMM uses 16 4-bit chips making a 64-bit word
Lanes/chip	4 bit lanes = 1 GB	Each chip serves a word 4 bits wide
8 banks	$2^{14 + 12 + 8} = 0.5$ Gbit	Each bank has its own bit-cell arrays (simultaneously open)
$2^{12}$ rows	64 Mbit	A page or row is one row of bit cells in an array
(Burst)	8 words = 64 bytes	The unit of transfer over the channel
$2^{14}$ columns	16 kbit	The data read/write line to a bit cell

Discipline	$N_{s y s t e m}$	$N_{Q}$
$M / M / 1$	$\frac{ρ}{1 - ρ}$	$\frac{ρ^{2}}{1 - ρ}$
$M / D / 1$	$\frac{ρ^{2}}{2 (1 - ρ)} + ρ$	$\frac{ρ^{2}}{2 (1 - ρ)}$
$D / D / 1$	$N_{Q} + ρ$	$\approx \frac{1}{(1 - ρ)^{\infty}}$

	Clock	Power
On/Off	Clock gating	Power supply gating
Variable	Dynamic frequency scaling (DFS)	Dynamic voltage scaling (DVS)

Supply voltage	Clock frequency	Static power	Dynamic power	Total power
(V)	(MHz)	(mW)	(mW)	(mW)
0.8	100	40	24	64

1.35	100	67	68	135
1.35	200	67	136	204

1.8	100	90	121	211
1.8	200	90	243	333
1.8	400	90	486	576

Technique	Clock gating	Supply gating	DVFS
Control	Automatic	Various	Software
Granularity	Register or FSM	Larger blocks	Macroscopic
Clock tree	Mostly free runs	Turned off	Slows down
Response time	Instant	2 to 3 cycles	Instant (or ms if PLL adjusted)

TDI	In	Test data in: serial bits from test agent or previous device
TMS	In	Test mode select: frame data and addresses
TCK	In	Test clock: clocks each bit in and out
TDO	Out	Test data out: to next device or back to agent