HOME UP PREV NEXT (Typical ISS setup with Loose Timing (Temporal Decoupling))

Instruction Set Simulator (ISS)

Is a program that interprets or otherwise models the behaviour of machine code.

Typically implemented as a C++ object:

class mips64iss
{ // Programmer's view state:
  u64_t regfile[32];   // General purpose registers (R0 is constant zero)
  u64_t pc;            // Program counter (low two bits always zero)
  u5_t  mode;          // Mode (user, supervisor, etc...)
  ...
  void step();         // Run one instruction
  ...
}

The ISS can be cycle-accurate or just programmer-view accurate, where the hidden registers that overcome structural hazards or implement pipeline stages are not modelled.

This fragment of a main step function evaluates one instruction, but this does not necessarily correspond to one clock cycle in hardware (e.g. fetch and execute would be of different instructions owing to pipelining):

void mips64iss::step()
{
  u32_t ins = ins_fetch(pc);
  pc += 4;
  u8_t opcode = ins >> 26;    // Major opcode
  u8_t scode = ins & 0x3F;    // Minor opcode
  u5_t rs = (ins >> 21) & 31; // Registers
  u5_t rd = (ins >> 11) & 31;
  u5_t rt = (ins >> 16) & 31;

  if (!opcode) switch (scode) // decode minor opcode
  {
    case 052: /* SLT - set on less than */
      regfile_up(rd, ((int64_t)regfile[rs]) < ((int64_t)regfile[rt]));
      break;

    case 053: /* SLTU - set on less than unsigned */
      regfile_up(rd, ((u64_t)regfile[rs]) < ((u64_t)regfile[rt]));
      break;


      ...  
   ...      

void mips64iss::regfile_up(u5_t d, u64_t w32)
{ if (d != 0) // Register zero stays at zero
    { TRC(trace("[ r%i := %llX ]", d, w32));
      regfile[d] = w32;
    }
}

See the nominal processor example in the »`Toy ESL'practical material.

Various forms of ISS are possible, modelling more or less detail:

Type of ISS I-cache traffic D-cache traffic Relative
Modelled Modelled Speed

1. Interpreted RTL Y Y 0.000001
2. Compiled RTL Y Y 0.00001
3. V-to-C C++ Y Y 0.001
4. Hand-crafted cycle accurate C++ Y Y 0.1
5. Hand-crafted high-level C++ Y Y 1.0
6. Trace buffer/JIT C++ N Y 20.0
7. Native cross-compile N N 50.0

A cycle-accurate model of the processor core is normally available in RTL. Using this under an EDS interpreted simulator will result in a system that typically runs one millionth of real time speed (1). Using compiled RTL, as is now normal practice, gives a factor of 10 better, but remains hopeless for serious software testing (2).

Using programs such as Tenison VTOC and Verilator, a fast, cycle-accurate C++ model of the core can be generated, giving intermediate performance (3). A hand-crafted model is generally much better, requiring perhaps 100 workstation instructions to be executed for each modelled instruction (4). The workstation clock frequency is generally about 10 times faster than the modelled embedded system.

If we dispense with cycle accuracy, a hand-crafted model (5) gives good performance and is generally throttled by the overhead of modelling instruction and data operations on the model of the system bus.

A JIT (just-in-time) cross-compilation of the target machine code to native workstation machine code gives excellent performance (say 20.0 times faster than real time) but instruction fetch traffic is no longer fully modelled (6). Techniques that unroll loops and concatenate basic blocks, such as used for trace caches in processor architecture, are applicable.

Finally (line~7), compiling the embedded software using the workstation native compiler (as described later) exposes the unfettered raw performance of the workstation for cpu-intensive code.

Type of ISS	I-cache traffic	D-cache traffic	Relative
	Modelled	Modelled	Speed
1. Interpreted RTL	Y	Y	0.000001
2. Compiled RTL	Y	Y	0.00001
3. V-to-C C++	Y	Y	0.001
4. Hand-crafted cycle accurate C++	Y	Y	0.1
5. Hand-crafted high-level C++	Y	Y	1.0
6. Trace buffer/JIT C++	N	Y	20.0
7. Native cross-compile	N	N	50.0