LG4 Notes: ESL - Electronic System Level Design

Motivation 1: Architectural Exploration
Embedded Firmware Modelling Methods
Example H/W Protocol: 4P Handshake
What is a transaction ?
Adding Timing Annotations
TLM in SystemC: TLM 1.0
SoC Component, TLM Form Example DMA Controller
TLM in SystemC: TLM 2.0
Timing Models
Approximate Timing
Loose Timing and Temporal Decoupling

Introduction: Architectural Exploration

ESL is electronic system level modelling: a recent development whereby transactional models of hardware components can be called directly by device driver code without modelling processor cores or busses. It is especially useful for architectural exploration where a designer can rapidly experiment with different SoC configurations in terms of how many busses, what is connected to which bus and how wide the various busses and caches are.

We cover the motivational history of ESL, looking at how firmware and behavioural models were two types of IP divided from each other despite being generally in a common language: C++. We discuss architectural exploration using mixed-abstraction models.

ESL uses procedure calls between components in a S/W (software) coding style whereas traditional hardware modelling has used shared variables to model nets that connect the components. We need, at times, to convert between these S/W and H/W styles. We will need a transactor. This is a small software entity that converts between the two modelling styles.

To help us investigate the key aspects of this methodology without using extensive libraries of any sort we use our own processor, the almost trivial nominalproc, and we cook our own transactional modelling library rather than using the industry standard TLM 2.0 library.

The practical take an instruction set simulator of a nominal processor and then sub-class it in two different ways: one to make a conventional net-level model and the other to make an ESL version. The nominal processor is wired up in various different example configurations, some using mixed-abstraction modelling.

We briefly looked at blocking and non-blocking transaction styles and then extend them with timing annotations to give performance estimates that vary in accuracy. The various phases of the non-blocking style can be separately annotated to give a result, known as approximately-timed, that should be as accurate as a cycle-based simulation result.

On the other hand, there is a much looser timing model that is more suitable for use with blocking models. The ESL models can then run ahead of each other and of the global simulation time, each keeping track of its local offset in a variable called delta and checkpointing with the EDS kernel when necessary or at intervals of a time quantum. Having a large quantum can expose design bugs (good) and leads to a fast model.

Embedded Firmware Modelling Methods

Pre-ESL, the main tool was the EDS simulator. An instruction set simulator (ISS) or actual processor model can be used to interpret the firmware, but this runs very slowly, perhaps a million times slower than real time, meaning operating system boot would take a day of modelling.

Improvements on this are to implement the ISS efficiently in C++ or using JIT techniques on the actual firmware and to compile the RTL to C++ for faster simulation.

A further improvement is to avoid the ISS entirely by cross compiling the firmware so that it can run naively on the modelling workstation. This can lead to a system that runs faster than real time, because the modelling workstation might be ten times faster than the true embedded processor core, depending on the activity ratios between hardware models and firmware.

Finally, for top speed, one can envisage a scenario where the firmware device drivers communicates directly with high-level models of the hardware using procedure calling. This avoids modelling the processor bus operations entirely.

Example H/W Protocol: 4P Handshake

A common asynchronous protocol is the simple four phase handshake. Data is transferred once per complete handshake operation, which involves four phases.

Simple transactor code to convert the pin-level implementation so that software can call it can be implemented as follows:

  putbyte(char d)
  {
    wait_until(!ack);
    data = d;
    settle();
    req = 1;
    wait_until(ack);
    req = 0;
  }

  char getbyte()
  {
    wait_until(req);
    char r = data;
    ack = 1;
    wait_until(!req);
    ack = 0;
    return r;
  }

It is also possible to have transactors that keep their own thread and make up-calls to user callbacks for each transaction.

What is a transaction ?

In general, a transaction has atomicity, with commit or rollback. But in ESL the term means less than that. In ESL we might just mean that a thread from one component executes a method on another. However, the call and return of the thread normally achieve flow control and implement the atomic transfer of some datum, so the term remains relatively intact.

We can have blocking and non-blocking coding styles.

Blocking: Hardware flow control signals implied by thread's call and return.
Non-blocking: Success status returned immediately and caller must poll/retry as necessary.

In SystemC: blocking requires an SC_THREAD, whereas non-blocking can use an SC_METHOD.

Which is better: a matter of style ? Non-blocking enables finer-grained concurrency and closer to cycle-accurate timing results. TLM 2.0 sockets will even automatically map.)

Examples of a coding styles:

  bool putbyte_nb_start(char d)
  {
    if (ack) return false;
    data = d;
    settle(); // A H/W delay for skew issues, or a memory fence in S/W for sequential consistency.
    req = 1;
    return true;
  }

  bool putbyte_nb_end(char d)
  {
    if (!ack) return false;
    req = 0;
    return true;
  }

  bool getbyte_nb_start(char &r)
  {
    if (!req) return false;
    r = data;
    ack = 1;
    return true;
  }
  
  bool getbyte_nb_end()
  {
    if (req) return false;
    ack = 0;
    return true;
  }

Adding Timing Annotations

Both coding styles can be used for zero delay models that do not advance simulation time and hence do not give useful timing information. However, if we know how long something is likely to take (we don't have the detailed answer because this is just a high-level model) we can add timing annotations to models written in either coding style.

In the non-blocking model, we can make a note of the simulator tnow in a local variable at the start of a transaction and not return success of the transaction (or a sub-part of it) until tnow has sufficiently advanced.

In the blocking model we can block the caller's thread for the appropriate number of cycles before returning with the result.

Going further, we can allow transactions to execute out of order with loose timing (see last slide).

TLM in SystemC: TLM 1.0

The OSCI TLM 1.0 standard used conventional C++ concepts of multiple inheritance. As illustrated in LG2, an SC_MODULE that implements an interface just inherits it.

SystemC 2.0 implemented an extension called sc_export that allows a parent module to inherit the interface of one of its child. This was a vital step needed in the common situation where the exporting module is not the top-level module of the component being wired-up.

However, TLM 1.0 had no standardised or recommended structure for payloads and no standardised timing annotation mechanisms.

There was also the problem of how to have multiple TLM ports on a component with same interface: e.g.\ a packet router. (Full exam credit in answers can be gained using any of TLM1.0 or TLM2.0 styles or even your own pseudo code. You just need to understand the concepts of interfaces and inheritance in OO programming).

SoC Component, TLM Form Example DMA Controller

The RTL coding style of the DMA controller in LG3 was a little bit hard to understand, but at least it was synthesisable.

The active component of an ESL version of such a DMA controller looks like this:

  // Bus mastering side 
  while(1)
  { 
    waituntil(busy);
    while (length-- > 0)
      mem.write(dest++, mem.read(src++));
    busy = 0;
  }

In other words, it looks just like a simple block copy in C++. (More details are on the slides and in the practical class examples library).

However, we can see that that memory operations are likely to get well out of synchronisation with the real system since this copying loop just goes as fast as it can without worrying about the speed of the hardware it would really be implemented in. It is just governed by the number of cycles the read and write calls block for, which could be none. The whole block copy might occur in zero simulation time!

This sort of modelling is useful for exposing certain types of bugs in a design, but it does not give useful performance feedback.

TLM in SystemC: TLM 2.0

TLM 2.0 was defined in July 2008 and it defined a Generic Payload that looks a bit like this: (see the header files for real on the PWF)

trans->set_command(tlm::TLM_WRITE_COMMAND);
trans->set_address(addr);
trans->set_data_ptr(reinterpret_cast(&data));
trans->set_data_length(4);
trans->set_streaming_width(4);
trans->set_byte_enable_ptr(0);
trans->set_response_status( tlm::TLM_INCOMPLETE_RESPONSE );

socket->b_transport(*trans, delay);

The generic payload can be extened on a a custom basis and intermediate bus bridges and routers can be polymorphic about this: not needing to know about all the extensions but able to update timestamps to model routing delays.

It also defines memory/garbage ownership and transport primitives with timing. Finally, it defines a raft of useful features, such as automatic conversion between blocking and non-blocking styles. However, it is a bit too complex to get a deep understanding of in the time available, so for this course I have mostly ignored it and implemented things in TLM 1.0 style.

Timing Models

Let's retain the EDS kernel with its tnow variable defined by the head of the event queue. This is our main reference time stamp, but let's try not to use the kernel very much, only entering it when inter-module communication is needed. This reduces context swap overhead (I know it's all in user space but it is a computed branch that does not get predicted) and we can run a large number of ISS instructions at one time, giving good use of the caches on the modelling workstation.

We can always print the current time with:

  cout << ``Time now is : `` << simcontext()->time_stamp() << `` \n'';

Basically, coarse processing granularity leads to high-speed simulation. How often must we re-enter the SystemC kernel:

Cycle Accurate - Every clock tick.
Approximately Timed - Every TLM call and return.
Loosely Timed - When we need a result from another component.
Untimed - Never ?

Approximate Timing

With approximate tiing, processes typically need to run in lockstep with SystemC simulation time. This is most apropriate for architectural exploration and performance analysis.

Delays annotated onto process interactions are implemented using wait calls.

Detailed timing within the various phases of a transaction can be modelled. TLM 2.0 defines multiple transaction phases with timing points. The four default phases of the base protocol are BEGIN_REQ, END_REQ, BEGIN_RESP, and END_RESP. Other phases can be added.

Loose Timing and Temporal Decoupling

The loosely-timed coding style is normally used with blocking models. and is appropriate for software development or when only very rough timing is needed. However the accuracy can be altered by reducing the quantum, but this also reduces performance.

There are two timing points per transaction: call and return. EDS tnow simulation time still exists, but processes become temporally decoupled from simulation time. This means each process keeps a tally of how far it has run ahead of simulation time and gets back in synch from time to time by yielding. To do this the thread must just call sc_wait( ) for the appropriate time difference that it has calculated.

It may yield because it reaches an explicit synchronization point, such as when it must block waiting for some other process to do something, or because it has consumed its time quantum.

All processes will typically use the same quantum (although they need not). When they detect their local delta time has gone above the quantum they yield. The delta time can be passed by reference between components so that one process can effectively give another process a time penalty equivalent to the time it would have blocked that process under approximate timing.

In summary, large time quantum means fast simulation whereas small time quantum means transaction order interleaving is more accurate.

See the example in the practical work.