System-On-Chip Design and Modelling, 2012.

Accelerator Example: Ethernet CRC

Problem Statement

A CRC is like a checksum over a block of data, but it is much more robust.

The following code embodies the CRC algorithm used in the Ethernet. It make intensive use of bit-level operations and so does not work efficiently on general-purpose CPU cores. (As well as Ethernet, this same CRC check is used in MPEG-2, PKZIP, Gzip, Bzip2 and PNG.)


// Generating polynomial:
const uint32_t ethernet_polynomial_le = 0xedb88320U;

//bit-oriented implementation: processes a byte array.
unsigned ether_crc_le(int length, u8_t *data, int foxes)
{
    unsigned int crc = (foxes) ? 0xffffffff: 0;	/* Initial value. */
    while(--length >= 0) 
      {
	unsigned char current_octet = *data++;
	int bit;
	// printf("%02X, %08X,  inv %08X\n", current_octet, crc, ~crc);

	for (bit = 8; --bit >= 0; current_octet >>= 1) {
	  if ((crc ^ current_octet) & 1) {
	    crc >>= 1;
	    crc ^= ethernet_polynomial_le;
	  } else
	    crc >>= 1;
	}
      }
    // printf("crc final %x\n", crc);
    return crc;
  }

We desire to offload the inner bit-manipulations to custom hardware.

Possible implementations: We can consider a hardware peripheral that uses programmed I/O or DMA, or a co-processor or custom instruction.

First design: Programmed I/O Peripheral

To implement this as a hardware peripheral unit, a new device must be design and instantiated in the memory map of the controlling host CPU. A programmers' view of its register file must be defined, such as:

#ifndef ETHERCRC_MEMMAP_H
#define ETHERCRC_MEMMAP_H

// Memory map (programmer's model) for programmed I/O operations on this device.                                               


// We can perform programmed I/O on this component by reading and writing to the following registers.                          

// Each register has an offset from a system-dependent base address.                                                           

#define CRC_RESET_REG              8   // Write any value here to reset this CRC unit.                                         

#define CRC_DATA_8_PROCESS_REG     16  // Write one byte here to have it included in the CRC                                   

#define CRC_DATA_32_PROCESS_BE_REG 24  // Writa big-endian word here to have it included                                       

#define CRC_DATA_32_PROCESS_LE_REG 32  // Write a little-endian word here to have it included                                  

#define CRC_READ_REG               48  // Read here for current crc value in host endian                                       

#define CRC_CONTROL_REG            56  // Read/write here for control/status flags (if needed.)                                



#endif

We'll also need some access macros which we initially define in the following style:

#define CRC_RESET_OPERATION(BASE)      ((volatile int *)BASE)[CRC_RESET_REG] = 1;
#define CRC_PROCESS_BYTE(BASE, DATA)   ((volatile int *)BASE)[CRC_DATA_PROCESS_REG] = DATA;
#define CRC_READ_CRC(BASE)             (((volatile int *)BASE)[CRC_READ_REG])
#define CRC_READ_CONTROL(BASE)         (((volatile int *)BASE)[CRC_CONTROL_REG])

And a simple test application will look like this:

int main()
{
  const int crc_unit0 = 0xFFFED000; // Or whatever - base address of device in memory map.  
  CRC_RESET_OPERATION(crc_unit0);
  for (int i=0;i<64; i++)
    {
      u8_t dd = test_data[i];
      CRC_PROCESS_BYTE(crc_unit0, dd);
      //printf("%02X, %02X   %08X\n", i, dd, CRC_READ_CRC(crc_unit0));
    }
  printf("CRC final %x\n", CRC_READ_CRC(crc_unit0));
  return 0;
}

Of course, more typically we will write a device driver for kernel space and the test app would invoke that using a system call, such as write(const char *buffer, int length).


Run Complete Test Application on OpenRISC OR1K

As a simple port and sanity test, we can cross compile and run the functional model on the target processor emulation without hardware assist of any kind.

Zip file of src and object files hltest.zip.

This will give us output in terms of clock cycles used and power consumption without optimisation of the problem.


Partition the functional model into hardware and software parts

If we divide the program into two halves, we can join them up again using C preprocessor macros to restore the original functional model.

But an alternative definition of the macros, using a preprocessor flag, enables the application to control the hardware portion using programmed I/O:


#ifdef CRC_DIRECT_BEVMODEL
#define CRC_RESET_OPERATION(BASE)      ((ethercrc_bev_core *)BASE)->reset_operation();
#define CRC_PROCESS_BYTE(BASE, DATA)   ((ethercrc_bev_core *)BASE)->process_byte(DATA);
#define CRC_READ_CRC(BASE)             ((ethercrc_bev_core *)BASE)->read_crc()
#define CRC_READ_CONTROL(BASE)         ((ethercrc_bev_core *)BASE)->read_control()
#else
#define CRC_RESET_OPERATION(BASE)      ((volatile int *)BASE)[CRC_RESET_REG] = 1;
#define CRC_PROCESS_BYTE(BASE, DATA)   ((volatile int *)BASE)[CRC_DATA_PROCESS_REG] = DATA;
#define CRC_READ_CRC(BASE)             (((volatile int *)BASE)[CRC_READ_REG])
#define CRC_READ_CONTROL(BASE)         (((volatile int *)BASE)[CRC_CONTROL_REG])
#endif

First half of functional model (remains software)

//
// A test application that may be run on the OR1K or on the native workstation.
//
int main()
{
#ifdef CRC_DIRECT_BEVMODEL
  ethercrc_bev_core *crc_unit0 = new ethercrc_bev_core();
#else
  const int crc_unit0 = 0xFFFED000; // Or whatever - base address of device in memory map.
#endif
  CRC_RESET_OPERATION(crc_unit0);
  for (int i=0;i<64; i++)
    {
      u8_t dd = test_data[i];
      CRC_PROCESS_BYTE(crc_unit0, dd);
      printf("%02X, %02X   %08X\n", i, dd, CRC_READ_CRC(crc_unit0));
    }
  printf("CRC final %x\n", CRC_READ_CRC(crc_unit0));
  return 0;
}

Second half of functional model (becomes hardware)

We define a C++ model that encapsulates the behaviour of the functional model as separately-callable methods:

#ifndef ETHERCRC_BEV_CORE_H
#define ETHERCRC_BEV_CORE_H
// Ethernet CRC Processor - behavioural model
// $Id: ethercrc_bev_core.h,v 1.1 2011/03/22 07:30:34 djg11 Exp $
const uint32_t ethernet_polynomial_le = 0xedb88320U;

class ethercrc_bev_core
{
  // Programmer's view (ironically made private).
  uint32_t crc_reg;
  uint32_t control_reg;

public:
  uint32_t read_crc();
  uint32_t read_control();
  void write_control(uint32_t);
  void reset_operation();
  void process_byte(uint8_t dd);
  void process_word32_be(uint32_t dd);
  void process_word32_le(uint32_t dd);
};
#endif

The implementations of the methods would be:

uint32_t ethercrc_bev_core::read_crc()
{
  return crc_reg;
}

uint32_t ethercrc_bev_core::read_control()
{
  return control_reg;
}

void ethercrc_bev_core::reset_operation()
{
  crc_reg = 0xFFFFFFFF;    
}


void ethercrc_bev_core::process_word32_le(uint32_t dd)
{
  process_byte(dd >> 0);
  process_byte(dd >> 8);
  process_byte(dd >> 16);
  process_byte(dd >> 24);
}

void ethercrc_bev_core::process_word32_be(uint32_t dd)
{
  process_byte(dd >> 24);
  process_byte(dd >> 16);
  process_byte(dd >> 8);
  process_byte(dd >> 0);
}


void ethercrc_bev_core::process_byte(uint8_t dd)
{
  dd &= 0xFF;
  uint8_t c = crc_reg >> 24;
  crc_reg = (crc_reg << 8) ^ crc32_table[dd] ^ crc32_table[c];
}

TLM Wrapper for Device Behavioural Core

The above methods also need encapsulating in a SC_MODULE as a model of the device. Using the TLM2.0 blocking TLM coding style, the important part is the following entry point:

void ethercrc_tlm::b_access(tlm::tlm_generic_payload &trans, sc_time &delay)
{
  tlm::tlm_command cmd = trans.get_command();
  uint32_t    adr = ((uint32_t)trans.get_address() & 0x1Ffffffc);
  uint8_t*	ptr = trans.get_data_ptr();
  uint32_t    len = trans.get_data_length();
  uint8_t*	lanes = trans.get_byte_enable_ptr();
  uint32_t    wid = trans.get_streaming_width();

  delay += latency;

  if (cmd == tlm::TLM_READ_COMMAND)
    {
      u32_t r = 0;
      switch (adr)
	{
	case CRC_RESET_REG:
	  break;
	case CRC_DATA_8_PROCESS_REG:
	  break;

	case CRC_READ_REG:
	  r = bevcore.read_crc();
	  break;
	case CRC_CONTROL_REG:
	  r = bevcore.read_control();
	  break;
	}
      ((u32_t *)ptr)[0] = r;
    }
  else if (cmd == tlm::TLM_WRITE_COMMAND)
    {
      u32_t d32 = ((u32_t *)ptr)[0];
      switch (adr)
	{
	case CRC_RESET_REG:
	  bevcore.reset_operation();
	  break;

	case CRC_DATA_8_PROCESS_REG:
	  bevcore.process_byte(d32 & 0xFF);
	  break;

	case CRC_DATA_32_PROCESS_LE_REG:
	  bevcore.process_word32_le(d32);
	  break;
	case CRC_DATA_32_PROCESS_BE_REG:
	  bevcore.process_word32_be(d32);
	  break;
	case CRC_READ_REG:
	  break;
	case CRC_CONTROL_REG:
	  bevcore.write_control(d32);
	  break;
	}
    }
  trans.set_response_status( tlm::TLM_OK_RESPONSE);
}

Alternative implementation as a coprocessor

We can define some unused instructions to be coprocessor instructions. Some processors have instructions already allocated for communication to, as yet, unspecified coprocessors.

// EtherCRC example co-processor instructions:
#define NOP_CUSTOM_ETHERCRC_READ     0x0008  
#define NOP_CUSTOM_ETHERCRC_WRITE    0x0009
#define NOP_CUSTOM_ETHERCRC_PROCESS  0x000A  


// We use in-line assembler shims to instantiate the coprocessors instructions.
void ethercrc_write(int value)
{
  asm("l.addi r3,%0,0": :"r" (value)                    : "r3");
  asm("l.nop %0"       : :"K" (NOP_CUSTOM_ETHERCRC_WRITE));
}

void ethercrc_process(int value)
{
  asm("l.addi r3,%0,0": :"r" (value)                      : "r3");
  asm("l.nop %0"       : :"K" (NOP_CUSTOM_ETHERCRC_PROCESS));
}

int ethercrc_read()
{
  volatile int value;
  asm("l.nop %0":        :"K" (NOP_CUSTOM_ETHERCRC_READ): "r3");
  asm("l.sw  %0(r1),r3": :"m" (value)                   :);
  return value;
}

Generated assembler is:

00000548 :
     548:	a8 83 00 00 	l.ori r4,r3,0x0
     54c:	9c 64 00 00 	l.addi r3,r4,0x0
     550:	15 00 00 09 	l.nop 0x9
     554:	44 00 48 00 	l.jr r9
     558:	15 00 00 00 	l.nop 0x0

0000055c :
     55c:	a8 83 00 00 	l.ori r4,r3,0x0
     560:	9c 64 00 00 	l.addi r3,r4,0x0
     564:	15 00 00 0a 	l.nop 0xa
     568:	44 00 48 00 	l.jr r9
     56c:	15 00 00 00 	l.nop 0x0

00000570 :
     570:	9c 21 ff fc 	l.addi r1,r1,0xfffffffc
     574:	15 00 00 08 	l.nop 0x8
     578:	d4 01 18 00 	l.sw 0x0(r1),r3
     57c:	85 61 00 00 	l.lwz r11,0x0(r1)
     580:	44 00 48 00 	l.jr r9
     584:	9c 21 00 04 	l.addi r1,r1,0x4

And modify the application code as follows:

// coprocessor implementation
unsigned ether_crc_le(int length, uint8_t *data, int foxes)
{
    int crc = (foxes) ? 0xffffffff: 0;	/* Initial value. */
    ethercrc_write(crc);
    int *idata = (int *) data;
    for( ; length >= 0; length -=4)  
      {
	int d = *idata ++;

	printf("d=%08X, old=%08X\n", d, ethercrc_read());
	ethercrc_process(d);
      }
    crc = ethercrc_read();
    printf("crc final %x\n", crc);
    /*  if (crc != 0xdebb20e3) printf("crc is wrong");*/
    return crc;
  }

We must make the same modifications to the RTL of the processor (not shown) and to its ISS.

 // new cases in the ISS 

	  case NOP_CUSTOM_ETHERCRC_WRITE:
	    the_ethercrc_customins.write_crc(evalsim_reg(3));
	    break;

	  case NOP_CUSTOM_ETHERCRC_PROCESS:
	    the_ethercrc_customins.process_word32_be(evalsim_reg(3));
	    break;

	  case NOP_CUSTOM_ETHERCRC_READ:
	    cpu_state.reg[3] = the_ethercrc_customins.read_crc();
	    break;

Power and Performance Comparison

The four programs run were: ZIP FILE

Results table

For 64000 bytes of CRC computation.

Application name         Ins Count  Duration     Energy     Power
Technique                                      Dyn/Stat    Dyn/Static
                                      ms           mJ          mW


bit-oriented hltest.c     6345349     70          2.9/0.5    37/6
LUT hltest.c              1159943     16          3.8/0.092  225/5
Peripheral periphtest.c    517612     8.2         0.3/0.037  45/5    
Coprocessor coproctest.c   126950     1.9         0.1/0.012  96/6

This is without putting power annotations on the coprocessor instructions
or the peripheral unit.

Conclusions

We see a considerable speed up using acceleration. The power goes up, but that does not matter, since the total energy went down.


(C) 2012 DJ Greaves, University of Cambridge, Computer Laboratory.