System-On-Chip Design and Modelling, 2012.
A CRC is like a checksum over a block of data, but it is much more robust.
The following code embodies the CRC algorithm used in the Ethernet. It make intensive use of bit-level operations and so does not work efficiently on general-purpose CPU cores. (As well as Ethernet, this same CRC check is used in MPEG-2, PKZIP, Gzip, Bzip2 and PNG.)
// Generating polynomial: const uint32_t ethernet_polynomial_le = 0xedb88320U; //bit-oriented implementation: processes a byte array. unsigned ether_crc_le(int length, u8_t *data, int foxes) { unsigned int crc = (foxes) ? 0xffffffff: 0; /* Initial value. */ while(--length >= 0) { unsigned char current_octet = *data++; int bit; // printf("%02X, %08X, inv %08X\n", current_octet, crc, ~crc); for (bit = 8; --bit >= 0; current_octet >>= 1) { if ((crc ^ current_octet) & 1) { crc >>= 1; crc ^= ethernet_polynomial_le; } else crc >>= 1; } } // printf("crc final %x\n", crc); return crc; }
We desire to offload the inner bit-manipulations to custom hardware.
Possible implementations: We can consider a hardware peripheral that uses programmed I/O or DMA, or a co-processor or custom instruction.
To implement this as a hardware peripheral unit, a new device must be design and instantiated in the memory map of the controlling host CPU. A programmers' view of its register file must be defined, such as:
#ifndef ETHERCRC_MEMMAP_H #define ETHERCRC_MEMMAP_H // Memory map (programmer's model) for programmed I/O operations on this device. // We can perform programmed I/O on this component by reading and writing to the following registers. // Each register has an offset from a system-dependent base address. #define CRC_RESET_REG 8 // Write any value here to reset this CRC unit. #define CRC_DATA_8_PROCESS_REG 16 // Write one byte here to have it included in the CRC #define CRC_DATA_32_PROCESS_BE_REG 24 // Writa big-endian word here to have it included #define CRC_DATA_32_PROCESS_LE_REG 32 // Write a little-endian word here to have it included #define CRC_READ_REG 48 // Read here for current crc value in host endian #define CRC_CONTROL_REG 56 // Read/write here for control/status flags (if needed.) #endif
We'll also need some access macros which we initially define in the following style:
#define CRC_RESET_OPERATION(BASE) ((volatile int *)BASE)[CRC_RESET_REG] = 1; #define CRC_PROCESS_BYTE(BASE, DATA) ((volatile int *)BASE)[CRC_DATA_PROCESS_REG] = DATA; #define CRC_READ_CRC(BASE) (((volatile int *)BASE)[CRC_READ_REG]) #define CRC_READ_CONTROL(BASE) (((volatile int *)BASE)[CRC_CONTROL_REG])
And a simple test application will look like this:
int main() { const int crc_unit0 = 0xFFFED000; // Or whatever - base address of device in memory map. CRC_RESET_OPERATION(crc_unit0); for (int i=0;i<64; i++) { u8_t dd = test_data[i]; CRC_PROCESS_BYTE(crc_unit0, dd); //printf("%02X, %02X %08X\n", i, dd, CRC_READ_CRC(crc_unit0)); } printf("CRC final %x\n", CRC_READ_CRC(crc_unit0)); return 0; }
Of course, more typically we will write a device driver for kernel space and the test app would invoke that using a system call, such as write(const char *buffer, int length).
As a simple port and sanity test, we can cross compile and run the functional model on the target processor emulation without hardware assist of any kind.
Zip file of src and object files hltest.zip.
This will give us output in terms of clock cycles used and power consumption without optimisation of the problem.
If we divide the program into two halves, we can join them up again using C preprocessor macros to restore the original functional model.
But an alternative definition of the macros, using a preprocessor flag, enables the application to control the hardware portion using programmed I/O:
#ifdef CRC_DIRECT_BEVMODEL #define CRC_RESET_OPERATION(BASE) ((ethercrc_bev_core *)BASE)->reset_operation(); #define CRC_PROCESS_BYTE(BASE, DATA) ((ethercrc_bev_core *)BASE)->process_byte(DATA); #define CRC_READ_CRC(BASE) ((ethercrc_bev_core *)BASE)->read_crc() #define CRC_READ_CONTROL(BASE) ((ethercrc_bev_core *)BASE)->read_control() #else #define CRC_RESET_OPERATION(BASE) ((volatile int *)BASE)[CRC_RESET_REG] = 1; #define CRC_PROCESS_BYTE(BASE, DATA) ((volatile int *)BASE)[CRC_DATA_PROCESS_REG] = DATA; #define CRC_READ_CRC(BASE) (((volatile int *)BASE)[CRC_READ_REG]) #define CRC_READ_CONTROL(BASE) (((volatile int *)BASE)[CRC_CONTROL_REG]) #endif
// // A test application that may be run on the OR1K or on the native workstation. // int main() { #ifdef CRC_DIRECT_BEVMODEL ethercrc_bev_core *crc_unit0 = new ethercrc_bev_core(); #else const int crc_unit0 = 0xFFFED000; // Or whatever - base address of device in memory map. #endif CRC_RESET_OPERATION(crc_unit0); for (int i=0;i<64; i++) { u8_t dd = test_data[i]; CRC_PROCESS_BYTE(crc_unit0, dd); printf("%02X, %02X %08X\n", i, dd, CRC_READ_CRC(crc_unit0)); } printf("CRC final %x\n", CRC_READ_CRC(crc_unit0)); return 0; }
We define a C++ model that encapsulates the behaviour of the functional model as separately-callable methods:
#ifndef ETHERCRC_BEV_CORE_H #define ETHERCRC_BEV_CORE_H // Ethernet CRC Processor - behavioural model // $Id: ethercrc_bev_core.h,v 1.1 2011/03/22 07:30:34 djg11 Exp $ const uint32_t ethernet_polynomial_le = 0xedb88320U; class ethercrc_bev_core { // Programmer's view (ironically made private). uint32_t crc_reg; uint32_t control_reg; public: uint32_t read_crc(); uint32_t read_control(); void write_control(uint32_t); void reset_operation(); void process_byte(uint8_t dd); void process_word32_be(uint32_t dd); void process_word32_le(uint32_t dd); }; #endif
The implementations of the methods would be:
uint32_t ethercrc_bev_core::read_crc() { return crc_reg; } uint32_t ethercrc_bev_core::read_control() { return control_reg; } void ethercrc_bev_core::reset_operation() { crc_reg = 0xFFFFFFFF; } void ethercrc_bev_core::process_word32_le(uint32_t dd) { process_byte(dd >> 0); process_byte(dd >> 8); process_byte(dd >> 16); process_byte(dd >> 24); } void ethercrc_bev_core::process_word32_be(uint32_t dd) { process_byte(dd >> 24); process_byte(dd >> 16); process_byte(dd >> 8); process_byte(dd >> 0); } void ethercrc_bev_core::process_byte(uint8_t dd) { dd &= 0xFF; uint8_t c = crc_reg >> 24; crc_reg = (crc_reg << 8) ^ crc32_table[dd] ^ crc32_table[c]; }
The above methods also need encapsulating in a SC_MODULE as a model of the device. Using the TLM2.0 blocking TLM coding style, the important part is the following entry point:
void ethercrc_tlm::b_access(tlm::tlm_generic_payload &trans, sc_time &delay) { tlm::tlm_command cmd = trans.get_command(); uint32_t adr = ((uint32_t)trans.get_address() & 0x1Ffffffc); uint8_t* ptr = trans.get_data_ptr(); uint32_t len = trans.get_data_length(); uint8_t* lanes = trans.get_byte_enable_ptr(); uint32_t wid = trans.get_streaming_width(); delay += latency; if (cmd == tlm::TLM_READ_COMMAND) { u32_t r = 0; switch (adr) { case CRC_RESET_REG: break; case CRC_DATA_8_PROCESS_REG: break; case CRC_READ_REG: r = bevcore.read_crc(); break; case CRC_CONTROL_REG: r = bevcore.read_control(); break; } ((u32_t *)ptr)[0] = r; } else if (cmd == tlm::TLM_WRITE_COMMAND) { u32_t d32 = ((u32_t *)ptr)[0]; switch (adr) { case CRC_RESET_REG: bevcore.reset_operation(); break; case CRC_DATA_8_PROCESS_REG: bevcore.process_byte(d32 & 0xFF); break; case CRC_DATA_32_PROCESS_LE_REG: bevcore.process_word32_le(d32); break; case CRC_DATA_32_PROCESS_BE_REG: bevcore.process_word32_be(d32); break; case CRC_READ_REG: break; case CRC_CONTROL_REG: bevcore.write_control(d32); break; } } trans.set_response_status( tlm::TLM_OK_RESPONSE); }
We can define some unused instructions to be coprocessor instructions. Some processors have instructions already allocated for communication to, as yet, unspecified coprocessors.
// EtherCRC example co-processor instructions: #define NOP_CUSTOM_ETHERCRC_READ 0x0008 #define NOP_CUSTOM_ETHERCRC_WRITE 0x0009 #define NOP_CUSTOM_ETHERCRC_PROCESS 0x000A // We use in-line assembler shims to instantiate the coprocessors instructions. void ethercrc_write(int value) { asm("l.addi r3,%0,0": :"r" (value) : "r3"); asm("l.nop %0" : :"K" (NOP_CUSTOM_ETHERCRC_WRITE)); } void ethercrc_process(int value) { asm("l.addi r3,%0,0": :"r" (value) : "r3"); asm("l.nop %0" : :"K" (NOP_CUSTOM_ETHERCRC_PROCESS)); } int ethercrc_read() { volatile int value; asm("l.nop %0": :"K" (NOP_CUSTOM_ETHERCRC_READ): "r3"); asm("l.sw %0(r1),r3": :"m" (value) :); return value; }
Generated assembler is:
00000548: 548: a8 83 00 00 l.ori r4,r3,0x0 54c: 9c 64 00 00 l.addi r3,r4,0x0 550: 15 00 00 09 l.nop 0x9 554: 44 00 48 00 l.jr r9 558: 15 00 00 00 l.nop 0x0 0000055c : 55c: a8 83 00 00 l.ori r4,r3,0x0 560: 9c 64 00 00 l.addi r3,r4,0x0 564: 15 00 00 0a l.nop 0xa 568: 44 00 48 00 l.jr r9 56c: 15 00 00 00 l.nop 0x0 00000570 : 570: 9c 21 ff fc l.addi r1,r1,0xfffffffc 574: 15 00 00 08 l.nop 0x8 578: d4 01 18 00 l.sw 0x0(r1),r3 57c: 85 61 00 00 l.lwz r11,0x0(r1) 580: 44 00 48 00 l.jr r9 584: 9c 21 00 04 l.addi r1,r1,0x4
And modify the application code as follows:
// coprocessor implementation unsigned ether_crc_le(int length, uint8_t *data, int foxes) { int crc = (foxes) ? 0xffffffff: 0; /* Initial value. */ ethercrc_write(crc); int *idata = (int *) data; for( ; length >= 0; length -=4) { int d = *idata ++; printf("d=%08X, old=%08X\n", d, ethercrc_read()); ethercrc_process(d); } crc = ethercrc_read(); printf("crc final %x\n", crc); /* if (crc != 0xdebb20e3) printf("crc is wrong");*/ return crc; }
We must make the same modifications to the RTL of the processor (not shown) and to its ISS.
// new cases in the ISS case NOP_CUSTOM_ETHERCRC_WRITE: the_ethercrc_customins.write_crc(evalsim_reg(3)); break; case NOP_CUSTOM_ETHERCRC_PROCESS: the_ethercrc_customins.process_word32_be(evalsim_reg(3)); break; case NOP_CUSTOM_ETHERCRC_READ: cpu_state.reg[3] = the_ethercrc_customins.read_crc(); break;
Results table
For 64000 bytes of CRC computation. Application name Ins Count Duration Energy Power Technique Dyn/Stat Dyn/Static ms mJ mW bit-oriented hltest.c 6345349 70 2.9/0.5 37/6 LUT hltest.c 1159943 16 3.8/0.092 225/5 Peripheral periphtest.c 517612 8.2 0.3/0.037 45/5 Coprocessor coproctest.c 126950 1.9 0.1/0.012 96/6 This is without putting power annotations on the coprocessor instructions or the peripheral unit.
We see a considerable speed up using acceleration. The power goes up, but that does not matter, since the total energy went down.
(C) 2012 DJ Greaves, University of Cambridge, Computer Laboratory.