Kiwi AES Encryption Demo

PAGE UNDER CONSTRUCTION AUGUST 2016 Gosh its 2018 now so construction had better be finished off soon!

Preface

The FPGAs for Software Programmers (FSP 2016) Workshop in Lausanne included a paper `Building hardware from C# models' by Kenneth Skovhede and Brian Vinter.

The paper concentrates on using Occam/Kahn networks and uses the C# 'await' primitive for clock synchronisation where needed.

A demo in that paper was to compile the C# AES code from https://github.com/mono/mono/blob/ef407901f8fdd9ed8c377dbec8123b5afb932ebb/mcs/class/System.Core/System.Security.Cryptography/AesTransform.cs.

(We reproduced that demo using Kiwi while the paper was being presented.)

Introduction

AES is well-known to be well-suited to hardware acceleration.

The C# source code compiled straightaway with Kiwi, but, like the authors of the workshop paper, we clearly needed to manually code a testbench.

The library source code main function has the following signature which is also the constructor for the class. This coding style is not ideal for hardware implementation since we prefer to first generate one or more component instances and then pass data to the instances.

public AesTransform (Aes algo, bool encryption, byte[] key, byte[] iv)

So we made a few light edits to the code, as provided, so that the key installation was separated from the data operations and both of these could be applied to an AesUnit post construction.

Testbench

public class GreavesTestBench
{
  [Kiwi.OutputBitPort("done")]  static bool done;

  // Typically AES uses 16 byte (128 bit) blocks and a 128 bit key in CBC mode.
  static byte [] key   = new byte [16];
  static byte [] inv   = new byte [16];
  static byte [] outv  = new byte [16];
  [Kiwi.HardwareEntryPoint()]
 [Kiwi.HardwareEntryPoint()]
  public static void Main()
  {
    Console.WriteLine("Kiwi AES Demo 2 - Key=2, din=3,4,5");
    Aes aes = new Aes();

    bool encf = true;

    // Create a Key and select encrypt and ECB mode.
    AesTransformRef aesUnit1 = new AesTransformRef();
    waypoint(0, 2, "START-KEY-CREATE");
    for (int p=0; p<key.Length;p++) key[p] = (byte)(p==15? 2:0); 
    waypoint(0, 3, "START-KEY-INSTALL");
    aesUnit1.InstallKey(aes, encf, key); 

    string msg = (encf) ? "Encrypt": "Decrypt";
    Console.WriteLine("Key installed {0} ({1} bytes).", msg, key.Length);
    for (int test_no=0; test_no<3; test_no++)
      {
        waypoint(test_no, 4, "START-DATA-GEN");
	for (int p=0; p<16;p++) inv[p] = (byte)(p==15?3+test_no:0);  // Create input data
	Console.Write("IN:  ");
	for (int j=0;j<outv.Length;j++) Console.Write("{0:X02} ", inv[j]);
	Console.WriteLine();
        waypoint(test_no, 5, "START-DATA-ENCRYPT");
	aesUnit1.ECB(inv, outv);
        waypoint(test_no, 6, "STOP-DATA-ENCRYPT");
	Console.Write("OUT: ");
	for (int j=0;j<outv.Length;j++) Console.Write("{0:X02} ", outv[j]);
	Console.WriteLine();
	Console.WriteLine();
	Console.WriteLine();
      }
    waypoint(0, 7, "STOP-ALL");
    Console.WriteLine("Kiwi AES Demo 2 Done at {0}", Kiwi.tnow);
  }
}

The full source code is here AesTransform.cs

Results

Running Under Mono

mcs AesTransform.cs -r:/home/djg11/d320/hprls/kiwipro/kiwic/distro/support/Kiwi.dll 
mono AesTransform.exe
Kiwi AES Demo 2 - Key=2, din=3,4,5
Install key start
Key installed Encrypt (16 bytes).
IN:  00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 03 
Round 0 at 0 (decrypt)
End Dec
Finish decrypt at 0.
OUT: 3C 02 67 12 D1 A2 0E E4 32 1A 8E 2D 7B F9 09 97 


IN:  00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 04 
Round 0 at 10 (decrypt)
End Dec
Finish decrypt at 10.
OUT: 48 60 82 07 EA 85 59 E9 92 A6 89 6A FB EC 0F F5 


IN:  00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 05 
Round 0 at 20 (decrypt)
End Dec
Finish decrypt at 20.
OUT: 7A 74 04 06 CD 78 5D 55 FA CE A3 4A 00 F1 E4 9B 


Kiwi AES Demo 2 Done at 30
Kiwi AES Demo - Hello
IN:  00 11 12 13 14 15 16 17 18 19 1A 1B 1C 1D 1E 1F 
OUT: 98 AF 3A BD 8C 60 E0 38 03 62 23 A3 3D 8B 1D 12 
IN:  03 11 12 13 14 15 16 17 18 19 1A 1B 1C 1D 1E 1F 
OUT: 62 98 99 F1 38 E6 6E CE 47 98 F1 BF C9 88 32 A1 
IN:  06 11 12 13 14 15 16 17 18 19 1A 1B 1C 1D 1E 1F 
OUT: EE D9 93 98 FF EA 15 CC 9E 07 B0 9C 81 23 1D 6B 
Kiwi AES Demo Done

KiwiC Compile

We compiled it with a logic cost setting of 5 (aiming for a quick-to-compile, small implementation that might use a lot of clocks).

$ kiwic -vnl-rootmodname DUT AesTransform.exe -vnl AesTransform.v -vnl-roundtrip=disable \
        -vnl-resets=synchronous -res2-loadstore-port-count=0 -bevelab-default-pause-mode=soft -bevelab-soft-pause-threshold=5
$ ls -l AesTransform.v
-rw-rw-r-- 1 djg11 djg11 3648914 Aug 30 05:57 AesTransform.v
$ wc AesTransform.v
  60418  314571 3648914 AesTransform.v

We need a suitable testbench, such as

module SIMSYS();
   reg clk, reset;
   initial begin reset = 1; clk = 0; #33 reset = 0; end
   always #5 clk = !clk;

   initial begin # (1000 * 100 * 1000) $display("Finish HDL simulation on timeout %t.", $time); $finish(); end
   initial begin $dumpfile("vcd.vcd"); $dumpvars(); end

   wire done;
   wire [31:0] codesent;
   DUT the_dut(.clk(clk), 
	       .reset(reset),
	       .done(done));

   always @(posedge clk) begin
      if (done) begin
	 $display("Exit on done asserted after %d clocks.", $time/10);
	 $finish;
	 end
   end
endmodule

RTL simulation gives the following output

iverilog AesTransform.v vsys.v /home/djg11/d320/hprls/kiwipro/kiwic/distro/lib/cvgates.v
./a.out
VCD info: dumpfile vcd.vcd opened for output.
Kiwi AES Demo 2 - Key=2, din=3,4,5
Waypoint 0 2 1398030674
Waypoint 0 3 1398030674
Install key start
Key installed 1164862322 (16 bytes).
Waypoint 0 4 1398030674
IN:  00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 03 
Waypoint 0 5 1398030674
Round 0 at 975 (decrypt)
End Dec
Finish decrypt at 1325.
Waypoint 0 6 1398034256
OUT: 3c 02 67 12 d1 a2 0e e4 32 1a 8e 2d 7b f9 09 97 


Waypoint 1 4 1398030674
IN:  00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 04 
Waypoint 1 5 1398030674
Round 0 at 1365 (decrypt)
End Dec
Finish decrypt at 1715.
Waypoint 1 6 1398034256
OUT: 48 60 82 07 ea 85 59 e9 92 a6 89 6a fb ec 0f f5 


Waypoint 2 4 1398030674
IN:  00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 05 
Waypoint 2 5 1398030674
Round 0 at 1755 (decrypt)
End Dec
Finish decrypt at 2105.
Waypoint 2 6 1398034256
OUT: 7a 74 04 06 cd 78 5d 55 fa ce a3 4a 00 f1 e4 9b 


Waypoint 0 7 1398034256
Kiwi AES Demo 2 Done at 2125
Finished with code 0x01 at 2215 after         218 clocks

The most interesting parameter for a hardware AES implementation is the number of clock cycles used.

Also, the degree of pipelining in the resultant hardware is of great interest ... the input design expression style is one item at a time, but a serious hardware implementation will be fully-pipelined.

Further runs with different HLS time/space tuning

We can compiled with different settings of -bevelab-soft-pause-threshold=5 to trade off time and space (clock cycles versys silicon area).

The files generated are in this zip file demo-zip.zip.

Very Fast

If we increase this to, say, 2132, then KiwiC will pack a massive amount of work into each clock cycle. This gives the following result, where there are 11 clocks per encryption operation (this is the time where the value 5 is in the LED display bottom).

Big Area, Low Latency AES Timing
Big Area, Low Latency AES Waveform Timing

Medium

We now compiled with -bevelab-soft-pause-threshold=132. This gives the following result, where there are 37 clocks per encryption operation.

Smaller Area, Higher Latency AES Waveform Timing

We note the Key installation phase in the high-performance run took longer than the medium performance run. The Key had to be installed in more places, but since Key installation is not on the critical path for normal data operations, that is not a worry.

Difference in number of FUs

From the post-HLS report files, we see the main difference between the compilations was the number of SBox ROMs instantiated:

High-Area, Quick Resource Inventory


// 1 vectors of width 6
// 408 vectors of width 1
// 264 vectors of width 8
// 148 vectors of width 32
// 54 array locations of width 8
// 158 array locations of width 32
// Total state bits in module = 12750 bits.
// 6816 continuously assigned (wire/non-state) bits 
//   cell SROM_A_8_US_CC_SBox_SCALbx32_SBox_ARB0_FL1 count=44
//   cell SROM_A_UINT_CC_iT1_SCALbx20_iT1_ARF0_FL1 count=36
//   cell SROM_A_UINT_CC_iT0_SCALbx18_iT0_ARE0_FL1 count=36
//   cell SROM_A_UINT_CC_iT2_SCALbx16_iT2_ARD0_FL1 count=36
//   cell SROM_A_UINT_CC_iT3_SCALbx14_iT3_ARC0_FL1 count=36
//   cell SROM_A_8_US_CC_iSBox_SCALbx38_iSBox_ARE0_FL1 count=16
// Total number of leaf cells = 204

Medium Resource, Medium Performance Inventory

// 1 vectors of width 6
// 332 vectors of width 1
// 188 vectors of width 8
// 160 vectors of width 32
// 54 array locations of width 8
// 158 array locations of width 32
// Total state bits in module = 12450 bits.
// 5840 continuously assigned (wire/non-state) bits 
//   cell SROM_A_8_US_CC_SBox_SCALbx32_SBox_ARB0_FL1 count=6
//   cell SROM_A_UINT_CC_iT1_SCALbx20_iT1_ARF0_FL1 count=36
//   cell SROM_A_UINT_CC_iT0_SCALbx18_iT0_ARE0_FL1 count=36
//   cell SROM_A_UINT_CC_iT2_SCALbx16_iT2_ARD0_FL1 count=36
//   cell SROM_A_UINT_CC_iT3_SCALbx14_iT3_ARC0_FL1 count=36
//   cell SROM_A_8_US_CC_iSBox_SCALbx38_iSBox_ARE0_FL1 count=16
// Total number of leaf cells = 166

ROM mirroring

The AES code contains a large number of ROMs. These are initialised in the C# source code with lines like the following

    static readonly byte[] iSBox = {
      0x52, 0x09, 0x6a, 0xd5, 0x30, 0x36, 0xa5, 0x38, 0xbf, 0x40, 0xa3, 0x9e, 0x81, 0xf3, 0xd7, 0xfb,
      ... };

These ROMs are invoked in a way that causes structural hazards that KiwiC has to serialised in the time domain, increasing clock cycle consumption for the task.

We can get higher performance by asking KiwiC to mirror all ROMs by a factor of up to 4 for a blocking design (ie re-initiation interval and latency are both 11 clock cycles) or by a factor of 40 for a fully-pipelined design (ie re-initialisation interval of unity). This is 16 and 160 ROMs respectively.

... under construction ...

Conclusions

AES is simple, consisting just of ROM look-ups, XORs and multiplexing. It is well known to accelerate well compared with Von Neumann implementations. A 2018 laptop with 2GHz Intel I-3 gets about 25 MB/sec of througput per core, using a the simple C program on this LINK . This is just a tight loop on local data. A higher performance workstation CPU gets double this and using the AES-NI custom instructions another factor of 10 (says Wikipedia) is achieved within the Intel architecture. But note that the data shovelling in reality will consume time or else be offloaded to other cores.

If we take a standard software implementation of a function like AES and put it through HLS we get an executable hardware unit. The throughput in terms of clocks per block is adjustable with HLS tuning parameters but what stops HLS making a fully-pipelined implementation for such a system and would any changes to the C# coding style be needed?

This demo will be reimplemented using the new Kiwi real-time pipelined accelerator synthesis engine... watch this space.