Kiwi Cuckoo Cache

Here is the single-threaded implementation of a four-way Cuckoo Cache.

Wikipedia description: Cuckoo Hashing.

Introduction

A cuckoo hash is, like a set-associative cache, a means to mostly avoid chains of entries stored under a common hash (avoiding the Birthday Paradox) by using parallel lookup in some number of ways. When all ways are full for a given entry, one of the items is chosen for relocation to a different way, where it will be hashed to a different 'set' since the hash functions used in each way differ. Hopefully that alternative location is not filled. If it is, either the entry stored there can be similarly relocated, or else a different eviction candidate can be chosen. Backtracking is potentially needed to explore alternative eviction candidates, so it is easier to just push forward further items on a simple linear trajectory. Lookup cost remains constant at unity, regardless of how much relocation is done.

Source Code - Single Threaded

This is the single-threaded version. Parallel speedup is gained here through automatic unwind of the inner loops.

This source file is a Kiwi Scientific application, compiled in 'soft pause mode'. The user's program does not contain calls to 'Kiwi.Pause()'. Instead, the allocation of work to clock cycles is controlled by the budgeted number of ALUs and the soft-pause-threshold setting.

The demo contains statistics counters and print statements for them. These will be converted to RTL and operate during the RTL simulation. But when compiled further to the FPGA, without an O/S console stream and printf implementation (see KiwiC manual for info) these will be stripped by the FPGA tools and add little or no runtime overhead. (Runtime overhead arises when we print data from inside RAMs where the read cycles on the RAM ports must be schedulled by KiwiC.)

// Kiwi Scientific Acceleration Example - Cuckoo Hash Demo
// (C) 2016 DJ Greaves, University of Cambridge, Computer Laboratory.
//
//
using System;
using KiwiSystem;


public class DataGenerator
{
        int seed;
        ulong dk;

  // PRGEN Park and Miller CACM October 1988 
  int random()
  {
    
    const int a = 2147001325;
    const int b = 715136305;
    seed = a * seed + b;
    return seed;
  }

  public void Reset()
  {
    seed = 123456;
    dk = 9999999900000000L;
  }
  
  public DataGenerator() // Constructor
  {
    Reset();
  }
  
  public int Generate(out int key, out ulong value)
  {
    key = random();
    value = dk ++;
    return 0;
  }
}

public class CuckooHasher
{
  const int n_ways = 4;
  ulong [] dataArray;
  int [] [] keyTables = new int [n_ways] [];
  int [] [] valuePointerTables = new int [n_ways] [];
  int waycap;
  int next_free = 0;
  int next_victim = 0;

  int stats_inserts = 0;
  int stats_insert_probes = 0;
  int stats_insert_evictions = 0;
  int stats_lookups = 0;
  int stats_lookup_probes = 0;


  public void printStats()
  {
    
    Console.WriteLine("cuckoo cache: this={0}, inserts={1}, lookups={2}", this, stats_inserts, stats_lookups);
    Console.WriteLine("cuckoo cache: insert_probes={0}, insert_evictions={1}", stats_insert_probes, stats_insert_evictions);
    Console.WriteLine("cuckoo cache: lookup_probes={0}", stats_lookup_probes);

  }

  public CuckooHasher(int capacity) // constructor
  {
    waycap = capacity / n_ways;
    for (int n=0; n<n_ways; n++) keyTables[n] = new int [waycap];
    for (int n=0; n<n_ways; n++) valuePointerTables[n] = new int [waycap];   
    dataArray = new ulong [capacity];
  }
  
  public void Clear()
   {
    for (int m=0; m<waycap; m++)
      for (int n=0; n<n_ways; n++) keyTables[n][m]= 0;
   }
  
  int hash(int hashno, int arg)
  {
    int v = hashno * 51  + arg;
    if (v < 0) v = - v;
    return (v % waycap);
  }
  
  
  public int insert(int key_in, ulong value)
  {
    int key = key_in;
    int evict_stat = 0;
    int p = 0;
    if (key ==0)
      {
        System.Diagnostics.Debug.Assert(false, "Key of zero used");
        return -4;
      }
    else if (next_free == waycap * n_ways)
      {
        System.Diagnostics.Debug.Assert(false, "Table is full");
        return -2;
      }
    else
      {
        p = next_free ++;
        dataArray[p] = value;
        stats_inserts += 1;
        while (true)
          {
            int nn, hh=0;
            for (nn=0; nn<n_ways; nn++)
              {
                stats_insert_probes += 1;
                hh = hash(nn, key);
                if (keyTables[nn][hh] == 0) { keyTables[nn][hh] = key; break; }
              } 
            if (nn==n_ways)
              {
                Console.WriteLine("Eviction {0} needed", evict_stat);
                evict_stat++;
                stats_insert_evictions += 1;
                int key1 = keyTables[next_victim][hh];
                int p1 = valuePointerTables[next_victim][hh];
                keyTables[next_victim][hh] = key;
                valuePointerTables[next_victim][hh] = p;
                key = key1;
                p = p1;
                next_victim = (++next_victim) % n_ways;
              }
            else
              {
                keyTables[nn][hh] = key;
                valuePointerTables[nn][hh] = p;
                return 0;
              }
          }
      }
  }


  public int lookup(int key, out ulong value)
  {
    stats_lookups += 1;
    value = 0;
    if (key ==0)
       {
         System.Diagnostics.Debug.Assert(false, "Key of zero used");
         return -4;
       }
    else
      {
         int nn, h=0;
         for (nn=0; nn<n_ways; nn++)
         {
           stats_lookup_probes += 1;
           h = hash(nn, key);
           if (keyTables[nn][h] == key) break; 
         } 
         if (nn==n_ways)
         {
           return -5; // not found
         }
         int p = valuePointerTables[nn][h];
         value = dataArray[p];
         return 0;
      }
   }


}


static class Testbench
{
  const int items = 32768;

  [Kiwi.HardwareEntryPoint()]  
  static public void Main()
  {
    Kiwi.KppMark("Start");
    Console.WriteLine("Cuckoo cache testbench start. Capacity={0}", items);
    CuckooHasher chasher = new CuckooHasher(items);
    chasher.Clear();
    Console.WriteLine("Cuckoo cache cleared");           
    Kiwi.KppMark("Cach Cleared");
    DataGenerator dg = new DataGenerator();
    
    if (true)
      {
        dg.Reset();
        int trials = 0, successes = 0;
        for (int iv = 0; iv < 2*items/3; iv++)
          { int key; ulong value;
            dg.Generate(out key, out value);
            int rc = chasher.insert(key, value);
            if (rc == 0) successes ++;
            trials ++;
          }
        Console.WriteLine("Cuckoo cache inserted items {0}/{1}", successes, trials);             
      }
    
    Kiwi.KppMark("Data Entered");
    if (true)
      {
        dg.Reset();
        int trials = 0, successes = 0;
        for (int iv = 0; iv < 2*items/3; iv++)
          { int key; ulong wvalue, rvalue;
            dg.Generate(out key, out wvalue);
            int rc = chasher.lookup(key, out rvalue);
            if (rc == 0 && rvalue == wvalue) successes ++;
            trials ++;
          }
        Console.WriteLine("Cuckoo cache retrieved items {0}/{1}", successes, trials);             
      }
    Kiwi.KppMark("Readback Done");
    chasher.printStats();
    Console.WriteLine("Cuckoo cache demo finished.");
    }
}
// eof

KiwiC Compile

We used a Makefile to compile the demo:

#//
#// Kiwi Scientific Acceleration Example - Cuckoo Hash Demo
#// (C) 2016 DJ Greaves, University of Cambridge, Computer Laboratory.
#//
ANAME=cuckoo_hash_demo
CSC     ?=gmcs
KLIB1   ?=${HPRLS}/kiwipro/kiwic/distro/support/Kiwi.dll
KIWIC   ?=${HPRLS}/kiwipro/kiwic/distro/bin/kiwic -give-backtrace
CV_INT_ARITH ?= $(HPRLS)/kiwipro/kiwic/distro/lib/cvgates.v

KIWIFLAGS= -vnl-resets=synchronous \
           -kiwic-cil-dump=combined \
           -kiwic-kcode-dump=enable \
           -res2-loadstore-port-count=0 \
           -vnl-roundtrip=disable \
           -bevelab-default-pause-mode=soft \
           -bevelab-soft-pause-threshold=15 \
           -vnl-rootmodname=DUT
all: isim

$(ANAME).exe:$(ANAME).cs
        $(CSC) $(ANAME).cs -r:$(KLIB1) 

$(ANAME).v:$(ANAME).exe
        $(KIWIC)  -vnl=$(ANAME).v $(ANAME).exe $(KIWIFLAGS)

monorun:$(ANAME).exe
        MONO_PATH=$(HPRLS)/kiwipro/kiwic/distro/support mono $(ANAME).exe

isim:$(ANAME).v
        iverilog $(ANAME).v vsys.v $(CV_INT_ARITH)
        ./a.out
        cp vcd.vcd ~/Dropbox

# eof

Make compiled this program with the following command line

gmcs cuckoo_hash_demo.cs -r:/home/djg11/d320/hprls/kiwipro/kiwic/distro/support/Kiwi.dll
/home/djg11/d320/hprls/kiwipro/kiwic/distro/bin/kiwic -res2-loadstore-port-count=0 
 -rootmodname=DUT -vnl=cuckoo_hash_demo.v cuckoo_hash_demo.exe 
 -vnl-resets=synchronous -vnl-roundtrip=disable -bevelab-default-pause-mode=soft  -bevelab-soft-pause-threshold=20

Generated Verilog RTL

The KiwiC compiler generated 12000 lines of Verilog RTL (more than half are embedded comments) that instantiated the following modules:

The full RTL output file is cuckoo_demo.v.

You can see it instantiated the following structural resources:

  CV_INT_VL_MODULUS_S isMODULUS10(clk, reset, isMODULUS10_rdy, isMODULUS10_req, isMODULUS10_RR, isMODULUS10_NN, isMODULUS10_DD, isMODULUS10_err);

  CV_INT_VL_MODULUS_S isMODULUS12(clk, reset, isMODULUS12_rdy, isMODULUS12_req, isMODULUS12_RR, isMODULUS12_NN, isMODULUS12_DD, isMODULUS12_err);

  CV_SP_SSRAM_FL1 #(6'd32, 4'd13, 14'd8192, 6'd32) A_SINT_CC_MAPR12NoCE3_ARD0(clk, reset, A_SINT_CC_MAPR12NoCE3_ARD0_RDD0, A_SINT_CC_MAPR12NoCE3_ARD0_AD0,
    A_SINT_CC_MAPR12NoCE3_ARD0_WEN0, A_SINT_CC_MAPR12NoCE3_ARD0_REN0, A_SINT_CC_MAPR12NoCE3_ARD0_WRD0);

  CV_SP_SSRAM_FL1 #(6'd32, 4'd13, 14'd8192, 6'd32) A_SINT_CC_MAPR12NoCE2_ARD0(clk, reset, A_SINT_CC_MAPR12NoCE2_ARD0_RDD0, A_SINT_CC_MAPR12NoCE2_ARD0_AD0,
    A_SINT_CC_MAPR12NoCE2_ARD0_WEN0, A_SINT_CC_MAPR12NoCE2_ARD0_REN0, A_SINT_CC_MAPR12NoCE2_ARD0_WRD0);

  CV_SP_SSRAM_FL1 #(6'd32, 4'd13, 14'd8192, 6'd32) A_SINT_CC_MAPR12NoCE1_ARD0(clk, reset, A_SINT_CC_MAPR12NoCE1_ARD0_RDD0, A_SINT_CC_MAPR12NoCE1_ARD0_AD0,
    A_SINT_CC_MAPR12NoCE1_ARD0_WEN0, A_SINT_CC_MAPR12NoCE1_ARD0_REN0, A_SINT_CC_MAPR12NoCE1_ARD0_WRD0);

  CV_SP_SSRAM_FL1 #(6'd32, 4'd13, 14'd8192, 6'd32) A_SINT_CC_MAPR12NoCE0_ARD0(clk, reset, A_SINT_CC_MAPR12NoCE0_ARD0_RDD0, A_SINT_CC_MAPR12NoCE0_ARD0_AD0,
    A_SINT_CC_MAPR12NoCE0_ARD0_WEN0, A_SINT_CC_MAPR12NoCE0_ARD0_REN0, A_SINT_CC_MAPR12NoCE0_ARD0_WRD0);

The data has been stored in FPGA block-RAM (BRAM) since the capacity is below the compilation default thresholds. If we had requested a larger capacity then it would be stored in DRAM outside the FPGA. If we had left the capacity to be set at runtime size, this would fail under KiwiC 1.0 (but would be put in DRAM by KiwiC 2.0 which is under development.)

If you look closely at the output file (or read the report file), you see KiwiC has used a full run-time divider to take the mod by waycap since a slight bug in that release has stopped it realising that waycap is a constant that can be replaced with a bitmask! The PRBS generators need dividers of course.

Simulation Output

iverilog cuckoo_hash_demo.v vsys.v /home/djg11/d320/hprls/kiwipro/kiwic/distro/lib/cvgates.v
./a.out
Cuckoo cache testbench start. Capacity=32768
Cuckoo cache cleared
Eviction 0 needed
Eviction 0 needed

Cuckoo cache inserted items 21845/21845
Cuckoo cache retrieved items 21598/21845
cuckoo cache: this=0, inserts=21845, lookups=21845
cuckoo cache: insert_probes=45687, insert_evictions=343
cuckoo cache: lookup_probes=44756
Cuckoo cache demo finished.

Process make finished

Run Under Mono

For comparison, the same .exe file is run under mono:

MONO_PATH=/home/djg11/d320/hprls/kiwipro/kiwic/distro/support mono cuckoo_hash_demo.exe
Cuckoo cache testbench start. Capacity=32768
Cuckoo cache cleared
Eviction 0 needed
Eviction 0 needed
...
Cuckoo cache inserted items 21845/21845
Cuckoo cache retrieved items 21598/21845
cuckoo cache: this=CuckooHasher, inserts=21845, lookups=21845
cuckoo cache: insert_probes=45687, insert_evictions=343
cuckoo cache: lookup_probes=44756
Cuckoo cache demo finished.

Performance and Conclusions

We need to explore the performance of this design. So far we just simply coded it up and ran it as a demo. The parallel speedup potentially arises from the multiple RAMs used for each way that are consulted concurrently.


UP