Implementing 4-way, set-associative cache is relatively straightforward.
Do not need an associative RAM macrocell: just synthesise four sets of XOR gates from RTL using the `==' operator!
reg [31:0] data0 [0:32767], data1 [0:32767], data2 [0:32767], data3 [0:32767]; reg [14:0] tag0 [0:32767], tag1 [0:32767], tag2 [0:32767], tag3 [0:32767]; always @(posedge clk) begin miss = 0; if (tag0[addr[16:2]]==addr[31:17]) dout <= data0[addr[16:2]]; else if (tag1[addr[16:2]]==addr[31:17]) dout <= data1[addr[16:2]]; else if (tag2[addr[16:2]]==addr[31:17]) dout <= data2[addr[16:2]]; else if (tag3[addr[16:2]]==addr[31:17]) dout <= data3[addr[16:2]]; else miss = 1; end
Of course we also need a write and evict mechanism... (not shown).
Rather than implement least recently used (LRU) one tends to do `random' replacement which can be as simple as using keeping a two bit counter to say which `way' to evict next.
Comp-arch exercise: add a `way prediction cache' that avoids the double lookup latency.