Implementing 4-way, set-associative cache is relatively straightforward.
Do not need an associative RAM macrocell: just synthesise four sets of XOR gates from RTL using the `==' operator!
reg [31:0] data0 [0:32767], data1 [0:32767], data2 [0:32767], data3 [0:32767]; reg [14:0] tag0 [0:32767], tag1 [0:32767], tag2 [0:32767], tag3 [0:32767]; always @(posedge clk) begin miss = 0; if (tag0[addr[16:2]]==addr[31:17]) dout <= data0[addr[16:2]]; else if (tag1[addr[16:2]]==addr[31:17]) dout <= data1[addr[16:2]]; else if (tag2[addr[16:2]]==addr[31:17]) dout <= data2[addr[16:2]]; else if (tag3[addr[16:2]]==addr[31:17]) dout <= data3[addr[16:2]]; else miss = 1; end
Of course we also need a write and evict mechanism... (not shown).
Rather than implement least-recently-used (LRU) one tends to do `random' replacement which can be as simple as using keeping a two bit counter to say which `way' to evict next.
Typically an IP company like ARM will provide a high-quality, carefully-tuned implementation, ready to go.
Comp-arch exercise: add a `way prediction cache' that avoids the double lookup latency. A way cache records which set was last accessed and optimistically forwards the result from that, giving access times closer to that of a directly-mapped cache, without the aliasing overheads.
59: (C) 2008-17, DJ Greaves, University of Cambridge, Computer Laboratory. |