The Bluespec System Verilog language is increasingly seen as a viable and productive alternative to conventional RTL coding for hardware design. For compiler writers like myself, the best way to learn a new language was to write a toy compiler for it. Also, where there's a lack of formal semantics for a language, writing a compiler is a good way to learn about the important and/or arbitrary decisions that have to be made. (These seem to be in the scheduler and in the way implicit guards of implicit guards are processed)
The user manual from Bluespec Inc may have restricted circulation, so this informal study was based on a couple of published studies: `HARDWARE LANGUAGES AND PROOF' by Dominic Richards of Manchester University and `A Unified Model for Hardware/Software Codesign' by Nirav Dave when at MIT.
This article gives the essence of a compiler for Bluespec. This toy compiler was coded in FSharp using the HPR L/S library. That library provides various logic manipulation facilities and output functions for SMV, SystemC, Verilog and so on. Rather than initially writing a Bluespec parser, the test abstract-syntax trees were manually entered as separate FSharp source files. Later a full parser was added.
An Open-Source Bluespec Compiler PDF.
Bluespec is based around the concept of modules and rules. A module contains zero or more rules. A module also instantiates zero or more lower modules. Modules instantiated at the lowest levels are primitives, such as FIFOs, registers and RAMs. Bluespec starts structural elaboration at a top-level module. The module hierarchy is nominally flattened during the structural elaboration process. Once elaboration is complete, we have essentially a flat collection of interconnected Bluespec rules and primitives.
Our first implementation was just a proof of concept, without a parser or structural elaborator. Instead we manually entered the abstract syntax tree of the elaborated rules and primitives as an FSharp data structure.
let test_ast1 = (* module test1(); Reg#(#UInt(15)) ctr1 <- mk_register(0) rule test_ast1 (True) ; ctr1.write(ctr1.read()+1) endrule endmodule *) Bsv_moduleDef("test1", ([], [], Bsv_typen "Empty"), [ Bsv_varDeclAssign("ctr1", [B_num [15]; B_num[1]], "mkReg", []); Bsv_rule("test_ast1", B_true, [ g_write("ctr1", B_diadic(V_plus, g_read("ctr1"), B_num[1]))]) ], [])
This first test generated the following output (which works fine: ctr is incremented on every clock cycle):
// CBG Orangepath HPR/LS System // /home/djg11/d320/hprls/bsvc/bsv.exe -vnl test1.vnl test1 module test1(input clk, input RST_N); reg test_ast1_FIRE; reg [14:0] test1_ctr1_DREG; reg [14:0] test1_ctr1_Din; wire test1_ctr1_RDY; reg test1_ctr1_EN; always @(posedge clk ) begin if (!RST_N || test1_ctr1_EN) test1_ctr1_DREG <= (!RST_N ? 0: test1_ctr1_Din); end assign test_ast1_FIRE = 1; always @(*) test1_ctr1_EN = test_ast1_FIRE; always @(*) test1_ctr1_Din = 1+test1_ctr1_DREG; // 2 vectors of width 1 // 2 vectors of width 15 // Total state bits in module = 32 bits. // 1 continuously assigned (wire/non-state) bits endmodule // eof (HPR/LS Verilog)
The second test was to compile my Dining Philosophers example: Dining Philosophers in Bluespec Verilog.
Not originally having a parser, I had to manually enter the syntax tree for the example. Fortunately it is not too long.
The main part of the grammar is:
// Bluespec: Toy abstract syntax (may not be right!) // Various modifiers for an interface: type bsv_provisos_t = | Proviso_alwaysReady | Proviso_alwaysEnabled ; // Method prototype protocol has one of these three forms: type bsv_protoprotocol_t = | BsvProtoValue of bsv_type_name_t | BsvProtoActionValue of bsv_type_name_t | BsvProtoAction and bsv_type_name_t = | Bsv_typen of string // for now - separate from bsv_if_t ... but can join. | Bsv_typen_uint_ns // Native/primitive run-time type | Bsv_typen_uint1 of int // Native/primitive run-time type | Bsv_typen_action // Method prototype: has name, provisio/pragmas, return type and method formals (pairs of formal and actuals so-far bound). and bsv_methodProto_t = Bsv_methodProto of string * bsv_provisos_t list * bsv_protoprotocol_t * (bsv_type_name_t * string) list and bsv_sigma_protocol_t = | BsvValue | BsvAction of hexp_t option * (hbexp_t * hexp_t) // (en-net if needed) * (rdy-x, rdy-net) and bsv_sigma_t = Bsv_current of string * bsv_sigma_protocol_t * (bsv_type_name_t * hexp_t option) * (hexp_t * hexp_t) list and actuals_t = (string * (bsv_type_name_t * bsv_exp_t)) list and formals_t = (bsv_exp_t * string) list and bsv_moduleStmt_t = // Or actionValueStmt ? | Bsv_varDeclAssign of string * bsv_exp_t list * string * string list | Bsv_rule of string * bsv_bexp_t * bsv_moduleStmt_t list | Bsv_assignStmt of string * bsv_exp_t | Bsv_eascActionStmt of bsv_exp_t | Bsv_pliStmt of string * bsv_exp_t list // for $display and so on | Bsv_beginEndStmt of bsv_moduleStmt_t list | Bsv_ifThenStmt of bsv_bexp_t * bsv_moduleStmt_t * bsv_moduleStmt_t option | Bsv_whileStmt of bsv_bexp_t * bsv_moduleStmt_t * bsv_moduleStmt_t option | Bsv_returnStmt of bsv_exp_t // Only in actionvalue blocks. | Bsv_CaseStmt of bsv_exp_t * (bsv_exp_t * bsv_moduleStmt_t) list * bsv_moduleStmt_t option | Bsv_methodDef of string * formals_t * bsv_bexp_t * bsv_moduleStmt_t list | Bsv_primBuffer of bsv_exp_t * bsv_exp_t (* Only for internal implementation *) // There are three lots of 'parameters' to the typical module as stored in this triple. // They are the parametric type args of the exported interface, the parameters to the generate // function and the list of interfaces, which consists of those used plus the single interface exported. and bsv_moduleParams_t = (bsv_type_name_t * string) list * (bsv_type_name_t * string) list * bsv_type_name_t // The Bluespec interface definition: Generic parameters and list of methods: and bsv_if_t = Bsv_if of (bsv_type_name_t * string) list * bsv_methodProto_t list // Boolean expressions and bsv_bexp_t = | B_true | B_false | B_firing of string // Backdoor access to the composite guard for any rule. | B_not of bsv_bexp_t | B_and of bsv_bexp_t list | B_or of bsv_bexp_t list | B_bexp of hbexp_t // Pre-compiled forms | B_bdiop of x_bdiop_t * bsv_exp_t list // Equality, less than and orreduce where list is a singleton. | B_orred of bsv_exp_t // Integer expressions and bsv_exp_t = | B_num of int list | B_blift of bsv_bexp_t | B_query of bsv_bexp_t * bsv_exp_t * bsv_exp_t | B_var of string | B_string of string | B_hexp of hexp_t | B_fsmStmt of bsv_exprFsmStmt_t | B_ifcRef of string // Interface reference | B_diadic of x_diop_t * bsv_exp_t * bsv_exp_t | B_apply of string list * bsv_exp_t list // Finite-state machine sub-language. and bsv_exprFsmStmt_t = | Bsv_seqFsmStmt of bsv_exprFsmStmt_t list | Bsv_parFsmStmt of bsv_exprFsmStmt_t list | Bsv_ifFsmStmt of bsv_bexp_t * bsv_exprFsmStmt_t * bsv_exprFsmStmt_t option | Bsv_whileFsmStmt of bsv_bexp_t * bsv_exprFsmStmt_t | Bsv_repeatFsmStmt of bsv_exprFsmStmt_t | Bsv_eascFsmStmt of bsv_exp_t | Bsv_breakFsmStmt | Bsv_continueFsmStmt ;The encoding of the spoon (really should be called a fork!) and the philosopher and the table of five of them look like this:
(* interface Spoon_if; method Action pickup(); method Action putdown(); endinterface *) Bsv_ifDef("Spoon_if", Bsv_if([], [ Bsv_methodProto("pickup", [], BsvProtoAction, []); Bsv_methodProto("putdown", [], BsvProtoAction, []); ]) ) (* module spoon (Spoon_if) ; Reg#(Bool) inuse <- mkReg(?); method Action pickup() if (!inuse); action inuse <= True; endaction endmethod method Action putdown(); action inuse <= False; endaction endmethod endmodule *) Bsv_moduleDef("Spoon", ([], [], Bsv_typen "Spoon_if"), [ Bsv_varDeclAssign("inuse", [B_num[1]; B_num[0]], "mkReg", []); Bsv_methodDef("pickup", [], B_bdiop(V_dned, [g_read "inuse"; B_num[1]]), [ g_write("inuse", B_num[1]); Bsv_pliStmt("$display", [B_string "pickup"]) ]); Bsv_methodDef("putdown", [], B_true, [ g_write("inuse", B_num[0]); Bsv_pliStmt("$display", [B_string "putdown"]) ]) ], []) (* interface Random_if; method ActionValue#(UInt#(15)) gen(); endinterface *) Bsv_ifDef("Random_if", Bsv_if([], [ Bsv_methodProto("gen", [], BsvProtoActionValue(Bsv_typen_uint1 15), []); ]) ) (* module mkRandom_gen #(UInt#(15) seed) (Random_if); Reg#(UInt#(15)) prbs <- mkReg(seed); method ActionValue#(UInt#(15)) gen(); prbs <= (prbs << 1) | (((prbs >> 14) ^ (prbs >> 13)) & 1); return prbs; endmethod endmodule *) Bsv_moduleDef("mkRandom_gen", ([(Bsv_typen "int", "seed")], [], Bsv_typen "Random_if"), [ Bsv_varDeclAssign("prbs", [B_num[15]; B_var "seed"], "mkReg", []); Bsv_methodDef("gen", [], B_true, [ g_write("prbs", B_diadic(V_bitor, B_diadic(V_lshift, g_read "prbs", B_num[1]), B_diadic(V_bitand, (B_diadic(V_bitor, B_diadic(V_rshift, g_read "prbs", B_num[14]), B_diadic(V_rshift, g_read "prbs", B_num[13]))), B_num[1]))); Bsv_returnStmt(g_read "prbs"); ]) ], []) Bsv_moduleDef("Philo", ([(Bsv_typen "int", "seed"); (Bsv_typen "int", "on")], [(Bsv_typen "Spoon_if", "left"); (Bsv_typen "Spoon_if", "right")], Bsv_typen "Empty"), [ Bsv_varDeclAssign("gen", [B_var "seed"], "mkRandom_gen", []); Bsv_varDeclAssign("timer", [B_num[15]; B_num[0]], "mkReg", []); Bsv_varDeclAssign("x", [B_num[15]; B_num[0]], "mkReg", []); Bsv_rule("foo", B_bdiop(V_dned, [g_read("timer"); B_num[0]]), [ g_write("timer", B_diadic(V_minus, g_read "timer", B_num[1]))]); Bsv_varDeclAssign("eating", [B_num[1]; B_num[0]], "mkReg", []); Bsv_assignStmt("seq_behaviour", B_fsmStmt( Bsv_whileFsmStmt(B_true, Bsv_seqFsmStmt [ fsm_write("x", g_actionVal("gen", "gen", [])); fsm_write("timer", B_diadic(V_bitand, g_read "timer", B_num[31])); fsm_await(B_bdiop(V_deqd, [g_read "timer"; B_num[0]])); fsm_action("left", "pickup", []); fsm_noAction(); fsm_write("x", g_actionVal("gen", "gen", [])); fsm_write("timer", B_diadic(V_bitand, g_read "timer", B_num[31])); fsm_await(B_bdiop(V_deqd, [g_read "timer"; B_num[0]])); fsm_action("right", "pickup", []); fsm_noAction(); fsm_write("eating", B_num[1]); fsm_write("timer", B_var "on"); fsm_await(B_bdiop(V_deqd, [g_read "timer"; B_num[0]])); fsm_write("eating", B_num[0]); fsm_noAction(); fsm_action("left", "putdown", []); fsm_noAction(); fsm_action("right", "putdown", []); fsm_noAction(); ]))); Bsv_varDeclAssign("FSM", [B_var "seq_behaviour" ], "mkFSM", [ ]); Bsv_rule("kickoff", B_true, [ g_action("FSM", "start", []) ]); ], []) Bsv_moduleDef("philoBENCH", ([], [], Bsv_typen "Empty"), [ Bsv_varDeclAssign("spoon0", [], "Spoon", []); Bsv_varDeclAssign("spoon1", [], "Spoon", []); Bsv_varDeclAssign("spoon2", [], "Spoon", []); Bsv_varDeclAssign("spoon3", [], "Spoon", []); Bsv_varDeclAssign("spoon4", [], "Spoon", []); Bsv_varDeclAssign("philo0", [B_num[3]; B_num[7] ], "Philo", [ "spoon0"; "spoon1" ]); Bsv_varDeclAssign("philo1", [B_num[4]; B_num[4] ], "Philo", [ "spoon1"; "spoon2" ]); Bsv_varDeclAssign("philo2", [B_num[2]; B_num[9] ], "Philo", [ "spoon3"; "spoon2" ]); Bsv_varDeclAssign("philo3", [B_num[3]; B_num[6] ], "Philo", [ "spoon3"; "spoon4" ]); Bsv_varDeclAssign("philo4", [B_num[3]; B_num[6] ], "Philo", [ "spoon4"; "spoon0" ]); ], []) ;
Note that I have reversed the spoons to philosopher number 2. This stops the system from deadlocking straightaway!
The compiler creates instances of each leaf or child component. Each component has some number of invokable methods. It is not normally important to distinguish which method is sported by which component, except for re-entrancy constraints. The general method has ready and enable handshake nets. An OR-gate asserts enable when any of the rules invokes the method. A multiplexor steers the arguments from the rule and hence only one rule should assert enable at a time. The ready output and optional return value output are fanned out to the rules that invoke the method.
General view of the Bluespec wiring between scheduller, rule logic and leaf components. Each rule connects to some number of leaf components. Dotted outline shows example component boundaries.
The scheduler connects to each rule with a handshake pair. It knows which rules conflict with each others and will not simultaneously assert the fireguard to any pair in a conflict group.
The Verilog RTL output looks as follows (note: this was the very first run and there might be some bugs in this, but it seems to work!) HERE.
Does this deadlock? The HPR library can output also an NuSMV file of this design. We can give SMV the -ctt command line flag and it will check for deadlock (eventually?).
The above toy compiler has a default rule precedence based on the textual order of the rules in the source code. It's not hard to add priorities.
The above compiler does not have the interlock that BSV defaults to ON for each rule that stops it firing more than once per cycle. This can be added where the comment says.
Finally, the above compiler does not check alternate compositions of rule pairs to see if they have the same effect, but a simple approach based on textual equivalence of effect would work quite well because the expressions are all converted to a normal memoized form by the HPR library, so most minor artefacts arising from alternate compositions, such as operator association and operand interchange in commutative operators are eliminated.
A simple parser was implemented using fsyacc. The resulting abstract syntax tree required a typechecker, normaliser and elaborator for all the functional hardware construction (aka generate) forms, but this then produced the same core lanugage AST as used in the initial experiments.
Parser example run example run.
One of the interesting small tests available online is SimpleProcessor.
mono bsv.exe -incdir=/home/djg11/d320/hprls/bsvc/smalltests -cpp=disable -comb-assignment-delay=10 -vnl=dut.v smalltests/SimpleProcessorTb.bsv cver dut.v vsys.v GPLCVER_2.11a of 07/05/05 (Linux-elf). Copyright (c) 1991-2005 Pragmatic C Software Corp. All Rights reserved. Licensed under the GNU General Public License (GPL). See the 'COPYING' file for details. NO WARRANTY provided. Today is Wed Nov 7 14:44:56 2012. Compiling source file "dut.v" Compiling source file "vsys.v" Highest level modules: SIMSYS 48: output regs[ 1] = 3 49: Halt at pc13 There were 0 error(s), 0 warning(s), and 24 inform(s).
The online Bluespec examples define a simple microprocessor that is provided with an assembly coding of Euclid's algorith. The above plot shows the toy compiler can now parse and correctly execute the processor and the example algorithm: the GCD of 27 and 15 is 3 which is printed from R1 at the end of the computation. The delay at the start is the download of the program into the processor by the testbench. Mind you, this hardly stretches the interesting parts of the Bluespec scheduler.
Listing ((C) Bluespec Inc):
Instruction code [14] = { tagged MovI { rd: R0, v: 0 }, // 0: The constant 0 tagged MovI { rd: R1, v: 15 }, // 1: x = 21 tagged MovI { rd: R2, v: 27 }, // 2: y = 27 // label_loop tagged Brz { rs: R2, dest:label_done }, // 3: if (y == 0) goto done tagged Gt { rd: R3, rs1: R1, rs2: R2 }, // 4: tmp = (x > y) tagged Brz { rs: R3, dest: label_subtract }, // 5: if (x <= y) goto subtract // swap tagged Minus { rd: R3, rs1: R1, rs2: R0 }, // 6: tmp = x; tagged Minus { rd: R1, rs1: R2, rs2: R0 }, // 7: x = y; tagged Minus { rd: R2, rs1: R3, rs2: R0 }, // 8: y = tmp; tagged Br label_loop, // 9: goto loop // label_subtract tagged Minus { rd: R2, rs1: R2, rs2: R1 }, // 10: y = y - x tagged Br label_loop, // 11: goto loop // label_done tagged Output R1, // 12: output x // label_last tagged Halt // 13: halt };
Verilog object file generated: dut.v.
Simon Moore implemented Chuck Thacker's Tiny3 Processor in Bluespec. Simon's source files are here SWM Tiny3. This design now (Jan 2013) goes through the Toy Compiler and works!
The test program is coded in an assembly language entirely written in Bluespec. It is a simple program that writes the integers 9 down to 0 to the output FIFO.
function List#(AsmLineT) my_program(function ImmT findaddr(String label)) = /*0*/ cons(asmi("", 0, 0), // Keep register 0 clear. /*1*/ cons(asmi("", 1, 10), // put 10 in r1 /*2*/ cons(asmi("", 2, findaddr("loop")), // loop address in r2 /*3*/ cons(asm("loop", OpOut, FDECb, ShiftNone, SkipNever, 1, 0, 1), // r1-- and output /*4*/ cons(asm("", OpJump, FaORb, ShiftNone, SkipZero, 3, 2, 0), // jump r2 if not zero /*5*/ cons(asm("", OpReserved, Freserved, ShiftNone, SkipNever, 0, 0, 0), // stop processor tagged Nil))))));
Simulation output:
cver dut.v vsys.v GPLCVER_2.12a of 05/16/07 (Linux-elf). Copyright (c) 1991-2007 Pragmatic C Software Corp. All Rights reserved. Licensed under the GNU General Public License (GPL). See the 'COPYING' file for details. NO WARRANTY provided. Today is Sun Jan 20 18:39:48 2013. Compiling source file "dut.v" Compiling source file "vsys.v" Highest level modules: SIMSYS End of prog load 2100: fetch 00 80000000: 2700: doSkip = False 2700: r 0 = imm = 0 2900: fetch 01 8100000a: 3500: doSkip = False 3500: r 1 = imm = 10 3700: fetch 02 82000003: 4300: doSkip = False 4300: r 2 = imm = 3 4500: fetch 03 01000583: 5100: doSkip = False 5100: rd <- 1 9 Write out 00000009 5300: fetch 5300: output = 9 04 03040296: 5900: doSkip = False 6100: fetch 03 01000583: 6700: doSkip = False 6700: rd <- 1 8 Write out 00000008 6900: fetch 6900: output = 8 04 03040296: 7500: doSkip = False 7700: fetch 03 01000583: 8300: doSkip = False 8300: rd <- 1 7 Write out 00000007 8500: fetch 8500: output = 7 04 03040296: 9100: doSkip = False 9300: fetch 03 01000583: 9900: doSkip = False 9900: rd <- 1 6 Write out 00000006 10100: fetch 10100: output = 6 04 03040296: 10700: doSkip = False 10900: fetch 03 01000583: 11500: doSkip = False 11500: rd <- 1 5 Write out 00000005 11700: fetch 11700: output = 5 04 03040296: 12300: doSkip = False 12500: fetch 03 01000583: 13100: doSkip = False 13100: rd <- 1 4 Write out 00000004 13300: fetch 13300: output = 4 04 03040296: 13900: doSkip = False 14100: fetch 03 01000583: 14700: doSkip = False 14700: rd <- 1 3 Write out 00000003 14900: fetch 14900: output = 3 04 03040296: 15500: doSkip = False 15700: fetch 03 01000583: 16300: doSkip = False 16300: rd <- 1 2 Write out 00000002 16500: fetch 16500: output = 2 04 03040296: 17100: doSkip = False 17300: fetch 03 01000583: 17900: doSkip = False 17900: rd <- 1 1 Write out 00000001 18100: fetch 18100: output = 1 04 03040296: 18700: doSkip = False 18900: fetch 03 01000583: 19500: doSkip = False 19500: rd <- 1 0 Write out 00000000 19700: fetch 19700: output = 0 04 03040296: 20300: doSkip = True 20500: fetch 06 6c697374: 21100: doSkip = False 21100: rd <- 108 x 21300: fetch 07 6c697374: 21900: doSkip = True 21900: rd <- 108 x 22100: fetch 09 xxxxxxxx: Halted at location **vsys.v(16) time 35000 from call to $finish. There were 0 error(s), 1 warning(s), and 63 inform(s).
It is good that this test worked, since it uses a large fraction of the language.
Here we show the output of some experimental language extension experiments. Full details to be in a white paper.
Standard semantics are that a register is updated at most once per clock cycle. (Indeed the standard is a leaf action method is invoked at most once per cycle, but we allready allow all action methods with the same argument to be run more than once on idempotent-marked methods by sharing use of a single call.)
With the superscalar update extension, we allow operations to be invoked more than once per clock cycle for certain methods where extensions have been hardwired into the main body of the compiler. We have so-far implemented support for any number of operations on registers per clock cycle and any number of operations on BRAMs (see register forwarding extension) provided the number of different addresses expressions is less than or equal to the number of address ports on the hardware entity. (An approach for FIFO FUs is also possible in the future and it should be working already for FIFO modules that are defined purely within Bluespec simply out of multiply-updated maybe registers.)
The Test12f demo is compiled in normal and superscalar action mode, an experimental mode that allows multiple updates to a register in one clock cycle.
module mkTest1f2(); Reg#(Bit#(10)) vodka <- mkReg(30); Reg#(Bool) grd <- mkReg(0); rule test_1f_inc1 if (True); vodka <= vodka + 1; if (vodka > 50) $finish; endrule rule test_1f_inc3 if (grd); vodka <= vodka + 3; endrule rule shower if (True); grd <= !grd; $display("Vodka is %1d", vodka); endrule endmodule
RTL output is Verilog. The normal simulation listing generated is:
Vodka is 30 Vodka is 31 Vodka is 32 Vodka is 35 Vodka is 36 Vodka is 39 Vodka is 40 Vodka is 43 Vodka is 44 Vodka is 47 Vodka is 48
But in the new mode we now see that the register either increases by 1 or 4 each clock cycle. An increase of 4 arises from the composition of both register writing rules firing in one clock cycle.
Vodka is 31 Vodka is 35 Vodka is 36 Vodka is 40 Vodka is 41 Vodka is 45 Vodka is 46 Vodka is 50 Vodka is 51
The magic line in the output RTL that encapsulates the compose effect is
always @(*) vodka_write_din = (mkTest1f2_test_1f_inc3_FIRE? 32'd3+rtl_unsigned_extend0((mkTest1f2_test_1f_inc1_FIRE? rtl_unsigned_bitextract1(10'd1+vodka_read): vodka_read )): rtl_unsigned_bitextract1(10'd1+vodka_read));
The following two rules from SimpleProcessorTb do not conflict under the idempotent actions scheme since they both write the same expression to the register iax. But they do write different values under either sequential composition. Under the super-scalar extension, this is fine, both are packed into the same clock cycle whether they conflict or not. TODO explain ...
rule loadInstrs (iax <= label_last); sp.loadInstruction (iax, code [iax]); iax <= iax + 1; endrule rule go (iax == label_last + 1); sp.start(); iax <= iax + 1; endrule
The standard Bluespec compiler performs method calls only on the component that has been explicitly instantiated and named in the source code whereas typical HLS tools choose the number of component instances to instantiate and decide the binding (mapping) between operations and those instances. A component in the Bluespec context is either a leaf component or a child Bluespec module.
In HLS, the relevant components are normally called FUs (functional units). FUs may be stateful or stateless. Binding decisions for stateful FUs are restricted in that state must be read back from the FU where the state lies whereas stateless FUs may be allocated to an operation without such restriction. A stateful FU might be a RAM. A stateless FU might be a multiplier. Moreover, in typical HLS, pipelined operators, such as a three clock cycle multiplier, are regarded as stateless since the compiler has no ambiguity discriminating the scope of the actual state inside the FU and the pipelining is not visible to the user.
In Bluespec, however, it is common to use the Put/Get paradigm for access to pipelined FUs whether stateful or stateless to overcome the lack of intrinsic access to pipelined operators with the standard compiler.
...
Standard semantics are that a static schedule is created that cannot span clock cycles. Hence some rules may be starved of service.
Manual instantiation of arbiters and other anti-hog mechanisms is one solution.
Two automated solutions are multi-cycle static schedules and dynamic schedules. Here we consider the latter.
This demo uses two compilation units that sit each side of an interface. A single compilation unit would allow both rules in the parent/master unit to fire at once, since the method in the interface will be elaborated separately for each call site. But as separate units, we have a structural hazard: the method is manifested at the net level in the interface and can only be invoked once per clock cycle.
Interface definition:
interface BarFace; method Action orderDrink(int which, int no); endinterface
Lower unit definition:
module mkBarTender(BarFace); Reg#(Bit#(10)) beerdrink <- mkReg(20); Reg#(Bit#(10)) winedrink <- mkReg(10); method Action orderDrink(int which, int no); if (which == 1) beerdrink <= beerdrink + no; if (which == 2) winedrink <= winedrink + no; endmethod rule shower if (True); $display("Beer is %1d and wine is %1d", beerdrink, winedrink); endrule endmodule
Upper unit definition:
module mkTest1iBench(); BarFace fbar <- mkBarTender(); rule drinkBeer; fbar.orderDrink(1, 2); // Beer should increase by two every clock cycle. endrule rule drinkWine if (True); fbar.orderDrink(2, 10); // This rule will be starved in the absence of a stateful arbiter -bsv-round-robin=enable endrule endmodule
Compiling with the standard semantics we see a starvation warning from the compiler:
** Warning: sheduler=default_schedule: starvation detected for rule Test1i.mkTest1iBench.drinkWine ** Warning: sheduler=default_schedule: earlier rules being greedy are Test1i.mkTest1iBench.drinkBeer In partition shedtree1, 1 of the 2 rules are potentially starving. No arbiter will be generated owing to recipe setting -bsv-round-robin=disableAnd simulation demonstrates that wine is never consumed owing to beer hogging:
Beer is 22 and wine is 12 Beer is 24 and wine is 12 Beer is 26 and wine is 12 Beer is 28 and wine is 12
The mkTest1gBench demo is now compiled under an experimental mode that instantiates run-time arbiters to provide fairness between the rules.
The simulation listing now generated is
Beer is 20 and wine is 10 Beer is 22 and wine is 10 Beer is 22 and wine is 20 Beer is 24 and wine is 20 Beer is 24 and wine is 30
We see the rules have taken it in turns to fire.
Standard semantics are that a rule is fired at most once per clock cycle.
The multiple updates system gives superscalar performance when the same rule is applied more than once per clock cycle. This needs to be used sparingly since combinational logic can grow quickly when operating on vectors of registers...
An associated extension is the inclusion of %r in format strings for $display and so on. Like the standard %m for module name, no argument is consumed, but the new format string gives the rule name. This is a helpful debug aid when a rule is executed more than once since the substituted string gives not only the rule name but a cardinal extension giving the firing instance number. This means we can insert PLI calls such as the following in each clause of the rule:
$display("Brz (%r) r%d == %0d. Branch taken=%0d ", rs, regs[rs], b);
An an example, we compiled smalltests/SimpleProcessor.bsv with the fetchAndExecute rule being allowed to fire twice per clock cycle. This nearly doubled the performance, with the 50 instructions in the GCD demo being completed in 28 clock cycles instead of 50. The allocation of work to the two instances of the rule is illustrated in the following listing:
Pc:: 0 Instr=0000000000000000 Load immediate (fetchAndExecute_1) R0 with 0x00000000 Pc:: 1 Instr=000000010000000f Load immediate (fetchAndExecute_1) R1 with 0x0000000f Pc:: 2 Instr=000000020000001b Load immediate (fetchAndExecute_1) R2 with 0x0000001b Pc:: 3 Instr=000000080000002c Brz (fetchAndExecute_0) r2 == 27. Branch taken=0 Pc:: 4 Instr=0000000c00000036 Gt (fetchAndExecute_1) r1 > r2 15 27 res=0 Pc:: 5 Instr=000000080000003a Brz (fetchAndExecute_0) r3 == 0. Branch taken=1 Pc:: a Instr=0000001000000029 Subtract (fetchAndExecute_1) 27 15 Pc:: b Instr=0000000400000003 Br (fetchAndExecute_0) 000000003 Pc:: 3 Instr=000000080000002c Brz (fetchAndExecute_0) r2 == 12. Branch taken=0 Pc:: 4 Instr=0000000c00000036 Gt (fetchAndExecute_1) r1 > r2 15 12 res=1 Pc:: 5 Instr=000000080000003a Brz (fetchAndExecute_0) r3 == 4294967295. Branch taken=0 Pc:: 6 Instr=0000001000000034 Subtract (fetchAndExecute_1) 15 0 Pc:: 7 Instr=0000001000000018 Subtract (fetchAndExecute_0) 12 0 Pc:: 8 Instr=000000100000002c Subtract (fetchAndExecute_1) 15 0 Pc:: 9 Instr=0000000400000003 Br (fetchAndExecute_0) 000000003 Pc:: 3 Instr=000000080000002c Brz (fetchAndExecute_0) r2 == 15. Branch taken=0 Pc:: 4 Instr=0000000c00000036 Gt (fetchAndExecute_1) r1 > r2 12 15 res=0 Pc:: 5 Instr=000000080000003a Brz (fetchAndExecute_0) r3 == 0. Branch taken=1 Pc:: a Instr=0000001000000029 Subtract (fetchAndExecute_1) 15 12 Pc:: b Instr=0000000400000003 Br (fetchAndExecute_0) 000000003 Pc:: 3 Instr=000000080000002c Brz (fetchAndExecute_0) r2 == 3. Branch taken=0 Pc:: 4 Instr=0000000c00000036 Gt (fetchAndExecute_1) r1 > r2 12 3 res=1 Pc:: 5 Instr=000000080000003a Brz (fetchAndExecute_0) r3 == 4294967295. Branch taken=0 Pc:: 6 Instr=0000001000000034 Subtract (fetchAndExecute_1) 12 0 Pc:: 7 Instr=0000001000000018 Subtract (fetchAndExecute_0) 3 0 Pc:: 8 Instr=000000100000002c Subtract (fetchAndExecute_1) 12 0 Pc:: 9 Instr=0000000400000003 Br (fetchAndExecute_0) 000000003 Pc:: 3 Instr=000000080000002c Brz (fetchAndExecute_0) r2 == 12. Branch taken=0 Pc:: 4 Instr=0000000c00000036 Gt (fetchAndExecute_1) r1 > r2 3 12 res=0 Pc:: 5 Instr=000000080000003a Brz (fetchAndExecute_0) r3 == 0. Branch taken=1 Pc:: a Instr=0000001000000029 Subtract (fetchAndExecute_1) 12 3 Pc:: b Instr=0000000400000003 Br (fetchAndExecute_0) 000000003 Pc:: 3 Instr=000000080000002c Brz (fetchAndExecute_0) r2 == 9. Branch taken=0 Pc:: 4 Instr=0000000c00000036 Gt (fetchAndExecute_1) r1 > r2 3 9 res=0 Pc:: 5 Instr=000000080000003a Brz (fetchAndExecute_0) r3 == 0. Branch taken=1 Pc:: a Instr=0000001000000029 Subtract (fetchAndExecute_1) 9 3 Pc:: b Instr=0000000400000003 Br (fetchAndExecute_0) 000000003 Pc:: 3 Instr=000000080000002c Brz (fetchAndExecute_0) r2 == 6. Branch taken=0 Pc:: 4 Instr=0000000c00000036 Gt (fetchAndExecute_1) r1 > r2 3 6 res=0 Pc:: 5 Instr=000000080000003a Brz (fetchAndExecute_0) r3 == 0. Branch taken=1 Pc:: a Instr=0000001000000029 Subtract (fetchAndExecute_1) 6 3 Pc:: b Instr=0000000400000003 Br (fetchAndExecute_0) 000000003 Pc:: 3 Instr=000000080000002c Brz (fetchAndExecute_0) r2 == 3. Branch taken=0 Pc:: 4 Instr=0000000c00000036 Gt (fetchAndExecute_1) r1 > r2 3 3 res=0 Pc:: 5 Instr=000000080000003a Brz (fetchAndExecute_0) r3 == 0. Branch taken=1 Pc:: a Instr=0000001000000029 Subtract (fetchAndExecute_1) 3 3 Pc:: b Instr=0000000400000003 Br (fetchAndExecute_0) 000000003 Pc:: 3 Instr=000000080000002c Brz (fetchAndExecute_0) r2 == 0. Branch taken=1 Pc:: c Instr=0000001400000001 38: (fetchAndExecute_0) output regs[1] = 3 Pc:: d Instr=0000001800000000 39: Halt (fetchAndExecute_1) at pc13
This was an easy, get-started example, since in that design, the program code is in the same sort of register file as the processor registers and there is no structural hazard on instruction memory read when performing dual issue.
Normal Execution of the SimpleProcessor design. The result '3' is returned at cycle 48 in the original design.
Superscalar Execution of the SimpleProcessor design. The result '3' is returned at cycle 28 which is a 2-to-1 speed up.
For Xilinx Virtex7, we compiled the baseline SimpleProcessor and two variants with one and two extra copies of the fetchAndExecute rule.
Not surprisingly, the length of the critical path in each run was roughly 5, 10 and 15 ns. Hence performance in terms of time to completion was roughly unchanged, but IPC did not keep up with lowered clock frequency, with single-scalar being the best. (Strangely, the number of CARRY4 units on the critical path was 4, 7 and 8 respectively.) Here is the critical path for the triple-scalar verison:
Max Delay Paths -------------------------------------------------------------------------------------- Slack: inf Source: the_dut/SimpleProcessorTb_mkTb_sp/SimpleProcessor_mkSimpleProcessor_imem_6_read_RV_reg[2]/C (rising edge-triggered cell FDRE) Destination: the_dut/SimpleProcessorTb_mkTb_sp/SimpleProcessor_mkSimpleProcessor_regs_0_read_RV_reg[0]/D Path Group: (none) Path Type: Max at Slow Process Corner Data Path Delay: 14.269ns (logic 2.265ns (15.874%) route 12.004ns (84.126%)) Logic Levels: 28 (CARRY4=8 FDRE=1 LUT2=3 LUT4=1 LUT5=1 LUT6=12 MUXF7=1 MUXF8=1) Location Delay type Incr(ns) Path(ns) Netlist Resource(s) ------------------------------------------------------------------- ------------------- SLICE_X70Y126 FDRE 0.000 0.000 r the_dut/SimpleProcessorTb_mkTb_sp/SimpleProcessor_mkSimpleProcessor_imem_6_read_RV_reg[2]/C SLICE_X70Y126 FDRE (Prop_fdre_C_Q) 0.259 0.259 r the_dut/SimpleProcessorTb_mkTb_sp/SimpleProcessor_mkSimpleProcessor_imem_6_read_RV_reg[2]/Q net (fo=20, estimated) 0.829 1.088 the_dut/SimpleProcessorTb_mkTb_sp/rtl_unsigned_bitextract54_return[2] SLICE_X62Y133 LUT6 (Prop_lut6_I1_O) 0.043 1.131 r the_dut/SimpleProcessorTb_mkTb_sp/SimpleProcessor_mkSimpleProcessor_regs_0_read_RV[0]_i_311/O net (fo=1, routed) 0.000 1.131 the_dut/SimpleProcessorTb_mkTb_sp/SimpleProcessor_mkSimpleProcessor_regs_0_read_RV[0]_i_311_n_0 SLICE_X62Y133 MUXF7 (Prop_muxf7_I1_O) 0.117 1.248 r the_dut/SimpleProcessorTb_mkTb_sp/SimpleProcessor_mkSimpleProcessor_regs_0_read_RV_reg[0]_i_135/O net (fo=1, routed) 0.000 1.248 the_dut/SimpleProcessorTb_mkTb_sp/SimpleProcessor_mkSimpleProcessor_regs_0_read_RV_reg[0]_i_135_n_0 SLICE_X62Y133 MUXF8 (Prop_muxf8_I0_O) 0.046 1.294 r the_dut/SimpleProcessorTb_mkTb_sp/SimpleProcessor_mkSimpleProcessor_regs_0_read_RV_reg[0]_i_55/O net (fo=4, estimated) 0.837 2.131 the_dut/SimpleProcessorTb_mkTb_sp/SimpleProcessor_mkSimpleProcessor_regs_0_read_RV_reg[0]_i_55_n_0 SLICE_X64Y147 LUT6 (Prop_lut6_I0_O) 0.125 2.256 f the_dut/SimpleProcessorTb_mkTb_sp/SimpleProcessor_mkSimpleProcessor_regs_0_read_RV[0]_i_27/O net (fo=41, estimated) 0.697 2.953 the_dut/SimpleProcessorTb_mkTb_sp/SimpleProcessor_mkSimpleProcessor_regs_0_read_RV[0]_i_27_n_0 SLICE_X65Y142 LUT5 (Prop_lut5_I0_O) 0.043 2.996 f the_dut/SimpleProcessorTb_mkTb_sp/SimpleProcessor_mkSimpleProcessor_regs_1_read_RV[0]_i_10/O net (fo=116, estimated) 0.836 3.832 the_dut/SimpleProcessorTb_mkTb_sp/SimpleProcessor_mkSimpleProcessor_regs_1_read_RV[0]_i_10_n_0 SLICE_X53Y151 LUT2 (Prop_lut2_I1_O) 0.043 3.875 r the_dut/SimpleProcessorTb_mkTb_sp/SimpleProcessor_mkSimpleProcessor_result_reg_read_RV[8]_i_204/O net (fo=1, routed) 0.000 3.875 the_dut/SimpleProcessorTb_mkTb_sp/SimpleProcessor_mkSimpleProcessor_result_reg_read_RV[8]_i_204_n_0 SLICE_X53Y151 CARRY4 (Prop_carry4_S[0]_CO[3]) 0.259 4.134 r the_dut/SimpleProcessorTb_mkTb_sp/SimpleProcessor_mkSimpleProcessor_result_reg_read_RV_reg[8]_i_162/CO[3] net (fo=1, estimated) 0.000 4.134 the_dut/SimpleProcessorTb_mkTb_sp/SimpleProcessor_mkSimpleProcessor_result_reg_read_RV_reg[8]_i_162_n_0 SLICE_X53Y152 CARRY4 (Prop_carry4_CI_CO[3]) 0.053 4.187 r the_dut/SimpleProcessorTb_mkTb_sp/SimpleProcessor_mkSimpleProcessor_result_reg_read_RV_reg[8]_i_122/CO[3] net (fo=1, estimated) 0.000 4.187 the_dut/SimpleProcessorTb_mkTb_sp/SimpleProcessor_mkSimpleProcessor_result_reg_read_RV_reg[8]_i_122_n_0 SLICE_X53Y153 CARRY4 (Prop_carry4_CI_CO[3]) 0.053 4.240 r the_dut/SimpleProcessorTb_mkTb_sp/SimpleProcessor_mkSimpleProcessor_result_reg_read_RV_reg[8]_i_82/CO[3] net (fo=1, estimated) 0.000 4.240 the_dut/SimpleProcessorTb_mkTb_sp/SimpleProcessor_mkSimpleProcessor_result_reg_read_RV_reg[8]_i_82_n_0 SLICE_X53Y154 CARRY4 (Prop_carry4_CI_CO[3]) 0.053 4.293 r the_dut/SimpleProcessorTb_mkTb_sp/SimpleProcessor_mkSimpleProcessor_result_reg_read_RV_reg[8]_i_62/CO[3] net (fo=5, estimated) 0.850 5.143 the_dut/SimpleProcessorTb_mkTb_sp/hprbpin501107x1020998_in SLICE_X53Y142 LUT4 (Prop_lut4_I1_O) 0.049 5.192 r the_dut/SimpleProcessorTb_mkTb_sp/SimpleProcessor_mkSimpleProcessor_result_reg_read_RV[8]_i_38/O net (fo=4, estimated) 0.687 5.879 the_dut/SimpleProcessorTb_mkTb_sp/SimpleProcessor_mkSimpleProcessor_result_reg_read_RV[8]_i_38_n_0 SLICE_X57Y143 LUT6 (Prop_lut6_I0_O) 0.136 6.015 r the_dut/SimpleProcessorTb_mkTb_sp/SimpleProcessor_mkSimpleProcessor_result_reg_read_RV[8]_i_16/O net (fo=1, estimated) 0.457 6.472 the_dut/SimpleProcessorTb_mkTb_sp/SimpleProcessor_mkSimpleProcessor_result_reg_read_RV[8]_i_16_n_0 SLICE_X57Y143 LUT6 (Prop_lut6_I1_O) 0.043 6.515 r the_dut/SimpleProcessorTb_mkTb_sp/SimpleProcessor_mkSimpleProcessor_result_reg_read_RV[8]_i_10/O net (fo=31, estimated) 0.647 7.162 the_dut/SimpleProcessorTb_mkTb_sp/SimpleProcessor_mkSimpleProcessor_result_reg_read_RV[8]_i_10_n_0 SLICE_X52Y143 LUT6 (Prop_lut6_I1_O) 0.043 7.205 r the_dut/SimpleProcessorTb_mkTb_sp/SimpleProcessor_mkSimpleProcessor_result_reg_read_RV[8]_i_7/O net (fo=191, estimated) 0.695 7.900 the_dut/SimpleProcessorTb_mkTb_sp/hprbpin501107x10 SLICE_X61Y146 LUT2 (Prop_lut2_I1_O) 0.043 7.943 r the_dut/SimpleProcessorTb_mkTb_sp/SimpleProcessor_mkSimpleProcessor_regs_0_read_RV[0]_i_1619/O net (fo=1, routed) 0.000 7.943 the_dut/SimpleProcessorTb_mkTb_sp/SimpleProcessor_mkSimpleProcessor_regs_0_read_RV[0]_i_1619_n_0 SLICE_X61Y146 CARRY4 (Prop_carry4_S[0]_CO[3]) 0.259 8.202 r the_dut/SimpleProcessorTb_mkTb_sp/SimpleProcessor_mkSimpleProcessor_regs_0_read_RV_reg[0]_i_1103/CO[3] net (fo=1, estimated) 0.000 8.202 the_dut/SimpleProcessorTb_mkTb_sp/SimpleProcessor_mkSimpleProcessor_regs_0_read_RV_reg[0]_i_1103_n_0 SLICE_X61Y147 CARRY4 (Prop_carry4_CI_CO[3]) 0.053 8.255 r the_dut/SimpleProcessorTb_mkTb_sp/SimpleProcessor_mkSimpleProcessor_regs_0_read_RV_reg[0]_i_800/CO[3] net (fo=1, estimated) 0.000 8.255 the_dut/SimpleProcessorTb_mkTb_sp/SimpleProcessor_mkSimpleProcessor_regs_0_read_RV_reg[0]_i_800_n_0 SLICE_X61Y148 CARRY4 (Prop_carry4_CI_CO[3]) 0.053 8.308 r the_dut/SimpleProcessorTb_mkTb_sp/SimpleProcessor_mkSimpleProcessor_regs_0_read_RV_reg[0]_i_591/CO[3] net (fo=1, estimated) 0.000 8.308 the_dut/SimpleProcessorTb_mkTb_sp/SimpleProcessor_mkSimpleProcessor_regs_0_read_RV_reg[0]_i_591_n_0 SLICE_X61Y149 CARRY4 (Prop_carry4_CI_CO[3]) 0.053 8.361 r the_dut/SimpleProcessorTb_mkTb_sp/SimpleProcessor_mkSimpleProcessor_regs_0_read_RV_reg[0]_i_373/CO[3] net (fo=4, estimated) 0.764 9.125 the_dut/SimpleProcessorTb_mkTb_sp/SimpleProcessor_mkSimpleProcessor_regs_0_read_RV_reg[0]_i_373_n_0 SLICE_X49Y148 LUT6 (Prop_lut6_I0_O) 0.043 9.168 f the_dut/SimpleProcessorTb_mkTb_sp/SimpleProcessor_mkSimpleProcessor_regs_0_read_RV[0]_i_375/O net (fo=3, estimated) 0.464 9.632 the_dut/SimpleProcessorTb_mkTb_sp/SimpleProcessor_mkSimpleProcessor_regs_0_read_RV[0]_i_375_n_0 SLICE_X48Y147 LUT6 (Prop_lut6_I0_O) 0.043 9.675 f the_dut/SimpleProcessorTb_mkTb_sp/SimpleProcessor_mkSimpleProcessor_regs_0_read_RV[0]_i_199/O net (fo=1, estimated) 0.729 10.404 the_dut/SimpleProcessorTb_mkTb_sp/SimpleProcessor_mkSimpleProcessor_regs_0_read_RV[0]_i_199_n_0 SLICE_X50Y136 LUT2 (Prop_lut2_I1_O) 0.047 10.451 f the_dut/SimpleProcessorTb_mkTb_sp/SimpleProcessor_mkSimpleProcessor_regs_0_read_RV[0]_i_97/O net (fo=1, estimated) 0.945 11.396 the_dut/SimpleProcessorTb_mkTb_sp/SimpleProcessor_mkSimpleProcessor_regs_0_read_RV[0]_i_97_n_0 SLICE_X50Y148 LUT6 (Prop_lut6_I0_O) 0.134 11.530 r the_dut/SimpleProcessorTb_mkTb_sp/SimpleProcessor_mkSimpleProcessor_regs_0_read_RV[0]_i_41/O net (fo=1, estimated) 0.640 12.170 the_dut/SimpleProcessorTb_mkTb_sp/SimpleProcessor_mkSimpleProcessor_regs_0_read_RV[0]_i_41_n_0 SLICE_X56Y152 LUT6 (Prop_lut6_I2_O) 0.043 12.213 r the_dut/SimpleProcessorTb_mkTb_sp/SimpleProcessor_mkSimpleProcessor_regs_0_read_RV[0]_i_24/O net (fo=1, estimated) 0.567 12.780 the_dut/SimpleProcessorTb_mkTb_sp/SimpleProcessor_mkSimpleProcessor_regs_0_read_RV[0]_i_24_n_0 SLICE_X63Y148 LUT6 (Prop_lut6_I2_O) 0.043 12.823 r the_dut/SimpleProcessorTb_mkTb_sp/SimpleProcessor_mkSimpleProcessor_regs_0_read_RV[0]_i_10/O net (fo=4, estimated) 0.815 13.638 the_dut/SimpleProcessorTb_mkTb_sp/hprbpin504915x10 SLICE_X53Y129 LUT6 (Prop_lut6_I4_O) 0.043 13.681 r the_dut/SimpleProcessorTb_mkTb_sp/SimpleProcessor_mkSimpleProcessor_regs_0_read_RV[0]_i_2/O net (fo=1, estimated) 0.545 14.226 the_dut/SimpleProcessorTb_mkTb_sp/SimpleProcessor_mkSimpleProcessor_regs_0_read_RV[0]_i_2_n_0 SLICE_X53Y129 LUT6 (Prop_lut6_I0_O) 0.043 14.269 r the_dut/SimpleProcessorTb_mkTb_sp/SimpleProcessor_mkSimpleProcessor_regs_0_read_RV[0]_i_1/O net (fo=1, routed) 0.000 14.269 the_dut/SimpleProcessorTb_mkTb_sp/SimpleProcessor_mkSimpleProcessor_regs_0_read_RV[0]_i_1_n_0 SLICE_X53Y129 FDRE r the_dut/SimpleProcessorTb_mkTb_sp/SimpleProcessor_mkSimpleProcessor_regs_0_read_RV_reg[0]/D ------------------------------------------------------------------- -------------------
The sensible behaviour of the curious Bluespec wire types, such as PulseWire and RWire, under super-scalar operation is that they are reset to their normal starting point between each operation. The execution order restrictions associated with such wires get more restrictive for super-scalar behaviour ...
In today's FPGAs, multipliers and BRAM memories are abundant and programmers are used to accessing such resources via the standard operator notations (asterisk and square brackets) in high-level languages. On FPGAs, these resources are pipelined and a multi-cycle schedule is needed to use them in this way. (Standard Bluespec uses these resources via the Put/Get paradigm...)
Latency-of-one operators can be used infix style via the register forwarding extension described in the next section.
For general use of pipelined operators outside of forwardable coding styles, there are two basic approaches:
The HPR L/S library provides easy support for superstate based multi-cycle schedules: its restructure component will convert infix operators appearing in the NSF of an FSM, replacing them with pipelined FUs and padding out the initial FSM states into multi-cycle superstates. It will also overlap these superstates as much as possible, solving all RaW hazards and inserting holding registers as needed. It can handle fixed-latency FUs as well as variable latency FUs that have handshake nets. The lower-level FU can be a previous Bluespec compile and the necessary meta-info to instantiate it is held in an IP-XACT file read in by the higher-level compilation.
Multi-cycle schedules can be a simple modulo-like schedule with superstate concepts, but these will give low throughput owing to the initiation interval being extended to the cycle length. Pipelined schedules generally require speculation ...
... demo needed by adding restructure to the recipe ... also mention of previous work Karczmarek supported multi-cycle rules but not mutli-cycle methods ...
Sadly, the standard Bluespec handshake cannot support multi-cycle operations (except by backdoor methods such as global stall mechanism). The restructure plugin, when used in Bluespec like mode, maps its req to Bluespec's EN and its reqrdy to Bluespec's RDY. But for multi-cycle builds, using either a seqeuencer (II=latency) or pipelined accelerator mode, the ack signal needs also to used to deliver the output. Hence it automatically converts to Put/Get style which is not what we want.
TODO demonstrate the performance gain from re-pipelining a design as a whole compared with making longer rules stall shorter ones in a gated/superstate scheme...
Access to pipelined hardware resources in conventional Bluespec is typically via the Put/Get interface. This requires a shift register that runs alongside the pipelined FU to record which stage is live with data. An output FIFO is required to store the results until they are consumed.
Where no clock enables are used (as illustrated), the output FIFO must have sufficient space to avoid overrun taking into account the operator's latency. A clock enable on the FU and shotgun shift register reduces the FIFO capacity requirements since we can use the FU internal registers as storage by stalling and de-asserting put_RDY when the FIFO is full.
We provide additional syntax for easier access to single and dual-ported RAMs with certain restrictions. This mechanism will operate for any latency-of-one stateful or stateless operator, but RAM is the only demo so far.
The write operation for the synchronous BRAM memory is easily supported with a syntax extension to the Bluespec language. Synchronous RAM writes are not multi-cycle: they are the same as register writes but with the value to be stored and the address being presented to the RAM instance in the same clock cycle.
Bluespec already has concrete syntax for subscription of vectors using square brackets. The additional dot added to our concrete syntax for RAM subscriptions reminds us that this operation will consume RAM port capacity and potentially raise a structural hazard.
The write construct added is:
mybram.[store_address ...] <= value_to_be_stored;
Latency-of-one synchronous RAM reads have an intrinsic pipeline delay compared with the combinational read of asynchronous RAMs. Efficient asynchronous RAMs are not found in today's mainstream technologies (see BRAM versus distributed RAM discussions ...). Reads where the result is stored directly in a register without otherwise being used can, however, be fairly easily supported with a new Bluespec language extension that generates an extra bit of hidden state per read destination. The extra bit is so-called scoreboard flag, recording that reads of the destination should be forwarded from the RAM read bus, instead of being served from the register. One such bit is needed per RAM per read destination of that RAM. The scoreboard bits may be optimised away if always constant value when read.
In this example, there is one read site and one read destination (called rval) with direct connection. The rval value used in both the local_reader rule and the remote_reader rule will be served from the output register of the synchronous BRAM when it has been activated the clock cycle immediately before and from the named rval register otherwise. The extra state bit manages which source to read. Reads where the data is looked at in the current cycle cannot be managed this way and the multi-cycle schedule is needed instead. [ If the local_reader has nothing stopping it firing every clock cycle, it will never need the rval register. ] Net-level debugging can be facilitated by providing the rval bus driving logic, even if it is not used. The debugging logic will be later stripped by backend tools.
The new read construct is illustrated in this example.
rule local_reader; rval <= mybram.[read_address ...]; sum1 <= sum1 + rval ... etc; endrule rule remote_reader; ... sum2 <= sum2 + rval ... etc; endrule
(A more complex example would show conditional reading of the BRAM (via IF statements or tertiary operator) and multiple destination registers with the destination registers being served from a number of different BRAMs, varying addresses on a commmon BRAM and also conventional updates to the target register(s). The Put/Get interface to the BRAM and its accompanying FIFO structure are not intended to be mixed with the new syntax on a common BRAM.)
Where the register rval is used, the value is either supplied from the read bus of the BRAM or from rval itself. The real rval is loaded from the RAM read bus the clock cycle afterwards if no other rule writes to that destination at that time.
Where there is more than one read site and associated read register, the mechanism is repeated for each read register.
The various read and write sites for such a RAM will each have an address expression. The expressions in use may be determined (at compile time) to always be identical or not. Where the expressions are different, there is a structural hazard, but where they are determined to be the same, a read and a write on the same address are possible through a single port and the read may be fanned out to multiple read sites. There are intrinsic limitations to such static identity checking arising from computability theory, but our use of an internal memoising heap with normal forms arising from multiple commutative, distributive and boolean laws, helps spot identity. Our implementation puts the expressions into equivalence classes and serves one class per RAM port.
Port over-subscription requires arbitration and we cast this into the Bluespec intrinsic guard paradigm to be resolved by traditional static rule priority or with our new run-time arbiters mechanism.
Dual-ported synchronous RAMs are commonly found on today's FPGAs.
For multi-ported RAMs, automatic load balancing within the clock domain can be supported with a schedulling
conflict only arising where more addresses than ports is attempted...
... being added June/Aug 2019 ... explain about superscalar interactions and RAM-to-RAM operations...
Our implementation supports certain simple combinational operations on the forwarding path, including bit extract, field select and inversion. Addition or subtraction of a constant could also have been supported, but was left out since this requires carry chains instead of just multiplexors.
With or without profile-directed feedback, the compiler can make an estimate and report the expected firing rate of each rule. Given
the ability to schedule a rule more than once per cycle, or at a rate less than normal under control of a new generation of arbiters
called policers, we can synthesise a design to meet real-time rate targets ...
... being added June 2019 ... This was done. See the paper from Memocode referenced below.
Given that this compiler is implemented on top of the HPR library, that supports floating-point hardware, adding floating-point arithmetic to Bluespec is nominally a very simple step, requiring a small concrete syntax extension. However, floating-point ALUs are normally heavily pipelined, so this only makes sense alongside the pipelined FU extension. IEEE floats are perhaps of little use since Bluespec is not intended for scientific computation. Custom arithmetic (e.g. for CNNs) might be of more use and should be possible with the same literal constant and infix operator extension mechanism...
... being added June 2019
Spare text: This is actually an attractive approach but needs tidying up from what is here. The attraction is that Bluespec has a static elaborator which is basically a Haskel interpreter, but by embedding our constructs in FSharp instead, we have the power of the ML dialect of F Sharp to do exactly the same kind of elaboration. All of the higher-order iterators and functors are exactly the same, modulo inserting 'fun () -> ' in a few places to get lazy behaviour. The good work needed is to use the domain-specific language (DSL) embedding facilities of F Sharp for the elaborated leaves. I might do this if I update this page.
(C) 2012-13 Orangepath Project. DJ Greaves.