HPR L/S Project: Hardware and Embedded Software Synthesis from Executable Specifications.
CBG-BSV Toy Bluespec Compiler.


  • Introduction.

  • Initial Tests.

  • Parser for concrete syntax.

  • Experimental Language Extensions.

  • Citations and Links.

    CBG-BSV Toy Bluespec Compiler

    The Bluespec System Verilog language is increasingly seen as a viable and productive alternative to conventional RTL coding for hardware design. For compiler writers like myself, the best way to learn a new language was to write a toy compiler for it. Also, where there's a lack of formal semantics for a language, writing a compiler is a good way to learn about the important and/or arbitrary decisions that have to be made. (These seem to be in the scheduler and in the way implicit guards of implicit guards are processed)

    The user manual from Bluespec Inc may have restricted circulation, so this informal study was based on a couple of published studies: `HARDWARE LANGUAGES AND PROOF' by Dominic Richards of Manchester University and `A Unified Model for Hardware/Software Codesign' by Nirav Dave when at MIT.

    This article gives the essence of a compiler for Bluespec. This toy compiler was coded in FSharp using the HPR L/S library. That library provides various logic manipulation facilities and output functions for SMV, SystemC, Verilog and so on. Rather than initially writing a Bluespec parser, the test abstract-syntax trees were manually entered as separate FSharp source files. Later a full parser was added.

    An Open-Source Bluespec Compiler PDF.

    Flow diagram for Toy Bluespec Compiler.

    DOWNLOAD: I shall put a version of the Toy Bluespec Compiler on the following link for your own experiments: HERE.

    Initial Tests without a front-end Parser

    Bluespec is based around the concept of modules and rules. A module contains zero or more rules. A module also instantiates zero or more lower modules. Modules instantiated at the lowest levels are primitives, such as FIFOs, registers and RAMs. Bluespec starts structural elaboration at a top-level module. The module hierarchy is nominally flattened during the structural elaboration process. Once elaboration is complete, we have essentially a flat collection of interconnected Bluespec rules and primitives.

    Our first implementation was just a proof of concept, without a parser or structural elaborator. Instead we manually entered the abstract syntax tree of the elaborated rules and primitives as an FSharp data structure.

    First simple test

    The first test was a rule that has a single register incremented on every clock cycle:
    let test_ast1 =
        (*     module test1();
                  Reg#(#UInt(15)) ctr1 <- mk_register(0)
                  rule test_ast1 (True) ;
                      ([], [], Bsv_typen "Empty"),
                        Bsv_varDeclAssign("ctr1", [B_num [15]; B_num[1]], "mkReg", []);
                        Bsv_rule("test_ast1", B_true, [ g_write("ctr1", B_diadic(V_plus, g_read("ctr1"), B_num[1]))])

    This first test generated the following output (which works fine: ctr is incremented on every clock cycle):

    // CBG Orangepath HPR/LS System
    //  /home/djg11/d320/hprls/bsvc/bsv.exe  -vnl test1.vnl test1
    module test1(input clk, input RST_N);
      reg test_ast1_FIRE;
      reg [14:0] test1_ctr1_DREG;
      reg [14:0] test1_ctr1_Din;
      wire test1_ctr1_RDY;
      reg test1_ctr1_EN;
      always   @(posedge clk )  begin
          if (!RST_N || test1_ctr1_EN)  test1_ctr1_DREG <= (!RST_N ? 0: test1_ctr1_Din);
      assign test_ast1_FIRE = 1;
      always @(*) test1_ctr1_EN =  test_ast1_FIRE;
      always @(*) test1_ctr1_Din =  1+test1_ctr1_DREG;
      // 2 vectors of width 1
      // 2 vectors of width 15
      // Total state bits in module = 32 bits.
      // 1 continuously assigned (wire/non-state) bits
    // eof (HPR/LS Verilog)

    Second Test

    The second test was to compile my Dining Philosophers example: Dining Philosophers in Bluespec Verilog.

    Not originally having a parser, I had to manually enter the syntax tree for the example. Fortunately it is not too long.

    The main part of the grammar is:

    // Bluespec: Toy abstract syntax (may not be right!)
    // Various modifiers for an interface:
    type bsv_provisos_t =
        | Proviso_alwaysReady
        | Proviso_alwaysEnabled    
    // Method prototype protocol has one of these three forms:
    type bsv_protoprotocol_t =
        | BsvProtoValue of bsv_type_name_t
        | BsvProtoActionValue of bsv_type_name_t
        | BsvProtoAction 
    and bsv_type_name_t =
        | Bsv_typen of string // for now - separate from bsv_if_t ... but can join.
        | Bsv_typen_uint_ns // Native/primitive run-time type
        | Bsv_typen_uint1 of int  // Native/primitive run-time type
        | Bsv_typen_action
    // Method prototype: has name, provisio/pragmas, return type and method formals (pairs of formal and actuals so-far bound).
    and bsv_methodProto_t = Bsv_methodProto of string * bsv_provisos_t list * bsv_protoprotocol_t * (bsv_type_name_t * string) list
    and bsv_sigma_protocol_t =
        | BsvValue 
        | BsvAction of hexp_t option * (hbexp_t * hexp_t)  //  (en-net if needed) * (rdy-x, rdy-net)
    and bsv_sigma_t = Bsv_current of string * bsv_sigma_protocol_t * (bsv_type_name_t * hexp_t option) * (hexp_t * hexp_t) list
    and actuals_t = (string * (bsv_type_name_t * bsv_exp_t)) list
    and formals_t = (bsv_exp_t * string) list
    and bsv_moduleStmt_t = // Or actionValueStmt ?
        | Bsv_varDeclAssign  of string * bsv_exp_t list * string * string list
        | Bsv_rule           of string * bsv_bexp_t * bsv_moduleStmt_t list
        | Bsv_assignStmt     of string * bsv_exp_t
        | Bsv_eascActionStmt of bsv_exp_t
        | Bsv_pliStmt        of string * bsv_exp_t list      // for $display and so on
        | Bsv_beginEndStmt   of bsv_moduleStmt_t list
        | Bsv_ifThenStmt     of bsv_bexp_t * bsv_moduleStmt_t * bsv_moduleStmt_t option
        | Bsv_whileStmt     of bsv_bexp_t * bsv_moduleStmt_t * bsv_moduleStmt_t option      | Bsv_returnStmt    of bsv_exp_t // Only in actionvalue blocks.
        | Bsv_CaseStmt      of bsv_exp_t * (bsv_exp_t * bsv_moduleStmt_t) list *  bsv_moduleStmt_t option
        | Bsv_methodDef     of string * formals_t * bsv_bexp_t * bsv_moduleStmt_t list
        | Bsv_primBuffer    of bsv_exp_t * bsv_exp_t (* Only for internal implementation *)
    // There are three lots of 'parameters' to the typical module as stored in this triple.
    // They are the parametric type args of the exported interface, the parameters to the generate
    // function and the list of interfaces, which consists of those used plus the single interface exported. 
    and bsv_moduleParams_t = (bsv_type_name_t * string) list * (bsv_type_name_t * string) list * bsv_type_name_t
    // The Bluespec interface definition: Generic parameters and list of methods:
    and bsv_if_t = Bsv_if of (bsv_type_name_t * string) list *  bsv_methodProto_t list
    // Boolean expressions
    and bsv_bexp_t =
        | B_true
        | B_false
        | B_firing of string // Backdoor access to the composite guard for any rule.
        | B_not  of bsv_bexp_t
        | B_and  of bsv_bexp_t list
        | B_or   of bsv_bexp_t list        
        | B_bexp of hbexp_t // Pre-compiled forms
        | B_bdiop of x_bdiop_t * bsv_exp_t list // Equality, less than and orreduce where list is a singleton.
        | B_orred of bsv_exp_t
    // Integer expressions    
    and bsv_exp_t =
        | B_num of int list
        | B_blift of bsv_bexp_t
        | B_query of bsv_bexp_t * bsv_exp_t * bsv_exp_t
        | B_var of string
        | B_string of string    
        | B_hexp of hexp_t
        | B_fsmStmt of bsv_exprFsmStmt_t
        | B_ifcRef of string // Interface reference
        | B_diadic of x_diop_t * bsv_exp_t * bsv_exp_t
        | B_apply of string list * bsv_exp_t list
    // Finite-state machine sub-language.    
    and bsv_exprFsmStmt_t =
     | Bsv_seqFsmStmt    of bsv_exprFsmStmt_t list
     | Bsv_parFsmStmt    of bsv_exprFsmStmt_t list
     | Bsv_ifFsmStmt     of bsv_bexp_t * bsv_exprFsmStmt_t * bsv_exprFsmStmt_t option 
     | Bsv_whileFsmStmt  of bsv_bexp_t * bsv_exprFsmStmt_t
     | Bsv_repeatFsmStmt of bsv_exprFsmStmt_t
     | Bsv_eascFsmStmt   of bsv_exp_t
     | Bsv_breakFsmStmt
     | Bsv_continueFsmStmt
    The encoding of the spoon (really should be called a fork!) and the philosopher and the table of five of them look like this:
    (* interface Spoon_if;
         method Action pickup();
         method Action putdown();
                             Bsv_methodProto("pickup", [], BsvProtoAction, []);
                             Bsv_methodProto("putdown", [], BsvProtoAction, []);              
       module spoon (Spoon_if) ;
         Reg#(Bool) inuse <- mkReg(?);    
    	method Action pickup() if (!inuse);
    	   inuse <= True;
    	method Action putdown();
    	   inuse <= False;
                      ([], [], Bsv_typen "Spoon_if"),
                        Bsv_varDeclAssign("inuse", [B_num[1]; B_num[0]], "mkReg", []);
                        Bsv_methodDef("pickup", [], B_bdiop(V_dned, [g_read "inuse"; B_num[1]]), [ g_write("inuse", B_num[1]); Bsv_pliStmt("$display", [B_string "pickup"]) ]);
                        Bsv_methodDef("putdown", [], B_true, [ g_write("inuse", B_num[0]); Bsv_pliStmt("$display", [B_string "putdown"]) ])
        (* interface Random_if;
           method ActionValue#(UInt#(15)) gen();
                             Bsv_methodProto("gen", [], BsvProtoActionValue(Bsv_typen_uint1 15), []);
         (*  module mkRandom_gen #(UInt#(15) seed) (Random_if);
               Reg#(UInt#(15)) prbs <- mkReg(seed);
               method ActionValue#(UInt#(15)) gen();
                 prbs <= (prbs << 1) | (((prbs >> 14) ^ (prbs >> 13)) & 1);
                 return prbs;
                      ([(Bsv_typen "int", "seed")], [], Bsv_typen "Random_if"),
                        Bsv_varDeclAssign("prbs", [B_num[15]; B_var "seed"], "mkReg", []);
                        Bsv_methodDef("gen", [], B_true,
                                      [ g_write("prbs", B_diadic(V_bitor, B_diadic(V_lshift, g_read "prbs", B_num[1]), B_diadic(V_bitand, (B_diadic(V_bitor, B_diadic(V_rshift, g_read "prbs", B_num[14]), B_diadic(V_rshift, g_read "prbs", B_num[13]))), B_num[1])));
                                        Bsv_returnStmt(g_read "prbs");
                      ([(Bsv_typen "int", "seed"); (Bsv_typen "int", "on")],
                       [(Bsv_typen "Spoon_if", "left"); (Bsv_typen "Spoon_if", "right")], Bsv_typen "Empty"),
                         Bsv_varDeclAssign("gen", [B_var "seed"], "mkRandom_gen", []);
                         Bsv_varDeclAssign("timer", [B_num[15]; B_num[0]], "mkReg", []);
                         Bsv_varDeclAssign("x", [B_num[15]; B_num[0]], "mkReg", []);
                         Bsv_rule("foo", B_bdiop(V_dned, [g_read("timer"); B_num[0]]), [ g_write("timer", B_diadic(V_minus, g_read "timer", B_num[1]))]);
                         Bsv_varDeclAssign("eating", [B_num[1]; B_num[0]], "mkReg", []);
                                                                  fsm_write("x", g_actionVal("gen", "gen", []));
                                                                  fsm_write("timer", B_diadic(V_bitand, g_read "timer", B_num[31]));
                                                                  fsm_await(B_bdiop(V_deqd, [g_read "timer"; B_num[0]]));
                                                                  fsm_action("left", "pickup", []);
                                                                  fsm_write("x", g_actionVal("gen", "gen", []));
                                                                  fsm_write("timer", B_diadic(V_bitand, g_read "timer", B_num[31]));
                                                                  fsm_await(B_bdiop(V_deqd, [g_read "timer"; B_num[0]]));
                                                                  fsm_action("right", "pickup", []);
                                                                  fsm_write("eating", B_num[1]);
                                                                  fsm_write("timer", B_var "on");
                                                                  fsm_await(B_bdiop(V_deqd, [g_read "timer"; B_num[0]]));
                                                                  fsm_write("eating", B_num[0]);
                                                                  fsm_action("left", "putdown", []);
                                                                  fsm_action("right", "putdown", []);
                        Bsv_varDeclAssign("FSM", [B_var "seq_behaviour" ], "mkFSM", [ ]);                    
                        Bsv_rule("kickoff", B_true, [ g_action("FSM", "start", []) ]);
                      ([], [], Bsv_typen "Empty"),
                        Bsv_varDeclAssign("spoon0", [], "Spoon", []);
                        Bsv_varDeclAssign("spoon1", [], "Spoon", []);
                        Bsv_varDeclAssign("spoon2", [], "Spoon", []);
                        Bsv_varDeclAssign("spoon3", [], "Spoon", []);
                        Bsv_varDeclAssign("spoon4", [], "Spoon", []);
                        Bsv_varDeclAssign("philo0", [B_num[3]; B_num[7]  ], "Philo", [ "spoon0"; "spoon1" ]);                    
                        Bsv_varDeclAssign("philo1", [B_num[4]; B_num[4]  ], "Philo", [ "spoon1"; "spoon2" ]);                    
                        Bsv_varDeclAssign("philo2", [B_num[2]; B_num[9]  ], "Philo", [ "spoon3"; "spoon2" ]);                    
                        Bsv_varDeclAssign("philo3", [B_num[3]; B_num[6]  ], "Philo", [ "spoon3"; "spoon4" ]);                    
                        Bsv_varDeclAssign("philo4", [B_num[3]; B_num[6]  ], "Philo", [ "spoon4"; "spoon0" ]);                    

    Note that I have reversed the spoons to philosopher number 2. This stops the system from deadlocking straightaway!

    Compiler Wiring

    The compiler creates instances of each leaf or child component. Each component has some number of invokable methods. It is not normally important to distinguish which method is sported by which component, except for re-entrancy constraints. The general method has ready and enable handshake nets. An OR-gate asserts enable when any of the rules invokes the method. A multiplexor steers the arguments from the rule and hence only one rule should assert enable at a time. The ready output and optional return value output are fanned out to the rules that invoke the method.

    General view of the Bluespec wiring between scheduller, rule logic and leaf components.
    General view of the Bluespec wiring between scheduller, rule logic and leaf components. Each rule connects to some number of leaf components. Dotted outline shows example component boundaries.

    The scheduler connects to each rule with a handshake pair. It knows which rules conflict with each others and will not simultaneously assert the fireguard to any pair in a conflict group.

    Verilog Output

    The Verilog RTL output looks as follows (note: this was the very first run and there might be some bugs in this, but it seems to work!) HERE.

    Modelsim simulation of Dining Philosophers using the Toy Bluespec Compiler

    Does this deadlock? The HPR library can output also an NuSMV file of this design. We can give SMV the -ctt command line flag and it will check for deadlock (eventually?).

    The above toy compiler has a default rule precedence based on the textual order of the rules in the source code. It's not hard to add priorities.

    The above compiler does not have the interlock that BSV defaults to ON for each rule that stops it firing more than once per cycle. This can be added where the comment says.

    Finally, the above compiler does not check alternate compositions of rule pairs to see if they have the same effect, but a simple approach based on textual equivalence of effect would work quite well because the expressions are all converted to a normal memoized form by the HPR library, so most minor artefacts arising from alternate compositions, such as operator association and operand interchange in commutative operators are eliminated.

    Toy Compiler With Concrete Syntax Parser

    A simple parser was implemented using fsyacc. The resulting abstract syntax tree required a typechecker, normaliser and elaborator for all the functional hardware construction (aka generate) forms, but this then produced the same core lanugage AST as used in the initial experiments.

    Parser example run example run.

    SimpleProcessor.bsv Test

    One of the interesting small tests available online is SimpleProcessor.

    gtkwave plot of the Bluespec SimpleProcessor.bsv example

    mono bsv.exe -incdir=/home/djg11/d320/hprls/bsvc/smalltests -cpp=disable -comb-assignment-delay=10  -vnl=dut.v  smalltests/SimpleProcessorTb.bsv
    cver dut.v vsys.v
    GPLCVER_2.11a of 07/05/05 (Linux-elf).
    Copyright (c) 1991-2005 Pragmatic C Software Corp.
      All Rights reserved.  Licensed under the GNU General Public License (GPL).
      See the 'COPYING' file for details.  NO WARRANTY provided.
    Today is Wed Nov  7 14:44:56 2012.
    Compiling source file "dut.v"
    Compiling source file "vsys.v"
    Highest level modules:
    48: output regs[           1] = 3
    49: Halt at pc13
      There were 0 error(s), 0 warning(s), and 24 inform(s).

    The online Bluespec examples define a simple microprocessor that is provided with an assembly coding of Euclid's algorith. The above plot shows the toy compiler can now parse and correctly execute the processor and the example algorithm: the GCD of 27 and 15 is 3 which is printed from R1 at the end of the computation. The delay at the start is the download of the program into the processor by the testbench. Mind you, this hardly stretches the interesting parts of the Bluespec scheduler.

    Listing ((C) Bluespec Inc):

    Instruction code [14] =                                                               
         tagged MovI { rd: R0, v: 0 },                // 0: The constant 0                
         tagged MovI { rd: R1, v: 15 },               // 1: x = 21                        
         tagged MovI { rd: R2, v: 27 },               // 2: y = 27                        
         // label_loop                                                                    
         tagged Brz { rs: R2, dest:label_done },      // 3: if (y == 0) goto done         
         tagged Gt  { rd: R3, rs1: R1, rs2: R2 },     // 4: tmp = (x > y)                 
         tagged Brz { rs: R3, dest: label_subtract }, // 5: if (x <= y) goto subtract     
         // swap                                                                          
         tagged Minus { rd: R3, rs1: R1, rs2: R0 },   // 6: tmp = x;                      
         tagged Minus { rd: R1, rs1: R2, rs2: R0 },   // 7: x = y;                        
         tagged Minus { rd: R2, rs1: R3, rs2: R0 },   // 8: y = tmp;                      
         tagged Br  label_loop,                       // 9: goto loop                     
         // label_subtract                                                                
         tagged Minus { rd: R2, rs1: R2, rs2: R1 },   // 10: y = y - x                    
         tagged Br  label_loop,                       // 11: goto loop                    
         // label_done                                                                    
         tagged Output R1,                            // 12: output x                     
         // label_last                                                                    
         tagged Halt                                  // 13: halt                         

    Verilog object file generated: dut.v.

    Chuck Thacker's Tiny Computer

    Simon Moore implemented Chuck Thacker's Tiny3 Processor in Bluespec. Simon's source files are here SWM Tiny3. This design now (Jan 2013) goes through the Toy Compiler and works!

    The test program is coded in an assembly language entirely written in Bluespec. It is a simple program that writes the integers 9 down to 0 to the output FIFO.

       function List#(AsmLineT) my_program(function ImmT findaddr(String label)) =
    /*0*/      cons(asmi("",    0, 0),  // Keep register 0 clear.
    /*1*/      cons(asmi("",    1, 10), // put 10 in r1
    /*2*/      cons(asmi("",    2, findaddr("loop")), // loop address in r2
    /*3*/      cons(asm("loop", OpOut, FDECb, ShiftNone, SkipNever, 1, 0, 1), // r1-- and output
    /*4*/      cons(asm("",     OpJump, FaORb, ShiftNone, SkipZero, 3, 2, 0), // jump r2 if not zero
    /*5*/      cons(asm("",     OpReserved, Freserved, ShiftNone, SkipNever, 0, 0, 0), // stop processor
          tagged Nil))))));

    Simulation output:

    cver dut.v vsys.v
    GPLCVER_2.12a of 05/16/07 (Linux-elf).
    Copyright (c) 1991-2007 Pragmatic C Software Corp.
      All Rights reserved.  Licensed under the GNU General Public License (GPL).
      See the 'COPYING' file for details.  NO WARRANTY provided.
    Today is Sun Jan 20 18:39:48 2013.
    Compiling source file "dut.v"
    Compiling source file "vsys.v"
    Highest level modules:
    End of prog load
                    2100: fetch 00 80000000: 
                    2700: doSkip = False
                    2700: r         0 = imm =          0
                    2900: fetch 01 8100000a: 
                    3500: doSkip = False
                    3500: r         1 = imm =         10
                    3700: fetch 02 82000003: 
                    4300: doSkip = False
                    4300: r         2 = imm =          3
                    4500: fetch 03 01000583: 
                    5100: doSkip = False
                    5100: rd <-          1         9
    Write out 00000009
                    5300: fetch
                    5300: output =          9 04 03040296: 
                    5900: doSkip = False
                    6100: fetch 03 01000583: 
                    6700: doSkip = False
                    6700: rd <-          1         8
    Write out 00000008
                    6900: fetch
                    6900: output =          8 04 03040296: 
                    7500: doSkip = False
                    7700: fetch 03 01000583: 
                    8300: doSkip = False
                    8300: rd <-          1         7
    Write out 00000007
                    8500: fetch
                    8500: output =          7 04 03040296: 
                    9100: doSkip = False
                    9300: fetch 03 01000583: 
                    9900: doSkip = False
                    9900: rd <-          1         6
    Write out 00000006
                   10100: fetch
                   10100: output =          6 04 03040296: 
                   10700: doSkip = False
                   10900: fetch 03 01000583: 
                   11500: doSkip = False
                   11500: rd <-          1         5
    Write out 00000005
                   11700: fetch
                   11700: output =          5 04 03040296: 
                   12300: doSkip = False
                   12500: fetch 03 01000583: 
                   13100: doSkip = False
                   13100: rd <-          1         4
    Write out 00000004
                   13300: fetch
                   13300: output =          4 04 03040296: 
                   13900: doSkip = False
                   14100: fetch 03 01000583: 
                   14700: doSkip = False
                   14700: rd <-          1         3
    Write out 00000003
                   14900: fetch
                   14900: output =          3 04 03040296: 
                   15500: doSkip = False
                   15700: fetch 03 01000583: 
                   16300: doSkip = False
                   16300: rd <-          1         2
    Write out 00000002
                   16500: fetch
                   16500: output =          2 04 03040296: 
                   17100: doSkip = False
                   17300: fetch 03 01000583: 
                   17900: doSkip = False
                   17900: rd <-          1         1
    Write out 00000001
                   18100: fetch
                   18100: output =          1 04 03040296: 
                   18700: doSkip = False
                   18900: fetch 03 01000583: 
                   19500: doSkip = False
                   19500: rd <-          1         0
    Write out 00000000
                   19700: fetch
                   19700: output =          0 04 03040296: 
                   20300: doSkip = True
                   20500: fetch 06 6c697374: 
                   21100: doSkip = False
                   21100: rd <-        108         x
                   21300: fetch 07 6c697374: 
                   21900: doSkip = True
                   21900: rd <-        108         x
                   22100: fetch 09 xxxxxxxx: 
    Halted at location **vsys.v(16) time 35000 from call to $finish.
      There were 0 error(s), 1 warning(s), and 63 inform(s).

    Simulation of Simon Moore's Implementation of Chuck Thacker's Tiny3 Processor Design using the Toy Bluespec Compiler

    It is good that this test worked, since it uses a large fraction of the language.

    Language Extensions

    Here we show the output of some experimental language extension experiments. Full details to be in a white paper.

    Semantic Extension: Superscalar Action Mode: multiple calls to certain action methods (registers, wires and BRAMs currently) within one clock cycle

    Standard semantics are that a register is updated at most once per clock cycle. (Indeed the standard is a leaf action method is invoked at most once per cycle, but we allready allow all action methods with the same argument to be run more than once on idempotent-marked methods by sharing use of a single call.)

    With the superscalar update extension, we allow operations to be invoked more than once per clock cycle for certain methods where extensions have been hardwired into the main body of the compiler. We have so-far implemented support for any number of operations on registers per clock cycle and any number of operations on BRAMs (see register forwarding extension) provided the number of different addresses expressions is less than or equal to the number of address ports on the hardware entity. (An approach for FIFO FUs is also possible in the future and it should be working already for FIFO modules that are defined purely within Bluespec simply out of multiply-updated maybe registers.)

    The Test12f demo is compiled in normal and superscalar action mode, an experimental mode that allows multiple updates to a register in one clock cycle.

    module mkTest1f2();
           Reg#(Bit#(10)) vodka <- mkReg(30);
           Reg#(Bool) grd <- mkReg(0);
           rule test_1f_inc1 if (True);
              vodka <= vodka + 1;
              if (vodka > 50) $finish;
           rule test_1f_inc3 if (grd);
              vodka <= vodka + 3;
           rule shower if (True);
              grd <= !grd;
              $display("Vodka is %1d", vodka);

    RTL output is Verilog. The normal simulation listing generated is:

    Vodka is 30
    Vodka is 31
    Vodka is 32
    Vodka is 35
    Vodka is 36
    Vodka is 39
    Vodka is 40
    Vodka is 43
    Vodka is 44
    Vodka is 47
    Vodka is 48

    But in the new mode we now see that the register either increases by 1 or 4 each clock cycle. An increase of 4 arises from the composition of both register writing rules firing in one clock cycle.

    Vodka is 31                                                                                                     Vodka is 35
    Vodka is 36
    Vodka is 40
    Vodka is 41
    Vodka is 45
    Vodka is 46
    Vodka is 50
    Vodka is 51

    The magic line in the output RTL that encapsulates the compose effect is

    always @(*) vodka_write_din = (mkTest1f2_test_1f_inc3_FIRE? 32'd3+rtl_unsigned_extend0((mkTest1f2_test_1f_inc1_FIRE? rtl_unsigned_bitextract1(10'd1+vodka_read): vodka_read
    )): rtl_unsigned_bitextract1(10'd1+vodka_read));

    The following two rules from SimpleProcessorTb do not conflict under the idempotent actions scheme since they both write the same expression to the register iax. But they do write different values under either sequential composition. Under the super-scalar extension, this is fine, both are packed into the same clock cycle whether they conflict or not. TODO explain ...

      rule loadInstrs (iax <= label_last);
         sp.loadInstruction (iax, code [iax]);
         iax <= iax + 1;
      rule go (iax == label_last + 1);
        iax <= iax + 1;

    Semantic Extension: Dynamic Binding for Load Balancing

    The standard Bluespec compiler performs method calls only on the component that has been explicitly instantiated and named in the source code whereas typical HLS tools choose the number of component instances to instantiate and decide the binding (mapping) between operations and those instances. A component in the Bluespec context is either a leaf component or a child Bluespec module.

    In HLS, the relevant components are normally called FUs (functional units). FUs may be stateful or stateless. Binding decisions for stateful FUs are restricted in that state must be read back from the FU where the state lies whereas stateless FUs may be allocated to an operation without such restriction. A stateful FU might be a RAM. A stateless FU might be a multiplier. Moreover, in typical HLS, pipelined operators, such as a three clock cycle multiplier, are regarded as stateless since the compiler has no ambiguity discriminating the scope of the actual state inside the FU and the pipelining is not visible to the user.

    In Bluespec, however, it is common to use the Put/Get paradigm for access to pipelined FUs whether stateful or stateless to overcome the lack of intrinsic access to pipelined operators with the standard compiler.


    Semantic Extension: Automatic Insertion of Stateful Schedullers

    Standard semantics are that a static schedule is created that cannot span clock cycles. Hence some rules may be starved of service.

    Manual instantiation of arbiters and other anti-hog mechanisms is one solution.

    Two automated solutions are multi-cycle static schedules and dynamic schedules. Here we consider the latter.

    This demo uses two compilation units that sit each side of an interface. A single compilation unit would allow both rules in the parent/master unit to fire at once, since the method in the interface will be elaborated separately for each call site. But as separate units, we have a structural hazard: the method is manifested at the net level in the interface and can only be invoked once per clock cycle.

    Interface definition:

    interface BarFace;
      method Action orderDrink(int which, int no);

    Lower unit definition:

    module mkBarTender(BarFace);
           Reg#(Bit#(10)) beerdrink <- mkReg(20);
           Reg#(Bit#(10)) winedrink <- mkReg(10);       
           method Action orderDrink(int which, int no);
               if (which == 1) beerdrink <= beerdrink + no;
               if (which == 2) winedrink <= winedrink + no;	   
           rule shower if (True);
              $display("Beer is %1d and wine is %1d", beerdrink, winedrink);

    Upper unit definition:

    module mkTest1iBench();
           BarFace fbar <- mkBarTender();
           rule drinkBeer;
             fbar.orderDrink(1, 2);  // Beer should increase by two every clock cycle.
           rule drinkWine if (True);
           	  fbar.orderDrink(2, 10); // This rule will be starved in the absence of a stateful arbiter  -bsv-round-robin=enable

    Compiling with the standard semantics we see a starvation warning from the compiler:

    ** Warning: sheduler=default_schedule: starvation detected for rule Test1i.mkTest1iBench.drinkWine
    ** Warning: sheduler=default_schedule: earlier rules being greedy are Test1i.mkTest1iBench.drinkBeer
    In partition shedtree1, 1 of the 2 rules are potentially starving. No arbiter will be generated owing to recipe setting -bsv-round-robin=disable
    And simulation demonstrates that wine is never consumed owing to beer hogging:
    Beer is 22 and wine is 12
    Beer is 24 and wine is 12
    Beer is 26 and wine is 12
    Beer is 28 and wine is 12

    The mkTest1gBench demo is now compiled under an experimental mode that instantiates run-time arbiters to provide fairness between the rules.

    The simulation listing now generated is

    Beer is 20 and wine is 10
    Beer is 22 and wine is 10
    Beer is 22 and wine is 20
    Beer is 24 and wine is 20
    Beer is 24 and wine is 30

    We see the rules have taken it in turns to fire.

    RTL output is Parent, Child.

    Semantic Extension: Multiple Rule Executions Per Clock Cycle

    Standard semantics are that a rule is fired at most once per clock cycle.

    The multiple updates system gives superscalar performance when the same rule is applied more than once per clock cycle. This needs to be used sparingly since combinational logic can grow quickly when operating on vectors of registers...

    An associated extension is the inclusion of %r in format strings for $display and so on. Like the standard %m for module name, no argument is consumed, but the new format string gives the rule name. This is a helpful debug aid when a rule is executed more than once since the substituted string gives not only the rule name but a cardinal extension giving the firing instance number. This means we can insert PLI calls such as the following in each clause of the rule:

       $display("Brz (%r) r%d == %0d.  Branch taken=%0d  ", rs, regs[rs], b);

    An an example, we compiled smalltests/SimpleProcessor.bsv with the fetchAndExecute rule being allowed to fire twice per clock cycle. This nearly doubled the performance, with the 50 instructions in the GCD demo being completed in 28 clock cycles instead of 50. The allocation of work to the two instances of the rule is illustrated in the following listing:

    Pc:: 0  Instr=0000000000000000  Load immediate (fetchAndExecute_1) R0 with 0x00000000
    Pc:: 1  Instr=000000010000000f  Load immediate (fetchAndExecute_1) R1 with 0x0000000f
    Pc:: 2  Instr=000000020000001b  Load immediate (fetchAndExecute_1) R2 with 0x0000001b
    Pc:: 3  Instr=000000080000002c  Brz (fetchAndExecute_0) r2 == 27.  Branch taken=0
    Pc:: 4  Instr=0000000c00000036  Gt (fetchAndExecute_1) r1 > r2   15 27  res=0
    Pc:: 5  Instr=000000080000003a  Brz (fetchAndExecute_0) r3 == 0.  Branch taken=1
    Pc:: a  Instr=0000001000000029  Subtract (fetchAndExecute_1) 27 15
    Pc:: b  Instr=0000000400000003  Br (fetchAndExecute_0) 000000003
    Pc:: 3  Instr=000000080000002c  Brz (fetchAndExecute_0) r2 == 12.  Branch taken=0
    Pc:: 4  Instr=0000000c00000036  Gt (fetchAndExecute_1) r1 > r2   15 12  res=1
    Pc:: 5  Instr=000000080000003a  Brz (fetchAndExecute_0) r3 == 4294967295.  Branch taken=0
    Pc:: 6  Instr=0000001000000034  Subtract (fetchAndExecute_1) 15 0
    Pc:: 7  Instr=0000001000000018  Subtract  (fetchAndExecute_0) 12 0
    Pc:: 8  Instr=000000100000002c  Subtract (fetchAndExecute_1) 15 0
    Pc:: 9  Instr=0000000400000003  Br (fetchAndExecute_0) 000000003
    Pc:: 3  Instr=000000080000002c  Brz (fetchAndExecute_0) r2 == 15.  Branch taken=0
    Pc:: 4  Instr=0000000c00000036  Gt (fetchAndExecute_1) r1 > r2   12 15  res=0
    Pc:: 5  Instr=000000080000003a  Brz (fetchAndExecute_0) r3 == 0.  Branch taken=1
    Pc:: a  Instr=0000001000000029  Subtract (fetchAndExecute_1) 15 12
    Pc:: b  Instr=0000000400000003  Br (fetchAndExecute_0) 000000003
    Pc:: 3  Instr=000000080000002c  Brz (fetchAndExecute_0) r2 == 3.  Branch taken=0
    Pc:: 4  Instr=0000000c00000036  Gt (fetchAndExecute_1) r1 > r2   12 3  res=1
    Pc:: 5  Instr=000000080000003a  Brz (fetchAndExecute_0) r3 == 4294967295.  Branch taken=0
    Pc:: 6  Instr=0000001000000034  Subtract (fetchAndExecute_1) 12 0
    Pc:: 7  Instr=0000001000000018  Subtract (fetchAndExecute_0) 3 0
    Pc:: 8  Instr=000000100000002c  Subtract (fetchAndExecute_1) 12 0
    Pc:: 9  Instr=0000000400000003  Br (fetchAndExecute_0) 000000003
    Pc:: 3  Instr=000000080000002c  Brz (fetchAndExecute_0) r2 == 12.  Branch taken=0
    Pc:: 4  Instr=0000000c00000036  Gt (fetchAndExecute_1) r1 > r2   3 12  res=0
    Pc:: 5  Instr=000000080000003a  Brz (fetchAndExecute_0) r3 == 0.  Branch taken=1
    Pc:: a  Instr=0000001000000029  Subtract (fetchAndExecute_1) 12 3
    Pc:: b  Instr=0000000400000003  Br (fetchAndExecute_0) 000000003
    Pc:: 3  Instr=000000080000002c  Brz (fetchAndExecute_0) r2 == 9.  Branch taken=0
    Pc:: 4  Instr=0000000c00000036  Gt (fetchAndExecute_1) r1 > r2   3 9  res=0
    Pc:: 5  Instr=000000080000003a  Brz (fetchAndExecute_0) r3 == 0.  Branch taken=1
    Pc:: a  Instr=0000001000000029  Subtract (fetchAndExecute_1) 9 3
    Pc:: b  Instr=0000000400000003  Br (fetchAndExecute_0) 000000003
    Pc:: 3  Instr=000000080000002c  Brz (fetchAndExecute_0) r2 == 6.  Branch taken=0
    Pc:: 4  Instr=0000000c00000036  Gt (fetchAndExecute_1) r1 > r2   3 6  res=0
    Pc:: 5  Instr=000000080000003a  Brz (fetchAndExecute_0) r3 == 0.  Branch taken=1
    Pc:: a  Instr=0000001000000029  Subtract (fetchAndExecute_1) 6 3
    Pc:: b  Instr=0000000400000003  Br (fetchAndExecute_0) 000000003
    Pc:: 3  Instr=000000080000002c  Brz (fetchAndExecute_0) r2 == 3.  Branch taken=0
    Pc:: 4  Instr=0000000c00000036  Gt (fetchAndExecute_1) r1 > r2   3 3  res=0
    Pc:: 5  Instr=000000080000003a  Brz (fetchAndExecute_0) r3 == 0.  Branch taken=1
    Pc:: a  Instr=0000001000000029  Subtract (fetchAndExecute_1) 3 3
    Pc:: b  Instr=0000000400000003  Br (fetchAndExecute_0) 000000003
    Pc:: 3  Instr=000000080000002c  Brz (fetchAndExecute_0) r2 == 0.  Branch taken=1
    Pc:: c  Instr=0000001400000001  38: (fetchAndExecute_0) output regs[1] = 3
    Pc:: d  Instr=0000001800000000  39: Halt (fetchAndExecute_1) at pc13

    This was an easy, get-started example, since in that design, the program code is in the same sort of register file as the processor registers and there is no structural hazard on instruction memory read when performing dual issue.

    Normal Execution of the SimpleProcessor design. The result '3' is returned at cycle 48 in the original design.

    Superscalar Execution of the SimpleProcessor design. The result '3' is returned at cycle 28 which is a 2-to-1 speed up.

    Three copies of fetch and execute rule

    For Xilinx Virtex7, we compiled the baseline SimpleProcessor and two variants with one and two extra copies of the fetchAndExecute rule.

    Not surprisingly, the length of the critical path in each run was roughly 5, 10 and 15 ns. Hence performance in terms of time to completion was roughly unchanged, but IPC did not keep up with lowered clock frequency, with single-scalar being the best. (Strangely, the number of CARRY4 units on the critical path was 4, 7 and 8 respectively.) Here is the critical path for the triple-scalar verison:

    Max Delay Paths
    Slack:                    inf
      Source:                 the_dut/SimpleProcessorTb_mkTb_sp/SimpleProcessor_mkSimpleProcessor_imem_6_read_RV_reg[2]/C
                                (rising edge-triggered cell FDRE)
      Destination:            the_dut/SimpleProcessorTb_mkTb_sp/SimpleProcessor_mkSimpleProcessor_regs_0_read_RV_reg[0]/D
      Path Group:             (none)
      Path Type:              Max at Slow Process Corner
      Data Path Delay:        14.269ns  (logic 2.265ns (15.874%)  route 12.004ns (84.126%))
      Logic Levels:           28  (CARRY4=8 FDRE=1 LUT2=3 LUT4=1 LUT5=1 LUT6=12 MUXF7=1 MUXF8=1)
        Location             Delay type                Incr(ns)  Path(ns)    Netlist Resource(s)
      -------------------------------------------------------------------    -------------------
        SLICE_X70Y126        FDRE                         0.000     0.000 r  the_dut/SimpleProcessorTb_mkTb_sp/SimpleProcessor_mkSimpleProcessor_imem_6_read_RV_reg[2]/C
        SLICE_X70Y126        FDRE (Prop_fdre_C_Q)         0.259     0.259 r  the_dut/SimpleProcessorTb_mkTb_sp/SimpleProcessor_mkSimpleProcessor_imem_6_read_RV_reg[2]/Q
                             net (fo=20, estimated)       0.829     1.088    the_dut/SimpleProcessorTb_mkTb_sp/rtl_unsigned_bitextract54_return[2]
        SLICE_X62Y133        LUT6 (Prop_lut6_I1_O)        0.043     1.131 r  the_dut/SimpleProcessorTb_mkTb_sp/SimpleProcessor_mkSimpleProcessor_regs_0_read_RV[0]_i_311/O
                             net (fo=1, routed)           0.000     1.131    the_dut/SimpleProcessorTb_mkTb_sp/SimpleProcessor_mkSimpleProcessor_regs_0_read_RV[0]_i_311_n_0
        SLICE_X62Y133        MUXF7 (Prop_muxf7_I1_O)      0.117     1.248 r  the_dut/SimpleProcessorTb_mkTb_sp/SimpleProcessor_mkSimpleProcessor_regs_0_read_RV_reg[0]_i_135/O
                             net (fo=1, routed)           0.000     1.248    the_dut/SimpleProcessorTb_mkTb_sp/SimpleProcessor_mkSimpleProcessor_regs_0_read_RV_reg[0]_i_135_n_0
        SLICE_X62Y133        MUXF8 (Prop_muxf8_I0_O)      0.046     1.294 r  the_dut/SimpleProcessorTb_mkTb_sp/SimpleProcessor_mkSimpleProcessor_regs_0_read_RV_reg[0]_i_55/O
                             net (fo=4, estimated)        0.837     2.131    the_dut/SimpleProcessorTb_mkTb_sp/SimpleProcessor_mkSimpleProcessor_regs_0_read_RV_reg[0]_i_55_n_0
        SLICE_X64Y147        LUT6 (Prop_lut6_I0_O)        0.125     2.256 f  the_dut/SimpleProcessorTb_mkTb_sp/SimpleProcessor_mkSimpleProcessor_regs_0_read_RV[0]_i_27/O
                             net (fo=41, estimated)       0.697     2.953    the_dut/SimpleProcessorTb_mkTb_sp/SimpleProcessor_mkSimpleProcessor_regs_0_read_RV[0]_i_27_n_0
        SLICE_X65Y142        LUT5 (Prop_lut5_I0_O)        0.043     2.996 f  the_dut/SimpleProcessorTb_mkTb_sp/SimpleProcessor_mkSimpleProcessor_regs_1_read_RV[0]_i_10/O
                             net (fo=116, estimated)      0.836     3.832    the_dut/SimpleProcessorTb_mkTb_sp/SimpleProcessor_mkSimpleProcessor_regs_1_read_RV[0]_i_10_n_0
        SLICE_X53Y151        LUT2 (Prop_lut2_I1_O)        0.043     3.875 r  the_dut/SimpleProcessorTb_mkTb_sp/SimpleProcessor_mkSimpleProcessor_result_reg_read_RV[8]_i_204/O
                             net (fo=1, routed)           0.000     3.875    the_dut/SimpleProcessorTb_mkTb_sp/SimpleProcessor_mkSimpleProcessor_result_reg_read_RV[8]_i_204_n_0
        SLICE_X53Y151        CARRY4 (Prop_carry4_S[0]_CO[3])
                                                          0.259     4.134 r  the_dut/SimpleProcessorTb_mkTb_sp/SimpleProcessor_mkSimpleProcessor_result_reg_read_RV_reg[8]_i_162/CO[3]
                             net (fo=1, estimated)        0.000     4.134    the_dut/SimpleProcessorTb_mkTb_sp/SimpleProcessor_mkSimpleProcessor_result_reg_read_RV_reg[8]_i_162_n_0
        SLICE_X53Y152        CARRY4 (Prop_carry4_CI_CO[3])
                                                          0.053     4.187 r  the_dut/SimpleProcessorTb_mkTb_sp/SimpleProcessor_mkSimpleProcessor_result_reg_read_RV_reg[8]_i_122/CO[3]
                             net (fo=1, estimated)        0.000     4.187    the_dut/SimpleProcessorTb_mkTb_sp/SimpleProcessor_mkSimpleProcessor_result_reg_read_RV_reg[8]_i_122_n_0
        SLICE_X53Y153        CARRY4 (Prop_carry4_CI_CO[3])
                                                          0.053     4.240 r  the_dut/SimpleProcessorTb_mkTb_sp/SimpleProcessor_mkSimpleProcessor_result_reg_read_RV_reg[8]_i_82/CO[3]
                             net (fo=1, estimated)        0.000     4.240    the_dut/SimpleProcessorTb_mkTb_sp/SimpleProcessor_mkSimpleProcessor_result_reg_read_RV_reg[8]_i_82_n_0
        SLICE_X53Y154        CARRY4 (Prop_carry4_CI_CO[3])
                                                          0.053     4.293 r  the_dut/SimpleProcessorTb_mkTb_sp/SimpleProcessor_mkSimpleProcessor_result_reg_read_RV_reg[8]_i_62/CO[3]
                             net (fo=5, estimated)        0.850     5.143    the_dut/SimpleProcessorTb_mkTb_sp/hprbpin501107x1020998_in
        SLICE_X53Y142        LUT4 (Prop_lut4_I1_O)        0.049     5.192 r  the_dut/SimpleProcessorTb_mkTb_sp/SimpleProcessor_mkSimpleProcessor_result_reg_read_RV[8]_i_38/O
                             net (fo=4, estimated)        0.687     5.879    the_dut/SimpleProcessorTb_mkTb_sp/SimpleProcessor_mkSimpleProcessor_result_reg_read_RV[8]_i_38_n_0
        SLICE_X57Y143        LUT6 (Prop_lut6_I0_O)        0.136     6.015 r  the_dut/SimpleProcessorTb_mkTb_sp/SimpleProcessor_mkSimpleProcessor_result_reg_read_RV[8]_i_16/O
                             net (fo=1, estimated)        0.457     6.472    the_dut/SimpleProcessorTb_mkTb_sp/SimpleProcessor_mkSimpleProcessor_result_reg_read_RV[8]_i_16_n_0
        SLICE_X57Y143        LUT6 (Prop_lut6_I1_O)        0.043     6.515 r  the_dut/SimpleProcessorTb_mkTb_sp/SimpleProcessor_mkSimpleProcessor_result_reg_read_RV[8]_i_10/O
                             net (fo=31, estimated)       0.647     7.162    the_dut/SimpleProcessorTb_mkTb_sp/SimpleProcessor_mkSimpleProcessor_result_reg_read_RV[8]_i_10_n_0
        SLICE_X52Y143        LUT6 (Prop_lut6_I1_O)        0.043     7.205 r  the_dut/SimpleProcessorTb_mkTb_sp/SimpleProcessor_mkSimpleProcessor_result_reg_read_RV[8]_i_7/O
                             net (fo=191, estimated)      0.695     7.900    the_dut/SimpleProcessorTb_mkTb_sp/hprbpin501107x10
        SLICE_X61Y146        LUT2 (Prop_lut2_I1_O)        0.043     7.943 r  the_dut/SimpleProcessorTb_mkTb_sp/SimpleProcessor_mkSimpleProcessor_regs_0_read_RV[0]_i_1619/O
                             net (fo=1, routed)           0.000     7.943    the_dut/SimpleProcessorTb_mkTb_sp/SimpleProcessor_mkSimpleProcessor_regs_0_read_RV[0]_i_1619_n_0
        SLICE_X61Y146        CARRY4 (Prop_carry4_S[0]_CO[3])
                                                          0.259     8.202 r  the_dut/SimpleProcessorTb_mkTb_sp/SimpleProcessor_mkSimpleProcessor_regs_0_read_RV_reg[0]_i_1103/CO[3]
                             net (fo=1, estimated)        0.000     8.202    the_dut/SimpleProcessorTb_mkTb_sp/SimpleProcessor_mkSimpleProcessor_regs_0_read_RV_reg[0]_i_1103_n_0
        SLICE_X61Y147        CARRY4 (Prop_carry4_CI_CO[3])
                                                          0.053     8.255 r  the_dut/SimpleProcessorTb_mkTb_sp/SimpleProcessor_mkSimpleProcessor_regs_0_read_RV_reg[0]_i_800/CO[3]
                             net (fo=1, estimated)        0.000     8.255    the_dut/SimpleProcessorTb_mkTb_sp/SimpleProcessor_mkSimpleProcessor_regs_0_read_RV_reg[0]_i_800_n_0
        SLICE_X61Y148        CARRY4 (Prop_carry4_CI_CO[3])
                                                          0.053     8.308 r  the_dut/SimpleProcessorTb_mkTb_sp/SimpleProcessor_mkSimpleProcessor_regs_0_read_RV_reg[0]_i_591/CO[3]
                             net (fo=1, estimated)        0.000     8.308    the_dut/SimpleProcessorTb_mkTb_sp/SimpleProcessor_mkSimpleProcessor_regs_0_read_RV_reg[0]_i_591_n_0
        SLICE_X61Y149        CARRY4 (Prop_carry4_CI_CO[3])
                                                          0.053     8.361 r  the_dut/SimpleProcessorTb_mkTb_sp/SimpleProcessor_mkSimpleProcessor_regs_0_read_RV_reg[0]_i_373/CO[3]
                             net (fo=4, estimated)        0.764     9.125    the_dut/SimpleProcessorTb_mkTb_sp/SimpleProcessor_mkSimpleProcessor_regs_0_read_RV_reg[0]_i_373_n_0
        SLICE_X49Y148        LUT6 (Prop_lut6_I0_O)        0.043     9.168 f  the_dut/SimpleProcessorTb_mkTb_sp/SimpleProcessor_mkSimpleProcessor_regs_0_read_RV[0]_i_375/O
                             net (fo=3, estimated)        0.464     9.632    the_dut/SimpleProcessorTb_mkTb_sp/SimpleProcessor_mkSimpleProcessor_regs_0_read_RV[0]_i_375_n_0
        SLICE_X48Y147        LUT6 (Prop_lut6_I0_O)        0.043     9.675 f  the_dut/SimpleProcessorTb_mkTb_sp/SimpleProcessor_mkSimpleProcessor_regs_0_read_RV[0]_i_199/O
                             net (fo=1, estimated)        0.729    10.404    the_dut/SimpleProcessorTb_mkTb_sp/SimpleProcessor_mkSimpleProcessor_regs_0_read_RV[0]_i_199_n_0
        SLICE_X50Y136        LUT2 (Prop_lut2_I1_O)        0.047    10.451 f  the_dut/SimpleProcessorTb_mkTb_sp/SimpleProcessor_mkSimpleProcessor_regs_0_read_RV[0]_i_97/O
                             net (fo=1, estimated)        0.945    11.396    the_dut/SimpleProcessorTb_mkTb_sp/SimpleProcessor_mkSimpleProcessor_regs_0_read_RV[0]_i_97_n_0
        SLICE_X50Y148        LUT6 (Prop_lut6_I0_O)        0.134    11.530 r  the_dut/SimpleProcessorTb_mkTb_sp/SimpleProcessor_mkSimpleProcessor_regs_0_read_RV[0]_i_41/O
                             net (fo=1, estimated)        0.640    12.170    the_dut/SimpleProcessorTb_mkTb_sp/SimpleProcessor_mkSimpleProcessor_regs_0_read_RV[0]_i_41_n_0
        SLICE_X56Y152        LUT6 (Prop_lut6_I2_O)        0.043    12.213 r  the_dut/SimpleProcessorTb_mkTb_sp/SimpleProcessor_mkSimpleProcessor_regs_0_read_RV[0]_i_24/O
                             net (fo=1, estimated)        0.567    12.780    the_dut/SimpleProcessorTb_mkTb_sp/SimpleProcessor_mkSimpleProcessor_regs_0_read_RV[0]_i_24_n_0
        SLICE_X63Y148        LUT6 (Prop_lut6_I2_O)        0.043    12.823 r  the_dut/SimpleProcessorTb_mkTb_sp/SimpleProcessor_mkSimpleProcessor_regs_0_read_RV[0]_i_10/O
                             net (fo=4, estimated)        0.815    13.638    the_dut/SimpleProcessorTb_mkTb_sp/hprbpin504915x10
        SLICE_X53Y129        LUT6 (Prop_lut6_I4_O)        0.043    13.681 r  the_dut/SimpleProcessorTb_mkTb_sp/SimpleProcessor_mkSimpleProcessor_regs_0_read_RV[0]_i_2/O
                             net (fo=1, estimated)        0.545    14.226    the_dut/SimpleProcessorTb_mkTb_sp/SimpleProcessor_mkSimpleProcessor_regs_0_read_RV[0]_i_2_n_0
        SLICE_X53Y129        LUT6 (Prop_lut6_I0_O)        0.043    14.269 r  the_dut/SimpleProcessorTb_mkTb_sp/SimpleProcessor_mkSimpleProcessor_regs_0_read_RV[0]_i_1/O
                             net (fo=1, routed)           0.000    14.269    the_dut/SimpleProcessorTb_mkTb_sp/SimpleProcessor_mkSimpleProcessor_regs_0_read_RV[0]_i_1_n_0
        SLICE_X53Y129        FDRE                                         r  the_dut/SimpleProcessorTb_mkTb_sp/SimpleProcessor_mkSimpleProcessor_regs_0_read_RV_reg[0]/D
      -------------------------------------------------------------------    -------------------

    The sensible behaviour of the curious Bluespec wire types, such as PulseWire and RWire, under super-scalar operation is that they are reset to their normal starting point between each operation. The execution order restrictions associated with such wires get more restrictive for super-scalar behaviour ...

    Semantic Extension: Pipelined FU Infix Access via Multi-cycle Schedules

    In today's FPGAs, multipliers and BRAM memories are abundant and programmers are used to accessing such resources via the standard operator notations (asterisk and square brackets) in high-level languages. On FPGAs, these resources are pipelined and a multi-cycle schedule is needed to use them in this way. (Standard Bluespec uses these resources via the Put/Get paradigm...)

    Latency-of-one operators can be used infix style via the register forwarding extension described in the next section.

    For general use of pipelined operators outside of forwardable coding styles, there are two basic approaches:

    The HPR L/S library provides easy support for superstate based multi-cycle schedules: its restructure component will convert infix operators appearing in the NSF of an FSM, replacing them with pipelined FUs and padding out the initial FSM states into multi-cycle superstates. It will also overlap these superstates as much as possible, solving all RaW hazards and inserting holding registers as needed. It can handle fixed-latency FUs as well as variable latency FUs that have handshake nets. The lower-level FU can be a previous Bluespec compile and the necessary meta-info to instantiate it is held in an IP-XACT file read in by the higher-level compilation.

    Multi-cycle schedules can be a simple modulo-like schedule with superstate concepts, but these will give low throughput owing to the initiation interval being extended to the cycle length. Pipelined schedules generally require speculation ...

    ... demo needed by adding restructure to the recipe ... also mention of previous work Karczmarek supported multi-cycle rules but not mutli-cycle methods ...

    Sadly, the standard Bluespec handshake cannot support multi-cycle operations (except by backdoor methods such as global stall mechanism). The restructure plugin, when used in Bluespec like mode, maps its req to Bluespec's EN and its reqrdy to Bluespec's RDY. But for multi-cycle builds, using either a seqeuencer (II=latency) or pipelined accelerator mode, the ack signal needs also to used to deliver the output. Hence it automatically converts to Put/Get style which is not what we want.

    TODO demonstrate the performance gain from re-pipelining a design as a whole compared with making longer rules stall shorter ones in a gated/superstate scheme...

    Reference Implementation of Pipelined FUs

    Access to pipelined hardware resources in conventional Bluespec is typically via the Put/Get interface. This requires a shift register that runs alongside the pipelined FU to record which stage is live with data. An output FIFO is required to store the results until they are consumed.

    Where no clock enables are used (as illustrated), the output FIFO must have sufficient space to avoid overrun taking into account the operator's latency. A clock enable on the FU and shotgun shift register reduces the FIFO capacity requirements since we can use the FU internal registers as storage by stalling and de-asserting put_RDY when the FIFO is full.

    Semantic Extension: Register Forwarding: Simple Synchronous RAM Support Without Multi-cycle Schedules

    We provide additional syntax for easier access to single and dual-ported RAMs with certain restrictions. This mechanism will operate for any latency-of-one stateful or stateless operator, but RAM is the only demo so far.

    The write operation for the synchronous BRAM memory is easily supported with a syntax extension to the Bluespec language. Synchronous RAM writes are not multi-cycle: they are the same as register writes but with the value to be stored and the address being presented to the RAM instance in the same clock cycle.

    Bluespec already has concrete syntax for subscription of vectors using square brackets. The additional dot added to our concrete syntax for RAM subscriptions reminds us that this operation will consume RAM port capacity and potentially raise a structural hazard.

    The write construct added is:

        mybram.[store_address ...] <= value_to_be_stored;

    Latency-of-one synchronous RAM reads have an intrinsic pipeline delay compared with the combinational read of asynchronous RAMs. Efficient asynchronous RAMs are not found in today's mainstream technologies (see BRAM versus distributed RAM discussions ...). Reads where the result is stored directly in a register without otherwise being used can, however, be fairly easily supported with a new Bluespec language extension that generates an extra bit of hidden state per read destination. The extra bit is so-called scoreboard flag, recording that reads of the destination should be forwarded from the RAM read bus, instead of being served from the register. One such bit is needed per RAM per read destination of that RAM. The scoreboard bits may be optimised away if always constant value when read.

    In this example, there is one read site and one read destination (called rval) with direct connection. The rval value used in both the local_reader rule and the remote_reader rule will be served from the output register of the synchronous BRAM when it has been activated the clock cycle immediately before and from the named rval register otherwise. The extra state bit manages which source to read. Reads where the data is looked at in the current cycle cannot be managed this way and the multi-cycle schedule is needed instead. [ If the local_reader has nothing stopping it firing every clock cycle, it will never need the rval register. ] Net-level debugging can be facilitated by providing the rval bus driving logic, even if it is not used. The debugging logic will be later stripped by backend tools.

    The new read construct is illustrated in this example.

        rule local_reader;
          rval <= mybram.[read_address ...];
          sum1 <= sum1 + rval ... etc;
        rule remote_reader;
          sum2 <= sum2 + rval ... etc;

    (A more complex example would show conditional reading of the BRAM (via IF statements or tertiary operator) and multiple destination registers with the destination registers being served from a number of different BRAMs, varying addresses on a commmon BRAM and also conventional updates to the target register(s). The Put/Get interface to the BRAM and its accompanying FIFO structure are not intended to be mixed with the new syntax on a common BRAM.)

    Where the register rval is used, the value is either supplied from the read bus of the BRAM or from rval itself. The real rval is loaded from the RAM read bus the clock cycle afterwards if no other rule writes to that destination at that time.

    Where there is more than one read site and associated read register, the mechanism is repeated for each read register.

    The various read and write sites for such a RAM will each have an address expression. The expressions in use may be determined (at compile time) to always be identical or not. Where the expressions are different, there is a structural hazard, but where they are determined to be the same, a read and a write on the same address are possible through a single port and the read may be fanned out to multiple read sites. There are intrinsic limitations to such static identity checking arising from computability theory, but our use of an internal memoising heap with normal forms arising from multiple commutative, distributive and boolean laws, helps spot identity. Our implementation puts the expressions into equivalence classes and serves one class per RAM port.

    Port over-subscription requires arbitration and we cast this into the Bluespec intrinsic guard paradigm to be resolved by traditional static rule priority or with our new run-time arbiters mechanism.

    Dual-ported synchronous RAMs are commonly found on today's FPGAs. For multi-ported RAMs, automatic load balancing within the clock domain can be supported with a schedulling conflict only arising where more addresses than ports is attempted...
    ... being added June/Aug 2019 ... explain about superscalar interactions and RAM-to-RAM operations...

    Our implementation supports certain simple combinational operations on the forwarding path, including bit extract, field select and inversion. Addition or subtraction of a constant could also have been supported, but was left out since this requires carry chains instead of just multiplexors.

    Semantic Extension: Real-time Rule Firing Rate Targets

    With or without profile-directed feedback, the compiler can make an estimate and report the expected firing rate of each rule. Given the ability to schedule a rule more than once per cycle, or at a rate less than normal under control of a new generation of arbiters called policers, we can synthesise a design to meet real-time rate targets ...
    ... being added June 2019

    Semantic Extension: Floating-Point Support

    Given that this compiler is implemented on top of the HPR library, that supports floating-point hardware, adding floating-point arithmetic to Bluespec is nominally a very simple step, requiring a small concrete syntax extension. However, floating-point ALUs are normally heavily pipelined, so this only makes sense alongside the pipelined FU extension. IEEE floats are perhaps of little use since Bluespec is not intended for scientific computation. Custom arithmetic (e.g. for CNNs) might be of more use and should be possible with the same literal constant and infix operator extension mechanism...

    ... being added June 2019

    See Also (citations and links)

  • An Open-Source Bluespec Compiler: PDF.

  • TNDJG:0004: I converted my processor design to Bluespec and it went more than twice as fast!

  • Download our open-source Bluespec Compiler (including the fairly complete front-end parser) for your own experiments: HERE.

  • Further Sub-cycle and Multi-cycle Schedulling support for Bluespec Verilog (Memocode'19 paper): Paper PDF       Slides PDF.

  • Satnam's Kami Oak Hardware Demo README.md.

    Spare text: This is actually an attractive approach but needs tidying up from what is here. The attraction is that Bluespec has a static elaborator which is basically a Haskel interpreter, but by embedding our constructs in FSharp instead, we have the power of the ML dialect of F Sharp to do exactly the same kind of elaboration. All of the higher-order iterators and functors are exactly the same, modulo inserting 'fun () -> ' in a few places to get lazy behaviour. The good work needed is to use the domain-specific language (DSL) embedding facilities of F Sharp for the elaborated leaves. I might do this if I update this page.

    (C) 2012-13 Orangepath Project. DJ Greaves.