ECAD and Architecture Practical Classes
Exercise 5: RISC-V programming
Clarvi - a RISC-V processor
Clarvi ('Computer LAboratory RISC-V Implementation') is a simple, in-order, 6-stage pipeline implementation of a processor in SystemVerilog. It implements the base 32-bit RISC-V instruction set (RV32I) with minimal supervisor mode support. It can use a shared external instruction and data memory but has no caches. Additionally it can communicate with other peripherals in a system using a simple memory-mapped I/O bus, which we will use later on. Clarvi is described in more detail in the Computer Design lectures.
The full specification of the RISC-V instruction set can be found on RiscV.org, and we also have a copy of the RISC-V Green Card instruction set summary (also included in your Computer Design handout). You may also wish to consult our Assembly Programming Guide for tips on assembly language programming.
Clarvi tour
Find the Clarvi code in ecad_labs/exercise6_riscv/clarvi/
It contains several files:
- clarvi.sv
- The main processor description
- riscv.svh
- RISC-V instruction set definitions
- clarvi_debug.sv
- Debugging $display statements for the processor in simulation, in a separate file for clarity
- clarvi_sim.sv
- A toplevel testbench for the processor in simulation
- bram.sv
- The shared instruction/data memory for use in simulation
- clarvi_avalon.sv
- A wrapper for Clarvi when used on FPGA
- clarvi_test.do
- A Tcl script to compile and configure a Clarvi simulation
- clarvi_hw.tcl
- A script to generate a Qsys component for Clarvi/dd>
Glance through clarvi.sv, riscv.svh and sim.sv and familiarise yourself with the key parts of the code (this will be covered in lectures).
The Spike instruction-set simulator
In Routes A (and optionally for B) we'll simulate Clarvi's SystemVerilog directly. This means the simulator is tracing the action of every wire and the state of every flip-flop in the system. The basic building blocks are state transitions of flip-flops.
Simulating in this way is accurate, but time-consuming and scales poorly as our design gets bigger. Instead of this gate-level simulation, an alternative approach at a higher level of abstraction is an instruction-set architecture (ISA) simulator. An ISA simulator considers instructions as the basic building-blocks, and the simulator is designed to model the fetch-execute cycle of a specific architecture (for instance Arm or RISC-V).
A gate-level processor simulator is said to be cycle-accurate, because it has a full description of the processor pipeline (implemented in this case in SystemVerilog). An ISA simulator has models of instructions written in a higher-level language such as C++ or Java. While those models could be annotated with timing information, the full pipeline is not being modelled. This makes it faster, but less accurate. For example, some kinds of bugs that might arise from specific pipeline conditions might not be evident.
For Routes B and C we'll use the Spike ISA simulator, which is produced by the RISC-V Foundation. This is said to be the 'golden model' simulator for the RISC-V architecture - many hardware designers consult the behaviour of Spike to check details of their implementation.
(There exist also formal models of the ISA, a higher level again, which represent behaviour of instructions in terms of abstract logic that makes them amenable to formal proof. RISC-V have adopted the Sail formal model from this department as their formal ISA specification).
Programming RISC-V
We will begin by writing RISC-V assembly code, and then later link our assembly with code written in C. We compile code into a memory image we either can load into Spike, a Modelsim simulation of Clarvi, or build into Clarvi running on FPGA.
To generate the memory contents we use a C compiler and assembler, using the following process:
First, source files are compiled into matching object files, which contain the instructions for each function but has not yet made a decision to where it will go in memory. Then all the object files, together with any additional libraries that might be used (none in this example) are linked together into a program binary, using a linker script to indicate where in memory all the parts should go. In our case the binary is in the ELF format. Since the target architecture we are generating code for (32-bit RISC-V) doesn't match the architecture the compiler is running on (64-bit x86 Linux) this process is called cross-compiling.
Each binary is made up of a number of sections, including program instructions (.text), read-only data (.rodata), pre-initialised writable data (.data), and data the program has declared but not pre-defined (.bss). In a traditional operating system the binary would be loaded into memory using the operating system's loader or runtime linker, which would also allocate memory, maybe start a new process, and jump to the loaded code.
We are using RISC-V processors bare metal, i.e. with no operating system, so we use a much simpler technique. We extract the relevant sections from the ELF binary and simply bundle them all together into one memory image which we arrange to be in memory when the processor starts. This way the first instruction the CPU executes is the first instruction of your program - it doesn't have to worry about how to load code or initialise the program's data.
To manage the multiple steps of this build process, we use a Makefile. Make is a command-line tool that calculates build dependencies and does the minimal number of steps necessary. A Makefile is a series of rules describing how to build one type of file from another type of file. For example, if you want to build sourcefile.o, Make will look for a source file called sourcefile.c or sourcefile.s and try and build it, using the rules provided in the Makefile. Make will do the minimal amount of work necessary - if it sees that the date on sourcefile.c has not changed since the last time it compiled it, it will not build it again but instead use the existing sourcefile.o. If you type 'make' alone, it will look for a file of rules called Makefile, and attempt to execute the first rule it finds in it. You can also follow 'make' with a rule name, for example 'make clean'.
The RISC-V toolchain
For creating RISC-V programs that can run on a RISC-V CPU, there is a cross-compiler installed prefixed riscv32-unknown-elf-. To compile a simple program, you can invoke riscv32-unknown-elf-gcc. To deal with specific details of the our processor, a Makefile, linker script (link.ld) and example program framework is provided.
Navigate to the ecad_labs/ex5_riscv/assembly/ directory in your terminal and type "make", you should see an output something like this:
$ make mkdir -p build riscv32-unknown-elf-gcc -c -o build/init.o src/init.s -O0 -march=rv32i riscv32-unknown-elf-gcc -c -o build/div.o src/div.s -O0 -march=rv32i riscv32-unknown-elf-gcc -c -o build/main.o src/main.c -O0 -march=rv32i riscv32-unknown-elf-gcc -o build/program.elf -O0 -march=rv32i -static -fvisibility=hidden -nostdlib -nostartfiles -T link.ld build/init.o build/div.o build/main.o riscv32-unknown-elf-objcopy -O binary --only-section=.data* --only-section=.text* build/program.elf build/mem.bin hexdump -v -e '"%08x\n"' build/mem.bin > build/mem.txt python3 txt2hex.py build/mem.txt build/mem.hex 4 riscv32-unknown-elf-objdump -S -s build/program.elf > build/program.dump
In this case the Makefile will search the src directory for files ending in .c and .s and attempt to build them. After creating a directory to hold the output, you can see it assembles three source files main.c, init.s and div.s into .o files. These are linked (using the .ld script) into an ELF file, which is then converted into a flat binary image (.bin). Then we generate memory images (in two different hexadecimal formats, .txt for Modelsim, .hex for Quartus). Finally a disassembly dump is generated.
Have a look at build/program.dump. See if you can follow how the information in link.ld, init.s, div.s and main.c was used to build the program and how the program is structured.
To save time while compiling, make will try hard not to recompile files you didn't change. In case you want to start afresh, the clean target inside the Makefile will delete all the compiler output files. Typing make clean will force the next run of make to rebuild everything.
Simulating the Clarvi (Routes A and optional for B)
Now we have a program, we need to simulate the Clarvi processor.
In a terminal:
To simulate the Clarvi, start Modelsim from the clarvi directory. When started, type:
do clarvi_test.do ../assembly/build/mem.txt TRACE
in the transcript window. This should compile the Clarvi sources, set up the waveforms and initialise the clock signal. Now you can use the run buttons to start simulation, or type run 1us in the transcript window to run for a short length of time.
In addition to waveforms, Clarvi also outputs an instruction trace in the transcript window. Besides tracing program flow, we can use this trace output to track both intermediate and return values of a function by looking at the register and memory writes. See here for an explanation of what is happening in the trace.
Simulating using Spike (Routes B and C)
In the files bundle we have provided a Makefile to simulate the code using Spike. Run:
make spike-log
from the 'assembly' folder. You can also use make spike-debug which provides an interactive debugger which can single-step through the assembly instructions (type 'h' for help, or see usage notes). Using make spike will run without the instruction trace being printed.
Debugging your code
In simulation, the basic Clarvi and Spike have no input/output devices like LEDs or displays. We have added a simple debugging printout to display a 32 bit value. We have implemented Control/Status Register (CSR) 0x800. so that any write to this register prints the register name and value to the simulation log.
Register xN can be printed by:
csrw 0x800, xN
Because this construction is a bit clumsy to use regularly, in the assembly source we have defined an assembler macro to give it a nicer name:
.macro DEBUG_PRINT reg csrw 0x800, \reg .endm
And then you can print register t1 with:
DEBUG_PRINT t1
You can also see the value appear on the debug_scratch wires in your Spike log or Clarvi simulation. For Clarvi, if the instruction trace makes it hard to find your debug printouts, you can turn it off by removing TRACE from the clarvi_test.do command line.
Exercise 5a
Write a RISC-V assembly program to perform integer division.
Division is slow and expensive in hardware, and we use it relatively rarely. Instead we can use a subroutine to perform division in software, and call this instead of a hardware divide instruction.
Wikipedia gives the long division algorithm as follows (in Pascal like pseudo-code):
Q := 0 -- initialize quotient and remainder to zero
R := 0
for i = n-1...0 do -- where n is number of bits in N
R := R << 1 -- left-shift R by 1 bit
R(0) := N(i) -- set the least-significant bit of R equal to bit i of the numerator
if R >= D then
R := R - D
Q(i) := 1
end
end
In assembly, write a function div that for two numbers in registers a0 and a1, calculates (a0/a1) and returns the quotient in a0 and remainder in a1. Use the provided div.s framework as a starting point. The main.s file will call your div() function. You might want to think about how your function should behave when given 0 as a denominator: the above pseudo-code would lead to a quotient with n 1's as the least significant bits, and a remainder equal to the numerator. You should return 0 when the denominator is zero.).
Verify that the program you just wrote behaves as expected. Use this template to invoke your code from main.s (you can replace the code between # *** with this):
.macro DEBUG_PRINT reg csrw 0x800, \reg .endm addi a0, zero, 12 # a0 <- 12 addi a1, zero, 4 # a1 <- 4 call div DEBUG_PRINT a0 # display the quotient DEBUG_PRINT a1 # display the remainder addi a0, zero, 93 # a0 <- 93 addi a1, zero, 7 # a1 <- 7 call div DEBUG_PRINT a0 # display the quotient DEBUG_PRINT a1 # display the remainder lui a0, (0x12345000>>12) addi a0, a0, 0x678 # a0 <- 0x12345678 # we could also use the pseudo-instruction 'li a0, 0x12345678' # which will assemble to the above two instructions addi a1, zero, 255 # a1 <- 255 call div DEBUG_PRINT a0 # display the quotient DEBUG_PRINT a1 # display the remainder
Follow the procedure to convert your source code into textual format and preload the CPU memory with it. Carefully analyse the program trace to compare the implementation's output with your expectation.
C on RISC-V
Clarvi and Spike implement enough of the RISC-V instruction set to be targetable by a C compiler. C enables us to build larger programs and makes it easier to port them to different CPUs.
Change to the ex5_riscv/c/src directory. This is similar to the assembly project, except with C functions. C programs always begin at the main() function, which we have provided in main.c. As before, init.s sets up the environment so that we can begin execution of the C program. The Makefile works in the same way.
C syntax is similar to Java. See An Introduction to C for differences.
Mixing C and assembler
You can write some parts of your program in C and some in assembler. To do this, we need to arrange for the C to put the parameter values into the correct registers for the assembler to pick up, and vice versa. This is defined by the processor's calling convention. On the RISC-V the calling convention is to use registers 10-17 (named a0-a7) for function arguments and 10-11 (a0-a1) for return values. a0-a7 correspond to the first 8 arguments provided to a C function, and the return value from C should be provided in a0. Note that C only provides access to a single return value.
For example, an addition function might look like:
# int add(int a, int b): add two numbers together # parameters supplied in a0 and a1 # doesn't call any other functions, so no need to store return address register (ra) # doesn't corrupt any callee-save registers, so no need to use the stack .global add # export the function symbol so the linker can find it add: # add the two parameters, returning the result in the return register a0 add a0,a0,a1 ret
To call from C code we need a function prototype to tell the C compiler the types and parameters of the function, without defining it. For example, if we have an assembler function taking two integers and returning another, in a file called asmfunctions.h (or some other name) enter:
int add(int a, int b);
Then your C code can:
#include "asmfunctions.h"
to make the definition available to this C file. You can then call myfunction() anywhere below the #include. The advantage of .h header files is you can easily include them in multiple files, to enable calling your function from multiple places.
Note that, while C is a type-checked language, the assembler will not type check you. So there are no safeguards if your declaration of myfunction() does not match your usage of registers in assembler.
Exercise 5b
Copy your div.s into the ex5_riscv/c/src directory. Write a C function prototype for it in asmfunctions.h.
We have provided an implementation of mod() in init.s that calls div() and copies the remainder from a1 to a0 to obey the C calling convention. Write a prototype for mod() as well.
In C, write a program using div and mod to display a digital clock in minutes, seconds and centiseconds. You'll produce the output in Binary Coded Decimal, ie 17m 34.89s is 0x00173489.
To read the time, we'll use the CPU's internal cycle counter. Make a new file cycles.s:
.section .text .global get_time get_time: csrr a0, cycle ret
This defines a function with the prototype:
int get_time(void);
The get_time function will now return the number of CPU cycles since power on (as this is a 32-bit value, it will wraparound every 86 seconds at 50MHz). To make the simulation faster, for now you should assume there are 1,000 ticks of this value per second. Using the actual value (100,000,000 in simulation) would require you to simulate for a very long time to test even the seconds counter of your clock. This highlights how slow simulation can be compared to FPGA, which could run this in real time.
grep Debug transcript
will print all the lines containing the string Debug. You can also run without TRACE to not display the instruction trace.
On Clarvi only, output the value by writing to address 0x04000080, the address of the hex LEDs in the simulator. This will cause a message in the log. You can use this function:
void hex_output(int value) { int *hex_leds = (int *) 0x04000080; // define a pointer to the register *hex_leds = value; // write the value to that address }
(On Spike we don't have any memory mapped at 0x04000000 so calling hex_output() will generate an exception and terminate the program)
Note also that we don't have a standard library, so functions like printf, malloc or memset do not exist (unless we write them). The Clarvi CPU also does not have multiply - if the compiler cannot optimise a constant multiply to adds and shifts, it will call a multiply function which we don't have (unless you write one).
To use the debug log from C you can use the following C function (an example of inline assembly):
void dprint(int value) { asm ("csrw 0x800, %0" : : "r" (value) ); }
We have provided this in main.c.
Continuously read and display the time so that you can check your system correctly displays transitions from centiseconds to seconds and seconds to minutes.