ECAD and Architecture Practical Classes

Exercise 5: RISC-V programming

Clarvi - a RISC-V processor

Clarvi ('Computer LAboratory RISC-V Implementation') is a simple, in-order, 6-stage pipeline implementation of a processor in SystemVerilog. It implements the base 32-bit RISC-V instruction set (RV32I) with minimal supervisor mode support. It can use a shared external instruction and data memory but has no caches. Additionally it can communicate with other peripherals in a system using a simple memory-mapped I/O bus, which we will use later on. Clarvi is described in more detail in the Computer Design lectures.

The full specification of the RISC-V instruction set can be found on RiscV.org, and we also have a copy of the RISC-V Green Card instruction set summary (also included in your Computer Design handout). You may also wish to consult our Assembly Programming Guide for tips on assembly language programming.

Clarvi tour

Find the Clarvi code in ecad_labs/exercise5_riscv/clarvi/

It contains several files:

clarvi.sv: The main processor description
riscv.svh: RISC-V instruction set definitions
clarvi_debug.sv: Debugging $display statements for the processor in simulation, in a separate file for clarity
clarvi_sim.sv: A toplevel testbench for the processor in simulation
bram.sv: The shared instruction/data memory for use in simulation
clarvi_avalon.sv: A wrapper for Clarvi when used on FPGA
clarvi_test.do: A Tcl script to compile and configure a Clarvi simulation
clarvi_hw.tcl: A script to generate a Qsys component for Clarvi/dd>

Glance through clarvi.sv, riscv.svh and sim.sv and familiarise yourself with the key parts of the code (this will be covered in lectures).

The Spike instruction-set simulator

In Routes A (and optionally for B) we'll simulate Clarvi's SystemVerilog directly. This means the simulator is tracing the action of every wire and the state of every flip-flop in the system. The basic building blocks are state transitions of flip-flops.

Simulating in this way is accurate, but time-consuming and scales poorly as our design gets bigger. Instead of this gate-level simulation, an alternative approach at a higher level of abstraction is an instruction-set architecture (ISA) simulator. An ISA simulator considers instructions as the basic building-blocks, and the simulator is designed to model the fetch-execute cycle of a specific architecture (for instance Arm or RISC-V).

A gate-level processor simulator is said to be cycle-accurate, because it has a full description of the processor pipeline (implemented in this case in SystemVerilog). An ISA simulator has models of instructions written in a higher-level language such as C++ or Java. While those models could be annotated with timing information, the full pipeline is not being modelled. This makes it faster, but less accurate. For example, some kinds of bugs that might arise from specific pipeline conditions might not be evident.

For Routes B and C we'll use the Spike ISA simulator, which is produced by the RISC-V Foundation. This is said to be the 'golden model' simulator for the RISC-V architecture - many hardware designers consult the behaviour of Spike to check details of their implementation.

(There exist also formal models of the ISA, a higher level again, which represent behaviour of instructions in terms of abstract logic that makes them amenable to formal proof. RISC-V have adopted the Sail formal model from this department as their formal ISA specification).

Programming RISC-V

We will begin by writing RISC-V assembly code, and then later link our assembly with code written in C. We compile code into a memory image we either can load into Spike, a Modelsim simulation of Clarvi, or build into Clarvi running on FPGA.

To generate the memory contents we use a C compiler and assembler, using the following process:

First, source files are compiled into matching object files, which contain the instructions for each function but has not yet made a decision to where it will go in memory. Then all the object files, together with any additional libraries that might be used (none in this example) are linked together into a program binary, using a linker script to indicate where in memory all the parts should go. In our case the binary is in the ELF format. Since the target architecture we are generating code for (32-bit RISC-V) doesn't match the architecture the compiler is running on (64-bit x86 Linux) this process is called cross-compiling.

Each binary is made up of a number of sections, including program instructions (.text), read-only data (.rodata), pre-initialised writable data (.data), and data the program has declared but not pre-defined (.bss). In a traditional operating system the binary would be loaded into memory using the operating system's loader or runtime linker, which would also allocate memory, maybe start a new process, and jump to the loaded code.

We are using RISC-V processors bare metal, i.e. with no operating system, so we use a much simpler technique. We extract the relevant sections from the ELF binary and simply bundle them all together into one memory image which we arrange to be in memory when the processor starts. This way the first instruction the CPU executes is the first instruction of your program - it doesn't have to worry about how to load code or initialise the program's data.

To manage the multiple steps of this build process, we use a Makefile. Make is a command-line tool that calculates build dependencies and does the minimal number of steps necessary. A Makefile is a series of rules describing how to build one type of file from another type of file. For example, if you want to build sourcefile.o, Make will look for a source file called sourcefile.c or sourcefile.s and try and build it, using the rules provided in the Makefile. Make will do the minimal amount of work necessary - if it sees that the date on sourcefile.c has not changed since the last time it compiled it, it will not build it again but instead use the existing sourcefile.o. If you type 'make' alone, it will look for a file of rules called Makefile, and attempt to execute the first rule it finds in it. You can also follow 'make' with a rule name, for example 'make clean'.

The RISC-V toolchain

For creating RISC-V programs that can run on a RISC-V CPU, there is a cross-compiler installed prefixed riscv32-unknown-elf-. To compile a simple program, you can invoke riscv32-unknown-elf-gcc. To deal with specific details of the our processor, a Makefile, linker script (link.ld) and example program framework is provided.

Navigate to the ecad_labs/ex5_riscv/assembly/ directory in your terminal and type "make", you should see an output something like this:

$ make
mkdir -p build
riscv32-unknown-elf-gcc -c -o build/init.o src/init.s -O0 -march=rv32i
riscv32-unknown-elf-gcc -c -o build/div.o src/div.s -O0 -march=rv32i
riscv32-unknown-elf-gcc -c -o build/main.o src/main.c -O0 -march=rv32i
riscv32-unknown-elf-gcc -o build/program.elf -O0 -march=rv32i -static -fvisibility=hidden -nostdlib -nostartfiles -T link.ld build/init.o build/div.o build/main.o
riscv32-unknown-elf-objcopy -O binary --only-section=.data* --only-section=.text* build/program.elf build/mem.bin
hexdump -v -e '"%08x\n"' build/mem.bin > build/mem.txt
python3 txt2hex.py build/mem.txt build/mem.hex 4
riscv32-unknown-elf-objdump -S -s build/program.elf > build/program.dump

In this case the Makefile will search the src directory for files ending in .c and .s and attempt to build them. After creating a directory to hold the output, you can see it assembles three source files main.c, init.s and div.s into .o files. These are linked (using the .ld script) into an ELF file, which is then converted into a flat binary image (.bin). Then we generate memory images (in two different hexadecimal formats, .txt for Modelsim, .hex for Quartus). Finally a disassembly dump is generated.

Have a look at build/program.dump. See if you can follow how the information in link.ld, init.s, div.s and main.c was used to build the program and how the program is structured.

To save time while compiling, make will try hard not to recompile files you didn't change. In case you want to start afresh, the clean target inside the Makefile will delete all the compiler output files. Typing make clean will force the next run of make to rebuild everything.

Simulating the Clarvi (Routes A and optional for B)

Now we have a program, we need to simulate the Clarvi processor.

In a terminal:

To simulate the Clarvi, start Modelsim from the clarvi directory. When started, type:

do clarvi_test.do ../assembly/build/mem.txt TRACE

in the transcript window. This should compile the Clarvi sources, set up the waveforms and initialise the clock signal. Now you can use the run buttons to start simulation, or type run 1us in the transcript window to run for a short length of time.

In addition to waveforms, Clarvi also outputs an instruction trace in the transcript window. Besides tracing program flow, we can use this trace output to track both intermediate and return values of a function by looking at the register and memory writes. See here for an explanation of what is happening in the trace.

Simulating using Spike (Routes B and C)

In the files bundle we have provided a Makefile to simulate the code using Spike. Run:

make spike-log

from the 'assembly' folder. You can also use make spike-debug which provides an interactive debugger which can single-step through the assembly instructions (type 'h' for help, or see usage notes). Using make spike will run without the instruction trace being printed.

Debugging your code

In simulation, the basic Clarvi and Spike have no input/output devices like LEDs or displays. We have added a simple debugging printout to display a 32 bit value. We have implemented Control/Status Register (CSR) 0x800. so that any write to this register prints the register name and value to the simulation log.

	csrw 0x800, xN

Because this construction is a bit clumsy to use regularly, in the assembly source we have defined an assembler macro to give it a nicer name:

	.macro  DEBUG_PRINT     reg
	csrw 0x800, \reg
	.endm

And then you can print register t1 with:

	DEBUG_PRINT	t1

You can also see the value appear on the debug_scratch wires in your Spike log or Clarvi simulation. For Clarvi, if the instruction trace makes it hard to find your debug printouts, you can turn it off by removing TRACE from the clarvi_test.do command line.

Exercise 5a

Write a RISC-V assembly program to perform integer division.

Division is slow and expensive in hardware, and we use it relatively rarely. Instead we can use a subroutine to perform division in software, and call this instead of a hardware divide instruction.

Wikipedia gives the long division algorithm as follows (in Pascal like pseudo-code):

Q := 0                 -- initialize quotient and remainder to zero
R := 0                     
for i = n-1...0 do     -- where n is number of bits in N
  R := R << 1          -- left-shift R by 1 bit
  R(0) := N(i)         -- set the least-significant bit of R equal to bit i of the numerator
  if R >= D then
    R := R - D
    Q(i) := 1
  end
end

In assembly, write a function div that for two numbers in registers a0 and a1, calculates (a0/a1) and returns the quotient in a0 and remainder in a1. Use the provided div.s framework as a starting point. The main.s file will call your div() function. You might want to think about how your function should behave when given 0 as a denominator: the above pseudo-code would lead to a quotient with n 1's as the least significant bits, and a remainder equal to the numerator. You should return 0 when the denominator is zero.).

Verify that the program you just wrote behaves as expected. Use this template to invoke your code from main.s (you can replace the code between # *** with this):

	.macro  DEBUG_PRINT     reg
	csrw 0x800, \reg
	.endm

        addi    a0, zero, 12    # a0 <- 12
        addi    a1, zero, 4     # a1 <- 4
        call    div
        DEBUG_PRINT a0          # display the quotient
        DEBUG_PRINT a1          # display the remainder

        addi    a0, zero, 93    # a0 <- 93
        addi    a1, zero, 7     # a1 <- 7
        call    div
        DEBUG_PRINT a0          # display the quotient
        DEBUG_PRINT a1          # display the remainder

        lui     a0, (0x12345000>>12)
        addi    a0, a0, 0x678   # a0 <- 0x12345678
	# we could also use the pseudo-instruction 'li a0, 0x12345678'
	# which will assemble to the above two instructions
        addi    a1, zero, 255   # a1 <- 255
        call    div
        DEBUG_PRINT a0          # display the quotient
        DEBUG_PRINT a1          # display the remainder

Follow the procedure to convert your source code into textual format and preload the CPU memory with it. Carefully analyse the program trace to compare the implementation's output with your expectation.

C on RISC-V

Clarvi and Spike implement enough of the RISC-V instruction set to be targetable by a C compiler. C enables us to build larger programs and makes it easier to port them to different CPUs.

Change to the ex5_riscv/c/src directory. This is similar to the assembly project, except with C functions. C programs always begin at the main() function, which we have provided in main.c. As before, init.s sets up the environment so that we can begin execution of the C program. The Makefile works in the same way.

C syntax is similar to Java. See An Introduction to C for differences.

The GCC C compiler has a number of optimisation levels, that set how hard it works to make the output assembler faster or smaller. Try adding -O0 to -O3 to CFLAGS in the Makefile to optimise for speed, or -Os to optimise for size, and look at the assembly code it generates. A full list of options can be found in the GCC manual. (-Os will try to use functions not present in our environment and so fail to link, but you can still look at the generated .s file).

Mixing C and assembler

You can write some parts of your program in C and some in assembler. To do this, we need to arrange for the C to put the parameter values into the correct registers for the assembler to pick up, and vice versa. This is defined by the processor's calling convention. On the RISC-V the calling convention is to use registers 10-17 (named a0-a7) for function arguments and 10-11 (a0-a1) for return values. a0-a7 correspond to the first 8 arguments provided to a C function, and the return value from C should be provided in a0. Note that C only provides access to a single return value.

For example, an addition function might look like:

	# int add(int a, int b): add two numbers together
	# parameters supplied in a0 and a1
	# doesn't call any other functions, so no need to store return address register (ra)
	# doesn't corrupt any callee-save registers, so no need to use the stack

	.global add		# export the function symbol so the linker can find it
	add:
	        # add the two parameters, returning the result in the return register a0
	       	add		a0,a0,a1
	        
	        ret

To call from C code we need a function prototype to tell the C compiler the types and parameters of the function, without defining it. For example, if we have an assembler function taking two integers and returning another, in a file called asmfunctions.h (or some other name) enter:

		int add(int a, int b);

Then your C code can:

		#include "asmfunctions.h"

to make the definition available to this C file. You can then call myfunction() anywhere below the #include. The advantage of .h header files is you can easily include them in multiple files, to enable calling your function from multiple places.

Note that, while C is a type-checked language, the assembler will not type check you. So there are no safeguards if your declaration of myfunction() does not match your usage of registers in assembler.

Exercise 5b

Copy your div.s into the ex5_riscv/c/src directory. Write a C function prototype for it in asmfunctions.h.

We have provided an implementation of mod() in init.s that calls div() and copies the remainder from a1 to a0 to obey the C calling convention. Write a prototype for mod() as well.

In C, write a program using div and rem to display a digital clock in minutes, seconds and centiseconds. You'll produce the output in Binary Coded Decimal, ie 17m 34.89s is 0x00173489.

To read the time, we'll use the CPU's internal cycle counter. Make a new file cycles.s:

.section .text
.global get_time

get_time:
        csrr a0, cycle
        ret

This defines a function with the prototype:

int get_time(void);

The get_time function will now return the number of CPU cycles since power on (as this is a 32-bit value, it will wraparound every 86 seconds at 50MHz). To make the simulation faster, for now you should assume there are 1,000 ticks of this value per second. Using the actual value (100,000,000 in simulation) would require you to simulate for a very long time to test even the seconds counter of your clock. This highlights how slow simulation can be compared to FPGA, which could run this in real time.

The Modelsim log window will only display a limited number of lines. It also outputs the log to a file called transcript in the directory you started Modelsim from. You can open this in an editor, or the UNIX command
grep Debug transcript
will print all the lines containing the string Debug. You can also run without TRACE to not display the instruction trace.

On Clarvi only, output the value by writing to address 0x04000080, the address of the hex LEDs in the simulator. This will cause a message in the log. You can use this function:

void hex_output(int value)
{
	int *hex_leds = (int *) 0x04000080;  // define a pointer to the register
	*hex_leds = value;                   // write the value to that address
}

(On Spike we don't have any memory mapped at 0x04000000 so calling hex_output() will generate an exception and terminate the program)

Note also that we don't have a standard library, so functions like printf, malloc or memset do not exist (unless we write them). The Clarvi CPU also does not have multiply - if the compiler cannot optimise a constant multiply to adds and shifts, it will call a multiply function which we don't have (unless you write one).

To use the debug log from C you can use the following C function (an example of inline assembly):

void dprint(int value)
{
	asm ("csrw	0x800, %0" : : "r" (value) );
}

We have provided this in main.c.

Continuously read and display the time so that you can check your system correctly displays transitions from centiseconds to seconds and seconds to minutes.