

             Two Benchmark Tests for BCPL Style Coroutines


This directory contains various implementions of two benchmark
programs to test the efficiency of BCPL style coroutines when
implemented in a variety of different languages.

Download cobench.tgz (or cobench.zip) and un-tar (or un-zip) it to
form the directory Cobench. The enter one of the diectories: bcpl,
natbcpl, mcpl, java, c and New Jersey SML and type the command:
make. This should compile and run the benchmark, provided you are on a
Pentium machine running Linux. You may find you have to re-install the
BCPL Cintcode system and you may have to edit one or more of the
Makefiles.

There are directories for each language and implementation style. They
are currently as follows:

bcpl      The definitive BCPL implementation run using Cintcode
natbcpl   BCPL compiled (very naively) in machine code
c         Implementation in C using a coroutine library involving
          setjmp, longjump and two instructions of inline assembler.
nat-c     Implemetation in C using a hand written coroutine library
          in assembler (for the Pentium).
java      Implementation in Java using a coroutine library based on
          threads using wait, notify and synchronisation.
mcpl      An interpretive implementation in MCPL
smlnj     An implementation in Standard ML using New Jersey Continuations.
haskell   An untested approximation to the benchmark in Haskell.

All these implementations can be run under Linux (provided the
relevant compilers have been installed). They are typically run use
make as follows:

make             Same as make help
make help        Display the help infomation
make bench       Run cobench
make bencht      Run cobench with tracing
make sim         Run cosim
make simt        Run cosim with tracing
make test        Run cotest
make clean       Delete any files that can be reconstructed

The program cotest tests where the BCPL style coroutines have been
implemented correctly.

************************ cobench ************************

This benchmark program creates a source coroutine, 500 copy coroutines
and a sink coroutine. It then causes the source coroutine to transmit
10,000 numbers through all the copy coroutines to be thrown away by
the sink coroutine.  The numbers are sent one at a time, and all
transmission from one coroutine to the next uses a coroutine
implementation of Occam channels.

Reading and writing are done (in the BCPL version) using coread and
cowrite, respectively. These are defined as follows:

AND coread(ptr) = VALOF
{ LET cptr = !ptr
  TEST cptr
  THEN { !ptr := 0         // Clear the channel word
         RESULTIS resumeco(cptr, currco)
       }
  ELSE { !ptr := currco    // Set channel word to this coroutine
         RESULTIS cowait() // Wait for value from cowrite
       }
}

AND cowrite(ptr, val) BE
{ LET cptr = !ptr
  TEST cptr
  THEN { !ptr := 0
         callco(cptr, val) // Send val to coread
       }
  ELSE { !ptr := currco
         callco(cowait(), val)
       }
}

In both functions, ptr points to a channel word which hold zero
(=FALSE) if no coroutine is waiting on the channel. The first
coroutine to call coread (or cowrite) puts its own identity into the
channel word and then waits. When the second coroutine reaches the
corresponding cowrite (or coread) call, it can see the identity of
the other coroutine that is waiting.

If the writing coroutine arrives at the communication point second, it
picks the identity of the reading coroutine out of the channel word,
clears the channel word, and then transmits the value to the waiting
read coroutine using a simple call of callco.

If the reading coroutine arrives at the communication point second, it
gives its own identity to the waiting write coroutine then waits for
the value to be sent. A little thought will show you why coread uses
resumeco to transmit its own identity to the write coroutine.

If you are not familiar with BCPL coroutines, you can read about them
in the BCPL manual that is available via my home page. The BCPL Cintcode
version of this cobench executes 257,569,294 Cintcode instructions
changing coroutines 10,024,518 times. (These figures were obtained
using the stats command).

This directory currently holds four implementations of the benchmark:
bcpl, natbcpl, java and c. They take the following times when run on a
1 GHz Pentium III, unless otherwise stated.


BCPL Cintcode byte stream interpreter:                      4.78 secs

BCPL Cintcode byte stream interpreter
  on a 2.4GHz AMD64, under Cygwin:                          1.79 secs

BCPL Native 386 code (unoptimised):                         0.81 secs

BCPL under Cintpos using an interpreter implemented
  in C and all debugging aids on turned on:                10    secs

MCPL Mintcode byte stream interpreter:                     10.51 secs

Java j2sdk1.4.0 JIT (using threads):                      183.60 secs

C using setjmp and longjmp + inline assembler:              1.36 secs

C using colib written in assembler                          0.61 secs

New Jersey SML using continuations:                         4.37 secs

BCPL Cintcode on a 75Mhz SH3 (HP 620LX Handheld)          115.52 secs

The first of the C versions using setjmp, longjmp and inline assembler
by Jeremy Singer.


************************ cosim ************************

This benchmark is a discrete event simulator that simulates messages
passing through a network of nodes. The nodes are numbered 1 to 500
and are each initially given a message to process. The time to process
a message is randomly distributed between 0 and 999 units of simulated
time. Once a message has been processed it is sent to another node
selected at random. The transmission delay is assumed to be ABS(i-j)
where i and j are the source and destination node numbers. Nodes can
only process one message at a time and so messages may have to be
queued. Initially each node is assumed to have just received a
message. The simulation starts a time 0 and finishes at time
1,000,000. The result, 501907, is the total count of messages that
have been completely processed by the finishing time.  This benchmark
executes 435,363,350 Cintcode instructions and performs 2,510,520
coroutine changes. The timing results are as follows:

BCPL Cintcode byte stream interpreter:                     9.02 secs

BCPL Cintcode byte stream interpreter
  on a 2.4GHz AMD64 under Cygwin:                          3.29 secs

BCPL Native 386 code (unoptimised):                        1.11 secs

BCPL under Cintpos using an interpreter implemented
  in C and all debugging aids on turned on:               17    secs

MCPL Mintcode byte stream interpreter:                    15.39 secs

Java j2sdk1.4.0 JIT (using threads, wait and notify):     57.39 secs

C using setjmp and longjmp + inline assembler:             0.80 secs

C using colib written in assembler                         0.58 secs

New Jersey SML using continuations:                        4.08 secs

BCPL Cintcode on a 75Mhz SH3 (HP 620LX Handheld)         121.22 secs

Martin Richards
23 February 2007

