A Coroutine-Thread benchmark


This directory contains various implementations of a language
independent benchmark program to test how well different languages can
implement an application which is well suited to using BCPL style
coroutines running in a multi-threading environment. The general
application area is in process control where many activities are
running in parallel servicing requests from clients and communicating
with several asynchronous devices. The activities have to cooperate
and so require various forms of synchronisation. This program is most
naturally implemented using both threads and coroutines but could be
implemented using threads alone.

Six implementations of this benchmark are planned: one in BCPL running
under the Cintpos portable operating system, two in C using and Posix
Pthreads with and without the use of a hand written coroutine library,
one in C using thread and Microsoft fibers under windows and two in
Java again with and without a coroutine library package.

The definitive version is in BCPL and is currently the only one that
is complete.

Specification

The benchmark has the following parameters:

     k       The number time each client performs its schedule of work.
     n       The number of read and write client threads, and
             read and write server threads.
     w       The number of work coroutines per server thread.
     m       The number of multiplexor threads
     c       The number of channels controlled by each multiplexor
             thread.
     d       The real time delay in milli-seconds when a work coroutine
             is asked is processing a delay request.
     t       A flag that turns on the optional trace output.

These parameters can be set by optional command line arguments using
the keywords -k, -n, -w, -m, -c, -d and -t. For example,

        tcobench -k 10 -n 10 -w 9 -m 15 -c 10 -d 500

will invoke the benchmark with these settings. These are, in fact,
the default settings, so the same effect can be achieved by just

        tcobench

If tracing is specified, the default settings are:

        k=2 n=2 w=3 m=2 c=3 d=500

but these can, of course, be overridden.

The benchmark creates 4n+m+3 threads consisting of a controller thread,
n write clients (WC1 .. WCn), n read clients (RC1 .. RCn), n write
servers (WS1 .. WSn), n read servers (RS1 .. RSn), m multiplexor
threads (M1 .. Mm), a print thread and a bounce thread. Each server
thread has a logger coroutine and w work coroutines, and each
multiplexor thread looks after c channels.

Each read (or write) client has a schedule of work consisting requests
to send to the read (or write) servers.  Each request identifies a
server number, a multiplexor number and a channel number. All n*m*c
combinations occur exactly once in the schedule but appear in a
machine independent random order, with a different order for each
client. One request in the schedule is selected at random to cause a
delay of d milli-seconds in the client before sending the request to
its server, and second request in the schedule is selected at random
as a delayed read (or write) request, causing it to be held for d
milli-seconds in its server before being processed. The delayed
request from RCi (or WCi) always identifies RSi (or WSi), so that when
all read (or write) clients have completed one run of their respective
schedules each read (or write) server will have processed just one
delayed request. During these delays there is usually sufficient work
for the other threads to do to keep the application fairly busy.
Write clients include random data values in their requests and read
clients receive data values from read requests. All data sent (or
received) by clients are checksummed and these checksums are sent to
the controller at the end of the run. The controller checks that the
grand total checksum of all data sent by write clients equals the
similar sum from the read clients.

When a client completes its schedule of work, it sends sync requests
to all of its servers. A server only returns sync request packets when
all such packets have been received from all of its clients.  When all
of a client's sync packets have been returned, it is ready to start on
the next cycle of scheduled work, finally stopping when k cycles have
been completed.

Read (and write) servers use work coroutines to process their
requests. Each request is typically passed directly to an available
work coroutine, but is queued for later processing if all work
coroutines are busy.  Read and write requests are processed by work
coroutines as follows.

1) If a work coroutine receives a delayed request, it suspends itself
for d milli-seconds.

2) The coroutine extracts the parameters from the request and, if it
was a write request, the packet is returned to the client immediately.

3) A suitable read (or write) request is sent to a specified
multiplexor giving it the parameters of the original request. It
awaits the packet's return and, if it were a read request, it extracts
the returned data value and sends it originating read client in its
original request packet.

4) Work coroutines within each read (or write) server maintains counts
of the number of requests that have been processed. These counts are
held in a structure accessible to all the other work coroutines of the
same server in a form that allows smallest (countmin) and largest
(countmax) counts to be determined easily.  There is a global
constraint that these two values may not differ by more than 16. This
will sometimes cause a work coroutine to suspend itself when trying to
increment its count. When a work coroutine finds that it is increasing
the value of countmin it releases any coroutine waiting to increment
its count.  This is implemented using a condition variable and the
functions wait and notify (non Java functions).

5) Each server thread has a logger coroutine to simulate the actions
of a log service. When a work coroutine completes the processing of a
request it has a conversation with the logger, but this is only
allowed when it has acquired the logger lock. The converstion is as
follows. The work coroutine send two values: x+y and x-y separately
via an Occam style channel (loggerin) then reads the reply via a
second Occam style channel (loggerout). Both x and y are non-negative
random values. The reply should be a sequence of ones and zeros
encoding of x followed by minus one. This is checked by the work
coroutine before releasing the logger lock. Notice that this tests
that the locking mechanism is working correctly. Finally, the work
coroutine sets a flag to indicate that it is ready to process another
request. To simulate the behaviour of a logging service having to wait
for i/o, the logger coroutine sends a request to the printer thread
once in every 25 values it receives from the work coroutines. The
printer thread runs at very low priority and so this request is likely
to delay progress in the logger without any chance of causing
deadlock.

Multiplexor threads receive read and write requests, each specifying a
channel number. Each channel has a FIFO buffer to hold data written
but not yet read. Data originating from write clients is inserted into
such buffers in a first come first served basis. Thus data in any
buffer can originate from any write client via any write server.
Similarly data retrieved from channels may be satisfying read requests
from any read client via any read server. Thus data written by a
particular write client will ultimatly be read by an essentially
random read client. A write request becomes blocked if its channel
buffer is full and, similarly, a read request becomes blocked if its
channel buffer is empty. Channel buffers are just large enough to
hold all the data from one cycle of scheduled work from all the write
clients. This is to remove one possible cause of deadlock in the
program.

The controller thread supervises the execution of the benchmark. It is
entered when all threads have been created and causes them to
start. While the benchmark is running, it repeated send a packet to
the bounce thread which returns it as soon as it is able to, but,
since the bounce thread runs at the lowest priority, this will only
happen (under Cintpos) when all other thread in the benchmark are
blocked.

Every tenth of a second the controller inspects the count of how many
bounces were achieved in the previous period and uses these values to
estimate the approximate CPU utilisation over the period.  At the end
of the run a table of utilisation statistics is output.  The bounce
thread calls an echo coroutine ten times for each bounce packet
received. This is to maintains a reasonable high ratio of coroutine
changes to thread changes even for systems where the bounce count is
high. These coroutine calls occur when the system would otherwise be
idle and so does not contribute to the execution time of the
benchmark (which should be around 37 seconds).

Typical output from the benchmark when run under the byte stream
interpretive Cintpos Portable Operating System on a 1.6GHz Centrino
Duo machine (a Toshiba Tecra M5L) is as follows:

c178$ 
c178$ make bench
cintpos -c c b tcobench

Cintpos System (27 July 2006)
1> c b tcobench
bcpl tcobench.b to tcobench hdrs POSHDRS 


BCPL (27 Jul 2006)
Code size = 16776 bytes
1> 

time cintpos -c tcobench

Cintpos System (27 July 2006)
1> tcobench


Thread and Coroutine Benchmark

loopmax        =   10 (k)
climax, sermax =   10 (n)
workmax        =    9 (w)
mpxmax         =   15 (m)
chnmax         =   10 (c)
delaymsecs     =  500 (d)

Requests per schedule = 1500
Channel buffer size   = 101


controller: Calibrating the bounce counter

controller: bouncesmax = 5470

Controller: All tasks initialised

Start  time: 23-Oct-06 12:58:44

Controller: Releasing all read clients
Controller: Releasing all write clients

controller: final clock packet received
controller: final bounce packet received
Finish time: 23-Oct-06 12:59:17

Controller: rdsum=wrsum=754522  -- Good

Controller: rddatacount=wrdatacount=150000  -- Good

Controller: making final call of cowait

number of calls of taskwait:       5285326
number of calls of qpkt:           5285326
number of coroutine changes:      42193504
number of calls of wait(..):        115732
number of calls of notify(..):       88975
number of calls of increment(..):   300000
   increment had to wait:           115732
number of calls of lock(..):        300000
   lock had to wait:                208783
number of  500 msec delays:            400
print task counter:                  42840
bounce task counter:               1997295

     Approximate CPU utilisation over 334 periods of 100 msecs
   0-10% 10-20% 20-30% 30-40% 40-50% 50-60% 60-70% 70-80% 80-90% >90%
  143     15     27     18     23     11     12     10     16     59  

Test completed
1> 
34.61user 0.04system 0:35.17elapsed 98%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+5114minor)pagefaults 0swaps
c178$

Other Implementations

This benchmark will be translated into other languages. Three in C and
two in Java are planned, but most these are not yet available. They
will eventually be in the directories: c-colib, c-thread, c-fiber,
java-colib, java-thread.

A TCP/IP Variant

In due course a variant of this benchmark will appear that will run
the read and write sides on different machines using TCP/IP
connections to transmitted the data through the channels. Typically
between one and two hundred TCP/IP connections will be used. The two
halves of this version of the benchmark can be implemented in
different languages.


Martin Richards
23 February 2007