A Coroutine-Thread benchmark This directory contains various implementations of a language independent benchmark program to test how well different languages can implement an application which is well suited to using BCPL style coroutines running in a multi-threading environment. The general application area is in process control where many activities are running in parallel servicing requests from clients and communicating with several asynchronous devices. The activities have to cooperate and so require various forms of synchronisation. This program is most naturally implemented using both threads and coroutines but could be implemented using threads alone. Six implementations of this benchmark are planned: one in BCPL running under the Cintpos portable operating system, two in C using and Posix Pthreads with and without the use of a hand written coroutine library, one in C using thread and Microsoft fibers under windows and two in Java again with and without a coroutine library package. The definitive version is in BCPL and is currently the only one that is complete. Specification The benchmark has the following parameters: k The number time each client performs its schedule of work. n The number of read and write client threads, and read and write server threads. w The number of work coroutines per server thread. m The number of multiplexor threads c The number of channels controlled by each multiplexor thread. d The real time delay in milli-seconds when a work coroutine is asked is processing a delay request. t A flag that turns on the optional trace output. These parameters can be set by optional command line arguments using the keywords -k, -n, -w, -m, -c, -d and -t. For example, tcobench -k 10 -n 10 -w 9 -m 15 -c 10 -d 500 will invoke the benchmark with these settings. These are, in fact, the default settings, so the same effect can be achieved by just tcobench If tracing is specified, the default settings are: k=2 n=2 w=3 m=2 c=3 d=500 but these can, of course, be overridden. The benchmark creates 4n+m+3 threads consisting of a controller thread, n write clients (WC1 .. WCn), n read clients (RC1 .. RCn), n write servers (WS1 .. WSn), n read servers (RS1 .. RSn), m multiplexor threads (M1 .. Mm), a print thread and a bounce thread. Each server thread has a logger coroutine and w work coroutines, and each multiplexor thread looks after c channels. Each read (or write) client has a schedule of work consisting requests to send to the read (or write) servers. Each request identifies a server number, a multiplexor number and a channel number. All n*m*c combinations occur exactly once in the schedule but appear in a machine independent random order, with a different order for each client. One request in the schedule is selected at random to cause a delay of d milli-seconds in the client before sending the request to its server, and second request in the schedule is selected at random as a delayed read (or write) request, causing it to be held for d milli-seconds in its server before being processed. The delayed request from RCi (or WCi) always identifies RSi (or WSi), so that when all read (or write) clients have completed one run of their respective schedules each read (or write) server will have processed just one delayed request. During these delays there is usually sufficient work for the other threads to do to keep the application fairly busy. Write clients include random data values in their requests and read clients receive data values from read requests. All data sent (or received) by clients are checksummed and these checksums are sent to the controller at the end of the run. The controller checks that the grand total checksum of all data sent by write clients equals the similar sum from the read clients. When a client completes its schedule of work, it sends sync requests to all of its servers. A server only returns sync request packets when all such packets have been received from all of its clients. When all of a client's sync packets have been returned, it is ready to start on the next cycle of scheduled work, finally stopping when k cycles have been completed. Read (and write) servers use work coroutines to process their requests. Each request is typically passed directly to an available work coroutine, but is queued for later processing if all work coroutines are busy. Read and write requests are processed by work coroutines as follows. 1) If a work coroutine receives a delayed request, it suspends itself for d milli-seconds. 2) The coroutine extracts the parameters from the request and, if it was a write request, the packet is returned to the client immediately. 3) A suitable read (or write) request is sent to a specified multiplexor giving it the parameters of the original request. It awaits the packet's return and, if it were a read request, it extracts the returned data value and sends it originating read client in its original request packet. 4) Work coroutines within each read (or write) server maintains counts of the number of requests that have been processed. These counts are held in a structure accessible to all the other work coroutines of the same server in a form that allows smallest (countmin) and largest (countmax) counts to be determined easily. There is a global constraint that these two values may not differ by more than 16. This will sometimes cause a work coroutine to suspend itself when trying to increment its count. When a work coroutine finds that it is increasing the value of countmin it releases any coroutine waiting to increment its count. This is implemented using a condition variable and the functions wait and notify (non Java functions). 5) Each server thread has a logger coroutine to simulate the actions of a log service. When a work coroutine completes the processing of a request it has a conversation with the logger, but this is only allowed when it has acquired the logger lock. The converstion is as follows. The work coroutine send two values: x+y and x-y separately via an Occam style channel (loggerin) then reads the reply via a second Occam style channel (loggerout). Both x and y are non-negative random values. The reply should be a sequence of ones and zeros encoding of x followed by minus one. This is checked by the work coroutine before releasing the logger lock. Notice that this tests that the locking mechanism is working correctly. Finally, the work coroutine sets a flag to indicate that it is ready to process another request. To simulate the behaviour of a logging service having to wait for i/o, the logger coroutine sends a request to the printer thread once in every 25 values it receives from the work coroutines. The printer thread runs at very low priority and so this request is likely to delay progress in the logger without any chance of causing deadlock. Multiplexor threads receive read and write requests, each specifying a channel number. Each channel has a FIFO buffer to hold data written but not yet read. Data originating from write clients is inserted into such buffers in a first come first served basis. Thus data in any buffer can originate from any write client via any write server. Similarly data retrieved from channels may be satisfying read requests from any read client via any read server. Thus data written by a particular write client will ultimatly be read by an essentially random read client. A write request becomes blocked if its channel buffer is full and, similarly, a read request becomes blocked if its channel buffer is empty. Channel buffers are just large enough to hold all the data from one cycle of scheduled work from all the write clients. This is to remove one possible cause of deadlock in the program. The controller thread supervises the execution of the benchmark. It is entered when all threads have been created and causes them to start. While the benchmark is running, it repeated send a packet to the bounce thread which returns it as soon as it is able to, but, since the bounce thread runs at the lowest priority, this will only happen (under Cintpos) when all other thread in the benchmark are blocked. Every tenth of a second the controller inspects the count of how many bounces were achieved in the previous period and uses these values to estimate the approximate CPU utilisation over the period. At the end of the run a table of utilisation statistics is output. The bounce thread calls an echo coroutine ten times for each bounce packet received. This is to maintains a reasonable high ratio of coroutine changes to thread changes even for systems where the bounce count is high. These coroutine calls occur when the system would otherwise be idle and so does not contribute to the execution time of the benchmark (which should be around 37 seconds). Typical output from the benchmark when run under the byte stream interpretive Cintpos Portable Operating System on a 1.6GHz Centrino Duo machine (a Toshiba Tecra M5L) is as follows: c178$ c178$ make bench cintpos -c c b tcobench Cintpos System (27 July 2006) 1> c b tcobench bcpl tcobench.b to tcobench hdrs POSHDRS BCPL (27 Jul 2006) Code size = 16776 bytes 1> time cintpos -c tcobench Cintpos System (27 July 2006) 1> tcobench Thread and Coroutine Benchmark loopmax = 10 (k) climax, sermax = 10 (n) workmax = 9 (w) mpxmax = 15 (m) chnmax = 10 (c) delaymsecs = 500 (d) Requests per schedule = 1500 Channel buffer size = 101 controller: Calibrating the bounce counter controller: bouncesmax = 5470 Controller: All tasks initialised Start time: 23-Oct-06 12:58:44 Controller: Releasing all read clients Controller: Releasing all write clients controller: final clock packet received controller: final bounce packet received Finish time: 23-Oct-06 12:59:17 Controller: rdsum=wrsum=754522 -- Good Controller: rddatacount=wrdatacount=150000 -- Good Controller: making final call of cowait number of calls of taskwait: 5285326 number of calls of qpkt: 5285326 number of coroutine changes: 42193504 number of calls of wait(..): 115732 number of calls of notify(..): 88975 number of calls of increment(..): 300000 increment had to wait: 115732 number of calls of lock(..): 300000 lock had to wait: 208783 number of 500 msec delays: 400 print task counter: 42840 bounce task counter: 1997295 Approximate CPU utilisation over 334 periods of 100 msecs 0-10% 10-20% 20-30% 30-40% 40-50% 50-60% 60-70% 70-80% 80-90% >90% 143 15 27 18 23 11 12 10 16 59 Test completed 1> 34.61user 0.04system 0:35.17elapsed 98%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+0outputs (0major+5114minor)pagefaults 0swaps c178$ Other Implementations This benchmark will be translated into other languages. Three in C and two in Java are planned, but most these are not yet available. They will eventually be in the directories: c-colib, c-thread, c-fiber, java-colib, java-thread. A TCP/IP Variant In due course a variant of this benchmark will appear that will run the read and write sides on different machines using TCP/IP connections to transmitted the data through the channels. Typically between one and two hundred TCP/IP connections will be used. The two halves of this version of the benchmark can be implemented in different languages. Martin Richards 23 February 2007