Dhrystone Benchmark for MCUs

Background

The Dhrystone benchmark was devised many, many years ago to measure the performance of a computer/compiler combination with the specific exclusion of floating-point arithmetic.  (Floating-point performance is measured by the Whetstone benchmark.  The Dhrystone benchmark is so named because of the lack of floats.)  The structure of the benchmark is a loop containing the author's notion of a representative mix of C-language computations.  This is surrounded by a correctness-checking, timing and reporting framework.  The benchmark produces two results.  The first result is the number of times the loop is executed per second, which is the "Dhrystones-per-second" rating of the computer/compiler combination.  This is then divided by 1757 to give the second result of "VAX MIPs".  The Digital Equipment Corporation VAX 11-780 with the DEC-supplied compiler executed 1757 Dhrystones-per-second and, according to the marketing literature, ran at one million instructions per second (although not in reality).

What's in a MIP

I would not suggest that the Dhrystone benchmark is in any way ideal, particularly with respect to embedded systems.  However, nothing as widely available has emerged to take its place and give a universal meaning to the much abused notion of a MIP.  MCU manufacturers typically specify the throughput of their products in MIPs but simply use the rate at which fairly primitive machine instructions can be fetched and executed.  ZiLOG rate the Z8 Encore! MCU at "up to 10 MIPs".  The highest clock frequency is 20 MHz and many two-byte instructions can be fetched and executed at that rate.  However, many more instructions are more than two bytes in length and/or take more than two cycles to execute.  Fairly simple statements in C generate several such instructions.  Atmel rate the ATmega MCUs at "up to 16 MIPs".  ATmega instructions are 16 bits in length and can therefore do more work per cycle than 8 bits of Z8 instruction.  The maximum clock rate is 16 MHz (at 5 volts - halve these numbers for most 3.3 volt versions).  Again, you often need several instructions to implement a simple C statement.  Using the manufacturer's data, you might expect an ATmega running at 16 MHz to have 60% higher performance than a Z8 Encore! running at 20 MHz.  But is this really the case?  Unless you build your product using both MCUs, there is no way to know with much accuracy.  However, running a benchmark will give you something of an idea.  (My results are given below).

Dhrystone for Embedded Systems

ECROS Technology has adapted the widely-available Dhrystone benchmark for MCUs with 2 Kbytes of RAM or more.  The code is available for download and is free to all users.  Right-click here and select Save Target As... to download the source in a Zip archive.  The following sections describe how to use the benchmark and how it was derived from the standard UNIX-workstation based Dhrystone distribution.

Using the Benchmark

On a workstation, you use the Dhrystone benchmark by running it, specifying the number of iterations of the loop and reading the results from the screen.  Operating system calls are used to measure total execution time and to display and store the results.  The adapted benchmark does not require any support for elapsed time measurement or output to a display device.  Instead, it requires one bit of an output port to be available and for you to observe a signal at that port and measure its frequency.  The port bit is both set and reset in the Dhrystone loop, therefore the frequency you measure is the Dhrystones-per-second result.  To compute the VAX MIPs result, divide this by 1757.

If you don't have a frequency counter, inexpensive hand-held digital multi-function meters are available with frequency measurement and are fine for this application.  Some, however, are susceptible to double-counting if the signal bounces around a bit.  To avoid this, use a simple RC low-pass filter between the output port and the meter or estimate the frequency with an oscilloscope before accepting the meter reading.  As a last resort, measuring the period of the signal with the oscilloscope provides a rough estimate.

As delivered, the adapted benchmark uses bit 0 of port A on a Z8 Encore! and bit 0 of port B on an Atmel AVR.

How the Benchmark was Adapted

In this section, I describe the changes I made to get from the standard Dhrystone distribution to the version adapted for MCUs.  I would appreciate any comments on this process, particularly pointing out any errors.  I tried to make modifications that are easy to see and understand when the adapted source is compared to the original.

K & R to ANSI C

The standard distribution of the Dhrystone benchmark is written in the Kernighan and Richie dialect of C.  The first step is to convert it to ANSI C to eliminate a large number of errors from an ANSI compiler.  Formal argument lists must be moved to within the parentheses of function definitions and return types must be corrected.  Several function prototypes must be placed in header file dhry.h.  Note that compilers that are very good at detecting errors, such as GCC under -Wall, will report uninitialized variables.  I have not corrected this.

System Services

This section discusses how I dealt with system services that may not be available in your system.

dtime()

Function dtime() in file timers_b.c is intended to retrieve wall-clock time so as to measure the time taken by the selected number of iterations of the main loop.  This is not required in the adapted benchmark.  I have chosen to write a stub function that returns 0.0 rather than remove calls to this function from the code.  Make sure that in your system a function of this name exists to satisfy the linker.  My stub function appears if either __AVR_ARCH__ or __ENCORE__ is defined.

Input/Output

Functions fopen(), fprintf(), fclose(), scanf(), and printf() are called for readily apparent reasons.  These are not required in the adapted benchmark.  Again, rather than mess up the source, I have removed these functions in other ways.

I have provided stub functions for fopen(), fclose() and scanf() in a new source file estubs.cfopen() must return a non-zero value or the program will exit.  There is no reason why scanf() must provide a result, but in my stub I have chosen to emulate the user typing the number 30000 as the iteration count.  Failing all means of frequency measurement, you may be able to get a rough reading by uncommenting ++Run_Index in the loop, connecting an LED at the output port and timing the total time for which the LED lights up with a watch.  Don't forget that your integer type most probably limits Run_Index to 32767.  You may need to prototype some of the functions listed above in dhry.h if your system does not provide this.  See also Output Port Control below for further information on stubs for fopen() and fclose().

Functions fprintf() and printf() need to be dealt with more harshly.  The strings declared in their argument lists eat up a great deal of RAM.  Removing the function calls would make it more difficult to compare the adapted source with the original.  I have #defined fprintf and printf as the single line comment token in dhry.h.  This seems to have the effect of removing the function calls without disturbing the source text, although GCC produces a large number of warnings about statements with no effect.

malloc()

Some embedded microcontroller systems may not provide malloc().  To side-step this issue, I have faked this function with emalloc() in estubs.c.  This was nowhere near as hard as it sounds because we know that malloc() is called twice and we can find out with a debugger how much memory is requested (for example, 39 bytes).  I have just parceled out a piece of statically allocated memory.  I should not have been forced to change the function name, but ZiLOG's ZDS II insists on linking the library version of malloc(), which does not work, instead of my replacement.

Output Port Control

As mentioned above, the "result" of the benchmark is an output port bit that generates a frequency equal to the rate at which Dhrystone loops are executed.  This requires new code to initialize and control that output port bit.  Function fopen() is conveniently called once before the loop begins.  The stub of this function is an ideal place to put the port setup code.  If you want to clean up the port configuration at the end of the benchmark, a good place to do it is fclose() but since my version loops forever I have not done this.  To avoid the overhead of a function call, the port state is controlled by a macro, defined in dhry.h and invoked at two places in the main loop in dhry21a.c.

Other Modifications

The Dhrystone benchmark defines some rather large integer arrays.  Fortunately, it seems that few elements of the arrays are actually accessed and that halving their size does not change the computational results.  The offset of some (but not all) elements in the arrays have also been halved.  This allows the benchmark to fit in an MCU with 2 Kbytes of RAM (ZiLOG ZDS II reports 1640 bytes of EDATA, AVR-GCC reports 1634 bytes in the .data and .bss segments).

Measurements Using the Benchmark

These are the results I have obtained running my adapted Dhrystone benchmark on various MCUs.

MCU

Clock (MHz)

Compiler and settings

Dhrystones per sec

VAX MIPs

MIPs per MHz

ZiLOG Z8F6422

18.432

ZDS II 4.8.0, optimize for speed

2105

1.20

0.065

Atmel ATmega64

14.7456

AVR-GCC 3.3.2, -g -Os

8487

4.83

0.328

TI MSP430F149

8.000

CrossWorks for MSP430 (Note 1)

4047

2.30

0.288

MAXQ2000

8.000

CrossWorks for MAXQ (Note 1)

6762

3.85

0.481

MAXQ2000

20.000

CrossWorks for MAXQ (Note 1)

16906

9.62

0.481

Note 1 - Thanks to Paul Curtis of Rowley Associates Ltd. for the Texas Instruments MSP430F149 and the Dallas/Maxim MAXQ2000 measurements.  A version of CrossWorks exists for the Atmel AVR and Paul is confident it will beat GCC!

I have made every effort to make sure that the above results are correct and fair.  If you disagree, please try it yourself.  In particular, I have verified the factor of approximately four between the Z8F6422 and ATmega64 using a test completely unrelated to the Dhrystone benchmark and more representative of an embedded system.  This suggests that, inappropriate as this benchmark might seem, its results are nevertheless useful.