Title

PC Benchmarks For 64 Bit Windows

Index

     
Conversion
System ID
Maximum CPU Speed
Maximum MP Speed
Classic Benchmarks
Linpack
Livermore Loops
Dhrystone
Whetstone MP
BusSpeed MP
Rand MP
More 64 Bit Tests
       
Main Page
Other Results
Download Win64.zip
Download DualCore.zip

This page was set up as 770 pixels wide and accommodates preformatted text <PRE> results tables. Some browsers
produce monospaced font of an unexpected size but this might be adjustable via browser Preferences.

Conversion

Most of my benchmarks have been converted to run as 64 bit programs and have been tested via Windows XP Pro x64 and 64 bit Windows Vista. The first step was to download Microsoft Platform SDK for Windows Server 2003 SP1. This includes a 64 bit compiler (cl), assembler (ml64) and linker. These can be used via the command line or in a .BAT file and the package can be installed using Win32 or Win64. For Windows programs, .RC files can be converted (rc) to .RES files and the latter to .OBJ (cvtres). Library names used are the same as Win32, like GDI32.LIB.

The compiler does not accept asm type assembler functions so these have to be converted to MASM format but headers and .INC files are different to 32 bit varieties. 64 bit systems cannot run the old x87 floating point instructions nor MMX instructions. The former have to be converted to SSE1/2/3 instructions. MMX instruction names are the same as some provided in SSE2 but memory addresses have to be changed to suit 128 bit registers instead of 64 bits. 32 bit instructions can still be used, including CPUID and RDTSC. The only complication appears to be that push/pop should refer to a 64 bit register (push rdx instead of push edx). There appears to be complications in passing parameters to assembly code but I have avoided this by using global variables.

The SDK includes a 32 bit compiler that checks for 64 bit compatibility and has options to use SSE or SSE2 instructions for floating point. In some cases, this produces identical code to the 64 bit version. In other cases it restricts the number of registers used for 32 bit compatibility. MASM type assembly requires an assembler that comes with such as Microsoft Visual C++ 6.0 Pro. In order to compare 64 versus 32 bit speeds, some of the benchmarks have also been compiled using the SDK 32 bit compiler.

The C/C++ and Assembler source codes for these benchmarks are available in NewSource.zip. The original versions can be obtained via the Main Page.

To Start

More 64 Bit Benchmarks

Windows, DirectDraw, OpenGL and Image Processing benchmarks have also been converted to run at 64 bits and a DirectX 9 benchmark has also been produced. See 64 Bit Graphics Tests.htm. Download benchmarks and C/C++ source codes via Video64.zip Then, there are benchmarks for disks, CD/DVD drives, networks and peripherals in More64bit.zip with results in 64 Bit Disk Tests.htm.

The latest conversions, including source code, are also in More64bit.zip. These are three versions of my Fast Fourier Transform benchmarks (see also FFTGraf.zip), SSE/SSE2 benchmark and burn-in/reliability tests (see also SSE3Dnow.zip) and BusSpd2K burn-in/reliability tests (see also BusSpd2K.zip). The latter burn-in tests have been modified to demonstrate paging speeds more quickly. See Paging.htm for results via 64-Bit Vista and XP Pro x64.

To Start

Other Results

Results of 64 bit tests, descriptions and some comparisons are included in results reports for 32 bit versions.

Whetstone Results.htm Linpack Results.htm
Livermore Loops Results.htm Dhrystone Results.htm
WhatCPU Results.htm BusSpd2K Results.htm
SSE3Dnow Results.htm Randmem Results.htm
FFTgraf Results.htm BMPspeed Results.htm
64 Bit Graphics Tests.htm BurnIn64.htm
DualCore.htm DiskGraf Results.htm
CDDVDSpd Results.htm VideoWin Results.htm
DirectDraw Results.htm Direct3D Results.htm
OpenGL Results.htm br


To Start

System ID

Each benchmark includes a new system identification test. This is limited because Intel appear to make significant changes with each new CPU (now much too complicated for identifying dual CPUs with HT enabled or cache sizes on a range of CPUs). Windows functions also considerably lag on hardware capabilities. The following shows details provided by the 64 bit programs, then differences at 32 bits. Note that on AMD and Intel CPUs, with 64 bit working, info.wProcessorArchitecture from GetSystemInfo(&info) indicates PROCESSOR_ARCHITECTURE_AMD64. With 32 bit operation PROCESSOR_ARCHITECTURE_INTEL is supplied.


  AMD Windows XP Pro x64 

 CPUID and RDTSC Assembly Code
 CPU AuthenticAMD, Features Code 178BFBFF, Model Code 00020FB1
 AMD Athlon(tm) 64 X2 Dual Core Processor 4200+ Measured 2211 MHz 
 Has MMX, Has SSE, Has SSE2, Has SSE3, Has 3DNow, 
 Windows GetSystemInfo, GetVersionEx, GlobalMemoryStatus
 AMD64 processor architecture, 2 CPUs 
 Windows NT  Version 5.2, build 3790, Service Pack 1
 Memory 1024 MB, Free 656 MB
 User Virtual Space 8388608 MB, Free 8388557 MB

  Intel Windows Vista 64-Bit 

  CPUID and RDTSC Assembly Code
  CPU GenuineIntel, Features Code BFEBFBFF, Model Code 000006F6
  Intel(R) Core(TM)2 CPU          6600  @ 2.40GHz Measured 2402 MHz
  Has MMX, Has SSE, Has SSE2, Has SSE3, No 3DNow,
  Windows GetSystemInfo, GetVersionEx, GlobalMemoryStatus
  AMD64 processor architecture, 2 CPUs 
  Windows NT  Version 6.0, build 6000, 
  Memory 4094 MB, Free 3207 MB
  User Virtual Space 8388608 MB, Free 8388547 MB

Differences Win32 and Win64 at 32 bits


 Intel processor architecture, 2 CPUs 
 User Virtual Space 4096 MB, Free 4047 MB - Win64
 User Virtual Space 2048 MB, Free 2022 MB - Win32
 Memory 4095 MB, Free 3103 MB             - Win64
 Memory less than 3.5 GB                  - Win32
    

The C/C++ and Assembler source codes for these utilities are available in NewSource.zip.

To Start

Maximum CPU Speed

This benchmark CPUID64 (In Win64.zip) is based on the original in Whatcpu.zip. The latter executes a long series of assembler coded add instructions to 1, 2, 3 and 4 registers to identify maximum speeds of integer, floating point and MMX instructions. The 64 bit version has the same 32 bit integer test and an identical one using 64 bit mode. SSE/SSE2 32/64 bit floating point tests are the same. As indicated above, normal floating point and MMX instructions are invalid under Win64. Instead of MMX, SSE2 32 bit and 64 bit add speeds are measured. A revised 32 bit version is included in Win64.zip to show SSE2 integer speeds.

In the following example of 64 bit results, Millions of Instructions Per Second (MIPS) are similar to 32 bit speeds. That would be expected as 32 bit registers use half of real register size. With more pipelines, 64 bit normal integer MIPS can be faster than using integer SSE2 instructions. As usual, 64 bit floating point MFLOPS (Millions of FLoating point Operations Per Second) run at half speed compared with 32 bits (2 words versus 4 words in 128 bit registers). This is also the case with AMD on 32/64 bit SSE2 integers, but not with this Intel CPU.


     CPU ID and Speed Test 64 bit Version - Windows XP Pro x64
 
       Assembled with Microsoft ml64.exe Version 8.00.40310.39

  AMD Athlon(tm) 64 X2 Dual Core Processor 4200+ Measured 2211 MHz

 Speeds adding to     1 Register  2 Registers  3 Registers  4 Registers

 32 bit Integer MIPS     2430         4864         5650         6080
 64 bit Integer MIPS     2430         4864         6356         6485
 32 bit SSE2 Int MIPS    4421         8895         8844         8844
 64 bit SSE2 Int MIPS    2211         4419         4447         4422
 32 bit SSE MFLOPS       2214         4421         4421         4434
 64 bit SSE2 MFLOPS      1105         2210         2217         2217


     CPU ID and Speed Test 64 bit Version - Windows Vista 64-Bit

      Intel(R) Core(TM)2 CPU 6600  @ 2.40GHz Measured 2402 MHz

 Speeds adding to     1 Register  2 Registers  3 Registers  4 Registers

 32 bit Integer MIPS     2609         4081         5261         7044
 64 bit Integer MIPS     2613         4177         5225         7044
 32 bit SSE2 Int MIPS    9488        14638        17542        17466
 64 bit SSE2 Int MIPS    2401         4575         4585         4575
 32 bit SSE MFLOPS       3201         6405         9607         9607
 64 bit SSE2 MFLOPS      1601         3202         4804         4804

Download Win64.zip

To Start

Maximum MP Speed

This benchmark CPUIDMP64 (In DualCore.zip) uses some of the instruction sequences from CPUID64. First an integer and an SSE floating point test are run separately. They are then run as two threads of equal priority, where both should run at full speed with 2 CPUs. Finally, an FP test is started with another and two integer tests at lower priority. With 2 CPUs, the FP test should run at full speed and the others at the whim of the OS. A 32 bit version is included in DualCore.zip using the same 32 bit instructions. Results of 64 bit and 32 bit tests can be expected to be the same except possibly for sharing with 4 threads.

When run on a single CPU, the floating point and integer tests are likely to run at half speed with two threads. With four threads, the lower priority tests might obtain a small amount of time.


   CPU ID and MP Speed Test 64 bit Version - Windows XP Pro x64
 
     Assembled with Microsoft ml64.exe Version 8.00.40310.39

  AMD Athlon(tm) 64 X2 Dual Core Processor 4200+ Measured 2211 MHz

  Speed adding to registers   Pass 1   Pass 2   Pass 3

  Separate Tests
  32 bit SSE   MFLOPS          4411     4411     4415
  32 bit Integer MIPS          6068     6070     6070

  Two Threads Equal Priority
  32 bit SSE   MFLOPS          4405     4409     4408
  32 bit Integer MIPS          6067     6053     5992

  Four Threads, First Normal Priority, Others Normal - 1
  32 bit SSE   MFLOPS          4401     4411     4410
  32 bit Integer MIPS          2903     2053     2898
  32 bit SSE   MFLOPS             0     1433        0
  32 bit Integer MIPS          3454     2227     3455


  CPU ID and MP Speed Test 64 bit Version - Windows Vista 64-Bit

      Intel(R) Core(TM)2 CPU 6600  @ 2.40GHz Measured 2402 MHz

  Speed adding to registers   Pass 1   Pass 2   Pass 3

  Separate Tests
  32 bit SSE   MFLOPS          9582     9595     9600
  32 bit Integer MIPS          6934     6936     6950

  Two Threads Equal Priority
  32 bit SSE   MFLOPS          9501     9600     9600
  32 bit Integer MIPS          7002     7006     7013

  Four Threads, First Normal Priority, Others Normal - 1
  32 bit SSE   MFLOPS          9592     9575     9576
  32 bit Integer MIPS          3447     3414     3329
  32 bit SSE   MFLOPS          4844        0        0
  32 bit Integer MIPS             0     3337     3366

Download DualCore.zip

To Start

Classic Benchmarks

The Classic Benchmarks are the first programs that set standards of performance for computers. Details are available from Classic.htm and benchmark programs and results obtained via BenchNT.zip. The Linpack, Livermore Loops and Whetstone Benchmarks have been compiled for 64 bit systems and for 32 bit PCs using automatic compilation with SSE or SSE2 instructions. Dhrystone Benchmarks are now included. The benchmarks and sample results can be obtained from Win64.zip or DualCore.zip and source codes from NewSource.zip.


Linpack and Livermore Loops Benchmarks

Linpack and Livermore Loops benchmarks use double precision floating point, so are compiled with SSE2 instructions. The compilers are not as efficient as they could be, producing instructions using one 64 bit word in the 128 bit registers, rather than two for Single Instruction Multiple Data (SIMD) operation. The original 64 bit Linpack results on Core 2 Duo were disappointing but this was corrected in 2009 on using a later version of the compiler.


           Linpack Benchmark Results

  AMD Athlon(tm) 64 X2 Dual Core Processor 4200+ 
      Measured 2211 MHz and XP Pro x64

    Original       SSE2 Win32         SSE2 Win64

  838 MFLOPS      1014 MFLOPS        1044 MFLOPS

2009 Compilation                     1091 MFLOPS  

    Intel(R) Core(TM)2 CPU 6600  @ 2.40GHz 
      Measured 2402 MHz and Vista 64-Bit

    Original       SSE2 Win32         SSE2 Win64

 1315 MFLOPS      1480 MFLOPS         823 MFLOPS

 2009 Compilation                    1602 MFLOPS 

To Start

There are 24 Livermore Loops whose performance is measured in MFLOPS also with average results, Geometric Mean being the official average quoted. Following are results for the original Watcom version, 32 bits with SSE2 and 64 bits with SSE2. The 64 bit compilation can use up to 16 registers to speed up processing. However, some 32 bit SSE2 results are faster as are a few from the original Watcom version. The Intel Core 2 Duo results, compiled for 64 bits, are more frequently slower than at 32 bits, again corrected in a 2009 recompilation. See below for Whetstone Benchmark.


  AMD Athlon(tm) 64 X2 Dual Core Processor 4200+ Measured 2211 MHz and XP Pro x64

 ********************************************************

 Livermore Loops Benchmark Original Optimised via C/C++

 MFLOPS for 24 loops

 2032.4 1312.4  345.7 1031.6  275.6  334.9 2565.9 2288.0 2337.2  121.9  183.7  550.5
   49.8  131.6  393.5  350.4  217.7 1474.2  309.3  290.9  612.8  458.5  751.6  294.6

 Overall Ratings
 Maximum Average Geomean Harmean Minimum

  2565.9   740.9   460.5   285.7    48.4
 
 ********************************************************

 Livermore Loops Benchmark 32 Bit Version

 Via Microsoft 32-bit C/C++ Optimizing Compiler Version 13.10.3077 for 80x86

 MFLOPS for 24 loops

 1619.1 1187.1  717.9 1068.3  244.9  606.3 1815.6 1727.1 1907.6  670.2  200.4  549.6
  169.4  317.6  737.7  654.7  684.2 1452.0  455.0  762.5 1031.2  406.1  590.3  219.2

 Overall Ratings
 Maximum Average Geomean Harmean Minimum

  1907.6   798.3   640.1   501.3   162.3

 ********************************************************

 Livermore Loops Benchmark 64 Bit Version

 Via 64 Bit Microsoft C/C++ Optimizing Compiler Version 14.00.40310.41 for AMD64

 MFLOPS for 24 loops

 1927.4 1118.3 1096.7 1054.2  252.2  320.3 2284.1 2099.2 1756.2  632.4  183.6  731.0
  173.1  306.3  552.9  732.4  922.6 1441.3  500.8  881.3  328.6  351.8  758.0  397.8

 Overall Ratings
 Maximum Average Geomean Harmean Minimum

  2284.1   843.7   660.7   509.1   165.7

 ********************************************************

 Livermore Loops Benchmark 64 Bit Version - 2009 Compilation

 Via Microsoft C/C++ Optimizing Compiler Version 15.00.30729.207 for x64

 MFLOPS for 24 loops
 1954.0 1151.1 1094.2  948.6  249.3  605.2 2067.6 2022.1 1783.5  655.1  170.8  731.2
  195.8  305.8  512.1  723.4  902.1 1332.5  501.6  730.1 1042.7  359.6  736.2  398.3

 Overall Ratings
 Maximum Average Geomean Harmean Minimum
  2067.6   846.9   679.2   533.9   170.8

 ######################################################

 Intel(R) Core(TM)2 CPU 6600  @ 2.40GHz Measured 2402 MHz and Vista 64-Bit

 Via Microsoft 32-bit C/C++ Optimizing Compiler Version 13.10.3077 for 80x86

 MFLOPS for 24 loops

 1960.1 1357.0  788.5 1471.0  341.2  891.9 2526.4 2044.9 2153.0  860.1  265.8 1181.5
  458.5  555.0  444.0 1018.2  824.4 1073.6  505.2  632.3 1235.3  194.7  772.0  278.2

 Overall Ratings
 Maximum Average Geomean Harmean Minimum
  2526.4   990.3   803.8   639.0   194.7

 ********************************************************

 Via Microsoft 64 Bit C/C++ Optimizing Compiler Version 14.00.40310.41 for AMD64

 MFLOPS for 24 loops
  626.3  835.2  594.5  589.0  341.3  406.3  886.7 1040.6 1098.2  391.0  239.0  398.3
  349.7  397.8  320.7  857.1 1038.9  714.0  639.2  429.5  418.0  227.3  838.1  673.5

 Overall Ratings
 Maximum Average Geomean Harmean Minimum
  1175.0   592.9   537.0   484.9   227.2

 ********************************************************
 2009 Compilation

 Via Microsoft C/C++ Optimizing Compiler Version 15.00.30729.207 for x64

 MFLOPS for 24 loops
 1833.8 1221.5 1505.3 1290.3  340.5  858.4 2760.0 2375.6 2183.4  851.3  264.5 1183.2
  508.6  561.6  446.6  909.1 1067.1 1132.1  637.8  665.0 1233.7  362.7  928.6  715.9

 Overall Ratings
 Maximum Average Geomean Harmean Minimum
  2798.8  1066.4   893.2   749.0   260.5

To Start


Dhrystone Benchmarks

The Dhrystone tests are integer/fixed point benchmarks measuring performance in Millions of Instructions Per Second relative to the 1977 Digital Vax 11/780, accepted as the first 1 MIPS minicomputer. Dhrystone 1 could easily be over optimised, where some of the code is not executed, and is probably reflected in the results. Dhrystone 2 was intended to overcome this deficiency.

Three versions are available one compiled for 32 bit Windows and two for the 64 bit systems. One of the latter uses 32 bit integer variables and the other at 64 bits. Results for 32 bit integers show that 64 bit compilations are up to 56% faster than the 32 bit versions. Much of the gain appears to be due to a different translation of the C source code but, with twice as many registers available for optimisation at 64 bits, there could be some performance improvement. Regarding 64 bit compilations, the versions using 64 bit integers were both slower than with 32 bit integers. by 27% in one case. This might be due to the higher volume of data from cache with 64 bit words but limited compilations were inconclusive when some of the code was omitted. Further 32/64 bit integer comparisons can be found for BusSpdMP.


                          Dhrystone Benchmark Results

  AMD Athlon(tm) 64 X2 Dual Core Processor 4200+ Measured 2211 MHz and XP Pro x64


 ********************************************************
 32 Bit Integers - Via MS 32-bit C/C++ Optimizing Compiler Version 13.10.3077 for 80x86

 VAX MIPS rating Dhrystone 1 and 2:   6104.33    3719.73

 ********************************************************
 32 Bit Integers - Via MS C/C++ Compiler Version 14.00.40310.41 for AMD64

 VAX MIPS rating Dhrystone 1 and 2:   8668.31    5213.64

 ********************************************************
 64 Bit Integers - Via MS C/C++ Compiler Version 14.00.40310.41 for AMD64

 VAX MIPS rating Dhrystone 1 and 2:   8548.73    4654.28

 ######################################################

    Intel(R) Core(TM)2 CPU 6600  @ 2.40GHz Measured 2402 MHz and Vista 64-Bit
 

 ********************************************************
 32 Bit Integers - Via MS 32-bit C/C++ Optimizing Compiler Version 13.10.3077 for 80x86

 VAX MIPS rating Dhrystone 1 and 2:   8094.18    5476.09

 ********************************************************
 32 Bit Integers - Via MS C/C++ Compiler Version 14.00.40310.41 for AMD64

 VAX MIPS rating Dhrystone 1 and 2:  12600.13    8549.73

 ********************************************************
 64 Bit Integers - Via MS C/C++ Compiler Version 14.00.40310.41 for AMD64

 VAX MIPS rating Dhrystone 1 and 2:  11725.88    6247.67

 ######################################################

    

To Start

Whetstone Benchmark

The Whetstone Benchmark produces an overall rating in terms of Millions of Whetstone Instructions Per Second (MWIPS). The version used also produces speeds in MFLOPS and MOPS for the eight test loops, three with straight floating point, two with intrinsic functions and three with integer type operations. An overall average (Geometric) is produced for the first three and equivalent VAX MIPS for the last three. The single precision version of the benchmark was compiled to use SSE instructions. This is quite a bit faster than the original version but Vax MIPS are over-inflated due to excessive optimisation.

MP Version

The program was modified to use a second thread to execute some of the code and demonstrate the use of two CPUs. The second thread is run at THREAD_PRIORITY_BELOW_NORMAL which sees little time on a single CPU. With dual CPU, both threads should demonstrate full speed . One complication is that the compiler refused to produce the same code for that used by the second thread so there is some variation in speeds. This MP version was also compiled for 64 bit operation and results are shown below for this, 32 bit MP and 32 bit SSE versions. Floating point speed is similar on both MP versions (and around double that of a single processor) but the 64 bit variety runs one of the integer tests faster by using more registers.


 AMD Athlon(tm) 64 X2 Dual Core Processor 4200+ Measured 2211 MHz and XP Pro x64


 Whetstone Single Precision SSE benchmark - single CPU version

 Via Microsoft 32-bit C/C++ Optimizing Compiler Version 13.10.3077 for 80x86

 MFLOPS    Vax  MWIPS MFLOPS MFLOPS MFLOPS    Cos    Exp  Fixpt     If  Equal
  Gmean   MIPS            1      2      3    MOPS   MOPS   MOPS   MOPS   MOPS

    583  12197   2313    655    656    461   51.0   36.3   1988   2210   3305

 ********************************************************************************

 Whetstone Single Precision MP SSE Benchmark Wed Aug 10 12:38:03 2005

 Via Microsoft 32-bit C/C++ Optimizing Compiler Version 13.10.3077 for 80x86

 MFLOPS    Vax  MWIPS MFLOPS MFLOPS MFLOPS    Cos    Exp  Fixpt     If  Equal
  Gmean   MIPS            1      2      3    MOPS   MOPS   MOPS   MOPS   MOPS

   1164  19030   4506   1310   1308    920    102   69.7   3598   4139   3702
  Thread 1               642    642    452   50.7   34.8   1796   2062   2690
  Thread 2               668    666    467   50.8   34.9   1802   2078   1013

 ********************************************************************************

 Whetstone Single Precision MP SSE Benchmark Fri Aug 05 12:18:12 2005

 Via Microsoft 64 bit C/C++ Optimizing Compiler Version 14.00.40310.41 for AMD64

 MFLOPS    Vax  MWIPS MFLOPS MFLOPS MFLOPS    Cos    Exp  Fixpt     If  Equal
  Gmean   MIPS            1      2      3    MOPS   MOPS   MOPS   MOPS   MOPS

   1086  25950   4983   1325   1145    845    151   67.1   3610   4204   9210
  Thread 1               661    572    468   75.2   33.5   1804   2099   8067
  Thread 2               663    573    377   76.0   33.6   1806   2105   1143

 #################################################################################

 Intel(R) Core(TM)2 CPU 6600  @ 2.40GHz Measured 2402 MHz and Vista 64-Bit

 Whetstone Single Precision SSE Benchmark Fri Jul 20 17:06:25 2007

 Via Microsoft 32-bit C/C++ Optimizing Compiler Version 13.10.3077 for 80x86

 MFLOPS    Vax  MWIPS MFLOPS MFLOPS MFLOPS    Cos    Exp  Fixpt     If  Equal
  Gmean   MIPS            1      2      3    MOPS   MOPS   MOPS   MOPS   MOPS
    728  18421   2419    851    855    530   57.2   29.7   1994   1747  14352

 ********************************************************************************

 Whetstone Single Precision MP SSE Benchmark Fri Jul 20 17:06:08 2007

 Via Microsoft 32-bit C/C++ Optimizing Compiler Version 13.10.3077 for 80x86

 MFLOPS    Vax  MWIPS MFLOPS MFLOPS MFLOPS    Cos    Exp  Fixpt     If  Equal
  Gmean   MIPS            1      2      3    MOPS   MOPS   MOPS   MOPS   MOPS
   1439  23554   4704   1700   1689   1037    113   58.1   3720   3738   7518
  Thread 1               845    826    517   56.5   28.9   1871   1797   6439
  Thread 2               855    863    520   56.8   29.3   1848   1941   1079

 ********************************************************************************

 Whetstone Single Precision MP SSE Benchmark Fri Jul 20 17:06:45 2007

 Via Microsoft 64 bit C/C++ Optimizing Compiler Version 14.00.40310.41 for AMD64

 MFLOPS    Vax  MWIPS MFLOPS MFLOPS MFLOPS    Cos    Exp  Fixpt     If  Equal
  Gmean   MIPS            1      2      3    MOPS   MOPS   MOPS   MOPS   MOPS
   1417  26543   5661   1723   1608   1026    157   77.4   3645   3096  13257
  Thread 1               862    805    530   78.1   38.5   1809   1535  12268
  Thread 2               861    803    496   78.4   39.0   1837   1560    989

To Start

BusSpeed MP Benchmark

This MP benchmark uses variable amounts of memory to measure speed via caches and RAM, first as a single thread, then as two threads to demonstrate the impact of two CPUs. It is based on BusSpd2K (BusSpd2K.zip ) using integer AND instructions to a single register, streaming data from caches or RAM. The first test reads one word with a 32 word address increment for the next word. That is 128 bytes with 32 bit words and 256 bytes with 64 bit words. The address increment reduces for following tests to one word (ReadAll). The last test reads all 16 byte SSE2 data. With two threads, each reads all the data, with total passes same as with one thread. BusSpd2K can produce some faster results as streaming to two registers is used for some tests. Except for SSE2, C compiler code is used for the tests as this is similar to assembly code in BusSpd2K. Results of benchmarks compiled for 32 and 64 bit systems are shown below. The benchmarks are in DualCore.zip.

Looking at RAM speed, the system reads data in 64 byte bursts - 16 word address increments at 32 bits and 8 word increments at 64 bits. This is demonstrated by no/little performance gain with larger address increments. Speed at 64 bits will appear to be twice as fast as 32 bits as twice as much data is being used out of the burst. Typical burst speed at 32 bits is 319 MB/sec and maximum speed can be assumed to be 16 times this or 5104 MB/sec (maximum theoretical 2 x 3200). In this case, the memory buses appear to be saturated and there is no gain with 2 CPUs. As the address increment is reduced speed increases to around 3000 MB/sec using one thread or 4700 MB/sec with two threads. These are similar speeds to a BusSpd2K two program test (see DualCore.htm), indicating a performance limitation with a single CPU.

Results via caches are strange. A sample from 32 bit BusSpd2K is included below to explain possible reasons. Firstly, BusSpd2K uses just MOV instructions for the burst tests. It shows halving of speed from caches from 32 byte (8 word) increments to 64 byte (16 word) and BusSpdMP goes one step further to 32 words address increments. BusSpd2K also shows half speed from L1 cache when ANDing to 1 register instead of 2. With BusMP, the compiler refused to translate code for two registers as hoped for.

Most cache based results do not show expected performance gains on using 2 CPUs. Inner loops of the tests have 64 AND instructions and an outer loops runs this for around 0.5 seconds (a long time and little difference at 0.1 seconds). Maybe the cause is cache flushing with some data coming from RAM.

The above comments relate to the tests on the PC with an AMD CPU and using windows XP Pro x64. Later, a Core 2 Duo PC results are given, using 64-Bit Windows Vista. This has faster RAM, larger and faster L2 cache and faster operation on SSE2 instructions.


 AMD Athlon 64(tm) X2 Dual Core Processor 4200+ Measured 2211 MHz, XP Pro x64

 ##############################################################################
 
 Old BusSpd2K Performance Test MBytes/Second

         16wds  8wds

          MovI  MovI  MovI  MovI  MovI  MovI  AndI  AndI  MovM  MovM
  Memory  Reg2  Reg2  Reg2  Reg2  Reg1  Reg2  Reg1  Reg2  Reg1  Reg8
  KBytes Inc64 Inc32 Inc16  Inc8  Inc4  Inc4  Inc4  Inc4  Inc8  Inc8

      4   8070 15711 16498 17247 16538 16763  8670 16454 34291 34254
      8   8437 16391 16544 17044 16838 17064  8765 16787 34148 35264

    128    639  1281  2437  4400  7782  7780  6539  6694  8882  8684
    256    651  1285  2411  4418  7786  7776  6448  6688  8936  8718

  65536    315   609  1009  1478  2789  2792  2656  2842  2940  2940
 131072    315   610  1007  1457  2793  2791  2704  2803  2940  2941

 #####################################################################
      MP Bus Speed Test 32 bit Version 1.2 Thu Apr 23 12:59:52 2009
 
 Via Microsoft 32-bit C/C++ Optimizing Compiler Version 13.10.3077 for 80x86
        SSE2 Assembled with Microsoft ml.exe Version 6.15.8803

                  Part 1 - Single Thread MBytes/Second

   Kbytes Inc32wds Inc16wds  Inc8wds  Inc4wds  Inc2wds  ReadAll 128bSSE2

        6     8150     8479    10384    10088     9866     9976    17421
       24     8394     8599    10477    10098    10183    10109    17484
       96      745      659     1245     2371     4908     6372     8930
      384      355      311      568      889     1443     2791     2967
      768      358      310      564      887     1432     2781     2946
     1536      360      310      565      887     1436     2788     2961
    16380      352      313      561      877     1384     2745     2910
   131070      351      314      562      877     1415     2739     2917

               Part 2 - Two Threads Total MBytes/Second

   Kbytes Inc32wds Inc16wds  Inc8wds  Inc4wds  Inc2wds  ReadAll 128bSSE2

        6     9245    10399    14752    16214    17382    18566    34846
       24    11382    13886    18134    18714    19568    19658    34652
       96     1475     1314     2474     4705     9725    12685    17789
      384      320      329      666     1303     2368     4809     4740
      768      318      329      664     1302     2365     4793     4728
     1536      318      328      665     1304     2372     4812     4743
    16380      319      331      665     1291     2334     4729     4683
   131070      320      330      661     1289     2332     4727     4690

 ##############################################################################

      MP Bus Speed Test 64 bit/int32 Ver 1.2 Thu Apr 23 13:01:16 2009
 
 Via Microsoft C/C++ Optimizing Compiler Version 14.00.40310.41 for AMD64
       SSE2 Assembled with Microsoft ml64.exe Version 8.00.40310.39

                  Part 1 - Single Thread MBytes/Second

   Kbytes Inc32wds Inc16wds  Inc8wds  Inc4wds  Inc2wds  ReadAll 128bSSE2

        6     7427     8061     9776     9563     9098     9402    17526
       24     7532     8114    10196     9910     9484     9525    17496
       96      741      671     1253     2345     4902     6576     8791
      384      359      309      544      843     1465     2549     2969
      768      358      307      543      840     1453     2583     2962
     1536      360      307      543      841     1463     2615     2958
    16380      353      310      542      838     1437     2644     2929
   131070      349      309      540      832     1431     2598     2865

               Part 2 - Two Threads Total MBytes/Second

   Kbytes Inc32wds Inc16wds  Inc8wds  Inc4wds  Inc2wds  ReadAll 128bSSE2

        6     9591    10169    14026    15814    16444    17487    34972
       24    10886    13336    17636    18394    18089    18606    34922
       96     1479     1341     2493     4652     9736    13097    17679
      384      320      330      667     1280     2396     4349     4750
      768      320      330      667     1280     2398     4362     4766
     1536      319      331      666     1279     2393     4315     4746
    16380      322      335      668     1271     2371     4255     4719
   131070      321      334      662     1259     2343     4246     4736

 ##############################################################################

      MP Bus Speed Test 64 bit Version 1.2 Thu Apr 23 13:02:40 2009
 
 Via Microsoft C/C++ Optimizing Compiler Version 14.00.40310.41 for AMD64
       SSE2 Assembled with Microsoft ml64.exe Version 8.00.40310.39

                  Part 1 - Single Thread MBytes/Second

   Kbytes Inc32wds Inc16wds  Inc8wds  Inc4wds  Inc2wds  ReadAll 128bSSE2

        6    14612    15968    16147    17828    17760    17492    17477
       24    14970    16215    16258    17791    17719    17820    17508
       96     1944     1952     1344     2410     4543     9668     8787
      384      655      720      592      996     1513     2933     2970
      768      656      720      592      992     1506     2897     2957
     1536      653      719      590      993     1505     2898     2960
    16380      643      705      591      986     1478     2873     2929
   131070      640      702      593      985     1478     2871     2927

               Part 2 - Two Threads Total MBytes/Second

   Kbytes Inc32wds Inc16wds  Inc8wds  Inc4wds  Inc2wds  ReadAll 128bSSE2

        6    11775    14939    14931    17563    19524    21683    34827
       24    11304    17001    18712    22069    22764    23612    34809
       96     3113     3073     2556     4672     8255    12516    17556
      384      557      632      645     1268     2228     4064     4718
      768      560      631      645     1271     2242     4072     4721
     1536      561      630      643     1233     2242     4102     4752
    16380      564      633      643     1266     2232     4086     4724
   131070      560      634      646     1264     2228     4080     4721

 ##############################################################################

 Intel(R) Core(TM)2 CPU 6600 @ 2.40GHz Measured 2402 MHz and Vista 64-Bit

 ##############################################################################

      MP Bus Speed Test 32 bit Version 1.2 Thu Apr 23 12:53:45 2009
 
 Via Microsoft 32-bit C/C++ Optimizing Compiler Version 13.10.3077 for 80x86
        SSE2 Assembled with Microsoft ml.exe Version 6.15.8803

                  Part 1 - Single Thread MBytes/Second

   Kbytes Inc32wds Inc16wds  Inc8wds  Inc4wds  Inc2wds  ReadAll 128bSSE2

        6     7076     8930     8889     9164     9247     9220    37147
       24     8406     8585     8741     8940     9014     9104    37713
       96     2027     2017     3222     4518     6614     7915    19004
      384     2023     2018     3239     4503     6661     7956    19038
      768     2001     2015     3226     4487     6632     7917    19102
     1536     1950     1983     3191     4412     6491     7830    18595
    16380      316      380      783     1411     2631     4798     5634
   131070      312      382      778     1423     2567     4868     5678

               Part 2 - Two Threads Total MBytes/Second

   Kbytes Inc32wds Inc16wds  Inc8wds  Inc4wds  Inc2wds  ReadAll 128bSSE2

        6     7347    11414    13477    15485    16414    17004    64860
       24    12643    14519    15499    15765    15959    16291    65950
       96     3137     2809     5169     7291    11891    14047    30072
      384     3150     2849     5077     7639    10778    14685    29339
      768     3015     2980     4988     7379    11758    13607    30887
     1536     2969     2725     5036     7056    11213    13271    29849
    16380      315      417      851     1739     3087     5743     6971
   131070      313      416      877     1757     2967     5693     6919

 ##############################################################################

      MP Bus Speed Test 64 bit/int32 Ver 1.2 Thu Apr 23 12:55:49 2009
 
 Via Microsoft C/C++ Optimizing Compiler Version 14.00.40310.41 for AMD64
       SSE2 Assembled with Microsoft ml64.exe Version 8.00.40310.39

                  Part 1 - Single Thread MBytes/Second

   Kbytes Inc32wds Inc16wds  Inc8wds  Inc4wds  Inc2wds  ReadAll 128bSSE2

        6     7088     8064     8353     8685     8728     8719    37226
       24     7313     8121     8153     8432     8543     8598    37539
       96     1765     1965     3177     4507     6462     7816    18899
      384     2039     2022     3232     4518     6457     7850    18962
      768     2034     2008     3218     4501     6464     7819    19034
     1536     1931     1991     3178     4427     6306     7708    18523
    16380      316      380      789     1427     2654     4876     5731
   131070      317      370      787     1398     2610     4867     5669

               Part 2 - Two Threads Total MBytes/Second

   Kbytes Inc32wds Inc16wds  Inc8wds  Inc4wds  Inc2wds  ReadAll 128bSSE2

        6     6576    10532    11969    13477    15141    14897    70169
       24    10735    13041    14113    14257    15900    14838    70711
       96     3008     2944     4909     7646    11280    14548    29482
      384     2904     2997     4994     7335    11754    13745    30437
      768     3028     2779     5125     7195    11704    13391    30549
     1536     2829     2769     4793     7126    11158    13621    29143
    16380      316      427      867     1705     3062     5694     7036
   131070      314      423      845     1736     3014     5520     7051

 ##############################################################################

      MP Bus Speed Test 64 bit Version 1.2 Thu Apr 23 12:57:50 2009
 
 Via Microsoft C/C++ Optimizing Compiler Version 14.00.40310.41 for AMD64
       SSE2 Assembled with Microsoft ml64.exe Version 8.00.40310.39

                  Part 1 - Single Thread MBytes/Second

   Kbytes Inc32wds Inc16wds  Inc8wds  Inc4wds  Inc2wds  ReadAll 128bSSE2

        6    14142    16327    16605    17323    17381    17440    36730
       24    14978    16822    16506    17300    17310    17271    37790
       96     4076     4190     3988     6449     9504    13632    18914
      384     3995     4149     4022     6425     9549    13593    19051
      768     3977     4152     4011     6438     9555    13584    18967
     1536     3918     3977     3954     6318     9282    13358    18674
    16380      594      625      771     1554     2882     5150     5696
   131070      574      631      762     1578     2861     5119     5666

               Part 2 - Two Threads Total MBytes/Second

   Kbytes Inc32wds Inc16wds  Inc8wds  Inc4wds  Inc2wds  ReadAll 128bSSE2

        6     8740    13942    15815    17691    19101    20269    66705
       24     9484    15989    16697    20145    20505    19950    68468
       96     5225     5630     5557     9618    13355    17128    28572
      384     5271     5631     5510     9893    12940    17222    29617
      768     5369     5760     5544     9899    13147    17200    30917
     1536     4938     5635     5543     9374    12449    16582    28732
    16380      583      625      821     1673     3166     5148     6988
   131070      600      624      821     1664     3154     5267     6901

To Start

Another version of the 64 bit benchmark was produced. This just uses the single thread test, with command line options to select memory size used, running time and log file name. More than one version can then be run at the same time. Results are shown below for two programs running concurrently to test L1 cache, L2 cache and RAM. Speed from caches is seen to double, unlike the same tests using two threads. Results are for the AMD based PC.


   Kbytes    Inc32wds Inc16wds  Inc8wds  Inc4wds  Inc2wds  ReadAll 128bSSE2

    6 Prog 1     5706    11458    11868    17738    17742    17452    17431
    6 Prog 2     5731    11349    11882    17839    17803    17505    17473

   96 Prog 1     1926     1498     1347     2449     4495     9619     8796
   96 Prog 2     1936     1505     1349     2455     4506     9642     8813

 1536 Prog 1      295      319      328      642     1174     2366     2444
 1536 Prog 2      299      328      347      694     1240     2334     2809

To Start

BusSpdMP 32/64 bit Integers - CPU Speed Comparison

In addition to the 64 bit test, another version has been produced. This one, BusMP64Int32 in DualCore.zip, is compiled for 64 bit working but using 32 bit integers instead of 64 bits. Results in MB/second are included above. These speeds can be converted to Millions of Instructions Per Second (MIPS) by dividing by four for 32 bit integers or by eight for those at 64 bits. The following are for the ReadAll data.

The program inner loop is run many times and executes 64 instructions, and it is unlikely that the additional registers, available for optimising 64 bit programs, will have any effect. Unlike Dhrystone, the 32 bit compiler produces faster speeds than the 64 bit version with 32 bit numbers. This might be due to the simpler instruction format of and edi [eax-212] compared with and ecx [rax+rdx-488]. The slowest speeds are when using 64 bit integers, where twice as much data would need to be transferred for the same MIPS. Worst case is from RAM where CPU execution speed is halved at 64 bits. Performance gains on using two CPUs are also worse at 64 bits.


              Core 2 Duo 2.4 GHz Vista 64      Athlon 64 x2 2.2 GHz XP x64

           1 Thread ReadAll MIPS

     Kbytes     32/32     64/32     64/64        32/32     64/32     64/64

          6      2305      2180      2180         2494      2351      2187
         24      2276      2150      2159         2527      2381      2228
         96      1979      1954      1704         1593      1644      1209
        384      1989      1963      1699          698       637       367
        768      1979      1955      1698          695       646       362
       1536      1958      1927      1670          697       654       362
      16380      1200      1219       644          686       661       359
     131070      1217      1217       640          685       650       359

           2 Threads ReadAll MIPS

          6      4251      3724      2534         4642      4372      2710
         24      4073      3710      2494         4915      4652      2952
         96      3512      3637      2141         3171      3274      1565
        384      3671      3436      2153         1202      1087       508
        768      3402      3348      2150         1198      1091       509
       1536      3318      3405      2073         1203      1079       513
      16380      1436      1424       644         1182      1064       511
     131070      1423      1380       658         1182      1062       510

           Gain With 2 CPUs

          6      1.84      1.71      1.16         1.86      1.86      1.24
         24      1.79      1.73      1.16         1.94      1.95      1.33
         96      1.77      1.86      1.26         1.99      1.99      1.29
        384      1.85      1.75      1.27         1.72      1.71      1.39
        768      1.72      1.71      1.27         1.72      1.69      1.41
       1536      1.69      1.77      1.24         1.73      1.65      1.42
      16380      1.20      1.17      1.00         1.72      1.61      1.42
     131070      1.17      1.13      1.03         1.73      1.63      1.42
    

To Start

RandMP Benchmark

This MP benchmark uses variable amounts of memory to measure speed via caches and RAM, first as a single thread, then as two threads to demonstrate the impact of two CPUs. It is based on RandMem (RandMem.zip ) with serial and random read and read/write tests. Serial and random tests use the same code via indexing to read and write 4 byte words e.g. a sequence such as tot = tot & xi[xi[i+ 0]] | xi[xi[i+ 2]] & --- for reading and xi[xi[i+ 0]] = xi[xi[i+ 2]]; for read/write. The inner loops have a more than 600 CPU instructions. RandMP64 and RandMP32 versions to run via Win64 and Win32 can be found in DualCore.zip.

The benchmark has four tests, Serial Read (RD), Serial Read/Write (RW), Random Read and Random Read/Write. With two threads, each has its own code and use the same data but the second thread starts at the half way point. Each has the same number of repeat passes where variations in the time taken are reflected in the relative speeds of the two threads.

Below are example results of the 32 bit version on a single CPU using Windows XP and 64 bit version on a dual core CPU via Windows XP x64 (32 bit version produces very similar results). Using one thread, RW speed is slower than RD and speed reduces more using larger data size with random access. Running two threads on a single CPU produces the same sort of total speed as the single thread. With two CPUs, the speed of read only is mainly around double that of a single thread but speed via caches with read/write can be worse than for a single thread (or single CPU).

Looking at dual core results, with Serial RW and Random RW at 6 KB, the CPU is executing at around 1360 Million Instructions Per Second (MIPS) or 0.62 MIPS/MHz with a single thread. With two threads, each CPU runs at 340 MIPS (0.15 MIPS/MHz) with Serial RW and 154 MIPS (0.07 MIPS/MHz) with Random RW. This can be put down to Windows flushing caches to maintain data coherency. Modifying the benchmark, so that each thread accesses its own data array, enables RW cache tests to run at 1360 MIPS on each CPU.

The above comments relate to results on the PC with an AMD CPU and using Windows XP Pro x64. Later results are for a Core 2 Duo system using 64-Bit Windows Vista. RAM on this is nearly twice as fast but the tests show up to 4 times faster. Measured L1 cache speeds are much faster on the Read/Write tests as they are via L2 cache.


           AMD Athlon(tm) XP 2600+ Measured 2088 MHz

 #####################################################################
  RandMP Write/Read Test 32 bit Version 1.0 Sat Aug 27 19:33:14 2005
 
 Via Microsoft 32-bit C/C++ Optimizing Compiler Version 13.10.3077 for 80x86


               ------------------ MBytes Per Second At --------------------
               6 KB   24 KB   96 KB  384 KB  768 KB 1536 KB   12 MB   96 MB
 1 Thread
 Serial RD     7773    7748    3616     895     896     889     891     892
 Serial RW     3655    3657    2193     660     663     661     663     658
 Random RD     7527    7599    2165     628     313     240     192      57
 Random RW     3686    3693    2034     439     190     141     116      44

 2 Threads
 Serial RD1    4510    4522    2043     444     448     466     447     534
 Serial RD2    3911    3906    1813     443     444     442     442     442

 Serial RW1    1890    2133    1153     346     328     348     349     340
 Serial RW2    1832    1828    1097     327     342     328     328     326

 Random RD1    4429    4297    1134     311     169     115     103      31
 Random RD2    3781    3803    1067     302     151     116      92      28

 Random RW1    1928    1941    1050     219      95      75      61      24
 Random RW2    1837    1849    1012     220      92      71      58      22

           For approximate speed in MIPS divide MBytes/Second by 3.2


 AMD Athlon 64 X2 Dual Core Processor 4200+ Measured 2211 MHz, XP Pro x64

 #####################################################################
  RandMP Write/Read Test 64 bit Version 1.0 Sat Aug 27 19:17:58 2005
 
 Via Microsoft C/C++ Optimizing Compiler Version 14.00.40310.41 for AMD64

               ------------------ MBytes Per Second At --------------------
               6 KB   24 KB   96 KB  384 KB  768 KB 1536 KB   12 MB   96 MB
 1 Thread
 Serial RD     8552    8518    5115    5132    2369    2353    2344    2305
 Serial RW     4346    4340    2702    2697    1349    1352    1354    1351
 Random RD     8176    8244    3733    1620     872     389     255     170
 Random RW     4384    4332    2865    1483     563     236     161     136

 2 Threads
 Serial RD1    8374    8532    5064    5010    2075    2096    2021    2026
 Serial RD2    8532    8394    5176    5108    2111    2062    2049    2054

 Serial RW1    1090    1172    1110    1096    1041     867     864     866
 Serial RW2    1083    1136    1089    1076    1049     866     855     824

 Random RD1    8147    8024    3683    1638     485     193     126     100
 Random RD2    8154    8158    3701    1637     485     195     125     101

 Random RW1     494     489     448     406     352     152      86      75
 Random RW2     495     490     449     406     343     152      87      75

           For approximate speed in MIPS divide MBytes/Second by 3.2

 
Intel(R) Core(TM)2 CPU 6600 @ 2.40GHz Measured 2402 MHz and Vista 64-Bit

 #####################################################################
  RandMP Write/Read Test 64 bit Version 1.0 Sat Aug 27 19:17:58 2005

               ------------------ MBytes Per Second At --------------------
               6 KB   24 KB   96 KB  384 KB  768 KB 1536 KB   12 MB   96 MB
 1 Thread
 Serial RD     8742    9128    7498    7468    7486    7429    4417    4391
 Serial RW     8428    9332    7665    7663    7662    7165    2442    2397
 Random RD     8918    9404    4244    3304    3183    2790     638     458
 Random RW     8014    8523    3390    2752    2656    2462     418     289

 2 Threads
 Serial RD1    8435    9094    7334    7336    7365    7238    4024    2817
 Serial RD2    8460    8943    7183    7168    7201    7159    3962    2764

 Serial RW1    2007    2181    6931    6995    6984    6738    1643    1521
 Serial RW2    2010    2174    6789    6801    6806    6651    1568    1433

 Random RD1    8576    9392    3530    2695    2604    2292     450     443
 Random RD2    8598    9180    3478    2666    2553    2256     455     443

 Random RW1     730     759    1409    1984    1991    1923     282     292
 Random RW2     733     759    1398    1955    1961    1897     277     292

To Start

Roy Longbottom September 2009

The new Internet Home for my PC Benchmarks is via the link
Roy Longbottom's PC Benchmark Collection