PC Benchmarks - For 64 Bit Windows - Roy Longbottom - Including Core 2 Duo and Athlon 64 X2

PC Benchmarks For 64 Bit Windows

Index
Conversion	System ID	Maximum CPU Speed	Maximum MP Speed
Classic Benchmarks	Linpack	Livermore Loops	Dhrystone
Whetstone MP	BusSpeed MP	Rand MP	More 64 Bit Tests

Main Page	Other Results	Download Win64.zip	Download DualCore.zip

This page was set up as 770 pixels wide and accommodates preformatted text <PRE> results tables. Some browsers
produce monospaced font of an unexpected size but this might be adjustable via browser Preferences.

Conversion

Most of my benchmarks have been converted to run as 64 bit programs and have been tested via Windows XP Pro x64 and 64 bit Windows Vista. The first step was to download Microsoft Platform SDK for Windows Server 2003 SP1. This includes a 64 bit compiler (cl), assembler (ml64) and linker. These can be used via the command line or in a .BAT file and the package can be installed using Win32 or Win64. For Windows programs, .RC files can be converted (rc) to .RES files and the latter to .OBJ (cvtres). Library names used are the same as Win32, like GDI32.LIB.

The compiler does not accept asm type assembler functions so these have to be converted to MASM format but headers and .INC files are different to 32 bit varieties. 64 bit systems cannot run the old x87 floating point instructions nor MMX instructions. The former have to be converted to SSE1/2/3 instructions. MMX instruction names are the same as some provided in SSE2 but memory addresses have to be changed to suit 128 bit registers instead of 64 bits. 32 bit instructions can still be used, including CPUID and RDTSC. The only complication appears to be that push/pop should refer to a 64 bit register (push rdx instead of push edx). There appears to be complications in passing parameters to assembly code but I have avoided this by using global variables.

The SDK includes a 32 bit compiler that checks for 64 bit compatibility and has options to use SSE or SSE2 instructions for floating point. In some cases, this produces identical code to the 64 bit version. In other cases it restricts the number of registers used for 32 bit compatibility. MASM type assembly requires an assembler that comes with such as Microsoft Visual C++ 6.0 Pro. In order to compare 64 versus 32 bit speeds, some of the benchmarks have also been compiled using the SDK 32 bit compiler.

The C/C++ and Assembler source codes for these benchmarks are available in NewSource.zip. The original versions can be obtained via the Main Page.

To Start

More 64 Bit Benchmarks

Windows, DirectDraw, OpenGL and Image Processing benchmarks have also been converted to run at 64 bits and a DirectX 9 benchmark has also been produced. See 64 Bit Graphics Tests.htm. Download benchmarks and C/C++ source codes via Video64.zip Then, there are benchmarks for disks, CD/DVD drives, networks and peripherals in More64bit.zip with results in 64 Bit Disk Tests.htm.

The latest conversions, including source code, are also in More64bit.zip. These are three versions of my Fast Fourier Transform benchmarks (see also FFTGraf.zip), SSE/SSE2 benchmark and burn-in/reliability tests (see also SSE3Dnow.zip) and BusSpd2K burn-in/reliability tests (see also BusSpd2K.zip). The latter burn-in tests have been modified to demonstrate paging speeds more quickly. See Paging.htm for results via 64-Bit Vista and XP Pro x64.

To Start

Other Results

Results of 64 bit tests, descriptions and some comparisons are included in results reports for 32 bit versions.

Whetstone Results.htm	Linpack Results.htm
Livermore Loops Results.htm	Dhrystone Results.htm
WhatCPU Results.htm	BusSpd2K Results.htm
SSE3Dnow Results.htm	Randmem Results.htm
FFTgraf Results.htm	BMPspeed Results.htm
64 Bit Graphics Tests.htm	BurnIn64.htm
DualCore.htm	DiskGraf Results.htm
CDDVDSpd Results.htm	VideoWin Results.htm
DirectDraw Results.htm	Direct3D Results.htm
OpenGL Results.htm	br

To Start

System ID

Each benchmark includes a new system identification test. This is limited because Intel appear to make significant changes with each new CPU (now much too complicated for identifying dual CPUs with HT enabled or cache sizes on a range of CPUs). Windows functions also considerably lag on hardware capabilities. The following shows details provided by the 64 bit programs, then differences at 32 bits. Note that on AMD and Intel CPUs, with 64 bit working, info.wProcessorArchitecture from GetSystemInfo(&info) indicates PROCESSOR_ARCHITECTURE_AMD64. With 32 bit operation PROCESSOR_ARCHITECTURE_INTEL is supplied.

AMD Windows XP Pro x64 CPUID and RDTSC Assembly Code CPU AuthenticAMD, Features Code 178BFBFF, Model Code 00020FB1 AMD Athlon(tm) 64 X2 Dual Core Processor 4200+ Measured 2211 MHz Has MMX, Has SSE, Has SSE2, Has SSE3, Has 3DNow, Windows GetSystemInfo, GetVersionEx, GlobalMemoryStatus AMD64 processor architecture, 2 CPUs Windows NT Version 5.2, build 3790, Service Pack 1 Memory 1024 MB, Free 656 MB User Virtual Space 8388608 MB, Free 8388557 MB Intel Windows Vista 64-Bit CPUID and RDTSC Assembly Code CPU GenuineIntel, Features Code BFEBFBFF, Model Code 000006F6 Intel(R) Core(TM)2 CPU 6600 @ 2.40GHz Measured 2402 MHz Has MMX, Has SSE, Has SSE2, Has SSE3, No 3DNow, Windows GetSystemInfo, GetVersionEx, GlobalMemoryStatus AMD64 processor architecture, 2 CPUs Windows NT Version 6.0, build 6000, Memory 4094 MB, Free 3207 MB User Virtual Space 8388608 MB, Free 8388547 MB

Differences Win32 and Win64 at 32 bits


 Intel processor architecture, 2 CPUs 
 User Virtual Space 4096 MB, Free 4047 MB - Win64
 User Virtual Space 2048 MB, Free 2022 MB - Win32
 Memory 4095 MB, Free 3103 MB             - Win64
 Memory less than 3.5 GB                  - Win32

The C/C++ and Assembler source codes for these utilities are available in NewSource.zip.

To Start

Maximum CPU Speed

This benchmark CPUID64 (In Win64.zip) is based on the original in Whatcpu.zip. The latter executes a long series of assembler coded add instructions to 1, 2, 3 and 4 registers to identify maximum speeds of integer, floating point and MMX instructions. The 64 bit version has the same 32 bit integer test and an identical one using 64 bit mode. SSE/SSE2 32/64 bit floating point tests are the same. As indicated above, normal floating point and MMX instructions are invalid under Win64. Instead of MMX, SSE2 32 bit and 64 bit add speeds are measured. A revised 32 bit version is included in Win64.zip to show SSE2 integer speeds.

In the following example of 64 bit results, Millions of Instructions Per Second (MIPS) are similar to 32 bit speeds. That would be expected as 32 bit registers use half of real register size. With more pipelines, 64 bit normal integer MIPS can be faster than using integer SSE2 instructions. As usual, 64 bit floating point MFLOPS (Millions of FLoating point Operations Per Second) run at half speed compared with 32 bits (2 words versus 4 words in 128 bit registers). This is also the case with AMD on 32/64 bit SSE2 integers, but not with this Intel CPU.

CPU ID and Speed Test 64 bit Version - Windows XP Pro x64 Assembled with Microsoft ml64.exe Version 8.00.40310.39 AMD Athlon(tm) 64 X2 Dual Core Processor 4200+ Measured 2211 MHz Speeds adding to 1 Register 2 Registers 3 Registers 4 Registers 32 bit Integer MIPS 2430 4864 5650 6080 64 bit Integer MIPS 2430 4864 6356 6485 32 bit SSE2 Int MIPS 4421 8895 8844 8844 64 bit SSE2 Int MIPS 2211 4419 4447 4422 32 bit SSE MFLOPS 2214 4421 4421 4434 64 bit SSE2 MFLOPS 1105 2210 2217 2217 CPU ID and Speed Test 64 bit Version - Windows Vista 64-Bit Intel(R) Core(TM)2 CPU 6600 @ 2.40GHz Measured 2402 MHz Speeds adding to 1 Register 2 Registers 3 Registers 4 Registers 32 bit Integer MIPS 2609 4081 5261 7044 64 bit Integer MIPS 2613 4177 5225 7044 32 bit SSE2 Int MIPS 9488 14638 17542 17466 64 bit SSE2 Int MIPS 2401 4575 4585 4575 32 bit SSE MFLOPS 3201 6405 9607 9607 64 bit SSE2 MFLOPS 1601 3202 4804 4804

Download Win64.zip
To Start

Maximum MP Speed

This benchmark CPUIDMP64 (In DualCore.zip) uses some of the instruction sequences from CPUID64. First an integer and an SSE floating point test are run separately. They are then run as two threads of equal priority, where both should run at full speed with 2 CPUs. Finally, an FP test is started with another and two integer tests at lower priority. With 2 CPUs, the FP test should run at full speed and the others at the whim of the OS. A 32 bit version is included in DualCore.zip using the same 32 bit instructions. Results of 64 bit and 32 bit tests can be expected to be the same except possibly for sharing with 4 threads.

When run on a single CPU, the floating point and integer tests are likely to run at half speed with two threads. With four threads, the lower priority tests might obtain a small amount of time.

CPU ID and MP Speed Test 64 bit Version - Windows XP Pro x64 Assembled with Microsoft ml64.exe Version 8.00.40310.39 AMD Athlon(tm) 64 X2 Dual Core Processor 4200+ Measured 2211 MHz Speed adding to registers Pass 1 Pass 2 Pass 3 Separate Tests 32 bit SSE MFLOPS 4411 4411 4415 32 bit Integer MIPS 6068 6070 6070 Two Threads Equal Priority 32 bit SSE MFLOPS 4405 4409 4408 32 bit Integer MIPS 6067 6053 5992 Four Threads, First Normal Priority, Others Normal - 1 32 bit SSE MFLOPS 4401 4411 4410 32 bit Integer MIPS 2903 2053 2898 32 bit SSE MFLOPS 0 1433 0 32 bit Integer MIPS 3454 2227 3455 CPU ID and MP Speed Test 64 bit Version - Windows Vista 64-Bit Intel(R) Core(TM)2 CPU 6600 @ 2.40GHz Measured 2402 MHz Speed adding to registers Pass 1 Pass 2 Pass 3 Separate Tests 32 bit SSE MFLOPS 9582 9595 9600 32 bit Integer MIPS 6934 6936 6950 Two Threads Equal Priority 32 bit SSE MFLOPS 9501 9600 9600 32 bit Integer MIPS 7002 7006 7013 Four Threads, First Normal Priority, Others Normal - 1 32 bit SSE MFLOPS 9592 9575 9576 32 bit Integer MIPS 3447 3414 3329 32 bit SSE MFLOPS 4844 0 0 32 bit Integer MIPS 0 3337 3366

Download DualCore.zip
To Start

Classic Benchmarks

The Classic Benchmarks are the first programs that set standards of performance for computers. Details are available from Classic.htm and benchmark programs and results obtained via BenchNT.zip. The Linpack, Livermore Loops and Whetstone Benchmarks have been compiled for 64 bit systems and for 32 bit PCs using automatic compilation with SSE or SSE2 instructions. Dhrystone Benchmarks are now included. The benchmarks and sample results can be obtained from Win64.zip or DualCore.zip and source codes from NewSource.zip.

Linpack and Livermore Loops Benchmarks

Linpack and Livermore Loops benchmarks use double precision floating point, so are compiled with SSE2 instructions. The compilers are not as efficient as they could be, producing instructions using one 64 bit word in the 128 bit registers, rather than two for Single Instruction Multiple Data (SIMD) operation. The original 64 bit Linpack results on Core 2 Duo were disappointing but this was corrected in 2009 on using a later version of the compiler.

Linpack Benchmark Results AMD Athlon(tm) 64 X2 Dual Core Processor 4200+ Measured 2211 MHz and XP Pro x64 Original SSE2 Win32 SSE2 Win64 838 MFLOPS 1014 MFLOPS 1044 MFLOPS 2009 Compilation 1091 MFLOPS Intel(R) Core(TM)2 CPU 6600 @ 2.40GHz Measured 2402 MHz and Vista 64-Bit Original SSE2 Win32 SSE2 Win64 1315 MFLOPS 1480 MFLOPS 823 MFLOPS 2009 Compilation 1602 MFLOPS

To Start

There are 24 Livermore Loops whose performance is measured in MFLOPS also with average results, Geometric Mean being the official average quoted. Following are results for the original Watcom version, 32 bits with SSE2 and 64 bits with SSE2. The 64 bit compilation can use up to 16 registers to speed up processing. However, some 32 bit SSE2 results are faster as are a few from the original Watcom version. The Intel Core 2 Duo results, compiled for 64 bits, are more frequently slower than at 32 bits, again corrected in a 2009 recompilation. See below for Whetstone Benchmark.

AMD Athlon(tm) 64 X2 Dual Core Processor 4200+ Measured 2211 MHz and XP Pro x64 ******************************************************** Livermore Loops Benchmark Original Optimised via C/C++ MFLOPS for 24 loops 2032.4 1312.4 345.7 1031.6 275.6 334.9 2565.9 2288.0 2337.2 121.9 183.7 550.5 49.8 131.6 393.5 350.4 217.7 1474.2 309.3 290.9 612.8 458.5 751.6 294.6 Overall Ratings Maximum Average Geomean Harmean Minimum 2565.9 740.9 460.5 285.7 48.4 ******************************************************** Livermore Loops Benchmark 32 Bit Version Via Microsoft 32-bit C/C++ Optimizing Compiler Version 13.10.3077 for 80x86 MFLOPS for 24 loops 1619.1 1187.1 717.9 1068.3 244.9 606.3 1815.6 1727.1 1907.6 670.2 200.4 549.6 169.4 317.6 737.7 654.7 684.2 1452.0 455.0 762.5 1031.2 406.1 590.3 219.2 Overall Ratings Maximum Average Geomean Harmean Minimum 1907.6 798.3 640.1 501.3 162.3 ******************************************************** Livermore Loops Benchmark 64 Bit Version Via 64 Bit Microsoft C/C++ Optimizing Compiler Version 14.00.40310.41 for AMD64 MFLOPS for 24 loops 1927.4 1118.3 1096.7 1054.2 252.2 320.3 2284.1 2099.2 1756.2 632.4 183.6 731.0 173.1 306.3 552.9 732.4 922.6 1441.3 500.8 881.3 328.6 351.8 758.0 397.8 Overall Ratings Maximum Average Geomean Harmean Minimum 2284.1 843.7 660.7 509.1 165.7 ******************************************************** Livermore Loops Benchmark 64 Bit Version - 2009 Compilation Via Microsoft C/C++ Optimizing Compiler Version 15.00.30729.207 for x64 MFLOPS for 24 loops 1954.0 1151.1 1094.2 948.6 249.3 605.2 2067.6 2022.1 1783.5 655.1 170.8 731.2 195.8 305.8 512.1 723.4 902.1 1332.5 501.6 730.1 1042.7 359.6 736.2 398.3 Overall Ratings Maximum Average Geomean Harmean Minimum 2067.6 846.9 679.2 533.9 170.8 ###################################################### Intel(R) Core(TM)2 CPU 6600 @ 2.40GHz Measured 2402 MHz and Vista 64-Bit Via Microsoft 32-bit C/C++ Optimizing Compiler Version 13.10.3077 for 80x86 MFLOPS for 24 loops 1960.1 1357.0 788.5 1471.0 341.2 891.9 2526.4 2044.9 2153.0 860.1 265.8 1181.5 458.5 555.0 444.0 1018.2 824.4 1073.6 505.2 632.3 1235.3 194.7 772.0 278.2 Overall Ratings Maximum Average Geomean Harmean Minimum 2526.4 990.3 803.8 639.0 194.7 ******************************************************** Via Microsoft 64 Bit C/C++ Optimizing Compiler Version 14.00.40310.41 for AMD64 MFLOPS for 24 loops 626.3 835.2 594.5 589.0 341.3 406.3 886.7 1040.6 1098.2 391.0 239.0 398.3 349.7 397.8 320.7 857.1 1038.9 714.0 639.2 429.5 418.0 227.3 838.1 673.5 Overall Ratings Maximum Average Geomean Harmean Minimum 1175.0 592.9 537.0 484.9 227.2 ******************************************************** 2009 Compilation Via Microsoft C/C++ Optimizing Compiler Version 15.00.30729.207 for x64 MFLOPS for 24 loops 1833.8 1221.5 1505.3 1290.3 340.5 858.4 2760.0 2375.6 2183.4 851.3 264.5 1183.2 508.6 561.6 446.6 909.1 1067.1 1132.1 637.8 665.0 1233.7 362.7 928.6 715.9 Overall Ratings Maximum Average Geomean Harmean Minimum 2798.8 1066.4 893.2 749.0 260.5

To Start

Dhrystone Benchmarks

The Dhrystone tests are integer/fixed point benchmarks measuring performance in Millions of Instructions Per Second relative to the 1977 Digital Vax 11/780, accepted as the first 1 MIPS minicomputer. Dhrystone 1 could easily be over optimised, where some of the code is not executed, and is probably reflected in the results. Dhrystone 2 was intended to overcome this deficiency.

Three versions are available one compiled for 32 bit Windows and two for the 64 bit systems. One of the latter uses 32 bit integer variables and the other at 64 bits. Results for 32 bit integers show that 64 bit compilations are up to 56% faster than the 32 bit versions. Much of the gain appears to be due to a different translation of the C source code but, with twice as many registers available for optimisation at 64 bits, there could be some performance improvement. Regarding 64 bit compilations, the versions using 64 bit integers were both slower than with 32 bit integers. by 27% in one case. This might be due to the higher volume of data from cache with 64 bit words but limited compilations were inconclusive when some of the code was omitted. Further 32/64 bit integer comparisons can be found for BusSpdMP.

Dhrystone Benchmark Results AMD Athlon(tm) 64 X2 Dual Core Processor 4200+ Measured 2211 MHz and XP Pro x64 ******************************************************** 32 Bit Integers - Via MS 32-bit C/C++ Optimizing Compiler Version 13.10.3077 for 80x86 VAX MIPS rating Dhrystone 1 and 2: 6104.33 3719.73 ******************************************************** 32 Bit Integers - Via MS C/C++ Compiler Version 14.00.40310.41 for AMD64 VAX MIPS rating Dhrystone 1 and 2: 8668.31 5213.64 ******************************************************** 64 Bit Integers - Via MS C/C++ Compiler Version 14.00.40310.41 for AMD64 VAX MIPS rating Dhrystone 1 and 2: 8548.73 4654.28 ###################################################### Intel(R) Core(TM)2 CPU 6600 @ 2.40GHz Measured 2402 MHz and Vista 64-Bit ******************************************************** 32 Bit Integers - Via MS 32-bit C/C++ Optimizing Compiler Version 13.10.3077 for 80x86 VAX MIPS rating Dhrystone 1 and 2: 8094.18 5476.09 ******************************************************** 32 Bit Integers - Via MS C/C++ Compiler Version 14.00.40310.41 for AMD64 VAX MIPS rating Dhrystone 1 and 2: 12600.13 8549.73 ******************************************************** 64 Bit Integers - Via MS C/C++ Compiler Version 14.00.40310.41 for AMD64 VAX MIPS rating Dhrystone 1 and 2: 11725.88 6247.67 ######################################################

To Start

Whetstone Benchmark

The Whetstone Benchmark produces an overall rating in terms of Millions of Whetstone Instructions Per Second (MWIPS). The version used also produces speeds in MFLOPS and MOPS for the eight test loops, three with straight floating point, two with intrinsic functions and three with integer type operations. An overall average (Geometric) is produced for the first three and equivalent VAX MIPS for the last three. The single precision version of the benchmark was compiled to use SSE instructions. This is quite a bit faster than the original version but Vax MIPS are over-inflated due to excessive optimisation.

MP Version

The program was modified to use a second thread to execute some of the code and demonstrate the use of two CPUs. The second thread is run at THREAD_PRIORITY_BELOW_NORMAL which sees little time on a single CPU. With dual CPU, both threads should demonstrate full speed . One complication is that the compiler refused to produce the same code for that used by the second thread so there is some variation in speeds. This MP version was also compiled for 64 bit operation and results are shown below for this, 32 bit MP and 32 bit SSE versions. Floating point speed is similar on both MP versions (and around double that of a single processor) but the 64 bit variety runs one of the integer tests faster by using more registers.

AMD Athlon(tm) 64 X2 Dual Core Processor 4200+ Measured 2211 MHz and XP Pro x64 Whetstone Single Precision SSE benchmark - single CPU version Via Microsoft 32-bit C/C++ Optimizing Compiler Version 13.10.3077 for 80x86 MFLOPS Vax MWIPS MFLOPS MFLOPS MFLOPS Cos Exp Fixpt If Equal Gmean MIPS 1 2 3 MOPS MOPS MOPS MOPS MOPS 583 12197 2313 655 656 461 51.0 36.3 1988 2210 3305 ******************************************************************************** Whetstone Single Precision MP SSE Benchmark Wed Aug 10 12:38:03 2005 Via Microsoft 32-bit C/C++ Optimizing Compiler Version 13.10.3077 for 80x86 MFLOPS Vax MWIPS MFLOPS MFLOPS MFLOPS Cos Exp Fixpt If Equal Gmean MIPS 1 2 3 MOPS MOPS MOPS MOPS MOPS 1164 19030 4506 1310 1308 920 102 69.7 3598 4139 3702 Thread 1 642 642 452 50.7 34.8 1796 2062 2690 Thread 2 668 666 467 50.8 34.9 1802 2078 1013 ******************************************************************************** Whetstone Single Precision MP SSE Benchmark Fri Aug 05 12:18:12 2005 Via Microsoft 64 bit C/C++ Optimizing Compiler Version 14.00.40310.41 for AMD64 MFLOPS Vax MWIPS MFLOPS MFLOPS MFLOPS Cos Exp Fixpt If Equal Gmean MIPS 1 2 3 MOPS MOPS MOPS MOPS MOPS 1086 25950 4983 1325 1145 845 151 67.1 3610 4204 9210 Thread 1 661 572 468 75.2 33.5 1804 2099 8067 Thread 2 663 573 377 76.0 33.6 1806 2105 1143 ################################################################################# Intel(R) Core(TM)2 CPU 6600 @ 2.40GHz Measured 2402 MHz and Vista 64-Bit Whetstone Single Precision SSE Benchmark Fri Jul 20 17:06:25 2007 Via Microsoft 32-bit C/C++ Optimizing Compiler Version 13.10.3077 for 80x86 MFLOPS Vax MWIPS MFLOPS MFLOPS MFLOPS Cos Exp Fixpt If Equal Gmean MIPS 1 2 3 MOPS MOPS MOPS MOPS MOPS 728 18421 2419 851 855 530 57.2 29.7 1994 1747 14352 ******************************************************************************** Whetstone Single Precision MP SSE Benchmark Fri Jul 20 17:06:08 2007 Via Microsoft 32-bit C/C++ Optimizing Compiler Version 13.10.3077 for 80x86 MFLOPS Vax MWIPS MFLOPS MFLOPS MFLOPS Cos Exp Fixpt If Equal Gmean MIPS 1 2 3 MOPS MOPS MOPS MOPS MOPS 1439 23554 4704 1700 1689 1037 113 58.1 3720 3738 7518 Thread 1 845 826 517 56.5 28.9 1871 1797 6439 Thread 2 855 863 520 56.8 29.3 1848 1941 1079 ******************************************************************************** Whetstone Single Precision MP SSE Benchmark Fri Jul 20 17:06:45 2007 Via Microsoft 64 bit C/C++ Optimizing Compiler Version 14.00.40310.41 for AMD64 MFLOPS Vax MWIPS MFLOPS MFLOPS MFLOPS Cos Exp Fixpt If Equal Gmean MIPS 1 2 3 MOPS MOPS MOPS MOPS MOPS 1417 26543 5661 1723 1608 1026 157 77.4 3645 3096 13257 Thread 1 862 805 530 78.1 38.5 1809 1535 12268 Thread 2 861 803 496 78.4 39.0 1837 1560 989

To Start

BusSpeed MP Benchmark

This MP benchmark uses variable amounts of memory to measure speed via caches and RAM, first as a single thread, then as two threads to demonstrate the impact of two CPUs. It is based on BusSpd2K (BusSpd2K.zip ) using integer AND instructions to a single register, streaming data from caches or RAM. The first test reads one word with a 32 word address increment for the next word. That is 128 bytes with 32 bit words and 256 bytes with 64 bit words. The address increment reduces for following tests to one word (ReadAll). The last test reads all 16 byte SSE2 data. With two threads, each reads all the data, with total passes same as with one thread. BusSpd2K can produce some faster results as streaming to two registers is used for some tests. Except for SSE2, C compiler code is used for the tests as this is similar to assembly code in BusSpd2K. Results of benchmarks compiled for 32 and 64 bit systems are shown below. The benchmarks are in DualCore.zip.

Looking at RAM speed, the system reads data in 64 byte bursts - 16 word address increments at 32 bits and 8 word increments at 64 bits. This is demonstrated by no/little performance gain with larger address increments. Speed at 64 bits will appear to be twice as fast as 32 bits as twice as much data is being used out of the burst. Typical burst speed at 32 bits is 319 MB/sec and maximum speed can be assumed to be 16 times this or 5104 MB/sec (maximum theoretical 2 x 3200). In this case, the memory buses appear to be saturated and there is no gain with 2 CPUs. As the address increment is reduced speed increases to around 3000 MB/sec using one thread or 4700 MB/sec with two threads. These are similar speeds to a BusSpd2K two program test (see DualCore.htm), indicating a performance limitation with a single CPU.

Results via caches are strange. A sample from 32 bit BusSpd2K is included below to explain possible reasons. Firstly, BusSpd2K uses just MOV instructions for the burst tests. It shows halving of speed from caches from 32 byte (8 word) increments to 64 byte (16 word) and BusSpdMP goes one step further to 32 words address increments. BusSpd2K also shows half speed from L1 cache when ANDing to 1 register instead of 2. With BusMP, the compiler refused to translate code for two registers as hoped for.

Most cache based results do not show expected performance gains on using 2 CPUs. Inner loops of the tests have 64 AND instructions and an outer loops runs this for around 0.5 seconds (a long time and little difference at 0.1 seconds). Maybe the cause is cache flushing with some data coming from RAM.

The above comments relate to the tests on the PC with an AMD CPU and using windows XP Pro x64. Later, a Core 2 Duo PC results are given, using 64-Bit Windows Vista. This has faster RAM, larger and faster L2 cache and faster operation on SSE2 instructions.

AMD Athlon 64(tm) X2 Dual Core Processor 4200+ Measured 2211 MHz, XP Pro x64 ############################################################################## Old BusSpd2K Performance Test MBytes/Second 16wds 8wds MovI MovI MovI MovI MovI MovI AndI AndI MovM MovM Memory Reg2 Reg2 Reg2 Reg2 Reg1 Reg2 Reg1 Reg2 Reg1 Reg8 KBytes Inc64 Inc32 Inc16 Inc8 Inc4 Inc4 Inc4 Inc4 Inc8 Inc8 4 8070 15711 16498 17247 16538 16763 8670 16454 34291 34254 8 8437 16391 16544 17044 16838 17064 8765 16787 34148 35264 128 639 1281 2437 4400 7782 7780 6539 6694 8882 8684 256 651 1285 2411 4418 7786 7776 6448 6688 8936 8718 65536 315 609 1009 1478 2789 2792 2656 2842 2940 2940 131072 315 610 1007 1457 2793 2791 2704 2803 2940 2941 ##################################################################### MP Bus Speed Test 32 bit Version 1.2 Thu Apr 23 12:59:52 2009 Via Microsoft 32-bit C/C++ Optimizing Compiler Version 13.10.3077 for 80x86 SSE2 Assembled with Microsoft ml.exe Version 6.15.8803 Part 1 - Single Thread MBytes/Second Kbytes Inc32wds Inc16wds Inc8wds Inc4wds Inc2wds ReadAll 128bSSE2 6 8150 8479 10384 10088 9866 9976 17421 24 8394 8599 10477 10098 10183 10109 17484 96 745 659 1245 2371 4908 6372 8930 384 355 311 568 889 1443 2791 2967 768 358 310 564 887 1432 2781 2946 1536 360 310 565 887 1436 2788 2961 16380 352 313 561 877 1384 2745 2910 131070 351 314 562 877 1415 2739 2917 Part 2 - Two Threads Total MBytes/Second Kbytes Inc32wds Inc16wds Inc8wds Inc4wds Inc2wds ReadAll 128bSSE2 6 9245 10399 14752 16214 17382 18566 34846 24 11382 13886 18134 18714 19568 19658 34652 96 1475 1314 2474 4705 9725 12685 17789 384 320 329 666 1303 2368 4809 4740 768 318 329 664 1302 2365 4793 4728 1536 318 328 665 1304 2372 4812 4743 16380 319 331 665 1291 2334 4729 4683 131070 320 330 661 1289 2332 4727 4690 ############################################################################## MP Bus Speed Test 64 bit/int32 Ver 1.2 Thu Apr 23 13:01:16 2009 Via Microsoft C/C++ Optimizing Compiler Version 14.00.40310.41 for AMD64 SSE2 Assembled with Microsoft ml64.exe Version 8.00.40310.39 Part 1 - Single Thread MBytes/Second Kbytes Inc32wds Inc16wds Inc8wds Inc4wds Inc2wds ReadAll 128bSSE2 6 7427 8061 9776 9563 9098 9402 17526 24 7532 8114 10196 9910 9484 9525 17496 96 741 671 1253 2345 4902 6576 8791 384 359 309 544 843 1465 2549 2969 768 358 307 543 840 1453 2583 2962 1536 360 307 543 841 1463 2615 2958 16380 353 310 542 838 1437 2644 2929 131070 349 309 540 832 1431 2598 2865 Part 2 - Two Threads Total MBytes/Second Kbytes Inc32wds Inc16wds Inc8wds Inc4wds Inc2wds ReadAll 128bSSE2 6 9591 10169 14026 15814 16444 17487 34972 24 10886 13336 17636 18394 18089 18606 34922 96 1479 1341 2493 4652 9736 13097 17679 384 320 330 667 1280 2396 4349 4750 768 320 330 667 1280 2398 4362 4766 1536 319 331 666 1279 2393 4315 4746 16380 322 335 668 1271 2371 4255 4719 131070 321 334 662 1259 2343 4246 4736 ############################################################################## MP Bus Speed Test 64 bit Version 1.2 Thu Apr 23 13:02:40 2009 Via Microsoft C/C++ Optimizing Compiler Version 14.00.40310.41 for AMD64 SSE2 Assembled with Microsoft ml64.exe Version 8.00.40310.39 Part 1 - Single Thread MBytes/Second Kbytes Inc32wds Inc16wds Inc8wds Inc4wds Inc2wds ReadAll 128bSSE2 6 14612 15968 16147 17828 17760 17492 17477 24 14970 16215 16258 17791 17719 17820 17508 96 1944 1952 1344 2410 4543 9668 8787 384 655 720 592 996 1513 2933 2970 768 656 720 592 992 1506 2897 2957 1536 653 719 590 993 1505 2898 2960 16380 643 705 591 986 1478 2873 2929 131070 640 702 593 985 1478 2871 2927 Part 2 - Two Threads Total MBytes/Second Kbytes Inc32wds Inc16wds Inc8wds Inc4wds Inc2wds ReadAll 128bSSE2 6 11775 14939 14931 17563 19524 21683 34827 24 11304 17001 18712 22069 22764 23612 34809 96 3113 3073 2556 4672 8255 12516 17556 384 557 632 645 1268 2228 4064 4718 768 560 631 645 1271 2242 4072 4721 1536 561 630 643 1233 2242 4102 4752 16380 564 633 643 1266 2232 4086 4724 131070 560 634 646 1264 2228 4080 4721 ############################################################################## Intel(R) Core(TM)2 CPU 6600 @ 2.40GHz Measured 2402 MHz and Vista 64-Bit ############################################################################## MP Bus Speed Test 32 bit Version 1.2 Thu Apr 23 12:53:45 2009 Via Microsoft 32-bit C/C++ Optimizing Compiler Version 13.10.3077 for 80x86 SSE2 Assembled with Microsoft ml.exe Version 6.15.8803 Part 1 - Single Thread MBytes/Second Kbytes Inc32wds Inc16wds Inc8wds Inc4wds Inc2wds ReadAll 128bSSE2 6 7076 8930 8889 9164 9247 9220 37147 24 8406 8585 8741 8940 9014 9104 37713 96 2027 2017 3222 4518 6614 7915 19004 384 2023 2018 3239 4503 6661 7956 19038 768 2001 2015 3226 4487 6632 7917 19102 1536 1950 1983 3191 4412 6491 7830 18595 16380 316 380 783 1411 2631 4798 5634 131070 312 382 778 1423 2567 4868 5678 Part 2 - Two Threads Total MBytes/Second Kbytes Inc32wds Inc16wds Inc8wds Inc4wds Inc2wds ReadAll 128bSSE2 6 7347 11414 13477 15485 16414 17004 64860 24 12643 14519 15499 15765 15959 16291 65950 96 3137 2809 5169 7291 11891 14047 30072 384 3150 2849 5077 7639 10778 14685 29339 768 3015 2980 4988 7379 11758 13607 30887 1536 2969 2725 5036 7056 11213 13271 29849 16380 315 417 851 1739 3087 5743 6971 131070 313 416 877 1757 2967 5693 6919 ############################################################################## MP Bus Speed Test 64 bit/int32 Ver 1.2 Thu Apr 23 12:55:49 2009 Via Microsoft C/C++ Optimizing Compiler Version 14.00.40310.41 for AMD64 SSE2 Assembled with Microsoft ml64.exe Version 8.00.40310.39 Part 1 - Single Thread MBytes/Second Kbytes Inc32wds Inc16wds Inc8wds Inc4wds Inc2wds ReadAll 128bSSE2 6 7088 8064 8353 8685 8728 8719 37226 24 7313 8121 8153 8432 8543 8598 37539 96 1765 1965 3177 4507 6462 7816 18899 384 2039 2022 3232 4518 6457 7850 18962 768 2034 2008 3218 4501 6464 7819 19034 1536 1931 1991 3178 4427 6306 7708 18523 16380 316 380 789 1427 2654 4876 5731 131070 317 370 787 1398 2610 4867 5669 Part 2 - Two Threads Total MBytes/Second Kbytes Inc32wds Inc16wds Inc8wds Inc4wds Inc2wds ReadAll 128bSSE2 6 6576 10532 11969 13477 15141 14897 70169 24 10735 13041 14113 14257 15900 14838 70711 96 3008 2944 4909 7646 11280 14548 29482 384 2904 2997 4994 7335 11754 13745 30437 768 3028 2779 5125 7195 11704 13391 30549 1536 2829 2769 4793 7126 11158 13621 29143 16380 316 427 867 1705 3062 5694 7036 131070 314 423 845 1736 3014 5520 7051 ############################################################################## MP Bus Speed Test 64 bit Version 1.2 Thu Apr 23 12:57:50 2009 Via Microsoft C/C++ Optimizing Compiler Version 14.00.40310.41 for AMD64 SSE2 Assembled with Microsoft ml64.exe Version 8.00.40310.39 Part 1 - Single Thread MBytes/Second Kbytes Inc32wds Inc16wds Inc8wds Inc4wds Inc2wds ReadAll 128bSSE2 6 14142 16327 16605 17323 17381 17440 36730 24 14978 16822 16506 17300 17310 17271 37790 96 4076 4190 3988 6449 9504 13632 18914 384 3995 4149 4022 6425 9549 13593 19051 768 3977 4152 4011 6438 9555 13584 18967 1536 3918 3977 3954 6318 9282 13358 18674 16380 594 625 771 1554 2882 5150 5696 131070 574 631 762 1578 2861 5119 5666 Part 2 - Two Threads Total MBytes/Second Kbytes Inc32wds Inc16wds Inc8wds Inc4wds Inc2wds ReadAll 128bSSE2 6 8740 13942 15815 17691 19101 20269 66705 24 9484 15989 16697 20145 20505 19950 68468 96 5225 5630 5557 9618 13355 17128 28572 384 5271 5631 5510 9893 12940 17222 29617 768 5369 5760 5544 9899 13147 17200 30917 1536 4938 5635 5543 9374 12449 16582 28732 16380 583 625 821 1673 3166 5148 6988 131070 600 624 821 1664 3154 5267 6901

To Start

Another version of the 64 bit benchmark was produced. This just uses the single thread test, with command line options to select memory size used, running time and log file name. More than one version can then be run at the same time. Results are shown below for two programs running concurrently to test L1 cache, L2 cache and RAM. Speed from caches is seen to double, unlike the same tests using two threads. Results are for the AMD based PC.

Kbytes Inc32wds Inc16wds Inc8wds Inc4wds Inc2wds ReadAll 128bSSE2 6 Prog 1 5706 11458 11868 17738 17742 17452 17431 6 Prog 2 5731 11349 11882 17839 17803 17505 17473 96 Prog 1 1926 1498 1347 2449 4495 9619 8796 96 Prog 2 1936 1505 1349 2455 4506 9642 8813 1536 Prog 1 295 319 328 642 1174 2366 2444 1536 Prog 2 299 328 347 694 1240 2334 2809

To Start

BusSpdMP 32/64 bit Integers - CPU Speed Comparison

In addition to the 64 bit test, another version has been produced. This one, BusMP64Int32 in DualCore.zip, is compiled for 64 bit working but using 32 bit integers instead of 64 bits. Results in MB/second are included above. These speeds can be converted to Millions of Instructions Per Second (MIPS) by dividing by four for 32 bit integers or by eight for those at 64 bits. The following are for the ReadAll data.

The program inner loop is run many times and executes 64 instructions, and it is unlikely that the additional registers, available for optimising 64 bit programs, will have any effect. Unlike Dhrystone, the 32 bit compiler produces faster speeds than the 64 bit version with 32 bit numbers. This might be due to the simpler instruction format of and edi [eax-212] compared with and ecx [rax+rdx-488]. The slowest speeds are when using 64 bit integers, where twice as much data would need to be transferred for the same MIPS. Worst case is from RAM where CPU execution speed is halved at 64 bits. Performance gains on using two CPUs are also worse at 64 bits.

Core 2 Duo 2.4 GHz Vista 64 Athlon 64 x2 2.2 GHz XP x64 1 Thread ReadAll MIPS Kbytes 32/32 64/32 64/64 32/32 64/32 64/64 6 2305 2180 2180 2494 2351 2187 24 2276 2150 2159 2527 2381 2228 96 1979 1954 1704 1593 1644 1209 384 1989 1963 1699 698 637 367 768 1979 1955 1698 695 646 362 1536 1958 1927 1670 697 654 362 16380 1200 1219 644 686 661 359 131070 1217 1217 640 685 650 359 2 Threads ReadAll MIPS 6 4251 3724 2534 4642 4372 2710 24 4073 3710 2494 4915 4652 2952 96 3512 3637 2141 3171 3274 1565 384 3671 3436 2153 1202 1087 508 768 3402 3348 2150 1198 1091 509 1536 3318 3405 2073 1203 1079 513 16380 1436 1424 644 1182 1064 511 131070 1423 1380 658 1182 1062 510 Gain With 2 CPUs 6 1.84 1.71 1.16 1.86 1.86 1.24 24 1.79 1.73 1.16 1.94 1.95 1.33 96 1.77 1.86 1.26 1.99 1.99 1.29 384 1.85 1.75 1.27 1.72 1.71 1.39 768 1.72 1.71 1.27 1.72 1.69 1.41 1536 1.69 1.77 1.24 1.73 1.65 1.42 16380 1.20 1.17 1.00 1.72 1.61 1.42 131070 1.17 1.13 1.03 1.73 1.63 1.42

To Start

RandMP Benchmark

This MP benchmark uses variable amounts of memory to measure speed via caches and RAM, first as a single thread, then as two threads to demonstrate the impact of two CPUs. It is based on RandMem (RandMem.zip ) with serial and random read and read/write tests. Serial and random tests use the same code via indexing to read and write 4 byte words e.g. a sequence such as tot = tot & xi[xi[i+ 0]] | xi[xi[i+ 2]] & --- for reading and xi[xi[i+ 0]] = xi[xi[i+ 2]]; for read/write. The inner loops have a more than 600 CPU instructions. RandMP64 and RandMP32 versions to run via Win64 and Win32 can be found in DualCore.zip.

The benchmark has four tests, Serial Read (RD), Serial Read/Write (RW), Random Read and Random Read/Write. With two threads, each has its own code and use the same data but the second thread starts at the half way point. Each has the same number of repeat passes where variations in the time taken are reflected in the relative speeds of the two threads.

Below are example results of the 32 bit version on a single CPU using Windows XP and 64 bit version on a dual core CPU via Windows XP x64 (32 bit version produces very similar results). Using one thread, RW speed is slower than RD and speed reduces more using larger data size with random access. Running two threads on a single CPU produces the same sort of total speed as the single thread. With two CPUs, the speed of read only is mainly around double that of a single thread but speed via caches with read/write can be worse than for a single thread (or single CPU).

Looking at dual core results, with Serial RW and Random RW at 6 KB, the CPU is executing at around 1360 Million Instructions Per Second (MIPS) or 0.62 MIPS/MHz with a single thread. With two threads, each CPU runs at 340 MIPS (0.15 MIPS/MHz) with Serial RW and 154 MIPS (0.07 MIPS/MHz) with Random RW. This can be put down to Windows flushing caches to maintain data coherency. Modifying the benchmark, so that each thread accesses its own data array, enables RW cache tests to run at 1360 MIPS on each CPU.

The above comments relate to results on the PC with an AMD CPU and using Windows XP Pro x64. Later results are for a Core 2 Duo system using 64-Bit Windows Vista. RAM on this is nearly twice as fast but the tests show up to 4 times faster. Measured L1 cache speeds are much faster on the Read/Write tests as they are via L2 cache.

AMD Athlon(tm) XP 2600+ Measured 2088 MHz ##################################################################### RandMP Write/Read Test 32 bit Version 1.0 Sat Aug 27 19:33:14 2005 Via Microsoft 32-bit C/C++ Optimizing Compiler Version 13.10.3077 for 80x86 ------------------ MBytes Per Second At -------------------- 6 KB 24 KB 96 KB 384 KB 768 KB 1536 KB 12 MB 96 MB 1 Thread Serial RD 7773 7748 3616 895 896 889 891 892 Serial RW 3655 3657 2193 660 663 661 663 658 Random RD 7527 7599 2165 628 313 240 192 57 Random RW 3686 3693 2034 439 190 141 116 44 2 Threads Serial RD1 4510 4522 2043 444 448 466 447 534 Serial RD2 3911 3906 1813 443 444 442 442 442 Serial RW1 1890 2133 1153 346 328 348 349 340 Serial RW2 1832 1828 1097 327 342 328 328 326 Random RD1 4429 4297 1134 311 169 115 103 31 Random RD2 3781 3803 1067 302 151 116 92 28 Random RW1 1928 1941 1050 219 95 75 61 24 Random RW2 1837 1849 1012 220 92 71 58 22 For approximate speed in MIPS divide MBytes/Second by 3.2 AMD Athlon 64 X2 Dual Core Processor 4200+ Measured 2211 MHz, XP Pro x64 ##################################################################### RandMP Write/Read Test 64 bit Version 1.0 Sat Aug 27 19:17:58 2005 Via Microsoft C/C++ Optimizing Compiler Version 14.00.40310.41 for AMD64 ------------------ MBytes Per Second At -------------------- 6 KB 24 KB 96 KB 384 KB 768 KB 1536 KB 12 MB 96 MB 1 Thread Serial RD 8552 8518 5115 5132 2369 2353 2344 2305 Serial RW 4346 4340 2702 2697 1349 1352 1354 1351 Random RD 8176 8244 3733 1620 872 389 255 170 Random RW 4384 4332 2865 1483 563 236 161 136 2 Threads Serial RD1 8374 8532 5064 5010 2075 2096 2021 2026 Serial RD2 8532 8394 5176 5108 2111 2062 2049 2054 Serial RW1 1090 1172 1110 1096 1041 867 864 866 Serial RW2 1083 1136 1089 1076 1049 866 855 824 Random RD1 8147 8024 3683 1638 485 193 126 100 Random RD2 8154 8158 3701 1637 485 195 125 101 Random RW1 494 489 448 406 352 152 86 75 Random RW2 495 490 449 406 343 152 87 75 For approximate speed in MIPS divide MBytes/Second by 3.2 Intel(R) Core(TM)2 CPU 6600 @ 2.40GHz Measured 2402 MHz and Vista 64-Bit ##################################################################### RandMP Write/Read Test 64 bit Version 1.0 Sat Aug 27 19:17:58 2005 ------------------ MBytes Per Second At -------------------- 6 KB 24 KB 96 KB 384 KB 768 KB 1536 KB 12 MB 96 MB 1 Thread Serial RD 8742 9128 7498 7468 7486 7429 4417 4391 Serial RW 8428 9332 7665 7663 7662 7165 2442 2397 Random RD 8918 9404 4244 3304 3183 2790 638 458 Random RW 8014 8523 3390 2752 2656 2462 418 289 2 Threads Serial RD1 8435 9094 7334 7336 7365 7238 4024 2817 Serial RD2 8460 8943 7183 7168 7201 7159 3962 2764 Serial RW1 2007 2181 6931 6995 6984 6738 1643 1521 Serial RW2 2010 2174 6789 6801 6806 6651 1568 1433 Random RD1 8576 9392 3530 2695 2604 2292 450 443 Random RD2 8598 9180 3478 2666 2553 2256 455 443 Random RW1 730 759 1409 1984 1991 1923 282 292 Random RW2 733 759 1398 1955 1961 1897 277 292

To Start

Roy Longbottom September 2009

The new Internet Home for my PC Benchmarks is via the link
Roy Longbottom's PC Benchmark Collection