Conversion
Most of my benchmarks have been converted to run as 64 bit programs and have been tested via Windows XP Pro x64 and 64 bit Windows Vista. The first step was to download Microsoft Platform SDK for Windows Server 2003 SP1. This includes a 64 bit compiler (cl), assembler (ml64) and linker. These can be used via the command line or in a .BAT file and the package can be installed using Win32 or Win64. For Windows programs, .RC files can be converted (rc) to .RES files and the latter to .OBJ (cvtres). Library names used are the same as Win32, like GDI32.LIB.
The compiler does not accept asm type assembler functions so these have to be converted to MASM format but headers and .INC files are different to 32 bit varieties. 64 bit systems cannot run the old x87 floating point instructions nor MMX instructions. The former have to be converted to SSE1/2/3 instructions. MMX instruction names are the same as some provided in SSE2 but memory addresses have to be changed to suit 128 bit registers instead of 64 bits. 32 bit instructions can still be used, including CPUID and RDTSC. The only complication appears to be that push/pop should refer to a 64 bit register (push rdx instead of push edx). There appears to be complications in passing parameters to assembly code but I have avoided this by using global variables.
The SDK includes a 32 bit compiler that checks for 64 bit compatibility and has options to use SSE or SSE2 instructions for floating point. In some cases, this produces identical code to the 64 bit version. In other cases it restricts the number of registers used for 32 bit compatibility. MASM type assembly requires an assembler that comes with such as Microsoft Visual C++ 6.0 Pro. In order to compare 64 versus 32 bit speeds, some of the benchmarks have also been compiled using the SDK 32 bit compiler.
The C/C++ and Assembler source codes for these benchmarks are available in NewSource.zip.
The original versions can be obtained via the Main Page.
To Start
More 64 Bit Benchmarks
Windows, DirectDraw, OpenGL and Image Processing benchmarks have also been converted to run at 64 bits
and a DirectX 9 benchmark has also been produced. See 64 Bit Graphics Tests.htm. Download benchmarks and C/C++ source codes via Video64.zip
Then, there are benchmarks for disks, CD/DVD drives, networks and peripherals in More64bit.zip with results in 64 Bit Disk Tests.htm.
The latest conversions, including source code, are also in More64bit.zip. These are three versions of my Fast Fourier Transform benchmarks (see also FFTGraf.zip), SSE/SSE2 benchmark and burn-in/reliability tests (see also SSE3Dnow.zip) and BusSpd2K burn-in/reliability tests (see also BusSpd2K.zip).
The latter burn-in tests have been modified to demonstrate paging speeds more quickly. See Paging.htm for results via 64-Bit Vista and XP Pro x64.
To Start
Other Results
Results of 64 bit tests, descriptions and some comparisons are included in results reports for 32 bit versions.
To Start
System ID
Each benchmark includes a new system identification test. This is limited because Intel appear to make significant changes with each new CPU (now much too complicated for identifying dual CPUs with HT enabled or cache sizes on a range of CPUs). Windows functions also considerably lag on hardware capabilities. The following shows details provided by the 64 bit programs, then differences at 32 bits.
Note that on AMD and Intel CPUs, with 64 bit working, info.wProcessorArchitecture from GetSystemInfo(&info) indicates PROCESSOR_ARCHITECTURE_AMD64. With 32 bit operation PROCESSOR_ARCHITECTURE_INTEL is supplied.
AMD Windows XP Pro x64
CPUID and RDTSC Assembly Code
CPU AuthenticAMD, Features Code 178BFBFF, Model Code 00020FB1
AMD Athlon(tm) 64 X2 Dual Core Processor 4200+ Measured 2211 MHz
Has MMX, Has SSE, Has SSE2, Has SSE3, Has 3DNow,
Windows GetSystemInfo, GetVersionEx, GlobalMemoryStatus
AMD64 processor architecture, 2 CPUs
Windows NT Version 5.2, build 3790, Service Pack 1
Memory 1024 MB, Free 656 MB
User Virtual Space 8388608 MB, Free 8388557 MB
Intel Windows Vista 64-Bit
CPUID and RDTSC Assembly Code
CPU GenuineIntel, Features Code BFEBFBFF, Model Code 000006F6
Intel(R) Core(TM)2 CPU 6600 @ 2.40GHz Measured 2402 MHz
Has MMX, Has SSE, Has SSE2, Has SSE3, No 3DNow,
Windows GetSystemInfo, GetVersionEx, GlobalMemoryStatus
AMD64 processor architecture, 2 CPUs
Windows NT Version 6.0, build 6000,
Memory 4094 MB, Free 3207 MB
User Virtual Space 8388608 MB, Free 8388547 MB
|
Differences Win32 and Win64 at 32 bits
Intel processor architecture, 2 CPUs
User Virtual Space 4096 MB, Free 4047 MB - Win64
User Virtual Space 2048 MB, Free 2022 MB - Win32
Memory 4095 MB, Free 3103 MB - Win64
Memory less than 3.5 GB - Win32
|
The C/C++ and Assembler source codes for these utilities are available in NewSource.zip.
To Start
Maximum CPU Speed
This benchmark CPUID64 (In Win64.zip) is based on the original in Whatcpu.zip. The latter executes a long series of assembler coded add instructions to 1, 2, 3 and 4 registers to identify maximum speeds of integer, floating point and MMX instructions.
The 64 bit version has the same 32 bit integer test and an identical one using 64 bit mode. SSE/SSE2 32/64 bit floating point tests are the same. As indicated above, normal floating point and MMX instructions are invalid under Win64. Instead of MMX, SSE2 32 bit and 64 bit add speeds are measured.
A revised 32 bit version is included in Win64.zip to show SSE2 integer speeds.
In the following example of 64 bit results, Millions of Instructions Per Second (MIPS) are
similar to 32 bit speeds. That would be expected as 32 bit registers use half of real register
size. With more pipelines, 64 bit normal integer MIPS can be faster than using integer SSE2
instructions. As usual, 64 bit floating point MFLOPS (Millions of
FLoating point Operations Per Second) run at half speed compared with 32 bits (2 words versus
4 words in 128 bit registers). This is also the case with AMD on 32/64 bit SSE2 integers, but not
with this Intel CPU.
CPU ID and Speed Test 64 bit Version - Windows XP Pro x64
Assembled with Microsoft ml64.exe Version 8.00.40310.39
AMD Athlon(tm) 64 X2 Dual Core Processor 4200+ Measured 2211 MHz
Speeds adding to 1 Register 2 Registers 3 Registers 4 Registers
32 bit Integer MIPS 2430 4864 5650 6080
64 bit Integer MIPS 2430 4864 6356 6485
32 bit SSE2 Int MIPS 4421 8895 8844 8844
64 bit SSE2 Int MIPS 2211 4419 4447 4422
32 bit SSE MFLOPS 2214 4421 4421 4434
64 bit SSE2 MFLOPS 1105 2210 2217 2217
CPU ID and Speed Test 64 bit Version - Windows Vista 64-Bit
Intel(R) Core(TM)2 CPU 6600 @ 2.40GHz Measured 2402 MHz
Speeds adding to 1 Register 2 Registers 3 Registers 4 Registers
32 bit Integer MIPS 2609 4081 5261 7044
64 bit Integer MIPS 2613 4177 5225 7044
32 bit SSE2 Int MIPS 9488 14638 17542 17466
64 bit SSE2 Int MIPS 2401 4575 4585 4575
32 bit SSE MFLOPS 3201 6405 9607 9607
64 bit SSE2 MFLOPS 1601 3202 4804 4804
|
Download Win64.zip
To Start
Maximum MP Speed
This benchmark CPUIDMP64 (In DualCore.zip) uses some of the instruction sequences from CPUID64. First an integer and an SSE floating point test are run separately. They are then run as two threads of equal priority, where both should run at full speed with 2 CPUs. Finally, an FP test is started with another and two integer tests at lower priority. With 2 CPUs, the FP test should run at full speed and the others at the whim of the OS. A 32 bit version is included in DualCore.zip using the same 32 bit instructions. Results of 64 bit and 32 bit tests can be expected to be the same except possibly for sharing with 4 threads.
When run on a single CPU, the floating point and integer tests are likely to run at half speed with two threads. With four threads, the lower priority tests might obtain a small amount of time.
CPU ID and MP Speed Test 64 bit Version - Windows XP Pro x64
Assembled with Microsoft ml64.exe Version 8.00.40310.39
AMD Athlon(tm) 64 X2 Dual Core Processor 4200+ Measured 2211 MHz
Speed adding to registers Pass 1 Pass 2 Pass 3
Separate Tests
32 bit SSE MFLOPS 4411 4411 4415
32 bit Integer MIPS 6068 6070 6070
Two Threads Equal Priority
32 bit SSE MFLOPS 4405 4409 4408
32 bit Integer MIPS 6067 6053 5992
Four Threads, First Normal Priority, Others Normal - 1
32 bit SSE MFLOPS 4401 4411 4410
32 bit Integer MIPS 2903 2053 2898
32 bit SSE MFLOPS 0 1433 0
32 bit Integer MIPS 3454 2227 3455
CPU ID and MP Speed Test 64 bit Version - Windows Vista 64-Bit
Intel(R) Core(TM)2 CPU 6600 @ 2.40GHz Measured 2402 MHz
Speed adding to registers Pass 1 Pass 2 Pass 3
Separate Tests
32 bit SSE MFLOPS 9582 9595 9600
32 bit Integer MIPS 6934 6936 6950
Two Threads Equal Priority
32 bit SSE MFLOPS 9501 9600 9600
32 bit Integer MIPS 7002 7006 7013
Four Threads, First Normal Priority, Others Normal - 1
32 bit SSE MFLOPS 9592 9575 9576
32 bit Integer MIPS 3447 3414 3329
32 bit SSE MFLOPS 4844 0 0
32 bit Integer MIPS 0 3337 3366
|
Download DualCore.zip
To Start
Classic Benchmarks
The Classic Benchmarks are the first programs that set standards of performance for computers. Details are available from Classic.htm and benchmark programs and results obtained via BenchNT.zip. The Linpack, Livermore Loops and Whetstone Benchmarks have been compiled for 64 bit systems and for 32 bit PCs using automatic compilation with SSE or SSE2 instructions.
Dhrystone Benchmarks are now included.
The benchmarks and sample results can be obtained from Win64.zip or DualCore.zip and source codes from NewSource.zip.
Linpack and Livermore Loops Benchmarks
Linpack and Livermore Loops benchmarks use double precision floating point, so are compiled with SSE2 instructions. The compilers are not as efficient as they could be, producing instructions using one 64 bit word in the 128 bit registers, rather than two for Single Instruction Multiple Data (SIMD) operation.
The original 64 bit Linpack results on Core 2 Duo were disappointing but this was corrected in 2009 on using a later version of the compiler.
Linpack Benchmark Results
AMD Athlon(tm) 64 X2 Dual Core Processor 4200+
Measured 2211 MHz and XP Pro x64
Original SSE2 Win32 SSE2 Win64
838 MFLOPS 1014 MFLOPS 1044 MFLOPS
2009 Compilation 1091 MFLOPS
Intel(R) Core(TM)2 CPU 6600 @ 2.40GHz
Measured 2402 MHz and Vista 64-Bit
Original SSE2 Win32 SSE2 Win64
1315 MFLOPS 1480 MFLOPS 823 MFLOPS
2009 Compilation 1602 MFLOPS
|
To Start
There are 24 Livermore Loops whose performance is measured in MFLOPS also with average
results, Geometric Mean being the official average quoted. Following are results for the
original Watcom version, 32 bits with SSE2 and 64 bits with SSE2. The 64 bit compilation can
use up to 16 registers to speed up processing. However, some 32 bit SSE2 results are faster
as are a few from the original Watcom version. The Intel Core 2 Duo results, compiled for 64 bits,
are more frequently slower than at 32 bits, again corrected in a 2009 recompilation. See below for Whetstone Benchmark.
AMD Athlon(tm) 64 X2 Dual Core Processor 4200+ Measured 2211 MHz and XP Pro x64
********************************************************
Livermore Loops Benchmark Original Optimised via C/C++
MFLOPS for 24 loops
2032.4 1312.4 345.7 1031.6 275.6 334.9 2565.9 2288.0 2337.2 121.9 183.7 550.5
49.8 131.6 393.5 350.4 217.7 1474.2 309.3 290.9 612.8 458.5 751.6 294.6
Overall Ratings
Maximum Average Geomean Harmean Minimum
2565.9 740.9 460.5 285.7 48.4
********************************************************
Livermore Loops Benchmark 32 Bit Version
Via Microsoft 32-bit C/C++ Optimizing Compiler Version 13.10.3077 for 80x86
MFLOPS for 24 loops
1619.1 1187.1 717.9 1068.3 244.9 606.3 1815.6 1727.1 1907.6 670.2 200.4 549.6
169.4 317.6 737.7 654.7 684.2 1452.0 455.0 762.5 1031.2 406.1 590.3 219.2
Overall Ratings
Maximum Average Geomean Harmean Minimum
1907.6 798.3 640.1 501.3 162.3
********************************************************
Livermore Loops Benchmark 64 Bit Version
Via 64 Bit Microsoft C/C++ Optimizing Compiler Version 14.00.40310.41 for AMD64
MFLOPS for 24 loops
1927.4 1118.3 1096.7 1054.2 252.2 320.3 2284.1 2099.2 1756.2 632.4 183.6 731.0
173.1 306.3 552.9 732.4 922.6 1441.3 500.8 881.3 328.6 351.8 758.0 397.8
Overall Ratings
Maximum Average Geomean Harmean Minimum
2284.1 843.7 660.7 509.1 165.7
********************************************************
Livermore Loops Benchmark 64 Bit Version - 2009 Compilation
Via Microsoft C/C++ Optimizing Compiler Version 15.00.30729.207 for x64
MFLOPS for 24 loops
1954.0 1151.1 1094.2 948.6 249.3 605.2 2067.6 2022.1 1783.5 655.1 170.8 731.2
195.8 305.8 512.1 723.4 902.1 1332.5 501.6 730.1 1042.7 359.6 736.2 398.3
Overall Ratings
Maximum Average Geomean Harmean Minimum
2067.6 846.9 679.2 533.9 170.8
######################################################
Intel(R) Core(TM)2 CPU 6600 @ 2.40GHz Measured 2402 MHz and Vista 64-Bit
Via Microsoft 32-bit C/C++ Optimizing Compiler Version 13.10.3077 for 80x86
MFLOPS for 24 loops
1960.1 1357.0 788.5 1471.0 341.2 891.9 2526.4 2044.9 2153.0 860.1 265.8 1181.5
458.5 555.0 444.0 1018.2 824.4 1073.6 505.2 632.3 1235.3 194.7 772.0 278.2
Overall Ratings
Maximum Average Geomean Harmean Minimum
2526.4 990.3 803.8 639.0 194.7
********************************************************
Via Microsoft 64 Bit C/C++ Optimizing Compiler Version 14.00.40310.41 for AMD64
MFLOPS for 24 loops
626.3 835.2 594.5 589.0 341.3 406.3 886.7 1040.6 1098.2 391.0 239.0 398.3
349.7 397.8 320.7 857.1 1038.9 714.0 639.2 429.5 418.0 227.3 838.1 673.5
Overall Ratings
Maximum Average Geomean Harmean Minimum
1175.0 592.9 537.0 484.9 227.2
********************************************************
2009 Compilation
Via Microsoft C/C++ Optimizing Compiler Version 15.00.30729.207 for x64
MFLOPS for 24 loops
1833.8 1221.5 1505.3 1290.3 340.5 858.4 2760.0 2375.6 2183.4 851.3 264.5 1183.2
508.6 561.6 446.6 909.1 1067.1 1132.1 637.8 665.0 1233.7 362.7 928.6 715.9
Overall Ratings
Maximum Average Geomean Harmean Minimum
2798.8 1066.4 893.2 749.0 260.5
|
To Start
Dhrystone Benchmarks
The Dhrystone tests are integer/fixed point benchmarks measuring performance in Millions of Instructions Per Second relative to the 1977 Digital Vax 11/780, accepted as the first 1 MIPS minicomputer. Dhrystone 1 could easily be over optimised, where some of the code is not executed, and is probably reflected in the results. Dhrystone 2 was intended to overcome this deficiency.
Three versions are available one compiled for 32 bit Windows and two for the 64 bit systems. One of the latter uses 32 bit integer variables and the other at 64 bits.
Results for 32 bit integers show that 64 bit compilations are up to 56% faster than the 32 bit versions. Much of the gain appears to be due to a different translation of the C source code but, with twice as many registers available for optimisation at 64 bits, there could be some performance improvement. Regarding 64 bit compilations, the versions using 64 bit integers were both slower than with 32 bit integers. by 27% in one case. This might be due to the higher volume of data from cache with 64 bit words but limited compilations were inconclusive when some of the code was omitted.
Further 32/64 bit integer comparisons can be found for BusSpdMP.
Dhrystone Benchmark Results
AMD Athlon(tm) 64 X2 Dual Core Processor 4200+ Measured 2211 MHz and XP Pro x64
********************************************************
32 Bit Integers - Via MS 32-bit C/C++ Optimizing Compiler Version 13.10.3077 for 80x86
VAX MIPS rating Dhrystone 1 and 2: 6104.33 3719.73
********************************************************
32 Bit Integers - Via MS C/C++ Compiler Version 14.00.40310.41 for AMD64
VAX MIPS rating Dhrystone 1 and 2: 8668.31 5213.64
********************************************************
64 Bit Integers - Via MS C/C++ Compiler Version 14.00.40310.41 for AMD64
VAX MIPS rating Dhrystone 1 and 2: 8548.73 4654.28
######################################################
Intel(R) Core(TM)2 CPU 6600 @ 2.40GHz Measured 2402 MHz and Vista 64-Bit
********************************************************
32 Bit Integers - Via MS 32-bit C/C++ Optimizing Compiler Version 13.10.3077 for 80x86
VAX MIPS rating Dhrystone 1 and 2: 8094.18 5476.09
********************************************************
32 Bit Integers - Via MS C/C++ Compiler Version 14.00.40310.41 for AMD64
VAX MIPS rating Dhrystone 1 and 2: 12600.13 8549.73
********************************************************
64 Bit Integers - Via MS C/C++ Compiler Version 14.00.40310.41 for AMD64
VAX MIPS rating Dhrystone 1 and 2: 11725.88 6247.67
######################################################
|
To Start
Whetstone Benchmark
The Whetstone Benchmark produces an overall rating in terms of Millions of Whetstone Instructions Per Second (MWIPS). The version used also produces speeds in MFLOPS and MOPS for the eight test loops, three with straight floating point, two with intrinsic functions and three with integer type operations. An overall average (Geometric) is produced for the first three and equivalent VAX MIPS for the last three. The single precision version of the benchmark was compiled to use SSE instructions.
This is quite a bit faster than the original version but Vax MIPS are over-inflated due to excessive optimisation.
MP Version
The program was modified to use a second thread to execute some of the code and demonstrate the use of two CPUs. The second thread is run at THREAD_PRIORITY_BELOW_NORMAL which sees little time on a single CPU. With dual CPU, both threads should demonstrate full speed . One complication is that the compiler refused to produce the same code for that used by the second thread so there is some variation in speeds.
This MP version was also compiled for 64 bit operation and results are shown below for this, 32 bit MP and 32 bit SSE versions. Floating point speed is similar on both MP versions (and around double that of a single processor) but the 64 bit variety runs one of the integer tests faster by using more registers.
AMD Athlon(tm) 64 X2 Dual Core Processor 4200+ Measured 2211 MHz and XP Pro x64
Whetstone Single Precision SSE benchmark - single CPU version
Via Microsoft 32-bit C/C++ Optimizing Compiler Version 13.10.3077 for 80x86
MFLOPS Vax MWIPS MFLOPS MFLOPS MFLOPS Cos Exp Fixpt If Equal
Gmean MIPS 1 2 3 MOPS MOPS MOPS MOPS MOPS
583 12197 2313 655 656 461 51.0 36.3 1988 2210 3305
********************************************************************************
Whetstone Single Precision MP SSE Benchmark Wed Aug 10 12:38:03 2005
Via Microsoft 32-bit C/C++ Optimizing Compiler Version 13.10.3077 for 80x86
MFLOPS Vax MWIPS MFLOPS MFLOPS MFLOPS Cos Exp Fixpt If Equal
Gmean MIPS 1 2 3 MOPS MOPS MOPS MOPS MOPS
1164 19030 4506 1310 1308 920 102 69.7 3598 4139 3702
Thread 1 642 642 452 50.7 34.8 1796 2062 2690
Thread 2 668 666 467 50.8 34.9 1802 2078 1013
********************************************************************************
Whetstone Single Precision MP SSE Benchmark Fri Aug 05 12:18:12 2005
Via Microsoft 64 bit C/C++ Optimizing Compiler Version 14.00.40310.41 for AMD64
MFLOPS Vax MWIPS MFLOPS MFLOPS MFLOPS Cos Exp Fixpt If Equal
Gmean MIPS 1 2 3 MOPS MOPS MOPS MOPS MOPS
1086 25950 4983 1325 1145 845 151 67.1 3610 4204 9210
Thread 1 661 572 468 75.2 33.5 1804 2099 8067
Thread 2 663 573 377 76.0 33.6 1806 2105 1143
#################################################################################
Intel(R) Core(TM)2 CPU 6600 @ 2.40GHz Measured 2402 MHz and Vista 64-Bit
Whetstone Single Precision SSE Benchmark Fri Jul 20 17:06:25 2007
Via Microsoft 32-bit C/C++ Optimizing Compiler Version 13.10.3077 for 80x86
MFLOPS Vax MWIPS MFLOPS MFLOPS MFLOPS Cos Exp Fixpt If Equal
Gmean MIPS 1 2 3 MOPS MOPS MOPS MOPS MOPS
728 18421 2419 851 855 530 57.2 29.7 1994 1747 14352
********************************************************************************
Whetstone Single Precision MP SSE Benchmark Fri Jul 20 17:06:08 2007
Via Microsoft 32-bit C/C++ Optimizing Compiler Version 13.10.3077 for 80x86
MFLOPS Vax MWIPS MFLOPS MFLOPS MFLOPS Cos Exp Fixpt If Equal
Gmean MIPS 1 2 3 MOPS MOPS MOPS MOPS MOPS
1439 23554 4704 1700 1689 1037 113 58.1 3720 3738 7518
Thread 1 845 826 517 56.5 28.9 1871 1797 6439
Thread 2 855 863 520 56.8 29.3 1848 1941 1079
********************************************************************************
Whetstone Single Precision MP SSE Benchmark Fri Jul 20 17:06:45 2007
Via Microsoft 64 bit C/C++ Optimizing Compiler Version 14.00.40310.41 for AMD64
MFLOPS Vax MWIPS MFLOPS MFLOPS MFLOPS Cos Exp Fixpt If Equal
Gmean MIPS 1 2 3 MOPS MOPS MOPS MOPS MOPS
1417 26543 5661 1723 1608 1026 157 77.4 3645 3096 13257
Thread 1 862 805 530 78.1 38.5 1809 1535 12268
Thread 2 861 803 496 78.4 39.0 1837 1560 989
|
To Start
BusSpeed MP Benchmark
This MP benchmark uses variable amounts of memory to measure speed via caches and RAM, first as a single thread, then as two threads to demonstrate the impact of two CPUs. It is based on BusSpd2K (BusSpd2K.zip ) using integer AND instructions to a single register, streaming data from caches or RAM. The first test reads one word with a 32 word address increment for the next word. That is 128 bytes with 32 bit words and 256 bytes with 64 bit words. The address increment reduces for following tests to one word (ReadAll). The last test reads all 16 byte SSE2 data. With two threads, each reads all the data, with total passes same as with one thread. BusSpd2K can produce some faster results as streaming to two registers is used for some tests. Except for SSE2, C compiler code is used for the tests as this is similar to assembly code in BusSpd2K. Results of benchmarks compiled for 32 and 64 bit systems are shown below. The benchmarks are in DualCore.zip.
Looking at RAM speed, the system reads data in 64 byte bursts - 16 word address increments at 32 bits and 8 word increments at 64 bits. This is demonstrated by no/little performance gain with larger address increments. Speed at 64 bits will appear to be twice as fast as 32 bits as twice as much data is being used out of the burst. Typical burst speed at 32 bits is 319 MB/sec and maximum speed can be assumed to be 16 times this or 5104 MB/sec (maximum theoretical 2 x 3200). In this case, the memory buses appear to be saturated and there is no gain with 2 CPUs. As the address increment is reduced speed increases to around 3000 MB/sec using one thread or 4700 MB/sec with two threads. These are similar speeds to a BusSpd2K two program test (see DualCore.htm), indicating a performance limitation with a single CPU.
Results via caches are strange. A sample from 32 bit BusSpd2K is included below to explain possible reasons. Firstly, BusSpd2K uses just MOV instructions for the burst tests. It shows halving of speed from caches from 32 byte (8 word) increments to 64 byte (16 word) and BusSpdMP goes one step further to 32 words address increments. BusSpd2K also shows half speed from L1 cache when ANDing to 1 register instead of 2. With BusMP, the compiler refused to translate code for two registers as hoped for.
Most cache based results do not show expected performance gains on using 2 CPUs. Inner loops of the tests have 64 AND instructions and an outer loops runs this for around 0.5 seconds (a long time and little difference at 0.1 seconds). Maybe the cause is cache flushing with some data coming from RAM.
The above comments relate to the tests on the PC with an AMD CPU and using windows XP Pro x64.
Later, a Core 2 Duo PC results are given, using 64-Bit Windows Vista. This has faster RAM, larger
and faster L2 cache and faster operation on SSE2 instructions.
AMD Athlon 64(tm) X2 Dual Core Processor 4200+ Measured 2211 MHz, XP Pro x64
##############################################################################
Old BusSpd2K Performance Test MBytes/Second
16wds 8wds
MovI MovI MovI MovI MovI MovI AndI AndI MovM MovM
Memory Reg2 Reg2 Reg2 Reg2 Reg1 Reg2 Reg1 Reg2 Reg1 Reg8
KBytes Inc64 Inc32 Inc16 Inc8 Inc4 Inc4 Inc4 Inc4 Inc8 Inc8
4 8070 15711 16498 17247 16538 16763 8670 16454 34291 34254
8 8437 16391 16544 17044 16838 17064 8765 16787 34148 35264
128 639 1281 2437 4400 7782 7780 6539 6694 8882 8684
256 651 1285 2411 4418 7786 7776 6448 6688 8936 8718
65536 315 609 1009 1478 2789 2792 2656 2842 2940 2940
131072 315 610 1007 1457 2793 2791 2704 2803 2940 2941
#####################################################################
MP Bus Speed Test 32 bit Version 1.2 Thu Apr 23 12:59:52 2009
Via Microsoft 32-bit C/C++ Optimizing Compiler Version 13.10.3077 for 80x86
SSE2 Assembled with Microsoft ml.exe Version 6.15.8803
Part 1 - Single Thread MBytes/Second
Kbytes Inc32wds Inc16wds Inc8wds Inc4wds Inc2wds ReadAll 128bSSE2
6 8150 8479 10384 10088 9866 9976 17421
24 8394 8599 10477 10098 10183 10109 17484
96 745 659 1245 2371 4908 6372 8930
384 355 311 568 889 1443 2791 2967
768 358 310 564 887 1432 2781 2946
1536 360 310 565 887 1436 2788 2961
16380 352 313 561 877 1384 2745 2910
131070 351 314 562 877 1415 2739 2917
Part 2 - Two Threads Total MBytes/Second
Kbytes Inc32wds Inc16wds Inc8wds Inc4wds Inc2wds ReadAll 128bSSE2
6 9245 10399 14752 16214 17382 18566 34846
24 11382 13886 18134 18714 19568 19658 34652
96 1475 1314 2474 4705 9725 12685 17789
384 320 329 666 1303 2368 4809 4740
768 318 329 664 1302 2365 4793 4728
1536 318 328 665 1304 2372 4812 4743
16380 319 331 665 1291 2334 4729 4683
131070 320 330 661 1289 2332 4727 4690
##############################################################################
MP Bus Speed Test 64 bit/int32 Ver 1.2 Thu Apr 23 13:01:16 2009
Via Microsoft C/C++ Optimizing Compiler Version 14.00.40310.41 for AMD64
SSE2 Assembled with Microsoft ml64.exe Version 8.00.40310.39
Part 1 - Single Thread MBytes/Second
Kbytes Inc32wds Inc16wds Inc8wds Inc4wds Inc2wds ReadAll 128bSSE2
6 7427 8061 9776 9563 9098 9402 17526
24 7532 8114 10196 9910 9484 9525 17496
96 741 671 1253 2345 4902 6576 8791
384 359 309 544 843 1465 2549 2969
768 358 307 543 840 1453 2583 2962
1536 360 307 543 841 1463 2615 2958
16380 353 310 542 838 1437 2644 2929
131070 349 309 540 832 1431 2598 2865
Part 2 - Two Threads Total MBytes/Second
Kbytes Inc32wds Inc16wds Inc8wds Inc4wds Inc2wds ReadAll 128bSSE2
6 9591 10169 14026 15814 16444 17487 34972
24 10886 13336 17636 18394 18089 18606 34922
96 1479 1341 2493 4652 9736 13097 17679
384 320 330 667 1280 2396 4349 4750
768 320 330 667 1280 2398 4362 4766
1536 319 331 666 1279 2393 4315 4746
16380 322 335 668 1271 2371 4255 4719
131070 321 334 662 1259 2343 4246 4736
##############################################################################
MP Bus Speed Test 64 bit Version 1.2 Thu Apr 23 13:02:40 2009
Via Microsoft C/C++ Optimizing Compiler Version 14.00.40310.41 for AMD64
SSE2 Assembled with Microsoft ml64.exe Version 8.00.40310.39
Part 1 - Single Thread MBytes/Second
Kbytes Inc32wds Inc16wds Inc8wds Inc4wds Inc2wds ReadAll 128bSSE2
6 14612 15968 16147 17828 17760 17492 17477
24 14970 16215 16258 17791 17719 17820 17508
96 1944 1952 1344 2410 4543 9668 8787
384 655 720 592 996 1513 2933 2970
768 656 720 592 992 1506 2897 2957
1536 653 719 590 993 1505 2898 2960
16380 643 705 591 986 1478 2873 2929
131070 640 702 593 985 1478 2871 2927
Part 2 - Two Threads Total MBytes/Second
Kbytes Inc32wds Inc16wds Inc8wds Inc4wds Inc2wds ReadAll 128bSSE2
6 11775 14939 14931 17563 19524 21683 34827
24 11304 17001 18712 22069 22764 23612 34809
96 3113 3073 2556 4672 8255 12516 17556
384 557 632 645 1268 2228 4064 4718
768 560 631 645 1271 2242 4072 4721
1536 561 630 643 1233 2242 4102 4752
16380 564 633 643 1266 2232 4086 4724
131070 560 634 646 1264 2228 4080 4721
##############################################################################
Intel(R) Core(TM)2 CPU 6600 @ 2.40GHz Measured 2402 MHz and Vista 64-Bit
##############################################################################
MP Bus Speed Test 32 bit Version 1.2 Thu Apr 23 12:53:45 2009
Via Microsoft 32-bit C/C++ Optimizing Compiler Version 13.10.3077 for 80x86
SSE2 Assembled with Microsoft ml.exe Version 6.15.8803
Part 1 - Single Thread MBytes/Second
Kbytes Inc32wds Inc16wds Inc8wds Inc4wds Inc2wds ReadAll 128bSSE2
6 7076 8930 8889 9164 9247 9220 37147
24 8406 8585 8741 8940 9014 9104 37713
96 2027 2017 3222 4518 6614 7915 19004
384 2023 2018 3239 4503 6661 7956 19038
768 2001 2015 3226 4487 6632 7917 19102
1536 1950 1983 3191 4412 6491 7830 18595
16380 316 380 783 1411 2631 4798 5634
131070 312 382 778 1423 2567 4868 5678
Part 2 - Two Threads Total MBytes/Second
Kbytes Inc32wds Inc16wds Inc8wds Inc4wds Inc2wds ReadAll 128bSSE2
6 7347 11414 13477 15485 16414 17004 64860
24 12643 14519 15499 15765 15959 16291 65950
96 3137 2809 5169 7291 11891 14047 30072
384 3150 2849 5077 7639 10778 14685 29339
768 3015 2980 4988 7379 11758 13607 30887
1536 2969 2725 5036 7056 11213 13271 29849
16380 315 417 851 1739 3087 5743 6971
131070 313 416 877 1757 2967 5693 6919
##############################################################################
MP Bus Speed Test 64 bit/int32 Ver 1.2 Thu Apr 23 12:55:49 2009
Via Microsoft C/C++ Optimizing Compiler Version 14.00.40310.41 for AMD64
SSE2 Assembled with Microsoft ml64.exe Version 8.00.40310.39
Part 1 - Single Thread MBytes/Second
Kbytes Inc32wds Inc16wds Inc8wds Inc4wds Inc2wds ReadAll 128bSSE2
6 7088 8064 8353 8685 8728 8719 37226
24 7313 8121 8153 8432 8543 8598 37539
96 1765 1965 3177 4507 6462 7816 18899
384 2039 2022 3232 4518 6457 7850 18962
768 2034 2008 3218 4501 6464 7819 19034
1536 1931 1991 3178 4427 6306 7708 18523
16380 316 380 789 1427 2654 4876 5731
131070 317 370 787 1398 2610 4867 5669
Part 2 - Two Threads Total MBytes/Second
Kbytes Inc32wds Inc16wds Inc8wds Inc4wds Inc2wds ReadAll 128bSSE2
6 6576 10532 11969 13477 15141 14897 70169
24 10735 13041 14113 14257 15900 14838 70711
96 3008 2944 4909 7646 11280 14548 29482
384 2904 2997 4994 7335 11754 13745 30437
768 3028 2779 5125 7195 11704 13391 30549
1536 2829 2769 4793 7126 11158 13621 29143
16380 316 427 867 1705 3062 5694 7036
131070 314 423 845 1736 3014 5520 7051
##############################################################################
MP Bus Speed Test 64 bit Version 1.2 Thu Apr 23 12:57:50 2009
Via Microsoft C/C++ Optimizing Compiler Version 14.00.40310.41 for AMD64
SSE2 Assembled with Microsoft ml64.exe Version 8.00.40310.39
Part 1 - Single Thread MBytes/Second
Kbytes Inc32wds Inc16wds Inc8wds Inc4wds Inc2wds ReadAll 128bSSE2
6 14142 16327 16605 17323 17381 17440 36730
24 14978 16822 16506 17300 17310 17271 37790
96 4076 4190 3988 6449 9504 13632 18914
384 3995 4149 4022 6425 9549 13593 19051
768 3977 4152 4011 6438 9555 13584 18967
1536 3918 3977 3954 6318 9282 13358 18674
16380 594 625 771 1554 2882 5150 5696
131070 574 631 762 1578 2861 5119 5666
Part 2 - Two Threads Total MBytes/Second
Kbytes Inc32wds Inc16wds Inc8wds Inc4wds Inc2wds ReadAll 128bSSE2
6 8740 13942 15815 17691 19101 20269 66705
24 9484 15989 16697 20145 20505 19950 68468
96 5225 5630 5557 9618 13355 17128 28572
384 5271 5631 5510 9893 12940 17222 29617
768 5369 5760 5544 9899 13147 17200 30917
1536 4938 5635 5543 9374 12449 16582 28732
16380 583 625 821 1673 3166 5148 6988
131070 600 624 821 1664 3154 5267 6901
|
To Start
Another version of the 64 bit benchmark was produced. This just uses the single thread
test, with command line options to select memory size used, running time and log file name.
More than one version can then be run at the same time. Results are shown below for two
programs running concurrently to test L1 cache, L2 cache and RAM. Speed from caches is
seen to double, unlike the same tests using two threads. Results are for the AMD based PC.
Kbytes Inc32wds Inc16wds Inc8wds Inc4wds Inc2wds ReadAll 128bSSE2
6 Prog 1 5706 11458 11868 17738 17742 17452 17431
6 Prog 2 5731 11349 11882 17839 17803 17505 17473
96 Prog 1 1926 1498 1347 2449 4495 9619 8796
96 Prog 2 1936 1505 1349 2455 4506 9642 8813
1536 Prog 1 295 319 328 642 1174 2366 2444
1536 Prog 2 299 328 347 694 1240 2334 2809
|
To Start
BusSpdMP 32/64 bit Integers - CPU Speed Comparison
In addition to the 64 bit test, another version has been produced. This one, BusMP64Int32 in DualCore.zip, is compiled for 64 bit working but using 32 bit integers instead of 64 bits.
Results in MB/second are included above. These speeds can be converted to Millions of Instructions Per Second (MIPS) by dividing by four for 32 bit integers or by eight for those at 64 bits. The following are for the ReadAll data.
The program inner loop is run many times and executes 64 instructions, and it is unlikely that the additional registers, available for optimising 64 bit programs, will have any effect. Unlike Dhrystone, the 32 bit compiler produces faster speeds than the 64 bit version with 32 bit numbers. This might be due to the simpler instruction format of and edi [eax-212] compared with and ecx [rax+rdx-488].
The slowest speeds are when using 64 bit integers, where twice as much data would need to be transferred for the same MIPS. Worst case is from RAM where CPU execution speed is halved at 64 bits.
Performance gains on using two CPUs are also worse at 64 bits.
Core 2 Duo 2.4 GHz Vista 64 Athlon 64 x2 2.2 GHz XP x64
1 Thread ReadAll MIPS
Kbytes 32/32 64/32 64/64 32/32 64/32 64/64
6 2305 2180 2180 2494 2351 2187
24 2276 2150 2159 2527 2381 2228
96 1979 1954 1704 1593 1644 1209
384 1989 1963 1699 698 637 367
768 1979 1955 1698 695 646 362
1536 1958 1927 1670 697 654 362
16380 1200 1219 644 686 661 359
131070 1217 1217 640 685 650 359
2 Threads ReadAll MIPS
6 4251 3724 2534 4642 4372 2710
24 4073 3710 2494 4915 4652 2952
96 3512 3637 2141 3171 3274 1565
384 3671 3436 2153 1202 1087 508
768 3402 3348 2150 1198 1091 509
1536 3318 3405 2073 1203 1079 513
16380 1436 1424 644 1182 1064 511
131070 1423 1380 658 1182 1062 510
Gain With 2 CPUs
6 1.84 1.71 1.16 1.86 1.86 1.24
24 1.79 1.73 1.16 1.94 1.95 1.33
96 1.77 1.86 1.26 1.99 1.99 1.29
384 1.85 1.75 1.27 1.72 1.71 1.39
768 1.72 1.71 1.27 1.72 1.69 1.41
1536 1.69 1.77 1.24 1.73 1.65 1.42
16380 1.20 1.17 1.00 1.72 1.61 1.42
131070 1.17 1.13 1.03 1.73 1.63 1.42
|
To Start
RandMP Benchmark
This MP benchmark uses variable amounts of memory to measure speed via caches and RAM, first as a single thread, then as two threads to demonstrate the impact of two CPUs. It is based on RandMem (RandMem.zip ) with serial and random read and read/write tests. Serial and random tests use the same code via indexing to read and write 4 byte words e.g. a sequence such as tot = tot & xi[xi[i+ 0]] | xi[xi[i+ 2]] & --- for reading and xi[xi[i+ 0]] = xi[xi[i+ 2]]; for read/write. The inner loops have a more than 600 CPU instructions. RandMP64 and RandMP32 versions to run via Win64 and Win32 can be found in DualCore.zip.
The benchmark has four tests, Serial Read (RD), Serial Read/Write (RW), Random Read and Random Read/Write. With two threads, each has its own code and use the same data but the second thread starts at the half way point. Each has the same number of repeat passes where variations in the time taken are reflected in the relative speeds of the two threads.
Below are example results of the 32 bit version on a single CPU using Windows XP and 64 bit version on a dual core CPU via Windows XP x64 (32 bit version produces very similar results). Using one thread, RW speed is slower than RD and speed reduces more using larger data size with random access. Running two threads on a single CPU produces the same sort of total speed as the single thread. With two CPUs, the speed of read only is mainly around double that of a single thread but speed via caches with read/write can be worse than for a single thread (or single CPU).
Looking at dual core results, with Serial RW and Random RW at 6 KB, the CPU is executing at around 1360 Million Instructions Per Second (MIPS) or 0.62 MIPS/MHz with a single thread. With two threads, each CPU runs at 340 MIPS (0.15 MIPS/MHz) with Serial RW and 154 MIPS (0.07 MIPS/MHz) with Random RW. This can be put down to Windows flushing caches to maintain data coherency.
Modifying the benchmark, so that each thread accesses its own data array, enables RW cache tests to run at 1360 MIPS on each CPU.
The above comments relate to results on the PC with an AMD CPU and using Windows XP Pro x64.
Later results are for a Core 2 Duo system using 64-Bit Windows Vista. RAM on this is nearly twice as fast but the tests show up to 4 times faster.
Measured L1 cache speeds are much faster on the Read/Write tests as they are via L2 cache.
AMD Athlon(tm) XP 2600+ Measured 2088 MHz
#####################################################################
RandMP Write/Read Test 32 bit Version 1.0 Sat Aug 27 19:33:14 2005
Via Microsoft 32-bit C/C++ Optimizing Compiler Version 13.10.3077 for 80x86
------------------ MBytes Per Second At --------------------
6 KB 24 KB 96 KB 384 KB 768 KB 1536 KB 12 MB 96 MB
1 Thread
Serial RD 7773 7748 3616 895 896 889 891 892
Serial RW 3655 3657 2193 660 663 661 663 658
Random RD 7527 7599 2165 628 313 240 192 57
Random RW 3686 3693 2034 439 190 141 116 44
2 Threads
Serial RD1 4510 4522 2043 444 448 466 447 534
Serial RD2 3911 3906 1813 443 444 442 442 442
Serial RW1 1890 2133 1153 346 328 348 349 340
Serial RW2 1832 1828 1097 327 342 328 328 326
Random RD1 4429 4297 1134 311 169 115 103 31
Random RD2 3781 3803 1067 302 151 116 92 28
Random RW1 1928 1941 1050 219 95 75 61 24
Random RW2 1837 1849 1012 220 92 71 58 22
For approximate speed in MIPS divide MBytes/Second by 3.2
AMD Athlon 64 X2 Dual Core Processor 4200+ Measured 2211 MHz, XP Pro x64
#####################################################################
RandMP Write/Read Test 64 bit Version 1.0 Sat Aug 27 19:17:58 2005
Via Microsoft C/C++ Optimizing Compiler Version 14.00.40310.41 for AMD64
------------------ MBytes Per Second At --------------------
6 KB 24 KB 96 KB 384 KB 768 KB 1536 KB 12 MB 96 MB
1 Thread
Serial RD 8552 8518 5115 5132 2369 2353 2344 2305
Serial RW 4346 4340 2702 2697 1349 1352 1354 1351
Random RD 8176 8244 3733 1620 872 389 255 170
Random RW 4384 4332 2865 1483 563 236 161 136
2 Threads
Serial RD1 8374 8532 5064 5010 2075 2096 2021 2026
Serial RD2 8532 8394 5176 5108 2111 2062 2049 2054
Serial RW1 1090 1172 1110 1096 1041 867 864 866
Serial RW2 1083 1136 1089 1076 1049 866 855 824
Random RD1 8147 8024 3683 1638 485 193 126 100
Random RD2 8154 8158 3701 1637 485 195 125 101
Random RW1 494 489 448 406 352 152 86 75
Random RW2 495 490 449 406 343 152 87 75
For approximate speed in MIPS divide MBytes/Second by 3.2
Intel(R) Core(TM)2 CPU 6600 @ 2.40GHz Measured 2402 MHz and Vista 64-Bit
#####################################################################
RandMP Write/Read Test 64 bit Version 1.0 Sat Aug 27 19:17:58 2005
------------------ MBytes Per Second At --------------------
6 KB 24 KB 96 KB 384 KB 768 KB 1536 KB 12 MB 96 MB
1 Thread
Serial RD 8742 9128 7498 7468 7486 7429 4417 4391
Serial RW 8428 9332 7665 7663 7662 7165 2442 2397
Random RD 8918 9404 4244 3304 3183 2790 638 458
Random RW 8014 8523 3390 2752 2656 2462 418 289
2 Threads
Serial RD1 8435 9094 7334 7336 7365 7238 4024 2817
Serial RD2 8460 8943 7183 7168 7201 7159 3962 2764
Serial RW1 2007 2181 6931 6995 6984 6738 1643 1521
Serial RW2 2010 2174 6789 6801 6806 6651 1568 1433
Random RD1 8576 9392 3530 2695 2604 2292 450 443
Random RD2 8598 9180 3478 2666 2553 2256 455 443
Random RW1 730 759 1409 1984 1991 1923 282 292
Random RW2 733 759 1398 1955 1961 1897 277 292
|
To Start
Roy Longbottom September 2009
The new Internet Home for my PC Benchmarks is via the link
Roy Longbottom's PC Benchmark Collection
|