Title

64-Bit Vista and Core 2 Duo Benchmarks


Index

Introduction
Image Processing - 64 Bit Windows slow on large images due to paging
Identification - Identifying 64-Bit Vista and Core 2 Duo
Paging - Program can allocate more memory with 64-Bit Vista
Paging - Vista paging more efficient than Windows XP
Paging - Accessing data 500 times slower than using RAM and 8 times slower than normal disk I/O
Dual Core CPU - Efficient use
Dual Core CPU - Slow on data streaming to 64 bit integer registers
Dual Core CPU - Windows slow on writing/reading shared data
Disk Drives - Partitioned disk C: drive slower than D:
Graphics - 3D slower using Aero
Floating Point - 64 bit compilations slow on Core 2 Duo

Introduction

All of the programs in The PC Benchmark Collection were run on a new PC with a Core 2 Duo CPU using 64-Bit Vista. A surprising number of performance issues were raised but these are mainly related to Core 2 Duo, the compilers used, Windows in general and 64 bit working, rather than just Vista. The system tested and compilers/assemblers used were:


  Core 2 Duo 2400 MHz, Asus P5B motherboard, 4 GB 800 MHz DDR2 RAM,
  Seagate ST3400633AS SATA-300 disk, 16 MB buffer, 7200 RPM,  
  GeForce 8600 GT graphics, Vista Home Premium 64-Bit

  Microsoft C/C++ Optimizing Compiler Version 14.00.40310.41 for AMD64
  Microsoft ml64.exe Version 8.00.40310.39
  Microsoft 32-bit C/C++ Optimizing Compiler Version 13.10.3077 for 80x86
  Microsoft ml.exe Version 6.15.8803

Results were compared with those from a dual core Athlon 64 using Windows XP Pro x64 and other PCs with Windows XP and 2000.

To Start

Paging and Virtual Memory

Besides the results files indicated below see Paging.htm. This includes comparisons, further details and Performance Monitor graphs of disk activity, showing significant differences between Windows XP and Vista.

Image Processing - BMPSpeed benchmark measures performance using image files increasing in size from 0.5 MB to 512 MB. Initially (until Memory Remapping was set in BIOS), only 3 GB of the 4 GB was seen and slow speed due to paging was reported with the larger images. This follows worst ever results using 64 bit XP and 1 GB RAM. See BMPSpeed Results.htm.

The benchmark uses fast BitBlt copying when permitted and a slower byte based method when not. With 32 bit Windows, the former was used up to a maximum image size of 64 MB. Using 64 bit Windows, no limit was seen. A CreateDIBitmap function is used for BitBlt and this uses memory space outside user virtual memory space at 32 bits per pixel instead of 24. The result is that maximum memory demands are increased from 1.1 GB to 2.3 GB. Not of the same significance, 64 bit Windows shows an increase in user memory requirements.

Configuration Statistics - All the latest benchmarks provide the following system identification details. Variations for the AMD CPU, 32 bit versions of Windows and applications compiled for 32 bits are also shown. The only way to determine 64 bit Vista by programming appears to by the GetSystemInfo flag PROCESSOR_ARCHITECTURE_AMD64 with 32 bit varieties via PROCESSOR_ARCHITECTURE_INTEL. It appears that 32 bit applications running via 64 bit Windows can use 4 GB of virtual address space (UVS), compared with 2 GB for 32 bit versions.

To Start


  CPUID and RDTSC Assembly Code
  CPU GenuineIntel, Features Code BFEBFBFF, Model Code 000006F6
  Intel(R) Core(TM)2 CPU          6600  @ 2.40GHz Measured 2402 MHz
  Has MMX, Has SSE, Has SSE2, Has SSE3, No 3DNow,
  Windows GetSystemInfo, GetVersionEx, GlobalMemoryStatus
  AMD64 processor architecture, 2 CPUs           [64 Bit Windows]
  Windows NT  Version 6.0, build 6000,           [Vista]
  Memory 4095 MB, Free 3088 MB
  User Virtual Space 8388608 MB, Free 8388542 MB [64 Bit Windows, 64 bit application]

  CPU AuthenticAMD, Features Code 178BFBFF, Model Code 00020FB1
  AMD Athlon(tm) 64 X2 Dual Core Processor 4200+ Measured 2211 MHz
  Windows NT  Version 5.2, build 3790            [XP Pro x64]

  Intel processor architecture, 2 CPUs           [32 Bit Windows] 
  Intel processor architecture, 1 CPU 
  User Virtual Space 4096 MB, Free 4035 MB       [64 Bit Windows, 32 bit application]
  User Virtual Space 2048 MB, Free 2024 MB       [32 Bit Windows, 32 bit application]


To Start

Paging Benchmarks - Paging speed in terms of MB/second can be measured using BusSpd2K and IntBurn64 Burn-in/Reliability benchmarks. These have six Write/Read and six read only tests. With 4 GB RAM and paging, test running time was greater than an hour. The benchmarks have been modified to include a Paging Test option that uses a single write/read test. Results are in BusSpd2K Results.htm.

Running with increasing (or decreasing) memory demands reveals the maximum that can be allocated for a single data array:

  • 32 bit Windows (UVM 2 GB) - 1,200,000 KB with XP and 1,500,000 KB via Windows 2000
  • 64-Bit Vista and 64 bit program (UVM 8192 GB) - 7,900,000 KB, XP x64 (1 GB RAM) - between 5,000,000 KB and 6,000,000 KB
  • 64 bit Windows and 32 bit program (UVM 4 GB) - 2,000,000 KB.

Tests were run on four PCs using Windows XP, 2000, x64 and 64-Bit Vista. All had disk drives that could write and read large files at around 50 MB/second. Best (RAM speed) and worst case results are below. Worst is using XP x64 with 1 GB RAM, where, with similar demands, it can be slower than the PCs with 512 MB. In all cases, writing/reading data with paging can be much slower than using normal disk output/input. There is also an issue regarding speed relative to that from RAM with more advanced memory technology.

Using Vista showed that speed tended to improve more frequently and to a greater extent, as memory demands increased. The others showed more of a general decline. Performance Monitor logging revealed differences between XP x64 and Vista. The former appeared to consistently write using 64 KB data transfers and read at 4 KB. With Vista there were periods of reading at 8, 16, 32 and 64 KB and writing was mainly at around 1000 KB. Disk benchmarks show that writing at 64 KB block size is not necessarily the fastest and reading at 4 KB is usually much slower. Graphs of Performance Monitor logs are included in Paging.htm, along with many more speed measurements.

To Start


  Windows         RAM    Maximum      Minimum  
                   MB     MB/sec  MB/sec      KB        Seconds

  64 Bit Benchmark
  64-Bit Vista    4096     3393     10     5,000,000     1024
                                    21     7,900,000      770
  XP x64          1024     2040      6     5,000,000     1707
                                     6     2,000,000      683
 
  32 Bit Benchmark
  64-Bit Vista    4096     3390   2139     2,000,000        2
  XP x64          1024     2041      6     2,000,000      683
                                     8     1,400,000      358
  XP               512      532     13     1,200,000      189  
  20000            512      970     15     1,500,000      205



To Start

Dual Core Benchmarks

The four multi-threaded benchmarks, with 32 bit and 64 bit varieties, were run to demonstrate dual core CPU performance. Multi-tasking tests were Also run using two copies of BusSpd2K Reliability Tests and IntBurn64. See DualCore.htm.

CPUIDMP and CPUIDMP64 - The benchmark uses an integer test and a floating point test. They are first executed separately, followed by together in two threads of equal priority and finally with two of each type, where three are at a lower priority. With a dual CPU, performance with two threads should be similar to that of the stand alone runs. The total speed of four threads might give some variation on the latter. Results are in WhatCPU Results.htm. There were no surprises here.

Whets32MP and Whets64MP - The Whetstone Benchmark has various routines that execute floating point and integer instructions. In the MP version, the benchmark is run in the main thread and another copy of each routine in a low priority second thread which should mainly run at the same speed with two CPUs. Results are in Whetstone Results.htm. Again, there were no surprises.

Multi-Tasking Tests - The 32 bit BusSpd2K Reliability Test and 64 bit IntBurn64 were run separately then with two copies concurrently to demonstrate dual core performance using data in caches and RAM. The former uses 64 bit MMX instructions and the latter 64 bit integers. They have write/read and read only tests where both showed similar gains using two CPUs. Example reading speeds in MB/second are below - from BusSpeed2K Results.htm. These show significant performance differences between the two system. Total throughput via caches is seen to nearly double, using two CPUs, which is surprising for the Core 2 Duo shared L2 cache. There is also a gain on using RAM, more notable with the AMD system.

To Start


       Core 2 Duo 2400 MHz, 800 MHz RAM         Athlon 64 X2 2210 MHz, 400 MHz RAM
       64-Bit Vista                             Windows XP x64

       L1            L2            RAM           L1            L2            RAM
Mode   32     64     32     64     32     64     32     64     32     64     32     64
CPUs
  1  15794  16206  13084  13048   5433   5408  20913  22257   9112  10102   2872   3009
  2  31401  32248  25111  25033   6066   6019  41503  44389  18023  19957   4706   4838
  %    199    199    192    192    112    111    198    199    198    198    164    161


To Start

BusMP and BusMP64 Two Thread Tests - These run a series of tests to measure performance via caches and RAM, firstly as a single thread and secondly using two threads accessing different data arrays. The tests are based on those in BusSpd2K and results are in BusSpeed2K Results.htm. To indicate bus burst reading speeds, the tests start with reading one word with address increments of 32 words (128 bytes at 32 bits and 256 bytes at 64 bits), then with decreasing increments until all data is read. The last test uses 128 bit SSE2 instructions. The data is read using a sequence of 64 AND instructions to one CPU register, repeated numerous times without programmed interference.

The first observations are that, using a single CPU, the time used for each test and throughput is approximately the same using one and two threads. This is certainly not the case using two CPUs. Two copies of BusSpd2K were also run simultaneously on a Core 2 Duo CPU and showed no performance degradation.

Example L1 cache MB/second results below are for address increments of 64 bytes, reading all data (32 or 64 bit integers) and 128 bit SSE2. Note that running the 32 bit version, with two CPUs via 32 bit Windows, produces the same variations. Except for Athlon SSE2, the single thread tests execute instructions at around 1 per clock cycle (e.g. 2400 MIPS at 2400 MHz) for 32, 64 and 128 bit data (divide MB/sec by 4, 8 and 16) and are unlikely to run faster, using a single register (see in WhatCPU Results.htm).

The speed of the two thread tests can vary quite a bit and, throughput via two CPUs can be even less than from one processor when data streaming 64 bit integers. Part of the difference is that the two threads do not finish at the same time. Other measurements (program debug option) show that this does not usually seem to affect the tests with large address increments but, adjusting for this on the Core 2 Duo Read All results, could increase the performance gain from 117% to 140%. It does seem that Windows is interfering with the data flow and timing is more critical using 64 bit integers.

To Start


       Core 2 Duo 2400 MHz, 64-Bit Vista        Athlon 64 X2 2210 MHz, XP x64

    32 Bit               64 Bit               32 Bit               64 Bit
    Inc64B  RdAll   SSE2 Inc64B  RdAll   SSE2 Inc64B  RdAll   SSE2 Inc64B  RdAll   SSE2
CPUs
  1   8999   9263  37310  16736  17449  37143   8462   9958  17391  16078  17443  17358
  2  11341  17116  66635  15512  20354  71247  10367  18515  34466  13348  21414  34617
  %    126    185    179     93    117    192    123    186    198     83    123    199


To Start

RandMP32 and RandMP64 Two Thread Tests - The program uses the same code for serial and random use via a complex indexing structure and comprises Read (RD) and Read/Write (RW) tests. They are run to use data from L1 cache, L2 cache and RAM, firstly as a single thread and secondly using two threads. The 64 bit compilation uses the same 32 bit integer arrays as the 32 bit version and resultant speeds are generally the same. For this benchmark, the two threads share the same data array but one starts half way through. Examples of MB/second results from Randmem Results.htm are below.

The main observations are the reduction in throughput when writing and reading cache based data in two CPUs, particularly for random access. Here, Windows will be updating data in RAM to maintain integrity (a cache killer). The impact of the Core 2 Duo shared L2 cache is demonstrated, with a lower percentage increase on reading but a higher one with writing and reading, plus some gains on RAM based data.

To Start


       Core 2 Duo 2400 MHz, 800 MHz RAM          Athlon 64 X2 2210 MHz, 400 MHz RAM
       64-Bit Vista                              Windows XP x64

    Serial

       L1 Cache      L2 Cache      RAM           L1 Cache      L2 Cache      RAM
CPUs   RD     RW     RD     RW     RD     RW     RD     RW     RD     RW     RD     RW

  1   8742   8428   7498   7665   4417   2442   8552   4346   5115   2702   2344   1354
  2  16895   4017  14517  13720   7986   3211  16906   2173  10240   2199   4070   1719
  %    193     48    194    179    181    131    198     50    200     81    174    127

    Random

  1   8918   8014   4244   3390    638    418   8176   4384   3733   2865    255    161
  2  17174   1463   7008   2807    905    559  16301    989   7384    897    251    173
  %    193     18    165     83    142    134    199     23    198     31     98    107


To Start

Disks

The disk on the Core 2 Duo/Vista PC is partitioned at 200 GB + 200 GB. On running the DiskGraf and CDDVDSpd benchmarks, it was anticipated that the C: drive would be faster than D:. This was shown to be true for large files with large block sizes but small files and those with small block sizes were slower. CDDVDSpd writes and reads one large file of selected size and 520 small files that occupy the same amount of space. Below are some results for small files, including average milliseconds per file. For more details see 64 Bit Disk Tests.htm, DiskGraf Results.htm and CDDVDSpd Results.htm.

Tests were run using real data via copying and pasting a folder containing downloaded HTML documents with lots of tiny GIF files - 18.1 MB, 28.4 MB on disk, 3636 files in 277 folders. Example copying times were 41 seconds on C: and 15 seconds on D:.

After the above, the same tests were run on a Pentium 4 PC using Windows XP and the same effects were observed but performance differences were not as great.

To Start


       C: Partition                D: Partition
      
                      Per File                     Per File
  KB  Write   Read  Write   Read   Write   Read  Write   Read
       MB/s   MB/s  msecs  msecs    MB/s   MB/s  msecs  msecs

   2   0.55   0.50    3.6    4.0    2.26   8.30    0.9    0.2
   4   1.14   1.12    3.5    3.6    7.00  14.71    0.6    0.3
   8   1.86   1.69    4.3    4.7   14.46  26.89    0.6    0.3


To Start

Graphics

My old Windows, DirectDraw, and OpenGL benchmarks were converted to run at 64 bits and to compile with a more modern 32 bit compiler. A new DirectX 9 was also produced as the Direct3D functions used were no longer supported. The benchmarks were run via 64-Bit Vista on a Core 2 Duo CPU, following earlier tests using Windows XP Pro x64 and a dual core Athlon 64 CPU. The Vista tests included running them on Aero and Classic Desktops

The only real problem was that the programmed WaitForVerticalBlank (VSYNC) failed to synchronise to the monitor refresh Hz on the old DirectDraw and D3D benchmarks, program refresh speed being 1.5 to 3 times faster than it should be. Performance via 64 bit and new 32 bit versions was mainly the same. Some different speeds were obtained with Aero and Classic Desktops in the background. Occasionally, Aero was faster but, the DirectX 9 tests showed that Classic Desktop results produced an average speed gain of 12%. This might be influenced by the benchmark running in windowed mode, and not as a dedicated full screen application.

Below are results for the 64 bit DirectX 9 benchmark. For further details see - Direct3D Results.htm, DirectDraw Results.htm, OpenGL Results.htm and VideoWin Results.htm.

To Start


   DirectX 9    ..................... Frames Per Second ......................
                Shaded WireEgg     500 Texture  Colour Texture   Pixel  Vertex
   Resolution      Egg   Vsync   Cubes  Tunnel Objects Objects Shader2 Shader2

   Aero
   640  480 32  3244.0    60.0    95.7   953.3  1336.8   877.8   868.7   894.8 
   800  600 32  2479.9    60.0    74.1   673.5  1108.4   642.1   635.6   834.7 
  1024  768 32  1514.2    60.0    60.9   501.2   845.5   475.5   472.4   673.6 
  1280 1024 32   848.7    60.0    42.6   323.2   545.2   291.2   290.9   431.3 
  
  Classic
   640  480 32  3596.4    60.0   111.2  1086.0  1532.4   996.2   963.7   958.5 
   800  600 32  2760.9    60.0    89.2   780.7  1209.4   749.0   737.4   931.2 
  1024  768 32  1832.7    60.0    67.6   529.7   836.5   495.2   503.0   690.4 
  1280 1024 32  1131.0    60.0    48.9   356.8   571.0   327.1   326.1   461.8 


To Start

Floating Point

My original benchmarks, that measure CPU floating point performance, were converted and compiled for 64 bit working. These have to use SSE and SSE2 floating point instead of the old x87 instructions. Other versions were produced for 32 bit working, with options set to use SSE and SSE2. They were first produced to run on an Athlon 64 x2 CPU via Windows XP Pro x64 and all ran successfully on a Core 2 Duo processor using 64-Bit Vista.

When using assembly code, processing speed of SSE and SSE2 floating point instructions are shown to be much faster than the old x87 variety. Unfortunately, the compiler used did not implement Single Instruction Multiple Data (SIMD) instructions properly, only using one variable in registers - Single Instruction Single Data (SISD). The result is that maximum speed could be expected to be reduced by two times on 64 bit double precision SSE2 operation and by four times using 32 bit single precision SSE instructions. In some cases, this leads to programs using the old x87 floating point instructions being faster than SSE/SSE2 varieties.

The major surprise was that Core 2 Duo demonstrated particularly slow performance on some 64 bit compilations that produce SSE2 instructions, where the Athlon 64 could be up to twice as fast. These slow results were from the original 2006 versions but these were corrected using a later compiler in 2009. Below are results for the Linpack and Livermore Loops benchmarks - see Linpack Results.htm and Livermore Loops Results.htm.

Other benchmarks using floating point generally show superior Core 2 Duo performance - see WhatCPU Results.htm, FFTGraf Results.htm, Whetstone Results.htm, SSE3DNow Results.htm and MemSpd2K Results.htm.

To Start


                     Linpack Benchmark - Results in MFLOPS

                                 64 Bit       32 Bit      Original

  Core 2 Duo 2400 MHz, Vista       823         1480         1315
  Core 2 Duo 2009 compilation     1602

  Athlon 64 2210 MHz, XP x64      1044         1014          838
  Athlon 64 2009 compilation      1091

 
              Livermore Loops Benchmark - Results in MFLOPS

                     64 Bit                32 Bit               Original
                Max   Mean    Min     Max   Mean    Min     Max   Mean    Min

  Core 2 Duo   1175    537    227    2526    804    195    2236    539     52
  2009 comp    2799    893    261

  Athlon 64    2284    661    166    1908    641    162    2566    461     48
  2009 comp    2068    679    171


To Start

SSE3DNow and MemSpeed (latest MemSpd2K) carry out the same floating point data streaming instructions to measure performance via caches and RAM. The former uses SIMD assembly code and the latter was compiled using the original x87 instructions. This was recompiled at 64 bits and 32 bits, using SSE/SSE2 options. Results are below, showing slow performance on Core 2 Duo at 64 bits.

The disassembled code was examined and the main difference was that the movsd load instruction was used with the 32 bit compilation and movlpd at 64 bits (as with Linpack and Livermore Loops benchmarks). Assembly code using SSE2 instructions was constructed for the first double precision test, to prove that movlpd produced slower results - see below. A later search found Intel Documentation, confirming that there are complications concerning use of the same register following movlpd, which can cause pipeline stalls. This was also corrected in the 2009 recompilation using the movsdx instruction.

To Start


                                 Maximum Speeds in MFLOPS

                  Core 2 Duo 2400 MHz, Vista        Athlon 64 2210 MHz, XP x64

                s=s+x[m]*y[m]   x[m]=x[m]+y[m]     s=s+x[m]*y[m]   x[m]=x[m]+y[m] 
                  Dble   Sngl     Dble   Sngl        Dble   Sngl     Dble   Sngl

 Assembled SIMD   3166   6347     2340   4692        1999   3998     1011   2140

 32 bit SISD      1053   1059     1173   1147         673    727      970    878

 64 bit SISD       761   1275      398    943         891    918      808    735
 2009 compile     1260   1270     1172   1188         976    978      976    979

 x87 FPU          1591   1593     1180   1177        1047   1100      853   1085


                          Assembly Code Experiments

 Code used by 32 bit compiler                Code used by 64 bit compiler
 Slower due to unnecessary save              Slow due to movlpd

 lp:                                         lp:
 movsd   xmm0, QWORD PTR [rax+rdi]           movlpd  xmm1, QWORD PTR [rdi+rax*8]
 mulsd   xmm0, QWORD PTR [rcx+rdi]           movlpd  xmm0, QWORD PTR [r8+rax*8+8]
 addsd   xmm0, QWORD PTR [rbx]               add     rax, 4
 movsd   xmm1, QWORD PTR [rax+rdi+8]         cmp     rax, rcx
 mulsd   xmm1, QWORD PTR [rcx+rdi+8]         mulsd   xmm0, QWORD PTR [rdi+rax*8-24]
 addsd   xmm0, xmm1                          mulsd   xmm1, QWORD PTR [r8+rax*8-32]
 movsd   xmm1, QWORD PTR [rax+rdi+16]        addsd   xmm1, xmm2
 mulsd   xmm1, QWORD PTR [rcx+rdi+16]        movsd   xmm2, xmm1
 addsd   xmm0, xmm1                          addsd   xmm2, xmm0
 movsd   xmm1, QWORD PTR [rax+rdi+24]        movlpd  xmm0, QWORD PTR [r8+rax*8-16]
 mulsd   xmm1, QWORD PTR [rcx+rdi+24]        mulsd   xmm0, QWORD PTR [rdi+rax*8-16]
 add     rdi, 32                             addsd   xmm2, xmm0
 sub     rdx, 4                              movlpd  xmm0, QWORD PTR [r8+rax*8-8]
 addsd   xmm0, xmm1                          mulsd   xmm0, QWORD PTR [rdi+rax*8-8]
 movsd   QWORD PTR [rbx], xmm0               addsd   xmm2, xmm0
 jg      lp                                  jl      lp

 2400 MHz Core 2 Duo 1065 MFLOPS             2400 MHz Core 2 Duo  767 MFLOPS
 2110 MHz Athlon 64   696 MFLOPS             2110 MHz Athlon 64   958 MFLOPS

 Modified 64 bit code using movsd            Modified 64 bit code using movlpd
 Faster using movsd                          Fastest using movlpd to 4 registers

 lp:                                         lp:
 movsd   xmm1, QWORD PTR [rdi+rax*8]         movlpd  xmm0, QWORD PTR [rax+rdi]
 movsd   xmm0, QWORD PTR [r8+rax*8+8]        movlpd  xmm1, QWORD PTR [rax+rdi+8]
 add     rax, 4                              movlpd  xmm2, QWORD PTR [rax+rdi+16]
 cmp     rax, rcx                            movlpd  xmm3, QWORD PTR [rax+rdi+24]
 mulsd   xmm0, QWORD PTR [rdi+rax*8-24]      mulsd   xmm0, QWORD PTR [rcx+rdi]
 mulsd   xmm1, QWORD PTR [r8+rax*8-32]       mulsd   xmm1, QWORD PTR [rcx+rdi+8]
 addsd   xmm1, xmm2                          mulsd   xmm2, QWORD PTR [rcx+rdi+16]
 movsd   xmm2, xmm1                          mulsd   xmm3, QWORD PTR [rcx+rdi+24]
 addsd   xmm2, xmm0                          addsd   xmm4, xmm0
 movsd   xmm0, QWORD PTR [r8+rax*8-16]       addsd   xmm4, xmm1
 mulsd   xmm0, QWORD PTR [rdi+rax*8-16]      add     rdi, 32
 addsd   xmm2, xmm0                          sub     rdx, 4
 movsd   xmm0, QWORD PTR [r8+rax*8-8]        addsd   xmm4, xmm2
 mulsd   xmm0, QWORD PTR [rdi+rax*8-8]       addsd   xmm4, xmm3
 addsd   xmm2, xmm0                          jg      lp
 jl      lp

 2400 MHz Core 2 Duo 1267 MFLOPS             2400 MHz Core 2 Duo 1448 MFLOPS
 2110 MHz Athlon 64   964 MFLOPS             2110 MHz Athlon 64  1085 MFLOPS

To Start


Roy Longbottom September 2009

The new Internet Home for my PC Benchmarks is via the link
Roy Longbottom's PC Benchmark Collection