Document:

 FFTGraf.htm

 Creation Date:

 12 May 2001

 Revision Date:

 19 October 2002

 Title:

FFTGraf Benchmark

 Keywords:

PC Benchmark FFT CPU Cache RAM Performance Graph

 Abstract:



The program runs code for single and double precision Fast Fourier Transforms (FFTs) of size 1024 to 1048576, producing a graph of results. Fourier Transforms are scientific calculations often associated with analysing radio signals.
Version 3.0 using SSE, SSE2 and 3DNow instructions.

 Contributor:

 Roy_Longbottom@compuserve.com


 Source:

 Compuserve PC Hardware Forum, Library 10 Benchmarks


 Note:



The program is still under test and should be treated as beta test software but it has been run via Win95, Win98, Win98SE, WinME, WinNT4, Win2000 and WinXP.


Scott Taylor's and Roy Longbottom's FFTGraf Benchmark

Copyright (C) Scott Taylor and Roy Longbottom 2001, 2002

FFTGraf benchmark uses optimised FFT code written by Scott Taylor. It produces a graph of results besides text output in file FFTGraf.txt. The text file can be read to reproduce graphs on the screen.

The data for FFTs is not loaded or stored from sequential memory addresses for much of the time. As the hardware loads this data in 32 or 64 byte bursts, much of it is redundant, resulting in slow performance. AMD and Intel P4 CPUs suffer most as their bursts are 64 bytes.

Version 2 of the program produces significant speed improvements mainly by making more efficient use of caches and using all burst data. Scott’s new code divides the data from RAM into L2 cache sized segments, leading to a 2 x improvement on large FFTs. Roy’s new code unrolls the critical loop to make effective use of burst data read into caches, leading to up to a further 2 x improvement or more with PCs that use 64 byte bursts. Roy has also supplied assembly code for the main calculations which helps a little. See table at end for comparative results and file Compare2.txt.

Version 3 includes additional code from Roy to use SSE or 3DNow instructions for single precision calculations and SSE2 instructions for double precision calculations, when these are provided by the CPU. Pentium III and AMD XP CPUs have SSE and Pentium 4 has SSE and SSE2. AMD CPUs have 3DNow.

The program calculates 1024 to 1048576 sized FFTs using both single and double precision numbers. Each can be calculated up to 12 times (default 5). Running time is calculated in milliseconds, minimum value being shown on the graph. Graphs are plotted using milliseconds per K size for each run. Although ms/K increases as FFTs become larger, a distinct difference is usual when data is too large for a cache. The ms/K scale is self calibrating and increases as the tests run.

Version 1 memory demands in bytes were up to 16 times the FFT size on single precision and 28 times on double precision. These also apply to Versiona 2 and 3 up to size 8192. Above this, they can be up to 28 times on SP and 52 times on DP - maximum 52 MB. Data spanned in the critical timing loop is 8 or 16 times FFT size at 8192 and below and 16 x SP or 32 x DP above 8192.

Date/time of testing, CPU ID and speed, RAM size, and version of Windows are also shown on the graph and log.

On starting, the maximum FFT size is reduced for 32 MB or below RAM sizes. This is to avoid excessive running time due to paging to/from disk.

The program has a compare facility, allowing two sets of data to be loaded from text results files to produce a graph of relative speeds. Multiple copies of the program can also be loaded from log files to display different graphs. Sample log files for various CPUs to use are in the Results folder. The scale of the graphs can be changed to enable visual comparisons on the same scale.

The original or re-scaled graphs can be saved as BMP files via a Save As menu option.


On starting the buttons shown on the sample graph can be clicked. Variables - (or Parameters) see panel below. Run - runs the tests according to the selected variables and parameters. Exit - ends the program.

These functions are repeated in the Options menu, also Change Scale to change scale for comparison purposes, View Graph to redisplay last results, Load Graph to load from log files, Load Two to load and compare data from log files and Save Graph to save current display to BMP file.


Variables or Parameters

Passes each FFT can be increased to possibly show more variance.

Maximum seconds per pass is to provide a warning when the next test is likely to exceed this time.

Minimum FFT Size and Maximum FFT Size can be selected from 1, 2, 4, 8 --- to 1024 K where K = 1024.

Maximum msecs/K Scale can be used to compare graphs using the same scale. Values in the drop down list are 10, 9, 8 --- to 1; 0.9, 0.8 --- 0.1; down to 0.009, 0.008 --- 0,0.001. The setting reverts to N/A on running new tests.

Text File allows the default log file name to be changed.



Example Graph

AMD Duron at 950 MHz, PC133 CAS 2 RAM, Abit KT7-RAID mainboard, Windows 2000. Single Precision graphs are red and Double Precision graphs are blue.



Example Comparison Graph

This shows a comparison of results of an AMD 1330 MHz Thunderbird with PC2100 DDR SDRAM and a 910 MHz PIII with PC133 SDRAM. As AMD CPUs load 64 byte cache lines, compared with 32 on Intel PIII, the former tends to be slower when data is in RAM. On the other hand, the AMD CPU has a larger L1 cache and is often faster in carrying out floating point calculations.



Example Log File

Following is a sample log file, initially produced as FFTGraf.txt. Multiple logs can be included in the same file and the graphs viewed via the menu option. Other sample logs are in the Results folder with details of configurations in file ReadMe.txt.



Versions 3, 2 and a revised Version 1 logs show Scott's MagSq'd[n/16], Peak Noise and Average Noise accuracy checks for FFTs at 1024K. This shows that the two versions produce the same numeric results.

Checks SP 9.999891e-001 3.338028e-006 1.043382e-011
Checks DP 1.000000e+000 1.133294e-023 1.428096e-028

Version 3 single precision numeric checks are slightly different, probably as the SSE/3DNow calculations are in 32 bit floating point units as opposed to normal SP which is executed in the same unit as double precision.

Checks SP 9.999890e-001 3.338029e-006 1.043487e-011
Checks DP 1.000000e+000 1.133294e-023 1.428096e-028

Comparative Results: Double Precision FFT milliseconds

Following are sample results in milliseconds from Scott’s optimised and original code. Also shown are results for FFT and TFFTDP obtained from ftp://ftp.nosc.mil/pub/aburto/. All C programs were compiled using Watcom Version 11 and run on a Duron 950 MHz with 133 MHz bus/RAM via Windows 2000. Memory bytes used is up to 28 times FFT size or 52 times with Version 2.

Scott’s original optimised version uses the same floating point machine code instructions as the non-optimised program but results in data being accessed more sequentially. The nosc.mil programs are more complicated and, probably, the number of variables used does not suit the standard PC floating point register arrangement. So, of these examples, Scott’s code comes out fastest at smaller FFT sizes, with data in L1 and L2 caches. The last example indicates superiority to Version 1 when accessing RAM, but this is surpassed by Version 2.

As the Duron only has single precision 3DNow instructions, Version 3 DP speeds are the same as Version 2. The second table shows Version 3 (SSE2), Version 2 and Opt results from a 1900 MHz Pentium 4 with PC 133 RAM. Improvements obtained by using SSE2 (and SSE, 3DNow) are not that great as there is a significant pre-calculation overhead, reading/writing randomly from/to memory. Also, during the main calculations, reading and writing to two different areas of memory is not good for data streaming.

Size Version 2 Opt NoOpt TFFTDP FFT
1024 0.101 0.135 0.205 0.349 0.320
2048 0.209 0.296 0.472 0.724 0.723
4096 0.677 0.739 1.177 1.534 1.567
8192 4.220 5.868 11.141 6.462 3.914
16384 13.000 35.565 68.549 42.965 14.019
32768 29.800 82.500 158.278 114.669 63.087
65536 66.900 189.696 373.473 268.995 146.617
131072 150.000 491.680 970.348 672.734 325.168
262144 332.000 1113.715 2190.291 1531.683 858.985

Size Version 3 Version 2 Opt
1024 0.089 0.096 0.149
2048 0.190 0.231 0.332
4096 0.403 0.508 0.747
8192 0.962 1.410 1.870
16384 5.110 5.910 13.200
32768 14.500 16.600 92.000
65536 34.500 38.700 213.00
131072 73.500 89.800 463.000
262144 156.000 194.000 1006.291