Document: |
FFTGraf.htm |
Creation Date: |
12 May 2001 |
Revision Date: |
19 October 2002 |
Title: |
FFTGraf Benchmark |
Keywords: |
PC Benchmark FFT CPU Cache RAM Performance Graph |
Abstract: |
The program runs code for single and double precision Fast Fourier Transforms (FFTs) of size 1024 to 1048576, producing a graph of results. Fourier Transforms are scientific calculations often associated with analysing radio signals.
|
Contributor: |
|
Source: |
|
Note: |
The program is still under test and should be treated as beta test software but it has been run via Win95, Win98, Win98SE, WinME, WinNT4, Win2000 and WinXP. |
|
On starting the buttons shown on the sample graph can be clicked.
Variables - (or Parameters) see panel below.
Run - runs the tests according to the selected variables and parameters.
Exit - ends the program.
These functions are repeated in the Options menu, also Change Scale to change scale for comparison purposes, View Graph to redisplay last results, Load Graph to load from log files, Load Two to load and compare data from log files and Save Graph to save current display to BMP file. |
![]() |
Passes each FFT can be increased to possibly show more variance.
Maximum seconds per pass is to provide a warning when the next test is likely to exceed this time. Minimum FFT Size and Maximum FFT Size can be selected from 1, 2, 4, 8 --- to 1024 K where K = 1024. Maximum msecs/K Scale can be used to compare graphs using the same scale. Values in the drop down list are 10, 9, 8 --- to 1; 0.9, 0.8 --- 0.1; down to 0.009, 0.008 --- 0,0.001. The setting reverts to N/A on running new tests. Text File allows the default log file name to be changed. |
![]() |
Example GraphAMD Duron at 950 MHz, PC133 CAS 2 RAM, Abit KT7-RAID mainboard, Windows 2000. Single Precision graphs are red and Double Precision graphs are blue.
![]() |
Example Comparison GraphThis shows a comparison of results of an AMD 1330 MHz Thunderbird with PC2100 DDR SDRAM and a 910 MHz PIII with PC133 SDRAM. As AMD CPUs load 64 byte cache lines, compared with 32 on Intel PIII, the former tends to be slower when data is in RAM. On the other hand, the AMD CPU has a larger L1 cache and is often faster in carrying out floating point calculations.
![]() |
Example Log FileFollowing is a sample log file, initially produced as FFTGraf.txt. Multiple logs can be included in the same file and the graphs viewed via the menu option. Other sample logs are in the Results folder with details of configurations in file ReadMe.txt.
![]() |
Versions 3, 2 and a revised Version 1 logs show Scott's MagSq'd[n/16], Peak Noise and Average Noise accuracy checks for FFTs at 1024K. This shows that the two versions produce the same numeric results.
Checks SP 9.999891e-001 3.338028e-006 1.043382e-011
Version 3 single precision numeric checks are slightly different, probably as the SSE/3DNow calculations are in 32 bit floating point units as opposed to normal SP which is executed in the same unit as double precision.
Checks SP 9.999890e-001 3.338029e-006 1.043487e-011
Comparative Results: Double Precision FFT millisecondsFollowing are sample results in milliseconds from Scott’s optimised and original code. Also shown are results for FFT and TFFTDP obtained from ftp://ftp.nosc.mil/pub/aburto/. All C programs were compiled using Watcom Version 11 and run on a Duron 950 MHz with 133 MHz bus/RAM via Windows 2000. Memory bytes used is up to 28 times FFT size or 52 times with Version 2.Scott’s original optimised version uses the same floating point machine code instructions as the non-optimised program but results in data being accessed more sequentially. The nosc.mil programs are more complicated and, probably, the number of variables used does not suit the standard PC floating point register arrangement. So, of these examples, Scott’s code comes out fastest at smaller FFT sizes, with data in L1 and L2 caches. The last example indicates superiority to Version 1 when accessing RAM, but this is surpassed by Version 2. As the Duron only has single precision 3DNow instructions, Version 3 DP speeds are the same as Version 2. The second table shows Version 3 (SSE2), Version 2 and Opt results from a 1900 MHz Pentium 4 with PC 133 RAM. Improvements obtained by using SSE2 (and SSE, 3DNow) are not that great as there is a significant pre-calculation overhead, reading/writing randomly from/to memory. Also, during the main calculations, reading and writing to two different areas of memory is not good for data streaming.
|
Size | Version 2 | Opt | NoOpt | TFFTDP | FFT |
1024 | 0.101 | 0.135 | 0.205 | 0.349 | 0.320 |
2048 | 0.209 | 0.296 | 0.472 | 0.724 | 0.723 |
4096 | 0.677 | 0.739 | 1.177 | 1.534 | 1.567 |
8192 | 4.220 | 5.868 | 11.141 | 6.462 | 3.914 |
16384 | 13.000 | 35.565 | 68.549 | 42.965 | 14.019 |
32768 | 29.800 | 82.500 | 158.278 | 114.669 | 63.087 |
65536 | 66.900 | 189.696 | 373.473 | 268.995 | 146.617 |
131072 | 150.000 | 491.680 | 970.348 | 672.734 | 325.168 |
262144 | 332.000 | 1113.715 | 2190.291 | 1531.683 | 858.985 |
Size | Version 3 | Version 2 | Opt |
1024 | 0.089 | 0.096 | 0.149 |
2048 | 0.190 | 0.231 | 0.332 |
4096 | 0.403 | 0.508 | 0.747 |
8192 | 0.962 | 1.410 | 1.870 |
16384 | 5.110 | 5.910 | 13.200 |
32768 | 14.500 | 16.600 | 92.000 |
65536 | 34.500 | 38.700 | 213.00 |
131072 | 73.500 | 89.800 | 463.000 |
262144 | 156.000 | 194.000 | 1006.291 |