|
GeneralRoy Longbottom’s PC Benchmark Collection comprises numerous FREE benchmarks and reliability testing programs, for processors, caches, memory, buses, disks, flash drives, graphics, local area networks and Internet. Original ones run under DOS and later ones under all varieties of Windows. Most have also been converted to run under Linux on PCs, and many for ARM CPUs via Android and Raspbian. For the latter, details, benchmark execution files and source code download links, and results are provided in Raspberry Pi Benchmarks.htm with downloads for the multithreading program codes in Raspberry_Pi_MP_Benchmarks.zip. The ARM benchmarks use the same multithreading programming code as my Linux MultiThreading Benchmarks. Each of these particular programs obtain the same configuration details as the Android versions but, unlike with Android, results are saved in a text log file, besides being displayed. An example of the Raspberry Pi details obtained are shown below. When more than one CPU core is provided, separate details are normally shown for each one, labelled Processor 0, Processor 1 etc. At the end of each benchmark, any appropriate additional information can be entered from the keyboard. The C program codes used for the RPi were also compiled on a Linux based PC, the only change being for the version name (to Linux/Intel from Linux/ARM). This Intel version is included in the zip file. Results below include those, from this version, on an Intel Atom CPU and a quad core AMD Phenom processor. Results are now included for Raspberry Pi 2 that has a quad core ARM V7 processor. The original benchmarks were run along with revised versions (MP-xxxxPiA7), compiled with gcc 4.8, to use advanced hardware features, identified in cpuinfo details. The new benchmarks are included in the zip file. An example of the compile command, that uses the new features, is shown below. This also includes -funsafe-math-optimizations, which can produce incorrect results. For these benchmarks, it leads to acceptable minor rounding differences. 2016 - The latest benchmark s were run on Raspberry Pi 3 Model B that includes a quad core Broadcom BCM2837 system-on-chip running at 1200 MHz, each core having a 32 KB L1 cache. There is a shared 512 KB L2 cache and 1 GB RAM. The CPU is an ARM Cortex-A53, capable of 64 bit working, but presently only supports 32 bit operation.
A different graphics driver had to be installed to run a new
OpenGL GLUT benchmark.
In certain cases, benchmarks were rerun with this driver disabled, as it could lead to degraded CPU performance.
CPU, Cache and RAM MFLOPS Benchmarks
This benchmark also executes identical functions as my CUDA and OpenMP performance tests. Details and results of these can be found in
Linux CUDA MFLOPS.htm and
OpenMP Speeds.htm.
The arithmetic operations executed are of the form x[i] = (x[i] + a) * b - (x[i] + c) * d + (x[i] + e) * f with 2, 8 or 32 operations per input data word. Array sizes used are 0.1, 1 or 10 million 4 byte single precision floating point words. The program checks for consistent numeric results, primarily to show that all calculations are carried out and can be run using between 1 and 64 threads. Each thread uses the same calculations but accessing different segments of the data.
|
MP-MFLOPS Linux/ARM v1.0 Sat Jul 27 17:41:13 2013
FPU Add & Multiply using 1, 2, 4 and 8 Threads
2 Ops/Word 32 Ops/Word
KB 12.8 128 12800 12.8 128 12800
MFLOPS
1T 43 33 31 191 170 161
2T 44 42 31 192 174 160
4T 44 43 31 192 176 159
8T 43 51 31 192 184 160
Results x 100000
1T 86735 98519 99984 79897 97638 99975
2T 86735 98519 99984 79897 97638 99975
4T 86735 98519 99984 79897 97638 99975
8T 86735 98519 99984 79897 97638 99975
End of test Sat Jul 27 17:42:00 2013
Later 76406 97075 99969 66015 95363 99951
Neon 76406 97075 99969 66008 95367 99951
DP 76384 97072 99969 66065 95370 99951
|
Although these Raspberry Pi MFLOPS speeds are quite impressive, they are nowhere near to claimed maximum capabilities, for example, RPi 3 Single Precision 38.4 GFLOPS at 32 MFLOPS/MHz and Double Precision at a quarter of these speeds - first maximum RPi 3 NEON-VFP GFLOPS were 6.03 SP and 2.3 DP, the former at 5.0 MFLOPS/MHz.
On the other hand, the same source code, compiled for Intel CPUs with GCC, obtained 23 out of 32 MFLOPS/MHz with SSE instructions and 45.6 out of 64 MFLOPS/MHz with AVX 1 options. This was on a Core i7 CPU. See Linux MP-MFLOPS Benchmarks.
Some of the instructions generated by the compiler, for the Raspberry Pi, are shown below, with some explanation.
Below are performance results for the RPi, at normal MHz settings and with maximum overclocking. Speed improvements, due to the latter, are approximately proportional to differences in CPU and SDRAM MHz. Results from the Android version, running on a four core CPU, are also provided. This shows speed gains of up to four times that for a single core but, in this case, needs eight threads to do it.
The Atom has Hyperthreading that allows more than one thread to run at the same time on a single core CPU. Results indicate that performance is mainly dependent on CPU speed, whereas there is some degradation due to cache and RAM speeds on the other systems. The quad core AMD CPU speed using L2 cache can be faster than with data from L1 cache, probably due to some conflict on storing results of calculations. Note that Intel numeric results are slightly different to those from ARM CPUs.
Comparing MP-MFLOPS speeds between the old RPi and Raspberry Pi 2 (1 core vs 4 cores), all at 1000 MHz, shows performance increases of 8 to 10 times at 2 operations per word and 6 to 7 times with the higher instruction count. This benchmark only uses single precision floating point arithmetic. As for other benchmarks, the new V7A compilation produced essentially the same MFLOPS speeds as the original MP-MFLOPS program. Running time on the RPi 2 was rather short. So the run pass parameter was increased for a longer run. As expected, this lead to slightly different numeric results (see below).
The program was recompiled including the -funsafe-math-optimizations parameter, to force the use of NEON instructions as MP-NeonMFLOPS. V7A2 NEON entry below shows results at 1000 MHz, achieving a peak performance of nearly 3 GFLOPS. With the lower processing per word tests, average speed gains of 24 times were demonstrated, for cache based data, compared with the original RPi. The revised compilation included fused multiply accumulate instructions, where slightly different answers can be produced (see @@@@@ below).
An earlier Android benchmark executes the same functions as MP-MFLOPS, but using NEON intrinsic functions. This was also converted for the RPi 2 - see MP-NeonMFLOPS, where results are almost the same as the NEON compiled version.
Raspberry Pi 3 with a CPU MHz 1.33 times that on the Raspberry Pi 2, some MP_MFLOPS benchmark speeds were not as advantageous. That was the case with MP-MFLOPSPiA7, at 2 operations per word, but averaged 55% faster at 32 operations per word. MP-MFLOPSPiNeon was much better with average performance ratios of 1.34 and 2.30 at the two sets of tests. Then much faster than MP-MFLOPSPiA7, by more than twice as fast with 32 calculations per word and up to 4.66 times from cached data, at 2 per word (see
assembly code below).
MP-MFLOPSDP, the double precision compilation of MP-MFLOPSPiA7, compiled with the same instructions as the single precision version, but applicable to 64 bit registers. Speeds were effectively the same, except some tests were slower with RAM based data.
pi@raspberrypi ~/benchmarks/mpmflops $ ./MP-MFLOPS
V7A ./MP-MFLOPSPiA7
*****************************************************
Raspberry Pi CPU 700 MHz, Core 400 MHz, SDRAM 400 MHz
MP-MFLOPS Linux/ARM v1.0 Sat Jul 27 17:41:13 2013
FPU Add & Multiply using 1, 2, 4 and 8 Threads
2 Ops/Word 32 Ops/Word
KB 12.8 128 12800 12.8 128 12800
MFLOPS
1T 43 33 31 191 170 161
2T 44 42 31 192 174 160
4T 44 43 31 192 176 159
8T 43 51 31 192 184 160
Results x 100000
1T 86735 98519 99984 79897 97638 99975
End of test Sat Jul 27 17:42:00 2013
############################ RPi OC ##############################
Raspberry Pi CPU 1000 MHz, Core 500 MHz, SDRAM 600 MHz, 6 volts
MP-MFLOPS Linux/ARM v1.0 Sat Jul 27 18:45:14 2013
FPU Add & Multiply using 1, 2, 4 and 8 Threads
2 Ops/Word 32 Ops/Word
KB 12.8 128 12800 12.8 128 12800
MFLOPS
1T 49 58 45 278 255 237
2T 72 60 46 278 262 240
4T 72 62 46 279 265 239
8T 72 70 46 279 225 234
Results x 100000
1T 86735 98519 99984 79897 97638 99975
End of test Sat Jul 27 18:45:46 2013
########################### RPi 2 ###############################
Raspberry Pi 2 CPU 900 MHz, Core 250 MHz, SDRAM 450 MHz
MP-MFLOPS Linux/ARM v1.0 Mon Mar 2 17:14:57 2015
FPU Add & Multiply using 1, 2, 4 and 8 Threads
2 Ops/Word 32 Ops/Word
KB 12.8 128 12800 12.8 128 12800
MFLOPS
1T 149 147 130 410 395 387
2T 298 291 254 820 803 782
4T 526 409 393 1519 1622 1456
8T 494 486 335 1581 1518 1436
Results x 100000
1T 86735 98519 99984 79897 97638 99975
End of test Mon Mar 2 17:15:07 2015
######################### RPi 2 OC ############################
Raspberry Pi 2 CPU 1000 MHz, Core 500 MHz, SDRAM 500 MHz, over_voltage=2
MP-MFLOPS Linux/ARM v1.0 Wed Mar 4 10:24:36 2015
FPU Add & Multiply using 1, 2, 4 and 8 Threads
2 Ops/Word 32 Ops/Word
KB 12.8 128 12800 12.8 128 12800
MFLOPS
1T 136 161 144 452 451 433
2T 328 322 282 904 900 862
4T 593 546 449 1739 1790 1711
8T 543 537 437 1588 1679 1578
Results x 100000
1T 86735 98519 99984 79897 97638 99975
End of test Wed Mar 4 10:24:45 2015
######################### RPi 2 V7A ############################
Raspberry Pi 2 CPU 900 MHz, Core 250 MHz, SDRAM 450 MHz
MP-MFLOPS Linux/ARM V7A v1.0 Sun Mar 15 12:50:06 2015
FPU Add & Multiply using 1, 2, 4 and 8 Threads
2 Ops/Word 32 Ops/Word
KB 12.8 128 12800 12.8 128 12800
MFLOPS
1T 113 158 138 438 437 417
2T 321 315 266 884 873 831
4T 424 611 343 1706 1731 1629
8T 560 512 332 1579 1622 1520
Results x 100000
1T 86735 98519 99984 79897 97639 99975
End of test Sun Mar 15 12:50:15 2015
##################### V7A2 Increased Passes ####################
######################### RPi 2 V7A2 ###########################
Raspberry Pi 2 CPU 900 MHz, Core 250 MHz, SDRAM 450 MHz
MP-MFLOPS Linux/ARM V7A v1.0 Fri Mar 20 16:59:26 2015
FPU Add & Multiply using 1, 2, 4 and 8 Threads
2 Ops/Word 32 Ops/Word
KB 12.8 128 12800 12.8 128 12800
MFLOPS
1T 158 158 136 424 435 414
2T 322 314 264 875 868 824
4T 528 533 394 1731 1744 1612
8T 549 505 392 1639 1629 1518
Results x 100000
1T 76406 97075 99969 66015 95363 99951
End of test Fri Mar 20 16:59:44 2015
################## RPi 2 V7A2 Compiled NEON ####################
Raspberry Pi 2 CPU 900 MHz, Core 250 MHz, SDRAM 450 MHz
MP-MFLOPS Compiled NEON v1.0 Tue Aug 16 11:18:32 2016
FPU Add & Multiply using 1, 2, 4 and 8 Threads
2 Ops/Word 32 Ops/Word
KB 12.8 128 12800 12.8 128 12800
MFLOPS
1T 357 451 337 690 688 657
2T 885 769 426 1355 1354 1315
4T 1320 1747 382 2700 2721 2552
8T 1391 1405 381 2548 2653 2446
Results x 100000
1T 76406 97075 99969 66008 95367 99951
End of test Tue Aug 16 11:18:43 2016
######################## RPi 2 V7A2 OC #########################
Raspberry Pi 2 CPU 1000 MHz, Core 500 MHz, SDRAM 500 MHz, over_voltage=2
MP-MFLOPS Linux/ARM V7A v1.0 Fri Mar 20 17:17:44 2015
FPU Add & Multiply using 1, 2, 4 and 8 Threads
2 Ops/Word 32 Ops/Word
KB 12.8 128 12800 12.8 128 12800
MFLOPS
1T 138 177 158 488 487 454
2T 359 352 308 976 968 922
4T 552 627 465 1939 1906 1760
8T 554 586 453 1763 1830 1779
Results x 100000
1T 76406 97075 99969 66015 95363 99951
End of test Fri Mar 20 17:18:00 2015
################ RPi 2 V7A2 Compiled NEON OC ###################
Raspberry Pi 2 CPU 1000 MHz, Core 500 MHz, SDRAM 500 MHz, over_voltage=2
MP-MFLOPS Compiled NEON v1.0 Fri Mar 20 17:18:25 2015
FPU Add & Multiply using 1, 2, 4 and 8 Threads
2 Ops/Word 32 Ops/Word
KB 12.8 128 12800 12.8 128 12800
MFLOPS
1T 369 504 386 769 766 736
2T 1052 947 486 1521 1513 1440
4T 2052 2023 470 3040 2917 2854
8T 1764 1920 459 2860 2883 2597
Results x 100000
1T 76406 97075 99969 66008 95367 99951
@@@@@ @@@@@
End of test Fri Mar 20 17:18:35 2015
######################### RPi 3 V7A2 ###########################
Raspberry Pi 3 CPU 1200 MHz, SDRAM 900 MHz
MP-MFLOPS Linux/ARM V7A v1.0 Mon Aug 15 19:07:03 2016
FPU Add & Multiply using 1, 2, 4 and 8 Threads
2 Ops/Word 32 Ops/Word
KB 12.8 128 12800 12.8 128 12800
MFLOPS
1T 168 182 171 691 693 684
2T 364 358 329 1382 1381 1358
4T 408 484 401 2451 2561 2436
8T 609 554 420 2531 2425 2315
Results x 100000
1T 76406 97075 99969 66015 95363 99951
End of test Mon Aug 15 19:07:15 2016
########## RPi 3 V7A2 New OpenGL GLUT Driver Disabled ##########
Raspberry Pi 3 CPU 1200 MHz, SDRAM 900 MHz
MP-MFLOPS Linux/ARM V7A v1.0 Tue Aug 30 14:16:59 2016
FPU Add & Multiply using 1, 2, 4 and 8 Threads
2 Ops/Word 32 Ops/Word
KB 12.8 128 12800 12.8 128 12800
MFLOPS
1T 159 181 178 690 692 685
2T 342 364 353 1384 1386 1368
4T 466 501 456 2451 2473 2633
8T 581 643 479 2618 2502 2550
Results x 100000
1T 76406 97075 99969 66015 95363 99951
End of test Tue Aug 30 14:17:11 2016
################# RPi 3 V7A2 Double Precision ##################
Raspberry Pi 3 CPU 1200 MHz, SDRAM 900 MHz
MP-MFLOPS Double Precision v1.0 Wed Sep 7 17:07:12 2016
FPU Add & Multiply using 1, 2, 4 and 8 Threads
2 Ops/Word 32 Ops/Word
KB 12.8 128 12800 12.8 128 12800
MFLOPS
1T 143 182 171 678 680 674
2T 343 361 240 1360 1360 1335
4T 441 712 240 2232 2208 2185
8T 406 593 241 2345 2315 2272
Results x 100000
1T 76384 97072 99969 66065 95370 99951
End of test Wed Sep 7 17:07:18 2016
################## RPi 3 V7A2 Compiled NEON ####################
Raspberry Pi 3 CPU 1200 MHz, SDRAM 900 MHz
MP-MFLOPS Compiled NEON v1.0 Mon Aug 15 19:09:46 2016
FPU Add & Multiply using 1, 2, 4 and 8 Threads
2 Ops/Word 32 Ops/Word
KB 12.8 128 12800 12.8 128 12800
MFLOPS
1T 419 782 437 1672 1660 1637
2T 1324 1529 442 3331 3308 3212
4T 1903 1574 439 5040 6073 5738
8T 1613 2204 433 5543 5780 5445
Results x 100000
1T 76406 97075 99969 66008 95367 99951
End of test Mon Aug 15 19:09:52 2016
####### RPi 3 V7A 2 NEON New OpenGL GLUT Driver Disabled #######
MP-MFLOPS Compiled NEON v1.0 Tue Aug 30 14:18:13 2016
Raspberry Pi 3 CPU 1200 MHz, SDRAM 900 MHz
FPU Add & Multiply using 1, 2, 4 and 8 Threads
2 Ops/Word 32 Ops/Word
KB 12.8 128 12800 12.8 128 12800
MFLOPS
1T 488 774 485 1674 1652 1644
2T 1438 1503 488 3341 3299 3262
4T 1984 1703 472 5045 5125 5256
8T 1567 2098 470 5527 5400 5021
Results x 100000
1T 76406 97075 99969 66008 95367 99951
End of test Tue Aug 30 14:18:18 2016
########################## Other ###############################
P11 Galaxy SIII, Quad Cortex-A9 1.4 GHz, Android 4.0.4
Android MP-MFLOPS v7 Benchmark V1.0 23-Dec-2012 14.12
FPU Add & Multiply using 1, 2, 4 and 8 Threads
2 Ops/Word 32 Ops/Word
KB 12.8 128 12800 12.8 128 12800
MFLOPS
1T 208 188 172 588 675 643
2T 392 375 302 1323 1342 1311
4T 472 439 321 1824 1758 1645
8T 619 608 381 2666 2537 2645
Total Elapsed Time 6.7 seconds
*****************************************************
Intel Atom 1.66 GHz, Linux Ubuntu 10.10
MP-MFLOPS Linux/Intel v1.0 Sat Jul 27 18:18:15 2013
FPU Add & Multiply using 1, 2, 4 and 8 Threads
2 Ops/Word 32 Ops/Word
KB 12.8 128 12800 12.8 128 12800
MFLOPS
1T 207 206 201 406 404 404
2T 303 363 354 799 793 783
4T 330 367 357 798 795 788
8T 321 366 354 796 793 788
Results x 100000
1T 86723 98518 99984 79927 97642 99975
End of test Sat Jul 27 18:18:26 2013
*****************************************************
Quad Core AMD Phenom 3.0 GHz, Linux Ubuntu 12.04
MP-MFLOPS Linux/Intel v1.0 Tue Jul 30 15:12:02 2013
FPU Add & Multiply using 1, 2, 4 and 8 Threads
2 Ops/Word 32 Ops/Word
KB 12.8 128 12800 12.8 128 12800
MFLOPS
1T 1987 1977 1585 3827 3826 3732
2T 3835 3975 2527 7631 7654 7442
4T 6723 7873 2932 10822 14463 13728
8T 5890 7659 5497 10300 14452 14006
Results x 100000
1T 86723 98518 99984 79927 97642 99975
End of test Tue Jul 30 15:12:03 2013
|
The Whetstone programs, initially used in 1972, were the first general purpose benchmarks that set industry standards of computer system performance. Details and performance of early to modern systems can be found in Whetstone Benchmark History And Results and Results On PCs. The overall performance rating is in Millions of Whetstone Instructions Per Second (MWIPS). Later, it was found necessary to measure the speed of the eight different test functions used, to demonstrate that compilers were not over optimising and to allow code tweaks to avoid this situation. The additional measurements are in terms of Millions of Operations Per Second (MOPS) or MFLOPS for straight floating point calculations. As the design authority, nominated by the original author, I have to say that versions that do not provide these separate measurements cannot be taken as valid.
This multithreading benchmark runs using 1, 2, 4 and 8 threads, executing multiple copies of the same program. An initial calibration, using a single thread, determines the number of passes needed for an overall execution time of 5 seconds. Then all threads are run using the same pass count, running time being extended when there are more threads than CPUs. The same calculations are carried out on each thread but using dedicated variables. The numeric results of calculations are noted for the first thread with others checked for the same values and an error message displayed if they are inconsistent.
Displayed speeds are in the order that tests are run but are sorted for logged results, as shown below.
Relative performance due to overclocking is similar to MP-MFLOPS, an exception being fixed point calculations, where the particular compiler might have optimised the code too much, and many more passes could be needed to produce consistent speeds. The Galaxy SIII and AMD Phenom are also more inclined to achieve a four times performance gain with quad cores. The Atom hyperthreading shows improved throughput with multiple threads on all tests.
Running the original benchmark on the Raspberry Pi 2 shows a performance increase between 1.3 and 2.1 times, on the different tests, on a single core at 1000 MHz. With multithreading, this leads to a 10.2 times increase on MFLOPS, 5.8 times on functions and 7.5 time on integer MOPS. The revised PiA7 compilation generates slower code on some tests but the three quoted ratios increase to 12.8, 6.1 and 8.2 times.
Raspberry Pi 3 Overall MWIPS ratings are 1.37 times RPi 2 speeds, with ratios for other tests in the range 1.19 to 1.79, except the last copy test average of 2.73.
pi@raspberrypi ~/benchmarks/mpwhetss $ ./MP-WHETS
V7A ./MP-WHETSPiA7
*****************************************************
Raspberry Pi CPU 700 MHz, Core 400 MHz, SDRAM 400 MHz
MP-Whetstone Benchmark Linux/ARM v1.0 Sat Jul 27 17:44:25 2013
Using 1, 2, 4 and 8 Threads
MWIPS MFLOPS MFLOPS MFLOPS Cos Exp Fixpt If Equal
1 2 3 MOPS MOPS MOPS MOPS MOPS
1T 243.6 91.9 90.9 84.4 4.9 2.7 2143.6 471.5 114.9
2T 256.8 61.2 90.0 82.3 5.6 2.7 2201.7 496.9 120.4
4T 258.5 74.5 96.0 84.2 5.6 2.7 2272.7 501.5 118.7
8T 258.5 84.2 95.2 85.1 5.6 2.7 2774.6 522.9 106.0
Overall Seconds 3.26 1T, 6.34 2T, 12.57 4T, 25.80 8T
############################ RPi OC ##############################
Raspberry Pi CPU 1000 MHz, Core 500 MHz, SDRAM 600 MHz, 6 volts
MP-Whetstone Benchmark Linux/ARM v1.0 Sat Jul 27 18:41:42 2013
Using 1, 2, 4 and 8 Threads
MWIPS MFLOPS MFLOPS MFLOPS Cos Exp Fixpt If Equal
1 2 3 MOPS MOPS MOPS MOPS MOPS
1T 373.6 136.0 135.3 122.5 8.2 3.8 3271.4 705.5 180.1
2T 375.2 122.5 133.4 118.4 8.2 3.9 3247.6 733.4 180.0
4T 377.0 132.2 139.3 122.3 8.2 3.9 6267.6 737.9 172.3
8T 377.0 135.1 140.6 123.2 8.2 3.9 4585.5 749.7 162.5
Overall Seconds 3.52 1T, 7.07 2T, 14.23 4T, 28.83 8T
########################### RPi 2 ###############################
Raspberry Pi 2 CPU 900 MHz, Core 250 MHz, SDRAM 450 MHz
MP-Whetstone Benchmark Linux/ARM v1.0 Tue Mar 3 16:37:24 2015MP-WHETSPiA7
Using 1, 2, 4 and 8 Threads
MWIPS MFLOPS MFLOPS MFLOPS Cos Exp Fixpt If Equal
1 2 3 MOPS MOPS MOPS MOPS MOPS
1T 515.4 250.2 250.0 223.0 10.0 5.1 4421.6 891.1 334.5
2T 1035.2 500.9 501.9 447.7 20.0 10.2 8878.8 1789.0 671.3
4T 2063.4 960.6 996.0 893.6 39.9 20.5 17560.2 3559.3 1334.9
8T 2140.9 1192.4 1325.4 992.3 40.3 21.2 24312.0 3968.1 1379.2
Overall Seconds 4.98 1T, 4.98 2T, 5.06 4T, 10.11 8T
######################### RPi 2 OC #############################
Raspberry Pi 2 CPU 1000 MHz, Core 500 MHz, SDRAM 500 MHz, over_voltage=2
MP-Whetstone Benchmark Linux/ARM v1.0 Wed Mar 4 10:34:01 2015
Using 1, 2, 4 and 8 Threads
MWIPS MFLOPS MFLOPS MFLOPS Cos Exp Fixpt If Equal
1 2 3 MOPS MOPS MOPS MOPS MOPS
1T 577.6 280.2 280.1 249.7 11.2 5.7 4961.5 998.2 374.7
2T 1155.5 560.1 560.1 499.4 22.4 11.4 9915.1 1995.8 749.3
4T 2290.3 1080.3 1110.8 994.2 44.4 22.8 13471.3 3642.0 1491.6
8T 2405.6 1506.5 1490.2 1103.8 45.9 23.5 28234.7 5151.7 1552.5
Overall Seconds 4.74 1T, 4.74 2T, 4.84 4T, 9.82 8T
######################### RPi 2 V7A #############################
Raspberry Pi 2 CPU 900 MHz, Core 250 MHz, SDRAM 450 MHz
MP-Whetstone Benchmark Linux/ARM V7A v1.0 Tue Mar 3 16:44:08 2015
Using 1, 2, 4 and 8 Threads
MWIPS MFLOPS MFLOPS MFLOPS Cos Exp Fixpt If Equal
1 2 3 MOPS MOPS MOPS MOPS MOPS
1T 527.8 361.8 362.9 184.2 10.0 5.6 3316.1 889.2 445.2
2T 1056.9 724.9 729.2 368.6 20.0 11.2 6638.7 1779.1 891.6
4T 2119.0 1381.1 1454.5 739.2 40.1 22.5 13301.0 3571.3 1788.4
8T 2195.2 1912.9 1849.8 805.7 40.8 23.1 17643.5 4808.5 1893.6
Overall Seconds 4.70 1T, 4.70 2T, 4.75 4T, 9.56 8T
######################## RPi 2 V7A OC ###########################
Raspberry Pi 2 CPU 1000 MHz, Core 500 MHz, SDRAM 500 MHz, over_voltage=2
MP-Whetstone Benchmark Linux/ARM V7A v1.0 Tue Mar 3 17:54:38 2015
Using 1, 2, 4 and 8 Threads
MWIPS MFLOPS MFLOPS MFLOPS Cos Exp Fixpt If Equal
1 2 3 MOPS MOPS MOPS MOPS MOPS
1T 593.5 409.5 409.3 206.7 11.3 6.3 3729.2 998.2 499.6
2T 1181.3 814.4 801.0 411.7 22.4 12.5 7423.4 1988.4 994.7
4T 2351.2 1486.6 1527.9 813.0 44.7 25.0 14800.0 3825.0 1989.5
8T 2452.9 2199.5 2099.1 890.8 45.3 26.1 21104.2 5439.9 2084.2
Overall Seconds 4.98 1T, 5.03 2T, 5.10 4T, 10.26 8T
######################### RPi 3 V7A #############################
Raspberry Pi 3 CPU 1200 MHz, SDRAM 900 MHz
MP-Whetstone Benchmark Linux/ARM V7A v1.0 Mon Aug 15 19:34:21 2016
Using 1, 2, 4 and 8 Threads
MWIPS MFLOPS MFLOPS MFLOPS Cos Exp Fixpt If Equal
1 2 3 MOPS MOPS MOPS MOPS MOPS
1T 723.1 517.2 517.0 254.9 12.1 8.8 5853.9 1181.8 1189.8
2T 1464.7 960.5 1025.1 511.3 24.1 18.5 11899.0 2381.2 2385.7
4T 2902.3 1696.4 1867.3 1013.4 47.8 36.8 19754.6 4541.3 4687.1
8T 3004.0 2747.8 2569.0 1066.4 48.6 38.0 25502.9 6075.2 5610.8
Overall Seconds 4.77 1T, 4.74 2T, 4.88 4T, 9.76 8T
########################## Other #################################
P11 Galaxy SIII, Quad Cortex-A9 1.4 GHz, Android 4.0.4
Android MP-Whetstone Benchmark V1.0 23-Dec-2012 14.36
Using 1, 2, 4 and 8 Threads
MWIPS MFLOPS MFLOPS MFLOPS Cos Exp Fixpt If Equal
1 2 3 MOPS MOPS MOPS MOPS MOPS
1T 1206.4 266.3 269.4 310.1 30.1 17.6 522.8 551.8 597.9
2T 2411.7 520.5 530.0 619.1 60.0 35.1 1026.4 1359.2 1195.9
4T 4719.0 874.2 881.7 1231.1 119.1 69.6 2072.8 2779.4 2369.0
8T 4676.4 1227.1 1105.1 1182.4 120.0 63.2 2254.4 2821.8 2299.5
Overall Seconds 4.84 1T, 4.82 2T, 5.14 4T, 10.25 8T
*****************************************************
Intel Atom 1.66 GHz, Linux Ubuntu 10.10
MP-Whetstone Benchmark Linux/Intel v1.0 Sat Jul 27 18:08:28 2013
Using 1, 2, 4 and 8 Threads
MWIPS MFLOPS MFLOPS MFLOPS Cos Exp Fixpt If Equal
1 2 3 MOPS MOPS MOPS MOPS MOPS
1T 704.4 329.0 328.9 280.8 17.1 7.6 763.7 1248.6 117.7
2T 1203.2 562.1 613.8 484.0 30.4 13.4 997.5 1688.6 176.8
4T 1203.8 605.2 618.4 477.3 30.4 13.4 993.0 1688.2 178.0
8T 1206.9 608.1 619.8 486.1 30.3 13.4 1008.3 1702.2 177.9
Overall Seconds 4.99 1T, 6.28 2T, 12.48 4T, 24.93 8T
*****************************************************
Quad Core AMD Phenom 3.0 GHz, Linux Ubuntu 12.04
MP-Whetstone Benchmark Linux/Intel v1.0 Tue Jul 30 15:11:23 2013
Using 1, 2, 4 and 8 Threads
MWIPS MFLOPS MFLOPS MFLOPS Cos Exp Fixpt If Equal
1 2 3 MOPS MOPS MOPS MOPS MOPS
1T 2662.8 924.4 926.2 695.0 64.3 34.7 3088.3 2258.4 620.7
2T 5304.1 1850.8 1850.9 1387.8 128.4 69.1 6210.0 4507.3 1200.1
4T 10582.7 3551.7 3668.0 2771.6 256.7 138.0 12173.2 8966.3 2399.4
8T 10637.9 3758.1 3754.6 2772.7 257.2 138.5 12389.8 9104.7 2441.1
Overall Seconds 4.90 1T, 4.98 2T, 5.04 4T, 9.94 8T
|
The Dhrystone "C" benchmark provides a measure of integer performance (no floating point instructions). It became the key standard benchmark from 1984. Speed was originally measured in Dhrystones per second. This was later changed to VAX MIPS by dividing Dhrystones per second by 1757, the DEC VAX 11/780 result, the latter being regarded as the first 1 MIPS minicomputer. Details and results from Windows and Linux based PCs, can be found in Dhrystone Results.htm.
This multithreading benchmark runs using 1, 2, 4 and 8 threads, executing multiple copies of the same program. An initial calibration, using a single thread, determines the number of passes needed for an overall execution time of 1 second. Then all threads are run using the same pass count, running time being extended when there are more threads than CPUs. The same calculations are carried out on each thread.
Some variables can be used by all threads and it might be foreseen that this could cause the program to crash. Data arrays have been moved so that different RAM will be allocated for each thread. One of the locations is used to count the number passes used be each thread and these are checked for consistency.
MP-Dhrystone Benchmark Linux/ARM v1.0
Fri Jul 26 12:25:24 2013
Using 1, 2, 4 and 8 Threads
Threads Dhrys/sec VAX MIPS
1 1650351 939
2 1547631 881
4 1594706 908
8 1619087 922
Internal pass count correct all threads
End of test Fri Jul 26 12:25:40 2013
|
RasPi performance improvement due to overclocking is again proportional to CPU MHz. The Android quad core phone shows limited performance gains of up to 2.63 times, a shared data effect? Again, the Atom shows performance gains due to Hyperthreading.
Worst comparisons are on the Phenom PC, where performance using two threads is a lot slower than one thread, probably due to a conflict in updating results.
These Raspberry Pi 2 results also show multithreading performance degradations, through handling the shared data, with wide variations in measured speed.. A 1000 MHz single core produces a 40% improvement in performance, compared with RPi 1 at the same frequency, with no gain via the PiA7 recompilation.
Raspberry Pi 3 performance, using a single thread, is not much faster than model 2 at 1.43 times faster, compared with a CPU MHz ratio of 1.33. Then, it appears to perform much better using more threads, at 3.49 times faster.
pi@raspberrypi ~/benchmarks/mphry $ ./MP-DHRY
V7A ./MP-DHRYPiA7
*****************************************************
Raspberry Pi CPU 700 MHz, Core 400 MHz, SDRAM 400 MHz
MP-Dhrystone Benchmark Linux/ARM v1.0 Sat Jul 27 17:38:10 2013
Using 1, 2, 4 and 8 Threads
Threads 1 2 4 8
Seconds 0.97 2.07 4.01 7.91
Dhrystones per Second 1650351 1547631 1594706 1619087
VAX MIPS rating 939 881 908 922
Internal pass count correct all threads
End of test Sat Jul 27 17:38:26 2013
########################### RPi OC ##############################
Raspberry Pi CPU 1000 MHz, Core 500 MHz, SDRAM 600 MHz, 6 volts
MP-Dhrystone Benchmark Linux/ARM v1.0 Sat Jul 27 18:48:23 2013
Using 1, 2, 4 and 8 Threads
Threads 1 2 4 8
Seconds 0.67 1.38 2.72 5.41
Dhrystones per Second 2388323 2324087 2354828 2364828
VAX MIPS rating 1359 1323 1340 1346
Internal pass count correct all threads
End of test Sat Jul 27 18:48:34 2013
########################### RPi 2 ###############################
Raspberry Pi 2 CPU 900 MHz, Core 250 MHz, SDRAM 450 MHz
MP-Dhrystone Benchmark Linux/ARM v1.0 Mon Mar 2 17:12:58 2015
Using 1, 2, 4 and 8 Threads
Threads 1 2 4 8
Seconds 0.67 3.45 7.33 14.56
Dhrystones per Second 2985075 1159420 1091405 1098901
VAX MIPS rating 1699 660 621 625
Internal pass count correct all threads
End of test Mon Mar 2 17:13:06 2015
######################### RPi 2 OC #############################
Raspberry Pi 2 CPU 1000 MHz, Core 500 MHz, SDRAM 500 MHz, over_voltage=2
MP-Dhrystone Benchmark Linux/ARM v1.0 Wed Mar 4 12:04:06 2015
Using 1, 2, 4 and 8 Threads
Threads 1 2 4 8
Seconds 0.96 2.22 9.92 19.82
Dhrystones per Second 3333333 2882883 1290323 1291625
VAX MIPS rating 1897 1641 734 735
Internal pass count correct all threads
End of test Wed Mar 4 12:04:17 2013
######################### RPi 2 V7A #############################
Raspberry Pi 2 CPU 900 MHz, Core 250 MHz, SDRAM 450 MHz
MP-Dhrystone Benchmark Linux/ARM V7A v1.0 Tue Mar 3 15:53:27 2015
Using 1, 2, 4 and 8 Threads
Threads 1 2 4 8
Seconds 0.54 0.68 2.21 2.95
Dhrystones per Second 2956666 4706235 2895209 4339729
VAX MIPS rating 1683 2679 1648 2470
Internal pass count correct all threads
End of test Tue Mar 3 15:53:34 2015
####################### RPi 2 V7A OC ###########################
Raspberry Pi 2 CPU 1000 MHz, Core 500 MHz, SDRAM 500 MHz, over_voltage=2
MP-Dhrystone Benchmark Linux/ARM V7A v1.0 Tue Mar 3 17:41:09 2015
Using 1, 2, 4 and 8 Threads
Threads 1 2 4 8
Seconds 0.97 1.18 2.51 5.02
Dhrystones per Second 3286275 5439640 5096932 5094843
VAX MIPS rating 1870 3096 2901 2900
Internal pass count correct all threads
End of test Tue Mar 3 17:41:20 2015
######################### RPi 3 V7A #############################
Raspberry Pi 3 CPU 1200 MHz, SDRAM 900 MHz
MP-Dhrystone Benchmark Linux/ARM V7A v1.0 Mon Aug 15 19:47:57 2016
Using 1, 2, 4 and 8 Threads
Threads 1 2 4 8
Seconds 0.95 1.12 1.59 3.04
Dhrystones per Second 4229473 7124952 10091677 10523432
VAX MIPS rating 2407 4055 5744 5989
Internal pass count correct all threads
End of test Mon Aug 15 19:48:04 2016
########################## Other #################################
P11 Galaxy SIII, Quad Cortex-A9 1.4 GHz, Android 4.0.4
Android MP-Dhrystone 2 Benchmark V1.0 23-Dec-2012 14.47
Using 1, 2, 4 and 8 Threads
Threads 1 2 4 8
Seconds 0.50 0.57 1.07 1.53
Dhrystones per Second 3187471 5583906 5972050 8389079
VAX MIPS rating 1814 3178 3399 4775
Internal pass count correct all threads
Total Elapsed Time 4.2 seconds
*****************************************************
Intel Atom 1.66 GHz, Linux Ubuntu 10.10
MP-Dhrystone Benchmark Linux/Intel v1.0 Sat Jul 27 17:59:01 2013
Using 1, 2, 4 and 8 Threads
Threads 1 2 4 8
Seconds 0.87 1.37 2.78 5.47
Dhrystones per Second 4624003 5836935 5756209 5845862
VAX MIPS rating 2632 3322 3276 3327
Internal pass count correct all threads
End of test Sat Jul 27 17:59:12 2013
*****************************************************
Quad Core AMD Phenom 3.0 GHz, Linux Ubuntu 12.04
MP-Dhrystone Benchmark Linux/Intel v1.0 Tue Jul 30 15:10:47 2013
Using 1, 2, 4 and 8 Threads
Threads 1 2 4 8
Seconds 0.58 2.63 6.91 13.86
Dhrystones per Second 13854293 6080597 4628248 4618862
VAX MIPS rating 7885 3461 2634 2629
Internal pass count correct all threads
End of test Tue Jul 30 15:11:11 2013
|
This uses the same calculations as my original
BusSpeed2K Benchmark ,
the link providing data and results, including Windows and Linux MP varieties.
Data is read using AND instructions at a range of data sizes covering caches and RAM.
The program starts by reading words with 32 word address increments, then reduces the increment to eventually read all words sequentially. Speed reductions of around 50% at each higher increment suggests reading in bursts over the bus. This is normal for reading from RAM and is sometimes found reading cached data.
In this case, only 12.3 KB, 123KB and 12.3 MB memory sizes are used via 1, 2, 4 and 8 threads.
Speeds using L1 cache and large address increments can be unpredictable and not show performance gains using multiple cores. Some of this might be due to high overheads compared with actual execution time. Note that RasPi L2 cache speeds are relatively slow, compared with those from L1.
Ignoring the wildly variable burst reading comparisons, with CPUs at 1000 MHz, single core Raspberry Pi 2 performance, via L1 cache, showed no improvement. There were significant gains at 122.9 KB, L2 cache test. The PiA7 compilation produced some performance increases at 122.9 KB and double speed via RAM. Bottom line four thread comparison gains, against single core RPi 1, were 4.1 times from L1 cache, 17.5 times via L2 cache and 11.7 times with RAM based data, at 3.35 GB/second.
The exaggerated performance is valid, where all threads read the same data but, as the 512 KB L2 cache is shared between all cores, measured speed does not reflect RAM speed. In order to demonstrate more realistic memory speeds, a second version, MP-BusSpd2PiA7, was produced, where each thread starts reading from different addresses (RPi 2 V7A 2 OC) below. This produced fairly constant RAM speeds using multiple threads.
Raspberry Pi 3 - Results for the latest MP-BusSpd are shown below. Compared to default RPi 2 performance, best RAM speed improvements were the same as memory bus speed difference. Cache speed improvements were around 1.9 times, compared with CPU MHz ratio of 1.33.
The benchmark was rerun with the
new graphics driver disabled,
as this tended to degrade memory performance.
pi@raspberrypi ~/benchmarks/mpbusspd $ ./MP-BusSpd
V7A ./MP-BusSpdPiA7
*****************************************************
Raspberry Pi CPU 700 MHz, Core 400 MHz, SDRAM 400 MHz
MP-BusSpd Linux/ARM v1.0 Sat Jul 27 17:32:12 2013
MB/Second Reading Data, 1, 2, 4 and 8 Threads
KB Inc32 Inc16 Inc8 Inc4 Inc2 RdAll
12.3 1T 550 1200 1194 1258 1273 1179
2T 469 1207 1206 1261 1266 1217
4T 585 1207 1225 1252 1243 1249
8T 1046 1184 1208 1237 1236 1245
122.9 1T 22 46 45 55 107 218
2T 22 46 43 55 105 224
4T 21 46 43 55 106 224
8T 22 45 44 54 92 217
12288 1T 32 32 33 42 85 182
2T 15 33 30 41 80 165
4T 14 18 31 42 82 175
8T 15 28 32 43 81 178
End of test Sat Jul 27 17:32:29 2013
########################### RPi OC ##############################
Raspberry Pi CPU 1000 MHz, Core 500 MHz, SDRAM 600 MHz, 6 volts
MP-BusSpd Linux/ARM v1.0 Sat Jul 27 18:50:22 2013
MB/Second Reading Data, 1, 2, 4 and 8 Threads
KB Inc32 Inc16 Inc8 Inc4 Inc2 RdAll
12.3 1T 1246 1751 1794 1768 1847 1859
2T 1591 1756 1741 1830 1837 1760
4T 1162 1709 1784 1830 1802 1840
8T 1574 1732 1739 1774 1817 1820
122.9 1T 90 90 84 106 198 415
2T 65 88 86 106 204 418
4T 90 88 86 103 196 403
8T 89 88 82 103 192 407
12288 1T 49 49 50 71 138 293
2T 37 49 50 71 138 295
4T 45 50 49 69 129 288
8T 30 48 50 70 135 291
End of test Sat Jul 27 18:50:35 2013
########################### RPi 2 ###############################
Raspberry Pi 2 CPU 900 MHz, Core 250 MHz, SDRAM 450 MHz
MP-BusSpd Linux/ARM v1.0 Mon Mar 2 17:09:03 2015
MB/Second Reading Data, 1, 2, 4 and 8 Threads
KB Inc32 Inc16 Inc8 Inc4 Inc2 RdAll
12.3 1T 1050 1681 1685 1729 1732 1734
2T 2955 3232 3324 3375 3425 3370
4T 5045 6417 6600 6753 6795 6868
8T 5053 5285 6087 5814 5845 6346
122.9 1T 383 391 695 1173 1493 1324
2T 712 738 1382 2324 2960 2652
4T 728 787 1593 3053 5693 4697
8T 774 771 1575 3192 4622 4704
12288 1T 71 76 151 295 635 349
2T 134 152 300 583 1242 691
4T 146 164 272 755 1415 1366
8T 137 77 240 421 930 1129
End of test Mon Mar 2 17:09:16 2015
######################### RPi 2 OC #############################
Raspberry Pi 2 CPU 1000 MHz, Core 500 MHz, SDRAM 500 MHz, over_voltage=2
MP-BusSpd Linux/ARM v1.0 Wed Mar 4 12:40:34 2015
MB/Second Reading Data, 1, 2, 4 and 8 Threads
KB Inc32 Inc16 Inc8 Inc4 Inc2 RdAll
12.3 1T 1032 1869 1884 1919 1923 1927
2T 3302 3606 3704 3794 3819 3821
4T 5833 7058 7356 7517 7348 7618
8T 5534 5699 6209 6517 6572 6674
122.9 1T 425 431 768 1285 1650 1469
2T 809 815 1540 2583 3306 2944
4T 824 875 1768 3651 6262 5809
8T 858 822 1709 3464 5574 4615
12288 1T 96 110 218 424 914 447
2T 193 219 436 785 1702 877
4T 165 246 457 754 2236 1738
8T 111 131 283 623 1348 1474
End of test Wed Mar 4 12:40:47 2015
######################### RPi 2 V7A #############################
Raspberry Pi 2 CPU 900 MHz, Core 250 MHz, SDRAM 450 MHz
MP-BusSpd Linux/ARM V7A v1.0 Tue Mar 3 16:08:07 2015
MB/Second Reading Data, 1, 2, 4 and 8 Threads
KB Inc32 Inc16 Inc8 Inc4 Inc2 RdAll
12.3 1T 1571 1662 1670 1174 1698 1725
2T 3072 3266 3362 3379 3416 3443
4T 5077 6562 6582 6719 6771 6847
8T 5318 5731 6009 5939 5820 5535
122.9 1T 376 396 702 1192 1558 1624
2T 710 738 1388 2359 3111 3228
4T 708 779 1618 3238 5729 6383
8T 692 761 1612 2970 5056 5648
12288 1T 69 82 163 292 629 1251
2T 138 160 329 579 1247 2380
4T 217 175 364 485 1135 2582
8T 106 100 210 585 871 1817
End of test Tue Mar 3 16:08:21 2015
####################### RPi 2 V7A OC ###########################
Raspberry Pi 2 CPU 1000 MHz, Core 500 MHz, SDRAM 500 MHz, over_voltage=2
MP-BusSpd Linux/ARM V7A v1.0 Tue Mar 3 17:14:10 2015
MB/Second Reading Data, 1, 2, 4 and 8 Threads
KB Inc32 Inc16 Inc8 Inc4 Inc2 RdAll
12.3 1T 1029 1851 1875 1903 1912 1909
2T 3413 3616 3736 3799 3813 3818
4T 6821 7300 5957 7493 7523 7621
8T 5668 5894 6455 6372 6508 7495
122.9 1T 433 442 782 1305 1738 1789
2T 810 813 1542 2588 3429 3574
4T 818 887 1780 3584 6552 7071
8T 839 854 1629 3284 5229 6202
12288 1T 92 116 228 407 854 1286
2T 184 230 450 699 1619 2531
4T 236 253 564 1492 2178 3356
8T 156 164 258 699 1065 3018
End of test Tue Mar 3 17:14:23 2015
######################## RPi 2 V7A 2 ############################
Raspberry Pi 2 CPU 900 MHz, Core 250 MHz, SDRAM 450 MHz
MP-BusSpd ARM V7A v2 Fri Mar 6 17:29:14 2015
MB/Second Reading Data, 1, 2, 4 and 8 Threads
Staggered starting addresses to avoid caching
KB Inc32 Inc16 Inc8 Inc4 Inc2 RdAll
12.3 1T 961 1602 1638 1668 1733 2227
2T 2644 3012 3154 3240 3393 4138
4T 4024 5503 6027 6389 6705 8153
8T 2780 3979 4777 5031 6028 6376
122.9 1T 356 389 688 1185 1541 2050
2T 706 731 1373 2343 3070 4065
4T 743 800 1595 3198 5894 7872
8T 750 775 1566 2928 5406 7139
12288 1T 66 71 159 281 628 1147
2T 87 87 177 311 697 1256
4T 84 98 191 297 700 1186
8T 103 93 177 294 742 1147
End of test Fri Mar 6 17:29:26 2015
###################### RPi 2 V7A 2 OC ##########################
Raspberry Pi 2 CPU 1000 MHz, Core 500 MHz, SDRAM 500 MHz, over_voltage=2
MP-BusSpd ARM V7A v2 Fri Mar 6 17:35:56 2015
MB/Second Reading Data, 1, 2, 4 and 8 Threads
Staggered starting addresses to avoid caching
KB Inc32 Inc16 Inc8 Inc4 Inc2 RdAll
12.3 1T 952 1772 1817 1862 1921 2466
2T 2958 3387 3552 3654 3826 4832
4T 4448 6110 6708 7037 7344 9078
8T 3358 4852 5570 5684 6631 7153
122.9 1T 435 436 787 1318 1718 2285
2T 813 816 1534 2610 3426 4527
4T 821 864 1780 3536 6523 8823
8T 813 812 1607 3307 5750 8159
12288 1T 94 104 229 406 904 1648
2T 141 141 289 454 1165 1785
4T 143 148 256 407 1000 1584
8T 148 133 250 485 1062 1531
End of test Fri Mar 6 17:36:08 2015
######################## RPi 3 V7A 2 ############################
Raspberry Pi 3 CPU 1200 MHz, SDRAM 900 MHz
MP-BusSpd ARM V7A v2 Sun Jul 24 09:26:21 2016
MB/Second Reading Data, 1, 2, 4 and 8 Threads
Staggered starting addresses to avoid caching
KB Inc32 Inc16 Inc8 Inc4 Inc2 RdAll
12.3 1T 3011 3715 3792 4080 4400 4149
2T 5391 6873 7125 7827 8466 8124
4T 8622 11926 13488 15276 16419 13422
8T 4922 7930 9659 11732 13307 11995
122.9 1T 565 563 1070 1792 2830 3865
2T 886 901 1762 3225 5402 7584
4T 901 921 1863 3727 7185 13816
8T 874 919 1762 3712 6269 9242
12288 1T 120 125 244 420 968 1926
2T 126 128 246 537 1000 2184
4T 110 118 231 443 990 1824
8T 120 137 262 517 1043 2124
End of test Sun Jul 24 09:26:33 2016
########## RPi 3 V7A 2 New OpenGL GLUT Driver Disabled ##########
Raspberry Pi 3 CPU 1200 MHz, SDRAM 900 MHz
MP-BusSpd ARM V7A v2 Tue Aug 30 13:45:43 2016
MB/Second Reading Data, 1, 2, 4 and 8 Threads
Staggered starting addresses to avoid caching
KB Inc32 Inc16 Inc8 Inc4 Inc2 RdAll
12.3 1T 1565 3749 3718 4078 4385 4160
2T 5041 6829 7066 7813 8584 7839
4T 5480 11958 13330 15256 16863 15614
8T 6006 8477 8873 7777 8918 8315
122.9 1T 566 566 1062 1822 2831 3907
2T 899 906 1742 2395 5433 7638
4T 907 935 1876 3757 7241 13871
8T 863 919 1789 3491 6411 9403
12288 1T 130 136 263 513 1047 2080
2T 185 138 276 554 1108 2149
4T 131 137 269 536 1169 2383
8T 125 133 224 513 1038 2142
End of test Tue Aug 30 13:45:55 2016
########################## Other #################################
P11 Galaxy SIII, Quad Cortex-A9 1.4 GHz, Android 4.0.4
Android MP-BusSpd v7 Benchmark V1.0 23-Dec-2012 14.42
MB/Second Reading Data, 1, 2, 4 and 8 Threads
KB Inc32 Inc16 Inc8 Inc4 Inc2 RdAll
12.3 1T 3452 3697 4088 4122 3860 4183
2T 6616 7251 8016 8179 8307 8191
4T 8108 7430 10052 8511 8305 8404
8T 8729 10701 11687 12938 15297 15116
122.9 1T 747 762 746 966 992 1401
2T 1132 1161 1155 1554 1873 2668
4T 1127 1133 1137 2193 2987 4614
8T 1134 1145 1133 2210 3153 4231
12288 1T 82 89 200 376 739 1184
2T 204 179 407 797 1449 2205
4T 399 359 334 1227 1183 4038
8T 134 123 226 502 1378 3718
Total Elapsed Time 13.4 seconds
*****************************************************
Intel Atom 1.66 GHz, Linux Ubuntu 10.10
MP-BusSpd Linux/Intel v1.0 Sat Jul 27 18:03:37 2013
MB/Second Reading Data, 1, 2, 4 and 8 Threads
KB Inc32 Inc16 Inc8 Inc4 Inc2 RdAll
12.3 1T 5512 6061 6219 6362 6359 6388
2T 5866 6412 6556 6638 6595 6659
4T 6157 6445 6551 6607 6605 6639
8T 6139 6424 6510 6611 6303 6070
122.9 1T 513 417 787 1476 2518 3945
2T 586 696 1316 2347 3655 4741
4T 625 686 1334 2270 3614 4736
8T 615 720 1255 2273 3635 4777
12288 1T 135 261 522 1034 1966 3280
2T 128 261 567 1146 2250 4535
4T 118 277 562 1183 2300 4454
8T 122 250 549 1122 2225 4413
End of test Sat Jul 27 18:03:49 2013
*****************************************************
Quad Core AMD Phenom 3.0 GHz, Linux Ubuntu 12.04
MP-BusSpd Linux/Intel v1.0 Tue Jul 30 15:13:37 2013
MB/Second Reading Data, 1, 2, 4 and 8 Threads
KB Inc32 Inc16 Inc8 Inc4 Inc2 RdAll
12.3 1T 10273 13905 14184 13640 13542 13651
2T 7599 14053 19451 22479 24301 25801
4T 7743 15110 29846 44953 48783 51672
8T 7613 15116 29805 44501 48082 51027
122.9 1T 1494 1496 2987 6001 11037 12857
2T 2980 2987 5952 11900 21852 25515
4T 5344 5967 11735 23781 43699 50429
8T 5947 5903 11661 23333 42953 50569
12288 1T 459 466 922 1878 3117 5236
2T 741 773 1452 2648 4731 8370
4T 839 887 1814 4006 7923 14909
8T 903 933 1921 4244 7997 14597
End of test Tue Jul 30 15:13:49 2013
|
RandMem benchmark carries out four tests at increasing data sizes to produce data transfer speeds in MBytes Per Second from caches and memory. Serial and random address selections are employed, using the same program structure, with read and read/write tests using 32 bit integers. The main purpose is to demonstrate how much slower performance can be through using random access. Here, speed can be considerably influenced by reading and writing in bursts, where much of the data is not used, and by the size of preceding caches. Details and results for Windows and Linux versions can be found in RandMem Results.htm. This benchmark uses data from the same array for all threads, but starting at different points. Results of the Serial Reading tests are checked for the same result on all threads.
The original Windows version produces extremely slow speeds with read/write tests, particularly with random access. Later Linux varieties included Mutex, or mutual exclusion, functions avoid the updating conflict by only allowing one thread at a time to access common data. This can still lead to using multiple threads being slower than one but, with random access, there can be a significant improvement compared with untethered multiple thread speeds, except when accessing RAM (see linux%20multithreading%20benchmarks.htm). This and the Android benchmarks also use Mutex and some speeds continue to be unpredictable.
The revised PiA7 compilation made little difference to Raspberry Pi 2 MP-RandMem results. Multithreading did not provide any performance improvement with read/write tests (Mutex effect) but RPi 2 gains were around 1.6 times, using L1 cache, 4 to 6 times from L2 cache, with RAM at 6 times for serial access and 1.6 times with random access. RPi 2 single thread read only L1 cache tests showed no performance increase, with more than three times gain from l2 cache and a 70% improvement from RAM. Serial Read/Random Read, quad core versus RPi 1 single core performance ratios were about 4/4 times for L1 cache, 15/9 times with L2 cache and 5.7/4.3 times from RAM.
The results show that there is no gain in using multiple threads on systems with multiple cores, at least for cache based data, due to Mutex effects (but this is better than being much slower - see above). As could be anticipated, random access is slow, compared with serial reading and writing, when burst transfers are involved. Note the similarities with BusSpeed above.
Raspberry Pi 3 results are provided with and without the
new graphics driver,
as memory performance was degraded by the latter. RPi 2 comparisons, without the driver, are included below. Some were no better that the 1.33 CPU MHz increase, but RAM speed improvements were significant.
pi@raspberrypi ~/benchmarks/mprandmem$ ./MP-RandMem
V7A ./MP-RandMemPiA7
*****************************************************
Raspberry Pi CPU 700 MHz, Core 400 MHz, SDRAM 400 MHz
MP-RandMem Linux/ARM v1.0 Mon Jul 29 16:23:20 2013
MB/Second Using 1, 2, 4 and 8 Threads
KB SerRD SerRDWR RndRD RndRDWR
12.3 1T 1564 1347 1605 1343
2T 1576 1338 1584 1217
4T 1550 1297 1544 1324
8T 1500 1303 1489 1183
122.9 1T 236 202 112 99
2T 234 201 111 110
4T 232 201 110 96
8T 226 200 109 99
12288 1T 170 135 23 26
2T 170 134 22 26
4T 169 132 23 25
8T 123 105 23 26
No Errors Found
End of test Mon Jul 29 16:24:19 2013
########################### RPi OC ##############################
Raspberry Pi CPU 1000 MHz, Core 500 MHz, SDRAM 600 MHz, 6 volts
MP-RandMem Linux/ARM v1.0 Mon Jul 29 17:05:25 2013
MB/Second Using 1, 2, 4 and 8 Threads
KB SerRD SerRDWR RndRD RndRDWR
12.3 1T 2306 1938 2312 1937
2T 2281 1900 2288 1933
4T 2238 1918 2240 1918
8T 2171 1890 2178 1880
122.9 1T 460 369 205 194
2T 448 371 202 195
4T 441 371 204 194
8T 427 367 202 193
12288 1T 270 198 36 42
2T 270 198 36 42
4T 270 198 36 42
8T 270 198 36 42
No Errors Found
End of test Mon Jul 29 17:06:16 2013
########################### RPi 2 ###############################
Raspberry Pi 2 CPU 900 MHz, Core 250 MHz, SDRAM 450 MHz
MP-RandMem Linux/ARM v1.0 Tue Mar 3 16:13:52 2015
MB/Second Using 1, 2, 4 and 8 Threads
KB SerRD SerRDWR RndRD RndRDWR
12.3 1T 2256 2857 2257 2858
2T 4480 2847 4480 2849
4T 8738 2795 8759 2808
8T 8032 2772 8439 2794
122.9 1T 1624 1483 628 682
2T 3208 1467 1183 683
4T 6203 1457 1673 681
8T 5793 1385 1670 689
12288 1T 359 940 55 57
2T 670 941 105 57
4T 1180 936 126 57
8T 1161 938 127 56
No Errors Found
End of test Tue Mar 3 16:14:38 2015
######################### RPi 2 OC #############################
Raspberry Pi 2 CPU 1000 MHz, Core 500 MHz, SDRAM 500 MHz, over_voltage=2
MP-RandMem Linux/ARM v1.0 Wed Mar 4 12:53:39 2015
MB/Second Using 1, 2, 4 and 8 Threads
KB SerRD SerRDWR RndRD RndRDWR
12.3 1T 2493 3157 2509 3176
2T 4979 3161 4979 3153
4T 9357 3144 9699 3125
8T 8706 3104 8633 3085
122.9 1T 1796 2152 701 761
2T 3577 2142 1331 762
4T 6916 2151 1870 766
8T 6421 2135 1823 765
12288 1T 461 1233 68 70
2T 862 1218 129 69
4T 1561 1210 159 69
8T 1514 1202 162 69
No Errors Found
End of test Wed Mar 4 12:54:25 2015
######################### RPi 2 V7A #############################
Raspberry Pi 2 CPU 900 MHz, Core 250 MHz, SDRAM 450 MHz
MP-RandMem Linux/ARM V7A v1.0 Tue Mar 3 16:28:30 2015
MB/Second Using 1, 2, 4 and 8 Threads
KB SerRD SerRDWR RndRD RndRDWR
12.3 1T 1967 2784 1968 2793
2T 3711 2787 3740 2788
4T 7019 2730 7313 2762
8T 6783 2410 6881 2704
122.9 1T 1413 1489 532 681
2T 2788 1470 1013 679
4T 5393 1485 1593 681
8T 5207 1448 1587 686
12288 1T 357 950 45 57
2T 697 946 89 57
4T 1212 930 126 56
8T 1157 938 123 57
No Errors Found
End of test Tue Mar 3 16:29:18 2015
####################### RPi 2 V7A OC ###########################
Raspberry Pi 2 CPU 1000 MHz, Core 500 MHz, SDRAM 500 MHz, over_voltage=2
MP-RandMem Linux/ARM V7A v1.0 Tue Mar 3 17:51:16 2015
MB/Second Using 1, 2, 4 and 8 Threads
KB SerRD SerRDWR RndRD RndRDWR
12.3 1T 2209 3129 2209 3130
2T 4177 3101 4156 3098
4T 8289 3056 8292 3059
8T 7641 3009 7574 2990
122.9 1T 1584 2121 592 754
2T 3109 2105 1127 758
4T 5983 2114 1715 759
8T 5669 2118 1682 764
12288 1T 453 1219 55 69
2T 841 1217 109 69
4T 1535 1209 157 69
8T 1523 1189 154 68
No Errors Found
End of test Tue Mar 3 17:52:01 2015
######################### RPi 3 V7A #############################
Raspberry Pi 3 CPU 1200 MHz, SDRAM 900 MHz
MP-RandMem Linux/ARM V7A v1.0 Mon Aug 15 19:37:27 2016
MB/Second Using 1, 2, 4 and 8 Threads
KB SerRD SerRDWR RndRD RndRDWR
12.3 1T 2907 3773 2917 3790
2T 5480 3768 5187 3775
4T 11198 3679 10960 3712
8T 10094 3697 10038 3685
122.9 1T 2673 3340 686 892
2T 5031 3386 1251 888
4T 9398 3378 2002 890
8T 9291 3370 1916 886
12288 1T 1896 899 50 64
2T 2535 900 98 65
4T 2878 896 137 64
8T 2631 897 130 65
No Errors Found
End of test Mon Aug 15 19:38:14 2016
########## RPi 3 V7A 2 New OpenGL GLUT Driver Disabled ##########
Raspberry Pi 3 CPU 1200 MHz, SDRAM 900 MHz
MP-RandMem Linux/ARM V7A v1.0 Tue Aug 30 14:13:08 2016
MB/Second Using 1, 2, 4 and 8 Threads
KB SerRD SerRDWR RndRD RndRDWR
12.3 1T 2930 3791 2918 3791
2T 5571 3766 5194 3776
4T 11196 3722 11205 3722
8T 10063 3685 10051 3702
122.9 1T 2675 3398 681 893
2T 5124 3387 1256 886
4T 10041 3387 1916 891
8T 9593 3367 1952 890
12288 1T 2120 979 54 71
2T 3255 980 107 71
4T 3346 979 138 70
8T 2226 979 143 71
No Errors Found
End of test Tue Aug 30 14:13:54 2016
RPi3/RPi2 Average
L1 cache 1.53 1.36 1.47 1.35
L2 cache 1.86 2.29 1.24 1.31
RAM 4.46 1.04 1.17 1.25
########################## Other #################################
P11 Galaxy SIII, Quad Cortex-A9 1.4 GHz, Android 4.0.4
Android MP-RndMem v7 Benchmark V1.0 23-Dec-2012 14.40
MB/Second Using 1, 2, 4 and 8 Threads
KB SerRD SerRDWR RndRD RndRDWR
12.29 1T 2043 2028 2066 2027
2T 6788 3058 6835 3346
4T 6251 3104 6478 3376
8T 6635 3244 5408 3242
122.9 1T 1365 1392 1150 1151
2T 2415 1386 1927 1159
4T 2495 1374 1870 1117
8T 2470 1352 1772 1013
12288 1T 581 351 71 77
2T 1674 934 143 96
4T 1675 882 143 95
8T 1838 939 142 96
Total Elapsed Time 5.5 seconds
*****************************************************
Intel Atom 1.66 GHz, Linux Ubuntu 10.10
MP-RandMem Linux/Intel v1.0 Mon Jul 29 17:14:26 2013
MB/Second Using 1, 2, 4 and 8 Threads
KB SerRD SerRDWR RndRD RndRDWR
12.3 1T 4207 5242 4206 5244
2T 6219 5159 5770 5118
4T 6155 5158 6206 5149
8T 5765 5019 6088 4956
122.9 1T 3084 3455 789 1077
2T 4692 3451 1230 1078
4T 4753 3408 1246 1076
8T 4689 3400 1243 1045
12288 1T 1291 1339 57 88
2T 3008 1323 108 88
4T 3043 1336 105 88
8T 3092 1329 108 87
No Errors Found
End of test Mon Jul 29 17:15:12 2013
*****************************************************
Quad Core AMD Phenom 3.0 GHz, Linux Ubuntu 12.04
MP-RandMem Linux/Intel v1.0 Tue Jul 30 15:15:12 2013
MB/Second Using 1, 2, 4 and 8 Threads
KB SerRD SerRDWR RndRD RndRDWR
12.3 1T 14913 11834 14229 11681
2T 24219 11686 23129 11552
4T 35885 11566 33095 11443
8T 29820 11596 29206 11518
122.9 1T 10936 10580 5543 4835
2T 20167 10563 9942 4814
4T 38266 10522 18061 4845
8T 37272 10437 17753 4813
12288 1T 3858 3864 655 559
2T 6280 3866 1137 558
4T 10752 3859 1920 558
8T 11107 3827 1924 558
No Errors Found
End of test Tue Jul 30 15:15:53 2013
|
These benchmarks use the same source code program calculations as the original MP_MFLOPS benchmark for Linux with MP-MFLOPS above using a cut down version, implemented to use on Android devices. OpenMP-MFLOPS benchmark uses the simplest OpenMP directive, #pragma omp parallel for, before the for loops where parallelisation might be expected, and a -fopenmp compile parameter. Then, notOpenMP-MFLOPS is the same, without the compile parameter.
The default memory sizes used, starting at 400 KB, are much larger than MP-MFLOPS, as is the number of repeat passes. However, these benchmarks have run time parameters, shown below, that can change these. In fact, test runs show that performance is mainly dependent on the number of operations per word, and notOpenMP-MFLOPS speed is almost the same as 1 Thread MP-MFLOPS results.
Besides notOpenMP-MFLOPS, results below include OpenMP-MFLOPS, set to run on a single core, where speeds at 32 operations per word were nearly twice as fast on the former. Examination of the assembly code generated, for this particular test function, show that the latter has 67 instructions, and the former 346, clearly with more options to suit data size. Both use vfma fused floating-point multiply accumulate instructions, where there are rounding complications. Note the different results at 32 operations per word. At least the default OpenMP-MFLOPS benchmark shows speed gains of up to 3.9 times those using a single core.
Raspberry Pi 3 results, using the same parameters as
MP-MFLOPS,
have been included, with those for notOpenMP-MFLOPS 2 and 32 Ops/Word being almost identical to the same for MP-MFLOPSPiNeon but, unlike the latter, little gain was produced with multithreading. MP performance appeared to be improved somewhat by increasing the pass count. Compared to Raspberry Pi 2, notOpenMP-MFLOPS was around twice as fast at 8 and 32 operations per word, but less so with 2 operations. Then, it sometimes appeared to be slower with OpenMP-MFLOPS. Poor performance could well be associated with running time and overheads.
######################## Run Time Parameters #############################
For same as latest MP-MFLOPS ./OpenMP-MFLOPS Words 3200, Repeats 10000
############################## RPi 2 V7A #################################
Raspberry Pi 2 CPU 900 MHz, Core 250 MHz, SDRAM 450 MHz
OpenMP MFLOPS Benchmark 1 Sat Mar 7 15:44:05 2015
Test 4 Byte Ops/ Repeat Seconds MFLOPS First All
Words Word Passes Results Same
Data in & out 100000 2 2500 0.676731 739 0.929538 Yes
Data in & out 1000000 2 250 1.365332 366 0.992550 Yes
Data in & out 10000000 2 25 1.308170 382 0.999250 Yes
Data in & out 100000 8 2500 1.076658 1858 0.957126 Yes
Data in & out 1000000 8 250 1.390932 1438 0.995524 Yes
Data in & out 10000000 8 25 1.356837 1474 0.999550 Yes
Data in & out 100000 32 2500 5.561007 1439 0.890232 Yes
Data in & out 1000000 32 250 5.843752 1369 0.988068 Yes
Data in & out 10000000 32 25 5.791580 1381 0.998785 Yes
End of test Sat Mar 7 15:44:30 2015
***************** taskset 0x00000001 ./OpenMP-MFLOPS 1 Core *****************
OpenMP MFLOPS Benchmark 1 Mon Mar 9 11:56:40 2015
Test 4 Byte Ops/ Repeat Seconds MFLOPS First All
Words Word Passes Results Same
Data in & out 100000 2 2500 1.592769 314 0.929538 Yes
Data in & out 1000000 2 250 1.928113 259 0.992550 Yes
Data in & out 10000000 2 25 1.917868 261 0.999250 Yes
Data in & out 100000 8 2500 4.049420 494 0.957126 Yes
Data in & out 1000000 8 250 4.766354 420 0.995524 Yes
Data in & out 10000000 8 25 4.757556 420 0.999550 Yes
Data in & out 100000 32 2500 21.886468 366 0.890232 Yes
Data in & out 1000000 32 250 22.745527 352 0.988068 Yes
Data in & out 10000000 32 25 22.726837 352 0.998785 Yes
End of test Mon Mar 9 11:58:07 2015
-----------------------------------------------------------------------------
Not OpenMP MFLOPS Benchmark 1 Sat Mar 7 15:41:17 2015
Test 4 Byte Ops/ Repeat Seconds MFLOPS First All
Words Word Passes Results Same
Data in & out 100000 2 2500 1.256587 398 0.929538 Yes
Data in & out 1000000 2 250 1.470944 340 0.992550 Yes
Data in & out 10000000 2 25 1.467244 341 0.999250 Yes
Data in & out 100000 8 2500 2.574641 777 0.957126 Yes
Data in & out 1000000 8 250 3.241242 617 0.995524 Yes
Data in & out 10000000 8 25 3.226519 620 0.999550 Yes
Data in & out 100000 32 2500 11.566683 692 0.890268 Yes
Data in & out 1000000 32 250 12.312695 650 0.988078 Yes
Data in & out 10000000 32 25 12.309223 650 0.998806 Yes
End of test Sat Mar 7 15:42:07 2015
####################### RPi 2 V7A OC ###########################
Raspberry Pi 2 CPU 1000 MHz, Core 500 MHz, SDRAM 500 MHz, over_voltage=2
OpenMP MFLOPS Benchmark 1 Sat Mar 7 19:21:01 2015
Test 4 Byte Ops/ Repeat Seconds MFLOPS First All
Words Word Passes Results Same
Data in & out 100000 2 2500 0.502595 995 0.929538 Yes
Data in & out 1000000 2 250 1.061047 471 0.992550 Yes
Data in & out 10000000 2 25 1.027811 486 0.999250 Yes
Data in & out 100000 8 2500 0.962144 2079 0.957126 Yes
Data in & out 1000000 8 250 1.202937 1663 0.995524 Yes
Data in & out 10000000 8 25 1.158232 1727 0.999550 Yes
Data in & out 100000 32 2500 4.947005 1617 0.890232 Yes
Data in & out 1000000 32 250 5.147261 1554 0.988068 Yes
Data in & out 10000000 32 25 5.111022 1565 0.998785 Yes
End of test Sat Mar 7 19:21:23 2015
Not OpenMP MFLOPS Benchmark 1 Sat Mar 7 19:19:54 2015
Test 4 Byte Ops/ Repeat Seconds MFLOPS First All
Words Word Passes Results Same
Data in & out 100000 2 2500 1.085229 461 0.929538 Yes
Data in & out 1000000 2 250 1.314159 380 0.992550 Yes
Data in & out 10000000 2 25 1.307451 382 0.999250 Yes
Data in & out 100000 8 2500 2.323887 861 0.957126 Yes
Data in & out 1000000 8 250 2.859657 699 0.995524 Yes
Data in & out 10000000 8 25 2.851960 701 0.999550 Yes
Data in & out 100000 32 2500 10.461870 765 0.890268 Yes
Data in & out 1000000 32 250 11.074036 722 0.988078 Yes
Data in & out 10000000 32 25 11.070011 723 0.998806 Yes
End of test Sat Mar 7 19:20:39 2015
######################### RPi 3 V7A #############################
Raspberry Pi 3 CPU 1200 MHz, SDRAM 900 MHz
OpenMP MFLOPS Benchmark 1 Sat Jul 30 13:01:12 2016
Test 4 Byte Ops/ Repeat Seconds MFLOPS First All
Words Word Passes Results Same
Data in & out 100000 2 2500 0.363631 1375 0.929538 Yes
Data in & out 1000000 2 250 1.133716 441 0.992550 Yes
Data in & out 10000000 2 25 1.150107 435 0.999250 Yes
Data in & out 100000 8 2500 0.432833 4621 0.957126 Yes
Data in & out 1000000 8 250 1.177219 1699 0.995524 Yes
Data in & out 10000000 8 25 1.151536 1737 0.999550 Yes
Data in & out 100000 32 2500 3.845114 2081 0.890232 Yes
Data in & out 1000000 32 250 3.754590 2131 0.988068 Yes
Data in & out 10000000 32 25 3.737356 2141 0.998785 Yes
End of test Sat Jul 30 13:01:29 2016
-----------------------------------------------------------------------------
Not OpenMP MFLOPS Benchmark 1 Mon Aug 15 19:23:03 2016
Test 4 Byte Ops/ Repeat Seconds MFLOPS First All
Words Word Passes Results Same
Data in & out 100000 2 2500 0.697952 716 0.929538 Yes
Data in & out 1000000 2 250 1.160158 431 0.992550 Yes
Data in & out 10000000 2 25 1.140070 439 0.999250 Yes
Data in & out 100000 8 2500 1.178477 1697 0.957126 Yes
Data in & out 1000000 8 250 1.442497 1386 0.995524 Yes
Data in & out 10000000 8 25 1.428921 1400 0.999550 Yes
Data in & out 100000 32 2500 5.060230 1581 0.890268 Yes
Data in & out 1000000 32 250 5.203246 1538 0.988078 Yes
Data in & out 10000000 32 25 5.203889 1537 0.998806 Yes
End of test Mon Aug 15 19:23:26 2016
######################### RPi 3 V7A #############################
Run with parameters ./OpenMP-MFLOPS Words 3200, Repeats 10000
Raspberry Pi 3 CPU 1200 MHz, SDRAM 900 MHz
OpenMP MFLOPS Benchmark 1 Thu Sep 15 18:13:47 2016
Test 4 Byte Ops/ Repeat Seconds MFLOPS First All
Words Word Passes Results Same
Data in & out 3200 2 10000 0.138179 463 0.764063 Yes
Data in & out 32000 2 1000 0.091516 699 0.970753 Yes
Data in & out 320000 2 100 0.193833 330 0.997008 Yes
Data in & out 3200 8 10000 0.148140 1728 0.850919 Yes
Data in & out 32000 8 1000 0.120691 2121 0.982347 Yes
Data in & out 320000 8 100 0.429023 597 0.998205 Yes
Data in & out 3200 32 10000 0.514128 1992 0.660291 Yes
Data in & out 32000 32 1000 0.703450 1456 0.953632 Yes
Data in & out 320000 32 100 1.067654 959 0.995180 Yes
End of test Thu Sep 15 18:13:50 2016
-----------------------------------------------------------------------------
Not OpenMP MFLOPS Benchmark 1 Thu Sep 15 18:14:47 2016
Test 4 Byte Ops/ Repeat Seconds MFLOPS First All
Words Word Passes Results Same
Data in & out 3200 2 10000 0.152466 420 0.764063 Yes
Data in & out 32000 2 1000 0.081762 783 0.970753 Yes
Data in & out 320000 2 100 0.134984 474 0.997008 Yes
Data in & out 3200 8 10000 0.147960 1730 0.850919 Yes
Data in & out 32000 8 1000 0.148731 1721 0.982347 Yes
Data in & out 320000 8 100 0.168795 1517 0.998205 Yes
Data in & out 3200 32 10000 0.644568 1589 0.660158 Yes
Data in & out 32000 32 1000 0.649362 1577 0.953663 Yes
Data in & out 320000 32 100 0.663790 1543 0.995240 Yes
End of test Thu Sep 15 18:14:50 2016
|
MemSpeed benchmark measures data reading speeds in MegaBytes per second, carrying out calculations on arrays of cache and RAM data. Calculations are as shown in the results’ headings. As with OpenMP-MFLOPS benchmark, OpenMP-MemSpeed uses the simplest OpenMP directive (#pragma omp parallel) before the test loops. Full results are below for the RPi 2 running at 900 and 1000 MHz. The compile command is also shown.
With OpenMP-MemSpeed Version 1, the declaration to use OpenMP was before an inner loop, leading to possible performance degradation due to overheads. For Version 2, or OpenMP-MemSpeed2, the directive was moved to an outer loop. For the following results, this was run, along with a test to use one CPU core, via the command taskset 0x00000001 ./OpenMP-MemSpeed2. Then, another compilation (NotOpenMP-MemSpeed2), was produced without the -fopenmp compile option, to use a single core without OMP overheads. All three versions are in Raspberry_Pi_MP_Benchmarks.zip.
Raspberry Pi 3 - Results are below, along with RPi3/Rpi2 average performance ratios, plus those for Raspberry Pi 3 OpenMP/NotOpenMP and NotOpenMP/1 core OpenMP. Some RPi3/RPi2 comparisons were close to the 1.33 CPU MHz ratio, but most were higher, particularly on RAM speed, at up to 4.68 times, and all integer arithmetic tests, with all MP ratios between 4.08 and 5.62. RPi 3 multiprocessing gains were disappointing on integer operations but mainly over 3.5 times for cache based floating point and over 3.0 from RAM. Using one thread, the benchmark produced wide variations to the unthreaded code, mainly worse, as expected, but some were better. It could be assumed that different instructions were generated.
######################### RPi 2 V7A #############################
Raspberry Pi 2 CPU 900 MHz, Core 250 MHz, SDRAM 450 MHz
gcc memSpeedOMP.c cpuidc.c -lrt -lc -lm -O3 -mcpu=cortex-a7
-mfloat-abi=hard -mfpu=neon-vfpv4 -funsafe-math-optimizations
-fopenmp -o OpenMP-MemSpeed
Memory Reading Speed Test OpenMP Version 1 by Roy Longbottom
Start of test Sat Mar 7 19:12:39 2015
Memory x[m]=x[m]+s*y[m] Int+ x[m]=x[m]+y[m] x[m]=y[m]
KBytes Dble Sngl Int32 Dble Sngl Int32 Dble Sngl Int32
Used MB/S MB/S MB/S MB/S MB/S MB/S MB/S MB/S MB/S
4 589 759 843 925 871 896 517 491 490
8 1487 1056 1367 1707 1161 1472 971 876 876
16 2357 1595 1941 2852 2186 2348 1737 1422 1355
32 2565 2045 2669 4329 2876 3161 2945 2246 2235
64 3964 2294 3080 5634 3497 3962 4224 2936 2934
128 2420 2317 3096 5661 3478 3928 1831 3416 3425
256 2884 2150 2838 4411 3179 3578 1184 1392 1357
512 1837 1731 2155 3061 2327 2557 1064 1217 1218
1024 650 990 1106 1254 1134 1162 1050 1055 937
2048 793 833 907 1010 935 889 851 676 825
4096 792 705 864 1004 871 953 767 771 748
8192 760 829 881 1009 935 961 761 736 766
16384 839 810 873 1004 934 961 765 772 762
32768 850 829 906 1005 725 953 770 776 777
65536 951 838 894 1022 928 963 772 779 779
131072 949 835 867 1010 937 950 774 786 788
End of test Sat Mar 7 19:13:10 2015
####################### RPi 2 V7A OC ###########################
Raspberry Pi 2 CPU 1000 MHz, Core 500 MHz, SDRAM 500 MHz, over_voltage=2
Memory Reading Speed Test OpenMP Version 1 by Roy Longbottom
Start of test Sat Mar 7 19:17:09 2015
Memory x[m]=x[m]+s*y[m] Int+ x[m]=x[m]+y[m] x[m]=y[m]
KBytes Dble Sngl Int32 Dble Sngl Int32 Dble Sngl Int32
Used MB/S MB/S MB/S MB/S MB/S MB/S MB/S MB/S MB/S
4 669 855 948 1081 980 1004 578 548 496
8 1595 1339 1549 1948 1643 1399 1099 915 994
16 2467 1859 2292 3238 2464 2686 1966 1538 1660
32 3429 2302 3012 4932 3330 3720 3324 2527 2332
64 4190 2585 3520 6499 3969 4520 4881 3357 3366
128 4327 2670 3656 6991 4134 4789 4914 3094 3827
256 4185 2392 3524 6035 3994 4607 1710 2719 2713
512 2757 2119 2329 4008 2944 3250 1587 1726 1717
1024 1393 1161 1303 1493 1350 1408 1488 1476 1465
2048 903 996 1086 1207 1113 921 1083 1093 1086
4096 632 911 1058 1177 1094 1122 969 995 998
8192 1141 988 1074 1198 1113 1112 985 994 998
16384 825 980 1070 1184 950 1131 1014 1022 1015
32768 1111 994 1083 1209 1117 1155 994 957 1007
65536 1161 982 1084 1223 1112 1163 996 999 997
131072 1134 986 1083 1212 1098 1135 1005 1017 1022
End of test Sat Mar 7 19:17:40 2015
#################################################################
Version 2
########################## RPi 2 ################################
Raspberry Pi 2 CPU 900 MHz, Core 250 MHz, SDRAM 450 MHz
Memory Reading Speed Test OpenMP Version 2 by Roy Longbottom
Start of test Mon Sep 5 21:29:03 2016
Memory x[m]=x[m]+s*y[m] Int+ x[m]=x[m]+y[m] x[m]=y[m]
KBytes Dble Sngl Int32 Dble Sngl Int32 Dble Sngl Int32
Used MB/S MB/S MB/S MB/S MB/S MB/S MB/S MB/S MB/S
4 3259 2499 271 3261 2383 286 1333 2099 432
8 2854 2507 256 3160 2594 305 1235 2116 445
16 3329 2507 256 3331 3098 270 1173 1547 446
32 3210 2509 264 3328 3026 267 1155 1452 433
64 3461 1889 249 5869 3399 250 1128 2024 317
128 3215 2229 257 5719 3672 262 1117 1123 293
256 3896 2387 250 5677 3647 257 1119 1132 301
512 2521 1527 217 2718 2258 230 1115 1112 282
1024 931 871 185 1408 1254 182 1092 1094 258
2048 863 777 212 1217 1203 198 1095 1088 275
4096 846 724 159 962 885 168 1092 1078 251
8192 824 779 234 1151 1191 200 1090 1070 266
16384 791 701 362 961 1223 335 1078 1057 334
32768 845 641 398 930 973 391 913 1066 300
65536 312 256 331 359 306 338 956 1069 301
131072 312 255 332 360 306 338 994 812 356
End of test Mon Sep 5 21:29:41 2016
####################### RPi 2 1 Core ############################
Raspberry Pi 2 CPU 900 MHz, Core 250 MHz, SDRAM 450 MHz
Memory Reading Speed Test OpenMP Version 2 by Roy Longbottom
Start of test Mon Sep 5 21:30:34 2016
Memory x[m]=x[m]+s*y[m] Int+ x[m]=x[m]+y[m] x[m]=y[m]
KBytes Dble Sngl Int32 Dble Sngl Int32 Dble Sngl Int32
Used MB/S MB/S MB/S MB/S MB/S MB/S MB/S MB/S MB/S
4 857 674 676 2199 1096 658 2297 1172 489
8 1234 677 676 2269 1131 690 2351 1179 490
16 1236 677 677 2273 1138 694 2362 1183 490
32 1225 673 674 2258 1132 691 2354 1181 489
64 1056 616 623 1732 950 638 1428 1093 471
128 968 605 614 1660 947 626 1242 1127 476
256 910 602 611 1635 947 626 1191 1131 475
512 705 499 529 1242 743 515 1119 954 438
1024 347 282 350 434 359 357 803 785 339
2048 309 256 326 359 305 333 814 744 299
4096 304 251 324 353 302 331 856 785 313
8192 304 252 322 352 300 331 879 839 331
16384 305 251 324 352 300 331 891 864 342
32768 308 251 325 354 301 332 859 773 313
65536 309 253 325 355 293 331 836 737 302
131072 309 253 326 355 302 332 838 713 295
End of test Mon Sep 5 21:31:10 2016
####################### RPi 2 Not OMP ###########################
Raspberry Pi 2 CPU 900 MHz, Core 250 MHz, SDRAM 450 MHz
Memory Reading Speed Test Not OpenMP Version 2 by Roy Longbottom
Start of test Mon Sep 5 21:31:37 2016
Memory x[m]=x[m]+s*y[m] Int+ x[m]=x[m]+y[m] x[m]=y[m]
KBytes Dble Sngl Int32 Dble Sngl Int32 Dble Sngl Int32
Used MB/S MB/S MB/S MB/S MB/S MB/S MB/S MB/S MB/S
4 823 1666 2560 2001 2172 2564 1985 1564 1565
8 1262 1671 2571 2024 2181 2570 2018 1573 1571
16 1261 1670 2576 2023 2179 2570 2023 1575 1575
32 1064 1275 1705 1552 1569 1724 1468 1325 1272
64 971 1292 1714 1493 1577 1721 1616 1296 1297
128 995 1319 1767 1539 1626 1718 1464 1317 1318
256 912 1294 1722 1494 1580 1714 1209 1376 1374
512 655 885 1091 977 1025 1078 1108 932 948
1024 364 408 451 422 439 450 863 510 511
2048 309 334 356 343 352 360 914 562 557
4096 305 304 350 338 345 350 930 607 604
8192 306 331 356 340 349 356 922 608 609
16384 311 332 358 343 349 358 917 609 609
32768 313 333 355 344 349 359 925 609 607
65536 313 333 359 343 351 358 926 611 612
131072 314 331 359 345 351 359 927 609 612
End of test Mon Sep 5 21:32:12 2016
######################### RPi 3 V7A #############################
Raspberry Pi 3 CPU 1200 MHz, SDRAM 900 MHz
Memory Reading Speed Test OpenMP Version 1 by Roy Longbottom
Start of test Mon Aug 15 19:29:18 2016
Memory x[m]=x[m]+s*y[m] Int+ x[m]=x[m]+y[m] x[m]=y[m]
KBytes Dble Sngl Int32 Dble Sngl Int32 Dble Sngl Int32
Used MB/S MB/S MB/S MB/S MB/S MB/S MB/S MB/S MB/S
4 749 1064 1302 1407 1236 1266 775 745 745
8 379 1597 2200 2473 1968 2433 1481 1364 1378
16 3180 2126 3319 3928 2901 3866 2718 2364 2372
32 4244 2546 4492 5565 3733 5495 4685 3700 3713
64 4930 2772 5252 7185 3845 6693 6947 4959 4960
128 3699 2924 5785 8169 4592 7349 5553 6047 6009
256 5553 2970 5939 8340 4585 7720 9048 6657 6653
512 5167 2854 5537 7555 4116 7009 6125 5464 5288
1024 903 1436 1456 1329 1461 1456 1585 1609 1600
2048 950 1164 1155 1186 1171 914 1043 1036 1024
4096 974 1148 1039 1174 1164 1162 919 923 928
8192 920 1131 1158 1168 1163 1093 938 936 945
16384 919 838 948 1169 1165 990 931 940 946
32768 1166 1159 1168 1171 1167 1168 923 926 916
65536 1156 1146 1167 1170 1163 1147 928 939 931
131072 1163 1151 1148 1171 1075 1092 934 915 957
End of test Mon Aug 15 19:29:47 2016
########## RPi 3 V7A 2 New OpenGL GLUT Driver Disabled ##########
Raspberry Pi 3 CPU 1200 MHz, SDRAM 900 MHz
Memory Reading Speed Test OpenMP Version 1 by Roy Longbottom
Start of test Tue Aug 30 14:03:24 2016
Memory x[m]=x[m]+s*y[m] Int+ x[m]=x[m]+y[m] x[m]=y[m]
KBytes Dble Sngl Int32 Dble Sngl Int32 Dble Sngl Int32
Used MB/S MB/S MB/S MB/S MB/S MB/S MB/S MB/S MB/S
4 565 1059 1293 1403 1228 1381 773 740 743
8 433 1590 2185 2480 1989 886 1458 1349 1367
16 274 2118 3212 3987 2882 3846 2678 2335 2334
32 4234 2547 4489 5786 3723 5476 4645 3685 3690
64 3613 2791 5328 7263 3959 6777 7146 5065 5075
128 1349 2889 5624 6927 4090 7274 9530 5908 5923
256 3597 2960 5877 8177 4676 7637 8693 6697 6725
512 4140 2985 3621 8556 4784 7931 7867 6723 6768
1024 1534 1547 1631 1646 1629 1634 1872 1848 1852
2048 1274 1270 1274 1267 1106 1274 1106 1108 1090
4096 675 1263 1270 1277 1265 1266 1025 1031 1028
8192 1271 1256 1281 1280 1263 1265 996 994 959
16384 1281 1277 1289 1288 1102 1278 986 971 976
32768 1285 1283 1287 1301 1286 1291 977 966 986
65536 1291 1285 1291 1292 1291 1289 988 982 986
131072 1293 1283 1293 1298 1295 1287 970 979 998
End of test Tue Aug 30 14:03:53 2016
#################################################################
Version 2
########################## RPi 3 ################################
Raspberry Pi 3 CPU 1200 MHz, SDRAM 900 MHz
Memory Reading Speed Test OpenMP Version 2 by Roy Longbottom
Start of test Mon Sep 5 14:27:38 2016
Memory x[m]=x[m]+s*y[m] Int+ x[m]=x[m]+y[m] x[m]=y[m]
KBytes Dble Sngl Int32 Dble Sngl Int32 Dble Sngl Int32
Used MB/S MB/S MB/S MB/S MB/S MB/S MB/S MB/S MB/S
4 5518 2990 1309 8808 4732 1455 15426 7656 1244
8 5414 3115 1322 10150 5068 1470 14323 8301 1254
16 5503 3143 1270 10255 5154 1378 16743 8043 1221
32 5507 3145 1344 10142 5089 1458 16572 7732 1206
64 5033 2999 1257 9230 4867 1419 16012 7869 1228
128 5255 3041 1258 9372 5014 1365 9452 8192 1252
256 5266 3093 1282 9401 5006 1372 8418 7864 1313
512 4494 2765 1358 7248 4482 1332 5748 5460 1410
1024 3810 2683 1078 4425 3668 1155 1753 1732 1265
2048 2008 1425 1098 2274 2214 980 1086 1094 1333
4096 3972 2413 1075 4628 3672 945 1058 1057 839
8192 1597 2435 920 3671 3649 1199 1059 1067 1043
16384 3838 1624 1867 4440 1550 1108 1065 1076 1166
32768 1658 2273 1695 4227 1876 1054 1066 1039 921
65536 3657 1247 1286 4839 3801 1308 1053 1046 1133
131072 990 655 810 1260 932 826 1129 1083 619
End of test Mon Sep 5 14:28:08 2016
####################### RPi 3 1 Core ############################
Raspberry Pi 3 CPU 1200 MHz, SDRAM 900 MHz
Memory Reading Speed Test OpenMP Version 2 by Roy Longbottom
Start of test Mon Sep 5 14:30:31 2016
Memory x[m]=x[m]+s*y[m] Int+ x[m]=x[m]+y[m] x[m]=y[m]
KBytes Dble Sngl Int32 Dble Sngl Int32 Dble Sngl Int32
Used MB/S MB/S MB/S MB/S MB/S MB/S MB/S MB/S MB/S
4 775 789 994 2578 1309 1027 4087 2337 654
8 1551 793 1003 2620 1313 1029 4176 2361 656
16 1553 793 1003 2626 1314 1029 4209 2372 657
32 1512 782 982 2501 1282 1009 4146 2338 647
64 1464 770 961 2379 1242 982 3976 2183 636
128 1476 773 963 2406 1253 990 3837 2160 639
256 1478 773 964 2389 1256 982 3867 2208 639
512 1401 748 926 2204 1202 958 3342 2119 636
1024 1082 663 798 1347 979 814 1759 1634 616
2048 968 651 776 1193 923 791 1272 1215 604
4096 962 645 779 1171 909 812 1253 1247 615
8192 977 654 807 1233 925 820 1240 1245 619
16384 1016 653 794 1226 920 818 1223 1231 617
32768 1018 656 815 1263 930 806 1175 1176 615
65536 1026 658 816 1270 935 829 971 988 614
131072 1030 660 818 1269 938 830 866 870 608
End of test Mon Sep 5 14:30:57 2016
####################### RPi 3 Not OMP ###########################
Raspberry Pi 3 CPU 1200 MHz, SDRAM 900 MHz
Memory Reading Speed Test Not OpenMP Version 2 by Roy Longbottom
Start of test Mon Sep 5 14:28:22 2016
Memory x[m]=x[m]+s*y[m] Int+ x[m]=x[m]+y[m] x[m]=y[m]
KBytes Dble Sngl Int32 Dble Sngl Int32 Dble Sngl Int32
Used MB/S MB/S MB/S MB/S MB/S MB/S MB/S MB/S MB/S
4 785 2536 3789 2360 3448 3787 2670 2693 2692
8 1594 2547 3812 2389 3465 3812 2715 2716 2716
16 1595 2551 3824 2392 3477 3823 2727 2728 2728
32 1556 2435 3564 2300 3272 3565 2730 2722 2723
64 1513 2314 3330 2189 3091 3327 2599 2435 2435
128 1516 2312 3357 2188 3118 3353 2635 2569 2569
256 1521 2316 3381 2187 3130 3384 2676 2618 2617
512 1419 2034 2765 1977 2674 2835 2593 2481 2524
1024 1113 1379 1544 1348 1521 1543 1691 1583 1586
2048 995 1203 1282 1193 1277 1257 1263 1231 1232
4096 992 1196 1248 1178 1252 1259 1203 1176 1166
8192 1041 1237 1290 1213 1298 1291 927 943 954
16384 1052 1262 1311 1229 1252 1303 874 866 867
32768 1053 1271 1317 1239 1325 1303 995 987 991
65536 1057 1281 1323 1245 1343 1316 920 920 918
131072 1057 1283 1323 1184 1350 1327 856 849 840
End of test Mon Sep 5 14:28:50 2016
#################################################################
Comparisons
#################################################################
########################### RPi3/RPi2 ###########################
Memory x[m]=x[m]+s*y[m] Int+ x[m]=x[m]+y[m] x[m]=y[m]
Ares Dble Sngl Int32 Dble Sngl Int32 Dble Sngl Int32
MB/S MB/S MB/S MB/S MB/S MB/S MB/S MB/S MB/S
RPi3/RPi2 OpenMP-MemSpeed2
L1 1.75 1.25 5.07 3.11 1.76 5.11 13.37 4.71 2.78
L2 1.70 1.64 5.38 1.85 1.62 5.62 7.43 4.80 4.46
RAM 3.73 2.94 4.68 4.32 2.90 4.05 1.03 0.99 3.73
RPi3/RPi2 NotOpenMP-MemSpeed2
L1 1.32 1.63 1.63 1.26 1.72 1.63 1.48 1.83 1.85
L2 1.82 1.99 2.13 1.67 2.17 2.16 1.95 2.15 2.15
RAM 3.33 3.79 3.64 3.56 3.70 3.61 1.12 1.70 1.70
###################### RPi3 OpenMP/NotOpenMP ####################
L1 3.46 1.25 0.35 4.31 1.50 0.38 5.83 2.95 0.45
L2 3.37 1.41 0.43 4.01 1.70 0.46 3.39 2.66 0.55
RAM 2.70 1.53 1.02 3.30 2.16 0.85 1.03 1.04 1.05
################## RPi3 NotOpenMP/1 core OpenMP #################
L1 1.03 3.18 3.75 0.91 2.61 3.65 0.65 1.15 4.17
L2 1.03 2.78 3.12 0.92 2.28 3.06 0.73 1.13 3.71
RAM 1.04 1.90 1.62 0.99 1.40 1.59 0.87 0.86 1.66
|
This executes the same functions as MP-MFLOPS, with two versions. One uses NEON intrinsic functions, with the second one compiled with directives to use NEON. The two benchmarks obtain similar performance, as reflected in the results below, the first being for MP-MFLOPS, with compiled NEON instructions, but with rounding differences, identified by @@@@@.
Raspberry Pi 3 average performance gains over RPi 2 were 1.34 and 2.30 at the two sets of tests, for the compiled version and effectively the same for the program with intrinsic functions -
see above.
As produced, the 32 Operations Per Word arithmetic statements were in a loop with one load and one store, but compiled with numerous additional load instructions, with code similar to MP-MFLOPSPiNeon - See
assembly code below).
It could have probably have been anticipated, that there were insufficient registers for all the variables.
################## RPi 2 V7A2 Compiled NEON ####################
Raspberry Pi 2 CPU 900 MHz, Core 250 MHz, SDRAM 450 MHz
MP-MFLOPS Compiled NEON v1.0 Fri Mar 20 17:01:47 2015
FPU Add & Multiply using 1, 2, 4 and 8 Threads
2 Ops/Word 32 Ops/Word
KB 12.8 128 12800 12.8 128 12800
MFLOPS
1T 361 446 329 692 678 647
2T 887 841 430 1371 1358 1300
4T 1596 1141 381 2719 2725 2482
8T 1542 1502 384 2604 2701 2460
Results x 100000
1T 76406 97075 99969 66008 95367 99951
End of test Fri Mar 20 17:01:58 2015
################## RPi 2 NEON Intrinsics #######################
Raspberry Pi 2 CPU 900 MHz, Core 250 MHz, SDRAM 450 MHz
MP-MFLOPS NEON Intrinsics v1.0 Fri Mar 20 17:07:09 2015
FPU Add & Multiply using 1, 2, 4 and 8 Threads
2 Ops/Word 32 Ops/Word
KB 12.8 128 12800 12.8 128 12800
MFLOPS
1T 249 347 268 709 706 679
2T 635 667 411 1403 1386 1323
4T 919 1342 377 2783 2798 2623
8T 1076 1341 380 2589 2476 2409
Results x 100000
1T 76406 97075 99969 66014 95363 99951
@@@@@ @@@@@
End of test Fri Mar 20 17:07:20 2015
################ RPi 2 NEON Intrinsics OC #######################
Raspberry Pi 2 CPU 1000 MHz, Core 500 MHz, SDRAM 500 MHz, over_voltage=2
MP-MFLOPS NEON Intrinsics v1.0 Fri Mar 20 17:19:01 2015
FPU Add & Multiply using 1, 2, 4 and 8 Threads
2 Ops/Word 32 Ops/Word
KB 12.8 128 12800 12.8 128 12800
MFLOPS
1T 309 386 308 788 785 758
2T 778 745 500 1554 1546 1483
4T 1048 1461 468 3097 3072 2931
8T 1377 1253 465 2780 2781 2689
Results x 100000
1T 76406 97075 99969 66014 95363 99951
End of test Fri Mar 20 17:19:11 2015
################## RPi 3 V7A2 Compiled NEON ####################
Raspberry Pi 3 CPU 1200 MHz, SDRAM 900 MHz
MP-MFLOPS Compiled NEON v1.0 Mon Aug 15 19:09:46 2016
FPU Add & Multiply using 1, 2, 4 and 8 Threads
2 Ops/Word 32 Ops/Word
KB 12.8 128 12800 12.8 128 12800
MFLOPS
1T 419 782 437 1672 1660 1637
2T 1324 1529 442 3331 3308 3212
4T 1903 1574 439 5040 6073 5738
8T 1613 2204 433 5543 5780 5445
Results x 100000
1T 76406 97075 99969 66008 95367 99951
End of test Mon Aug 15 19:09:52 2016
################## RPi 3 NEON Intrinsics #######################
Raspberry Pi 3 CPU 1200 MHz, SDRAM 900 MHz
MP-MFLOPS NEON Intrinsics v1.0 Mon Aug 15 19:41:37 2016
FPU Add & Multiply using 1, 2, 4 and 8 Threads
2 Ops/Word 32 Ops/Word
KB 12.8 128 12800 12.8 128 12800
MFLOPS
1T 347 583 427 1706 1703 1657
2T 1080 1157 438 3397 3398 3226
4T 979 1430 437 6265 6128 5464
8T 1218 1351 436 5507 5766 5426
Results x 100000
1T 76406 97075 99969 66014 95363 99951
End of test Mon Aug 15 19:41:42 2016
|
As indicated in Raspberry Pi Benchmarks.htm, the original Linpack benchmark operates on double precision floating point 100x100 matrices (N = 100). This version uses mainly the same C programming code as the single precision floating point NEON compilation. It is run run on 100x100, 500x500 and 1000x1000 matrices using 0, 1, 2 and 4 separate threads. The 0 thread procedures are identical to those in the single core 100 x 100 NEON compilation, using intrinsic functions.
The code differences were slight changes to allow a higher level of parallelism. The initial 100x100 Linpack benchmark is only of use for measuring performance of single processor systems. The one for shared memory multiple processor systems is a 1000x1000 variety. The programming code for this is the same as 100x100, except users are allowed to use their own linear equation solver.
Unlike the NEON MP MFLOPS benchmark, that carries out the same multiply/add calculations, this program can run much slower using multiple threads. This is due to the overhead of creating and closing threads too frequently. At 100x100, around 0.67 million floating point calculations are executed in daxpy, the critical function. With the present equations, threads have to be created 99 times (unless someone can do better and change more things). At 100x100, data size is 40 KB, L2 cache based. With larger matrices, performance becomes more dependent on RAM, but multi-threading overheads have less influence.
Without threading at N=100, as shown below, speed is a little faster than single core Linpack NEON MFLOPS, due to improved coding, but not as fast as MP-NeonMFLOPS, that has less variety in accessing data. Performance is worse at n=500 and 1000, where data is mainly from RAM.
The benchmark checks that the numeric results produced, using threads, are identical to those without threading. As expected, these are not the same using different matrix sizes, and the n=100 results are the same as linpackPiNEONi, the single core version.
Raspberry Pi 3 - At N=100, average speed was 1.73 times that from a RPi 2, with 1.52 to 1.59 times using the larger matrices. These can be compared with a CPU MHz ratio of 1.33.
######################### RPi 2 NEON ############################
Raspberry Pi 2 CPU 900 MHz, Core 250 MHz, SDRAM 450 MHz
linpackPiNEONi MFLOPS 300
MP-NeonMFLOPS MFLOPS 347 at 128 KB
######################### RPi 2 NEON ############################
Raspberry Pi 2 CPU 900 MHz, Core 250 MHz, SDRAM 450 MHz
Linpack Single Precision MultiThreaded Benchmark
Using NEON Intrinsics, Sun Mar 22 15:37:56 2015
MFLOPS 0 to 4 Threads, N 100, 500, 1000
Threads None 1 2 4
N 100 323.06 66.59 64.76 64.64
N 500 276.52 216.62 215.69 216.28
N 1000 235.25 221.69 222.63 223.98
NR=norm resid RE=resid MA=machep X0=x[0]-1 XN=x[n-1]-1
N 100 500 1000
NR 2.17 5.42 9.50
RE 5.16722466e-05 6.46698638e-04 2.26586126e-03
MA 1.19209290e-07 1.19209290e-07 1.19209290e-07
X0 -2.38418579e-07 -5.54323196e-05 -1.26898289e-04
XN -5.06639481e-06 -4.70876694e-06 1.41978264e-04
Thread
0 - 4 Same Results Same Results Same Results
####################### RPi 2 NEON OC ###########################
Raspberry Pi 2 CPU 1000 MHz, Core 500 MHz, SDRAM 500 MHz, over_voltage=2
Linpack Single Precision MultiThreaded Benchmark
Using NEON Intrinsics, Sun Mar 22 15:47:04 2015
MFLOPS 0 to 4 Threads, N 100, 500, 1000
Threads None 1 2 4
N 100 362.42 74.74 74.63 75.16
N 500 326.00 259.13 257.42 258.82
N 1000 280.61 262.30 262.31 262.38
NR=norm resid RE=resid MA=machep X0=x[0]-1 XN=x[n-1]-1
N 100 500 1000
NR 2.17 5.42 9.50
RE 5.16722466e-05 6.46698638e-04 2.26586126e-03
MA 1.19209290e-07 1.19209290e-07 1.19209290e-07
X0 -2.38418579e-07 -5.54323196e-05 -1.26898289e-04
XN -5.06639481e-06 -4.70876694e-06 1.41978264e-04
Thread
0 - 4 Same Results Same Results Same Results
######################### RPi 3 NEON ############################
Raspberry Pi 3 CPU 1200 MHz, SDRAM 900 MHz
Linpack Single Precision MultiThreaded Benchmark
Using NEON Intrinsics, Mon Aug 15 19:44:30 2016
MFLOPS 0 to 4 Threads, N 100, 500, 1000
Threads None 1 2 4
N 100 538.46 116.24 113.61 113.47
N 500 467.73 335.53 338.61 338.97
N 1000 363.87 336.10 336.72 336.22
NR=norm resid RE=resid MA=machep X0=x[0]-1 XN=x[n-1]-1
N 100 500 1000
NR 2.17 5.42 9.50
RE 5.16722466e-05 6.46698638e-04 2.26586126e-03
MA 1.19209290e-07 1.19209290e-07 1.19209290e-07
X0 -2.38418579e-07 -5.54323196e-05 -1.26898289e-04
XN -5.06639481e-06 -4.70876694e-06 1.41978264e-04
Thread
0 - 4 Same Results Same Results Same Results
|
Below are examples of disassembled code for MP-MFLOPS plus the NEON variety. The first uses 32 bit single precision floating point registers. At least, with 32 arithmetic calculations per word, use is made of advanced instructions VFMA or VMAS (Vector Fused Multiply Accumulate or Subtract). Ten of these execute 20 of the 32 floating point operations, the other twelve being from conventional add and multiply instructions.
The NEON compilation uses the same VFMA and VFMS instructions, but using 128 bit quad words, for SIMD operation, then with an unrolled loop, with 10 VFMAs (or VFMSs) to execute 80 instructions. Four word vectors are also used for adds and multiplies. This produces up to 1.7 GFLOPS per core, on a Raspberry Pi 3, not very good, out of a maximum of 9.6 GFLOPS, a part of the reason being the excessive number of load instruction, probably due to an insufficient number of registers. With compiler generated unrolling, disassembled code can show many more sets of calculations, to cover situations where data is too small for the whole unrolled loop.
The notOpenMP-MFLOPS (OpenMP-MFLOPS but not using OMP threads) has the extra test with 8 operations per word. As shown below, the inner loop in unrolled by the compiler to produce 32 calculations via quad word vectors, but at not much more than 1.7 GFLOPS. Manually unrolling the loop to 16 x 8 calculations did not lead to further unrolling by the compiler. With four times more calculations in the loop, a maximum of just over 3 GFLOPS could be demonstrated, still a long way from 9.6.
MP-MFLOPSPiA7, MP-MFLOPSPiNeon
2 Operations Per Word 2 Operations Per Word
.L27: .L83:
flds s15, [r1] vld1.64 {d16-d17}, [lr:64]
fadds s15, s0, s15 add r4, r4, #1
fmuls s15, s15, s1 add lr, lr, #16
fstmias r1!, {s15} cmp r2, r4
cmp r1, r0 add r3, r3, #16
bne .L27 vadd.f32 q8, q8, q10
vmul.f32 q8, q8, q9
vstr d16, [r3, #-16]
vstr d17, [r3, #-8]
bhi .L83
32 Operations Per Word 32 Operations Per Word
.L21: .L61:
flds s23, [r1] vld1.64 {d18-d19}, [lr:64]
fadds s16, s23, s2 vldr d16, [sp, #64]
fadds s24, s23, s0 vldr d17, [sp, #72]
fadds s31, s23, s4 vldr d14, [sp, #80]
fadds s30, s23, s6 vldr d15, [sp, #88]
fnmuls s16, s3, s16 vadd.f32 q8, q9, q8
fadds s29, s23, s8 vld1.64 {d20-d21}, [sp:64]
fadds s28, s23, s10 vmul.f32 q8, q8, q7
fadds s27, s23, s12 vadd.f32 q10, q9, q10
vfma.f32 s16, s24, s1 vldr d14, [sp, #16]
fadds s26, s23, s14 vldr d15, [sp, #24]
fadds s25, s23, s17 vldr d22, [sp, #144]
fadds s24, s23, s19 vldr d23, [sp, #152]
fadds s23, s23, s21 vfma.f32 q8, q10, q7
vfma.f32 s16, s31, s5 vldr d20, [sp, #128]
vfms.f32 s16, s30, s7 vldr d21, [sp, #136]
vfma.f32 s16, s29, s9 vldr d14, [sp, #192]
vfms.f32 s16, s28, s11 vldr d15, [sp, #200]
vfma.f32 s16, s27, s13 vadd.f32 q10, q9, q10
vfms.f32 s16, s26, s15 vadd.f32 q7, q9, q7
vfma.f32 s16, s25, s18 vfma.f32 q8, q10, q11
vfms.f32 s16, s24, s20 vldr d22, [sp, #208]
vfma.f32 s16, s23, s22 vldr d23, [sp, #216]
fstmias r1!, {s16} vadd.f32 q10, q9, q15
cmp r1, r0 add r4, r4, #1
bne .L21 add lr, lr, #16
cmp r2, r4
add r3, r3, #16
NotOpenMP vfma.f32 q8, q7, q11
vldr d22, [sp, #256]
8 Operations Per Word vldr d23, [sp, #264]
vadd.f32 q7, q9, q11
.L31: vldr d22, [sp, #240]
vld1.64 {d18-d19}, [lr:64] vldr d23, [sp, #248]
add r4, r4, #1 vfma.f32 q8, q10, q14
add lr, lr, #16 vldr d20, [sp, #32]
cmp r2, r4 vldr d21, [sp, #40]
add r3, r3, #16 vadd.f32 q10, q9, q10
vadd.f32 q8, q9, q12 vfma.f32 q8, q7, q11
vadd.f32 q10, q9, q3 vldr d22, [sp, #96]
vmul.f32 q8, q8, q11 vldr d23, [sp, #104]
vadd.f32 q9, q9, q14 vadd.f32 q7, q9, q11
vfma.f32 q8, q10, q15 vldr d22, [sp, #48]
vfms.f32 q8, q9, q13 vldr d23, [sp, #56]
vstr d16, [r3, #-16] vfms.f32 q8, q10, q11
vstr d17, [r3, #-8] vldr d22, [sp, #112]
bhi .L31 vldr d23, [sp, #120]
vldr d20, [sp, #160]
vldr d21, [sp, #168]
vadd.f32 q10, q9, q10
vfms.f32 q8, q7, q11
vldr d22, [sp, #224]
vldr d23, [sp, #232]
vadd.f32 q7, q9, q11
vldr d22, [sp, #176]
vldr d23, [sp, #184]
vadd.f32 q9, q9, q13
vfms.f32 q8, q10, q11
vfms.f32 q8, q7, q6
vfms.f32 q8, q9, q12
vstr d16, [r3, #-16]
vstr d17, [r3, #-8]
bhi .L61
|
Roy Longbottom September 2016