Roy Longbottom Contents Below is a list of links to a large number of my reports, under the following headings, along with separate summaries of each. See ## note in Introduction about viewing PDF files from ResearchGate. In the event of HTM files not being available from the links provided here, you might find them the Wayback Machine for roylongbottom.org.uk or by this Wayback Archive link
MultiThreading Benchmarks Files
The Collection comprises a large number of free benchmarks and stress testing programs, with no advertisements, that run under Windows, Linux and Android, using PCs, Raspberry Pi or other single board computers, phones and tablets. Besides details of the programs and extensive results, the following reports also provide links to compressed files containing the source codes and execution files. Except for Android apps, no formal installation process is needed, simply extract from the compressed file and run by a command or click. The programs are aimed at identifying best and worst performance characteristics not a single overall rating. They are mainly calibrated to run for noticeable times, with results displayed on an ongoing basis and saved in text log files. Historic Data Reports included, provide performance ratings of computers released from 1954 to more modern times, most including cost besides the year of manufacture. They are based on benchmark results and information collected by myself and my colleagues, engineers working for the UK Government’s Central Computer Agency, formed in 1957. This New Home Page provides links to more than 40 reports mainly in both HTM and PDF format, along with brief summaries. Using current browsers, it uses automatic word wrap to see all the text in a PC Window and mobile phone or tablet screens. Also, HTM reports can be manually stretched and moved side to side. The identified reports contain wide tables of numeric data, not exactly mobile phone friendly, but they can also be stretched and moved sideways. PDF files might need to be downloaded to view the detail ##. The PDF files are provided from ResearchGate. ## To download PDF files from ResearchGate select Download from "More v" top line options.
About Roy Longbottom Celebrating 50 years of computer benchmarking and stress testing 1972 to 2022 - From 1972 to 2022 I produced and ran computer benchmarking and stress testing programs. The Whetstone Benchmark, for which I became the design authority, also covered exactly the same time span.Stress Tests - I wrote a series of programs to use during acceptance trials of computers purchased by the UK Government. From 1972, these were used on many hundreds of acceptance trials up to 1990. I personally supervised trials of the first range of supercomputers, including CDC 7600 and Cray 1, where, out of five such systems, my programs lead to three failed first trials. Then, over the years, I produced stress tests to run via Windows, Linux and other Operating Systems, covering PCs, Android based devices and Raspberry Pi systems. In 2019 (aged 84), I was recruited as a voluntary member of Raspberry Pi pre-release Alpha testing team. my 2022 contribution being for the Raspberry Pi Pico W. Whetstone Benchmarks - This was produced by my colleague Harold Curnow, who passed over responsibility to me later. In 1972, I included it in the acceptance trial’s suite of programs. I introduced timing and output format changes, aimed at verifying final numeric calculations and identifying unexpected performance attributes. I produced a vector processing version for supercomputers. Then, new varieties for the same range of technology quoted for stress tests.
Original Main Page (see About Roy) -
Historic Data Summary Computer Speeds From Instruction Mixes pre-1960 to 1971 - 190 Gibson and ADP instruction mix results from 18 manufacturers Headings - Manufacturer, Model, Word Size bits, Memory Max, Memory Cycle Time, Gibson Mix KIPS, ADP Mix KIPS, Intro Year
Computer Speed Claims 1980 to 1996 -
Headings - No. of CPUs, OS/CPU chip, MHz, MIPS, MAX MFLOPS, Type, Year, Cost GBP
PC CPUID 1994 to 2013, plus Measured Maximum Speeds Via Assembler Code -
Sections Features Codes, Model Codes, More than 80 sets of results from 80486 to Core i7 and Phenom Headings - Model, MHz, MIPS and MFLOPS using 1, 2, 3 and 4 registers, 32 bit and 64 bit. Operations normal, MMX, SSE, SSE2, AVX, 3DNow, SP, DP, 1, 2, 4, and 8 threads,
PC CPU Specifications 1994 to 2014, plus Measured MIPS and MFLOPS per MHz -
Intel and AMD CPU Characteristics - 28 pages, Model, CPUs, Cores, MHz from to, KB L1 L2 L3 caches, HT and RAM MHz, CPUID Measured MIPS and MFLOPS per MHz - 80486 to Core i7 and Phenom, 8 pages derived from benchmarks CPUID, BusSpeed, RandMem, Classics (Whetstone, Dhrystone, Linpack, Livermore Loops), SSE3DNow, FFTGGraf, some covering CPU, caches and RAM.
Whetstone Benchmark History and Results 1973 to 2014 -
Headings - System, CPU, MHz, MWIPS, MFLOPS, VAX MIPS, DP MWIPS, Language, Opt, Cost $K, Intro Date Plus PCs - 75 results, MWIPS from 22 CPUs using 12 different interpreters and compilers, MP MWIPS 1 to 8 cores on 5 systems, %MWIPS/MHz efficiency (between 0.03 and 311)
Cray 1 Supercomputer Performance Comparisons With Home Computers Phones and Tablets -
Topics - My background and benchmarks, Main tests on Cray 1, Raspberry Pi 1 to 4 and 400, Android phones and tablets, Windows and Linux based PCs, SIMD considerations, other supercomputers, Performance Summary Cray 1, PC AVX 512, Android phone, Raspberry Pi 400, Error reports.
Classic Benchmarks Summary These have an initial calibration to run individual test functions for a noticeable finite time, with results displayed as the programs progress. The benchmarks measure performance of single CPUs, that tends to be proportional to MHz, particularly at a given level of technology. For PCs, they cover processors from 80386 to Core i7. Following the latter, CPU MHz has not increased sufficiently to pursue further results. But some are included in the Historic Data section Cray 1 report, for a 2021 Intel 11th Generation CPU that has advanced vector processing type functions. For PCs, first versions included compilations with and without optimisation, with some some from other compilers. The bulk of PC results are from DOS, OS/2, Windows and Linux varieties. Limited ones are provided for Android and Raspberry Pi devices, where many more up to date performance details are covered in other reports.
From 1972 Whetstone -
Besides from C/C++ compilations, results are included from Fortran, Java, Basic and Visual Basic versions. There are 21 pages covering around 670 sets of results, each with 10 entries, over 17 categories (including SP, DP, 1 core, MP, Opt, No Opt, 16 bit, 32 bit, 64 bit, different Operating Systems, different programming languages and compilers, different manufacturers). Largest group is for original C compilations, for 76 1991 to 2017 vintage CPUs.
From 1984 Dhrystone -
There are 149 sets of results containing between 1 and 4 DMIPS ratings, covering the same range of appropriate categories and vintage as those for the Whetstone Benchmarks. For PCs, there are 75 results, each containing DMIPS for versions Dhrystone 1 and 2, produced by optimised and non-optimised compilations.
From 1979 Linpack 100 -
The largest batch is for the original double precision Linpack 100 benchmark, running on PCs and comprising 80 optimised and 80 non-optimised MFLOPS measurements.
From 1970 Livermore Loops -
The main performance ratings are three variations of average MFLOPS, with minimum and maximum, for each of 203 results. In turn, the MFLOPS scores of the 24 selected loops are provided for most. There are 59 sets of ratings for the 1991 to 2017 vintage CPUs, for both optimised and non-optimised benchmark compilations.
Memory Benchmarks Summary Windows and Linux CPU, Cache and RAM PC Benchmarks - For all of the Windows memory benchmarks, results are provided covering more than 20 years from 80386 or 80486 CPUs to Core i7 and AMD equivalents, with separate tables providing sample performance measurements, mainly in MBytes per second, for RAM and each variety of cache.Examples of full output are shown for all benchmarks. For some, calculations are carried out by assembly code, others by C/C++ compilations. There are also both 32 bit and 64 bit varieties. Linux results cover the same areas for three of the later processors, concentrating on providing comparisons between 32 bit and 64 bit working, with some including the use of more advanced SIMD operation. MemSpeed - carries out three different sets of single and double precision floating point and integer calculations via two data arrays. Two versions are available, the first one, originally to run under DOS, based on. assembly code BusSpeed - The benchmark is intended to demonstrate maximum data transfer speeds from buses and caches. On the latest PCs, use of multiple cores appears to be required, to achieve this goal. The program starts by reading one word, with a large address increment for the next one, the increment being reduced by a half for following measurements, until all data is read. This identifies where data is read in bursts and provides a means of estimating bus and maximum RAM (or cache) speed. RandMem - Serial and random address selections are employed by this benchmark, using the same complex integer based indexing, with read and read/write tests for 32 bit integers and 64 bit floating point numbers. The main purpose is to show the difference between serial and random data transfer speed, where that for the latter is considerably reduced by burst reading or writing, in turn affected by data size. SSEfpu - This carries out floating point calculations, similar to MemSpeed, to compare data transfer speeds, and associated MFLOPS, between two at a time SSE2 double precision, four at a time SSSE2 and single word calculations. FFT Benchmarks - Three versions were produced, the first being the original C code, the second with further optimised assembly language and the third using SSE SIMD instructions. The benchmarks run code for single and double precision Fast Fourier Transforms of size 1024 to 1048576 (1K to 1024K), each one being run a number of times to identify variance, with results in milliseconds.
MultiThreading Benchmarks Summary Windows and Linux MultiThreading Benchmarks - These benchmarks execute the same code as the original, designed to exercise a single CPU, but implementing multithreading to use up to all available cores. For most, multithreading levels are controlled by the program, with others using OpenMP and QPAR to automatically generate parallelism.This report concentrates on showing variations in performance of a quad core, 8 thread CPU, with links to other reports covering many different processors. With up to 8 columns of results, details are provided for each thread from between 1 and 8, compiled for 32 bit and 64 bit working, via Windows and Linux. Whetstone MP Benchmark - is mainly dependent on floating point speed but with some independently timed integer test functions. Each thread executes shared code using mainly L1 cache based independent variables, leading to performance being proportional to the number of cores, or higher with hyperthreading. Assembly Code Arithmetic - This executes integer and SSE floating point add instructions via independent threads. BusSpeed MP Benchmark - provides read only access to data in caches and RAM. It is intended to demonstrate bus operation and speed where data is transferred in bursts and maximum data transfer speed. In the original Windows version, each thread read all the data, starting at the same point. This had to be modified for Linux, due to excessive impact of caching. RandMem MP Benchmark - The program uses the same code for serial and random access via a complex indexing structure and comprises Read and Read/Write tests, covering data from caches and RAM. This benchmark uses data from the same array for all threads, but starting at different points. MP MFLOPS Benchmark - The benchmark carries out calculations of the form x[i] = (x[i] + a) * b - (x[i] + c) * d + (x[i] + e) * f with 2, 8 or 32 operations per input data word, via caches and RAM. Each thread deals with separate segments of the data, via shared code, fully demonstrating multithreading speed gains. Performance is highly dependent on ability of a compiler available at production time, particularly using SIMD options. OpenMP MFLOPS Benchmark - The benchmark carries out the same calculations as MP MFLOPS Benchmark, essentially using the same code, without any OpenMP code requirements, but with critical loops preceded by a simple “go parallel” directive. QPAR MFLOPS Benchmark - QPAR is a Microsoft alternative to OpenMP.
Graphics Benchmarks Summary Windows and Linux Graphics Benchmarks - Reports on the following contain numerous results and links to download variations, plus benchmarks and source codes. Here, the main results are for a Core i7 CPU, some with comparisons with older computers and different Operating Systems, at 32 bits or 64 bits, covering a range of monitor screen resolutions.Windows Drawing Benchmarks - draws different shapes, copies blocks of image data, colours an area, and pokes pixels, with performance measured in Millions of Pixels Per Second and Frames Per Second. Windows DirectDraw Benchmarks - uses DirectDraw functions to copy image data and to colour fill an area, with performance measured in Millions of Pixels Per Second and Frames Per Second. Windows Direct3D Benchmark - uses Direct3D functions operating on wireframe, coloured and textured moving objects, with performance measured in Frames Per Second, replaced with Direct3D9 Benchmarks for 32 bit and 64 bit versions. Windows Direct3D9 Benchmarks - with similar wireframe, coloured and textured objects plus use of Pixel and Vertex Shaders, with performance measured in Frames Per Second. Windows OpenGL Benchmarks - Coloured and textured moving objects, again, and a complex wireframe and textured real simulation of a kitchen, with performance measured in Frames Per Second. Windows BMPSpeed Benchmarks - This is system test, with graphics activity, comprising writing and reading small to enlarged images, scrolling and rotating them, with time in seconds and milliseconds plus MB/second for scrolling. JavaDraw Benchmark - for running via Windows, Lixux and Andrioid, starting with a simple scene with added complexity for subsequent tests, with performance measured in Frames Per Second. There is also an on-line version of this benchmark, executed via a downloaded HTML document. Linux OpenGL Benchmarks - This is a similar to, but enhanced, version of the Windows OpenGL program. Approval was given to Canonical to include this benchmark in the testing framework for the Unity desktop. Linux SDL BMPSpeed Benchmarks - carrying out the same functional tests as Windows BMPSpeed, but written using Simple DirectMedia Layer functions. Stress Testing - The benchmarks have run time parameters to include them in a stress testing exercise, including which section to run and running time.
Input/Output Benchmarks Summary Windows, Linux and Android Data Storage Device Benchmarks - Again, the general htm file provides examples of performance, with the detail provided in the following main files. These also include links to download programs ans source codes.DiskGRAF is a full Windows application that measures speeds of serial writing and reading, then for cached and random access activity. Results are logged in a text file and graphically for serial operation. The main report includes 4 tables of results, each containing more than 70 sets of performance and CPU utilisation results of disk drives, covering 1994 to 2014 vintage PCs. Other results are for CD and DVD writers, flash, firewire and network drives. CDDVDSpd is another full Windows application that measures writing, reading times/speeds of a large file and 520 small files, on most types of mass storage devices. Results are provided for the same period as DiskGraf, with 67 devices covered from floppy disks to 7200 RPM disks, then SSD, SD. USB and firewire drives, plus those accessed via WiFi and LAN networks. DriveSpeed is a command line driven program, with variations for Windows and Linux, having parameters for path/device to use and large file sizes. For the latter, a number of files are written and read. Then there is a test for cached data, followed by one handling random access. Finally, a large number of differently sized small files are written and read. The identified main file has a number of tables, one covering 15 disk drives with various using Linux or Windows, NTFS, FAT or Ext formatting, main or USB drives. Then there are 12 similar entries for flash drives, 14 for a revised benchmark with random access, 8 for 2014 drives and 3 covering 2016 Windows tablets. LANspeed is a variant of DriveSpeed that enables running on a selected network drive. A Windows executable version is also available and is run is run from a Windows based PC by clicking on the file resident on a remote computer This has 12 sets of LAN and WiFi results accessing PCs, desktops and a netbook, using 32 bit and 64 bit compilations.
Stress Testing Programs Summary DOS, Windows and Linux Stress Testing Programs - The first stress tests, in this collection, were based on programs that I wrote for acceptance trials of computers purchased by the UK Government. See Celebrating 50 years of computer benchmarking and stress testing.Initial requirements were that running times and data volumes should be controllable and results of calculations should be checked for correctness or consistency, with a clear indication provided of any errors or absolute minimum output, if needed for manual checking. Checking written data, on reading, or results of integer calculations, presented no problems. For floating point, either a simple integer sumcheck was produced or a series of calculations arranged to obtain a theoretical value of 1.0, that would be multiplied by results from repeating the calculations, to generate a final answer close to 1.0. The program used for stressing input/output writes a number of files, filled with blocks of different data patterns. Reading is carried out, one block at a time, with the target file selected on a random basis. Finally each block, from one file, is read repetitively, intended to be from a disk’s buffer. File sizes and running times can be specified. The file, accessible here, has the following sections. Each provides further links to detailed reports and for benchmark downloads, also sample log files produced by the programs. DOS and Windows PC CPU Tests - CPU benchmarks CPR4DOS.EXE, FPtest.exe - includes sample results from 1997 and 2017, plus example sumchecks on different CPUs. DOS and Windows PC Drive Tests - CDK1DOS.EXE, DiskTest.exe - with program data patterns, plus 1997 and 2017 logs Livermore Loops Benchmark EXE files - Modified for extended running time and for checking results. In its original form, it was found to produce the wrong results of numeric calculations on an overclocked PC. BusSpd2k.exe Full Windows app - Stress test added to benchmarking options, particularly to select data size to test a caches or RAM. The program uses a variety of different data patterns. Examples of data comparison failures are provided, believed to be from an overclocked PC. IntBurn64.exe Full 64 bit Windows app - Same program as BusSpd2k stress test Windows Multiprocessor Integer Stress Tests - Identifies files covering MP tests using multiple copies of other stress tests. Example of performance provided, using 1, 4 and 8 copies on a quad core/8 thread PC. Windows Floating Point Stress Tests - SSE3DSoak.exe and SSEburn64.exe, use assembly code SSE, SSE2 or 3DNow Single Instruction Multiple Data (SIMD) floating point instructions to soak test the CPU, Cache or RAM. Includes temperature graph over 8 minutes, running 4 copies of the program. Windows Graphics Stress Tests - CUDA MFLOPS, VideoD3D9_64, VideoD3D9_32 - These graphics benchmarks have parameters to specify running time and which test procedure to use. The report, directly accessible here, includes results of 10 minutes tests that ran at constant speeds on a particular PC. The CUDA test identified graphics processor temperature increase of 30°C. Linux PC CPU Tests - lloops, lloops_64, intburn32, intburn64, burninsse32 and burninsse64 - (same as Windows programs) - These new 32/64 bit command line driven benchmarks were the forerunners of my later test programs, avoiding the overcomplex Windows procedures. The more detailed summary report identifies excessive CPU temperatures and result of cleaning the heatsink. These tests caused a laptop to overheat to the point of failure and, for the first time, identified the effects of system induced CPU MHz changes. Linux PC Drive Tests - drivestress32, drivestress64 - (same as Windows program) Linux Graphics Stress Tests - cudamflops32SP, cudamflops64SP - (same as Windows programs), videogl32, videogl64 (OpenGL) - Report includes samples of performance and CPU/GPU temperatures, running seven copies of the CPU tests along with the OpenGL program.
Raspberry Pi Benchmarks and Stress Tests Summary 1 These benchmarks were compiled to run on ARM processors and are essentially same as the latest programs produced to run on Intel CPUs, via Windows and Linux. ARM versions were also included to suit newer technology, for both 32 bit and 64 bit working. In many cases, detailed descriptions of the benchmarks are included.
Raspberry Pi, Pi 2 and Pi 3 32 Bit and 64 Bit Benchmarks and Stress Tests -
Next we have Java Whetstone, JavaDraw, OpenGL ES and the cross platform OpenGL GLUT results , along with screenshots. DriveSpeed measurements are included for all processors, using main SD cards, USB drives and various formatting options, then many covering LanSpeed data transfers, including at 64 bits. Finally are examples of stress tests that highlight identified problems. The first is for the single core PI 1, where running a CPU test and and an OpenGL one, lead to failures using the CPU overclocking option. The second problem is the Pi 3 system crashing, running my new OpenGL GLUT benchmark, where a new version of the Operating System provided a fix. The main considerations are temperature effects on the Pi 3 at 64 bits, using all four CPU cores, with several tables identifying excessive temperatures producing CPU MHz throttling. Then there are some that show slow single core performance using default power settings. Lastly, results demonstrate less throttling on installing a CPU heatsink, then full speed after installing the system board in a special metal case.
Raspberry Pi OpenElec Benchmarks -
Raspberry Pi 1, 2, 3 Multithreading Benchmarks -
Raspberry Pi 2 and 3 Stress Tests -
Raspberry Pi 3B 32 bit and 64 bit Benchmarks and Stress Tests -
For most benchmarks, results of using both 32 bit and 64 bit working are provided, generally showing performance gains of the latter. Problems encountered were 64 bit Linux Gentoo handling drive input/output in a non-standard way and peculiarities running LAN and WiFi benchmarks. A new program was produced for stress testing, measuring CPU MHz, voltage and temperature. This demonstrated that the 3B+ CPU MHz reduced from 1400 to 1200 when the temperature reached 70°C, with further throttling at 80°C. Core voltage also reduced. Integer and floating point stress tests were run at both 32 bits and 64 bits. With no heatsink and a plastic case, all reached the 70°C threshold, and 80°C with the former. The latter 64 bit code benefited from using NEON SIMD vector instructions (disassembled examples provided). Using a special metal case, with three 15 minute CPU stressing programs and an OpenGL one, most tests recorded 1400 (sample not average) MHz, with the odd reduction to 1200 and up to 6% average performance reduction. .
Raspberry Pi 3B and 3B+ High Performance Linpack and Error Tests -
This report covers running an existing version of HPL that uses BLAS Basic Linear Algebra Software and another with ATLAS (Automatically Tuned) that I built for 32 bit operation. Numerous tests were run on a Pi 3B and 3B+ housed in that special metal case, covering data sizes between 8 and 512 Mbytes using 1, 2 and 4 CPU cores. Bottom line achievements were successful runs on the 3B+, at all sizes, but with performance degradation due to reducing CPU MHz at a temperature of 60°C. The 3B suffered from failures due to apparently wrong sumchecks, system crashes, fatal error indications, when using an older operating system and crashes with 4 cores using 512 MB. My floating point stress tests were also run, that produced numerous wrong numeric results and system crashes on the Pi 3B, but not on the Pi 3B+. These tests provide minute by minute changes in performance, CPU MHz and temperature.
Raspberry Pi Benchmarks and Stress Tests Summary 2 Raspberry Pi 4B 32 Bit Benchmarks - Following the last reports, (aged 84), I was recruited as a voluntary member of Raspberry Pi pre-release Alpha Testing Team. This represents my first effort that was endorsed by Eben Upton, the CEO and praised by Gordon Hollingworth, Chief Product Officer, in this Twitter topic.The then ARM V6 and V7 Classic, Memory, Multithreading, Java and OpenGL Benchmarks were run on the Pi 4B for comparison with Pi 3B+ results. Those written in C/C++ were reproduced using the later GCC 8 compiler and run on both computers for further comparisons. Compared with a 1.07 times increase in CPU MHz, the Classics overall scores increased between 1.87 and 4.70 times. For other CPU speed dependent benchmarks floating point improvements were often between 4 and 6 times faster. Numerous results and comparisons are provided, too many for a quick survey. For example, 300 comparisons are provided for GCC 8 Memory benchmarks, that cover data from caches and RAM. There, average and maximum ratios were 2.33 and 4.9 times, with 6% noticeably slower. Some of the Multithreading benchmarks, run here, are intended to demonstrate that this form of programming can produce slow and inconsistent performance. These cover 1, 2, 4 and 8 threads, where best case examples show gains nearly proportional up to the thread count, of up to 4. Pi 4B/3B+ performance improvements were similar to those for Memory benchmarks. Oracle Java was used to run Whetstone benchmark and and a drawing program, providing Pi4B/3B+ average gains of 3.43 times over 14 test procedures (range 1 to 18). OpenJDK was also tried on the Pi 4, producing some much faster drawing speeds. My OpenGL benchmark demonstrated average speed gains of 1.82 times comprising 6 tests at 4 window sizes. Input/Output benchmarks Pi 4B/3B+ performance comparisons - 2.4 and 5 GHz WiFi speeds were similar. LANs were 1 Gbps vs 100 Mbps with 4B large file data transfer speeds 3 to 4 times faster. USB USB3 vs USB2, where example Pi 4B large files were around 3 times faster on writing and 4.2 times on reading but, on small files, 4B was similar on reading but 27 times slower on writing. Further Alpha Test activity is covered in a Stress Testing report, where those for floating point and integer programs now have benchmarking options that measure performance over the full range of data sizes and test complexity, using between 1 and 32 threads, with Pi 4B integers up to 1.9 times faster and floating point 2.6 times, at 20.8 GFLOPS.
Raspberry Pi 4B Stress Tests Including High Performance Linpack -
Initially, program descriptions, example cold state results output and available run time parameters are provided. The first tests were without cooling, for five minutes, to identify weakest links. Five MP-IntStress tests were run via 1, 4 or 8 threads using caches and RAM, showing CPU MHz throttling starting at 80°C, the main offender being when using the shared L2 cache, with a performance degradation of 43%. Then videogl32, by itself, ran with constant CPU MHz and frames per second. The remaining CPU stress tests were all run for 15 minutes, each with a number of runs involving (some of) no cooling, heatsink, case fan or Raspberry Pi PoE HAT fan.. 8 threads and 1280 KB (>L2 cache size). Full details of results are provided, with some graphs to show variations by time or CPU MHz throttling variations. CPU Stressing tests with fans all effectively ran continuously at full speed with CPU temperatures less than 70°C. With no fans MP-IntStress, MP-FPUStress and two variations of MP-FPUStressDP indicated CPU temperatures up to 86°C with 44% performance degradations and CPU MHz occasionally half speed at 750 MHz. High Performance Linpack was run with parameters to use four memory demands between 128 MB and 3.2 GB, each without and with fans, at 3.2 GB achieving 6.2 GFLOPS at 87°C without and 10.8 at 71°C with. A 10 second sampling graph indicates CPU temperature reduction spikes to 600 MHz. CPU + OpenGL - Three copies of liverloopsPiA7R plus the most CPU dependent videogl32 test were run 1 with and 2 without a cooling fan and 3 the latter using dual monitors (2 x pixels), each for around 16 minutes. Test 1 recorded continuous maximum CPU MHz and OpenGL FPS. Without the fan, both tests recorded temperatures of 82°C within 30 seconds, with approaching half speed CPU MHz, Loops MFLOPS and OpenGL FPS. The six OpenGL test functions were run using both a single monitor and in dual mode, without liverloopsPiA7R. Those depending on the pixel count ran at half FPS on the dual, but the CPU speed dependent ones, slower to start with, suffered from a further reduction of between 20% and 30%. Input/Output Stress Tests - For these, three copies of burnindrive2 were run, accessing the main drive, a USB 3 stick and a remote PC via a 1 Gbps LAN, along with MP-IntStress using four threads, for 15 minutes without a fan. No errors were detected. Following 80°C being reached, after 2 minutes, CPU MHz throttling came into play. Stand alone speeds are provided, showing that LAN data transfers continued at this rate throughout. The CPU program ran at 58% of maximum MB/second, at the start, falling to 45%. Drive speeds varied but were up to 10% slower than maximum. Performance monitors showed near 100% CPU utilisation of four cores, LAN speed, as measured by the program, at around 33 MB/second with total for drives at up to 80 MB/second.
Raspberry Pi 4B 64 Bit Benchmarks and Stress Tests -
More than 1000 performance comparisons are included. At 64 bits, Pi 4/3B+ average gains were 2.62 times in the range of 0.70 to 16.8. 64bit/32 bit ratios were 1.28 times, from 0.31 to 4.90 and GCC 9/6 near similar to the latter. Stress Tests - Maximum speeds of MP-IntStress, MP-FPUStress and MP-FPUStressDP, with short running times, were around 40% faster using the 64 bit versions, now 28.7 GB/second, 26.7 GFLOPS and 13.2 GFLOPS. A series of 10 minute runs of these, without a cooling fan, produced the same order of CPU MHz throttling as at 32 bits. High Performance Linpack - Similar fan cooled performance, as the 32 bit version, were indicated, at 10.4 DP MFLOPS. Other runs demonstrated the no fan performance variability and different, but valid, sumchecks. 15 Minute tests, comprising 64 bit OpenGL and 3 x Livermore Loops programs, were run with and without fan cooling. Performance was much better that that at 32 bits, but running in an improved environment. Input /Output Stress Tests - A wide variety of these were run, mainly to establish that all was well using a 64 bit Operating System. Errors? - During all of these tests, other than the High Performance Linpack sumcheck issue, no other data comparisons failures were detected nor any system crashes, in spite of CPU speed sometimes reducing to 600 MHz. DriveSpeed64 file handling operated in a different way and required a new approach in order to avoid unrequired data caching.
Raspberry Pi 4 CPU MHz Throttling Performance Effects -
Using the latter, performance of benchmarks, with short running times, can indicate worst case CPU speed throttling effects. The bcmstat performance monitor was run to obtain CPU utilisation of each of the four cores and other details. In case you are unaware, %total indicates average over 4 cores. Video Playback - These tests were run using BBC iPlayer with data transfers via LAN. Unlike with WiFi connection, no buffering was indicated using both MHz settings but, at 600 MHz, pixel dimension quality was worse viewing complex images, then the same with plain backgrounds. The bcmstat monitor indicated that all four cores were heavily utilised, at an average of 81% each at 600 MHz. OpenGL Benchmark - Performance was the same or worse, at 600 MHz, depending whether graphics or CPU speed was the limiting factor and nearly proportional to CPU MHz. However, this was not reflected in CPU utilisation ratios, possibly due to lack of multithreading or graphics processor time. Main Drive Benchmark - Writing and reading large files, average data transfer speeds were similar at both MHz settings, as was CPU utilisation, equivalent to a little more than 100% of one CPU core. Then this was nearly all recorded as waiting for I/O. LAN Benchmark - Again transferring large files, as for the drive benchmark. Gigabit speeds were demonstrated at the higher MHz, some 25% faster than at 600 MHz. CPU utilisation differences were similar, but influenced by waiting time for I/O and serving interrupt requests. LAN Plus CPU Benchmarks - Using the same LAN benchmark plus a single threaded processor test, network speeds were the same as before but the CPU benchmark performance was proportional to MHz settings. This time, the latter increased equivalent average CPU utilisation by around 25% per core, as might be expected. Copying 1 GB Files From Pi 4 USB 3 Drive Via LAN To Windows PC - Copying speed MB/second performance degradation at 600 MHz was 40%, compared with 60% in MHz. CPU utilisation and data transfer speeds were lower than those for the LAN benchmark. Core Utilisation Variations - These are from bcmstat, using 1 second sampling, showing details of variations for the file copying and video playback tests.
Benchmarking Raspberry Pi 4 Running From Power Over Ethernet -
Hardware required for PoE is a unit that injects power on to an ethernet cable, at up to 50 volts, and another to extract it at a remote destination, converted to 5 volts. The latter can be a Raspberry Pi PoE HAT, that includes a Pi system fan, or a separate unit. The main cables used were combinations of three (30+10+8) for 48 metres CAT 6 and (30+10+10) for 50 metres, the last one being an unlabeled thin one. Programs run, involve large and very small files, using LanSpeed benchmark and Burnindrive stress test, then CPU stress tests that were known to consume the most power. The report includes detailed results logs that can be subject to different interpretations. LAN Benchmarks - The first LanSpeed run, from the Pi 4, was via a short CAT 6 cable, to determine maximum speeds. This was followed by using the two long cables running from normal power and PoE. Subject to wide variability, the CAT 6 cable essentially demonstrated 1000 Mbps speeds but, including the thin cable only 100 Mbps was possible. Using the latter with PoE, failed to read the 2000 MB large files and, at all Cat 6 with PoE, 100 Mbps performance was only possible using much smaller files. LAN Stress Tests - These were each run for 21 minutes, transmitting numerous different data patterns with random switching between files. Performance is also measured, but was slower than from benchmarks, due to the time required to compare the data with expected values. Tests using PoE and the long CAT 6 cable, were completed successfully with no errors detected. Using the long cable, with the thin section, failed to run properly at lengths of of 50 and 40 metres, but ran without errors at 18 metres. PoE Voltage Tests - CPU stress tests, that had been identified as those with the highest current demands, were run for 10 minutes. The system was fan cooled, then CPU MHz, Voltage and temperatures were monitored. Five MP-FPUStress and five MP-IntStress tests were carried out, covering normain power, PoE, long thick and thin cables, HAT and external power connections. Effectively, long term voltages, temperatures and performance measurements were constant throughout. PoE CPU Stress Tests Plus USB 3 Drives - The CPU stress tests were repeated with USB 3 disk or Flash drives connected, both active and non-active. Intermittent system crashes occurred, in most cases. Results fron successful runs are provided but with no indications of unacceptable behaviour. One Wire PoE WiFi Only - A series of tests were carried out Using PoE, with the Ethernet cable unplugged at the Pi 4 end, with WiFi communications active. Screen shots of all are provided. 1. LanSpeed to Windows 7 by clicking and PuTTy, long CAT 6 cable.
Raspberry Pi 64 Bit OS and 8 GB Pi 4B Benchmarks -
Classic Benchmarks - All showed 64 bit average performance gains in the range of 11% to 81%, the highest where the new vector instructions were compiled. Memory Benchmarks - 64 bit and 32 bit speeds from RAM were the same, as were around half of CPU dependent routines, with the other half an average near 30% faster at 64 bits. Multithreading Benchmarks - There were twelve, covering some intended to show that they were unsuitable for multithreading operation. Five measured floating point performance, where the average 64 bit gain was 39%, demonstrating a maximum of 25.9 single precision GFLOPS and 12.7 at double precision. Comparisons from most others were irrelevant. Drive and Network Benchmarks - 32 bit and 64 bit performance was generally the same. But 32 bit file sizes were limited to 2 GB minus 1, whereas 3 x 12 GB could be exercised at 64 bits. There were a number of caching issues at 64 bits. Java and OpenGL Benchmarks - 64 bit Java CPU speed, Java drawing and OpenGL benchmarks were run, with different window settings, including using dual monitors. 32 bit versions were slightly faster on some test functions. 64/32 bit ratios were between 0.7 and 1.3 with OpenGL. Measured Usable RAM Caoacity - 3.43 out of 4 GB and 7.9 out of 8GB at 64 bits, but just under 2 GB at 32 bits. High Performance Linpack Benchmark - Maximum performance was similar at 32 and 64 bits, at around 11.25 double precision GFLOPS, with 8 GB RAM, and 10.8 GFLOPS, using 4 GB. Without an active cooling fan, the latest improvements in thermal management lead to significant increased performance, working in this state. Other 8 GB Stress Tests - these included using 7.2 GB RAM with swapping, exercising a 40 GB file and demonstrating less performance degradation caused by CPU MHz throttling. Other Tests - these included using Power over Ethernet (PoE) and playing TV programmes.
Raspberry Pi 400 PC Benchmarks and Stress Tests -
Tests included all the usual Classic, Memory, Multithreading, Input/Output, OpenGL Benchmarks and Stress Tests. Full details of the programs' logged results and comparisons are provided for Pi 400 at 32 and 64 bits and a Pi 4B at 32 bits. CPU and RAM Benchmarks - The first group of 18 benchmarks measure various aspects of CPU performance, including accessing multiple CPU cores. At 32 bits, the Pi 400 generally provides the expected 20% improvement in performance, where CPU time dominates but little difference with RAM speed limitations. Average performance was superior using 64 bit operation, but too variable to be conclusive. The compiler version used was identified as a potential significant issue. Input/Output Benchmarks - These were runs of OpenGL on single and dual monitors, LanSpeed at 1 Gbps, WiFi at 2.4 and 5 GHz, and DriveSpeed on a range of SD cards, USB 3 flash and hard drives, some with different formats. The 32 bit drive benchmark identified associated file size limitations but no other real issues. The 64 bit version demonstrated handling larger files, but the long established method of avoiding caching produced various failures. USB Booting - This new feature was tried using SD cards, flash drives and a hard drive. These tests were completed satisfactorily. with a few complications, the major one being a restriction in a drive’s partition size. High Performance Linpack Benchmark - A number of runs were carried out to demonstrate consistent performance and temperature control using the novel heat spreader. The higher CPU MHz speed now produces a rating of 11.7 DP GFLOPS with 4 GB RAM. The 64 bit version, again, produced different numeric sumchecks that were accepted as valid by the program. Stress Test Benchmarks - MP-FPUStress+, MP-IntStress+ - Using multiple threads and data sizes, performance was generally 20% faster than the Pi 4B, using cache based data,. Maximum 64 bit GFLOPS were SP 28.0, DP 16.0 and Integer GB/second 34.2, all clearly using advanced SIMD instructions and 4 CPU cores. Compilations for 32 bit operation were somewhat slower. That 64 bit GB/second rating was from an earlier version of the benchmark, the later one being much slower, not benefiting from the usual compilation parameters. CPU Stress Tests - Fifteen 30 minute tests were run covering floating point SP and DP and integer, 32 bit and 64 bit operation, exercising the Pi 400, along with a fan and fanless Pi 4B. Full details of measured CPU and PMIC (power chip) temperatures are provided and summaries of voltages, plus performance and CPU MHz range. One Pi 400 test was run outside at ambient temperature greater than 40°C. Overall, Pi 400 cooling and performance advantages were demonstrated. System Stress Tests - Six programs were run at the same time for 15 minutes, exercising integer and floating point hardware, all RAM space, OpenGL and drive data transfers, whilst monitoring environment and system utilisation. These were run at 32 and 64 bits on the Pi 400 and fan controlled Pi 4B. There were no excessive CPU temperatures and no data comparison errors. TV Tests - BBC iPlayer programmes were viewed, using the Pi 400, for at least seven hours each, via TV at 32 bits and a PC monitor at 64 bits, with external bluetooth speaker sound for the latter. There were a few peculiarities for consideration, but no interruptions to service.
Raspberry Pi Benchmarks and Stress Tests Summary 3 Raspberry Pi Pico, Pi 4 and Pi 400 Python and C Basic Beginners Bit Banging Benchmarks - The Pico is a microcontroller with many advanced options, identified such as DMA, ADC, UART, 12C and PWM. Beginners in this area might be initially interested in exploiting general purpose input/output. This report covers measuring bit banging performance, switching LEDs on and off at various frequencies, using C and (new to me) Python programs, accessing Pico and Raspberry Pi 4B and 400 computers. Additionally, some general purpose CPU benchmarks were run.Bit Banging Tests - Details of different wiring and program code are provided for using Python and C, on a Pi computer and Pico. The full tests are carried out executing on then off cycles to 1 and 13 output pins, specifying 6 sleep time delays between 100 milliseconds and 1 microsecond’ These are repeated between 100 and 10000000 times, where theoretical running time is 20 seconds for all steps. The 6 tests are repeated, without sleeping, then with sleeping and no output. These enable execution and sleep overheads to be calculated, along with maximum possible cyscles per second performance. Full details of results are provided, amounting to more than 200 measurements, including Pi 400 running at 1800 and 600 MHz and different sleep timers. Also included are details of power consumption and a program that validates data transfer speeds on a monitoring Pi computer. Maximum Bit Banging Speed (no sleeping) - Measured cycles per second were much slower at 13 outputs, but converting to bits per second could produce similar performance to that from one output. Maximum Pi 400 speed, via C, was around 66 Mbps at 1800 MHz and, proportional to MHz, 22 Mbps at 600 MHz. The Pico, at 125 MHz, achieved 42 to 52 Mbps, indicating less dependency on MHz. In all cases, The C code produced maximum speeds more than 500 times faster than from Python programs. Sleep Timer Overheads (no output) - These lead to lower than possible speeds on changing run time parameter for higher cycles per second. However, a desired speed can often be obtained by experimenting with lower sleep time settings. Performance also depends on availability of timer with minimum overheads. The only one observed, that produced linear increases in speed, over the range specified, was one using C on the Pico. All tests indicated desired speed with parameters for 100 milliseconds sleep and within 10% at 1 milliseconds, but out of reach with lower time settings. Overall Performance (output + sleeping) - compared with sleep only speeds, the best C Program indicated the same 100% timing accuracy with one output, but not quite with 13 and at 1 microsecond sleep time. Those using Python still obtained the same minimum speed but gradually became less accurate than the sleep only tests, by up to a further 27%. CPU Benchmarks - I compiled my Whetstone, Dhrystone and MemSpeed C/C++ benchmarks to run on the Pico. The benchmarks, source codes and necessary make procedures are made available to download. Performance, of course, is much slower than from recent Pi computers, particularly using floating point calculations. MemSpeed, with calculations, demonstrated 48 Mbps with floating point and at least 760 Mbps with integers, the latter much faster than needed for the bit banging benchmarks run here.
Raspberry Pi Pico W Basic Beginners Bit Banging, CPU and WiFi Benchmarks -
Bit Banging Tests (see above) - These include precompiled versions, where performance was the same as on the original Pico, and others run via various MicroPython interpreters. For the latter, the embedded MicroPython and later releases of the original version, provided an increase of nearly four times in maximum bit switching speeds, but still much slower than from compiled programs. CPU Benchmarks (see above) - Performance of the precompiled Dhrystone and MemSpeed benchmarks was the same as before, with Whetstone slightly different and involving a variation in output format, now the same as from Pi CPUs. In addition, Python Pystone benchmark was run. This was produced by the original author of the Dhrystone program, allowing approximate benefits of compilation to be calculated and indicating that the existing C version was 175 times faster. WiFi Tests - These were carried out between Pico W, Raspberry Pi 4, with 2.4 and 5 GHz WiFi, and a PC, all on the same internal network. The first tests were via the iPerf benchmark, providing estimates of the maximum achievable bandwidth of the network. Calculations included the effects of 5 GHz or 2.4 GHz WiFi, CPU MHz impact, and network packet sizes used. The Pico performance was relatively very slow, but nearly 10M bits per second might be adequate for Pico W applications, dealing with this sort of activity. Note, that these tests were with a Pico W Client sending data to a remote Server. At this time Pico W iPerf Server operation was attempted, established connectivity but failed to transfer data. Ping Tests - The next tests were via compiled ping Windows and Linux utilities, accessing the Pico W. These can be used to confirm that a Pico W is up and running, connected to the network and provide guidance on the likely performance of dealing with small sized data transfers. At the time of testing, pinging the Pico was extremely slow, but a temporary fix was obtained to allow tests to continue. Data transfer speed, whilst pinging, was calculated, achieving the same sort of levels as iPerf. The last detailed ping tests were carried out using a recommended Python program available from GitHub, started by satisfactorily pinging up to 4096 bytes, from Pico W to a Raspberry Pi and a PC. Windows and Linux network monitoring utilities were used to confirm reception.This Python utility has a parameter to control speed of operation. Using maximum speed, ping produced false reports, with the remote performance monitor recording much greater data volumes. Other monitoring indicated that the Pico W was overloaded and could not cope.
Raspberry Pi 5 Benchmarks and Stress Tests -
Benchmarks - Besides detailed results, Pi5/Pi4 performance comparisons are provided using older gcc8 compiled versions, also the latter with new varieties from gcc12, included in the new 64 bit Operating System software. Single Core CPU Tests - comprising varieties of Whetstone, Dhrystone, Linpack 100 and Livermore Loops Classic Benchmarks. Pi 5 gains were between 2.14 and 4.65 times from 182 measurements. Single Core Memory Benchmarks - measuring performance using data from caches and RAM. More than 250 Pi5/Pi4 comparisons are provided from five benchmarks, indicating a Pi 5 average gain of 3.1 times maximum 13.3 times. Pi 5 new compilation average gain was 2.6 times and maximum 10 times. High gains were due to improved caching and SIMD vector processing operations. MultiThreading Benchmarks - These 8 benchmarks execute the same calculations using 1, 2, 4 and 8 threads. From 150 plus comparisons Pi5/Pi4 average/maximum gains were 3.4/18.2 times, with 1.2/5.6 times for Pi 5 gcc12/gcc8 compilations. The reasons for the high gains were improved caching and SIMD as above. Miscellaneous - average Pi5/Pi4 performance gains for a series of tests were Java Whetstones 2.47 times, JavaDraw 1.98 times and OpenGL 4.0 times for 6 tests at 4 screen resolutions. Input/Output Benchmarks - These measure performance of large files, small files and random access with numerous performance measurements of Gbps LAN, WiFi, large files with 64 bit OS, main SD and USB 3 FAT and Ext disk drives and 11 main and USB boot drives. Also are booting times, main and USB volts and amps power usage. First test result indicated that Pi 5 was typically 50% faster than Pi 4 handling large files on a high speed USB 3 flash drive. Drive Stress Test - This writes four large files with data comprising numerous binary data patterns, reads them randomly for a specified time, then repetitively reads each different data block for a time. Eleven 15 minute tests were successfully run on the Pi 5 comprising LAN, WiFi, OS SD, 3 USB 3 flash drives and 5 disk drive partitions, plus 2 network tests from a Pi 400. Disk Drive Errors and System Crashes - (Power supply issues) - Two out of three tests using 2 disk drives caused crashes one with both on a USB 3 hub, due to exceeding 900 mA USB 3 port specification. Next crash was with one drive via hub, one direct USB and a CPU stress test leading to measured main power supply exceeding the 3 amps specification. This lead to reading the wrong file and data comparison failures. Two disks on different USB 3 ports ran successfully. CPU Stress Tests - Initial 3 floating point and 3 integer tests were run without fan cooling, each for 15 minutes, using 1, 2 and 4 threads, whilst recording performance, CPU MHz, volts and temperatures. All suffered from MHz throttling at temperatures up to 90°C, with measured performance deterioration less than 50%, still faster than a fan cooled Pi 4. I acquired a 4 amps power supply and repeated the test that crashed at 3 amps, this time with no failures. INTitHOT New Integer Stress - This read only test produced the hottest and fastest effects, through executing continuous SIMD AND instructions. On the Pi 5, fastest, via L1 cache sized data, obtained 240 GB/second or Terabit speed of 1.92 Tbps. Via L2 cache, maximum speed was 168 GB/second with higher power consumption and Temperature. The Pi 5 was around 4.6 times faster than a Pi 4 using 1 or 2 threads, and much greater at 4 threads where the Pi 4 was unbelievably slow. System Stress Tests - These were run for 30 minutes using the 4 amps power supply and included INTitHOT, disk drive and OpenGL stress tests. Initial tests ran successfully at near maximum speed with the fan but reached a CPU temperature of 91.7°C with a 40% reduction in CPU and graphics performance without the fan. The next ones included floating point and network stress tests. The no fan test ran successfully with the usual high temperature and degraded performance but, with the fan, crashed with disk drive errors again. Then a low USB voltage was recorded. Other Tests and Comparisons - Tests were carried out involving Firefox, Bluetooth sound and YouTube videos. Next is Pi-5 The Vector Processor, with examples and comparing performance with 1978 to 1991 supercomputers, then Comparisons with PCs from 1991 to 2021. Results for the latter indicate that the Raspberry Pi 5 can be assumed to be 194 times faster than the Cray 1 supercomputer. New 5 Amps Power Supply and Active Cooler - Graphs of temperature increases with time are provided for initial CPU only stress tests, followed by others using the new items, now all much less than the the CPU MHz throttling level. Hottest was not the floating point test but the one using integer calculations with L2 cache based data. Next was a repeat of the Heavy System Stress Tests. This ran successfully twice. It was then repeated with the 4 amps power supply and failed as before but at a much lower CPU temperature, then ran without any issues at a second attempt. The strange measured power volts and amps probably indicate a marginal condition, compared to the 5 amps measurements. Solid State Hard Drive - Following an earlier disastrous attempt, I repeated the last system stress test powered with 4 and 5 amps supplies on the Pi 5, providing similar performance. Then I ran the drive benchmarks where average large file write/reading speeds were around 360/400 MB/second, faster than the old hard drive. A surprise was tha the measured USB current was the relatively high 640 mA. Android Benchmarks and Stress Tests Summary Most reports for these Android programs provide direct access to download and install the Apps or a folder to reside on an SD card. Installation normally requires approval, in Setting, to take onboard non-Market applications. Links to download Project Files, containing source codes, are also provided. These are arranged to run under Eclipse Integrated Development Environment (Some were later converted by Android Studio, but not included in the collection). The Apps usually have three buttons, Run, Info and Save or Email. Originally, default for the latter was to Email the results to me. Now, with the latest versions of Android, multiple choices are provided, like Gmail, Bluetooth, Message or Drive. The Apps require Java code to communicate with Android but, in this case, C/C++ programs are also used, mainly to produce faster performance. Note that downloaded Apps might not operate correctly using later or earlier versions of Android than those shown here, nor with alternative hardware.
2013 Android Benchmarks2 -
Classic Benchmarks - Whetstone and Linpack Java versions were run, along with those using C/C++ compiled native ARM code, the latter being more than twice as fast on the later CPUs. Results for Linpack include programs using five varieties of compiling options, with fastest system producing between 29 and 1335 MFLOPS. Optimised and non-optimised versions of the Dhrystone benchmarks were run, the former typically being around twice as fast and maximum performance achieving VAX MIPS (AKA DMIPS) per MHz of 2.17. MFLOPS of all 24 Livermore Loops are provided, one demonstrating more than 1 GFLOPS, and many shown to be faster than the Cray 1 supercomputer, for which this benchmark originally used. Memory Benchmarks - MemSpeed, BusSpeed and RandMem benchmarks were run, each measuring cache and RAM data transfer speeds at 10 different capacity demands between 16 KB and 65 MB. With 180 results for each system tested, much can be learned. With MemSpeed best GB/second were up to 9.4 L1 cache, 6.4 L2 cache and 1.6 RAM, compared with worst at 0.69, 0.15 and 0.15 respectively. Performance was similar on other benchmarks, except for random access, where best case from RAM was 0.1 GB/second. DriveSpeed Benchmark - This is not easy to use as the drive path normally has to be has to be typed in and can be difficult to identify. There are sometimes caching issues, where a file is written but a reboot is needed in order to ensure that the drive is read and not data cached in RAM. See the report for details. Example results are provided for main and external SD cards and USB 2/3 drives, some of which are identified as running particularly slow. CPU MHz Monitor - This demonstrated where MHz varied, using power and energy saving settings and on battery power. On-Line Benchmarks - Image Loading Times - These procedures do not work anymore.
2015 Android Graphics Benchmark Apps -
JavaOpenGL1 - This measures frames per second (FPS) of WireFrame, Shaded, Shaded+ and Textured displays at thee different pixel densities. All sorts of complications were identified. Here, best score for the test with the heaviest loading was around 8 FPS. Then, for lightest, performance was limited by a system forced 60 FPS. JavaDraw - Five tests draw on a background of continuously changing colour shades with ever increasing drawing content, again measuring FPS. Slowest system produced performance ratings between 4 and 12 FPS, and fastest 6 to 60 FPS. Battery Test - The program runs the second most demanding JavaDraw test, with CPU MHz displayed, along with FPS and running time in minutes. Five systems were tested for between 4 an 6 hours. Some ran with little variation in MHz samples or FPS, most eventually turning off due to the lack of power. Others had higher variation in MHz or peculiar behaviour.
2016 Android MultiThreading Benchmark Apps -
Fastest Floating Point with own segment of shared data. - MP-MFLOPS benchmark 1 thread 1.2 GFLOPS, 4 threads 4.2 GFLOPS. Example Fast Data Transfer speed with own segment of shared data. - MP-BusSpeed benchmark L1 cache 1 thread 6.0 GB/sec, 4 threads 23.7 GB/sec, RAM 1 thread 2.7 GB/sec, 4 threads 9.1 GB/sec. Best Performance each thread with own data - Whetstone benchmark 1 thread 1877 MWIPS, 4 threads 7426 MWIPS. Worst MultiThreading Performance - Write/Read test MP-RandMem benchmark with no MP gain at around 3.5 GB/sec using 1, 2, 4 and 8 threads. Limited MultiThreading Performance - MP-Dhrystone benchmark with some shared data - 1584, 2749, 3836 DMIPS 1, 2, 4 threads.
2016 Android NEON Benchmark Apps -
NEON MP-MFLOPS Benchmark - The same best case performance quoted above for MP-MFLOPS, were more than twice as fast, on the same system, at 2.9 GFLOPS using 1 thread and 11.6 GFLOPS with 4 threads. NeonSpeed Benchmark - This covers the same single precision floating point and integer calculations and memory demands as MemSpeed, providing normal and NEON speeds for comparative purposes, the latter indicting up to 2.3 times improvement on the system quoted for MemSpeed. NEON-Linpack Benchmark - A table is provided with 17 sets of results covering 5 variations of compiler and options used, running the single core programs. Here, the NEON version was typically twice as fast as the normal single precision benchmark. NEON-Linpack Benchmark-MP - The Linpack benchmark is completely unsuitable for multithreading operation, using my usual method. This one is run accessing three different sized data matrices, using normal operation then 1, 2 and 4 separate threads. Example performance, with the normal N=100 parameter, was 1498 MFLOPS, without threading and around 61 MFLOPS using 1, 2 and 4 threads.
2016 Android Benchmarks32 -
Results from both Android 4 and 5 are provided for some systems, indicating improvement in Java performance, in one case. Tests were run using Android on Intel CPUs for comparison purposes, also some running Windows and Linux versions of the benchmarks.
2016 Android Native ARM + Intel Benchmarks -
These results are included below in 2018 Android Benchmarks For 32 Bit and 64 Bit CPUs from ARM, Intel and MIPS.
2016 Android 64 Bit Benchmarks -
Again, these results are included below in 2018 Android Benchmarks For 32 Bit and 64 Bit CPUs from ARM, Intel and MIPS.
2018 Android Benchmarks For 32 Bit and 64 Bit CPUs from ARM, Intel and MIPS -
Full details of results are provided from a wide range of systems from a 2013 Cortex v7-A9 to 2017. Many under the headings of compilation versions Original ARM, ARM/Intel 32 Bit, ARM/Intel 64 Bit, Intel/Windows 32 Bit, and Intel/Windows 64 Bit, and a few also from Android Java and Windows Java. Android variations were from versions 4, 5, 6 and 7. One tablet had options to boot for Android or Windows 10. Numerous calculations are included, the main ones being mentioned below. Results, using 64 Bit Android, were only available for one device. Comparisons for these are in later publications. Classic Benchmarks - Many of the new ARM 32 bit compilations produced similar performance to the original but the odd ones were faster. Intel results include some using Atom models and one using a high end Core i7 CPU. In most cases, the latter produced far superior speeds but would not not be competitive on cost/performance grounds. Memory Benchmarks - All results include samples of MB/second per MHz calculations, with separate ratios using L1 cache, L2 cache and RAM. Regarding ARM processors, raw RAM BusSpeed results were up to 6.3 times faster than the 2013 Cortex v7-A9, and 5.0 times on a MB/second per MHz. For the latter, L2 cache improvements were up to 3.1 times. L1 cache and other benchmark MB/second per MHz results were much lower, with wide variations. Two variations of My Fast Fourier benchmarks were included, involving single and double precision calculations at 11 different FFT sizes. MultiThreading Benchmarks - All identify performance using 1, 2, 4 and 8 threads. In some, the expected 4 thread gain is not always demonstrated. Multiple runs are required to establish that this is normal behaviour. ARM MP-Classic Benchmarks - Some Whetstone compilations produce slow performance on two critical tests that use such as COS and EXP functions, depending on the default libraries used. Best 4 thread MWIPS scores have improved to reach 7491 and maximum GFLOPS to 2.5. ARM MP-Memory Benchmarks - BusSpeed demonstrates that RAM throughput can benefit from multithreading, now up to 8 GB/second. RandMem continues to produce no gain performance with read/write tests, good improvement on serial reading but some disappointing on random access. ARM MP-MFLOPS Benchmarks - Over the period and devices considered here, MFLOPS per MHz improved from 3.5 to 5.4 to produce 11.6 GFLOPS (also quoted above). This is when using NEON SIMD instructions. Note PC with Intel Core processor is shown to reach 23.7 MFLOPS per MHz ARM OpenGL and Java Drawing Benchmarks - For OpenGL, the 2013 V7-A9 obtained 7.10 FFS on the heaviest test at screen size 1280 x 720 pixels. Best was 17.6 FPS at 2048 x 1440 pixels. With Java Draw, V7-A9 reached 3.81 FPS and best shown at 6.72 at 1290 x 1032. CPU MHz Benchmark and Battery Test - Example results of the MHz benchmark and battery test are provided. Later, these were found to be no longer applicable and alternatives produced. New CPU Stress Tests - These comprise MP-FPU-Stress.apk for floating point operation and MP-Int-Stress.apk using integer calculations. They both have a benchmarking mode that provides use of between 1 and 8 (FPU) or 32 (Int) threads, with data sizes for L1 cache, L2 cache or RAM and 2, 8 and 32 arithmetic operations per word, for floating point. These variables can be set for stress testing, besides running time in minutes and up to 32 threads can be selected in both cases. Then, results are displayed after each pass, where pass count depends on CPU speed. For FPU and Integer stress tests, 8 sets of results are provided for 15 minute tests on 8 different systems or battery use, with 8 threads, accessing L2 cache sized data. The FPU test used 32 operations per data word. In both cases, recorded performance of three tests ended with a speed reduced by more than 40%. Next, 8 thread FPU and Integer tests were run at the same time, with each running at half speed at the start. Finally, these tests were repeated, each mode using 32 threads. After a while, performance decreased but in an unpredictable manner. No data sumcheck errors were reported, but results were occasionally lost due to system crashes or a flat battery. See report for other unacceptable behaviour.
2018 Updated Android Benchmarks For 32 Bit and 64 Bit CPUs from ARM and Intel -
ARM Classic Benchmarks - As for most other benchmarks considered here, sample results are provided for the earlier and 4A8 compilations at both 32 bits and 64 bits. For the Whetstone benchmark, calculated MFLOPS/MHz ratios were little different across the range, but those for MWIPS/MHz improved, at 64 bit working, due to faster MOPS/MHz speeds using COS and EXP functions. Dhrystone 64 bit DMIPS/MHz ratings were much higher, best at 5.87 (Some would suggest over optimisation, again). ARM Linpack speeds has improved at 64 bits and with the latest V8 CPU at 4A8 32 bits, probably due to the use of advanced SIMD operation, with best here at 1.38 GFLOPS or 0.59 MFLOPS/MHz. Livermore Loops speeds also improved in line those from Linpack, with best maximum at 2.6 GFLOPS at 1.0 MFLOPS/MHz and average 0.45 MFLOPS/MHz. ARM Memory Benchmarks - Calculated MemSpeed maximum Single and Double Precision (SP and DP) MFLOPS/MHz ratios were faster at 64 bits but not so much using 4A8 compilations. One phone, using the 4A8 version, was much slower than running the earlier version. Using the latest technology, ARM 64 bit SIMD vector operation was demonstrated, (but limited by lack of complexity), where maximum calculated speeds were 3.9 GFLOPS DP and 6.8 GFLOPS SP. These are normally the same with scalar operation. 64 bit SIMD compiled operation was also demonstrated using NeonSpeed for SP floating point and integer calculations, by producing the same speeds as those from using NEON intrinsic functions. Here, best was 21 to 25 GB/second CPU data transfer speed and 2.5 to 3 GB/second from RAM. There was nothing particularly outstanding in results from the other memory benchmarks. ARM MultiThreading Benchmarks - There were limited 4A8 and 64 bit performance gains running MP-Whetstones, ignoring those by alternative libraries. A new PC provided new best 4 thred speed of 11762 MWIPS. MP-Dhrystone continued to demonstrate poor MP performance. Best MP-BusSpeed RAM result is now 14.5 GB/second. MP-RandMem provided some gains and some losses using newer versions. ARM MP-MFLOPS Both 64 bit and 4A8 compilations provided performance gains. Best at 4 threads is now 42.0 GFLOPS, clearly using SIMD, including fused multiply and add operations, with single core 5.27 SP MFLOPS/MHz or 18.27 using 4 threads. NEON-MFLOPS-MP also obtained the same level of performance, NEON intrinsic functions being converted to 64 bit vector instructions. ARM OpenGL and Java Drawing Benchmarks - These Java programs ran successfully under Android 8, as all other programs run, under this OS, for this report. CPU Stress Tests - See previous summary. In this case, MHz measurements of each core were recorded. Results from eight different 10 minute stress tests are provided, covering six tablets or phones, with one running at constant speed and others with unpredictable reductions. A near best example I have is for a 10 minute 8 thread integer test, running on a CPU with 8 cores, with rated speeds of 4 at 2350 MHz and 4 at 1900 MHz. This ran with all cores being utilised throughout, mainly at the specified speed, with final average MHz reduced by 9%. Worst case was a floating test, using 8 cores, where only six were active after the first 20 seconds. The CPU comprised 4 Cortex A53 cores rated 1500 MHz and 4 Cortex A57 at 2000 MHz. Highest measured were 1330 and 1555 MHz ending at 384 and 960, after 10 minutes. Measured MFLOPS reduced from 17955 to 6713. Battery Test - An example of running the integer stress test on phone with a near flat battery is provided, starting using 8 cores, running at 1517 or 1210 Mhz, reducing to four cores after a minute. This was followed by four cores mainly running at the lower speed, then at 998 MHz until the end of a ten minute test. On restarting, the test continued to run at that speed until the phone died, after a short time. Measured GB/second reduced from 37.0 to 14.3.
2020 Android 9 Benchmarks and Stress Tests On 32 Bit and 64 Bit CPUs -
Classic Benchmarks - comparing measured speed/MHz, one Cortex-A73 appeared to be slower than the other, at 64 bits. Later, it was found that the CPU was running at 1805 MHz, and not the claimed 2000 MHz. Allowing for this, performance from Android 8 and 9 could be assumed to be the same. Whetstone results were similar between 32 bit and 64 bit versions. On the other benchmarks, the latter was up to twice as fast. Memory Benchmarks - As these benchmarks access data covering caches and RAM, performance levels can indicate which cache is used. These are labelled in BusSpeed results and are shown to be different on CPUs labelled as Cortex A73, where maximum RAM speed was 9.1 GB/second. MemSpeed maximum GFLOPS were 3.11 DP and 8.57 SP. There were wide variations in random access performance during RandMem. MultiThreading Benchmarks - Most devices had 8 cores, with 2 groups of four, running at different MHz. Best 8 thread MP-Whetstone score was 20501 MWIPS, 6.4 times that for 1 thread. MP-Dhrystone continued to have unacceptable MP performance. Fastest MP-BusSpeed RAM result was 14.5 GB/second, 1.8 times faster than 1 thread. An even faster result was indicated for MP-RandMem at 20.9 GB/second from RAM, but this is probably affected by the 3 MB L2 cache size. Single Precision MP-MFLOPS Benchmarks - As reported earlier, highest 4 thread speed obtained was 42.0 GFLOPS, 3.5 times faster than 1 thread. Here, I also point out that performance using 8 threads was lower, at 36.7 GFLOPS, on that device. OpenGL and Java Drawing Benchmarks - These ran successfully on Android 9, As did DriveSpeed, but with usual data caching problems. CPU Stress Tests - These were carried out on my mainly 8 core tablets and phones, two of which have the same CPU running under Android 9, one at 32 bits and the other at 64 bits. First we have examples of one minute 6 thread floating point and integer tests, with results alongside measured MHz of the 8 cores, sampled every 5 second, showing reducing performance. The main observation is the unpredictable variation in core MHz speeds. Next are examples of stress test benchmarks, covering Android 5, 7, 9, 32 bits and 64 bits, identifying identical floating point sumchecks and error free integer calculations. This is followed by a 100 seconds integer test, with 1 second MHz samples of the 8 cores plus reducing MB/second and associated MHz reductions.
Finally there are full details of 16 stress tests, covering the identical CPUs running at 32 bits or 64 bits, 4 and 8 threads, floating point and integer operation, 5 minute and 15 minute durations. These identify variations 64 bit/32 bit and 8 bit/4 bit performance ratios.
2021 Android 10 and 11 Benchmarks and ARM big.LITTLE Architecture Issues -
64 Bit Classic Single Core Benchmarks - With MHz of the fastest cores being the same, Kryo performance gains greater than 1.0 indicate improved internal architecture (as claimed for A 76). There are 35 performance measurements in this group, mainly floating point ratio gains over the earlier phone were average 1.83, minimum 1.26, maximum 2.38. 64 Bit Memory Benchmarks - There are 220 MB/second scores from the four main memory benchmarks. Average Kryo gains were 2.25 from L1 cache, 1.98 from L2 cache, 2.12 from L3 cache vs RAM and 1.51 from RAM vs RAM. 64 Bit MultiThreading Benchmarks - Comparing Kryo 2+6 cores with older 4+4 CPU. MP-Classic Benchmarks - As indicated before, two of these were produced to demonstrate unsuitability for multithreading operation. They are MP-Dhrystone and MP-Linpack. The latter is no longer run, as execution time can be greater than 5 minutes. MP-Whetstone reflects the mismatch in big/LITTLE CPU MHz operation, where the Kryo MWIPS gains for 1, 2, 4 and 8 threads were 1.47, 1,61, 1.28 and 1.04. Sample MHz measurements of the 8 cores are provided. MP-BusSpeed results indicates that the MHz mismatch can lead to the older CPU being much faster using 4 and 8 threads. MP-RandMem also has similar lower speeds but Kryo random write/read access can be more than four times faster because of the large L3 cache. MP-MFLOPS Benchmarks - The Kryo achieved up to 12178 MFLOPS using one thread, or around 6 MFLOPS/MHz, clearly demonstrating SIMD operation with fused multiply and add instructions, followed by 23674 using 2 threads. Then there were the disappointing mismatch results of 26173 and 35686 at 4 and 8 threads. Performance gains over the other CPU were between 1.01 and 3.45 times. NEON-MFLOPS-MP results were similar. Java Benchmarks comprising OpenGL, Draw, Whetstone and Linpack all ran successfully under Android 11 and faster with the Kryo CPU. CPU Stress Test Benchmarks - Examination of the detail can identify unexpected performance, like faster using 16 threads on an 8 core CPU, as this leads to execution in a lower level cache. The floating point stress test is essentially the same as MP-MFLOPS, but with more run time options. The example integer stress test used up to 32 threads, where the fastest speeds were demonstrated, in this case 49046 MB/second using Kryo, not much faster than the older CPU, due to the CPU MHz mismatch. CPU Stress Tests - Results from many 15 minute 8 thread tests are provided. The first are 30 second samples of CPU MHz on all 8 cores and measured performance of both CPUs being considered, . The Integer Test MHz sampling indicated that the Kryo had 2 cores running at 2035 MHz and 6 at 1805 and producing between 56 and 57 GB/second over 30 seconds. The older CPU had variable MHz readings with average reducing from 1989 to 1404 and performance from 52 to 40 GB/second. The Kryo Floating Point Test indicated constant average samples of 1862.5 MHz and 37 to 38 GFLOPS. The other CPU came with average MHz reducing from 1989 to 1504 and GFLOPS from 31 to 25. The final results are for a series of 15 minute tests using 2, 4 and 32 threads without MHz recording. Average Kryo/older Integer Test performance ratios varied from 0.95 to 1.65 at the start 1.12 to 1.63 at the end with Floating Point Tests starting between 1.11 to 2.50 and ending in 1.17 to 2.7. The lowest ratios are on using 4 threads.
|