# Experimentally measuring FLOPs and energy consumption [Joule] on standard hardware

## Option 1: Libpfm4 guide

Following http://www.bnikolic.co.uk/blog/hpc-howto-measure-flops.html

Execute the following in this directory:

    git clone git://perfmon2.git.sourceforge.net/gitroot/perfmon2/libpfm4
    cd libpfm4
    make
    cd examples
    ./showevtinfo -h  # lists options
    ./showevtinfo -L  # shows measures
    ./showevtinfo FP_ARITH | grep Code   # show relevant codes

The guide then suggests something like

    ./showevtinfo FP_ARITH | grep Code                           
    Code     : 0xc7
    ...
    
And then calling something like

    perf stat -e rc7 ./a.out
    
On the binary to be tested. But that doesn't work for me at all.

   
## Option 2: Using perf only

I had more success with

    perf list | grep fp_arith_
    
And then calling

    $ perf stat -e fp_arith_inst_retired.scalar_double ./a.out
    Number of total runs: 10000
    Number of total steps: 19990000
    Total duration: 1006540209 nanoseconds

    Time per run to t=1: 100654 nanoseconds
    Time per step: 50 nanoseconds

    Performance counter stats for './a.out':

        1.600.000.000      fp_arith_inst_retired.scalar_double:u                                   

        1,008534898 seconds time elapsed

        1,007365000 seconds user
        0,000000000 seconds sys


Yay! This tells me my little benchmark runs roughly 80 FLOPs per step.
This takes 50ns, so the computer overally performs 1600 MFLOPS/sec,
that is, close to 1 GFOPs/sec.

## Option 3: Using likwid

Likwid is a performance measurement tool developed for HPC to benchmark
hardware usage and energy consumption. See instructions how to obtain
the code at

* https://github.com/RRZE-HPC/likwid
* https://github.com/RRZE-HPC/likwid/wiki
* https://github.com/RRZE-HPC/likwid/wiki/likwid-perfctr#basic-usage-wrapper-mode

For GCC10, I had to clone the repository, because the latest release 5.0.1
was dysfunctional. Also "sudo make install" is mandatory.

We can then benchmark the code as easy as by running

    likwid-perfctr -C S0:1 -g ENERGY ./a.out
    likwid-perfctr -C S0:1 -g FLOPS_DP ./a.out

On my Core i7-8565 Intel Kabylake laptop, this gives me an overall power
consumption of

    +----------------------+-----------+
    |        Metric        |   Core 1  |
    +----------------------+-----------+
    |  Runtime (RDTSC) [s] |    1.9633 |
    | Runtime unhalted [s] |    3.1611 |
    |      Clock [MHz]     | 3213.6181 |
    |          CPI         |    0.7390 |
    |    Temperature [C]   |        67 |
    |      Energy [J]      |   21.6309 |
    |       Power [W]      |   11.0179 |
    +----------------------+-----------+

For the FLOPs, I get


    +----------------------+-----------+
    |        Metric        |   Core 1  |
    +----------------------+-----------+
    |  Runtime (RDTSC) [s] |    1.9491 |
    | Runtime unhalted [s] |    3.1616 |
    |      Clock [MHz]     | 3237.2705 |
    |          CPI         |    0.7391 |
    |     DP [MFLOP/s]     |  820.8928 |
    |   AVX DP [MFLOP/s]   |         0 |
    |   Packed [MUOPS/s]   |         0 |
    |   Scalar [MUOPS/s]   |  820.8928 |
    |  Vectorization ratio |         0 |
    +----------------------+-----------+
    
The overall performance is close to 1 GFLOP/sec. That's factor 2 smaller then
the `perf` output, but on the same order of magnitude.

# Analysis of output

The following text is superseded by the flops.tex LateX document:

## Energy efficiency compared to analog computers

With 1 GFLOP/sec and 11 W power concumption, the single core non-vectorized
performance of my laptop is 1 GFLOP/sec / (11 J/sec) =~ 100 MFLOP/J.

## Comparison with analog computer

The digital computer runs to t=1 in 30us (Euler) up to 60us (RK4).
The analog computer approaches t=1 in either 1ms (k_0=1000) up to 1s (k_0=1).
That means the digital computer is at least factor 20 faster then the
digital computer.

The digital computer approaches roughly 1 GFLOP/sec.

We the reduced exactness of the analog computer (which is about a quarter of
the resolution of a double floating point number) for the time being.
Then we could assign the analog computer 50 MFLOP/sec.

The energy consumption of the Model-1 computer is at the order of 10-100mW.
This results in an energy efficiency of

    50 MFLOP/sec / (50mJ/sec) =~ 1 MFLOP/J
  
The overall result has a large uncertainty (expected to be at the order of
at least 500% of the final result), since the analog
result is less accurate and also the power estimation is very rough.

