Vectorized statistics using SIMD primitives

Introduction

is a C++17 library of computationally efficient methods for calculating sample statistics (mean, variance, covariance, correlation).

the implementation builds upon the SIMD abstraction layer provided by the EVE [1]
it uses a data-parallel Youngs and Cramer [2] algorithm for numerically stable computations of sums and sums-of-squares.
the results from independent data partitions are combined with the approach by Schubert and Gertz [3].
the methods are validated for correctness against statistical methods from the GNU Scientific Library [4].

Usage

To use this library you simply need to copy the contents of the include folder inside your project, and then #include <vstat.hpp>. Defining VSTAT_NAMESPACE before inclusion will allow you to set a custom namespace for the library.

Two convenience methods are provided for batch data:

univariate::accumulate for univariate statistics (mean, variance, standard deviation)
bivariate::accumulate for bivariate statistics (covariance, correlation)

The methods return a statistics object which contains all the stat values. For example:

std::vector<float> values{ 1.0, 2.0, 3.0, 4.0 };
std::vector<float> weights{ 2.0, 4.0, 6.0, 8.0 };
 
// unweighted data
auto stats = univariate::accumulate<float>(values.begin(), values.end());
std::cout << "stats:\n" << stats << "\n";
 
count:                  4
sum:                    10
ssr:                    5
mean:                   2.5
variance:               1.25
sample variance:        1.66667
 
// weighted data
auto stats = univariate::accumulate<float>(values.begin(), values.end(), weights.begin());
std::cout << "stats:\n" << stats << "\n";
 
count:                  20
sum:                    60
ssr:                    20
mean:                   3
variance:               1
sample variance:        1.05263

Besides iterators, it is also possible to provide raw pointers:

float x[] = { 1., 1., 2., 6. };
float y[] = { 2., 4., 3., 1. };
size_t n = std::size(x);
 
auto stats = bivariate::accumulate<float>(x, y, n);
std::cout << "stats:\n" << stats << "\n";
 
// results
count:                  4
sum_x:                  10
ssr_x:                  17
mean_x:                 2.5
variance_x:             4.25
sample variance_x:      5.66667
sum_y:                  10
ssr_y:                  5
mean_y:                 2.5
variance_y:             1.25
sample variance_y:      1.66667
correlation:            -0.759257
covariance:             -1.75
sample covariance:      -2.33333

It is also possible to use projections to aggregate stats over object properties:

struct Foo {
    float value;
};
 
Foo foos[] = { {1}, {3}, {5}, {2}, {8} };
auto stats = univariate::accumulate<float>(foos, std::size(foos), [](auto const& foo) { return foo.value; });
std::cout << "stats:\n" << stats << "\n";
 
// results
count:                  5
sum:                    19
ssr:                    30.8
mean:                   3.8
variance:               6.16
sample variance:        7.7
 
struct Foo {
    float value;
};
 
struct Bar {
    int value;
};
 
Foo foos[] = { {1}, {3}, {5}, {2}, {8} };
Bar bars[] = { {3}, {2}, {1}, {4}, {11} };
 
auto stats = bivariate::accumulate<float>(foos, bars, std::size(foos), [](auto const& foo) { return foo.value; },
                                                                       [](auto const& bar) { return bar.value; });
std::cout << "stats:\n" << stats << "\n";
 
// results
count:                  5
sum_x:                  19
ssr_x:                  30.8
mean_x:                 3.8
variance_x:             6.16
sample variance_x:      7.7
sum_y:                  21
ssr_y:                  62.8
mean_y:                 4.2
variance_y:             12.56
sample variance_y:      15.7
correlation:            0.686676
covariance:             6.04
sample covariance:      7.55

The methods above accept a batch of data and calculate relevant statistics. If the data is streaming, then one can also use accumulators. The accumulator is a lower-level object that is able to perform calculations online as new data arrives:

univariate_accumulator<float> acc;
acc(1.0);
acc(2.0);
acc(3.0);
acc(4.0);
auto stats = univariate_statistics(acc);
std::cout << "stats:\n" << stats << "\n";
 
Count:                  4
Sum:                    10
Sum of squares:         5
Mean:                   2.5
Variance:               1.25
Sample variance:        1.66667

The template parameter tells the accumulator how to represent data internally.

if a scalar type is provided (e.g. float or double), the accumulator will perform all operations with scalars (i.e., no SIMD).
if a SIMD-type is provided (e.g., eve::wide) then the accumulator will perform data-parallel operations

This allows the user to combine accumulators, for example using a SIMD-enabled accumulator to process the bulk of the data and a scalar accumulator for the left-over points.

Available statistics

univariate
struct univariate_statistics {

double count;

double sum;

double ssr;

double mean;

double variance;

double sample_variance;

};
bivariate
struct bivariate_statistics {

double count;

double sum_x;

double sum_y;

double ssr_x;

double ssr_y;

double sum_xy;

double mean_x;

double mean_y;

double variance_x;

double variance_y;

double sample_variance_x;

double sample_variance_y;

double correlation;

double covariance;

double sample_covariance;

};

Benchmarks

The following libraries have been used for performance comparison in the univariate (variance) and bivariate (covariance) case:

Methodology

we generate 1M values uniformly distributed between [-1, 1] and save them into a double and a float array
increase the data size in 100k increments and benchmark the performance for each method using nanobench

Notes

we did not use MKL as a backend for numpy and gsl (expect MKL performance to be higher)
linasm methods for variance and covariance require precomputed array means, so means computation is factored into the benchmarks
hardware: Ryzen 9 5950X

Acknowledgements

[1] Expressive Vector Engine

[2] Youngs and Cramer - Some Results Relevant to Choice of Sum and Sum-of-Product Algorithms

[3] Schubert and Gertz - Numerically stable parallel computation of (co-)variance

[4] GNU Scientific Library