Vectorized statistics using SIMD primitives
Introduction
is a C++17 library of computationally efficient methods for calculating sample statistics (mean, variance, covariance, correlation).
- the implementation builds upon the SIMD abstraction layer provided by the EVE [1]
- it uses a data-parallel Youngs and Cramer [2] algorithm for numerically stable computations of sums and sums-of-squares.
- the results from independent data partitions are combined with the approach by Schubert and Gertz [3].
- the methods are validated for correctness against statistical methods from the GNU Scientific Library [4].
Usage
To use this library you simply need to copy the contents of the include
folder inside your project, and then #include <vstat.hpp>
. Defining VSTAT_NAMESPACE
before inclusion will allow you to set a custom namespace for the library.
Two convenience methods are provided for batch data:
univariate::accumulate
for univariate statistics (mean, variance, standard deviation)
bivariate::accumulate
for bivariate statistics (covariance, correlation)
The methods return a statistics
object which contains all the stat values. For example:
std::vector<float> values{ 1.0, 2.0, 3.0, 4.0 };
std::vector<float> weights{ 2.0, 4.0, 6.0, 8.0 };
auto stats = univariate::accumulate<float>(values.begin(), values.end());
std::cout << "stats:\n" << stats << "\n";
count: 4
sum: 10
ssr: 5
mean: 2.5
variance: 1.25
sample variance: 1.66667
auto stats = univariate::accumulate<float>(values.begin(), values.end(), weights.begin());
std::cout << "stats:\n" << stats << "\n";
count: 20
sum: 60
ssr: 20
mean: 3
variance: 1
sample variance: 1.05263
Besides iterators, it is also possible to provide raw pointers:
float x[] = { 1., 1., 2., 6. };
float y[] = { 2., 4., 3., 1. };
size_t n = std::size(x);
auto stats = bivariate::accumulate<float>(x, y, n);
std::cout << "stats:\n" << stats << "\n";
count: 4
sum_x: 10
ssr_x: 17
mean_x: 2.5
variance_x: 4.25
sample variance_x: 5.66667
sum_y: 10
ssr_y: 5
mean_y: 2.5
variance_y: 1.25
sample variance_y: 1.66667
correlation: -0.759257
covariance: -1.75
sample covariance: -2.33333
It is also possible to use projections to aggregate stats over object properties:
struct Foo {
float value;
};
Foo foos[] = { {1}, {3}, {5}, {2}, {8} };
auto stats = univariate::accumulate<float>(foos, std::size(foos), [](auto const& foo) { return foo.value; });
std::cout << "stats:\n" << stats << "\n";
count: 5
sum: 19
ssr: 30.8
mean: 3.8
variance: 6.16
sample variance: 7.7
struct Foo {
float value;
};
struct Bar {
int value;
};
Foo foos[] = { {1}, {3}, {5}, {2}, {8} };
Bar bars[] = { {3}, {2}, {1}, {4}, {11} };
auto stats = bivariate::accumulate<float>(foos, bars, std::size(foos), [](auto const& foo) { return foo.value; },
[](auto const& bar) { return bar.value; });
std::cout << "stats:\n" << stats << "\n";
count: 5
sum_x: 19
ssr_x: 30.8
mean_x: 3.8
variance_x: 6.16
sample variance_x: 7.7
sum_y: 21
ssr_y: 62.8
mean_y: 4.2
variance_y: 12.56
sample variance_y: 15.7
correlation: 0.686676
covariance: 6.04
sample covariance: 7.55
The methods above accept a batch of data and calculate relevant statistics. If the data is streaming, then one can also use accumulators. The accumulator is a lower-level object that is able to perform calculations online as new data arrives:
univariate_accumulator<float> acc;
acc(1.0);
acc(2.0);
acc(3.0);
acc(4.0);
auto stats = univariate_statistics(acc);
std::cout << "stats:\n" << stats << "\n";
Count: 4
Sum: 10
Sum of squares: 5
Mean: 2.5
Variance: 1.25
Sample variance: 1.66667
The template parameter tells the accumulator how to represent data internally.
- if a scalar type is provided (e.g.
float
or double
), the accumulator will perform all operations with scalars (i.e., no SIMD).
- if a SIMD-type is provided (e.g.,
eve::wide
) then the accumulator will perform data-parallel operations
This allows the user to combine accumulators, for example using a SIMD-enabled accumulator to process the bulk of the data and a scalar accumulator for the left-over points.
Available statistics
- univariate ```cpp struct univariate_statistics { double count; double sum; double ssr; double mean; double variance; double sample_variance; }; ```
- bivariate ```cpp struct bivariate_statistics { double count; double sum_x; double sum_y; double ssr_x; double ssr_y; double sum_xy; double mean_x; double mean_y; double variance_x; double variance_y; double sample_variance_x; double sample_variance_y; double correlation; double covariance; double sample_covariance; }; ```
Benchmarks
The following libraries have been used for performance comparison in the univariate (variance) and bivariate (covariance) case:
Methodology
- we generate 1M values uniformly distributed between [-1, 1] and save them into a
double
and a float
array
- increase the data size in 100k increments and benchmark the performance for each method using nanobench
Notes
- we did not use MKL as a backend for numpy and gsl (expect MKL performance to be higher)
- linasm methods for variance and covariance require precomputed array means, so means computation is factored into the benchmarks
- hardware: Ryzen 9 5950X
Acknowledgements
[1] Expressive Vector Engine
[2] Youngs and Cramer - Some Results Relevant to Choice of Sum and Sum-of-Product Algorithms
[3] Schubert and Gertz - Numerically stable parallel computation of (co-)variance
[4] GNU Scientific Library