Simple distributed standard deviation computation – demo entry
How to compute population variance for a distributed dataset by aggregating only local counts, sums, and sums of squares.
(Re)setting blog and checking that code and math render OK, using a simple problem I recently worked a bit on. FWIW, goal was simply to instead of computing the global mean and sending that to nodes, just doing it in one go.
Math
We have a distributed dataset made of multiple local datasets:
The global mean is:
The population variance is:
But we do not have all of in one single place. However:
Proof:
Now compute the two pieces separately over the distributed dataset.
So each partial node only needs to send three numbers:
That is:
where:
Then the central aggregator computes:
Finally:
Code
/// Computes local partial statistics for standard deviation
fn std_partial_stats(values: &[f64]) -> (f64, f64, usize) {
// Sum of squares
let mut s = 0.0;
// Total sum
let mut t = 0.0;
// Count
let mut n = 0;
for &x in values {
s += x * x;
t += x;
n += 1;
}
(s, t, n)
}
/// Aggregates local partial statistics into a standard deviation
fn aggregate_std(partials: &[(f64, f64, usize)]) -> f64 {
// Global sum of squares
let mut global_s = 0.0;
// Global sum
let mut global_t = 0.0;
// Global count
let mut global_n = 0;
for &(s, t, n) in partials {
global_s += s;
global_t += t;
global_n += n;
}
let m_2 = global_s / global_n as f64;
let m_1 = global_t / global_n as f64;
// sqrt(E[D^2] - E[D]^2)
(m_2 - m_1 * m_1).sqrt()
}