Variance of Superset Using Only Mean and Variance of Subsets
I’ve been working on a tool to perform an ANOVA test in a distributed system––the idea is that each node has a rollup of some internal statistics, but would need to compare its values with other nodes’.
Given that I already had a rollup of the statistics, I really don’t want to re-expand them for obvious reasons:
- Because the rollups are typically over 10-minute time periods and we record metrics every 10 seconds, that means we’d be sending something ~60x larger than necessary.
- Re-expanding the metrics means we’d have to execute a range lookup over the timeseries data, as opposed to a point lookup. Not catastrophic, but more work than it needs to be.
- The ANOVA test will have at least 3 groups, which means we’d incur all of these penalties at least 2x.
However, I was slightly wedged on this problem because the ANOVA calculation requires the variance of the entire set, i.e. the superset. I tried wiggling around with this algebraically for a while and gave up. However, searching on Stack Exchange’s statistics site proved fruitful where I found the following equation:
variance(superset) = ((k - 1) / (g *k - 1)) * (sum(subset variances) + variance(subset means)*(k*(g-1)/(k-1)))
Here:
- We assume each group has the same number of measurements
-
k
: the number of measurements in each group -
g
: the number of groups
Using this formulate to determine the superset’s variance, we only need to share node’s rollups with one another to perform our entire ANOVA test (as well as post-hoc analysis like Tukey).
Adapted from this Stack Exchange post