Variance of Superset Using Only Mean and Variance of Subsets

I’ve been working on a tool to perform an ANOVA test in a distributed system––the idea is that each node has a rollup of some internal statistics, but would need to compare its values with other nodes’.

Given that I already had a rollup of the statistics, I really don’t want to re-expand them for obvious reasons:

However, I was slightly wedged on this problem because the ANOVA calculation requires the variance of the entire set, i.e. the superset. I tried wiggling around with this algebraically for a while and gave up. However, searching on Stack Exchange’s statistics site proved fruitful where I found the following equation:

variance(superset) = ((k - 1) / (g *k - 1)) * (sum(subset variances) + variance(subset means)*(k*(g-1)/(k-1)))


Using this formulate to determine the superset’s variance, we only need to share node’s rollups with one another to perform our entire ANOVA test (as well as post-hoc analysis like Tukey).

Adapted from this Stack Exchange post


Now read this

Welcome to New York

I’d like to start by being unequivocal and unambiguous: moving to New York is singularly the best thing I’ve ever done. You can smell the naivety on me but I’ve just never felt so sated by just being somewhere. New York has a kind of... Continue →