这次我比较懒,不想从头开始分析为什么要有 ANOVA。我只想简单说一下 ANOVA 是怎么计算的。我终于也成了我最讨厌的那种人:直讲是什么,不讲为什么。因为要把「为什么」弄清楚真的太累了。当然,我希望自己只是暂时偷个懒,之后会反过来说一下为什么这么算。
One-way ANOVA #
假设我们现在有三组数据:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
a = [35, 46, 78]
b = [22, 48, 98, 77]
c = [90, 65]
group_dic = {'A': a, "B": b, "C": c}
group_dic
{'A': [35, 46, 78], 'B': [22, 48, 98, 77], 'C': [90, 65]}
def calc(group_dic):
res = []
for group, data in group_dic.items():
mu = np.mean(data)
n = len(data)
var = np.var(data, ddof = 1)
res.append({
'treatment': group,
'sample_mean': mu,
'sample_variance': var,
'sample_size': n
})
return res
res = calc(group_dic)
res = pd.DataFrame(res)
res
treatment | sample_mean | sample_variance | sample_size | |
---|---|---|---|---|
0 | A | 53.00 | 499.000000 | 3 |
1 | B | 61.25 | 1104.916667 | 4 |
2 | C | 77.50 | 312.500000 | 2 |
n = res.sample_size.sum()
k = len(group_dic)
n, k
(9, 3)
我们把 $\bar{X}_{Total}$
定义为总平均值,其计算方法为:
grand_mean = np.sum(
res['sample_mean'] * res['sample_size'])/n
grand_mean
62.111111111111114
组间方差我们记为 SSTrt,计算方法为:
sstrt = np.sum(
res['sample_size']*(res['sample_mean'] - grand_mean)**2)
sstrt
725.6388888888889
组内方差我们记为 SSErr,计算方法为:
sserr = np.sum(
(res.sample_size -1)*res.sample_variance
)
sserr
4625.25
涉及到自由度:
- Total 的自由度为
n-1
- SSTrt 的自由度
dftrt
为k-1
- SSErr 的自由度
dferr
为n-k
我们把 sstrt/dftrt
记为 MSTrt
。类似的,把 sserr/dferr
记为 MSErr
。计算如下:
dftot = n-1
dftrt = k-1
dferr = n-k
mstrt = sstrt/dftrt
mserr = sserr/dferr
我们把 sstrt + sserr
记为 sstot
:
sstot = sstrt + sserr
统计量 F 的计算方法很简单:
$$F = \frac{\text{MSTrt}}{\text{MSErr}}$$
f_value = mstrt/mserr
f_value
0.4706592436444877
我们来把整个结果列出来:
source = ['Treatment', 'Error', 'Total']
d_f = [dftrt, dferr, dftot]
ss = [sstrt, sserr, sstot]
ms = [mstrt, mserr, None]
f_stats = [f_value, None, None]
aov_res = pd.DataFrame({
'Source': source,
'Degrees of Freedom': d_f,
'Sum of Squares': ss,
'Mean Square': ms,
'F-Statistic': f_stats
})
aov_res
Source | Degrees of Freedom | Sum of Squares | Mean Square | F-Statistic | |
---|---|---|---|---|---|
0 | Treatment | 2 | 725.638889 | 362.819444 | 0.470659 |
1 | Error | 6 | 4625.250000 | 770.875000 | NaN |
2 | Total | 8 | 5350.888889 | NaN | NaN |
p 值的计算方法为
from scipy.stats import f
p_value = f.sf(f_value, dftrt, dferr)
p_value
0.6458443083352733
验证 #
我们来验证一下我们的计算是正确的:
from scipy.stats import f_oneway
f_stat, p_value = f_oneway(a, b, c)
print(f"F-statistic: {f_stat}")
print(f"p-value: {p_value}")
F-statistic: 0.4706592436444877
p-value: 0.6458443083352733
假设 #
要使用 One-Way ANOVA 需要满足三个假设;
- 各组所代表总体之方差相等
- 各组所代表总体为正态分布
- 各组独立
最后一次修改于 2024-11-15