ANOVA

郝鸿涛 / 2024-11-15

这次我比较懒,不想从头开始分析为什么要有 ANOVA。我只想简单说一下 ANOVA 是怎么计算的。我终于也成了我最讨厌的那种人:只讲是什么,不讲为什么。因为要把「为什么」弄清楚真的太累了。当然,我希望自己只是暂时偷个懒,之后会反过来说一下为什么这么算。

One-way ANOVA #

假设我们现在有三组数据:

import numpy as np 
import pandas as pd
import matplotlib.pyplot as plt 
a = [35, 46, 78]
b = [22, 48, 98, 77]
c = [90, 65]
group_dic = {'A': a, "B": b, "C": c}
group_dic
{'A': [35, 46, 78], 'B': [22, 48, 98, 77], 'C': [90, 65]}
def calc(group_dic):
    res = []
    for group, data in group_dic.items():
        mu = np.mean(data)
        n = len(data)
        var = np.var(data, ddof = 1)
        res.append({
            'treatment': group,
            'sample_mean': mu,
            'sample_variance': var,
            'sample_size': n
        })
    return res
res = calc(group_dic)
res = pd.DataFrame(res)
res

treatment sample_mean sample_variance sample_size
0 A 53.00 499.000000 3
1 B 61.25 1104.916667 4
2 C 77.50 312.500000 2
n = res.sample_size.sum()
k = len(group_dic)
n, k

(9, 3)
我们把 $\bar{X}_{Total}$ 定义为总平均值,其计算方法为:

grand_mean = np.sum(
    res['sample_mean'] * res['sample_size'])/n
grand_mean

62.111111111111114
组间方差我们记为 SSTrt,计算方法为:

sstrt = np.sum(
    res['sample_size']*(res['sample_mean'] - grand_mean)**2)
sstrt

725.6388888888889
组内方差我们记为 SSErr,计算方法为:

sserr = np.sum(
    (res.sample_size -1)*res.sample_variance
)
sserr

4625.25
涉及到自由度:

我们把 sstrt/dftrt 记为 MSTrt。类似的,把 sserr/dferr 记为 MSErr。计算如下:

dftot = n-1
dftrt = k-1
dferr = n-k
mstrt = sstrt/dftrt 
mserr = sserr/dferr 

我们把 sstrt + sserr 记为 sstot

sstot = sstrt + sserr 

统计量 F 的计算方法很简单:

$$F = \frac{\text{MSTrt}}{\text{MSErr}}$$

f_value = mstrt/mserr
f_value

0.4706592436444877
我们来把整个结果列出来:

source = ['Treatment', 'Error', 'Total']
d_f = [dftrt, dferr, dftot]
ss = [sstrt, sserr, sstot]
ms = [mstrt, mserr, None]
f_stats = [f_value, None, None]
aov_res = pd.DataFrame({
    'Source': source,
    'Degrees of Freedom': d_f,
    'Sum of Squares': ss,
    'Mean Square': ms,
    'F-Statistic': f_stats
})
aov_res

Source Degrees of Freedom Sum of Squares Mean Square F-Statistic
0 Treatment 2 725.638889 362.819444 0.470659
1 Error 6 4625.250000 770.875000 NaN
2 Total 8 5350.888889 NaN NaN

p 值的计算方法为

from scipy.stats import f 
p_value = f.sf(f_value, dftrt, dferr)
p_value
0.6458443083352733

验证 #

我们来验证一下我们的计算是正确的:

from scipy.stats import f_oneway
f_stat, p_value = f_oneway(a, b, c)

print(f"F-statistic: {f_stat}")
print(f"p-value: {p_value}")
F-statistic: 0.4706592436444877 p-value: 0.6458443083352733

假设 #

要使用 One-Way ANOVA 需要满足三个假设;

  1. 各组所代表总体之方差相等
  2. 各组所代表总体为正态分布
  3. 各组独立
#统计

最后一次修改于 2024-12-07