s
2
=
∑
i
=
1
N
(
x
i
−
μ
)
2
n
−
1
s^2=\frac{\sum_{i=1}^N (x_i-\mu)^2}{n-1}
s2=n−1∑i=1N(xi−μ)2 或
s
2
=
∑
i
=
1
k
(
M
i
−
μ
)
2
f
i
n
−
1
s^2=\frac{\sum_{i=1}^k (M_i-\mu)^2f_i}{n-1}
s2=n−1∑i=1k(Mi−μ)2fi
样本标准差:
s
=
∑
i
=
1
N
(
x
i
−
μ
)
2
n
−
1
s=\sqrt {\frac{\sum_{i=1}^N (x_i-\mu)^2}{n-1}}
s=n−1∑i=1N(xi−μ)2 或
s
=
∑
i
=
1
k
(
M
i
−
μ
)
2
f
i
n
−
1
s=\sqrt{\frac{\sum_{i=1}^k (M_i-\mu)^2f_i}{n-1}}
s=n−1∑i=1k(Mi−μ)2fi
这里用
μ
\mu
μ是不对的,
μ
\mu
μ虽然可以指代统计学中的均值,但是
μ
\mu
μ是代表总体均值。而严格来说,样本均值通常只是近似总体均值,因此必须作区分,故常用
x
ˉ
\bar x
xˉ来做为样本均值。故修改后公式为
样本方差:
s
2
=
∑
i
=
1
N
(
x
i
−
x
ˉ
)
2
n
−
1
s^2=\frac{\sum_{i=1}^N (x_i-\bar x)^2}{n-1}
s2=n−1∑i=1N(xi−xˉ)2 或
s
2
=
∑
i
=
1
k
(
M
i
−
x
ˉ
)
2
f
i
n
−
1
s^2=\frac{\sum_{i=1}^k (M_i-\bar x)^2f_i}{n-1}
s2=n−1∑i=1k(Mi−xˉ)2fi
样本标准差:
s
=
∑
i
=
1
N
(
x
i
−
x
ˉ
)
2
n
−
1
s=\sqrt {\frac{\sum_{i=1}^N (x_i-\bar x)^2}{n-1}}
s=n−1∑i=1N(xi−xˉ)2 或
s
=
∑
i
=
1
k
(
M
i
−
x
ˉ
)
2
f
i
n
−
1
s=\sqrt{\frac{\sum_{i=1}^k (M_i-\bar x)^2f_i}{n-1}}
s=n−1∑i=1k(Mi−xˉ)2fi
3 OR值计算
由于我目前主要从事健康地理学方面的研究,最近碰上了一个基础的OR值计算问题。首先OR值的全称是odds ratio值,这是公共卫生领域的一个专业名词。这里给出Encyclopedia of Public Health的定义。
The odds ratio (OR) provides a measure of the strength of relationship between two variables, most commonly an exposure and a dichotomous outcome. It is most commonly used in a case control study where it is defined as “the ratio of the odds of being exposed in the group with the outcome to the odds of being exposed in the group without the outcome.”
This concept can be extended to a situation with multiple levels of exposure (e.g., low, moderate, or high exposure to an environmental containment). One exposure level is assigned as the “reference” level. For each of the remaining exposure levels, one divides the odds of that exposure level in the outcome positive group (compared with the reference level) by the odds of that exposure level in the outcome negative group.
The OR ranges in value from 0 to infinity. Values close to 1.0 indicate no relationship between the exposure and the outcome. Values less than 1.0 suggest a protective effect, while values greater than 1.0 suggest a causative or adverse effect of exposure.
那么我们突然发现,这个散点是有线性趋势的。假设我们采用线性回归来做分析,即假定有:
p
r
(
d
e
a
t
h
)
=
β
0
+
β
1
(
a
g
e
)
pr(death)=\beta_0+\beta_1(age)
pr(death)=β0+β1(age),不就可以拟合了吗?但是我们又会发现一个问题。那就是这里的y(pr(death))是有现实意义的实数,也就是它的值域必须在(0,1)中。然而等式右边实际上是可以取任何值的(根据
β
0
,
β
1
,
a
g
e
\beta_0 , \beta_1, age
β0,β1,age),因此这个线性方程即使求解出来,预测值通常会超过实际的值域。所以为了解决这个问题,logistics regression就提出了。首先是定义了logit函数为:
l
o
g
i
t
(
p
)
=
l
o
g
(
p
1
−
p
)
logit(p)=log(\frac{p}{1-p})
logit(p)=log(1−pp)
p
=
p
r
(
d
e
a
t
h
)
p=pr(death)
p=pr(death)
那么这个logit函数的现实意义是事件发生几率的对数。那么同时模型就变成了:
l
o
g
(
p
1
−
p
)
=
β
0
+
β
1
(
a
g
e
)
log(\frac{p}{1-p})=\beta_0+\beta_1(age)
log(1−pp)=β0+β1(age)
l
o
g
i
t
(
p
)
=
β
0
+
β
1
(
I
t
y
p
e
)
logit(p)=\beta_0+\beta_1(I_{type})
logit(p)=β0+β1(Itype)
2x2联表则为:
Lived
Died
elective admission
a
b
emeregency admission
c
d
那么这时候
I
t
y
p
e
=
0
I_{type}=0
Itype=0时是elective admission,
I
t
y
p
e
=
1
I_{type}=1
Itype=1时是emeregency admission。因此我们可以得到对应的y值。也就是elective adminssion的logit§为
β
0
\beta_0
β0。而emergency admission的logit§为
β
0
+
β
1
\beta_0+\beta_1
β0+β1。那么根据logit函数的定义,我们就有如下的式子:
对elective adminssion的odds:
o
d
d
s
e
l
e
=
p
1
−
p
=
a
b
=
e
β
0
odds_{ele}=\frac{p}{1-p}=\frac{a}{b}=e^{\beta_0}
oddsele=1−pp=ba=eβ0
对emergency adminssion的odds:
o
d
d
s
e
m
e
=
p
1
−
p
=
c
d
=
e
β
0
+
β
1
odds_{eme}=\frac{p}{1-p}=\frac{c}{d}=e^{\beta_0+\beta_1}
oddseme=1−pp=dc=eβ0+β1
那么所以这个OR值就可以计算:
O
R
=
a
d
b
c
=
o
d
d
s
e
l
e
/
o
d
d
s
e
m
e
=
a
b
/
c
d
=
e
β
0
+
β
1
/
e
β
0
=
e
β
0
+
β
1
−
β
0
=
e
β
1
OR =\frac{ad}{bc} =odds_{ele}/odds_{eme}=\frac{a}{b}/\frac{c}{d}=e^{\beta_0+\beta_1}/e^{\beta_0}=e^{\beta_0+\beta_1-\beta_0}=e^{\beta_1}
OR=bcad=oddsele/oddseme=ba/dc=eβ0+β1/eβ0=eβ0+β1−β0=eβ1