概率图模型
有向 vs 无向
概率图模型用图刻画一组随机变量之间的相关关系. 有向(无环)图刻画的概率图模型称作 bayesian network, 无向图刻画的概率图模型称作 markov network.
有向图模型和无向图模型直观的区别在于“因果性”, 本质的区别在于两种模型建模了不同的独立关系.例如,从 independence map 的角度: 有向图模型无法表示无向环形关系, 无向图模型无法表示有向 V 形结构.
无向图模型
Pr
(
s
)
=
1
Z
∏
C
ϕ
C
(
s
C
)
\Pr(s) = \frac{1}{Z} \prod_{C} \phi_C(s_C)
Pr(s)=Z1C∏ϕC(sC)
有向图模型
Pr
(
s
)
=
∏
i
Pr
(
s
i
∣
p
a
r
e
n
t
(
s
i
)
)
\Pr(s) = \prod_i \Pr(s_i \vert \mathrm{parent}(s_i))
Pr(s)=i∏Pr(si∣parent(si))
概率模型
参数估计
参数分布
Pr
(
θ
∣
S
)
∝
Pr
(
θ
)
Pr
(
S
∣
θ
)
\Pr(\theta \vert S) \propto \Pr(\theta) \Pr(S \vert \theta)
Pr(θ∣S)∝Pr(θ)Pr(S∣θ)
极大后验
arg
max
θ
Pr
(
θ
∣
S
)
=
arg
max
θ
Pr
(
θ
)
Pr
(
S
∣
θ
)
\arg \max_{\theta} \Pr(\theta \vert S) = \arg \max_{\theta} \Pr(\theta) \Pr(S \vert \theta)
argθmaxPr(θ∣S)=argθmaxPr(θ)Pr(S∣θ)
极大似然 (极大后验的基础上假定参数的先验均匀)
arg
max
θ
Pr
(
θ
∣
S
)
=
arg
max
θ
Pr
(
S
∣
θ
)
\arg \max_{\theta} \Pr(\theta \vert S) = \arg \max_{\theta} \Pr(S \vert \theta)
argθmaxPr(θ∣S)=argθmaxPr(S∣θ)
隐变量估计
极大似然
arg
max
θ
Pr
(
θ
∣
S
)
=
arg
max
θ
Pr
(
S
∣
θ
)
=
arg
max
θ
∑
Z
Pr
(
S
,
Z
∣
θ
)
\arg \max_{\theta} \Pr(\theta \vert S) = \arg \max_{\theta} \Pr(S \vert \theta) = \arg \max_{\theta} \sum_{Z} \Pr(S,Z \vert \theta)
argθmaxPr(θ∣S)=argθmaxPr(S∣θ)=argθmaxZ∑Pr(S,Z∣θ)
似然下界
weighted algebra geometry inequality 角度:
∑
Z
Pr
(
S
,
Z
∣
θ
)
=
∑
Z
Pr
(
Z
∣
η
)
Pr
(
S
,
Z
∣
θ
)
Pr
(
Z
∣
η
)
⩾
∏
Z
(
Pr
(
S
,
Z
∣
θ
)
Pr
(
Z
∣
η
)
)
Pr
(
Z
∣
η
)
\sum_{Z} \Pr(S,Z \vert \theta) = \sum_{Z} \Pr(Z \vert \eta) \frac{\Pr(S,Z \vert \theta)}{\Pr(Z \vert \eta)} \geqslant \prod_{Z} \left( \frac{\Pr(S,Z \vert \theta)}{\Pr(Z \vert \eta)} \right)^{\Pr(Z \vert \eta)}
Z∑Pr(S,Z∣θ)=Z∑Pr(Z∣η)Pr(Z∣η)Pr(S,Z∣θ)⩾Z∏(Pr(Z∣η)Pr(S,Z∣θ))Pr(Z∣η)
jensen’s inequality 角度:
log
∑
Z
Pr
(
S
,
Z
∣
θ
)
=
log
∑
Z
Pr
(
Z
∣
η
)
Pr
(
S
,
Z
∣
θ
)
Pr
(
Z
∣
η
)
⩾
∑
Z
Pr
(
Z
∣
η
)
log
Pr
(
S
,
Z
∣
θ
)
Pr
(
Z
∣
η
)
\log \sum_{Z} \Pr(S,Z \vert \theta) = \log \sum_{Z} \Pr(Z \vert \eta) \frac{\Pr(S,Z \vert \theta)}{\Pr(Z \vert \eta)} \geqslant \sum_{Z} \Pr(Z \vert \eta) \log \frac{\Pr(S,Z \vert \theta)}{\Pr(Z \vert \eta)}
logZ∑Pr(S,Z∣θ)=logZ∑Pr(Z∣η)Pr(Z∣η)Pr(S,Z∣θ)⩾Z∑Pr(Z∣η)logPr(Z∣η)Pr(S,Z∣θ)
∑
Z
Pr
(
Z
∣
η
)
log
Pr
(
S
,
Z
∣
θ
)
Pr
(
Z
∣
η
)
=
∑
Z
Pr
(
Z
∣
η
)
log
Pr
(
S
,
Z
∣
θ
)
−
∑
Z
Pr
(
Z
∣
η
)
log
Pr
(
Z
∣
η
)
=
∑
Z
Pr
(
Z
∣
η
)
log
Pr
(
S
,
Z
∣
θ
)
+
E
n
t
r
o
p
y
(
Pr
(
Z
∣
η
)
)
\begin{aligned} \sum_{Z} \Pr(Z \vert \eta) \log \frac{\Pr(S,Z \vert \theta)}{\Pr(Z \vert \eta)} &= \sum_{Z} \Pr(Z \vert \eta) \log \Pr(S,Z \vert \theta) - \sum_{Z} \Pr(Z \vert \eta) \log \Pr(Z \vert \eta) \\ &= \sum_{Z} \Pr(Z \vert \eta) \log \Pr(S,Z \vert \theta) + \mathrm{Entropy}(\Pr(Z \vert \eta)) \\ \end{aligned}
Z∑Pr(Z∣η)logPr(Z∣η)Pr(S,Z∣θ)=Z∑Pr(Z∣η)logPr(S,Z∣θ)−Z∑Pr(Z∣η)logPr(Z∣η)=Z∑Pr(Z∣η)logPr(S,Z∣θ)+Entropy(Pr(Z∣η))
下界间隙
∑
Z
Pr
(
Z
∣
η
)
log
Pr
(
S
,
Z
∣
θ
)
Pr
(
Z
∣
η
)
=
∑
Z
Pr
(
Z
∣
η
)
log
Pr
(
Z
∣
S
,
θ
)
Pr
(
S
∣
θ
)
Pr
(
Z
∣
η
)
=
log
Pr
(
S
∣
θ
)
−
K
L
(
Pr
(
Z
∣
η
)
∥
Pr
(
Z
∣
S
,
θ
)
)
\begin{aligned} \sum_{Z} \Pr(Z \vert \eta) \log \frac{\Pr(S,Z \vert \theta)}{\Pr(Z \vert \eta)} &= \sum_{Z} \Pr(Z \vert \eta) \log \frac{\Pr(Z \vert S,\theta) \Pr(S \vert \theta)}{\Pr(Z \vert \eta)} \\ &= \log \Pr(S \vert \theta) - \mathrm{KL}(\Pr(Z \vert \eta) \Vert \Pr(Z \vert S,\theta)) \end{aligned}
Z∑Pr(Z∣η)logPr(Z∣η)Pr(S,Z∣θ)=Z∑Pr(Z∣η)logPr(Z∣η)Pr(Z∣S,θ)Pr(S∣θ)=logPr(S∣θ)−KL(Pr(Z∣η)∥Pr(Z∣S,θ))
恒等式
log
Pr
(
S
∣
θ
)
=
∑
Z
Pr
(
Z
∣
η
)
log
Pr
(
S
,
Z
∣
θ
)
+
E
n
t
r
o
p
y
(
Pr
(
Z
∣
η
)
)
+
K
L
(
Pr
(
Z
∣
S
,
θ
)
∥
Pr
(
Z
∣
η
)
)
\log \Pr(S \vert \theta) = \sum_{Z} \Pr(Z \vert \eta) \log \Pr(S,Z \vert \theta) + \mathrm{Entropy}(\Pr(Z \vert \eta)) + \mathrm{KL}(\Pr(Z \vert S,\theta) \Vert \Pr(Z \vert \eta))
logPr(S∣θ)=Z∑Pr(Z∣η)logPr(S,Z∣θ)+Entropy(Pr(Z∣η))+KL(Pr(Z∣S,θ)∥Pr(Z∣η))
EM 算法
θ
t
+
1
=
arg
max
θ
t
∑
Z
Pr
(
Z
∣
S
,
θ
t
)
log
Pr
(
S
,
Z
∣
θ
t
)
\theta^{t+1} = \arg \max_{\theta^t} \sum_{Z} \Pr(Z \vert S,\theta^t) \log \Pr(S,Z \vert \theta^t)
θt+1=argθtmaxZ∑Pr(Z∣S,θt)logPr(S,Z∣θt)
- E 步: 固定参数
θ
\theta
θ, 优化隐变量分布参数
η
\eta
η. 无约束, 下界间隙
K
L
(
Pr
(
Z
∣
η
)
∥
Pr
(
Z
∣
S
,
θ
)
)
\mathrm{KL}(\Pr(Z \vert \eta) \Vert \Pr(Z \vert S,\theta))
KL(Pr(Z∣η)∥Pr(Z∣S,θ)).
- M 步: 固定隐变量分布参数
η
\eta
η, 优化参数
θ
\theta
θ. 去除常量
K
L
(
Pr
(
Z
∣
S
,
θ
)
∥
Pr
(
Z
∣
η
)
)
\mathrm{KL}(\Pr(Z \vert S,\theta) \Vert \Pr(Z \vert \eta))
KL(Pr(Z∣S,θ)∥Pr(Z∣η)) 和
E
n
t
r
o
p
y
(
Pr
(
Z
∣
η
)
)
\mathrm{Entropy}(\Pr(Z \vert \eta))
Entropy(Pr(Z∣η)).
变分推断
arg
min
ζ
K
L
(
Pr
(
Y
∣
ζ
)
∥
Pr
(
Y
∣
X
)
)
\arg \min_{\zeta} \mathrm{KL}(\Pr(Y \vert \zeta) \Vert \Pr(Y \vert X))
argζminKL(Pr(Y∣ζ)∥Pr(Y∣X))
用
Pr
(
Y
∣
ζ
)
\Pr(Y \vert \zeta)
Pr(Y∣ζ) 逼近
Pr
(
Y
∣
X
)
\Pr(Y \vert X)
Pr(Y∣X). 可以使用梯度等方法.
应用例子1 BM RBM
BM boltzmann machine
BM 是一种最小化能量框架下的二值神经网络.
E
n
g
(
s
)
=
∑
i
<
j
E
n
g
(
e
i
j
)
+
∑
i
E
n
g
(
v
i
)
=
−
∑
i
<
j
α
i
j
s
i
s
j
−
∑
i
β
i
s
i
\mathrm{Eng}(s) = \sum_{i < j} \mathrm{Eng}(e_{ij}) + \sum_{i} \mathrm{Eng}(v_i) = - \sum_{i < j} \alpha_{ij} s_i s_j - \sum_{i} \beta_i s_i
Eng(s)=i<j∑Eng(eij)+i∑Eng(vi)=−i<j∑αijsisj−i∑βisi
BM 本质上是引入了隐变量的无向图模型. 每条边或每个顶点各自含有一个隐变量, 组成一个团.
Pr
(
s
)
=
∏
i
<
j
e
α
i
j
e
s
i
e
s
j
∏
i
e
β
i
e
s
i
\Pr(s) = \prod_{i < j} e^{\alpha_{ij}} e^{s_i} e^{s_j} \prod_{i} e^{\beta_i} e^{s_i}
Pr(s)=i<j∏eαijesiesji∏eβiesi
RBM restricted boltzmann machine
E
n
g
(
h
,
v
)
=
−
∑
i
n
∑
j
m
α
i
j
h
i
v
j
−
∑
i
n
β
i
h
i
−
∑
j
m
γ
i
v
j
=
−
h
T
A
v
−
h
T
B
−
v
T
Γ
\mathrm{Eng}(h,v) = - \sum_{i}^{n} \sum_{j}^{m} \alpha_{ij} h_i v_j - \sum_{i}^{n} \beta_i h_i - \sum_{j}^{m} \gamma_i v_j = - h^T \Alpha v - h^T \Beta - v^T \Gamma
Eng(h,v)=−i∑nj∑mαijhivj−i∑nβihi−j∑mγivj=−hTAv−hTB−vTΓ
Pr
(
h
,
v
)
=
e
h
T
A
v
+
h
T
B
+
v
T
Γ
∬
e
h
T
A
v
+
h
T
B
+
v
T
Γ
d
h
d
v
\Pr(h,v) = \frac{\displaystyle e^{h^T \Alpha v + h^T \Beta + v^T \Gamma}}{\displaystyle \iint e^{h^T \Alpha v + h^T \Beta + v^T \Gamma} \mathrm{d} h \mathrm{d} v}
Pr(h,v)=∬ehTAv+hTB+vTΓdhdvehTAv+hTB+vTΓ
极大似然
arg
max
θ
log
∏
k
K
Pr
(
v
(
k
)
)
\arg \max_{\theta} \log \prod_{k}^{K} \Pr(v^{(k)})
argθmaxlogk∏KPr(v(k))
log
∏
k
K
Pr
(
v
(
k
)
)
=
∑
k
K
log
Pr
(
v
(
k
)
)
=
∑
k
K
log
∫
Pr
(
v
(
k
)
,
h
)
d
h
=
∑
k
K
log
∫
e
h
T
A
v
+
h
T
B
+
v
T
Γ
d
h
∬
e
h
T
A
v
+
h
T
B
+
v
T
Γ
d
h
d
v
=
∑
k
K
(
log
∫
e
h
T
A
v
+
h
T
B
+
v
T
Γ
d
h
)
−
K
log
∬
e
h
T
A
v
+
h
T
B
+
v
T
Γ
d
h
d
v
\begin{aligned} & \log \prod_{k}^{K} \Pr(v^{(k)}) \\ =& \sum_{k}^{K} \log \Pr(v^{(k)}) \\ =& \sum_{k}^{K} \log \int \Pr(v^{(k)},h) \mathrm{d} h \\ =& \sum_{k}^{K} \log \frac{\displaystyle \int e^{h^T \Alpha v + h^T \Beta + v^T \Gamma} \mathrm{d} h}{\displaystyle \iint e^{h^T \Alpha v + h^T \Beta + v^T \Gamma} \mathrm{d} h \mathrm{d} v} \\ =& \sum_{k}^{K} \left( \log \int e^{h^T \Alpha v + h^T \Beta + v^T \Gamma} \mathrm{d} h \right) - K \log \iint e^{h^T \Alpha v + h^T \Beta + v^T \Gamma} \mathrm{d} h \mathrm{d} v \\ \end{aligned}
====logk∏KPr(v(k))k∑KlogPr(v(k))k∑Klog∫Pr(v(k),h)dhk∑Klog∬ehTAv+hTB+vTΓdhdv∫ehTAv+hTB+vTΓdhk∑K(log∫ehTAv+hTB+vTΓdh)−Klog∬ehTAv+hTB+vTΓdhdv
d
d
θ
log
∏
k
K
Pr
(
v
(
k
)
)
=
d
d
θ
∑
k
K
(
log
∫
e
h
T
A
v
+
h
T
B
+
v
T
Γ
d
h
)
−
K
log
∬
e
h
T
A
v
+
h
T
B
+
v
T
Γ
d
h
d
v
=
∑
k
K
(
∫
e
h
T
A
v
+
h
T
B
+
v
T
Γ
d
d
θ
(
h
T
A
v
+
h
T
B
+
v
T
Γ
)
d
h
∫
e
h
T
A
v
+
h
T
B
+
v
T
Γ
d
h
)
−
K
∬
e
h
T
A
v
+
h
T
B
+
v
T
Γ
d
d
θ
(
h
T
A
v
+
h
T
B
+
v
T
Γ
)
d
h
d
v
∬
e
h
T
A
v
+
h
T
B
+
v
T
Γ
d
h
d
v
=
∑
k
K
(
∫
Pr
(
h
∣
v
)
d
d
θ
(
h
T
A
v
+
h
T
B
+
v
T
Γ
)
d
h
)
−
K
∬
Pr
(
h
,
v
)
d
d
θ
(
h
T
A
v
+
h
T
B
+
v
T
Γ
)
d
h
d
v
=
∑
k
K
E
h
∼
Pr
(
h
∣
v
)
[
h
v
T
d
A
+
h
d
B
+
v
d
Γ
]
−
K
E
h
,
v
∼
Pr
(
h
,
v
)
[
h
v
T
d
A
+
h
d
B
+
v
d
Γ
]
\begin{aligned} & \frac{\mathrm{d}}{\mathrm{d} \theta} \log \prod_{k}^{K} \Pr(v^{(k)}) \\ =& \frac{\mathrm{d}}{\mathrm{d} \theta} \sum_{k}^{K} \left( \log \int e^{h^T \Alpha v + h^T \Beta + v^T \Gamma} \mathrm{d} h \right) - K \log \iint e^{h^T \Alpha v + h^T \Beta + v^T \Gamma} \mathrm{d} h \mathrm{d} v \\ =& \sum_{k}^{K} \left( \frac{\displaystyle \int e^{h^T \Alpha v + h^T \Beta + v^T \Gamma} \frac{\mathrm{d}}{\mathrm{d} \theta} (h^T \Alpha v + h^T \Beta + v^T \Gamma) \mathrm{d} h}{\displaystyle \int e^{h^T \Alpha v + h^T \Beta + v^T \Gamma} \mathrm{d} h} \right) - K \frac{\displaystyle \iint e^{h^T \Alpha v + h^T \Beta + v^T \Gamma} \frac{\mathrm{d}}{\mathrm{d} \theta} (h^T \Alpha v + h^T \Beta + v^T \Gamma) \mathrm{d} h \mathrm{d} v}{\displaystyle \iint e^{h^T \Alpha v + h^T \Beta + v^T \Gamma} \mathrm{d} h \mathrm{d} v} \\ =& \sum_{k}^{K} \left( \int \Pr(h \vert v) \frac{\mathrm{d}}{\mathrm{d} \theta} (h^T \Alpha v + h^T \Beta + v^T \Gamma) \mathrm{d} h \right) - K \iint \Pr(h,v) \frac{\mathrm{d}}{\mathrm{d} \theta} (h^T \Alpha v + h^T \Beta + v^T \Gamma) \mathrm{d} h \mathrm{d} v \\ =& \sum_{k}^{K} \mathop{\mathbb{E}}\limits_{h \sim \Pr(h \vert v)} \left[ h v^T \mathrm{d} \Alpha + h \mathrm{d} \Beta + v \mathrm{d} \Gamma \right] - K \mathop{\mathbb{E}}\limits_{h,v \sim \Pr(h,v)} \left[ h v^T \mathrm{d} \Alpha + h \mathrm{d} \Beta + v \mathrm{d} \Gamma \right] \\ \end{aligned}
====dθdlogk∏KPr(v(k))dθdk∑K(log∫ehTAv+hTB+vTΓdh)−Klog∬ehTAv+hTB+vTΓdhdvk∑K⎝⎜⎜⎛∫ehTAv+hTB+vTΓdh∫ehTAv+hTB+vTΓdθd(hTAv+hTB+vTΓ)dh⎠⎟⎟⎞−K∬ehTAv+hTB+vTΓdhdv∬ehTAv+hTB+vTΓdθd(hTAv+hTB+vTΓ)dhdvk∑K(∫Pr(h∣v)dθd(hTAv+hTB+vTΓ)dh)−K∬Pr(h,v)dθd(hTAv+hTB+vTΓ)dhdvk∑Kh∼Pr(h∣v)E[hvTdA+hdB+vdΓ]−Kh,v∼Pr(h,v)E[hvTdA+hdB+vdΓ]
对比散度 Contrastive Dìvergence
考虑第 k 个样本的似然微分.
积分求期望的代价不可接受. (例如, 原始的二值神经网络, 求和
2
n
+
m
2^{n+m}
2n+m 次.)
- 从积分的角度, 积分是概率的规范化和边缘化, 采用中值定理简化.
- 从期望的角度, 期望是随机变量的代表性取值, 通过采样代表简化.
∫
Pr
(
h
∣
v
)
d
d
θ
(
h
T
A
v
+
h
T
B
+
v
T
Γ
)
d
h
−
∬
Pr
(
h
,
v
)
d
d
θ
(
h
T
A
v
+
h
T
B
+
v
T
Γ
)
d
h
d
v
=
E
h
∼
Pr
(
h
∣
v
)
[
h
v
T
d
A
+
h
d
B
+
v
d
Γ
]
−
E
h
,
v
∼
Pr
(
h
,
v
)
[
h
v
T
d
A
+
h
d
B
+
v
d
Γ
]
≈
(
h
v
T
d
A
+
h
d
B
+
v
d
Γ
)
−
(
h
′
v
′
T
d
A
+
h
′
d
B
+
v
′
d
Γ
)
\begin{aligned} & \int \Pr(h \vert v) \frac{\mathrm{d}}{\mathrm{d} \theta} (h^T \Alpha v + h^T \Beta + v^T \Gamma) \mathrm{d} h - \iint \Pr(h,v) \frac{\mathrm{d}}{\mathrm{d} \theta} (h^T \Alpha v + h^T \Beta + v^T \Gamma) \mathrm{d} h \mathrm{d} v \\ =& \mathop{\mathbb{E}}\limits_{h \sim \Pr(h \vert v)} \left[ h v^T \mathrm{d} \Alpha + h \mathrm{d} \Beta + v \mathrm{d} \Gamma \right] - \mathop{\mathbb{E}}\limits_{h,v \sim \Pr(h,v)} \left[ h v^T \mathrm{d} \Alpha + h \mathrm{d} \Beta + v \mathrm{d} \Gamma \right] \\ \approx& (h v^T \mathrm{d} \Alpha + h \mathrm{d} \Beta + v \mathrm{d} \Gamma) - (h' v'^T \mathrm{d} \Alpha + h' \mathrm{d} \Beta + v' \mathrm{d} \Gamma) \\ \end{aligned}
=≈∫Pr(h∣v)dθd(hTAv+hTB+vTΓ)dh−∬Pr(h,v)dθd(hTAv+hTB+vTΓ)dhdvh∼Pr(h∣v)E[hvTdA+hdB+vdΓ]−h,v∼Pr(h,v)E[hvTdA+hdB+vdΓ](hvTdA+hdB+vdΓ)−(h′v′TdA+h′dB+v′dΓ)
上式采样 1 次, 可推广为多次采样. 采样方式为 gibbs sampling.
Pr
(
h
∣
v
)
=
∏
i
n
Pr
(
h
i
∣
v
)
Pr
(
v
∣
h
)
=
∏
j
m
Pr
(
v
j
∣
h
)
\begin{aligned} \Pr(h \vert v) = \prod_{i}^{n} \Pr(h_i \vert v) \\ \Pr(v \vert h) = \prod_{j}^{m} \Pr(v_j \vert h) \\ \end{aligned}
Pr(h∣v)=i∏nPr(hi∣v)Pr(v∣h)=j∏mPr(vj∣h)
这一简化很像负采样 (negative sampling).
应用示例2 概率图模型中网络结构估计
Pr
(
G
∣
S
)
∝
∑
θ
Pr
(
S
∣
θ
,
G
)
Pr
(
θ
∣
G
)
Pr
(
G
)
\Pr(G \vert S) \propto \sum_{\theta} \Pr(S \vert \theta,G) \Pr(\theta \vert G) \Pr(G)
Pr(G∣S)∝θ∑Pr(S∣θ,G)Pr(θ∣G)Pr(G)
其中
Pr
(
θ
∣
G
)
\Pr(\theta \vert G)
Pr(θ∣G) 是给定网络结构的参数的先验. (网络结构不同, 参数的结构不同.)
网络结构是离散的, 通常认为
Pr
(
G
)
\Pr(G)
Pr(G) 是均匀分布.
Pr
(
G
∣
S
)
∝
∑
θ
Pr
(
S
∣
θ
,
G
)
Pr
(
θ
∣
G
)
\Pr(G \vert S) \propto \sum_{\theta} \Pr(S \vert \theta,G) \Pr(\theta \vert G)
Pr(G∣S)∝θ∑Pr(S∣θ,G)Pr(θ∣G)
限制为: 有限种取值的离散随机变量组成的有向图模型
可以贪心的搜索网络结构 (例如: 爬山法, beam search, 分支限界, 遗传算法, …).