Notations
- input
v
v
v
- output
r
r
r
- weight parameter
W
∈
R
d
×
m
W \in \mathbb{R}^{d \times m}
W∈Rd×m
- activation function
a
a
a
- mask
m
m
m for vector and
M
M
M for matrix
Dropout
- Randomly set activations of each layer to zero with probability
1
−
p
1-p
1−p.
r
=
m
∘
a
(
W
v
)
,
r = m \circ a(Wv),
r=m∘a(Wv),
m
j
∼
Bernoulli
(
p
)
m_j \sim \text{\small Bernoulli}(p)
mj∼Bernoulli(p). - As many activation functions have the property that
a
(
0
)
=
0
)
a(0)=0)
a(0)=0), we have
r
=
a
(
m
∘
W
v
)
.
r = a(m \circ Wv).
r=a(m∘Wv).
DropConnect
- Randomly set the weight of each layer to zero with probability
1
−
p
1-p
1−p.
r
=
a
(
M
∘
W
v
)
,
r = a(M \circ Wv),
r=a(M∘Wv),
M
i
j
∼
Bernoulli
(
p
)
M_{ij} \sim \text{\small Bernoulli}(p)
Mij∼Bernoulli(p). - Each
M
i
j
M_{ij}
Mij is drawn independently for each example during training.
The memory requirement for
M
M
M's grows with the size of each mini-batch, and therefore, the implementation needs to be carefully designed. - overall model
f
(
x
;
θ
,
M
)
f(x;\theta,M)
f(x;θ,M), where
θ
=
{
W
g
,
W
,
W
s
}
\theta = \{W_g,W,W_s\}
θ={Wg,W,Ws}
o
=
E
M
[
f
(
x
;
θ
,
M
)
]
=
∑
M
p
(
M
)
f
(
x
;
θ
,
M
)
=
1
∣
M
∣
∑
M
s
(
a
(
M
∘
W
)
v
)
;
W
s
)
if
p
=
0.5
\begin{aligned} o=\mathbb{E}_M[f(x;\theta,M)]&=\sum_M p(M) f(x;\theta,M)\\ &=\frac{1}{|M|}\sum_M s(a(M \circ W) v); W_s) \quad \text{if } p = 0.5 \end{aligned}
o=EM[f(x;θ,M)]=M∑p(M)f(x;θ,M)=∣M∣1M∑s(a(M∘W)v);Ws)if p=0.5
- inference (test stage)
r
=
1
∣
M
∣
∑
M
a
(
(
M
∘
W
)
v
)
)
r
≈
1
Z
∑
z
=
1
Z
r
z
≈
1
Z
∑
z
=
1
Z
a
(
u
z
)
,
\begin{aligned} r&=\frac{1}{|M|} \sum_M a((M \circ W)v))\\ r&\approx \frac{1}{Z} \sum_{z=1}^Z r_z \\ &\approx \frac{1}{Z} \sum_{z=1}^Z a(u_z), \end{aligned}
rr=∣M∣1M∑a((M∘W)v))≈Z1z=1∑Zrz≈Z1z=1∑Za(uz),
where
u
z
∼
N
(
p
W
v
,
p
(
1
−
p
)
(
W
∘
W
)
(
v
∘
v
)
u_z \sim \mathcal{N}(pWv,p(1-p)(W \circ W)(v \circ v)
uz∼N(pWv,p(1−p)(W∘W)(v∘v);
Z
Z
Z denotes the number of randoml samples drawn from the Gaussian distribution.
Idea: approximate a sum of weighted Bernoulli random variables by a Gaussian random variable. Partially supported by the central limit theorem.
局限性
\textcolor{red}{\text{\small 局限性}}
局限性:
Both techniques are suitable for fully connected layers only.
本文内容由网友自发贡献,版权归原作者所有,本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容,请联系:hwhale#tublm.com(使用前将#替换为@)