假设输入为
X
X
X,参数矩阵为
W
W
W,输出为
f
W
(
x
)
f^W(x)
fW(x),真实分布为
y
y
y。
3.1 回归任务
对于回归任务,在给定其输出
f
W
(
x
)
f^W(x)
fW(x)的情况下
y
y
y的概率为高斯似然:
(对于高斯似然,
σ
σ
σ为模型的观测噪声参数,表示输出数据中的噪声量)
将
f
W
(
x
)
f^W(x)
fW(x)和
y
y
y代入上述高斯函数中,可得:
p
(
y
∣
f
W
(
x
)
)
=
1
2
π
σ
e
−
(
y
−
f
W
(
x
)
)
2
2
σ
2
p(y|f^W(x))=\frac{1}{\sqrt{2\pi}\sigma}e^{-\frac{(y-f^W(x))^2}{2\sigma^2}}
p(y∣fW(x))=2πσ1e−2σ2(y−fW(x))2
两边求对数,可得其对数似然函数,如下:
l
o
g
p
(
y
∣
f
W
(
x
)
)
∝
−
(
y
−
f
W
(
x
)
)
2
2
σ
2
−
l
o
g
σ
logp(y|f^W(x))\propto-\frac{(y-f^W(x))^2}{2\sigma^2}-log\sigma
logp(y∣fW(x))∝−2σ2(y−fW(x))2−logσ
同理,将
f
W
(
x
)
f^W(x)
fW(x)和
y
y
y代入上述softmax函数中,可得对数似然函数,如下:
其中,c为某一类别
3.3 多任务
对于回归与分类任务混合的多任务似然,假定
y
1
y_1
y1、…、
y
k
y_k
yk分别为回归任务和分类任务的真实输出,其似然为:
那么,假设多任务的loss记作
L
(
W
,
σ
1
,
σ
2
)
L(W,\sigma_1 ,\sigma_2 )
L(W,σ1,σ2),那么则有: =
L
(
W
,
σ
1
,
σ
2
)
L(W,\sigma_1 ,\sigma_2 )
L(W,σ1,σ2) =
−
l
o
g
N
(
y
1
;
f
W
(
x
)
,
σ
1
2
)
⋅
s
o
f
t
m
a
x
(
y
2
=
c
;
f
W
(
x
)
,
σ
2
)
-logN(y_1;f^W(x),\sigma ^2_1)\cdot softmax(y_2=c;f^W(x),\sigma _2)
−logN(y1;fW(x),σ12)⋅softmax(y2=c;fW(x),σ2) =
1
2
σ
1
2
∥
y
1
−
f
W
(
x
)
∥
2
+
l
o
g
σ
1
−
l
o
g
p
(
y
2
=
c
∣
f
W
(
x
)
,
σ
2
)
\frac{1}{2\sigma ^2_1}\left \| y_1 - f^W(x) \right \|^2 + log\sigma_1 - logp(y_2=c|f^W(x),\sigma_2)
2σ121y1−fW(x)2+logσ1−logp(y2=c∣fW(x),σ2) =
1
2
σ
1
2
∥
y
1
−
f
W
(
x
)
∥
2
+
l
o
g
σ
1
−
1
σ
2
2
f
c
′
W
(
x
)
+
l
o
g
∑
c
′
e
1
σ
2
2
f
c
′
W
(
x
)
\frac{1}{2\sigma ^2_1}\left \| y_1 - f^W(x) \right \|^2 + log\sigma_1 - \frac{1}{\sigma_2^2}f_{c{'}}^W(x)+log\sum_{c{'}}e^{\frac{1}{\sigma^2_2}{f_{c{'}}^W(x)}}
2σ121y1−fW(x)2+logσ1−σ221fc′W(x)+log∑c′eσ221fc′W(x) =
1
2
σ
1
2
∥
y
1
−
f
W
(
x
)
∥
2
+
l
o
g
σ
1
+
1
σ
2
2
l
o
g
∑
c
′
e
f
c
′
W
(
x
)
−
1
σ
2
2
f
c
′
W
(
x
)
+
l
o
g
∑
c
′
e
1
σ
2
2
f
c
′
W
(
x
)
−
1
σ
2
2
l
o
g
∑
c
′
e
f
c
′
W
(
x
)
\frac{1}{2\sigma ^2_1}\left \| y_1 - f^W(x) \right \|^2 + log\sigma_1 + \frac{1}{\sigma ^2_2}log\sum_{c{'}}e^{f_{c{'}}^W(x)} - \frac{1}{\sigma_2^2}f_{c{'}}^W(x)+ log\sum_{c{'}}e^{\frac{1}{\sigma^2_2}{f_{c{'}}^W(x)}}- \frac{1}{\sigma ^2_2}log\sum_{c{'}}e^{f_{c{'}}^W(x)}
2σ121y1−fW(x)2+logσ1+σ221log∑c′efc′W(x)−σ221fc′W(x)+log∑c′eσ221fc′W(x)−σ221log∑c′efc′W(x) =
1
2
σ
1
2
∥
y
1
−
f
W
(
x
)
∥
2
+
l
o
g
σ
1
−
1
σ
2
2
l
o
g
s
o
f
t
m
a
x
(
y
2
,
f
W
(
x
)
)
+
l
o
g
∑
c
′
e
1
σ
2
2
f
c
′
W
(
x
)
−
1
σ
2
2
l
o
g
∑
c
′
e
f
c
′
W
(
x
)
\frac{1}{2\sigma ^2_1}\left \| y_1 - f^W(x) \right \|^2 +log\sigma_1 - \frac{1}{\sigma_2^2}log \ softmax(y_2,f^W(x))+ log\sum_{c{'}}e^{\frac{1}{\sigma^2_2}{f_{c{'}}^W(x)}}- \frac{1}{\sigma ^2_2}log\sum_{c{'}}e^{f_{c{'}}^W(x)}
2σ121y1−fW(x)2+logσ1−σ221logsoftmax(y2,fW(x))+log∑c′eσ221fc′W(x)−σ221log∑c′efc′W(x) =
1
2
σ
1
2
∥
y
1
−
f
W
(
x
)
∥
2
+
l
o
g
σ
1
−
1
σ
2
2
l
o
g
s
o
f
t
m
a
x
(
y
2
,
f
W
(
x
)
)
+
l
o
g
∑
c
′
e
1
σ
2
2
f
c
′
W
(
x
)
(
∑
c
′
e
f
c
′
W
(
x
)
)
1
σ
2
2
\frac{1}{2\sigma ^2_1}\left \| y_1 - f^W(x) \right \|^2 +log\sigma_1 - \frac{1}{\sigma_2^2}log \ softmax(y_2,f^W(x)) + log\frac{\sum_{c{'}}e^{\frac{1}{\sigma^2_2}{f_{c{'}}^W(x)}}}{(\sum_{c{'}}e^{f_{c{'}}^W(x)})^\frac{1}{\sigma_2^2}}
2σ121y1−fW(x)2+logσ1−σ221logsoftmax(y2,fW(x))+log(∑c′efc′W(x))σ221∑c′eσ221fc′W(x)
由于当
σ
2
\sigma_2
σ2->1时,有
1
σ
2
∑
c
′
e
1
σ
2
2
f
c
′
W
(
x
)
\frac{1}{\sigma_2}\sum_{c{'}}e^{\frac{1}{\sigma^2_2} {f_{c{'}}^W(x)}}
σ21∑c′eσ221fc′W(x)≈
(
∑
c
′
e
1
σ
2
2
f
c
′
W
(
x
)
)
1
σ
2
(\sum_{c{'}}e^{\frac{1}{\sigma^2_2}{f_{c{'}}^W(x)}})^\frac{1}{\sigma_2}
(∑c′eσ221fc′W(x))σ21, 所以上式最后一个
l
o
g
∑
c
′
e
1
σ
2
2
f
c
′
W
(
x
)
(
∑
c
′
e
f
c
′
W
(
x
)
)
1
σ
2
2
≈
l
o
g
σ
2
log\frac{\sum_{c{'}}e^{\frac{1}{\sigma^2_2}{f_{c{'}}^W(x)}}}{(\sum_{c{'}}e^{f_{c{'}}^W(x)})^\frac{1}{\sigma_2^2}} \approx log\sigma_2
log(∑c′efc′W(x))σ221∑c′eσ221fc′W(x)≈logσ2, 则有:
L
(
W
,
σ
1
,
σ
2
)
L(W,\sigma_1 ,\sigma_2 )
L(W,σ1,σ2) ≈
1
2
σ
1
2
∥
y
1
−
f
W
(
x
)
∥
2
+
l
o
g
σ
1
−
1
σ
2
2
l
o
g
s
o
f
t
m
a
x
(
y
2
,
f
W
(
x
)
)
+
l
o
g
σ
2
\frac{1}{2\sigma ^2_1}\left \| y_1 - f^W(x) \right \|^2 +log\sigma_1 - \frac{1}{\sigma_2^2}log \ softmax(y_2,f^W(x)) + log\sigma_2
2σ121y1−fW(x)2+logσ1−σ221logsoftmax(y2,fW(x))+logσ2
令
L
1
(
W
)
=
∥
y
1
−
f
W
(
x
)
∥
2
L_1(W)=\left \| y_1 - f^W(x) \right \|^2
L1(W)=y1−fW(x)2为回归问题的loss,
L
2
(
W
)
=
−
l
o
g
(
s
o
f
t
m
a
x
(
y
2
,
f
W
(
x
)
)
)
L_2(W)=-log(softmax(y_2,f^W(x)))
L2(W)=−log(softmax(y2,fW(x)))为分类问题的loss,则有多任务loss为:
L
(
W
,
σ
1
,
σ
2
)
L(W,\sigma_1 ,\sigma_2 )
L(W,σ1,σ2) ≈
1
2
σ
1
2
L
1
(
W
)
+
l
o
g
σ
1
+
1
σ
2
2
L
2
(
W
)
+
l
o
g
σ
2
\frac{1}{2\sigma ^2_1}L_1(W)+log\sigma_1 + \frac{1}{\sigma_2^2}L_2(W)+ log\sigma_2
2σ121L1(W)+logσ1+σ221L2(W)+logσ2
L
(
W
,
σ
1
,
σ
2
)
L(W,\sigma_1 ,\sigma_2 )
L(W,σ1,σ2) ≈
1
σ
1
2
L
1
(
W
)
+
1
σ
2
2
L
2
(
W
)
+
2
l
o
g
σ
1
+
2
l
o
g
σ
2
\frac{1}{\sigma ^2_1}L_1(W) + \frac{1}{\sigma_2^2}L_2(W)+2log\sigma_1+ 2log\sigma_2
σ121L1(W)+σ221L2(W)+2logσ1+2logσ2
并且,一般令
s
=
l
o
g
σ
2
s=log\sigma^2
s=logσ2,上式可以化简为:
L
(
W
,
σ
1
,
σ
2
)
L(W,\sigma_1 ,\sigma_2 )
L(W,σ1,σ2) ≈
e
−
s
1
L
1
(
W
)
+
e
−
s
2
L
2
(
W
)
+
s
1
+
s
2
e^{-s1}L_1(W)+ e^{-s2}L_2(W)+s1+s2
e−s1L1(W)+e−s2L2(W)+s1+s2
pytorch代码实现如下:
class DynamicWeightedLoss(nn.Module):
def __init__(self, num=2):
super(DynamicWeightedLoss, self).__init__()
params = torch.ones(num, requires_grad=True)
self.params = torch.nn.Parameter(params)
def forward(self, *x):
loss_sum = 0
for i, loss in enumerate(x):
loss_sum += torch.exp(-self.params[i]) * loss
+ self.params[i]
return loss_sum
论文[2]对论文[1]做了loss的正则项做了优化,它的loss如下:
L
(
W
,
σ
1
,
σ
2
)
=
1
2
σ
1
2
L
1
(
W
)
+
1
2
σ
2
2
L
2
(
W
)
+
l
o
g
(
σ
1
2
+
1
)
+
l
o
g
(
σ
2
2
+
1
)
L(W,\sigma_1 ,\sigma_2 ) = \frac{1}{2\sigma ^2_1}L_1(W) + \frac{1}{2\sigma_2^2}L_2(W)+log(\sigma_1^2+1)+ log(\sigma_2^2+1)
L(W,σ1,σ2)=2σ121L1(W)+2σ221L2(W)+log(σ12+1)+log(σ22+1), 其pytroch代码实现如下:
class DynamicWeightedLoss(nn.Module):
def __init__(self, num=2):
super(DynamicWeightedLoss, self).__init__()
params = torch.ones(num, requires_grad=True)
self.params = torch.nn.Parameter(params)
def forward(self, *x):
loss_sum = 0
for i, loss in enumerate(x):
loss_sum += 0.5 / (self.params[i] ** 2) * loss
+ torch.log(1 + self.params[i] ** 2)
return loss_sum
[1].Multi-Task Learning Using Uncertainty to Weigh Losses for Scene Geometry and Semantics [2].Auxiliary Tasks in Multi-task Learning [3].What Uncertainties Do We Need in Bayesian Deep Learning for Computer Vision? [4].An Overview of Multi-Task Learning in Deep Neural Networks.pdf [5].https://zhuanlan.zhihu.com/p/269492239 [6].Multi-Task Learning as Multi-Objective Optimization [7].MMOE [8].SNR