比较 MSE 损失和交叉熵损失的收敛性

2023-12-21

For a very simple classification problem where I have a target vector [0,0,0,....0] and a prediction vector [0,0.1,0.2,....1] would cross-entropy loss converge better/faster or would MSE loss? When I plot them it seems to me that MSE loss has a lower error margin. Why would that be?

Or for example when I have the target as [1,1,1,1....1] I get the following:

作为对已接受答案的补充，我将回答以下问题

从概率角度解释MSE损失和交叉熵损失是什么？
为什么分类用交叉熵，线性回归用MSE？

TL;DR如果（随机）目标变量来自高斯分布，则使用 MSE 损失；如果（随机）目标变量来自多项分布，则使用分类交叉熵损失。

MSE（均方误差）

线性回归的假设之一是多变量正态性。由此可见，目标变量呈正态分布（有关线性回归假设的更多信息，请参见here https://www.statisticssolutions.com/assumptions-of-linear-regression/ and here http://r-statistics.co/Assumptions-of-Linear-Regression.html).

Gaussian distribution(Normal distribution) https://en.wikipedia.org/wiki/Normal_distribution with mean $\mu$ and variance $\sigma^2$ is given by
$\mathcal{N}(x|\mu,\sigma^2)=\frac{1}{\sqrt{2\pi\sigma^{2}}}e^{-\frac{(x-\mu)^2}{2\sigma^2}}$
Often in machine learning we deal with distribution with mean 0 and variance 1(Or we transform our data to have mean 0 and variance 1). In this case the normal distribution will be,
$\mathcal{N}(x|\mu=0,\sigma^2=1)=\frac{1}{\sqrt{2\pi}}e^{-\frac{x^2}{2}}$ This is called standard normal distribution.
For normal distribution model with weight parameter $\mathbf{w}$ and precision(inverse variance) parameter $\beta$ , the probability of observing a single target t given input x is expressed by the following equation

$\mathcal{p(t|x,\mathbf{w},\beta)=\mathcal{N}(t|y(x,\mathbf{w}),\beta^{-1})$ , where $y(x,\mathbf{w})$ is mean of the distribution and is calculated by model as
$y(x,\mathbf{w})=\sum_{i=1}^{m}w_ix^i$

Now the probability of target vector $\mathbf{t}$ given input $\mathbf{X}$ can be expressed by

$p(\mathbf{t}|\mathbf{X},\mathbf{w},\beta)=\prod_{n=1}^{N}\mathcal{N}(t_n|y(x_n,\mathbf{w}),\beta^{-1})=$ $\prod_{n=1}^{N}\frac{\beta}{\sqrt{2\pi}}e^{-\beta\frac{(t_n-y(x_n,w))^2}{2}}$
Taking natural logarithm of left and right terms yields

$\ln p(\mathbf{t}|\mathbf{X},\mathbf{w},\beta)=\ln \prod_{n=1}^{N}\frac{\beta}{\sqrt{2\pi}}e^{-\beta\frac{(t_n-y(x_n,w))^2}{2}}$
$=-\frac{\beta}{2}\sum_{n=1}^N\left{y(x_n,w)-t_n\right}^2+\frac{N}{2}\ln\beta-\frac{N}{2}\ln(2\pi)=$ $\ln L(\mathbf{w},\beta}|\mathbf{X},\mathbf{t})$
Where $\ln L(\mathbf{w},\beta}|\mathbf{X},\mathbf{t})$ is log likelihood of normal function. Often training a model involves optimizing the likelihood function with respect to $\mathbf{w}$ . Now maximum likelihood function for parameter $\mathbf{w}$ is given by (constant terms with respect to $\mathbf{w}$ can be omitted),

For training the model omitting the constant $\frac{-\beta}{2}$ doesn't affect the convergence. $\ln L(\mathbf{w},\beta}|\mathbf{X},\mathbf{t})=\sum_{n=1}^N\left{y(x_n,w)-t_n\right}^2$ This is called squared error and taking the mean yields mean squared error.
$\frac{1}{N}\ln L(\mathbf{w},\beta}|\mathbf{X},\mathbf{t})=\frac{1}{N}\sum_{n=1}^N\left{y(x_n,w)-t_n\right}^2$ ,

交叉熵

在讨论更一般的交叉熵函数之前，我将解释交叉熵的特定类型 - 二元交叉熵。

二元交叉熵

二元交叉熵的假设是目标变量的概率分布是由伯努利分布得出的。根据维基百科

伯努利分布是随机变量的离散概率分布，以概率 p 取值 1 和值 0 概率 q=1-p

Probability of Bernoulli distribution random variable is given by
$P(Y=k)=p^k(1-p)^{1-k}$ , where $k\in\left{0,1\right}$ and p is probability of success. This can be simply written as $P(y)=p^y(1-p)^{1-y}$
Taking negative natural logarithm of both sides yields

$-\ln P(y)=-y\ln(p)-(1-y)\ln(1-p)$ , this is called binary cross entropy.

分类交叉熵

交叉熵的推广遵循一般情况当随机变量是多变量时（来自多项分布）具有以下概率分布

两边取负自然对数会产生分类交叉熵损失。

$-\ln P(y)=-(\sum_{n=1}^{N}y_n\ln(p_n)+(1-y_n)\ln(1-p_n))$ ,

本文内容由网友自发贡献，版权归原作者所有，本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容，请联系:hwhale#tublm.com(使用前将#替换为@)