For a very simple classification problem where I have a target vector [0,0,0,....0] and a prediction vector [0,0.1,0.2,....1] would cross-entropy loss converge better/faster or would MSE loss?
When I plot them it seems to me that MSE loss has a lower error margin. Why would that be?
Or for example when I have the target as [1,1,1,1....1] I get the following:
线性回归的假设之一是多变量正态性。由此可见,目标变量呈正态分布(有关线性回归假设的更多信息,请参见here https://www.statisticssolutions.com/assumptions-of-linear-regression/ and here http://r-statistics.co/Assumptions-of-Linear-Regression.html).
Gaussian distribution(Normal distribution) https://en.wikipedia.org/wiki/Normal_distribution with mean and variance is given by
Often in machine learning we deal with distribution with mean 0 and variance 1(Or we transform our data to have mean 0 and variance 1). In this case the normal distribution will be,
This is called standard normal distribution.
For normal distribution model with weight parameter and precision(inverse variance) parameter , the probability of observing a single target t given input x is expressed by the following equation
, where is mean of the distribution and is calculated by model as
Now the probability of target vector given input can be expressed by
Taking natural logarithm of left and right terms yields
Where is log likelihood of normal function. Often training a model involves optimizing the likelihood function with respect to . Now maximum likelihood function for parameter is given by (constant terms with respect to can be omitted),
For training the model omitting the constant doesn't affect the convergence.
This is called squared error and taking the mean yields mean squared error.
,
交叉熵
在讨论更一般的交叉熵函数之前,我将解释交叉熵的特定类型 - 二元交叉熵。
二元交叉熵
二元交叉熵的假设是目标变量的概率分布是由伯努利分布得出的。根据维基百科
伯努利分布是随机变量的离散概率分布,
以概率 p 取值 1 和值 0
概率 q=1-p
Probability of Bernoulli distribution random variable is given by
, where and p is probability of success.
This can be simply written as
Taking negative natural logarithm of both sides yields