3.1 基本形式
示例x由 d个属性描述,![x_{i} = (x_{i1},x_{i2}, ..., x_{id})](https://latex.csdn.net/eq?x_%7Bi%7D%20%3D%20%28x_%7Bi1%7D%2Cx_%7Bi2%7D%2C%20...%2C%20x_%7Bid%7D%29)
线性模型 试图学得一个通过属性的线性组合来进行预测的函数,即
![f(x) = w_{1}x_{1}+w_{2}x_{2}+ ... + w_{d}x_{d} + b \\ f(x)=w\top x + b,\ w=(w_{1}, w_{2}, ..., w_{d})](https://latex.csdn.net/eq?f%28x%29%20%3D%20w_%7B1%7Dx_%7B1%7D+w_%7B2%7Dx_%7B2%7D+%20...%20+%20w_%7Bd%7Dx_%7Bd%7D%20+%20b%20%5C%5C%20f%28x%29%3Dw%5Ctop%20x%20+%20b%2C%5C%20w%3D%28w_%7B1%7D%2C%20w_%7B2%7D%2C%20...%2C%20w_%7Bd%7D%29)
3.2 线性回归
3.2.1 一元线性回归:for regression
给定数据集 ![D=\{(x_{1}, y_{1}), (x_{2}, y_{2}), ..., (x_{m}, y_{m})\}](https://latex.csdn.net/eq?D%3D%5C%7B%28x_%7B1%7D%2C%20y_%7B1%7D%29%2C%20%28x_%7B2%7D%2C%20y_%7B2%7D%29%2C%20...%2C%20%28x_%7Bm%7D%2C%20y_%7Bm%7D%29%5C%7D)
先考虑最简单的情形:输入只有一个属性,此时线性回归试图学得: ![f(x_{i}) = wx_{i} + b](https://latex.csdn.net/eq?f%28x_%7Bi%7D%29%20%3D%20wx_%7Bi%7D%20+%20b)
使得
.
如何确定w和b?=> 让均方误差最小化,即
![\left ( w^{\ast } , b^{\ast }\right ) = arg\ min_{\left ( w,b \right )}\sum_{i=1}^{m}\left ( f\left ( x_{i} \right )-y_{i} \right )^{2}\\=arg \min_{\left ( w,b \right )}\sum_{i=1}^{m}\left ( y_{i} - wx_{i}-b \right )^{2}](https://latex.csdn.net/eq?%5Cleft%20%28%20w%5E%7B%5Cast%20%7D%20%2C%20b%5E%7B%5Cast%20%7D%5Cright%20%29%20%3D%20arg%5C%20min_%7B%5Cleft%20%28%20w%2Cb%20%5Cright%20%29%7D%5Csum_%7Bi%3D1%7D%5E%7Bm%7D%5Cleft%20%28%20f%5Cleft%20%28%20x_%7Bi%7D%20%5Cright%20%29-y_%7Bi%7D%20%5Cright%20%29%5E%7B2%7D%5C%5C%3Darg%20%5Cmin_%7B%5Cleft%20%28%20w%2Cb%20%5Cright%20%29%7D%5Csum_%7Bi%3D1%7D%5E%7Bm%7D%5Cleft%20%28%20y_%7Bi%7D%20-%20wx_%7Bi%7D-b%20%5Cright%20%29%5E%7B2%7D)
基于均方误差最小化进行模型求解的方法,称为“最小二乘法 (Least Square Method)”。即找到一条直线,是所有样本导致线上的欧式距离之和最小。
最小二乘“参数估计 (parameter estimation)”:求解w和b使得下式最小化的过程。
![E(w, b)= \sum_{i=1}^{m}\left ( y_{i} -wx_{i}-b\right )^{2}\\ \frac{\partial E\left ( w,b \right )}{\partial w} = 2\left ( w\sum_{i=1}^{m} x_{i}^{2} - \sum_{i=1}^{m}\left ( y_{i}-b \right )x_{i}\right ),\ \ \ \left ( 3.5 \right ) \\ \frac{\partial E\left ( w,b \right )}{\partial b} = 2\left [ mb-\sum_{i=1}^{m}\left ( y_{i} - wx_{i} \right ) \right ], \ \ \ \left ( 3.6 \right )](https://latex.csdn.net/eq?E%28w%2C%20b%29%3D%20%5Csum_%7Bi%3D1%7D%5E%7Bm%7D%5Cleft%20%28%20y_%7Bi%7D%20-wx_%7Bi%7D-b%5Cright%20%29%5E%7B2%7D%5C%5C%20%5Cfrac%7B%5Cpartial%20E%5Cleft%20%28%20w%2Cb%20%5Cright%20%29%7D%7B%5Cpartial%20w%7D%20%3D%202%5Cleft%20%28%20w%5Csum_%7Bi%3D1%7D%5E%7Bm%7D%20x_%7Bi%7D%5E%7B2%7D%20-%20%5Csum_%7Bi%3D1%7D%5E%7Bm%7D%5Cleft%20%28%20y_%7Bi%7D-b%20%5Cright%20%29x_%7Bi%7D%5Cright%20%29%2C%5C%20%5C%20%5C%20%5Cleft%20%28%203.5%20%5Cright%20%29%20%5C%5C%20%5Cfrac%7B%5Cpartial%20E%5Cleft%20%28%20w%2Cb%20%5Cright%20%29%7D%7B%5Cpartial%20b%7D%20%3D%202%5Cleft%20%5B%20mb-%5Csum_%7Bi%3D1%7D%5E%7Bm%7D%5Cleft%20%28%20y_%7Bi%7D%20-%20wx_%7Bi%7D%20%5Cright%20%29%20%5Cright%20%5D%2C%20%5C%20%5C%20%5C%20%5Cleft%20%28%203.6%20%5Cright%20%29)
E(w,b)是关于w和b的凸函数,当关于w和b的导数均为零时得到w和b的最优解。U型曲线的函数通常是凸函数。对于实数集上的函数,凹凸性可通过求解二阶导数进行判断:当二阶导数在区间上≥0,则是凸函数;若二阶导数在区间上恒>0,则称为严格凸函数。
令公式(3.5), (3.6)为零,得到w和b最优解的闭式解(closed-form solution):
![w = \frac{\sum_{i=1}^{m}y_{i}\left ( x_{i} -\bar{x}\right )}{\sum_{i=1}^{m}x_{i}^{2}-\frac{1}{m}\left ( \sum_{i=1}^{m} x_{i}\right )^{2}}\\ b=\frac{1}{m}\sum_{i=1}^{m}\left ( y_{i} -wx_{i}\right )](https://latex.csdn.net/eq?w%20%3D%20%5Cfrac%7B%5Csum_%7Bi%3D1%7D%5E%7Bm%7Dy_%7Bi%7D%5Cleft%20%28%20x_%7Bi%7D%20-%5Cbar%7Bx%7D%5Cright%20%29%7D%7B%5Csum_%7Bi%3D1%7D%5E%7Bm%7Dx_%7Bi%7D%5E%7B2%7D-%5Cfrac%7B1%7D%7Bm%7D%5Cleft%20%28%20%5Csum_%7Bi%3D1%7D%5E%7Bm%7D%20x_%7Bi%7D%5Cright%20%29%5E%7B2%7D%7D%5C%5C%20b%3D%5Cfrac%7B1%7D%7Bm%7D%5Csum_%7Bi%3D1%7D%5E%7Bm%7D%5Cleft%20%28%20y_%7Bi%7D%20-wx_%7Bi%7D%5Cright%20%29)
![](https://img-blog.csdnimg.cn/eb575121458340d9b13c86f4cb79c435.png?x-oss-process=image/watermark,type_d3F5LXplbmhlaQ,shadow_50,text_Q1NETiBAd2F5YXlhMTEy,size_20,color_FFFFFF,t_70,g_se,x_16)
3.2.2 多元线性回归 (multivariate linear regression):for regresion
, 使得
.
此时,对应于一元线性回归模型的E(w,b),有
![\hat{w}^{\ast }=arg\min_{\hat{w}}\left ( y-X\hat{w} \right )^{T}\left ( y-X\hat{w} \right ), \ \ \(3.9)](https://latex.csdn.net/eq?%5Chat%7Bw%7D%5E%7B%5Cast%20%7D%3Darg%5Cmin_%7B%5Chat%7Bw%7D%7D%5Cleft%20%28%20y-X%5Chat%7Bw%7D%20%5Cright%20%29%5E%7BT%7D%5Cleft%20%28%20y-X%5Chat%7Bw%7D%20%5Cright%20%29%2C%20%5C%20%5C%20%5C%283.9%29)
对
求导,并令求导结果为零,得到
最优解的闭式解:
![\frac{\partial E_{\hat{w}}}{\partial \hat{w}}= 2X^{T}\left ( X\hat{w} - y \right ), \ \ \ (3.10)](https://latex.csdn.net/eq?%5Cfrac%7B%5Cpartial%20E_%7B%5Chat%7Bw%7D%7D%7D%7B%5Cpartial%20%5Chat%7Bw%7D%7D%3D%202X%5E%7BT%7D%5Cleft%20%28%20X%5Chat%7Bw%7D%20-%20y%20%5Cright%20%29%2C%20%5C%20%5C%20%5C%20%283.10%29)
当
为满秩矩阵 (full-rank matrix) 或正定矩阵(positive definite matrix)时,令式 (3.10)为零,可得到:
![\hat{w}^{\ast } = \left ( X^{T}X \right )^{-1}X^{T}y](https://latex.csdn.net/eq?%5Chat%7Bw%7D%5E%7B%5Cast%20%7D%20%3D%20%5Cleft%20%28%20X%5E%7BT%7DX%20%5Cright%20%29%5E%7B-1%7DX%5E%7BT%7Dy)
现实任务中,
往往不是满秩矩阵,即任务中存在大量的变量,其数目超过样例数,导致X的列数多于行数,此时,可以求得
的多个解,均能使 MSE 最小化。
此时,选择哪一个作为最优解,通常由学习算法的归纳偏好 (inductive bias)决定,常见做法是引入正则化 (regularization).
令模型预测值逼近y的衍生物:假设我们认为示例所对应的输出标记是在 指数尺度上变化,则 可将输出标记的对数作为线性回归拟合的目标,称为“对数线性回归 (log-linear regression)”,如下:
![ln\ y = w^{T}x + b](https://latex.csdn.net/eq?ln%5C%20y%20%3D%20w%5E%7BT%7Dx%20+%20b)
实际上等价于
,此时,虽然表面上看,这仍是线性回归,但实质上 是在求解输入空间到输出空间的非线性映射,如下图所示:
![](https://img-blog.csdnimg.cn/4afdbd28841641a3a6059434c479c77c.png?x-oss-process=image/watermark,type_d3F5LXplbmhlaQ,shadow_50,text_Q1NETiBAd2F5YXlhMTEy,size_13,color_FFFFFF,t_70,g_se,x_16)
3.3 对数几率回归:for classification
从回归到分类:找到一个单调可微函数,将分类任务的真实标记y 与 线性回归模型的预测值联系起来,也就是说,找到一个函数,可以将实数值 转换为 0/1 值。
从单位阶跃函数 (unit-step function) 到 对数几率函数 (logistic function):
红色为unit-step function;黑色为logistic function
对数几率函数是一种“Sigmoid函数”,将实数值z转换为一个接近0或1的y值,且在z=0附近变化很大。当作为分类任务的目标函数时,在进行反向传播进行求偏导时,z=0附近可能出现梯度消失。
对
两边同时取log,得到
,实质上是 让线性回归的预测结果逼近真是标记的对数几率 (logit).
将y是为样本x作为正例的可能性,则1-y是其作为反例的可能性。
称为 “几率 (odds)”,反映了 x作为正例的相对可能性。而取log后称为“对数几率 (log odds==logit)”,即![ln \ \frac{y}{1-y}](https://latex.csdn.net/eq?ln%20%5C%20%5Cfrac%7By%7D%7B1-y%7D)
虽然名字是“回归”,但实际上是一种分类学习方法。其优点有:
1. 直接对分类可能性建模,不需要事先假设数据分布,避免了假设分布不准确带来的问题;
2. 不是仅仅预测出“类别”,而是 给出了近似概率。
3. logistic function是任意阶可导的凸函数,易于求导得到最优解
重写式 (3.19)后,得到
![ln \ \frac{p(y=1|x)}{p\left ( y=0|x \right )} = w^{T}x + b, \ \ \ (3.22)\\ p\left ( y=1|x \right ) = \frac{e^{w^{T}x+b}}{1+e^{w^{T}x+b}}, \\ p\left ( y=0|x \right ) = \frac{1}{1+e^{w^{T}x+b}}](https://latex.csdn.net/eq?ln%20%5C%20%5Cfrac%7Bp%28y%3D1%7Cx%29%7D%7Bp%5Cleft%20%28%20y%3D0%7Cx%20%5Cright%20%29%7D%20%3D%20w%5E%7BT%7Dx%20+%20b%2C%20%5C%20%5C%20%5C%20%283.22%29%5C%5C%20p%5Cleft%20%28%20y%3D1%7Cx%20%5Cright%20%29%20%3D%20%5Cfrac%7Be%5E%7Bw%5E%7BT%7Dx+b%7D%7D%7B1+e%5E%7Bw%5E%7BT%7Dx+b%7D%7D%2C%20%5C%5C%20p%5Cleft%20%28%20y%3D0%7Cx%20%5Cright%20%29%20%3D%20%5Cfrac%7B1%7D%7B1+e%5E%7Bw%5E%7BT%7Dx+b%7D%7D)
使用极大似然估计 (Maximum likelihood method)来估计w和b。对数似然为:
![l\left ( w,b \right ) = \sum_{i=1}^{m}ln\ p\left ( y_{i} | x_{i};\ w, b\right ), \ \ \ \left ( 3.25 \right )\\ p\left ( y_{i} |x_{i};\ w,b\right )=y_{i}p\left ( xi ; w,b \right ) + \left ( 1-y_{i} \right )\left ( 1- p\left ( xi ; w,b \right ) \right ), \ \ \ \left ( 3.26 \right )](https://latex.csdn.net/eq?l%5Cleft%20%28%20w%2Cb%20%5Cright%20%29%20%3D%20%5Csum_%7Bi%3D1%7D%5E%7Bm%7Dln%5C%20p%5Cleft%20%28%20y_%7Bi%7D%20%7C%20x_%7Bi%7D%3B%5C%20w%2C%20b%5Cright%20%29%2C%20%5C%20%5C%20%5C%20%5Cleft%20%28%203.25%20%5Cright%20%29%5C%5C%20p%5Cleft%20%28%20y_%7Bi%7D%20%7Cx_%7Bi%7D%3B%5C%20w%2Cb%5Cright%20%29%3Dy_%7Bi%7Dp%5Cleft%20%28%20xi%20%3B%20w%2Cb%20%5Cright%20%29%20+%20%5Cleft%20%28%201-y_%7Bi%7D%20%5Cright%20%29%5Cleft%20%28%201-%20p%5Cleft%20%28%20xi%20%3B%20w%2Cb%20%5Cright%20%29%20%5Cright%20%29%2C%20%5C%20%5C%20%5C%20%5Cleft%20%28%203.26%20%5Cright%20%29)
即令每个样本属于其真实标记的概率越大越好。将3.25 带入3.26中,且最大化3.25等价于最小化以下负对数似然:
![l\left ( w,b \right )=- \sum_{i=1}^{m}[\left ( y_{i}ln \ p\left ( y_{i}|x_{i;w,b} \right )+ \left ( 1-y_{i} \right ) ln\ \left ( 1- \ p\left ( y_{i}|x_{i;w,b} \right )\right )\right )]](https://latex.csdn.net/eq?l%5Cleft%20%28%20w%2Cb%20%5Cright%20%29%3D-%20%5Csum_%7Bi%3D1%7D%5E%7Bm%7D%5B%5Cleft%20%28%20y_%7Bi%7Dln%20%5C%20p%5Cleft%20%28%20y_%7Bi%7D%7Cx_%7Bi%3Bw%2Cb%7D%20%5Cright%20%29+%20%5Cleft%20%28%201-y_%7Bi%7D%20%5Cright%20%29%20ln%5C%20%5Cleft%20%28%201-%20%5C%20p%5Cleft%20%28%20y_%7Bi%7D%7Cx_%7Bi%3Bw%2Cb%7D%20%5Cright%20%29%5Cright%20%29%5Cright%20%29%5D)
3.4 线性判别分析 (Linear Discriminant Analysis)
在二分类问题上,亦称为“Fisher判别分析”。
training时,给定训练样例集,设法将样例投影到一条直线上,以使得同类样例的投影点尽可能接近、异类样例的投影点尽可能远离;test时,给定新样本,将其投影到该条直线上,根据投影点位置,确定新样本类别。如下为二维示意图:
![](https://img-blog.csdnimg.cn/7d807be6ce324bdf8529876ea9af38da.png?x-oss-process=image/watermark,type_d3F5LXplbmhlaQ,shadow_50,text_Q1NETiBAd2F5YXlhMTEy,size_20,color_FFFFFF,t_70,g_se,x_16)
给定数据集
, 令
分别表示第
类示例的集合、均值向量、协方差矩阵。
将数据投影到直线w上时,两类样本的中心在直线上的投影分别为
(均为实数)
当将所有样本点都投影到直线上时,此时两类样本的协方差分别为
(均为实数)
让同类样例投影点的协方差尽可能小,以使同类样例的投影点尽可能接近,即 令
尽可能小
让类中心之间的距离尽可能大,以使 异类样例的投影点尽可能远离,即令
尽可能大。同时考虑两者,则得到以下LDA最大化目标:
![J = \frac{||w^{T}\mu _{0}\ - w^{T}\mu _{1}||_{2}^{2}}{w^{T}\Sigma _{0}w\ + \ w^{T}\Sigma _{1}w}\\ =\frac{w^{T}(\mu _{0}-\mu _{1})(\mu _{0}-\mu _{1})^{T}w}{w^{T}(\Sigma _{0} + \Sigma _{1})w}](https://latex.csdn.net/eq?J%20%3D%20%5Cfrac%7B%7C%7Cw%5E%7BT%7D%5Cmu%20_%7B0%7D%5C%20-%20w%5E%7BT%7D%5Cmu%20_%7B1%7D%7C%7C_%7B2%7D%5E%7B2%7D%7D%7Bw%5E%7BT%7D%5CSigma%20_%7B0%7Dw%5C%20+%20%5C%20w%5E%7BT%7D%5CSigma%20_%7B1%7Dw%7D%5C%5C%20%3D%5Cfrac%7Bw%5E%7BT%7D%28%5Cmu%20_%7B0%7D-%5Cmu%20_%7B1%7D%29%28%5Cmu%20_%7B0%7D-%5Cmu%20_%7B1%7D%29%5E%7BT%7Dw%7D%7Bw%5E%7BT%7D%28%5CSigma%20_%7B0%7D%20+%20%5CSigma%20_%7B1%7D%29w%7D)
类内散度矩阵 (within-class scatter matrix):
![](https://img-blog.csdnimg.cn/d157d07a60da4a1197cb078af325bd5f.png?x-oss-process=image/watermark,type_d3F5LXplbmhlaQ,shadow_50,text_Q1NETiBAd2F5YXlhMTEy,size_20,color_FFFFFF,t_70,g_se,x_16)
类间散度矩阵 (between-class scatter matrix):
![](https://img-blog.csdnimg.cn/d0385031366b4b54850b8a56a8a6917c.png?x-oss-process=image/watermark,type_d3F5LXplbmhlaQ,shadow_50,text_Q1NETiBAd2F5YXlhMTEy,size_17,color_FFFFFF,t_70,g_se,x_16)
如何确定w?
令上述LDA 最大化目标J 的分母为1,则等价于
![min_{w}\ \ -w^{T}S_{b}w\\ \ \ \ \ \ \ \ \ \ \\ s.t. \ \ \ w^{T}S_{w}w=1 \ \ \ \ \ \ (3.36)](https://latex.csdn.net/eq?min_%7Bw%7D%5C%20%5C%20-w%5E%7BT%7DS_%7Bb%7Dw%5C%5C%20%5C%20%5C%20%5C%20%5C%20%5C%20%5C%20%5C%20%5C%20%5C%20%5C%5C%20s.t.%20%5C%20%5C%20%5C%20w%5E%7BT%7DS_%7Bw%7Dw%3D1%20%5C%20%5C%20%5C%20%5C%20%5C%20%5C%20%283.36%29)
使用拉格朗日乘子,上式(3.36)等价于
3.5 多分类学习
3.6 类别不平衡问题