计算下降 执行公式:
重复
直到收敛:
{
w
=
w
−
α
∂
J
(
w
,
b
)
∂
w
b
=
b
−
α
∂
J
(
w
,
b
)
∂
b
}
\begin{align*} \text{重复}&\text{直到收敛:} \newline \; \lbrace \newline \; w &= w - \alpha \frac{\partial J(w,b)}{\partial w} \ \ \ \ \ \ \ \ \ \ \tag{1} \; \newline b &= b - \alpha \frac{\partial J(w,b)}{\partial b} \ \ \ \ \ \ \ \ \ \ \tag{2} \newline \rbrace \end{align*}
重复{wb}直到收敛:=w−α∂w∂J(w,b)=b−α∂b∂J(w,b)(1)(2)
∂
J
(
w
,
b
)
∂
w
=
1
m
∑
i
=
0
m
−
1
(
f
w
,
b
(
x
(
i
)
)
−
y
(
i
)
)
x
(
i
)
∂
J
(
w
,
b
)
∂
b
=
1
m
∑
i
=
0
m
−
1
(
f
w
,
b
(
x
(
i
)
)
−
y
(
i
)
)
\begin{align} \frac{\partial J(w,b)}{\partial w} &= \frac{1}{m} \sum\limits_{i = 0}^{m-1} (f_{w,b}(x^{(i)}) - y^{(i)})x^{(i)} \ \ \ \ \ \ \ \ \ \ \tag{3} \\ \frac{\partial J(w,b)}{\partial b} &= \frac{1}{m} \sum\limits_{i = 0}^{m-1} (f_{w,b}(x^{(i)}) - y^{(i)}) \tag{4} \\ \end{align}
∂w∂J(w,b)∂b∂J(w,b)=m1i=0∑m−1(fw,b(x(i))−y(i))x(i)=m1i=0∑m−1(fw,b(x(i))−y(i))(3)(4)
计算代价 执行公式
J
(
w
,
b
)
=
1
2
m
∑
i
=
0
m
−
1
(
f
w
,
b
(
x
(
i
)
)
−
y
(
i
)
)
2
(5)
J(w,b) = \frac{1}{2m} \sum\limits_{i = 0}^{m-1} (f_{w,b}(x^{(i)}) - y^{(i)})^2 \ \ \ \ \ \ \ \ \ \ \tag{5}
J(w,b)=2m1i=0∑m−1(fw,b(x(i))−y(i))2(5)
梯度下降 重复执行上述两个步骤。
代码实现
import math, copy
import numpy as np
import matplotlib.pyplot as plt
plt.style.use('./deeplearning.mplstyle')from lab_utils_uni import plt_house_x, plt_contour_wgrad, plt_divergence, plt_gradients
# 代价函数计算方法defcompute_cost(x, y, w, b):
m = x.shape[0]
cost =0for i inrange(m):
f_wb = w * x[i]+ b
cost = cost +(f_wb - y[i])**2
total_cost =1/(2* m)* cost
return total_cost
# 计算两个权重的梯度下降的导数项defcompute_gradient(x, y, w, b):"""
Computes the gradient for linear regression
计算梯度下降的下降率
Args:(参数)
x (ndarray (m,)): Data, m examples
y (ndarray (m,)): target values
w,b (scalar) : model parameters
Returns(返回值)
dj_dw (scalar): The gradient of the cost w.r.t. the parameters w
dj_db (scalar): The gradient of the cost w.r.t. the parameter b
"""# Number of training examples
m = x.shape[0]
dj_dw =0
dj_db =0for i inrange(m):
f_wb = w * x[i]+ b
dj_dw_i =(f_wb - y[i])* x[i]
dj_db_i = f_wb - y[i]
dj_db += dj_db_i
dj_dw += dj_dw_i
dj_dw = dj_dw / m
dj_db = dj_db / m
return dj_dw, dj_db
# 计算梯度下降函数defgradient_descent(x, y, w_in, b_in, alpha, num_iters, cost_function, gradient_function):"""
Performs gradient descent to fit w,b. Updates w,b by taking
num_iters gradient steps with learning rate alpha
Args:(参数)
x (ndarray (m,)) : Data, m examples
y (ndarray (m,)) : target values
w_in,b_in (scalar): initial values of model parameters (模型变量的初始值)
alpha (float): Learning rate (学习率)
num_iters (int): number of iterations to run gradient descent (指定迭代次数)
cost_function: function to call to produce cost (代价函数的方法)
gradient_function: function to call to produce gradient (梯度下降偏导计算)
Returns:(返回值)
w (scalar): Updated value of parameter after running gradient descent
b (scalar): Updated value of parameter after running gradient descent
J_history (List): History of cost values
p_history (list): History of parameters [w,b]
"""
w = copy.deepcopy(w_in)# avoid modifying global w_in (规避全局变量)# 用于储存返回值的历史代价函数和历史参数对
J_history =[]
p_history =[]
b = b_in
w = w_in
for i inrange(num_iters):# Calculate the gradient and update the parameters using gradient_function# 计算梯度下降的下降率
dj_dw, dj_db = gradient_function(x, y, w , b)# Update Parameters using equation (3) above# 计算梯度下降的参数值并反代
b = b - alpha * dj_db
w = w - alpha * dj_dw
# Save cost J at each iteration# 储存代价函数Jif i<100000:# prevent resource exhaustion (避免资源耗尽)
J_history.append( cost_function(x, y, w , b))
p_history.append([w,b])# Print cost every at intervals 10 times or as many iterations if < 10# 每10次运行就打印一次参数和代价函数if i% math.ceil(num_iters/10)==0:print(f"Iteration {i:4}: Cost {J_history[-1]:0.2e} ",f"dj_dw: {dj_dw: 0.3e}, dj_db: {dj_db: 0.3e} ",f"w: {w: 0.3e}, b:{b: 0.5e}")return w, b, J_history, p_history #return w and J,w history for graphing
注意事项
对于梯度下降公式,导数项是同步的
取得w, b最好远离最合适的值
学习率α要适当,过大会导致发散
多元线性梯度下降代码
数据矩阵
样本数据
假如拥有
m
m
m个样本(example)和
n
n
n个特征(further),则这些数据可以用如下的
(
m
,
n
)
(m, n)
(m,n)矩阵
x
\mathbf{x}
x表示:
X
=
(
x
0
(
0
)
x
1
(
0
)
⋯
x
n
−
1
(
0
)
x
0
(
1
)
x
1
(
1
)
⋯
x
n
−
1
(
1
)
⋯
x
0
(
m
−
1
)
x
1
(
m
−
1
)
⋯
x
n
−
1
(
m
−
1
)
)
\mathbf{X} = \begin{pmatrix} x^{(0)}_0 & x^{(0)}_1 & \cdots & x^{(0)}_{n-1} \\ x^{(1)}_0 & x^{(1)}_1 & \cdots & x^{(1)}_{n-1} \\ \cdots \\ x^{(m-1)}_0 & x^{(m-1)}_1 & \cdots & x^{(m-1)}_{n-1} \end{pmatrix}
X=x0(0)x0(1)⋯x0(m−1)x1(0)x1(1)x1(m−1)⋯⋯⋯xn−1(0)xn−1(1)xn−1(m−1) 其中:
x
(
i
)
\mathbf{x}^{(i)}
x(i) 代表一个样本,
x
(
i
)
=
(
x
0
(
i
)
,
x
1
(
i
)
,
⋯
,
x
n
−
1
(
i
)
)
\mathbf{x}^{(i)}=(x_{0}^{(i)}, x_{1}^{(i)}, \cdots, x_{n-1}^{(i)})
x(i)=(x0(i),x1(i),⋯,xn−1(i))
x
j
(
i
)
x_{j}^{(i)}
xj(i) 代表第
i
i
i个样本的第
j
j
j个元素。上标是样本号,下标是元素索引。
参数向量
用
w
\mathbf{w}
w是一个带有
n
n
n个元素的向量:
w
\mathbf{w}
w每一个元素都有一个关联的特征
称之为列向量(column vector)
w
=
(
w
0
w
1
⋯
w
n
−
1
)
\mathbf{w} = \begin{pmatrix} w_0 \\ w_1 \\ \cdots\\ w_{n-1} \end{pmatrix}
w=w0w1⋯wn−1
b
b
b 是一个标量参数 此时给定的参数
w
w
w的初始值同样是个向量
w
\mathbf{w}
w。
计算线性回归
f
(
x
)
f(x)
f(x)
用非向量表示:
f
w
,
b
(
x
)
=
w
_
0
x
_
0
+
w
_
1
x
_
1
+
.
.
.
+
w
n
−
1
x
n
−
1
+
b
(1)
f_{\mathbf{w},b}(\mathbf{x}) = w\_0x\_0 + w\_1x\_1 +... + w_{n-1}x_{n-1} + b \ \ \ \ \ \ \ \tag{1}
fw,b(x)=w_0x_0+w_1x_1+...+wn−1xn−1+b(1) 用向量表示:
f
w
,
b
(
x
)
=
w
⋅
x
+
b
(2)
f_{\mathbf{w},b}(\mathbf{x}) = \mathbf{w} \cdot \mathbf{x} + b \ \ \ \ \ \ \tag{2}
fw,b(x)=w⋅x+b(2) 这里的
⋅
\cdot
⋅代表的是向量点乘(dot product)
方法
循环法,对应公式(1)
defpredict_single_loop(x, w, b):"""
single predict using linear regression 用线性回归预测一次
Args:
x (ndarray): Shape (n,) example with multiple features(一个样本的向量)
w (ndarray): Shape (n,) model parameters(权重的向量)
b (scalar): model parameter(模型标参)
Returns:
p (scalar): prediction(函数预测值)
"""
n = x.shape[0]
p =0for i inrange(n):
p_i = x[i]* w[i]
p = p + p_i
p = p + b
return p
数量积法,对应公式(2)
defpredict(x, w, b):"""
single predict using linear regression
Args:
x (ndarray): Shape (n,) example with multiple features
w (ndarray): Shape (n,) model parameters
b (scalar): model parameter
Returns:
p (scalar): prediction
"""
p = np.dot(x, w)+ b
return p
计算代价函数
j
(
x
)
j(x)
j(x)
J
(
w
,
b
)
J(\mathbf{w},b)
J(w,b)的表达式为:
J
(
w
,
b
)
=
1
2
m
∑
i
=
0
m
−
1
(
f
w
,
b
(
x
(
i
)
)
−
y
(
i
)
)
2
(3)
J(\mathbf{w},b) = \frac{1}{2m} \sum\limits_{i = 0}^{m-1} (f_{\mathbf{w},b}(\mathbf{x}^{(i)}) - y^{(i)})^2 \tag{3}
J(w,b)=2m1i=0∑m−1(fw,b(x(i))−y(i))2(3) 用向量表示:
f
w
,
b
(
x
(
i
)
)
=
w
⋅
x
(
i
)
+
b
(4)
f_{\mathbf{w},b}(\mathbf{x}^{(i)}) = \mathbf{w} \cdot \mathbf{x}^{(i)} + b \tag{4}
fw,b(x(i))=w⋅x(i)+b(4)
方法
defcompute_cost(X, y, w, b):"""
compute cost
Args:
X (ndarray (m,n)): Data, m examples with n features(所有输入的矩阵)
y (ndarray (m,)) : target values(目标变量)
w (ndarray (n,)) : model parameters(权重向量)
b (scalar) : model parameter(标参)
Returns:
cost (scalar): cost(代价函数)
"""
m = X.shape[0]
cost =0.0for i inrange(m):
f_wb_i = np.dot(X[i], w)+ b #(n,)(n,) = scalar (see np.dot)
cost = cost +(f_wb_i - y[i])**2#scalar
cost = cost /(2* m)#scalar return cost
下降率
参数:
循环
直到收敛:
{
w
j
=
w
j
−
α
∂
J
(
w
,
b
)
∂
w
j
for j = 0..n-1
b
=
b
−
α
∂
J
(
w
,
b
)
∂
b
}
\begin{align*} \text{循环}&\text{直到收敛:} \; \lbrace \newline\; & w_j = w_j - \alpha \frac{\partial J(\mathbf{w},b)}{\partial w_j} \; & \text{for j = 0..n-1}\ \ \ \ \ \ \ \ \ \ \ \ \ \tag{5} \newline &b\ \ = b - \alpha \frac{\partial J(\mathbf{w},b)}{\partial b} \newline \rbrace \end{align*}
循环}直到收敛:{wj=wj−α∂wj∂J(w,b)b=b−α∂b∂J(w,b)for j = 0..n-1(5) 导数项:
∂
J
(
w
,
b
)
∂
w
j
=
1
m
∑
i
=
0
m
−
1
(
f
w
,
b
(
x
(
i
)
)
−
y
(
i
)
)
x
j
(
i
)
∂
J
(
w
,
b
)
∂
b
=
1
m
∑
i
=
0
m
−
1
(
f
w
,
b
(
x
(
i
)
)
−
y
(
i
)
)
\begin{align} \frac{\partial J(\mathbf{w},b)}{\partial w_j} &= \frac{1}{m} \sum\limits_{i = 0}^{m-1} (f_{\mathbf{w},b}(\mathbf{x}^{(i)}) - y^{(i)})x_{j}^{(i)} \ \ \ \ \ \ \tag{6} \\ \frac{\partial J(\mathbf{w},b)}{\partial b} &= \frac{1}{m} \sum\limits_{i = 0}^{m-1} (f_{\mathbf{w},b}(\mathbf{x}^{(i)}) - y^{(i)}) \ \ \ \ \ \ \ \tag{7} \end{align}
∂wj∂J(w,b)∂b∂J(w,b)=m1i=0∑m−1(fw,b(x(i))−y(i))xj(i)=m1i=0∑m−1(fw,b(x(i))−y(i))(6)(7) 其中
m
m
m是数据集中的样本数,
f
(
w
,
b
)
(
x
(
i
)
)
f_{(w, b)}(x^{(i)})
f(w,b)(x(i))是预测值,
y
(
i
)
y^{(i)}
y(i)是目标值。
defcompute_gradient(X, y, w, b):"""
Computes the gradient for linear regression
Args:
X (ndarray (m,n)): Data, m examples with n features(数据矩阵)
y (ndarray (m,)) : target values(目标值)
w (ndarray (n,)) : model parameters(参数向量)
b (scalar) : model parameter(标参)
Returns:
dj_dw (ndarray (n,)): The gradient of the cost w.r.t. the parameters w.
dj_db (scalar): The gradient of the cost w.r.t. the parameter b.
"""
m,n = X.shape #(number of examples, number of features)
dj_dw = np.zeros((n,))
dj_db =0.for i inrange(m):
err =(np.dot(X[i], w)+ b)- y[i]for j inrange(n):
dj_dw[j]= dj_dw[j]+ err * X[i, j]
dj_db = dj_db + err
dj_dw = dj_dw / m
dj_db = dj_db / m
return dj_db, dj_dw
实现梯度下降
defgradient_descent(X, y, w_in, b_in, cost_function, gradient_function, alpha, num_iters):"""
Performs batch gradient descent to learn theta. Updates theta by taking
num_iters gradient steps with learning rate alpha
Args:
X (ndarray (m,n)) : Data, m examples with n features
y (ndarray (m,)) : target values
w_in (ndarray (n,)) : initial model parameters
b_in (scalar) : initial model parameter
cost_function : function to compute cost
gradient_function : function to compute the gradient
alpha (float) : Learning rate
num_iters (int) : number of iterations to run gradient descent
Returns:
w (ndarray (n,)) : Updated values of parameters
b (scalar) : Updated value of parameter
"""# An array to store cost J and w's at each iteration primarily for graphing later
J_history =[]
w = copy.deepcopy(w_in)#avoid modifying global w within function
b = b_in
for i inrange(num_iters):# Calculate the gradient and update the parameters
dj_db,dj_dw = gradient_function(X, y, w, b)##None# Update Parameters using w, b, alpha and gradient
w = w - alpha * dj_dw ##None
b = b - alpha * dj_db ##None# Save cost J at each iterationif i<100000:# prevent resource exhaustion
J_history.append( cost_function(X, y, w, b))# Print cost every at intervals 10 times or as many iterations if < 10if i% math.ceil(num_iters /10)==0:print(f"Iteration {i:4d}: Cost {J_history[-1]:8.2f} ")return w, b, J_history #return final w,b and J history for graphing
概念:运用所学知识和现有的特征创建新的特征,通常通过变换,合并等方式,使其能帮助算法更简单地做出准确的预测。 例子:一块土地有特征长(
x
1
x_1
x1)和宽(
x
2
x_2
x2),可以写出回归方程
f
(
x
)
=
w
1
x
1
×
w
2
x
2
+
b
f(x)=w_1 x_1 \times w_2 x_2 + b
f(x)=w1x1×w2x2+b 。但是长和宽可以合并成面积(
x
3
=
x
1
x
2
x_3=x_1 x_2
x3=x1x2),扩充后的回归方程可以写为
f
(
x
)
=
w
1
x
1
×
w
2
x
2
×
w
3
x
3
+
b
f(x) = w_1 x_1 \times w_2 x_2 \times w_3 x_3 + b
f(x)=w1x1×w2x2×w3x3+b,使得对目标值的预测更准确。
多项式回归(Polynomial regression)
通过特征工程对参数处理,从而得到一个多项式回归模型,如
f
(
X
)
=
w
1
x
1
×
w
2
x
2
2
×
w
3
x
3
3
+
b
f(X)=w_1 x_1 \times w_2 x_2^2 \times w_3 x_3^3 + b
f(X)=w1x1×w2x22×w3x33+b这样的曲线回归模型。实质上,最佳特征相对于目标呈线性,比如一个用二次方匹配的目标,用二次方去拟合目标值,则对应的图像是一条直线。其实可以理解为将多项式回归中的高次项当成一个整体,以线性回归方程理解。
K-Score归一化
计算方法:
x
i
=
x
i
−
μ
j
σ
j
x_i = \frac{x_i - \mu_j}{\sigma_j}
xi=σjxi−μj 代码实现:
# K-score归一化def zscore\_normalize\_features(X):"""
按列计算k均值归一化
参数:
X (ndarray): Shape (m,n) input data, m examples, n features(训练用的特征矩阵)
Returns:
X_norm (ndarray): Shape (m,n) input normalized by column(归一化后的特征矩阵)
mu (ndarray): Shape (n,) mean of each feature(每一列的特征均值)
sigma (ndarray): Shape (n,) standard deviation of each feature(每一列的特征标准差)
"""
\# 计算均值
mu = np.mean(X, axis=0) \# mu will have shape (n,)
\# 计算标准差
sigma = np.std(X, axis=0) \# sigma will have shape (n,)
\# element-wise, subtract mu for that column from each example, divide by std for that column
X_norm =(X - mu)/ sigma
return(X_norm, mu, sigma)