目录
- 1. 前言
- 2. 实现过程
- 2.1 数据可视化过程
- 2.2 Sigmoid函数
- 2.3 代价函数(costFunction)
- 2.4 其他设置
- 2.5 梯度下降函数
- 2.5.1 梯度下降结果(初始参数为0):
- 2.5.2 SciPy's truncated newton(TNC)寻找最优参数:
- 2.5.3 代价计算:
- 2.6 精度函数
- 3. 逻辑回归实现
- 4. 正则化逻辑回归实现(减少过拟合,提高泛化能力)
-
- 5. 正则化完整实现
1. 前言
今天学习了吴恩达机器学习的逻辑回归部分,逻辑回归属于分类问题(classification),这里主要是考虑有sigmoid函数(也叫做Logistic函数)。
这里我主要是介绍一些逻辑回归函数的实现过程。首先,我的数据来源于吴恩达机器学习数据集(学生录取成绩数据,含标签label(0,1))。设想你是大学相关部分的管理者,想通过申请学生两次测试的评分,来决定他们是否被录取。现在你拥有之前申请学生的可以用于训练逻辑回归的训练样本集。对于每一个训练样本,你有他们两次测试的评分和最后是被录取的结果,通过逻辑回归的方式来对其他学生的成绩进行预测。
2. 实现过程
2.1 数据可视化过程
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
path = r'C:\Users\Administrator\Desktop\logisticRegression_1.txt'
data = pd.read_csv(path, header=None, names=['Exam 1', 'Exam 2', 'Admitted'])
data.head()
positive = data[data['Admitted'].isin([1])]
negative = data[data['Admitted'].isin([0])]
fig, ax = plt.subplots(figsize=(8,5))
ax.scatter(positive['Exam 1'], positive['Exam 2'], s=50, c='b', marker='o', label='Admitted')
ax.scatter(negative['Exam 1'], negative['Exam 2'], s=50, c='r', marker='x', label='Not Admitted')
ax.legend()
ax.set_xlabel('Exam 1 Score')
ax.set_ylabel('Exam 2 Score')
plt.show()
2.2 Sigmoid函数
def sigmoid(z):
return 1 / (1 + np.exp(-z))
nums = np.arange(-10, 10, step=1)
fig, ax = plt.subplots(figsize=(8,5))
ax.plot(nums, sigmoid(nums), 'r')
plt.show()
2.3 代价函数(costFunction)
def cost(theta, X, y):
theta = np.matrix(theta)
X = np.matrix(X)
y = np.matrix(y)
first = np.multiply(-y, np.log(sigmoid(X * theta.T)))
second = np.multiply((1 - y), np.log(1 - sigmoid(X * theta.T)))
return np.sum(first - second) / (len(X))
2.4 其他设置
data.insert(0, 'Ones', 1)
cols = data.shape[1]
X = data.iloc[:,0:cols-1]
y = data.iloc[:,cols-1:cols]
X = np.array(X.values)
y = np.array(y.values)
theta = np.zeros(3)
代价计算:
cost(theta, X, y)
0.6931471805599453(效果不错)
2.5 梯度下降函数
注意:我们实际上没有在这个函数中执行梯度下降,我们仅仅在计算一个梯度步长。这是一个优化的过程,就像Octave中的 fminunc() 函数一样,寻找最优的成本和梯度参数。而在Python中,可以用SciPy的 optimize 命名空间来做同样的事情。
def gradient(theta, X, y):
theta = np.matrix(theta)
X = np.matrix(X)
y = np.matrix(y)
parameters = int(theta.ravel().shape[1])
grad = np.zeros(parameters)
error = sigmoid(X * theta.T) - y
for i in range(parameters):
term = np.multiply(error, X[:,i])
grad[i] = np.sum(term) / len(X)
return grad
此处的代码,我还没有完全弄明白,还需要继续学习相关的优化算法,之后再回过头来思考!!!
2.5.1 梯度下降结果(初始参数为0):
gradient(theta, X, y)
array([ -0.1 , -12.00921659, -11.26284221])
2.5.2 SciPy’s truncated newton(TNC)寻找最优参数:
import scipy.optimize as opt
result = opt.fmin_tnc(func=cost, x0=theta, fprime=gradient, args=(X, y))
result
(array([-25.16131863, 0.20623159, 0.20147149]), 36, 0)
2.5.3 代价计算:
cost(result[0], X, y)
0.20349770158947458(比之前计算的代价值更好,达到了优化算法的目的)
2.6 精度函数
def predict(theta, X):
probability = sigmoid(X * theta.T)
return [1 if x >= 0.5 else 0 for x in probability]
theta_min = np.matrix(result[0])
predictions = predict(theta_min, X)
correct = [1 if ((a == 1 and b == 1) or (a == 0 and b == 0)) else 0 for (a, b) in zip(predictions, y)]
accuracy = (sum(map(int, correct)) % len(correct))
print ('accuracy = {0}%'.format(accuracy))
accuracy = 89%(效果不错,但是可能会高于真实值)
3. 逻辑回归实现
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import math
from collections import OrderedDict
def inputData():
data = pd.read_csv('C:\\Users\\Administrator\\Desktop\\logisticRegression_1.txt'
,dtype={0:float,1:float,2:int})
data.insert(0,"one",[1 for i in range(0,data.shape[0])])
X=data.iloc[:,0:3].values
y=data.iloc[:,3].values
y=y.reshape(y.shape[0],1)
return X,y
def showData(X,y):
for i in range(0, X.shape[0]):
if(y[i,0]==1):
plt.scatter(X[i,1],X[i,2],marker='o', s=50, c='b',label='Admitted')
elif(y[i,0]==0):
plt.scatter(X[i,1],X[i,2],marker='x', s=50, c='r',label='Not admitted')
plt.xticks(np.arange(30,110,10))
plt.yticks(np.arange(30,110,10))
plt.xlabel('Exam 1 score')
plt.ylabel('Exam 2 score')
handles, labels = plt.gca().get_legend_handles_labels()
by_label = OrderedDict(zip(labels, handles))
plt.legend(by_label.values(), by_label.keys())
plt.show()
def sigmoid(z):
return 1/(1+np.exp(-z))
def showCostsJ(X,y,theta,m):
costsJ = ((y*np.log(sigmoid(X@theta)+ 1e-6))+((1-y)*np.log(1-sigmoid(X@theta)+ 1e-6))).sum()/(-m)
return costsJ
def gradientDescent(X,y,theta,m,alpha,iterations):
for i in range(0,iterations):
ys=sigmoid(X@theta)-y
temp0=theta[0][0]-alpha*(ys*(X[:,0].reshape(X.shape[0],1))).sum()
temp1=theta[1][0]-alpha*(ys*(X[:,1].reshape(X.shape[0],1))).sum()
temp2=theta[2][0]-alpha*(ys*(X[:,2].reshape(X.shape[0],1))).sum()
theta[0][0]=temp0
theta[1][0]=temp1
theta[2][0]=temp2
return theta
def evaluateLogisticRegression(X,y,theta):
for i in range(0,X.shape[0]):
if(y[i,0]==1):
plt.scatter(X[i,1],X[i,2],marker='o', s=50, c='b',label='Admitted')
elif(y[i,0]==0):
plt.scatter(X[i,1],X[i,2],marker='x', s=50, c='r',label='Not admitted')
plt.xticks(np.arange(30,110,10))
plt.yticks(np.arange(30,110,10))
plt.xlabel('Exam 1 score')
plt.ylabel('Exam 2 score')
handles, labels = plt.gca().get_legend_handles_labels()
by_label = OrderedDict(zip(labels, handles))
plt.legend(by_label.values(), by_label.keys())
minX=np.min(X[:,1])
maxX=np.max(X[:,1])
xx=np.linspace(minX,maxX,100)
yy=(theta[0][0]+theta[1][0]*xx)/(-theta[2][0])
plt.plot(xx,yy)
plt.show()
def judge(X,y,theta):
ys=sigmoid(X@theta)
yanswer=y-ys
yanswer=np.abs(yanswer)
print('accuary:',(yanswer<0.5).sum()/y.shape[0]*100,'%')
X,y = inputData()
theta=np.zeros((3,1))
alpha=0.0002
iterations=200000
theta=gradientDescent(X,y,theta,X.shape[0],alpha,iterations)
judge(X,y,theta)
evaluateLogisticRegression(X,y,theta)
accuary: 91.91919191919192 %
4. 正则化逻辑回归实现(减少过拟合,提高泛化能力)
当特征值features过多,且数据量少于特征值时,将会出现过拟合的情况,过拟合的泛化效果不好,不能应用于其他的预测项目中,所以需要进行调整,主要包括两种方法:
- 1.减少特征值的个数;
- 2.正则化(regularization);
正则化就是保证特征数量不变的情况下,减小特征值的大小或者量级,即对特征值进行相应的惩罚,其中包括:lamda(正则参数 regularization parameter)。
图一表示欠拟合,图二是合适的分类结果,图三为过拟合的情况。
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import scipy.optimize as opt
path = r'C:\Users\Administrator\Desktop\logisticRegression_2.txt'
data2 = pd.read_csv(path, header=None, names=['Test 1', 'Test 2', 'Accepted'])
data2.head()
positive = data2[data2['Accepted'].isin([1])]
negative = data2[data2['Accepted'].isin([0])]
fig, ax = plt.subplots(figsize=(8,5))
ax.scatter(positive['Test 1'], positive['Test 2'], s=50, c='b', marker='o', label='Accepted')
ax.scatter(negative['Test 1'], negative['Test 2'], s=50, c='r', marker='x', label='Rejected')
ax.legend()
ax.set_xlabel('Test 1 Score')
ax.set_ylabel('Test 2 Score')
plt.show()
degree = 5
x1 = data2['Test 1']
x2 = data2['Test 2']
data2.insert(3, 'Ones', 1)
for i in range(1, degree):
for j in range(0, i):
data2['F' + str(i) + str(j)] = np.power(x1, i-j) * np.power(x2, j)
data2.drop('Test 1', axis=1, inplace=True)
data2.drop('Test 2', axis=1, inplace=True)
data2.head()
4.1 正则化函数:
实行过程:
def costReg(theta, X, y, learningRate):
theta = np.matrix(theta)
X = np.matrix(X)
y = np.matrix(y)
first = np.multiply(-y, np.log(sigmoid(X * theta.T)))
second = np.multiply((1 - y), np.log(1 - sigmoid(X * theta.T)))
reg = (learningRate / (2 * len(X))) * np.sum(np.power(theta[:,1:theta.shape[1]], 2))
return np.sum(first - second) / len(X) + reg
吴恩达老师讲到的正则化原理,其中分为两部分,theta0不需要进行正则化:
def gradientReg(theta, X, y, learningRate):
theta = np.matrix(theta)
X = np.matrix(X)
y = np.matrix(y)
parameters = int(theta.ravel().shape[1])
grad = np.zeros(parameters)
error = sigmoid(X * theta.T) - y
for i in range(parameters):
term = np.multiply(error, X[:,i])
if (i == 0):
grad[i] = np.sum(term) / len(X)
else:
grad[i] = (np.sum(term) / len(X)) + ((learningRate / len(X)) * theta[:,i])
return grad
5. 正则化完整实现
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from collections import OrderedDict
def inputData():
data=pd.read_csv('C:\\Users\\Administrator\\Desktop\\logisticRegression_2.txt'
,dtype={0:float,1:float,2:int})
data.insert(0,"ones",np.ones((data.shape[0],1)))
X=data.iloc[:,0:3]
X=X.values
y=data.iloc[:,3]
y=(y.values).reshape(y.shape[0],1)
return X,y
def showData(X,y):
for i in range(0,X.shape[0]):
if(y[i][0]==1):
plt.scatter(X[i,1],X[i,2],marker='o', s=50, c='b',label='y=1')
else:
plt.scatter(X[i,1],X[i,2],marker='x', s=50, c='r',label='y=0')
plt.xticks(np.arange(-1,1.5,0.5))
plt.yticks(np.arange(-0.8,1.2,0.2))
plt.xlabel('Microchip Test 1')
plt.ylabel('Microchip Test 2')
handles, labels = plt.gca().get_legend_handles_labels()
by_label = OrderedDict(zip(labels, handles))
plt.legend(by_label.values(), by_label.keys())
plt.show()
def sigmoid(z):
return 1/(1+np.exp(-z))
def computeCostsJ(X,y,theta,lamda,m):
hx=sigmoid(X@theta)
costsJ=-(np.sum(y*np.log(hx)+(1-y)*np.log(1-hx)))/m+lamda*np.sum(np.power(theta,2))/(2*m)
return costsJ
def featureMapping(x1,x2,level):
answer = {}
for i in range(1,level+1):
for j in range(0,i+1):
answer['F{}{}'.format(i-j,j)]=np.power(x1,i-j)*np.power(x2,j)
answer = pd.DataFrame(answer)
answer.insert(0,"ones",np.ones((answer.shape[0],1)))
return answer.values
def gradientDescent(X,y,theta,alpha,iterations,m,lamda):
for i in range(0,iterations+1):
hx=sigmoid(X@theta)
temp0=theta[0][0]-alpha*np.sum(hx-y)/m
theta=theta-alpha*(X.T@(hx-y)+lamda*theta)/m
theta[0][0]=temp0
return theta
def judge(X,y,theta):
ys=sigmoid(X@theta)
yanswer=y-ys
yanswer=np.abs(yanswer)
print('accuary',(yanswer<0.5).sum()/y.shape[0]*100,'%')
def evaluateLogisticRegression(X,y,theta):
for i in range(0,X.shape[0]):
if(y[i][0]==1):
plt.scatter(X[i,1],X[i,2],marker='o', s=50, c='b',label='y=1')
else:
plt.scatter(X[i,1],X[i,2],marker='x', s=50, c='r',label='y=0')
plt.xticks(np.arange(-1,1.5,0.5))
plt.yticks(np.arange(-0.8,1.2,0.2))
plt.xlabel('Microchip Test 1')
plt.ylabel('Microchip Test 2')
handles, labels = plt.gca().get_legend_handles_labels()
by_label = OrderedDict(zip(labels, handles))
plt.legend(by_label.values(), by_label.keys())
x=np.linspace(-1,1.5,250)
xx,yy = np.meshgrid(x,x)
answerMapping=featureMapping(xx.ravel(),yy.ravel(),6)
answer=answerMapping@theta
answer=answer.reshape(xx.shape)
plt.contour(xx,yy,answer,0)
plt.show()
X,y = inputData()
theta=np.zeros((28,1))
iterations=200000
alpha=0.001
lamda=50
mappingX = featureMapping(X[:,1],X[:,2],6)
theta=gradientDescent(mappingX,y,theta,alpha,iterations,X.shape[0],lamda)
judge(mappingX,y,theta)
evaluateLogisticRegression(X,y,theta)
accuary 65.8119658119658 %
本文内容由网友自发贡献,版权归原作者所有,本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容,请联系:hwhale#tublm.com(使用前将#替换为@)