张小染 2018-12-28
分类是机器学习问题中非常常见且重要的变体。已经制定了许多机器算法来解决分类(离散而不连续)问题。基于分类的预测分析问题的示例如下:
问题1和2是二进制分类的示例,其中分别只有2类,糖尿病/非糖尿病和幸福/悲伤或赞美/侮辱。但问题3有10个类,因为有10个数字(0-9)。所以它需要多级分类。
在众多机器学习分类算法中,Logistic回归是广泛使用且非常受欢迎的算法之一。它可以用于二进制和多类分类问题。在本文中,我将解释Logistic回归,它在Python中的实现以及在Practical Practice Dataset上的应用。
作为“Logistic”这个名称,让我们认为可能存在一个称为Logistic的函数,它涉及机器学习算法的假设。Sigmoid函数也可称为Logistic函数。
f(x)是 Sigmoid函数
现在,继续讨论Logistic回归的假设:
theta_1, theta_2,theta_3 ,..., theta_n是Logistic回归的参数
我们来看看Sigmoid函数的图表:
Sigmoid函数图
因此,Sigmoid函数的输出范围从0到1。但是对于分类,应该预测指示类的特定标签。在这种情况下,需要设置一个阈值(显然是介于0和1之间的值),以便获取最大预测性能。
当然,当有假设时,肯定还涉及成本函数。
此成本函数也称为二进制交叉熵函数。
在这里,必须最小化成本函数。因此,需要找到最小值(theta_0,theta_1,theta_2,...,theta_n)。
批量梯度下降可以用作优化技术来找到这个最小值。
Logistic回归的实现是通过创建3个模块来完成的。
=> hypothesis():在给定theta(θ)(theta_0,theta_1,theta_2,...,theta_n的列表),特征集X和特征数n的情况下,找到算法假设的输出的函数。假设()的实现如下:
def hypothesis(theta, X, n): h = np.ones((X.shape[0],1)) theta = theta.reshape(1,n+1) for i in range(0,X.shape[0]): h[i] = 1 / (1 + exp(-float(np.matmul(theta, X[i])))) h = h.reshape(X.shape[0]) return h
=> BGD():这里,实现了梯度下降算法。它返回theta(θ)(theta_0,theta_1,theta_2,...,theta_n的列表),它们形成最小值,theta_history(包含每次迭代的theta值)和cost(包含每次迭代时的cost函数值),给定初始theta( theta_0,theta_1,theta_2,...,theta_n),alpha(学习率),num_iters(迭代次数),h(所有样本的假设值),特征集X,标签集y和特征数n的列表。BGD()的实现如下:
def BGD(theta, alpha, num_iters, h, X, y, n): theta_history = np.ones((num_iters,n+1)) cost = np.ones(num_iters) for i in range(0,num_iters): theta[0] = theta[0] - (alpha/X.shape[0]) * sum(h - y) for j in range(1,n+1): theta[j]=theta[j]-(alpha/X.shape[0])*sum((h-y) *X.transpose()[j]) theta_history[i] = theta h = hypothesis(theta, X, n) cost[i]=(-1/X.shape[0])*sum(y*np.log(h)+(1-y)*np.log(1 - h)) theta = theta.reshape(1,n+1) return theta, theta_history, cost
=> logistic_regression():它是主要功能,它采用特征集X,标签集y,学习速率alpha和迭代次数(num_iters)。logistic_regression()的实现如下:
def logistic_regression(X, y, alpha, num_iters): n = X.shape[1] one_column = np.ones((X.shape[0],1)) X = np.concatenate((one_column, X), axis = 1) # initializing the parameter vector... theta = np.zeros(n+1) # hypothesis calculation.... h = hypothesis(theta, X, n) # returning the optimized parameters by Gradient Descent... theta,theta_history,cost = BGD(theta,alpha,num_iters,h,X,y,n) return theta, theta_history, cost
数据集包含100名学生在2门考试中获得的分数和标签(0/1),表明学生是否将被录取。
问题陈述:“ 鉴于在2门考试中获得的分数,使用Logistic回归 预测学生是否会被录取进入大学”
数据读入Numpy数组:
data = np.loadtxt('dataset.txt', delimiter=',') X_train = data[:,[0,1]] # feature-set y_train = data[:,2] # label-set
分散图数据集的可视化:
x0 = np.ones((np.array([x for x in y_train if x == 0]).shape[0], X_train.shape[1])) x1 = np.ones((np.array([x for x in y_train if x == 1]).shape[0], X_train.shape[1])) #x0 and x1 are matrices containing +ve and -ve examples from the #dataset, initialized to 1 k0 = k1 = 0
for i in range(0,y_train.shape[0]): if y_train[i] == 0: x0[k0] = X_train[i] k0 = k0 + 1 else: x1[k1] = X_train[i] k1 = k1 + 1 X = [x0, x1] colors = ["green", "blue"] # 2 distinct colours for 2 classes import matplotlib.pyplot as plt for x, c in zip(X, colors): if c == "green": plt.scatter(x[:,0],x[:,1],color = c,label = "Not Admitted") else: plt.scatter(x[:,0], x[:,1], color = c, label = "Admitted") plt.xlabel("Marks obtained in 1st Exam") plt.ylabel("Marks obtained in 2nd Exam") plt.legend()
可视化散点图
运行3模块逻辑回归:
# calling the principal function with learning_rate = 0.001 and # num_iters = 100000 theta,theta_history,cost=logistic_regression(X_train,y_train,0.001)
theta输出结果如下:
:θ :array([[-4.81180027,0.04528064,0.03819149]])
经过Logistic回归的BGD后的theta(θ)
获得的θ的可视化可以通过在散点图中合并决策边界(基于θ的2个类之间的分离线)来完成:
plot_x = np.array([min(X_train[:,0]) - 2, max(X_train[:,0]) + 2]) plot_y = (-1/theta[2]) * (theta[1] * plot_x + theta[0]) for x, c in zip(X, colors): if c == "green": plt.scatter(x[:,0],x[:,1],color = c, label = "Not Admitted") else: plt.scatter(x[:,0], x[:,1], color = c, label = "Admitted") plt.plot(plot_x, plot_y, label = "Decision_Boundary") plt.legend() plt.xlabel("Marks obtained in 1st Exam") plt.ylabel("Marks obtained in 2nd Exam")
决策边界包含Scatter Plot看起来像:
使用Line Plot可以看到成本函数的逐渐减少:
import matplotlib.pyplot as plt cost = list(cost) n_iterations = [x for x in range(1,100001)] plt.plot(n_iterations, cost) plt.xlabel('No. of iterations') plt.ylabel('Cost')
Line-Curve:
利用BGD进行Logistic回归的成本最小化线性曲线表示
模型性能分析:
在分类中,模型性能分析基于以下指标:
X_train = np.concatenate((np.ones((X_train.shape[0],1)), X_train) , axis = 1) h = hypothesis(theta, X_train, X_train.shape[1] - 1) # Taking 0.5 as threshold: for i in range(0, h.shape[0]): if h[i] > 0.5: h[i] = 1 else: h[i] = 0
=>准确度:没有比率。正确预测的样本总数没有。样品。
寻找准确性:
k = 0 for i in range(0, h.shape[0]): if h[i] == y_train[i]: k = k + 1 accuracy = k/y_train.shape[0]
准确度的输出结果如下:
:accuracy :0.91
精度输出=> 91%准确度
=>精度:比率为no。正确预测的积极观察总数没有。
查询精度:
tp = fp = 0 # tp -> True Positive, fp -> False Positive for i in range(0, h.shape[0]): if h[i] == y_train[i] == 0: tp = tp + 1 elif h[i] == 0 and y_train[i] == 1: fp = fp + 1 precision = tp/(tp + fp)
输出精度:
:precision :1.0
=>召回:正确识别的阳性比例。
寻找召回:
fn = 0 # fn -> False Negatives for i in range(0, h.shape[0]): if h[i] == 1 and y_train[i] == 0: fn = fn + 1 recall = tp/(tp + fn)
召回的输出:
:recall :0.775
=> F1-分数:精度和召回的谐波平均值
寻找F1分数:
f1_score = (2 * precision * recall)/(precision + recall)
输出F1分数:
:f1_score :0.8732394366197184
混淆矩阵:
混淆矩阵的结构
tn = 0 # tn -> True Negative for i in range(0, h.shape[0]): if h[i] == y_train[i] == 1 tn = tn + 1 cm = np.array([[tp, fn], [fp, tn]]) # MODULE FOR CONFUSION MATRIX import matplotlib.pyplot as plt %matplotlib inline import numpy as np import itertools def plot_confusion_matrix(cm, classes, normalize=False, title='Confusion matrix', cmap=plt.cm.Blues): """ This function prints and plots the confusion matrix. Normalization can be applied by setting `normalize=True`. """ if normalize: cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis] print("Normalized confusion matrix") else: print('Confusion matrix, without normalization') print(cm) plt.imshow(cm, interpolation='nearest', cmap=cmap) plt.title(title) plt.colorbar() tick_marks = np.arange(len(classes)) plt.xticks(tick_marks, classes, rotation=45) plt.yticks(tick_marks, classes) fmt = '.2f' if normalize else 'd' thresh = cm.max() / 2. for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])): plt.text(j, i, format(cm[i, j], fmt), horizontalalignment="center", color="white" if cm[i, j] > thresh else "black") plt.tight_layout() plt.ylabel('True label') plt.xlabel('Predicted label') plt.figure() # Un-Normalized Confusion Matrix... plot_confusion_matrix(cm, classes=[0,1], normalize=False, title='Unnormalized Confusion Matrix') # Normalized Confusion Matrix... plot_confusion_matrix(cm, classes=[0,1], normalize=True, title='Normalized Confusion Matrix')
上面的例子实际上是二进制分类的应用。为了解决多类分类问题,可以重复二元分类策略。其中N是使用One Vs概念的类数。
这一切都是关于Python中的Logistic回归的。
" \ \ / /_ | / | _ \ / | / / _ | \ | | | / |. " \ \ / / | || |/| | |) | | | | | | | | | | | | | | _.