在Python中的Logistic回归

张小染 2018-12-28

分类是机器学习问题中非常常见且重要的变体。已经制定了许多机器算法来解决分类(离散而不连续)问题。基于分类的预测分析问题的示例如下:

  1. 糖尿病视网膜病变 给定视网膜图像,将图像(眼睛)分类为糖尿病或非糖尿病。
  2. 情感分析 给出一个句子,分析句子的含义(例如,幸福/悲伤,赞美/侮辱等)
  3. 数字识别: 给定数字图像,识别数字(0-9)。这是多类分类的一个例子。

问题1和2是二进制分类的示例,其中分别只有2类,糖尿病/非糖尿病和幸福/悲伤或赞美/侮辱。但问题3有10个类,因为有10个数字(0-9)。所以它需要多级分类。

在众多机器学习分类算法中,Logistic回归是广泛使用且非常受欢迎的算法之一。它可以用于二进制和多类分类问题。在本文中,我将解释Logistic回归,它在Python中的实现以及在Practical Practice Dataset上的应用。

作为“Logistic”这个名称,让我们认为可能存在一个称为Logistic的函数,它涉及机器学习算法的假设。Sigmoid函数也可称为Logistic函数。

在Python中的Logistic回归

f(x)是 Sigmoid函数

现在,继续讨论Logistic回归的假设:

在Python中的Logistic回归

theta_1theta_2,theta_3 ,..., theta_n是Logistic回归的参数

我们来看看Sigmoid函数的图表:

在Python中的Logistic回归

Sigmoid函数图

因此,Sigmoid函数的输出范围从0到1。但是对于分类,应该预测指示类的特定标签。在这种情况下,需要设置一个阈值(显然是介于0和1之间的值),以便获取最大预测性能。

当然,当有假设时,肯定还涉及成本函数。

在Python中的Logistic回归

此成本函数也称为二进制交叉熵函数。

在这里,必须最小化成本函数。因此,需要找到最小值(theta_0,theta_1,theta_2,...,theta_n)。

批量梯度下降可以用作优化技术来找到这个最小值。

Logistic回归的实现是通过创建3个模块来完成的。

=> hypothesis():在给定theta(θ)(theta_0,theta_1,theta_2,...,theta_n的列表),特征集X和特征数n的情况下,找到算法假设的输出的函数。假设()的实现如下:

def hypothesis(theta, X, n):
h = np.ones((X.shape[0],1))
theta = theta.reshape(1,n+1)
for i in range(0,X.shape[0]):
h[i] = 1 / (1 + exp(-float(np.matmul(theta, X[i]))))
h = h.reshape(X.shape[0])
return h

=> BGD():这里,实现了梯度下降算法。它返回theta(θ)(theta_0,theta_1,theta_2,...,theta_n的列表),它们形成最小值,theta_history(包含每次​​迭代的theta值)和cost(包含每次​​迭代时的cost函数值),给定初始theta( theta_0,theta_1,theta_2,...,theta_n),alpha(学习率),num_iters(迭代次数),h(所有样本的假设值),特征集X,标签集y和特征数n的列表。BGD()的实现如下:

def BGD(theta, alpha, num_iters, h, X, y, n):
theta_history = np.ones((num_iters,n+1))
cost = np.ones(num_iters)
for i in range(0,num_iters):
theta[0] = theta[0] - (alpha/X.shape[0]) * sum(h - y)
for j in range(1,n+1):
theta[j]=theta[j]-(alpha/X.shape[0])*sum((h-y)
*X.transpose()[j])
theta_history[i] = theta
h = hypothesis(theta, X, n)
cost[i]=(-1/X.shape[0])*sum(y*np.log(h)+(1-y)*np.log(1 - h))
theta = theta.reshape(1,n+1)
return theta, theta_history, cost

=> logistic_regression():它是主要功能,它采用特征集X,标签集y,学习速率alpha和迭代次数(num_iters)。logistic_regression()的实现如下:

def logistic_regression(X, y, alpha, num_iters):
n = X.shape[1]
one_column = np.ones((X.shape[0],1))
X = np.concatenate((one_column, X), axis = 1)
# initializing the parameter vector...
theta = np.zeros(n+1)
# hypothesis calculation....
h = hypothesis(theta, X, n)
# returning the optimized parameters by Gradient Descent...
theta,theta_history,cost = BGD(theta,alpha,num_iters,h,X,y,n)
return theta, theta_history, cost

数据集包含100名学生在2门考试中获得的分数和标签(0/1),表明学生是否将被录取。

问题陈述:“ 鉴于在2门考试中获得的分数,使用Logistic回归 预测学生是否会被录取进入大学

数据读入Numpy数组:

data = np.loadtxt('dataset.txt', delimiter=',')
X_train = data[:,[0,1]] # feature-set
y_train = data[:,2] # label-set

分散图数据集的可视化:

x0 = np.ones((np.array([x for x in y_train if x == 0]).shape[0],
X_train.shape[1]))
x1 = np.ones((np.array([x for x in y_train if x == 1]).shape[0],
X_train.shape[1]))
#x0 and x1 are matrices containing +ve and -ve examples from the
#dataset, initialized to 1 
 
 
k0 = k1 = 0
for i in range(0,y_train.shape[0]):
if y_train[i] == 0:
x0[k0] = X_train[i]
k0 = k0 + 1
else:
x1[k1] = X_train[i]
k1 = k1 + 1 
 
X = [x0, x1]
colors = ["green", "blue"] # 2 distinct colours for 2 classes
 
import matplotlib.pyplot as plt
for x, c in zip(X, colors):
if c == "green":
plt.scatter(x[:,0],x[:,1],color = c,label = "Not Admitted")
else:
plt.scatter(x[:,0], x[:,1], color = c, label = "Admitted")
plt.xlabel("Marks obtained in 1st Exam")
plt.ylabel("Marks obtained in 2nd Exam")
plt.legend()

在Python中的Logistic回归

可视化散点图

运行3模块逻辑回归:

# calling the principal function with learning_rate = 0.001 and
# num_iters = 100000
theta,theta_history,cost=logistic_regression(X_train,y_train,0.001)

theta输出结果如下:

:θ
:array([[-4.81180027,0.04528064,0.03819149]])

经过Logistic回归的BGD后的theta(θ)

获得的θ的可视化可以通过在散点图中合并决策边界(基于θ的2个类之间的分离线)来完成:

plot_x = np.array([min(X_train[:,0]) - 2, max(X_train[:,0]) + 2])
plot_y = (-1/theta[2]) * (theta[1] * plot_x + theta[0])
 
for x, c in zip(X, colors):
if c == "green":
plt.scatter(x[:,0],x[:,1],color = c, label = "Not Admitted")
else:
plt.scatter(x[:,0], x[:,1], color = c, label = "Admitted")
plt.plot(plot_x, plot_y, label = "Decision_Boundary")
plt.legend()
plt.xlabel("Marks obtained in 1st Exam")
plt.ylabel("Marks obtained in 2nd Exam")

决策边界包含Scatter Plot看起来像:

在Python中的Logistic回归

使用Line Plot可以看到成本函数的逐渐减少:

import matplotlib.pyplot as plt
cost = list(cost)
n_iterations = [x for x in range(1,100001)]
plt.plot(n_iterations, cost)
plt.xlabel('No. of iterations')
plt.ylabel('Cost')

Line-Curve:

在Python中的Logistic回归

利用BGD进行Logistic回归的成本最小化线性曲线表示

模型性能分析

在分类中,模型性能分析基于以下指标:

X_train = np.concatenate((np.ones((X_train.shape[0],1)), X_train)
, axis = 1)
h = hypothesis(theta, X_train, X_train.shape[1] - 1)
# Taking 0.5 as threshold:
for i in range(0, h.shape[0]):
if h[i] > 0.5:
h[i] = 1
else:
h[i] = 0

=>准确度:没有比率。正确预测的样本总数没有。样品。

寻找准确性:

​​​​​​​k = 0
for i in range(0, h.shape[0]):
if h[i] == y_train[i]:
k = k + 1
accuracy = k/y_train.shape[0]

准确度的输出结果如下:

:accuracy
:0.91

精度输出=> 91%准确度

=>精度:比率为no。正确预测的积极观察总数没有。

查询精度:

tp = fp = 0
# tp -> True Positive, fp -> False Positive
for i in range(0, h.shape[0]):
if h[i] == y_train[i] == 0:
tp = tp + 1
elif h[i] == 0 and y_train[i] == 1:
fp = fp + 1
precision = tp/(tp + fp)

输出精度:

:precision
:1.0

=>召回:正确识别的阳性比例。

寻找召回:

​​​​​​​fn = 0
# fn -> False Negatives
for i in range(0, h.shape[0]):
if h[i] == 1 and y_train[i] == 0:
fn = fn + 1
recall = tp/(tp + fn)

召回的输出:

:recall
:0.775

=> F1-分数:精度和召回的谐波平均值

寻找F1分数:

f1_score = (2 * precision * recall)/(precision + recall)

输出F1分数:

:f1_score
:0.8732394366197184

混淆矩阵:

在Python中的Logistic回归

混淆矩阵的结构

tn = 0
# tn -> True Negative
for i in range(0, h.shape[0]):
if h[i] == y_train[i] == 1
tn = tn + 1 
 
cm = np.array([[tp, fn], [fp, tn]])
 
# MODULE FOR CONFUSION MATRIX
 
import matplotlib.pyplot as plt
%matplotlib inline
import numpy as np
import itertools
def plot_confusion_matrix(cm, classes,
normalize=False,
title='Confusion matrix',
cmap=plt.cm.Blues):
"""
This function prints and plots the confusion matrix.
Normalization can be applied by setting `normalize=True`.
"""
if normalize:
cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
print("Normalized confusion matrix")
else:
print('Confusion matrix, without normalization')
 
print(cm)
 
plt.imshow(cm, interpolation='nearest', cmap=cmap)
plt.title(title)
plt.colorbar()
tick_marks = np.arange(len(classes))
plt.xticks(tick_marks, classes, rotation=45)
plt.yticks(tick_marks, classes)
 
fmt = '.2f' if normalize else 'd'
thresh = cm.max() / 2.
for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
plt.text(j, i, format(cm[i, j], fmt),
horizontalalignment="center",
color="white" if cm[i, j] > thresh else "black")
 
plt.tight_layout()
plt.ylabel('True label')
plt.xlabel('Predicted label')
 
plt.figure()
# Un-Normalized Confusion Matrix...
plot_confusion_matrix(cm, classes=[0,1], normalize=False,
title='Unnormalized Confusion Matrix')
# Normalized Confusion Matrix...
plot_confusion_matrix(cm, classes=[0,1], normalize=True,
title='Normalized Confusion Matrix')

在Python中的Logistic回归

上面的例子实际上是二进制分类的应用。为了解决多类分类问题,可以重复二元分类策略。其中N是使用One Vs概念的类数。

这一切都是关于Python中的Logistic回归的。

相关推荐