在Python中的Logistic回归

分类是机器学习问题中非常常见且重要的变体。已经制定了许多机器算法来解决分类（离散而不连续）问题。基于分类的预测分析问题的示例如下：

糖尿病视网膜病变：给定视网膜图像，将图像（眼睛）分类为糖尿病或非糖尿病。
情感分析：给出一个句子，分析句子的含义（例如，幸福/悲伤，赞美/侮辱等）
数字识别：给定数字图像，识别数字（0-9）。这是多类分类的一个例子。

问题1和2是二进制分类的示例，其中分别只有2类，糖尿病/非糖尿病和幸福/悲伤或赞美/侮辱。但问题3有10个类，因为有10个数字（0-9）。所以它需要多级分类。

在众多机器学习分类算法中，Logistic回归是广泛使用且非常受欢迎的算法之一。它可以用于二进制和多类分类问题。在本文中，我将解释Logistic回归，它在Python中的实现以及在Practical Practice Dataset上的应用。

作为“Logistic”这个名称，让我们认为可能存在一个称为Logistic的函数，它涉及机器学习算法的假设。Sigmoid函数也可称为Logistic函数。

在Python中的Logistic回归

f（x）是 Sigmoid函数

现在，继续讨论Logistic回归的假设：

在Python中的Logistic回归

theta_1， theta_2，theta_3 ，...， theta_n是Logistic回归的参数

我们来看看Sigmoid函数的图表：

在Python中的Logistic回归

Sigmoid函数图

因此，Sigmoid函数的输出范围从0到1。但是对于分类，应该预测指示类的特定标签。在这种情况下，需要设置一个阈值（显然是介于0和1之间的值），以便获取最大预测性能。

当然，当有假设时，肯定还涉及成本函数。

在Python中的Logistic回归

此成本函数也称为二进制交叉熵函数。

在这里，必须最小化成本函数。因此，需要找到最小值（theta_0，theta_1，theta_2，...，theta_n）。

批量梯度下降可以用作优化技术来找到这个最小值。

Logistic回归的实现是通过创建3个模块来完成的。

=> hypothesis（）：在给定theta（θ）（theta_0，theta_1，theta_2，...，theta_n的列表），特征集X和特征数n的情况下，找到算法假设的输出的函数。假设（）的实现如下：

def hypothesis(theta, X, n):
h = np.ones((X.shape[0],1))
theta = theta.reshape(1,n+1)
for i in range(0,X.shape[0]):
h[i] = 1 / (1 + exp(-float(np.matmul(theta, X[i]))))
h = h.reshape(X.shape[0])
return h

=> BGD（）：这里，实现了梯度下降算法。它返回theta（θ）（theta_0，theta_1，theta_2，...，theta_n的列表），它们形成最小值，theta_history（包含每次迭代的theta值）和cost（包含每次迭代时的cost函数值），给定初始theta（ theta_0，theta_1，theta_2，...，theta_n），alpha（学习率），num_iters（迭代次数），h（所有样本的假设值），特征集X，标签集y和特征数n的列表。BGD（）的实现如下：

def BGD(theta, alpha, num_iters, h, X, y, n):
theta_history = np.ones((num_iters,n+1))
cost = np.ones(num_iters)
for i in range(0,num_iters):
theta[0] = theta[0] - (alpha/X.shape[0]) * sum(h - y)
for j in range(1,n+1):
theta[j]=theta[j]-(alpha/X.shape[0])*sum((h-y)
*X.transpose()[j])
theta_history[i] = theta
h = hypothesis(theta, X, n)
cost[i]=(-1/X.shape[0])*sum(y*np.log(h)+(1-y)*np.log(1 - h))
theta = theta.reshape(1,n+1)
return theta, theta_history, cost

=> logistic_regression（）：它是主要功能，它采用特征集X，标签集y，学习速率alpha和迭代次数（num_iters）。logistic_regression（）的实现如下：

def logistic_regression(X, y, alpha, num_iters):
n = X.shape[1]
one_column = np.ones((X.shape[0],1))
X = np.concatenate((one_column, X), axis = 1)
# initializing the parameter vector...
theta = np.zeros(n+1)
# hypothesis calculation....
h = hypothesis(theta, X, n)
# returning the optimized parameters by Gradient Descent...
theta,theta_history,cost = BGD(theta,alpha,num_iters,h,X,y,n)
return theta, theta_history, cost

数据集包含100名学生在2门考试中获得的分数和标签（0/1），表明学生是否将被录取。

问题陈述：“ 鉴于在2门考试中获得的分数，使用Logistic回归 预测学生是否会被录取进入大学”

数据读入Numpy数组：

data = np.loadtxt('dataset.txt', delimiter=',')
X_train = data[:,[0,1]] # feature-set
y_train = data[:,2] # label-set

分散图数据集的可视化：

x0 = np.ones((np.array([x for x in y_train if x == 0]).shape[0],
X_train.shape[1]))
x1 = np.ones((np.array([x for x in y_train if x == 1]).shape[0],
X_train.shape[1]))
#x0 and x1 are matrices containing +ve and -ve examples from the
#dataset, initialized to 1 
 
 
k0 = k1 = 0

for i in range(0,y_train.shape[0]):
if y_train[i] == 0:
x0[k0] = X_train[i]
k0 = k0 + 1
else:
x1[k1] = X_train[i]
k1 = k1 + 1 
 
X = [x0, x1]
colors = ["green", "blue"] # 2 distinct colours for 2 classes
 
import matplotlib.pyplot as plt
for x, c in zip(X, colors):
if c == "green":
plt.scatter(x[:,0],x[:,1],color = c,label = "Not Admitted")
else:
plt.scatter(x[:,0], x[:,1], color = c, label = "Admitted")
plt.xlabel("Marks obtained in 1st Exam")
plt.ylabel("Marks obtained in 2nd Exam")
plt.legend()

在Python中的Logistic回归

可视化散点图

运行3模块逻辑回归：

# calling the principal function with learning_rate = 0.001 and
# num_iters = 100000
theta,theta_history,cost=logistic_regression(X_train,y_train,0.001)

theta输出结果如下：

：θ
：array（[[-4.81180027,0.04528064,0.03819149]])

经过Logistic回归的BGD后的theta（θ）

获得的θ的可视化可以通过在散点图中合并决策边界（基于θ的2个类之间的分离线）来完成：

plot_x = np.array([min(X_train[:,0]) - 2, max(X_train[:,0]) + 2])
plot_y = (-1/theta[2]) * (theta[1] * plot_x + theta[0])
 
for x, c in zip(X, colors):
if c == "green":
plt.scatter(x[:,0],x[:,1],color = c, label = "Not Admitted")
else:
plt.scatter(x[:,0], x[:,1], color = c, label = "Admitted")
plt.plot(plot_x, plot_y, label = "Decision_Boundary")
plt.legend()
plt.xlabel("Marks obtained in 1st Exam")
plt.ylabel("Marks obtained in 2nd Exam")

决策边界包含Scatter Plot看起来像：

在Python中的Logistic回归

使用Line Plot可以看到成本函数的逐渐减少：

import matplotlib.pyplot as plt
cost = list(cost)
n_iterations = [x for x in range(1,100001)]
plt.plot(n_iterations, cost)
plt.xlabel('No. of iterations')
plt.ylabel('Cost')

Line-Curve：

在Python中的Logistic回归

利用BGD进行Logistic回归的成本最小化线性曲线表示

模型性能分析：

在分类中，模型性能分析基于以下指标：

X_train = np.concatenate((np.ones((X_train.shape[0],1)), X_train)
, axis = 1)
h = hypothesis(theta, X_train, X_train.shape[1] - 1)
# Taking 0.5 as threshold:
for i in range(0, h.shape[0]):
if h[i] &gt; 0.5:
h[i] = 1
else:
h[i] = 0

=>准确度：没有比率。正确预测的样本总数没有。样品。

寻找准确性：

k = 0
for i in range(0, h.shape[0]):
if h[i] == y_train[i]:
k = k + 1
accuracy = k/y_train.shape[0]

准确度的输出结果如下：

:accuracy
:0.91

精度输出=> 91％准确度

=>精度：比率为no。正确预测的积极观察总数没有。

查询精度：

tp = fp = 0
# tp -&gt; True Positive, fp -&gt; False Positive
for i in range(0, h.shape[0]):
if h[i] == y_train[i] == 0:
tp = tp + 1
elif h[i] == 0 and y_train[i] == 1:
fp = fp + 1
precision = tp/(tp + fp)

输出精度：

:precision
:1.0

=>召回：正确识别的阳性比例。

寻找召回：

fn = 0
# fn -&gt; False Negatives
for i in range(0, h.shape[0]):
if h[i] == 1 and y_train[i] == 0:
fn = fn + 1
recall = tp/(tp + fn)

召回的输出:

:recall
:0.775

=> F1-分数：精度和召回的谐波平均值

寻找F1分数：

f1_score = (2 * precision * recall)/(precision + recall)

输出F1分数:

:f1_score
:0.8732394366197184

混淆矩阵：

在Python中的Logistic回归

混淆矩阵的结构

tn = 0
# tn -&gt; True Negative
for i in range(0, h.shape[0]):
if h[i] == y_train[i] == 1
tn = tn + 1 
 
cm = np.array([[tp, fn], [fp, tn]])
 
# MODULE FOR CONFUSION MATRIX
 
import matplotlib.pyplot as plt
%matplotlib inline
import numpy as np
import itertools
def plot_confusion_matrix(cm, classes,
normalize=False,
title='Confusion matrix',
cmap=plt.cm.Blues):
"""
This function prints and plots the confusion matrix.
Normalization can be applied by setting `normalize=True`.
"""
if normalize:
cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
print("Normalized confusion matrix")
else:
print('Confusion matrix, without normalization')
 
print(cm)
 
plt.imshow(cm, interpolation='nearest', cmap=cmap)
plt.title(title)
plt.colorbar()
tick_marks = np.arange(len(classes))
plt.xticks(tick_marks, classes, rotation=45)
plt.yticks(tick_marks, classes)
 
fmt = '.2f' if normalize else 'd'
thresh = cm.max() / 2.
for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
plt.text(j, i, format(cm[i, j], fmt),
horizontalalignment="center",
color="white" if cm[i, j] &gt; thresh else "black")
 
plt.tight_layout()
plt.ylabel('True label')
plt.xlabel('Predicted label')
 
plt.figure()
# Un-Normalized Confusion Matrix...
plot_confusion_matrix(cm, classes=[0,1], normalize=False,
title='Unnormalized Confusion Matrix')
# Normalized Confusion Matrix...
plot_confusion_matrix(cm, classes=[0,1], normalize=True,
title='Normalized Confusion Matrix')

在Python中的Logistic回归

上面的例子实际上是二进制分类的应用。为了解决多类分类问题，可以重复二元分类策略。其中N是使用One Vs概念的类数。

这一切都是关于Python中的Logistic回归的。

在Python中的Logistic回归

相关推荐