拉风小宇 2018-08-30
在统计学中,logistic模型(或logit模型)是一种统计模型,通常用于应用于二元因变量。在回归分析中,logistic回归或logit回归是对logistic模型的参数进行估计。更正式地,逻辑模型是事件概率的对数概率是独立变量或预测变量的线性组合的模型。两个可能的因变量值通常标记为“0”和“1”,其表示诸如通过/失败,赢/输,活着/死亡或健康/生病的结果。二元logistic回归模型可以推广到两个以上的因变量:具有两个以上值的分类输出通过多项逻辑回归建模,如果多个类别是有序的,则通过序数逻辑回归,例如比例赔率序数逻辑模型。
Logistic回归是由统计学家David Cox在1958年开发的。二元逻辑模型用于估计基于一个或多个预测变量(或独立)变量(特征)的二元响应的概率。模型本身只是根据输入对输出概率进行建模,并且不执行统计分类(它不是分类器),尽管它可以用于制作分类器,例如通过选择截止值并将输入分类为更大的概率比作为一个等级的截止值,低于截止值作为另一个等级。与线性最小二乘法不同,系数通常不是通过闭式表达式计算的。
一个例子:这个人是否执行了一个动作?(假设接受要约或回复电子邮件)。直观地,我们可以理解,接受要约的客户与拒绝要约的客户之间存在某种相关性。解决这个问题的最佳方法是预测这两个事件发生的可能性。
我们从图表中看到,我们有两个值,0到1:
如果我们将sigmoid函数应用于线性回归函数并求解Y,我们将得到我们给出逻辑回归函数的绿色方框函数:
线性回归的斜率线对应于可以拟合数据集的最佳拟合线。我们可以使用这种方法来预测概率。让我们取4个随机值并将它们投影到逻辑曲线并得到拟合值:
现在,如果我们将它们投射到左侧,我们将获得相关概率并建立一个任意概率来估计我们的结果:
要在Python中实现LR,我们需要从sklearn导入线性模型库,并使用LogisticRegression类创建一个对象,它将成为我们的分类器,并使其适合训练集:
# Data Preprocessing
# Importing the Library
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
# Importing the dataset
dataset= pd.read_csv('Data.csv')
X = dataset.iloc[: , [2, 3]].values
Y = dataset.iloc[: , 4].values
# Feature Scaling
from sklearn.preprocessing import StandardScaler
sc_X = StandardScaler()
X_train = sc_X.fit_transform(X_train)
X_test = sc_X.transform(X_test)
# Fitting Logistic Regression to the Training set
from sklearn.linear_model import LogisticRegression
classifier = LogisticRegression(random_state = 0)
classifier.fit(X_train, y_train)
接下来我们将介绍一个新的变量y_pred预测向量并应用预测方法,Pythond代码如下:
# Data Preprocessing
# Importing the Library
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
# Importing the dataset
dataset= pd.read_csv('Data.csv')
X = dataset.iloc[: , [2, 3]].values
Y = dataset.iloc[: , 4].values
# Feature Scaling
from sklearn.preprocessing import StandardScaler
sc_X = StandardScaler()
X_train = sc_X.fit_transform(X_train)
X_test = sc_X.transform(X_test)
# Fitting Logistic Regression to the Training set
from sklearn.linear_model import LogisticRegression
classifier = LogisticRegression(random_state = 0)
classifier.fit(X_train, y_train)
# Predicting the Test set results
y_pred = classifier.predict(X_test)
下一步是从sklearn.matrix库创建混淆矩阵,我们导入confusion_matrix类,以验证我们的预测的准确性,Python实现如下:
# Data Preprocessing
# Importing the Library
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
# Importing the dataset
dataset= pd.read_csv('Data.csv')
X = dataset.iloc[: , [2, 3]].values
Y = dataset.iloc[: , 4].values
# Feature Scaling
from sklearn.preprocessing import StandardScaler
sc_X = StandardScaler()
X_train = sc_X.fit_transform(X_train)
X_test = sc_X.transform(X_test)
# Fitting Logistic Regression to the Training set
from sklearn.linear_model import LogisticRegression
classifier = LogisticRegression(random_state = 0)
classifier.fit(X_train, y_train)
# Predicting the Test set results
y_pred = classifier.predict(X_test)
# Making the Confusion Matrix
fromsklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred)
最后一步是通过ListedColorMap类可视化训练集结果和测试集结果:
# Data Preprocessing
# Importing the Library
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
# Importing the dataset
dataset= pd.read_csv('Data.csv')
X = dataset.iloc[: , [2, 3]].values
Y = dataset.iloc[: , 4].values
# Feature Scaling
from sklearn.preprocessing import StandardScaler
sc_X = StandardScaler()
X_train = sc_X.fit_transform(X_train)
X_test = sc_X.transform(X_test)
# Fitting Logistic Regression to the Training set
from sklearn.linear_model import LogisticRegression
classifier = LogisticRegression(random_state = 0)
classifier.fit(X_train, y_train)
# Predicting the Test set results
y_pred = classifier.predict(X_test)
# Making the Confusion Matrix
fromsklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred)
# Visualising the Training set results
from matplotlib.colors import ListedColormap
X_set, Y_set = X_train, y_train
X1, X2 = np.meshgrid(np.arrange(start = X_set[:, 0].min() - 1, stop = X_set[:, 0].max() + 1, step = 0.01),
np.arrange(start = X_set[:, 1].min() - 1, stop = X_set[:, 1].max() + 1, step = 0.01)
plt.contourf(X1, X2, classifier.predict(np.array([X1.rave(), X2.ravel()]).T).reshape(X1.shape),
alpha = 0.75, cmap = ListedColormap(('red', 'Green')))
plt.xlim(X1.min(), X1.max())
plt.ylim(X1.min(), X1.max())
for i, j in emunerate(np.unique(y_set)):
plt.scatter(X_set[y_set == j, 0], X_set[y_set == j, 1]
c = ListedColormap(('red', 'green'))(i), label = j)
plt.title('Logistic Regression (Training set)')
plt.xlabel('Age')
plt.ylabel('Estimated Salary')
plt.legend()
plt.show()
# Visualising the Test set results
from matplotlib.colors import ListedColormap
X_set, Y_set = X_train, y_train
X1, X2 = np.meshgrid(np.arrange(start = X_set[:, 0].min() - 1, stop = X_set[:, 0].max() + 1, step = 0.01),
np.arrange(start = X_set[:, 1].min() - 1, stop = X_set[:, 1].max() + 1, step = 0.01)
plt.contourf(X1, X2, classifier.predict(np.array([X1.rave(), X2.ravel()]).T).reshape(X1.shape),
alpha = 0.75, cmap = ListedColormap(('red', 'Green')))
plt.xlim(X1.min(), X1.max())
plt.ylim(X1.min(), X1.max())
for i, j in emunerate(np.unique(y_set)):
plt.scatter(X_set[y_set == j, 0], X_set[y_set == j, 1]
c = ListedColormap(('red', 'green'))(i), label = j)
plt.title('Logistic Regression (Test set)')
plt.xlabel('Age')
plt.ylabel('Estimated Salary')
plt.legend()
plt.show()