tracy 2018-05-19
交叉验证用于评估预测模型,方法是将原始样本划分为训练集以训练模型,并使用测试集对其进行评估。
Sklearn中的交叉验证对我们选择正确的模型和模型参数非常有帮助。通过使用它,我们可以直观地看到不同模型或参数对结构精度的影响。
我们将使用著名的数据集“iris”和KNN分类器。
基本上,knn.score()的准确性只测试一组列车和测试数据集。
# We are going to use the famous dataset 'iris' with the KNN Classifier
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
# load dataset
iris = load_iris()
X = iris.data
y = iris.target
# split into test and train dataset, and use random_state=48
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=4)
# build KNN model and choose n_neighbors = 5
knn = KNeighborsClassifier(n_neighbors = 5)
# train the model
knn.fit(X_train, y_train)
# get the predict value from X_test
y_pred = knn.predict(X_test)
# print the score
print('accuracy: ', knn.score(X_test, y_test))
# accuracy: 0.973684210526
在k-fold交叉验证中,原始样本被随机划分为k个相同大小的子样本。
# import k-folder
from sklearn.cross_validation import cross_val_score
# use the same model as before
knn = KNeighborsClassifier(n_neighbors = 5)
# X,y will automatically devided by 5 folder, the scoring I will still use the accuracy
scores = cross_val_score(knn, X, y, cv=5, scoring='accuracy')
# print all 5 times scores
print(scores)
# [ 0.96666667 1. 0.93333333 0.96666667 1. ]
# then I will do the average about these five scores to get more accuracy score.
print(scores.mean())
# 0.973333333333
我们可以选择不同的邻居来看看哪个K是最好的K。
import matplotlib.pyplot as plt
%matplotlib inline
# choose k between 1 to 31
k_range = range(1, 31)
k_scores = []
# use iteration to caclulator different k in models, then return the average accuracy based on the cross validation
for k in k_range:
knn = KNeighborsClassifier(n_neighbors=k)
scores = cross_val_score(knn, X, y, cv=5, scoring='accuracy')
k_scores.append(scores.mean())
# plot to see clearly
plt.plot(k_range, k_scores)
plt.xlabel('Value of K for KNN')
plt.ylabel('Cross-Validated Accuracy')
plt.show()
#我们可以看到最好的K在6-13之间,13之后精度由于不适合而下降。
import matplotlib.pyplot as plt
k_range = range(1, 31)
k_scores = []
for k in k_range:
knn = KNeighborsClassifier(n_neighbors=k)
loss = abs(cross_val_score(knn, X, y, cv=5, scoring='neg_mean_squared_error'))
k_scores.append(loss.mean())
plt.plot(k_range, k_scores)
plt.xlabel('Value of K for KNN')
plt.ylabel('Cross-Validated MSE')
plt.show()
#因为它显示MSE,我们需要找到6-13之间的最小值。与#2结果相同。
from sklearn import metrics
from sklearn.naive_bayes import GaussianNB
gnb = GaussianNB()
y_pred = gnb.fit(iris.data, iris.target).predict(iris.data)
print("Number of mislabeled points : %d" % (iris.target != y_pred).sum())
# Number of mislabeled points : 6
#上面我们使用了一个简单的错误标记计数来确定一个分数:6个错误标签/ 150个总数或144个右侧/150个总数= 0.96(显然这里我们希望尽可能接近1)。
我们可以通过绘制受试者工作特征曲线和确定曲线下面积值(AUC)来评分二元分类。同样,我们的目标是尽可能接近1的AUC。
# Finding the false positive and true positive rates where the positive label is 2.
from sklearn import metrics
fpr, tpr, thresholds = metrics.roc_curve(iris.target, y_pred, pos_label=2)
metrics.auc(fpr, tpr)
plt.plot(fpr, tpr)
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver operating characteristic example')
plt.show()