81367464 2018-07-07
多类文本分类是NLP和机器学习最常见的应用之一。有几种方法可以解决这个问题,而机器学习算法的性能取决于数据的质量。LinearSVC是在基于NLP的文本分类任务范围上表现较好的算法之一。但是,如果需要对所有类都有概率分布,那么scikitt -learn中的LinearSVC不会提供像predict_proba这样的函数。
LinearSVC提供了decision_function方法。decision_function可预测样品的置信度分数。样本的置信值是该样本到超平面的有符号距离。
在这篇文章中,我将演示如何使用校准过的scikitlearn库的classifiercv类,以便在预测的输出中具有跨所有类的概率分布。在github上使用的是jupyter笔记本。我将使用data.gov上的消费者金融投诉数据集。
输出所有类的概率分布,以便在scikit-learn中使用LinearSVC分类器进行预测。
第一步是探索数据集(https://catalog.data.gov/dataset/consumer-complaint-database)。我们将查看数据集中可用类的数量和总行数。我们将使用Pandas作为Python中的一个流行库来加载数据,并概述数据的外观。
import pandas as pd
consumer_complaints_df = pd.read_csv("Consumer_Complaints.csv")
consumer_complaints_df.head()
数据概述
我们将仅为这个任务使用两列。“Product”列将作为类和“Consumer complaint narrative”特征列。我们将用特征来训练分类器来预测类。因此,对于预测输入将是消费者投诉的叙述和输出将是产品的概率分布。最有可能的乘积将表明预测的可信度。
在数据集中打印唯一类的列表
consumer_complaints_df['Product'].unique()
array(['Mortgage', 'Credit reporting', 'Consumer Loan', 'Credit card',
'Debt collection', 'Student loan', 'Bank account or service',
'Other financial service', 'Prepaid card', 'Money transfers',
'Checking or savings account',
'Credit reporting, credit repair services, or other personal consumer reports',
'Payday loan', 'Money transfer, virtual currency, or money service',
'Credit card or prepaid card', 'Vehicle loan or lease',
'Payday loan, title loan, or personal loan', 'Virtual currency'], dtype=object)
下一步是删除任何具有Null值的行
consumer_complaints_filtered_df = consumer_complaints_df[pd.notnull(consumer_complaints_df['Consumer complaint narrative'])]
consumer_complaints_filtered_df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 299516 entries, 1 to 1072068
Data columns (total 18 columns):
Date received 299516 non-null object
Product 299516 non-null object
Sub-product 247333 non-null object
Issue 299516 non-null object
Sub-issue 197380 non-null object
Consumer complaint narrative 299516 non-null object
Company public response 145114 non-null object
Company 299516 non-null object
State 298406 non-null object
ZIP code 296979 non-null object
Tags 51351 non-null object
Consumer consent provided? 299516 non-null object
Submitted via 299516 non-null object
Date sent to company 299516 non-null object
Company response to consumer 299514 non-null object
Timely response? 299516 non-null object
Consumer disputed? 164125 non-null object
Complaint ID 299516 non-null int64
dtypes: int64(1), object(17)
memory usage: 43.4+ MB
任何ML任务中非常重要的一步是可视化类的分布。类的分布对训练算法的性能有很大影响。记录数量明显较少的类可能没有很好的平均准确度。
import matplotlib.pyplot as plt
fig = plt.figure(figsize=(10,6))
df = consumer_complaints_filtered_df[['Product','Consumer complaint narrative']]
df.groupby('Product').count().plot.bar(ylim=0)
plt.show()
现在我们将从预处理开始。NLP任务的分类始终具有这个重要的预处理步骤。分类器背后有数学运算,仅适用于数字。因此,我们必须以数字形式转换我们的输入。scikit-learn中有几个类可用于文本预处理。
我们将使用CountVectorizer和TfidfTransformer。CountVectorizer将一类文本文档转换为令牌计数矩阵。TfidfTransformer将计数矩阵转换为术语频率或反向文档频率。您可以在scikit-learn网站上阅读更多相关信息。
类似地,为了将文本标签或类转换为数字形式,我们将使用LabelEncoder 。它对值为0到-1的类型的标签进行编码。
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.preprocessing import LabelEncoder
labels = df['Product']
text = df['Consumer complaint narrative']
X_train, X_test, y_train, y_test = train_test_split(text, labels, random_state=0, test_size=0.3)
count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(X_train)
tf_transformer = TfidfTransformer().fit(X_train_counts)
X_train_transformed = tf_transformer.transform(X_train_counts)
X_test_counts = count_vect.transform(X_test)
X_test_transformed = tf_transformer.transform(X_test_counts)
labels = LabelEncoder()
y_train_labels_fit = labels.fit(y_train)
y_train_lables_trf = labels.transform(y_train)
print(labels.classes_)
['Bank account or service' 'Checking or savings account' 'Consumer Loan'
'Credit card' 'Credit card or prepaid card' 'Credit reporting'
'Credit reporting, credit repair services, or other personal consumer reports'
'Debt collection' 'Money transfer, virtual currency, or money service'
'Money transfers' 'Mortgage' 'Other financial service' 'Payday loan'
'Payday loan, title loan, or personal loan' 'Prepaid card' 'Student loan'
'Vehicle loan or lease' 'Virtual currency']
最后,我们将使用LinearSVC分类器进行训练,并使用CalibratedClassifierCV来获取所有类的概率。
from sklearn.svm import LinearSVC
from sklearn.calibration import CalibratedClassifierCV
linear_svc = LinearSVC()
clf = linear_svc.fit(X_train_transformed,y_train_lables_trf)
calibrated_svc = CalibratedClassifierCV(base_estimator=linear_svc,
cv="prefit")
calibrated_svc.fit(X_train_transformed,y_train_lables_trf)
predicted = calibrated_svc.predict(X_test_transformed)
to_predict = ["I have outdated information on my credit report that I have previously disputed that has yet to be removed this information is more then seven years old and does not meet credit reporting requirements"]
p_count = count_vect.transform(to_predict)
p_tfidf = tf_transformer.transform(p_count)
print('Average accuracy on test set={}'.format(np.mean(predicted == labels.transform(y_test))))
print('Predicted probabilities of demo input string are')
print(calibrated_svc.predict_proba(p_tfidf))
Average accuracy on test set=0.73637527127
Predicted probabilities of demo input string are
[[ 4.66096051e-04 7.61305759e-06 2.42386129e-03 8.39870195e-04
9.63384564e-04 7.67200317e-01 2.07382738e-01 1.73294053e-02
3.91417748e-07 3.76878086e-06 2.40907318e-03 3.80234243e-10
1.16823419e-05 1.43313864e-05 6.93519524e-06 8.95787556e-04
3.78217257e-05 6.92201968e-06]]
演示预测
我相信我们可以通过进一步微调或预处理来提高平均准确度。然而,这篇文章的目的是证明使用CalibratedClassifierCV来获得预测输出中每个类的概率。