pandazjd 2018-06-15
是否有可能通过机器学习来预测基于其描述的葡萄酒评级?
有些人称这种情绪分析或文本分析。我们开始吧!
数据集(https://www.kaggle.com/zynicide/wine-reviews):一系列130k葡萄酒(包括他们的评级,描述,价格仅举几例)来自WineMag。
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
# Plot / Graph stuffs
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix, classification_report
# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list the files in the input directory
from subprocess import check_output
print(check_output(["ls", "../input"]).decode("utf8"))
数据集文件
winemag-data-130k-v2.csv
winemag-data-130k-v2.json
winemag-data_first150k.csv
通常对于数据集,我学会了删除重复项和NaN值(null stuff):
parsed_data = init_data[init_data.duplicated('description', keep=False)]
print("Length of dataframe after duplicates are removed:", len(parsed_data))
parsed_data.dropna(subset=['description', 'points'])
print("Length of dataframe after NaNs are removed:", len(parsed_data))
parsed_data.head()
Length of dataframe after duplicates are removed: 92393
Length of dataframe after NaNs are removed: 92393
我们剩下的92k葡萄酒评论,这足够用了!
让我们来看看我们的数据“描述”vs“points”:
dp = parsed_data[['description','points']]
dp.info()
dp.head()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 92393 entries, 25 to 150929
Data columns (total 2 columns):
description 92393 non-null object
points 92393 non-null int64
dtypes: int64(1), object(1)
memory usage: 2.1+ MB
现在我们来看看我们的数据集的分布。在我们的例子中,这将是每点葡萄酒的数量:
fig, ax = plt.subplots(figsize=(30,10))
plt.xticks(fontsize=20) # X Ticks
plt.yticks(fontsize=20) # Y Ticks
ax.set_title('Number of wines per points', fontweight="bold", size=25) # Title
ax.set_ylabel('Number of wines', fontsize = 25) # Y label
ax.set_xlabel('Points', fontsize = 25) # X label
dp.groupby(['points']).count()['description'].plot(ax=ax, kind='bar')
很多酒的价格在83到93分之间。这也与市场相匹配(不是很多真正的好酒)。如果我们想要一个更好的模型,我们可以尝试收集更多的基于点的葡萄酒评论,以使其更稳定。
看一下描述长度vspoints
dp = dp.assign(description_length = dp['description'].apply(len))
dp.info()
dp.head()
作为一个有趣的笔记,通过阅读一些数据,我发现葡萄酒越好,说明的时间越长。人们渴望对他们真正感兴趣的葡萄酒留下更长时间的评论是合乎逻辑的,但我不认为这会对数据产生重大影响:
fig, ax = plt.subplots(figsize=(30,10))
sns.boxplot(x='points', y='description_length', data=dp)
plt.xticks(fontsize=20) # X Ticks
plt.yticks(fontsize=20) # Y Ticks
ax.set_title('Description Length per Points', fontweight="bold", size=25) # Title
ax.set_ylabel('Description Length', fontsize = 25) # Y label
ax.set_xlabel('Points', fontsize = 25) # X label
plt.show()
让我们尝试使用5个不同的值简化模型:
1 - >点80至84(低于平均值)
2 - >点84至88(平均值)
3 - >点88至92(好葡萄酒)
4 - >点92到96(很好的葡萄酒)
5 - >点数96到100(优秀的葡萄酒)
#Transform method taking points as param
def transform_points_simplified(points):
if points < 84:
return 1
elif points >= 84 and points < 88:
return 2
elif points >= 88 and points < 92:
return 3
elif points >= 92 and points < 96:
return 4
else:
return 5
#Applying transform method and assigning result to new column "points_simplified"
dp = dp.assign(points_simplified = dp['points'].apply(transform_points_simplified))
dp.head()
每个简化点有多少种葡萄酒?这是数据模型的新分布
fig, ax = plt.subplots(figsize=(30,10))
plt.xticks(fontsize=20) # X Ticks
plt.yticks(fontsize=20) # Y Ticks
ax.set_title('Number of wines per points', fontweight="bold", size=25) # Title
ax.set_ylabel('Number of wines', fontsize = 25) # Y label
ax.set_xlabel('Points', fontsize = 25) # X label
dp.groupby(['points_simplified']).count()['description'].plot(ax=ax, kind='bar')
fig, ax = plt.subplots(figsize=(30,10))
sns.boxplot(x='points_simplified', y='description_length', data=dp)
plt.xticks(fontsize=20) # X Ticks
plt.yticks(fontsize=20) # Y Ticks
ax.set_title('Description Length per Points', fontweight="bold", size=25) # Title
ax.set_ylabel('Description Length', fontsize = 25) # Y label
ax.set_xlabel('Points', fontsize = 25) # X label
plt.show()
现在用ML来分类文本的最简单的方法之一是所谓的词袋(bag-of-words)或矢量化。
基本上,你想在向量空间中表示你的文本,与权重(出现次数等)相关联,所以你的分类算法将能够解释它。
一些向量化算法是可用的,最着名的(据我所知)是:
- CountVectorizer:简单地通过字数统计,如名称所述
- TF-IDF Vectorizer:体重与计数成比例地增加,但被总语料中的单词。这被称为IDF(反向文档频率)。这样Vectorizer就可以用“the”,“a”等频繁词汇来调整权重......
X = dp['description']
y = dp['points_simplified']
vectorizer = CountVectorizer()
vectorizer.fit(X)
#print(vectorizer.vocabulary_)
让我们基于训练数据矢量化X
X = vectorizer.transform(X)
print('Shape of Sparse Matrix: ', X.shape)
print('Amount of Non-Zero occurrences: ', X.nnz)
# Percentage of non-zero values
density = (100.0 * X.nnz / (X.shape[0] * X.shape[1]))
print('Density: {}'.format((density)))
在机器学习中,这是您测试的最后一部分。
您想用您的数据集的百分比来训练模型,然后通过比较数据集的其余部分和预测来测试其准确性。
对于这个实验,90%的数据集将用于训练(约80k葡萄酒)。数据集的10%将用于测试(约9k葡萄酒)。
我们将使用的分类器是RandomForestClassifier(RFC),因为它很酷,并且在很多情况下(> <)都能很好地工作。
尽管如此,RFC并不像其他一些分类器那样具有性能(内存和CPU),但我总是发现它对于小数据集非常有效。
# Training the model
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=101)
rfc = RandomForestClassifier()
rfc.fit(X_train, y_train)
# Testing the model
predictions = rfc.predict(X_test)
print(classification_report(y_test, predictions))
结果:计数机的精度为97%,召回率和f1
这是超级好!