83327712 2018-08-25
这是一个关于如何使用scikit-learn对真实文本数据进行计数矢量化的演示。
今天,我们将看看我们用数字表示文本数据的最基本方法之一:One-hot编码(或计数矢量化)。这个想法很简单。
我们将创建一个维度与词汇表大小相等的矢量,如果文本数据具有vocab单词的特征,我们将在那个维度中放置一个1。每当我们再次遇到这个词时,我们会增加计数。我们如果没有找到这个词则为0。
如果我们使用它们在真实的文本数据上,这将是非常大的矢量,我们会得到很准确的文本数据的内容。不幸的是,这不会提供任何语义或关系信息,但这没关系,因为这不是使用这个技术的关键。
今天,我们将使用scikit-learn的软件包。
以下是使用计数矢量化来获取矢量的基本Python示例:
from sklearn.feature_extraction.text import CountVectorizer
# To create a Count Vectorizer, we simply need to instantiate one.
# There are special parameters we can set here when making the vectorizer, but
# for the most basic example, it is not needed.
vectorizer = CountVectorizer()
# For our text, we are going to take some text from our previous blog post
# about count vectorization
sample_text = ["One of the most basic ways we can numerically represent words "
"is through the one-hot encoding method (also sometimes called "
"count vectorizing)."]
# To actually create the vectorizer, we simply need to call fit on the text
# data that we wish to fix
vectorizer.fit(sample_text)
# Now, we can inspect how our vectorizer vectorized the text
# This will print out a list of words used, and their index in the vectors
print('Vocabulary: ')
print(vectorizer.vocabulary_)
# If we would like to actually create a vector, we can do so by passing the
# text into the vectorizer to get back counts
vector = vectorizer.transform(sample_text)
# Our final vector:
print('Full vector: ')
print(vector.toarray())
# Or if we wanted to get the vector for one word:
print('Hot vector: ')
print(vectorizer.transform(['hot']).toarray())
# Or if we wanted to get multiple vectors at once to build matrices
print('Hot and one: ')
print(vectorizer.transform(['hot', 'one']).toarray())
# We could also do the whole thing at once with the fit_transform method:
print('One swoop:')
new_text = ['Today is the day that I do the thing today, today']
new_vectorizer = CountVectorizer()
print(new_vectorizer.fit_transform(new_text).toarray())
输出:
Vocabulary:
{'one': 12, 'of': 11, 'the': 15, 'most': 9, 'basic': 1, 'ways': 18, 'we': 19,
'can': 3, 'numerically': 10, 'represent': 13, 'words': 20, 'is': 7,
'through': 16, 'hot': 6, 'encoding': 5, 'method': 8, 'also': 0,
'sometimes': 14, 'called': 2, 'count': 4, 'vectorizing': 17}
Full vector:
[[1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 2 1 1 1 1 1]]
Hot vector:
[[0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0]]
Hot and one:
[[0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0]]
One swoop:
[[1 1 1 1 2 1 3]]
所以让我们在一些真实的数据上使用它!我们将查看scikit-learn附带的20个新闻组数据集。
Python代码如下:
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer
import numpy as np
# Create our vectorizer
vectorizer = CountVectorizer()
# Let's fetch all the possible text data
newsgroups_data = fetch_20newsgroups()
# Why not inspect a sample of the text data?
print('Sample 0: ')
print(newsgroups_data.data[0])
print()
# Create the vectorizer
vectorizer.fit(newsgroups_data.data)
# Let's look at the vocabulary:
print('Vocabulary: ')
print(vectorizer.vocabulary_)
print()
# Converting our first sample into a vector
v0 = vectorizer.transform([newsgroups_data.data[0]]).toarray()[0]
print('Sample 0 (vectorized): ')
print(v0)
print()
# It's too big to even see...
# What's the length?
print('Sample 0 (vectorized) length: ')
print(len(v0))
print()
# How many words does it have?
print('Sample 0 (vectorized) sum: ')
print(np.sum(v0))
print()
# What if we wanted to go back to the source?
print('To the source:')
print(vectorizer.inverse_transform(v0))
print()
# So all this data has a lot of extra garbage... Why not strip it away?
newsgroups_data = fetch_20newsgroups(remove=('headers', 'footers', 'quotes'))
# Why not inspect a sample of the text data?
print('Sample 0: ')
print(newsgroups_data.data[0])
print()
# Create the vectorizer
vectorizer.fit(newsgroups_data.data)
# Let's look at the vocabulary:
print('Vocabulary: ')
print(vectorizer.vocabulary_)
print()
# Converting our first sample into a vector
v0 = vectorizer.transform([newsgroups_data.data[0]]).toarray()[0]
print('Sample 0 (vectorized): ')
print(v0)
print()
# It's too big to even see...
# What's the length?
print('Sample 0 (vectorized) length: ')
print(len(v0))
print()
# How many words does it have?
print('Sample 0 (vectorized) sum: ')
print(np.sum(v0))
print()
# What if we wanted to go back to the source?
print('To the source:')
print(vectorizer.inverse_transform(v0))
print()
输出:
Sample 0:
From: [email protected] (where's my thing)
Subject: WHAT car is this!?
Nntp-Posting-Host: rac3.wam.umd.edu
Organization: University of Maryland, College Park
Lines: 15
I was wondering if anyone out there could enlighten me on this car I saw
the other day. It was a 2-door sports car, looked to be from the late 60s/
early 70s. It was called a Bricklin. The doors were really small. In addition,
the front bumper was separate from the rest of the body. This is
all I know. If anyone can tellme a model name, engine specs, years
of production, where this car is made, history, or whatever info you
have on this funky looking car, please e-mail.
Thanks,
- IL
---- brought to you by your neighborhood Lerxst ----
Vocabulary:
{'from': 56979, 'lerxst': 75358, 'wam': 123162, 'umd': 118280, 'edu': 50527,
'where': 124031, 'my': 85354, 'thing': 114688, 'subject': 111322,
'what': 123984, 'car': 37780, 'is': 68532, 'this': 114731, 'nntp': 87620,
'posting': 95162, 'host': 64095, 'rac3': 98949, 'organization': 90379,
'university': 118983, 'of': 89362, 'maryland': 79666,
'college': 40998, ... } (Abbreviated...)
Sample 0 (vectorized):
[0 0 0 ... 0 0 0]
Sample 0 (vectorized) length:
130107
Sample 0 (vectorized) sum:
122
To the source:
[array(['15', '60s', '70s', 'addition', 'all', 'anyone', 'be', 'body',
'bricklin', 'brought', 'bumper', 'by', 'called', 'can', 'car',
'college', 'could', 'day', 'door', 'doors', 'early', 'edu',
'engine', 'enlighten', 'from', 'front', 'funky', 'have', 'history',
'host', 'if', 'il', 'in', 'info', 'is', 'it', 'know', 'late',
'lerxst', 'lines', 'looked', 'looking', 'made', 'mail', 'maryland',
'me', 'model', 'my', 'name', 'neighborhood', 'nntp', 'of', 'on',
'or', 'organization', 'other', 'out', 'park', 'please', 'posting',
'production', 'rac3', 'really', 'rest', 'saw', 'separate', 'small',
'specs', 'sports', 'subject', 'tellme', 'thanks', 'the', 'there',
'thing', 'this', 'to', 'umd', 'university', 'wam', 'was', 'were',
'what', 'whatever', 'where', 'wondering', 'years', 'you', 'your'],
dtype='<U180')]
Sample 0:
I was wondering if anyone out there could enlighten me on this car I saw
the other day. It was a 2-door sports car, looked to be from the late 60s/
early 70s. It was called a Bricklin. The doors were really small. In addition,
the front bumper was separate from the rest of the body. This is
all I know. If anyone can tellme a model name, engine specs, years
of production, where this car is made, history, or whatever info you
have on this funky looking car, please e-mail.
Vocabulary:
{'was': 95844, 'wondering': 97181, 'if': 48754, 'anyone': 18915, 'out': 68847,
'there': 88638, 'could': 30074, 'enlighten': 37335, 'me': 60560, 'on': 68080,
'this': 88767, 'car': 25775, 'saw': 80623, 'the': 88532, 'other': 68781,
'day': 31990, 'it': 51326, 'door': 34809, 'sports': 84538, 'looked': 57390,
'to': 89360, 'be': 21987, 'from': 41715, 'late': 55746, '60s': 9843,
'early': 35974, '70s': 11174, 'called': 25492, 'bricklin': 24160, 'doors': 34810,
'were': 96247, 'really': 76471, ... } (Abbreviated...)
Sample 0 (vectorized):
[0 0 0 ... 0 0 0]
Sample 0 (vectorized) length:
101631
Sample 0 (vectorized) sum:
85
To the source:
[array(['60s', '70s', 'addition', 'all', 'anyone', 'be', 'body',
'bricklin', 'bumper', 'called', 'can', 'car', 'could', 'day',
'door', 'doors', 'early', 'engine', 'enlighten', 'from', 'front',
'funky', 'have', 'history', 'if', 'in', 'info', 'is', 'it', 'know',
'late', 'looked', 'looking', 'made', 'mail', 'me', 'model', 'name',
'of', 'on', 'or', 'other', 'out', 'please', 'production', 'really',
'rest', 'saw', 'separate', 'small', 'specs', 'sports', 'tellme',
'the', 'there', 'this', 'to', 'was', 'were', 'whatever', 'where',
'wondering', 'years', 'you'], dtype='<U81')]
我们知道如何根据计数对这些东西进行矢量化,但是我们实际上可以用这些信息做什么呢?
首先,我们可以做一些分析。我们可以观察词频,我们可以去掉stop words,我们可以尝试聚类。现在我们已经有了这些文本数据的数字表示形式,我们可以做很多以前无法做的事情!
让我们更具体一点。我们一直在使用来自20新闻组数据集的文本数据。
20新闻组数据集是一个数据集,分为20个不同的类别。为什么不使用我们的矢量化来尝试和分类这些数据呢?
Python代码如下:
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn import metrics
# Create our vectorizer
vectorizer = CountVectorizer()
# All data
newsgroups_train = fetch_20newsgroups(subset='train',
remove=('headers', 'footers', 'quotes'))
newsgroups_test = fetch_20newsgroups(subset='test',
remove=('headers', 'footers', 'quotes'))
# Get the training vectors
vectors = vectorizer.fit_transform(newsgroups_train.data)
# Build the classifier
clf = MultinomialNB(alpha=.01)
# Train the classifier
clf.fit(vectors, newsgroups_train.target)
# Get the test vectors
vectors_test = vectorizer.transform(newsgroups_test.data)
# Predict and score the vectors
pred = clf.predict(vectors_test)
acc_score = metrics.accuracy_score(newsgroups_test.target, pred)
f1_score = metrics.f1_score(newsgroups_test.target, pred, average='macro')
print('Total accuracy classification score: {}'.format(acc_score))
print('Total F1 classification score: {}'.format(f1_score))
输出
Total accuracy classification score: 0.6460435475305364
Total F1 classification score: 0.6203806145034193