K均值聚类-机器学习算法简介和Python实现

stevenkwong 2018-06-10

K均值聚类-机器学习算法简介和Python实现

在机器学习中,我们并不总是被提供一个目标来优化,我们也不总是被提供一个目标标签来分类输入数据点。在人工智能领域,没有目标或标签来分类的问题被称为无监督学习问题。在无监督学习问题中,我们试图对数据中潜在的结构化信息进行建模。聚类是一种非监督学习问题,我们试图根据相似数据的底层结构将其分组到群组中。K-means算法是一种常见的聚类算法。K表示我们要将数据点分类成的簇的数量。

k - means伪代码

# # k - means聚类

1、选择集群的数量(K)并获取数据点

2、将质心 c_1, c_2,…c_k随机

3、重复步骤4和5,直到收敛,或者直到固定的迭代次数结束

4、对于每个数据点x_i:

-找到最近的质心(c_1, c_2 ..)c_k)

-将点分配给该群集

5、对于每个群集j = 1..k

-新质心=分配给该群集的所有点的平均值

6、结束

下面的模拟将提供对K-means算法的更好理解

K均值聚类-机器学习算法简介和Python实现

K均值聚类-机器学习算法简介和Python实现

如何选择K?

在某些情况下,我们不知道clusters的数量。那么,我们如何选择K的值呢?有一种方法叫肘部法则(Elbow Method)。在这种方法中,您选择不同数量的聚类,并开始绘制群集距离质心的距离。该图看起来如下

K均值聚类-机器学习算法简介和Python实现

Elbow method

从上面的图我们可以推断,在k=4时,图形达到了一个最优的最小值。即使within-cluster的距离在4之后减少,我们还需要做更多的计算。这与收益递减律类似。因此,我们选择一个4作为最优。

k均值聚类算法的应用

  • 行为细分
  • 异常检测
  • 社会网络分析
  • 市场细分

让我们编写一些代码

我们将使用Iris数据集来构建我们的算法。即使Iris数据集具有标签,我们也会放弃它们并仅使用特征点对数据进行聚类。我们知道有3个聚类('Iris-virginica','Iris-setosa','Iris-versicolor')。因此,在我们的例子中k = 3。

import pandas as pd

import numpy as np

from sklearn.utils import shuffle

## Load Iris dataset

df = pd.read_csv('/Users/rohith/Documents/Datasets/Iris_dataset/iris.csv')

## Store the target vaue

classes = df['Species']

## Drop the Id and Class values from dat

df = df.drop(['Id','Species'],axis=1)

## Convert dataframe into list and then into a numpy array

data = df.values.tolist()

data = np.array(data)

## Shuffle classes and data

data,classes = shuffle(data,classes)

## First 135 points are used for training and the rest is used for testing

train_data = data[:135]

test_data = data[135:]

我们加载数据集并删除目标值。我们将特征点转换为一个numpy数组,并将其分解为训练和测试数据

## K-Means Algorithm

import random

import numpy as np

## Randomly place the centroids of the three clusters

c1 = [float(np.random.randint(4,8)),float(np.random.randint(1,5)),

float(np.random.randint(1,7)),float(np.random.randint(0,3))]

c2 = [float(np.random.randint(4,8)),float(np.random.randint(1,5)),

float(np.random.randint(1,7)),float(np.random.randint(0,3))]

c3 = [float(np.random.randint(4,8)),float(np.random.randint(1,5)),

float(np.random.randint(1,7)),float(np.random.randint(0,3))]

## Intialize the number of iterations you want to run

epochs = 1

while(epochs <= 100):

cluster_1 = []

cluster_2 = []

cluster_3 = []

for point in train_data:

## Find the eucledian distance between all points the centroid

dis_point_c1 = ((c1[0]-point[0])**2 + (c1[1]-point[1])**2 +

(c1[2]-point[2])**2 + (c1[3]-point[3])**2)**0.5

dis_point_c2 = ((c2[0]-point[0])**2 + (c2[1]-point[1])**2 +

(c2[2]-point[2])**2 + (c2[3]-point[3])**2)**0.5

dis_point_c3 = ((c3[0]-point[0])**2 + (c3[1]-point[1])**2 +

(c3[2]-point[2])**2 + (c3[3]-point[3])**2)**0.5

distances = [dis_point_c1,dis_point_c2,dis_point_c3]

## Find the closest centroid to the point and assign the point to that cluster

pos = distances.index(min(distances))

if(pos == 0):

cluster_1.append(point)

elif(pos == 1):

cluster_2.append(point)

else:

cluster_3.append(point)

## Store the centroid values to calculate new centroid values

prev_c1 = c1

prev_c2 = c2

prev_c3 = c3

cluster_1 = np.array(cluster_1)

cluster_2 = np.array(cluster_2)

cluster_3 = np.array(cluster_3)

## Find mean of all points within a cluster and make it as the centroid

if(len(cluster_1) != 0):

c1 = [sum(cluster_1[:,0])/float(len(cluster_1)),

sum(cluster_1[:,1])/float(len(cluster_1)),

sum(cluster_1[:,2])/float(len(cluster_1)),

sum(cluster_1[:,3])/float(len(cluster_1))]

if(len(cluster_2) != 0):

c2 = [sum(cluster_2[:,0])/float(len(cluster_2)),

sum(cluster_2[:,1])/float(len(cluster_2)),

sum(cluster_2[:,2])/float(len(cluster_2)),

sum(cluster_2[:,3])/float(len(cluster_2))]

if(len(cluster_3) != 0):

c3 = [sum(cluster_3[:,0])/float(len(cluster_3)),

sum(cluster_3[:,1])/float(len(cluster_3)),

sum(cluster_3[:,2])/float(len(cluster_3)),

sum(cluster_3[:,3])/float(len(cluster_3))]

## If centroid values hasn't changed, algorithm has convereged

if(prev_c1 == c1 and prev_c2 == c2 and prev_c3 == c3):

print("Converged")

break

print(epochs)

epochs += 1

我们实现上面显示的伪代码,我们可以发现我们的算法在6次迭代后收敛。我们现在可以输入一个测试数据点并找到它最接近的质心并将该点分配给相应的群集

pred = []

for point in test_data:

## Find distance between test data point and centroids

dis_point_c1 = ((c1[0]-point[0])**2 + (c1[1]-point[1])**2 +

(c1[2]-point[2])**2 + (c1[3]-point[3])**2)**0.5

dis_point_c2 = ((c2[0]-point[0])**2 + (c2[1]-point[1])**2 +

(c2[2]-point[2])**2 + (c2[3]-point[3])**2)**0.5

dis_point_c3 = ((c3[0]-point[0])**2 + (c3[1]-point[1])**2 +

(c3[2]-point[2])**2 + (c3[3]-point[3])**2)**0.5

## Find the cluster to which the point is closest to and append

## it to pred

distances = [dis_point_c1,dis_point_c2,dis_point_c3]

pos = distances.index(min(distances))

pred.append(pos)

## Print the predictions

print(pred)

通过提供一个抽象级别的对象,我们可以用它来实现算法,Scikit学习库再次使我们避免了编写如此多的代码行

from sklearn.cluster import KMeans

clf = KMeans(n_clusters = 3)

clf.fit(train_data)

pred = clf.predict(test_data)

结论

K-means是聚类技术的入门算法,它是最简单的。正如你会注意到的那样,没有objective/loss 函数。因此,不需要偏导数并且复杂的数学被消除。K-means是一个易于实现和方便的算法。

相关推荐