从头开始实现机器学习K-Means聚类

kunlong00 2018-09-15

本文的目的不是详细解释机器学习中的K-Means聚类,是在不使用Scikit的情况下提供实现细节。

K-Means是用于聚类的流行且简单的机器学习中无监督学习算法之一。

K-Means中的超参数'K'指的是clusters的数量。

K-Means是一种基于质心的聚类方案。

我正在考虑以下微小的数据集来逐步解释算法。

从头开始实现机器学习K-Means聚类

6个数据点,2个特征

导入所需的Python模块

import pandas as pd
import numpy as np
import random
import math
import matplotlib.pyplot as plt

通过pandas创建data frame,Python代码如下:

d = {'X1':[1,1.5,5,3,4,3], 'X2':[1,1.5,5,4,4,3.5]}
df = pd.DataFrame(data=d)

算法步骤

1、从数据中随机选择“K”点并将其称为质心,Python代码如下:

def calRandomCentroids(k, df):
 centroids = []
 for i in range(k):
 rand = random.randint(0,len(df)-1)
 randVal = tuple(df.loc[rand].values)
 while randVal in centroids:
 rand = random.randint(0,len(df)-1)
 randVal = tuple(df.loc[rand].values)
 else:
 centroids.append(randVal)
return centroids

2.计算每个数据点与随机拾取的质心之间的距离。

3.将数据点分配给所有质心中距离最小的聚类质心。

def calDist(a,b):
 return math.sqrt(sum((np.array(a)-np.array(b))**2))
 
def makeClusters(k, df, centroids):
 clusters = {}
 for tup in centroids:
 clusters[tup] = []
 for i in range(len(df)):
 pointDists = {}
 for tup in centroids:
 dist = calDist(tuple(df.loc[i].values),tup)
 pointDists[dist] = tup
 ncp = pointDists.get(min(pointDists)) 
 clusters[ncp].append(i) #or i
return clusters

4.使用公式重新计算新的质心

sum of all points in a cluster Ci/number of points in a cluster Ci

Python代码如下:

def calNewCentroids(clusters):
 newcentroids = []
 for k in clusters:
 sumc = 0
 for l in range(len(clusters[k])):
 sumc += df.loc[clusters[k][l]]
 cent = sumc/len(clusters[k])
 newcentroids.append(tuple(cent))
return newcentroids

注意:新计算的质心点可能在数据点中,也可能不在数据点中。

5.重复步骤3和步骤4直到收敛(这意味着新的质心值与旧的质心值没有太大差别)

Python实现如下:

def checkConvergence(k,oldcentroids,newcentroids):
 result = []
 for i in range(k):
 rs = calDist(oldcentroids[i],newcentroids[i])
 result.append(rs)
 print("convergence result is {}".format(result))
 count = 0
 for i in range(len(result)):
 if result[i] <= 0.5:
 count = count+1
return True if count == len(result) else False

这里我考虑K=3并调用函数kMeans(),Python代码如下:

def kMeans(k, df):
 
 centroids = calRandomCentroids(k, df)
 print("random centroids are {}".format(centroids))
 oldcentroids = centroids
 
 clusters = makeClusters(k, df, oldcentroids)
 print("first iter clusters are {}".format(clusters)) 
 
 newcentroids = calNewCentroids(clusters)
 print("new centroids are {}".format(newcentroids))
 
 res = checkConvergence(k,oldcentroids,newcentroids)
 print(res)
 
 while res == False:
 oldcentroids = newcentroids
 clusters = makeClusters(k, df, oldcentroids)
 print("further iter clusters are {}".format(clusters)) 
 newcentroids = calNewCentroids(clusters)
 res = checkConvergence(k,oldcentroids,newcentroids)
 print(res)
 else:
 print("Final clusterings are {}".format(clusters)) 
 
 
kMeans(3, df)

上述程序的输出如下所示

从头开始实现机器学习K-Means聚类

相关推荐