kunlong00 2018-09-15
本文的目的不是详细解释机器学习中的K-Means聚类,是在不使用Scikit的情况下提供实现细节。
K-Means是用于聚类的流行且简单的机器学习中无监督学习算法之一。
K-Means中的超参数'K'指的是clusters的数量。
K-Means是一种基于质心的聚类方案。
我正在考虑以下微小的数据集来逐步解释算法。

6个数据点,2个特征
导入所需的Python模块
import pandas as pd import numpy as np import random import math import matplotlib.pyplot as plt
通过pandas创建data frame,Python代码如下:
d = {'X1':[1,1.5,5,3,4,3], 'X2':[1,1.5,5,4,4,3.5]}
df = pd.DataFrame(data=d)算法步骤
1、从数据中随机选择“K”点并将其称为质心,Python代码如下:
def calRandomCentroids(k, df): centroids = [] for i in range(k): rand = random.randint(0,len(df)-1) randVal = tuple(df.loc[rand].values) while randVal in centroids: rand = random.randint(0,len(df)-1) randVal = tuple(df.loc[rand].values) else: centroids.append(randVal) return centroids
2.计算每个数据点与随机拾取的质心之间的距离。
3.将数据点分配给所有质心中距离最小的聚类质心。
def calDist(a,b):
return math.sqrt(sum((np.array(a)-np.array(b))**2))
def makeClusters(k, df, centroids):
clusters = {}
for tup in centroids:
clusters[tup] = []
for i in range(len(df)):
pointDists = {}
for tup in centroids:
dist = calDist(tuple(df.loc[i].values),tup)
pointDists[dist] = tup
ncp = pointDists.get(min(pointDists))
clusters[ncp].append(i) #or i
return clusters4.使用公式重新计算新的质心
sum of all points in a cluster Ci/number of points in a cluster Ci
Python代码如下:
def calNewCentroids(clusters): newcentroids = [] for k in clusters: sumc = 0 for l in range(len(clusters[k])): sumc += df.loc[clusters[k][l]] cent = sumc/len(clusters[k]) newcentroids.append(tuple(cent)) return newcentroids
注意:新计算的质心点可能在数据点中,也可能不在数据点中。
5.重复步骤3和步骤4直到收敛(这意味着新的质心值与旧的质心值没有太大差别)
Python实现如下:
def checkConvergence(k,oldcentroids,newcentroids):
result = []
for i in range(k):
rs = calDist(oldcentroids[i],newcentroids[i])
result.append(rs)
print("convergence result is {}".format(result))
count = 0
for i in range(len(result)):
if result[i] <= 0.5:
count = count+1
return True if count == len(result) else False这里我考虑K=3并调用函数kMeans(),Python代码如下:
def kMeans(k, df):
centroids = calRandomCentroids(k, df)
print("random centroids are {}".format(centroids))
oldcentroids = centroids
clusters = makeClusters(k, df, oldcentroids)
print("first iter clusters are {}".format(clusters))
newcentroids = calNewCentroids(clusters)
print("new centroids are {}".format(newcentroids))
res = checkConvergence(k,oldcentroids,newcentroids)
print(res)
while res == False:
oldcentroids = newcentroids
clusters = makeClusters(k, df, oldcentroids)
print("further iter clusters are {}".format(clusters))
newcentroids = calNewCentroids(clusters)
res = checkConvergence(k,oldcentroids,newcentroids)
print(res)
else:
print("Final clusterings are {}".format(clusters))
kMeans(3, df)上述程序的输出如下所示
