kunlong00 2018-09-15
本文的目的不是详细解释机器学习中的K-Means聚类,是在不使用Scikit的情况下提供实现细节。
K-Means是用于聚类的流行且简单的机器学习中无监督学习算法之一。
K-Means中的超参数'K'指的是clusters的数量。
K-Means是一种基于质心的聚类方案。
我正在考虑以下微小的数据集来逐步解释算法。
6个数据点,2个特征
导入所需的Python模块
import pandas as pd import numpy as np import random import math import matplotlib.pyplot as plt
通过pandas创建data frame,Python代码如下:
d = {'X1':[1,1.5,5,3,4,3], 'X2':[1,1.5,5,4,4,3.5]} df = pd.DataFrame(data=d)
算法步骤
1、从数据中随机选择“K”点并将其称为质心,Python代码如下:
def calRandomCentroids(k, df): centroids = [] for i in range(k): rand = random.randint(0,len(df)-1) randVal = tuple(df.loc[rand].values) while randVal in centroids: rand = random.randint(0,len(df)-1) randVal = tuple(df.loc[rand].values) else: centroids.append(randVal) return centroids
2.计算每个数据点与随机拾取的质心之间的距离。
3.将数据点分配给所有质心中距离最小的聚类质心。
def calDist(a,b): return math.sqrt(sum((np.array(a)-np.array(b))**2)) def makeClusters(k, df, centroids): clusters = {} for tup in centroids: clusters[tup] = [] for i in range(len(df)): pointDists = {} for tup in centroids: dist = calDist(tuple(df.loc[i].values),tup) pointDists[dist] = tup ncp = pointDists.get(min(pointDists)) clusters[ncp].append(i) #or i return clusters
4.使用公式重新计算新的质心
sum of all points in a cluster Ci/number of points in a cluster Ci
Python代码如下:
def calNewCentroids(clusters): newcentroids = [] for k in clusters: sumc = 0 for l in range(len(clusters[k])): sumc += df.loc[clusters[k][l]] cent = sumc/len(clusters[k]) newcentroids.append(tuple(cent)) return newcentroids
注意:新计算的质心点可能在数据点中,也可能不在数据点中。
5.重复步骤3和步骤4直到收敛(这意味着新的质心值与旧的质心值没有太大差别)
Python实现如下:
def checkConvergence(k,oldcentroids,newcentroids): result = [] for i in range(k): rs = calDist(oldcentroids[i],newcentroids[i]) result.append(rs) print("convergence result is {}".format(result)) count = 0 for i in range(len(result)): if result[i] <= 0.5: count = count+1 return True if count == len(result) else False
这里我考虑K=3并调用函数kMeans(),Python代码如下:
def kMeans(k, df): centroids = calRandomCentroids(k, df) print("random centroids are {}".format(centroids)) oldcentroids = centroids clusters = makeClusters(k, df, oldcentroids) print("first iter clusters are {}".format(clusters)) newcentroids = calNewCentroids(clusters) print("new centroids are {}".format(newcentroids)) res = checkConvergence(k,oldcentroids,newcentroids) print(res) while res == False: oldcentroids = newcentroids clusters = makeClusters(k, df, oldcentroids) print("further iter clusters are {}".format(clusters)) newcentroids = calNewCentroids(clusters) res = checkConvergence(k,oldcentroids,newcentroids) print(res) else: print("Final clusterings are {}".format(clusters)) kMeans(3, df)
上述程序的输出如下所示