聚类分析

时间：2017-03-13 22:00:51 阅读：356 评论：0 收藏：0 [点我收藏+]

标签：图像个人 tar hid src learn 社交网络学习方法 cluster

线性回归和逻辑回归都是监督学习方法，聚类分析是非监督学习的一种，可以从一批数据集中探索信息，比如在社交网络数据中可以识别社区，在一堆菜谱中识别出菜系。本节介绍K-means聚类算法。

1、K-means

k是一个超参数，表示要聚类成多少类。K-means计算方法是重复移动类的重心，以实现成本函数最小化，成本函数为：

技术分享

其中μk是第k类的重心位置。

2、试验

 1 import matplotlib.pyplot as plt
 2 import numpy as np
 3 
 4 # 生成2*10的矩阵，且值均匀分布的随机数
 5 cluster1 = np.random.uniform(0.5, 1.5, (2, 10))
 6 cluster2 = np.random.uniform(3.5, 4.5, (2, 10))
 7 #print(cluster1)
 8 #print(cluster2)
 9 # 顺序连接两个矩阵，形成一个新矩阵,所以生成了一个2*20的矩阵，T做转置后变成20*2的矩阵,刚好是一堆(x,y)的坐标点
10 X = np.hstack((cluster1, cluster2)).T
11 #print(X)
12 plt.figure()
13 plt.axis([0, 5, 0, 5])
14 plt.grid(True)
15 plt.plot(X[:,0],X[:,1],‘k.‘)
16 #plt.show()
17 
18 from sklearn.cluster import KMeans
19 kmeans = KMeans(n_clusters=2)
20 kmeans.fit(X)
21 plt.plot(kmeans.cluster_centers_[:,0], kmeans.cluster_centers_[:,1], ‘ro‘)
22 plt.show()

View Code

结果：

技术分享

可以看到找到了两个重心点。

3、选择最优k值

你说不清它应该聚类成2、3、4个点，因此我们需要通过分别计算k=(2,3,4)的聚类结果，并比较他们的成本函数值，随着k的增大，成本函数值会不断降低，只有快速降低的那个k值才是最合适的k值，如下：

 1 import numpy as np
 2 import matplotlib.pyplot as plt
 3 from sklearn.cluster import KMeans
 4 from scipy.spatial.distance import cdist
 5 # 生成2*10的矩阵，且值均匀分布的随机数
 6 cluster1 = np.random.uniform(0.5, 1.5, (2, 10))
 7 cluster2 = np.random.uniform(1.5, 2.5, (2, 10))
 8 cluster3 = np.random.uniform(2.5, 3.5, [2, 10])
 9 cluster4 = np.random.uniform(3.5, 4.5, [2, 10])
10 # 顺序连接两个矩阵，形成一个新矩阵,所以生成了一个2*20的矩阵，T做转置后变成20*2的矩阵,刚好是一堆(x,y)的坐标点
11 X1 = np.hstack((cluster1, cluster2))
12 X2 = np.hstack((cluster3, cluster4))
13 X = np.hstack((X1, X2)).T #(40, 2)
14 K = range(1, 10)
15 meandistortions = []
16 for k in K:
17     kmeans = KMeans(n_clusters=k)
18     kmeans.fit(X)
19     # 求kmeans的成本函数值
20     meandistortions.append(sum(np.min(cdist(X, kmeans.cluster_centers_, ‘euclidean‘), axis=1)) / X.shape[0])
21 plt.figure()
22 plt.grid(True)
23 plt1 = plt.subplot(2,1,1)
24 # 画样本点
25 plt1.plot(X[:,0],X[:,1],‘k.‘);
26 plt2 = plt.subplot(2,1,2)
27 # 画成本函数值曲线
28 plt2.plot(K, meandistortions, ‘bx-‘)
29 plt.show()

View Code

技术分享

从曲线上可以看到，随着k的增加，成本函数值在降低，但降低的变化幅度不断在减小，因此急速降低才是最合适的，这里面也许3是比较合适的，你也许会有不同看法

通过这种方法来判断最佳K值的方法叫做肘部法则，你看图像像不像一个人的胳膊肘？

聚类分析

标签：图像个人 tar hid src learn 社交网络学习方法 cluster

原文地址：http://www.cnblogs.com/yuzhuwei/p/6545267.html

踩

(0)

评论一句话评论（0）

分享档案

更多>

2021年07月29日 (22)
2021年07月28日 (40)
2021年07月27日 (32)
2021年07月26日 (79)
2021年07月23日 (29)
2021年07月22日 (30)
2021年07月21日 (42)
2021年07月20日 (16)
2021年07月19日 (90)
2021年07月16日 (35)

周排行