码迷,mamicode.com
首页 > 其他好文 > 详细

聚类算法总结

时间:2014-09-13 00:51:24      阅读:420      评论:0      收藏:0      [点我收藏+]

标签:style   blog   http   color   io   os   ar   strong   for   

最近要在spark上做一个聚类的项目,数据量和类的个数都比较大。KMeans效果尚可,但是有点慢,因而重新看了下常用的算法。


kmeans
     attention: init centers (randomize vs kmeans++)

mini-batch kmeans
     loops: random samples; find closest for all; update centers for each

mean shift
     init: get centers by bandwidth
     loops: find neighbours of centers; update centers; dedup

AP cluster
     init: get S; Rik=0, Aik=0
     loops: Rik = Sik - Max_k‘!=k(Aik‘ + Sik‘); Aik = min(0, Rkk + Sum_i‘!=i,k Max(0, Ri‘k)); Akk = Sum_i‘!=k Max(0, Ri’k)
     end: for any i, Max_k Rik + Aik as it’s exemplar

spectral clustering
     steps: similarity matrix S; S=UV; kmeans of U

Ward hierarchical
     init: each sample as center
     loops: merge to minimize RMSE within clusters

DBSCAN
     init: get densest core samples
     loops: get more core samples nearby old samples

其实scikit-learn实现了很多算法,也有现成的数据集可以做做实验。例如:http://scikit-learn.org/stable/modules/clustering.html 上有一些效果图,和算法扩展性的说明。

                                             A comparison of the clustering algorithms in scikit-learn

bubuko.com,布布扣


Method name Parameters Scalability Usecase Geometry (metric used)
K-Means number of clusters Very large n_samples, medium n_clusterswith MiniBatch code General-purpose, even cluster size, flat geometry, not too many clusters Distances between points
Affinity propagation damping, sample preference Not scalable with n_samples Many clusters, uneven cluster size, non-flat geometry Graph distance (e.g. nearest-neighbor graph)
Mean-shift bandwidth Not scalable withn_samples Many clusters, uneven cluster size, non-flat geometry Distances between points
Spectral clustering number of clusters Medium n_samples, small n_clusters Few clusters, even cluster size, non-flat geometry Graph distance (e.g. nearest-neighbor graph)
Ward hierarchical clustering number of clusters Large n_samples andn_clusters Many clusters, possibly connectivity constraints Distances between points
Agglomerative clustering number of clusters, linkage type, distance Large n_samples andn_clusters Many clusters, possibly connectivity constraints, non Euclidean distances Any pairwise distance
DBSCAN neighborhood size Very large n_samples, medium n_clusters Non-flat geometry, uneven cluster sizes Distances between nearest points
Gaussian mixtures many Not scalable Flat geometry, good for density estimation Mahalanobis distances to centers

聚类算法总结

标签:style   blog   http   color   io   os   ar   strong   for   

原文地址:http://blog.csdn.net/lzt1983/article/details/39236107

(0)
(0)
   
举报
评论 一句话评论(0
登录后才能评论!
© 2014 mamicode.com 版权所有  联系我们:gaon5@hotmail.com
迷上了代码!