码迷,mamicode.com
首页 > 其他好文 > 详细

KMeans and optimization

时间:2017-09-23 10:44:38      阅读:196      评论:0      收藏:0      [点我收藏+]

标签:dom   for   can   排除   fas   建立   href   build   ...   

  • random sheme or say naive 

input: k, set of n points  

place k centroids at random locations 随机选

  • repeat the follow operations until convergence 重复到收敛

--for each point i:

  1. 找到k个中最近centroid j   (距离公式)
  2. 将point i 放入cluster j中

--for each cluster j:

  1. 对此cluster j中的每个point计算所有的attribute的均值

(attribute不能是categorical or ordinal必须是numeric)

  • stop when none of the cluster assignments change   所有点不再改变cluster membership
  • O(iterations*k*n*dimensions)  ,per interation:O(kn)  memory O(k+n)
  • 无法precache,每次迭代都会改变centroids

 

 

  • optimization

1 k-means++(using adaptive sampling scheme) :slow but samll error ; 随机选择:extremely fast,large error

2AFK-MC2: using Markov chain improving k-means++

  • AFK-MC2   改变seeding的方式

paper :https://las.inf.ethz.ch/files/bachem16fast.pdf

Initial data points are states in the Mchain

a further data point is sampled to act as the candidate for the next state

randomized decision determines whether the Mchain transitions to the candidate or whether it remains old state

repeat and the last state returned as the initial cluster center

  • code 
  1. 欧氏距离:np.lianlg.norm(a-b)
  2. np.loadtxt(naem)
  3. 变量:

    参数:epsilon =0 //threshold,minimun error used in stop condition

    history_centroids = []

    configuration记录:num_instances,num_features = dataset.shape

    初始:prototype = dataset[np.random.randint(0,num_instances-1,size =k)]

    np.ndarray num_instances个[],每个[]中num_features个元素,存放centroid:prototypes_old = np.zeros(prototype.shape)

    存放cluster:belongs_to=np.zeros((num_instances,1))

  4.   迭代:
while norm>epsilon:

  iteration+=1

  norm = dist_method(prototype,prototype_old) //用来看是否停止,迭代前后的变化

  for index_in,instance in enumrate(dataset):
    dist_vec = np.zeros((k,1))

    for index_prototype,prototype in enumrate(prototypes):

      dist_vec[index_prototype] =dist_method[prototype,instance]

    belongs_to[index_in,0]=np.argmin(dist_vec)

  tmp_prototype = np.zeros((k,num_features))

  for .....(cluster)

 

  • scaling n,k

sample and approximation approaches: 效果不好,当k增大分类更糟。

initial centroid selection:(seedling smarter): like ‘blaklist‘ 、‘Elkan‘s‘ 、‘Hamerly‘s‘ algorithm

  • blacklist algorithm  

在data上建立一个tree,在所有centroid上迭代,排除一些。

setup cost O(nlgn) to build tree, computation worst:O(knlgn)  ,  memory O(k+nlgn)

  • ‘Elkan‘s‘  

计算centroids之间距离,平衡points和centroid的距离来减少距离计算

no setup costs,worst O(k^2+kn)  memory O(k^2+kn)

  • Dual-Tree k-means with bounded single-iteration runtime

paper: http://www.ratml.org/pub/pdf/2016dual.pdf

  1. build two trees: query-treeT  reference-tree Q  T:一个instance task of查最近邻,保存点  Q:最近邻来自的set
  2. 同时traverse  当访问(T.node,Q.node)一对时,看是否可剪,可则prune整个子树(可用于最近邻search, kernel density estimation, kernel conditional density estimation.....等等)
  3. space tree:不是 space partitioning tree 允许nodes overlap。undirected acyclic rooted simple graph
    1. 每个节点有许多points(0) 与一个父节点连接,许多子节点(0)
    2. 根节点
    3. 每个点至少被包含在一个树节点中
    4. 每个节点有一个多维的凸子集(convex subset)包含着该节点中的所有点以及孩子节点所表示的convex subsets    即每个节点有bounding shape包含所有descendant points
  4. traverse

访问pair(T Q节点的组合) no more than once并对combination计算给出score

if score>bound or infinite, the combination is pruned。否则计算Tnode的每个点和Qnode的每个点,而不是计算每个descendant point之间score

直到tree只有叶子的时候,call base case

!!:dual-tree algorithm = space tree+pruning dual-tree traversal+BaseCase() Score()

进一步理解见link

 

KMeans and optimization

标签:dom   for   can   排除   fas   建立   href   build   ...   

原文地址:http://www.cnblogs.com/yumanman/p/7580049.html

(0)
(0)
   
举报
评论 一句话评论(0
登录后才能评论!
© 2014 mamicode.com 版权所有  联系我们:gaon5@hotmail.com
迷上了代码!