标签:
刘 勇 Email:lyssym@sina.com
简介
鉴于DBSCAN算法对输入参数,领域半径E和阈值M比较敏感,在参数调优时比较麻烦,因此本文对另一种基于密度的聚类算法OPTICS(Ordering Points To Identify the Clustering Structure)展开研究,其为DBSCAN的改进算法,与DBSCAN相比,该算法对输入参数不敏感。OPTICS算法不显示地生成数据聚类,其只是对数据对象集合中的对象进行排序,获取一个有序的对象列表,其中包含了足够的信息能用来提取聚类。在实际的应用中,可利用该有序的对象序列,对数据的分布展开分析以及对数据的关联进行分析。
基本概念
由于OPTICS是对DBSCAN算法的一种改进,因此许多概念是共用的,如核心对象、(直接)密度可达、密度相连等,具体内容参考DBSCAN。在上述基础上,本文再引入两个核心概念。
(1) 核心距离
在数据集合D中,对于给定的参数E和M,称使得p成为核心对象的最小邻域半径为p的核心距离。
通俗意义上来说,在给定的参数E和M上,p的核心距离为距离值中的第M个最小值(最大值),该距离表征可以为欧式距离、余弦相似度或Word2Vec等。
(2) 可达距离
在数据集合D中,对于给定的参数E和M,称对象p的核心距离与对象p和o距离,二者之间最大值为o关于p的可达距离。
程序伪代码(参考维基百科):
1 OPTICS(DB, eps, MinPts) 2 for each point p of DB 3 p.reachability-distance = UNDEFINED 4 for each unprocessed point p of DB 5 N = getNeighbors(p, eps) 6 mark p as processed 7 output p to the ordered list 8 if (core-distance(p, eps, Minpts) != UNDEFINED) 9 Seeds = empty priority queue 10 update(N, p, Seeds, eps, Minpts) 11 for each next q in Seeds 12 N‘ = getNeighbors(q, eps) 13 mark q as processed 14 output q to the ordered list 15 if (core-distance(q, eps, Minpts) != UNDEFINED) 16 update(N‘, q, Seeds, eps, Minpts) 17 18 19 update(N, p, Seeds, eps, Minpts) 20 coredist = core-distance(p, eps, MinPts) 21 for each o in N 22 if (o is not processed) 23 new-reach-dist = max(coredist, dist(p,o)) 24 if (o.reachability-distance == UNDEFINED) // o is not in Seeds 25 o.reachability-distance = new-reach-dist 26 Seeds.insert(o, new-reach-dist) 27 else // o in Seeds, check for improvement 28 if (new-reach-dist < o.reachability-distance) 29 o.reachability-distance = new-reach-dist 30 Seeds.move-up(o, new-reach-dist)
程序源代码:
1 import java.util.List; 2 3 import com.gta.cosine.ElementDict; 4 5 public class DataPoint { 6 private List<ElementDict> terms; 7 private double initDistance; 8 private double coreDistance; 9 private double reachableDistance; 10 private boolean isVisited; 11 12 13 public DataPoint(List<ElementDict> terms) { 14 this.terms = terms; 15 this.initDistance = -1; 16 this.coreDistance = -1; 17 this.reachableDistance = -1; 18 this.isVisited = false; 19 } 20 21 22 public double getCoreDistance() { 23 return coreDistance; 24 } 25 26 27 public void setCoreDistance(double coreDistance) { 28 this.coreDistance = coreDistance; 29 } 30 31 32 public double getReachableDistance() { 33 return reachableDistance; 34 } 35 36 37 public void setReachableDistance(double reachableDistance) { 38 this.reachableDistance = reachableDistance; 39 } 40 41 42 public boolean getIsVisitLabel() { 43 return isVisited; 44 } 45 46 47 public void setIsVisitLabel(boolean isVisited) { 48 this.isVisited = isVisited; 49 } 50 51 52 public double getInitDistance() { 53 return initDistance; 54 } 55 56 57 public void setInitDistance(double initDistance) { 58 this.initDistance = initDistance; 59 } 60 61 62 public List<ElementDict> getAllElements() { 63 return terms; 64 } 65 66 67 public ElementDict getElement(int index) { 68 return terms.get(index); 69 } 70 71 72 public boolean equals(DataPoint dp) 73 { 74 List<ElementDict> ed1 = getAllElements(); 75 List<ElementDict> ed2 = dp.getAllElements(); 76 int len = ed1.size(); 77 78 if (len != ed2.size()) 79 { 80 return false; 81 } 82 83 for (int i = 0; i < len; i++) 84 { 85 if (!ed1.get(i).equals(ed2.get(i))) 86 { 87 return false; 88 } 89 } 90 return true; 91 } 92 93 }
1 import java.util.Comparator; 2 import java.util.List; 3 import java.util.ArrayList; 4 import java.util.Collections; 5 import java.util.Queue; 6 import java.util.PriorityQueue; 7 8 import com.gta.cosine.ElementDict; 9 import com.gta.cosine.TextCosine; 10 11 public class OPTICS { 12 private double eps; 13 private int minPts; 14 private TextCosine cosine; 15 private List<DataPoint> dataPoints; 16 private List<DataPoint> orderList; 17 18 public OPTICS(double eps, int minPts) 19 { 20 this.eps = eps; 21 this.minPts = minPts; 22 this.cosine = new TextCosine(); 23 this.dataPoints = new ArrayList<DataPoint>(); 24 this.orderList = new ArrayList<DataPoint>(); 25 } 26 27 28 public void addPoint(String s) 29 { 30 List<ElementDict> ed = cosine.tokenizer(s); 31 dataPoints.add(new DataPoint(ed)); 32 } 33 34 35 public double coreDistance(List<DataPoint> neighbors) 36 { 37 double ret = -1; 38 if (neighbors.size() >= minPts) 39 { 40 Collections.sort(neighbors, new Comparator<DataPoint>() { 41 public int compare(DataPoint dp1, DataPoint dp2) { 42 double cd = dp1.getInitDistance() - dp2.getInitDistance(); 43 if (cd < 0) { 44 return 1; 45 } else { 46 return -1; 47 } 48 } 49 }); 50 51 ret = neighbors.get(minPts-1).getInitDistance(); 52 } 53 return ret; 54 } 55 56 57 public double cosineDistance(DataPoint p, DataPoint q) 58 { 59 List<ElementDict> vec1 = p.getAllElements(); 60 List<ElementDict> vec2 = q.getAllElements(); 61 return cosine.analysisText(vec1, vec2); 62 } 63 64 65 public List<DataPoint> getNeighbors(DataPoint p, List<DataPoint> points) 66 { 67 List<DataPoint> neighbors = new ArrayList<DataPoint>(); 68 double countDistance = -1; 69 for (DataPoint q : points) 70 { 71 countDistance = cosineDistance(p, q); 72 if (countDistance >= eps) 73 { 74 q.setInitDistance(countDistance); 75 neighbors.add(q); 76 } 77 } 78 return neighbors; 79 } 80 81 82 public void cluster(List<DataPoint> points) 83 { 84 for (DataPoint point : points) 85 { 86 if (!point.getIsVisitLabel()) 87 { 88 List<DataPoint> neighbors = getNeighbors(point, points); 89 point.setIsVisitLabel(true); 90 orderList.add(point); 91 double cd = coreDistance(neighbors); 92 if (cd != -1) 93 { 94 point.setCoreDistance(cd); 95 Queue<DataPoint> seeds = new PriorityQueue<DataPoint>(16, new Comparator<DataPoint>() { 96 public int compare (DataPoint dp1, DataPoint dp2) { 97 double rd = dp1.getReachableDistance() - dp2.getReachableDistance(); 98 if (rd < 0) { 99 return 1; 100 } else { 101 return -1; 102 } 103 } 104 }); 105 106 update(point, neighbors, seeds, orderList); 107 while (!seeds.isEmpty()) 108 { 109 DataPoint q = seeds.poll(); 110 List<DataPoint> newNeighbors = getNeighbors(q, points); 111 q.setIsVisitLabel(true); 112 orderList.add(q); 113 if (coreDistance(newNeighbors) != -1) 114 { 115 update(q, newNeighbors, seeds, orderList); 116 } 117 } 118 } 119 } 120 } 121 } 122 123 124 public void update(DataPoint p, List<DataPoint> neighbors, Queue<DataPoint> seeds, List<DataPoint> seqList) 125 { 126 double coreDistance = coreDistance(neighbors); 127 for (DataPoint point : neighbors) 128 { 129 double cosineDistance = cosineDistance(p, point); 130 double reachableDistance = coreDistance > cosineDistance ? coreDistance : cosineDistance; 131 if (!point.getIsVisitLabel()) 132 { 133 if (point.getReachableDistance() == -1) 134 { 135 point.setReachableDistance(reachableDistance); 136 seeds.add(point); 137 } 138 else 139 { 140 if (point.getReachableDistance() > reachableDistance) 141 { 142 if (seeds.remove(point)) 143 { 144 point.setReachableDistance(reachableDistance); 145 seeds.add(point); 146 } 147 } 148 } 149 } 150 else 151 { 152 if (point.getReachableDistance() == -1) 153 { 154 point.setReachableDistance(reachableDistance); 155 if (seqList.remove(point)) 156 { 157 seeds.add(point); 158 } 159 } 160 } 161 } 162 } 163 164 165 public void showCluster() 166 { 167 for (DataPoint point : orderList) 168 { 169 170 List<ElementDict> ed = point.getAllElements(); 171 for (ElementDict e : ed) 172 { 173 System.out.print(e.getTerm() + " "); 174 } 175 System.out.println(); 176 System.out.println("core: " + point.getCoreDistance()); 177 System.out.println("reach: " + point.getReachableDistance()); 178 System.out.println("***************************************"); 179 } 180 } 181 182 183 public void analysis() 184 { 185 cluster(dataPoints); 186 showCluster(); 187 } 188 189 190 public int IndexOfList(DataPoint o, Queue<DataPoint> points) 191 { 192 int index = 0; 193 for (DataPoint p : points) 194 { 195 if (o.equals(p)) 196 { 197 break; 198 } 199 index ++; 200 } 201 return index; 202 } 203 204 }
本文计算距离时采用余弦相似度,具体内容参考该系列文本挖掘之文本相似度判定。此外,本文经过分析,某些(个)对象之前已访问对象后,例如某个的边界对象,其核心距离经过处理仍为初始值,经过伪代码所示处理后,与DBSCAN的结果有些出入,因此本文作者对OPTICS进行了一点修改,使这类对象可达距离能被修改,并添加至列表中,本文作者认为这样做是有效的,而且存在一定的必要,若有更好的解决方案,请联系我。
作者:志青云集
出处:http://www.cnblogs.com/lyssym
如果,您认为阅读这篇博客让您有些收获,不妨点击一下右下角的【推荐】。
如果,您希望更容易地发现我的新博客,不妨点击一下左下角的【关注我】。
如果,您对我的博客所讲述的内容有兴趣,请继续关注我的后续博客,我是【志青云集】。
本文版权归作者和博客园共有,欢迎转载,但未经作者同意必须保留此段声明,且在文章页面明显位置给出原文连接。
标签:
原文地址:http://www.cnblogs.com/lyssym/p/4950843.html