标签:选择 log vector 线性 err sort items __init__ elf
原理:
(1) 输入点A,输入已知分类的数据集data
(2) 求A与数据集中每个点的距离,归一化,并排序,选择距离最近的前K个点
(3) K个点进行投票,票数最多的分类即为所求
优点:
简单,可用于非线性分类
缺点:
当样本不均衡时影响投票结果;
分类结果受K值影响;
时空复杂度高:需要保存全部数据O(N),每次取前k个都要与全部数据进行计算O(N),耗费内存大且计算量大
改进:
样本均衡化
太小的K值容易受噪音影响,大的K值减小噪音但会使分类边界模糊,最合适的方法是用交叉验证确定K值:先确定较小的k,。
建立kd树木来较少距离计算:x轴中位数重复切分超平面,直到子区域没有实例;
代码:
1 # coding: utf-8 2 from numpy import * 3 4 class KNN(object): 5 def __init__(self, inX, dataStr, k=20): 6 group, labels = self.loadData(dataStr) 7 if array(inX).shape != array(group[0]).shape: 8 raise ValueError("the shapes of inX are wrong!") 9 self.classify = self.classifyer(inX, group, labels, k) 10 11 def loadData(self, dataStr): 12 data = [it.strip().split("\t") for it in dataStr.strip().split("\n")] 13 group = [map(float, it[:2]) for it in data] 14 labels = [it[2] for it in data] 15 return group, labels 16 17 def getDistance(self, vectorA, vectorB): 18 vectorA = array(vectorA); vectorB = array(vectorB) 19 if vectorA.shape != vectorB.shape: 20 raise ValueError("the shapes of input array are different!") 21 else: 22 print vectorA, vectorB 23 return sum((vectorA - vectorB)**2)**0.5 24 25 def classifyer(self, inX, group, labels, k): 26 k = min(k, len(labels)) 27 distances = [self.getDistance(it, inX) for it in group] 28 sortedDistIndices = array(distances).argsort().tolist() 29 voter = dict() 30 for i in range(k): 31 voteLabel = labels[sortedDistIndices[i]] 32 voter[voteLabel] = voter.get(voteLabel, 0) + 1 33 classify = min(voter.items(), key = lambda x: x[1])[0] 34 return classify 35 36 if __name__=="__main__": 37 data = "1\t1.1\tA\n1\t1\tA\n0\t0\tB\n0\t0.1\tB\n" 38 inX = [1,2] 39 a = KNN(inX, data).classify
标签:选择 log vector 线性 err sort items __init__ elf
原文地址:http://www.cnblogs.com/fresh-bird/p/6129343.html