初识分类算法(1)------knn近邻算法

时间：2014-10-11 15:32:15 阅读：398 评论：0 收藏：0 [点我收藏+]

标签：des cWeb style blog http color io os ar

例子：某人想要由以下1000行训练样本数据构建一个分类器，将数据分成3类（喜欢，一般，不喜欢）。样本数据的特征有主要有3个，

A:每年获得的飞行常客里程数

B:玩视频游戏所耗时间百分比

C：每周消费冰淇淋公升数

bubuko.com,布布扣

1. 数据的读取

 1 filename=‘D://machine_learn//Ch02//datingTestSet2.txt‘
 2 def file2matrix(filename):
 3     fr = open(filename)
 4     a=fr.readlines()
 5     numberOfLines = len(a)         #get the number of lines in the file
 6     returnMat = zeros((numberOfLines,3))        #prepare matrix to return
 7     classLabelVector = []                       #prepare labels return  
 8     index=0 
 9     for line in a:
10         line = line.strip()
11         listFromLine = line.split(‘\t‘)
12         returnMat[index,:] = listFromLine[0:3]  #第index行=右边数据
13         classLabelVector.append(int(listFromLine[-1]))
14         index += 1
15     return returnMat,classLabelVector
16 data,labels=file2matrix(filename)

data

2. 数据的归一化处理：由于A的特征值远大于B,C的特征值，因此为了使3个特征转化为真正等权重的特征，需要进行数据标准化操作

1 def autoNorm(dataSet):
2     minVals = dataSet.min(0)                       #矩阵中每一列的最小值
3     maxVals = dataSet.max(0)                       #矩阵中每一列的最大值
4     ranges = maxVals - minVals
5     normDataSet = zeros(shape(dataSet))
6     m = dataSet.shape[0]
7     normDataSet = dataSet - tile(minVals, (m,1))
8     normDataSet = normDataSet/tile(ranges, (m,1))   #element wise divide
9     return normDataSet, ranges, minVals

autoNorm(dataSet)

3.应用kNN算法进行分类

3.1 首先简述knn-算法的思想

bubuko.com,布布扣

3.2 python 实现knn

 1 def classify0(inX, dataSet, labels, k):
 2     dataSetSize = dataSet.shape[0] 
 3     diffMat = tile(inX, (dataSetSize,1)) - dataSet
 4     sqDiffMat = diffMat**2
 5     sqDistances = sqDiffMat.sum(axis=1)
 6     distances = sqDistances**0.5
 7     sortedDistIndicies = distances.argsort()     
 8     classCount={}          
 9     for i in range(k):
10         voteIlabel = labels[sortedDistIndicies[i]]
11         classCount[voteIlabel] = classCount.get(voteIlabel,0) + 1
12     sortedClassCount = sorted(classCount.iteritems(), key=operator.itemgetter(1), reverse=True)
13     return sortedClassCount[0][0]

knn-classify0

3.3 在上述数据中应用knn,并且计算出误判率

 1 def datingClassTest():
 2     hoRatio = 0.50      #hold out 10%
 3     datingDataMat,datingLabels = file2matrix(‘datingTestSet2.txt‘)       #load data setfrom file
 4     normMat, ranges, minVals = autoNorm(datingDataMat)
 5     m = normMat.shape[0]
 6     numTestVecs = int(m*hoRatio)
 7     errorCount = 0.0
 8     for i in range(numTestVecs):
 9         classifierResult = classify0(normMat[i,:],normMat[numTestVecs:m,:],datingLabels[numTestVecs:m],3)
10         print "the classifier came back with: %d, the real answer is: %d" % (classifierResult, datingLabels[i])
11         if (classifierResult != datingLabels[i]): errorCount += 1.0
12     print "the total error rate is: %f" % (errorCount/float(numTestVecs))
13     print errorCount

datingClassTest

4. 可视化分类结果

 1 import matplotlib
 2 import matplotlib.pyplot as plt
 3 fig=plt.figure()
 4 ax=fig.add_subplot(111)
 5 #ax.scatter(data[:,0],data[:,1])
 6 ax.set_xlabel(‘B‘)
 7 ax.set_ylabel(‘C‘)
 8 ax.scatter(data[:,1],data[:,2],15.0*array(labels),array(labels))
 9 ax.scatter([20,20,20],[1.8,1.6,1.4],15*array(list(set(labels))),list(set(labels)))
10 legends=[‘dislike‘,‘smallDoses‘,‘largeDoses‘]
11 ax.text(22,1.8,‘%s‘ %(legends[0]))
12 ax.text(22,1.6,‘%s‘ %(legends[1]))
13 ax.text(22,1.4,‘%s‘ %(legends[2]))
14 plt.show()

scatter

bubuko.com,布布扣

初识分类算法(1)------knn近邻算法

标签：des cWeb style blog http color io os ar

原文地址：http://www.cnblogs.com/smileqiong/p/4018867.html

踩

(0)

评论一句话评论（0）

分享档案

更多>

2021年07月29日 (22)
2021年07月28日 (40)
2021年07月27日 (32)
2021年07月26日 (79)
2021年07月23日 (29)
2021年07月22日 (30)
2021年07月21日 (42)
2021年07月20日 (16)
2021年07月19日 (90)
2021年07月16日 (35)

周排行