码迷,mamicode.com
首页 > 编程语言 > 详细

初识分类算法(1)------knn近邻算法

时间:2014-10-11 15:32:15      阅读:398      评论:0      收藏:0      [点我收藏+]

标签:des   cWeb   style   blog   http   color   io   os   ar   

例子:某人想要由以下1000行训练样本数据构建一个分类器,将数据分成3类(喜欢,一般,不喜欢)。样本数据的特征有主要有3个,

 A:每年获得的飞行常客里程数

 B:玩视频游戏所耗时间百分比

 C:每周消费冰淇淋公升数

bubuko.com,布布扣

1. 数据的读取

bubuko.com,布布扣
 1 filename=D://machine_learn//Ch02//datingTestSet2.txt
 2 def file2matrix(filename):
 3     fr = open(filename)
 4     a=fr.readlines()
 5     numberOfLines = len(a)         #get the number of lines in the file
 6     returnMat = zeros((numberOfLines,3))        #prepare matrix to return
 7     classLabelVector = []                       #prepare labels return  
 8     index=0 
 9     for line in a:
10         line = line.strip()
11         listFromLine = line.split(\t)
12         returnMat[index,:] = listFromLine[0:3]  #第index行=右边数据
13         classLabelVector.append(int(listFromLine[-1]))
14         index += 1
15     return returnMat,classLabelVector
16 data,labels=file2matrix(filename)
data

2. 数据的归一化处理:由于A的特征值远大于B,C的特征值,因此为了使3个特征转化为真正等权重的特征,需要进行数据标准化操作

bubuko.com,布布扣
1 def autoNorm(dataSet):
2     minVals = dataSet.min(0)                       #矩阵中每一列的最小值
3     maxVals = dataSet.max(0)                       #矩阵中每一列的最大值
4     ranges = maxVals - minVals
5     normDataSet = zeros(shape(dataSet))
6     m = dataSet.shape[0]
7     normDataSet = dataSet - tile(minVals, (m,1))
8     normDataSet = normDataSet/tile(ranges, (m,1))   #element wise divide
9     return normDataSet, ranges, minVals
autoNorm(dataSet)

3.应用kNN算法进行分类

  3.1 首先简述knn-算法的思想

 bubuko.com,布布扣

  3.2 python 实现knn

bubuko.com,布布扣
 1 def classify0(inX, dataSet, labels, k):
 2     dataSetSize = dataSet.shape[0] 
 3     diffMat = tile(inX, (dataSetSize,1)) - dataSet
 4     sqDiffMat = diffMat**2
 5     sqDistances = sqDiffMat.sum(axis=1)
 6     distances = sqDistances**0.5
 7     sortedDistIndicies = distances.argsort()     
 8     classCount={}          
 9     for i in range(k):
10         voteIlabel = labels[sortedDistIndicies[i]]
11         classCount[voteIlabel] = classCount.get(voteIlabel,0) + 1
12     sortedClassCount = sorted(classCount.iteritems(), key=operator.itemgetter(1), reverse=True)
13     return sortedClassCount[0][0]
knn-classify0

  3.3 在上述数据中应用knn,并且计算出误判率

bubuko.com,布布扣
 1 def datingClassTest():
 2     hoRatio = 0.50      #hold out 10%
 3     datingDataMat,datingLabels = file2matrix(datingTestSet2.txt)       #load data setfrom file
 4     normMat, ranges, minVals = autoNorm(datingDataMat)
 5     m = normMat.shape[0]
 6     numTestVecs = int(m*hoRatio)
 7     errorCount = 0.0
 8     for i in range(numTestVecs):
 9         classifierResult = classify0(normMat[i,:],normMat[numTestVecs:m,:],datingLabels[numTestVecs:m],3)
10         print "the classifier came back with: %d, the real answer is: %d" % (classifierResult, datingLabels[i])
11         if (classifierResult != datingLabels[i]): errorCount += 1.0
12     print "the total error rate is: %f" % (errorCount/float(numTestVecs))
13     print errorCount
datingClassTest

4. 可视化分类结果

bubuko.com,布布扣
 1 import matplotlib
 2 import matplotlib.pyplot as plt
 3 fig=plt.figure()
 4 ax=fig.add_subplot(111)
 5 #ax.scatter(data[:,0],data[:,1])
 6 ax.set_xlabel(B)
 7 ax.set_ylabel(C)
 8 ax.scatter(data[:,1],data[:,2],15.0*array(labels),array(labels))
 9 ax.scatter([20,20,20],[1.8,1.6,1.4],15*array(list(set(labels))),list(set(labels)))
10 legends=[dislike,smallDoses,largeDoses]
11 ax.text(22,1.8,%s %(legends[0]))
12 ax.text(22,1.6,%s %(legends[1]))
13 ax.text(22,1.4,%s %(legends[2]))
14 plt.show()
scatter

 

bubuko.com,布布扣

 

初识分类算法(1)------knn近邻算法

标签:des   cWeb   style   blog   http   color   io   os   ar   

原文地址:http://www.cnblogs.com/smileqiong/p/4018867.html

(0)
(0)
   
举报
评论 一句话评论(0
登录后才能评论!
© 2014 mamicode.com 版权所有  联系我们:gaon5@hotmail.com
迷上了代码!