【用Python玩Machine Learning】KNN * 代码 * 一

时间：2015-04-11 13:19:10 阅读：214 评论：0 收藏：0 [点我收藏+]

标签：

KNN的是“k Nearest Neighbors”的简称，中文就是“最近邻分类器”。基本思路就是，对于未知样本，计算该样本和训练集合中每一个样本之间的距离，选择距离最近的k个样本，用这k个样本所对应的类别结果进行投票，最终多数票的类别就是该未知样本的分类结果。选择什么样的度量来衡量样本之间的距离是关键。

一、从文本中读取样本的特征和分类结果。

'''
kNN: k Nearest Neighbors
'''

import numpy as np

'''
function: load the feature maxtrix and the target labels from txt file (datingTestSet.txt)
input: the name of file to read
return:
1. the feature matrix
2. the target label
'''
def LoadFeatureMatrixAndLabels(fileInName):

    # load all the samples into memory
    fileIn = open(fileInName,'r')
    lines = fileIn.readlines()

    # load the feature matrix and label vector
    featureMatrix = np.zeros((len(lines),3),dtype=np.float64)
    labelList = list()
    index = 0
    for line in lines:
        items = line.strip().split('\t')
        # the first three numbers are the input features
        featureMatrix[index,:] = [float(item) for item in items[0:3]]
        # the last column is the label
        labelList.append(items[-1])
        index += 1
    fileIn.close()

    return featureMatrix, labelList

每个样本在文本文件中存储的格式是：3个特征值，再加一个分类结果，用tab键隔开。代码中首先把所有文件load进入内存，然后创建了一个“样本数目 * 特征数目” 的浮点数矩阵，用0.0初始化。之后，解析每一行数据（样本），并用解析后的数据初始化矩阵。这一行用了python中的列表推导：

featureMatrix[index,:] = [float(item) for item in items[0:3]]

一个for循环，用一个语句就写完了，而且运行效率高于（不低于）正常写法的for循环。现在开始体会到python的好了。

二、特征值归一化

特征值归一化，对于绝大多数机器学习算法都是必不可少的一步。归一化的方法通常是取每个特征维度所对应的最大、最小值，然后用当前特征值与之比较，归一化到[0,1]之间的一个数字。如果特征取值有噪声的话，还要事先去除噪声。

'''
function: auto-normalizing the feature matrix
    the formula is: newValue = (oldValue - min)/(max - min)
input: the feature matrix
return: the normalized feature matrix
'''
def AutoNormalizeFeatureMatrix(featureMatrix):

    # create the normalized feature matrix
    normFeatureMatrix = np.zeros(featureMatrix.shape)

    # normalizing the matrix
    lineNum = featureMatrix.shape[0]
    columnNum = featureMatrix.shape[1]
    for i in range(0,columnNum):
        minValue = featureMatrix[:,i].min()
        maxValue = featureMatrix[:,i].max()
        for j in range(0,lineNum):
            normFeatureMatrix[j,i] = (featureMatrix[j,i] - minValue) / (maxValue-minValue)

    return normFeatureMatrix

numpy的基本数据结构是多维数组，矩阵作为多维数组的一个特例。每个numpy的多维数组都有shape属性。shape是一个元组（列表？），表征多维数组中每一个维度的大小，例如：shape[0]表示有多少行，shape[1]表示有多少列...... numpy中的矩阵，对于一行的访问就是“featureMatrix[i,:]”，对于列的访问就是“featureMatrix[:,i]”。这部分代码就是规规矩矩的双重循环，比较像c；不过原来书中的代码也用矩阵来计算的，我写的时候还不熟悉numpy，书中的代码又调试不通，就直接用c的方式来写了。

三、样本之间的距离计算

距离可以有很多种衡量方法，这段代码写的是欧氏距离的计算，是计算给定样本（的特征向量）和所有训练样本之间的距离。

'''
function: calculate the euclidean distance between the feature vector of input sample and
the feature matrix of the samples in training set
input:
1. the input feature vector
2. the feature matrix
return: the distance array
'''
def CalcEucDistance(featureVectorIn, featureMatrix):

    # extend the input feature vector as a feature matrix
    lineNum = featureMatrix.shape[0]
    featureMatrixIn = np.tile(featureVectorIn,(lineNum,1))

    # calculate the Euclidean distance between two matrix
    diffMatrix = featureMatrixIn - featureMatrix
    sqDiffMatrix = diffMatrix ** 2
    distanceValueArray = sqDiffMatrix.sum(axis=1)
    distanceValueArray = distanceValueArray ** 0.5

    return distanceValueArray

用到了numpy中的比较有特色的东西。做法是先将输入的特征向量扩展成为一个特征矩阵（tile函数干的，第一个参数是要扩展的东西，第二个参数是在哪些维度上进行扩展：纵向扩展了lineNum次，横向不进行扩展）。然后，就是扩展出来的矩阵和训练样本的矩阵之间的计算了——本来能用向量之间的计算解决的问题，非要扩展成矩阵来做，这效率......可见，python的效率低，一方面的确源于python语言本身的实现和执行效率，另一方面，更源于python写程序的思维——程序员想偷懒，cpu有啥招儿呢？

未完，待续。

如有转载，请注明出处：http://blog.csdn.net/xceman1997/article/details/44994001

【用Python玩Machine Learning】KNN * 代码 * 一

标签：

原文地址：http://blog.csdn.net/xceman1997/article/details/44994001

踩

(0)

评论一句话评论（0）

分享档案

更多>

2021年07月29日 (22)
2021年07月28日 (40)
2021年07月27日 (32)
2021年07月26日 (79)
2021年07月23日 (29)
2021年07月22日 (30)
2021年07月21日 (42)
2021年07月20日 (16)
2021年07月19日 (90)
2021年07月16日 (35)

周排行