标签:
KNN的是“k Nearest Neighbors”的简称,中文就是“最近邻分类器”。基本思路就是,对于未知样本,计算该样本和训练集合中每一个样本之间的距离,选择距离最近的k个样本,用这k个样本所对应的类别结果进行投票,最终多数票的类别就是该未知样本的分类结果。选择什么样的度量来衡量样本之间的距离是关键。
一、从文本中读取样本的特征和分类结果。
''' kNN: k Nearest Neighbors ''' import numpy as np ''' function: load the feature maxtrix and the target labels from txt file (datingTestSet.txt) input: the name of file to read return: 1. the feature matrix 2. the target label ''' def LoadFeatureMatrixAndLabels(fileInName): # load all the samples into memory fileIn = open(fileInName,'r') lines = fileIn.readlines() # load the feature matrix and label vector featureMatrix = np.zeros((len(lines),3),dtype=np.float64) labelList = list() index = 0 for line in lines: items = line.strip().split('\t') # the first three numbers are the input features featureMatrix[index,:] = [float(item) for item in items[0:3]] # the last column is the label labelList.append(items[-1]) index += 1 fileIn.close() return featureMatrix, labelList
featureMatrix[index,:] = [float(item) for item in items[0:3]]
二、特征值归一化
特征值归一化,对于绝大多数机器学习算法都是必不可少的一步。归一化的方法通常是取每个特征维度所对应的最大、最小值,然后用当前特征值与之比较,归一化到[0,1]之间的一个数字。如果特征取值有噪声的话,还要事先去除噪声。
''' function: auto-normalizing the feature matrix the formula is: newValue = (oldValue - min)/(max - min) input: the feature matrix return: the normalized feature matrix ''' def AutoNormalizeFeatureMatrix(featureMatrix): # create the normalized feature matrix normFeatureMatrix = np.zeros(featureMatrix.shape) # normalizing the matrix lineNum = featureMatrix.shape[0] columnNum = featureMatrix.shape[1] for i in range(0,columnNum): minValue = featureMatrix[:,i].min() maxValue = featureMatrix[:,i].max() for j in range(0,lineNum): normFeatureMatrix[j,i] = (featureMatrix[j,i] - minValue) / (maxValue-minValue) return normFeatureMatrix
三、样本之间的距离计算
距离可以有很多种衡量方法,这段代码写的是欧氏距离的计算,是计算给定样本(的特征向量)和所有训练样本之间的距离。
''' function: calculate the euclidean distance between the feature vector of input sample and the feature matrix of the samples in training set input: 1. the input feature vector 2. the feature matrix return: the distance array ''' def CalcEucDistance(featureVectorIn, featureMatrix): # extend the input feature vector as a feature matrix lineNum = featureMatrix.shape[0] featureMatrixIn = np.tile(featureVectorIn,(lineNum,1)) # calculate the Euclidean distance between two matrix diffMatrix = featureMatrixIn - featureMatrix sqDiffMatrix = diffMatrix ** 2 distanceValueArray = sqDiffMatrix.sum(axis=1) distanceValueArray = distanceValueArray ** 0.5 return distanceValueArray
未完,待续。
如有转载,请注明出处:http://blog.csdn.net/xceman1997/article/details/44994001
【用Python玩Machine Learning】KNN * 代码 * 一
标签:
原文地址:http://blog.csdn.net/xceman1997/article/details/44994001