标签:
KNN的是“k Nearest Neighbors”的简称,中文就是“最近邻分类器”。基本思路就是,对于未知样本,计算该样本和训练集合中每一个样本之间的距离,选择距离最近的k个样本,用这k个样本所对应的类别结果进行投票,最终多数票的类别就是该未知样本的分类结果。选择什么样的度量来衡量样本之间的距离是关键。
一、从文本中读取样本的特征和分类结果。
'''
kNN: k Nearest Neighbors
'''
import numpy as np
'''
function: load the feature maxtrix and the target labels from txt file (datingTestSet.txt)
input: the name of file to read
return:
1. the feature matrix
2. the target label
'''
def LoadFeatureMatrixAndLabels(fileInName):
# load all the samples into memory
fileIn = open(fileInName,'r')
lines = fileIn.readlines()
# load the feature matrix and label vector
featureMatrix = np.zeros((len(lines),3),dtype=np.float64)
labelList = list()
index = 0
for line in lines:
items = line.strip().split('\t')
# the first three numbers are the input features
featureMatrix[index,:] = [float(item) for item in items[0:3]]
# the last column is the label
labelList.append(items[-1])
index += 1
fileIn.close()
return featureMatrix, labelListfeatureMatrix[index,:] = [float(item) for item in items[0:3]]
二、特征值归一化
特征值归一化,对于绝大多数机器学习算法都是必不可少的一步。归一化的方法通常是取每个特征维度所对应的最大、最小值,然后用当前特征值与之比较,归一化到[0,1]之间的一个数字。如果特征取值有噪声的话,还要事先去除噪声。
'''
function: auto-normalizing the feature matrix
the formula is: newValue = (oldValue - min)/(max - min)
input: the feature matrix
return: the normalized feature matrix
'''
def AutoNormalizeFeatureMatrix(featureMatrix):
# create the normalized feature matrix
normFeatureMatrix = np.zeros(featureMatrix.shape)
# normalizing the matrix
lineNum = featureMatrix.shape[0]
columnNum = featureMatrix.shape[1]
for i in range(0,columnNum):
minValue = featureMatrix[:,i].min()
maxValue = featureMatrix[:,i].max()
for j in range(0,lineNum):
normFeatureMatrix[j,i] = (featureMatrix[j,i] - minValue) / (maxValue-minValue)
return normFeatureMatrix
三、样本之间的距离计算
距离可以有很多种衡量方法,这段代码写的是欧氏距离的计算,是计算给定样本(的特征向量)和所有训练样本之间的距离。
'''
function: calculate the euclidean distance between the feature vector of input sample and
the feature matrix of the samples in training set
input:
1. the input feature vector
2. the feature matrix
return: the distance array
'''
def CalcEucDistance(featureVectorIn, featureMatrix):
# extend the input feature vector as a feature matrix
lineNum = featureMatrix.shape[0]
featureMatrixIn = np.tile(featureVectorIn,(lineNum,1))
# calculate the Euclidean distance between two matrix
diffMatrix = featureMatrixIn - featureMatrix
sqDiffMatrix = diffMatrix ** 2
distanceValueArray = sqDiffMatrix.sum(axis=1)
distanceValueArray = distanceValueArray ** 0.5
return distanceValueArray
未完,待续。
如有转载,请注明出处:http://blog.csdn.net/xceman1997/article/details/44994001
【用Python玩Machine Learning】KNN * 代码 * 一
标签:
原文地址:http://blog.csdn.net/xceman1997/article/details/44994001