标签:朴素 类别 outline range pytho data 返回 entry ken
这篇文章将利用朴素贝叶斯分类对文档进行分类。
从文本中获取特征,需要先拆分文本,下面的代码直接创建词条向量形式的文本作为训练数据,函数有两个返回值,分别是训练数据和每条数据对应的类别组成的列表:
def loadDataSet(): # postingList 为进行词条切分后的文档集合 postingList = [ [‘my‘,‘dog‘,‘has‘,‘flea‘,‘problems‘,‘help‘,‘please‘], [‘maybe‘,‘not‘,‘take‘,‘him‘,‘to‘,‘dog‘,‘park‘,‘stupid‘], [‘my‘,‘dalmation‘,‘is‘,‘so‘,‘cute‘,‘I‘,‘love‘,‘him‘], [‘stop‘,‘posting‘,‘stupid‘,‘worthless‘,‘garbage‘], [‘mr‘,‘licks‘,‘ate‘,‘my‘,‘steak‘,‘how‘,‘to‘,‘stop‘,‘him‘], [‘quit‘,‘buying‘,‘worthless‘,‘dog‘,‘food‘,‘stupid‘] ] classVec = [0,1,0,1,0,1] # 类别标签集合 return postingList,classVec
接着创建一个包含在所有文档中出现的不重复词的词汇表:
# 返回包含在所有文档中出现的不重复词的列表 def createVocabList(dataSet): vocabSet = set([]) for document in dataSet: vocabSet = vocabSet | set(document) return list(vocabSet) #转化为列表形式返回
下面是对训练数据进行处理的函数,输入为词汇表和某个文档,输出为文档向量,向量的元素为0或者1,代表该词在该输入文档中是否出现:1代表出现 0代表未出现
def setOfWords2Vec(vocabList,inputSet): #vocabList 用于对照的词汇表 inputSet 用于检查的文档 returnVec = [0] * len(vocabList) # 默认为0 (即所有单词都不出现在inputSet文档中) for word in inputSet: if word in vocabList: returnVec[vocabList.index(word)] += 1 #设置该单词出现 return returnVec
接着是朴素贝叶斯训练函数:
def trainNB0(trainMatrix,trainCategory): numTrainDocs = len(trainMatrix) #训练样本的总数量 numWords = len(trainMatrix[0]) pAbusive = sum(trainCategory) / numTrainDocs #类别为1的数据出现的概率 p0Num = np.ones(numWords) p1Num = np.ones(numWords) p0DeNom = 2.0 p1DeNom = 2.0 for i in range(numTrainDocs): if trainCategory[i] == 1: p1Num += trainMatrix[i] #类别为1:各个词分别出现的总次数 p1DeNom += sum(trainMatrix[i]) # 类别为1: 单词总数 else: p0Num += trainMatrix[i] #类别为0:各个词分别出现的总次数 p0DeNom += sum(trainMatrix[i]) # 类别为0: 单词总数 p1Vect = p1Num/p1DeNom p0Vect = p0Num/p0DeNom return p0Vect, p1Vect, pAbusive # vec2Classify->要分类的变量 p0Vec->P(word|0) p1Vec->P(word|1) pClass1->1类出现的概率 def classifyNB(vec2Classify,p0Vec,p1Vec,pClass1): p1 = sum(vec2Classify * p1Vec) * pClass1 p0 = sum(vec2Classify * p0Vec) * (1 - pClass1) if p1 > p0: return 1 else: return 0
下面函数用于测试分类器的效果:
def testingNB(): listOPosts,listClasses = loadDataSet() myVocabList = createVocabList(listOPosts) # 词汇表 trainMat = [] for postInDoc in listOPosts: trainMat.append(setOfWords2Vec(myVocabList, postInDoc)) p0V, p1V, pAb = trainNB0(trainMat, listClasses) testEntry = [‘love‘,‘my‘,‘dalmation‘] thisDoc = np.array(setOfWords2Vec(myVocabList,testEntry)) print(testEntry,‘classified as: ‘,classifyNB(thisDoc,p0V,p1V,pAb)) testEntry2 = [‘stupid‘,‘garbage‘] thisDoc2 = np.array(setOfWords2Vec(myVocabList,testEntry2)) print(testEntry2,‘classified as: ‘, classifyNB(thisDoc2,p0V,p1V,pAb))
调用函数,输出结果为:
[‘love‘, ‘my‘, ‘dalmation‘] classified as: 0
[‘stupid‘, ‘garbage‘] classified as: 1
标签:朴素 类别 outline range pytho data 返回 entry ken
原文地址:http://www.cnblogs.com/weimusan/p/7499319.html