码迷,mamicode.com
首页 > 其他好文 > 详细

朴素贝叶斯和逻辑回归分类

时间:2015-05-03 11:59:59      阅读:353      评论:0      收藏:0      [点我收藏+]

标签:

朴素贝叶斯

 技术分享

技术分享

 

 

 

用p1(x, y)表示(x, y)属于类别1的概率,P2(x, y)表示(x, y)属于类别2的概率;

如果p(c1|x, y) > P(c2|x, y), 那么类别为1

如果p(c1|x, y) < P2(c2|x, y), 那么类别为2

 

根据贝叶斯公式:

p(c|x, y) = (p(x, y|c) * p(c)) / p(x, y)

(x, y)表示要分类的特征向量, c表示类别

 

因为p(x, y),对不同类别的数值是一样的,只需计算p(x, y|c) 和 p(c)

p(c)根据样本数据的类别,容易计算出来

p(x, y|c), 需要先计算每个类别下训练样本的特征出现的概率

根据测试样本,计算特征向量,再计算与训练好的特征概率的点积,即可。

 

实例, 垃圾邮件过滤

观察代码:

技术分享
 1 # -*- coding: utf-8 -*-
 2 """
 3 Created on Sat May 02 21:52:08 2015
 4 
 5 @author: silingxiao
 6 """
 7 from numpy import *
 8 
 9 def loadDataSet():
10     postingList=[[my, dog, has, flea, problems, help, please],
11                  [maybe, not, take, him, to, dog, park, stupid],
12                  [my, dalmation, is, so, cute, I, love, him],
13                  [stop, posting, stupid, worthless, garbage],
14                  [mr, licks, ate, my, steak, how, to, stop, him],
15                  [quit, buying, worthless, dog, food, stupid]]
16     classVec = [0,1,0,1,0,1]    #1 is abusive, 0 not
17     return postingList,classVec
18                  
19 def createVocabList(dataSet):
20     vocabSet = set([])  #create empty set
21     for document in dataSet:
22         vocabSet = vocabSet | set(document) #union of the two sets
23     return list(vocabSet)
24 
25 def setOfWords2Vec(vocabList, inputSet):
26     returnVec = [0]*len(vocabList)
27     for word in inputSet:
28         if word in vocabList:
29             returnVec[vocabList.index(word)] = 1
30         else: print "the word: %s is not in my Vocabulary!" % word
31     return returnVec
32 
33 def trainNB0(trainMatrix,trainCategory):
34     numTrainDocs = len(trainMatrix)
35     numWords = len(trainMatrix[0])
36     pAbusive = sum(trainCategory)/float(numTrainDocs)
37     p0Num = ones(numWords); p1Num = ones(numWords)      #change to ones() 
38     p0Denom = 2.0; p1Denom = 2.0                        #change to 2.0
39     for i in range(numTrainDocs):
40         if trainCategory[i] == 1:
41             p1Num += trainMatrix[i]
42             p1Denom += sum(trainMatrix[i])
43         else:
44             p0Num += trainMatrix[i]
45             p0Denom += sum(trainMatrix[i])
46     p1Vect = log(p1Num/p1Denom)          #change to log()
47     p0Vect = log(p0Num/p0Denom)          #change to log()
48     return p0Vect,p1Vect,pAbusive
49 
50 def classifyNB(vec2Classify, p0Vec, p1Vec, pClass1):
51     p1 = sum(vec2Classify * p1Vec) + log(pClass1)    #element-wise mult
52     p0 = sum(vec2Classify * p0Vec) + log(1.0 - pClass1)
53     if p1 > p0:
54         return 1
55     else: 
56         return 0
57 
58 
59 def testingNB():
60     listOPosts,listClasses = loadDataSet()
61     myVocabList = createVocabList(listOPosts)
62     trainMat=[]
63     for postinDoc in listOPosts:
64         trainMat.append(setOfWords2Vec(myVocabList, postinDoc))
65     p0V,p1V,pAb = trainNB0(array(trainMat),array(listClasses))
66     testEntry = [love, my, dalmation]
67     thisDoc = array(setOfWords2Vec(myVocabList, testEntry))
68     print testEntry,classified as: ,classifyNB(thisDoc,p0V,p1V,pAb)
69     testEntry = [stupid, garbage]
70     thisDoc = array(setOfWords2Vec(myVocabList, testEntry))
71     print testEntry,classified as: ,classifyNB(thisDoc,p0V,p1V,pAb)
72 
73 postingList, classVec = loadDataSet()
74 vocabSet = createVocabList(postingList)
75 trainMat = []
76 for postinDoc in postingList:
77     trainMat.append(setOfWords2Vec(vocabSet, postinDoc))
78 
79 p0v, p1v, pAb = trainNB0(trainMat, classVec)
80 
81 
82 testingNB()
View Code

 

运行结果如下:

技术分享

 

逻辑回归分类

属于广义线性回归中的一个特殊类别,主要用于分类;

模型采用连接函数

技术分享

为sigmoid函数,和阶梯函数有着类似的性质,但要求该函数一阶可微;

 

技术分享

采用梯度上升或下降计算模型的参数。

技术分享或者

技术分享

 

梯度方向为技术分享

 

 

技术分享

 

Python代码分类两类点

技术分享

 

目标,找出一条直线分割两类点,也就是求出模型的系数,采用梯度下降方法gradAscent或者优化的随机梯度下降算法gradAscent1

注意,y = w0x0 + w1x1 + w2x2, 另 y = 0, 求出x2 和 x1的函数关系。具体的解释见下方

 

技术分享

 

技术分享
  1 # -*- coding: utf-8 -*-
  2 """
  3 Created on Sun May 03 10:22:21 2015
  4 
  5 @author: silingxiao
  6 """
  7 from numpy import *
  8 
  9 def loadDataSet():
 10     dataMat = []; labelMat = []
 11     fr = open(testSet.txt)
 12     for line in fr.readlines():
 13         lineArr = line.strip().split()
 14         dataMat.append([1.0, float(lineArr[0]), float(lineArr[1])])
 15         labelMat.append(int(lineArr[2]))
 16     return dataMat, labelMat
 17 
 18 def sigmoid(inx):
 19     return 1.0/(1 + exp(-inx))        
 20 
 21 def gradAscent(dataMatIn, classLabels):
 22     dataMatrix = mat(dataMatIn)             #convert to NumPy matrix
 23     labelMat = mat(classLabels).transpose() #convert to NumPy matrix
 24     m,n = shape(dataMatrix)
 25     alpha = 0.001
 26     maxCycles = 500
 27     weights = ones((n,1))
 28     for k in range(maxCycles):              #heavy on matrix operations
 29         h = sigmoid(dataMatrix*weights)     #matrix mult
 30         error = (labelMat - h)              #vector subtraction
 31         weights = weights + alpha * dataMatrix.transpose()* error #matrix mult
 32     return weights
 33 
 34 def plotBestFit(weights):
 35     import matplotlib.pyplot as plt
 36     dataMat,labelMat=loadDataSet()
 37     dataArr = array(dataMat)
 38     n = shape(dataArr)[0] 
 39     xcord1 = []; ycord1 = []
 40     xcord2 = []; ycord2 = []
 41     for i in range(n):
 42         if int(labelMat[i])== 1:
 43             xcord1.append(dataArr[i,1]); ycord1.append(dataArr[i,2])
 44         else:
 45             xcord2.append(dataArr[i,1]); ycord2.append(dataArr[i,2])
 46     fig = plt.figure()
 47     ax = fig.add_subplot(111)
 48     ax.scatter(xcord1, ycord1, s=30, c=red, marker=s)
 49     ax.scatter(xcord2, ycord2, s=30, c=green)
 50     x = arange(-3.0, 3.0, 0.1)
 51     y = (-weights[0]-weights[1]*x)/weights[2]
 52     #ax.plot(x, y)
 53     plt.xlabel(X1); plt.ylabel(X2);
 54     plt.show()
 55     
 56 
 57 def stocGradAscent1(dataMatrix, classLabels, numIter=150):
 58     m,n = shape(dataMatrix)
 59     weights = ones(n)   #initialize to all ones
 60     for j in range(numIter):
 61         dataIndex = range(m)
 62         for i in range(m):
 63             alpha = 4/(1.0+j+i)+0.0001    #apha decreases with iteration, does not 
 64             randIndex = int(random.uniform(0,len(dataIndex)))#go to 0 because of the constant
 65             h = sigmoid(sum(dataMatrix[randIndex]*weights))
 66             error = classLabels[randIndex] - h
 67             weights = weights + alpha * error * dataMatrix[randIndex]
 68             del(dataIndex[randIndex])
 69     return weights
 70 
 71 def classifyVector(inX, weights):
 72     prob = sigmoid(sum(inX*weights))
 73     if prob > 0.5: return 1.0
 74     else: return 0.0
 75 
 76 def colicTest():
 77     frTrain = open(horseColicTraining.txt); frTest = open(horseColicTest.txt)
 78     trainingSet = []; trainingLabels = []
 79     for line in frTrain.readlines():
 80         currLine = line.strip().split(\t)
 81         lineArr =[]
 82         for i in range(21):
 83             lineArr.append(float(currLine[i]))
 84         trainingSet.append(lineArr)
 85         trainingLabels.append(float(currLine[21]))
 86     trainWeights = stocGradAscent1(array(trainingSet), trainingLabels, 1000)
 87     errorCount = 0; numTestVec = 0.0
 88     for line in frTest.readlines():
 89         numTestVec += 1.0
 90         currLine = line.strip().split(\t)
 91         lineArr =[]
 92         for i in range(21):
 93             lineArr.append(float(currLine[i]))
 94         if int(classifyVector(array(lineArr), trainWeights))!= int(currLine[21]):
 95             errorCount += 1
 96     errorRate = (float(errorCount)/numTestVec)
 97     print "the error rate of this test is: %f" % errorRate
 98     return errorRate
 99 
100 def multiTest():
101     numTests = 10; errorSum=0.0
102     for k in range(numTests):
103         errorSum += colicTest()
104     print "after %d iterations the average error rate is: %f" % (numTests, errorSum/float(numTests))
105            
106 
107 if __name__ == __main__:
108     dataMat, labelMat = loadDataSet()
109     weights = gradAscent(dataMat, labelMat)
110     plotBestFit(weights.getA())
111     #multiTest()
View Code

 

运行结果如下:

 

技术分享

 

朴素贝叶斯和逻辑回归分类

标签:

原文地址:http://www.cnblogs.com/hdu-2010/p/4473485.html

(0)
(0)
   
举报
评论 一句话评论(0
登录后才能评论!
© 2014 mamicode.com 版权所有  联系我们:gaon5@hotmail.com
迷上了代码!