奇异值分解SVD实现与应用

时间：2015-04-24 19:11:22 阅读：404 评论：0 收藏：0 [点我收藏+]

标签：svd 推荐系统

SVD是一种提取信息的强大工具，通过SVD实现我们能够用小的多的数据集来表示原始数据集，这样做实际就是去除噪声和冗余信息。

隐性语义索引

SVD最早应用就是信息检索，我们称利用SVD方法为隐性语义索引（LSI），在LSI中一个矩阵是由文档和词语组成，当应用SVD到矩阵上时，就会构建多个奇异值。这些奇异值代表了文档中概念或主题，这一特点可以更高效的文档搜索。

推荐系统

SVD的另外一个应用就是推荐系统，简单版本实现推荐系统就是计算item或者user之间相似性。更先进的方法就是利用SVD从数据中构建一个主题空间，然后在该空间下计算相似度。

基于python对SVD方法在简单推荐系统中实现。

中间用到了python两个很常用函数方法

sorted 方法和 nonzero 方法。

sorted方法是python内置的方法，我们实现中要用到对元组进行排序，如下：

>>> student_tuples = [
    ('john', 'A', 15),
    ('jane', 'B', 12),
    ('dave', 'B', 10),
]
>>> sorted(student_tuples, key=lambda student: student[2])   # sort by age
[('dave', 'B', 10), ('jane', 'B', 12), ('john', 'A', 15)]

参考链接：sorted方法

nonzero返回二维的不为0的index，看例子：

>>> a = np.array([[1,2,3],[4,5,6],[7,8,9]])
>>> a > 3
array([[False, False, False],
       [ True,  True,  True],
       [ True,  True,  True]], dtype=bool)
>>> np.nonzero(a > 3)
(array([1, 1, 1, 2, 2, 2]), array([0, 1, 2, 0, 1, 2]))

参考链接：nonzero方法

# encoding=utf8
import numpy as np
from numpy import *
from numpy import linalg as la
from operator import itemgetter

def loadExData():
    return [[4,4,0,2,2],
            [4,0,0,3,3],
            [4,0,0,1,1],
            [1,1,1,2,0],
            [2,2,2,0,0],
            [1,1,1,0,0],
            [5,5,5,0,0]]

def loadExData2():
    return[[0, 0, 0, 0, 0, 4, 0, 0, 0, 0, 5],
           [0, 0, 0, 3, 0, 4, 0, 0, 0, 0, 3],
           [0, 0, 0, 0, 4, 0, 0, 1, 0, 4, 0],
           [3, 3, 4, 0, 0, 0, 0, 2, 2, 0, 0],
           [5, 4, 5, 0, 0, 0, 0, 5, 5, 0, 0],
           [0, 0, 0, 0, 5, 0, 1, 0, 0, 5, 0],
           [4, 3, 4, 0, 0, 0, 0, 5, 5, 0, 1],
           [0, 0, 0, 4, 0, 4, 0, 0, 0, 0, 4],
           [0, 0, 0, 2, 0, 2, 5, 0, 0, 1, 2],
           [0, 0, 0, 0, 5, 0, 0, 0, 0, 4, 0],
           [1, 0, 0, 0, 0, 0, 0, 1, 2, 0, 0]]
 
def ReconstructSigma(Sigma):
    return np.mat([[Sigma[0],0,0],[0,Sigma[1],0],[0,0,Sigma[2]]])

def ReconstructData(U,Sigma,VT):
    return U[:,:3]*Sigma*VT[:3,:]

# 计算相似性函数
def eulidSim(inA,inB):
    return 1.0/(1.0 + la.norm(inA - inB))#默认计算列做为一个元素之间的距离

def pearsSim(inA,inB):
    if(len(inA)<3): return 1.0
    return 0.5 + 0.5*np.corrcoef(inA, inB, rowvar=0)[0][1]# 这里返回是一个矩阵，只拿第一行第二个元素

def cosSim(inA,inB):
    num = float(inA.T * inB)
    denom = la.norm(inA) * la.norm(inB)
    return 0.5 + 0.5 * (num/denom)
'''
standEst 需要做的就是估计user 的item 评分，
采用方法是  根据物品相似性，及每一列相似性
要估计item那一列与其他列进行相似性估计，获得两列都不为0的元素计算相似性
然后用相似性乘以 评分来估计未评分的数值 。
'''
def standEst(dataMat,user,simMeas,item):
    n = np.shape(dataMat)[1]
    simTotal = 0.0 ; ratSimTotal = 0.0
    for j in range(n):
        userRating = dataMat[user,j]
        if(userRating == 0): continue
        overLap = nonzero(logical_and(dataMat[:,item].A > 0,dataMat[:,j].A >0))[0]# 返回元素不为0的下标
        '''
        nonzero 返回参考下面例子，返回二维数组，第一维是列方向，第二位是行方向
        '''
        if(len(overLap)) == 0 :similarity = 0
        else:
            similarity = simMeas(dataMat[overLap,item],dataMat[overLap,j])
        print 'the %d and %d similarity is : %f' %(item,j,similarity)
        simTotal += similarity
        ratSimTotal += similarity * userRating
    if simTotal == 0: return 0
    else : return ratSimTotal/simTotal

def recommend(dataMat,user ,N = 3,simMeas= cosSim,estMethod = standEst):
    unratedItems = nonzero(dataMat[user,:].A == 0)[1]# .A 使得矩阵类型转为array
    '''
    >>> a = np.array([[1,2,3],[4,5,6],[7,8,9]])
    >>> a > 3
    array([[False, False, False],
       [ True,  True,  True],
       [ True,  True,  True]], dtype=bool)
    >>> np.nonzero(a > 3)
    (array([1, 1, 1, 2, 2, 2]), array([0, 1, 2, 0, 1, 2]))
    '''
    if len(unratedItems) == 0: return 'you rated everything'
    itemScores = []
    for item in unratedItems:
        estimatedScore = estMethod(dataMat,user,simMeas,item)
        itemScores.append((item,estimatedScore))
    return sorted(itemScores,key=itemgetter(1),reverse = True)[:N]

def svdEst(dataMat , user, simMeas,item):
    n = shape(dataMat)[1]
    simTotal = 0.0 ; ratSimTotal = 0.0
    U,Sigma,VT = la.svd(dataMat)
    Sig4 = mat(eye(4) * Sigma[:4]) # 保留最大三个奇异值
    xformedItems = dataMat.T * U[:,:4] * Sig4.I
    print xformedItems
    for j in range(n):
        userRating = dataMat[user,j]
        if userRating == 0 or j==item: continue
        similarity = simMeas(xformedItems[item,:].T,                             xformedItems[j,:].T)
        print 'the %d and %d similarity is: %f' % (item, j, similarity)
        simTotal += similarity
        ratSimTotal += similarity * userRating
    if simTotal == 0: return 0
    else: return ratSimTotal/simTotal
    
if __name__=="__main__":
    '''
    # 测试中间数据
    Data = loadExData()
    MatData = np.mat(Data)
    U,Sigma,VT = np.linalg.svd(Data)
    print Sigma
    Sigma = ReconstructSigma(Sigma)
    print Sigma
    print ReconstructData(U, Sigma, VT)
    print eulidSim(MatData[:,0], MatData[:,4])
    print cosSim(MatData[:,0], MatData[:,4])
    print pearsSim(MatData[:,0], MatData[:,0])
    '''
    Data = loadExData()
    dataMat = np.mat(Data)
    dataMat2 = mat(loadExData2())
    print dataMat2
    print recommend(dataMat2, 1,estMethod=svdEst)

实现细节参考机器学习实战。

奇异值分解SVD实现与应用

标签：svd 推荐系统

原文地址：http://blog.csdn.net/huruzun/article/details/45248997

踩

(0)

评论一句话评论（0）

分享档案

更多>

2021年07月29日 (22)
2021年07月28日 (40)
2021年07月27日 (32)
2021年07月26日 (79)
2021年07月23日 (29)
2021年07月22日 (30)
2021年07月21日 (42)
2021年07月20日 (16)
2021年07月19日 (90)
2021年07月16日 (35)

周排行