用户收视习惯聚类分析

时间：2015-05-19 16:53:09 阅读：186 评论：0 收藏：0 [点我收藏+]

标签：用户数南京 where 样本特色

数据挖掘测试实例

用户收视习惯聚类分析

用户收视习惯在不同的小时段，不同的星期，会呈现不一样的特色，我们现在要做的就是将用户IPTV数据按照每小时收视时长进行聚类分析

测试样本：

2013年6月6日（星期四，非假日）南京地区当天观看过IPTV的用户

用户数：269745 人

数据准备：

1.创建临时表

select s_userid,s_hour,s_timeleninto tmp_user_hour_len from tst_fct_d20130606_4 where s_city_id=1

2、生成目标表

select s_userid,

(case when s_hour=‘00‘ then s_timelen else 0 end)as hour00 ,

(case when s_hour=‘01‘ then s_timelen else 0 end)as hour01 ,

(case when s_hour=‘02‘ then s_timelen else 0 end)as hour02 ,

(case when s_hour=‘03‘ then s_timelen else 0 end)as hour03 ,

(case when s_hour=‘04‘ then s_timelen else 0 end)as hour04 ,

(case when s_hour=‘05‘ then s_timelen else 0 end)as hour05 ,

(case when s_hour=‘06‘ then s_timelen else 0 end)as hour06 ,

(case when s_hour=‘07‘ then s_timelen else 0 end)as hour07 ,

(case when s_hour=‘08‘ then s_timelen else 0 end)as hour08 ,

(case when s_hour=‘09‘ then s_timelen else 0 end)as hour09 ,

(case when s_hour=‘10‘ then s_timelen else 0 end)as hour10 ,

(case when s_hour=‘11‘ then s_timelen else 0 end) ashour11 ,

(case when s_hour=‘12‘ then s_timelen else 0 end)as hour12 ,

(case when s_hour=‘13‘ then s_timelen else 0 end)as hour13 ,

(case when s_hour=‘14‘ then s_timelen else 0 end)as hour14 ,

(case when s_hour=‘15‘ then s_timelen else 0 end)as hour15 ,

(case when s_hour=‘16‘ then s_timelen else 0 end)as hour16 ,

(case when s_hour=‘17‘ then s_timelen else 0 end)as hour17 ,

(case when s_hour=‘18‘ then s_timelen else 0 end)as hour18 ,

(case when s_hour=‘19‘ then s_timelen else 0 end)as hour19 ,

(case when s_hour=‘20‘ then s_timelen else 0 end)as hour20 ,

(case when s_hour=‘21‘ then s_timelen else 0 end)as hour21 ,

(case when s_hour=‘22‘ then s_timelen else 0 end)as hour22 ,

(case when s_hour=‘23‘ then s_timelen else 0 end)as hour23 into user_hour_len_nj_20130606

from tmp_user_hour_len

3、在211服务器上导出文件到本地

bcp user_hour_len_nj_20130606 outuser_hour_len_nj_20130606.txt -UXXX -PXXX -SXXX -c -t ‘|‘ -r ‘\n‘

4、提取前200个实例进行测试

分析方法：

采用k均值算法进行聚类分析

数据源格式：

属性集：

属性集包含24个时段的详细信息，格式如下(这里real也可以为numeric)：

@relation cluster

@attribute H00 real

@attribute H01 real

@attribute H02 real

@attribute H03 real

@attribute H04 real

@attribute H05 real

@attribute H06 real

@attribute H07 real

@attribute H08 real

@attribute H09 real

@attribute H10 real

@attribute H11 real

@attribute H12 real

@attribute H13 real

@attribute H14 real

@attribute H15 real

@attribute H16 real

@attribute H17 real

@attribute H18 real

@attribute H19 real

@attribute H20 real

@attribute H21 real

@attribute H22 real

@attribute H23 real

数据集：

数据集包含每个用户的订购信息，格式如下：

@data

0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,31,12,0

0,0,0,0,0,0,0,0,0,0,0,0,26,59,16,0,0,0,50,55,56,58,59,10

0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,34,59,59,18,0

57,35,0,0,0,0,20,0,0,0,0,0,0,0,15,59,59,59,59,59,59,58,54,35

.....

测试过程：

打开weka explorer，open file打开特征文件(如example_cluster_ID_H24_200.arff)，然后选择cluster，选择算法SimpleKmeans，选择距离方法Euclidean distance (orsimilarity) function.迭代次数maxIterations=500,类数目numcluster=5（或3,4都可以），seed=10,start

numcluster=5时，得出如下结果

1）

这里代表所聚的各个类中的样本条数、数量占整个样本集的百分比。

2）

Number of iterations: 7

Within cluster sum of squared errors:228.6644541918032

Within cluster sum of squared errors，代表簇内距离，这个值越小，聚类效果越好（当然聚类数越多这个值越小）。在不改变聚类数量的前提下，调整seed值可以改变上面squared errors值的大小，使得簇内距离越小，聚类效果越好。

参数说明：

参数选择窗口如下：

参数说明：

displayStdDevs是否显示数字属性标准差和名词属性个数
distanceFunction 用于比较实例的距离函数，包括马氏距离、欧氏距离、明氏距离等（默认:weka.core.EuclideanDistance）。
dontReplaceMissingValues 是否不使用mean/mode替换全部丢失的值。
maxIterations 最大迭代次数
numClusters 所聚的类数
preserveInstancesOrder 是否预先排列实例的顺序
seed 设定的随机种子值

QuestionS：

1、如何找出哪个ID聚到了哪一类中；

A: 针对训练样本，在聚类结果右击点击“Visualizecluster assignments”，在弹出的窗口中点击save，则可保存一个arff文件，在这个文件中每个样本最后一个属性值即(“@attributeCluster”)给出了详细划入的簇类别；

另外，第一个数值为训练样本的标号。

以文件的部分数据为例(save_file_ID2Class.arff)，如下：

----------------------------------------------------------------------------------------------------------------

@attributeH22 numeric

@attributeH23 numeric

@attributeCluster {cluster0,cluster1,cluster2,cluster3}

@data

0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,31,12,0,cluster1

1,0,0,0,0,0,0,0,0,0,0,0,0,26,59,16,0,0,0,50,55,56,58,59,10,cluster2

2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,34,59,59,18,0,cluster2

3,57,35,0,0,0,0,20,0,0,0,0,0,0,0,15,59,59,59,59,59,59,58,54,35,cluster3

----------------------------------------------------------------------------------------------------------------

本文出自 “用户流失统计” 博客，谢绝转载！

用户收视习惯聚类分析

标签：用户数南京 where 样本特色

原文地址：http://9309062.blog.51cto.com/9299062/1652804

踩

(0)

评论一句话评论（0）

分享档案

更多>

2021年07月29日 (22)
2021年07月28日 (40)
2021年07月27日 (32)
2021年07月26日 (79)
2021年07月23日 (29)
2021年07月22日 (30)
2021年07月21日 (42)
2021年07月20日 (16)
2021年07月19日 (90)
2021年07月16日 (35)

周排行