数据挖掘——学习笔记：关联规则挖掘

时间：2014-10-02 16:33:23 阅读：286 评论：0 收藏：0 [点我收藏+]

标签：style http color io ar for strong 数据 sp

一、概念

　　关联规则挖掘：从食物数据库、关系数据库等大量数据的项集之间发现有趣的、频繁出现的模式、关联和相关性。

　　关联规则的兴趣度度量：support、confidence

　　K-项集：包含K个项的集合

　　项集的频率：包含项集的事务数

　　频繁项集：如果项集的频率大于最小支持度*事务总数，则该项集成为频繁项集

二、关联规则挖掘的分类

　　1、根据规则中所处理的值类型：布尔关联规则、量化关联规则

　　2、根据规则中涉及的数据维：单维关联规则、多维关联规则

　　3、根据规则所涉及的抽象层：单层关联规则、多层关联规则

　　4、根据关联挖掘的各种扩充：挖掘最大的频繁模式、挖掘频繁闭项集

三、大型数据库中的关联规则挖掘过程

　　1、找出所有频繁项集，大部分的计算都集中在这一步

　　2、由频繁项集产生强关联规则，即满足最小支持度和最小置信度的规则

四、找出频繁项集的算法：Apriori algorithm

Apriori algorithm 利用频繁项集的先验知识(prior knowledge)，通过逐层搜索的迭代方法，即将K-项集用于探察(K+1)项集，，来穷尽数据集中地所有频繁项集。

To improve the effciency of the level-wise generation of frequent itemsets,an important property called the Apriori property is used to reduce the search space.

Apriori property:All nonempty subsets of a frequent itemset must also be frequent.

Apriori algorithm 步骤：

1. The join step:为了计算L_k,通过L_k-1与自己连接产生候选K-项集的集合，该候选K项集称作C_k。

Lk-1中的两个元素L1和L2可以执行连接操作 $bubuko.com,布布扣$ 的条件是 $bubuko.com,布布扣$

C_k中的频繁集即为L_k

2. The prune step:利用Apriori property减少计算量。

Algorithm:Apriori.Find frequent itensets using an iterative level-wise approach based on cadidate generation.

Input:

D,a database of transaction;

min_sup,the minimum support count threshold.

Output:L,frequent itemsets in D.

Method:

　　L₁=find_frequent_1-itemsets(D);
　　for(k=2;L_k-1!=NULL;k++){
　　　　C_k=apriori_gen(L_k-1);
　　　　for each transaction t belont to D{
　　　　　　C_t=subset(C_k,t);
　　　　　　for each candidate c belong to C_t
　　　　　　c.count++;
　　　　}
　　　　L_k={c belong to C_k|c.count >=min_sup}
　　}
　　return L=U_kL_k;

procedure apriori_gen(L_k-1:frequent(k-1)-itemsets)
　　for each itemset l₁ belong to L_k-1
　　　　for eachitemset l₂ belong to L_k-1
　　　　　　if(l₁[1]=l₂[1] & l₁[2]=l₂[2] & ... & l₁[k-2]=l₂[k-2] & l₁[k-1]<l₂[k-1])then{
　　　　　　c=l₁ join l₂;//join sep:generate candidates
　　　　　　if has_infrequent_subset(c,L_k-1)then
　　　　　　　　delete c;//prune step:remove unfruitful candidate
　　　　　　else add c to C_k;
　　}
　　return C_k;

procedure has_infrequent_subset(c:candidate k-itemset;;L_k-1:frequent(k-1)-itemsets);//use prior knowledge
　　for each (k-1)-subset s of c
　　　　if s not belong to L_k-1 then
　　　　　　return TRUE;
　　return FALSE;

Apriori算法缺点：

　　1、对数据进行多次扫描；

　　2、产生大量的候选项集；

　　3、对候选集的支持度计算繁琐

解决思路：

　　1、减少扫描次数；

　　2、缩小候选集；

　　3.改进支持度计算方法

方法一：Hash-based technique

将每个项集通通过Hash函数映射到Hash标的不同桶中，这样可以通过将桶中的项集计数与最小支持计数相比较先淘汰一部分项集。

方法二：Transaction reduction

不包含任何K项集的事务不可能包含K+1项集。因此这样的项集可以从考虑的项集中被标记或移除

方法三：Partitioning

方法四：sampling

方法五：Dynamic itemset counting

Apriori算法的主要开销是产生大量的候选频繁项集，FP-tree算法可以发现频发模式而不产生候选

数据挖掘——学习笔记：关联规则挖掘

标签：style http color io ar for strong 数据 sp

原文地址：http://www.cnblogs.com/lookdown/p/4002818.html

踩

(0)

评论一句话评论（0）

分享档案

更多>

2021年07月29日 (22)
2021年07月28日 (40)
2021年07月27日 (32)
2021年07月26日 (79)
2021年07月23日 (29)
2021年07月22日 (30)
2021年07月21日 (42)
2021年07月20日 (16)
2021年07月19日 (90)
2021年07月16日 (35)

周排行