Efficient Pattern Mining Methods
@(Pattern Discovery in Data Mining)
本文介绍了几个模式挖掘的高效算法。主要以Apriori思想为框架,主要讲解了FP-Growth算法。
The Downward Closure Property of Frequent Patterns
Property
The downward closure (also called “Apriori”) property of frequent patterns:
- If {beer, diaper, nuts} is frequent, so is {beer, diaper}
- Every transaction containing {beer, diaper, nuts} also contains {beer, diaper}
- Apriori: Any subset of a frequent itemset must be frequent
Efficient mining methodology
- If any subset of an itemset S is infrequent, then there is no chance for S to
??be frequent — why do we even have to consider S!? (It is an efficient way to prune)
Principle
Apriori pruning principle: If there is any itemset which is infrequent, its superset should not even be generated! (Agrawal & Srikant @VLDB’94, Mannila, et al. @ KDD’ 94)
Scalable mining Methods
- Level-wise, join-based approach: Apriori (Agrawal &Srikant@VLDB’94)
- Vertical data format approach: Eclat (Zaki, Parthasarathy, Ogihara, Li @KDD’97)
- Frequent pattern projection and growth: FPgrowth (Han, Pei, Yin @SIGMOD’00)
The Apriori Algorithm
Outline of Apriori (level-wise, candidate generation and test)
- Initially, scan DB once to get frequent 1-itemset
- Repeat
- Generate length-(k+1) candidate itemsets from length-k frequent itemsets
- Test the candidates against DB to find frequent (k+1)-itemsets
- Set k := k +1
- Until no frequent or candidate set can be generated
- Return all the frequent itemsets derived
Psuedo Code
Tricks
joining & pruning
这里,对于某此迭代产生的joining结果,检验任何一个k-1的子集是否在候选集Ck中。即上图pruning的过程。
Extensions or Improvements of Apriori
- Reduce passes of transaction database scans
- Partitioning (e.g., Savasere, et al., 1995)
- Dynamic itemset counting (Brin, et al., 1997) —> one of Google’s cofounder
- Shrink the number of candidates
- Hashing (e.g., DHP: Park, et al., 1995)
- Pruning by support lower bounding (e.g., Bayardo 1998) ? Sampling (e.g., Toivonen, 1996)
- Exploring special data structures
- Tree projection (Aggarwal, et al., 2001)
- H-miner (Pei, et al., 2001)
- Hypecube decomposition (e.g., LCM: Uno, et al., 2004)
FPGrowth: A Frequent Pattern-Growth Approach
- 构造FP-Tree,快速迭代生成frequent patterns
什么是FP-Tree?如何构造FP-Tree?
- 计算每个single itemset的frequency
- 将每个个Transaction中item的根据frequency进行排序
- 类似于前缀树,生成FP-Tree,其中每个节点代表了一个item
生成结果如下所示:
- 生成Frequent Itemset
利用分治的方法进行迭代计算。
过程(设min_sup=2,以e后缀为例):
1)得到e的前缀路径子树
2)计算e的频数
,判断e是否是frequent item。方法是遍历e节点的链表(虚线连接)计算节点数目,得sup(e)=3 > 2,所以继续下述步骤。
3)因为e是频繁的,找到所有以e结尾的frequent itemlist,也就是说,分拆问题,进行迭代。
这里我们首先需要拿到e的Conditional FP-Tree。
4)Conditional FP-Tree的生成:
结果:比较挫,直接看图
步骤:
1 - 更新e的前缀路径子树中的support值
2 - 删除包含e的节点
3 - 删除不频繁(infrequent)的节点,这里的c和d根据前述计算频数
的方法知道满足最小support条件。至此,已经得到了关于e的Conditional FP-Tree。
5)利用前面得到的关于e的CFPT,找到所有以de、ce、ae结尾(be不考虑因为b已经被删除)的frequent itemlist。这里直接调用分治的过程进行递归。例如对于de来说,在e的CFPT中找到关于de的前缀路径子树……得到de的CFPT。
例如e→de→ade,e→ce,e→ae
讨论:对于单枝前缀路径子树,一次就能生成所有frequent patterns。例如:
* 此处红框选中的子树是m-cond. based,单枝树为{}-f3-c3-a3.
* 这是一个迭代的过程,节点a
产生了第二棵树{}-f3-c3. 节点c
产生了树{}-f3.
然后第二棵树中的节点c
产生了最后一棵树{}-f3. 节点f
无法再产生新的树。
* 第一棵树是m-cond. based,产生了组合fm,cm,am
* 第二棵树是am-cond. based,产生了fam,cam
* 第三棵树是cm-cond. based,产生了fcm
* 最后一棵树产生了fcam
* 所以我们可以得到并集 fm, cm, am, fam, fcm, cam, fcam。
课程习题:
这里Parallel Project比较耗费空间,因为它是根据需要计算的不同的X-cond. based进行任务分割来计算的,但是比较快;而Partition则根据树枝进行切分,这样是真正意义上的“partition”。
Mining Closed Patterns