标签:
相对于机器学习,关联规则的apriori算法更偏向于数据挖掘。
1) 测试文档中调用weka的关联规则apriori算法,如下
try { File file = new File("F:\\tools/lib/data/contact-lenses.arff"); ArffLoader loader = new ArffLoader(); loader.setFile(file); Instances m_instances = loader.getDataSet(); Discretize discretize = new Discretize(); discretize.setInputFormat(m_instances); m_instances = Filter.useFilter(m_instances, discretize); Apriori apriori = new Apriori(); apriori.buildAssociations(m_instances); System.out.println(apriori.toString()); } catch (Exception e) { e.printStackTrace(); }
步骤
1 读取数据集data,并提取样本集instances
2 离散化属性Discretize
3 创建Apriori 关联规则模型
4 输出大频率项集和关联规则集
2) 创建分类器的时候,调用设置默认参数方法
public void resetOptions() { m_removeMissingCols = false; m_verbose = false; m_delta = 0.05; m_minMetric = 0.90; m_numRules = 10; m_lowerBoundMinSupport = 0.1; m_upperBoundMinSupport = 1.0; m_significanceLevel = -1; m_outputItemSets = false; m_car = false; m_classIndex = -1; }
参数详细解析,见后面的备注1
3)buildAssociations方法的解析,源码如下
public void buildAssociations(Instances instances) throws Exception { double[] confidences, supports; int[] indices; FastVector[] sortedRuleSet; int necSupport = 0; instances = new Instances(instances); if (m_removeMissingCols) { instances = removeMissingColumns(instances); } if (m_car && m_metricType != CONFIDENCE) throw new Exception("For CAR-Mining metric type has to be confidence!"); // only set class index if CAR is requested if (m_car) { if (m_classIndex == -1) { instances.setClassIndex(instances.numAttributes() - 1); } else if (m_classIndex <= instances.numAttributes() && m_classIndex > 0) { instances.setClassIndex(m_classIndex - 1); } else { throw new Exception("Invalid class index."); } } // can associator handle the data? getCapabilities().testWithFail(instances); m_cycles = 0; // make sure that the lower bound is equal to at least one instance double lowerBoundMinSupportToUse = (m_lowerBoundMinSupport * instances.numInstances() < 1.0) ? 1.0 / instances.numInstances() : m_lowerBoundMinSupport; if (m_car) { // m_instances does not contain the class attribute m_instances = LabeledItemSet.divide(instances, false); // m_onlyClass contains only the class attribute m_onlyClass = LabeledItemSet.divide(instances, true); } else m_instances = instances; if (m_car && m_numRules == Integer.MAX_VALUE) { // Set desired minimum support m_minSupport = lowerBoundMinSupportToUse; } else { // Decrease minimum support until desired number of rules found. m_minSupport = m_upperBoundMinSupport - m_delta; m_minSupport = (m_minSupport < lowerBoundMinSupportToUse) ? lowerBoundMinSupportToUse : m_minSupport; } do { // Reserve space for variables m_Ls = new FastVector(); m_hashtables = new FastVector(); m_allTheRules = new FastVector[6]; m_allTheRules[0] = new FastVector(); m_allTheRules[1] = new FastVector(); m_allTheRules[2] = new FastVector(); if (m_metricType != CONFIDENCE || m_significanceLevel != -1) { m_allTheRules[3] = new FastVector(); m_allTheRules[4] = new FastVector(); m_allTheRules[5] = new FastVector(); } sortedRuleSet = new FastVector[6]; sortedRuleSet[0] = new FastVector(); sortedRuleSet[1] = new FastVector(); sortedRuleSet[2] = new FastVector(); if (m_metricType != CONFIDENCE || m_significanceLevel != -1) { sortedRuleSet[3] = new FastVector(); sortedRuleSet[4] = new FastVector(); sortedRuleSet[5] = new FastVector(); } if (!m_car) { // Find large itemsets and rules findLargeItemSets(); if (m_significanceLevel != -1 || m_metricType != CONFIDENCE) findRulesBruteForce(); else findRulesQuickly(); } else { findLargeCarItemSets(); findCarRulesQuickly(); } // prune rules for upper bound min support if (m_upperBoundMinSupport < 1.0) { pruneRulesForUpperBoundSupport(); } int j = m_allTheRules[2].size() - 1; supports = new double[m_allTheRules[2].size()]; for (int i = 0; i < (j + 1); i++) supports[j - i] = ((double) ((ItemSet) m_allTheRules[1] .elementAt(j - i)).support()) * (-1); indices = Utils.stableSort(supports); for (int i = 0; i < (j + 1); i++) { sortedRuleSet[0].addElement(m_allTheRules[0].elementAt(indices[j - i])); sortedRuleSet[1].addElement(m_allTheRules[1].elementAt(indices[j - i])); sortedRuleSet[2].addElement(m_allTheRules[2].elementAt(indices[j - i])); if (m_metricType != CONFIDENCE || m_significanceLevel != -1) { sortedRuleSet[3].addElement(m_allTheRules[3] .elementAt(indices[j - i])); sortedRuleSet[4].addElement(m_allTheRules[4] .elementAt(indices[j - i])); sortedRuleSet[5].addElement(m_allTheRules[5] .elementAt(indices[j - i])); } } // Sort rules according to their confidence m_allTheRules[0].removeAllElements(); m_allTheRules[1].removeAllElements(); m_allTheRules[2].removeAllElements(); if (m_metricType != CONFIDENCE || m_significanceLevel != -1) { m_allTheRules[3].removeAllElements(); m_allTheRules[4].removeAllElements(); m_allTheRules[5].removeAllElements(); } confidences = new double[sortedRuleSet[2].size()]; int sortType = 2 + m_metricType; for (int i = 0; i < sortedRuleSet[2].size(); i++) confidences[i] = ((Double) sortedRuleSet[sortType].elementAt(i)) .doubleValue(); indices = Utils.stableSort(confidences); for (int i = sortedRuleSet[0].size() - 1; (i >= (sortedRuleSet[0].size() - m_numRules)) && (i >= 0); i--) { m_allTheRules[0].addElement(sortedRuleSet[0].elementAt(indices[i])); m_allTheRules[1].addElement(sortedRuleSet[1].elementAt(indices[i])); m_allTheRules[2].addElement(sortedRuleSet[2].elementAt(indices[i])); if (m_metricType != CONFIDENCE || m_significanceLevel != -1) { m_allTheRules[3].addElement(sortedRuleSet[3].elementAt(indices[i])); m_allTheRules[4].addElement(sortedRuleSet[4].elementAt(indices[i])); m_allTheRules[5].addElement(sortedRuleSet[5].elementAt(indices[i])); } } if (m_verbose) { if (m_Ls.size() > 1) { System.out.println(toString()); } } if (m_minSupport == lowerBoundMinSupportToUse || m_minSupport - m_delta > lowerBoundMinSupportToUse) m_minSupport -= m_delta; else m_minSupport = lowerBoundMinSupportToUse; necSupport = Math.round((float) ((m_minSupport * m_instances .numInstances()) + 0.5)); m_cycles++; } while ((m_allTheRules[0].size() < m_numRules) && (Utils.grOrEq(m_minSupport, lowerBoundMinSupportToUse)) /* (necSupport >= lowerBoundNumInstancesSupport) */ /* (Utils.grOrEq(m_minSupport, m_lowerBoundMinSupport)) */&& (necSupport >= 1)); m_minSupport += m_delta; }
主要步骤解析:
1 使用removeMissingColumns方法,删除缺失属性的列
2 如果参数m_car是真,则进行划分;因为m_car是真的意思是挖掘与关联规则的有关的规则,所以划分成两部分,一部分有关,一部分无关,删除无关的即可;
3 方法findLargeItemSets查找大频率项集;具体源码见下面
4 方法findRulesQuickly查找所有的关联规则集;
5 方法pruneRulesForUpperBoundSupport删除不满足最小置信度的规则集;
6)按照置信度把规则集排序;
4)查找大频率项集findLargeItemSets源码如下:
private void findLargeItemSets() throws Exception { FastVector kMinusOneSets, kSets; Hashtable hashtable; int necSupport, necMaxSupport, i = 0; // Find large itemsets // minimum support necSupport = (int) (m_minSupport * m_instances.numInstances() + 0.5); necMaxSupport = (int) (m_upperBoundMinSupport * m_instances.numInstances() + 0.5); kSets = AprioriItemSet.singletons(m_instances); AprioriItemSet.upDateCounters(kSets, m_instances); kSets = AprioriItemSet.deleteItemSets(kSets, necSupport, m_instances.numInstances()); if (kSets.size() == 0) return; do { m_Ls.addElement(kSets); kMinusOneSets = kSets; kSets = AprioriItemSet.mergeAllItemSets(kMinusOneSets, i, m_instances.numInstances()); hashtable = AprioriItemSet.getHashtable(kMinusOneSets, kMinusOneSets.size()); m_hashtables.addElement(hashtable); kSets = AprioriItemSet.pruneItemSets(kSets, hashtable); AprioriItemSet.upDateCounters(kSets, m_instances); kSets = AprioriItemSet.deleteItemSets(kSets, necSupport, m_instances.numInstances()); i++; } while (kSets.size() > 0); }
主要步骤:
1 类AprioriItemSet.singletons方法,将给定数据集的头信息转换成一个项集的集合, 头信息中的值的顺序是按字典序。
2 方法upDateCounters查找一频繁项目集;
3 AprioriItemSet.deleteItemSets方法,删除不满足支持度区间的项目集;
4 使用方法mergeAllItemSets(源码如下)由k-1项目集循环生出k频繁项目集,并且使用方法deleteItemSets删除不满足支持度区间的项目集;
5)由k-1项目集循环生出k频繁项目集的方法mergeAllItemSets,源码如下:
public static FastVector mergeAllItemSets(FastVector itemSets, int size, int totalTrans) { FastVector newVector = new FastVector(); ItemSet result; int numFound, k; for (int i = 0; i < itemSets.size(); i++) { ItemSet first = (ItemSet) itemSets.elementAt(i); out: for (int j = i + 1; j < itemSets.size(); j++) { ItemSet second = (ItemSet) itemSets.elementAt(j); result = new AprioriItemSet(totalTrans); result.m_items = new int[first.m_items.length]; // Find and copy common prefix of size ‘size‘ numFound = 0; k = 0; while (numFound < size) { if (first.m_items[k] == second.m_items[k]) { if (first.m_items[k] != -1) numFound++; result.m_items[k] = first.m_items[k]; } else break out; k++; } // Check difference while (k < first.m_items.length) { if ((first.m_items[k] != -1) && (second.m_items[k] != -1)) break; else { if (first.m_items[k] != -1) result.m_items[k] = first.m_items[k]; else result.m_items[k] = second.m_items[k]; } k++; } if (k == first.m_items.length) { result.m_counter = 0; newVector.addElement(result); } } } return newVector; }
调用方法generateRules生出关联规则
6)生出关联规则的方法generateRules,源码如下
public FastVector[] generateRules(double minConfidence, FastVector hashtables, int numItemsInSet) { FastVector premises = new FastVector(), consequences = new FastVector(), conf = new FastVector(); FastVector[] rules = new FastVector[3], moreResults; AprioriItemSet premise, consequence; Hashtable hashtable = (Hashtable) hashtables.elementAt(numItemsInSet - 2); // Generate all rules with one item in the consequence. for (int i = 0; i < m_items.length; i++) if (m_items[i] != -1) { premise = new AprioriItemSet(m_totalTransactions); consequence = new AprioriItemSet(m_totalTransactions); premise.m_items = new int[m_items.length]; consequence.m_items = new int[m_items.length]; consequence.m_counter = m_counter; for (int j = 0; j < m_items.length; j++) consequence.m_items[j] = -1; System.arraycopy(m_items, 0, premise.m_items, 0, m_items.length); premise.m_items[i] = -1; consequence.m_items[i] = m_items[i]; premise.m_counter = ((Integer) hashtable.get(premise)).intValue(); premises.addElement(premise); consequences.addElement(consequence); conf.addElement(new Double(confidenceForRule(premise, consequence))); } rules[0] = premises; rules[1] = consequences; rules[2] = conf; pruneRules(rules, minConfidence); // Generate all the other rules moreResults = moreComplexRules(rules, numItemsInSet, 1, minConfidence, hashtables); if (moreResults != null) for (int i = 0; i < moreResults[0].size(); i++) { rules[0].addElement(moreResults[0].elementAt(i)); rules[1].addElement(moreResults[1].elementAt(i)); rules[2].addElement(moreResults[2].elementAt(i)); } return rules; }
几个我想说的
1)不想输出为0的项,可以设置成缺失值?,因为算法会自动删除缺失值的列,不参与关联规则的生成;
2)按照置信度对关联规则排序,是关联规则分类器中使用的,只是提取关联规则,不需要排序;
备注
1)weka的关联规则中参数的详解
1. car 如果设为真,则会挖掘类关联规则而不是全局关联规则。也就是只保留与类标签有关的关联规则,设置索引为-1 2. classindex 类属性索引。如果设置为-1,最后的属性被当做类属性。 3. delta 以此数值为迭代递减单位。不断减小支持度直至达到最小支持度或产生了满足数量要求的规则。 4. lowerBoundMinSupport 最小支持度下界。 5. metricType 度量类型。设置对规则进行排序的度量依据。可以是:置信度(类关联规则只能用置信度挖掘),提升度(lift),杠杆率(leverage),确信度(conviction)。 在 Weka中设置了几个类似置信度(confidence)的度量来衡量规则的关联程度,它们分别是: a) Lift : P(A,B)/(P(A)P(B)) Lift=1时表示A和B独立。这个数越大(>1),越表明A和B存在于一个购物篮中不是偶然现象,有较强的关联度. b) Leverage :P(A,B)-P(A)P(B)Leverage=0时A和B独立,Leverage越大A和B的关系越密切 c) Conviction:P(A)P(!B)/P(A,!B) (!B表示B没有发生) Conviction也是用来衡量A和B的独立性。从它和lift的关系(对B取反,代入Lift公式后求倒数)可以看出,这个值越大, A、B越关联。 6. minMtric 度量的最小值。 7. numRules 要发现的规则数。 8. outputItemSets 如果设置为真,会在结果中输出项集。 9. removeAllMissingCols 移除全部为缺省值的列。 10. significanceLevel 重要程度。重要性测试(仅用于置信度)。 11. upperBoundMinSupport 最小支持度上界。 从这个值开始迭代减小最小支持度。 12. verbose 如果设置为真,则算法会以冗余模式运行。
2)控制台输出结果
Apriori ======= Minimum support: 0.2 (5 instances) Minimum metric <confidence>: 0.9 Number of cycles performed: 16 Generated sets of large itemsets: Size of set of large itemsets L(1): 11 Size of set of large itemsets L(2): 21 Size of set of large itemsets L(3): 6 Best rules found: 1. tear-prod-rate=reduced 12 ==> contact-lenses=none 12 conf:(1) 2. spectacle-prescrip=myope tear-prod-rate=reduced 6 ==> contact-lenses=none 6 conf:(1) 3. spectacle-prescrip=hypermetrope tear-prod-rate=reduced 6 ==> contact-lenses=none 6 conf:(1) 4. astigmatism=no tear-prod-rate=reduced 6 ==> contact-lenses=none 6 conf:(1) 5. astigmatism=yes tear-prod-rate=reduced 6 ==> contact-lenses=none 6 conf:(1) 6. contact-lenses=soft 5 ==> astigmatism=no 5 conf:(1) 7. contact-lenses=soft 5 ==> tear-prod-rate=normal 5 conf:(1) 8. tear-prod-rate=normal contact-lenses=soft 5 ==> astigmatism=no 5 conf:(1) 9. astigmatism=no contact-lenses=soft 5 ==> tear-prod-rate=normal 5 conf:(1) 10. contact-lenses=soft 5 ==> astigmatism=no tear-prod-rate=normal 5 conf:(1)
转置请注明出处:http://www.cnblogs.com/rongyux/
标签:
原文地址:http://www.cnblogs.com/rongyux/p/5384184.html