Technologies for Mining Frequent Patterns in Large Databases.ppt
《Technologies for Mining Frequent Patterns in Large Databases.ppt》由会员分享,可在线阅读,更多相关《Technologies for Mining Frequent Patterns in Large Databases.ppt(189页珍藏版)》请在麦多课文档分享上搜索。
1、Technologies for Mining Frequent Patterns in Large Databases,Jiawei Han Intelligent Database Systems Research Lab. Simon Fraser University, Canada http:/www.cs.sfu.ca/han,Tutorial Outline,What is frequent pattern mining? Frequent pattern mining algorithms Apriori and its variations A multi-dimension
2、al view of frequent pattern mining Constraint-based frequent pattern mining Recent progress on efficient mining methods Mining frequent patterns without candidate generation CLOSET: Efficient mining of frequent closet itemsets FreeSpan: Towards efficient sequential pattern mining,Part I What Is Freq
3、uent Pattern Mining?,What is frequent pattern? Why frequent pattern mining? Challenges in frequent pattern mining,What Is Frequent Pattern Mining?,What is a frequent pattern? Pattern (set of items, sequence, etc.) that occurs together frequently in a database AIS93 Frequent pattern: an important for
4、m of regularity What products were often purchased together? beers and diapers! What are the consequences of a hurricane? What is the next target after buying a PC?,Application Examples,Market Basket Analysis * Maintenance AgreementWhat the store should do to boost Maintenance Agreement sales Home E
5、lectronics *What other products should the store stocks up on if the store has a sale on Home Electronics Attached mailing in direct marketing Detecting “ping-pong”ing of patients transaction: patient item: doctor/clinic visited by a patient support of a rule: number of common patients,Frequent Patt
6、ern MiningA Corner Stone in Data mining,Association analysis Basket data analysis, cross-marketing, catalog design, loss-leader analysis, text database analysis Correlation or causality analysis Clustering Classification Association-based classification analysis Sequential pattern analysis Web log s
7、equence, DNA analysis, etc. Partial periodicity, cyclic/temporal associations,Association Rule Mining,Given A database of customer transactions Each transaction is a list of items (purchased by a customer in a visit) Find all rules that correlate the presence of one set of items with that of another
8、 set of items Example: 98% of people who purchase tires and auto accessories also get automotive services done Any number of items in the consequent/antecedent of rule Possible to specify constraints on rules (e.g., find only rules involving Home Laundry Appliances).,Basic Concepts,Rule form: “A B s
9、upport s, confidence c”.Support: usefulness of discovered rules Confidence: certainty of the detected associationRules that satisfy both min_sup and min_conf are called strong. Examples: buys(x, “diapers”) buys(x, “beers”) 0.5%, 60% age(x, “30-34”) income(x ,“42K-48K”) buys(x, “high resolution TV”)
10、2%,60% major(x, “CS”) takes(x, “DB”) grade(x, “A”) 1%, 75%,Rule Measures: Support and Confidence,Find all the rules X & Y Z with minimum confidence and support support, s, probability that a transaction contains X, Y, Z confidence, c, conditional probability that a transaction having X, Y also conta
11、ins Z.,Let minimum support 50%, and minimum confidence 50%, we have A C (50%, 66.6%) C A (50%, 100%),Customer buys diaper,Customer buys both,Customer buys beer,Part II Frequent pattern mining methods: Apriori and its variations,The Apriori algorithm Improvements of Apriori Incremental, parallel, and
12、 distributed methods Different measures in association mining,An Influential Mining Methodology The Apriori Algorithm,The Apriori method: Proposed by Agrawal & Srikant 1994 A similar level-wise algorithm by Mannila et al. 1994 Major idea: A subset of a frequent itemset must be frequent E.g., if beer
13、, diaper, nuts is frequent, beer, diaper must be. Anyone is infrequent, its superset cannot be! A powerful, scalable candidate set pruning technique: It reduces candidate k-itemsets dramatically (for k 2),Mining Association Rules Example,For rule A C: support = support(A C) = 50% confidence = suppor
14、t(A C)/support(A) = 66.6% The Apriori principle: Any subset of a frequent itemset must be frequent.,Min. support 50% Min. confidence 50%,Procedure of Mining Association Rules:,Find the frequent itemsets: the sets of items that have minimum support (Apriori) A subset of a frequent itemset must also b
15、e a frequent itemset, i.e., if A B is a frequent itemset, both A and B should be a frequent itemset Iteratively find frequent itemsets with cardinality from 1 to k (k-itemset) Use the frequent itemsets to generate association rules.,The Apriori Algorithm,Join StepCk is generated by joining Lk-1with
16、itself Prune StepAny (k-1)-itemset that is not frequent cannot be a subset of a frequent k-itemset, hence should be removed.(Ck: Candidate itemset of size k)(Lk : frequent itemset of size k),AprioriPseudocode,Ck: Candidate itemset of size k Lk : frequent itemset of size kL1 = frequent items; for (k
17、= 1; Lk !=; k+) do beginCk+1 = candidates generated from Lk;for each transaction t in database doincrement the count of all candidates in Ck+1 that are contained in tLk+1 = candidates in Ck+1 with min_supportend return k Lk;,The Apriori Algorithm Example,Database D,Scan D,C1,L1,L2,C2,C2,Scan D,C3,L3
18、,Scan D,How to Generate Candidates?,Suppose the items in Lk-1 are listed in an order Step 1: self-joining Lk-1 insert into Ck select p.item1, p.item2, , p.itemk-1, q.itemk-1 from Lk-1 p, Lk-1 q where p.item1=q.item1, , p.itemk-2=q.itemk-2, p.itemk-1 q.itemk-1 Step 2: pruning forall itemsets c in Ck
19、do forall (k-1)-subsets s of c do if (s is not in Lk-1) then delete c from Ck,How to Count Supports of Candidates?,Why counting supports of candidates a problem? The total number of candidates can be very hugeOne transaction may contain many candidates Method: Candidate itemsets are stored in a hash
20、-tree Leaf node of hash-tree contains a list of itemsets and counts Interior node contains a hash table Subset function: finds all the candidates contained in a transaction,Example of Generating Candidates,L3=abc, abd, acd, ace, bcd Self-joining: L3*L3 abcd from abc and abd acde from acd and ace Pru
21、ning: acde is removed because ade is not in L3 C4=abcd,Example: Counting Supports of Candidates,Transaction: 1 2 3 5 6,1 + 2 3 5 6,1 2 + 3 5 6,1 3 + 5 6,Generating Strong Association Rules,Confidence(A B) = Prob(B|A)= support(A B)/support(A) Example:L3=2,3,523 5, confidence=2/2=100%25 3, confidence=
22、2/3=67%35 2, confidence=2/2=100% 2 35, confidence=2/3=67%3 25, confidence=2/3=67%5 32, confidence=2/3=67%,Efficient Implementation of Apriori in SQL,S. Sarawagi, S. Thomas, and R. Agrawal. Integrating association rule mining with relational database systems: Alternatives and Implications. In SIGMOD9
23、8 Implementations based on pure SQL-92 Impossible to get good performance out of pure SQL based approaches alone Make use of object-relational extensions like UDFs, BLOBs, Table functions etc. Get orders of magnitude improvement,Improvements of Apriori,General ideas Scan the transaction database as
24、fewer passes as possible Reduce number of candidates Facilitate support counting of candidates,DIC: Reduce Number of Scans,S. Brin R. Motwani, J. Ullman, and S. Tsur. Dynamic itemset counting and implication rules for market basket data. In SIGMOD97 Basic idea Count the itemsets at the boundary in a
- 1.请仔细阅读文档,确保文档完整性,对于不预览、不比对内容而直接下载带来的问题本站不予受理。
- 2.下载的文档,不会出现我们的网址水印。
- 3、该文档所得收入(下载+内容+预览)归上传者、原创作者;如果您是本文档原作者,请点此认领!既往收益都归您。
下载文档到电脑,查找使用更方便
2000 积分 0人已下载
下载 | 加入VIP,交流精品资源 |
- 配套讲稿:
如PPT文件的首页显示word图标,表示该PPT已包含配套word讲稿。双击word图标可打开word文档。
- 特殊限制:
部分文档作品中含有的国旗、国徽等图片,仅作为作品整体效果示例展示,禁止商用。设计者仅对作品中独创性部分享有著作权。
- 关 键 词:
- TECHNOLOGIESFORMININGFREQUENTPATTERNSINLARGEDATABASESPPT

链接地址:http://www.mydoc123.com/p-389705.html