Data Mining and Model Choice in Supervised Learning.ppt
《Data Mining and Model Choice in Supervised Learning.ppt》由会员分享,可在线阅读,更多相关《Data Mining and Model Choice in Supervised Learning.ppt(64页珍藏版)》请在麦多课文档分享上搜索。
1、Data Mining and Model Choice in Supervised Learning,Gilbert Saporta Chaire de Statistique Applique & CEDRIC, CNAM, 292 rue Saint Martin, F-75003 Parisgilbert.saportacnam.fr http:/am.fr/saporta,Beijing, 2008,2,Outline,What is data mining? Association rule discovery Statistical models Predictive model
2、ling A scoring case study Discussion,Beijing, 2008,3,1. What is data mining?,Data mining is a new field at the frontiers of statistics and information technologies (database management, artificial intelligence, machine learning, etc.) which aims at discovering structures and patterns in large data s
3、ets.,Beijing, 2008,4,1.1 Definitions:,U.M.Fayyad, G.Piatetski-Shapiro : “ Data Mining is the nontrivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data ” D.J.Hand : “ I shall define Data Mining as the discovery of interesting, unexpected, or va
4、luable structures in large data sets”,Beijing, 2008,5,The metaphor of Data Mining means that there are treasures (or nuggets) hidden under mountains of data, which may be discovered by specific tools. Data Mining is concerned with data which were collected for another purpose: it is a secondary anal
5、ysis of data bases that are collected Not Primarily For Analysis, but for the management of individual cases (Kardaun, T.Alanko,1998) . Data Mining is not concerned with efficient methods for collecting data such as surveys and experimental designs (Hand, 2000),Beijing, 2008,6,The idea of discoverin
6、g facts from data is as old as Statistics which “ is the science of learning from data ” (J.Kettenring, former ASA president). In the 60s: Exploratory Data Analysis (Tukey, Benzecri) Data analysis is a tool for extracting the diamond of truth from the mud of data. (J.P.Benzcri 1973),What is new? Is
7、it a revolution ?,Beijing, 2008,7,1.2 Data Mining started from:,an evolution of DBMS towards Decision Support Systems using a Data Warehouse. Storage of huge data sets: credit card transactions, phone calls, supermarket bills: giga and terabytes of data are collected automatically. Marketing operati
8、ons: CRM (customer relationship management) Research in Artificial Intelligence, machine learning, KDD for Knowledge Discovery in Data Bases,Beijing, 2008,8,1.3 Goals and tools,Data Mining is a secondary analysis of data collected in an other purpose (management eg) Data Mining aims at finding struc
9、tures of two kinds : models and patternsPatterns a characteristic structure exhibited by a few number of points : a small subgroup of customers with a high commercial value, or conversely highly risked. Tools: cluster analysis, visualisation by dimension reduction: PCA, CA etc. association rules.,Be
10、ijing, 2008,9,ModelsBuilding models is a major activity for statisticians econometricians, and other scientists. A model is a global summary of relationships between variables, which both helps to understand phenomenons and allows predictions. DM is not concerned with estimation and tests, of prespe
11、cified models, but with discovering models through an algorithmic search process exploring linear and non-linear models, explicit or not: neural networks, decision trees, Support Vector Machines, logistic regression, graphical models etc. In DM Models do not come from a theory, but from data explora
12、tion.,Beijing, 2008,10,process or tools?,DM often appears as a collection of tools presented usually in one package, in such a way that several techniques may be compared on the same data-set. But DM is a process, not only tools:Data Information Knowledge,preprocessing,analysis,Beijing, 2008,11,2. A
13、ssociation rule discovery, or market basket analysis,Illustration with a real industrial example at Peugeot-Citroen car manufacturing company. (Ph.D of Marie Plasse).,ASSOCIATION RULES MINING,“90% of transactions that purchase bread and butter,also purchase milk“ (Agrawal et al., 1993), bread, butte
14、r ,milk ,antecedent,consequent,where A C = ,Itemset A,Itemset C,Supp = 30 % 30% of transactions contain + +,Conf = 90 % 90% of transactions that contain + , contain also,Beijing, 2008,14,Support: P(AC) Confidence: P(C/A) thresholds s0 et c0 Interesting result only if P(C/A) is much larger than P(C)
15、or P(C/not A) is low. Lift:,MOTIVATION,Motivation : decision-making aid Always searching for a greater quality level, the car manufacturer can take advantage of knowledge of associations between attributes.,Industrial data : A set of vehicles described by a large set of binary flags,Our work : We ar
16、e looking for patterns in data : Associations discovery,Vehicles,DATA FEATURE,Data size : More than 80 000 vehicles (transactions) 4 months of manufacturing More than 3000 attributes (items),DATA FEATURE,Count of attributes owned by vehicle,Vehicle Percent,OUPUT : ASSOCIATION RULES,Aims : Reduce cou
17、nt of rules Reduce size of rules,A first reduction is obtained by manual grouping :,COMBINING CLUSTER ANALYSIS AND ASSOCIATION RULES,10-clusters partition with hierarchical clustering and Russel Rao coefficient,Cluster 2 is atypical and produces many complex rules,Mining association rules inside eac
18、h cluster except atypical cluster :,The number of rules to analyse has significantly decreased The output rules are more simple to analyse Clustering has detected an atypical cluster of attributes to treat separately,Beijing, 2008,21,3.Statistical models,About statistical modelsUnsupervised case: a
19、representation of a probabilisable real world: X r.v. parametric family f(x;)Supervised case: response Y=(X)+ Different goals Unsupervised: good fit with parsimony Supervised: accurate predictions,Beijing, 2008,22,3.1. Model choice and penalized likelihood,The likelihood principle (Fisher, 1920)samp
20、le of n iid observations:The best model is the one which maximizes the likelihood, ie the probability of having observed the data. ML estimation etc.,Beijing, 2008,23,Overfitting risk,Likelihood increases with the number of parameters Variable selection: a particular case of model selection Need for
21、 parsimony Occams razor,Beijing, 2008,24,An English Franciscan friar and scholastic philosopher. He was summoned to Avignon in 1324 by Pope John XXII on accusation of heresy, and spent four years there in effect under house arrest. William of Ockham has inspired in U.Ecos The Name of the Rose, the m
22、onastic detective William of Baskerville, who uses logic in a similar manner. Occams razor states that the explanation of any phenomenon should make as few assumptions as possible, eliminating, or “shaving off“, those that make no difference in the observable predictions of the explanatory hypothesi
23、s or theory. lex parsimoniae : entia non sunt multiplicanda praeter necessitatem, or: entities should not be multiplied beyond necessity.,William of Occham(12851348),from wikipedia,Beijing, 2008,25,penalized likelihood,Nested (?) family of parametric models, with k parameters: trade-off between the
24、fit and the complexity Akake : AIC = -2 ln(L) + 2k Schwartz : BIC = -2 ln(L) + k ln(n)Choose the model which minimizes AIC or BIC,Beijing, 2008,26,3.2 AIC and BIC: different theories,AIC : approximation of Kullback-Leibler divergence between the true model and the best choice inside the family,Beiji
- 1.请仔细阅读文档,确保文档完整性,对于不预览、不比对内容而直接下载带来的问题本站不予受理。
- 2.下载的文档,不会出现我们的网址水印。
- 3、该文档所得收入(下载+内容+预览)归上传者、原创作者;如果您是本文档原作者,请点此认领!既往收益都归您。
下载文档到电脑,查找使用更方便
2000 积分 0人已下载
下载 | 加入VIP,交流精品资源 |
- 配套讲稿:
如PPT文件的首页显示word图标,表示该PPT已包含配套word讲稿。双击word图标可打开word文档。
- 特殊限制:
部分文档作品中含有的国旗、国徽等图片,仅作为作品整体效果示例展示,禁止商用。设计者仅对作品中独创性部分享有著作权。
- 关 键 词:
- DATAMININGANDMODELCHOICEINSUPERVISEDLEARNINGPPT

链接地址:http://www.mydoc123.com/p-372901.html