欢迎来到麦多课文档分享! | 帮助中心 海量文档,免费浏览,给你所需,享你所想!
麦多课文档分享
全部分类
  • 标准规范>
  • 教学课件>
  • 考试资料>
  • 办公文档>
  • 学术论文>
  • 行业资料>
  • 易语言源码>
  • ImageVerifierCode 换一换
    首页 麦多课文档分享 > 资源分类 > PPT文档下载
    分享到微信 分享到微博 分享到QQ空间

    Chapter 26- Data Mining.ppt

    • 资源ID:379691       资源大小:510KB        全文页数:102页
    • 资源格式: PPT        下载积分:2000积分
    快捷下载 游客一键下载
    账号登录下载
    微信登录下载
    二维码
    微信扫一扫登录
    下载资源需要2000积分(如需开发票,请勿充值!)
    邮箱/手机:
    温馨提示:
    如需开发票,请勿充值!快捷下载时,用户名和密码都是您填写的邮箱或者手机号,方便查询和重复下载(系统自动生成)。
    如需开发票,请勿充值!如填写123,账号就是123,密码也是123。
    支付方式: 支付宝扫码支付    微信扫码支付   
    验证码:   换一换

    加入VIP,交流精品资源
     
    账号:
    密码:
    验证码:   换一换
      忘记密码?
        
    友情提示
    2、PDF文件下载后,可能会被浏览器默认打开,此种情况可以点击浏览器菜单,保存网页到桌面,就可以正常下载了。
    3、本站不支持迅雷下载,请使用电脑自带的IE浏览器,或者360浏览器、谷歌浏览器下载即可。
    4、本站资源下载后的文档和图纸-无水印,预览文档经过压缩,下载后原文更清晰。
    5、试题试卷类文档,如果标题没有明确说明有答案则都视为没有答案,请知晓。

    Chapter 26- Data Mining.ppt

    1、Chapter 26: Data Mining,(Some slides courtesy of Rich Caruana, Cornell University),Ramakrishnan and Gehrke. Database Management Systems, 3rd Edition.,Definition,Data mining is the exploration and analysis of large quantities of data in order to discover valid, novel, potentially useful, and ultimate

    2、ly understandable patterns in data.Example pattern (Census Bureau Data): If (relationship = husband), then (gender = male). 99.6%,Ramakrishnan and Gehrke. Database Management Systems, 3rd Edition.,Definition (Cont.),Data mining is the exploration and analysis of large quantities of data in order to

    3、discover valid, novel, potentially useful, and ultimately understandable patterns in data.Valid: The patterns hold in general. Novel: We did not know the pattern beforehand. Useful: We can devise actions from the patterns. Understandable: We can interpret and comprehend the patterns.,Ramakrishnan an

    4、d Gehrke. Database Management Systems, 3rd Edition.,Why Use Data Mining Today?,Human analysis skills are inadequate: Volume and dimensionality of the data High data growth rateAvailability of: Data Storage Computational power Off-the-shelf software Expertise,Ramakrishnan and Gehrke. Database Managem

    5、ent Systems, 3rd Edition.,An Abundance of Data,Supermarket scanners, POS data Preferred customer cards Credit card transactions Direct mail response Call center records ATM machines Demographic data Sensor networks Cameras Web server logs Customer web site trails,Ramakrishnan and Gehrke. Database Ma

    6、nagement Systems, 3rd Edition.,Commercial Support,Many data mining tools http:/ Database systems with data mining support Visualization tools Data mining process support Consultants,Ramakrishnan and Gehrke. Database Management Systems, 3rd Edition.,Why Use Data Mining Today?,Competitive pressure! “T

    7、he secret of success is to know something that nobody else knows.” Aristotle OnassisCompetition on service, not only on price (Banks, phone companies, hotel chains, rental car companies) Personalization CRM The real-time enterprise Security, homeland defense,Ramakrishnan and Gehrke. Database Managem

    8、ent Systems, 3rd Edition.,Types of Data,Relational data and transactional data Spatial and temporal data, spatio-temporal observations Time-series data Text Voice Images, video Mixtures of data Sequence data Features from processing other data sources,Ramakrishnan and Gehrke. Database Management Sys

    9、tems, 3rd Edition.,The Knowledge Discovery Process,Steps: Identify business problem Data mining Action Evaluation and measurement Deployment and integration into businesses processes,Ramakrishnan and Gehrke. Database Management Systems, 3rd Edition.,Data Mining Step in Detail,2.1 Data preprocessing

    10、Data selection: Identify target datasets and relevant fields Data transformation Data cleaning Combine related data sources Create common units Generate new fields Sampling 2.2 Data mining model construction 2.3 Model evaluation,Ramakrishnan and Gehrke. Database Management Systems, 3rd Edition.,Data

    11、 Selection,Data Sources are Expensive Obtaining Data Loading Data into Database Maintaining Data Most Fields are not useful Names Addresses Code Numbers,Ramakrishnan and Gehrke. Database Management Systems, 3rd Edition.,Data Cleaning,Missing Data Unknown demographic data Impute missing values when p

    12、ossible Incorrect Data Hand-typed default values (e.g. 1900 for dates) Misplaced Fields Data does not always match documentation Missing Relationships Foreign keys missing or dangling,Ramakrishnan and Gehrke. Database Management Systems, 3rd Edition.,Combining Data Sources,Enterprise Data typically

    13、stored in many heterogeneous systems Keys to join systems may or may not be present Heuristics must be used when keys are missing Time-based matching Situation-based matching,Ramakrishnan and Gehrke. Database Management Systems, 3rd Edition.,Create Common Units,Data exists at different Granularity L

    14、evels Customers Transactions Products Data Mining requires a common Granularity Level (often called a Case) Mining usually occurs at “customer” or similar granularity,Ramakrishnan and Gehrke. Database Management Systems, 3rd Edition.,Generate New Fields,Raw data fields may not be useful by themselve

    15、s Simple transformations can improve mining results dramatically: Customer start date Customer tenure Recency, Frequency, Monetary values Fields at wrong granularity level must be aggregated,Ramakrishnan and Gehrke. Database Management Systems, 3rd Edition.,Sampling,Most real datasets are too large

    16、to mine directly ( 200 million cases) Apply random sampling to reduce data size and improve error estimation Always sample at analysis granularity (case/”customer”), never at transaction granularity.,Ramakrishnan and Gehrke. Database Management Systems, 3rd Edition.,Target Formats,Denormalized Table

    17、,One row per case/customer One column per field,Ramakrishnan and Gehrke. Database Management Systems, 3rd Edition.,Target Formats,Star Schema,Transactions,Customers,Products,Services,Must join/roll-up to Customer level before mining,Ramakrishnan and Gehrke. Database Management Systems, 3rd Edition.,

    18、Data Transformation Example,Client: major health insurer Business Problem: determine when the web is effective at deflecting call volume Data Sources Call center records Web data Claims Customer and Provider database,Ramakrishnan and Gehrke. Database Management Systems, 3rd Edition.,Data Transformat

    19、ion Example,Cleaning Required Dirty reason codes in call center records Missing customer Ids in some web records No session information in web records Incorrect date fields in claims Missing values in customer and provider records Some customer records missing entirely,Ramakrishnan and Gehrke. Datab

    20、ase Management Systems, 3rd Edition.,Data Transformation Example,Combining Data Sources Systems use different keys. Mappings were provided, but not all rows joined properly. Web data difficult to match due to missing customer Ids on certain rows. Call center rows incorrectly combined portions of dif

    21、ferent calls.,Ramakrishnan and Gehrke. Database Management Systems, 3rd Edition.,Data Transformation Example,Creating Common Units Symptom: a combined reason code that could be applied to both web and call data Interaction: a unit of work in servicing a customer comparable between web and call Rollu

    22、p to customer granularity,Ramakrishnan and Gehrke. Database Management Systems, 3rd Edition.,Data Transformation Example,New Fields Followup call: was a web interaction followed by a call on a similar topic within a given timeframe? Repeat call: did a customer call more than once about the same topi

    23、c? Web adoption rate: to what degree did a customer or group use the web?,Ramakrishnan and Gehrke. Database Management Systems, 3rd Edition.,Data Transformation Example,Implementation took six man-months Two full-time employees working for three months Time extended due to changes in problem definit

    24、ion and delays in obtaining data Transformations take time One week to run all transformations on a full dataset (200GB) Transformation run needed to be monitored continuously,Ramakrishnan and Gehrke. Database Management Systems, 3rd Edition.,What is a Data Mining Model?,A data mining model is a des

    25、cription of a specific aspect of a dataset. It produces output values for an assigned set of input values.Examples: Linear regression model Classification model Clustering,Ramakrishnan and Gehrke. Database Management Systems, 3rd Edition.,Data Mining Models (Contd.),A data mining model can be descri

    26、bed at two levels: Functional level: Describes model in terms of its intended usage. Examples: Classification, clustering Representational level: Specific representation of a model. Example: Log-linear model, classification tree, nearest neighbor method. Black-box models versus transparent models,Ra

    27、makrishnan and Gehrke. Database Management Systems, 3rd Edition.,Types of Variables,Numerical: Domain is ordered and can be represented on the real line (e.g., age, income) Nominal or categorical: Domain is a finite set without any natural ordering (e.g., occupation, marital status, race) Ordinal: D

    28、omain is ordered, but absolute differences between values is unknown (e.g., preference scale, severity of an injury),Ramakrishnan and Gehrke. Database Management Systems, 3rd Edition.,Data Mining Techniques,Supervised learning Classification and regression Unsupervised learning Clustering and associ

    29、ation rules Dependency modeling Outlier and deviation detection Trend analysis and change detection,Ramakrishnan and Gehrke. Database Management Systems, 3rd Edition.,Supervised Learning,F(x): true function (usually not known) D: training sample drawn from F(x),Ramakrishnan and Gehrke. Database Mana

    30、gement Systems, 3rd Edition.,Supervised Learning,F(x): true function (usually not known) D: training sample (x,F(x)57,M,195,0,125,95,39,25,0,1,0,0,0,1,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0 078,M,160,1,130,100,37,40,1,0,0,0,1,0,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0 169,F,180,0,115,85,40,22,0,0,0,0,0,1,0,0,0,0,1,

    31、0,0,0,0,0,0,0,0,0,0,0,0 018,M,165,0,110,80,41,30,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 054,F,135,0,115,95,39,35,1,1,0,0,0,1,0,0,0,1,0,0,0,0,1,0,0,0,1,0,0,0,0 1 G(x): model learned from D 71,M,160,1,130,105,38,20,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0 ? Goal: E(F(x)-G(x)2 is small (near zero

    32、) for future samples,Ramakrishnan and Gehrke. Database Management Systems, 3rd Edition.,Supervised Learning,Well-defined goal: Learn G(x) that is a good approximation to F(x) from training sample DWell-defined error metrics:Accuracy, RMSE, ROC, ,Ramakrishnan and Gehrke. Database Management Systems,

    33、3rd Edition.,Supervised vs. Unsupervised Learning,Supervised y=F(x): true function D: labeled training set D: xi,F(xi) Learn: G(x): model trained to predict labels D Goal: E(F(x)-G(x)2 0 Well defined criteria: Accuracy, RMSE, .,Unsupervised Generator: true model D: unlabeled data sample D: xi Learn

    34、? Goal: ? Well defined criteria: ?,Ramakrishnan and Gehrke. Database Management Systems, 3rd Edition.,Classification Example,Example training database Two predictor attributes: Age and Car-type (Sport, Minivan and Truck) Age is ordered, Car-type is categorical attribute Class label indicates whether

    35、 person bought product Dependent attribute is categorical,Ramakrishnan and Gehrke. Database Management Systems, 3rd Edition.,Regression Example,Example training database Two predictor attributes: Age and Car-type (Sport, Minivan and Truck) Spent indicates how much person spent during a recent visit

    36、to the web site Dependent attribute is numerical,Ramakrishnan and Gehrke. Database Management Systems, 3rd Edition.,Types of Variables (Review),Numerical: Domain is ordered and can be represented on the real line (e.g., age, income) Nominal or categorical: Domain is a finite set without any natural

    37、ordering (e.g., occupation, marital status, race) Ordinal: Domain is ordered, but absolute differences between values is unknown (e.g., preference scale, severity of an injury),Ramakrishnan and Gehrke. Database Management Systems, 3rd Edition.,Goals and Requirements,Goals: To produce an accurate cla

    38、ssifier/regression function To understand the structure of the problem Requirements on the model: High accuracy Understandable by humans, interpretable Fast construction for very large training databases,Ramakrishnan and Gehrke. Database Management Systems, 3rd Edition.,Different Types of Classifier

    39、s,Decision Trees Simple Bayesian models Nearest neighbor methods Logistic regression Neural networks Linear discriminant analysis (LDA) Quadratic discriminant analysis (QDA) Density estimation methods,Ramakrishnan and Gehrke. Database Management Systems, 3rd Edition.,Decision Trees,A decision tree T

    40、 encodes d (a classifier or regression function) in form of a tree. A node t in T without children is called a leaf node. Otherwise t is called an internal node.,Ramakrishnan and Gehrke. Database Management Systems, 3rd Edition.,What are Decision Trees?,Minivan,Age,Car Type,YES,NO,YES,30,=30,Sports,

    41、 Truck,Ramakrishnan and Gehrke. Database Management Systems, 3rd Edition.,Internal Nodes,Each internal node has an associated splitting predicate. Most common are binary predicates. Example predicates: Age 0,Ramakrishnan and Gehrke. Database Management Systems, 3rd Edition.,Leaf Nodes,Consider leaf

    42、node t Classification problem: Node t is labeled with one class label c in dom(C) Regression problem: Two choices Piecewise constant model: t is labeled with a constant y in dom(Y). Piecewise linear model: t is labeled with a linear model Y = yt + aiXi,Ramakrishnan and Gehrke. Database Management Sy

    43、stems, 3rd Edition.,Example,Encoded classifier: If (age= 30) Then NO,Minivan,Age,Car Type,YES,NO,YES,30,=30,Sports, Truck,Ramakrishnan and Gehrke. Database Management Systems, 3rd Edition.,Decision Tree Construction,Top-down tree construction schema: Examine training database and find best splitting

    44、 predicate for the root node Partition training database Recurse on each child node,Ramakrishnan and Gehrke. Database Management Systems, 3rd Edition.,Top-Down Tree Construction,BuildTree(Node t, Training database D,Split Selection Method S)(1) Apply S to D to find splitting criterion (2) if (t is n

    45、ot a leaf node) (3) Create children nodes of t (4) Partition D into children partitions (5) Recurse on each partition (6) endif,Ramakrishnan and Gehrke. Database Management Systems, 3rd Edition.,Decision Tree Construction,Three algorithmic components: Split selection (CART, C4.5, QUEST, CHAID, CRUIS

    46、E, ) Pruning (direct stopping rule, test dataset pruning, cost-complexity pruning, statistical tests, bootstrapping) Data access (CLOUDS, SLIQ, SPRINT, RainForest, BOAT, UnPivot operator),Ramakrishnan and Gehrke. Database Management Systems, 3rd Edition.,Split Selection Method,Numerical or ordered a

    47、ttributes: Find a split point that separates the (two) classes (Yes: No: ),Ramakrishnan and Gehrke. Database Management Systems, 3rd Edition.,Split Selection Method (Contd.),Categorical attributes: How to group? Sport: Truck: Minivan:(Sport, Truck) - (Minivan)(Sport) - (Truck, Minivan)(Sport, Miniva

    48、n) - (Truck),Ramakrishnan and Gehrke. Database Management Systems, 3rd Edition.,Pruning Method,For a tree T, the misclassification rate R(T,P) and the mean-squared error rate R(T,P) depend on P, but not on D. The goal is to do well on records randomly drawn from P, not to do well on the records in D

    49、 If the tree is too large, it overfits D and does not model P. The pruning method selects the tree of the right size.,Ramakrishnan and Gehrke. Database Management Systems, 3rd Edition.,Data Access Method,Recent development: Very large training databases, both in-memory and on secondary storage Goal: Fast, efficient, and scalable decision tree construction, using the complete training database.,Ramakrishnan and Gehrke. Database Management Systems, 3rd Edition.,


    注意事项

    本文(Chapter 26- Data Mining.ppt)为本站会员(brainfellow396)主动上传,麦多课文档分享仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。 若此文所含内容侵犯了您的版权或隐私,请立即通知麦多课文档分享(点击联系客服),我们立即给予删除!




    关于我们 - 网站声明 - 网站地图 - 资源地图 - 友情链接 - 网站客服 - 联系我们

    copyright@ 2008-2019 麦多课文库(www.mydoc123.com)网站版权所有
    备案/许可证编号:苏ICP备17064731号-1 

    收起
    展开