Chapter 26- Data Mining.ppt
《Chapter 26- Data Mining.ppt》由会员分享,可在线阅读,更多相关《Chapter 26- Data Mining.ppt(102页珍藏版)》请在麦多课文档分享上搜索。
1、Chapter 26: Data Mining,(Some slides courtesy of Rich Caruana, Cornell University),Ramakrishnan and Gehrke. Database Management Systems, 3rd Edition.,Definition,Data mining is the exploration and analysis of large quantities of data in order to discover valid, novel, potentially useful, and ultimate
2、ly understandable patterns in data.Example pattern (Census Bureau Data): If (relationship = husband), then (gender = male). 99.6%,Ramakrishnan and Gehrke. Database Management Systems, 3rd Edition.,Definition (Cont.),Data mining is the exploration and analysis of large quantities of data in order to
3、discover valid, novel, potentially useful, and ultimately understandable patterns in data.Valid: The patterns hold in general. Novel: We did not know the pattern beforehand. Useful: We can devise actions from the patterns. Understandable: We can interpret and comprehend the patterns.,Ramakrishnan an
4、d Gehrke. Database Management Systems, 3rd Edition.,Why Use Data Mining Today?,Human analysis skills are inadequate: Volume and dimensionality of the data High data growth rateAvailability of: Data Storage Computational power Off-the-shelf software Expertise,Ramakrishnan and Gehrke. Database Managem
5、ent Systems, 3rd Edition.,An Abundance of Data,Supermarket scanners, POS data Preferred customer cards Credit card transactions Direct mail response Call center records ATM machines Demographic data Sensor networks Cameras Web server logs Customer web site trails,Ramakrishnan and Gehrke. Database Ma
6、nagement Systems, 3rd Edition.,Commercial Support,Many data mining tools http:/ Database systems with data mining support Visualization tools Data mining process support Consultants,Ramakrishnan and Gehrke. Database Management Systems, 3rd Edition.,Why Use Data Mining Today?,Competitive pressure! “T
7、he secret of success is to know something that nobody else knows.” Aristotle OnassisCompetition on service, not only on price (Banks, phone companies, hotel chains, rental car companies) Personalization CRM The real-time enterprise Security, homeland defense,Ramakrishnan and Gehrke. Database Managem
8、ent Systems, 3rd Edition.,Types of Data,Relational data and transactional data Spatial and temporal data, spatio-temporal observations Time-series data Text Voice Images, video Mixtures of data Sequence data Features from processing other data sources,Ramakrishnan and Gehrke. Database Management Sys
9、tems, 3rd Edition.,The Knowledge Discovery Process,Steps: Identify business problem Data mining Action Evaluation and measurement Deployment and integration into businesses processes,Ramakrishnan and Gehrke. Database Management Systems, 3rd Edition.,Data Mining Step in Detail,2.1 Data preprocessing
10、Data selection: Identify target datasets and relevant fields Data transformation Data cleaning Combine related data sources Create common units Generate new fields Sampling 2.2 Data mining model construction 2.3 Model evaluation,Ramakrishnan and Gehrke. Database Management Systems, 3rd Edition.,Data
11、 Selection,Data Sources are Expensive Obtaining Data Loading Data into Database Maintaining Data Most Fields are not useful Names Addresses Code Numbers,Ramakrishnan and Gehrke. Database Management Systems, 3rd Edition.,Data Cleaning,Missing Data Unknown demographic data Impute missing values when p
12、ossible Incorrect Data Hand-typed default values (e.g. 1900 for dates) Misplaced Fields Data does not always match documentation Missing Relationships Foreign keys missing or dangling,Ramakrishnan and Gehrke. Database Management Systems, 3rd Edition.,Combining Data Sources,Enterprise Data typically
13、stored in many heterogeneous systems Keys to join systems may or may not be present Heuristics must be used when keys are missing Time-based matching Situation-based matching,Ramakrishnan and Gehrke. Database Management Systems, 3rd Edition.,Create Common Units,Data exists at different Granularity L
14、evels Customers Transactions Products Data Mining requires a common Granularity Level (often called a Case) Mining usually occurs at “customer” or similar granularity,Ramakrishnan and Gehrke. Database Management Systems, 3rd Edition.,Generate New Fields,Raw data fields may not be useful by themselve
15、s Simple transformations can improve mining results dramatically: Customer start date Customer tenure Recency, Frequency, Monetary values Fields at wrong granularity level must be aggregated,Ramakrishnan and Gehrke. Database Management Systems, 3rd Edition.,Sampling,Most real datasets are too large
16、to mine directly ( 200 million cases) Apply random sampling to reduce data size and improve error estimation Always sample at analysis granularity (case/”customer”), never at transaction granularity.,Ramakrishnan and Gehrke. Database Management Systems, 3rd Edition.,Target Formats,Denormalized Table
17、,One row per case/customer One column per field,Ramakrishnan and Gehrke. Database Management Systems, 3rd Edition.,Target Formats,Star Schema,Transactions,Customers,Products,Services,Must join/roll-up to Customer level before mining,Ramakrishnan and Gehrke. Database Management Systems, 3rd Edition.,
18、Data Transformation Example,Client: major health insurer Business Problem: determine when the web is effective at deflecting call volume Data Sources Call center records Web data Claims Customer and Provider database,Ramakrishnan and Gehrke. Database Management Systems, 3rd Edition.,Data Transformat
19、ion Example,Cleaning Required Dirty reason codes in call center records Missing customer Ids in some web records No session information in web records Incorrect date fields in claims Missing values in customer and provider records Some customer records missing entirely,Ramakrishnan and Gehrke. Datab
20、ase Management Systems, 3rd Edition.,Data Transformation Example,Combining Data Sources Systems use different keys. Mappings were provided, but not all rows joined properly. Web data difficult to match due to missing customer Ids on certain rows. Call center rows incorrectly combined portions of dif
21、ferent calls.,Ramakrishnan and Gehrke. Database Management Systems, 3rd Edition.,Data Transformation Example,Creating Common Units Symptom: a combined reason code that could be applied to both web and call data Interaction: a unit of work in servicing a customer comparable between web and call Rollu
22、p to customer granularity,Ramakrishnan and Gehrke. Database Management Systems, 3rd Edition.,Data Transformation Example,New Fields Followup call: was a web interaction followed by a call on a similar topic within a given timeframe? Repeat call: did a customer call more than once about the same topi
23、c? Web adoption rate: to what degree did a customer or group use the web?,Ramakrishnan and Gehrke. Database Management Systems, 3rd Edition.,Data Transformation Example,Implementation took six man-months Two full-time employees working for three months Time extended due to changes in problem definit
24、ion and delays in obtaining data Transformations take time One week to run all transformations on a full dataset (200GB) Transformation run needed to be monitored continuously,Ramakrishnan and Gehrke. Database Management Systems, 3rd Edition.,What is a Data Mining Model?,A data mining model is a des
- 1.请仔细阅读文档,确保文档完整性,对于不预览、不比对内容而直接下载带来的问题本站不予受理。
- 2.下载的文档,不会出现我们的网址水印。
- 3、该文档所得收入(下载+内容+预览)归上传者、原创作者;如果您是本文档原作者,请点此认领!既往收益都归您。
下载文档到电脑,查找使用更方便
2000 积分 0人已下载
下载 | 加入VIP,交流精品资源 |
- 配套讲稿:
如PPT文件的首页显示word图标,表示该PPT已包含配套word讲稿。双击word图标可打开word文档。
- 特殊限制:
部分文档作品中含有的国旗、国徽等图片,仅作为作品整体效果示例展示,禁止商用。设计者仅对作品中独创性部分享有著作权。
- 关 键 词:
- CHAPTER26DATAMININGPPT
