August 12, 2007.ppt
《August 12, 2007.ppt》由会员分享,可在线阅读,更多相关《August 12, 2007.ppt(64页珍藏版)》请在麦多课文档分享上搜索。
1、0,August 12, 2007,KDD-07 Invited Innovation Talk,Research,Usama Fayyad, Ph.D. Chief Data Officer & Executive VP Yahoo! Inc.,1,Thanks and Gratitude,My family: my wife Kristina and my 4 kids; my parents and my sisters My academic roots: The University of Michigan, Ann Arbor my Ph.D. committee, includi
2、ng Ramasamy Uthurusamy (then at GM Research Labs), grad student colleagues (Jie Cheng), Internships at GM Research and at NASAs JPL My Mentors and Collaborators Caltech Astronomy (G. Djorgovski, Nick Weir), Pietro Perona and M.C. Burl JPLNASA Colleagues: Padhraic Smyth, Rich Doyle, Steve Chien, Paul
3、 Stolorz, Peter Cheeseman, David Atkinson, many others Microsoft Colleagues: Decision Theory Group, Surajit Chadhuri, Jim Gray, Paul Bradley, Bassel Ojjeh, Nick Besbeas, Heikki Mannila, Rick Rashid, many others Fellows in KDD: Gregpry Piatetsky-Shapiro, Daryl Pregibon, Christos Faloutsos, Geoff Webb
4、, Bob Grossman, Jiawei Han, Eric Tsui, Tharam Dillon, Chengqi Zhang, many, many colleagues My Business Partners Bassel Ojjeh, Nick Besbeas, many VCs, many advisers and strategic clients including Microsoft SQL Server and sales teams My Yahoo! Colleagues: Zod Nazem, Jerry Yang, David Filo, Yahoo! exe
5、c team, Prabhakar Raghavan, Pavel Berkhin, Nick Weir, Hunter Madsen, Nitin Sharma, Raghu Ramakrishnan, Y! Research folks, many at Yahoo SDS and current and previous Yahoo! employees,2,Personal Observations of a Data Mining Disciple,A Data Miners Story Getting to Know the Grand Challenges,Research,Us
6、ama Fayyad, Ph.D. Chief Data Officer & Executive VP Yahoo! Inc.,3,Overview,The setting Why data mining is a must? Why data mining is not happening? A Data Miners Story Grand Challenges: Pragmatic Grand Challenges: Technical Some case studies Concluding Remarks,4,The data gap,The Machinery Moves on:
7、Moores law: processing “capacity” doubles every 18 months : CPU, cache, memory Its more aggressive cousin: Disk storage “capacity” doubles every 9 months The Demand is exploding: Every business is an eBusiness Scientific Instruments and Moores law Government The Internet the ubiquity of the Web The
8、Talent Shortage,5,What is Data Mining?,Finding interesting structure in data Structure: refers to statistical patterns, predictive models, hidden relationships Interesting: ?Examples of tasks addressed by Data Mining Predictive Modeling (classification, regression) Segmentation (Data Clustering ) Af
9、finity (Summarization) relations between fields, associations, visualization,6,Beyond Data Analysis,Scaling analysis to large databases How to deal with data without having to move it out? Are there abstract primitive accesses to the data, in database systems, that can provide mining algorithms with
10、 the information to drive the search for patterns?How do we minimize-or sometimes even avoid-having to scan the large database in its entirety? Automated search Enumerate and create numerous hypotheses Fast search Useful data reductions More emphasis on understandable models Finding patterns and mod
11、els that are “interesting” or “novel” to users. Scaling to high-dimensional data and models.,Data Mining and Databases,Many interesting analysis queries are difficult to state preciselyExamples: which records represent fraudulent transactions? which households are likely to prefer a Ford over a Toyo
12、ta? Whos a good credit risk in my customer DB?Yet database contains the information good/bad customer, profitability did/did not respond to mailout/survey/.,8,Data Mining Grand Vision,Whats New?,Whats Interesting?,Predict for me,9,The myths,Companies have built up some large and impressive data ware
13、houses Data mining is pervasive nowadays Large corporations know how to do it There are tools and applications that discover valuable information in enterprise databases,10,The truths,Data is a shambles, most data mining efforts end up not benefiting from existing data infra-structure Corporations c
14、are a lot about data, and are obsessed with customer behavior and understanding it They talk a lot about it An extremely small number of businesses are successfully mining data The successful efforts are “one-of”, “lucky strikes”,11,Data navigation, exploration, & exploitation technology is fairly p
15、rimitive: we know how to build massive data stores we do not know how to exploit them we do the book-keeping really well (OLTP) Inadequate basic understanding of navigation /systems many large data stores are write-only (= data tomb), Ancient Egypt,Current state of Databases,12,A Data Miners Story,S
16、tarted out in pure research Professional student Math and algorithms,13,Researcher view,Algorithms and Theory,Database,Systems,14,Practitioner view,Systems and integration,Database,Algorithms,Customer,15,Business view,Systems,Database,Algorithms,Customer,$s,16,A Data Miners Story,Started out in pure
17、 research At NASA-JPL did basic research and applied techniques to Science Data Analysis problems Worked with top scientists is several fields: astronomy, planetary geology, atmospherics, space science, remote sensing imagery Great results, strong group, lots of funding, high demand,So why move to M
18、icrosoft Research?,17,Example: Cataloging Sky Objects,Data Mining Based Solution,94% accuracy in recognizing sky objects Speed up catalog generation by one to two orders of magnitude (unrealistic to perform manually). Classify objects that are at least one magnitude fainter than catalogs to-date.Tri
19、pled the “data yield” Generate sky catalogs with much richer content: on order of billions of objects: 2x107 galaxies 2x108 stars, 105 quasars Discovered new quasars 40 times more efficiently,20,A Data Miners Story,Started out in pure research At NASA-JPL At Microsoft Research Basic research in algo
20、rithms and scalability Began to worry about building products and integrating with database server Two groups established: research and product,So why move out to a start-up?,21,Working with Large Databases,One scan (or less) of the database terminate early if appropriate Work within confines of a g
21、iven limited RAM buffer Cluster a Gigabyte or Terabyte in, say 10 or 100 Megabytes RAM “Anytime” algorithm best answer always handy Pause/resume enabled, incremental Operate on forward-only cursor over a view (essentially a data stream),22,Business Challenges,Conversion,Retention,Acquisition,Loyalty
22、,Average Order,Technologies,Business Results Gap,Business Challenges,Conversion,Retention,Acquisition,Loyalty,Average Order,Technical Tools,Business users are unable to apply the power of existing data mining tools to achieve results,23,Business Challenges,Conversion,Retention,Acquisition,Loyalty,Av
23、erage Order,Technologies,Business Results Gap,Business Challenges,Conversion,Retention,Acquisition,Loyalty,Average Order,Technical Tools,Business users are unable to apply the power of existing data mining tools to achieve results,24,Evolving Data Mining,Evolution on the technical front: New algorit
24、hms Embedded applications Make the analyst life easierEvolution on the usability front New metaphors Vertical applications embedding Used by the business userIn both cases, success means invisibility,25,Grand Challenges,Pragmatic: Achieving integration and invisibility Research/Technical: Solving so
- 1.请仔细阅读文档,确保文档完整性,对于不预览、不比对内容而直接下载带来的问题本站不予受理。
- 2.下载的文档,不会出现我们的网址水印。
- 3、该文档所得收入(下载+内容+预览)归上传者、原创作者;如果您是本文档原作者,请点此认领!既往收益都归您。
下载文档到电脑,查找使用更方便
2000 积分 0人已下载
下载 | 加入VIP,交流精品资源 |
- 配套讲稿:
如PPT文件的首页显示word图标,表示该PPT已包含配套word讲稿。双击word图标可打开word文档。
- 特殊限制:
部分文档作品中含有的国旗、国徽等图片,仅作为作品整体效果示例展示,禁止商用。设计者仅对作品中独创性部分享有著作权。
- 关 键 词:
- AUGUST12 2007 PPT
