欢迎来到麦多课文档分享! | 帮助中心 海量文档,免费浏览,给你所需,享你所想!
麦多课文档分享
全部分类
  • 标准规范>
  • 教学课件>
  • 考试资料>
  • 办公文档>
  • 学术论文>
  • 行业资料>
  • 易语言源码>
  • ImageVerifierCode 换一换
    首页 麦多课文档分享 > 资源分类 > PPT文档下载
    分享到微信 分享到微博 分享到QQ空间

    August 12, 2007.ppt

    • 资源ID:378717       资源大小:1.49MB        全文页数:64页
    • 资源格式: PPT        下载积分:2000积分
    快捷下载 游客一键下载
    账号登录下载
    微信登录下载
    二维码
    微信扫一扫登录
    下载资源需要2000积分(如需开发票,请勿充值!)
    邮箱/手机:
    温馨提示:
    如需开发票,请勿充值!快捷下载时,用户名和密码都是您填写的邮箱或者手机号,方便查询和重复下载(系统自动生成)。
    如需开发票,请勿充值!如填写123,账号就是123,密码也是123。
    支付方式: 支付宝扫码支付    微信扫码支付   
    验证码:   换一换

    加入VIP,交流精品资源
     
    账号:
    密码:
    验证码:   换一换
      忘记密码?
        
    友情提示
    2、PDF文件下载后,可能会被浏览器默认打开,此种情况可以点击浏览器菜单,保存网页到桌面,就可以正常下载了。
    3、本站不支持迅雷下载,请使用电脑自带的IE浏览器,或者360浏览器、谷歌浏览器下载即可。
    4、本站资源下载后的文档和图纸-无水印,预览文档经过压缩,下载后原文更清晰。
    5、试题试卷类文档,如果标题没有明确说明有答案则都视为没有答案,请知晓。

    August 12, 2007.ppt

    1、0,August 12, 2007,KDD-07 Invited Innovation Talk,Research,Usama Fayyad, Ph.D. Chief Data Officer & Executive VP Yahoo! Inc.,1,Thanks and Gratitude,My family: my wife Kristina and my 4 kids; my parents and my sisters My academic roots: The University of Michigan, Ann Arbor my Ph.D. committee, includi

    2、ng Ramasamy Uthurusamy (then at GM Research Labs), grad student colleagues (Jie Cheng), Internships at GM Research and at NASAs JPL My Mentors and Collaborators Caltech Astronomy (G. Djorgovski, Nick Weir), Pietro Perona and M.C. Burl JPLNASA Colleagues: Padhraic Smyth, Rich Doyle, Steve Chien, Paul

    3、 Stolorz, Peter Cheeseman, David Atkinson, many others Microsoft Colleagues: Decision Theory Group, Surajit Chadhuri, Jim Gray, Paul Bradley, Bassel Ojjeh, Nick Besbeas, Heikki Mannila, Rick Rashid, many others Fellows in KDD: Gregpry Piatetsky-Shapiro, Daryl Pregibon, Christos Faloutsos, Geoff Webb

    4、, Bob Grossman, Jiawei Han, Eric Tsui, Tharam Dillon, Chengqi Zhang, many, many colleagues My Business Partners Bassel Ojjeh, Nick Besbeas, many VCs, many advisers and strategic clients including Microsoft SQL Server and sales teams My Yahoo! Colleagues: Zod Nazem, Jerry Yang, David Filo, Yahoo! exe

    5、c team, Prabhakar Raghavan, Pavel Berkhin, Nick Weir, Hunter Madsen, Nitin Sharma, Raghu Ramakrishnan, Y! Research folks, many at Yahoo SDS and current and previous Yahoo! employees,2,Personal Observations of a Data Mining Disciple,A Data Miners Story Getting to Know the Grand Challenges,Research,Us

    6、ama Fayyad, Ph.D. Chief Data Officer & Executive VP Yahoo! Inc.,3,Overview,The setting Why data mining is a must? Why data mining is not happening? A Data Miners Story Grand Challenges: Pragmatic Grand Challenges: Technical Some case studies Concluding Remarks,4,The data gap,The Machinery Moves on:

    7、Moores law: processing “capacity” doubles every 18 months : CPU, cache, memory Its more aggressive cousin: Disk storage “capacity” doubles every 9 months The Demand is exploding: Every business is an eBusiness Scientific Instruments and Moores law Government The Internet the ubiquity of the Web The

    8、Talent Shortage,5,What is Data Mining?,Finding interesting structure in data Structure: refers to statistical patterns, predictive models, hidden relationships Interesting: ?Examples of tasks addressed by Data Mining Predictive Modeling (classification, regression) Segmentation (Data Clustering ) Af

    9、finity (Summarization) relations between fields, associations, visualization,6,Beyond Data Analysis,Scaling analysis to large databases How to deal with data without having to move it out? Are there abstract primitive accesses to the data, in database systems, that can provide mining algorithms with

    10、 the information to drive the search for patterns?How do we minimize-or sometimes even avoid-having to scan the large database in its entirety? Automated search Enumerate and create numerous hypotheses Fast search Useful data reductions More emphasis on understandable models Finding patterns and mod

    11、els that are “interesting” or “novel” to users. Scaling to high-dimensional data and models.,Data Mining and Databases,Many interesting analysis queries are difficult to state preciselyExamples: which records represent fraudulent transactions? which households are likely to prefer a Ford over a Toyo

    12、ta? Whos a good credit risk in my customer DB?Yet database contains the information good/bad customer, profitability did/did not respond to mailout/survey/.,8,Data Mining Grand Vision,Whats New?,Whats Interesting?,Predict for me,9,The myths,Companies have built up some large and impressive data ware

    13、houses Data mining is pervasive nowadays Large corporations know how to do it There are tools and applications that discover valuable information in enterprise databases,10,The truths,Data is a shambles, most data mining efforts end up not benefiting from existing data infra-structure Corporations c

    14、are a lot about data, and are obsessed with customer behavior and understanding it They talk a lot about it An extremely small number of businesses are successfully mining data The successful efforts are “one-of”, “lucky strikes”,11,Data navigation, exploration, & exploitation technology is fairly p

    15、rimitive: we know how to build massive data stores we do not know how to exploit them we do the book-keeping really well (OLTP) Inadequate basic understanding of navigation /systems many large data stores are write-only (= data tomb), Ancient Egypt,Current state of Databases,12,A Data Miners Story,S

    16、tarted out in pure research Professional student Math and algorithms,13,Researcher view,Algorithms and Theory,Database,Systems,14,Practitioner view,Systems and integration,Database,Algorithms,Customer,15,Business view,Systems,Database,Algorithms,Customer,$s,16,A Data Miners Story,Started out in pure

    17、 research At NASA-JPL did basic research and applied techniques to Science Data Analysis problems Worked with top scientists is several fields: astronomy, planetary geology, atmospherics, space science, remote sensing imagery Great results, strong group, lots of funding, high demand,So why move to M

    18、icrosoft Research?,17,Example: Cataloging Sky Objects,Data Mining Based Solution,94% accuracy in recognizing sky objects Speed up catalog generation by one to two orders of magnitude (unrealistic to perform manually). Classify objects that are at least one magnitude fainter than catalogs to-date.Tri

    19、pled the “data yield” Generate sky catalogs with much richer content: on order of billions of objects: 2x107 galaxies 2x108 stars, 105 quasars Discovered new quasars 40 times more efficiently,20,A Data Miners Story,Started out in pure research At NASA-JPL At Microsoft Research Basic research in algo

    20、rithms and scalability Began to worry about building products and integrating with database server Two groups established: research and product,So why move out to a start-up?,21,Working with Large Databases,One scan (or less) of the database terminate early if appropriate Work within confines of a g

    21、iven limited RAM buffer Cluster a Gigabyte or Terabyte in, say 10 or 100 Megabytes RAM “Anytime” algorithm best answer always handy Pause/resume enabled, incremental Operate on forward-only cursor over a view (essentially a data stream),22,Business Challenges,Conversion,Retention,Acquisition,Loyalty

    22、,Average Order,Technologies,Business Results Gap,Business Challenges,Conversion,Retention,Acquisition,Loyalty,Average Order,Technical Tools,Business users are unable to apply the power of existing data mining tools to achieve results,23,Business Challenges,Conversion,Retention,Acquisition,Loyalty,Av

    23、erage Order,Technologies,Business Results Gap,Business Challenges,Conversion,Retention,Acquisition,Loyalty,Average Order,Technical Tools,Business users are unable to apply the power of existing data mining tools to achieve results,24,Evolving Data Mining,Evolution on the technical front: New algorit

    24、hms Embedded applications Make the analyst life easierEvolution on the usability front New metaphors Vertical applications embedding Used by the business userIn both cases, success means invisibility,25,Grand Challenges,Pragmatic: Achieving integration and invisibility Research/Technical: Solving so

    25、me serious unaddressed problems,26,Pragmatic Grand Challenge 1,Where is the data? There is a glut of stored data Very little of that data is ready for mining Data warehousing has proven that it will not solve the problem for usSolution: integration with operational systems Take a serious database ap

    26、proach to solving the storage management problem,27,digiMine Background,Started as Venture Capital-funded company: digiMine, Inc. in March 2000. Built, operated and hosted data warehouses with built-in data mining apps Headquartered in Bellevue, Washington $45 million in funding Mayfield, Mohr David

    27、ow, American Express, Deutsche Bank Grew to over 120 employees 50 patents+ in technology and processes Both technology and services,28,Sample Customers,29,A Data Miners Story,Started out in pure research At NASA-JPL At Microsoft Research At digiMine Lots of VC funding, great team, great press covera

    28、ge, and fast moving great customers,So why move to a DMX Group?,30,Why DMX Group?,At digiMine, we grew a large “Professional Services” organization We learned a lot from these engagements VC-funded companies cannot do much consulting A fork in the road appeared digiMine re-focused on a market vertic

    29、al: behavioral targeting for media and publishers Renamed to Revenue Science, Inc. Formed DMX Group which was eventually acquired by Yahoo!,31,DMX Group Mission,Make enterprise data a working asset in the enterprise: Data strategy for the business Implementation of Business Intelligence and data min

    30、ing capabilities Business issues around data What is possible? How to expose it to business users How to train people and change processes Integration with operational systems,32,Data Strategy,How can your data influence your revenues? How do you optimize operations based on data? How do you increas

    31、e customer retention based on data? How do you utilize enterprise data assets to spot new opportunities: Cross-sell to existing customers Grow new markets Avoid problems such as fraud, abuse, churn, etc?,33,A Data Miners Story,Started out in pure research At NASA-JPL At Microsoft Research At digiMin

    32、e/Revenue Science Inc. At DMX Group,34,Pragmatic Grand Challenge 2,Embedding within Operational Systems We all worry about algorithms, they are fascinating Most of us know that data mining in practice is mostly data prep work Go where the data is when the data does not come to youBut how much of the

    33、 problem is “data mining”? facts: The effort in embedding an application is huge, and often not discussed Without it, all the algorithms are useless,35,Churn Modelling and Prediction,Case Study Wireless Telco,Research,36,Modeling Process,Customer Interaction Base,1,SMS,WAP,CDR,Billing,2,3,4,5,6,6,Ri

    34、sk,Value,37,LTV and Its Application,A customers life-time value (LTV) is the net value that a customer brings in to a business by the end of their service. I.e. their profit contribution.LTV allows decisions for individual customers that optimize the return-on-investment (ROI). Examples: Aggressive

    35、retention programs, such as equipment upgrade and contract renewal for high LTV. Differentiated customer care treatment for reactivations by customer with low LTV,38,What is the Required?,Detailed data Integration of CDR, WIG, SMS, Billing Maintained at detailed level Integrated data mining Algorith

    36、ms tuned to model thousands of variables and millions of rows Accurate Forecasts System Robustness Massively scalable back end system Flexible architecture to create new variables quickly and easily Collaborative Service Model Service model which guarantees success Combined IQ Model to optimize scie

    37、nce and business knowledge Low cost to create and maintain models,39,Map Segments to Actions,High,Low,Low,High,Nurture / Maintain,Aggressively Defend,Cautiously Defend,Grow Margin,Change Bad Behavior,Let them go,Equipment Upgrade,Feature Add,Contract Renewal,Save Program,Elite Program,Loyalty Progra

    38、ms,Feature Use,Plan Migration,Cost Reducing Programs,Churn Probability,Forecasted LTV,Negative,40,Cost Rules Applied,Cost Rules are introduced to define scoringFor Example: Network System Usage Cost Mobile to Land Connections Costs Technical Operations/Support Costs Long Distance Costs Inter-Carrier

    39、 /International subsidy costs Roaming Costs Bad Debt Allocation Many others,41,Cost Rules for a Bank?,Cost Rules are introduced to define valueFor Example: Deposit Value Product mix Average. daily balance Monthly service fees Technical operations/Support costs Branch/teller usage Late payment/Overdr

    40、aft history Interest rate Contract term Credit Score Employment history/Income,42,Pragmatic Grand Challenge 3,Integrating domain knowledge Data mining algorithms are knowledge free There is no notion of “common sense reasoning” Do we have to solve an AI-hard problem?Robust and deep domain knowledge

    41、utilization solution: Very deep and very narrow integration Ability to “model” business strategy Reasoning capability just evolves (c.f. chess players),43,Cross-Sell / Up-Sell Example,44,Pragmatic Grand Challenge 4,Managing and maintaining models When was the last time you thought about the lifetime

    42、 of a mining model What happens when a model is changed Have you tried to merge the results of two different clustering models over time? How many “data droppings” (aka temp files, quick transformations, quick fixes) do you generate in an analysis session? A framework for managing, updating, and ret

    43、iring mining models solution: use techniques that have been invented for this, databases, systems mngmt, s/w engr, etc,45,Pragmatic Grand Challenge 5,Effectiveness Measurement How do we measure honestly the effectiveness of a model in a context? Return on Investment (ROI) measurement Evaluation in t

    44、he context of the application A framework and methodology for measurement and evaluation Build the measurement method as part of the design of the model An engineering recipe for measurements, and a set of metrics,46,Technical Challenges,Research,47,Technical Challenges,0. Public benchmark data sets

    45、 As a field we have failed to define a common data collection Very difficult to judge research and systems advances Not an easy task, but not impossible A mix of synthetic (but realistic) data sets and real datasets,48,Technical Challenges,1. How does the data grow? A theory for how large data sets

    46、get to be large Definitely not IID sampling from a static distribution Inappropriateness of a “single-population” model,2. Complexity/understandability tradeoff Explaining how, when and why a model works Explaining when a model fails A “Tuning Dial” for reducing the complex into the understandable,4

    47、9,Technical Challenges,3. Interestingness What is an “interesting” pattern or summary? How do you measure “novelty”? What is “unusual”? When is it worthy of attention? Is it low probability events? High summarization ability? Outliers? Good fits? Bad fits?,50,Technical Challenges,4. Scalability Beyo

    48、nd just dealing with a large data set: Principled feature reduction: what is SVD equivalent? Graceful degradation with dimensionality Uncovering graphical structure in data Communities, relations, link analysis, Dealing with multiple data types: Structured, sparse, dense, text, images, video, audio,

    49、 sequence data, etc. I have yet to see an algorithm that deals with more than one type. Integration with DBMS Appropriate sampling Appropriate operator abstractions Taking care of “minor details” Initialization? Determining k,51,Technical Challenges,5. A theory for what we do What are the fundamental abstractions? What are the basics operations? What are the basic components of an algorithm? What is it that we are optimizing? What is hard? What is doable? Why? What is a “data summary”? When are two attributes “similar”? Can you measure efficiently? How do we extract the right representation?,


    注意事项

    本文(August 12, 2007.ppt)为本站会员(lawfemale396)主动上传,麦多课文档分享仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。 若此文所含内容侵犯了您的版权或隐私,请立即通知麦多课文档分享(点击联系客服),我们立即给予删除!




    关于我们 - 网站声明 - 网站地图 - 资源地图 - 友情链接 - 网站客服 - 联系我们

    copyright@ 2008-2019 麦多课文库(www.mydoc123.com)网站版权所有
    备案/许可证编号:苏ICP备17064731号-1 

    收起
    展开