欢迎来到麦多课文档分享! | 帮助中心 海量文档,免费浏览,给你所需,享你所想!
麦多课文档分享
全部分类
  • 标准规范>
  • 教学课件>
  • 考试资料>
  • 办公文档>
  • 学术论文>
  • 行业资料>
  • 易语言源码>
  • ImageVerifierCode 换一换
    首页 麦多课文档分享 > 资源分类 > PPT文档下载
    分享到微信 分享到微博 分享到QQ空间

    Text Web Mining.ppt

    • 资源ID:373210       资源大小:114KB        全文页数:36页
    • 资源格式: PPT        下载积分:2000积分
    快捷下载 游客一键下载
    账号登录下载
    微信登录下载
    二维码
    微信扫一扫登录
    下载资源需要2000积分(如需开发票,请勿充值!)
    邮箱/手机:
    温馨提示:
    如需开发票,请勿充值!快捷下载时,用户名和密码都是您填写的邮箱或者手机号,方便查询和重复下载(系统自动生成)。
    如需开发票,请勿充值!如填写123,账号就是123,密码也是123。
    支付方式: 支付宝扫码支付    微信扫码支付   
    验证码:   换一换

    加入VIP,交流精品资源
     
    账号:
    密码:
    验证码:   换一换
      忘记密码?
        
    友情提示
    2、PDF文件下载后,可能会被浏览器默认打开,此种情况可以点击浏览器菜单,保存网页到桌面,就可以正常下载了。
    3、本站不支持迅雷下载,请使用电脑自带的IE浏览器,或者360浏览器、谷歌浏览器下载即可。
    4、本站资源下载后的文档和图纸-无水印,预览文档经过压缩,下载后原文更清晰。
    5、试题试卷类文档,如果标题没有明确说明有答案则都视为没有答案,请知晓。

    Text Web Mining.ppt

    1、10/6/2018,1,Text & Web Mining,10/6/2018,2,Structured Data,So far we have focused on mining from structured data:,Attribute Value Attribute Value Attribute Value Attribute Value,Outlook Sunny Temperature Hot Windy Yes Humidity High Play Yes,Most data mining involves such data,10/6/2018,3,Complex Data

    2、 Types,Increased importance of complex data: Spatial data: includes geographic data and medical & satellite images Multimedia data: images, audio, & video Time-series data: for example banking data and stock exchange data Text data: word descriptions for objects World-Wide-Web: highly unstructured t

    3、ext and multimedia data,10/6/2018,4,Text Databases,Many text databases exist in practice News articles Research papers Books Digital libraries E-mail messages Web pages Growing rapidly in size and importance,10/6/2018,5,Semi-Structured Data,Text databases are often semi-structured Example: Title Aut

    4、hor Publication_Date Length Category Abstract Content,10/6/2018,6,Handling Text Data,Modeling semi-structured data Information Retrieval (IR) from unstructured documents Text mining Compare documents Rank importance & relevance Find patterns or trends across documents,10/6/2018,7,Information Retriev

    5、al,IR locates relevant documents Key words Similar documentsIR Systems On-line library catalogs On-line document management systems,10/6/2018,8,Performance Measure,Two basic measures,All documents,Retrieved documents,Relevant documents,Relevant & retrieved,10/6/2018,9,Retrieval Methods,Keyword-based

    6、 IR E.g., “data and mining” Synonymy problem: a document may talk about “knowledge discovery” instead Polysemy problem: mining can mean different things Similarity-based IR Set of common keywords Return the degree of relevance Problem: what is the similarity of “data mining” and “data analysis”,10/6

    7、/2018,10,Modeling a Document,Set of n documents and m terms Each document is a vector v in Rm The j-th coordinate of v measures the association of the j-th termHere r is the number of occurrences of the j-th term and R is the number of occurrences of any term.,10/6/2018,11,Frequency Matrix,10/6/2018

    8、,12,Similarity Measures,Cosine measure,Dot product,Norm of the vectors,10/6/2018,13,Example,Google search for “association mining” Two of the documents retrieved: Idaho Mining Association: mining in Idaho (doc 1) Scalable Algorithms for Association mining (doc 2) Using only the two terms,10/6/2018,1

    9、4,New Model,Add the term “data” to the document model,10/6/2018,15,Frequency Matrix,Will quickly become large,Singular value decomposition can be used to reduce it,10/6/2018,16,Association Analysis,Collect set of keywords frequently used together and find association among them Apply any association

    10、 rule algorithm to a database in the formatdocument_id, a_set_of_keywords,10/6/2018,17,Document Classification,Need already classified documents as training set Induce a classification model Any difference from before?,A set of keywords associated with a document has no fixed set of attributes or di

    11、mensions,10/6/2018,18,Association-Based Classification,Classify documents based on associated, frequently occurring text patterns Extract keywords and terms with IR and simple association analysis Create a concept hierarchy of terms Classify training documents into class hierarchies Use association

    12、mining to discover associated terms to distinguish one class from another,10/6/2018,19,Remember Generalized Association Rules,Clothes,Outerwear,Shirts,Jackets,Ski Pants,Footwear,Shoes,Hiking Boots,Taxonomy:,Generalized association rule X Y where no item in Y is an ancestor of an item in X,Ancestor o

    13、f shoes and hiking boots,10/6/2018,20,Classifiers,Let X be a set of terms Let Anc (X) be those terms and their ancestor terms Consider a rule X C and document d If X Anc (d) then X C covers d A rule that covers d may be used to classify d (but only one can be used),10/6/2018,21,Procedure,Step 1: Gen

    14、erate all generalized association rules , where X is a set of terms and C is a class, that satisfy minimum support. Step 2: Rank the rules according to some rule ranking criterion Step 3: Select rules from the list,10/6/2018,22,Web Mining,The World Wide Web may have more opportunities for data minin

    15、g than any other area However, there are serious challenges: It is too huge Complexity of Web pages is greater than any traditional text document collection It is highly dynamic It has a broad diversity of users Only a tiny portion of the information is truly useful,10/6/2018,23,Search Engines Web M

    16、ining,Current technology: search engines Keyword-based indices Too many relevant pages Synonymy and polysemy problems More challenging: web mining Web content mining Web structure mining Web usage mining,10/6/2018,24,Web Content Mining,10/6/2018,25,Example: Classification of Web Documents,Assign a c

    17、lass to each document based on predefined topic categories E.g., use Yahoo!s taxonomy and associated documents for training Keyword-based document classification Keyword-based association analysis,10/6/2018,26,Web Structure Mining,10/6/2018,27,Authoritative Web Pages,High quality relevant Web pages

    18、are termed authoritative Explore linkages (hyperlinks) Linking a Web page can be considered an endorsement of that page Those pages that are linked frequently are considered authoritative (This has its roots back to IR methods based on journal citations),10/6/2018,28,Structure via Hubs,A hub is a se

    19、t of Web pages containing collections of links to authorities There is a wide variety of hubs: Simple list of recommended links on a persons home page Professional resource lists on commercial sites,10/6/2018,29,HITS,Hyperlink-Induced Topic Search (HITS) Form a root set of pages using the query term

    20、s in an index-based search (200 pages) Expand into a base set by including all pages the root set links to (1000-5000 pages) Go into an iterative process to determine hubs and authorities,10/6/2018,30,Calculating Weights,Authority weightHub weight,Page p is pointed to by page q,10/6/2018,31,Adjacenc

    21、y Matrix,Lets number the pages 1,2,n The adjacency matrix is defined byBy writing the authority and hub weights as vectors we have,10/6/2018,32,Recursive Calculations,We now haveBy linear algebra theory this converges to the principle eigenvectors of the the two matrices,10/6/2018,33,Output,The HITS

    22、 algorithm finally outputs Short list of pages with high hub weights Short list of pages with high authority weightsHave not accounted for context,10/6/2018,34,Applications,The Clever Project at IBMs Almaden Labs Developed the HITS algorithmGoogle Developed at Stanford Uses algorithms similar to HIT

    23、S (PageRank) On-line version,10/6/2018,35,Web Usage Mining,10/6/2018,36,Complex Data Types Summary,Emerging areas of mining complex data types: Text mining can be done quite effectively, especially if the documents are semi-structured Web mining is more difficult due to lack of such structure Data includes text documents, hypertext documents, link structure, and logs Need to rely on unsupervised learning, sometimes followed up with supervised learning such as classification,


    注意事项

    本文(Text Web Mining.ppt)为本站会员(ideacase155)主动上传,麦多课文档分享仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。 若此文所含内容侵犯了您的版权或隐私,请立即通知麦多课文档分享(点击联系客服),我们立即给予删除!




    关于我们 - 网站声明 - 网站地图 - 资源地图 - 友情链接 - 网站客服 - 联系我们

    copyright@ 2008-2019 麦多课文库(www.mydoc123.com)网站版权所有
    备案/许可证编号:苏ICP备17064731号-1 

    收起
    展开