Text Web Mining.ppt
《Text Web Mining.ppt》由会员分享,可在线阅读,更多相关《Text Web Mining.ppt(36页珍藏版)》请在麦多课文档分享上搜索。
1、10/6/2018,1,Text & Web Mining,10/6/2018,2,Structured Data,So far we have focused on mining from structured data:,Attribute Value Attribute Value Attribute Value Attribute Value,Outlook Sunny Temperature Hot Windy Yes Humidity High Play Yes,Most data mining involves such data,10/6/2018,3,Complex Data
2、 Types,Increased importance of complex data: Spatial data: includes geographic data and medical & satellite images Multimedia data: images, audio, & video Time-series data: for example banking data and stock exchange data Text data: word descriptions for objects World-Wide-Web: highly unstructured t
3、ext and multimedia data,10/6/2018,4,Text Databases,Many text databases exist in practice News articles Research papers Books Digital libraries E-mail messages Web pages Growing rapidly in size and importance,10/6/2018,5,Semi-Structured Data,Text databases are often semi-structured Example: Title Aut
4、hor Publication_Date Length Category Abstract Content,10/6/2018,6,Handling Text Data,Modeling semi-structured data Information Retrieval (IR) from unstructured documents Text mining Compare documents Rank importance & relevance Find patterns or trends across documents,10/6/2018,7,Information Retriev
5、al,IR locates relevant documents Key words Similar documentsIR Systems On-line library catalogs On-line document management systems,10/6/2018,8,Performance Measure,Two basic measures,All documents,Retrieved documents,Relevant documents,Relevant & retrieved,10/6/2018,9,Retrieval Methods,Keyword-based
6、 IR E.g., “data and mining” Synonymy problem: a document may talk about “knowledge discovery” instead Polysemy problem: mining can mean different things Similarity-based IR Set of common keywords Return the degree of relevance Problem: what is the similarity of “data mining” and “data analysis”,10/6
7、/2018,10,Modeling a Document,Set of n documents and m terms Each document is a vector v in Rm The j-th coordinate of v measures the association of the j-th termHere r is the number of occurrences of the j-th term and R is the number of occurrences of any term.,10/6/2018,11,Frequency Matrix,10/6/2018
8、,12,Similarity Measures,Cosine measure,Dot product,Norm of the vectors,10/6/2018,13,Example,Google search for “association mining” Two of the documents retrieved: Idaho Mining Association: mining in Idaho (doc 1) Scalable Algorithms for Association mining (doc 2) Using only the two terms,10/6/2018,1
9、4,New Model,Add the term “data” to the document model,10/6/2018,15,Frequency Matrix,Will quickly become large,Singular value decomposition can be used to reduce it,10/6/2018,16,Association Analysis,Collect set of keywords frequently used together and find association among them Apply any association
10、 rule algorithm to a database in the formatdocument_id, a_set_of_keywords,10/6/2018,17,Document Classification,Need already classified documents as training set Induce a classification model Any difference from before?,A set of keywords associated with a document has no fixed set of attributes or di
11、mensions,10/6/2018,18,Association-Based Classification,Classify documents based on associated, frequently occurring text patterns Extract keywords and terms with IR and simple association analysis Create a concept hierarchy of terms Classify training documents into class hierarchies Use association
- 1.请仔细阅读文档,确保文档完整性,对于不预览、不比对内容而直接下载带来的问题本站不予受理。
- 2.下载的文档,不会出现我们的网址水印。
- 3、该文档所得收入(下载+内容+预览)归上传者、原创作者;如果您是本文档原作者,请点此认领!既往收益都归您。
下载文档到电脑,查找使用更方便
2000 积分 0人已下载
下载 | 加入VIP,交流精品资源 |
- 配套讲稿:
如PPT文件的首页显示word图标,表示该PPT已包含配套word讲稿。双击word图标可打开word文档。
- 特殊限制:
部分文档作品中含有的国旗、国徽等图片,仅作为作品整体效果示例展示,禁止商用。设计者仅对作品中独创性部分享有著作权。
- 关 键 词:
- TEXTWEBMININGPPT
