Introduction to Text Categorization.ppt
《Introduction to Text Categorization.ppt》由会员分享,可在线阅读,更多相关《Introduction to Text Categorization.ppt(50页珍藏版)》请在麦多课文档分享上搜索。
1、Introduction to Text Categorization,Lecturer: Paul Bennett 20-760: Web-Based Information Architectures July 23, 2002,Lecture Overview,Motivation for Automatic Text Categorization A Text Categorization Example Approaches to Automatic Text Categorization (emphasizing the “k-Nearest Neighbor approach”)
2、 Evaluation Methodology and Performance MetricsSummary,Motivation for Automatic Text Categorization,Shopping on the Web,Suppose you want to buy a cappuccino maker as a gifttry Google for “cappuccino maker”try “Yahoo! Shopping” for “cappuccino maker”,Observations,Broad indexing & speedy search alone
3、are not enough. Organizational view of data is critical for effective retrieval. Categorized data are easy for user to browse. Category taxonomies become most central in well-known web sites (Yahoo!, Lycos, .).,Text Categorization Applications,Web pages organized into category hierarchies Journal ar
4、ticles indexed by subject categories (e.g., the Library of Congress, MEDLINE, etc.) Responses to Census Bureau occupations Patents archived using International Patent Classification Patient records coded using international insurance categories E-mail message filtering News events tracked and filter
5、ed by topics,Cost of Manual Text Categorization,Yahoo! 200 (?) people for manual labeling of Web pages using a hierarchy of 500,000 categories MEDLINE (National Library of Medicine) $2 million/year for manual indexing of journal articles using MEdical Subject Headings (18,000 categories) Mayo Clinic
6、 $1.4 million annually for coding patient-record events using the International Classification of Diseases (ICD) for billing insurance companies US Census Bureau decennial census (1990: 22 million responses) 232 industry categories and 504 occupation categories $15 million if fully done by hand,What
7、 does it take to compete?,Suppose you were starting a web search company, what would it take to compete with established engines? You need to be able to establish a competing hierarchy fast. You will need a relatively cheap solution. (Unless you have investors that want to pay millions of dollars ju
8、st to get off the ground.),Why not a semi-automatic text categorization tool?,Humans can encode knowledge of what constitutes membership in a category.This encoding can then be automatically applied by a machine to categorize new examples.For example.,Rule-based Approach to TC,Text in a Web Page “Sa
9、eco revolutionized espresso brewing a decade ago by introducing Saeco SuperAutomatic machines, which go from bean to coffee at the touch of a button. The all-new Saeco Vienna Super-Automatic home coffee and cappucino machine combines top quality with low price!”Rules Rule 1. (espresso or coffee or c
10、appucino ) and machine* Coffee Maker Rule 2. automat* and answering and machine* Phone Rule .,Expert System for TC (late 1980s),Defining Rules By Hand,Experience has shown too time consuming too difficult inconsistency issues (as the rule set gets large),Replace Knowledge Engineering with a Statisti
11、cal Learner,Knowledge Statistical Engineering Learning,For US Census Bureau Decennial Census 1990 232 industry categories and 504 occupation categories $15 million if fully done by hand Define classification rules manually: Expert System AIOCS Development time: 192 person-months (2 people, 8 years)
12、Accuracy = 47% Learn classification function Nearest Neighbor classification (Creecy 92: 1-NN) Development time: 4 person-months (Thinking Machine) Accuracy = 60%,vs.,A Text Categorization Example,Predicting Topics of News Stories,Given: Collection of example news stories already labeled with a cate
13、gory (topic). Task: Predict category for news stories not yet labeled.For our example, well only get to see the headline of the news story.Well represent categories using colors. (All examples with the same color belong to the same category.),Our Labeled Examples,What to predict before seeing the do
14、cument?,?,Predict with Evidence,Senate Panel Studies Loan Rate, Set Aside Plans,The Actual Topic,Senate Panel Studies Loan Rate, Set Aside Plans,A few practical details ,Handling Documents with Multiple Classes,Representing Documents,Usually, an example is represented as a series of feature-value pa
15、irs. The features can be arbitrarily abstract (as long as they are easily computable) or very simple. For example, the features could be the set of all words and the values, their number of occurrences in a particular document.,Japan Firm Plans to Sell U.S. Farmland to Japanese,Farmland:1 Firm:1 Jap
16、an:1 Japanese:1 Plans:1 Sell:1 To:2 U.S.:1,Representation,Approaches to Text Categorization (emphasizing the “k-Nearest Neighbor approach”),1-Nearest Neighbor,Looking back at our example Did anyone try to find the most similar labeled item and then just guess the same color? This is 1-Nearest Neighb
17、or,Key Components of Nearest Neighbor,“Similar” item: We need a functional definition of “similarity” if we want to apply this automatically. How many neighbors do we consider? Does each neighbor get the same weight? All categories in neighborhood? Most frequent only? How do we make the final decisi
18、on?,Nearest Neighbor Classification,Instance-Based Learning, Lazy Learning well-known approach to pattern recognition initially by Fix and Hodges (1951) theoretical error bound analysis by Duda & Hart (1957) applied to text categorization in early 90s strong baseline in benchmark evaluations among t
19、op-performing methods in TC evaluations scalable to large TC applications,1-Nearest Neighbor (graphically),K-Nearest Neighbor using a majority voting scheme,K-NN using a weighted-sum voting Scheme,Category Scoring for Weighted-Sum,The score for a category is the sum of the similarity scores between
20、the point to be classified and all of its k-neighbors that belong to the given category. To restate: where x is the new point; c is a class (e.g. black or white); d is a classified point among the k-nearest neighbors of x; sim(x,d) is the similarity between x and d; I(d,c) = 1 iff point d belongs to
- 1.请仔细阅读文档,确保文档完整性,对于不预览、不比对内容而直接下载带来的问题本站不予受理。
- 2.下载的文档,不会出现我们的网址水印。
- 3、该文档所得收入(下载+内容+预览)归上传者、原创作者;如果您是本文档原作者,请点此认领!既往收益都归您。
下载文档到电脑,查找使用更方便
2000 积分 0人已下载
下载 | 加入VIP,交流精品资源 |
- 配套讲稿:
如PPT文件的首页显示word图标,表示该PPT已包含配套word讲稿。双击word图标可打开word文档。
- 特殊限制:
部分文档作品中含有的国旗、国徽等图片,仅作为作品整体效果示例展示,禁止商用。设计者仅对作品中独创性部分享有著作权。
- 关 键 词:
- INTRODUCTIONTOTEXTCATEGORIZATIONPPT
