A Whirlwind Tour of Search Engine Design Issues.ppt
《A Whirlwind Tour of Search Engine Design Issues.ppt》由会员分享,可在线阅读,更多相关《A Whirlwind Tour of Search Engine Design Issues.ppt(88页珍藏版)》请在麦多课文档分享上搜索。
1、,A Whirlwind Tour of Search Engine Design Issues,CMSC 498W April 4, 2006 Douglas W. Oard,Information Access Problems,Collection,Information Need,Stable,Stable,Different Each Time,Retrieval,Filtering,Different Each Time,Design Strategies,Foster human-machine synergy Exploit complementary strengths Ac
2、commodate shared weaknessesDivide-and-conquer Divide task into stages with well-defined interfaces Continue dividing until problems are easily solvedCo-design related components Iterative process of joint optimization,Human-Machine Synergy,Machines are good at: Doing simple things accurately and qui
3、ckly Scaling to larger collections in sublinear timePeople are better at: Accurately recognizing what they are looking for Evaluating intangibles such as “quality”Both are pretty bad at: Mapping consistently between words and concepts,Supporting the Search Process,Source Selection,Query Reformulatio
4、n and Relevance Feedback,Source Reselection,Choose,Supporting the Search Process,Source Selection,Taylors Model of Question Formation,Q1 Visceral Need,Q2 Conscious Need,Q3 Formalized Need,Q4 Compromised Need(Query),End-user Search,Intermediated Search,Search Goal,Choose the same documents a human wo
5、uld Without human intervention (less work) Faster than a human could (less time) As accurately as possible (less accuracy) Humans start with an information need Machines start with a query Humans match documents to information needs Machines match document & query representations,Search Component Mo
6、del,Comparison Function,Representation Function,Query Formulation,Human Judgment,Representation Function,Retrieval Status Value,Utility,Query,Information Need,Document,Query Representation,Document Representation,Query Processing,Document Processing,Relevance,Relevance relates a topic and a document
7、 Duplicates are equally relevant, by definition Constant over time and across users Pertinence relates a task and a document Accounts for quality, complexity, language, Utility relates a user and a document Accounts for prior knowledge We seek utility, but relevance is what we get!,Problems With Wor
8、d Matching,Word matching suffers from two problems Synonymy: paper vs. article Homonymy: bank (river) vs. bank (financial) Disambiguation in IR: seek to resolve homonymy Index word senses rather than words Synonymy usually addressed by Thesaurus-based query expansion Latent semantic indexing,Stemmin
9、g,Stem: in IR, a word equivalence class that preserves the main concept. Often obtained by affix-stripping (Porter, 1980) destroy, destroyed, destruction: destr Inexpensive to compute Usually helps IR performance Can make mistakes! (over-/understemming) centennial,century,center: cent acquire,acquir
10、ing,acquired: acquiracquisition: acquis,N-gram Indexing,Consider a Chinese document c1 c2 c3 cnDont segment (you could be wrong!)Instead, treat every character bigram as a term _c1 c2 , c2 c3 , c3 c4 , , cn-1 cnBreak up queries the same way,“Bag of Terms” Representation,Bag = a “set” that can contai
11、n duplicates “The quick brown fox jumped over the lazy dogs back” back, brown, dog, fox, jump, lazy, over, quick, the, theVector = values recorded in any consistent order back, brown, dog, fox, jump, lazy, over, quick, the, the 1 1 1 1 1 1 1 1 2,Bag of Terms Example,The quick brown fox jumped over t
12、he lazy dogs back.,Document 1,Document 2,Now is the time for all good men to come to the aid of their party.,the,quick,brown,fox,over,lazy,dog,back,now,is,time,for,all,good,men,to,come,jump,aid,of,their,party,0,0,1,1,0,1,1,0,1,1,0,0,1,0,1,0,0,1,1,0,0,1,0,0,1,0,0,1,1,0,1,0,1,1,Term,Document 1,Documen
13、t 2,StopwordList,Boolean IR,Strong points Accurate, if you know the right strategies Efficient for the computer Weaknesses Often results in too many documents, or none Users must learn Boolean logic Sometimes finds relationships that dont exist Words can have many meanings Choosing the right words i
14、s sometimes hard,Proximity Operators,More precise versions of AND “NEAR n” allows at most n-1 intervening terms “WITH” requires terms to be adjacent and in order Easy to implement, but less efficient Store a list of positions for each word in each doc Stopwords become very important! Perform normal
15、Boolean computations Treat WITH and NEAR like AND with an extra constraint,Proximity Operator Example,time AND come Doc 2 time (NEAR 2) come Empty quick (NEAR 2) fox Doc 1 quick WITH fox Empty,quick,brown,fox,over,lazy,dog,back,now,time,all,good,men,come,jump,aid,their,party,0,1 (9),Term,1 (13),1 (6
16、),1 (7),1 (8),1 (16),1 (1),1 (2),1 (15),1 (4),0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1 (5),1 (9),1 (3),1 (4),1 (8),1 (6),1 (10),Doc 1,Doc 2,Advantages of Ranked Retrieval,Closer to the way people think Some documents are better than othersEnriches browsing behavior Decide how far down the list to go as you
17、 read itAllows more flexible queries Long and short queries can produce useful results,Ranked Retrieval Challenges,“Best first” is easy to say but hard to do! The best we can hope for is to approximate itWill the user understand the process? It is hard to use a tool that you dont understandEfficienc
18、y becomes a concern Only a problem for long queries, though,Similarity-Based Queries,Treat the query as if it were a document Create a query bag-of-words Find the similarity of each document Using the coordination measure, for example Rank order the documents by similarity Most similar to the query
19、first Surprisingly, this works pretty well! Especially for very short queries,Counting Terms,Terms tell us about documents If “rabbit” appears a lot, it may be about rabbits Documents tell us about terms “the” is in every document - not discriminating Documents are most likely described well by rare
20、 terms that occur in them frequently Higher “term frequency” is stronger evidence Low “collection frequency” makes it stronger still,The Document Length Effect,Humans look for documents with useful parts But probabilities are computed for the wholeDocument lengths vary in many collections So probabi
21、lity calculations could be inconsistentTwo strategies Adjust probability estimates for document length Divide the documents into equal “passages”,Incorporating Term Frequency,High term frequency is evidence of meaning And high IDF is evidence of term importance Recompute the bag-of-words Compute TF
22、* IDF for every element,TF*IDF Example,4,5,6,3,1,3,1,6,5,3,4,3,7,1,nuclear,fallout,siberia,contaminated,interesting,complicated,information,retrieval,2,1,2,3,2,3,2,4,4,0.50,0.63,0.90,0.13,0.60,0.75,1.51,0.38,0.50,2.11,0.13,1.20,1,2,3,0.60,0.38,0.50,4,0.301,0.125,0.125,0.125,0.602,0.301,0.000,0.602,U
23、nweighted query:contaminated retrieval Result: 2, 3, 1, 4,Weighted query:contaminated(3) retrieval(1) Result: 1, 3, 2, 4,IDF-weighted query:contaminated retrieval Result: 2, 3, 1, 4,Document Length Normalization,Long documents have an unfair advantage They use a lot of terms So they get more matches
24、 than short documents And they use the same words repeatedly So they have much higher term frequencies Normalization seeks to remove these effects Related somehow to maximum term frequency But also sensitive to the of number of terms,“Okapi” Term Weights,TF component,IDF component,Passage Retrieval,
- 1.请仔细阅读文档,确保文档完整性,对于不预览、不比对内容而直接下载带来的问题本站不予受理。
- 2.下载的文档,不会出现我们的网址水印。
- 3、该文档所得收入(下载+内容+预览)归上传者、原创作者;如果您是本文档原作者,请点此认领!既往收益都归您。
下载文档到电脑,查找使用更方便
2000 积分 0人已下载
下载 | 加入VIP,交流精品资源 |
- 配套讲稿:
如PPT文件的首页显示word图标,表示该PPT已包含配套word讲稿。双击word图标可打开word文档。
- 特殊限制:
部分文档作品中含有的国旗、国徽等图片,仅作为作品整体效果示例展示,禁止商用。设计者仅对作品中独创性部分享有著作权。
- 关 键 词:
- AWHIRLWINDTOUROFSEARCHENGINEDESIGNISSUESPPT
