欢迎来到麦多课文档分享! | 帮助中心 海量文档,免费浏览,给你所需,享你所想!
麦多课文档分享
全部分类
  • 标准规范>
  • 教学课件>
  • 考试资料>
  • 办公文档>
  • 学术论文>
  • 行业资料>
  • 易语言源码>
  • ImageVerifierCode 换一换
    首页 麦多课文档分享 > 资源分类 > PPT文档下载
    分享到微信 分享到微博 分享到QQ空间

    A Whirlwind Tour of Search Engine Design Issues.ppt

    • 资源ID:377910       资源大小:586KB        全文页数:88页
    • 资源格式: PPT        下载积分:2000积分
    快捷下载 游客一键下载
    账号登录下载
    微信登录下载
    二维码
    微信扫一扫登录
    下载资源需要2000积分(如需开发票,请勿充值!)
    邮箱/手机:
    温馨提示:
    如需开发票,请勿充值!快捷下载时,用户名和密码都是您填写的邮箱或者手机号,方便查询和重复下载(系统自动生成)。
    如需开发票,请勿充值!如填写123,账号就是123,密码也是123。
    支付方式: 支付宝扫码支付    微信扫码支付   
    验证码:   换一换

    加入VIP,交流精品资源
     
    账号:
    密码:
    验证码:   换一换
      忘记密码?
        
    友情提示
    2、PDF文件下载后,可能会被浏览器默认打开,此种情况可以点击浏览器菜单,保存网页到桌面,就可以正常下载了。
    3、本站不支持迅雷下载,请使用电脑自带的IE浏览器,或者360浏览器、谷歌浏览器下载即可。
    4、本站资源下载后的文档和图纸-无水印,预览文档经过压缩,下载后原文更清晰。
    5、试题试卷类文档,如果标题没有明确说明有答案则都视为没有答案,请知晓。

    A Whirlwind Tour of Search Engine Design Issues.ppt

    1、,A Whirlwind Tour of Search Engine Design Issues,CMSC 498W April 4, 2006 Douglas W. Oard,Information Access Problems,Collection,Information Need,Stable,Stable,Different Each Time,Retrieval,Filtering,Different Each Time,Design Strategies,Foster human-machine synergy Exploit complementary strengths Ac

    2、commodate shared weaknessesDivide-and-conquer Divide task into stages with well-defined interfaces Continue dividing until problems are easily solvedCo-design related components Iterative process of joint optimization,Human-Machine Synergy,Machines are good at: Doing simple things accurately and qui

    3、ckly Scaling to larger collections in sublinear timePeople are better at: Accurately recognizing what they are looking for Evaluating intangibles such as “quality”Both are pretty bad at: Mapping consistently between words and concepts,Supporting the Search Process,Source Selection,Query Reformulatio

    4、n and Relevance Feedback,Source Reselection,Choose,Supporting the Search Process,Source Selection,Taylors Model of Question Formation,Q1 Visceral Need,Q2 Conscious Need,Q3 Formalized Need,Q4 Compromised Need(Query),End-user Search,Intermediated Search,Search Goal,Choose the same documents a human wo

    5、uld Without human intervention (less work) Faster than a human could (less time) As accurately as possible (less accuracy) Humans start with an information need Machines start with a query Humans match documents to information needs Machines match document & query representations,Search Component Mo

    6、del,Comparison Function,Representation Function,Query Formulation,Human Judgment,Representation Function,Retrieval Status Value,Utility,Query,Information Need,Document,Query Representation,Document Representation,Query Processing,Document Processing,Relevance,Relevance relates a topic and a document

    7、 Duplicates are equally relevant, by definition Constant over time and across users Pertinence relates a task and a document Accounts for quality, complexity, language, Utility relates a user and a document Accounts for prior knowledge We seek utility, but relevance is what we get!,Problems With Wor

    8、d Matching,Word matching suffers from two problems Synonymy: paper vs. article Homonymy: bank (river) vs. bank (financial) Disambiguation in IR: seek to resolve homonymy Index word senses rather than words Synonymy usually addressed by Thesaurus-based query expansion Latent semantic indexing,Stemmin

    9、g,Stem: in IR, a word equivalence class that preserves the main concept. Often obtained by affix-stripping (Porter, 1980) destroy, destroyed, destruction: destr Inexpensive to compute Usually helps IR performance Can make mistakes! (over-/understemming) centennial,century,center: cent acquire,acquir

    10、ing,acquired: acquiracquisition: acquis,N-gram Indexing,Consider a Chinese document c1 c2 c3 cnDont segment (you could be wrong!)Instead, treat every character bigram as a term _c1 c2 , c2 c3 , c3 c4 , , cn-1 cnBreak up queries the same way,“Bag of Terms” Representation,Bag = a “set” that can contai

    11、n duplicates “The quick brown fox jumped over the lazy dogs back” back, brown, dog, fox, jump, lazy, over, quick, the, theVector = values recorded in any consistent order back, brown, dog, fox, jump, lazy, over, quick, the, the 1 1 1 1 1 1 1 1 2,Bag of Terms Example,The quick brown fox jumped over t

    12、he lazy dogs back.,Document 1,Document 2,Now is the time for all good men to come to the aid of their party.,the,quick,brown,fox,over,lazy,dog,back,now,is,time,for,all,good,men,to,come,jump,aid,of,their,party,0,0,1,1,0,1,1,0,1,1,0,0,1,0,1,0,0,1,1,0,0,1,0,0,1,0,0,1,1,0,1,0,1,1,Term,Document 1,Documen

    13、t 2,StopwordList,Boolean IR,Strong points Accurate, if you know the right strategies Efficient for the computer Weaknesses Often results in too many documents, or none Users must learn Boolean logic Sometimes finds relationships that dont exist Words can have many meanings Choosing the right words i

    14、s sometimes hard,Proximity Operators,More precise versions of AND “NEAR n” allows at most n-1 intervening terms “WITH” requires terms to be adjacent and in order Easy to implement, but less efficient Store a list of positions for each word in each doc Stopwords become very important! Perform normal

    15、Boolean computations Treat WITH and NEAR like AND with an extra constraint,Proximity Operator Example,time AND come Doc 2 time (NEAR 2) come Empty quick (NEAR 2) fox Doc 1 quick WITH fox Empty,quick,brown,fox,over,lazy,dog,back,now,time,all,good,men,come,jump,aid,their,party,0,1 (9),Term,1 (13),1 (6

    16、),1 (7),1 (8),1 (16),1 (1),1 (2),1 (15),1 (4),0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1 (5),1 (9),1 (3),1 (4),1 (8),1 (6),1 (10),Doc 1,Doc 2,Advantages of Ranked Retrieval,Closer to the way people think Some documents are better than othersEnriches browsing behavior Decide how far down the list to go as you

    17、 read itAllows more flexible queries Long and short queries can produce useful results,Ranked Retrieval Challenges,“Best first” is easy to say but hard to do! The best we can hope for is to approximate itWill the user understand the process? It is hard to use a tool that you dont understandEfficienc

    18、y becomes a concern Only a problem for long queries, though,Similarity-Based Queries,Treat the query as if it were a document Create a query bag-of-words Find the similarity of each document Using the coordination measure, for example Rank order the documents by similarity Most similar to the query

    19、first Surprisingly, this works pretty well! Especially for very short queries,Counting Terms,Terms tell us about documents If “rabbit” appears a lot, it may be about rabbits Documents tell us about terms “the” is in every document - not discriminating Documents are most likely described well by rare

    20、 terms that occur in them frequently Higher “term frequency” is stronger evidence Low “collection frequency” makes it stronger still,The Document Length Effect,Humans look for documents with useful parts But probabilities are computed for the wholeDocument lengths vary in many collections So probabi

    21、lity calculations could be inconsistentTwo strategies Adjust probability estimates for document length Divide the documents into equal “passages”,Incorporating Term Frequency,High term frequency is evidence of meaning And high IDF is evidence of term importance Recompute the bag-of-words Compute TF

    22、* IDF for every element,TF*IDF Example,4,5,6,3,1,3,1,6,5,3,4,3,7,1,nuclear,fallout,siberia,contaminated,interesting,complicated,information,retrieval,2,1,2,3,2,3,2,4,4,0.50,0.63,0.90,0.13,0.60,0.75,1.51,0.38,0.50,2.11,0.13,1.20,1,2,3,0.60,0.38,0.50,4,0.301,0.125,0.125,0.125,0.602,0.301,0.000,0.602,U

    23、nweighted query:contaminated retrieval Result: 2, 3, 1, 4,Weighted query:contaminated(3) retrieval(1) Result: 1, 3, 2, 4,IDF-weighted query:contaminated retrieval Result: 2, 3, 1, 4,Document Length Normalization,Long documents have an unfair advantage They use a lot of terms So they get more matches

    24、 than short documents And they use the same words repeatedly So they have much higher term frequencies Normalization seeks to remove these effects Related somehow to maximum term frequency But also sensitive to the of number of terms,“Okapi” Term Weights,TF component,IDF component,Passage Retrieval,

    25、Another approach to long-document problem Break it up into coherent unitsRecognizing topic boundaries is hard But overlapping 300 word passages work fineDocument rank is best passage rank And passage information can help guide browsing,Summary,Goal: find documents most similar to the queryCompute no

    26、rmalized document term weights Some combination of TF, DF, and LengthOptionally, get query term weights from the user Estimate of term importanceCompute inner product of query and doc vectors Multiply corresponding elements and then add,Another Perspective,We ask “is this document relevant?” Vector

    27、space: we answer “somewhat” Probabilistic: we answer “probably”,Probability Ranking Principle,Assume binary relevance/document independence Each document is either relevant or it is not Relevance of one doc reveals nothing about anotherAssume the searcher works down a ranked list Seeking some number

    28、 of relevant documentsTheorem (provable from assumptions): Documents should be ranked in order of decreasing probability of relevance to the query, P(d relevant-to q),Some Questions for Indexing,How long will it take to find a document? Is there any work we can do in advance? If so, how long will th

    29、at take?How big a computer will I need? How much disk space? How much RAM?What if more documents arrive? How much of the advance work must be repeated? Will searching become slower? How much more disk space will be needed?,The Indexing Process,quick,brown,fox,over,lazy,dog,back,now,time,all,good,men

    30、,come,jump,aid,their,party,0,0,1,1,0,0,0,0,0,1,0,0,1,0,1,1,0,0,1,0,0,1,0,0,1,0,0,1,1,0,0,0,0,1,Term,Doc 1,Doc 2,0,0,1,1,0,1,1,0,1,1,0,0,1,0,1,0,0,1,1,0,0,1,0,0,1,0,0,1,0,0,0,0,0,1,Doc 3,Doc 4,0,0,0,1,0,1,1,0,0,1,0,0,1,0,0,1,0,0,1,0,0,1,0,0,1,0,0,0,1,0,1,0,0,1,Doc 5,Doc 6,0,0,1,1,0,0,1,0,0,1,0,0,1,0,

    31、0,1,0,1,0,0,0,1,0,0,1,0,0,1,1,1,1,0,0,0,Doc 7,Doc 8,A,B,C,F,D,G,J,L,M,N,O,P,Q,T,AI,AL,BA,BR,TH,TI,4, 8,2, 4, 6,1, 3, 7,1, 3, 5, 7,2, 4, 6, 8,3, 5,3, 5, 7,2, 4, 6, 8,3,1, 3, 5, 7,2, 4, 8,2, 6, 8,1, 3, 5, 7, 8,6, 8,1, 3,1, 5, 7,2, 4, 6,Postings,Inverted File,The Finished Product,quick,brown,fox,over,l

    32、azy,dog,back,now,time,all,good,men,come,jump,aid,their,party,Term,A,B,C,F,D,G,J,L,M,N,O,P,Q,T,AI,AL,BA,BR,TH,TI,4, 8,2, 4, 6,1, 3, 7,1, 3, 5, 7,2, 4, 6, 8,3, 5,3, 5, 7,2, 4, 6, 8,3,1, 3, 5, 7,2, 4, 8,2, 6, 8,1, 3, 5, 7, 8,6, 8,1, 3,1, 5, 7,2, 4, 6,Postings,Inverted File,How Big Is the Postings File?

    33、,Very compact for Boolean retrieval About 10% of the size of the documents If an aggressive stopword list is used!Not much larger for ranked retrieval Perhaps 20%Enormous for proximity operators Sometimes larger than the documents!,Building an Inverted Index,Simplest solution is a single sorted arra

    34、y Fast lookup using binary search But sorting large files on disk is very slow And adding one document means starting overTree structures allow easy insertion But the worst case lookup time is linearBalanced trees provide the best of both Fast lookup and easy insertion But they require 45% more disk

    35、 space,How Big is the Inverted Index?,Typically smaller than the postings file Depends on number of terms, not documentsEventually, most terms will already be indexed But the postings file will continue to growPostings dominate asymptotic space complexity Linear in the number of documents,Index Comp

    36、ression,CPUs are much faster than disks A disk can transfer 1,000 bytes in 20 ms The CPU can do 10 million instructions in that timeCompressing the postings file is a big win Trade decompression time for fewer disk readsKey idea: reduce redundancy Trick 1: store relative offsets (some will be the sa

    37、me) Trick 2: use an optimal coding scheme,Summary,Slow indexing yields fast query processing Key fact: most terms dont appear in most documentsWe use extra disk space to save query time Index space is in addition to document space Time and space complexity must be balancedDisk block reads are the cr

    38、itical resource This makes index compression a big win,Problems with “Free Text” Search,Homonymy Terms may have many unrelated meanings Polysemy (related meanings) is less of a problemSynonymy Many ways of saying (nearly) the same thingAnaphora Alternate ways of referring to the same thing,Two Ways

    39、of Searching,Write the document using terms to convey meaning,Author,Applications,When implied concepts must be captured Political action, volunteerism, When terminology selection is impractical Searching foreign language materials When no words are present Photos w/o captions, videos w/o transcript

    40、s, When user needs are easily anticipated Weather reports, yellow pages, ,Supervised Learning,Example: kNN Classifier,Challenges,Changing concept inventories Literary warrant and user needs are hard to predictAccurate concept indexing is expensive Machines are inaccurate, humans are inconsistentUser

    41、s and indexers may think differently Diverse user populations add to the complexityUsing thesauri effectively requires training Meta-knowledge and thesaurus-specific expertise,Relevance Feedback,Make Profile Vector,Compute Similarity,Select and Examine (user),Assign Ratings (user),Update User Model,

    42、New Documents,Vector,Documents, Vectors, Rank Order,Document, Vector,Rating, Vector,Vector(s),Make Document Vectors,Initial Profile Terms,Vectors,Rocchio Formula,0,4,0,8,0,0,1,2,4,0,0,1,2,0,1,1,0,4,-1,6,3,7,0,-3,0,4,0,8,0,0,2,4,8,0,0,2,8,0,4,4,0,16,Original profile,Positive Feedback,Negative feedbac

    43、k,(+),(-),New profile,Implicit Feedback,Observe user behavior to infer a set of ratings Examine (reading time, scrolling behavior, ) Retain (bookmark, save, save & annotate, print, ) Refer to (reply, forward, include link, cut & paste, ) Some measurements are directly useful e.g., use reading time t

    44、o predict reading time Others require some inference Should you treat cut & paste as an endorsement?,Some Observable Behaviors,Behavior Category,Behavior Category,Minimum Scope,Some Examples,Read/Ignored, Saved/Deleted, Replied to(Stevens, 1993)Reading time (Morita Konstan et al., 1997)Hypertext Lin

    45、k(Brin & Page, 1998),Estimating Authority from Links,Authority,Authority,Hub,Collecting Click Streams,Browsing histories are easily captured Make all links initially point to a central site Encode the desired URL as a parameter Build a time-annotated transition graph for each user Cookies identify u

    46、sers (when they use the same machine) Redirect the browser to the desired pageReading time is correlated with interest Can be used to build individual profiles Used to target advertising by ,0,20,40,60,80,100,120,140,160,180,No Interest,Low Interest,Moderate Interest,High Interest,Rating,Reading Tim

    47、e (seconds),Full Text Articles (Telecommunications),50,32,58,43,Critical Issues,Protecting privacy What absolute assurances can we provide? How can we make remaining risks understood?Scalable rating servers Is a fully distributed architecture practical?Non-cooperative users How can the effect of spa

    48、mming be limited?,Gaining Access to Observations,Observe public behavior Hypertext linking, publication, citing, Policy protection EU: Privacy laws US: Privacy policies + FTC enforcementArchitectural assurance of privacy Distributed architecture Model and mitigate privacy risks,Adversarial IR,Search is user-controlled suppression Everything is known to the search system Goal: avoid showing things the user doesnt wantOther stakeholders have different goals Authors risk little by wasting your time Marketers hope for serendipitous interestMetadata from trusted sources is more reliable,


    注意事项

    本文(A Whirlwind Tour of Search Engine Design Issues.ppt)为本站会员(postpastor181)主动上传,麦多课文档分享仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。 若此文所含内容侵犯了您的版权或隐私,请立即通知麦多课文档分享(点击联系客服),我们立即给予删除!




    关于我们 - 网站声明 - 网站地图 - 资源地图 - 友情链接 - 网站客服 - 联系我们

    copyright@ 2008-2019 麦多课文库(www.mydoc123.com)网站版权所有
    备案/许可证编号:苏ICP备17064731号-1 

    收起
    展开