The Search Engine Architecture.ppt
《The Search Engine Architecture.ppt》由会员分享,可在线阅读,更多相关《The Search Engine Architecture.ppt(27页珍藏版)》请在麦多课文档分享上搜索。
1、The Search Engine Architecture,CSCI 572: Information Retrieval and Search Engines Summer 2010,Outline,Introduction Google The PageRank algorithm The Google Architecture Architectural components Architectural interconnections Architectural data structures Evaluation of Google Summary,Problems with se
2、arch engines circa the last decade,Human maintenance Subjective Example: Ranking hits based on $ Automated search engines Quality of result Neglect to take users context into account Searching process High quality results arent always at the top of the list,The Typical Search Engine Process,In what
3、stages is the most time spent?,How to scale to modern times?,Currently Efficient index Petabyte scale storage space Efficient Crawling Cost effectiveness of hardware Future Qualitative context Maintaining localization data Perhaps send indexing to clients Client computers help gather Googles index i
4、n a distributed, decentralized fashion?,Google,The whole idea is to keep up with the growth of the web Design Goals: -Remove Junk Results -Scalable document indicesUse of link structure to improve quality filtering Use as an academic digital library Provide search engine datasets Search engine infra
5、structure and evolution,Google,Archival of information Use of compression Efficient data structures Proprietary file system Leverage of usage data PageRank algorithm Sort of a “lineage” of a source of information Citation graph,PageRank Algorithm,Numerical method to calculate pages importance this a
6、pproach might well be followed by people doing research Page Rank of a page A With damping factor d Where PR(x) = Page Rank of page X Where C(x) = the amount of outgoing links from page x Where T1Tn is the set of pages with incoming links to page A PR(A)=(1-d)+d(PR(T1)/C(T1)+PR(Tn)/C(Tn) Its actuall
7、y a bit more complicated than it first looks For instance, whats PR(T1) and PR(T2) and so on?,PageRank Algorithm,An excellent explanation http:/ Since the PR(A) equation is a probability distribution over all web pages linking to web page A And because of the (1-d) term and the d*(PR.) term The Page
8、Ranks of all the web pages on the web will sum to 1,PageRank: Example,So, where do you start? It turns out that you can effectively “guess” what the PageRanks for the web pages are initially In our example, guess 0 for all of the pages Then you run the PR function to calculate PR for all the web pag
9、es iteratively You do this until The page ranks for each web page stop changing in each iteration They “settle down”,PageRank: Example,PR(a) = 1 - $damp + $damp * PR(c); PR(b) = 1 - $damp + $damp * (PR(a)/2) PR(c) = 1 - $damp + $damp * (PR(a)/2 + PR(b) + PR(d); PR(d) = 1 - $damp;,Below is the iterat
10、ive calculation that we would run,PageRank Algorithm: First 18 iterations,a: 0.00000 b: 0.00000 c: 0.00000 d: 0.00000 a: 0.15000 b: 0.21375 c: 0.39544 d: 0.15000 a: 0.48612 b: 0.35660 c: 0.78721 d: 0.15000 a: 0.81913 b: 0.49813 c: 1.04904 d: 0.15000 a: 1.04169 b: 0.59272 c: 1.22403 d: 0.15000 a: 1.1
11、9042 b: 0.65593 c: 1.34097 d: 0.15000 a: 1.28982 b: 0.69818 c: 1.41912 d: 0.15000 a: 1.35626 b: 0.72641 c: 1.47136 d: 0.15000 a: 1.40065 b: 0.74528 c: 1.50626 d: 0.15000 a: 1.43032 b: 0.75789 c: 1.52959 d: 0.15000 a: 1.45015 b: 0.76632 c: 1.54518 d: 0.15000 a: 1.46341 b: 0.77195 c: 1.55560 d: 0.1500
12、0 a: 1.47226 b: 0.77571 c: 1.56257 d: 0.15000 a: 1.47818 b: 0.77823 c: 1.56722 d: 0.15000 a: 1.48214 b: 0.77991 c: 1.57033 d: 0.15000 a: 1.48478 b: 0.78103 c: 1.57241 d: 0.15000 a: 1.48655 b: 0.78178 c: 1.57380 d: 0.15000 a: 1.48773 b: 0.78228 c: 1.57473 d: 0.15000,PageRank: next 13 iterations,a: 1.
13、48852 b: 0.78262 c: 1.57535 d: 0.15000 a: 1.48904 b: 0.78284 c: 1.57576 d: 0.15000 a: 1.48940 b: 0.78299 c: 1.57604 d: 0.15000 a: 1.48963 b: 0.78309 c: 1.57622 d: 0.15000 a: 1.48979 b: 0.78316 c: 1.57635 d: 0.15000 a: 1.48990 b: 0.78321 c: 1.57643 d: 0.15000 a: 1.48997 b: 0.78324 c: 1.57649 d: 0.150
- 1.请仔细阅读文档,确保文档完整性,对于不预览、不比对内容而直接下载带来的问题本站不予受理。
- 2.下载的文档,不会出现我们的网址水印。
- 3、该文档所得收入(下载+内容+预览)归上传者、原创作者;如果您是本文档原作者,请点此认领!既往收益都归您。
下载文档到电脑,查找使用更方便
2000 积分 0人已下载
下载 | 加入VIP,交流精品资源 |
- 配套讲稿:
如PPT文件的首页显示word图标,表示该PPT已包含配套word讲稿。双击word图标可打开word文档。
- 特殊限制:
部分文档作品中含有的国旗、国徽等图片,仅作为作品整体效果示例展示,禁止商用。设计者仅对作品中独创性部分享有著作权。
- 关 键 词:
- THESEARCHENGINEARCHITECTUREPPT
