Analysis of Large Graphs-TrustRank and WebSpam.ppt
《Analysis of Large Graphs-TrustRank and WebSpam.ppt》由会员分享,可在线阅读,更多相关《Analysis of Large Graphs-TrustRank and WebSpam.ppt(60页珍藏版)》请在麦多课文档分享上搜索。
1、Analysis of Large Graphs: TrustRank and WebSpam,Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman Stanford University http:/www.mmds.org,Note to other teachers and users of these slides: We would be delighted if you found this our material useful in giving your own lectures. Fee
2、l free to use these slides verbatim, or to modify them to fit your own needs. If you make use of a significant portion of these slides in your own lecture, please include this message, or a link to our web site: http:/www.mmds.org,Example: PageRank Scores,B 38.4,C 34.3,E 8.1,F 3.9,D 3.9,A 3.3,1.6,1.
3、6,1.6,1.6,1.6,2,J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http:/www.mmds.org,Random Teleports ( = 0.8),y a = m,1/3 1/3 1/3,0.33 0.20 0.46,0.24 0.20 0.52,0.26 0.18 0.56,7/335/33 21/33,. . .,3,J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http:/www.mmds.org,
4、1/2 1/2 01/2 0 00 1/2 1,1/3 1/3 1/31/3 1/3 1/31/3 1/3 1/3,y 7/15 7/15 1/15 a 7/15 1/15 1/15 m 1/15 7/15 13/15,0.8,+ 0.2,M,1/NNxN,A,r = A r,Equivalently: = + ,PageRank: The Complete Algorithm,Input: Graph and parameter Directed graph with spider traps and dead ends Parameter Output: PageRank vector S
5、et: 0 = 1 , =1 do: : () = () () = if in-degree of is 0 Now re-insert the leaked PageRank: = + =+ while () (1) ,4,where: = (),If the graph has no dead-ends then the amount of leaked PageRank is 1-. But since we have dead-ends the amount of leaked PageRank may be larger. We have to explicitly account
6、for it by computing S.,J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http:/www.mmds.org,Some Problems with PageRank,Measures generic popularity of a page Will ignore/miss topic-specific authorities Solution: Topic-Specific PageRank (next) Uses a single measure of importance Other
7、 models of importance Solution: Hubs-and-Authorities Susceptible to Link spam Artificial link topographies created in order to boost page rank Solution: TrustRank,J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http:/www.mmds.org,5,Topic-Specific PageRank,Topic-Specific PageRank,In
8、stead of generic popularity, can we measure popularity within a topic? Goal: Evaluate Web pages not just according to their popularity, but by how close they are to a particular topic, e.g. “sports” or “history” Allows search queries to be answered based on interests of the user Example: Query “Troj
9、an” wants different pages depending on whether you are interested in sports, history and computer security,7,J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http:/www.mmds.org,Topic-Specific PageRank,Random walker has a small probability of teleporting at any step Teleport can go t
10、o: Standard PageRank: Any page with equal probability To avoid dead-end and spider-trap problems Topic Specific PageRank: A topic-specific set of “relevant” pages (teleport set) Idea: Bias the random walk When walker teleports, she pick a page from a set S S contains only pages that are relevant to
11、the topic E.g., Open Directory (DMOZ) pages for a given topic/query For each teleport set S, we get a different vector rS,8,J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http:/www.mmds.org,Matrix Formulation,To make this work all we need is to update the teleportation part of the
12、 PageRank formulation: = +()/| if + otherwise A is stochastic! We weighted all pages in the teleport set S equally Could also assign different weights to pages! Compute as for regular PageRank: Multiply by M, then add a vector Maintains sparseness,J. Leskovec, A. Rajaraman, J. Ullman: Mining of Mass
13、ive Datasets, http:/www.mmds.org,9,Example: Topic-Specific PageRank,1,2,3,4,Suppose S = 1, = 0.8,10,J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http:/www.mmds.org,S=1,2,3,4, =0.8: r=0.13, 0.10, 0.39, 0.36 S=1,2,3 , =0.8: r=0.17, 0.13, 0.38, 0.30 S=1,2 , =0.8: r=0.26, 0.20, 0.29
14、, 0.23 S=1 , =0.8: r=0.29, 0.11, 0.32, 0.26,S=1, =0.90: r=0.17, 0.07, 0.40, 0.36 S=1 , =0.8: r=0.29, 0.11, 0.32, 0.26 S=1, =0.70: r=0.39, 0.14, 0.27, 0.19,Discovering the Topic Vector S,Create different PageRanks for different topics The 16 DMOZ top-level categories: arts, business, sports, Which to
15、pic ranking to use? User can pick from a menu Classify query into a topic Can use the context of the query E.g., query is launched from a web page talking about a known topic History of queries e.g., “basketball” followed by “Jordan” User context, e.g., users bookmarks, ,11,J. Leskovec, A. Rajaraman
16、, J. Ullman: Mining of Massive Datasets, http:/www.mmds.org,Application to Measuring Proximity in Graphs,Random Walk with Restarts: S is a single element,Proximity on Graphs,a.k.a.: Relevance, Closeness, Similarity,Tong-Faloutsos, 06,J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets,
17、http:/www.mmds.org,13,Good proximity measure?,Shortest path is not good:No effect of degree-1 nodes (E, F, G)! Multi-faceted relationships,J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http:/www.mmds.org,14,Good proximity measure?,Network flow is not good:Does not punish long pat
18、hs,J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http:/www.mmds.org,15,What is good notion of proximity?,Multiple connectionsQuality of connectionDirect & Indirect connectionsLength, Degree, Weight,Tong-Faloutsos, 06,J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Dataset
19、s, http:/www.mmds.org,16,SimRank: Idea,SimRank: Random walks from a fixed node on k-partite graphs Setting: k-partite graph with k types of nodes E.g.: Authors, Conferences, Tags Topic Specific PageRank from node u: teleport set S = u Resulting scores measures similarity to node u Problem: Must be d
20、one once for each node u Suitable for sub-Web-scale applications,J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http:/www.mmds.org,17,Authors,Conferences,Tags,18,SimRank: Example,Conference,Author,Q: What is most relatedconference to ICDM?,J. Leskovec, A. Rajaraman, J. Ullman: Min
21、ing of Massive Datasets, http:/www.mmds.org,A: Topic-Specific PageRank with teleport set S=ICDM,SimRank: Example,19,J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http:/www.mmds.org,PageRank: Summary,“Normal” PageRank: Teleports uniformly at random to any node All nodes have the s
22、ame probability of surfer landing there: S = 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1 Topic-Specific PageRank also known as Personalized PageRank: Teleports to a topic specific set of pages Nodes can have different probabilities of surfer landing there: S = 0.1, 0, 0, 0.2, 0, 0, 0.5, 0, 0, 0
23、.2 Random Walk with Restarts: Topic-Specific PageRank where teleport is always to the same node. S=0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0,J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http:/www.mmds.org,20,TrustRank: Combating the Web Spam,What is Web Spam?,Spamming: Any deliberate acti
24、on to boost a web pages position in search engine results, incommensurate with pages real value Spam: Web pages that are the result of spamming This is a very broad definition SEO industry might disagree! SEO = search engine optimizationApproximately 10-15% of web pages are spam,22,J. Leskovec, A. R
- 1.请仔细阅读文档,确保文档完整性,对于不预览、不比对内容而直接下载带来的问题本站不予受理。
- 2.下载的文档,不会出现我们的网址水印。
- 3、该文档所得收入(下载+内容+预览)归上传者、原创作者;如果您是本文档原作者,请点此认领!既往收益都归您。
下载文档到电脑,查找使用更方便
2000 积分 0人已下载
下载 | 加入VIP,交流精品资源 |
- 配套讲稿:
如PPT文件的首页显示word图标,表示该PPT已包含配套word讲稿。双击word图标可打开word文档。
- 特殊限制:
部分文档作品中含有的国旗、国徽等图片,仅作为作品整体效果示例展示,禁止商用。设计者仅对作品中独创性部分享有著作权。
- 关 键 词:
- ANALYSISOFLARGEGRAPHSTRUSTRANKANDWEBSPAMPPT
