Text and Web Search.ppt
《Text and Web Search.ppt》由会员分享,可在线阅读,更多相关《Text and Web Search.ppt(87页珍藏版)》请在麦多课文档分享上搜索。
1、Text and Web Search,Text Databases and IR,Text databases (document databases) Large collections of documents from various sources: news articles, research papers, books, digital libraries, e-mail messages, and Web pages, library database, etc. Information retrieval A field developed in parallel with
2、 database systems Information is organized into (a large number of) documents Information retrieval problem: locating relevant documents based on user input, such as keywords or example documents,Information Retrieval,Typical IR systems Online library catalogs Online document management systems Info
3、rmation retrieval vs. database systems Some DB problems are not present in IR, e.g., update, transaction management, complex objects Some IR problems are not addressed well in DBMS, e.g., unstructured documents, approximate search using keywords and relevance,Basic Measures for Text Retrieval,Precis
4、ion: the percentage of retrieved documents that are in fact relevant to the query (i.e., “correct” responses)Recall: the percentage of documents that are relevant to the query and were, in fact, retrieved,Information Retrieval Techniques,Index Terms (Attribute) Selection: Stop list Word stem Index t
5、erms weighting methods Terms Documents Frequency Matrices Information Retrieval Models: Boolean Model Vector Model Probabilistic Model,Problem - Motivation,Given a database of documents, find documents containing “data”, “retrieval” Applications: Web law + patent offices digital libraries informatio
6、n filtering,Types of queries: boolean (data AND retrieval AND NOT .) additional features (data ADJACENT retrieval) keyword queries (data, retrieval)How to search a large collection of documents?,Problem - Motivation,Full-text scanning,for single term: (naive: O(N*M),ABRACADABRA,text,CAB,pattern,for
7、single term: (naive: O(N*M) Knuth, Morris and Pratt (77) build a small FSA; visit every text letter once only, by carefully shifting more than one step,ABRACADABRA,text,CAB,pattern,Full-text scanning,ABRACADABRA,text,CAB,pattern,CAB,CAB,CAB,.,Full-text scanning,for single term: (naive: O(N*M) Knuth
8、Morris and Pratt (77) Boyer and Moore (77) preprocess pattern; start from right to left & skip!,ABRACADABRA,text,CAB,pattern,Full-text scanning,Text - Detailed outline,text problem full text scanning inversion signature files clustering information filtering and LSI,Text Inverted Files,Q: space over
9、head?,Text Inverted Files,A: mainly, the postings lists,how to organize dictionary?stemming Y/N? Keep only the root of each word ex. inverted, inversion invert insertions?,Text Inverted Files,how to organize dictionary? B-tree, hashing, TRIEs, PATRICIA trees, . stemming Y/N? insertions?,Text Inverte
10、d Files,postings list more Zipf distr.: eg., rank-frequency plot of Bible,log(rank),log(freq),freq 1/rank / ln(1.78V),Text Inverted Files,postings lists Cutting+Pedersen(keep first 4 in B-tree leaves) how to allocate space: Faloutsos+92 geometric progression compression (Elias codes) Zobel+ down to
11、2% overhead! Conclusions: needs space overhead (2%-300%), but it is the fastest,Text Inverted Files,Vector Space Model and Clustering,Keyword (free-text) queries (vs Boolean) each document: - vector (HOW?) each query: - vector search for similar vectors,Vector Space Model and Clustering,main idea: e
12、ach document is a vector of size d: d is the number of different terms in the database,document,.data.,aaron,zoo,data,d (= vocabulary size),indexing,Document Vectors,Documents are represented as “bags of words” Represented as vectors when used computationally A vector is like an array of floating po
13、ints Has direction and magnitude Each vector holds a place for every term in the collection Therefore, most vectors are sparse,Document Vectors One location for each word.,nova galaxy heat hwood film role diet fur10 5 3 5 1010 8 7 9 10 510 109 105 7 96 10 2 87 5 1 3,A B C D E F G H I,“Nova” occurs 1
14、0 times in text A “Galaxy” occurs 5 times in text A “Heat” occurs 3 times in text A (Blank means 0 occurrences.),Document Vectors One location for each word.,nova galaxy heat hwood film role diet fur10 5 3 5 1010 8 7 9 10 510 109 105 7 96 10 2 87 5 1 3,A B C D E F G H I,“Hollywood” occurs 7 times in
15、 text I “Film” occurs 5 times in text I “Diet” occurs 1 time in text I “Fur” occurs 3 times in text I,Document Vectors,nova galaxy heat hwood film role diet fur10 5 3 5 1010 8 7 9 10 510 109 105 7 96 10 2 87 5 1 3,A B C D E F G H I,Document ids,We Can Plot the Vectors,Star,Diet,Doc about astronomy,D
16、oc about movie stars,Doc about mammal behavior,Vector Space Model and Clustering,Then, group nearby vectors together Q1: cluster search? Q2: cluster generation?Two significant contributions ranked output relevance feedback,Vector Space Model and Clustering,cluster search: visit the (k) closest super
17、clusters; continue recursively,CS TRs,MD TRs,Vector Space Model and Clustering,ranked output: easy!,CS TRs,MD TRs,Vector Space Model and Clustering,relevance feedback (brilliant idea) Roccio73,CS TRs,MD TRs,Vector Space Model and Clustering,relevance feedback (brilliant idea) Roccio73 How?,CS TRs,MD
18、 TRs,Vector Space Model and Clustering,How? A: by adding the good vectors and subtracting the bad ones,CS TRs,MD TRs,Cluster generation,Problem:given N points in V dimensions, group them,Cluster generation,Problem:given N points in V dimensions, group them (typically a k-means or AGNES is used),Assi
19、gning Weights to Terms,Binary Weights Raw term frequency tf x idf Recall the Zipf distribution Want to weight terms highly if they are frequent in relevant documents BUT infrequent in the collection as a whole,Binary Weights,Only the presence (1) or absence (0) of a term is included in the vector,Ra
20、w Term Weights,The frequency of occurrence for the term in each document is included in the vector,Assigning Weights,tf x idf measure: term frequency (tf) inverse document frequency (idf) - a way to deal with the problems of the Zipf distribution Goal: assign a tf * idf weight to each term in each d
21、ocument,tf x idf,Inverse Document Frequency,IDF provides high values for rare words and low values for common words,For a collection of 10000 documents,Similarity Measures for document vectors,Simple matching (coordination level match)Dices CoefficientJaccards CoefficientCosine CoefficientOverlap Co
22、efficient,tf x idf normalization,Normalize the term weights (so longer documents are not unfairly given more weight) normalize usually means force all values to fall within a certain range, usually between 0 and 1, inclusive.,Vector space similarity (use the weights to compare the documents),Computi
- 1.请仔细阅读文档,确保文档完整性,对于不预览、不比对内容而直接下载带来的问题本站不予受理。
- 2.下载的文档,不会出现我们的网址水印。
- 3、该文档所得收入(下载+内容+预览)归上传者、原创作者;如果您是本文档原作者,请点此认领!既往收益都归您。
下载文档到电脑,查找使用更方便
2000 积分 0人已下载
下载 | 加入VIP,交流精品资源 |
- 配套讲稿:
如PPT文件的首页显示word图标,表示该PPT已包含配套word讲稿。双击word图标可打开word文档。
- 特殊限制:
部分文档作品中含有的国旗、国徽等图片,仅作为作品整体效果示例展示,禁止商用。设计者仅对作品中独创性部分享有著作权。
- 关 键 词:
- TEXTANDWEBSEARCHPPT
