The riddles of the Sphinx.ppt
《The riddles of the Sphinx.ppt》由会员分享,可在线阅读,更多相关《The riddles of the Sphinx.ppt(44页珍藏版)》请在麦多课文档分享上搜索。
1、The riddles of the Sphinx,Full-text engine anatomy atlas,Who are you?,Sphinx FOSS full-text search engine,Who are you?,Sphinx FOSS full-text search engine Good at playing ball,Who are you?,Sphinx FOSS full-text search engine Good at playing ball Good at not playing ball,Who are you?,Sphinx FOSS full
2、-text search engine Good at playing ball Good at not playing ball Good at passing the ball to a team-mate,Who are you?,Sphinx FOSS full-text search engine Good at playing ball Good at not playing ball Good at passing the ball to a team-mate Good at many other “inferior” games “faceted” search, geose
3、arch, snippet extraction, multi-queries, IO throttling, and 10-20 other interesting directives,What are you here for?,What will not be covered? No entry-level “whats that Sphinx and whats in it for me” overview No long quotes from the documentation No C+ architecture details,What are you here for?,W
4、hat will not be covered? No entry-level “whats that Sphinx and whats in it for me” overview No long quotes from the documentation No C+ architecture details What will be? How does it generally work inside How things can be optimized How things can be parallelized,Chapter 1. Engine insides,Total work
5、flow,Indexing first Searching second,Total workflow,Indexing first Searching secondThere are data sources (what to fetch, where from) There are indexes What data sources to index How to process the incoming text Where to put the results,How indexing works,In two acts, with an intermission Phase 1 co
6、llecting documents Fetch the documents (loop over the sources) Split the documents into words Process the words (morphology, *fixes) Replace the words with their wordids (CRC32/64) Emit a number of temp files,How indexing works,Phase 2 sorting hits Hit (occurrence) is a (docid,wordid,wordpos) record
7、 Input is a number of partially sorted (by wordid) hit lists The incoming lists are merge-sorted Output is essentially a single fully sorted hit list Intermezzo Collect and sort MVA values Sort ordinals Sort extern attributes,Dumb & dumber,The index format is simple Several sorted lists Dictionary (
8、the complete list of wordids) Attributes (only if docinfo=extern) Document lists (for each keyword) Hit lists (for each keyword) Everything is laid out linearly, good for IO,How searching works,For each local index Build a list of candidates (documents that satisfy the full-text query) Filter (the a
9、nalogy is WHERE) Rank (compute the documents relevance values) Sort (the analogy is ORDER BY) Group (the analogy is GROUP BY) Merge the results from all the local indexes,1. Searching cost,Building the candidates list 1 keyword = 1+ IO (document list) Boolean operations on document lists Cost is pro
10、portional () to the lists lengths That is, to the sum of all the keyword frequencies In case of phrase/proximity/etc search, there also will be operations on hit lists approx. 2x IO/CPU,1. Searching cost,Building the candidates list 1 keyword = 1+ IO (document list) Boolean operations on document li
11、sts Cost is proportional () to the lists lengths That is, to the sum of all the keyword frequencies In case of phrase/proximity/etc search, there also are operations on hit lists approx. 2x IO/CPU Bottom line “The Who” are really bad,2. Filtering cost,docinfo=inline Attributes are inlined in the doc
12、ument lists ALL the values are duplicated MANY times! Immediately accessible after disk read docinfo=extern Attributes are stored in a separate list (file) Fully cached in RAM Hashed by docid + binary search Simple loop over all filters Cost number of candidates and filters,3. Ranking cost,Direct de
13、pends on the ranker To account for keyword positions Helps the relevancy But costs extra resources double impact! Cost number of results Most expensive phrase proximity + BM25 Most cheap none (weight=1)Indirect induced in the sorting,4. Sorting cost,Cost number of results Also depends on the sorting
14、 criteria (documents will be supplied in id asc order) Also depends on max_matches The more the max, the worse the server feels 1-10K is acceptable, 100K is way too much 10-20 is not enough (makes little sense),5. Grouping cost,Grouping is internally a kind of sorting Cost affected by the number of
15、results, too Cost affected by max_matches, too Additionally, max_matches setting affects count and distinct precision,Chapter 2. Optimizing things,How to optimize queries,Partitioning the dataChoosing ranking vs. sorting mode Filters vs. keywords Filters vs. manual MTF Multi queries,How to optimize
16、queries,Partitioning the dataChoosing ranking vs. sorting mode Filters vs. keywords Filters vs. manual MTF Multi queriesLast line of defense Three Big Buttons,1. Partitioning the data,Swiss army knife, for different tasks Bound by indexing time? Partition, re-index the recent changes only Bound by f
- 1.请仔细阅读文档,确保文档完整性,对于不预览、不比对内容而直接下载带来的问题本站不予受理。
- 2.下载的文档,不会出现我们的网址水印。
- 3、该文档所得收入(下载+内容+预览)归上传者、原创作者;如果您是本文档原作者,请点此认领!既往收益都归您。
下载文档到电脑,查找使用更方便
2000 积分 0人已下载
下载 | 加入VIP,交流精品资源 |
- 配套讲稿:
如PPT文件的首页显示word图标,表示该PPT已包含配套word讲稿。双击word图标可打开word文档。
- 特殊限制:
部分文档作品中含有的国旗、国徽等图片,仅作为作品整体效果示例展示,禁止商用。设计者仅对作品中独创性部分享有著作权。
- 关 键 词:
- THERIDDLESOFTHESPHINXPPT
