Approximate Query Processing-Taming the TeraBytes!A Tutorial.ppt
《Approximate Query Processing-Taming the TeraBytes!A Tutorial.ppt》由会员分享,可在线阅读,更多相关《Approximate Query Processing-Taming the TeraBytes!A Tutorial.ppt(90页珍藏版)》请在麦多课文档分享上搜索。
1、Approximate Query Processing: Taming the TeraBytes! A Tutorial,Minos Garofalakis and Phillip B. Gibbons Information Sciences Research Center Bell Laboratories http:/www.bell- pbgibbons/,Outline,Intro & Approximate Query Answering Overview Synopses, System architecture, Commercial offerings One-Dimen
2、sional Synopses Histograms, Samples, Wavelets Multi-Dimensional Synopses and Joins Multi-D Histograms, Join synopses, Wavelets Set-Valued Queries Using Histograms, Samples, WaveletsDiscussion & Comparisons Advanced Techniques & Future Directions Dependency-based, Workload-tuned, Streaming data Concl
3、usions,Introduction & Motivation,Exact answers NOT always required DSS applications usually exploratory: early feedback to help identify “interesting” regions Aggregate queries: precision to “last decimal” not needed e.g., “What percentage of the US sales are in NJ?” (display as bar graph) Preview a
4、nswers while waiting. Trial queries Base data can be remote or unavailable: approximate processing using locally-cached data synopses is the only option,SQL Query,Exact Answer,Decision Support Systems (DSS),Long Response Times!,Fast Approximate Answers,Primarily for Aggregate queries Goal is to quic
5、kly report the leading digits of answers In seconds instead of minutes or hours Most useful if can provide error guaranteesE.g., Average salary$59,000 +/- $500 (with 95% confidence) in 10 secondsvs. $59,152.25 in 10 minutesAchieved by answering the query based on samples or other synopses of the dat
6、aSpeed-up obtained because synopses are orders of magnitude smaller than the original data,Approximate Query Answering,Basic Approach 1: Online Query Processing e.g., Control Project HHW97, HH99, HAR00 Sampling at query time Answers continually improve, under user control,Approximate Query Answering
7、,Basic Approach 2: Precomputed Synopses Construct & store synopses prior to query time At query time, use synopses to answer the queryLike estimation in query optimizers, but reported to the user (need higher accuracy) more general queriesNeed to maintain synopses up-to-dateMost work in the area bas
8、ed on the precomputed approach e.g., Sample Views OR92, Olk93, Aqua Project GMP97a, AGP99,etc,The Aqua Architecture,Data Warehouse (e.g., Oracle),SQL Query Q,Network,Q,Result,HTML XML,Warehouse Data Updates,Browser Excel,Picture without Aqua: User poses a query QData Warehouse executes Q and returns
9、 resultWarehouse is periodically updated with new data,The Aqua Architecture,Picture with Aqua: Aqua is middleware, between the user and the warehouseAqua Synopses are stored in the warehouseAqua intercepts the user query and rewrites it to be a query Q on the synopses. Data warehouse returns approx
10、imate answer,Data Warehouse (e.g., Oracle),SQL Query Q,Network,Q,Result (w/ error bounds),HTML XML,Warehouse Data Updates,AQUA Synopses,AQUA Tracker,Browser Excel,GMP97a, AGP99,Online vs. Precomputed,Online: + Continuous refinement of answers (online aggregation) + User control: what to refine, when
11、 to stop + Seeing the query is very helpful for fast approximate results + No maintenance overheads + See HH01 Online Query Processing tutorial for detailsPrecomputed: + Seeing entire data is very helpful (provably & in practice) (But must construct synopses for a family of queries) + Often faster:
12、better access patterns, small synopses can reside in memory or cache + Middleware: Can use with any DBMS, no special index striding + Also effective for remote or streaming data,Commercial DBMS,Oracle, IBM Informix: Sampling operator (online)IBM DB2: “IBM Almaden is working on a prototype version of
13、 DB2 that supports sampling. The user specifies a priori the amount of sampling to be done.”Microsoft SQL Server: “New auto statistics extract statistics e.g., histograms using fast sampling, enabling the Query Optimizer to use the latest information.” The index tuning wizard uses sampling to build
14、statistics.see CN97, CMN98, CN98In summary, not much announced yet,Outline,Intro & Approximate Query Answering Overview One-Dimensional Synopses Histograms: Equi-depth, Compressed, V-optimal, Incremental maintenance, Self-tuning Samples: Basics, Sampling from DBs, Reservoir Sampling Wavelets: 1-D Ha
15、ar-wavelet histogram construction & maintenance Multi-Dimensional Synopses and Joins Set-Valued Queries Discussion & Comparisons Advanced Techniques & Future Directions Conclusions,Histograms,Partition attribute value(s) domain into a set of buckets Issues: How to partition What to store for each bu
16、cket How to estimate an answer using the histogramLong history of use for selectivity estimation within a query optimizer Koo80, PSC84, etc.PIH96 Poo97 introduced a taxonomy, algorithms, etc.,1-D Histograms: Equi-Depth,Goal: Equal number of rows per bucket (B buckets in all)Can construct by first so
17、rting then taking B-1 equally-spaced splitsFaster construction: Sample & take equally-spaced splits in sample Nearly equal buckets Can also use one-pass quantile algorithms (e.g., GK01),Count in bucket,Domain values,1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20,Can maintain using one-pass algor
18、ithms (insertions only), orUse a backing sample GMP97b: Maintain a larger sample on disk in support of histogram maintenance Keep histogram bucket counts up-to-date by incrementing on row insertion, decrementing on row deletionMerge adjacent buckets with small countsSplit any bucket with a large cou
19、nt, using the sample to select a split value, i.e, take median of the sample points in bucket range Keeps counts within a factor of 2; for more equal buckets, can recompute from the sample,1-D Histograms: Equi-Depth,1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20,Count in bucket,Domain values,1-D
20、 Histograms: Compressed,Create singleton buckets for largest values, equi-depth over the restImprovement over equi-depth since get exact info on largest values, e.g., join estimation in DB2 compares largest values in the relations Construction: Sorting + O(B log B) + one pass; can use sampleMaintena
21、nce: Split & Merge approach as with equi-depth, but must also decide when to create and remove singleton buckets GMP97b,1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20,PIH96,1-D Histograms: V-Optimal,IP95 defined V-optimal AVGFi:j = avg freq for values ij SSEi:j = sumk=ij (Fk2 (j-i+1)*AVGFi:j2) F
22、or i=1N, compute Pi = sumk=1i Fk & Qi = sumk=1i Fk2 Then can compute any SSEi:j in constant time Let SSEP(i,k) = min SSE for F1Fi using k buckets Then SSEP(i,k) = minj=1i-1 (SSEP(j,k-1) + SSEj+1:i), i.e.,suffices to consider all possible left boundaries for kth bucket Also gave faster approximation
23、algorithms,Answering Queries: Equi-Depth,Answering queries: select count(*) from R where 4 = R.A = 15 approximate answer: F * |R|/B, where F = number of buckets, including fractions, that overlap the range error guarantee: 2 * |R|/B,1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20,4 R.A 15, 0.5 *
24、|R|/6,Answering Queries: Histograms,Answering queries from 1-D histograms (in general): (Implicitly) map the histogram back to an approximate relation, & apply the query to the approximate relationContinuous value mapping SAC79:,Count spread evenly among bucket values,- Uniform spread mapping PIH96:
- 1.请仔细阅读文档,确保文档完整性,对于不预览、不比对内容而直接下载带来的问题本站不予受理。
- 2.下载的文档,不会出现我们的网址水印。
- 3、该文档所得收入(下载+内容+预览)归上传者、原创作者;如果您是本文档原作者,请点此认领!既往收益都归您。
下载文档到电脑,查找使用更方便
2000 积分 0人已下载
下载 | 加入VIP,交流精品资源 |
- 配套讲稿:
如PPT文件的首页显示word图标,表示该PPT已包含配套word讲稿。双击word图标可打开word文档。
- 特殊限制:
部分文档作品中含有的国旗、国徽等图片,仅作为作品整体效果示例展示,禁止商用。设计者仅对作品中独创性部分享有著作权。
- 关 键 词:
- APPROXIMATEQUERYPROCESSINGTAMINGTHETERABYTESATUTORIALPPT

链接地址:http://www.mydoc123.com/p-378530.html