A Quick Introduction to Approximate Query Processing Part-.ppt
《A Quick Introduction to Approximate Query Processing Part-.ppt》由会员分享,可在线阅读,更多相关《A Quick Introduction to Approximate Query Processing Part-.ppt(42页珍藏版)》请在麦多课文档分享上搜索。
1、A Quick Introduction to Approximate Query Processing Part-IV,CS286, Spring2007 Minos Garofalakis,Logistics,Draft CS286 web site is finally up! http:/db.cs.berkeley.edu/cs286sp07/Project list and guidelines being worked on Please email me & Raghu to discuss your own project ideas,Approximate Query Pr
2、ocessing using Data Synopses,How to construct effective data synopses ?,SQL Query,Exact Answer,Decision Support Systems (DSS),Long Response Times!,GB/TB,Relations as Frequency Distributions,8 10 10,30 20 50,25 8 15,salary,age,name,age,salary,sales,sales,One-dimensional distribution,Age (attribute do
3、main values),tuple counts,Three-dimensional distribution,tuple counts,Outline,Intro & Approximate Query Answering Overview Synopses, System architectures, Commercial offerings One-Dimensional Synopses Histograms: Equi-depth, Compressed, V-optimal, Incremental maintenance, Self-tuning Samples: Basics
4、, Sampling from DBs, Reservoir Sampling Wavelets: 1-D Haar-wavelet histogram construction & maintenance Multi-Dimensional Synopses and Joins Set-Valued Queries Discussion & Comparisons Advanced Techniques & Future Directions,Outline,Intro Using Histograms, Samples, WaveletsDiscussion & Comparisons A
5、dvanced Techniques & Future Directions Dependency-based, Streaming data,Two-dimensional Haar Wavelets - Non-standard decomposition,Wavelet Transform Array:,(a+b-c-d)/4,(a+b+c+d)/4,(a-b-c+d)/4,(a-b+c-d)/4,Multi-dimensional Haar Wavelets,Haar decomposition in d dimensions = d-dimensional array of wave
6、let coefficients Coefficient support region = d-dimensional rectangle of cells in the original data array Sign of coefficients contribution can vary along the quadrants of its support,Support regions & signs for the 16 nonstandard 2-dimensional Haar coefficients of a 4X4 data array A,Range-sum Estim
7、ation Using Wavelet Synopses,Coefficient thresholding As in 1-d case, normalizing by appropriate constants and retaining the largest coefficients minimizes the overall L2 error Range-sums: selectivity estimation or OLAP-cube aggregates VW99 (“measure attribute” as count) Only coefficients with suppo
8、rt regions intersecting the query hyper-rectangle can contribute Many contributions can cancel each other CGR00, VW99,Query Range,Contribution to range sum = 0Only nodes on the path to range endpoints can have nonzero contributions (Extends naturally to multi-dimensional range sums),Decomposition Tr
9、ee (1-d),Approximate Query Processing Using Wavelets CGR00,Wavelet Synopses,Approximate Relations,Query Results in Wavelet Domain,Final Approximate Results,Render,Render,Querying in Wavelet Domain,Querying in Relation Domain,Reduce relations into compact wavelet-coefficient synopses,Entire query pro
10、cessing in the compressed (wavelet) domain,Wavelet Query Processing,set of coefficients,set of coefficients,set of coefficients,Each operator (e.g., select, project, join, aggregates, etc.) input: set of wavelet coefficients output: set of wavelet coefficients Finally, rendering step input: set of w
11、avelet coefficients output: (multi)set of tuples,render,Selection - Relational Domain,In relational domain, interested in only those cells inside query range In wavelet domain, interested in only the coefficients that contribute to those cells,Dim. D2,6,3,7,3,3,2,2,4,1,1,8,6,3,Query Range,Dim. D1,Jo
12、int Data Distribution Array,Relation,Selection - Wavelet Domain,-,-,+,+,+,-,-,+,+,-,D2,D1,Query Range,Equi-join - Relational Domain,Relational domain: Join count= 7*3 = (A1-A3)*(B2+B3) Wavelet domain: A1*B2 + A1*B3 - A3*B2 - A3*B3 Consider all pairs of coefficients: (1) check joinability (overlap in
13、 join dimension(s), (2) compute output coefficients,3,Coefficients A1 (+) and A3 (-) contribute to this cell,Coefficients B2 (+), and B3 (+) contribute to this cell,Join along D1,Joint Data Distributionof Relation 1,Joint Data Distr.of Relation 2,Dim. D3,Join Dim.D1,Relation 1,Relation 2,Equi-join -
14、 Wavelet Domain,-,+,D3,D1,-,-,+,+,D2,D1,D1,v1,v2,Wavelet Query Processing,set of coefficients,set of coefficients,set of coefficients,Each operator (e.g., select, project, join, aggregates, etc.) input: set of wavelet coefficients output: set of wavelet coefficients Finally, rendering step input: se
15、t of wavelet coefficients output: (multi)set of tuples,render,Outline,Intro & Approximate Query Answering Overview One-Dimensional Synopses Multi-Dimensional Synopses and Joins Set-Valued Queries Discussion & Comparisons Advanced Techniques & Future Directions Conclusions,Discussion & Comparisons (1
16、),Histograms & Wavelets: Limited by “curse of dimensionality” Rely on data space partitioning in “regions” Ineffective above 5-6 dimensions Value/frequency uniformity assumptions within buckets break down in medium-to-high dimensionalities! Sampling: No such limitations, BUT. Ineffective for ad-hoc
17、relational joins over arbitrary schemas Uniformity property is lost Quality guarantees degrade Effectiveness for set-valued approximate queries is unclear Only (very) small subsets of the answer set are returned (especially, when joins are present),Discussion & Comparisons (2),Histograms & Wavelets:
18、 Compress data by accurately capturing rectangular “regions” in the data space Advantage over sampling for typical, “range-based” relational DB queries BUT, unclear how to effectively handle unordered/non-numeric data sets (no such issues with sampling.) Sampling: Provides strong probabilistic quali
19、ty guarantees (unbiased answers) for individual aggregate queries Histograms & Wavelets: Can guarantee a bound on the overall error (e.g., L2) for the approximation, BUT answers to individual queries can be heavily biased!,No clear winner exists! (Hybrids?),Outline,Intro & Approximate Query Answerin
20、g Overview One-Dimensional Synopses Multi-Dimensional Synopses and Joins Set-Valued Queries Discussion & Comparisons Advanced Techniques & Future Directions Dependency-based Synopses Streaming Data XML Synopses Conclusions,Dependency-based Histogram Synopses DGR01,Extremes in terms of the underlying
21、 correlations!Dependency-Based Histograms: explore space between extremes by explicitly identifying data correlations/independences Build a statistical interaction model on data attributes Based on the model, build a collection of low-dimensional histograms Use this histogram collection to provide a
22、pproximate answersGeneral methodology, also applicable to other synopsis techniques (e.g., wavelets),Dependency-based Histograms,Identify (and exploit) attribute correlation and independence Partial Independence : p(salary, height, weight) = p(salary) * p(height, weight) Conditional Independence :p(
23、salary, age | YPE) = p(salary| YPE) * p(age | YPE) Use forward selection to build a decomposable statistical model BFH75, Lau96 on the attributes A,D are conditionally independent given B,C p(AD|BC) = p(A|BC) * p(D|BC) Joint distribution p(ABCD) = p(ABC) * p(BCD) / p(BC) Build histograms on model cl
- 1.请仔细阅读文档,确保文档完整性,对于不预览、不比对内容而直接下载带来的问题本站不予受理。
- 2.下载的文档,不会出现我们的网址水印。
- 3、该文档所得收入(下载+内容+预览)归上传者、原创作者;如果您是本文档原作者,请点此认领!既往收益都归您。
下载文档到电脑,查找使用更方便
2000 积分 0人已下载
下载 | 加入VIP,交流精品资源 |
- 配套讲稿:
如PPT文件的首页显示word图标,表示该PPT已包含配套word讲稿。双击word图标可打开word文档。
- 特殊限制:
部分文档作品中含有的国旗、国徽等图片,仅作为作品整体效果示例展示,禁止商用。设计者仅对作品中独创性部分享有著作权。
- 关 键 词:
- AQUICKINTRODUCTIONTOAPPROXIMATEQUERYPROCESSINGPARTPPT

链接地址:http://www.mydoc123.com/p-373177.html