A Quick Introduction to Approximate Query ProcessingPart II.ppt
《A Quick Introduction to Approximate Query ProcessingPart II.ppt》由会员分享,可在线阅读,更多相关《A Quick Introduction to Approximate Query ProcessingPart II.ppt(29页珍藏版)》请在麦多课文档分享上搜索。
1、A Quick Introduction to Approximate Query Processing Part II,CS286, Spring2007 Minos Garofalakis,Decision Support Systems,Data Warehousing: Consolidate data from many sources in one large repository. Loading, periodic synchronization of replicas. Semantic integration. OLAP: Complex SQL queries and v
2、iews. Queries based on spreadsheet-style operations and “multidimensional” view of data. Interactive and “online” queries. Data Mining: Exploratory search for interesting trends and anomalies. (Another lecture!),Approximate Query Processing using Data Synopses,How to construct effective data synopse
3、s ?,SQL Query,Exact Answer,Decision Support Systems (DSS),Long Response Times!,GB/TB,Relations as Frequency Distributions,8 10 10,30 20 50,25 8 15,salary,age,name,age,salary,sales,sales,One-dimensional distribution,Age (attribute domain values),tuple counts,Three-dimensional distribution,tuple count
4、s,Outline,Intro & Approximate Query Answering Overview Synopses, System architectures, Commercial offerings One-Dimensional Synopses Histograms: Equi-depth, Compressed, V-optimal, Incremental maintenance, Self-tuning Samples: Basics, Sampling from DBs, Reservoir Sampling Wavelets: 1-D Haar-wavelet h
5、istogram construction & maintenance Multi-Dimensional Synopses and Joins Set-Valued Queries Discussion & Comparisons Advanced Techniques & Future Directions,One-Dimensional Haar Wavelets,Wavelets: mathematical tool for hierarchical decomposition of functions/signals Haar wavelets: simplest wavelet b
6、asis, easy to understand and implement Recursive pairwise averaging and differencing at different resolutions,Resolution Averages Detail Coefficients,2, 2, 0, 2, 3, 5, 4, 4,2, 1, 4, 4,0, -1, -1, 0,-,3,2,1,0,Haar Wavelet Coefficients,Coefficient “Supports”,Hierarchical decomposition structure (a.k.a.
7、 “error tree”),Original data,Wavelet-based Histograms MVW98,Problem: range-query selectivity estimation Key idea: use a compact subset of Haar/linear wavelet coefficients for approximating the data distribution Steps compute (cumulative) data distribution C compute Haar (or linear) wavelet transform
8、 of C coefficient thresholding : only b|C| coefficients can be kept take largest coefficients in absolute normalized value Haar basis: divide coefficients at resolution j by Optimal in terms of the overall Mean Squared (L2) Error Greedy heuristic methods Retain coefficients leading to large error re
9、duction Throw away coefficients that give small increase in error,Using Wavelet-based Histograms,Selectivity estimation: sel(a= X= b) = Cb - Ca-1 C is the (approximate) “reconstructed” cumulative distribution Time: O(minb, logN), where b = size of wavelet synopsis (no. of coefficients), N= size of d
10、omain,At most logN+1 coefficients are needed to reconstruct any C value,Haar Wavelet Coefficients,Reconstruct data values d(i) d(i) = (+/-1) * (coefficient on path) Range sum calculation d(l:h) d(l:h) = simple linear combination of coefficients on paths to l, h Only O(logN) terms,Original data,3 = 2
11、.75 - (-1.25) + 0 + (-1),6 = 4*2.75 + 4*(-1.25),Dynamic Maintenance of Wavelet-based Histograms MVW00,Build Haar-wavelet synopses on the original data distributionKey issues with dynamic wavelet maintenance Change in single distribution value can affect the values of many coefficients (path to the r
12、oot of the decomposition tree),d+,As distribution changes, “most significant” (e.g., largest) coefficients can also change! Important coefficients can become unimportant, and vice-versa,Effect of Distribution Updates,Key observation: for each coefficient c in the Haar decomposition tree c = ( AVG(le
13、ftChildSubtree(c) - AVG(rightChildSubtree(c) ) / 2,Only coefficients on path(d) are affected and each can be updated in constant time,Maintenance Architecture,“Shake up” when log reaches max size: for each insertion at d for each coefficient c on path(d) and in H : update c for each coefficient c on
14、 path(d) and not in H or H: insert c into H with probability proportional to 1/2h, where h is the “height” of c (Probabilistic Counting FM85) Adjust H and H (move largest coefficients to H),m+m top coefficients,m,m,INSERTIONS/ DELETIONS,Problems with Conventional Wavelet Synopses,An example data vec
15、tor and wavelet synopsis (|D|=16, B=8 largest coefficients retained),Original Data Values 127 71 87 31 59 3 43 99 100 42 0 58 30 88 72 130,Wavelet Answers 65 65 65 65 65 65 65 65 100 42 0 58 30 88 72 130,Large variation in answer quality Within the same data set, when synopsis is large, when data va
16、lues are about the same, when actual answers are about the same Heavily-biased approximate answers! Root causes Thresholding for aggregate L2 error metric Independent, greedy thresholding ( large regions without any coefficient!) Heavy bias from dropping coefficients without compensating for loss,Ap
17、proach: Optimize for Maximum-Error Metrics,Key metric for effective approximate answers: Relative error with sanity boundSanity bound “s” to avoid domination by small data valuesTo provide tight error guarantees for all reconstructed data valuesMinimize maximum relative error in the data reconstruct
18、ionAnother option: Minimize maximum absolute errorAlgorithms can be extended to general “distributive” metrics (e.g., average relative error),Minimize,Our Approach: Deterministic Wavelet Thresholding for Maximum Error,Key Idea: Dynamic-Programming formulation that conditions the optimal solution on
19、the error that “enters” the subtree (through the selection of ancestor nodes),Our DP table:Mj, b, S = optimal maximum relative (or, absolute) error in T(j) with space budget of b coefficients (chosen in T(j), assuming subset S of js proper ancestors have already been selected for the synopsis Clearl
- 1.请仔细阅读文档,确保文档完整性,对于不预览、不比对内容而直接下载带来的问题本站不予受理。
- 2.下载的文档,不会出现我们的网址水印。
- 3、该文档所得收入(下载+内容+预览)归上传者、原创作者;如果您是本文档原作者,请点此认领!既往收益都归您。
下载文档到电脑,查找使用更方便
2000 积分 0人已下载
下载 | 加入VIP,交流精品资源 |
- 配套讲稿:
如PPT文件的首页显示word图标,表示该PPT已包含配套word讲稿。双击word图标可打开word文档。
- 特殊限制:
部分文档作品中含有的国旗、国徽等图片,仅作为作品整体效果示例展示,禁止商用。设计者仅对作品中独创性部分享有著作权。
- 关 键 词:
- AQUICKINTRODUCTIONTOAPPROXIMATEQUERYPROCESSINGPARTIIPPT

链接地址:http://www.mydoc123.com/p-377870.html