Big Data Analytics with Parallel Jobs.ppt
《Big Data Analytics with Parallel Jobs.ppt》由会员分享,可在线阅读,更多相关《Big Data Analytics with Parallel Jobs.ppt(46页珍藏版)》请在麦多课文档分享上搜索。
1、Big Data Analytics with Parallel Jobs,Ganesh Ananthanarayanan University of California at Berkeley http:/www.cs.berkeley.edu/ganesha/,Rising philosophy of data-ism,Diagnostics and decisions backed by extensive data analytics“In God we trust. Everybody else bring data to the table.“ Competitive and S
2、ocial BenefitsDichotomy: Ever-increasing data size and ever-decreasing latency target,Parallelization,Massive parallelization on large clustersData growing faster than Moores lawComputation Frameworks,Dryad,Computation Frameworks,Job DAG of tasks Outputs of upstream tasks are passed downstreamJobs i
3、nput (file) is divided among tasks Every task reads a block of dataComputation and storage are co-located Tasks scheduled for locality,Task2,Task1,Task3,Computation Frameworks,Block,Slot,Promise of Big Data Analytics,Goal: Efficient execution of jobs Completion time and utilizationChallenge: Scale D
4、ata size and parallelism Performance variations stragglers,Efficient and fault-tolerant execution of parallel jobs on large clusters,IO intensive Memory,Tasks are IO intensiveTask completes faster if its input is read off memory instead of disk Memory local taskFalling memory prices 64GB/machine at
5、FB in Aug 2011, 256GB/machine not uncommon now,Can we move all data to memory?,Proposals for making “RAM as the new disk” (e.g., RAMCloud)Huge discrepancy between storage and memory capacities 200x more data on disks than available memory,Use memory as cache,Production traces Dryad and Hadoop cluste
6、rs Thousands of machines Over 1 million jobs in all, span of 6 months (2011-12),Will the inputs fit in cache?,Will the inputs fit in cache?,Heavy-tailed distribution of jobs sizes 92% of jobs can fit their inputs in memory,We built a memory cache,Simple in-memory distributed cache Cache input data o
7、f jobsSchedule tasks for memory localitySimple cache replacement policies Least Recently Used (LRU) and Least Frequently Used (LFU),We built a memory cache,Replayed the Facebook trace of Hadoop jobsJobs sped up by 10% with LFU; hit-ratio of 67% Beladys MIN (optimal) 13% improvement with hit-ratio of
8、 74%,How do we make caching really speedup parallel jobs?,Parallel jobs require a new class of cache replacement algorithms,Parallel Jobs,Tasks of small jobs run simultaneously in a wave,Task duration (uncached input),Task duration (cached input),All-or-Nothing: Unless all inputs are cached, there i
9、s no benefit,All-or-Nothing for multi-waved jobs,Large jobs run tasks in multiple waves Number of tasks is larger than number of slots Wave-width: Number of parallel tasks of a job,time,slot5,slot4,slot3,slot2,slot1,completion time,All-or-Nothing for multi-waved jobs,Large jobs run tasks in multiple
10、 waves Number of tasks is larger than number of slots Wave-width: Number of parallel tasks of a job,time,slot5,slot4,slot3,slot2,slot1,completion time,All-or-Nothing for multi-waved jobs,time,slot5,slot4,slot3,slot2,slot1,Large jobs run tasks in multiple waves Number of tasks is larger than number o
11、f slots Wave-width: Number of parallel tasks of a job,All-or-Nothing for multi-waved jobs,time,slot5,slot4,slot3,slot2,slot1,Large jobs run tasks in multiple waves Number of tasks is larger than number of slots Wave-width: Number of parallel tasks of a job,Cache at the wave-width granularity,Cache a
12、t the wave-width granularity,Job with 50 tasks, wave-width of 10,All-or-nothing,How to evict from cache?,View at the granularity of a jobs input (file) Evict from incompletely cached waves Sticky Policy,With Sticky Policy,slot1slot2slot3slot4slot5slot6 a slot7slot8,completion,Hit-ratio: 50% No speed
13、-up of jobs,Job 1,Job 2,Without Sticky Policy,Which file should be evicted?,Depends on metric to optimize:User centric metric Completion time of jobsOperator centric metric Utilization of the cluster,What are the eviction policies for these metrics?,Reduction in Completion Time,Idealized model for j
14、ob: Wave-width for job: W Frequency predicts future access: F Task duration is proportional to data read: D Speedup factor for cached tasks: ,Cost of caching: W D Benefit of caching: Benefit/cost: F/W,Completion Time of Job: frequency/wave-width,F,D,How to estimate W for a job?,Job size,Wave-width (
15、slots),Use the size of a file as a proxy for wave-width Relative ordering remains unchanged,Improvement in Utilization,Idealized model for job: Wave-width for job: W Frequency predicts future access: F Task duration is proportional to data read: D Speedup factor for cached tasks: ,Cost of caching: W
- 1.请仔细阅读文档,确保文档完整性,对于不预览、不比对内容而直接下载带来的问题本站不予受理。
- 2.下载的文档,不会出现我们的网址水印。
- 3、该文档所得收入(下载+内容+预览)归上传者、原创作者;如果您是本文档原作者,请点此认领!既往收益都归您。
下载文档到电脑,查找使用更方便
2000 积分 0人已下载
下载 | 加入VIP,交流精品资源 |
- 配套讲稿:
如PPT文件的首页显示word图标,表示该PPT已包含配套word讲稿。双击word图标可打开word文档。
- 特殊限制:
部分文档作品中含有的国旗、国徽等图片,仅作为作品整体效果示例展示,禁止商用。设计者仅对作品中独创性部分享有著作权。
- 关 键 词:
- BIGDATAANALYTICSWITHPARALLELJOBSPPT
