Introduction to Apache Spark.ppt
《Introduction to Apache Spark.ppt》由会员分享,可在线阅读,更多相关《Introduction to Apache Spark.ppt(47页珍藏版)》请在麦多课文档分享上搜索。
1、Introduction to Apache Spark,Patrick Wendell - Databricks,What is Spark?,Efficient,General execution graphs In-memory storage,Usable,Rich APIs in Java, Scala, Python Interactive shell,Fast and Expressive Cluster Computing Engine Compatible with Apache Hadoop,2-5 less code,Up to 10 faster on disk, 10
2、0 in memory,The Spark Community,+You!,Todays Talk,The Spark programming modelLanguage and deployment choicesExample algorithm (PageRank),Spark Programming Model,Key Concept: RDDs,Resilient Distributed Datasets Collections of objects spread across a cluster, stored in RAM or on Disk Built through par
3、allel transformations Automatically rebuilt on failure,Operations Transformations (e.g. map, filter, groupBy) Actions (e.g. count, collect, save),Write programs in terms of operations on distributed datasets,Example: Log Mining,Load error messages from a log into memory, then interactively search fo
4、r various patterns,lines = spark.textFile(“hdfs:/.”) errors = lines.filter(lambda s: s.startswith(“ERROR”) messages = errors.map(lambda s: s.split(“t”)2) messages.cache(),Block 1,Block 2,Block 3,messages.filter(lambda s: “mysql” in s).count(),messages.filter(lambda s: “php” in s).count(),. . .,tasks
5、,results,Cache 1,Cache 2,Cache 3,Base RDD,Transformed RDD,Action,Full-text search of Wikipedia 60GB on 20 EC2 machine 0.5 sec vs. 20s for on-disk,Scaling Down,Fault Recovery,RDDs track lineage information that can be used to efficiently recompute lost data,msgs = textFile.filter(lambda s: s.startsWi
6、th(“ERROR”).map(lambda s: s.split(“t”)2),HDFS File,Filtered RDD,Mapped RDD,filter (func = startsWith(),map (func = split(.),Programming with RDDs,SparkContext,Main entry point to Spark functionality Available in shell as variable sc In standalone programs, youd make your own (see later for details),
7、Creating RDDs,# Turn a Python collection into an RDD sc.parallelize(1, 2, 3)# Load text file from local FS, HDFS, or S3 sc.textFile(“file.txt”) sc.textFile(“directory/*.txt”) sc.textFile(“hdfs:/namenode:9000/path/file”)# Use existing Hadoop InputFormat (Java/Scala only) sc.hadoopFile(keyClass, valCl
8、ass, inputFmt, conf),Basic Transformations,nums = sc.parallelize(1, 2, 3) # Pass each element through a function squares = nums.map(lambda x: x*x) / 1, 4, 9# Keep elements passing a predicate even = squares.filter(lambda x: x % 2 = 0) / 4# Map each element to zero or more others nums.flatMap(lambda
9、x: = range(x) # = 0, 0, 1, 0, 1, 2,Range object (sequence of numbers 0, 1, , x-1),Basic Actions,nums = sc.parallelize(1, 2, 3) # Retrieve RDD contents as a local collection nums.collect() # = 1, 2, 3# Return first K elements nums.take(2) # = 1, 2# Count number of elements nums.count() # = 3# Merge e
10、lements with an associative function nums.reduce(lambda x, y: x + y) # = 6# Write elements to a text file nums.saveAsTextFile(“hdfs:/file.txt”),Working with Key-Value Pairs,Sparks “distributed reduce” transformations operate on RDDs of key-value pairs,Python: pair = (a, b) pair0 # = a pair1 # = b Sc
11、ala: val pair = (a, b) pair._1 / = a pair._2 / = b Java: Tuple2 pair = new Tuple2(a, b); pair._1 / = a pair._2 / = b,Some Key-Value Operations,pets = sc.parallelize( (“cat”, 1), (“dog”, 1), (“cat”, 2) pets.reduceByKey(lambda x, y: x + y) # = (cat, 3), (dog, 1) pets.groupByKey() # = (cat, 1, 2), (dog
12、, 1) pets.sortByKey() # = (cat, 1), (cat, 2), (dog, 1)reduceByKey also automatically implements combiners on the map side,lines = sc.textFile(“hamlet.txt”) counts = lines.flatMap(lambda line: line.split(“ ”) .map(lambda word = (word, 1) .reduceByKey(lambda x, y: x + y),Example: Word Count,Other Key-
13、Value Operations,visits = sc.parallelize( (“index.html”, “1.2.3.4”), (“about.html”, “3.4.5.6”), (“index.html”, “1.3.3.1”) )pageNames = sc.parallelize( (“index.html”, “Home”), (“about.html”, “About”) )visits.join(pageNames) # (“index.html”, (“1.2.3.4”, “Home”) # (“index.html”, (“1.3.3.1”, “Home”) # (
14、“about.html”, (“3.4.5.6”, “About”)visits.cogroup(pageNames) # (“index.html”, (“1.2.3.4”, “1.3.3.1”, “Home”) # (“about.html”, (“3.4.5.6”, “About”),Setting the Level of Parallelism,All the pair RDD operations take an optional second parameter for number of taskswords.reduceByKey(lambda x, y: x + y, 5)
15、 words.groupByKey(5) visits.join(pageViews, 5),Using Local Variables,Any external variables you use in a closure will automatically be shipped to the cluster:query = sys.stdin.readline() pages.filter(lambda x: query in x).count()Some caveats: Each task gets a new copy (updates arent sent back) Varia
16、ble must be Serializable / Pickle-able Dont use fields of an outer object (ships all of it!),Under The Hood: DAG Scheduler,General task graphs Automatically pipelines functions Data locality aware Partitioning aware to avoid shuffles,= cached partition,= RDD,More RDD Operators,map filter groupBy sor
- 1.请仔细阅读文档,确保文档完整性,对于不预览、不比对内容而直接下载带来的问题本站不予受理。
- 2.下载的文档,不会出现我们的网址水印。
- 3、该文档所得收入(下载+内容+预览)归上传者、原创作者;如果您是本文档原作者,请点此认领!既往收益都归您。
下载文档到电脑,查找使用更方便
2000 积分 0人已下载
下载 | 加入VIP,交流精品资源 |
- 配套讲稿:
如PPT文件的首页显示word图标,表示该PPT已包含配套word讲稿。双击word图标可打开word文档。
- 特殊限制:
部分文档作品中含有的国旗、国徽等图片,仅作为作品整体效果示例展示,禁止商用。设计者仅对作品中独创性部分享有著作权。
- 关 键 词:
- INTRODUCTIONTOAPACHESPARKPPT
