Supercomputingin Plain EnglishInstruction Level Parallelism.ppt
《Supercomputingin Plain EnglishInstruction Level Parallelism.ppt》由会员分享,可在线阅读,更多相关《Supercomputingin Plain EnglishInstruction Level Parallelism.ppt(54页珍藏版)》请在麦多课文档分享上搜索。
1、Supercomputing in Plain English Instruction Level Parallelism,Henry Neeman Director OU Supercomputing Center for Education & Research October 1 2004,Supercomputing in Plain English: ILP OU Supercomputing Center for Education & Research,2,Outline,What is Instruction-Level Parallelism? Scalar Operatio
2、n Loops Pipelining Loop Performance Superpipelining Vectors A Real Example,Supercomputing in Plain English: ILP OU Supercomputing Center for Education & Research,3,Parallelism,Less fish ,More fish!,Parallelism means doing multiple things at the same time: you can get more work done in the same time.
3、,Supercomputing in Plain English: ILP OU Supercomputing Center for Education & Research,4,What Is ILP?,Instruction-Level Parallelism (ILP) is a set of techniques for executing multiple instructions at the same time within the same CPU. The problem: the CPU has lots of circuitry, and at any given tim
4、e, most of it is idle. The solution: have different parts of the CPU work on different operations at the same time if the CPU has the ability to work on 10 operations at a time, then the program can run as much as 10 times as fast (although in practice, not quite so much).,Supercomputing in Plain En
5、glish: ILP OU Supercomputing Center for Education & Research,5,Kinds of ILP,Superscalar: perform multiple operations at the same time (e.g., simultaneously perform an add, a multiply and a load) Pipeline: start performing an operation on one piece of data while finishing the same operation on anothe
6、r piece of data perform different stages of the same operation on different sets of operands at the same time (like an assembly line) Superpipeline: combination of superscalar and pipelining perform multiple pipelined operations at the same time Vector: load multiple pieces of data into special regi
7、sters and perform the same operation on all of them at the same time,Supercomputing in Plain English: ILP OU Supercomputing Center for Education & Research,6,Whats an Instruction?,Memory: e.g., load a value from a specific address in main memory into a specific register, or store a value from a spec
8、ific register into a specific address in main memory Arithmetic: e.g., add two specific registers together and put their sum in a specific register or subtract, multiply, divide, square root, etc Logical: e.g., determine whether two registers both contain nonzero values (“AND”) Branch: jump from one
9、 sequence of instructions to another and so on,Supercomputing in Plain English: ILP OU Supercomputing Center for Education & Research,7,Whats a Cycle?,Youve heard people talk about having a 2 GHz processor or a 3 GHz processor or whatever. (For example, Henrys laptop has a 1.5 GHz Pentium4.) Inside
10、every CPU is a little clock that ticks with a fixed frequency. We call each tick of the CPU clock a clock cycle or a cycle. So a 2 GHz processor has 2 billion clock cycles per second. Typically, a primitive operation (e.g., add, multiply, divide) takes a fixed number of cycles to execute (assuming n
11、o pipelining).,Supercomputing in Plain English: ILP OU Supercomputing Center for Education & Research,8,Whats the Relevance of Cycles?,Typically, a primitive operation (e.g., add, multiply, divide) takes a fixed number of cycles to execute (assuming no pipelining). IBM POWER4 1 Multiply or add: 6 cy
12、cles (64 bit floating point) Load: 4 cycles from L1 cache14 cycles from L2 cache Intel Pentium4 2 Multiply: 7 cycles (64 bit floating point) Add, subtract: 5 cycles (64 bit floating point) Divide, square root: 38 cycles (64 bit floating point) Tangent: 225-250 cycles (64 bit floating point),Scalar O
13、peration,Supercomputing in Plain English: ILP OU Supercomputing Center for Education & Research,10,DONT PANIC!,Supercomputing in Plain English: ILP OU Supercomputing Center for Education & Research,11,Scalar Operation,Load a into register R0 Load b into R1 Multiply R2 = R0 * R1 Load c into R3 Load d
14、 into R4 Multiply R5 = R3 * R4 Add R6 = R2 + R5 Store R6 into z,z = a * b + c * d;,How would this statement be executed?,Supercomputing in Plain English: ILP OU Supercomputing Center for Education & Research,12,Does Order Matter?,Load a into R0 Load b into R1 Multiply R2 = R0 * R1 Load c into R3 Loa
15、d d into R4 Multiply R5 = R3 * R4 Add R6 = R2 + R5 Store R6 into z,z = a * b + c * d;,In the cases where order doesnt matter, we say that the operations are independent of one another.,Load d into R0 Load c into R1 Multiply R2 = R0 * R1 Load b into R3 Load a into R4 Multiply R5 = R3 * R4 Add R6 = R2
16、 + R5 Store R6 into z,Supercomputing in Plain English: ILP OU Supercomputing Center for Education & Research,13,Superscalar Operation,Load a into R0 AND load b into R1 Multiply R2 = R0 * R1 AND load c into R3 AND load d into R4 Multiply R5 = R3 * R4 Add R6 = R2 + R5 Store R6 into z,z = a * b + c * d
17、;,So, we go from 8 operations down to 5. (Note: there are lots of simplifying assumptions here.),Loops,Supercomputing in Plain English: ILP OU Supercomputing Center for Education & Research,15,Loops Are Good,Most compilers are very good at optimizing loops, and not very good at optimizing other cons
18、tructs.,DO index = 1, lengthdst(index) = src1(index) + src2(index) END DO,Why?,Supercomputing in Plain English: ILP OU Supercomputing Center for Education & Research,16,Why Loops Are Good,Loops are very common in many programs. Also, its easier to optimize loops than more arbitrary sequences of inst
19、ructions: when a program does the same thing over and over, its easier to predict whats likely to happen next. So, hardware vendors have designed their products to be able to execute loops quickly.,Supercomputing in Plain English: ILP OU Supercomputing Center for Education & Research,17,DONT PANIC!,
20、Supercomputing in Plain English: ILP OU Supercomputing Center for Education & Research,18,Superscalar Loops,DO i = 1, nz(i) = a(i)*b(i) + c(i)*d(i) END DO,Each of the iterations is completely independent of all of the other iterations; e.g.,z(1) = a(1)*b(1) + c(1)*d(1) has nothing to do withz(2) = a
21、(2)*b(2) + c(2)*d(2) Operations that are independent of each other can be performed in parallel.,Supercomputing in Plain English: ILP OU Supercomputing Center for Education & Research,19,Superscalar Loops,for (i = 0; i n; i+) zi = ai * bi + ci * di; ,Load ai into R0 AND load bi into R1 Multiply R2 =
22、 R0 * R1 AND load ci into R3 AND load di into R4 Multiply R5 = R3 * R4 AND load ai+1 into R0 AND load bi+1 into R1 Add R6 = R2 + R5 AND load ci+1 into R3 AND load di+1 into R4 Store R6 into zi AND multiply R2 = R0 * R1 etc etc etc Once this loop is “in flight,” each iteration adds only 2 operations
23、to the total, not 8.,Supercomputing in Plain English: ILP OU Supercomputing Center for Education & Research,20,Example: IBM POWER4,8-way Superscalar: can execute up to 8 operations at the same time1 2 integer arithmetic or logical operations, and 2 floating point arithmetic operations, and 2 memory
24、access (load or store) operations, and 1 branch operation, and 1 conditional operation,Pipelining,Supercomputing in Plain English: ILP OU Supercomputing Center for Education & Research,22,Pipelining,Pipelining is like an assembly line or a bucket brigade. An operation consists of multiple stages. Af
- 1.请仔细阅读文档,确保文档完整性,对于不预览、不比对内容而直接下载带来的问题本站不予受理。
- 2.下载的文档,不会出现我们的网址水印。
- 3、该文档所得收入(下载+内容+预览)归上传者、原创作者;如果您是本文档原作者,请点此认领!既往收益都归您。
下载文档到电脑,查找使用更方便
2000 积分 0人已下载
下载 | 加入VIP,交流精品资源 |
- 配套讲稿:
如PPT文件的首页显示word图标,表示该PPT已包含配套word讲稿。双击word图标可打开word文档。
- 特殊限制:
部分文档作品中含有的国旗、国徽等图片,仅作为作品整体效果示例展示,禁止商用。设计者仅对作品中独创性部分享有著作权。
- 关 键 词:
- SUPERCOMPUTINGINPLAINENGLISHINSTRUCTIONLEVELPARALLELISMPPT

链接地址:http://www.mydoc123.com/p-389479.html