Advanced Computer Architecture.ppt
《Advanced Computer Architecture.ppt》由会员分享,可在线阅读,更多相关《Advanced Computer Architecture.ppt(109页珍藏版)》请在麦多课文档分享上搜索。
1、Advanced Computer Architecture,Chapter 4 Advanced Pipelining Ioannis Papaefstathiou CS 590.25 Easter 2003 (thanks to Hennesy & Patterson),Chap. 4 - Pipelining II,2,Chapter Overview,4.1 Instruction Level Parallelism: Concepts and Challenges 4.2 Overcoming Data Hazards with Dynamic Scheduling 4.3 Redu
2、cing Branch Penalties with Dynamic Hardware Prediction 4.4 Taking Advantage of More ILP with Multiple Issue 4.5 Compiler Support for Exploiting ILP 4.6 Hardware Support for Extracting more Parallelism 4.7 Studies of ILP,Chap. 4 - Pipelining II,3,Chapter Overview,Chap. 4 - Pipelining II,4,Instruction
3、 Level Parallelism,4.1 Instruction Level Parallelism: Concepts and Challenges 4.2 Overcoming Data Hazards with Dynamic Scheduling 4.3 Reducing Branch Penalties with Dynamic Hardware Prediction 4.4 Taking Advantage of More ILP with Multiple Issue 4.5 Compiler Support for Exploiting ILP 4.6 Hardware S
4、upport for Extracting more Parallelism 4.7 Studies of ILP,ILP is the principle that there are many instructions in code that dont depend on each other. That means its possible to execute those instructions in parallel.This is easier said than done: Issues include: Building compilers to analyze the c
5、ode, Building hardware to be even smarter than that code.This section looks at some of the problems to be solved.,Chap. 4 - Pipelining II,5,Terminology,Instruction Level Parallelism,Pipeline Scheduling and Loop Unrolling,Basic Block - That set of instructions between entry points and between branche
6、s. A basic block has only one entry and one exit. Typically this is about 6 instructions long.Loop Level Parallelism - that parallelism that exists within a loop. Such parallelism can cross loop iterations.Loop Unrolling - Either the compiler or the hardware is able to exploit the parallelism inhere
7、nt in the loop.,Chap. 4 - Pipelining II,6,Simple Loop and its Assembler Equivalent,for (i=1; i=1000; i+) x(i) = x(i) + s;,Loop: LD F0,0(R1) ;F0=vector elementADDD F4,F0,F2 ;add scalar from F2SD 0(R1),F4 ;store resultSUBI R1,R1,8 ;decrement pointer 8bytes (DW)BNEZ R1,Loop ;branch R1!=zeroNOP ;delayed
8、 branch slot,Instruction Level Parallelism,Pipeline Scheduling and Loop Unrolling,This is a clean and simple example!,Chap. 4 - Pipelining II,7,Loop: LD F0,0(R1) ;F0=vector elementADDD F4,F0,F2 ;add scalar in F2SD 0(R1),F4 ;store resultSUBI R1,R1,8 ;decrement pointer 8B (DW)BNEZ R1,Loop ;branch R1!=
9、zeroNOP ;delayed branch slot,FP Loop Hazards,Instruction Level Parallelism,Pipeline Scheduling and Loop Unrolling,Instruction Instruction Latency in producing result using result clock cycles FP ALU op Another FP ALU op 3 FP ALU op Store double 2 Load double FP ALU op 1 Load double Store double 0 In
10、teger op Integer op 0,Where are the stalls?,Chap. 4 - Pipelining II,8,FP Loop Showing Stalls,10 clocks: Rewrite code to minimize stalls?,1 Loop: LD F0,0(R1) ;F0=vector element2 stall3 ADDD F4,F0,F2 ;add scalar in F24 stall5 stall6 SD 0(R1),F4 ;store result7 SUBI R1,R1,8 ;decrement pointer 8Byte (DW)
11、8 stall9 BNEZ R1,Loop ;branch R1!=zero10 stall ;delayed branch slot,Instruction Level Parallelism,Pipeline Scheduling and Loop Unrolling,Instruction Instruction Latency in producing result using result clock cycles FP ALU op Another FP ALU op 3 FP ALU op Store double 2 Load double FP ALU op 1 Load d
12、ouble Store double 0 Integer op Integer op 0,Chap. 4 - Pipelining II,9,Scheduled FP Loop Minimizing Stalls,Now 6 clocks: Now unroll loop 4 times to make faster.,Instruction Instruction Latency in producing result using result clock cycles FP ALU op Another FP ALU op 3 FP ALU op Store double 2 Load d
13、ouble FP ALU op 1,1 Loop: LD F0,0(R1)2 SUBI R1,R1,83 ADDD F4,F0,F2 4 stall 5 BNEZ R1,Loop ;delayed branch6 SD 8(R1),F4 ;altered when move past SUBI,Swap BNEZ and SD by changing address of SD,Instruction Level Parallelism,Pipeline Scheduling and Loop Unrolling,Stall is because SD cant proceed.,Chap.
14、4 - Pipelining II,10,Unroll Loop Four Times (straightforward way),Rewrite loop to minimize stalls.,1 Loop: LD F0,0(R1)2 stall3 ADDD F4,F0,F24 stall5 stall6 SD 0(R1),F47 LD F6,-8(R1)8 stall9 ADDD F8,F6,F2 10 stall 11 stall 12 SD -8(R1),F8 13 LD F10,-16(R1) 14 stall,Instruction Level Parallelism,Pipel
15、ine Scheduling and Loop Unrolling,15 + 4 x (1+2) +1 = 28 clock cycles, or 7 per iterationAssumes R1 is multiple of 4,15 ADDD F12,F10,F2 16 stall 17 stall 18 SD -16(R1),F12 19 LD F14,-24(R1) 20 stall 21 ADDD F16,F14,F2 22 stall 23 stall 24 SD -24(R1),F16 25 SUBI R1,R1,#32 26 BNEZ R1,LOOP 27 stall 28
16、NOP,Chap. 4 - Pipelining II,11,Unrolled Loop That Minimizes Stalls,What assumptions made when moved code? OK to move store past SUBI even though changes register OK to move loads before stores: get right data? When is it safe for compiler to do such changes?,1 Loop: LD F0,0(R1) 2 LD F6,-8(R1) 3 LD F
17、10,-16(R1) 4 LD F14,-24(R1) 5 ADDD F4,F0,F2 6 ADDD F8,F6,F2 7 ADDD F12,F10,F2 8 ADDD F16,F14,F2 9 SD 0(R1),F4 10 SD -8(R1),F8 11 SD -16(R1),F12 12 SUBI R1,R1,#32 13 BNEZ R1,LOOP 14 SD 8(R1),F16 ; 8-32 = -2414 clock cycles, or 3.5 per iteration,Instruction Level Parallelism,Pipeline Scheduling and Lo
18、op Unrolling,No Stalls!,Chap. 4 - Pipelining II,12,Summary of Loop Unrolling Example,Determine that it was legal to move the SD after the SUBI and BNEZ, and find the amount to adjust the SD offset. Determine that unrolling the loop would be useful by finding that the loop iterations were independent
19、, except for the loop maintenance code.Use different registers to avoid unnecessary constraints that would be forced by using the same registers for different computations. Eliminate the extra tests and branches and adjust the loop maintenance code.Determine that the loads and stores in the unrolled
20、 loop can be interchanged by observing that the loads and stores from different iterations are independent. This requires analyzing the memory addresses and finding that they do not refer to the same address. Schedule the code, preserving any dependences needed to yield the same result as the origin
21、al code.,Instruction Level Parallelism,Pipeline Scheduling and Loop Unrolling,Chap. 4 - Pipelining II,13,Compiler Perspectives on Code Movement,Compiler concerned about dependencies in program. Not concerned if a HW hazard depends on a given pipeline. Tries to schedule code to avoid hazards. Looks f
22、or Data dependencies (RAW if a hazard for HW) Instruction i produces a result used by instruction j, or Instruction j is data dependent on instruction k, and instruction k is data dependent on instruction i. If dependent, cant execute in parallel Easy to determine for registers (fixed names) Hard fo
23、r memory: Does 100(R4) = 20(R6)? From different loop iterations, does 20(R6) = 20(R6)?,Instruction Level Parallelism,Dependencies,Chap. 4 - Pipelining II,14,Compiler Perspectives on Code Movement,1 Loop: LD F0,0(R1) 2 ADDD F4,F0,F2 3 SUBI R1,R1,8 4 BNEZ R1,Loop ;delayed branch5 SD 8(R1),F4 ;altered
24、when move past SUBI,Instruction Level Parallelism,Data Dependencies,Where are the data dependencies?,Chap. 4 - Pipelining II,15,Compiler Perspectives on Code Movement,Another kind of dependence called name dependence: two instructions use same name (register or memory location) but dont exchange dat
25、aAnti-dependence (WAR if a hazard for HW) Instruction j writes a register or memory location that instruction i reads from and instruction i is executed firstOutput dependence (WAW if a hazard for HW) Instruction i and instruction j write the same register or memory location; ordering between instru
- 1.请仔细阅读文档,确保文档完整性,对于不预览、不比对内容而直接下载带来的问题本站不予受理。
- 2.下载的文档,不会出现我们的网址水印。
- 3、该文档所得收入(下载+内容+预览)归上传者、原创作者;如果您是本文档原作者,请点此认领!既往收益都归您。
下载文档到电脑,查找使用更方便
2000 积分 0人已下载
下载 | 加入VIP,交流精品资源 |
- 配套讲稿:
如PPT文件的首页显示word图标,表示该PPT已包含配套word讲稿。双击word图标可打开word文档。
- 特殊限制:
部分文档作品中含有的国旗、国徽等图片,仅作为作品整体效果示例展示,禁止商用。设计者仅对作品中独创性部分享有著作权。
- 关 键 词:
- ADVANCEDCOMPUTERARCHITECTUREPPT
