Bài giảng Computer Architecture - Chapter 4: Microarchitecture

pdf 73 trang Gia Huy 16/05/2022 3610
Bạn đang xem 20 trang mẫu của tài liệu "Bài giảng Computer Architecture - Chapter 4: Microarchitecture", để tải tài liệu gốc về máy bạn click vào nút DOWNLOAD ở trên

Tài liệu đính kèm:

  • pdfbai_giang_computer_architecture_chapter_4_microarchitecture.pdf

Nội dung text: Bài giảng Computer Architecture - Chapter 4: Microarchitecture

  1. COMPUTER ARCHITECTURE Chapter 4: Microarchitecture Computer Engineering – CSE – HCMUT 1
  2. Introduction • CPU performance factors – Instruction count • Determined by ISA and compiler – CPI and Cycle time • Determined by CPU hardware • We will examine two MIPS implementations – A simplified version: CPI = 1 – A more realistic pipelined version: CPI ≈ 1 • Simple subset, shows most aspects – Memory reference: lw, sw – Arithmetic/logical: add, sub, and, or, slt – Control transfer: beq, j Computer Architecture (c) Cuong Pham-Quoc/HCMUT 2
  3. Instruction execution 1. PC → instruction memory (cache), fetch instruction 2. Register numbers → registers file, read registers 3. Depending on instruction class 3.1. Use ALU to calculate • Arithmetic result → done • Memory address for load/store → 3.2 • Branch target address → 3.3 3.2. Access data memory for load/store 3.3. PC ← target address or PC + 4 Computer Architecture (c) Cuong Pham-Quoc/HCMUT 3
  4. Execution stages 1. Instruction fetch (IF): PC → instruction address 2. Instruction decode (ID): register operands → register file 3. Execute (EXE): – Load/store: compute a memory address – Arithmetic/logical: compute an arithmetic/logical result 4. Memory access (MEM): – Load: read data memory – Store: write data memory 5. Write back (WB): – Store a result of register file Computer Architecture (c) Cuong Pham-Quoc/HCMUT 4
  5. Datapath - controller • Datapath: contains information that is operated on by a functional unit – Instruction memory: contain instructions (IF) – Registers file: 32 32-bit registers (ID & WB) – ALU: calculate arithmetic and logical operations (EXE) – Data memory: contain data (MEM) • Control signals: used for multiplexer selection or for directing the operation of a functional unit – Control unit – Multiplexer Computer Architecture (c) Cuong Pham-Quoc/HCMUT 5
  6. Datapath overview Computer Architecture (c) Cuong Pham-Quoc/HCMUT 6
  7. Multiplexer - MUX S = 0 → C = A S = 1 → C = B Computer Architecture (c) Cuong Pham-Quoc/HCMUT 7
  8. Controller overview Computer Architecture (c) Cuong Pham-Quoc/HCMUT 8
  9. Building datapath • Hardware components: • ID IF ID & WB EXE MEM Computer Architecture (c) Cuong Pham-Quoc/HCMUT 9
  10. Instruction fetch • Main operations: fetching the next instruction from Instruction memory – PC → instruction address – Instruction memory → instruction (32 bits) – PC ← PC + 4 (using the Add component) • Results: – 32 bits machine instruction (Instruction) – Address of the next instruction in PC Computer Architecture (c) Cuong Pham-Quoc/HCMUT 10
  11. Instruction decode • Main operations: Extract machine instructions • Results: – R-format, branch, and store: rs, rt → – Values (32-bit) for the next stage Registers → values – I-format: rs → Registers → values & immediate → sign-extend add $s0, $s1, $s2 0|17|18|16|0|32 Computer Architecture (c) Cuong Pham-Quoc/HCMUT 11
  12. Instruction execution • Main operation • Results: – Calculate the arithmetic operations/ – Arithmetic result/memory address (ALU- address of memory/register comparison result) – Comparison result (zero-output) • R-format and bne&beq: both operands collected from Registers • I-format (except bne&beq): one operands collected from Registers and one from sign-extend Computer Architecture (c) Cuong Pham-Quoc/HCMUT 12
  13. Memory access • Main operations • Results: – Load: get data from Data – Load: values of Data memory memory (Read data) – Store: write data from – Store: N/A Registers to Data memory Computer Architecture (c) Cuong Pham-Quoc/HCMUT 13
  14. Write back • Main operations: – Write values back to Registers (arithmetic/load) • Result: – N/A Computer Architecture (c) Cuong Pham-Quoc/HCMUT 14
  15. Full datapath Computer Architecture (c) Cuong Pham-Quoc/HCMUT 15
  16. Example • Question: Assume that the processor is executing add $s0, $s1, $s2 – Identify values of functional units’ inputs/outputs • Answer: – The machine code is: 000000_10001_10010_10000_00000_1000002 – Instruction memory: • Instruction address = PC (word address where we store the above instruction) • Instruction = machine code – Registers: • Read register 1 = 100012 => Read data 1 = content ($s1) • Read register 2 = 100102 => Read data 2 = content ($s2) • Write register = 100002 & Write data = content ($s1) + content ($s2) – sign-extend: input = 10000_00000_100000 Computer Architecture (c) Cuong Pham-Quoc/HCMUT 16
  17. Building controller • Extracting bits from 32-bit instructions – Read register 1, Read register 2, and Write registers – Sign-extend – Control block • Building a Control block: – Handling multiplexers – Handling control signals of functional units • RegWrite • MemRead • MemWrite • Computer Architecture (c) Cuong Pham-Quoc/HCMUT 17
  18. Extracting bits • Registers: – Read register 1 ⇐ rs (instruction[25:21]) – Read register 2 ⇐ rt (instruction[20:16]) – Write register ⇐ rt/rd → need a multiplexer • Sign-extend ⇐ address (instruction[15:0]) R-type 0 rs rt rd shamt funct 31:26 25:21 20:16 15:11 10:6 5:0 Load/ 35 or 43 rs rt address Store 31:26 25:21 20:16 15:0 Branch 4 rs rt address 31:26 25:21 20:16 15:0 Computer Architecture (c) Cuong Pham-Quoc/HCMUT 18
  19. Datapath with bit-selection Computer Architecture (c) Cuong Pham-Quoc/HCMUT 19
  20. Control block • Main function: handling • Predefined ALU operation values multiplexers and functional Operator ALU Opeation and 0000 units’ control signals or 0001 add 0010 – ALU: two levels of decoding sub 0110 • Level 1: 6 bit opcode → 2 slt 0111 bit ALUop • ALUop: • Level 2: 2 bit ALUop (+ 6 – opcode → add operator → bit function field) → 4 bit ALUop = 00 ALU operation – opcode → sub operator → – Muxes and control signals ALUop = 01 of functional units (except – opcode → unknown → ALUop ALU): 6 bit opcode = 10 Computer Architecture (c) Cuong Pham-Quoc/HCMUT 20
  21. Full microarchitecture 0 32 bit X U M 2 t 1 32 bit f e Add l PCSrc t f i Add h 4 S Branch [31:26] Control unit RegWrite g [25:21] e 32 bit R Read register 1 o Instruction Read oprd 1 t m data 1 c PC address r 32 bit 32 bit [20:16] zero MemWrite e Read register 2 S Instruction M U ALU 0 L 32 bit Read 32 bit 1 X Instruction A ALU Address data X U Write register memory [15:11] Read U M 0 result M 1 data 2 X 32 bit 32 bit U oprd 2 Write 0 Write data M Data RegD data 1 memory st Register 4 bit ALU Operation MemRead [15:0] Sign ALU Control 16 bit extend 32 bit ALUOp 32 bit [5:0] 6 bit Computer Architecture (c) Cuong Pham-Quoc/HCMUT 21
  22. Exercise • Question: assume that registers store values of two times their numbers, for instance $s1 stores values of 17 × 2 = 34, please identify: 1. Which functional units contribute to the processing of following instructions 2. Values of inputs/outputs of functional units when processing following instructions 3. Values of control signals when processing following instructions – add $t0, $t1, $t2 – addi $s0, $s1, 100 – lw $s0, 100($s2) # memory word at address 136 stores values of 2021 – sw $s0, 100($s2) – beq $s1, $s0, L1 Computer Architecture (c) Cuong Pham-Quoc/HCMUT 22
  23. Unconditional jump instructions • Chapter 2: update PC with concatenation of – Top 4 bits of old PC – 26-bit jump address PC = {PC[31:28],address,00} – 00 • One more way to update PC => Need a multiplexer & an extra control signal decoded from opcode Computer Architecture (c) Cuong Pham-Quoc/HCMUT 23
  24. Datapath with Jumps added Computer Architecture (c) Cuong Pham-Quoc/HCMUT 24
  25. Performance issue • Simplified version (CPI = 1): every instruction executed in only one cycle – Longest delay determines clock period – What is the longest instruction? • Instruction memory → register file → ALU → data memory → register file • Not feasible to vary period for different instructions • We will improve performance by pipelining Computer Architecture (c) Cuong Pham-Quoc/HCMUT 25
  26. Pipelining analogy • Pipeline laundry: – Overlapping execution – Improving performance (time for entire group) • Four loads – Speed-up = 2.3× – Not impressive • Non-stop (#loads → ∞) – Speed-up? – Number of stages Computer Architecture (c) Cuong Pham-Quoc/HCMUT 26
  27. MIPS pipeline • Five stages, one step per stage 1. Instruction fetch (IF): PC → instruction address 2. Instruction decode (ID): register operands → register file 3. Execute (EXE): • Load/store: compute a memory address • Arithmetic/logical: compute an arithmetic/logical result 4. Memory access (MEM): • Load: read data memory • Store: write data memory 5. Write back (WB): • Store a result of register file Computer Architecture (c) Cuong Pham-Quoc/HCMUT 27
  28. Pipeline performance • Assume time for stages is – 100ps for register read or write (ID & WB) – 200ps for other stages (IF, EXE, & MEM) • Compare pipelined datapath with single-cycle datapath Instruction Instr Register ALU op Memory Register Total fetch read access write time lw 200ps 100 ps 200ps 200ps 100 ps 800ps sw 200ps 100 ps 200ps 200ps 700ps R-format 200ps 100 ps 200ps 100 ps 600ps beq 200ps 100 ps 200ps 500ps Computer Architecture (c) Cuong Pham-Quoc/HCMUT 28
  29. Pipeline performance Single-cycle(Tc = 800ps) Pipeline(Tp = 200ps) Computer Architecture (c) Cuong Pham-Quoc/HCMUT 29
  30. Pipeline speed-up tme of single-cycle tme bw instructonsingle-cycle speed-up = = tme of pipelined tme bw instructonpipelined • If all stages are balanced: – speed_up = number of pipe stages • If not balanced, speedup is less • Source of speedup – Throughput increased – Latency (time for each instruction) does not decrease • Sometimes increased Computer Architecture (c) Cuong Pham-Quoc/HCMUT 30
  31. Pipeline datapath Computer Architecture (c) Cuong Pham-Quoc/HCMUT 31
  32. Instructions execution Mult-Cycle Pipeline Diagram How can we keep data when stages are not balanced? Computer Architecture (c) Cuong Pham-Quoc/HCMUT 32
  33. Wholesale market example weight = 100kg weight = 1ton weight = 1ton weight = 100kg tme = 15minutes tme = 1hour tme = 1hour tme = 15minutes • Trucks move forward when completed – Accident at the Apple store – Barriers can help: open every hour (cycle) Computer Architecture (c) Cuong Pham-Quoc/HCMUT 33
  34. Pipeline registers Barriers in wholesale markets ⇔ Registers in digital circuits IF/ID ID/EX EX/MEM MEM/WB 2 t f e Add l t f i Add h 4 S [25:21] 0 Instruction Read register 1 Read oprd1 X data 1 MemtoReg U PC 32 bit address [20:16] MemWrite M Instruction Read register 2 zero 1 ALUSrc ALU 0 Read 1 X Instruction Result Address X U Write register data Memory [15:11] U PCSrc M Read 0 M 1 data 2 X Write Write data U oprd2 0 M data Data RegDst Registers 1 4 bit Memory RegWrite ALU Operation MemRead [15:0] Sign ALU Control 16 bit extend 32 bit ALUOp [5:0] 32 bit Computer Architecture (c) Cuong Pham-Quoc/HCMUT 34
  35. Example • Question: Given the following MIPS sequence: lw $s0, 20($s1) sub $t2, $s2, $s3 add $t3, $s3, $s4 lw $t4, 24($s1) add $t5, $s5, $s6 Assume that the sequence is executed by a 5-stage pipelined MIPS processor a) Draw a multi-cycle pipeline diagram for the sequence b) Analyze the 5th cycle with the datapath diagram in the previous slide Computer Architecture (c) Cuong Pham-Quoc/HCMUT 35
  36. Multi-cycle pipeline diagram Computer Architecture (c) Cuong Pham-Quoc/HCMUT 36
  37. Single-cycle pipeline diagram add $t5, $s5, $s6 lw $t4, 24($s1) add $t3, $s3, $s4 sub $t2, $s2, $s3 lw $s0, 20($s1) IF/ID ID/EX EX/MEM MEM/WB 2 t f Add e l t f i Add h 4 S [25:21] 0 Instruction Read register 1 Read oprd1 X data 1 MemtoReg U PC 32 bit address [20:16] MemWrite M zero Instruction Read register 2 ALUSrc 1 ALU 0 Read 1 X Instruction Result Address X U Write register data Memory [15:11] U PCSrc M Read 0 M 1 data 2 X Write U oprd2 0 Write data data M Data RegDst Registers 1 4 bit Memory ALU Operation RegWrite MemRead [15:0] Sign ALU Control Anything wrong? 16 bit extend 32 bit ALUOp [5:0] 32 bit Computer Architecture (c) Cuong Pham-Quoc/HCMUT 37
  38. Corrected pipeline datapath IF/ID ID/EX EX/MEM MEM/WB 2 t f e Add l t f i Add h 4 S [25:21] 0 Read register 1 Read oprd 1 X Address data 1 MemtoReg U PC MemWrite 32 bit [20:16] zero M Instruction Read register 2 1 ALUSrc ALU Read 11 Instruction Write register Result Address data X PCSrc Memory Read 0 U M data 2 X Write U 0 Write data oprd 2 0 M data Data Registers 1 4 bit Memory ALU Operation RegWrite MemRead [15:0] Sign ALU Control 16 bit extend 32 bit ALUOp [5:0] 0 X [20:16] [15:11] U 1M RegDst Computer Architecture (c) Cuong Pham-Quoc/HCMUT 38
  39. Example of lw - IF lw IF/ID ID/EX EX/MEM MEM/WB 2 t f Add e l t f i Add h 4 S [25:21] 0 Read register 1 Read oprd1 X Address data 1 MemtoReg U PC MemWrite 32 bit [20:16] zero M Instruction Read register 2 1 ALUSrc ALU Read 1 Instruction Write register Result Address data X PCSrc Memory Read 0 U M data 2 X Write U oprd2 0 Write data M data Data Registers 1 4 bit Memory ALU Operation RegWrite MemRead [15:0] Sign ALU extend ALUOp Control 16 bit 32 bit [5:0] 0 X [20:16] [15:11] U 1M RegDst Computer Architecture (c) Cuong Pham-Quoc/HCMUT 39
  40. Example of lw - ID lw IF/ID ID/EX EX/MEM MEM/WB 2 t f e Add l t f i Add h 4 S [25:21] 0 Read register 1 Read oprd1 X Address data 1 MemtoReg U PC MemWrite 32 bit [20:16] zero M Instruction Read register 2 1 ALUSrc ALU Read 1 Instruction Write register Result Address data X PCSrc Memory Read 0 U M data 2 X Write U oprd2 0 Write data M data Data Registers 1 4 bit Memory ALU Operation RegWrite MemRead [15:0] Sign ALU extend ALUOp Control 16 bit 32 bit [5:0] 0 X [20:16] [15:11] U 1M RegDst Computer Architecture (c) Cuong Pham-Quoc/HCMUT 40
  41. Example of lw - EXE lw IF/ID ID/EX EX/MEM MEM/WB 2 t f e Add l t f i Add h 4 S [25:21] 0 Read register 1 Read oprd1 X Address data 1 MemtoReg U PC MemWrite 32 bit [20:16] zero M Instruction Read register 2 1 ALUSrc ALU Read 1 Instruction Write register Result Address data X PCSrc Memory Read 0 U M data 2 X Write U oprd2 0 Write data M data Data Registers 1 4 bit Memory ALU Operation RegWrite MemRead [15:0] Sign ALU extend ALUOp Control 16 bit 32 bit [5:0] 0 X [20:16] [15:11] U 1M RegDst Computer Architecture (c) Cuong Pham-Quoc/HCMUT 41
  42. Example of lw - MEM lw IF/ID ID/EX EX/MEM MEM/WB 2 t f e Add l t f i Add h 4 S [25:21] 0 Read register 1 Read oprd1 X Address data 1 MemtoReg U PC MemWrite 32 bit [20:16] zero M Instruction Read register 2 1 ALUSrc ALU Read 1 Instruction Write register Result Address data X PCSrc Memory Read 0 U M data 2 X Write U oprd2 0 Write data M data Data Registers 1 4 bit Memory ALU Operation RegWrite MemRead [15:0] Sign ALU extend ALUOp Control 16 bit 32 bit [5:0] 0 X [20:16] [15:11] U 1M RegDst Computer Architecture (c) Cuong Pham-Quoc/HCMUT 42
  43. Example of lw - WB lw IF/ID ID/EX EX/MEM MEM/WB 2 t f Add e l t f i Add h 4 S [25:21] 0 Read register 1 Read oprd1 X Address data 1 U PC MemWrite 32 bit [20:16] zero M Instruction Read register 2 1 ALUSrc ALU Read 1 Instruction Write register Result Address data X PCSrc Memory Read 0 U M data 2 X Write U oprd2 0 Write data M data Data Registers 1 4 bit Memory ALU Operation RegWrite MemRead [15:0] Sign ALU extend ALUOp Control 16 bit 32 bit [5:0] 0 X [20:16] [15:11] U 1M RegDst Computer Architecture (c) Cuong Pham-Quoc/HCMUT 43
  44. • Controltravel signals across registers pipeline IF/ID Control ComputerArchitecture Pham-Quoc/HCMUT Cuong (c) Control signals ID/EX WB EX M RegDst ALUSrc ALUOp 2 bit EX/MEM WB M MemRead MemWrite Branch MEM/WB WB MemtoReg RegWrite 44
  45. Exercise • Question: Given the following MIPS sequence: lw $s0, 20($s1) sub $t2, $s2, $s3 add $t3, $s3, $s4 lw $t4, 24($s1) add $t5, $s5, $s6 Assume that the sequence is executed by a 5-stage pipelined MIPS processor a) Identify values of control signals in cycle 5 at the functional units b) Identify values of control signals in cycle 5 at the Control block Computer Architecture (c) Cuong Pham-Quoc/HCMUT 45
  46. Full pipeline micro-architecture ID/EX EX/MEM WB MEM/WB Control M WB IF/ID WB EX M [31:26] h c n a 2 r t B f Add e l t f i Add h 4 S RegWrite g e [25:21] R o 0 Instruction Read register 1 Read oprd1 t X data 1 m U address e PC 32 bit [20:16] c r MemWrite M Instruction Read register 2 zero M 1 S ALU U Read L 1 Instruction X Write register A Result Address data PCSrc Memory Read 0 U M data 2 X Write Write data U oprd2 0 M data Data Registers 1 4 bit Memory MemRead p O t [15:0] U ALU s Sign L D extend A Control 16 bit 32 bit g [5:0] e R 0 X [20:16] [15:11] U 1M Computer Architecture (c) Cuong Pham-Quoc/HCMUT 46
  47. Hazards • Situations that prevent starting the next instruction in the next cycle • Structure hazard – A required resource is busy • Data hazard – Need to wait for previous instruction to complete its data read/write • Control hazard – Deciding on control action depends on previous instruction Computer Architecture (c) Cuong Pham-Quoc/HCMUT 47
  48. Structure hazards • Conflict for use of a resource – e.g.,: the laundry process, B forgot to bring clothes from the washing machine to dryer • Should be eliminated entirely – Stall the pipe for that cycle – May repeat many times • MIPS processors already solved all structure hazards – Separated instructions and data memory (caches) – Read and write registers use different ports Computer Architecture (c) Cuong Pham-Quoc/HCMUT 48
  49. Data hazards • An instruction depends on completion of data access by a previous instruction sub $t0, $t2, $t3 add $t3, $t0, $t1 • No any issues with the simplified version • Problem with pipeline Computer Architecture (c) Cuong Pham-Quoc/HCMUT 49
  50. Control hazards • Branch determines flow of control – Fetching next instruction depends on branch outcome • Pipeline update PC in the MEM stage – Still working on ID stage of branch Computer Architecture (c) Cuong Pham-Quoc/HCMUT 50
  51. Data hazards solutions 1. Code rescheduling – Done by compiler (software level) – Sometimes cannot find solutions 2. Delay or stalls insertion – Done by hardware – Always can find solutions – Increase execution time & need extra hardware resources 3. Forwarding – Done by hardware – Cannot find solutions in a special case – Requires extra hardware resources Computer Architecture (c) Cuong Pham-Quoc/HCMUT 51
  52. Code re-scheduling • When haven’t data hazards happened? – Read after or in the same cycle with Write (RAW) • Find and swap data-independent instructions Computer Architecture (c) Cuong Pham-Quoc/HCMUT 52
  53. Example • Question: given the following sequence of MIPS instructions 1: lw $t1, 0($t0) 1: lw $t1, 0($t0) 2: lw $t2, 4($t0) 2: lw $t2, 4($t0) 3: add $t3, $t1, $t2 5: lw $t4, 8($t0) 4: sw $t5, 12($t0) 7: addi $t6, $t0, 4 5: lw $t4, 8($t0) 3: add $t3, $t1, $t2 6: add $t5, $t1, $t4 4: sw $t5, 12($t0) 7: addi $t6, $t0, 4 6: add $t5, $t1, $t4 – Identify data hazards and solve by code re-scheduling • Answer: – The third and the sixth instructions (add $t3, $t1, $t2 and add $t5, $t1, $t4) have data hazards – Move two data-independent instructions 5 and 7 to right after the second instruction Computer Architecture (c) Cuong Pham-Quoc/HCMUT 53
  54. Stalls insertion • When haven’t data hazards happened? – Read after or in the same cycle with Write (RAW) • Delay instructions that use data: – ID stage of the instruction using data ≡ WB stage of the instruction producing data Computer Architecture (c) Cuong Pham-Quoc/HCMUT 54
  55. Example • Questions: given the following MIPS sequence of instructions 1: lw $t1, 0($t0) 2: lw $t2, 4($t0) 3: add $t3, $t1, $t2 4: sw $t3, 12($t0) – Identify data hazards and solve them by the stalls insertion method; how many cycles needed for the sequence? • Answer: – Data hazards (2) - (3) & (3) - (4) – Insert two stalls for the 3rd instruction & two stalls for the 4th instruction – 12 cycles needed Computer Architecture (c) Cuong Pham-Quoc/HCMUT 55
  56. Example (cont.) CK1 CK2 CK3 CK4 CK5 CK6 CK7 CK8 CK9 CK10 CK11 CK12 IM REG ALU DM REG ) 0 t $ ( 0 IM REG DM REG ALU , 1 t ) $ 0 t w $ l ( 4 add $t3, $t1, $t2 IM REG ALU DM REG , 2 t $ Stall Stall w l IM REG DM REG sw $t3, 12($t0) ALU Stall Stall Computer Architecture (c) Cuong Pham-Quoc/HCMUT 56
  57. Forwarding • When can an instruction use data at soonest? – Right after data is produced (EXE stage & MEM stage) • When is data processed? – Apply operators: EXE state • Use data right after created Computer Architecture (c) Cuong Pham-Quoc/HCMUT 57
  58. Example • Question: given the following MIPS sequence 1: sub $s2, $s1, $s3 2: and $s7, $s2, $s5 3: or $s8, $s6, $s2 4: add $s0, $s2, $s2 5: sw $s5, 100($s2) – Analyze data dependencies & identify data hazards • Answer: – $s2 produced by the 1st instruction used by all other instructions – Data hazards: 2nd & 3rd instructions Computer Architecture (c) Cuong Pham-Quoc/HCMUT 58
  59. Example (cont.) • Answer (cont.): Computer Architecture (c) Cuong Pham-Quoc/HCMUT 59
  60. Forwarding • Consider the previous sequence of MIPS instructions – Data can be forwarded CC1 CC2 CC3 CC4 CC5 CC6 CC7 CC8 CC9 sub $s2,$s1,$s3 IM REG ALU DM REG and $s7,$s2,$s5 IM REG ALU DM REG or $s8,$s6,$s2 IM REG ALU DM REG add $s0,$s2,$s2 IM REG ALU DM REG sw $s5,100($s2) IM REG ALU DM REG Computer Architecture (c) Cuong Pham-Quoc/HCMUT 60
  61. Load used data hazards • When instruction producing data is a load, can data be forwarded? – Cannot forward to the next instruction since data is produced later – Delay: • 1 stall with forwarding or 2 stalls without forwarding Computer Architecture (c) Cuong Pham-Quoc/HCMUT 61
  62. Detecting data hazards • EXE hazard versus MEM hazard – EXE hazard: forward from the EX/MEM register • Destination register in the MEM stage ≡ one of the source registers of the instruction in EXE – MEM hazard: forward from the MEM/WB register • Destination register in the WB stage ≡ one of the source registers of the instruction in EXE • Passing register numbers along the pipe – ID/EX.rs & ID/EX.rt: first & second sources – ID/EX.rd; EX/MEM.rd; MEM/WB.rd: destination Computer Architecture (c) Cuong Pham-Quoc/HCMUT 62
  63. EX hazard conditions IF/ID ID/EX EX/MEM MEM/WB Control [31:26] unit Instruction address Instruction ] Instruction 0 : Memory 1 [25:21] 3 [ ID/EX.rs t s n I [20:16] ID/EX.rt 0 EX/MEM.rd MEM/WB.rd X [15:11] U 1M RegDst 1a. ID/EX.rs = EX/MEM.rd 1b. ID/EX.rt = EX/MEM.rd (not happened when an I-format instruction is in the EXE stage) Computer Architecture (c) Cuong Pham-Quoc/HCMUT 63
  64. MEM hazard conditions • Consider the following MIPS sequence of instructions 1: add $s0, $s0, $s1 2: add $s0, $s0, $s2 3: add $s0, $s0, $s3 • Both EX and MEM hazards seem occur – EX hazard is correct • MEM hazard conditions 2a. (ID/EX.rs = MEM/WB.rd) & (ID/EX.rs != EX/MEM.rd) 2b. (ID/EX.rt = MEM/WB.rd) & (ID/EX.rt != EX/MEM.rd) (not happened when an I-format instruction is in the EXE stage) Computer Architecture (c) Cuong Pham-Quoc/HCMUT 64
  65. Datapath with forwarding ID/EX EX/MEM MEM/WB WB WB Control M IF/ID Unit WB (1) ALUSrc EX (2) RegDst M (3) ALUop [31:26] 2 t h f c e Add l n a t r f i B Add h 4 S g RegWrite e t e i r R [25:21] X o W U 0 Instruction Read register 1 oprd 1 t X m M m address Read data 1 e U PC 32 bit [20:16] e zero M M Instruction Read register 2 M 1 ALU (1) Read 1 Instruction Write register Result Address data X U PCSrc Memory X 0 U M X M Write Write data Read data 2 U oprd 2 0 M data Data Registers 1 ALU 4 bit Memory Operation F1 MemRead F2 (3) [15:0] Sign ALU extend Control 16 bit 32 bit [5:0] (2) 0 X EX/MEM.rd U d M r 1 . [15:11] B W ID/EX.rt / Forwarding M ID/EX.rs EX/MEM.RegWrite E unit M Mem/WB.RegWrite Computer Architecture (c) Cuong Pham-Quoc/HCMUT 65
  66. Forwarding control values Mux values Source Explanation (binary) F1 = 00 ID/EX The first ALU operand comes from the registers file The first ALU operand is forwarded from the prior ALU F1 = 10 EX/MEM result The first ALU operand is forwarded from data memory F1 = 01 MEM/WB of an earlier ALU result F2 = 00 ID/EX The second ALU operand comes from the registers file The second ALU operand is forwarded from the prior F2 = 10 EX/MEM ALU result The second ALU operand is forwarded from data F2 = 01 MEM/WB memory of an earlier ALU result Computer Architecture (c) Cuong Pham-Quoc/HCMUT 66
  67. Forwarding conditions • EX hazard – 1a: if (EX/MEM.RegWrite and (EX/MEM.rd ≠ 0) and (EX/MEM.rd = ID/EX.rs)) F1 = 10 – 1b: if (EX/MEM.RegWrite and (EX/MEM.rd ≠ 0) and (EX/MEM.rd = ID/EX.rt)) F2 = 10 • MEM hazard – 2a: if (MEM/WB.RegWrite and (MEM/WB.rd ≠ 0) and not (EX/ MEM.RegWrite and (EX/MEM.rd ≠ 0) and (EX/MEM.rd = ID/EX.rs)) and (MEM/WB.rd = ID/EX.rs)) F1 = 01 – 2b: if (MEM/WB.RegWrite and (MEM/WB.rd ≠ 0) and not (EX/ MEM.RegWrite and (EX/MEM.rd ≠ 0) and (EX/MEM.rd = ID/EX.rt)) and (MEM/WB.rd = ID/EX.rt)) F2 = 01 Computer Architecture (c) Cuong Pham-Quoc/HCMUT 67
  68. Detecting Load-use data hazards • Check when using data instruction is decoded in ID stage – Producing data instruction (load) is in the EXE stage – The sooner the better due to a stall inserted • Load-use hazard when – ID/EX.MemRead and ((ID/EX.rt = IF/ID.rs) or (ID/EX.rt = IF/ID.rt)) • How to stall the pipeline? – Force control signals in ID/EX register to 0 ⇒ EXE, MEM, and WB do nothing – Prevent updating PC and ID/EX register Computer Architecture (c) Cuong Pham-Quoc/HCMUT 68
  69. ID/EX EX/MEM MEM/WB WB 0 1 WB Write X ID/EX.MemRead Hazard LU-hazard U M M WB detection 0 (1) ALUSrc t r EX (2) RegDst M & s r (3) ALUop . ID/EX.rt IF/ID D I / F I Control unit 2 [31:26] t h f c e Add l n a t r f i B Add h 4 S g e t RegWrite e i r R [25:21] X o W U 0 Instruction Read register 1 oprd1 t X m M m address Read data 1 e U PC 32 bit [20:16] e zero M M Instruction Read register 2 M 1 ALU (1) Read 1 Instruction Write register Result Address data X U PCSrc Memory X 0 U M X Write M U 0 Write data Read data 2 oprd2 M data Data 1 Memory Registers ALU 4 bit F1Operation MemRead F2 (3) [15:0] Sign ALU extend control 16 bit 32 bit [5:0] (2) 0 X EX/MEM.rd U d M r 1 . [15:11] B W ID/EX.rt / Forwarding M ID/EX.rs EX/MEM.RegWrite E unit M Mem/WB.RegWrite Computer Architecture (c) Cuong Pham-Quoc/HCMUT 69
  70. Branch hazards solutions • Predict outcome of branch – Only stall if prediction is wrong CC1 CC2 CC3 CC4 CC5 CC6 CC7 CC8 CC9 beq $s1,$s2, L1 IM REG ALU DM REG and $s1, $s2, $s3 IM REG ALU DM REG add $t3, $s3, $s4 IM REG ALU DM REG sub $s3, $s3, $s4 IM REG ALU DM REG L1: lw $t4, 24($s1) IM REG ALU DM REG Computer Architecture (c) Cuong Pham-Quoc/HCMUT 70
  71. Branch prediction • Static branch prediction – Based on typical branch behavior – Example: loop and if-statement branches • Predict backward branches taken • Predict forward branches not taken • Dynamic branch prediction – Hardware measures actual branch behavior • e.g., record recent history of each branch – Assume future behavior will continue the trend • When wrong, stall while re-fetching, and update history Computer Architecture (c) Cuong Pham-Quoc/HCMUT 71
  72. Concluding remarks • ISA influences design of datapath and control • Datapath and control influence design of ISA • Pipelining improves instruction throughput using parallelism – More instructions completed per second – Latency for each instruction not reduced • Hazards: structural, data, control Computer Architecture (c) Cuong Pham-Quoc/HCMUT 72
  73. The end Computer Engineering – CSE – HCMUT 73