Bài giảng Computer Architecture - Chapter 5: Memory Hierarchy

pdf 76 trang Gia Huy 16/05/2022 2110
Bạn đang xem 20 trang mẫu của tài liệu "Bài giảng Computer Architecture - Chapter 5: Memory Hierarchy", để tải tài liệu gốc về máy bạn click vào nút DOWNLOAD ở trên

Tài liệu đính kèm:

  • pdfbai_giang_computer_architecture_chapter_5_memory_hierarchy.pdf

Nội dung text: Bài giảng Computer Architecture - Chapter 5: Memory Hierarchy

  1. Computer Architecture Chapter 5: Memory Hierarchy Computer Engineering – CSE – HCMUT 1
  2. Contents • Memory hierarchy • Cache organization – Direct mapped – Fully associative – n-way associative • Virtual memory (self-learning) Computer Architecture (c) Cuong Pham-Quoc/HCMUT 2
  3. Motivation example Publishers/Secondary memory Bookshelf/Memory Table/Cache Computer Architecture (c) Cuong Pham-Quoc/HCMUT 3
  4. Principle of Locality • Programs access a small proportion of their address space at any time • Temporal locality – Items accessed recently are likely to be accessed again soon – e.g., instructions in a loop, induction variables • Spatial locality – Items near those accessed recently are likely to be accessed soon – E.g., sequential instruction access, array data Computer Architecture (c) Cuong Pham-Quoc/HCMUT 4
  5. Taking Advantage of Locality • Memory hierarchy • Store everything on disk • Copy recently accessed (and nearby) items from disk to smaller DRAM memory – Main memory • Copy more recently accessed (and nearby) items from DRAM to smaller SRAM memory – Cache memory attached to CPU Computer Architecture (c) Cuong Pham-Quoc/HCMUT 5
  6. Memory Hierarchy Levels • Block (aka line): unit of copying – May be multiple words • If accessed data is present in upper level – Hit: access satisfied by upper level • Hit ratio: hits/accesses • If accessed data is absent – Miss: block copied from lower level • Time taken: miss penalty • Miss ratio: misses/accesses = 1 – hit ratio – Then accessed data supplied from upper level Computer Architecture (c) Cuong Pham-Quoc/HCMUT 6
  7. Memory Technology • Static RAM (SRAM) – 0.5ns – 2.5ns, $2000 – $5000 per GB • Dynamic RAM (DRAM) – 50ns – 70ns, $20 – $75 per GB • Flash Memory – 5μs – 50μs, $0.75 - $1 per GB • Magnetic disk – 5ms – 20ms, $0.20 – $2 per GB • Ideal memory – Access time of SRAM – Capacity and cost/GB of disk Computer Architecture (c) Cuong Pham-Quoc/HCMUT 7
  8. Cache Memory • Cache memory – The level of the memory hierarchy closest to the CPU • Given accesses X1, X2, . . . , Xn−1, Xn How do we know if the data is present? Where do we look? Computer Architecture (c) Cuong Pham-Quoc/HCMUT 8
  9. Direct Mapped Cache • Location determined by address • Direct mapped: only one choice – (Block address) modulo (#Blocks in cache) #Blocks is a power of 2 Use low-order address bits Computer Architecture (c) Cuong Pham-Quoc/HCMUT 9
  10. Tags and Valid Bits • How do we know which particular block is stored in a cache location? – Store block address as well as the data – Actually, only need the high-order bits – Called the tag • What if there is no data in a location? – Valid bit: 1 = present, 0 = not present – Initially 0 Computer Architecture (c) Cuong Pham-Quoc/HCMUT 10
  11. Cache Example • 8-blocks, 1 word/block, direct mapped • Initial state Index V Tag Data 000 N 001 N 010 N 011 N 100 N 101 N 110 N 111 N Computer Architecture (c) Cuong Pham-Quoc/HCMUT 11
  12. Cache Example Word addr Binary addr Hit/miss Cache block 22 10 110 Miss 110 Index V Tag Data 000 N 001 N 010 N 011 N 100 N 101 N 110 Y 10 Mem[10110] 111 N Computer Architecture (c) Cuong Pham-Quoc/HCMUT 12
  13. Cache Example Word addr Binary addr Hit/miss Cache block 26 11 010 Miss 010 Index V Tag Data 000 N 001 N 010 Y 11 Mem[11010] 011 N 100 N 101 N 110 Y 10 Mem[10110] 111 N Computer Architecture (c) Cuong Pham-Quoc/HCMUT 13
  14. Cache Example Word addr Binary addr Hit/miss Cache block 22 10 110 Hit 110 26 11 010 Hit 010 Index V Tag Data 000 N 001 N 010 Y 11 Mem[11010] 011 N 100 N 101 N 110 Y 10 Mem[10110] 111 N Computer Architecture (c) Cuong Pham-Quoc/HCMUT 14
  15. Cache Example Word addr Binary addr Hit/miss Cache block 16 10 000 Miss 000 3 00 011 Miss 011 16 10 000 Hit 000 Index V Tag Data 000 Y 10 Mem[10000] 001 N 010 Y 11 Mem[11010] 011 Y 00 Mem[00011] 100 N 101 N 110 Y 10 Mem[10110] 111 N Computer Architecture (c) Cuong Pham-Quoc/HCMUT 15
  16. Cache Example Word addr Binary addr Hit/miss Cache block 18 10 010 Miss 010 Index V Tag Data 000 Y 10 Mem[10000] 001 N 010 Y 10 Mem[10010] 011 Y 00 Mem[00011] 100 N 101 N 110 Y 10 Mem[10110] 111 N Computer Architecture (c) Cuong Pham-Quoc/HCMUT 16
  17. Address Subdivision Computer Architecture (c) Cuong Pham-Quoc/HCMUT 17
  18. Example: Larger Block Size • 64 blocks, 16 bytes/block – To what bytenumber does address 1200 map? 1200 • Block address = ⌊ ⌋ = 75 16 • Block number = 75 modulo 64 = 11 31 10 9 4 3 0 Tag Index Offset 22 bits 6 bits 4 bits Computer Architecture (c) Cuong Pham-Quoc/HCMUT 18
  19. Block Size Considerations • Larger blocks should reduce miss rate – Due to spatial locality • But in a fixed-sized cache – Larger blocks ⇒ fewer of them • More competition ⇒ increased miss rate – Larger blocks ⇒ pollution • Larger miss penalty – Can override benefit of reduced miss rate – Early restart and critical-word-first can help Computer Architecture (c) Cuong Pham-Quoc/HCMUT 19
  20. Cache Misses • On cache hit, CPU proceeds normally • On cache miss – Stall the CPU pipeline – Fetch block from next level of hierarchy – Instruction cache miss • Restart instruction fetch – Data cache miss • Complete data access Computer Architecture (c) Cuong Pham-Quoc/HCMUT 20
  21. Write-Through • On data-write hit, could just update the block in cache – But then cache and memory would be inconsistent • Write through: also update memory • But makes writes take longer – e.g., if base CPI = 1, 10% of instructions are stores, write to memory takes 100 cycles • Effective CPI = 1 + 0.1×100 = 11 • Solution: write buffer – Holds data waiting to be written to memory – CPU continues immediately • Only stalls on write if write buffer is already full Computer Architecture (c) Cuong Pham-Quoc/HCMUT 21
  22. Write-Back • Alternative: On data-write hit, just update the block in cache – Keep track of whether each block is dirty • When a dirty block is replaced – Write it back to memory – Can use a write buffer to allow replacing block to be read first Computer Architecture (c) Cuong Pham-Quoc/HCMUT 22
  23. Write Allocation • What should happen on a write miss? • Alternatives for write-through – Allocate on miss: fetch the block – Write around: don’t fetch the block • Since programs often write a whole block before reading it (e.g., initialization) • For write-back – Usually fetch the block Computer Architecture (c) Cuong Pham-Quoc/HCMUT 23
  24. Example: Intrinsity FastMATH • Embedded MIPS processor – 12-stage pipeline – Instruction and data access on each cycle • Split cache: separate I-cache and D-cache – Each 16KB: 256 blocks × 16 words/block – D-cache: write-through or write-back • SPEC2000 miss rates – I-cache: 0.4% – D-cache: 11.4% – Weighted average: 3.2% Computer Architecture (c) Cuong Pham-Quoc/HCMUT 24
  25. Example: Intrinsity FastMATH Computer Architecture (c) Cuong Pham-Quoc/HCMUT 25
  26. Main Memory Supporting Caches • Use DRAMs for main memory – Fixed width (e.g., 1 word) – Connected by fixed-width clocked bus • Bus clock is typically slower than CPU clock • Example cache block read – 1 bus cycle for address transfer – 15 bus cycles per DRAM access – 1 bus cycle per data transfer • For 4-word block, 1-word-wide DRAM – Miss penalty = 1 + 4×15 + 4×1 = 65 bus cycles – Bandwidth = 16 bytes / 65 cycles = 0.25 B/cycle Computer Architecture (c) Cuong Pham-Quoc/HCMUT 26
  27. Increasing Memory Bandwidth 4-word wide memory Miss penalty = 1 + 15 + 1 = 17 bus cycles Bandwidth = 16 bytes / 17 cycles = 0.94 B/cycle 4-bank interleaved memory Miss penalty = 1 + 15 + 4×1 = 20 bus cycles Bandwidth = 16 bytes / 20 cycles = 0.8 B/cycle Computer Architecture (c) Cuong Pham-Quoc/HCMUT 27
  28. Measuring Cache Performance • Components of CPU time – Program execution cycles • Includes cache hit time – Memory stall cycles • Mainly from cache misses • With simplifying assumptions: Memory accesses Memory stall cycles = × Miss rate × Miss penalty Program Instructons Misses = × × Miss penalty Program Instructon Computer Architecture (c) Cuong Pham-Quoc/HCMUT 28
  29. Cache Performance Example • Given – I-cache miss rate = 2% – D-cache miss rate = 4% – Miss penalty = 100 cycles – Base CPI (ideal cache) = 2 – Load & stores are 36% of instructions • Miss cycles per instruction – I-cache: 0.02 × 100 = 2 – D-cache: 0.36 × 0.04 × 100 = 1.44 • Actual CPI = 2 + 2 + 1.44 = 5.44 – Ideal CPU is 5.44/2 =2.72 times faster Computer Architecture (c) Cuong Pham-Quoc/HCMUT 29
  30. Average Access Time • Hit time is also important for performance • Average memory access time (AMAT) – AMAT = Hit time + Miss rate × Miss penalty • Example – CPU with 1ns clock, hit time = 1 cycle, miss penalty = 20 cycles, I-cache miss rate = 5% – AMAT = 1 + 0.05 × 20 = 2ns • 2 cycles per instruction Computer Architecture (c) Cuong Pham-Quoc/HCMUT 30
  31. Performance Summary • When CPU performance increased – Miss penalty becomes more significant • Decreasing base CPI – Greater proportion of time spent on memory stalls • Increasing clock rate – Memory stalls account for more CPU cycles • Can’t neglect cache behavior when evaluating system performance Computer Architecture (c) Cuong Pham-Quoc/HCMUT 31
  32. Associative Caches • Fully associative – Allow a given block to go in any cache entry – Requires all entries to be searched at once – Comparator per entry (expensive) • n-way set associative – Each set contains n entries – Block number determines which set • (Block number) modulo (#Sets in cache) – Search all entries in a given set at once – n comparators (less expensive) Computer Architecture (c) Cuong Pham-Quoc/HCMUT 32
  33. Associative Cache Example Computer Architecture (c) Cuong Pham-Quoc/HCMUT 33
  34. Spectrum of Associativity • For a cache with 8 entries Computer Architecture (c) Cuong Pham-Quoc/HCMUT 34
  35. Associativity Example • Compare 4-block caches – Direct mapped, 2-way set associative, fully associative – Block access sequence: 0, 8, 0, 6, 8 • Direct mapped Block Cache Cache content after access Hit/miss address index 0 1 2 3 0 0 miss Mem[0] 8 0 miss Mem[8] 0 0 miss Mem[0] 6 2 miss Mem[0] Mem[6] 8 0 miss Mem[8] Mem[6] Computer Architecture (c) Cuong Pham-Quoc/HCMUT 35
  36. Associativity Example • 2-way set associative Block Cache Cache content after access Hit/miss address index Set 0 Set 1 0 0 miss Mem[0] 8 0 miss Mem[0] Mem[8] 0 0 hit Mem[0] Mem[8] 6 0 miss Mem[0] Mem[6] 8 0 miss Mem[8] Mem[6] • Fully associative Block Hit/miss Cache content after access address 0 miss Mem[0] 8 miss Mem[0] Mem[8] 0 hit Mem[0] Mem[8] 6 miss Mem[0] Mem[8] Mem[6] 8 hit Mem[0] Mem[8] Mem[6] Computer Architecture (c) Cuong Pham-Quoc/HCMUT 36
  37. How Much Associativity • Increased associativity decreases miss rate – But with diminishing returns • Simulation of a system with 64KB D-cache, 16-word blocks, SPEC2000 – 1-way: 10.3% – 2-way: 8.6% – 4-way: 8.3% – 8-way: 8.1% Computer Architecture (c) Cuong Pham-Quoc/HCMUT 37
  38. Set Associative Cache Organization Computer Architecture (c) Cuong Pham-Quoc/HCMUT 38
  39. Replacement Policy • Direct mapped: no choice • Set associative – Prefer non-valid entry, if there is one – Otherwise, choose among entries in the set • Least-recently used (LRU) – Choose the one unused for the longest time • Simple for 2-way, manageable for 4-way, too hard beyond that • Random – Gives approximately the same performance as LRU for high associativity Computer Architecture (c) Cuong Pham-Quoc/HCMUT 39
  40. Least Recently Used Algorithm • Need to keep track of what was used when • Keep “age bits” for each cache line • Update all the “age bits” of all cache lines when a cache line is used • → Pseudo-LRU: use one bit per cache line Computer Architecture (c) Cuong Pham-Quoc/HCMUT 40
  41. Multilevel Caches • Primary cache attached to CPU – Small, but fast • Level-2 cache services misses from primary cache – Larger, slower, but still faster than main memory • Main memory services L-2 cache misses • Some high-end systems include L-3 cache Computer Architecture (c) Cuong Pham-Quoc/HCMUT 41
  42. Multilevel Cache Example • Given – CPU base CPI = 1, clock rate = 4GHz – Miss rate/instruction = 2% – Main memory access time = 100ns • With just primary cache – Miss penalty = 100ns/0.25ns = 400 cycles – Effective CPI = 1 + 0.02 × 400 = 9 Computer Architecture (c) Cuong Pham-Quoc/HCMUT 42
  43. Example (cont.) • Now add L-2 cache – Access time = 5ns – Global miss rate to main memory = 0.5% • Primary miss with L-2 hit – Penalty = 5ns/0.25ns = 20 cycles • Primary miss with L-2 miss – Extra penalty = 400 cycles • CPI = 1 + 0.02 × 20 + 0.005 × 400 = 3.4 • Performance ratio = 9/3.4 = 2.6 Computer Architecture (c) Cuong Pham-Quoc/HCMUT 43
  44. Multilevel Cache Considerations • Primary cache – Focus on minimal hit time • L-2 cache – Focus on low miss rate to avoid main memory access – Hit time has less overall impact • Results – L-1 cache usually smaller than a single cache – L-1 block size smaller than L-2 block size Computer Architecture (c) Cuong Pham-Quoc/HCMUT 44 Chapter 5 — Memory Hierarchy
  45. Interactions with Advanced CPUs • Out-of-order CPUs can execute instructions during cache miss – Pending store stays in load/store unit – Dependent instructions wait in reservation stations • Independent instructions continue • Effect of miss depends on program data flow – Much harder to analyse – Use system simulation Computer Architecture (c) Cuong Pham-Quoc/HCMUT 45
  46. Interactions with Software • Misses depend on memory access patterns – Algorithm behavior – Compiler optimization for memory access Computer Architecture (c) Cuong Pham-Quoc/HCMUT 46
  47. Software Optimization via Blocking • Goal: maximize accesses to data before it is replaced • Consider inner loops of DGEMM: 1. for (int i = 0; i < n; ++i) 2. for (int j = 0; j < n; ++j){ 3. double cij = C[i+j*n]; 4. for( int k = 0; k < n; k++ ) 5. cij += A[i+k*n] * B[k+j*n]; 6. C[i+j*n] = cij; 7. } Computer Architecture (c) Cuong Pham-Quoc/HCMUT 47
  48. DGEMM Access Pattern • C, A, and B arrays older accesses new accesses Computer Architecture (c) Cuong Pham-Quoc/HCMUT 48
  49. Cache Accessed in DGEMM • Cache miss depends on N • For example: – N = 32, each element 8 bytes – Three matrix 24KB (32x32x8x3) – Core i7 Sandy Bridge: 32KB • What would happen when N is increased? • What would happen if matrices is divided into N-by-N Blocks Computer Architecture (c) Cuong Pham-Quoc/HCMUT 49
  50. Cache Blocked DGEMM 1. #define BLOCKSIZE 32 2. void do_block (int n, int si, int sj, int sk, double *A, double *B, double *C){ 3. for (int i = si; i < si+BLOCKSIZE; ++i) 4. for (int j = sj; j < sj+BLOCKSIZE; ++j){ 5. double cij = C[i+j*n];/* cij = C[i][j] */ 6. for( int k = sk; k < sk+BLOCKSIZE; k++ ) 7. cij += A[i+k*n] * B[k+j*n];/* cij+=A[i][k]*B[k][j] */ 8. C[i+j*n] = cij;/* C[i][j] = cij */ 9. } 10.} 12.void dgemm (int n, double* A, double* B, double* C){ 13. for ( int sj = 0; sj < n; sj += BLOCKSIZE ) 14. for ( int si = 0; si < n; si += BLOCKSIZE ) 15. for ( int sk = 0; sk < n; sk += BLOCKSIZE ) 16. do_block(n, si, sj, sk, A, B, C); 17.} Computer Architecture (c) Cuong Pham-Quoc/HCMUT 50
  51. Blocked DGEMM Access Pattern Unoptimized Blocked Computer Architecture (c) Cuong Pham-Quoc/HCMUT 51
  52. Virtual Memory • Use main memory as a “cache” for secondary (disk) storage – Managed jointly by CPU hardware and the operating system (OS) • Programs share main memory – Each gets a private virtual address space holding its frequently used code and data – Protected from other programs • CPU and OS translate virtual addresses to physical addresses – VM “block” is called a page – VM translation “miss” is called a page fault Computer Architecture (c) Cuong Pham-Quoc/HCMUT 52
  53. Address Translation • Fixed-size pages (e.g., 4K) Computer Architecture (c) Cuong Pham-Quoc/HCMUT 53
  54. Page Fault Penalty • On page fault, the page must be fetched from disk – Takes millions of clock cycles – Handled by OS code • Try to minimize page fault rate – Fully associative placement – Smart replacement algorithms Computer Architecture (c) Cuong Pham-Quoc/HCMUT 54
  55. Page Tables • Stores placement information – Array of page table entries, indexed by virtual page number – Page table register in CPU points to page table in physical memory • If page is present in memory – PTE stores the physical page number – Plus other status bits (referenced, dirty, ) • If page is not present – PTE can refer to location in swap space on disk Computer Architecture (c) Cuong Pham-Quoc/HCMUT 55
  56. Translation Using a Page Table Computer Architecture (c) Cuong Pham-Quoc/HCMUT 56
  57. Mapping Pages to Storage Computer Architecture (c) Cuong Pham-Quoc/HCMUT 57
  58. Replacement and Writes • To reduce page fault rate, prefer least-recently used (LRU) replacement – Reference bit (aka use bit) in PTE set to 1 on access to page – Periodically cleared to 0 by OS – A page with reference bit = 0 has not been used recently • Disk writes take millions of cycles – Block at once, not individual locations – Write through is impractical – Use write-back – Dirty bit in PTE set when page is written Computer Architecture (c) Cuong Pham-Quoc/HCMUT 58
  59. Fast Translation Using a TLB • Address translation would appear to require extra memory references – One to access the PTE – Then the actual memory access • But access to page tables has good locality – So use a fast cache of PTEs within the CPU – Called a Translation Look-aside Buffer (TLB) – Typical: 16–512 PTEs, 0.5–1 cycle for hit, 10–100 cycles for miss, 0.01%–1% miss rate – Misses could be handled by hardware or software Computer Architecture (c) Cuong Pham-Quoc/HCMUT 59
  60. Fast Translation Using a TLB Computer Architecture (c) Cuong Pham-Quoc/HCMUT 60
  61. TLB and Cache Interaction • If cache tag uses physical address – Need to translate before cache lookup • Alternative: use virtual address tag – Complications due to aliasing • Different virtual addresses for shared physical address Computer Architecture (c) Cuong Pham-Quoc/HCMUT 61
  62. TLB Misses • If page is in memory – Load the PTE from memory and retry – Could be handled in hardware • Can get complex for more complicated page table structures – Or in software • Raise a special exception, with optimized handler • If page is not in memory (page fault) – OS handles fetching the page and updating the page table – Then restart the faulting instruction Computer Architecture (c) Cuong Pham-Quoc/HCMUT 62
  63. TLB Miss Handler • TLB miss indicates – Page present, but PTE not in TLB – Page not preset • Must recognize TLB miss before destination register overwritten – Raise exception • Handler copies PTE from memory to TLB – Then restarts instruction – If page not present, page fault will occur Computer Architecture (c) Cuong Pham-Quoc/HCMUT 63
  64. Page Fault Handler • Use faulting virtual address to find PTE • Locate page on disk • Choose page to replace – If dirty, write to disk first • Read page into memory and update page table • Make process runnable again – Restart from faulting instruction Computer Architecture (c) Cuong Pham-Quoc/HCMUT 64
  65. Memory Protection • Different tasks can share parts of their virtual address spaces – But need to protect against errant access – Requires OS assistance • Hardware support for OS protection – Privileged supervisor mode (aka kernel mode) – Privileged instructions – Page tables and other state information only accessible in supervisor mode – System call exception (e.g., syscall in MIPS) Computer Architecture (c) Cuong Pham-Quoc/HCMUT 65
  66. The Memory Hierarchy The BIG Picture • Common principles apply at all levels of the memory hierarchy – Based on notions of caching • At each level in the hierarchy – Block placement – Finding a block – Replacement on a miss – Write policy Computer Architecture (c) Cuong Pham-Quoc/HCMUT 66
  67. Exercise • Given the TLB (fully associative) and the V Physical or V Tag Physical Page table (4KB pages) with LRU Disk 1 11 12 replacement 1 5 1 7 4 • If pages must be brought from disk, 0 Disk 1 3 6 increment the next largest page 0 Disk 0 4 9 number 1 6 • Show the final state of the TLB and Page 1 9 table if virtual address requests are as 1 11 follow: 0 Disk 1 4 – 4669, 2227, 13916, 34587, 48870, 12608, 49225 0 Disk 0 Disk 12948, 49419, 46814, 13975, 40004, – 1 3 12707, 52236 1 12 • The same question but 16KB pages instead of 4KB Hint: Analyse the virtual address to extract virtual page number Computer Architecture (c) Cuong Pham-Quoc/HCMUT 67
  68. Solution • Virtual address (decimal): 4669, 2227, 13916, 34587, 48870, 12608, 49225 • Binary address: 4KB pages => 12-bit page offset – 4669 = 1_0010_0011_1101, VPN = 1 – 2227 = 0_1000_1011_0011, VPN = 0 – 13916 = 11_0110_0101_1100, VPN = 3 – 34587 = 1000_0111_0001_1011, VPN = 8 – 48870 = 1011_1110_1110_0110, VPN = 11 – 12608 = 11_0001_0100_0000, VPN = 3 – 49225 = 1100_0000_0100_1001, VPN = 12 Computer Architecture (c) Cuong Pham-Quoc/HCMUT 68
  69. Computer Architecture (c) Cuong Pham-Quoc/HCMUT 69
  70. Computer Architecture (c) Cuong Pham-Quoc/HCMUT 70
  71. Block Placement • Determined by associativity – Direct mapped (1-way associative) • One choice for placement – n-way set associative • n choices within a set – Fully associative • Any location • Higher associativity reduces miss rate – Increases complexity, cost, and access time Computer Architecture (c) Cuong Pham-Quoc/HCMUT 71
  72. Finding a Block • Hardware caches – Reduce comparisons to reduce cost • Virtual memory – Full table lookup makes full associativity feasible – Benefit in reduced miss rate Associativity Location method Tag comparisons Direct mapped Index 1 n-way set associative Set index, then search entries n within the set Fully associative Search all entries #entries Full lookup table 0 Computer Architecture (c) Cuong Pham-Quoc/HCMUT 72
  73. Replacement • Choice of entry to replace on a miss – Least recently used (LRU) • Complex and costly hardware for high associativity – Pseudo-LRU • Close to LRU, less costly hardware – Random • Close to LRU, easier to implement • Virtual memory – LRU approximation with hardware support Computer Architecture (c) Cuong Pham-Quoc/HCMUT 73
  74. Write Policy • Write-through – Update both upper and lower levels – Simplifies replacement, but may require write buffer • Write-back – Update upper level only – Update lower level when block is replaced – Need to keep more state • Virtual memory – Only write-back is feasible, given disk write latency Computer Architecture (c) Cuong Pham-Quoc/HCMUT 74
  75. Sources of Misses • Compulsory misses (aka cold start misses) – First access to a block • Capacity misses – Due to finite cache size – A replaced block is later accessed again • Conflict misses (aka collision misses) – In a non-fully associative cache – Due to competition for entries in a set – Would not occur in a fully associative cache of the same total size Computer Architecture (c) Cuong Pham-Quoc/HCMUT 75
  76. Concluding Remarks • Fast memories are small, large memories are slow – We really want fast, large memories ☹ – Caching gives this illusion ☺ • Principle of locality – Programs use a small part of their memory space frequently • Memory hierarchy – L1 cache ↔ L2 cache ↔ ↔ DRAM memory ↔ disk • Memory system design is critical for multiprocessors Computer Architecture (c) Cuong Pham-Quoc/HCMUT 76