Computer Structure X86 Virtual Memory and TLB Franck Sala Updated by tomer gurevich Slides from Lihu and Adi’s Lecture 1 Computer Structure – VM X86 paging עבור זיכרון עם כתובת בגודל של 32ביט וגודל הדף הוא , 4kbגודל טבלת הדפים הדרושה הוא . 4Mb רוב התהליכים במערכת משתמשים רק במעט זיכרון. התקורה של 4Mbעבור כל תהליך היא לרוב יקרה ומיותרת .עבור רוב התהליכים טבלת תרגום כזאת תהיה רוב הזיכרון שהתהליך צורך. ב X86-ישנן מספר רמות של טבלאות תרגום ,המסודרות במבנה של עץ .אנו מקצים טבלאות תרגום באופן דינאמי , רק כאשר יש צורך ממשי בטבלה. Computer Structure – VM 2 32bit Mode: 4KB / 4MB Page Mapping 2-level hierarchical mapping: Page Directories and Page tables – 4KB aligned PDE Linear Address Space (4K Page) 31 – Present (0 = page fault) 21 DIR 11 TABLE 0 OFFSET – Page size (4KB or 4 MB) CR4.PSE=1 both 4MB & 4KB pages supported Separate TLBs 10 10 12 4K Page data Linear Address Space (4MB Page) 31 21 0 DIR OFFSET 10 22 1K entry Page Table 1K entry Page Directory 4MByte Page PTE data Page Directory 20 PDE 20 20+12=32 (4K aligned) PDE 10 20+12=32 (4K aligned) 3 CR3 (PDBR) CR3 (PDBR) Computer Structure – VM 32bit Mode: PDE and PTE Format 20 bit pointer to a 4K Aligned address Virtual memory – Present – Accessed – Dirty (in PTE only) – Page size (in PDE only) – Global Protection – Writable (R#/W) – User / Supervisor # 2 levels/type only Caching – Page WT – Page Cache Disabled – PAT – PT Attribute Index 3 bits available for OS usage 4 Page Directory Entry (4KB page table) Present Writable User / Supervisor Write-Through Cache Disable Accessed Page Size (0: 4 Kbyte) Global Available for OS Use Page Frame Address 31:12 PP AVAIL G 0 A A C W U W P DT 12 11 - 9 8 7 6 5 4 3 2 1 0 31 Page Table Entry Present Writable User / Supervisor Write-Through Cache Disable Accessed Dirty PAT Global Available for OS Use Page Frame Address 31:12 31 P PP AVAIL G A D A C W U W P T DT 12 11 - 9 8 7 6 5 4 3 2 1 0 Computer Structure – VM 4KB Page Mapping in 64 bit Mode 2003: AMD Opteron… Linear Address Space (4K Page) 63 47 sign ext. 39 38 PML4 9 30 29 21 20 PDP DIR 9 9 12 11 TABLE PML4: Page Map Level 4 PDP: Page Directory Pointer 0 OFFSET 9 4KByte Page 12 data 512 entry Page Directory Pointer Table 512 entry PML4 Table 512 entry Page Directory 512 entry Page Table 64 PTE 64 64 64 M-12 PDE M-12 PDP entry M-12 PML4 entry M-12 40 (4KB aligned) 256 TB of virtual memory (248) 1 TB of physical memory (240) CR3 (PDPTR) 5 Computer Structure – VM 2MB Page Mapping in 64 bit Mode Linear Address Space (2M Page) 63 47 sign ext. 39 38 PML4 9 30 29 21 20 PDP DIR 9 9 512 entry Page Directory Pointer Table 512 entry PML4 Table 0 OFFSET 21 2MByte Page 512 entry Page Directory data PDE M-21 PDP entry M-12 PML4 entry M-12 40 (4KB aligned) CR3 (PDPTR) 6 Computer Structure – VM 1GB Page Mapping in 64 bit Mode Linear Address Space (1G Page) 63 47 sign ext. 39 38 PML4 30 29 PDP 9 9 0 OFFSET 30 1GByte Page 512 entry Page Directory Pointer Table 512 entry PML4 Table data PDP entry M-30 PML4 entry M-12 40 (4KB aligned) CR3 (PDPTR) 7 Computer Structure – VM Question 1 We have a core similar to X86 – – – – 64 bit mode Support Small Pages (PTE) and Large Pages (DIR) Page table size in each hierarchy is the size of a small page Entry size in the Page Table is 16 byte, in all the hierarchies N4 63 sign ext. PML4 N3 N2 PDP DIR N1 0 12 11 TABLE OFFSET What is the size of a small page ? 12 bits in the offset field 212 B = 4KB How many entries are in each Page Table? Page Table size = Page Size = 4KB PTE = 16B 4KB / 16B = 212 / 24 = 28 = 256 entries in each Page Table 8 Computer Structure – VM - 64 bit (large & small) PT size = Page size = 4KB PTE = 16B Page Table: 256 entries Question 1 N4 63 sign ext. PML4 N3 N2 PDP DIR N1 0 12 11 TABLE OFFSET What are the values of N1, N2, N3 and N4 ? Since we have 256 entries in each table, we need 8 bits to address them - Table [19:12] N1 = 19 - DIR [27:20] N2 = 27 - PDP [35:28] N3 = 35 - PML4 [43:36] N4 = 43 What is the size of a large page ? Large pages are pointed by DIR So the large pages offset is 20 bits [19:0] large pages size: 220 = 1MB We can also say: DIR can point to 256 pages of 4KB = 1MB 9 Computer Structure – VM - 64 bit (large & small) PT size = Page size = 4KB Large page size = 1 MB PTE = 16B Page Table: 256 entries Question 1 43 63 sign ext. PML4 35 27 PDP DIR 19 0 12 11 TABLE OFFSET We access a sequence of virtual addresses For each address, what is the minimal number of tables that were added in all the hierarchies ? See next foil in presentation mode… 10 Computer Structure – VM Question 1: sequence of allocations 0 63 43 36 35 28 27 20 19 12 11 sign ext. PML4 PDP DIR TABLE OFFSET 8 bits instead of 9 for example purposes only 8 PDP Table PML4 Table D3 8 8 FF Page Dir 8 Page Table E2 12 B25 349 00000 D3 82 FF E2 B25 00000 D3 82 FF E2 349 0 new table 937 28 71 PML4 PDP DIR PTE offset 3 new tables + 1 page 68 82 4KB Page 00000 D3 82 FF 68 937 0 new table + 1 page CR3 C00 46 49 00000 D3 82 28 49 C00 1 new tables + 1 page B5 54 622 00000 71 46 B5 54 622 3 new tables + 1 page 11 Computer Structure – VM Translation Look aside Buffer (TLB) Page table resides in memory each translation requires an extra Virtual Address memory access TLB caches recently used PTEs TLB Access – speed up translation – typically 128 to 256 entries, 4 to 8 way associative Access Page Table In memory TLB Indexing Virtual page number Tag Offset Set On A TLB miss No TLB Hit ? Yes Physical Addresses – Page Miss Handler (HW PMH) gets PTE from memory 12 Computer Structure – VM Virtual Memory And Cache Virtual Address Access TLB Page Walk: get PTE from No memory Hierarchy STLB Hit ? No TLB Hit ? Yes Access Cache L1 Cache Hit ? No L2 Cache Hit ? No Access Memory Yes Physical Addresses Data TLB access is serial with cache access Page table entries are cached in L1 D$, L2$ and L3 $ as data 13 Computer Structure – VM TLBs The processor saves most recently used PDEs and PTEs in TLBs – Separate TLB for data and instruction caches – Separate TLBs for 4KB and 2/4MB page sizes 14 Computer Structure – VM כאשר ישנה גישה לזיכרון מתרחש התהליך הבא: ראשית ,ניגש ל TLB -המתאים עם ה VPN -המלא. ישנם dTLBו – iTLBאשר מכילים תרגום של מידע וזיכרון בהתאמה. ה TLBs -מחולקים גם לדפים קטנים וגדולים בהתאמה. אם ישנו , TLB HITנשתמש בתרגום שמצאנו . במקרה של TLB MISSנפנה לPMH - Computer Structure – VM 15 ה – PMHמכיל את ה. STLB - לאחר ה TLB MISS -ניגש ל . STLB -ה STLB-מהווה עוד "רמה " של . TLBהוא מכיל יותר PTEגם הוא. נפנה אליו לאחר ה.. TLB MISS - ה STLB -מכיל גם זיכרון של פקודות וגם של מידע .כמו כן ,הוא מכיל גם תרגומים של דפים גדולים. כמו ב TLB -אנו משתמשים ב VPN -כדי לחפש ב. STLB- עבור , STLB HITנשתמש בתרגום שמצאנו עבור , STLB MISSה PMH -יבצע . Page walk Computer Structure – VM 16 בשלב הזה ,ה PMH-חייב לבצע : Page Walkהוא מטייל על הירארכיית טבלאות הדפים החל מהשורש (. ) PML4 כדי לקצר את התהליך ה PMH -שומר cacheעבור הרמות הגבוהות של התרגום. PML4 cache,PDP cache,DIR cache : מדוע אין צורך לשמור ? Table cache מכיוון שה TLB-שומר תרגום מלא של כתובת ,אין צורך בTable - Cacheאלא רק עבור הרמות הגבוהות יותר ,אשר מהוות תרגום חלקי. Computer Structure – VM 17 48 SIGN EXT. 18 39 PML4 30 PDP DIR 0 12 21 TABLE OFFSET cache Accessed with virtual address bits If hits, returns DIR cache [47:21] PDE PDP cache [47:30] PDP entry PML4 cache [47:39] PML4 entry Computer Structure – VM אל כל cacheאנו פונים עם הסיביות ששימשו אותנו גם לפנייה לרמה גבוהה יותר . אם ברמה כלשהיא של ה cache -התרחשה פגיעה .אזי הצלחנו לחסוך (לפחות) את כל הגישות לרמה הנוכחית ולרמות הגבוהות יותר. לדוגמא ,עבור PDP cache hitנמצא PDPEמתאים וכך נחסוך פנייה ל PML4 -ו – . PDPעדיין נצטרך לגשת לזיכרון עבור הרמות הנמוכות יותר. Computer Structure – VM 19 ה PMH-ניגש לכל המטמונים במקביל ובוחר ברמה הנמוכה ביותר עבורה היה . HIT את שארית התרגום נבצע כרגיל ,באמצעות גישה לטבלאות אשר שמורות בזיכרון . נשים לב לכל הפחות נהיה חייבים לגשת ל. PAGE TABLE- את טבלת הדפים המתאימה נחפש בהירארכיית הזיכרון כרגיל :ניגש קודם ל L3 cache , L2 cache , L1 cache-ורק לבסוף ניגש לזיכרון . Computer Structure – VM 20 Caches and Translation Structures Instruction bytes Core On-die L2 L1 data cache L1 Inst. cache L3 Platform Memory translation Inst. TLB Data. TLB PTE PTE Load entry PTE STLB PDE cache PDP cache PML4 cache 21 PMH VA[47:12] PDE entry VA[47:21] PDP entry VA[47:30] Page Walk Logic PML4 entry VA[47:39] Computer Structure – VM Question 2 Processor similar to X86 – 64 bits Pages of 4KB 63 The processor has a TLB sign ext. 43 PML4 35 27 PDP DIR 19 0 12 11 TABLE OFFSET – TLB Hit: we get the translation with no need to access the translation tables – TLB Miss: the processor has to do a Page Walk The hardware that does the Page Walk (PMH) contains a cache for each of the translation tables All Caches and TLB are empty on Reset For the sequence of memory access below, how many accesses are needed for the translations? Address memory access Explanations 0000022334455666H 0000022334455777H 0000022884455777H 22 Computer Structure – VM Question 2 Processor similar to X86 – 64 bits Pages of 4KB 63 The processor has a TLB sign ext. 43 PML4 35 27 PDP DIR 19 0 12 11 TABLE OFFSET – TLB Hit: we get the translation with no need to access the translation tables – TLB Miss: the processor has to do a Page Walk The hardware that does the Page Walk (PMH) contains a cache for each of the translation tables All Caches and TLB are empty on Reset For the sequence of memory access below, how many accesses are needed for the translations? Address memory access Explanations 0000022334455666H 0000022334455777H 0000022884455777H 23 Computer Structure – VM Question 2 Processor similar to X86 – 64 bits Pages of 4KB 63 The processor has a TLB sign ext. 43 PML4 35 27 PDP DIR 19 0 12 11 TABLE OFFSET – TLB Hit: we get the translation with no need to access the translation tables – TLB Miss: the processor has to do a Page Walk The hardware that does the Page Walk (PMH) contains a cache for each of the translation tables All Caches and TLB are empty on Reset For the sequence of memory access below, how many accesses are needed for the translations? Address memory access 0000022334455666H 4 Explanations We need to access the memory for each of the 4 translation tables 0000022334455777H 0000022884455777H 24 Computer Structure – VM Question 2 Processor similar to X86 – 64 bits Pages of 4KB 63 The processor has a TLB sign ext. 43 PML4 35 27 PDP DIR 19 0 12 11 TABLE OFFSET – TLB Hit: we get the translation with no need to access the translation tables – TLB Miss: the processor has to do a Page Walk The hardware that does the Page Walk (PMH) contains a cache for each of the translation tables All Caches and TLB are empty on Reset For the sequence of memory access below, how many accesses are needed for the translations? Address memory access Explanations 0000022334455666H 4 We need to access the memory for each of the 4 translation tables 0000022334455777H 0 Same pages as above TLB hit No memory access 0000022884455777H 25 Computer Structure – VM Question 2 Processor similar to X86 – 64 bits Pages of 4KB 63 The processor has a TLB sign ext. 43 PML4 35 27 PDP DIR 19 0 12 11 TABLE OFFSET – TLB Hit: we get the translation with no need to access the translation tables – TLB Miss: the processor has to do a Page Walk The hardware that does the Page Walk (PMH) contains a cache for each of the translation tables All Caches and TLB are empty on Reset For the sequence of memory access below, how many accesses are needed for the translations? Address memory access Explanations 0000022334455666H 4 We need to access the memory for each of the 4 translation tables 0000022334455777H 0 Same pages as above TLB hit No memory access 0000022884455777H 3 We hit in PML4 cache. Then we miss in the PDP and so we need to access the memory 3 times: PDP, DIR, PTE 26 Computer Structure – VM Question 2 L1 data Cache: 32KB – 2 ways of 64B each How can we access this cache before we get the physical address? 64B 6 bits offset bits [5:0] 32KB = 215 / (2 ways * 26 bytes) = 28 = 256 sets [13:6] 12 bits are not translated: [11:0] we lack 2 bits [13:12] to get the set address So we do a lookup using 2 un-translated bits for the set address Those bits can be different from the PFN obtained after translation, therefore we need to compare the whole PFN to the tag stored in the Cache Tag array 27 Computer Structure – VM Question 2: Read Acces Translated Not Translated VPN Set 13 12 11 47 40 Offset 6 5 0 Tag [40:12] Tag [40:12] Way 0 Way 1 13 12 PFN 40:12 = Tag Match 28 = Tag Match Computer Structure – VM Translated VPN Set 13 12 11 47 Question 2: example Not Translated Offset 6 5 0 Write Virtual Address A VPN Offset VA: 0000 … 0000 0000 0000 0000 Set: [13:6] = 0 PFN Offset PA: 0 … 0000 0000 0000 0000 Tag: 0 Read Virtual Address B VB: 0011 … 0000 0000 0000 0000 Set: [13:6] = 0 29 PB: 0 … 0011 0000 0000 0000 Tag: 3 (0 if we don’t take [13:12]) Computer Structure – VM Question 2: Virtual Alias L1 data Cache: 32KB 2 ways of 64B each What will happen when we access with a given offset the virtual page A and after this, there is an access with the same offset in the virtual page B, which is mapped by the OS to the same physical page as A? Translated Not Translated VPN Set 13 12 11 47 40 Offset 6 5 0 13 12 PFN VPN xxxx01 Physical Addresses Cache PFN zzzz VPN yyyy00 Set[11:6] Not translated Physical Addresses Cache 01.set[11:6] zzzz Max: 64 sets 00.set[11:6] zzzz zzzz - 2 virtual pages map to the same frame xxxx01.set.ofset and yyyy00.set.ofset 01 and 00 are bits 13:12 30 Phys. addressed cache: The data exist only once Virtual addressed cache: The data may exist twice AVOID THIS !!! Computer Structure – VM Question 2: Virtual Alias L1 data Cache: 32KB 2 ways of 64B each What will happen when we access with a given offset the virtual page A and after this, there is an access with the same offset in the virtual page B, which is mapped by the OS to the same physical page as A? Translated VPN Set 13 12 11 47 40 Avoid having the same data twice in the cache xxxx01.set.ofset and yyyy00.set.ofset Check 4 sets when we allocate a new entry and see if the same tag appears If yes, evict the second occurrence of the data (the alias) 31 Not Translated Offset 6 5 0 13 12 PFN Physical Addresses Cache 01.set[11:6] 00.set[11:6] zzzz zzzz Virtual addressed cache: The data may exist twice AVOID THIS !!! Computer Structure – VM Question 2: Snoop L1 data Cache: 32KB 2 ways of 64B each What happens in case of snoop in the cache? Translated Not Translated VPN Set 13 12 11 47 40 Offset 6 5 0 13 12 PFN The cache is snooped with a physical address [40:0] Since the 2 MSB bits of the set address are virtual, a given physical address can map to 4 different sets in the cache (depending on the virtual page that is mapped to it) Physical Addresses Cache 01.set[11:6] 00.set[11:6] zzzz zzzz So we must snoop 4 sets * 2 ways in the cache Virtual addressed cache: The data may exist twice AVOID THIS !!! 32 Computer Structure – VM Question 3 Core similar to X86 in 64 bit mode 55 63 sign ext. PML4 47 35 PDP DIR 23 0 12 11 TABLE OFFSET Supports small pages (pointed by PTE) and large pages (pointed by DIR). Size of an entry in all the different page tables is 8 Bytes PMH Caches at all the levels – 4 entries direct mapped – Access time on hit: 2 cycles – Miss known after 1 cycle PMH caches are accessed at all the levels in parallel In each level, when there is a HIT, the PMH cache provides the relevant entry in the page table in the relevant level In each level, when there is a miss: the core accesses the relevant page table in the main memory. Access time to the main memory is 100 cycles, not including the time needed to get the PMH cache miss. 33 Computer Structure – VM Question 3 55 63 sign ext. PML4 47 35 PDP DIR 23 0 12 11 TABLE OFFSET What is the size of the large pages? The large page is pointed by DIR, therefore, all the bits under are offset inside the large page: 224 = 16 MB How many entries in each Page Table ? • • • • PTE: DIR: PDP: PML4: 34 212 212 212 28 Computer Structure – VM 55 63 sign ext. PML4 47 35 PDP Virtual Addr. DIR Cycles 23 0 12 11 TABLE OFFSET Comment TLB 4entries Direct mapped TLB hit: 2 cycles TLB miss: 1 cycle Memory access: 100 cycle FF81 2345 6789 ABCD FF81 2340 6789 ABCD FF80 2340 6789 ABCD FF81 2340 6709 ABCD FF81 2340 6709 A0CD 35 Computer Structure – VM 55 63 sign ext. PML4 47 35 PDP DIR 23 0 12 11 TABLE OFFSET TLB 4entries Direct mapped TLB hit: 2 cycles TLB miss: 1 cycle Memory access: 100 cycle Virtual Addr. Cycles Comment FF81 2345 6789 ABCD 401 Miss and memory access at each level: 1+ 4 *100 = 401 cycles FF81 2340 6789 ABCD FF80 2340 6789 ABCD FF81 2340 6709 ABCD FF81 2340 6709 A0CD 36 Computer Structure – VM 55 63 sign ext. PML4 47 35 PDP DIR 23 0 12 11 TABLE OFFSET TLB 4entries Direct mapped TLB hit: 2 cycles TLB miss: 1 cycle Memory access: 100 cycle Virtual Addr. Cycles Comment FF81 2345 6789 ABCD 401 Miss and memory access at each level: 1+ 4 *100 = 401 cycles FF81 2340 6789 ABCD 202 (PML4, PDP, DIR, sTLB) = (H,H,M,M) 1cyc PDP TLB read 1 more cycle DIR TLB: Miss 100 cycles PTE TLB: Miss 100 cycles FF80 2340 6789 ABCD FF81 2340 6709 ABCD FF81 2340 6709 A0CD 37 Computer Structure – VM 55 63 sign ext. PML4 47 35 PDP DIR 23 0 12 11 TABLE OFFSET TLB 4entries Direct mapped TLB hit: 2 cycles TLB miss: 1 cycle Memory access: 100 cycle Virtual Addr. Cycles Comment FF81 2345 6789 ABCD 401 Miss and memory access at each level: 1+ 4 *100 = 401 cycles FF81 2340 6789 ABCD 202 (PML4, PDP, DIR, sTLB) = (H,H,M,M) 1cyc PDP TLB read 1 more cycle DIR TLB: Miss 100 cycles PTE TLB: Miss 100 cycles FF80 2340 6789 ABCD 401 PMH cache miss in all the levels: 1 + 4*100 = 401 cycles FF81 2340 6709 ABCD FF81 2340 6709 A0CD 38 Computer Structure – VM 55 63 sign ext. PML4 47 35 PDP DIR 23 0 12 11 TABLE OFFSET TLB 4entries Direct mapped TLB hit: 2 cycles TLB miss: 1 cycle Memory access: 100 cycle Virtual Addr. Cycles Comment FF81 2345 6789 ABCD 401 Miss and memory access at each level: 1+ 4 *100 = 401 cycles FF81 2340 6789 ABCD 202 (PML4, PDP, DIR, sTLB) = (H,H,M,M) 1cyc PDP TLB read 1 more cycle DIR TLB: Miss 100 cycles PTE TLB: Miss 100 cycles FF80 2340 6789 ABCD 401 PMH cache miss in all the levels: 1 + 4*100 = 401 cycles FF81 2340 6709 ABCD 302 (PML4, PDP, DIR, sTLB) = (H,M,M,M) 1 cyc PDP: the entry 234 that was filled for the second access was replaced by the entry that was filled in access 3, as it is in the same set miss 2 + (3 × 100) = 302 FF81 2340 6709 A0CD 39 Computer Structure – VM 55 63 sign ext. PML4 47 35 PDP DIR 23 0 12 11 TABLE OFFSET TLB 4entries Direct mapped TLB hit: 2 cycles TLB miss: 1 cycle Memory access: 100 cycle Virtual Addr. Cycles Comment FF81 2345 6789 ABCD 401 Miss and memory access at each level: 1+ 4 *100 = 401 cycles FF81 2340 6789 ABCD 202 (PML4, PDP, DIR, sTLB) = (H,H,M,M) 1cyc PDP TLB read 1 more cycle DIR TLB: Miss 100 cycles PTE TLB: Miss 100 cycles FF80 2340 6789 ABCD 401 PMH cache miss in all the levels: 1 + 4*100 = 401 cycles FF81 2340 6709 ABCD 302 (PML4, PDP, DIR, sTLB) = (H,M,M,M) 1 cyc PDP: the entry 234 that was filled for the second access was replaced by the entry that was filled in access 3, as it is in the same set miss 2 + (3 × 100) = 302 FF81 2340 6709 A0CD 2 Hit in TLB – no need to go to PMH: 2 cycles 40 Computer Structure – VM Backup Slides 41 Computer Structure – VM Translated not translated 47 12 0 11 Virtual Address Set 13 Offset 6 5 0 VPN 47 13 … 40 : 11 Tag field 40 11 PFN Way 0 Way 3 = Tag Match 42 = Tag Match Computer Structure – VM Translated not translated Virtual Address Set Offset VPN … : Tag field 40 PFN Way 0 Way 3 = Tag Match 43 = Tag Match Computer Structure – VM