[Lecture 32] Understanding Virtual Memory Concepts

okay so let's jump into virtual memory we may talk about that more tomorrow but uh we're going to talk about virtual memory uh here uh I didn't add many readings over here as you can see except this this is covered somewhat uh well it's covered actually not in not a bad way in h& I'm I'm going to borrow some slides so you can actually find some reading in your book because it's such a fundamental concept uh even though books ignore some of the fundamental concepts like P fetching they don't ignore virtual memory as you can see okay so uh the programmer's view of memory looks like this essentially right it's memory looks like something for you and if you didn't take this course memory would always looks like look like this for you basically it's just some magic underneath in fact if you don't take any uh course that shows how a processor works it's all magic right I cannot imagine a word like that right it's you write this program and some magic executes it and you're not curious about that magic underneath that's the reason why you have this course in a computer science degree right you're becoming going to become a computer scientist or computer engineer uh hopefully this should not be magic to someone who claims to be a computer scientist right uh maybe there should be better degrees or some other degree for people who don't want to who want to keep it as magic right so that's the whole point of understanding how a computer works so this is not magic as you know but from the programers perspective we do loads and stores to memory right so we know that ideal memory on need is not magic ideally we want this magic also we want zero access time infinite capacity zero cost infinite benit Zero Energy dot dot dot today we're going to tackle the infinite capacity problem a little bit but we're also going to introduce other stuff because virtual memory is actually serving multiple purposes in today's systems one is hiding the management of memory hiding the capacity issues with memory from the programmer and the other is protection access protection isolation right these are two completely different problems that are actually solved uh in an integrated manner in in the virtual memory system okay so remember the slide uh we said uh we have virtual memory and programmer sees virtual memory and this an abstraction actually because underneath there is some physical memory and programmer can assume the memory is infinite large enough uh in reality physical memory size is much smaller than what the programmer assumes and things work uh you write programs with large virtual addresses in reality that large of an address doesn't exist you have a 64-bit address we don't have two to the 64 byte memories right if you have that please let me know I'd like to use it so basically uh physical memory size is much smaller than what the programmer assumes and underneath there's some magic the system using some magic meaning using software and Hardware cooperatively Maps virtual addresses to physical addresses virtual memory addresses to physical memory and we're going to uncover that magic today the system automatically manages the physical memory space transparently to the programmer and this is another reason why this was a very successful idea if you actually had to find system ideas that have been extremely successful virtual memory is one of them but hopefully we'll also say that this may not be the best way go to go forward uh by doing critical thinking either later today or tomorrow okay but this has been very useful so programmer essentially the big benefit is programmer does not need to know the physical size of the memory nor do they need to manage it a small physical memory can appear as a huge one to the programmer and essentially life is easier for the programmer right and we're going to actually add more and more over here this is simplification because there are other benefits of virtual memory as we will see but of course the downside is somebody needs to pay the cost of actually making this work and that's a system software and architecture essentially micro architect and the system uh designer and this a classic example of the programmer micro architect tradeoff right you make program life easy by making micro Architects life hard precise exception is another example right we retire instructions in order uh and we have to pay the cost of doing doing so so this requires virtual memory requires indirection and mapping between virtual and physical address spaces and this will become more clear because you need to programmer sees a virtual address space and then there's a physical address space and there needs to be some translation that needs to happen from the virtual space to the physical space and mapping that happens because not the entire virtual memory that the programmer sees is going to be in the physical uh space okay so let's talk about benefits of this automatic management of memory basically we're talking about automatic management of memory programmer doesn't need to deal with physical addresses and this has a lot of benefits actually each process has its own uh virtual address space which is very large basically for practical purposes infinite you can run out of virtual memory space also uh but maybe you're not doing something right if you're running out of virtual memory space uh in your memory management maybe you have a memory leak somewhere right possible uh and each process is an independent mapping of virtual to physical addresses basically this is very important to know each process or each program has a different view of virtual memory independent completely independent they may share parts of the physical memory through the mapping their own mappings but they are independent on the virtual memory address space on the other hand threads within a process share the virtual memory space that's the difference between process and thread right which you will also see I think when you talk when you take systems programming so virtual memory or or this uh automatic management of memory enables multiple things it enables code and data to be located anywhere in physical memory because you have this in direction right uh this will become a little more clear basically you can have relocation and flexible location of data and code uh if you if the programmer directly encoded a physical address and that physical address basically the code has to be there right whereas if it was a virtual address and there's a translation mechanism afterwards the system can load the program into any location physical memory by just changing the translation right this gives a lot of flexibility to the system uh as uh I mean hopefully it's obvious but you will see more this also enables isolation or separation of code and data from different processes in physical memory so if somebody writes a program X and you write program y uh these can be isolated from each other because the system has actually different mappings virtual to physical mappings uh this is called protection and isolation and we will see that essentially so I should also mention that uh this is important because uh this is what enables multiprogramming multi multiple programs can be present in the same machine uh otherwise there very if you don't have some sort of translation or IND Direction it's very difficult to enable because I write a program and I want it to be executed and I assume some physical addresses some somebody else writes a program they assume some physical addresses what if those physical addresses overlap how are these two things going to execute if there was no translation mechanism right it's not going to happen clearly because they will uh I I I do a right to a physical address a somebody else do to write to a physical address a clearly these are different programs they should not be writing to the same location because that address a means something else for each of the programs right these are semantically completely independent but if you don't have a mechanism of uh remapping those addresses somewhere else meaning if a programmer directly gets exposed to the physical address on the machine you will have that problem you cannot execute multiple processes at the same time okay and then uh the other benefit of this is code and data sharing between multiple processes also happen through virtual memory uh so you may again uh I write program X somebody else writes program Y and we want to share data somehow with each other what do I do I call my friend and say oh I'm writing this Lo this data value to physical memory location a so you'd better read from physical location a that's what you do today right no you don't do that today right you don't basically you you use some shared memory constructs today or shared libraries right if you have shared libraries in a machine how the different programs can call functions that are shared okay these will become more clear but essentially these are enabled by automatic management of memory so a system with physical memory only essentially is CPU generates physical addresses and these are this is really the physical memory load our store instructions generate physical memory addresses early computers were like this today this is not the case although you could make x86 for example work in physical mode it's called real mode in xc6 so if you want you can actually make this work so this sort of thing is visible to the bios for example the system a very low-level system software that enables the big startup of your machine because you have to access physical memory directly so that you can load some things like page tables as we will see but today most programmers uh do not directly manipulate physical addresses okay so basically uh in a system that looks like this let's assume that this is the Baseline physical memory is of limited size because of cost reasons we've seen that right what if you need more should the programmer be concerned about the size of the code and data blocks fitting physical memory right you write you write some code your code is larger than the physical memory what do you do do you actually do the hard work to manage the data moment from this to physical memory well if you start doing that it becomes very difficult to program this machine so we're going to tackle this problem with virtual memory multiple programs may need the physical memory should the programmer make sure all processes different programs can fit in physical memory we just said that it's not possible because different people may be writing different programs should the and even if the same person is writing it's a mess to actually think about this right uh should the programmer ensure that two processes do not unintentionally or incorrectly use the same physical memory portion this was the example that I gave you earlier writing to location a by two different programs and then there's another problem which is actually interesting because your instructions at architecture may be forward-looking they may have an address space greater than the physical memory size right you may today we have 64-bit addresses right 64-bit address space uh with bite addressability is 16 exabytes how many of you have 16 exabyte memory not even supercomputers have 16 exabyte memory today unfortunately so clearly we have an address space that's much larger today so how do you actually deal with that right I mean this is actually interesting because this is actually the address space is virtual that's why it's 64 bits but uh de with the problem for now okay so basically what if you do not have enough physical memory uh then you you still need to have this mapping mechanism that we're going to talk about so the difficulties of direct physical addic is a lot essentially programing needs to manage the physical memory space this is inconvenient and difficult clearly it used to be this way people used to manage their physical memory actually in the 1950s 60s uh good that you're not in those times let's say it's more difficult when you have more multiple processes it's difficult to support code and data relocation because addresses are directly specified in the program right you cannot relocate the data uh easily or you cannot relocate the data essentially if you don't have some indirection or change of address it's difficult to support multiple processes especially concurrently because you need protection and isolation between multiple different processes and sharing of physical memory space without problems uh becomes uh uh becomes not easy to handle and finally difficult ort actual data and code sharing across processes because different processes need to reference the same physical address and we talked about that also right how do you actually communicate that you are supposed to reference this particular address so essentially virtual memory uh solves all of these problems with a single idea let's say and the idea is very simple give each program the illusion of a large address space while having a small physical memory okay so that's a very high level idea clearly the question is how do you do that we're going to talk about that but the programmer does not worry about managing physical memory within a process or across processes Essen programmer can assume they have infinite amount of memory amount of physical memory hardware and the software cooperatively and automatically manage the physical memory to provide the illusion that we want and Illusions maintained for each independent process I have to emphasize this virtual memory is not a system level thing it's a process level thing each processes its own virtual memory okay so let's talk about the basic mechan basic mechanism is very simple in Direction and mapping of addresses essentially address generated by each instruction a program is a virtual address it's not directly specifying anything any physical entity basically it's a virtual address it could be it's a you can think of it as a name right it's a namespace it's a virtual name space uh it's not basically it's definitely not the physical address used to address main memory it's going to be translated to the physical address that's going to be used to access main this is also called a linear address in extics ex has interesting terms this is where I don't think they did a good job in terminology let's say uh and then there's an address translation mechanism Maps this address to a physical address and we're going to talk about that address translation mechanism uh that physical address is called real address in xa6 which is not bad actually in terms of terminology let's say and the address translation mechanism can be implemented in hardware and software together I'm giving you the exis equivalence because I'm going to show you some examples from x86 uh here okay let me give you the conceptual view this is from uh the book chapter that I assigned uh uh to you uh I like this high level view essentially you have illusion of large separate address space per process this is essentially summarizing what the mapping and Inter Direction looks like that in this case you have two processes they have independent virtual add spaces and you can see that each of them have a view of virtual memory that's 256 terabytes that's to the 48 right byes essentially 48 bit by addressable address the other uh process also has a virtual address space but then there's a physical address space this is actual physical memory and this this system this processor may have 8 gigabytes of physical memory right so essentially uh the the programs have the illusion that they are operating on 256 terabyte address space each but individual portions of or chunks of the virtual address space get mapped to physical memory at a given point in time so essentially uh the physical virtual address spaces divided into virtual Pages we're going to talk more about that in this in this case 4 kilobyte virtual cases that's an example it could be one megabyte one gigabyte we will see uh but 4 kilobyte portions of the virtual address space actually get mapped to the physical address space and they're present in the physical memory at that point in time so there needs to be a table that that records this mapping clearly right what is the for this process for process one where does virtual page zero in this particular case get mapped right does it get mapped to physical page zero no it gets mapped to physical page whatever but it's actually 012 in this case right as you can see on the other hand the virtual page zero of this process is gets mapped to a physical page zero so somebody needs to manage this clearly okay so that's what the indirection mapping needs you uh you have a virtual address in your program by some indirection mechanism you indirectly access physical address space and that in directions enabled by mapping of the virtual page number to the physical page number through some lookup table that we will call the page table okay so let's take a look at that page table essentially we need to trans we need to translate the address essentially let me go back here as opposed to this picture oh wow it was so far away you more okay as opposed to this picture where you're directly generating physical addresses we have an indirection okay essentially indirection is this page table right the CPU program generates virtual addresses and you consult this lookup table called page table and the page table basically says oh this virtual page for this process is in this physical page in physical memory and if you access something that's not in physical memory it says oh it's in dis somewhere so I'm going to call the system I'm going to ask help ask for help this is called an exception remember the exception is an interrupt sector this is called a page fault exception the page that the program is referencing is not in physical memory so I'm going to ask the operating system to help me bring it to the physical memory and we're going to talk about how that's done the io controller needs to get into uh the process uh and handle the exception essentially page needs to be moved from the disk to the physical memory and then the mapping needs to be fixed such that that page uh the page table entry of that page points to the physical memory page where the page is put into from the disk makes sense right so this is all done automatically if it was not done automatically the programmer had to do all of this and there was a time in the world 50 60 years ago and that was the case Okay in some systems actually what maybe 20 and 10 years ago right in some cases maybe it still exists but very small amount so basically the hardware converts virtual addresses into physical addresses via this lookup table page table but this page table is actually cooperatively managed by the operating system and the hardware together uh so we're going to talk about this management but this makes sense right you have virtual memory add space some pages in that virtual memory add space is in the physical memory these are the these uh in directions or these mappings some pages because they have not been accessed recently for example are actually on the disk but you can you have a mechanism to bring it from the disk to the physical memory this way this is why dis is a backing store of virtual memory or physical memory and this is why physical memory is a cache of the dis from this perspective right from the program's perspective so essentially all of the caching principles apply to physical memory as well as we will see in fact I'm going to show you a slide where caching terminology and virtual termin ology virtual memory terminology are going to be very similar okay so let's take a look at these two different processes each process has 4 gab over here I'll go through this relatively quickly basically sizes don't matter in this case this is a small system as you can see physical memory is small we divide uh the uh virtual memory of each process to Virtual pages and divide the physical memory into physical pages I will also call physical frames you may actually see both uh but we'll use physical page in general and then there's a mapping of the virtual pages uh of process one to the physical pages in physical memory and ditto for uh process two and that mapping is done through the page table clearly uh there are many issues in this indirection and mapping right when to map a virtual address to a physical address let's answer that actually quickly uh essentially when the virtual address is first referenced by the program nothing has to be in physical memory well the code hopefully is that you're executing but even the code right you have a program counter the program counter is virtual address you go through this translation mechanism and get a page fault and the code gets brought from the dis into the physical memory after that page fault and then you can reference that program counter in physical memory right so basically whenever you touch a virtual address the first time uh that uh virtual address uh gets mapped to a physical address through the page table of course you can do prefetching on this as well right uh so you can actually prepopulate this page table uh both from the operating system perspective and Hardware perspective so what is the mapping grity this is important I showed you 4 kilobyte for example right but it could be B grity it could be kilobyte it could be megabyte it could be gigabyte we have multiple granularities existing systems actually have multiple granularities with some pain this is not an easy uh task let's say but this granularity determines the size of the Tag store which is a page table page table is actually a big Tag store if you think about it right because physical memory is a cach for the dis essentially if you have B gr larity that maping is very small right and you don't want that because the Tag store is huge and also you're not exploiting locality very well if you have uh pages that are of size of B if it's four kilobytes not bad 16 kilobytes 64 kilobytes they exist uh megabytes gigabytes now we're going to talk about these basically similar issues like caching over here right this is you can think of this as a cache block size similar to page size so where do you store and how do you store the virtual to physical mappings we're going to concern ourselves with this a little bit basically this has to be an operating system visible data structure uh do you also cach it in Hardware we're going to see the answer is yes do you manage it cooperatively and we're going to see the answer is yes so this is EXA actually a very great example of Hardware software cooperation also uh all of the systems actually do this today so when do you do when the what do what do you do when the physical address space is full when your physical memory is full well remember physical memory is a cache right essentially evict and unlikely to be needed virtual page from physical memory and fix the mappings in the page table that's the idea because page table is a TX store right okay so basically if you think about physical memory as a cache and the backing store as the next level and the page table as a TAG store all of this becomes easy in my opinion because we've already seen caches okay so virtual let's go into a little bit more detail virtual address spaces divide into Pages physical address spaces divide into frames just to be a little bit different because it's actually a physical location frame of a page if you will but people call pages also that's why I have this IE Pages for now uh but let's let's stay with the frame terminology for some time a virtual page is mapped to a physical frame if the page is in physical memory okay that's what the mapping specifies in the page table for that uh virtual page otherwise a virtual page is mapped to a location in disk assuming it's allocated there's also the memory allocation uh part that we're not going to discuss as much I'm going to assume all of memory is is allocated but actually when you program you allocate some memory and that allocation uh enables you to have a mapping to the disk or uh physical memory but for now ignore that allocation let's say otherwise it will complicate things but there's another level of allocation over here which is different from what we're talking about here assume all of the memory is allocated uh all of the virtual add space is allocated uh then a virtual page is mapped to a location in dis or a physical frame if you actually a virtual page and it's not in memory but it's on disk virtual memory system brings the page into a physical frame and adjusts the mapping this is called demand paging and the fact that you found the virtual page not in memory is called a page fault and we're going to see that and Page table is a table that stores a mapping of virtual pages to physical frames and you've seen an example we're going to see more so in other words basically physical memory is is a is a cache of the cache for pages that are stored on disk in fact it's a fully associative cache in modern systems a virtual page can be mapped to any physical frame okay uh similar caching ex issues exist as we have covered earlier placement replacement grity of management right policy and we're going to see some of these again but we're not going to talk about a lot of these uh so for examp where and how to place and find the page in Cache what page do you remove to make room in the cache meaning physical memory do you have large small or uniform siiz Pages uh what do we do about wrs when you write to a physical uh page do you also write to a copy on disk and people quickly figured out that it's a very bad idea to have write through physical memories so today our physical memories are right back meaning you have a modified bit in the page table this is the same as the Dirty Bit or modified bit that we've seen in the caches for a given virtual page you have a modified bit if the program has written to that page that modified bit is set and when you have to evict that page from physical memory to make room for some other page coming from the disc you have to write back that page that is marked dirty into the dis makes sense right but this is not news for you because we already covered caching we just changed the where the cach is essentially okay so this is essentially the analog a cache block is an analogous to virtual memory page cash block size analogous to page size block offset is a page offset a cache Miss is essentially a page fault in this case the index into the cache is a virtual page number in this case uh the Tag Store metadata store is the page table and the data store is actually the physical memory physical memory is your data store right now okay I like the slide as you can see it simplifies a lot of the concepts but let's go into virtual memory a little bit more so a page size is the mapping uh grity of virtual to physical address spaces it dictates it also dictates the amount of data transferred from hard disk to DM at once and the reason a lot of the page sizes today are 4 kilobytes are historical because the discs were designed that way uh they could bring let's say 52 bytes 4 kilobytes easily through streaming uh that we actually uh still have uh this Legacy of uh dis based virtual memory systems that's why you see 4 kilobyte pages today in many systems it's unfortunate but that's how it is basically you transfer we're assuming that you're transferring all of that amount 4 Kil uh to from the hard dis to DM to uh put a page into the physical memory hopefully that makes sense but again we've seen in caching sub blocking or sectoring right you could actually play similar tricks in virtual memory of course you need to be a little bit more careful because now we need to change the page table because remember sub blocking or sectoring enabled us to bring only a piece of the cache block into the cache uh you could do that in physical memory as well or uh but it requires you to actually have the valid and Dirty Bit separately for different sub pages in this particular case Okay page table is the table that stores virtual to physical page mappings it's essentially a lookup table basically you it's used to translate virtual page addresses to physical frame addresses and find where the associated data is essentially an address translation is the process of determining the physical address from the virtual address so hopefully these are clear now let's take a look at how this is done I've already shown you this picture picture now we're going to actually see some examples essentially this picture is from your book you can see the chapter uh the hope is that most accesses hidden physical memory so essentially these are your virtual addresses there's an address transation mechanism some of your virtual addresses in physical memory some of them are in hard drive or SSD essentially disk the hope is that most access it in physical memory but programs see the large capacity of virtual memory such that the programmer doesn't need to worry about the sizes of data or code this is essentially the idea of a memory hierarchy Al rate if you think about it this is what we said right most accesses we want it should hit in the fast memory and we should get the capacity of the large memory right that's the idea so I can think of this as this is the cache and this is the large memory over here okay so this is what address translation looks like you take a virtual address in this case it's 32 bits uh you chop off the bottom bits like page offset because that's the page size those bottom bits don't change during the translation but the top part the page number virtual page number gets translated to a physical page number or physical frame number let's call physical frame page number that's what your book uses makes sense right by just looking at this now you can tell what's the size of your physical memory right it has 27 bits so it's two to the 27 bytes assuming it's B addressable and your virtual address is 32 bits is 2 32 bytes makes sense so this translation mechanism we're going to look at so essentially 9 18 bit virtual page number gets translated to a 15 bit physical frame number everything is good right so let's take a look at the example uh actually this is this was 31 sorry yeah uh virtual memory size is three gigabytes in this example so I have 3 31 bytes physical memory size 2 to the 27 page sizes 2 to the 12 4 kilobytes essentially virtual the organization looks like this virtual address is 31 bits physical address is 27 bits page offet is 12 bits so the number of virtual Pages you have is through the 19 number of physical pages to the 15 so you can actually easily drive the virtual page number and physical page number these are easy and this is an example from your book also this is what your virtual memory looks like this is what your physical memory looks like the blue items over here or the blue Pages actually are present in physical memory and we're looking at the granularity of virtual pages and physical Pages as you can see right this for example shows that virtual page two is mapped to physical page 7 FFF and clearly physical memory is smaller okay so essentially how do we translate addresses page table has entry for each virtual page so for every possible virtual page number you need to have an entry uh and each page table entry has a valid bit this valid bit says if the virtual page is located in physical memory basically if it's one the virtual page is actually located in physical memory otherwise it's somewhere else so we need to have a mechanism for finding where it is this stuff can be stored in the page table or some other data structures in the operating system which we're not going to get into but it must be fetched from the hard disk somehow I should actually call the backing store because it can be something else right it could be tape uh it could be some SSD uh physical page number is where the virtual page is located in physical memory assuming Val bit is one right and then there are also some other bits in the page table entry like dirty modified we talked about replacement policy bits what are you going to replace how uh similar to the cache replacement basically we're going to look at that also and then also permission and access bits can you access this page given that you have some privilege level in this process we're going to talk about that because this is separate from address translation essentially it enables protect virtual memory protection okay so ignore those permission access bits for now so let's take a look at this example page table for the example in your book essentially you have virtual page numbers and this is your page table ignoring the access protection and other replacement policy bits you have a valid bit that that's set for only those virtual pages that have that are in physical memory and it shows the mapping as you can see right so in this case you see one two three four of the virtual Pages present in physical memory okay let's take a look at the address translation how is done basically you take the virtual address 31 bits chop off the page offset because this doesn't change during translation because that's the page size you take you find the virtual page number you use the virtual page number to index into the page table remember page table is also something in memory so now you need to think a little bit uh basically you need to index into the page table to get the page table entry and that page table entry specifies the Val bit and the physical page number so basically page table is located at physical memory address specified by the page table base register now here you need to think a little bit because in order to access memory at a virtual address you need to access memory to get to the page table entry first this is the indirection right this is why the load indirect is a nice instruction in lc3 for example right but basically the page we're assuming that the page table is an entire memory we're going to deconstruct that soon because are you going to have this huge thing in physical memory right uh but basically there's a page table base register it's a physical address you add to it the virtual address times the page table entry size and that gives you the address of the page table entry that you're looking for to do the translation okay basically page table is indexed with the VPN virtual page number the page table provides the PPN which is the physical page number which is in this case ZX7 FFF and it's valid so the translation is basically concatenating the page offset with the physical page number so a page offset bits don't change during translation as expected makees sense right okay so we're going to look at a bunch of issues over here but let's go through a relatively simple examples what is the physical address or virtual address 05 uh 5 F20 we first need to find the page table entry containing the translation for the corresponding virtual page number so what is the virtual page number uh you need to figure that out we're going to figure that out soon and then we look up the page table entry at the address meaning do a memory access there and what is the memory access that you need to do essentially there's a page table base register we add to it the virtual page number times the page table entry size because each page table entry occupies some amount of space right so let's do this for the question that I asked earlier five F20 if you look at it from as a 31 bit entity you'll figure out that page offset is F20 and virtual page number is five so VPN is five meaning entry five in page table indicates VPN 5 is in physical page one well we we basically look up the page table in entry five this is entry 0o 1 2 3 4 five it's valid meaning it's in physical memory and the physical page number is this and we do the translation so the physical address is 1 F20 makes sense and then the question is another question what is the physical address of virtual address 73 73 e we do the same thing VPN happens to be seven over here because the B bottom 12 bits uh are the page offset we look at the entry seven in the page table the Alit is zero which means that this uh virtual page is not in physical memory that means that whatever I see over here is not useful for me for translation but it may be useful for figuring out where this is so this is a page fault exception at this point exception handling happens remember precise exceptions lecture when we talked about exception handling this process uh you save the architectural State and then you basically give an exception code and an exception Handler is called this exception handler comes in it checks what's going on it says oh this a page fault exception so I better service this page fault so you jump to a page fault exception Handler that services that page fault we're going to see that soon okay so let's take a look at some of the issues over here so page table size we should tackle this because this is going to be hairy if you don't tackle it so let's take a look at a bigger virtual memory add space like we have today 64 bits right so if you look at this uh assuming 4 kilobyte Pages page offic is 12 bits virtual page number is 52 bits assume that your physical address space is 40 bits your page table is responsible for translating 52 bit virtual page numbers to 28 bit physical frame numbers or physical page numbers sounds good okay I can deal with it but how large is this page table really well 52 bits means two to the 52 entries right first of all that's to begin with it as large number assume conservative that it's only four bytes per entry in existing PES actually it's larger so this is two the 54 bytes that's 16 pedabytes right do you have it in your physical memory I wish I had it I wouldn't use it on page tables if I had it let me put it that way basically and that's just one for one process never forget that if you're running 100 processes multiply this by 100 if you want thousand processes multiply by thousand right basically uh this is not realistic and the process may not be using the entire virtual memory space allocated not allocated comes into play but I'm not going to talk about that over here basically this is not realistic today so page table is large it's actually huge but at least part of it has to be located in physical memory we said that page TBL page register is in physical memory right so how do we get rid of located entire page table in physical memory clearly we're not going to be able to do that so Solutions more IND Direction essentially multi-level or hierarchical page tables in other words this is more IND Direction we'll see that we'll do more IND direction we're not going to get to the page table entry immediately we're going to go to another level of page table that gives us the page page table entry okay so multi-level page table basically organize the page table in a hierarchical manner such that only a small first level page table has to be in physical memory and that small first double page table hopefully is four kilobytes or 8 kilobytes right so that's the idea essentially this is from your book this this is what it looks like you have uh you chop the address into three things the top part specifies which page table to look up so this is called the page table number and then there's a offset inside that page table so essentially this is the first level page table it's you can see that it's nine bits right so you index using this first level page table this gives you the base register of the page table that you actually need to index makes sense right in a two level after one access to the first level you get the base register of the page table you actually need to index that houses your page table entry that you're looking for the translation in right so okay let's take a look page table number you index here of course this needs to be in physical memory now but this is small this is 2 to the N times four bytes let's say or eight bytes whatever right basically it's smaller than four kilobytes or equal to 4 kilobytes so it fits in your page also nicely so the entire virtual memory system is designed based on this principle the first level fits in your physical page and it stays in physical memory so that you don't get a page fault on your page table right first level page table but on the other level page tables you can actually get a page fault which is actually interesting okay so uh let's not get ahead of ourselves essentially uh this is the page table address and that gives you which page table you're going to index and this is your index into that page table so each of these page table is 1K entries as you can see but you have uh many of these page tables you can calculate how many you have I'm not going to do that right now so a first level page table must be in physical memory only the needed second level page tables can be kept in physical memory so if you haven't touched that memory you don't need to bring the page tables that do the translation for that virtual memory address space so this another beauty of multi-level page tables it solves the problem of not uh having or or problem of having lots of having your physical memory covered with the page table right only the page table that you have touched are brought into the physical memory okay so this is an example as you can see uh this is uh the address that we have this is the page table number uh and this is zero essentially uh we're accessing uh we we want to find the page table address for page table number zero over here the top nine bits and that's the base page table base register which happens to be in physical memory in this particular case right uh essentially well which essentially it isn't physical memory sorry it's valid as you can see over here and then you can access it yes and then the translation is also there basically the page itself is in physical memory so here the P top level page table the page table that you're looking for is in physical memory and then the page you're looking for is also in physical memory but if you actually access page table number 01 you can see that it's not in physical memory right the page table that you're looking for is not a physical memory so that's a page fault for the page table you're looking for so the virtual memory system needs to bring the page table you're looking for into the physical memory and fix the mapping over here okay I know there's a lot of indirection going on but you'll get used to it so basically uh we solved some problems with more IND Direction but this cost us something for n LEL page table now we need n page table accesses to find the page table entry every all of those page tables that we're looking for in physical memory so n accesses we were just doing one additional access now we want n accesses this is going to increase our latency to actually get the translation so let me give you an example from x86 architecture you can see that x86 nicely call this page directory so cr3 is the physical address so basically control register 3 or this is the page directory base register you add to it the top end bits from the directory entry and then this gives you the base register of the actual page table and then you index that page table using the part of the linear address that gives you the pte and then pte concatenated with the offset is the physical address essentially so with small Pages that's what it looks like this is 10 this is 10 as you can see 10 bits 10 bits uh and you can see everything fits nicely in 4 kilobyte Pages if you do that but xc6 actually has large pages also this a four megabyte page large pages are single level as you can see right so we have a single level page table x86 also has large addresses this is 48-bit address and they have four level paging because I don't remember what PML 4E means I had to write it page map level four entry over here basically you take the page lab page register index into it using this top nine bits that gives you an address that will start the next page table and index into it using the next nine bits and then that gives you another address that will start with the page directory that's the start of the page directory you index into it using the these nine bits and finally you get your page table based register from there and index into it using this address nine bits over here and you get the page table entry hopefully and assuming these are all valid 11111 you'll get the physical address after four translations but you can also get four page folds over here right so basically your uh latency can be very high if you get for page FS okay uh I think this is a great place to stop because we're going to move to the next challenge uh and next challenge is going to introduce more caching let's say so we'll pick up here tomorrow uh and finish up virtual memory and talk about okay so let's continue virtual memory and finish uh virtual memory never ends so it's impossible to finish virtual memory in my opinion but we'll finish at least the basics and interesting Parts there's a lot more that's interesting this an idea that's been around for 60 years and in the the end we'll spend a little bit of time to critically question it because uh it's probably not the best uh uh way of managing memory going into the future as memories become larger you have different type of accelerators sharing the memory Etc so and you will see some of the reasons uh hopefully toward this lecture uh within this lecture okay you know there's an extra assignment uh to get easy credit on we're going to talk about that hopefully toward the end of this lecture uh and you have some readings which have not changed that much recently and recall virtual memory is really uh providing the illusion of a large separate address space per process and this is a nice picture that I like using essentially each process has its own virtual address space which is large and then physical memory is small uh and uh you have this IND Direction and mapping mechanisms that map the virtual address space of each process to portions of the physical address space and there a management system that ensures that the programmer doesn't need to deal with managing the address space virtual address space managing the physical memory uh uh in any of the programs and this provides a lot of benefits like we discussed last time so I'm not going to go through all of those benefits again some of them may appear during the course of this lecture but certainly this illusion of large separate address space is helpful and it's maintained using IND Direction and mapping between virtual to physical address spaces and we discussed what is needed to enable that that's uh in Direction and mapping and that's the page table page table is very interesting because page table enables address translation we've we've seen multiple examples again I will not go through these again and it exists uh as part of the operating system and the hardware is aware of it basically this is part of the contract this part of the ISA essentially Isa specifies it as we will see with some real uh pictures and cop copy paste from Isa manuals and we wanted to solve some issues with this page table and one of the issues we tackled was the P table size right we said this page table actually can be quite big if your virtual add space is large and your page grity is small small as 4 kilobytes 4 kilobytes is small even one megabyte is small with a four 64bit virtual l space right so basically we calculated the size of the space table assuming this sort of system and we saw that it's 16 paby and we said that it's not possible to have this in physical memory so we Sol this Problem by introducing another level of interaction which is essentially first level and second level and end level or multi-level page tables or hierarchical page tables it's also called hierarchical page tables as we have discussed essentially you only need to uh store only in the first level page table in physical memory everything else can be in virtual memory and you basically bring the page tables that you need into the physical memory whenever you touch those that part of the virtual address space and you can translations for those parts make sense right so a page table is enabling translation if you don't need a translation you don't need a page table if you don't need a translation in that part of the virtual address space you don't need that page table to be brought into the physical memory Essen okay so that sounds good hopefully you're all on board with these sounds good multi level and you can also read your book because it's also in your book as you can see now I was also giving you examples from multi-level page tables from the x86 manual we actually went through this essentially control register three is a page directory base register which is a physical address and only this page directory table which is essentially a first level page table it needs to be in uh physical memory and everything else can be in virtual memory but you can see that you need to translate this to get to another page table and index it using some other bits in the linear address or virtual address uh to get to the uh page address or translation this is the page table entry essentially and this gives you the physical page number that corresponds to the virtual page number okay uh so there's nothing magic about this and you can actually put numbers these are real numbers from xd6 if you use 32bit Virtual addresses but xd6 is actually very flexible architecture uh it has a lot of baggage because of its flexibility so you can actually have uh larger Pages as well four megabyte Pages for example if you use four megabyte Pages then your page level page table doesn't need to be hierarchal as you can see right it's only single level because your uh your page offset is very large as you can see and your directory or meaning the virtual page number is actually small okay but x86 is even more flexible you can actually have 48 bit addresses and you can have a four level page table walk we've seen this uh yesterday uh this is not the worst actually as you will see soon so but I want to make sure that this is very clear this is uh the reason virtual memory is a hardware software construct is all of this needs to be specified in this book which is the ISA which is as you can see at some point it was 4,830 Pages you will see another version with 5,60 pages but this is a a picture that I showed you in lecture eight or so where we talked about isas and I said virtual memory management is part of it right Access Control mechanism priority privilege task thread management is part of it and this is the 5,60 page version of it which is one year later or one year or so later and there's a specific portion of this Isa these are actually multiple volumes one portion essentially is about system management so system programming guide right this is where the uh system architecture is specified P page table is an example of it this is what the operating system designer needs to know when they write an operating system for example how do they manage the tasks and threads as well as virtual memory and this particular uh manual is this volume three as you can see and it by itself is 1500 pages now you can see some part in ports of it chapter two is about system architecture chapter three is about protected mode memory management paging there's a chapter dedicated to paging I think this has 50 chapters or so but it's big basically and this is actually specified in the ISA and this is part of the problem this is a very rigid uh specification it's like a specification of instruction right this is what the page table looks like these are the entries as you can see right if you have four megabyte Pages page directory entry looks like this H and this is the uh for uh page table entry for a 4 kilobyte page looks like this basically this is the address of the page frame this essentially the physical page number and these are the uh different addresses address of the page table as you can see over here right and then there are some other bits that we're going to look at one of them is the valid bits so one means valid so if it's valid that means that the translation is valid over here you can either go to the next page table address or uh the physical page number as you can see over here if it's zero that means it's ignored then you need to get a page table except page fault exception right and we've also seen that we we're going to see more of that but the purpose of showing you this here is this is part of the ISA somebody put this up over there and if you want to write an operating system that does virtual memory management properly on that architecture you got to build your page table this way if you build in some other way you're not obeying the ISA and the processor is designed to actually look at these bits in a particular way we're going to cat these bits in tlbs for example soon the processor is designed to actually do the page table walks in Hardware you don't have to do them in software we're not going to talk about that but uh you you will see that maybe in future courses essentially if you don't design your software obeying this your Hardware will not work on whatever you designed if you don't design your Hardware obeying this the software will not work on the hardware you designed and that's basically this is the interface over here it's not an instruction as you can see instructions operate on these memory structures okay so if you go into the specification clearly there are specification we're going to see this again but uh uh present is the Val bit essentially xd6 has a nice naming strategy I like the present name also present absent right uh okay and then there are a bunch of other stuff that we will see soon some of these are for Access protection uh that I that we will talk about okay so this is actually the five level page table structure that X has there's a five level translation as well to in increase the address space but you can see that you first need to get this page table entry if it's valid that's good otherwise you get a page fault over here and then bring the page table and then you need to translate this one the second level the next level which is the fourth level over here if it's Val that's good if not you get a page Vault over here and then you translate this one uh well one of these actually depending on how how large your page is ignore the one gigabyte page over here look at this one over here essentially you get the address if it's valid and then you go to the next level you get the address if it's valid and then you go to the next level finally this gives you the uh physical page number so if someone asks you uh a question how many many page FS can you get when you're actually trying to get the physical page number of a load that is a virtual address it could be a lot it could be five actually and that's to fetch the instruction potentially to fetch the load instruction that's a virtual address the P PC and then when you actually execute the load you calculate another virtual address and then you can get five more page folds over there right that's 10 this is all assuming that you don't get a page table entry at the boundary of a page if you're at the boundary of a page you may actually get two page fults to Ser service something that I think doesn't happen in X6 architecture unless you have some serious misalignment issues but it could happen in some other architectures it used to happen in Max for example so you can actually get lots of page table uh page faults uh in this that's why you actually want to not uh uh Traverse this page table that's what that's why we're going to introduce the next idea very soon okay but just give you an idea these are the 4 kilobyte pages in x86 these are the 2 megabyte pages and these are the one gigabyte pages so you can actually have huge pages in x86 and even one gigabyte Pages as you can see require two level page tables with a 48-bit address so if your virtual address increases your page table uh uh your your multi-level hierarchy also increases make sense right okay so this is one of the reasons it's it's very difficult to scale the system into very very huge memory sizes you have pedabytes and pedabytes of memory it makes it's a lot of inefficiency basically that you have in this translation so people are looking into how do you actually do this in a much more efficient way okay so we Sol the first challenge I actually added a little bit more uh just to give you an idea of what uh this is really about uh uh so we saw the page table is large challenge using multi-level page tables at the cost of additional indirection additional latency now let's talk about this latency issue right so basically the second challenges whenever you do address translation each instruction fetch or load or store requires at least two memory accesses assuming you get the translation whenever you access the page table right and assuming non- hierarchical page tables also basically one memory access for address translation you read the page table and then want to access the data with the physical address after translation right this is the good case but in the bad case it could be lots of page fults also as you can see right uh so with a hierarchical page table that two goes to maybe four five six seven right depending on uh how many levels you have but let's assume it's two even two is bad right you don't want to access memory twice to actually get the instruction and then you don't want to access memory twice to get the data that the instruction needs right you want to access memory once to get the data immediately so basically this is a problem two memory accesses to service and instruction fetch or load store greatly degrades execution time uh and number of memory access increased with multi level page tables as we just discussed unless we're clever basically this is where another part of the hardware comes in uh Hardware accelerates this translation process using special caches so we're going to introduce a special cache it's called a translation look aide buffer how many of you heard about this in the past okay you're hearing for the first time that's great you'll hear more in the future perhaps but basically it's a cache uh it's a it's caching the page table entries in a hardware structure in the processor to speed up the translation that's the idea all of the principles of a cash up lie here tlb essentially it's a small cach of most recently used page table entries in other words recently used virtual to physical translations it's a pte cache in other words page table entries right it reduces the number of accesses memory accesses required for most instruction fetches and loads and stores to only one tlb access so it can be very small right it can be small and you access the structure and you get the translation immediate you don't need to access memory at all anymore because you cached the the translations there okay that sounds good it's a specialized cach because you do need to do this translation uh uh uh yeah okay so basically the reason it works is the reason why caches work because we have spatial and temporate locality in translations in P table access essentially and you can imagine why right you basically are streaming through memory let's say and you've allocated uh you have you have a large page let's say in that large page clearly that translation is used many many times if you're executing lots of instructions in a sequential manner they all belong to the same page right and then assuming that the next page is allocated in the physical consecutive page in memory you have well next page is actually in the next page is actually in the uh next virtual memory address right so basically if you're going through the sequentially you're basically going through the virtual memory address space sequentially the the translations actually have locality as well spatial locality as well and temporate locality because you have many uh instructions uh accessing the same data in the same page or instruction in the same page okay so this is the reason basically because memory ACC have tempat and spal loal and large pag sizes clearly exploit spatial locality better just like we saw in caches large block sizes remember that picture that I showed you yesterday page is equalent to block essentially if you have a gigabyte page and if you have very good spatial locality uh your tlb hits most of the time right okay and consecutive instruction and loads and stores are likely to access the same page we already discussed so tlb is a cach of page table entries in other words translations it's small it's accessed in a few Cycles typically it's multi level also we have multi level we have a hierarchy of tlbs also in existing systems so we have a hierarchy of caches as we have seen and we have a hierarchy of tlbs one of them handles the data the other one handles the translations you may say this a lot of waste and I agree that's true but that's the cost of making life easy for the programmer basically so typically at level one you have have 16 to 512 entries and usually it's high associativity and you get typically 9 to 99% hit rates but it all depends on the workload right it all depends on the access pad if you're completely accessing memory randomly with no locality in translations then you get 0% hit rate in this cash and it's basically terrible okay so hopefully you get the benefit of this right this reduces the number of memory accesses to only one tlb access you don't need to access even physical memory or the page table page table resides in physical memory but you don't need to access because you already whenever you use the translations okay so let's take a look at an example two entry tlb this is uh the like toy example from your book also it's essentially a two entry cache right so what do you do uh essentially it caches a translation it Cates the page table entry here we we don't have the access control bits it actually normally caches the access control bits but your book simplifies Things essentially we have two entries both of them are valid over here uh and one of uh essentially the translation looks like this in each entry you have the virtual page number and the corresponding physical page number and in the other entry you have the virtual page number and physical page number so this is caching two page table entries for two different pages and you know which Pages they are and if the processor is trying to access this virtual address it consults this tlb accesses the tlb and basically you can see that this a two-way or fully associate of cache in this particular case so it basically searches the virtual tag virtual page number it's looking for and this is essentially part of the tag uh in the tlb right and in this case there's a match on entry0 over here or way zero if you will and you get the translation or you get the physical page number concatenated with the offset that doesn't get translated as we discussed yesterday and you get the physical address okay so instead of doing the page table access we replace it with tlb access if you miss in the tlb if you don't have the translation the tlb then you do the page table access that we we saw yesterday and then when you do the page table access you access the page table entry and you bring it into the tlb right that's the idea just like you would cash data you're caching page table entries in this case okay no magic okay basically everything we discussed in caching and pre fishing lectures apply to tlbs you can have instruction and data tlbs for exactly the same reasons why we have instruction data caches uh you can have multi-level tlbs for exactly the same reasons why you have multi-level caches uh you have associativity and size choice and trade-offs again for similar reasons you can apply insertion promotion and replacement policies what do you keep in which tlb and how to decide that you can prefetch into the tlbs you can assume that you're striding through memory and you can can actually go and walk the page tables and get the translations early uh and this can actually work together with the prefetcher in fact uh and you need to keep tlbs coherent if they're in different processors if one of them modifies the translation for some reason because the operating system may be running on it you need to keep it coherent and the other processors that may have cached an older translation uh and you can have shared or private tlbs across cores and th everything we said applies here it's just a cache except it's caching something special translations but because you taken this course you know that there's nothing special right everything is bits in the end you're caching bits and you need to make sure that those bits are satisfied correctly we're just assigning meaning to bits in this particular case and that meaning is a translation okay so clearly we're not going to go into this but I think you can actually watch more detailed lectures that talk about an example of how modern processors work we won't have time to do that we'll release that lecture you can take a look at it but modern processors uh me translation hierarchy is actually quite complicated there's also Hardware page Walkers so actually this walk uh of in x86 for example whenever you need to do the translation when you need to get the translation there's a specialized engine that is essentially walking the page table to figure out where the translation is so you don't do that using a program you do it using a specialized Hardware accelerator inside the x86 processor sounds fancy right and this is one of the first accelerators that is added to the system actually before a lot of other accelerators this came close to around around the floating Point times basically they added the floating Point accelerator and they also added the virtual memory accelerator if you will because virtual memory is so important okay so let's talk about some uh examples uh before we uh end this part of the lecture actually I have a lot of interesting things over here so we're not done yet but but as I said virtual memory requires both hardware and software support page table is in memory it can be cached in special Hardware structures as we have just seen and Os and Hardware both know the page table organization as structure through the ISA the hardware component is called the mmu if you hear this term it's the memory management unit it refers to actually the hardware component everything that's in Hardware it's not just the tlb it refers to the page walker uh and then also the page fault Handler as we will see uh well part of the pagef tendler I should say okay but it's the job of the software the operating system is to use the hardware to populate the page tables decide what to replace in physical memory so we're going to talk about some of these things also and the change the base page table base register on a context switch right because when you context switch to some other process you should really change the translation hierarchy essentially and that happens by changing that cr3 for example the page table base register or control register 3 next 86 so that you can use the correct page table and it needs to handle page faults and ensure correct phys virtual to physical mapping again a lot of these are done also cooperatively between the hardware and software but it's the responsibility of the operating system because the operating system is the one that's executing the page fault Handler for example and Page fault handling is special because it's an exception it's an exceptional condition the process uh uh uh gets a page fold and normally if you remember the exception lecture we said that exceptions are handled at the priority of the process this is not the case here here you escalate privilege so that you can you can actually handle the exception because the page tables are very important as we will see soon uh you cannot enable the user to Pro change it if you enable the user to change that's a huge security problem uh basically you get a page fault the operating system needs to kick in and basically uh ensure that the page fold is handled in a secure way so that the mappings are changed without any problem right because if you change the mapping in a wrong way either the process will access wrong data or process May access some other data that it hases not have it doesn't have permissions to as we will see okay U so page fault is a special exception basically uh okay user cannot handle it I should say it that way it's the context in which it is handled is a system not the user okay so address translation uh we've already seen this so I'll go through this relatively quickly uh but basically this dictates how to obtain the physical address from a virtual address page sizes specified by the ISA there could be multiple page sizes as you can see today uh this actually causes issues in how to organize a tlb also do we have multiple different tlbs for Different Page sizes now you have a problem right so if you're interested in this you can watch the additional lecture we don't have time to go over this so basically there are trade-offs in the page cbus similar to trade-offs we have in caches and Page table contains an entry for each virtual page uh basically it's called the page table entry then the question is what is an PT let's take a look at that a little bit and we' already said that page table is a TAG store for the physical memory right data store uh it provides a mapping table between virtual memory and physical memory and Page table entry is really the Tag Store entry for a virtual page in memory so in a Val bit as we discussed right this indicates validity or presence in physical memory you need tag bits these are a physical frame number or physical page number bit to support translation you need bits to support replacement which we're going to talk about soon what do you replace from physical memory if your physical memory is full so you need to know which pages are not touched but this actually is a problem because we have so many pages in physical memory as we will see soon you need a dirty bit to support right back caching in physical memory and we discussed this already because right through is pretty expensive in physical memory you need protection bits to enable access control and protection now this has nothing to do with address translation this is basically added to this system because the system is capable of doing more than address translation so this is completely orthogonal to address translation you can actually use something else for protection but today virtual memory uh uh uh functions uh uh serves two functions one is access protection and the other is address translation and we'll see this more so this is my uh nice handwriting you'll see better pictures perhaps but uh basically I I already said all of this okay so weall address translation we've already seen this so I'm not going to go through this this the general address translation form you get virtual page number for translate to physical frame number and you have a separate page table per process so we've covered all of this so let's move to something that we have not fully covered so address translation you may get a payit so if you think about the mmu mmu can be a structure that includes tlbs that does include tlbs actually you send the virtual address to mmu mmu somehow resp responds with a translation and if it's a hit meaning it either hits in the tlb or the pages in physical memory meaning it's uh there there's a page table entry that's valid for that virtual can be classified as TB hit and TB Miss earlier as we discussed also right so basically in this case there's no page fault everything is good uh you get the translation after either a tlb access or a tlb access plus multiple memory accesses to GA the P page table end now let's let's talk about page fault this is not what we something we have not talked about so far let's see how it's handled so you the processor sends the virtual address to the memory management unit it misses the tlb it misses the uh uh basically in the end when you figure out the translation the translation doesn't exist either one page table uh doesn't exist that you're looking for doesn't exist meaning isn't present in physical memory because it's present in disk somewhere uh or basically that the real page that you're looking for isn't in physical memory so you need to handle the page fault in this case essentially valit is zero there's a pageold exception that's triggered uh once a page fold exception is triggered a page fold Handler kicks in it first assuming the physical memory is full it first evicts some dirty page out to the disk for example assuming it's going to replace that and then Handler pages in a new page and updates the page table entry in memory and then it returns control to the original process restarting from the faulting instruction that caus the fault and hopefully that gets a page hit and hopefully a TB hit because you filled all of those uh when you handle the fault so let's take a look at how how this is handled a little bit so basically this is a Miss in physical memory if you think about it because physical memory is a cache of the disk uh essentially page table entry indicates virtual page is not in memory access to such a p triggers the fault exception and we invoke the OS exception Handler other processes can continue executing right this process needs to stop but now you know about runhe head execution so maybe you can actually do rhead execution software also actually this is a real proposal it's a beautiful proposal that was that showed benefits I'm not going to go into that so I can actually specul execute the process at the software level if you critically think you can actually apply a lot of the concepts that you used into software as well okay so let's not go into that right now but basically OS has full control over page placement so basically what it does is before fault the CPU is accessing this vir virual page it's on disk it's not mapped to memory after the fault is handled the CPU is accessing the same page again and now the page is mapped into physical memory right and the page table is fixed to reflect that and something is kicked out from the physical memory perhaps right uh okay so how is this serviced again there's a lot that's involved here now we're going out to the system level a bit more uh if this course continues to uh in the in in the next Direction there's actually a lot in the system level also to think about but basically processor signals the io controller initiated by the operating system operating system basically tells the io controller IO controller I want to read some block from disk this is the address of the page table that I want from the disk uh and we I want to store it to memory starting at address y because that's where I want to locate the page table right because I know the mapping and I'm going to make sure everything works and uh there's a disc to memory read occurs and this is usually handled using direct memory access there's a setup that happens on the memory controller and the memory controller automatically transfers from the disk or dis and memory controller automatically transfer things to memory without involving the processor this is one of the cases where we don't involve the processor actually because there's a lot of data that may be transferred from This to Memory it could be one gigabyte as we have seen right with a large page so this happens under the control of the io controller and eventually this finishes uh the controller signals completion and it interrupts the processor saying that operating system I'm done so you can actually resume the process that you uh that faulted on this page and the operating system fixes all of those mappings and resumes the process makes sense right so you can clearly see that this can take lots and lots of microsc milliseconds actually milliseconds are more so you don't want page folds in your system basic the takeaway okay so let's talk about page replacement algorithms uh if physical memory is full you have a problem at hand now you need to actually uh replace one physical frame because you you're trying to bring some other physical page into the physical memory right uh so how do you determine physical memories full first usually you have a free list of physical Pages these are operating system data structures operating system actually keeps track of these uh you can potentially cach these in Hardware just like we've been caching stuff in Hardware uh but essentially uh it's possible to uh keep track of these so of course we've done this exercise before is to are you feasible we said that it's not feasible even caches right uh or small caches now let's take a look at the size of this cache I'm going to take one terab memory because this is almost reality today right uh we're going to have one terabyte memory soon one terabyte is how much is it two 40 bytes I think so yeah Terra paa EXA okay Tera is two to the 40 four kilobyte Pages small Pages two to 12 so essentially we have two to the 28 Pages how many possibilities for ordering there if you want to actually do perfect lru now you know the answer by heart to the 28 factorial now that's a huge number you can probably figure out what that number is in terms of and write it down but it'll be a waste of paper let's say essentially it's a huge number so modern systems again use approximations of lru for two reasons one of these reasons is this and the other is lru is not a perfect predictor of future anyway so I'll introduce the clock algorithm very quickly because the algorithms here need to deal with two to 28 pages right how do you figure out which one should you evict whatever heuristic you come up with is actually not so easy to implement in the end so clock algorithm is very clever uh but there are more sophisticated algorithms that people develop actually usually it's a more sophisticated algorithm this is a nice algorithm from IBM that was published about 20 years ago now uh it's adaptive replacement cach uh they called cache but it's really managing the physical memory physical memory is a cach as you know so what is clock algorithm very quickly uh it's it it resembles a clock basically essentially you have all of these pages in physical memory imagine two to 28 of them I'm showing you some small number over here and this is one of the nice pictures I could find uh but basically clock algorithm has all of the pages like this it points to one page uh uh at some point let's say and uh let me actually uh say this you keep a circular list of physical frames or physical pages in memory os does this of course uh and it keeps a pointer or hand to the last examined frame or last examined physical page in the list when a page is accessed uh the clock algorithm sets the r referenced or accessed bit in the pte for that page saying that I recently accessed this page okay that's good now when this frame needs to be replaced this clock algorithm kicks in and it replaces the first frame that has a reference bit not set and it finds this Frame by traversing the circular list starting from the pointer going clockwise so if you look at this example we're pointing to zero this is not set that's the reference bit so you replace this one okay if you need to replace something and then uh you change the clock algorithm to this one okay and then the not set if you're going to replace that's going to be replaced next and then uh this will point to one over here if you're not going to replace whatever is one so you go and find the first zero makes sense basically the goal is to find the first non accessed or not recently accessed page physical frame so during traversal while you're traversing this if you're skipping uh some page that has been recently accessed that has a bit set to one you set it to zero meaning I've seen this page uh it's recently accessed I'm not going to replace it but because I have gone through it I'm going to set it to zero saying that next time I see it if it's still not accessed I'm going to replace it okay and of course you're not going to get to that for a while assuming you have to the 28 pages to go through with this hand right okay so this is a very rough approximation and it's been used in early Linux uh kernels variance of it are still used frankly uh and then you set the handp pointer to the next frame in the list makes sense right this is clearly an approximation so that you don't replace the recently accessed uh pages and again there's an interesting thing over here there's an orbit in the PT page table entry this is also specified by the ISA the ISA basically specifies an bit I will show you in in x86 called the a bit access bit and this is this needs to be set so how do you set this you set it in the tlb how do you propagate at the enter to the page table when you actually set it in the tlb and you need to replace something in the tlb you need to write to the page table so they're actually really interesting things of management over here and when you write to the page table you need to do memory accesses right okay so cheap things in mind this is this uh these things complicate uh memory management in general so basically what we're doing is we're trying to make life easy for the programmer but we're adding a lot of memory accesses while doing that as you can see right okay so there are differences between cache like Hardware cache versus page replacement physical memory is a cache for disk as we've seen it's managed by the system software via the virtual memory subsystem page Replacements is very similar to cach Replacements page tables a tax St as we've discussed the difference is uh essentially uh multiple uh the required speed of access to the cach versus physical memory so this page replacement can be slow it doesn't need to be very fast if you will because caches require Cycles uh accesses that need to be as quick as Cycles here we're talking about bringing the page from memory all the way into uh from from dis all the way into memory you can have your sweet time to figure out what to replace right because that's going to take at least micros seconds okay so number of blocks in a cach versus physical memory hopefully we've seen that I just give you an example with two to 28 blocks in physical memory or pages so this makes a uh complicated physical uh memory replacement algorithms are not easy to implement uh I think we already discussed this it's similar to the first one uh which I'm not going to talk about and then the role of Hardware versus software caches are Hardware managed because the latencies are low but this is mostly under the control of software but as I said it's accelerated in Hardware okay any questions I'm going through the some some of these relatively quickly there's a lot of complexity in the system actually but if you watch the lecture that talks about uh what is done in a real system I give Intel Skylight as an example you'll see there's even more complexity than what I'm talking about that's one of the reasons why we believe as researchers is that this needs to be rethought completely but of course the difficulty of rethinking something that has been around for 60 years is not easy because the infrastructure has been developed for 60 years programs have been written this way if you actually do something completely different it's going to be not easy to adopt right okay let's talk about the memory protection aspects uh of uh page of virtual memory and I like this part because I get to talk about roh Hammer here also as you will see uh but this is actually interesting because memory protection is necessary right multiple programs or processes run concurrently each process has its own page table never forget that each process can use its entire virtual address space without worrying about where other programs are or what they're accessing this is beautiful right a process can only access physical Pages mapped in its own page table cannot overwrite the memory of other processes basically and this enables essentially isolation right you're isolating processes from each other and it provides protection also but protection has other categories also because within a process you may be able to read a page but not write it right within a process you may be able to read and write a page but not execute it necessarily so you have different permissions on a per page basis for example uh uh if you don't want your programs to be modified you can have read access to the code but not write access to the code right if you don't want uh someone to execute right to some place and uh execute the that data that will be interpreted as instructions when you jump to it you basically mark it as non-executable right uh so basically you can enable Access Control mechanisms per page uh this way okay so remember page tables per process we've seen this uh so essentially access protection and control can happen also via virtual memory and this is called page level access control or protection essentially not every process is allowed to access every page for example you need supervisor level privilege to access system Pages an example of a system page is a page table all of those page table structures are system Pages as a user you should not have add any access to it right otherwise you can change your permissions to access anything right uh for example you may not be able to execute instructions in some pages because you're not allowed to execute that code right uh and the idea is to enable this is to store Access Control information on a page basis in the process page table right and enforce Access Control at the same time as translation that's the idea it works nicely because that's when you really need to enforce Access Control also when you're trying to access a page you also need to translate it and assuming you have permissions translations you can proceed with translation if you don't have permissions don't even proceed with a translation right basically virtual memory system serves two functions today one is address translation to give you the illusion of large physical memory and the other is Access Control to give you memory protection and this is my nice picture essentially when you're doing translation you also bundle access control with it and in fact Access Control takes priority if you don't have access forget about the translation right you don't need the translation and if you have if you're trying to access a page uh that you don't have the proper access access to you'll get an exception immediately it's called an access protection exception in some systems essentially you get kicked to the operating system and operating system basically tells you stop usually stops the program right because it's trying to access something it's it doesn't it's not allowed to access okay so you can see that this an example we extend the page table entries with permission bits this is a very simple example right in addition to the translation information we have redite permission bits and then of course there needs to be valid bits Etc right okay so different processes have read and write access you can see that they're sharing physical page six over here it could be a shared library that they're executing it could be some shared data again it needs to be set up through the uh operating system to be able to do that uh and they may not be sharing some other processes for example physical page nine is not accessible by this process and you may have read write permissions to some of these pages and uh uh not write permissions to some of these pages so basically it depends on the specification of the say again what you can do so there are privilege levels in x86 uh usually again I don't have time to go over this in detail but there are these levels that you may see they're not necessarily the best architecture for protection this is old uh let's say but just to give you an idea usually at ring zero you have the supervisor level highest level operating system has this privilege and usually user application are at the lowest privilege right user supervisor is K kernel in modern terminology used to be called supervisor in the past okay so let's take a look at these page directory and Page table entries a little bit more closely so if you look at this is a a page directory entry X6 this is the first level uh page table uh you get the address of the page table and you you have a bunch of bits over here so this is the address of the page table and these are the flags you can see that one of them is read writ one of them is user supervisor you have the same thing over here at the PT also read wres user supervisor so now you can actually have protection at larger gr larities also so at the larger grity of uh a huge page table uh as opposed to a page you can specify user and supervisor level privilege as well as read write accesses because of the multi-level nature right so this multi level nature enables also multi-level Access Control if you uh it gets hairy a little bit as we will see basically this PD protects all 1,24 pages in a page te you can see these are the bits that I mentioned over here read writes whether you have read permissions or WR permissions or user and supervisor permissions whereas pte protects one page at a time gr is smaller over here okay and this is basically the specification of the ISA now you can see how messy this gets a little bit right basically depending on what you have in the page directory entry and Page table entry there's a combined effect of whether or not you can access and we can look at that over here again I don't have time to go over this but basically this is what the ISA specifies and the operating system needs to obey it and the hardware needs to obey it period if they specified it wrong too bad fix the ISA somehow right okay so uh any questions on this I'm not suggesting this is the best architecture for protection but this is what we have in common systems usually arm has its own protection mechanisms which is slightly better but not too much better if you ask me uh this is what we do today uh there could be better stronger protection mechanisms that are a lot more costly to Implement let me put it that way now what if now talk let's talk about raw Hammer what if your Hardware is unreliable and someone can flip the access protection bits this shows how flaky some of these access protection is as we will see and this is also going to have some implications on how important some of these issues are in my opinion such that a user level program can gain supervisor level access in other words access to all data on the system by flipping the access control bit from user supervisor that's one way to do right you somehow flip the access control bit on your page table entry to supervisor and you have access to your page table because this page table entry may be pointing to your page table right makes sense I I'll give you an example can this happen well since you know about rammer you know that this could happen right so basically you can predictably induce areas in most D memory chips and we already talked about this let me give you the idea very quickly again and then I'll show you a security attack that actually takes over the system by taking advantage of these bit flips so basically the issue was whenever we activated a ro uh we would apply high voltage to it whenever we pre-charge the RO if you remember from DM lectures we would apply low voltage to it now if you keep doing this repeatedly in the memory controller by accessing memory you would get bit flips in pages that are adjacent physically adjacent clear this should not happen you can say that this is bad data corruption blah blah but this is actually a security problem also uh and as we discussed in an earlier lecture we said that most the chips are vulnerable actually all the chips are vulnerable today so let's take a look at an example it's a user level example when we first publish the paper we actually put this code online and then later Google actually took this improved it Etc as we will see this code what this code does is it essentially hammers X and Y address X and Y it selects addresses X and Y in some way they map to the same bank and it avoids caches to act X and Y flashes X and flashes y from the cash and it avoids to X and Y by reading Y in row basically we we got to get rid of the C caching right in the system how do you get rid of the caching by flushing uh these cache blocks that you're touching in rows X and Y and also getting rid the row buffer by opening different rows in the same B so basically what this program does is this it keeps repeatedly accessing uh rows X and Y sounds good right you can actually try this program if you want I'm curious if it still works there are better programs out there today this is 2012 remember okay so you get bit flips and these are real bit flips they are reproduced in real life and uh now we're going back to how it's related to memory so if you look at the paper that we wrote in 2014 it says memory isolation is a key property of a reliable and secure Computing system and access to one memory address should not have unintended side effects on data stored another addresses and virtual memory actually ensures this right virtual memory is great because it ensures this but unfortunately virtual memory is powerless if you have a bit flip at the underlying levels like this right so we said in this paper that someone can hijack your computer if someone is maliciously trying to take advantage of these bit flips and these folks from Google project zero essentially did this essentially they took the program that I showed they improved it they made it better and they wrote this blog post I have a copy paste from this blog post and they also uh published a talk that you can watch their video in the black hat conference in 2015 where they said that they test a selection of the laptops and they replicated the issue that we discussed and they buil two attacks one of them is taking over the Google native client not so interesting the other is actually doing attack on the virtual memory subsystem as we will see essentially uh this is these are all their own words they use rammer IND bit flips to gain kernel privileges on x64 Linux when run as an unprivileged user L process so they were able to uh induce bit flips and Page table entries and they were able to gain right access to the own page table of the user level process and if you gain right access to your own page table you can change anything you want right so that's the idea over here so let's take a look at how they did it actually because now you have I think the background to understand this at least at some level these are their slides uh they basically say uh essentially what we've said earlier right page table entries are dense and trusted they control access to physical memory a bit flip in a p table entry is physical number can give a process access to a different physical page aim of the exploit is to get access to the page table right access I should say read access okay you cannot do too much with a read access right access you can do a lot and gives access to all the physical memory if you do that and they they want to maximize the chances that a bit flip is useful so they're actually going to introduce a special attack it's called page table spraying they spray the physical memory with page tables and they do it in a clever way as we will see and they check for useful repeatable bit flip first so that they can actually exploit this so this is the page table entries that they exploited basically it has 64 bits over here and they basically say if they actually Target the this redite bit the chance of bit flip is low right basically the chance of a bit flip Landing here is 2% because it's one out of 64 whereas if they target the physical page Base address which is the physical page number let's say essentially it's 20 bits so you get the 31% chance to be successful to have a bit flip over there so so there it's a problemistic attack in the end and these are again their slides you can watch their talk as well so let's see what they did basically so this is the virtual l space this is a physical address space you should be very familiar with this IND Direction right now so they say what happens when we map a file with redite permissions this is what happens basically the file gets mapped to memory and it has a virtual address and then the file uh the bits of the file is in physical memory action but it does it happens through an IND which is a page table those red entries page table entries that map the virtual pages of the file to the physical frames in physical memory sounds good right okay so what happens when we repeatedly map a file with red write permissions you get more page table entries you get more page tables essentially because this is the same file yes but these are different processes that are mapping or different whenever you actually map the different files it's acts as if you have different page tables that are needed because they need to be they actually you you modify them using different streams if you will so basically you get many many page tables in memory right pointing to the same physical frame over here okay basically pts and physical memory help resolve virtual addresses to physical Pages this is their words and they can fill physical memory with page table entries so most of physical memory is Page table entries now this is good each of them points to pages in the same physical file mapping so all of them point to this green part over here now if I bit in the right place and the page in the page table entry flips the corresponding virtual address now points to a wrong physical page with read write access because we open the file with read write access we're able to write to this so this page table entry tells us whenever we access this virtual address this page table entry tells us you can write to this part of the file that's cached in physical memory that sounds good but if if I do a bit flip by doing a row Hammer attack I can actually flip the bits in the physical page number in the physical page table entry over here to Somewhere Over Here so I gain access to some page table entry over here because I sprayed the memory with page table entries that sounds good as an attacker that is a Defender uh basically chances are this wrong page page contains a page table itself which is true because you sprayed most of the memory with page tables and attacker that can read WR page tables can use that to map any memory read WR essentially you have control of many page table entries over here and then you can actually do whatever you want to the system you can change the operating system if you want right basically you become root at this point make sense right so okay so to make this a working problemistic attack they basically say you need to do other stuff also like allocate a large chunk of memory figure out which locations are prone to flipping due to row hammer and then check if they fall into the right part in a pte for allowing the exploit so they don't want to leave it to 31% chance they want to actually see where the bit flips are you can actually figure that out and then return that particular area of memory to the operating system so as a user you can actually learn a lot about memory and then return to the operating system and the operating system allocates a page table entry over there later right and then you exploit that and then you force the operating system to reuse the memory for pts by allocating massive quantities of add space like we discussed earlier and then you cause a bit flip uh using row Hammer uh that shifts the page table entry to point to the page table and then abuse the r write access to all our physical memory at this point you are root basically clearly in practice there are many complications they see so that's essentially what uh rooh Hammer circumvents as you can see right and there are many many attacks that are developed this is the first attack that shows this but there actually people have automated some of these attacks today okay so I can talk more about rowhammer but we don't have a lot of time uh but I I will tell you that there are a lot of other attacks we don't have time and this is actually some works that introduced both attacks and defenses in 2023 there are sessions on rammer in the security and privacy conference for example uh okay so if you're curious there's a lot uh that's being written over here you can read papers solutions that were developed by industry doesn't work which we don't have time to talk about I'll refer you to a lot of lectures over here it's hard to guarantee Rob Hammer feat chips actually this a technology scaling problem and Industry adop Solutions are actually quite poor we will talk about that but these attacks are actually industry is also introducing attacks like Google itself introduced this uh heal double row Hammer attack so it's a little bit different from the original Ro Hammer attack what they do is you have the victim RS over here and you have an aggressive row that's not immediately adjacent basically they Hammer this adjacent row that's a little bit farther there there's another row in between they Hammer this a lot they Hammer this little and they cause bit flips in the victim this basically shows that R Hammer effect is getting worse in the physical memory and this also interesting for other reasons that we don't have time to talk about uh but G is sitting here and he can talk to a lot about that because he's been doing a lot of work on rammer and he's one of the authors in this paper as you can see okay so uh the good news is industry is developing solutions to this this is actually finally uh like nine years after the original rooh Hammer paper this is a paper from SK hinx that was written in iscc this year they basically say we're trying to solve rammer and you can see how they're trying to solve some of these Solutions resemble the solutions that we have proposed in the past and similarly Samsung has written papers also and they modify the DRM basically they modify the DRM to be more intelligent to try to detect these attacks and this a paper that was written by Samsung doing something similar much harder to read paper let me put it that way uh but essentially Solutions similar to what we have been discussing so are we now R Hammer free let's see there's more coming up as you can see we're going to present this work at ISA uh I believe there's even more that's going to come up uh I think it's not going to be easy to get rid of some of these issues and it's it's it's going to be harder harder and harder to actually prevent these bit flips in a fundamental way I cannot reveal the paper right now but after Isa talk to me so as you know I talk a lot about rowhammer and some of your uh creative Stu uh fellow prior students realized this okay so if you're interested there's a lot to discuss here including some of the solutions but I don't have time but I have to I had to put it in the virtual memory lecture because it actually brings together multiple Concepts right so let me uh I'm going to use a a few more slides and then conclude and then we're going to take a break but basically this actually opens up some really interesting things rammer actually opened up a lot of Hardware security research essentially if Hardware is unreliable your higher level security and protection mechanisms may be compromised completely right this basically indicates that the root of security and Trust are actually at the very very low levels if you don't have some security at the very low level then you're not going to build secure mechanisms at the higher levels so you can do as much as you want at the operating system level it doesn't matter because somebody someone will cause a bit flip in some way and circumvent those mechanisms right basically the root of security is really in the hardware and the physics itself and rammer Spectra and meltdown are recent key examples I don't cover Spectrum meltdown you actually have the background to think about Spectrum meltdown uh uh because you know about caches you know about speculative execution you know about Branch prediction all of those but it's more complicated than R Hammer it's harder to exploit also so then the question becomes what should we assume the hardware provides right how do we keep the hardware reliable should we assume Hardware is completely unreliable and build mechanisms of trusted execution on top of it that's another approach this is possible but it comes at very high costs it comes at checking for example every time you access a page you check whether Integrity is compromised or not and that's expensive basically and for that you need some Hardware Support also so how do you design secure Hardware how do you design secure Hardware is not a difficult question if it's just that it's a difficult question how do you if you actually want high performance High Energy Efficiency and convenient programming at the same time and low cost actually if you add any of those secure hardware and high performance they don't go well with each other secure hardware and low cost they don't go well with each other that's the problem basically so basically there's plenty of exciting and highly relevant research questions over here and if you're interested there's more in future courses now let me summarize virtual memory uh I think this a good place to summarize virtual memory and talk about some challenges so essentially virtual memory gives you gives us the illusion of infinite capacity infinite in quotation marks because you're still limited by the virtual address space size but hopefully that's large enough uh you have a subset of virtual Pages located in physical memory and a page table Maps virtual pages to physical Pages this is address translation a tlb and also a page Walker speeds up address translation Hardware accelerators basically accelerator specialized for virtual memory acceleration and multi-level page tables keep the page table size in check and using different page tables for different programs and having Access Control bits provides memory protection this is virtual memory in one slide let's say there's more in virtual memory but we will not cover them like how do you handle virtualized systems I'm going to show you another picture this makes the interaction to a Next Level this takes the interaction to a next level like if you're running a virtual machine on top of a guest operating system or if you're in a hypervisor on top of a guest operating system there's another level of interaction over there as you will see soon you can have alternative page table structures that's may be faster that may be more useful again we don't have time to talk about that in fact in placement I ignored an issue how do you do the mapping from physical pages to Virtual Pages you need actually some inverse mapping also to do the replacement so that you can actually change the page table entry So Physical table page table provides a forward mapping virtual to physical but if you're replacing something in the physical domain you need the inverse mapping when you replace the physical page or physical frame you need to invalidate the virtual mapping right so you have another problem basically which we kind of over right we didn't even talk about it but that problem exists and you need to solve it if you're writing an operating system so people have come up with iMed page tables for example that's another overhead as you can see so that's one of the reasons why we think rethinking virtual memory is actually a good idea uh maybe not so easy but let's take a look at this virtual Mar in virtualized environments how many of you use Virtual machines okay that's good so many of you actually if you submit jobs to Cloud you can actually run virtual machines or your machines here so basically this app another level of address translation so you have a guest operating system virtual machine you have a host operating system and you have a CPU clearly we looked at a real machine right real operating system you have this address transation but guest also has a virtual addresses and it gets map to the physical addresses over here which essentially the virtual addresses from the perspective of the host over here so you basically have another level of address translation over here and existing machines to speed up virtual machine performance have Hardware support for this multi-level translation also AMD introduced it in 2006 or so for example so basically it's a mess as you can see right it's not that great but we keep adding infrastructure to support it because it's so important okay so uh regardless of all of its shortcoming virtual shortcomings virtual memory is actually one of the most successful examples of architectural support for programmers how to partition work between hardware and software and Hardware software Cooper cooperation in the end and also the programmer architect trade-off the trade-off is made so well that we keep adding complexity to our systems so that we ensure that programmers don't know about what's going on right so in that sense it's not uh it's not a good deal for the architect let's say but it's a great deal for the programmer okay so going forward how does virtual memory scale into the future basically it's not going to scale easily it's a lot of energy is actually spent on virtual memory address transation a lot of cost a lot of performance Etc we usually ignore this cost we don't even count it when we talk about it right so physic memory size are increasing both local and remote uh we have hybrid physical memory system DM non memory ssds we want to actually do that going into the future there are many XO in the system accessing and addressing physical memory and we have virtualized systems as we've discussed and in the future we already have some near data accelerators and processing in memory systems which we may get a chance to talk about in the next part of the lecture if you have time let's see but basically scaling this is not easy again we don't have time to go over it but we thinking of alternatives to the conventional virtual memory framework I think this a very nice direction to think about I don't have time to talk about it but this completely rethinks the virtual memory subsystem you get rid of the uh virtual memory abstraction and make try to make it more flexible if you will it has a lot of downsides clearly it has a lot of issues that need to be solved still but virtual memory has been around 60 years and New Alternatives have been popping up maybe once every 10 years or so so we need new alternatives to this and if you're said you can actually look at uh the lectures uh related to this so you can see this actually lists some of the challenges like reg payable structure is actually pretty bad in my opinion uh we need to fix that somehow but let's see what happens so there are more lectures on virtual memory we don't have time to talk about uh but as I said we'll release one lecture if you want to go into more detail about a real system how it does virtual memory and you'll see even more complexity over there any questions on virtual memory let's see how how bad we're on time okay not so not so terrible but not so good also so let's take a break until uh 28 uh and then we will have a brief conclusion now we're going to talk a little bit more about issues in virtual memory but some of which we already talked about so these are some extra slides for your benefits I'm going to go through some of them quickly but we already discussed how large is the page table and how do we store and access it I'm going to give you a little bit more example from real systems with some additional slides uh but we're going to skip some of the slides the second question is how can we speed up translation Access Control check again we've discussed this tlbs but again these are very important that's why I raise up these questions and then there's a third question which was brought up by one of you in the last lecture I think when do we do the translation in relation to the cache access and that's going to be important I'm going to give you a teaser of that but we're not going to go into the details there's some backup slides and some lecture references you can take a look at there are many other issues in virtual memory that we will not cover in detail I mentioned some of those actually while we were uh discussing some earlier issues what happens on a context switch how can you handle multiple page sizes in efficient manner Etc we're going to touch upon some of this when I give you a real world example but not in great detail so issue one is how large is a page table where do we store it in Hardware physical memory virtual memory I think we've already discussed these questions with multi-level page tables part of the page tables in physical memory and the rest is in virtual memory and some of it is not allocated also and how can we store it efficiently without requiring physical memory that can store all page tables and the idea is multi-level page tables as we have discussed right and this is the slide that I'm showing you for the fourth time perhaps because we have seen it in the last lecture also and we've already discussed the solution and uh we discuss how to do the page table access there actually there's a page table base register there's also a page table limit register so that you don't exceed the virtual memory that you're accessing if virtual page number is out of bounds it exceeds the page table limit register then the process did not allocate the virtual page so there's an access control exception that happens based on that as well so there's Hardware checks x86 actually has segmentation also as we discussed in the last lecture which is a different level of protection but it's not it's not as important today that's why we don't talk about it as much uh okay we already discussed that page table based register is a part of process context it's an architectural register and it needs to be loaded when the process context switched in if you don't do that then you have a problem basically and it's a Hardware's job to do that if you don't do that actually the process will be using somebody else's page table B resistor and you basically compromise security and privacy right away right so the hardware bugs uh at any level of this process can directly affect your security and privacy basically okay I've already given you these examples I'm actually going to skip to the page table entries that are more so these are more recent this is x864 you can see that page table entries have increased to 64 bits so they used to be 32 bits over here these are each of these are page table entries and you can see that now there 64 bits why because we need to be able to address larger and larger virtual and physical memories and so you can see that there are different types of page table entries cr3 there's there's possibility of five level paging in X8 664 so this is a fifth level which which is the they don't call the directory anymore of course this is the first level if you have fifth level this is the first level if you have four levels and I can see that there are one gigabyte page frames also so there are three uh types of pages in x86 right now one Gaby 2 megabytes and 4 kilobytes and you need to choose which page you want this is a tough choice not an easy choice or the system needs to choose for you what type of page they they want and again it's not it's a tough choice basically it depends on your locality characteristics in your translation right and you can see that uh the other flags also exist there's nothing special that I want to provide over here other than this is the four level paging example that I showed you earlier right we're not going to go through this again but existing systems are complicated multi-level page TBL this is four level paging for four kilobyte pages I should say this is four level paging for two megabyte pages so it's uh and this is four level paging for one megabyte one gigabyte page so clearly for one gigabyte page you actually don't have four levels right you have two levels over here and you have three levels for two megabyte pages but this is the four level paging is the four the name of the paging mechanism that they use that's why it's called all four level paging the maximum levels is four with the smallest page size okay so that's how large is a page table and how complicated that it gets basically so how can we speed up the address translation and check we already know about translation look aside buffers uh we've already talked about how to make it fast let's talk about uh who manages it meaning what if you miss in the tlb so tlb caches the recent access translations Rec some translations but what if you miss in the tlb what tlb entry do you replace clearly it's a caching decision who TL the tlb Miss Hardware versus software and they Advanced and disadvantage this in many systems today Hardware is increasingly handling the TB misses but uh there there are actually some systems that handle the uh page tlb misses and software like myips actually handles theb software and that gives more flexibility to the system and we're going to talk about the trade-offs and then what should be done on a page fold is what if you actually find out the pte eventually and figure out that it's a page fold then what virtual page do you replace from physical memory these are different events I want to emphasize that tlb is very different from page fold tlb Miss is that says that the translation is not cached in the tlb in the hardware tlb page fault means that the physical page corresponding to this virtual page is not mapped in physical memory it doesn't exist in physical memory so you need to bring it you may get a tlb miss that doesn't result in a page fold obviously right you can actually have a Val page table entry you may get a page fault that doesn't result in a tlv miss that also happens if if the page table entry is cached in the tlb but it becomes invalid for some reason and there are reasons for it for TB coherence reasons for example when you for example replace uh uh uh replace uh replace the uh page uh that is part subject to translation you may actually invalidate that tlb entry it depends on how you invalidate exactly tlb entry of course but this is certainly possible also depending on how you implement things but it's more rare of course so who handles the page fault it's usually the software basically Hardware can potentially accelerate it but so actually I should say it's Cooperative but it's done under the supervision of the software basically you you usually have an exception that leads you to a software Handler and the software Handler decides what to replace and how to handle the page fold usually Hardware handles a tlb but not always okay we already discussed a tlb this is a more generic explanation so you can take a look at it but basically the tlb small it cannot hold hold all pts some translation requests will inevitably Miss in the tlb so you must access memory to find the required pte this is called Walking the page table essentially you walk all levels of the page table in a multi-level page table to get to the pte page table entry you're looking for and this takes a long time as you know right there's a large performance penalty so a better tlb management and prefetching can reduce tlb misses just like better cash management and pre-etching but then eventually you get a tlb Miss who handles at tbus is it the job of the hardware or software meaning whenever you get tbus do you take an ception or does the hardware handle it so there are multiple approaches it's Hardware managed in most systems today x86 arm etc for example and Hardware does a page F essentially it walks through all of the page table uh page table accesses and it fetches the PT in the end and inserts it into the tlb if the tlb is full the entry replaces another entry it's basically do the replacement and it's done transparently to the system software this way you can employ specialized structures and caches for example page Walkers and Page walk cach that we will see in an example soon approach two is software managed mips actually the architecture that you are implementing does it except you're not implementing the virtual memory part the hardware rais an exception and the operating system does a page walk using some instructions you can clearly do that right you can clearly construct instructions to access page tables and the operating system fetches the pte and the operating system inserts and evicts entries in the tlb essentially this a software managed tlb and clearly there are trade-offs to this approaches and you can imagine some of the trade-offs Hardware manag tlbs there is no exception on tlb Miss instruction just installs other instructions may already be going on in an outof order machine that's great independent instructions may continue and there are no extra instructions data brought into caches you don't do a context switch to a software so that's the big advantage and that's why Hardware managed tlbs are commonly used today but then the downside is Page directory and table organizations Ed into the system the hardware needs to uh Implement that and if you change it somehow in software your Hardware that manages the tlb is useless now so OS has little flexibility in deciding uh the tlb organization and P page table and directory organization software manage TB of course doesn't have that disadvantage the OS can Define the page table organization independently of the hardware and it can also employ more sophisticated tlb replacement policies the downside is performance overhead you need to generate an exception and you need to flush the pipeline ex execute the exception Handler extra instructions are brought into the cach of course you could handle this in another thread so you may actually eliminate some of the performance overhead like the pipeline flush but still there's overhead in executing other instructions uh to actually handle the software managed DB in Hardware if you handle it you can have a specialized structure completely as we will see in Intel system soon okay the final issue that I'm going to briefly touch on is when do we do the address transation in relation to Cache access and this becomes important especially from the perspective of the L1 cache uh because at the if if I if you need to do the translation to access the cache now translation is on your critical path right critical path of cache access so this is a teaser when do we do the address transition do you do it before or after accessing the L1 cach and in other words is the cash virtually addressed or physically addressed and this is a fundamental choice at the L1 level but you could also potentially move it to the L2 L3 levels but it doesn't make sense fully at at some point uh and then uh the the the difficult question is what are the issues with a virtually address cache and there are issues and one issue is synonym problem basically two different virtual addresses from uh the same process or different processes can map to the same physical address which means that same physical address can be present in multiple locations in the cache if you're actually using virtual addresses to address the cache and this can lead to inconsistency in data because the same physical address in different multiple locations let say different sets different indices in the cachee and if one of them is updated the other one needs to be updated also or invalidated so you have a coherence or consistency problem inside your own cache if your cach is virtual because two different virtual addresses can map to the same physical address there are other problems actually this is one problem there's also a homonym problem which is the fact that same virtual address can map to two different physical addresses why does this happen because virtual address can be in different processes right the same different proc have the same have have the virtual address space to each each of them and the same virtual address can be mapped to different physical addresses and this becomes a problem when you're managing cache uh if if if it's completely virtual so usually the solution to fix the homon problem is to introduce address space IDs to distinguish between different processes and address spaces this is easier to solve at Hardware cost synonym problem is different virtual address can map to the same physical address again why different pages can share the same physical frame within or across processes right essentially different virtual addresses can map to the same physical address just like we saw in the Google example right different virtual addresses mapped to the same file which is mapped to the physical memory and many different virtual addresses can map to the same physical address and there could be many reasons shared Library share data copy and WR Pages within the same process files that are mapped that's part of shared data actually and then the question is do homonyms and Syms create a problems when we have a cache and is the cache virtually or physically addressed uh comes a very relevant question over here and there are ways of solving the synonym problems also I'm not going to go into detail I have some backup slides for this but I would like you to think about it if you're interested in this so there are three ways of Designing the first level cache it could be physical meaning it can the access of the cache happens after the virtual physical address translation you don't have the homonym and synonym problems but you have a huge latency overhead you to access the cach you need to go through tlb and we discussed that first level cach is very latency sensitive so you don't want to do that in general it's not done the second option is you access the cach completely with virtual address and then do the translation this is completely virtual so there's too much flexibility and there's too many homonym and simony problems because of this as a result this is also not done in general usually you do a compromise that looks like this a virtual physical cache or virtually indexed physically tag cache where you access the cache concurrently with the tlb and then you do the tag check using the physical address now for this to work you need to design your cache carefully such that the indices the virtual index doesn't change during translation so you should really not index the cache using part of the address that's going to change using the translation otherwise you may actually map two different virtual addresses uh two virtual addresses to uh two different virtual address that map to the same physical location to different locations in the cache so if you didn't understand that that's okay but if you understood it that's good because that means that you've actually been really following what's what's going on in virtual memory but think about what happens in in this case what kind of bits should you use for the index into the cache and what can you do and what can you not do so the idea is virtually index the cache but do the tag comparison in the cache using physical addresses because you will have the physical address at the end over here okay and usually systems employ the sort of caches virtually index physically tack at the first level the problem is not present in the lower high level of the hierarchies because at that point you already have the physical address you'd better do the tlv translation before the second level for example okay see backup slides for more Let Me Now quickly go over a modern virtual memory system uh in Intel Sky Lake and then we're going to transition to the epilog part so this is Ed transation has evolved a lot so this is simple address translation earlier systems you have an L1 data and L1 instruction tlb L1 data cache which which is not part of your translation system but usually you do it concurrently with the AL one data cach as we discussed and then there's a software page table work work like mips modern address translation is much more complicated modern mmu you can think of this as mmu you have L1 L2 itbs L1 itlb L2 L1 data tlb L2 tlb page table Walker Hardware page table Walker cache to cach the middle levels of the page table page tables multi-level page table so there's a lot that goes on and then there's also clearly the cache so memory management unit we discussed it it's responsible it's the hardware that's responsible for uh resolving address translation request it's one mmu per core usually mmu has three key components tlbs we know what tlbs are page table walk caches they essentially offer fast access to the intermediate levels of a multi-level page table so tlbs cash pte not the intermediate parts of the page level page tables but page table walk caches cache intermediate mappings at the hierarchy page table hierarchy this is to eight the T translation tlb misses essentially and then Hardware page table Walker that sequentially does the page table walk that access the different levels of the page table to fetch the required pte and this is all in Hardware so I know you can see that there's a big Hardware over here okay let's take a look at this mmu let's tear it apart in an Intel Sky Lake system which is reasonably new uh L1 data tlb uh essentially Even This is complicated basically they have three tlbs each for Different Page sizes 4 kiloby 2 megab 1 gab and you can see the entries over here there's only four entry 1 Gigabyte tlb uh and fully Associated but this you can you can address four gigabytes of memory with this that's why it's four entry 2 megabytes 32 entries you can address only 64 megabytes right with all of those the 64 entries you can address only 256 kilobytes I guess in this case that's too bad so this tlb is actually quite powerful even though it has fewer entries it's very powerful because it really spans a 4 gigabyte portion of the physical memory okay virtual memory phys I should say but also physical memory I mean if you think about it from that perspective right and virtual to physical mappings are inserted into corresponding tlb after tlb Miss clearly and during a translation request all three L1 tlbs are looked up in parallel so now you can see that it's complicated and having multiple page sizes clearly complicates the picture and Intel FKS did not design a single tlb with different page sizes because they thought it was easier to design three tlbs with different page sizes and you can imagine this is a cartoonish picture you you need to basically index these tlbs with different uh parts of the virtual address and do the tag match Etc uh and then decide which one you hit on okay so there's an L2 unified instruction and data tlb essentially L2 UniFi TB cach translation for both instruction and data it's still private per individual core and there are two separate Al two TB struct now one for 4 kiloby and 2 megab pages and one for 1 gab pages and you can see that these are bigger but again they decided to consolidate 4 kilobyte and 2 megabyte together so the this the design of the tlv is a little bit more complicated for this one okay but these are bigger as you can see and you can see the penalties of course the question is how can you support both 4 kilobyte and 2 megabyte page using a single structure there's not enough detail on this but you can imagine we discussed this actually when we talked about associativity in time you basically index the tlb once using a 4 kilobyte index and then if you don't if you miss you index the TB again okay that's the idea basically there are two steps basically in time you index the tlb using assuming different page sizes okay if you hit in the first step that's good then it's a 4 kilobyte page if you miss in the first step and hit in the second second step then it's a 2 megab page if you miss them both then you need to do the page table walk the general algorithm is to recalculate the index and probe the tlb for all remaining page sizes and we've seen this in associativity in time if you remember the cach lectures we said sud associativity this is Poor Man's associative cach if you will we have associativity in space normally but now we have associativity in time we change the index and index into the cache again and again and again I can do it many many times to get very high levels of associativity or support any uh number of page sizes and this is the example base we first calculate the index for 4 kilobyte if you miss we calculate the index for 2 megabytes okay so L2 tlb has endep Index recalculations this simple and practical the downside is TB hit latency is longer and varying and now you have slower identification of L2 tlv Miss as you need to basically probe all page sizes so potentially you can optimize it by making the lookup parallel of course this adds Hardware cost or you can predict the page size predict the probing order and tradeoffs are again similar to associativity In Time versus Space okay let's talk about the hardware page Walker component as we said TB misses are handled in hardware and this an interesting component complicated component that walks the multi-level page table to avoid expensive context switches and software handling of tlb misss it has two components one is a state machine that is designed to be aware of the architecture's payable structure and then registers that keep track of outstanding tlb misss so I'm not going to go into the details of it but you can imagine how the state machine walks the page table of course the the the I we actually saw the slide earlier this avoids a need for context which tlb miss it also has the ability to overlap tlb misses with useful computation as we discussed it supports concurrent tlb misses because in Hardware you can support many concurrent misses but of course there's a downside Hardware area and power overheads and now the software cannot change the page table uh organization and properties if you will it's edged into the hardware okay so uh how do you do this walk I'm not going to go through this in detail but you start with cr3 register and then concatenate cr3 with the virtual address and then you essentially do the walk of the multi-level page table that we discussed earlier so it takes time basically uh okay but of course there's benefit to it that's why it's implemented in real processors it allows overlapping of many tlb misses with useful computation so if you do the software tlb mishandling you need to context switch to a TB mandler and then you do a load B you cannot handle this tlb m TB hit because you have a context switch right Hardware page table walk tlb Miss concurrently you can handle other tlb heads so you can do aut ofor execution and you can get the full benefits of aut ofor execution Etc right so you save a lot of Cycles actually with Hardware tlb mishandling it's also called a page table walk basically okay finally to Aid this Hardware page table walk there are page walk caches as we discussed is different from the tlbs because tlbs cach the page table entries the leaf nodes in the last level page table page wall caches cach stuff going back to this picture from essentially all of these other levels not not the real page table that you're looking for why basically you want to speed up the hardware page table walk that's the idea I already said all of this actually these are low latency guys that provide faster access to the page table levels so that the page table Walker does not have to access memory or the cach hierarchy for every page table walk it can just access this page W caches okay and that's the Intel sky like mmu for you basically in a nutshell very quickly and actually there are other virtual memory designs this is Apple A14 over here based on some reverse engineering Etc but you can see that L1 tlb and L2 tlb are larger in apple A14 and also the page size the default page size is actually larger I'm not going to go through details of it but you can enjoy the slide on your own okay so we're at the end of virtual memory uh so to summarize uh virtual memory has a lot of benefits it gives the illusion of infinite capacity and uh we already discussed everything so I'm not going to go through all of this again this is a summary of everything we discussed but it provides you the illusion of infinite capacity it also provides you protection that's very important and there are a lot of issues that you need to solve along the way and those issues all exist in real systems and they're solved in some way but there's more which we're not going to talk about how do you handle virtualized systems POS a problem I'm going to flash a slide related this you may have virtual machines and hypervisors running programs orchestrating systems now you have another virtualization layer and that actually increases the number of virtual memory translations that you need to do and existing systems Provide support for that called nested page walks for example I'm not going to talk about that n n tlbs uh there are alternative page table structures that we did not talk about inverted page tables we briefly mentioned when you want to actually the operating system make uh cop uh store the reverse mappings from physical pages to Virtual pages to figure out what to replace some systems actually inverted page tables to begin with probably a bad idea but they're trade-off associated with that because inverted page tables are actually quite compact in terms of storage size these are small uh hash page tables can enable fast access to the Bas tables actually this may be a good idea to revisit in many systems uh which we did not talk about basically we just directly indexed we didn't really H do sophisticated hashing uh to the virtual address and there's more that we don't have time to cover but I will show you one picture over here which is the virtualized system you may have a virtual machine for example like VMware uh you may have a guest OS running on a host OS running on the physical CPU and this virtualized environment needs to have an additional level of address translation from the guest OS virtual address to the host OS uh virtual address and then host OS virtual address needs to be translated to the host OS the CPU physical address essentially and that requires another level of layer of translation if you will and this leads to a lot more memory accesses basically I'm not going to go over it but it's a lot of memory accesses and existing Hardware X6 hardware for sure uh provides support for uh directly translating from the guest virtual address to the host physical address while trying to avoid as much as possible this indirection level so that's a direct support for virtualized systems okay so let me conclude with this parting thoughts uh the virtual memory parts and then we're going to have an epilog for those who wish to stay uh in the rest of the lecture but virtual memory is actually very important because it's one of the most successful examples of multiple things arit tal support for programmers how to partition work between hardware and software so it's really Hardware software cooperation which is similar to what I said over here and it's a very successful example of the programmer architect trade-off it basically makes programers life super easy Architects life both system architect and micro Architects life harder as we have seen but it is one of the most successful examples but going forward how does it scale into the future it's not doing very well in my opinion and there are a lot of papers that are written on it I'm going to show you some pictures of it also but basically we have increasing huge physical memory sizes both local and remote and 256 terabytes are starting to sound small for our memory sizes basically because data is huge we have hybrid physical memory systems DM NVM ssds and we want we want to increasingly manage everything using a single uh memory abstraction there are many accories in the system addressing physical memory they want to be part of the virtual memory and they're virtualized systems like I showed you earlier hard hypervisor software virtualization local and remote memories so how does it scale into the future so the management is actually quite complex uh overall so it's not going to scale uh very well uh so there's one question in case of hashing will the synonym and homonym problems be reduced uh not necessarily that's a very good question but not necessarily basically uh you still have a virtual address and you still have a physical address it's just a way of determining the mapping okay okay so if you want to Le more about rethinking virtual memory we do a lot of that in my research group actually and this is one of the papers that we have written last year uh that rethinks virtual memory from different perspectives you can also watch a lecture related to it this tries to solve the rigid page table structure problems translation overheads which we have discussed many times and heterogeneous memory management but if you're interested we're not going to cover it we you can take the uh future lectures or also read the paper and there's a lot more in virtual memory like if you want to know more about the synonym and homonym problems we cover that uh in other lectures but we don't have time to cover it in this lecture but you can also watch lectures as you can see there are multiple solutions to the problem okay there are more lectures on virtual memory that I'm not going to go over right now but hopefully this gave you a reasonably comprehensive over your issues in virtual memory which I think is critical to get exposed to as early as possible in your career because it is a lot of cool ideas it is a lot of complexity at the end but it's one of those things that has been successful despite the complexity heavy complexity because it makes programers life easier

Transcript for:[Lecture 32] Understanding Virtual Memory Concepts

Transcript for:
[Lecture 32] Understanding Virtual Memory Concepts