[Lecture 15] Understanding SSD Architecture and Management

you're life okay okayy is is my voice voice Al also okay okay good good afternoon every everyone welcome to another lecture lecture 15 in computer architecture course today as we promised the past we're going to cover uh flash memory and solid state drives today and tomorrow actually both days we're going to cover uh these topics today I'm going to mostly focus on storage architecture like how SS you know is working how we manage SSD in order to do that also I will will provide some background about flash memory but tomorrow we're going to have deep dive into flash memory to understand reliability issues of flash memory and how we can uh basically overcome that so any questions but then we can start so we have actually published a lot of papers in this topic uh mqc is actually one of the papers that we design a simulator we may have some lectures about mqc also later on um some of you might have work with mqs I guess you actually work with mq Sim in one of our courses but yeah mqs has been used in a lot of our researchers and other people are also using is actually state of Simulator for ssds that we're going to learn about more I'm going to cover some of these papers uh today but not all of them for some of them we're going to actually have some uh dedicated Cutting Edge research presentation later like gen store and uh megus I think John actually presented a little bit about them last week and so yeah so this topic is actually quite uh exciting and we have done a lot of works and we always look for you know interested students if you can if you want to continue this line you can just reach out and we will we have a lot of ongoing projects in this topic okay so I'm going to provide an overview on SS organization first and then we going to how we do address mapping and garbage collection which are one of the I would say one of the most important internal task inside the storage then we will get to are you scheduling the in the end and if we have time we also go into flash memory AR array parallelism at the end of this lecture so modern SS architecture is a complicated system that consists of multiple cores uh Hardware controllers Dam and N flash memory packages you can actually see them uh more or less in this uh picture like these are nand packages four of them is this side of this uh PCB we have also four n packages the other side is actually the PC SS architecture for one of the Samsung uh devices uh we also have lpddr drram that you can see here and also this SSC controller which has uh several cores here three cores for example and there are also some Hardware flash controllers and each flash controller actually do request handling ECC and randomizing and and has also encryption engine we're going to learn about all these uh modules uh in this lecture so for the SS controller there are uh different basically components or different uh steps that we we we need to do one of the first one is um host interface layer H which the goal for H is to implement the uh interface basically uh protocol like if it is SAA like the past and also nvme that we have today so that's the job for host interface layer to provide interface such that you know the host system like the CPU can communicate to the storage device in addition to that we also have FTL which is Flash translation layer that has a lot of U important actually duties like data cache management address translation garbage collection word leveling refreshing and so on and so forth we're going to learn about the different jobs uh of FTL in this lecture we also have flash controller uh as as we discussed and Flash controll is actually also part of this SSD uh controller and in each flash controller we do ECC and randomizing and these uh Flash control are managing each of them are is man each of them manages um n flash package and N flash package can also consist of several n d which going to also learn today so in the D module that we have U we keep different information one of them is host request CU so essentially you need some cues for to keep the request in that dram we also have write buffer in that DM uh which we also learn about it like what's the reason about it and also very important actually uh The Logical to physical mapping that keep actually one of the uh you know one of the uh capacity hungry component in that Dam that most of the dram actually structure is allocated to logical physical mapping and we're going to also learn about it later there are also some metadata regarding PE Cycles like program array cycles that has been used for other uh like Implement like other operations such as uh garbage collection such as uh we leveling and U refreshing essentially okay okay so now let's uh you know walk through this SSC controller to see how it works and for that we start with the right operation so assuming that host system sends the right request to the storage system and we want to write essentially like some pages in inside the storage so communication with the host operating system and basically The Hil uh this layer receives and returns request this is via a certain interface sat or nvme that I already explained we're going to also learn a little bit about nvme at the end of this lecture and the host iio request includes request Direction whether or not it's read or right it also has a offset like uh you need to show like what's the start point of the sector address so from the OS like the iio F system the basically the storage is considered as a you know logical view is like the big array let's say so like there is a big array and you have chunk of like pages in that big array and in order to send the right request the OS only needs to say what's the offset you know in that big AR from which point I like to start to write so that's why we need this offset and then you need to tell the storage about the size like how many sectors you want to write so by by specifying the start point and also the number of offsets sorry the number of sectors you can actually finalize your request and all these are typically aligned by 4 kilobyte this is actually from um SSD domain so in the past with hard disk drive we were actually using this sector size which was 50012 bytes but with SSD basically people realize that if we increase the page size um that can actually help help better performance and Energy Efficiency so they they work hard and also at the OS level and they increase the pce size to 4 kiloby that doesn't mean that we don't we cannot have sub page access so you can still send for example request to SD with like sub page like sector size but uh but those are like costly for a storage and also in addition like the flash memory I think we we have discussed this uh uh last Thursday that essentially this is the view of the storage that we have separate memory like we have memory main memory and then we have storage which has the second level of hierarchy and we also discuss if we can unify them in one like unified memory architecture and that unified memory can have different memory devices it can be flash memory R dis Drive Dam and also PCM for example so in that domain actually you can think of having like bite AIS to the flash so it doesn't really flash memory doesn't have to provide you know like page access granity it can be also sub page yes it has some overheads of course but basically it's it's a matter of like trade-offs so in that system you can think of what are the tradeoffs how much performance you can get if you provide more fun grain access to your flash memory and what are the overheads and then make a decision basically any question about hi level of this right operation okay good so once we are done with h then we need to get to the FTL um so first we just buffer data to the write buffer so the data that you want to write you need to buffer it and that's why there are a couple of reason for that one of them is that rights in flash memory are quite costly in terms of latency and uh if you want to pay the latency for rights every time you need to basically you know uh cause a lot of uh performance overhead to the system so most of the time you just write to the right buffer and then send the acknowledgement actually to the the whole system and then the right operation can happen in the background so that's why this right buffer is actually quite essential to reduce the right latency it also enables flexible iio scheduling uh like for example if you consider that you have you want to schedule between right and reads so reads are relatively fast but rights are quite a slow so in the end you always think of prioritizing reads over rights but once you have right buffer then you have some you know right operations that they are faster and as well as read so now you have more flexibility to make a decision sometimes you can actually schedule rights uh over reads so that's why flexible bio scheduling here is actually quite important and this right buffer can also be helpful for improving lifetime meaning that basically recall that all these nonvolatile memories I mean flash memory was also kind of emerging in the past nowadays is not emerging anymore but it has endurance problem so we have we can actually it can afford limited number of Rights or program arace cycle we're going to learn about it also later but essentially if you have right buffer you can actually do some rights in that right buffer and you it can reduce the number of rights that in the end you need to write to the flash if you have good locality in your rights and that's why this is not so likely so the locality that you get at the level of storage is not that high so most of the time you don't really rewrite to the same logical address in a relatively uh you know near future so that's why this improving lifetime is not that likely with this right buffer but still you can get some improvement so but it it has some issues like this right buffer has limited size tens of megabyte the main reason is that remember that I said once you write to the right buffer most of the time you can actually send the acknowledgement to the whole system even though there are some right operations that you can specify them as synchronous meaning that the system has to write them to the Flash and then you receive the acknowledgement but for many other rights you can just you know write to the right buffer but that doesn't mean that right buffer can lose your data so remember that the storage system is considered as a persistent memory so it should keep your data but right buffer is actually D so if you have some sudden power cut for example you know that Dam can just lose your data right so in order to avoid that in uh Enterprise SS we have some capacitors and those capacitors provide enough charge such that you know whenever you have power Cuts they can actually provide enough charge for the staging operation like writing back dirty data in that right buffer to the to the flash memory so that's actually quite essential and because of that you cannot you cannot actually have big uh Dam buffer because bigger Dam buffer you need basically larger capacitors and those capacitors are costly in terms of area in terms of power in terms of many many other Rees so that's why actually people are trying to keep this uh the right buffer as small as possible but that being said uh there are also some works that they try to build right buffer with other technologies that they are non volatile like for example people try to build the right buffer with the with the face change memory so you can check out actually some papers that they try to make it like hybrid U face change memory and Flash technology there are also some other works actually they are at the like prototype of Industry I mean not protote actually they are commercially you know prepared that people use U SLC mode of flash as the right bu so when you have your flash memory in in a TLC mode which is like you can store three bits per set um it's actually quite a slow but then you can actually consider that you have a small flash memory like some buffer as terms of like build by flash memory but that's only used as the SSC mode so then that flash memory SSC mode can be can be used as a right buffer or cache of the TLC U your flash memory or qlc so people are trying actually to increasing the size of this right buffer by doing some tricks basically okay other operation in Ft is actually address transation we're going to learn about a lot today but the one of this core functionality is actually for out of fles Rights so whenever you receive an update uh inside the SSD you don't really update the same location that you have been written so you need to basically find another location and write to that so that's why we call it out of place rights we're going to learn about the reason and that's why you always need to keep the mapping table so there should be a table tell you that each logical address from the OS is mapped to which physical location so that's actually one of the main uh you know the portion of this Dam so like the most of the dam space is actually allocated to this logical to physical mapping so mapping grity is usually 4 kilobyte and which means we need to keep like four bytes for each 4 kilobyte page and that's we get to this 0.1% of the SS capacity and that's why usually the dam capacity that we have in Enterprise SSS is actually around 0.1% of SSD capacity like if SSD capacity is 1 terabyte your D is around 1 gab for example any question yes yes yeah but not all of them those parts that you have changed them basically you know but there are some parts of flash memory that you keep your actually they are there for mapping basically you know you so that's your definitive storage for that that's very good question yes yes map it to another page yeah one of the reason that's one of the reason but there are also other reasons but we're going to get to that uh in later lectures later slides uh yes but keep your question uh for that yeah I mean that's uh yeah so if you know about like the number of pages number of blogs that you have you know you almost need four buys you know but yeah of course uh you can actually be more precise and say that maybe three bites is enough or something but that's a usual guess that four bites is needed okay any other questions those were very good questions actually okay so another operation you know here I'm just providing some overview but we're going to get to all this information actually later uh is actually garbage collection so since you whenever you receive an update you don't um write to the same location you just pick another free location and you write to that so at some point you're going to run out of all these free locations right and uh so you need to reclaims uh reclaim free pages so that's the job for garbage collection that you need to select basically garbage collection selects a victim block and then it needs to copy all valid pages to you know uh the free pages that you already have and then you can erase the victim block to make another um available free Block but we're going to see a lot be leveling is also another important operations as we discuss um these memories they all have endurance problem and if you keep right to the same location they're going to your device will wear out quickly so wear leveling is a a important task to evenly distribut P Cycles program race Cycles across n flash blocks so that's why we need this very level and it's usually there can be a lot of actually um you know implementation for rting but at a high level most of the time they need to swap hot and cold like blocks if there is a block which is hot meaning that you write to that block a lot and there are some blocks that they are called meaning that you haven't um ritten to them much you can just swap them basically and that's actually quite easy with flash memory implementation because with SS implement because you already have this mapping thing you know you can just map things differently you know if you just change the mapping like The Logical to physical mapping you can actually also easily Implement ver liity but at the dam level um it's not easy like the main memory because you don't really have much of this logical to physical mapping so that's why implementing like ver leveling techniques for PCM M main PCM as a main memory is not that easy Al also another operation is data refresh which not actually happening so frequently you know that for D we need to refresh it very very frequently because you know it uh loses charge but in flash memory also we need to do that uh we're going to learn about it more actually tomorrow but at some point you need to refresh pages with long retention ages like for example every months or every year depending on your Technique so there are uh some tradeoff here if you for example refresh your pages uh your non flash pages every week for example you're going to cause some issues like some Energy Efficiency but at the same time you can reduce uh the number of ECC like the you can reduce uh like error correction code capability and that can you know can that can buy you some other benefits so the seriously here also we have another trade-off okay and for flash controller you need to so for rewrite now you have your data and you send it to flash controller so first flash controller tries to like a scramble data right so there is some let's say key and that flash controller in the in the randomizer unit scramble your data in order to avoid worst case data patterns so there are some data patterns like the data that you want to write that people show that these patterns can cause a lot of reliability issues um from study that we will also show some of these studies tomorrow you know when you deep dive uh into the flash memory you can actually learn about what are the data patterns that they can cause more reliability issues so these patterns uh can cause reliability issues as well as security issues so once you know some people malicious program or they can use these patterns to you know cause reliability issues for your data so that's why in Flash controller you just scramble your data and with a key such that hopefully you can avoid worst case data patterns so that's the job for randomizer so the data that you write to your flash memory is actually different from data that you sent uh to your device and also you need to apply this error correcting code ECC which can detect and correct errors as an example you need to keep 72 bits uh per one kilte error one kilobyte of data for error correction capability and that can that stores additional parity information together with your uh randomized data so inside the flash memory you write your randomized data as well as ECC parities basically essentially and after that you can issue n flash commands like saying that I want to write to to This n flash package any question yes yes no it's a scramble data like data itself that means one when you want to read it you need to Der randomize it so basically randomization algorithm should have that capability usually it's just like chain of xor that you do randomize right and then you der randomize things say it again yes yes yes yeah the question is that uh when you want to write a page to a block you you randomize the whole data of that page right was that your question like modified right yes yeah you need to modify data using this randomization uh algorithms questions okay good so now that we are done with right let's see what happened when we want to read so the host system sends a read request and then we uh at the FDL we need to first check if the request data exist in the right buffer and we we might be lucky right which again is not is not is not a let's say frequent case because most of the time you don't really have that locality but it can be also the case that you find your data in that right buffer so if that's the case you just Returns the corresponding request immediately with the data um and also a host read request can be involved with several Pages like you you can when you send a read request to your storage it can actually be asking for one page or can be also several Pages the whole system will actually send acknowledge to the to the the the the hi layer sends the acknowledgement to the host with the data once everything all the data is ready for that but if the your data is not ready in that cache in that right buffer you need to basically check address translation why because you know the logical address but you don't know where your data has been stored so you need to check this uh physical like The Logical physical mapping and then with the in order to get the physical page address this PPA stands for physical page address in order to get that physical page address we needs to access the logical to physical mapping table so once you know then you can just send the read command to the to that corresponding n flash page after getting your data from n flash you need to perform uh eccd coding think and if things are okay you need to you can you should der randomize your the raw data such that you have your data ready and then you can send it back to the system there might be cases that ECC decoding fail which is actually quite uh frequent especially when your device uh is aged so you're going to see a lot of this U ECC fail and then uh the flash controller needs retry reading so hopefully by retrying uh we need to adjust the reference voltage we're going to learn about what is reference voltage but essentially we showed like how PCM works right PCM was like the distribution of the resistance and we show that we can chop the distribution of the resistors in order to get you know the what the data has been uh stored so in flash memory also you you you have a distribution of thresold voltage and you need to you can chop that uh window into several domains or several margins so essentially you need to send a reference voltage and then with that reference voltage you can decide if your data is zero or one or whatever so once things are failed you need to adjust that reference voltage reduce it or increase it essentially and then you need to retract so that's actually one of the reason that age devices um are quite a slow you might need to actually do uh several reads in order to get one read data yes yes yes exactly flash controller does that yeah so if that happen oh repeat the yeah the the question is about um after all do you get your data correct or not after after after all these retry that you did that you that you do um yes that's possible but if that happen then the your your system is basically um I don't know you may get some blue screen I'm kidding but B basically your device is not working there so the goal for S storage device is that with all these Riz and all ECC you're going to get your data but there are uh there are also some studies that I'm going to show tomorrow that we have done larger scale studies um in the the vipre of like at the at the server domain you know at the software level so at the software you can actually get those errors those failing things you know and you can say that oh mean this device is not working and like for example Facebook for example reports that oh some of my SS some of my storage devices are not working and then you can uh you know get some idea about it but at that level meaning that all these ECC stuff you know retried they didn't really work so most of the time you don't really get uh you know you don't get into all these errors because storage subsystem fix things you know at at the another level is also RAID controller you you have actually several storage uh devices you can have them in raid and raid actually can also provide some other parity so at the software level you try also to get rid of many errors but at some point your device can just you know get you wrong data okay so uh now we can actually get into more detail uh we are done with the overview so let's see what is Flash cell so flash cell is basically is a transistor um like you know that each transistor we have this Source Trin and gate and depending on the voltage that you provide between gate and Source if it's larger than uh thresold voltage your transistor will be on and the other way around is off basically so you can assume it as a like as a uh key switch but the difference with FL cell is that we have a special material called floating gate in 2D flash cell or charge trap um in 3D structure of that flash memory that here I'm showing the the 2D version The Floating gate so this is this is the special material that you have and this material can hold electrons in a nonvolatile manner so essentially when you want to program it you just you need to apply a high program voltage like 20 volt and this positive voltage will U basically absorb electrons like calling tunneling to this uh floating gate and when you basically uh switch off that voltage or you know when you cut that voltage these electrons will be trapped here so they don't get back to the substrate in order to get back to the subser you need to actually apply negative voltage like minus 20 volt to push all these electrons back to the substrate and that's the operation for erasing flash memory but once you program it you actually you you need to apply the the positive voltage such that you can get electrons to the floating gate and since you have a positive uh sorry you have a negative so once you have electrons here now you have a negative charge here that means you need to provide a higher gate Source voltage in order to turn on that transist meaning that your thresold voltage now is higher so once you program your flash cell you have a higher thresold voltage clear yes so when you want to erase it you need to apply for example minus 20 volts yeah it's not that easy but and that's also part of reason that's why flash memory this program raay uh doesn't really work well because once you erase it there are some electrons that they will remain actually it's not the perfect process when you again program it you you add more electrons and then because of the previous electrons now you have more negative charge when you again erase it so this is not a perfect uh basically process and that's why with several of these program arrays at some point your flash memory is dead essentially you cannot use it reliably question sometimes I forgot to repeat questions but yes is one of the main reason of endurance problem in the flash memory is that your program race uh process is not perfect so every time you do you just you leave some electrons or you may actually you have less electron you you know if you apply that positive charge longer than needed you know you may actually cause some cause more electrons to move to substrate and then you have a positive charge in your thresold in your gate okay yes yes yes exactly so when you want to read you just apply reference voltage and that reference voltage is something between your two tertial voltage this one is the nominal tertial voltage and this one is the programmed teral voltage Which is higher so you need to apply this reference voltage if your flash memory uh turns on with the reference voltage that you apply it means that you haven't programmed and you you can decode it as one you can yeah you can encode this state as one and if your fresh doesn't turn on you can that means that you have already programmed it and you can encode it as zero for example and this is of course the SLC mode you have one and zero but it can be also mlc that's we going to also see more any question about the basic operation of flash memory good so in order to erase it you need to apply negative voltage like minus 20 volt and that can tunnel electrons back to the substrate and hopefully you can get your nominal traal voltage okay so let's see some of the flash cell characteristics first of all you can actually have this multi- leveling a flash cell can store multiple bits so when you program you inject electrons and when you erase you eject electrons you can actually you know control it more accur and you have you consider that you have four levels in the tertial voltage and then you can encode it to two bits it can be also eight steps or 16 so we even have qlc flash memory now in the market as well we have retention loss same as dam but it's not as fast as dam so a cell leaks electrons over time like after this is the initial step let's say initial State and after one year you have some you know uh loss and after another year you have more loss and at some point you can get retention error essentially the thing is that this uh P Cycle like the program race can actually cause limited lifetime and it can actually uh makes this retention error worse like for example this is the if this is the initial state after one year assuming that you have around 1,000 P Cycles you can get this uh charge level meaning that you still you can you know read your read from flash memory correctly but the same thing after one year with 10K P Cycles you can get to this so basically P Cycle can accelerate how fast your fresh memory uh you know lose charge Yes program erray so whenever yeah we do program and erray like injecting electron and ejecting electrons question okay so this is Flash cell and we have nand string like these multiple flash cells example can be 128 there are also other numbers depending on the product so these multiple FS are serially connected in a nand string you can see that we serially connect all these flash cells um in a nand string and then these nand strings connected to a beat line essentially assuming that this cell is your target cell uh when you want to read it for example you can you should apply reference voltage to your target cell and in order to make sure that other cells don't do not uh interfere with your process like read or write you need to apply U the high enough voltage which we call it V pass meaning that this voltage is high enough to make them on no matter you have programmed them or not essentially so here for example is like around six volt tomorrow we're going to learn about this Vass which is Vass is actually quite high and that's that can cause some weak programming effect so when you want to program your flash memory you need to actually apply high voltage which is actually quite higher than six volt as I showed like is around 20 volt but even six volts can also have cause some weak programming issue and that weak programming uh can cause some you know reliability overhead to the flash memory but wait for tomorrow for that so then uh when we combine all these nand string having them in a row very very similar to how we struct Dam row we call it a page like you can also see the similar terminology here like the word line right and the whole thing like the basically all these bit lines and N the string we call it a block and essentially block is like the number of word lines that we consider times number of beats per cell like because each of these uh each of these cells can be SLC or mlc right so when we want to see how many pages uh each blog can store you should also consider if your cell is SLC or mlc assuming that we have SLC each word line stores one page and this Alpha here is actually those uh metadata like the ECC and some other stuff so like page is 16 kilobyte and you need to store it in addition you need to also keep some more data so that's the word line and you can see the number of pages that you can have in a block yes per per block so this pages and block program arrays as you can also Imagine is unidirectional meaning that when you program is set you increase the cell thresold voltage and when you want to decrease the cell thr voltage you need to raise the cell so you cannot program like there is not such a process that you increase tertial voltage in a program and also reduce teral voltage in a program you need to erase it and that's a completely different process like it's a completely different process because you need to apply negative voltage and at the same time you need to do it at the higher grity and the reason is that Aras process is actually cre slow in order to in order to at least provide good bandwidths you do it at the Block Level for program you can actually do at B at the page level read and program actually happening at the page level but raising is happening at the Block Level and we know that when we want to program a page uh that cannot change Zero cells to one cells so that's why you need to erase it before write that's one of the big issue that we have in flash memory and as I said Aras unit is at the Block and you can and in order to increase Aras band wids and that makes in place right on a page very inefficient so once you want to do like a assuming that you want to do in place right like in place update what you need to do like there are some pages in that block you want to update one page right so the problem is that you cannot update first you need to erase that page right you need to erase that page in order to write to that page the problem is that when you want to erase that page you also you need to erase other page the whole pages that you have in that block like the whole plug needs to be re eras because the arrays gr is that block so before doing that you need to copy all the valid pages that you have in that block to somewhere like in Dam let's say and then you can erase that block once the whole block is erased then then you can actually program and not only that page you need to program other pages that you have also copied them now you can see you know for every write how much time you need to spend in order to have in place update so that's why in flash memory we just do out of place updates so hopefully we're going to keep enough number of uh free pages that whenever you receive an an update you just write to another and that also cause some other issues like as I said garbage collection we're going to see that at some point you're going to run out of these free pages and you need to reclaim them but that I just wanted to give you some insight about why we need to do out of place rights okay so and playing is a large number of blocks yes so normally in qlc when you want to uh when you write um you actually write to the whole page like in qlc you can consider that each word line stores four pages and that's why you you actually need to have data for the whole pages and then you can program qlc so you don't really program only one um basically beat of that qlc is that correct Rakesh like for qlc uh we need to program four pages at the same time right we don't now but that's that's the operation for right buffer essentially yes you need to for right buffer no matter if it is D or SLC you need to First write to that in order to keep things like you need to have a four pages so you write to that SSC and then when whenever you have four pages ready then you write so that's the process we're going to get back to that actually when we want to discuss about fine grain mapping actually okay so a large number number of blocks share bit line in a plane like more than thousand blocks you can see that all these blocks they are sharing the same like bit lines so then the question is that how we can uh control each block to access bit line and that with some transistors like isolation transistor that you have seen it a lot for uh Dam works that we have done like here also we have these two transistor like a string select line and that we need to add per block and when when one n string wants to connect to that V Line you need to actually turn on these two transistors and that's the way you can control uh which block is controlling the bit line essentially and that's the definition of plane and in flash d we have multiple planes usually and the thing is that these planes they are sharing uh some they are shading row and column decoders and that's why you can actually do parallel operation across planes as long as the offset and also the operation is the same assuming that you want to do the same operation like the read in across planes and all these read operation are consistent in terms of the offset uh you can do it in parallel essentially and that's actually cause a lot of also control like Al tradeoff that we're going to see that how we can benefit from multiplan operations because multiplane operations provide better bandwidths for you but at the same time it's also quite hard to keep it uh always any questions so if you want to make it uh like question is that How likely uh we can get multiplay operations if you want to make it more likely you need to control it when you want to write and when you want to do write operation and at the garbage collection you actually need to dictate multiplan operations so that's we're going to see and that's that's the one of the reason that you know brings us to the trade-off sometimes you know dictating multipling operations can cause some other overheads and you need to do it intelligently which which is not also easy to do it intelligently exactly when you want to control that right operation you need to make sure that uh these planes are filling up the same thing same so when you want to pick up with the victim block these victim blocks should be aligned such that you can erase them together you know there are a couple of stuff going on there okay so as we discussed we have this thres voltage distribution in non flash memory uh here I'm showing the SSC like the the nominal reference voltage sorry nominal thres voltage we consider as Aras and one and this is the program reference voltage and of course we have this distribution so things are not perfect because of the process variation so we have there are some celles that they are like more easily programmed or erased so that's why we get to some distribution here and this is the way that we can also get to multi-level cell like in order to have M bits per cell we need to keep 2 to the power M tertial voltage here we are showing eight um basically regions and with that you can encode three bits like LSP CSP and MSP meaning that you can store like three pages uh in your b line so as you can see we have limited VDS here now for the threshold window and that make uh basically make each threshold State narrower so you have narrow and this margin is actually quite lower so that's why uh we have reliability issues here here that threshold voltage changes over time so after some time actually these U threshold window are moving across this uh basically AIS it's also not always reducing you're going to also see tomorrow that they may actually shift it to the to the right it's not only shift to the left and it's also not the fact that it's shifted it's also widened after some program so it's also getting viden so at some point you can see that these regions like here they uh they come they they they conflict to each other so then when you apply there is no way to avoid error so when you apply this reference voltage actually between this you may get some errors essentially and hopefully you can fix those errors with ECC but that's why actually ECC is needed even though you can actually apply a lot of techniques in order to reduce the requirement of ECC but you cannot really get rid of ECC completely because there can be also in the end some you know uh some Collision that you need to apply e question whenever I think about such things and I read I'm always amazed how our system is working essentially you know with all these error sources it's really incredible right you you just copy data in your flash memory your SS and you never think about such errors it's quite interesting okay so uh let's get to some basic operation like page program uh we discussed about like the read operation at the system level at the storage but now we want to see how we can do like read and write we want to see how we can program uh at the flash memory level assuming that now you're data is sent to the flash memory and the the controller ask flash memory to write like the sending the right command so you have a Target page assuming here we have SLC um you have a Target page and you want to program this you need to apply uh basic essentially for those SSL uh and GSL you need to activate them like the those isolation transistors and other word lines because they are not related to your work you need to apply V pass which is high enough to make them on essentially and for the Target page uh you need to apply program voltage so then we actually we have an interesting control here that we can control bit lines so when you want to program you know that the whole page should be erased meaning that all cells are considered as one basically you know so all of them are now they are restoring one so the data that you want to store some of them you know needs to you want to make them zero some still need to be one so you don't need to program so if you just apply program you're going to make everything zero but that's not what you want right you want control which cell you want to program it which cell you don't want to program and that's a very interesting uh basically mechanism here that we can inhibit cells to not be programmed and that's by applying VCC to to the bit line so those bit lines that you you connect them to the ground you can program them but those uh bit lines that you don't want to program like those cells that you don't want to program them you apply VCC to the bit line and you inhibit those cells and essentially your program voltage does not affect them okay so you have this this is the distribution of your distribution of your teral voltage before programming and once you program you hopefully get to this basically distribution that some of the cells are programmed the other cells are not programmed depending on the like your data pattern that you had but unfortunately if you apply the program voltage at once like that that 20 volt you will get to this kind of distribution that essentially the sales that you want to program the distribution is just scattered you know through the whole window which can show you know essentially your program is not successful here and the reason is that as I said we have process variation there are some cells that they are hard to program like as you can see their teral voltage you know doesn't change much there are also some cells that they are easy to program so the teral voltage for them changes well so that's why this doesn't work in order to overcome this issue people come up with this interesting techch that they call it incremental step pulse programming ispp and that's you apply program voltage step by step you first apply a program voltage Which is less than the high program voltage that you need so with that step some uh cells will be programmed you know you get to this after applying V program zero so you can actually verify you verify um by reading from your flash memory and see what are the cells that they are program already for the second cycle you just inhibit them like this for example is verify as program you just need to inhibit it so you'll apply VCC to the bit line such that for the second cycle of program you don't affect this cell anymore and then you need to apply program One V program One V program two and at some point you get to this beautiful distribution so that's the process of ispp uh very briefly of course there are many also other details that I'm not going to say now but you get the whole idea I guess any question interesting good so now let's see how we can read so for it things are a bit easier I hope so uh in SSC mode you need to apply reference voltage and for bitline control you need to charge all bit lines to B we're going to learn about sensing circuitry actually today but uh you need to initialize essentially like the pre-charge like very theologies are similar also to D because both of them are charge memory essentially so you need to pre-charge uh and pre-charge the bit line to the VCC and then you apply vref so if the reference voltage uh makes like does not turn on your transistor your flash cell meaning that you have program it so the current doesn't flow right and you can encode it as zero but if the bitline current uh flows well you encode it as one the way that we understand that if bitline current is uh does flow or does not is from a sensing curity that we're going to learn but the process is very similar like it's quite easy so if bitline current flows well meaning that this uh switch is on like when the switch is on bit line current will flow that means that uh in this example that means that you haven't program this fles set but if the B line current doesn't flow that means that you you have programmed that fles cell and you encode it at zero is that clear okay good so now let's see how we can read the in MNC uh basically sell so assuming that we have a TLC here which I mean the whole is called mlc multiple level cell but this example is actually TLC which is triple level cell which you have essentially three bits per uh cell so if you can check uh you can see that in this window uh like the threshold distribution you have seven reference voltages which means you need to apply multiple reference voltage in order to get your data but do we need to apply seven reference voltage for example in order to read the CSP beat example here any thoughts like how many reference voltage uh I have to apply in order to correctly read the CSP three and why is that say it again and and what are these three reference voltages V3 and um VRE one is also correct but v z no you need to check for the reference voltages and V 5 yes you need to check for the ref voltages that you see toggling essentially in the encoding so here you can see that we have toggling from P1 to P2 like for the CSP one gets to zero meaning that VF one is something important for me because that shows some toggling uh V ref zero is also important because you see a toggle 0er to one and v five is also important other referen voltage are not important for reading CSP how about LSP how many reference voltages I need to apply two and what are they yes correct okay so for reading CSP you apply these three reference voltages and uh and you read so first you apply vf5 and here are the data that you read you can see that for some of them you read it correctly like this one is correct this one is correct correct but this one is incorrect and the reason is that VY was not enough in order to read that CSP then you need to apply also reference zero sorry three you learn you read it from them and then uh V ref one finally you get the correct uh data for that so with this kind of encoding you need to so you have output from each of these read process you need to somehow combine these outputs to get the final output so with this specific uh encoding you can actually get that output with xor you can check at home as a homework if you are interested but essentially exhorting these U uh these uh outputs you can get the final correct values in this example yes say it again so you only uh so every time is actually when you want to read in a from mlc flash essentially you are doing SLC read every time but that's a reference voltage that you apply right but every time is SLC read so you are doing SLC read three times here and then combine so that's why actually mlc non flash memory requires an onchip X or logic and we actually have used that kind of EXO logic in order to do computation as well in one of our work mq flash Cosmos that we're going to also present it sometime and uh and you can see that beat encoding affects the read latency so now we have this CSP for example we have three uh reference voltage like three reads process for LSP we observe that we need two and for MSP something else so let's uh compare these two encoding for example the first one is the is exactly the same that I showed in the previous slide and the bottom one is completely different so compare the number of sensing for LSP how many sensing I need for the top you already answered it actually two so we need two sensing for the top but for the bottom we need to do how many four exactly so this encoding needs for sensing for reading LSP for example page but now let's compare U msv so for msv uh the top one you need to do two reference but the bottom is a good one like you need to only apply one reference so now you can see the trade-off uh for encoding and people have done research on it like how to benefit from that intelligently like make the MSP page as the cach for the LSP page for example because MSP page is faster you know to read so you can consider that MSP pages are cache or pages that they are more latency sensitive so if you have information of your application and you know that what pages are more latency sensitive you can map them to the MSP assuming you have this encoding so yeah you can assume that people have done a lot of research and what is the exact encoding we use in our our product today we don't know and and each industry can have can have its own actually encoding so we don't really know I mean raakesh can share some information if and then you cannot see Rakesh anymore okay okay let's uh take a break I guess now we are in a good position uh until U 230 then we can continue from this point e e e e e e e e e e e e e e e e e e e e e e e e okay um let's let's get it started so now I want to uh give you a little bit more information about how the sensing circuitry in the flash memory works like n flash read mechanis consist of three steps uh pre-charge evaluation and dis charge so in pre-charge you need to um yeah you need to pre-charge the bit line uh to the to the VCC and there is also a transistor here that connects the you know the beit Lin to the pre-charge voltage so you need to do that in order to get uh you know bit line to that pre-charge level and it's here Al we have a capacitor assuming so now you charge this capacitor CSO um to to high voltage in the evaluation basically you apply reference voltage you also need to turn off this pre-charged transistor you apply your reference voltage to your flag cell and then um You observe that you know this Speed Line current uh flows or not essentially if it flows uh like we know that this is uh erase so you code it as one and the other way around and after that for the discharge essentially you need to make sure that bit line um you know gets completely discharged so that's the process for discharge but now the interesting part is actually latching circuitry so how you really sense that you know the current is Flowing or not so you can see this uh simplified version of it like this is your bit line we haven't shown the nand string anymore here so we have a bit line and this is a pre-charge voltage and this is transistor pre here is also that capacitor that you you know charge it in the pre-charge mode and this is actually essentially your sensing circuitry you can see you have this back toback knot and all the the two Sid is actually connected to this transistor um using two separate transistors M1 and M2 so let's walk through it quickly I mean together um it's actually quite easy it's not a magic but it looks a bit complicated essentially so when you want to uh initialize like that the pre-charge mode you need to you activate this transistor and you charge this s so to one essentially so now you have high voltage here when you have high voltage here um this transistor is also on right so now you have a connection to the ground and and uh for the these two transistor M1 and M2 you basically activate only transistor M1 and M2 is off so since you apply M1 this out bar or out Prime is also connected to the ground because this part is connected so now you have a connection to the ground that means your voltage for the out Prime is zero and because of this back toback knot you have V out set to one and this actually remains to one because M2 is not connected to the ground clear so that's the initialization part any guess how can we evaluate it for the evaluation step you need to disable this pre-charge transistor and then also M1 you need to disable M1 and then activate M2 correct so now let's see what happens essentially when you deactivate this pre-charge assuming that also you you have applied voltage reference voltage to that transistor so we have charged this s o to one right and this one was controlling this transistor if the current uh flows then this s so will go to zero meaning that the connection of this transistor to ground will be basically disconnected right so you don't really have connection anymore to the ground and that's why actually turning on M2 doesn't really I mean does not do anything essentially because there is no connection to ground and then all these out Prime and out will keep their voltage and you can see that you are reading one so and that's what you should have read as well because you know the the current flows so now you're reading one but if the current doesn't flow this uh s so will keep its voltage meaning that this connection to ground is still there so when you apply M2 then you have a connection to ground and V out will go to zero and then out Prime is also goes to one and now you're reading zero so simple right it's not a magic honestly okay any question interestingly this process has a characteristic if you just swap M1 and M2 you can do inverse with so in the pre-charge in initialization mode if you activate transistor M2 instead of M1 and in the evaluation you disable M2 and enable M1 then you can actually read inverse meaning that you can Implement easily not operation inside the flash memory so you store one but you read zero so that's a not operation essentially you can do it U on your own at home okay now that you know about the process of read and program as well as also this uh sensing circuitry now let's also do some performance evaluation like we'd like to know what are this performance of the SSD essentially so in order to evaluate performance we have different metrics for ssds one of them is latency or response time which is the time delay until the request is returned so the average read latency for 4 kilobyte of page is around 67 microc this is actually data coming from this SSD it's one of the highend Samsung ssds actually and average right latency is around 47 microc which you can see that actually write is faster than read and the reason is that you are writing to the right buffer if you if you had to if you have to write to the flash memory then you're going to see that the average latency is around actually uh like hundreds of microsc another uh metric is trut which is the number of requests that can be serviced per unit time so in the throughput we consider the number of requests and we don't really care about request size essentially so that's uh another there's a metric for that which we call it iops input output operation per second and it actually has been used a lot for random accesses like for random read throughput we know that is up to 500k iops for this SSD and for random right throughput is up to 480 K iops essentially and band is the amount of data that can be accessed per unit time here you actually care about your size because you're calculating amount of data and that's the bandwith for sequential read up to 3500 megabyte per second and for right bandwith is up to 30 300 megabyte per second and if you compare them with htd latencies and performance for hard disk drive a latency is around 5 to 8 millisecond so you can see why ssds are every par entially you know for better performance true put for U htd is around 1,000 iops and the band is also around 100 megabyte per second so essentially we have we can we observe around two orders of magnitude Improvement when you go from HDD to SSD so for the non flash chip performance we have this cheap operation latency and there are couple of parameters here that we consider here to in order to evaluate the performance one of them is TR which is the sensing latency the latency of reading data from the cells into the unch page buffer so per flash CH we have a we have some buffer again like very similar to D that we have this row buffer here also we have unchap page buffer and the latency that you need to spend in order to sense your data to understand it is one or zero and put that data to that page buffer we call it TR we also have t program the latency of programming the cell with the data in the page buffer assuming that your data is ready in the page buffer of that flash memory then you need to spend T program in order to have your data uh written in that uh or program in that flash memory and also TBR which is latency of raising the cell or which is happening actually at the Block Level so these all uh values are depending on the mlc technology processing node and microarchitecture in the 3D TLC n flash memory which is again the same uh device as I showed in the previous slide TR is around 100 microc T program is around 700 microc and T BR is around 3 millisecond another important metric is iate which is number of Beats transferred via a single IOP pin per unit time so flash memories um current they have eight IOP pins like for data and these each flash memory is connected to the uh to the channel and that channel is also 8 bit white so essentially you can and per and each single IOP pin has a i rate of 1 gab meaning that for basically in order to transfer 16 kilobyte of page since you have eight iOS IOP pins you the the whole Bandit you have is around 1 Gigabyte per second right so transferring 16 kiloby page to the controller will cost around 16 microc just simple MTH here okay so now let's do some calculation so we know that these are the values for like sensing programming and erasing like you can see Al some range here because I mean read can be 50 to 100 microc average and program can be also 700 to you know 1 millisecond and erasing those also varies I mean 3 millisecond to 5 millisecond so for the operation like for the read operation these are the operation that you need to do so flash controller needs to send a command to the non flash sheet and then you need to spend the sensing time TR and then you need to communicate that data that you have read to the controller tdma uh and then you need to do this ECC decoding and as well as randomization so these are the values that we assume for command we can assume that this is actually negligible so we don't really count it here uh for sensing is around 100 microsc tdma is 16 eccd decoding is also around 20 um microsc in this example and then you can see that your read is around 130 second Microsoft yes ret it's actually quite likely especially when your device is aged yes there is actually a very nice report um that I I will show tomorrow that people show that uh SSD device that they bought like I don't know some years ago and it was supposed to provide let's say x amount of bandwidths for example they test the bandwidth after some years and they see that it's actually like several I'm not really sure about the number I will show tomorrow but they actually observe dramatic performance drop of that SS so when you use your computer for some sometimes and you see that it's getting slower actually might be one of the reason there can be many other reasons as well but SS also gets slower um as you go use it okay so this is a all I mean some amount you know for read operation um yes yes yeah exactly yeah no uh so this this metric is fungry right this TR is sensing time or tdma for example is per page read if you need to retry it you need to multiply TR and tdma essenti you need to do it yeah exactly but in this example we consider that things are beautiful you don't need to Rite okay so these are the operation that we already talked and for program essentially you need to do s something similar you first uh do randomization ECC then you send the command transfer data to your n flash she and then you program and you can also do some calculation and get some numbers here for rights yes yes yes yes uh when you when you send this send program command you you need to actually send program command to pre- erased uh block essentially re erased Pages yes that page needs to be erased the whole block is erased before but at some point you program some of the pages that block right but the page that you want to write to it should be erased should be pre- erased yeah yeah yeah you erase the whole block but when you want to write to that block you don't need that all pages in that block should be erased because as you go to the Future some page our program essentially right okay so you see some numbers here and if you calculate the bandd based on these numbers you can get to these uh low band essentially and if you compare about the read latency read man right latency right man that we have shown you can see that there is huge difference essentially and the reason is that there are many optimizations and advanced commands that SS uh you know has been applying so first of all we have internal Paralis and also we have right buffer of dam and S so you can actually accelerate things with right buffers and there are also other optimization that we're going to show a little bit today but with this internal paralleles for example you can actually send several reads at the same time so your average latency will be reduced even though for each read you may spend like 100 microc but if you do that for several reads at the same time then you can think of much lower average latency right so now I want to uh start and very quickly talk about some of these optimizations so one of them is uh what we do for a small rates like the minimum iio units in modern file system is 4 kilobyte but uh page size in most of the SSS today like capacity optimize or cost optimized SS are 16 kiloby there are some SSS that they are performance optimized and they are using lower pages but usually uh the SS that you can use they have uh larger pce sizes like 16 kiloby so if the H sends a request for accessing 4 kilobyte now you have this uh mismatch a unit mismatch so when you read from your flash memory you read 16 kilobyte but you only need 4 kilobyte so that's is a problem right so there are two optimizations that people have developed one of them is sub page sensing is actually a prototype from Micron that they they did some microarchitectural level optimization that you can directly reduce TR when you want to sense 4 kilobyte you really spend also less time you don't really need to spend like 15 50 microc in order to sense U 4 kilobyte you need to that time if you want to send 16 KOB so there is actually some microarchitectural implementation if you're cous you can uh check this reference to learn more but you can also get some idea about it another optimization is random data out so you sense let's say you sense and read 16 kilobyte but you have that in the page buffer unch buffer that you have your in your flash memory then you don't need to spend the whole time like the whole 16 kilobyte to send it back to the Flash control you only random data out you only send out the data that you want 4 kiloby for example and that can be in any offset essentially so this can reduce tdma and also the ECC decoding so you only spend dma and ECC decoding for data that you care about essentially yeah these are two optimization for example for handling U um you know minimum iio units in modern file systems another interesting command is cach read it's actually quite good for consecutive reads in the pipeline man so this is a process for read a uh page a for example you need to spend TR and then tdma and TC then you can wait for some time and then for the second read you need to actually wait until uh TR and tdma is done and then from this time because your channel is now free once the channel is free you can start uh doing the the send send the N yes send the read command to that flash memory and then start this TR and tdma and TCC so you can see that in um in regular page read you can overlap tecc with some of these sensing time essentially so that's good but we want to do better and that's happening with the command cach read so you when you send read operation you also send the read oper like read operation of page a you also send the read operation of page b as a cache command like you are telling flash memory that you first need to provide me the page a but immediately after that I also need you to provide me page B and with that you can actually you can see that uh you can over tdma and TC like how is it possible you're sending these two commands together so while you are doing a read of page a you are doing sensing and then move page a to the page buffer then together with data out of page a which is CDMA you can also start reading page B like you start sensing of PB and with that you can overlap uh sensing of page B with the tdma and the ECC of the previous page so that's the cache command essenti any question but of course it doesn't come you know with no overhead you need to apply you know you need to have uh more let's say buffers here so now you need two page buffers here so it comes at price of some you know overheads essentially okay so this uh remove tdma from the critical pass it increase increases throughput and bandb and it reduces effective latency as we discussed it doesn't really reduce um latency for every command or every request but in overall since you have higher band then you effective Laten is lower and you can use it but as I said it doesn't come uh free and you need to had more offers another optimization is multiplay operations that we also briefly talk about it before I'm going to skip it very soon actually so we have this target page as long as your target page in different planes are aligned you can actually access them um you know simultaneously and that's give you much better benefits essentially so you you can actually overlap sensing time of all these pages together yes yes yeah uh so I'm trying to find the simple answer to that um yeah essentially you need to the one of the way was you know you need to come up with the big accesses like large accesses like several accesses together and you send them sequentially you know as a big request to the C but there might be some other I mean more detail to that which I cannot remember from top of my no you need to do it for the whole s actually if you want to get the bandits across the channels one of the good reason for that sequential uh read is actually you can um scatter it across channels and you are benefiting from multi- Channel Paralis so you have Channel level paralleles you have uh uh page you have also uh D level paralleles and you have also plane level pares if you want to get the maximum trle you need to benefit from it and there might be some there needs some reverse engineering about you know how the access pattern like the how FTL Maps things you know with that you can come up with the access pattern um there are papers for that but Rish do you have anything to add you want to use usually the when the application requests for more than 64 kilobytes at a time then you consider it as a sequential request and then uh so the application sends the request to the sist and then you make it into sub request and then process it through the FTL so that's what the what we consider as sequential request but for random you you have usually have 4 kilobytes very small request thanks thank you okay so yeah this is a multiplane um as long as I said like the offset is the same you can use it and you can concretly operate so with multipling operations uh of course as I said you can uh overlap sensing and program latency I mean depending of you're doing multiplay read or write essentially you need to you don't need to for example here is an example for program uh so previously you had to program uh like 16 kilobyte the same time and you need to spend like 732 microc let's say with those calculation so you get to bandwidth of 22 megabyte per second for example but here now you need to program 32 kilobyte because you want to do multiplay operation assuming you have two planes here so this 32 kilobyte uh the latency that you need to spend so you can completely overlap the program latency uh which is 700 micros but you need to the other things you need to add to it like you need to spend two t uh twice tdma and twice TC so that's why here is a bit like the formula here is a bit complicated but essentially like 730 second is a program latency plus one tdma and one TCC so you need to also add another tdma and another T to it but since uh you really botling by the program latency you get a quite High U throughput like almost twice throughput uh when you get to these multiplay operations so you can see that per operation latency increases because you now you need to transfer more data so like with regular regular page program you only you need you have this ECC and coding plus tdma and T program but now you need you have to spend like this like uh number of planes times encoding EC encoding plus CDMA and that's makes things uh longer but the good thing is that for the program latency you're overlapping things and your average latency is reduced a lot any question but how to ensure multiplying operation we're going to learn about ftl's data placement later today yes what why do we need that no no no this 73 second is actually one um program latency plus one tdma and one ECC so since you're now you're uh doing multiplay operation you you need to add another tdma and another TC asuming that uh you have two planes this formula is actually very complic I mean is a bit misleading honestly but you can let me decode it so instead of having 736 here you can consider that you have 700 plus twice tdma twice TCC is that correct now yes or yes that clear now no tdma cannot be paralyzed because you have shared Channel you know you have eight IOP let's say like the channel you need to send it serly you cannot I mean there might be also some Innovation for that that you you know you encode your signal and send multiple signal on one link but those can make things even more complicated there has been research assuming like if you know about information Theory you know that they are using the same share channel for sending a lot of requests like they Cod it like TD what's that yes time division Multiplex exactly you can also do that but it's also not that easy okay another important actually optimization is about what we do for program and arrays so read performance performance is often more important because reads are on critical paths usually and also R can be done in an asynchronous manner using buffers as we discussed so now the problem is that we have significant latency symmetry for this is 100 micr for program is around 700 and array is around 5 millisecond for example so if the chip is designed to program all the pages in the same wordline Advance like the same thing that uh we discussed in the past like when we have TLC we want to program all three pages at the same time you need to actually this program latency is around uh 200 2,100 microc so you can see that the worst case cheap level read latency can be uh 50 times longer than the base case latency if you are unlucky and you're read latency is cute U after program let's say or raise you need to really wait a lot right so that's what we don't like in order to fix this issue people come up with the suspension technique so essentially we suspend an ongoing program or eras operation once it redes here is an example you have a program operation going on and you need to spend like 700 microc and then you have this read operation for which coming here but you you cannot start read because your chip is doing read uh chip is doing program so then the latency for the average latency here for program latency you spend like 700 for read also you spend like 700 microc even though you could have done reading 100 micros what you can do is you are doing program and then you suspend your program and immediately service read and once read is done we do the program now you can see that the program latency is increased actually to 800 microc but the read latency is still around 100 microc so you can clearly see the Improvement in the average latency right and the Improvement is even more because reads are on the critical PA of the system usually so it has Pro like significant decreases the read latency but it has cons as well like additional page buffer for data to program so now we have you need to add another uh page buffer and this complicated iOS scheduling until when can we suspend ongoing program request which I will also show something at the end of this lecture like you cannot really suspend or um basically yeah exactly suspend program all the time at some point your program uh needs to be done essentially so there should be some intelligent control on that you know to realize until when we can suspend it and also it has negative impact on the endurance which is also another issue like whenever you do this suspend and you repeat this you know you suspend and then you when you want to resume it this process um can uh basically Mak this endurance Force essentially people have observed question yeah these are some as I said summary of some of the optimizations that we have um but there are also many other optimizations that people have developed okay let's see yeah I have 240 slides but it's good um the good thing is that the the last two items I schedu and flash memory I can actually teach them quite fast but this part needs to you know go with a slow pace in order to you know we all we hold we all follow the topic any question before we jump to address mapping okay so we almost discussed it actually like in SS we have the flash translation layer also called as SS frameware and that provides back for compatibility with the traditional hard drive by hiding unique characteristics of n flash memory so there are a responsib I mean this ft is responsible for many important SS management task we we're going to discuss about address translation and plus garbage collection which is actually essential for performing out of place rights it also does ver leveling and data refreshing and iio scheduling today we're going to see address translation and hopefully iOS scheduling very leveling we're not going to cover and data refresh we're going to cover a little bit tomorrow uh hopefully so this is a simple SS architecture um that the host actually realized that the storage View at the operation system level is a flat block device assuming that we have here as an example we have 16 blocks and each of them is logical block address but at the hardware uh we have physical block address so these are assuming again we have uh here we have only one land flashship and we have only one plane so we want to keep things simple in this example so we have a physical block address and each each of these page they called as physical page address so this is the like terminology that you need to get used to it it's also sometimes quite confusing because at the OS at the host we call it block address but essentially is actually page address at the SSD so you need to get used to this terminology that's very unfortunate as well I and you can see that we have actually extra space in our storage and that's what we usually have like if you have a SSD that that you know is one terabyte your SSD actual space is actually much larger than one terabyte and there are some Reserve pages that they are used because of these out of place of the things so here also we have this over provisioning because as we discussed for which is quite important for performance lifetime okay so now let's see what happening when we have right request so we have this right requests um a so we want to this is a request that we sent to the flash translation layer essentially so The Logical block address is zero like this is the offset and the size is one so we just want to write one block address and the direction is also right and the value is a so that's the request we sent to the H essentially and H translate send this to the Ft so in the hardware in the like in the SS back end we write this uh a value in Block zero and pay zero so things are okay at this moment here also we are assuming that logical block size is equals to physical Pace size so but usually logical block size here is 4 kiloby but physical called P size is 16 kiloby that we're going to also see about fun green mapping but here for Simplicity we consider that these two are the same size so that's 4 kilobyte and here is also 4 kilobyte okay so later on we also receive another request uh which is the offset is logical block address four but the size is 12 so we want to write to 12 blocks and uh also direction is right and these are the values so we call it has a sequential large right for example when you look at your s backend you can see that okay this is four this is also four so would be good I know I will just write from four to 15 here right because things are fine if you do that but the thing is that that that's not happening and SS just writes you know continue writing to the same block and when block zero is full then goes to block one and block two and block three so that's the V right why uh because one of them is actually the reason is active block so one of the block which is ready to write meaning that is already erased and you're using that block to write so that's your right point we want to keep the number of right block um like as low as possible ideally we want to keep only one block being written and this is a due to the problem that we call it open block block problem so once you keep your block open that's that can cause some reliability issues for the pages that you have written already in that block so as the more you keep that uh the longer you keep that block U open you can actually cause more reliability issues to those pages so that's why you want to you have one active block you want to write to it and get it done and then go to the close it and go to another active block sometimes actually even um flash like controller just write some dummy things you know to that block just to get rid of that block make sure that we don't they don't you know uh issue errors like reliability issues yeah yeah maybe do you have something to add here my question was that why do we need to write dummy data to different uh pages to close an active block why not just keep them empty and move on to newer blocks so there are some disturbances that are caused when the block is active and it is still having some word lines empty so these are not characterized yet in uh research works but these are known in the uh in the commercial SS so that's the reason why we always program the logs and close it thanks okay yeah so another issue is also program sequence constraint you have this fixed program order within block due to sell to sell interference we're going to actually uh here we don't have that issue actually um because you are actually writing sequentially so you write B C D and then go but the thing is that you cannot for example write um write to page two for example and then later on you want to write to page one that can cause some issues so you always want to keep the order that's even uh that's actually more more a problem for mlc non flash chips people have observed issues uh when when we have mlc non flash chips we really need to keep this ordering things in SLC we have less issues but still um you you want to obey this program sequence conr okay so now we have this problem that logical block address or logical page address does not match the physical page address so when we receive a read uh from read operation for logical block address 4 we don't know where it is stored like so that's why we need to keep this uh table mapping table and for that we need know that for logical page address uh four we store it to physical page address one so we just query this table and then we get the address for the physical page address and then we can access the non flashs right okay so now assume that we receive some updates on The Logical lock address a you receive this update with the a prime so here what we can do is actually writing uh we are doing out of place update so our active block is actually block three and the next page is free so we just write a prime here and then we need to update this mapping table and this one needs to be invalid this is also important that you make this invalid yes yes yes yeah yeah so that's yeah exactly logical block address and here logical uh page address they are the same here especially because in our example block and page they have the same size here so we consider 4 kilobyte size here and Page also we consider 4 kilobyte if they are they if they have some mismatch you need to be careful then uh we're going to see actually in our next example in fine grain mapping yes no you need to keep actually we're going to see actually you you're going to you need it to keep a status stable especially for every block so you invalid this block zero sorry this pay zero in Block zero and then you get another updates you uh write it to 14 and then you make the previous one invalid then another update and at some point you are running out of free Pages you don't wait until all pages have been written and then you say oh what what can I do you know you really need to be a bit proactive and once you have some you know you can see that you have not much many pages uh you know remained you need to uh basically call for this operation which is garbage collection so what does garbage collection does uh what does garbage collection do is essentially we we have some wasted space because of these uh invalid pages right so you can think of like that you know you want to get rid of these invalid pages to erase that block and make that block ready for future rights so that operation we call it as a reclaims free pages by erasing invalid pages but the problem is that raise unit is at block so we need to come up with a victim block and that victim block can have also some valid pages right you first need to copy all the valid pages to other free pages and then uh you can erase that block so now you can actually see the performance overhead of garbage collection performance overhead of garbage collection is actually dominated by the copy it's not actually dominated by the erasing operation erasing operation is not that much but for all the valid pages that you have you need to spend like read and program because for every page every valid page that you have you need to read it and then you need to program it so you have t read plus T program times number of valid pages and also it has lifetime overhead so you have now you have additional rights which can cause program array cycle increase this is actually is known as as a right amplification this is a know known metric for ssds that people report what is the right amplification for my device essentially so in hard disk drive there is not such a thing right you the number of rights that you have to the the number of rights that you send to your uh hard disk drive is the same of the number of rights that you have internally right but uh for SS you can actually calculate that I send like 1 million rights for example to my s but internally my SS I don't know wrotes uh 1,200,000 right because of these operations now you have some right amplification and the more uh the larger right amplification you have you it shows that your SSC you know needs to do a lot of doing a lot of Aging things you know because of program arrays and also the performance issues so you always want to keep that right amplification uh as low as possible okay so one of the policy for garbage collection is called greedy victim selection policy you erase the block with the largest number of invalid Pages you just search and um select the block with the largest number of invalid Pages uh as a victim and the idea is simple because you can see that the that block has less number of valid pages so hopefully you can reduce the copy uh latency for that so now you need to add this status table which is per physical block address we need to keep the status like for every page in that block address you need to know if it is free valid or invalid and with that information you can now select the victim so here in our example block three uh has largest number of invalid pages so we considered as a victim so now you need to copy the valid pages that you have here so you have only one valid page which is m so you read the physical page address uh 12 then you write it program it to three pages that you have still so you write M here and then you need to update status so block three now is completely invalid and block four now you have two valid and two free pages so you update now you need to update the mapping also for this uh 15 for this m so in the past uh this value M has been mapped to like logical page address 15 was mapped to physical page address 12 but now you need to map it to 17 right so how you you can get that information because it's a back forting right so previously we know the logical uh so we know the logical page address and we access this mapping table to get the physical page address but here is the opposite way around you know you know the physical page address but you want to get the logical page address so for that uh there first of all you can actually have another table for physical to logical mapping but people do that uh as they store it in each physical page out of band area so we show that for every page we have some Alpha beats right that we use those metadata essentially so you can store that logical information like the phys p2l like the physical to logical mapping as some additional bits to the the page essentially so when you write to the flash you also write what was The Logical address for my basically for my block for my data okay so now that you know that block tree is completely uh invalid you can just erase it and then you can update the status of block three that all pages are free now that you can write it's also important to say that this erasing happening is uh we call it as a la operation lazy erase so once you make it completely invalid you don't erase it immediately you wait until you really need it and that's also because of open block problem because once you erase a block you make that block ready to write and you don't want to do that you want to just make it ready to write when you really need it so you delay the erasing operation until you need because because we issue the garbage collection you know sooner than we really need it so you have some you have a block with you can easily erase it but you wait until you really need to eras it and that's also important for reliability issues so you can now guess that garbage collection has performance issues we have high latency because of uh all these copies so assuming that our I mean our blog size is usually large a Blog can contains for example 100 sorry 576 pages in one of the Prototype one of the devices and uh assuming that we have only 5% of the page in the victim block are valid I mean this is not a significant number right but 5% of this number is actually around 28 Pages meaning that you need to copy 28 pages and it cause like the GC latency would be greater than 28 time TR plus T program plus T race which is a here in this example is around 27 millisecond and you can see that this is ex significant larger than erasing time so that's why I'm saying that erasing is actually not the dominant Factor here the copying is actually dominant and that's why this grading policy makes some sense because you really want to reduce that uh copy latency you know you want to you are selecting a block with less number of active pages in order to reduce the copy latency but at the same time you can also improve DC by you can improve the latency by doing multiplay operations so then you can have some other you know U basically ways to like tradeoffs to handle like you you might select a block which actually is not that not has the largest number of invalid pages but you select that because it can provide uh multiplay operations for you so you have a trade of here because multiplay operations provide better bandd for you but at the same time you want to also reduce the number of copy that you want to do so you have two kind of let's say parameters here to play it but there might be some other things as well okay um so if FTL performs garbage collection in in an atomic manner like once FTL starts garbage collection it does not really service anything things should be done and then can uh basically service things then it can delete a user request for a significantly long time and that was the case actually for traditional SSS today actually we don't really observe that kind of dramatic um you know slowdown in our accesses so in the past it was like you send request to your SSD and your true put is at this level and at some point GC has invoked and you can see that the performance was here like you know dramatic difference but now in the modern you you don't really see that dramatic reduce because vendors actually have worked a lot to reduce garbage collection and one of the way is also to make it not Atomic like do these suspension things you know you can suspend for example GC and address service some of the reest and then continue with your GC there are some mitigation technique one of them is a triam command which is at the OS level so when you for example at the OS at the file system you just delete a file you know the previously you basically your file system does not really inform SS that this file is done you know and your SS will keep consider that file as a valid page and we'll just you know copy all the time but the thing is that now with this uh command that people have developed for um for operating system for Linux and there people actually have done a lot of research at the OS when they when when we map when we move to SSC because when we move to SSC people have been using techniques that we have been using all the time for hard disk drive like the iio scheduling all these commands but at some point say that okay now I have a new device characteristics different so I need to also optimize them so people come up with new techniques train is actually one of them that essentially unform FTL of deletion or deallocation of a logical block and then FTL can consider that as a invalid page essentially and it can escape copy for that another way is to do GC at background so you exploit SS idle time uh then the question is that you know how accurately I can assess idle time banan for example has been working and on that direction in his master teasers semester project yeah he's been working on how accurately we can actually assess idle time using learning techniques or things like that but still it's not really easy especially if you consider that your SS is placed in a cloud in a like in a storage system that many applications are accessing your storage essentially and also another issue is that when you do background GC you may do premature GC so you copy pages that could have been invalidated by the whole system because you do GC sometimes premature you considered that some pages are valid but those pages could have been invalidated using trim command if you if you do the garbage collection in in a right moment essentially but another important technique is actually Progressive GC which you divide GC process into subtask like for example copying 28 Pages you copy one page and then service user request and then you can consider like the 28 times of it right you do one copy and some service and then copy Sur so you're you are not doing garbage collection in an atomic way and this is actually quite effective at decreasing tail latency of garbage collection question okay let's get a five minute break then we can continue with five F green mapping until uh 333 e e e e e e e e e e okay this started so we we talk about mapping and we say that in our example we consider that we have same sizes like uh iock size is 4 kilobyte and physical pce size is also 4 KOB but essentially we have this iio mismatch and in SS is always I mean is around 16 kilobyte for most of ss there are some SSS that they are performance optimized and they are using actually lower page sizes like 2 kilobyte or 4 kilobyte but most of ss that we use today is they have 16 kilobyte so now we want to see how we can handle a small WR request so let's V to it with some example like okay so we have inefficiency due to array before right property right this is a request uh logic block address uh four and size is one and direction is right and data that we want to write is a and this is also your logical uh page address to physical page address so you first need to um put this you know logical black address to you know like the uh binary so and then you can uh use the so since you know you know that your page is 16 kilobyte you can actually use the the first two bits as Offset you know that which uh 4 kilobyte page you are writing in that physical page address but here we know that like the offset is zero and then the rest is actually page number we use that the rest as a page number so now we when we want to uh access this logical this U logical to physical mapping we just we go to this one which is The Logical page address one and then we write here as a the offset so we try to keep the offset the same so since the offset here is zero we also write here at the offset zero that makes things simpler so you can also write a like for example here but then you need to keep that information that all offset was Zero but I write it to offset one for example and then here you need to keep the physical page address which is zero again so far so good right then we need we get another request to logical block address one and the size is two so we want to write two um Pages essentially B and C so you put it in the binary and you can see that the logical page number is zero and the offset is 01 but you want to write two so you need to use the offset 01 and one Zer essentially so yeah you need to write uh to the next physical page address and you the same offset and you update this mapping now the question is that why at the middle of the page that's we already discussed about it to keep the 4 kilobyte offset another is that why not using the unused space in the physical call P zero so the reason for that is that you have already actually allocated that space uh to logical page uh logical block address 5 to7 remember that we map uh logical block address four to this physical page address so when you map four that means that you you also map you're are also mapping five six and seven to that physical page address because you need to keep the whole the whole offset the whole basically yeah the other three offset also uh in that physical page address so that's one of the reason and another reason is that this actually has been already um most probably programmed because of the because of the program sequence uh yeah and the program sequence order so you write this a and you write BC and later on if you also receive some other rights maybe we actually have it yes we have it actually so we have this logical block address seven and we want to write it so you can see that this is the same thing right you you have this uh logical page address of uh one and then this is a kiloby like the offset so you can think of reusing the same mapping entry right so you ideally you can use this physical page address and you just write here but that would actually uh you know invalidate like that can cause out of order programming because you program uh first physical P physical page address zero and then you program physical page address one and now you want to get back to physical page address zero for the this program so that can cause issues so this is the I already mentioned the second reason program order constraint there also another reason which is data randomization that cells in the unused space have been already programmed essentially which I mean you may not Al you can also don't you can ignore them and don't program but but likely you have already programmed because that's the way that flash controller works if you want to uh handle rights in a way that you we were doing in the past like we we are doing this uh logical page address and physical page address like the mapping table grity is actually per uh logical block address so now you can see that we have this unused yet discarded and I I also didn't say that but this U new right that you receive you actually need to write it to another physical page address and meaning that you also need to write this a to that as well so you need to copy a and then D and then write it to the another physical page address so you like to avoid such things and that's uh what we do it uh basically by using this fine grain mapping yeah I already yeah I wanted to say this essentially you read a and you also modify the part D and then you write it to another physical page address and you also need to update this mapping table essentially small rights CA the read modify wres you need to read and modify and then write it which Bas P Cycles Plus additional read operations which can cause performance and lifetime degree so the way that we handle it we are using F green mapping plus page buffer so you need to add page buffer and also here you keep the mapping at the Fun grain like at The Logical page address so when you get this request logical block address zero like logical block address four you you also keep the mapping here so you don't consider any offset here you just go to The Logical um page address four so now you have this funing mapping like each entry here is associated to 4 kilobyte not 16 kiloby that's why we call it fun gr mapping and uh you write this value in the page buffer and then you can also update the physical page address here but you can also you can leave it for now and when you really want to write it to the to the SS to the N flash memory you can update upd it but for now you can also leave it then for the second write that you get you again access your page buffer and write them B and C and this two logical page address you need to also update the physical page address now you you say that physical page address is one you can see that this is funing right meaning that each of these funing Pages they have address so you don't use one address for all of them so here this address has zero this one has address one so you map the second one to also page address physical page address F and this third read that you receive for D you also go to this page uh this logical page address seven and you map it to three so now that you have your page buffer full ABCD you can just flush it um to the to the non flash memory and that's how we handle this small rights using fun grain mapping because we need fun grain mapping this table uh with the also page buffer that we have yes say it again sorry I was MH so but but yeah but that's an update right if you want to update um logical block address so for the you are seeing that for D you want to write it to C right meaning that your the right request is going to the same address yes but why should I do that no but that's that's uh request is actually coming from the host and that has only logical block so we are assigning physical page address question okay so f g mapping significantly reduces the number of man flash operations you can see that three right plus one read that we had in the previous example and now we have only one right but of course it caus a larger mapping table that now you need to keep mapping uh for 4 kilobyte meaning that we need four bytes per 4 kiloby page and that 0.1% is coming from here and uh yeah also you should also consider this data durability of written data like page buffers are implemented by using bulletti memory like stram or D so we need to make sure that we are we don't get our data doesn't get lost because of sudden power off and that's why we need to have these power capacitors measurements are okay with this room but I don't know sometime I don't feel good in this room are you guys feeling okay good okay yeah but despite non- negligible ZX and overheads of f green mapping this actually has been used uh significant in Enterprise SSS mobile SSS uh might not actually use it because they they really need to keep cost low but Enterprise SS they are all using this fine grade M okay now let's also quickly see how we can um basically so we know that multiplan operation is good but how we can ensure we can benefit from multiplan operation that's also another important this is a recap about multiplay operation that you already know so to perform as we in order to perform as many multiplay operation as possible we need to flush number of plane Pages at once after buffering them so we use this page buffer and we we need to wait for more rights and now that is full we can flush it uh to the N flash memory and th we can benefit from multiplay operations and so on and so forth so then we have this uh definition of super block based block management so you don't you don't you consider the whole like all these planes consider them as a super block and try to manage your super block um whenever you want to erase it or do garbage collection whatever and that super block can can actually uh uh basically go to other flagships as well so there are you don't only consider one die you also consider several D and several flash chips and you make a big super block and try to manage it because if you manage a super block nicely you can benefit from parties as much as possible you can send a request uh in parallel but let's see what would happen if we for example do garbage collection first uh oblivious of this multiplan operation and then very of that so this what we already know right for reducing uh per performance overhead of garbage collection we want to have this greedy policy and we select the blocks with the largest number of invalid Pages this is the plane and uh after some time we know that these are the number of valid Pages invalid pages that we have per block so we know this uh block is uh invalid like the number of invalid pages is higher so we consider as a biging selection and we need to copy valid pages to the block n minus one and then we can erase block two and make it basically ready for future rights now consider that we have two planes and we want to uh apply the same policy here so in these two planes U basically now you you also have the number of invalid Pages for for each of these block you can see that this block one in plane one one has the largest number of invalid pages and block two here has the largest number of invalid pages so now we have this issue in multiplay system uh these two blocks that you want to consider them as as victim they are not aligned meaning that you you cannot really benefit from multi operations so once you want to do copy uh you need to do single plane read and here also you need to do single plane read so here you need to do actually three reads but one uh read here also one single plane read but at the Block n minus one you can actually do multiplan WR so you do read these two put them on the page buffer and then you do multiplan right to that uh Block in minus one and in addition to that you need to also do two single plane reads and two single plane wres for Block two here so you can see that we had four page copy that so in the uh so that's the number of copies that we had but but after all we had four page copies like three reads here and one read here and we could have done it with only two multiplan reads and two multiplan writes essentially and that's also not the the only issue here is that block n minus one now the right point is actually the right point for Block n minus one in plane zero is here and the right point for plane one is actually here so you cannot also benefit from multipl operation in the future as long as you don't discard so there is a way that you can discard uh these two free pages here in plane one such that you can benefit from multiplay operations in the future is that clear okay good okay but when we consider super block based management we group each block with the same index as a vertical position in different planes so now we cons we try to select a block like a super per block with the largest number of um invalid pages so here we can see that block one has in total has nine invalid Pages if you consider like a super block so now we can that's our victim block essentially and these are the the valid pages that you need to copy essentially here so you can see that block one actually has a lot of valid pages in plane zero but in plane one it it has only one valid pages so you can actually check these uh equation numbers at home if you want essentially but for some of them you can do single read for some of them you need to do multiple uh multiplan read but essentially you can benefit from a lot of multiplay operations and at the Block n minus one also you can do multi planing rights the pros is that you can keep performing multi PL rights but at the cons you need to in this specific example you need to do more read and WR operations so in the end it's not really clear which one is better yes no no no we don't move from plane zero to plane one no no yeah yeah you're right actually yes you can actually benefit you can you can move things from plane zero to plane one yes that's true yeah correct yes exactly yeah and it's not clear actually which one is better at every but you should also consider fut accesses so once you have this super block access hopefully in the future you can actually service more multiplay operations and as I said this super block can you know you can do it at a d level or SS level there are actually papers that try to make a super block not even vertically they like you know do this there are many things going on in that line essentially okay let's see I will explain very quickly I scheduling but I will uh make the rest for tomorrow I think I can finish I scheduling six minutes six or seven minutes so we discuss about ft operations and iOS schedu is one of them uh for this I'm going to use this paper F that we have published in Isa 2018 enabling fairness and enhancing performance in mod nbme ssds so in modern CS use new storage protocols like nvme that eliminates the OS software stack I'm going to show it actually better in this picture you know all this stuff like back end front end I'm not going to over them but this is essentially what nbm is so in SS's u in the past initially adopted conventional host interface protocol sat that we have been used for hdds and that is the basically is like each process has a uh iio request Queue at this at the OS software stack so each process has a has a Queue at the OS software stack and then OS software stack needs to combine all these queue somehow schedule them and then send them to the SSC device as a hardware dispatch queue so that's what we have in sat essentially but uh people have realized that this doesn't really work well for nvme for ssds because ssds has actually quite I mean they can provide a very good throughput and this data actually can limit the throughput so in order to overcome that people come with come up with another protocol which call it nvme which is high performance H interface protocol and uh the way that we handle is that they remove this OS software stack completely and they map all these cues directly to the SS device now essentially each process can have a Quee internally inside SSD and each process can actually write to that queue using ACT user level code not even at the OS level code so OS needs to be involved in order to each process in initiate a queue you know in the in the SS device but once the queue is initiated the process can write to that queue using user level code and that can make things very fast because nvme is also implemented uh using PCI Express which is PCI Express provides a lot of bandies and NVM is a way to benefit from that b so you can also Implement nvme on um using other protocol as well like other Hardware protocol because nvme is actually is a software protocol it's not a hardware protocol but in order to get benefit from nvn you really need to implement it at on the PC Express or M2 uh for mobile systems for example is that clear now you can see that when we map this queue to the SS device and there is no OS so no one is doing that scheduling stuff then you can guess actually that we have fairness issues so in the past OS was somehow handling U basically fairness using that iio scheduling but now there is no iio scheduling going on at the OS level everything is happening at the SS and that can cause furnace issues but at the same time uh when you are doing is when you are when you have access to your SS so assuming that you can Implement I scheduling inside the SSD now you have much more information about this process you know you know about the internal parales internal task of the SSC so you can hopefully uh make a good decisions make much better decision as as before so you can see that opportunities and limitation so when you move to this nbme you don't really have anything are you schedu at the OS level which can cause issues but now you have some opportunity to benefit inside the storage in order to implement iio scheduling which is even better compared to OS level is scheduling techniques so we have done some experiment on real systems and we Define some Metric for uh slow down and unfairness you can see that we come up with a terminology like this is not actually our terminology like flow is a series of iio request generated by an application like a flow of them like a series of them and we consider a Slowdown as a metric like which is we run this flow alone and we capture the response time of that and we also run this flow in a shared Manner and we capture the we calculate the response time like measure the response time and a Slowdown is essentially shared flow response time divided by AOW alone flow response time clearly lower is better and unfairness is maximum slowdown divided by minimum slowdown and you can see that again Lower is better and fairness is uh basically reverse of unfairness and higher is better so we have done um experiment on these two workflow tpcc and TPC so tpcc is uh considered as one of the workflows that is not really so you can see that tpcc doesn't uh experience much slowdown when it's run um together with tpce but TPC actually has experienced a lot of a slow down these are like some uh transactional um database transaction workloads and as a result we have very low uh fairness so our takeaway was that ssds do not provide fairness among concurrency running flows we discussed different reasons that can cause this issue one source is different iio intensities like when you have different the high intensity of the flow affects the average qway time of flash transactions I'm going to conclude in few minutes and then because we also need to get to seminar class or maybe I can wait it for tomorrow because I don't want to waste it let's continue from the sources tomorrow any questions thank you see you tomorrow

Transcript for:[Lecture 15] Understanding SSD Architecture and Management

Transcript for:
[Lecture 15] Understanding SSD Architecture and Management