[Lecture 27] Advancements in DRAM and SSD Technologies

e hello you can hear me you can hear me okay we fine I think should I get started Nik can hear me okay good all right everyone welcome to the last like computer architecture lecture of this semester I was warned to say add that qualifier there because I was saying it's the last lecture and then some people didn't like that uh anyways we have five research paper presentations to cover today and no time to lose I'll get started with self-managing DRM this is a lowcost framework for enabling autonomous and efficient data maintenance operations and we um this was published in micro 2024 just just like a month ago this year okay so I'll start with a brief summary the problem is implementing new IND Dam maintenance operations such as better row Hammer protection uh implies changes in the DM interface and changes in other system components like the memory controller modifying the interface however is uh quite an involved process that typically takes multiple years and our goal is to ease and accelerate the process of implementing new efficient IND Dam maintenance operations and our key idea is to propose a simple change in the Dr interface such that the DRM chip can reject memory accesses that Target an under maintenance region in DRM and this single change uh in the interface allows DM designers to implement new maintenance mechanisms without any further changes to the interface and to other system components but the DRM chips themselves we implement three indor maintenance mechanisms in SMD to show that SMD is useful and versatile and our evaluations show that SMD enables performance and energy efficient maintenance operations at small area cost and all our sources and data is open sourced at the link on the bottom this is the outline of this presentation I'll start with some background and motivation let's see at a very high level uh what the DM interface looks like today you probably already know about this AUM chip communic Ates with um the memory controller over the ddrx interface where the memory controller issues DRM commands over the unidirectional command bus and the memory controller in the DM chip exchange data over the datab bus using this interface the memory controller orchestrates all DRM operations by sending commands to the DRM chip and the DM chip executes all DM commands without question thus today's DM interface is completely controlled by one side and one important type of DRM operation are those that are performed by maintenance operation maintenance mechanisms you already know that DM is subject to many error modes such as the ones that I show you here uh that necessitate maintenance mechanisms and these mechanisms basically just maintain the data Integrity one prominent example is periodic refresh where the memory controller periodically issues refresh commands to prevent data retention failures and as manufacturers scale storage density these error modes cause memory error rates to increase and the key point I want to make here is that continued process scaling uh necessitates new efficient maintenance mechanisms and these mechanisms are defined as part of the DDR well Dam specifications basically and jde composed of more than 300 companies defines the standard and produces devices that obey the standard the dram standard poses a barrier to entry for new efficient mechanisms maintenance mechanisms because adding new or modifying the existing mechanisms require modifying the DRM specifications and other system components such as the memory controller that obey these specifications and the key reason for the barrier to entry is that the DRM interface is rigid it's defined in a rigid manner it's also not easy to release a new DRM standard or a specification because doing so requires a multi-year e effort uh by the jde Comm well one jde committee and thus introducing new maintenance operations uh take a long time and this is just one example the recently released ddr5 standard introduces three new maintenance techniques they well they improve Bank level paradism and performance improve system robustness and error tolerance introducing these mechanisms was only possible after an 8year time duration between the ddr4 and the ddr5 standard however if we could basically release uh if if new maintenance mechanisms could have been more rapidly released these improvements could have been released much earlier or at a different time and let me finally reiterate the problem on on the problem and the goal based on what I already explained so far so I'll summarize introducing new maintenance operations or mechanisms take a long time and our goal is to ease and accelerate the process of implement in new efficient India maintenance operations so before I explain you our solution approach uh I'll will quickly categorize Dr operations into two classes that's access and maintenance so we're well defining the things we're talking about now uh access operation serve memory requests typically relying on information available only to the memory controller this can be a a load instructions address or the data that you're storing to the DM chip whereas maintenance operations maintain DRM data Integrity typically using information available only to the DRM chip that can be some internal uh re refresh row counter for example that tracks which rows must be refreshed with the next refresh command a key observation you can make here then is that the DM chip has all the information it requires to maintain itself and a DM chip should maintain itself due to two other reasons the first one is you can Implement maintenance mechanisms uh more easily and possibly faster if you didn't need to change the because doing so would not require changes in the DM interface right and second it enables DM manufacturers with what we call breathing room to perform architectural optimizations without exposing DM internal proprietary information to other parties and our solution approach is to enable autonomous maintenance operations in DM chips however the DM interface definition is too rigid to accommodate autonomous IND DM maintenance operations in particular the DM interface is designed in a way that yields all control to the processor so we're here on the left uh or like the processor or whatever is on the left side of this scale and our goal is to make a simple onetime change to the DM interface and enable autonomous maintenance operations another way to look at this is that we aim to give breathing room to DM chips such that they can perform their operations autonomous uh you can by the way interrupt me to ask any questions yeah yeah so the question yeah cond that's a good question so uh the question is simplified version of the question is why does it take so long to release a new standard right uh I'll go back to my slide about that there there are many things involved I think uh in publication of a new standard it's not just maintenance operations I want to make that clear first uh I don't exactly know which components of producing a standard take up how much time but the fact is basically these like publishing a new standard takes too much time right like not too much maybe depending on who you are uh what's your um understanding of long time is but it takes five to eight years in this example um and even if you could develop new maintenance mechanisms in maybe two months you cannot release it uh W in a widespread manner until the standard is also released that incorporates your changes yes yeah I deal with the [Music] yeah g is basically saying there's a lot of bureaucracy also involved so it's not just politics all politics just you cannot just propose a new thing and then implement it the next month you have to convince 300 plus companies involved yes okay so going back to where we left things oops here okay now I can talk about self-managing the itself in our work again we introduce IND M autonomous IND maintenance operations uh this means that the DM chip has full control over the maintenance operations it performs this allows us to add new mechanisms inside the DM chip without modifying the standard and without exposing DM internal propriety uh information one problem that autonomous indium operations introduce is access what we call Access maintenance conflicts that happen when the memory controller uh accesses a region in memory that's undergoing maintenance our solution to this problem is to make a simple change in the DRM interface that allows DRM chip to reject access commands specifically they will be activate commands that Target under maintenance regions in summary SMD enables autonomous IND manance operations with a simple change in the interface and it divides the work nicely between the memory controller and the DRM chip such that you can add new maintenance mechanisms without modifying the DRM standard now we will take a deeper look at SMD starting with SMD Bank organization and to better understand nsmd Bank organization you will look at you know a high level depiction of a DM chip again the audience in the room knows uh these Basics but maybe on YouTube you have basically bear with me while I talk to you about the basics again so ADM chip contains multiple banks that share a common interface and inside a bank there are multiple Subways Each of which contain multiple rowes here's what a DM bank with SMD looks like we introduced the concept of lock regions and lock Define the granularity of Maintenance operations the lock region bit Vector indicates the regions that are undergoing maintenance currently and the lock controller orchestrates the Locking and releasing of the regions so with support for SMD in a Dean Bank we lock a region before starting maintenance operations in that region so that an activate command that Target that region can be rejected let's see how now how exactly we lock and these regions in summary there are four key steps first the maintenance operations an operation locks the region it targets and second the memory controller can access regions that are not locked and third um if the memory controller accesses a locked region it receives a negative acknowledgement or knack for short and fourth the maintenance operation at the end will release the lock uh we will and take a a deeper look starting with the first step we have our lock regions on the right those are DM rows essentially uh the controller on the left and the the lock region bit Vector on the bottom so all regions here are in the not locked state currently so the bit Vector stores all zeros in the initial State and to preserve our Collective sanity because this anyways I will refer to lock regions simply as regions to reduce the number of locks that I have to say in the next few slides so the maintenance operation attempts to lock region zero and the region zero here is not locked right now in this case the controller the lock controller acquires the region this region Zero by setting the corresponding bits in the bit vector and the maintenance operation can now proed in region zero if the memory controller wants to access a different not locked region such as region one here uh in in in yellow it checks the corresponding bit and the bit Vector because region one is not locked the access can proceed had the memory controller tried to access region zero it would receive a negative acknowledgement because that region is currently locked we will discuss how the memory controller deals with negative acknowledgements later in the slide once the maintenance operation ends the lock controller releases the lock by uh setting the corresponding bit and the bit Vector to zero it's very simple and that's how SMD locks and releases regions any questions yes um maybe I'm not understanding can you maybe describe an example yes Bank zero okay let me tell you one thing uh we reject only activate commands does that clear things up so basically the if you're getting a rejection as the memory controller it means the bank is pre-charged there's no open Row from your perspective as the memory controller um you send an activate and you get rejected as the memory controller no other command besides activates are rejected right that means if you've sent an activate command it means the the bank was already pre-charged there's no open Row in the bank does that clear up yeah okay still needs toate bank is Curr yeah yeah but the memory controller already knows which Ro in which Bank is open yeah so that's a good one uh what the question is why the lock regions are multiple why do they span multiple sub first they don't have to uh yeah uh it depends on your architecture because we have open bit line architecture where the sense amplifier has um one of the bit line going sort of one to one Subway and the other bit line going to the other Subway like two terminals right um You cannot really have um one subay under maintenance and the next Subway being accessed by the memory controller so you want to have at least groups of like three Subways when you lock one Subway you want the immediate neighbors of that Subway to also not be accessed in an open bit line architecture in a different architecture you might have as fine grity as just one Subway for your lock region uh we also sweep the lock region so I'll show you some performance numbers where you can see that cool but now we see we will look at how we control an SMD chip uh again four steps all access operations before I go into steps um the other ones uh all access operations work like they would in a normal memory control like a regular memory controller in today's DRM interface uh with the exception that activate commands may get rejected the memory controller needs to retry these rejected commands it can attempt to access other lock regions that are not undergoing maintenance and the SMD chip and the memory controller uh together ensure that no memory request is rejected forever uh we can discuss that maybe later I I don't remember if I have slides on that uh but anyways let me know if you have questions about forward progress at the end of the talk so we'll look at how SMD handles rejections right uh but I need to well I would normally introduce the activate command uh here you already know what the activate command does uh we have the r DRM rows if you want to access this row um you send an activate command to that row this function is just to ready the DRM Row for access that's a quick recap for the activate command an activate command is typically followed by read and write commands the memory controller obeys what we call timing parameters that impose a minimum delay between two commands and for example we would have a timing parameter between activate and read and another one here between read and write now let's go back to the activate command rejections that happen when we have access maintenance conflict this is an example command sequence that targets region zero leading to such a a conflict now what does the memory controller do uh it has to retry the rejected activate command sometime ideally in the near future as soon as possible the memory controller is already used to working with timing parameters so we simply introduce a new timing parameter this is called the retry interval RI which the memory controller like after which the memory controller retries the rejected activate command and we determine this empirically 62.5 NCS in most of our experiments while the memory controller waits for D try interval to pass it can issue an activate command to another lock region uh and overlap the latency of Maintenance in one lock region and access in another one this design component in SMD that enables maintenance access parallelization Builds on the basic design proposed in Prior work uh specifically Sal Subway level parallelism paper I don't know if we covered it in the lecture probably we did uh and you can look at the paper for more details yeah so normally I would discuss for Progress here but I don't have many slides on that do you have any questions about forward progress um I can give you the B yes yeah yeah that's a good one so the question is when you send an activate if you're getting rejected how long does it take to for you to realize you rejected uh in the end uh so we have an analysis for this in the paper it's five n it's less than 5 NCS that's what we argue um what we how we argue about it is we look at the latency of data um appearing in the data bus right after you send the read command so there's a timing parameter for that in the standard that should be 2.5 NCS if I remember correctly so we consider that to be a oneway delay from the deepest end of your DM chip to the io uh buffer side so you multiply that by two because you have to make a one way inside deep inside your DM chip and come back from there in the worst case uh so this is like a sort of worst case analysis right the latency could be much smaller than this 5 NS the other components in the critical path of this latency they're very small and their latency is like access latencies are like in the order of 0.00 something nanc anything you need to add on that path is very easily accessed so it's five NC yeah sure me you're thinking about the same bank yes but an Knack wouldn't affect an activate command going to a different bank anyways you had another question yeah that's a very good one so the question is for software programmers specifically I'm going to mention that software programmers might want predictable performance and this seems to induce unpredictability in the DM interface that's correct and that's one of the main reasons that this paper was rejected six times uh five times um there are two ways to argue about it I think first the I will acknowledge that that's correct uh in this design at least you can have a design for SMD um because it's this key ID is to really enable autonomous operations inside the DM chip the way we enabled that uh is a bit on it makes the DM interf is a bit unpredictable but you can do what uh today's standards do for refresh for example you can you can say something like you're only allowed to reject commands in this region uh of time like let's say you you have a fixed duration 10 seconds in that 10 seconds 5 Seconds of this is the regular DM interface I'm not going to reject anything in the remaining 5 Seconds you can send activate commands some of them might be rejected but you can still paralyze things right and this already exists in today's interface you cannot always predict performance because you have many other components in your system anyways uh if you're designing for predictability then that's a different issue and SMD can be also designed predictably but I don't think this in like right now the way designed it SMD makes the predictability issues worse in today's systems you you know about caches all the prediction stuff going on and you know how do you predict all of that no no yes you're right uh so you already send refreshes periodically so how do you know from software that when a refresh is going to happen there are ways to figure them out yeah because they're like fixed intervals there's if you wanted to you could figure it out but at the hardware level the difference is the controller knows when to do that refresh and it can even delay those refreshes but with SMD the way we implement it um the memory control doesn't know if it's going to get rejected after send an activate command uh actually the main implications I would say going in this direction uh are for scheduling techniques in the end we don't have a template for you know a state-of-the-art scheduling memory controller scheduler design we don't have a very log code for you know it's like obviously the companies are not going to publish that um what could happen I think is so I'm brainstorming right now a bit uh let's say that the memory controller receives the requests in some buffer and then processes them and then creates a schedule of theum commands there's already a created schedule for a set of commands now when you reject one of the activates you might actually destroy the schedule uh and then how do you efficiently rebuild that schedule could be a good direction that builds on SMD but this is a low level I think issue and it depends on a lot of assumptions you make about your memory controller design okay I'll continue no questions about forward progress that's good now I'll describe how we use SMD to implement efficient Indi maintenance mechanisms uh we describe and evaluate three use cases in the paper uh to Dem to demonstrate smds versatility these are namely fixed rate refresh determin gr protection and memory scrubbing the extended version of the conference paper here at this link evaluates two additional maintenance mechanisms those are variable rate refresh and probabilistic Grammer protection and we discuss how SMD can be beneficial for uh many other applications uh we also discuss all these mechanisms in the conference paper but we don't evaluate them in the conference paper I'll briefly talk about fixed rate refresh uh and for other mechanisms for now I'll refer you to the paper you know why we do refresh the stores data in leaky capacitors and we periodically restore lost charge back to Dr by issuing refresh commands from the memory controller the this method of Performing refresh has two drawbacks first you issue commands to perform this operation which induces energy overheads and creates contention in the memory bus and during refresh an entire chip or a bank is in accessible SMD gets rid of the first drawback by autonomously performing refresh inside the dmip and the second one by to some extent allowing access to lock regions that are undergoing maintenance this is how the state machine for SMD fixed rate refresh the describing SMD fixed rate refresh looks like but I won't cover this in detail uh so you can look at the paper for more um details on the state machine and also on these mechanisms if you have a burning question about use cases I can answer them now yes oh yeah that's a good one you can choose to implement memory controller based mitigation like SMD does not prevent you from having memory controller based mitigation techniques I would say so but that would be an opinion not a yeah I think it's orthogonal too roow Hammer protection but it enables efficient IND Dam uh rammer mechanism defense mechanisms you have something to say no so now I will show you the results um we'll start by looking at smds Hardware implementation and overhead uh we require one modification in the DM interface that's the addition of this sorry my clicker is not working interesting my keyboard is also not working interesting okay um we'll have to deal with what we have so the addition of the ackn negative acknowledgements what's that self self-managing yeah that's it's a poor implementation of SD anyways uh we described two options of how you can add this the first one is to use a signal called a l 10 which already exists in the ddr4 and five interfaces uh basically you add additional meanings to a signal that's already in the direction of um Dam tool memory controller this comes at an additional Hardware cost the second option is to add a new PIN for each rank of DM chips and comes down to about 1.6% processor pin count for a modern uh many Channel system server system again we do this once to hopefully prevent future modifications in the DM interface from implementing new maintenance mechanisms uh these are like a su this is the summary of DM chip modifications introduced by SMD uh they're very small and the memory controller also needs some additional storage to track regions that are undergoing maintenance so that it won't repeatedly try to activate the same region that uh before TR passes right uh we use rolator to evaluate the performance and Energy Efficiency of SMD using this system configuration you can check the repo for more details you configure SMD with 60 lock regions in a DM bank and 16 Subs in each lock region and with a rra interval of 62.5 NCS we use a we use 62 single core and 64 core workflows from five major Benchmark suits and um we we evaluate I think six configurations of the first one is the Baseline ddr4 system which does refresh every 32 milliseconds we have SMD based fix rate refresh SMD based deterministic grow Hammer protection that's like SMD with graphine if you remember from older lectures um we have memory scrubbing and SMD combin that combines all three mechanisms and we also have the hypothetical uh no refresh configuration that doesn't do any maintenance at all so this is the well in one sense it's the best you can have this plot shows you the average speed up that's performance across all single workloads all single core workloads over the Baseline system so one is the Baseline uh and the higher the bar the better the performance we make two observations here the first one is that all SMD configurations provide substantial speed ups and uh SMD combine provides a comparable average speed up to no refresh that does not do any maintenance operations for multicore performance this shows you the performance in the same terms for four core workloads the xaxis shows three classes of work clows categorized according to their um memory intensity um or mpki as a measure of memory intensity this misses per kilo instruction um this is the let me draw the full plot for you um this is the full plot and we observe that SMD provides higher speed UPS as the memory intensity of the workload mix increases we show the average DM Energy across all tested workflows normalized to the Baseline here the lower the bar the in this case and the x-axis has the same uh mixes we observe that all SMD configurations provide substantial Energy savings and we find that SMD combines provides more than half the Energy savings of nor refresh to summarize uh SMD enables performance and energy efficient maintenance mechanisms and provides comparable benefits to a hypothetical system that doesn't do any maintenance operations we attribute these to SMD being able to overlap the latency of a maintenance operation with a with a useful access operation and to reduce command bus interference and energy use as the memory controller does not issue maintenance commands so there's a much more about there's much more about SMD in the paper that I cannot I well I will not cover in this talk uh I already mentioned these first two we also have um this thing that I talked about the in we evaluate policy that oh I didn't talk about this one we evaluate a new um a different policy that SMD can apply basically when a lock region undergoing maintenance receives an activate command you can choose to stop maintaining that region immediately and serve the activate command as soon as possible we evaluate that and we show that it can improve performance uh but by a small amount we discuss uh a predictable SMD interface in the paper you might be interested in that and we have sensitivity analysis uh which shows that performance improves with increasing number of lock regions the smaller your block regions get you have more opportunity to uh paralize commands overlap the latency of Maintenance and access your benefits increase with reducing refresh period because the Baseline ends up spending more time refreshing things and you provide similar benefits across one two four and eight core workloads we also evaluate SMD based scrubbing and compare it to memory controller based memory scrubbing and we see that SMD induces eight times less overhead at a very high scrubbing rate because you don't transmit any data over the um DDR interface you can do all the ECC checking everything inside the DM chip so this is the paper I'll conclude very quickly um so we've seen that new maintenance mechanisms require changes to DM standards and in SMD with a simple single modification to the DM interface we enable autonomous indm maintenance operations in this way you can Implement new maintenance operations without changes to the DRM interface and we showcase three high performance and energy efficient use cases for SMD and our hope is that moving forward SMD enables practical adoption of innovative ideas in DM design and it inspires better ways of partialing work between the memory controller and the DM chip an extended version is on archive we have everything open sourced and that's all for this paper you have any other questions yes that's one component but that's the small component the biggest performance benefit comes from the fact that you can paraliz accesses with maintenance operations so maintenance operations the memory controller does not issue commands but they still have to be done and they take time inside the dmip it's just that uh by enabling sub by using uh subar level parallelism the idea of subar level parallelism uh while you're doing maintenance in one part of your bank you can access the other part that helps a lot because you well when you have a high memory intensity workload turns out the workload is not just accessing the same subray all the time yeah yeah uh the question is does the workload mixes have a row Hammer attack pattern mix then no we don't we didn't evaluate that why do you yeah yeah perance yeah yeah so we don't have that yes you're right um in the archive version we have results for different thresholds at least uh so more aggressive and less aggressive thresholds yes sure heard this yeah we didn't practice this okay uh so we've been submitting this work for I don't know I mean two years three years now almost um yes the stand the new update to ddr5 introduces PR we have a footnote to cover this concern specifically it's concurrent work at Best but the fact that PRACK was implemented I think unfortunately helped our CH this is what I believe this is not like science at this point right uh um I think it helped SMD get in this conference because we responded to some reviews like you know the industry already is going in this direction even though they we could have propos like we could have published this two years ago um it didn't let us and Industry implemented it now with that I think we got in more easily I don't know why um if that's the case right basically my answer is it's concurrent work and yeah um the idea of using all 10 is much earlier though it was first introduced in Pon Opticon 2019 and thec um there it was clear that if you implement something like PRACK you would need extra time to do refresh operations and that extra time needs to come from somewhere and the interface is defined in a rigid way so you cannot ask for that extra time so they propos to modify the meaning of the Aller 10 signal to ask for more time or Ro Hammer specifically SMD makes it a bit more General you can use it for uh other purposes and it doesn't have to be an alert 10 signal that you use you can add a new thing what do you have other questions no okay uh then I'm done I think we want to move to the next one real quickly unless we have YouTube questions okay I don't know make where are my downloads not at all is my screen screen shared okay e everything is good right um hello everyone we continue with the next paper presentation right Cammer uh we also presented this paper last month in micro uh I had a brief self- introduction but I'll just fre through it I'm a master student in toet and I'm a visiting Master student in my research interests are in computer architecture and memory systems and best way to reach me is via email uh today I'm going to present as I said break Hammer enhancing grow Hammer mitigations by carefully thring suspect threats uh here's an executive summary of my talk to make the screen bigger okay uh the problem is theam continues to become more vulnerable to row hammer and operations that prevent roow Hammer which we call rammer preventive actions are time consuming and they block access to main memory locked want to use this oh okay let fixed again introduce a new type of exploit where one can mount a memory performance attack by intentionally triggering brohammer preventive actions to block main to block access to main memory for extended periods of time our goal in this work is to reduce the performance overhead of rammer mitigation mechanisms by carefully throttling this uh perform thr preventive actions without compr iing system robustness our key idea is the throttle threats that frequently trigger rammer solutions to this end we propose break Hammer which observes trigger draw Hammer preventive actions identifies threats that trigger many of these actions we call such threats suspect threats and reduces the memory banded usage of suspect threats our evaluations show that break Hammer significantly reduces the negative effects of rammer mitigation mechanisms on performance energy and fairness break hammer and all the data set we used to evaluate break Hammer are open source and available through the repository here's the outline of my talk similar toar if you have any questions you can interrupt me at any point um here I show a typical Computing system with a processor and a dram module connected through the dram Channel a dram module has several dram chips and inside the dram chip we have multiple dram Banks the dram Bank contains a two dimensional are of dam cells Each of which containing a single bit of information as electrical charge Dam cells are horizontally connected via vert lines and organiz as dram rolls and they are vertically connected via bit lines to sense amplifiers for Dam accesses theam is accessed at a row granularity and today this access is not perfect where it is vulnerable to re disturbance here I show a prime example of Le disturbance called row Hammer by repeated opening and closing a dam roll induces bit flips in the nearby Dam cells in Ro Hammer terminology we refer to the repeatedly activated Ro is the aggressor R and we refer to the nearby Ro that obser bit flips as the victim R we also denote the minimum activations necessary to induce the bit flip as rammer threshold or NR uh ro ro Hammer is becoming more uh D disturbance is becoming a bigger problem with technology scaling where in the last 10 years we have seen more than two orders of magnitude uh we two orders of magnitude decrease in the number of activations necessary to IND the first bit flip therefore it is critical to effectively and efficiently prevent these bit flips there have been many ways proposed by industri and Academia to prevent rowh Hammer uh via rowhammer preventive actions here are show a non-exhaustive list of rammer preventive actions which one can take to prevent rowhammer uh in our talk we will focus on these two as the state-of-the-art rammer mitigation mechanisms adopt these two approaches first is preventive refresh where we preventive refresh the charge of victim roles before their agressors have been activated enough times to induce a bit flip and the second is RO migration where we move for migrate the contents of an aggressive row to a distant row away from its victims before it can induce any bit flips I'll continue with the motivation of artwor rammer pre preventive actions are blocking and time consuming operations where the memory controller cannot access a memory bank under Ro action as such refreshing kilobytes of data can block access to gigabytes of data we evaluate the performance impact of blocking preventive actions by evaluating for state of the rammer solutions here the x-axis shows the row Hammer threshold the y- axis shows the weighted speed up normalized to a system with no row Hammer defense and each part identifies a different rammer mitigation mechanism among these mechanisms the first three namely Hydra rfm and para use preventive refresh as the rammer preventive action Anda uses ra migration from our results we see that R mitigation mechanisms incure increasing the large performance overheads as Ral decreases and even worse an attacker can intentionally trigger many of these action to block access to main memory therefore preventive actions can be exploited to reduce the band with availability to reiterate our problem and goal operations that prevent row Hammer lead to D band with availability issues as they can frequently block access to main memory and our goal is to reduce the performance overhead of rammer mitigation mechanisms by reducing the number of performed rammer preventive actions without compromising system robustness to this end we propose a new mechanism called Break hammer break Hammer's key idea is to detect and slow down the memory accesses of threats that trigger many Ro Hammer preventive actions in our study we refer to such threats as suspect threats here I show an overview of break Hammer break Hammer divides the execution timeline into uh Windows of equal length which we call throttling windows and inside each throttling window observes the execution timeline is a series of activations from different threats and rammer preventive actions from Ro Hammer Solutions rammer has three key stages first break Hammer observes rowhammer preventive actions and attributes what we call rowhammer preventive score to each Hardware thread second break Hammer analyzes these scores and identifies suspect threats that unfairly reduce theam band with availability next break Hammer throttles memory band with usage of these suspect threats uh let's start with the first uh operation of break Hammer observing rowhammer preventive actions break Hammer tracks the number of rowhammer preventive actions each threat figures to do so it implements a rammer preventive action score counter one cannot directly attribute a rammer preventive action score to a single thread because a Rohmer preventive action is generally caused by a stream of memory requests from many Hardware threads on the bottom I show an execution timeline where multiple excesses from different threats result in a rammer preventive action when a rammer preventive action is triggered break Hammer partially attributes each Hardware threat a score that is proportional to the number of activations uh they have performed for the given given action I will demonstrate how we can pair break Hammer with two existing Solutions first is probabilistic R activation or per uh I'll describe how per Works briefly uh whenever whenever the memory controller AC activates a Dr row per generates a random number and Compares it with a threshold if the generated random number exceeds the threshold per performs the r preventive Refresh on the activated R then pair Brak Hammer with per Brak Hammer tracks R activation count of each Hardware threat between rammer preventive refreshes and when we observe a ram prity refresh attributes each threat a score proportional to the number of activations they perform and our second mechanism that I will demonstrate is theay this industri solution to rawh hammer which we call per activation counting or PRACK for short I'll again talk about how PRACK Works uh pre maintains an activation counter per dram roll inside the dram chip and whenever these counters reach a critical value for example close to the rowhammer threshold it triggers what's called a back off to request time from the memory controller for refreshes once this uh signal is received the memory controller provides Time by sending specialized indam refresh commands to the Dr module similarly to per when we uh combine break Hammer with PR bre camer tracks activations of each threats between back offs and once the back off is observed increments each each threat score proportional to the number of activations they have I guess this this conveys the key idea of how break Hammer is combined with existing Solutions uh I would like to refer you to our paper for more implementations where we pair break Hammer with eight state of the rammer solutions uh Second Step of break Hammer is identifying suspect threats break Hammer detects threats that trigger too many rammer preventive actions here on the left I show a table of roh P scores of threats in a system we can visualize these uh as bars on the right where each bar identifies or depicts the score of a thread in the system R cam performs two checks to detect an to detect an outlier first it has a safeguard area where it marks the threats with relatively low ROM P score as safe and second uh break Cammer checks if any threat score largely deviate from the average score of all threats the third step of break Hammer is throttling memory bandwidth usage break Hammer reduced memory bandwidth usage of suspect threats but by limiting the number of cash Miss buffers the threats can allocate in the last level cach cash Miss buffers are used to track requests that Miss in the last level cache and are currently served by the main memory when the system boots none of the threats have any allocated cash Miss buffers and if they need to they can allocate all of the C Miss buffers in the system as the systems and as the threats execute they start making memory requests to main memory and start using these Cas Miss buffers when break Hammer detects a thread a suspect it limits the number of cash Miss buffers the thread can allocate if a thread reaches their allocation quota they are no longer allowed to make new memory requests that Miss in the last level cach and they must wait for their early requests in the memory uh in the main memory to be resolved bre restores the memory banded usage back if a thread stays benign for the full duration of a throttling window here I show an example on the bottom uh part of the slide where our threat is identified as a suspect in the first throttling window and we throttle its uh memory bandwidth in the next throttling window the threat stays benign and does not get detected break Hammer still throttles this threat for this throttling window and at the end of the throttling window because the threat stay P9 uh the memory banded usage of our threat is restored and the threat continues execution without any uh memory banded quas now I'll continue with the evaluation of break Hammer yes yes you're saying benign applications can be identified as suspects right that is a possibility if uh a threat is an outline in a given system so there are like let's say 16 threats if 15 of them are basically not trigging in rammer PR actions and one thread is perform like triggering many of them that thread will be uh identified as a suspect in that case but it can happen we on our evaluation I'll will discuss uh in the paper I I guess I don't have results for but in the paper we do but it can happen but yeah break Hammer tracks at a very course I think it's either per Channel or processor chip so you can go for more fine grain tracking and have more data yes yes oh uh in the paper we don't evaluate it because regulator doesn't have uh or like our I'll will describe our methodology but our evaluation like doesn't basically have the operating it's Trace spased doesn't have an operating you can have those type of of attacks but it comes down to how you like associate those operations with a thread for example you said coherence protocol uh evictions right if you define a way for example if someone is making a lot of evictions you can say that this tra is evicting these cach lines or this threat is victimizing someone else you can make it so that break Hammer scores that threat or for example if you're I think you get the key idea so if you can associate an operation with a threat break Hammer can also score that if you call me this information because it is in the processor chip I'll continue with the validation so we evaluate performance and energy of Brak Hammer using regulator 2.0 integrated with d power we evaluate a realistic four core system with a ddr5 module that has two ranks and 64 banks in total uh as I mentioned earlier we integrate break Hammer with eight state of the rammer mitigation mechanisms and compare Brak Hammer paired versions of these mechanisms to their Baseline versions we evaluate four core workload mixes from spec 2006 spec 2017 TPC media bench and ycsb uh we create two types of workloads uh 90 mixes where we have a malicious core that is mounting a memory performance attack by suring many Ro him preventive actions and n0 other mixes where all applications are benign I'll first I'm going to under attack results in a bit more detail right now uh we first investigate break Hammer's effect on preventive action count as the rmer shs decrease here the x-axis is the decreasing rammer TR Y axis is the preventive action count of a mechanism uh normalized to their uh Baseline version at the r of 4K the Blue Line show the Blue Line shows the Baseline uh system Baseline mechanism with no Brak hammer and the orange bar shows the break system ration mechanism with Brak Hammer to help uh to as an example to read this figure for example this point on the blue line means that as the Ral increases from 4K to 128 PRACK performs 34 times many back offs and when we combine PRACK with break camera at this threshold PC and 80% reduction in the number of backups performed across when we look at all of our results we see that break Cammer significantly reduces the number of preventive actions performed across all mitigation mechanisms we evaluate the reduce preventive action count on memory latency uh here x-axis is the memory latency percentile the y axis is the memory Laten of Bine application in a applications in NCS the blue line is a system without break hammer and the orange line is a system with break Hammer here we also have a dashed black line that shows the system with no R Hammer defense as as another example to read this figure uh this point on the blue line shows that or means that 50% of requests are served within 200 NCS when the system has PR uh across all of our results we see that break Hammer significantly reduces memory latency across B mechanisms we evaluate reduced memory latency is impact on system performance uh here the x-axis is the decreasing row hemmer sh y-axis is the weighted speed up of benign applications normaliz a system with no R Hammer defense and each bar identifies a mechanism without or with Brak Hammer uh from our PR results we see that break hmer significantly increases system performance and when we Lo look across all mechanisms we make two key observations first a hemal decreases Rohmer mitigation mechanisms incure increasingly large performance overheads and combining them with Brak Hammer significantly increases system performance when there's an attack we evaluate theem energy as well x-axis is the decreasing rammer threshold y- axis is the D energy normalized to a baseline with no defense and bar show systems with and without break Hammer again uh from our PR results similar to our performance results break Cammer has significant energy improvements and across all of our uh results similar to Performance again as the thresholds decrease all mechanisms incure more the Energy overhead and break Hammer greatly reduces the energy consumption of these mechanisms to summarize our under attack results Brak Hammer significantly reduces the negative performance and energy overheads of existing Ram mitigation mechanisms when a memory performance attack is present we attribute these gains to two aspects first break Hammer accurately detects suspect threats and second break Hammer effectively reduces the memory interference caused by suspect threats in a memory hierarchy I'll now summarize our no attack result results but we have more detailed results on these in the paper uh across 9 to4 core B9 workloads even though it's not break Cam's goal to improve B9 system performance we observe that Ram slightly improves memory exess latency system performance and D Energy Efficiency uh we have more in our paper uh more implementation details for example resing Brak hammerers counters tracking software threads or system with dmas or systems without caches we have our security analysis where we provide an upper bound on the overhead that an attacker can cuse without being detected by break hammer and second we have our security analysis against multi- traded attackers that try to circumvent break Hammer you also have more performance results for example un fairness results sensitivity memory sensitivity to memory intensity comparison to Prior works and sensitivity analysis of break camers parameters uh I would like to refer your TP paper for more results and implementation details now I'll quickly conclude my talk we introduced a new exploit where one can mount a memory performance attack by triggering rowhammer preventive actions to block access to main main memory for extended periods of time we introduce Brak Hammer which observes the triggered dmer preventive actions identifies threats that trigger many of these actions and reduce the memory bandwidth usage of these threats our results show that when the system is under attack Ram significantly improves system performance and reduces the energy and when there is no attack rayam still slightly improves performance and energy Ray cam is fully open sourced and artifact evaluated available through this QR code Orab below this concludes my talk I would be able to have any questions you may have so does continue or are we taking a break TR is it can you hear me through the microphone okay all right yeah cool um so the third paper yes this is this the correct screen cool so the third paper today is sector D it is a practical energy efficient and high performance fine gra DM architecture uh this was published in ACM taco and I think I I presented this in this year in SRC thecon and in January we will present this in high peak the conference typically held in Europe um anyways yes so I'll start with a summary um you already know about dram so dram based systems suffer from energy Wasted by cor scin data transfer and cor grain DM row activation uh this is because applications typically use a relatively smaller fraction of the data that they fetch from the and our goal is to design a new fine grain low cost and high throughput DM substrate or an architecture uh that can mitigate the problem with core grain systems our design has two key ideas with small modifications to the memory controller and the DM chip we enable data transfers that are smaller than a cache blocks size uh in a variable number of what's called DRM interface clock Cycles we will get to those later um and we enable activating small portions of a DRM R based on the workloads memory access pattern the integration of sector d into a modern Computing system provides significant Energy savings and system per system performance Improvement sector DM incurs a relatively small uh DRM chip area overhead and performs within 11% of a high performance and relatively High area overhead state-of-the-art fine grain Dam architecture while achieving significant small DM energy uh DM Energy overhead sorry achieving more energy saving makes more sense so this is the outline um I'll skip the background on Dr you know this already maybe a Dr mat is what we have multiple of inside the Dr subray and inside the mat you have the rows D cell distributed onto a two- dimensional array this is a different and more detailed view of the subay D subray and we will look at this to understand how activate commands work the global word line spans the whole subray and controls local word line drivers those are the small gray rectangles that are not visible um and the local sense amplifiers are connected to a global sense amplifier the global sense amplifiers are further connected to the iio interface to access a DRM row you send an activate command when the DM CH receives the activate command the Global and local row decoders enable the word line of the target row and it connects this doing this connects the cells to the bit lines then the sense amplifiers senses the deviation in bitline voltage uh amplifies the voltage and latches the values so it's like we copy the data in the row to the sense amplifiers to read out data in the sense amplifiers the memory controller sends a read command then the the column it selects with the read command is transferred to outside the DM chip over the global sense amplifiers and the iio interface that was inside the sub now we look at data transfers at the grity of a d modules so each tip each chip sorry transmits or receives one bite of data from its interface uh and typically 64 bytes of data will make up a cache block that's the grity of communication between the dram module or chips in your system and the the typically the processor now ADM chip transmits or receives eight uh multiple bytes of the cache Block in a data transfer burst the burst length for this example transfer is eight because the chip each chip receives or transmits eight bytes now a chip transmits a bite in each DM interface clock cycle also referred to as a beat and the first beat each chip transmits its corresponding bite the next beat the next eight bytes and so on until you transfer the last eight byes now a key problem with this cor scen DRM data transfer is that it wastes energy even if Pro even if the processor wants to access only one word for example the second word in in the in the cach block the system retrieves the whole cach block of this word with the hope of exploiting spatial locality however the processor only ends up accessing in this example eight out of the 64 words before it writes this Cash Line back to the on average across many workflows the fraction of used words are relatively low now course scin activation of DRM rows also cause energy waste the DM chip activates all mats with every activate command to transfer all the words of a cache Block in one data transfer burst to maintain High throughput however activating all mats is also not necessary because the processor does not use all of the words that it receives from the and the unused words lead to activation energy waste investigate the potential energy savings find A fine grain Dam like a hypothetical F gr DM design could provide using 41 single core workloads with the evaluation methodology we will describe next uh in in the next slides and this is well not the next but later on in the slides this is the performance for fine grain DM Maxis uh sorry energy benefits of fine grain DM Maxis on the left and F dram activation on the right um we observe that F grein dram can provide substantial read right and activation energy benefits uh to better explain this if um your Dam today consumes this much energy that's the darker orange bar a d chip that will that enables transferring data at b word grity enables something like 30% energy benefits 27 specifically and 4% for Activation uh however this is not easy to implement this fine green damro activation and and transfer it has three key challenges we describe those challenges here and we also provide a classification of some of the fir work that have a relatively low area overhead to begin with uh these these works also Implement well propose fine grain DRM architectures the first challenge is maintaining a high DRM data transfer through putut some of these architectures reduce data transfer through by eight times and that incurs prohibitive performance overheads the Second Challenge is incurring even though these are these have relatively low area overheads we're trying to uh make it with the most smallest area overhead possible right so even even though they try to be area efficient they we find that some of the prior works that maintain high high trut incur large area overheads the last challenge is to exploit fine grain DM uh both for DRM access and DRM activation um and yeah and basically You observe that no prior work overcomes all three challenges and our goal is to develop a new fine gra D architecture that can mitigate the energy waste while overcoming all three challenges now let's see how sector D mitigates energy waste it has two key design components first we find that modern high performance high yield DRM architectures already split DRM rolls into physically isolated small fixed size Ms leveraging this observation the sector activation component en fine grain DRM yeah fine grain DRM activation at low cost and second we observe that DRM iio circuit already has a way of selecting the arbitrary bite to be transmitted in one beat of a burst we slightly modify the bite selection criteria to enable fine grain DM data transfer at low cost using a mechanism that we call variable burst length let's look at sector activation in a bit more detail this is our example subarray uh for example we want to enable the first two mats and not enable the last two mats in this picture and to do so we need to isolate the global word line from the local word line drivers we do this by adding isolation transistors which we call sector transistors between the global word line and the local word line drivers next we need a way of controlling the sector transistor so we add a new control signal that we call the that we call the sector lch for each mat in a bank and by doing so we transferred transform what we used to call a mat into now what we call a sector Each of which can be individually activated so a sector is actually a mat but it you have the control uh you have control over it so that you can activate one sector it's just trying to distinguish between what already exists in today's DM chips and what sector data proposes to add now let's look at the second component I'll describe it over an example but here we have eight sectors in this example in a DM Bank each sector sends eight bits with a read command to a comp structure called read fifo in a relatively simple circuit we select eight bits to transfit uh to transmit over the chip iio interface with each of each beat of the data transfer burst the selection process is controlled by a burst counter that gets incremented every bat to enable variable burst length data transfers we simply replace the burst counter with an encoder that selects data sent by only the um the open or activated sectors so if only the first and last sectors were activated uh the encoder would select only these two sectors data I'll briefly describe how a memory controller can take advantage of sector D and exercise control over sectors integrating sector D chips into a system does not uh have to require changes to today's standard interface which is a nice feature to have uh to explain our reasoning better we have you already know about this but this is the pre-charge command uh which is used by the memory controller to close an open drro and a pre-charge command will typically followed by another activate Command right we see that there are more than 10 unused bits in the pre-charge commands encoding in the jde ddr4 standard we use eight of these unused bits to determine the sectors that the next activate command will up so before you transmit the sector bits before you send the activate command actually so far do you have any questions good uh that was sector activation but variable uh burst length also can be exposed to the memory controller with no additional changes and for that like that's that's detailed so I'll just refer you to the paper for those details so I'll now describe how a modern dat based Computing system can effectively leverage sect the data effectively integrating it to a system has two challenges first uh we transfer data between components of the memory hierarchy today in cash block garity however we must enable subash block data transfers for sector DM to mitigate energy waste we sector caches whereby we extend each cach block with one bit of metadata uh for each word and each bit each that new uh each of those new bits that we add uh indicate if its corresponding word is valid or not in the cache block there are more to this than just sector caches but implementing sector caches allow us to transmit data in sub cach block Grand Nar in uh today's systems let me make it more clear we don't have this in today's systems but if you were to implement it in a system that we have today you would have Top Cash block data transfers because we now retrieve a word with each memory access and not the whole cache block it belongs to um missing words in a cache block will cause performance overheads as you need to repeat memory accesses and our solution to this challenge is to develop two prediction techniques the first one exploits the spal locality in uh load store instructions subsequent Lo store instructions in the in in the Lo store queue you will look at word addresses of instructions that follow each other and you're trying to accumulate the word uh words day reference in the youngest instruction and the second one is a special pattern predictor that we tailor tailor for predicting useful words in a Dash block I'll describe these next so we call the first one low store Q look ahead one L store instruction typically references or accesses one word in main memory and our key idea is to collect these memory references from younger yes younger load store instructions uh in the load store queue and put the collected references in the oldest one the oldest one would be the one that probably executed the first uh the outcome we expect is that when a load store instruction executes it will retrieve all words in the cache block that will be used by other load store uh instructions that follow this uh this executed one in the load store Cube thereby we expect to eliminate some of the additional memory memory requests memory accesses from sector mes however lsq look ahead suffers from two key drawbacks which in summary prevented from identifying all useful words and we comp elent lety look ahead with sector predictor the key idea of sector predictor is to maintain cach block signatures that describe which words are likely to be useful based on current word use characteristics in the L1 cache then we reuse these signatures when a cache block misses in the L1 cache to retrieve all potentially useful words uh from lower levels in the memory hierarchy now we will see how sector DM performs we use a variety of simulation Frameworks to evaluate sector performance energy area overheads energy benefits performance benefits and area overheads uh with the Baseline system configuration that we show you here and we evaluate two sector Dam policies where always on keep sector Dam on all time all the time and the dynamic policy dynamically turns sector data on and off based on workload memory intensity uh specifically what we Implement is you look at your memory request queue occupancy if it's higher than a threshold you turn SE DM on like on average across a large Quantum if the occupancy is high that means you have a bunch a lot of memory requests coming so you turn sect DM on if not you turn it off uh it will become clear later why we do this and we compare sector DM to three state-of-the-art fine grain D mechanisms namely those are half the fine gain activation and partial row activation and we have 41 one to 16 core multiprogramm workloads from major Benchmark sets so I'll show you the power improvements by by by making use of fewer sectors we investigate the activate and read power write power looks very similar to read power so you can think of this as read and right the x-axis shows you how many uh sectors you activate or read and here's the power consumption broken down to Dam array and periphery power for activating and reading eight sectors this is the full plot we make two observations here first reading or activating one sector can greatly reduce power consumption compared to reading or activating eight of them and we observe that activate power is ated by peripheric power which is marginally affected by the number of sectors you activate you cannot reduce activate power as much as you can reduce read power then we look at the performance of our prediction techniques we show different sector Dam configurations here I'll explain each of them in the number of their last level cache misses that's the Y AIS average across all single core workloads so each bar is an average across all workloads basic has no predict ition is just SE to the so you retrieve a word you call it a day if you didn't retrieve all words in the cache block you're going to have to access it again and again again uh La n stands for lsq look ahead configurations that look ahead n entries in the load store Cube uh to give an example La 16 looks ahead only 16 uh next loads and stores in the queue now you notice we have an LA 2000 here is I didn't want to use the word impractical but you don't have 2,000 mry load store Q so that's like a uh limit study almost SP 512 is the sector predictor with a history table size of 512 and sector Dam uh with no prediction here significantly increases the last level cach misses you have by three times almost we see that lsq look ahead that looks ahead 12 entries combined with sector predictor can reduce this uh additional last level cach misses by a large margin and there's room I think there's still room for improvement but this is what we have in the paper the figure here shows you the normalized weighted speed up for heavily memory intensive eight core workload mixes the always on and the dynamic policies for for these two policies that I described earlier we we observe that sector DM provides significant speed ups for highly memory intensive workloads we show the weighted speed up for hetrogeneous workload mixes again for medium and low intensity workloads and here you should see be able to understand why we have two policies um always on will hurt performance for medium and low mpki workloads why because um you have additional sector Miss and the benefit of one thing I didn't tell you which is an error in my slides actually that I should fix the performance benefits come from the fact that you reduce activate power and you can relax some of the timing parameters specifically or namely t fall which is four activate window this timing parameter limits the number of activate commands you can issue to a single DM rank it's 25 or so NCS in the dr4 standard that means in 25 NS you can send only four activate commands now if you if your activate commands were to enable only one sector that power constraint is not a constraint anymore you're bound by another timing constraint so this is where the performance benefits are coming from you're not activating all eight sectors at once you're I don't have the average number for example you're activating three sectors two sectors then you can send many activate commands in in a tall not just four so you benefit from a higher rate of activate commands yeah so we have like a TAA budget of 32 sectors instead of four rows because we have eight sectors in a row instead we have something like 32 sectors per 25 uh NCS yeah and that doesn't help for medium and low intensity workows and the fact that you have uh additional memory memory accesses hurt your performance but if you can identify these workflows you can just turn sect the DM off that will also negate its energy benefits but you won't incur performance overhead so basically you overcome the performance degradation in nonmemory intensive workows with uh Dynamic policy you're as fast as the Baseline uh if you remembered from the previous slide it also reduces your benefits a little 177% becomes 14% but there could be a better way to transition between always on and dynamic States okay so that was the bulk of the presentation I think the rest is still important uh but I can proceed faster this is the system energy consumption across all tested workloads for different workload categories I'll just make this observation that sect provide significant Energy savings when you have core counts larger than two because the like coron 2 and one is not you don't have a high intensive enough um workload there to provide Ben provide uh Tang benefits with sector data you also compare sector Dam to those three state-of-the-art uh techniques F gra DM architectures um it greatly impr performance over basic fine gra DM architectures because it does not reduce DRM throughput it outperforms partial R activation by 10% because it leverages fine grain dram for both read accesses and write accesses uh partial Ro activation only um benefits from fine grain data transfer and activation for right because um well because they chose to do that I guess and sector DM performs within 11% of half D now half D is a bit unique it still reduces activate power uh it can benefit from a higher rate of activate commands as sector Dam does but it doesn't suffer from uh additional sector misses or memory accesses uh in contrast sector DM provides energy savings More Energy savings than all other prior work it outperforms all three techniques because it enables finer grain data transfer and activation than a half D and reduces background energy consumption by reducing workload run times over fine grain DM and partial activation we estimate sector DM's processor and DM area overheads I'll summarize these so we find that sector DM's DM components take up 1 .7% of DM Chip's area and the sector caches and the predict predictor take up 1.2% of an eight core processor's area we have more details in the paper um we have a micro Benchmark performance evaluation study um that basically where we basically show the more random access your workloads get the more benefits you have with seor DM uh and if you are to develop an aders access pattern which is something like a strided access workload you can really hurt performance using sector DM also that you can induce up to 33% overhead we also have sensitivity studies that show the effect of number of DM channels and uh how how sector performs with prefetching enabled these the results for these are not surprising the more channels you have the better the W line and sector DM performance there's no surprising relationship and with prefetching enabled sector DM improves performance even more again something you would expect we have discussion on how you could provide finer gret sector support um and if basically discussion on compatibility with d error correcting codes uh for finer granularity sectors you could do something like that you can send two pre-charge commands for example back to back if you need to transmit more than 10 bits 10 sector bits or you can actually change the standard and actually support fine DM but that's more costly um for for ECC we discuss how SE DM is compatible with existing state-of-the-art DM there's no bad news and those details you can find in the paper uh I'll I'll quickly conclude we designed A fine grain low cost and highr Dam substrate to mitigate the excessive energy consumed by cor gra Dam with small modifications to the memory controller and the DM chip we enable variable sized data transfer bursts and activating smaller sized regions and Dam chips and we show that an effective system integration of sector Dam provides provides significant Energy savings and performance Improvement at low DM areia cost and this work was published in ACM Taco we have an extended version of our confer Journal paper on archive and all everything open sourced any questions about sector DM Ah that's a good question I not an expert on the size maybe hundreds 100ish of entries yeah yeah so we have 128 look head size yes um but even with 16 yeah as long as you complemented with the special pattern predictor then you're fine yeah 39 yeah the difference between this last two bars are very little but you don't want to not have um look ahead also but this can hopefully be improved um an total experience sector DM was very difficult to build because you have so many things to touch um you have the processor the DM chip the interface we didn't touch it so much but you might have to and in the processor you have to touch the processor the core itself the caches all the cache subsystem uh it's just might require a lot of work to to implement but the there the benefits of enabling this subach block data transfers in the system is more than just uh enabling the sector D um what I mean by that for example if you have sector caches meaning you have that additional bit that tells you which words are valid you can actually power off the unused comp Parts in your caches that saves a lot of static energy and it's probably used in some systems even yeah yeah so that's the that's so that's we already have that implemented but the idea of that comes with partial row activation one of the prior ifying DM architectures that's a relatively low cost method you just need sector caches no prediction and you know cool uh then we can maybe take a break for how many minutes ra 7 11 yeah so we'll be back at 3M uh hopefully you will be back also 3 P.M is correct right okay e e e e e e e e e e e e e e e e e e e e e e e e Okay it's 3 p.m. uh Sharp so I I'll continue with the fourth paper presentation uh this is probably the last paper presentation I'll do this year it's very excited so this is called read disturbance in high Bandit memory and detailed experimental study on hbm2 DRM chips this was published in Dependable systems and Network Conference this year uh in June now you know what this is this is a typical D based Computing system with a CP and a Dr module uh we will look at a simplified depiction of the chip to understand another thing you don't know very well which is read disturbance the disturbance has significant system level implications because it breaks memory isolation right and a prominent example is dra Hammer when you Hammer well when you repeatedly activate a row uh many times you can cause flips in neighboring Dr rows we call the repeatedly activated row uh the aggressive row and the row of it bit fils victim rows and a relatively new red disturbance phenomenon is R press Rus induces bit flips by keeping the row open for a very long time instead of repeatedly activating it and in extreme cases you can see bit flips with only one activation here's a non-exhaustive list of recent work that demonstrates uh red disturb with flips in commodity DM chips on a timeline not to scale many lpddr4 and DDR well LP ddrx and ddrx chips old and new are vulnerable to R disturbance in our work we try to ask well try to answer this question what about high banded memory data uh does red disturbance exist in highb memory and if so how does the vulnerability look like and our go therefore our goal is to experimentally characterize uh experimental analyze how vulnerable modern High Banner memory chips are to row hammer or and row press we experimentally characterized six hbm2 chips in two different fbj boards here are four key results from 23 new observations uh we make we see that the distribution uh sorry the the red disturbance vulnerability uh varies across 3D stacked channels in an bm2 chip and across DRM rows in each channel on average across all tested DRM rows we have this interesting observation that you may induce 10 bit Philips um with an activation count fewer than the twice the activation count needed to induce the first one this is a bit difficult I'll I'll take my time to explain it here I don't want to leave you too confused uh if x hammers are enough to induce the first bit flip in a row with two x hammers you can get 10 bit flips in that row on average you show that row press is widespread across the test hbm2 chips and we completely uncover the inner workings of an undisclosed onm die rammer mitigation technique in one tested hbm2 chip well our observations um Can Inspire an eight future work that's our hope uh they can be used to craft more effective attacks and hopefully more efficient defense mechanisms and all data and the sources for this work are also open sourced here's the outline of this talk we have some background on high band R memory uh these chips high band memory chips are extensively used in important system infrastructures of today for example several popular machine learning uh applications of today rely on high banded memory chips and systems it typical hbm DRM system consists of a compute chip and multiple hbm2 DRM chips and all of these are integrated into a single package to form the full system there's a different view of the same system the fpj and the HPM chips reside in the same package and are connected by a silicon interposer the memory controller in the fpga communicates with the buffer die on the uh hbm chip over the hbm interface and the hbm chip contains multiple 3D stacked DM dieses and each die contains one or multiple channels and there's um there are wires called high sorry wires called true silicon wiras or wires uh connect the 3D stacked channels to the buffer die so that you can communicate data from the upper levels to the memory controller uh we've already looked at what's inside a channel this is like a the diagram for a single DM chip The Only Exception here is there's this concept of pseudo channels it's Unique to hbm2 um maybe they have it in hbm3 also but it's not very important what sudo channels a collection of multiple Banks is a good abstraction mod and each bank has multiple Subways and each Subway you have many rows and DM Sals a row now let's look at take a look at um DRM cell it's leakage and read disturbance in more detail each cell includes data in fundamentally leaky capacitors again something we covered earlier on this is a simplified diagram of a cell where the capacitor stores data as charge an access transistor enables well Gates access to the data and there are a number of leakage paths by which charge can exit the cell and the important thing to note is that the stored data may become corrupted if too much charge leaks we have a simplified diagram of the capacitor voltage on the y- axis uh we see that the voltage decreases over time in an exponential decay there's a threshold vmin the dash line under which we can no longer guarantee that the cell retains data correctly as long as a cell is above like cell's voltage above is above this line it's a uh success however if it drops below this we me we consider it a retention f failure um this is a conceptual uh description of retention failures and I will transition to row Hammer right uh so these are things you know but I just want to cover them for the YouTube audience in order to prevent such failures we issue periodic refresh operations to restore the charge in the cell and there is an interval between uh refresh commands which we refer to as the refresh window uh and that they deter the duration between two subsequent refresh commands now let's take a look at the same diagram in the context of a row Hammer attack if we issue enough activations to a nearby row uh the hypothesis here is that we can accelerate the charge leakage rate to a point uh of failure right even though this cell would have normally retained this data by disturbing a neighboring RO with uh basically we prevented it from doing so so if we refresh the Cell at this point uh the refresh operation sees the the corrupted data and restores the Corrupted State back to the cell so I'll now introduce the testing methodology we used in our characterization um we use D Bender it used to be a DDR3 and four testing infrastructure now we with this work we adapted it to work with hbm2 chips as well it's also open source at the same link this is one larger photo of one of the testing setups uh the unique thing about this setup is we have some heating pad uh on top of a heat syn which we use to uh stabilize the temperature of the setup we have a cooling fan that we use to kind of cool the thing down a bit uh but it's a very very ad hoc setup as you can see uh anyways we have the ardina temperature controller that controls both the heating pad and the cooling fan to stabilize the temperature at the given point this setups gives us fine grain control over DM commands and we can issue back commands back to back with as low as 1.67 nond latency uh for methodology I guess I will say that we carefully reuse the existing re disturbance characterization methodology uh as done by prior work where we basically prevent sources of interference interference and use worst the worst case double sided uh raw Hammer access pattern and the the bottom figure depicts one hammer operation uh here uh basically opening and closing the two neighbors of a victim row here we consider that as one hammer so two activations we measured two metrics in our work first one is the bit error rate it is the fraction of cells in a DM row that experience R disturbance with phlips uh for this one we activate the aggressive rows 512,000 times that's to say we use 256,000 hammers uh if this metric is higher the row Hammer vulnerability is worse for that row the second metric is the minimum Hammer count needed to induce the first bit flip and from now on I'll use HC first to refer to this because otherwise it just takes too long This Is HC first AKA Hammer count needed to induce the first bit FL uh and as this metric becomes smaller uh it means that we need a shorter time to induce the bit flip in a DRM row we use the data patterns depicted in this table to initialize DRM rows prior to hammering them uh we can I will explain how data gets laid out to DRM rows based on an example you have all zeros in the victim row you fill the aggresses with all once I'm talking about the row stripe zero data apparent here and the Seven other rows neighboring each aggressive row with all zeros again here's what happens when you use the Checker zero data pattern you see how the data pattern changed and yeah we use these four data patterns we determine the also we we have another data pattern um we don't have another data pattern sorry we determine the worst case data pattern uh for each tested DRM r as the data pattern that causes the smallest HT first for that Pro not that this worst case data pattern is always one of the four data parents that we test now we will move on to results um the interesting part we'll start with uh the analysis of spatial variation of read disturbance uh in the hbm2 chip we have three key takeaways from this study first we see that different 3D stack channels exhibit different read disturb vulnerability second D rows near the end and in the middle of a DM Bank uh experience smaller bit error rate than other rows and third the activation count needed to induce the first row Hammer bit for that's H first I made a mistake I should have just said HD first changes with the data pattern and the physical location of the diag so we look at the first one you see the bit error rate on the y- axis uh and and we have the distribution of borate across rows in each channel so each box here shows you the distribution from one channel uh we make two observations from this figure first we observe that there are bit flips in every tested Channel there's no channel that that's basically resilient throw Hammer which is expected but it's a simple observation to make and second we see that b rate varies varies across channels and I highlight here groups of two channels that have similar looking bit rate profiles and if you look at this you can group these two together these two uh together and these two together and also these two together just like there's some sort of symmetry going on and we'll come back to this later this is a larger version of the same figure that covers more data patterns um I guess the data pattern affects the bit rate distribution you can make that observation from this and we observe up to around 247 bit flips uh in a row of 8,000 bits with 256,000 hammers this to quantify the worst case with error rate that we've seen in our experiment this is a different version of the same figure that shows data from multiple chips uh here I show this to you so that you can see that our observations are consistent across the chips that we tested I will never show you all six chips data from all six chips is huge but whatever I show you is consistent across all chips there's no observation that we made that was specific to One D chip this is the full picture for 3D RAM chips uh I'll move on to the second one and you can interrupt me with questions by the way um here we show bit Terror rate on the y- axis but um for each point here is a row is a DM rows bit error rate and on the x-axis we order the rows in increasing uh well I would use the word ID or address um so as we go right we reach the last row in the DM Bank in this figure and note that we're only displaying um sorry maybe it will become more clear if I show you this these are the subray boundaries that we reverse engineered using roow hammer tests and this is the last Subway in the bank this is the last 832 rows uh just wanted to tell you that these are the last four Subways in the in the whole bank right this is a small picture of of the full plot this is the full this is the full point um from here we observe that the B rate is substantially smaller in the last in the middle sub RS than all other subs we have somehow hypothesis for these I'll I'll explain them later on we also see that bit error rate goes up and down as you move inside a subay the B rate is highest in the middle of a subay and it's lowest on at either ends of the subray uh this one I we I don't have a very good explanation for this is the same figure for the three chips whatever we observe as I said is consistent across all test chips shs the third takeaway is about HC first so it's a similar figure to what we've seen before except the y axis Is HC first uh you can induce a bit flip uh as fast as only 1.3 milliseconds with 14,000 is hammers in the worst case and it depends the HC first distribution depends heavily on data parents uh and also on the physical location of the DM chip again just wanted to show you the different patterns and uh how the distributions look like Ross chips okay we can move on to the second analysis if you don't have any questions but I have time uh this is sort of a new analysis we did in this paper no prior work as far as I know did a similar analysis it's a bit interesting what results we have um you look at the hammer count needed to induce up to 10 bit FPS you remember HC first this is like HC 10 HC 10th basically we derive nine other metrics based on the hammer count needed to induce the first bit flip uh it's second hc3 and so on we make two key takeaways from this analysis first we find that on average fewer than two times the HT first hammers is needed to induce up to 10 F in the same d That's what I tried to explain in the ex executive summary and second in general if it takes um many hammers to induce a bit in this to First bit flip it likely takes fewer additional hammers to induce uh more bit fillips in the same row hopefully these will become more clear with with data so here we show you the hammer counts needed to up to IND induce up to 10 bit flips normalize to the Hammer count needed to induce the first one we show four boxes for each one for each tested data pattern and this is these are the boxes that show you the values for the hammer count needed to induce two bit flips in terms of HC first uh on average we need approximately 1.2 times HC first hammers in these two bit flips and here's the whole complete plot we'll observe that the Hammer count needed to induce up to 10 bit flips uh various across rows that's the the one of the box like the uh sorry the whiskers of the Box uh show you the degree of that variation right so you have a roll uh where you can induce 10 bit flips with only as few as 1.2x Is HC first hammers but you also have other rows where it takes you five times HD first hmer to indu 10 bit FPS and on average fewer than 2x that's the green line uh H are enough to induce 10 bit flips now the second takeaway oh yeah yeah that's a good point um I think helps with first it's data that like the fact that we do the analys like the analysis s has some value in this sort of I guess project but other than that you're also interested in this sort of an analysis because there's error correcting codes that can correct the first few bit flips right so it helps you quantify how many more hammers you would need with a type of ECC although better analysis would be to look at COD word look at this in COD word ner this is Rog gra ner within a code word how many more hammers do you need to induce next one would be good to look at to complement like the or maybe um sort of complete the the the the circle like when you include uh system level uh implications right cool so here we're going to look at the same data but in a different way the xaxis shows you the hammer count this xaxis HC first the y- axis is hc10 well not hc10 sorry it's the additional Hammer count Over HD first that you need to induce the 10th bit flip this will be more clear uh here each point in this plot will be a DM row uh this point represents a row that exhibits a bit flip after 30,000 is Hammers and nine more bit flips after 8 ,000 additional hammers is this is clear right now this is the complete plot you look at how X and Y axis correlate and we find that increasing HT first I'll use the word somehow unfortunately because I'm not allowed to use significant or anything else it's somehow correlated with decreasing additional Hammer count uh to induce the 10th bit F this shows that if a row can endure many hammers before it exhibits the first bit flip we can likely induce more bit flips with fewer additional hammers in that dra which could be a useful in observation for attackers but we don't close that Gap right so we don't move from this to some useful system level implication and we see similar distributions across all tested chips lastly we will look at U how R disturbance changes with the amount of time the r remains open or this is the r press analysis basically I described earlier if we activate an aggressive R many times we can accelerate the charge leakage to a point of failure in contrast R press instead of a high activation count increases the time Arrow stays open so doing doing so disturbs adjustment rows enough to cause bit flips without having a very high activation count there are two takeaways from this analysis that we see that the bit rate increases with aggressive Row one time at 150,000 fixed Hammer count and second as we keep the aggressive Row open longer DRM RS experience bits at smaller activation counts now I'll skip these or maybe go through them very fast because this is nothing surprising compared to what R press already shows but we have seen basically we have Quantified the degree of R press um vulnerability in hbm2 chips that's the value of the Al so here as you go right you have a longer time you're keeping the rows open it's obvious or unexpected that your bitter rate will increase at the same Hammer count and that's what this figure shows uh and this is at even higher activation sorry aggress r one times tii you keep a row open for 3.9 microseconds and N9 tii the significance of these values uh have have to do with how long you're allowed to keep a row open according to the standard you cannot keep it open for longer than 9i bit rate we see that it reaches an at around 50% at an extreme combination of this long activation time uh and high Hammer count now this wouldn't fit in a refresh window but it's interesting to see that you can flip almost all of the bits uh with with row press and row hammer okay yeah our observations are consistent again across all chips yes oh yeah that's a good point so the question is like uh this could be some of them could be retention errors we look at that we filter them out so we do retention time testing first we eliminate all bit locations where retention errors could come from and these are not them because this is not a very long time also in in the end yeah it's a it exceeds 64 milliseconds but it's not like huge yeah the second takeaway is another expected observation I would say as you increase Row one time your um the hammer count you need to induce the first bit FP reduces um yeah and we also see that with one hammer in our case that's two activations um you can many DM rows experiences bit flips if you keep the row open for long enough again exceeding uh refreshment though yeah yeah so that's a good question you the way we account for VRT at least we test multiple times it's not like we cannot test forever right but uh it's not just we do one retention test and we call them okay these fail the proportion of or the fraction of rows cells that fail because of retention time um compared to Red disturbance erors is very very low in our experiments it's like 0.01 is perc so even if we didn't catch all the VR TI okay okay uh so some hypothesis and then I'll need to accelerate so we attribute the similarity in bit error rate across groups of two channels or within groups of two channels to their physical placement inside the the die the DRM die I think or we think not just me we think that groups of two channels might be might be sharing the same DM die basically you might have four DM D stacked and two channels on each time and maybe the process variation affects the dies in a way that they end up having similar bit rates from rammer no it's mounted on the well it's they share the same package so it's like one package h FB fbga and hbm on the same package so yeah yeah you you pay 5,000 swis Franks only to be able to test one hbm chip yeah okay and as for the bit error rate uh changing across rows in a in a in a subr one explanation we have well this is not an explanation it's just seems to be affected by the the distance of that row to the sense amplifiers that's one hypothesis maybe the further away you get from the sense amplifiers for some reason your bit rate tends to increase uh these two things two things like the bit rate being very low in two sub Rays we attributed to well we have three different hypothesis in the paper for that but the most likely one I think is um these Subs are at the end of the bank so you cannot actually replicate the data pattern um in the same way in these Subways as you would in other Subways the end the Subways at either end have bit lines coming only from One Direction well also from the other direction because you have to add the dummy sensible fires but you cannot really program well you cannot really operate those bit lines even if you did so a better explanation to this would require more analysis uh maybe a physical analysis of an hbm chip but that's not possible so it's the best we got yeah uh these are memory controller visible row it but this is an average of uh like bit rate from 32 row like each point here is like an average so those sort of noise would be eliminated yeah cool um I have one more thing normally to talk about but I I'll will just I'll try to find the few things to say and move on as another component of this paper we also look at the Oni mitigation techniques um maybe I'll show you this slide very quickly in the standard you can see that there's a Target R refresh mode I'm not talking about this one uh we're trying to find other on the mitigation mechanisms that that are memory controller transparent because this is not memory controller transparent and we don't enable this uh we have disabled TR mode and looking for the existence of D die on dram mitigation techniques that take action when you send a refresh command and we found some so we repeated the the experiments from utr on covering uh Target of refresh in this hbm chip and we find that every 17 periodic refresh operations even though you've disabled TR mode there is a there's a mitigation mechanism that that takes action and refreshes one aggressor potential aggressor row and it refreshes both of the adjacent rows of the aggressor R so probably it's not it's not Samsung I don't know uh Samsung not HX maybe we don't know uh it identifies the first row that gets activated after a TR refresh so after every 17 refresh and the row that receives more than half of all the activations between two refresh commands as aggressor rows if you have 10 activations in a window between the first one refresh to another refresh if one row receives more than five activations that's the aggressive row but uh there might be other things we couldn't uncover and we then of course we bypass this mitigation mechanism but this doesn't mean your HPM to chips are vulnerable to row Hammer when you enable TR mode right we didn't enable TR mode maybe with TR mode really you cannot get bit flips no one knows um yeah we have more analysis in the paper we look at the effect of aging on bit error rate and we look at HC first across test iterations that's across time so it's like we know about variable retention time you're looking for the existence of variable red disturbance right uh we showed some evidence for it meaning that when you character when you look when you sample HC first of a row you find X next time you sample it again you might find 2 x uh and that's what we show in this paper very preliminarily and then we did another work after that uh with this evidence larger scale experiment and we have a paper upcoming uh that will be presented in hbca and I hope to get to present it to some of you at least before hpca so yeah there's variation in HC first also which might explain I don't know I'm looking at you that wrong but you know the system level attacks that say oh you there's a probability of failure it's not really always guaranteed to fail when you do a row Hammer attack with the same number of hammers I think this might explain those observations we analyze the effectiveness of error correcting caes they don't work well if you don't supplement them with throw Hammer protection techniques because you can get get more than n bit flips in a code word where n is the limits of the ECC and we discussed more implications for tax and mitigations and more hypotheses to explain our observations in the paper we'll also look at bit rate variation across rols in each channel uh bank and sudo Channel but I didn't show you that but you can look at the paper for that uh I will skip concluding the paper and I'll show you this link that link think and conclude thank you uh no questions right yeah uh the question is are there real system level attacks in hp2 chips and what are the difficulties of doing so first getting the hardware uh is quite difficult actually it's like very expensive no I mean it's not as widespread as ddr4 you you probably won't find it in a in a shared Cloud environment so easily that's one reason you have them in gpus um you have hbm3 in gpus also the gpus have many many undocumented very hidden layers of translation between what your code what you think your code does and what the hardware actually ends up doing so I'm not not aware of any work that showed um rammer or red disturb bit flips using gpus for example uh not to be confused with using GPU as an accelerator sort of thing to hammer main memory in the system right talking about dedicated memory um yeah I think I would say they're not as widespread so it's difficult to find a system that has HPM to begin with and then we're talking about testing it figuring out the access parents that would work against it it's difficult in hbm3 I think they also have the better mechanisms implemented mitigation mechanisms so it will be even more difficult to figure CPS out yeah oh well yeah um I guess we have around 25 minutes and uh let's uh switch gears and we'll conclude the semester by going to ssds and U kind of in the similar Same Spirit as what we have discussed so far uh we are looking at mitigation techniques for U um errors and then how can we reduce the overhead of these techniques and um so this was a this is a work called reducing solid state drive read latency by optimizing GRE retry and this work was published in asplos 2021 so um I give a brief executive summary so uh read retry is a typical uh error mitigation technique and it results in a long read latency in modern ssds uh because we need to frequently uh do multip multiple read uh rri steps to read an erroneous page the goal of the work is to reduce the latency of each read retry operation and uh there are two contributions of this work one the first one is pipeline read retay operation uh PR Square which concurrently performs the consecutive read retry operations using cach read command and then an Adaptive read retry operation Square which is to uh reduce the read timing parameters to exploit the reliability margin provided by the strong ECC and uh both these um techniques come with small implementation overhead and requires no changes to n flash chips um quick results um so we read this work reduces the SSD response time by U up to 42% and 26% on average compared to a high-end SST which performs regular read retry operations and uh compared to the state of start Baseline it performs uh 29% better and 177% better on average so um this is the outline of my talk uh we'll quickly look at what read retry is in modern ssds then the two contributions and then I we wrap it up with some uh results so uh so what so as you know n flash memory is a very um error prone um substrate and there are various sources of Errors which shift and widen the programmed vth States so here what we show is uh is basically uh number of programmed states in the uh in a page and uh these uh the y- axis shows the number of cells and the x-axis shows the cells threshold voltage so in an mlc N flash memory where it store where each cell stores two bits you find four states and state which is not shown here but you have three programmed States and you use three uh read reference voltages to um identify the MSP and the LSP pages so um some error sources you have retention loss because the charge doesn't uh remain in the charge strap layer so basically the uh charge leaks and you see some um uh kind some cells uh the threshold voltage changing for those cells and there is also interference and disturbance where um you see the uh threshold voltage increasing for some of the cells because um the some uh programming um or re R disturbance causes these cells to move to the right of the of where they are so uh these are the erronous cells when there is retention loss let's say so um so n flash memory is very highly error prone and the main reason is also the as you increase the density of The Flash cell uh you have much smaller gaps in the in between two programmed States so uh how do we correct these uh or how do we mitigate these problems so we use uh error correction codes uh typically and uh we also store some parity bits for the ECC decoding and uh how does it work so you get a request to the flash controller to read a page The Flash controller um sends a request to the N flash chip which is a nand command and then uh you get the data back uh to the ECC engine and the ECC engine corrects the data and then you get the original data back so the number of robit errors that uh you have in the page if it is greater than the correction capability of the ECC engine then you uh you basically cannot retrieve the data back so these are uncorrectable errors so uh where does read retry come in this uh whole uh picture so read retry what it does is it basically uh shifts the read reference voltages in small or large increments and then tries to read the page again so that you uh reduce the number of aonia cells that you U get with the original read uh reference voltage so you basically adjust the read reference uh voltages and then this results in decreased uh number of robit errors so um now we look at the performance overhead of read R right and uh so in a typical read you find uh the TR which is the page sensing then you have the dma because uh you need to transfer the data from the flashship to the SSD controller U so then the ECC decoding uh is also happens so these are the three steps that typically happen in a read operation and um if you have errors in these uh in this page you let's say you have 32 errors it's still less than the correction capability of the ECC engine which is uh let's say 72 in the we show in the paper but uh if there are more errors let's say when you read the page you get 232 errors in the whole page and this is I think um per uh 1 kilobyte so what you do is you perform read retry with shifted uh read reference voltages and the number of Errors keep reducing as you uh reach the optimal R reference voltage so at some point you reach a reference number of Errors which is much less than the correction capability of the ECC engine so you basically perform this whole sequence of uh operations um the three steps multiple times um end times as we show here uh and uh you basically get the data back uh um without errors so uh read retry uh increases the uh overall read latency almost linearly with the number of retry steps uh so what we did was we also looked at uh did some real uh characterization um device characterization and uh we saw how much of an overhead it is and uh based on the characterization done on 160 real uh 3d9 flash chips uh we see that uh here we show the uh there's a lot going on here so uh on the y- axis we show the number of retry steps and on the x-axis uh you basically see the retention age which is uh the elapse time since the page was last programmed and you Al we also show the data for zero program array Cycles 1,000 program array cycles and 2K and uh these boxes that you are seeing here uh shows the probability of uh the number of R retri steps so uh let's say at the start you when um you have just programmed the page U with zero retention time you get a probability of one which is you're not doing any R right so so with zero P Cycles what we observe is with the retention time of 6 months you see uh read retry uh number of retry steps being five and uh so this is with with a probability of 40% and then uh let's say with 2 2kp Cycles you see a gradual the trend is seen here you can see that the number of retry steps increases as you as you perform more program Aras cycles and uh you see that the um with 2000 P Cycles you reach up to 25 uh so the more number of P Cycles you perform and uh the longer the retention time um the number of retri steps increases uh another Insight that this paper provides is that uh you also perform many read uh read retry operations even with modest operating conditions so when you see here see the the first box here so you see that you almost perform three rra read retry steps um at let's say a very large uh high probability uh even with U at at a retention time of 3 months so this is uh this is an Insight that this this paper provides so how does U how do um how does the existing uh ret mitigation techniques um uh perform uh so what they do is they try to predict the near optimal uh vref values so uh you can see here this currently shows the the way the typical operation happens which is to shift the read reference voltages uh to basically find the so we call it the value finding uh algorithm or some people call it Minima finding so um so you basically want to find the uh the region between the two uh States and then um so some of these the State ofth art techniques what they do is they try to predict the actual read reference voltage and uh avoid all these uh intermediate steps where you keep shifting so how how do they do it they uh try to see the process similarities between different pages so if one page uh under goes the read retry operations they record the the reference voltage and try to apply to the other page that has a similar process um and U um and also uh try to use that re reference voltage there uh some other works we also try to uh use machine learning techniques for this to predict the actual read reference voltage um but the bottom line is that we these uh disturbances cause the threshold voltage changes uh very fast and also there are large shifts in the threshold voltages so it's very hard to eliminate the r retry operation so that's a bit of background on rry uh it's a bit of a tongue twister for me but anyway uh so uh any questions so far or I okay yes there are some uh studies that look at uh these kind of similarities across different blogs um you also have some word lines which are more error prone across all these blocks so there is there are some studies which look at that yeah there is no pattern but there are similarities yeah um any other questions I'll quickly move on so uh I'll move to the first contribution which is called pipeline read R TR so the idea of this um technique is to concurrently perform consecutive retry steps so this is the way typ typically the retry operation happens so you perform the first three steps in the read operation then you see that there are a lot of Errors then then you start the retry operation um with the different uh reference voltage so you basically start after the first operation is completed but the idea of this work is to Leverage The cach read operation uh so cach read operation uses the cache buffer similar to the page buffer that is there in the flash chips and it overlaps the the read of the next page with the trans data transfer of the previous page so basically what we are doing here is to perform the um the read of the the sequential read of the same page uh using cach read operation so you U here what you do is you while the dma and the ECC is going on you perform the the next you start the next Dr which is the read operation um using cach read so what it saves is basically uh some latency reduction because you overlapping the dma and tdma and the uh TCC with TR so this um U latency uh reduction is around 30% for each retry step from the critical path and it also removes the tdma and ECC from the critical path now there is a yeah so there is a u problem here which is how do we know when to stop the the next retry operation because it keeps going on uh you perform the cash read again and again till you get the um the number of Errors less than the correction capability but then you only know that when the ECC sequence happens or the ECC decoding happens so by that time you would have already started the next read so what happens is when you have started the read at that point you can send a reset command to the flash to stop that operation so um that's typically how we stop this um cash uh pipeline Ry operation so that's about um PR Square now we'll move on to uh adaptive read retry so adaptive read retry is basically exploring an idea to um yeah so what it looks at is uh there is a large margin in between the the correction capability of the ECC engine and the the the number of Errors at in the last step of the um of the read retry operation so you see that even if ECC capability is around 72 at the in the last step you get the number of Errors to be 23 so in order to leverage this what we are trying to do here is um we want to explore uh reducing the read timing parameters for every read TR step so here we see that every read operation takes let's say 20 microc seconds but we are trying to reduce that u u read operation um latency and U basically by doing that repeatedly for the number of read rety operations you get a latency reduction uh so this is with shorter TR as I mentioned but this does not come um for free so you want what you get is with if you reduce the number of uh reduce the read latency there there is also going to be uh additional errors because that's how land operates so um what we are seeing is that every for every step you you get some additional errors and uh so we need to ensure that these additional errors that uh are cost because of the shorter uh TR is less than the ECC margin so so there are some necessary conditions for this adaptive read retr to work so the first condition is that the large ECC margin the final retry step so because uh as I mentioned if we uh in the final retry step we almost get the near optimal uh read reference voltage and uh that's what we want and the second condition is that there is a if we reduce the timing parameters then we need to explore how um it affects the reliability So based on the characterization what we have seen is that manufacturers set these retiming parameters in a very conservative manner so they want to um cover for the worst case process variations and operating conditions so um in this paper uh we do a thorough experimental analysis for all these conditions and I'll present some quick results now so the goal is to characterize the ECC margin in the final retra step the reliability impact of reducing the read latency and also U look at different conditions like temperature and uh uh retention time and so on so the methodology is that we test it on real U on chips uh 160 real chips and also um more on more than uh 10 million pages and this is done on fpga uh flash controller fpga based custom flash controller which can T basic commands and also test mode commands like uh set feature and get feature so uh when we look at the ECC margin in the final retry step uh we see that here we show the number of Errors per 1 kilobyte data and on the x-axis we show the retention time in in months and uh let's take the uh worst case of P Cycles 2K and in the final step we even at retention time of one year you see that there is a large margin in the um ECC margin between the correction capability and the number of Errors you get so so this large margin uh is present even when um you operate under worst case conditions and U so there is a smaller ECC margin as you kind of increase the number of p cycles and longer retention time next we look at the effect of reducing the time timing parameters so here we see that we again see the number of Errors maximum increase in the number of Errors uh at 85° Celsius and on the x-axis we uh show the reduction in page sensing latency so you see that we show 0 12.5% reduction in the TR and so on so uh if you take the worst case here so this is a P Cycle 2K and U retention time of 12 months you see that even at 25% reduction of the uh TR we still are under the safe reduction point of U uh ECC uh engine um so the correction capability I mean so so what we are trying to see this study shows is that even at 25% reduction we are still okay with uh reducing the latency and uh this also kind of depends on the operating condition so as you slightly increase the reduction you see that it suddenly jumps up so some takeaways from this uh device characterization the Adaptive read retry operation can easily work in the state of thought n flash chips and we must properly reduce the TR depending on the current operating conditions so when we combine this adaptive read rryy and pipelined uh R rryy operation so what we get is we we basically Implement a read timing parameter table where uh we store the the current P Cycle uh and the retention time and the TR values based on offline profiling so for let's say 1.5 um KP cycles and retention time of 60 days you get a TR value of 70 so we use this table to perform the to reduce the read latency in the read try operations so when you when your read fails we basically go and look up the uh this table and get the TR value for from there based on the current operating conditions and we use the set feature operation to uh set this read timings so then uh we perform the uh read retry operations with this shorter TR um value so once you get the uh success with the uh ECC decoding you basically reset the uh the next step and then you set the uh latency back to the original value using the set feature operation so this uh results in no change to the flash chips and also there is no impact uh on the threshold voltage and it's easy to combine with other techniques okay now before I go to the results any questions so far okay so I'll quickly uh jump to the results I have three minutes so uh this evaluation was done um primarily with real device characterization but also the uh we incorporated these uh results into the N flash models of mq sim and um it's a state ofth art Simulator for those who don't know um so we also evaluate on 12 World um IO workloads six from MSR Cambridge traces and from six from ycsb all of these are block iio traces um two of them are from are right dominant and four are R dominant um what are the baselines we look at an ideal ideal SST which has no redri for the upper bound of the performance and um we also look at a high-end uh sist which performs regular retry operations and a state-ofthe-art implementation um from micro 2019 which reduces uh which predicts the V reference uh reference voltage uh close to the optimal values and also reduces the number of retri steps by 70% now uh we first show the results uh compared to the Baseline SSD which performs regular read rry operations and when we show the when we do the combination of these two techniques the pr and Dr techniques uh we see that there is a uh we show the response time SSD response time on the y-axis and um the retention time on the x-axis for different races uh with uh different P Cycles so uh the lower is better here so you see that compared to the Baseline you get almost up to 42% uh reduction in the SSD response time and 26% on average and this also works on uh right dominant races where which U basically you would assume that for red dominant traces you get a performance Improvement because that's where you get the uh ECC errors but then for even for right uh dominant workloads which cause garbage collection we still see a good performance Improvement now uh compared to the State ofth art uh we see that U so here again we show the same um SSD response time on the y-axis what you see here is that U the state of the art performs much better than the U the combination of uh pipelin and adaptive R rry operations the reason is that in in the state of the art they do predictions so sometimes they can end up with the correct predicted red reference voltage and there won't be any reday operations or there can be very few re retry operations but uh in in the case of uh uh this technique we have to do U many retry operations but we reduce the timing parameters so here we see that uh the state of the art still has a large gap compared to the um ideal policy where there is no read uh read right so when we try to combine the combine our contribution with the state of the art we see that uh there's a performance Improvement of up to 29% and um yeah that's basically um what the result for this comparison with the state of thought so there are more uh results in the paper there is an analysis of read mechanism in modern sties uh detailed uh results from device characterization uh we look at uh how the effect of reducing individual timing parameters reducing multiple R timing parameters and effect of operating temperature and how to choose the best timing parameters there's a typo um we also do a detailed evaluation of PR square and AR Square when applied individually and discussion on future directions to reduce SS read latency so in summary um retry operations cause a long GRE latency in modern necessities and uh there are two contributions pipeline read retry and adaptive read retry um so and we achieve up to 42% compared to a highend SSD and 29% um compared to the Baseline um state of thought Baseline and we hope that this kind of an idea and the characterization results can Inspire more valuable studies going forward and that's my presentation thank you so much any questions on on this technique okay okay then uh yeah that's it for today I guess and yeah we'll close the lecture now and I wish you all a very happy holidays thank you

Transcript for:[Lecture 27] Advancements in DRAM and SSD Technologies

Transcript for:
[Lecture 27] Advancements in DRAM and SSD Technologies