Transcript for:
ITIL4 Availability Management Best Practices

hi and welcome to this series where i'm going to be going through the itil4 practices individually and i'm going to give you the main headline points you need to be aware of to support your base knowledge regarding itil4 i'm also going to be giving you some overall context and some real world experience from the organizations i've i've worked in before don't forget to subscribe to my channel hit the bell you'll get the updates okay so let's get to it today i'm going to be covering off availability management so availability management is from the service management practice so there's 17 practices there as a reminder practice is a set of organizational resources designed for performing work or accomplishing an an objective so let's cover off a couple of points here regarding availability management so these are things to think about to ponder on maybe some take away messages and this is just in the widest sense of availability management so look through the lens of availability management as end to end i.e holistically you remember that itil guiding principle on that right so look at availability end-to-end don't just look at it as it's the service or or an application or a database or whatever it may be look at it from an end to end and what i mean by that is look at all the various components that make up availability so real world experience the application may be up and available but the database there's a problem with it or the network there's a problem with it so technically the application is available the problem is when people use it it doesn't work so the database doesn't recall the data or people can't even connect to it because the network's not or even worse perhaps the application is available but and the network is available in inverted commas but the performance on that network is so poor your customers not using it so is that still technically available have you defined that something to think about you may also hear people talk about availability as uptown uptime so many customers many service providers will use service level agreements or key performance indicators to measure the effectiveness of of the kit and and the service perhaps you'll hear talk of five nines 99.999 available when companies use cloud services so i'm thinking of aws as your google for example when we talk about availability that's referring to the individual cloud services that are being used to run an application and provide that service architecture is a really key point in terms of availability management it has a large input into it so if you're designing a new service from scratch really think about that architecture make sure it's it's meeting the requirements if the service is already in production and operating is there some continual improvement activities that that can improve the architecture which will make availability better it will improve availability so let's get to some itel4 elements so define itil 4 defines availability as the ability of an i.t service or other configuration item to perform its agreed function when required so let's break that down i like to think of that so let's break it down into it's a service doing what was agreed that it said it would do it will do it how it was expected it would be done and it's doing it when it was supposed to be doing it let's think about itil a little wider here in terms of the manual so there's talk of enabling candidates to look at i.t service management through an end-to-end operating model i always like to to think of this as uh looking through the lens of the creation the delivery and the ci the the continual improvement of those products and services and an itil4 states the purpose of availability management being ensuring services deliver agreed levels of availability to meet the needs of customers and users key metrics for availability performance and availability management i'll go into a bit more detail on that later on during this session but meantime between failure abbreviated to mt bf it's a maintenance metric it's used to indicate the duration that a piece of equipment that piece of kit or a service operates without any kind of disturbance and it's only applicable to repairable items by the way we'll discuss that that much much later the larger that number the better in in this sense so you obviously want the greatest amount of time between any kind of outage and then mean time to resolution mttr smaller is is the key here you want you want that to be as small as possible so it's it's the time from something going horribly wrong and it's not available to it is now available so that the smaller the the number and then meantime to restore service mtrs that measures how quickly a service is restored after a failure so availability management activities typically will include negotiating and agreeing achievable targets for availability now bear in mind some of those some of that negotiation and agreement activity maybe with internal teams or they may be with external teams or a service integrator or some kind of outsourcing arrangement the second activity typically you would expect to see is around designing the infrastructure and designing the applications that can deliver the required availability so designing thing architecture you can design something to be much more resilient in one sense over over another sense there are design considerations there around infrastructure that will have a huge impact on your availability management cost is clearly a consideration there because if you're designing something that is fully resilient or as good as resilient you know perhaps you're thinking multiple data centers maybe you're thinking two of everything maybe you're thinking that self-healing that there's a cost associated with with that element so designing infrastructure and designing the applications you can design those in a way to recruit to deliver the required availability but have a thought around well there's a cost associated with with doing that also another activity is around ensuring the services and the various components are able to to grab that data that they can collect the data that you need to measure availability so there's a technical element through there but there's also a people element to that as well so think about how how will you be measuring availability let's assume you've got all the data but let's just make sure we're clear as to what what is the period that we are measuring is it business hours is it 95 is it 24 7 but again just expand on the activity may well be around ensuring services and components are able to collect the data to measure availability just go a step back a tiny bit there in terms of are you clear what we have we defined what availability is and then clearly you need those services and components to collect that data and then monitoring analyzing and reporting on availability think about tooling what kind of tooling will you use will you use your own tooling will you use a service provider would you use a commercial off the shelf will it be internally developed perhaps automation certainly in monitoring you want that automated as as much as possible in terms of if there is some kind of activity you want it to raise that flag quickly you want that to reporting into your service management tool you want perhaps if there's some kind of um metric that that need that needs to be uh to report be reported back to your customer and just have a think about which lens are you looking at reporting through which lens are you looking at monitoring analyzing uh from a customer's perspective or from your service provider's perspective and then the improvements to availability so clearly there are interfaces that directly impact availability and are involved in that so i can think of incident management as an example a priority one incident a p1 incident and you've spun up a bridge and you know it's all been you've got all the right subject matter experts on the call and you're diagnosing and you fixed it great hopefully you will have a process that then rolls into your problem management activity and there'll be a major incident report and a root cause analysis report brilliant this is all reactive stuff fantastic make sure that feeds into the availability management and practice as to okay well this is what we can do to improve that from not happening again which clearly has a knock-on effect around availability and be proactive as well around planning improvements make sure you're not just being reactive in that case i just talked through there but think about the proactive stuff there's there's more on that later in terms of certain components have a defined life there's a mean time before they are going to fail so perhaps an availability piece of um work could be around looking at some of those components and playing the statistics game there as to statistically that piece of equipment is x years old or it's operated x number of hours so statistically it's in that period where it's maybe going to cause an outage and impact or availability so proactiveness there i suppose is the key and continual improvement from the general management practice there so we're talking about the service management practice here in terms of availability but clearly continual improvement from the general management practice needs to be involved there okay so let's talk around availability of the service generally so it depends clearly it depends on how how often a service fails and how quickly it recovers that there's various ways of expressing this and metrics that you you would come across it is the mean time before failure and mean time to restore service so mtbf and mt sr these are key metrics for measuring availability so let's let's look at meantime before failure um as a summary so the summary there is around the the larger number is is is what you're looking for so what that's measuring is the distance in time between outage number one that impacted availability and outreach number two that that impacted availability so hopefully that number is a large number maybe it's maybe it's measured in years it failed in [Music] 2020 and it failed in 2021 or maybe it's hours or quarters or whatever your organization would measure that but the key takeaway for around me time before failure is you want a big number there the other point there and i referred to it earlier on the presentation about uh being proactive clearly you know you can sort of look at some of these if something's occurring on a regular basis the the obvious no-brainer if you like is to think well if this is failing on a regular basis and it's impacting our availability management stats maybe we need to replace it of course so mean time to resolution as i mentioned before you that's the opposite you want a smaller number there the smaller the the better so that's measured in times so that could be seconds that could be minutes it could be hours hopefully not days but how long is it from uh the start of something to happen to to it being back again and you may well see that in in terms of sla metrics for example um a service provider or a service integrator may be saying your priority one two three and four metrics as an example may be okay well a priority one we will resolve in ninety percent of the time in three hours or a p two ninety percent in six hours so you you can sort of tie that in into those uh um those metrics and then mean time to restore service as it sounds it's basically how do you um how quickly is something being restored after a failure so to work out the mean time before failure it's relatively straightforward all you're doing is you're dividing the number of operational hours and i'll come back to that word and that definition of what are operational hours but the mean time before failure all you need to do is divide the total number of operational hours in a period whatever that period is by the number of failures that occurred in that period usually it's hours that it's that it's measured so you would say an asset may have been operational for x number of hours during the the last period so that period could be a year but it failed y times so you know maybe it's been it's it's been um fairly reliable and those are low numbers or maybe they're they're higher numbers to add complexity here so when when we talk about operational hours it's really important that you define that because that can mask some of the results um and the data that somebody's giving you for example if it's being defined as business hours and maybe your business hours are eight a.m to six pm monday to friday or is it being measured 24 7 that will have a direct consequence on on your sla data and your reporting stating the obvious if an item is failing a lot you need to think about replacing it mitigating it putting better resilience in place update it whatever rca and problem management is definitely the way to go here there's incident management there's problem there's change there's continual improvement practices that can all contribute to to improve that and help her the the meantime between failure it's a predicted elapsed time between an inherent failure of a an electrical item or or it can be a mechanical item as well during normal system operation it can be calculated in a um as an average so it's an arithmetic mean and um it's the average time between failures of a of a system basically i've mentioned that you can use it proactively as well you you can use it for planning so it may sti this device may have a mean time um before failure of x and then you can use that so you can use that data so if you know a component or an item fails every i don't know one year two years five years or whatever that period of time is you can do a piece of work look at that and and work out actually should we be planning to swap this piece of equipment out meantime uh to recovery of service so that measures how quickly a service is restored after a failure so for example a service with a um mtrs of four hours will on average fully recover from failure in four hours it doesn't mean it will always be restored in that amount of time it is um it's an average and you would usually use incidents to to analyze that and another records older services um were designed with quite high failure average rates so that they wouldn't always fail quite quite so often but there is um there is there's a definite sort of shift towards optimizing design and architecture so you're minimizing that number and ultimately recovering uh quickly and that service will recover um so so yeah just just i guess kind of thinking about metrics for a moment and and some real world experience here keep in mind the other elements of access to that service so using that that guiding principle of thinking holistically don't just look at the application think about well hang on there's lots of other components in this ecosystem um there are there is a there's a database there's a server in there there's virtual machines there's a networking connectivity so when you agree your metrics make sure you've got your end to end service reporting hat on because that that will inevitably cause a bit of friction if you're you're in the situation where you're saying well the service is available as in the application is up and operating or the database is up and operating but clearly if the network's down there's quite a fundamental problem there in terms of it's available it's just not accessible so make sure your contracts with service providers service integrators whoever or whether it's your internal departments make sure you've thought that through as to are you doing end-to-end service reporting and a slight tangent off on that only a tiny bit is the performance point again i've seen in organizations where the system is technically up it's available it's there it's not down but it's very very slow like really really slow like customers are just getting so frustrated they're just thinking oh i can't use it i'm pressing search and i have to go make a coffee and it might come back 10 minutes later with some information is that available or is it not available it's just a question really clearly in my eyes that is not available and and the way that you you can address that are things like performance and response metrics to to make sure you have a definition of availability so let's talk about the itil4 service value chain as well so when it comes to availability management like all the other practices in general management in service management and technical management the the service value chain which is covering planning improving engaging designing and transition obtaining builds deliver and support the the availability management practice touches all of those and however there is a main emphasis around the plan activity there so portfolio decisions goal setting for practices for products and and your services so again just flipping back in into some real world experience around availability management and the activities that i would encourage you to think about if you're looking to get the best out of availability management so architecture i've mentioned that that a few times this is a key element to availability management it's so important if you're if you're starting in a green field and it's a brand new service lucky you you you've got the opportunity to really look at the resilience the scaling and the design um continual improvements you know you know lots of lots of areas to to build a resilient and um available service so the architecture points don't underestimate the importance of architecture it's right at the beginning of the exercise of course but really think about how is this going to work how will it be available over multiple perhaps countries or regional zones or um data centers yeah however however you think about it just make sure architecture is is core around your availability thinking it will dictate so if you know if a system need needs to be 99.999 available your architectural can be dictated by that as opposed to somebody who's saying actually as long as it's 95 available that's fine for us that's two different architecture approaches and there may well be two different costs there as well if you have a an existing environment where um you might look at the architecture and you would want to look at continual improvement architect architectural activities that can improve availability so again you would point back to the general management practice there where you've got architecture management but you've also got that continual improvement element there as well as to okay we are where we are how can we make that better what can we do how do we keep that momentum going so architecture would be my number one you really need to think about this and and just make sure that's that's at the core both from a new service and from an existing service the second point would be end to end so what i mean by that is think end-to-end when it comes to availability don't look at a server don't look at an application don't look at a particular component it's important you take that holistic approach the guiding principles within itil4 talks about taking an holistic approach look at it end to end look at it from start to finish a service isn't just an application it's the server it's the the associated infrastructure it's the networking connectivity it's databases it's lots of different compute elements so really think through that availability when it comes to to reporting back to your customer or back to your internal uh stakeholders within your organization i mentioned earlier about that a system may well be up or a product or a service may well be up maybe the server's up the applications are but if the network's down from a customer perspective the avail the system is not available now you can argue semantics there and you can say ah well the application is up the server's up that doesn't really wash from a customer service perspective they cannot use the system sure it's the network that's down but from a customer perspective the service is not available so that does need some thought there and have you defined that have you had those conversations perhaps a response rate tolerance could be something that might need to be to be explored the other area i i can think of that that has kind of core organizations out in terms of their availability reporting is not including maintenance windows or not agreeing service hours as to when the stats are presented maybe a monthly service review and that there needs to be a maintenance window for a for a service provider to perform essential maintenance now depending on your architecture of course you may have complete full resilience and it doesn't really matter because you could do the the maintenance during the middle of the day because from a customer perspective they wouldn't know because you've thought through your architecture and you're operating from a you know i don't know from a different location or whatever it may be if you've got a more um sort of perhaps a a less designed system where maintenance means outage you need to count you need to factor that in when when are those maintenance windows i've worked in an organization where it's been quite quite straightforward in terms of the first sunday of every month has been uh classified as a maintenance window and that's been reflected back to the stakeholders and the customers and everyone is absolutely aware first sunday of the month that's the maintenance window and all the service packs updates maintenance activities database work so on and so forth happens on on that day that's excluded from from the reporting and service hours is the other point so when you're agreeing when you talk about operational period when you're working the maths out behind the scenes as to what is the availability are you working that out 24 7 or are you working out that nine to five monday to friday your stats will look very different depending on how you you report on those make sure it's clear for the avoidance of doubt and all the rest of it just make sure that's clear that all parties are aware automated recovery procedures is another element of we're encouraged to think about and get the best out of availability so anything that's linked to monitoring so if there's an alert of some kind that says a system is down perhaps there can be an automated response that corrects that activity and it could be something as simple as it restarts the service so if a if a windows service for the sake of argument stalls an alert is generated it goes off into the service management tool and perhaps there's an automated script that goes i uh when this happens what what what we need to do is automatically restart service abc within windows and off it goes and restarts that service and services is is back again but look for automated recovery procedures and corrective action at its most practical sense if there is an alert you want to be making sure that your service management tool has a ticket created and then that is in the right assignment group for whatever that system or services and then the final point that i i would encourage everyone to to do is practice recovery so um look at local failures look at a full system dr exercise look at failover xr exercises resilience testing all of these things can can can have a a big impact in in terms of your your availability so make sure that they are tested perhaps that might be during a maintenance window talking about you know the point of a few moments ago but either way make sure there are regular activities around failover disaster recovery resilience testing so you can you can maintain that availability certainly in some of the organizations i've i've consulted at um over the last five years or so i i've i've been involved in a number of regulatory organizations that require and they need to see the evidence that you have indeed performed these activities and they want to see the reports that you have done that what what what was found what was tested how how did it work so then i suppose really just a a very quick some summary and sort of final thought on on availability management a lot of cloud service providers csps have um something um as a central tenet within within their their um their way of delivering their cloud services their their architecture um aws for example has a reliability pillar which talk which is part of their well architected framework and the aws well architected framework helps architects build not only secure high-performing resilient and effective infrastructure and applications but it builds on that pillar of operational excellence security reliability performance efficiency and and ultimately optimization so really think through when you if you are dealing with availability overall make sure you're taking a holistic approach to to this because i i think too often people kind of get quite focused in on a specific application it's really important in fact it's an absolute principle a guiding principle within the itil4 world to look at things holistically the architecture 100 you really must must focus on that in order to ensure you have a a well thought through availability plan okay that's it thank you very much for listening please do subscribe and hit the bell to continue to get the updates thank you very much