hi guys welcome to my youtube channel my name is saya today I'm going to do a video on incident management a lot of my friends and colleagues have asked me to do a video on incident management and major incident management so I decided to do one so this is for all of you I hope this is very informative to you and hope you use it to do your best and use that while you work alright guys so before I go ahead and start off what is incident management I need to talk about what is ITIL in first place alright and incident management comes into one of the life cycles of ITIM so let's let's let's understand what idea unless in first place right what is ITIL it's it's information technology infrastructure library as the acronym says right ITIL so what does it do so before you go ahead and learn any of the process of 90 il you need to understand what ID is IDL is as a framework that has a set of detailed practices right to implement into your business to align it with ID the focus of ITIL is to align your IT services within your ITSM to meet the business needs if I put that into one definition it becomes information technology infrastructure library as a detailed practice within itsm IT Service Management which focuses upon aligning your IT services with your business needs hope you understood what is ITIL is it's a framework it's a framework which gives you or detailed practices to implement within your businesses and align your IT your infrastructure what the business needs right the focus of idea is that to align your IT service with your business needs so now that we have understood what ITIL is now let's go ahead and understand there are five life cycles of ITIL one s service strategy service design service operations sorry service strategy service design service transition service operation and continual service improvement so where does incident marriage come and chill all right service operation is where incident management comes into play right so there's one of the life cycle which has different multiple processes within it which has event management incident management problem management access management Identity Management and one more process may be that I don't remember on top of my head right now so now that we know IDI what does ITIL it has five life cycles and incident management as part of one of the lifecycle called service operations and service operation has multiple processes in it like event incident problem Identity and Access Management right so incident management as part of service operations so now let's understand what does an objective of an incident management if you don't understand the goal or the purpose of anything that you do in your life there's no point to doing that that's what I believe in so we need to understand the objective of an incident management right it is to minimize incident management goal or objective or whatever you do whatever you term it as right incident management's goal or objective is to make sure you minimize any adverse impact effects on to your business due to an unplanned interruptions into your IT service the definition if you want me into simple terms to explain it to you what is incident it's an unplanned interruption into our IT service that can cause a degradation of your service is called an incident just understand this let's not go into the book related terms because that's never going to help us unless you're giving exams now if you're if you give a foundation examiner that maybe there are some questions that talk about what is incident and what is problem even that it's more of scenarios these days right so what is an incident an unplanned unknown interruption into a IT service that causes a degradation of your IT services corners an incident right so we're clear about that what is an incident an unplanned interruption into one IT service is called as an incident management it is called an incident sorry right now let's understand that nation of a problem because if we have to understand an incident we need to understand problem as well a little bit we need to understand change as well because IPC are interlocked with a 90 I am I will tell you why and how they are interlocked when we do incident management process and median management process real quick let's understand we have understood what is incident management or what is problem management what is problem definition right what is an instant lesson let's go back to incidents so that the problem becomes easier to us an unknown interruption into an IT service that is causing the decoration degradation of your quality of service is called as an incident right now what is problem and underlying cause of that incident is called as an problem so incident is always triggered due to a problem remember the question if it comes up how does an incident come into existence a problem triggers an incident there are two types of problems one is a proactive problem and reactive problem does not get deep into that because today I'm going to make a video on incident management so this definitions are important right an underlying cause of an incident is called a problem right a repetitive reoccurring incident is also called a problem so if you have the same incident that's being repeated multiple times which means you need to take some actions on it right which means you need to do problem management to understand the underlying cause of that incident it's called common one so what does change what is the definition of a change any addition modification or deletion of an existing ci a configuration item within your IT right is called a change any addition modification or deletion of an existing CI is called a change now why this comes into play you will understand when I do this major incident management process and how it is important the important of the important and the interlocks of IPC you will understand when I do the process okay so without wasting any time now we have understood what is ITIL we understood what is what is incident what is it goes problem what is changed we know incident comes into the lifecycle of service operations in other words we are good with the background of ITIL and terminologies let's understand incident management process I hope you all are taking this video hope you're liking this video because I don't want to make it so monotonous that you don't don't really understand what I'm saying as long as you understand these basic terms you're good it's really really helpful for you to remember this term so that you can implement it because now I'm going to talk about incident management process incident management process has its own way of dealing things like I told you the objective or the goal of instant management is important that is to minimize any adverse impact to the business due to any degradation of service that's being caused unknowingly right an unplanned interruption into an IT service is called as an incident management that's the object of the objective office to minimize that impact if there is a any unknown interruption into your business interruption or a degradation of his service and planned interruptions that's what we need to minimize that's our objective so let me go ahead and explain what is incident management what is incident and what is incident management I have already explained what is an incident and how do we know what is incident management how does an incident get triggered and how does it come to you with your an incident manager for now I'm going to act as an incident manager next part of my role will be acting as a major incident manager right so what is in how does it come to you incident can be reported in different ways one users can call Service Desk and they can inform the other one is through event management there are some automation tools where the event is triggered three types of events we have informated warning and exceptions exceptions are always turn into an incidence a ticket is cut through if it's an exception alert or an exception event that comes into play warning is to warn you and information is just to let you know what is happening just like application has been installed as an infinitive alert right I don't want to get into examples of event management but right now that way how does an incident manager get to know there's an incident the user reports an issue to the Service Desk I will get you through that process now how does it come to you once you get that ticket right as an incident manager what do you need to do right so one is to service desks you get notified the other one is through the machines or the tools that would call it as an intro event management so let's understand the concept of incident management I hope you can see my screen one second guys I'm just trying to make things better for you hope you see it okay whenever the user calls for an incident saying that I have shown so issue right what a service desk agent has to do this is the workflow or the the flow for an incident management understand this matrons room management is going to be much easier if you understand this part okay now first is user calls to the musicals the service desk and service desk agent receives the call and he understands or identifies if it's an incident or if it's a service request write less all this as a service request if it is a service request if the answer is yes then we will ask request fulfillment team or they will invoke request fulfillment process if the answer is no what happens a service desk agent is gonna lock her ticket first and that's all this is normal then what he does is he is going to categorize that right then he's gonna prioritize it okay now let's understand what is logging logging is basically he's going to take the issue description understand what is the problem what is the use of reporting and what was the degradation of service that's happening right now once he locks the ticket then he's going to categorize characterization is very important a lot of people you know they say category is - you know - if it's hardware or a software yes it's it's right it's true if what kind of a CI that is imported he categorizes servicedesk he she categorizes that but also remember category categorization is important in prioritizing the ticket it is very important in prioritizing the ticket because it talks about emergency during the categorization now let me explain but quickly to you what is categorization over here and also how a service this agent is going to understand if it's a service request or or or if it's an incident right so a lot of uses they call um you know i-i've I've asked a lot of people how do you categorize an incident in to categorize service requests and enhance it and they say hey internet is not working so it becomes your incident and my password reset it's a service request that is absolutely not correct understand the rules of it only then you will categorize it and you will be able to understand if it's an incident the same example let me take the same example my internet is not working is that an incident can it be an interest incident can it be a service request according to me yes it can be both right and who decides that services agent will always look back into something called Service Catalog remember this words or user user level catalog this defines this defines everything about who should have access to Internet what set of service requests are considered and what set of incidents what are what are the issues that are considered a service request what are the issues that are considered as an incident for example the user who calls from my internet is not working it could be a very new user who doesn't even have access to it to an internet maybe he's trying to use Google or he's trying to Facebook but his his intranet is working but Internet is not working so do you consider that incident no that is a service request because he has to raise a request to get internet access to him right and who confirms that and how does a service desk agent goes understands that ok the so-and-so person he looks into directory he looks into user ID and he looks for a group and that group is defined and user catalog then it is part of that group he will happen internet access right and that's when he's going to ask the user hey go ahead and raise an axe I don't have it at this you don't have it go ahead and raise an axis then you might have internet access to your system that's how it is a service request incident is not internet is not working is not an incident always you need to end this record to user level catalog or other service catalog you need to you need to refer to this two things mostly it's called as a user level catalog window service desk agent so probably they will have the set of definitions of the documentation that's given to them but the terminology is used in our catalog so once you understand sense request fulfillment it's a request then you will invoke request fulfillment process if it is no then his were no longer ticket for the same example internet is not working he is going to he's great he's going to look into service level catalog and he says ok this user so-and-so has access to the group he had access until yesterday now it's not working it's definitely at a gradation of service right the quality of service that is being degraded so he's gonna lock that ticket with the details then he's going to categorize it and at the same time he's gonna read categorize it looking at this user level catalog if it's a major incident for example you know I'm talking about a little bit on major & SIMON major incident if the server that is broken down if the server that is done he looks into this user level catalog or billing inventory or an inventory that he has were CMDB I'm sorry I'm s km/s or CMDB wherever he's looking into right configuration management database didn't he identifies that particular server so and so server XYZ server is a gold server where availability availability is 99.99% and that is down basically it's an urgency right so that's how you categorize this is very important people tend to mess after logging to directly say prioritization which is impacted providence in no urgency plays a role and categorization right now next we have is priority priority is again based on impact into urgency right now urgency is already defined at least to an extent where it's defining categorization and categorization the priority is impacting to urgency what is an impact impact can be the business impact the user impact of financial impact right notice what is in the urgency how quickly you can restore the service like I said in the server that is a gold silver and the Bailey immediately has to be nine nine and nine nine percent and which means the server cannot be drowned it has to be urgently fixed that's what impact into urgency is once you define a priority right the priorities can be P 1 P 2 P 3 and P 4 what is P 1 is a critically P 2 is high P 3 is medium P 4 is low and P five is mostly a planning right it's a no planning phase so something is down because the application is still in the build stage it's not into production here but there are some alerts while they are bringing the server I'm sorry the application into production that's what happens so that's a P Phi so what's P one it's critical depends it varies from a counter-current companies to companies so it's defined in your SLA agreement into your SMEs enemies deciding in about menís in the transition how he or she is doing the transition they will define all the scope right so p1 is critical p2 is high p3 is low before sorry PPP zero sorry p1 is critical p2 is high p3 is medium before is low and p5 is planet so once you define the priority our prioritization is complete alright so then say example something it comes up and you know it's become nmi I call that hasn't made an incident which basically means p1 or a people once I puts on their mind you need to understand if it is mi yes or no right it's not an mi yet so it could be an MIT major incident if it is yes I'm all this I'll give this to maintenance and management team mi empty right if it is no then I will as a service this agent I'm going to do initial diagnosis then I'm going to troubleshoot with functional installation and hierarchy resolution ok so this is where you need to understand functional hierarchy legislation comes to file your troubleshooting or if you need anything that is needed and that's not happening right you will use this if it is not an mi if it's an MI you will give it to major incident management team now this is a different process which I'm going to come and talk to you about I'm gonna give you a def diagram for major in management I'm sure you will remember it forever and not exaggerating when I say that it's something that I came up with I will explain it to you let's talk about if it's a major incident management a little bit right so I'm gonna say if it's made you smell great remember the three things on your on top of your head right the first thing should be disaster recovery science things should be stakeholders that has to be notified I a stakeholder to modify and third one was RCA forfeits oh they didn't say management I fit am I make sure there is a problem ticket for regardless of if p1 or p2 and my hospital PR has to be their problem ticket has to be there right if it is yes then major incident management process will get and move and you remember what did he add is that dr is a disaster recovery critical p1 is critically remember you know if the data center is down then you need to start disaster recovery you need to improve disaster recovery which means you need to which means you need to get other data center up and this because this data center has some facility issues or some natural disaster and stuff like that stakeholders to be notified yes very important any customers or any any any management would would like to know what's going on with their environment they have to know so for that you need to have your audience right you need to know who to inform and vote who not to and the third is are say like a total ready for any p1 and p2 is make sure you have an RCA even for the p3 is if you Canada problem tickets are very important like I explained and realizing underlying all of an incident is called a problem if the incidents being repeated like about multiple times then you need to know where is the problem coming from if it's not an MI if it's normal an incident initial diagnosis is taking place then troubleshooting with hierarchy and functional installations once that is complete then you will have to restore it what's your troubleshooting you diagnose and you fix the issue then you restore it then you resolve it and you cruise it in solution right now what's the difference between resto resolved a lot of people say resolved and restore is the same like once you the goal of incident management is to restore the service ASAP once you restore it you're done so what does the difference between restore a resolve and you know I have asked my colleagues now they come up with answers which are partially right they say hey restore is when you resolve the ticket and once once they shows up once the services are up and back to normal and as always when you close it on the ticket and you have a ticketing tool and you put it as a result state initially when you validate from the user he says then you call that as a relation which means you can stop your watch if it's ticking which means yeah it's partially right case but but you need to understand the difference between his restore and resolved restore is to restore the service that had the degradation of the service maybe you've fixed the issue you restore it resolve is restored plus you set or user satisfaction or customer satisfaction which means to restore you up the so application is down you brought up the application perfect application time uptime for example is 8:00 Eastern right so what does the all clear time of the closure time that you take it's only when you resolve it that is why use a validation so rest oppress you set or cease at is called as a resolve it's nothing to do with the tickets it's nothing to do with the logging and all of that just or resolve or not sim always need to go back to the use of evaluation only if the user is fine then the all clear time or the closure time that you can decide is when that because you came back up or when the user value depends upon it depends upon the counter come if they say notes when user confirmed it then that that's the closure time then sometimes they say notes when the rest of the services whereas so let's take that as an all-clear in time perfect right so let's understand that a little bit so I'm quickly rehashing everything first the service this agent is going to identify that is identified what if it's a service request of an incident if it's a service request yes by looking at the user level cash flow it's kind of go through that he's going to invoke request fulfillment process if it is know that it's been along the ticket with the details it's been categorized remember the urgency is based on the categorization to an extent very important that categorization is also looking at the user level catalog right agency is categorized here and also software and hardware and what kind of a CI and all of that is categorized here now there is an impact impact as always into agency sorry priority is into is always calculated part emergency in fact every business in part user impact of financial impact and urgency like I said based on the categorization if it's a nice looking at the user level catalog or inventories you understand what see is impacted there once you identify the priority you have p1 stupefies p1 is critical p2 is high p3 is medium before is low and p5 is planning so once you have that once you prioritize if it's MI or not if it's a Emily then you need to give it a maintenance management team and maybe a management team team you remember three things major incident manager I will have a check for a disaster the company has checked for stakeholders who to be notified in a problem ticket if it's not an MI then in service desk agent is gonna give it to incident management team or by himself it's been resolved if it's just one user a packet is when an initially investigated and troubleshoot and if required use functional in hierarchical installations very important image an instant management what is functioning the hierarchy destination once that is done once the issue is fixed excuse me restore it and resolve it first all is when the applications have when the services the CI is that a protein that are up the you know the the impacted CI is up in functional Brazil resolve is when you check with the user and you resolve it with the closure time right so what is the result is equal resolve is equal to restore plus you status he said frankness so you understand what is now this is the workflow of incident management do not go with the book related definitions you will never understand this understand this workflow anywhere that you go this should help you you know this should help you if probably you want to take it down because I don't know how I can put that into the comment section but yes I have this diagram when they were going through this video you can go through this right so fantastic a so this is incident management now what we discussed what is a 90 I am what is idea I am sorry what is ITIL we discussed the pipe live site is not really when the names of five life cycles where is instant management comes into which lifecycle it comes into then what is the definition what is the objective of an incident management the purpose of the goal we understood the definition of incident problem and change and incident management workflow right so far we have managed this right now let's go ahead and do it for now very important thing as now this is for incident manages basically or the same as this guy and Vincent manager who understands this for him he needs to understand it's not an mi he if it it could be mostly a p3 or p4 he needs to resolve this ticket you know there is an impact but it's medium before very low you still have to look back by resolve it by troubleshooting and required you know even here you don't have functional have a few less collisions that you need to take in place if there are some potential p3 in Peters remember guys whenever you hear this word potential p3 S&P Force which means four people usually you wouldn't have it but potential p3 or a p2 figures they're likely to become P ones you know they're likely to get into critical situations critics right so make sure you you start using functional hierarchy escalations remember escalation matrix is the backbone of an engine management if you don't use that then there you're not you're not you're not capable of doing incident management take that straight from me all right if you do not escalate it when it is really needed somebody's gonna is created or new the way you work the way you your ability of working and you know they can question the ability of your work they can question your ability of being an incident manager or major instant manager right so this so it's pretty much clear so far now I'm quickly going to move on to Meghan is Amanda this is very interesting right so I'm going to clear this off so that you have understood right so I'm going to clear it and hope you understood so far what does incident as an motors the definitions and all of that right okay before I go into matrons in management I've been doing this role for quite some time it's been about five years so I've seen different accounts different infrastructures I also make need to make you understand what sort of questions that you need to ask in a mitten you don't really when you're no major incident right you you cannot be an expert in all the platforms when I say all the platforms like Windows Unix Linux mainframes WebSphere web servers network databases you cannot be an expert so major incident manager is a bridge between the matter the management account management or your your vendor management to whoever you're working for you are a bridge between them and the platform teams between the management and the technical teams which basically means you need to have good communication you need to have that ability of articulating things on the call which I will come to it a little later but the let's let's in understand the process first a lot of questions and the first questions and on the ability of a major incident management manager comes when he is not talking on a bridge con right so just knowing the process won't help you but how to utilize that and how to implement that process is very important you really really need to take care of things you really really need to take care of sorry guys you really need to take care of things or take things seriously when you are on an mi at the articulation part your communication the way you write things the way you explain things the way you understand things very important right now like I said our major incident manager is a bridge between and bridge between the management the organization who does not know anything about technical part and and the other set of people who are very technical right so you are a bridge you need to understand them put that into a plain English description to explain into the management team right if you need to you need to articulate things and you need to write it down you know plain English description I come to that communication part also right so it's going to be a little longer video case but then yes very informative I'm sure you will all learn something out of it I hope you learn something out of it so now we understood what is made incident management major incident management okay this is very interesting and trust me you will like it this is something that I came up with and when I was training my colleagues when I was training the new folks were joined as major incident managers or incident managers from service desk I've been training quite a few people so I learned from them that giving them book related terms giving them the process as a workflow will not help you understand things right so I came up with a diagram for major incident manager excuse me major incident management so that diagram is is basically looks like a button it looks like a butterfly and I turned that as a butterfly effect basically which means any small changes in the environment can cause a huge impact that's what butterfly effect is any small changes somewhere can cause a major impact or a major change somewhere else and let's come back to this point a little later why it is called a butterfly effect I am Not sure but this is one theory it is not something proven so butterfly diagram or butterfly picture will help you understand so let me act as a major incident manager right now I'm a major incident manager it has been in what EMI has been involved like we saw there it was an MI and after privatization it was qualified to be an empire is given to major in management and like I said escalation matrix is very important now you will see why it is important as a major incident manager or a business recovery manager or credit manager different terminologies right from company to company meters in manager well I wanted to call him incident manager or coordinator has been running the call all this while and it says hey it's name I I want you to jump on with the comment right as soon as you jump onto a bridge or a SWAT column whatever a technical bridge ship whatever it is corner right remember this diagram I'm gonna draw this now understand this I'm gonna understand an issue first a butterfly tentacle a butterfly has extended and I don't understand the issue first right let me change color so that you understand issue because I has a major incident manager I cannot drive a bridge or you know restore the services ASAP to cool of the objective that we have is to restore the services ASAP without having any adverse impact on to the business without understanding an issue I'll take that one as my tentacle now I take one strong line your straight line as my impact the reason why I have this impacted longer line is because it can be a high impact critical impact high but low and medium impact on a low impact right so it's high medium or high sorry critically high medium no that's my second render for a seven second entity of low birth right the third is the teams that are needed teams required right as soon as I understand what is an issue and I understand what is an impact and it has already been in a mi so I still understand an impact because there are times that incident managers would not qualify that as you know not understand how to qualify that as a p1 and they just randomly make that as a p1 if it is not even there right so there are different examples which I don't want to get into it but sometimes you're into a situation you need to assess the impact again because that will show you the sense of urgency to resolve the services to us to the services sorry the Petra's officer on to restore the services right issue you understood you understood impact now you understood the teams that are required whosoever even if it's two teams that are required treat in additionally right so you need windows team you got them all to the corner and you're still engaging the rest of the people what do you need to do remember this remember this right like I said like I said as creation matrix is the backbone of instead of management or incident lifecycle so unwanted the butterfly's backbone has my s right this is explanation right now estimation archetypes one and we will call it as functionalist relation and that the one I have is a hierarchical escalation I'll call this was an F II and this I'll call it us I don't installation right let's keep this in mind because we need this escalation matrix throughout the incident lifecycle right now once you have identified issue then you understand and just do the impact then you have the technical unit you will start asking this should not make more than five to ten minutes guys ten minutes is also too long remember your SLA P ones can vary from organization to organization let's take it as ninety minutes as your few 1 p1 SLA for that for now so it should not take more than five to ten minutes ten minutes is also too long understanding the issue and impact is very important once you have the teens engage be one single team on the call and you're still engaging the rest of them ask that particular team to do this what is this called I'm going to take this point as let me change color so it's easy for you but I'm gonna take this point as initial investigation what is the initial investigation if it's a window steam of the pol asked them to do health checks right how is the server looking what is the memory of the server what is the CPU utilization what is memory utilization when was the server last reported how long the server has been up if whether any changes within where there were any changes over the weekend or just yesterday or today whatever you should be investigating start them with that right now let me go about applying here already okay now what does the next part of it is doing yes you have the initial investigation that is done right now I'm in it once you identify what you initially investigate then you will ask them to in premiership troubleshoot once they initially investigated they isolated some you know they some of the factors that that that does not impact in for example change is not there there has been no change on the server itself and and so on and so forth now they start looking into the logs of the servers so they start troubleshooting it once they start troubleshooting into it once they start troubleshooting and investigating it they find something right they find something they say hey this is the problem that we have determined all that as planning all right once they identify what is the problem once they identify where is the problem coming from and they start planning how to resolve it right how do you plan to resolve it right now swings through steps have been performed now how do we need to how do we need to resolve it is they gonna come up with an action that is very important very important right call with us when you're doing the planning stage always always ask for plan be do not get excited when they say okay this is the set of plans that reactions that we need to do and be able to resolve it doing so and so do not get excited with that always always have a back-up plan just like during the change when when will welcome to that change in my mind my next set up for our video whenever the change is being performed the criteria for the changes they should always have a back-up plan when the chain is being performed similarly an incident there should always there should not be a back-up plan but there should be a plan a and Plan B right if the plan is not working fine then you should always go for Plan B you should always have a back-up plan you cannot wait for plan E to come get completed then implement Plan B and think of Calvino while your planning stage you always have to ask for plan a and Plan B these are the tentacles of the butterfly you need to have plan a PA MPB right now once you have this once you have plan a and Plan B then you probably once you've determined you will execute the plan right you execute the plan I'm going to take this point again that's the center point from here I'm going to call that as execute you're going to execute your plan a and Plan B now write a plan is successful no plan B successful you know whatever the plan do you decide the supporting will decide and the the rest of the support groups on the conquer has decided so you execute the plan once the plan is completed then you validate it right and before you execute in place as a maintenance room manager I also want to summarize the plan with the tensor that we are sure of what we are doing on the call because I wouldn't want to keep quiet over there without talking and saying technically teams are doing their job so they know what they need to do no although I do not know what they are doing very much indeed technically but I would have a plan of action jotted down like they're in the I'm gonna write down and you know I'm gonna talk about Pliny and Plan B here and I'm gonna summarize that plan and I'm gonna summarize the birth right I'm going to summarize execute and validate then finally what's really we are back to the closure right we have this is called a closure note we are coding back and ending up with the serpent and I'm going to call this as a closure remember this two eyes and this escalation matrix this is your portion this is your functionalist lesion and hierarchical installation why did I write F V here H e here hierarchical escalation functional escalation whenever you are initially done investigating this I - how you must have set up your infrastructure and your technical part forms and because some most of the 80% of the cases I see that technically splat technically supportive to not respond to you on the column you say XYZ are you there you know it's so embarrassing you're dumb start you don't know what to do and the customer the other side is like you're not fit to handle a con you're not fit because they are not even responding it to you that's where when you're doing initial investigation for a p0 p1 and p2 whatsoever to start is creating get us enemies remember this key thing escalation is everything do not wait for an admin to create a nuisance on the call then you go and escalated know start escalating right from the top a moment you have the team's required whoever joins ask him to get their SMEs from their team subject matter experts if it's l1 + l2 l1 - l2 is called a function escalation that's why we have I have written on the other side as function escalation immediately it has to happen and at the other side of an hierarchical escalation is to let managers know to have the functional installations come get completed it's not just functional escalation is a separate thing if you're not getting level to support you need to call the managers the hierarchy distillation is also vendors so get them everybody on the call right in first 15 or 20 minutes I know it's typical but you need to push for it you need to ask if you want to meet your essays if you want to make sure that you have fixed the issue within your that's assignment to your account make sure you follow the escalation matrix functional exclusion hierarchical escalation so this is called a part of life right now what is the butterfly effect I told you a small change somewhere can cause a major impact somewhere else right now 80% or you know the stat says today it's 18 75 to 80% of incidents are triggered due to change remember do not get confused a problem always triggers an incident which means there's an underlying cause of an underlying cause somewhere that causes triggering an incident right there is a cause but who triggers a problem there must be some change that's happening somewhere the servers are being patched application is being the version is being deployed new version is being deployed network switch changes happening course which changes are happening routers are being changed somewhere some change and it's gonna have a major impact on to your so say for example in some location they're changing a switch and it's going to give an impact to the entire globe who are working on that application because that's application connectivity is to that switch right so that's why I call this is a butterfly effect hope this diagram you remember it I will just rehash it quickly the three tentacles on the top as a major incident management manager I would understand what is an issue then I understand the longer line that the bigger line here is an impact if it's critically low high medium watch over then I get the team strict point the moment I get the team's required I'm going to use a functional escalation to get you know they're SMEs and start initial investigation the center point here but this butterfly's take it as initial investigation and troubleshooting and from there this is the line is called troubleshooting and investigation from enterprise at the same time initial investigation should be done by really an elite force has to be taking it cannot be done in just just one one team parallel investigation if databases they're asking to do the help section if in doses their ask them to do their uncheck network is their ask them to perform the trace routes and so on and so forth if if database peoples on you call us and you check the databases in their servers of the database tables into the databases see what what parallel thought you can take you do not do not do one team at a time it's it's difficult you don't have to do that you have to have parallel efforts like I said escalation escalation escalation is important to get SMEs on the call for a p1 and vetoes then you troubleshoot and investigate with the SMEs with the help of SMEs because you have the right set of people now remember having right set of people is the key to your incident management if you want to fix the issue within your SFA's right setup do not wait like I said don't wait until the nuisance is occurred on to call me that means is creating not responding not doing what they need to do because they do not have the knowledge not that they are doing it on purpose not that they're doing deliberately they're doing it because they do not have much skills within them so you need to get someone who is capable right they can always be two things remember on your call they can be one to be skilled issue can be a bill issue may be one person who's really skillful is not willing to help you all that the spinal cord of this butterfly is escalation you need to have right sort of people troubleshoot plan then when you're planning stage remember you need to have plan in plan you need to write it down once you write it down your planning stage plan a and Plan B which one is being executed first you need to summarize all that over here you need to summarize which one is being executed why the plan is being executed and is this planning valid another good point you need to remember there's a plan a that is valid or not right that reminds me of another point when you understand and they show impact and all of that entails required always look for see that's more into of more of communication part and how you settle and how you establish yourself on the con you know to really you need to look for work around so you need to look for the changes in the environment you need to look for last one good configurations of the CIS right so whenever there's an issue ask for Perkins just not work around like I said plan of action a it should be valid just like that there should be about it work around somebody says do this it'll fix ask them if it's fine and have you done that before do not waste your time just implementing something it's not going to fix it right so remember time is everything you have when the clock's ticking 90 minutes is where you have Piru time this initial investigation should not get more than ten minutes and troubleshooting is where you cannot define how much time it's going to take right so this is where you need to push subject matter experts to work on then planet PRM or an agency manager is going to learn that planning part of it you will jot it down while you're on the call if you don't waste your time with sitting and writing down things focus on your call remember the point that they are talking ask technical support groups to put that onto your chat or into your an email what is the plan of action if if subject matter expert is decided I'm sure you will have an admin who can write on the plans and he will understand he can send it to you on the chat or whatsoever take that keep it plan a and Plan B you always like I said plan a and Plan B's must without that you cannot proceed further then summarize which plan you're going to implement first at this pin this is you know also give timelines base if you don't give timelines for each of these things to happen there's no way that you can meet the SFA's end remember major incident management works two ways you cannot be rude to them you cannot be demanding to them at the same time you cannot be very soft and you know very nice very nice to them you know that you accept what they say and neither can you be so rude with like asking them to to say you have to do this don't use such terms whenever you give them the timelines yeah when you ask them hey how much time do you think it's going to take ask them let them let them do let them decide it and give them two options the best way that I usually do it on my cause is if you need five minutes of ten minutes Intel team do you need five minutes or ten minutes now they will come back with the answer that they would commit and say hey I need ten minutes so that's a commitment from them so you're being conscious at the same time you're getting but if you say I'll give you ten minutes go ahead and do this then they you know you're wasting another minute of discussing and repairing on that is it ten minutes is not enough it's a physical server already takes time and I need to do so-and-so ask them if they give you a commitment of five to ten minutes give them that time right tell them I would not come back to you unless you have something to me but I'll come back to you or the ninth of the 10th minute to understand what's happening right and at the same time like I said if they say if you ask them how much time do you need don't take don't don't be so soft and don't be stupid I would say that they would say I need 25 minutes and just accept it usually won't be one that never happens if you have subject matter experts if they say it's 25 minutes right you need to help them understand the sense of urgency of this issue you need to you need to articulate the impact you need to explain it to them why it is important to fix that issue as soon as possible and not give them 25 minutes most of the time you not have that cases but yes admins will not understand there are some admins you will not even understand your communication that's why clear communication on the call with the right set of words appropriate words on the call is very important not don't be too rough don't be too soft don't be too demanding let it come from them and like I said let them understand the sense of urgency so that's why you need to keep summarizing things you need to add and assess impact each time say at an interval of while you're troubleshooting and they're taking some type of troubleshoot assess an impact again what if the issue is resolved on its own see this is all about major incident management skills while you're on the corn ride when you adopt it so many times I have seen on my cause I know the issues resolved without any technical intervention and mr. troubleshooting and we don't even ask and then the user is also like you know not even interested at that time because he thinks okay somebody's troubleshooting I don't need to be an on my desk at this time so I'm gonna just say and say you know what I'm gonna take a break or coffee break and come back no you can that's why you should have escalations what if the issue is resolved so you need to assess an impact you need to ask usually do you still face that issue are you still facing that latency are you still facing that errors on your screen so and so on and so forth so that is important right so based I don't I don't understand why major incident managers on the call say that are the complaints that they get that they do not talk on the call none of you have understood maintenance management process now let's decide some let's let's talk about some key tips and tricks the one of the key tip I told you was give them two options 5 minutes or 10 minutes as in 2 minutes or 5 minutes let it come from them right the other thing is how you know people say that you know method was not even talking on the call the answer to that I get from a major incident manager is he says hey you know what the issue has already been running and I didn't understand what's going on the show's already been running and I didn't want to be interrupted then what's the point of having a major incident manager over there so to answer to that question is what can I do you need to go and establish yourself the moment you turn on the call and you need to say hey guys my name is saya I'm the major incident manager on the call I will be driving this call to recovery can I know what's happening in somebody stops you at this time tell them I will not be able to help you at this time if you do not let me know what's going on I just joined people will understand that and the kun has to be important the tone is very important while you're talking right so if you if they don't know your imagination manager why would they even respond to you you need to explain it to them what is the what is your role over that role is to help them to restore the services gather information and not get distorted you know not get don't get away from the the focus of the silly resolution because there are multiple teams talking tell them I can help you go ahead and do that tell me what's happening and who's doing what just quick the snap of fingers like you know the snap of fingers you kind of I'm just giving example you've got to take things you gotta understand things quickly and you at least now know the issue and the cutaway the pretty technique that I followers right so this is my tip to you on whenever you join a call and when you guys establish yourself saying hey I'm the recovery manager but not directly ask them for an issue right then we'll be like who's this again now right ask them hey guys this is science and the recovery manager on the call do you have all the teams that you need on this call do you have all the right set of teams on the call they say yes we have all of them like okay can you now go ahead and quickly you know can can somebody explain this issue to me because if you if you don't have anymore if you don't have right team I can help you get somebody basically you're including them at the same time you are trying to make a way out of it saying that I'm here to help you not to just been dropped in destructive ways right so who do you need if you have everything okay guys just let me know who do you who do you need more in case if you need I'm right here on the of the manager while you're talking can somebody just give me a heads up or what's happening okay can somebody bring me up to speed of what's going on over here right so that's how you establish yourself and the first ten minutes if you don't establish yourself there's no way that you're going to establish yourself throughout the column because even if you try and ask question guys what's happening because I've seen this in my colleagues I've seen I've seen this in my film you know from the other cow teams that I have been handling and they say guys watch what's the plan somebody's just gonna come up and say hey can you just keep your mouth shut for a few minutes and you know who are you then literally no there are times that they've literally said that can you keep your mouth shut there are time that the same who are you it's a waste of time you know you're depending on something that's not important I understand so you need to know how to establish yourself like I said I will establish myself in guys this aside from the recovery management team I'm here to help you guys do you have everyone that you need the right set of people on the call they say yes okay fantastic let me know if you need some more help I'm here to get you can somebody just bring me up to speed of what's happening I'm gonna talk like this I'm gonna make aware of this way out like this so that I can I can I can also tell them from your true poram from your from from the people that was trying to help you right and once you establish yourself do not keep questioning them again and again and again let them work now that you know what is the issue now focus on to call and say which team is talking and who is talking if you know who is talking then you need to chart down you need to make your timeline you need to keep running your fingers on your laptop or your desktop a weapon right so to write it down or maybe a piece of paper that who's from which team they're not the key technique from my video you should take it as do not represent them or do not call them by the platform team for example windows team what are you doing unique team no it's much easier to call them with the name and that contact that connect will be there for them to respond for example hey Billy can you just explain what's happening from the network in a network team what's happening then I'm going to respond to you the moment you said Billy what's happening from on the network then did you just do the traceroute did you did you run traces and did you find anything else he's gonna talk to you definitely is gonna talk to you at least is gonna make an attempt but if he's not there you can ask you there you're talking to him with his name you need to know who is talking from who and if your join from the middle of the quad like it gave you one instance that you know major assume others tell me hey who's joining you know I do not know the call is already running and is joining between of the call so I do not know what's happening you need to ask the incident coordinators or the service desk guys to know that information if that's not happening in your account make sure it's a process gap you need to fill it up you need to ask whosoever is handing over to you should tell you who's on the column from which team without that it's not going to help you because you're going to ask each one of them's name on the call it's stupid you cannot do that you're from which team you're from which team no so when they give you for example like I said I'm going to stablish myself do you need everything and do you need any more help you have the right set of people they say yes we have like fantastic let me know if you need more help from any of the teams I can get you help can somebody explain on what's going on here and somebody talks right in experience and he like just say acknowledge and say be courteous and say thank you thank you understood that and and you're from which team I'm going to ask now and now it's now is when you when it makes sense to ask them not directly hey you're from which team what can you explain no you can't be so rude right so guys coming back to major incident management process remember this is a butterfly diagram and so butterfly fat somewhere some changes have happened the questions that you need to ask them the calls are the work around any recent changes last known good configuration of the CIS any valid workarounds I would say right and these three questions are very important and also like I said major incident management if you have to invoke disaster recovery stakeholders who are the stakeholders who has to be notified like I said you're a bridge between the stakeholders who cannot do anything about technical and one set of people is who who only knows about technical they will talk to me but in technical jargons and you might not be able to understand as well but at the same time you need to understand you need to make a way out of asking them what do you mean by this when you give it to that put that into plain English description notify your stakeholders you cannot you cannot talk to them in the Charles write the command CMD was wrong so that the you know it deleted the script and the script deleted was with with x3 cfb know the simple term would be like so-and-so team performed ran a command to delete the script post deletion of the script application started working fine that's how you need to articulate things and when you write in menu when you explain so this is the difference between incident management and major incident management workflow I showed you this is a butterfly basil incident management if you remember this the the two eyes of the butterfly is functional escalation have been the spinal cord is an escalation itself escalation matrix the tentacles are the first the moment you join an issue impact and the team's required and the centerpoint the bellybuttons I would say as butterflies our initial investigation paranal efforts from all the teams then you have troubleshooting then you have planning basically it just makes you look like are you know the diagram looks like a butterfly then when you're planning stage you need to ask somebody if the SME is deciding the plans you need to make sure the incident coordinators are incident coordinators are starting that plan or you know the admins are jotting that plan and you need to listen you need to know which plan is being secured an aspir valid worker as asked for valid valid plan ease and Plan B's ask them if this how how sure are you ask the SMEs they should be able to tell you and once you know that you need to summarize this is what we have come up with and summarize the bps and the management was drawing the call over there right there are some DPS or Pease or executives who are on the call this is where you they'll ask you what or not they say this is what is the issue what is an impact we have come up current give them everything initial diagnosis was down thrown so no to say that we have come up with there is come up with a plan now this is planning and this is Plan B keep it simple to them so they understand we are on the right and explain it to the business users also in such a way that now hey guys we have identified we have so far sounds like we have identified we have two plans now planning in play don't say that we have a plan say we have two plans now if this fails this should work so that's that's the sense of you know assurance that you're showing it to them that the surety of fixing the issue you say you have two plans the support teams have come up with two plans plan a and Plan B hopefully plan a but if not we still have planned right so that's the summary part once you summarize then you execute which one you're doing it once you've execute you know the plan is you need to ask users to validate so this is where and resolve is all comes into place restoration plus you sat or CSAT is called resolve right so you evaluate once you resolve it when you close the ticket go back to the center makeup complete butterfly that's your closure alright guys I hope this major instant management has helped you and I know it's one one hour long video but this this request has been coming to me from long time from all my colleagues and friends they say that why don't you do a video to a lot of viewers so a lot of people I myself when I started incident management looked at some incident management or YouTube videos it talks about the workflows it talks about so in some examples and you know all animated stuff but didn't have the clear understanding of how I can establish myself a lot of people on the call a major incident managers do not know how to establish themself on the call so the the voice quality the tone has to be set right establish yourself in first five minutes if you do not establish yourself in first 5 to 10 minutes there's no way that nobody is going to respond there's no way that anybody is going to respond to you and respect you well your honor on the call so guys with all of this information I hope this information was very informative to you and I hope you all make use of it leave comments if you have any questions I will respond to you as in when I get time this is my first youtube video so I hope I hope you all liked it I'm gonna make more videos on different different processes of with an ID il and other videos as well ok so like I said leave comments with any questions that you have and I should be able to help you with answers thank you guys for watching this and please please do subscribe for my youtube channel and like my video and I'm sure I'll come up with more videos like this more informations thank you guys you along you all have a great day and thank you so much for watching thank you