Hi there, and thanks so much for clicking on this video. My name is Bradley Knapp and I'm with IBM Cloud. And the topic we're going to discuss today is what is an RCA, or a "Root Cause Analysis". All right. So to start off with, an RCA is a standard process that you should go through any time within the technology industry that you have what I like to call a customer impacting event, or a serious event where something has gone wrong and it has resulted in serious problems for your customers. It could be a down time outage. It could be a loss of network connectivity. It could be a loss of electricity. But no matter what the problem is, this RCA process, which is seven steps that this RCA process is designed to help you not only identify what the problem is, but how to fix it so that it doesn't ever happen again. So with that in mind, let's jump right in. Right. So the first step in an RCA, and I know this seems basic but it is the first and most important step, is you must identify what went wrong, right? You have to identify your problem. And that means you have to define your problem. It's not just a matter of figuring out the symptoms. The symptom is my "computing environment stopped being available" or "the database dropped offline". That's a symptom. That's not identifying what the problem is. The first step in that RCA is figure out what it is that went wrong. And so, in order to identify what went wrong, your second step, which is really related to the first, is you must collect data. Because the decisions that you're going to make as part of this RCA process have to be based in data. They can't be based and guess they can't be based in conjecture. You have to know what went wrong and you have to have the data to back that up. Sometimes you get a little glitch that resolves itself automatically. You're still going to want to run an RCA process on that and you want to know what caused the glitch. You can't just trust that it's going to magically fix itself in the future. So you must collect data. Now, next step is you have identified your problem, right? You've defined it. You've got your data. You now have to ask why, and asking why is more than just asking the question. You have to make causal connections. So as we're asking why, we have to get really in-depth, right? It can't just be "all right, well, the power went out and as a result, these breakers tripped, and then when the power came back on, the breakers didn't automatically reset". You have to know, why didn't they automatically reset? Is it because you weren't doing your preventative maintenance correctly or you were doing your preventative maintenance correctly, but the schedule wasn't right, or even though you were doing the preventative maintenance and the schedule was right, maybe the equipment just failed. Did it fail inservice? Did it fail out of service? Do you need to go back to the manufacturer of that equipment and find out what on earth is going on? Because things did not happen the way that they should have. You have to ask why and you have to make those causal connections. And remember, one of the biggest goals of the RCA is to not only figure out what you did in order to solve the problem, you have to figure out how to keep it from happening again. If you don't get to that point, there's no point in doing an RCA, it's just a paperwork exercise. Why even bother with it? So we've identified what went wrong. We've got our data. We've asked why, we've made those causal connections between everything that went wrong to figure out what happened. Because remember, in the world of technology, our problems are very, very rarely a single thing. It is almost always a cascading error of some kind that started with something simple and cascaded into something serious. So we've made those causal connections. Now we have to actually figure out what are we going to fix. Right. We are going to identify those corrections. So we've identified what it is that we're going to fix, and solving it isn't just a matter of figuring out what it is you're going to fix. You've also got to figure out how to keep it from happening again. Right. And so once you've identified your corrections, a huge, huge piece of that is that you need to figure out what defects did you find in your data collection stage. So do you need to improve your monitoring? Are you collecting all of the things that you need? Are you also logging it? Did you figure out that you're actually monitoring the things correctly, but you're not storing that data because monitoring and logging go hand in hand? There is no point in having active monitoring that you don't also save so that you can do this kind of analysis later. Likewise, there's no point in logging data that no one is ever looking at. So they are hand in hand. And that's a big piece of identifying the corrections, identifying what you are going to fix. All right. So we've found our gaps in monitoring. We found our gaps in logging. We figured out what other corrections it is that we need to make now. What do we do now? We've got to implement the solution, right. Now is the implementation phase. So, implementation. We have to implement not just the short term fix that we use to solve the outage that we had, but we also have to implement all of the long term things, right, monitoring, logging, other corrections, software defects. Maybe we have a change management problem. You have to implement all of those fixes and you have to get them out there, because there's no point in doing the work to just let it sit, right? If you write up the RCA and you don't implement all of the changes that you need in order to make sure it doesn't happen again, again, it's a paperwork exercise. Not worth identifying the time. And then the last step, and this is the one that is often the hardest for everyone involved in any kind of a customer impacting event, is communication. So I'm just going to put this up here as "comms". And I'm going to underline it, actually, I'm going to underline it twice. Communication is so important because once you have figured out what the problem is, you figured out how you're going to fix it, and you figured out what gaps and defects you have, you've implemented those gaps and defects, you have to keep your stakeholders apprized of what is going on. And it's hard for us to admit that things went wrong. It's hard for us to admit that there were failures that were our fault, that we have acknowledged and we're going to fix. And so if we think about a company culture, comms around RCAs are so important because it is acknowledging to your customers, "yes, we're not perfect". We made a mistake, or the vendors that we selected had a problem, or really just about anything that could go wrong will eventually go wrong. But we are reassuring you through our communication in this RCA process that we know that things happen and we are going to fix it and ensure that it never happens again. This comms piece is the most important part. You can't just write up a two or three sentence, "Yes, something broke. We have identified fixes. We've implemented them and we'll make sure it doesn't happen again". You need to go one level further than that. You have to restore that trust and restore your customers confidence in you that you are acknowledging that you are not perfect and you're going to fix things so that they don't happen again. And then you've got to keep that communication going. Once you've reached this implementation phase and you've actually got the problem rolled, out at that point keep up with your customers. If somebody had a CIE six months ago, reach out to them. Be sure that they're still OK. Be sure that they have accepted the plan that you gave them on how it's never going to happen again and be sure that they are OK with it and that it's solving their needs. So thank you so much. Hopefully this was helpful to you. If you have any questions or comments, please feel free to share them with us below. If you enjoyed this video and you would like to see more like it in the future, please do like the video and subscribe to us so that we'll know to keep creating for you.