When you're working with your computer at home, you can decide to perform an operating system upgrade or modify something on your network without going through a large set of processes and procedures. But if you're working in a large computing environment, there's probably a formal change control process. This means if you want to make a change to an application or an operating system, you'll need to follow these very specific guidelines in order to make that modification. This lets everyone know what the change is going to be. It allows you to make the change in an environment that's very controlled.
And if there's any problems with making that change, it gives you options for rolling back to a previous configuration. This is usually part of a formal corporate policy, and there are usually well-defined procedures for implementing any type of change in your environment. A common change control environment might include the process for planning for the change that you'd like to make, estimating any risks associated with making that change on the network, having a recovery plan if something goes wrong during the implementation of that change. Before making the actual change, we would run tests in the lab and perform a number of simulations to see what the effects might be. Then we would document everything associated with this process and present our request to the change control board.
They can then decide whether the change that you've decided to implement and then decide what dates might be available to implement that change in the environment. Every time we're presented with some type of problem that needs to be solved, there's a standard flow that we can go through to help troubleshoot this particular issue. This is the troubleshooting process, and it starts with a system or application that may be broken and takes us all the way through the process to where we would fix or resolve whatever issue that might be.
The first step in this troubleshooting process is to collect as much information as possible about the problem. We need to get... details about what people may be seeing.
It may be useful to get screenshots of the error occurring or a list of error messages that are provided to the user. Sometimes these problems are associated with a single symptom. But with these complex applications and operating systems, there may be multiple symptoms caused by one or more different types of problems.
Although it's always good to get an email documenting information that you might need, it's sometimes good to reach out to the user themselves. They could sometimes tell you more over the telephone about the issue than you would have ever received in an email. It might also be good to look at your records to see if anything may have changed in the environment from the time that this was working until the time that this problem was reported.
There may have been patches or changes to the application, and all of these problems may have occurred after those patches were put in place. It may seem that there is a lot to take in during this information gathering process, and in some cases, there certainly is. But if you break all of this down into smaller pieces and approach each symptom as its own individual problem, you'll have a better method of stepping through each one of these issues and resolving all of the problems. Before continuing with the rest of this troubleshooting process, this might be a good time to take backups of everything that you have. During this troubleshooting process, we might make changes to an application or an operating system.
And if we have any problems with those changes, it would be good to have a method to roll back. to a previous config. It might also be good to look at documentation from the change control board to see what may have changed in the environment that the users may not know about. A single change to a router or firewall could have a dramatic effect on how an application might perform.
But the only way you would know that change was made is by looking at the documentation from the change control board. And of course, the application or the operating system itself may have a set of log files that might give you a little bit more insight into the problems that are occurring for this user. Now that we've collected as much information as we can about this particular problem, it's time to start theorizing on what the root cause might be of this issue. We should start looking at the most obvious causes. Occam's razor tells us that the simplest explanation is often the most likely, and that's often why you hear technicians ask you if a system may be plugged into the wall or if the network connection is plugged in well on the back of your computer.
Those are simple problems to resolve, and if it's a loose cable, you can already solve the problem. and go to the next issue. But it may be that the problem we're having is not something common and not something usual.
So we need to think outside the box of what might be causing this particular issue. We might want to make a list of what all the possible causes might be. We might put the more obvious causes at the top of the list and then put the more unusual root causes lower in that list.
This will give us a starting point. We want to test our theories to see which one of these may be the root cause of the issue. We might also get clues as to where the root cause might be by looking at documentation that's on Google or any internal knowledge bases. Someone else may have already come across this exact issue and having documentation from that person may allow us to immediately resolve this particular problem.
We can now test our theory to see if we found a problem or not. If we look at the top of our list, it may say, check the power cord. So we'll check the power cord. If the power cord is already plugged in and it's getting power but we're still having this problem, then we know the problem is not with the power cord.
We can go back to the next thing on our list. The next theory on our list might be to check the network connection. We'll check the network connection and see if that solves the problem.
If it doesn't, we'll go back to our list of theories. If we go through all of our theories and we're still not identifying where the root cause might be, it might be time to bring in a third party or an expert who's worked with these types of problems before. Or it may be that one of the theories on our list ends up resolving the problem.
If that's the case, then we can move to the next step where we want to create a plan of action. We've now confirmed in our lab that one of those troubleshooting steps resolves the issue. So now how do we implement that troubleshooting step? in the actual production environment.
To be able to do that, we'll need a plan that not only incorporates the change that we want to make, but also allows us to revert back if we run into any problems. We might want to look at documentation from the operating system vendor or the application vendor to see what their suggestions might be for implementing this particular fix. And once we have information from them, we can create our own plan for implementing that fix in our production environment. Of course, we'll not only need that plan A that steps through the primary fix, but we might also need alternate plans if we happen to run into problems during the implementation. Of course, there should always be a rollback as well.
So if you go through these plans and in the middle of the implementation something completely unexpected happens, you can roll everything back to the original configuration. So now the change control board has given us a time and a date to make the change, and we take all of our plans that we've created. and we implement that change on our environment.
Usually, that change control window gives us a certain amount of time that we can be down, and we want to be sure to have everything completed in that change control window. If it's a very small window, we may have limited time, so we may need to pull in additional resources to help perform multiple functions simultaneously. Once the fix has been implemented, we still don't know if it's actually fixed the problem, and the only way to tell is by performing some tests. Usually, there are a set of tests that you've defined prior to making this change that would allow you to test the environment, confirm that it's still working, and confirm that the original problem was resolved. This might also be a good opportunity to implement some preventative measures so that this particular problem does not occur in the future.
And of course, after all of this is over, we have to be sure that we document everything that we did, not only because we need some way to confirm the changes we made. But later on, if we run into this problem again, we'll have some documentation that will tell us how we resolved it last time. This documentation might list the symptoms the users were having. It might list out all of the changes we made to resolve those issues and explain what the results were after implementing those changes.
Most environments will have help desk software or knowledge-based software that's perfect for adding this documentation so that everyone will have access to this data. So there's our troubleshooting process. We start with the process. with a system or application that's broken.
We identify and gather information about what the problem is with this system and then create a list or a set of theories on what we think might be causing this issue. We would then step through every single one of these theories to see if it fixes the problem or if we're back to square one. And if it does fix the problem, we can create a plan of action for implementing the fix in our production environment.
Once we get a slot from the change control board, we can then implement the plan. And once it's implemented, we can verify that all of the system is working as expected. At that point we can document everything that we did from the very beginning so that next time we know exactly the path to take.