Hello all, today we are going to see what is chaos engineering, how to do chaos engineering with litmus chaos, how to use chaos engineering to test resiliency and reliability of a system. Let me first quickly introduce myself. I am Ruturaj Kadikar and I'm working as a senior SRE at Inferacloud Technologies.
As you can see, these are some of the famous companies and you might be wondering what is common in all these companies let me answer that these companies have faced major outages in the past few years now let us look at some statistics around the outages according to the latest uptime institute's outages report there are these are the 10 major outages in 2022 and 2023 going with the latest one that is Federal Aviation Administration, it faced a major outage due to a configuration issue which caused all the US flights to be grounded. Most of the flights were cancelled or delayed and it incurred a major business loss. Reliability Reliability is getting consistently stable results over the period of time without facing any issues.
With reliability, you can achieve higher slas and accordingly you can set higher slos for your uh businesses another metric for reliability is mtbf mean time between the failures the higher the mtbf the higher is the reliability of your system now there might be a question like why to test resilience we have seen that okay there are outages we have to resolve these outages but again one can claim that okay i don't face any outage but then why why to test the resilience so to answer to that is uh it it may avoid downtimes so you have not faced downtimes earlier but there is a possibility always testing resiliency provides us with the correct understanding of overall behavior of your system when subjected to failures And by mitigating these failures, we eliminate the system weaknesses, thereby minimizing the unexpected issues in the production. This will increase the mean time between the failures, which will increase the reliability of your system, thereby helping you to achieve the agreed SLA. It will enhance the end user experience.
Now, sometimes it may occur that your system is working correctly, but some intermediate or intermittent issues may affect the system responses and degrade the user experiences lastly many industries and regulations require that the systems have a certain level of resilience and uptime you can use this testing to ensure your system meets these requirements and avoids potential legal or regulatory issues so next why to test resiliency in kubernetes we already know that kubernetes is operating in high high availability mode then specifically going for this test what is the reason for that with micro services application hosted on kubernetes the underlying architecture may become complex and critical with lot of in the interconnected services any minor issues can become a domino effect and may turn into a disaster it is a distributed system you know many people working on different pieces that that needs to be put all together there may be several human errors or people sometimes not following the best practices which may lead to failures and lastly the kubernetes is evolving at a rapid speed there might be incidences where some apis are getting deprecated and that go unnoticed causing failures in your production Now, since we have talked around how to test resiliency, what are failure domains, let's just deep, deep, deep, deep, deep into the failure domains. We have seen earlier that failure domains are the critical areas which can cause major issues in your system. So what are the failure domains in Kubernetes? First is the network. some latency, some packet loss or some jitter in the network can cause major outages in your system so this is the first failure domain next is the pod crash crashing so we know that sometimes the pod crashes abruptly it is ephemeral in nature so after life cycle new pod comes or the pod gets stuck into the crash loop back off there might be init containers which get stuck into their respective processes and it may cause effect in overall scalability of your system the pod may not be able to scale properly there might be issues with the image registries where the particular image is not able to pull into your kubernetes from the image repo there might be issues with the processes like kubelet or container runtime sometimes abruptly the kubelet may stop working on a particular node or a container runtime may stop working on a particular node there might be issues with the nodes like abruptly termination of the nodes resource saturation on a particular node uh resource saturation in terms of compute or storage uh the disk full uh errors might be there which may cause issues uh in your system then there is there is an issue of load pattern so let's say you need to test your system with bursty load patterns or spiky load patterns so that you need to understand what is the behavior of your system when these load are subjected onto your system lastly we can call you know categorize the configuration or human errors like randomly changing the configurations randomly changing the environment variables the service dependent dependencies a particular service is not able to resolve that dependency or if one particular service is depending on other another service that may not be reachable so like this we can categorize the failure domains inside kubernetes and accordingly we can create the chaos of for the system in the kubernetes now we know that applications are not just hosted in kubernetes failure domains can be beyond kubernetes and the first category which is there it is databases so many uh Companies, they host databases outside of Kubernetes, in the cloud or on-prem.
So, what are the failure domains for these? First is network partitioning. So, there might be issue with your database cluster. Let's say there are three nodes in your cluster and one particular node is out creating a network partition.
This may create data inconsistency, a split-brain scenario. It may have a cascading effect on... the entire system due to such uh the node is taken out and you know the recovery of uh the data is gets very complex it may impact the replication so you need to address these issues uh whenever uh any network partitioning incident occurs next is time travel again uh sort of like uh connected to that network but with the synchronization issue of the ntp again it will create data inconsistency it may create security vulnerabilities uh it may create event synchronization issues uh due to uh incorrect uh timestamp there might be issues with the log analysis and debugging this in turn will impact the legal and compliance issues you need to check with like latency and packet loss on your databases what is the impact of of that There might be issues with accesses which can be categorized again as incorrect credentials.
Whether you are able to connect with the database with the incorrect credentials, are there any authorization bypass, what are the effect of expired tokens or expired credentials, and there can be a lot many in terms of access issues like permissions and all. So you need to categorize them as access failure domain. there might be again no termination as we have seen it can create network partitioning or the issue may be smaller but again you need to address and you need to see the system behavior accordingly you need to have different types of load patterns on your databases so that the read and write cycles are working properly or not you need to ensure that the next category of failure domains beyond kubernetes is cloud services the one main uh issue with cloud is uh instance termination and restarts and uh there might be random or abrupt uh instant instance terminations and you need to evaluate what is the impact of that let's say a new instance is coming up and it takes around one or two minutes is your system is able to cope up with this short time span there might be like a huge traffic during that time and these two minutes can cause you some business loss around that time next is security group or knuckle configuration there might be accidental human errors while configuring security groups and knuckle configuration which may cause a huge business losses because directly the communication is hampered over there you you can check for load balance load balancers uh inject high load onto these load balancers check whether they are able to cope up with that uh in aws scenario you can check whether the target if the targets is healthy and if it is not healthy then what is the impact uh on your system what is the impact of draining these targets on your system lastly you can simulate a particular easy downs scenario considering that your application is hosted in a completely hm mode you have let's say multiple uh your application is uh spanned into multiple multiple az's availability availability zones and you take down one particular availability zone and measure the impact on on your system ideally there should not be any impact because your application is is hosted into ha but still there might be some issues and you need to evaluate them beforehand now we have seen like the basics or the context around resiliency and reliability and how chaos engineering plays an important role in testing the resiliency and reliability now let's take a step further into chaos engineering and how to do the stuff with chaos engineering Let's start with principles of chaos engineering. The first thing is you need to hypothesize about the steady state.
On a normal day, on a normal traffic, how your system is responding, it can be considered as a steady state. So, once you know that this is your steady state, you need to identify the failure domains. Identify where, what things can go wrong.
and accordingly create the chaos scenarios and run those experiments in your system and find and then you check or you verify whether your hypothesis and the practical scenario are matching or not if there are any differences try to mitigate them and you know improve the your system so you can start with like a minimum blast radius first and you can slowly increase your blast radius so whenever any unexpected issue comes into into your production you can you know uh you know the what uh what what are the solution you need to implement and you can minimize the overall blast radius around it what are the tools available for chaos engineering is litmus chaos uh gremlin is there uh chaos monkey chaos mesh you can also use aws fis uh all these tools mainly do the work for you there you you need to create a chaos you need to inject using them but uh for this talk and personally why i feel it must chaos is first thing it is open source so you can anybody can use it easily you can use it in a centralized or distributed way so one of the use case which i found very helpful was if there are multiple accounts in your organization and you need to execute chaos from a centralized way so let's say one central account and there are multiple spoke accounts you can do that easily with litmus chaos so litmus chaos has uh the agents which you can deploy in different environments and execute the chaos over there. Next thing, it is flexible. It can be the scoring around the chaos scenarios or designing the chaos scenarios. It is very flexible and easy to use.
And lastly, the other use case which I personally found good was it has a good integration with AWS SSM. So, what is AWS SSM is? In AWS, you can write the scripts around whatever functionality you want to execute. you can create a session manager document around that and then you can run that document so litmus can integrate easily with that document and you can induce the chaos in your aws accounts also so whatever we have seen uh failure domains beyond kubernetes uh if you have aws account litmus can be very much you uh litmus can be useful in that scenarios also so let's say we have seen uh what is chaos engineering how how to do chaos engineering and increase the resiliency and reliability of the system you created experiments around chaos you executed them what now so it is a cycle this chaos engineering and resiliency testing is not a one time this resiliency testing should be periodic in your organization you can have a resiliency framework as the points we have discussed earlier define a steady state go with the hypothesis execute chaos verify the steady state what is the difference between uh whatever you have hypothesized and whatever you are getting practical create a reports mitigate those problems again define the steady state with a new vision again create create a hypothesis and create the experiments and in this way you can minimize all the outages that are happening in your system by minimizing the unexpected failures in your production.
Secondly, you can have resiliency scoring around like let's say if a pod crash chaos is there. The pod can be spawned a new if one particular pod is terminated. In that case, you can score this chaos with a minimum points and whichever chaos which you are introducing in your system which may affect or which may have a greater blast radius you can score those experiments accordingly you can have game days wherein you have you can have one particular day where you execute chaos and let other teams to resolve those issues so that the system knowledge in your team can be increased.
You can have periodic resiliency checks and reporting. You can have resiliency checks in the CD pipelines. Let's say you are giving a new release.
You can have a chaos experiment integrated with your CD pipelines which will test the chaos on your new release and you can see what are the failures or what is the effect of your release beforehand. And lastly, it will it will improve your observability posture as we get we have seen earlier that you will to gauge you will be able to gauge what is the impact of the chaos and if there is something wrong in your system you can easily find out and quickly find out what is going on now let's see how to run this chaos experiments with litmus chaos in practical so So, let's just go with the setup. I have setup a small EKS cluster in AWS and I have majorly three namespaces where I have divided all the application. The first is the litmus namespace. So let's get the pods in litmus namespace.
As you can see. uh the litmus is deployed in litmus uh litmus namespace and i have used the litmus helm chart to deploy the litmus stack with this helm chart you will deploy the litmus control plane and uh the litmus agent Now this agent is being deployed on the same cluster where the control plane is deployed hence it is called as a self agent whenever you are deploying the agent outside of the the same cluster then it is called as external agents the next namespace is uh the prometheus stack as uh we need to observe whatever the chaos we are doing we need to observe that so let's see what's in from stack so in this namespace i have deployed the prometheus stack including grafana and for test purpose i have used a microservices demo application that is sock shop so here i have deployed this test microservices application called sock shop so with this this is the bare minimum small setup that I have done for this demo now let's look into how litmus chaos looks so whenever you log into litmus chaos you will see this UI which is called chaos Center and here you can see chaos scenarios so chaos scenario is nothing but many chaos experiments bind together so it may be uh one experiment chaos scenario or it may be multiple uh chaos experiments inside a chaos scenario so if you see this is a chaos scenario and if i say this is one experiment which i have uh executed inside this one chaos scenario then you can see the delegates as i mentioned earlier since the agent is running in the same cluster as the litmus control plane so it is called a self agent then you have chaos hubs where you have predefined templates of the chaos experiments which you want to execute you can have like aws it's the same as your you can have for cassandra uh pod deletion of core dns there are some experiments around gcp then there are generic pod delete you can delete this is the template from which you can delete any uh particular pod or you can kill any container container you can increase the cpu or memory or you can have network network corruption so all this failure uh domains which we have seen in our uh previously in this talk we can uh able to run those failure domains with litmus chaos so this is uh the chaos hub containing uh all the pod all the chaos templates then there is a section for analytics litmus provides its own uh analytics uh for whatever the chaos scenario you have executed and again you can use statistics i think there is some issue yeah so how many users are there what are the projects how many chaos delegates are there what are the total chaos experiment runs what are the chaos scenarios and when those chaos scenarios are scheduled accordingly you can get all the statistics let's just visit analytics once again so here you can get all the analytics how many times what are the number of runs schedule stats how many experiments failed or what what was the success ratio if you come to a particular statistics of any one particular scenario you can get the statistics over there uh so for this experiment i had given 10 points and it is passed you can see this resiliency score over here so for each experiment uh as we have discussed earlier you can give a particular score for power crashing you can give a smaller score for cpu or memory uh chaos you can increase the resiliency score and accordingly you can get the statistics around those chaos scenarios or chaos experiments over here and let's see for this demo we will target one creating one chaos scenario from the chaos hub and executing inside the cluster itself and one chaos scenario we will induce in beyond kubernetes that is in the AWS account let's see so whenever you want to schedule a new chaos scenario or execute new chaos scenario you have to click over here schedule chaos scenario then you can go with like whatever the agent let's say if it is a different cluster then there will be an external agent you can select that particular agent and you can proceed further let's say i am we will select a particular experiment from the chaos hub and we will name it as memory hog if you click next you will see that in this scenario you need to add the experiment so we will add the experiment for memory hogging and we will take this particular template that is generic pod memory hog now the good part is from here itself you can tune your experiment accordingly so where you want to induce the chaos on which particular pod you want to increase the memory let's say i am going with the sock shop namespace and i'm taking the deployment uh let's say catalog okay then you can click on next if you have any health checks or probes you can mention over here then you can tune the memory consumption or the total chaos duration for what for which the memory should be increased currently i'm for demo purpose i am reducing that and we can reduce it to maybe 30 seconds so if you want to run this uh experiment uh pods onto a particular node you can use a node selector over here and you can provide the node selector uh value key value over here but right now we will not go for that uh we'll just click finish and we will revert the schedule so whatever uh the pods are getting scheduled for this chaos scenario in your cluster after the chaos scenario is executed successfully it will clean those particular parts so that's why we click on revert schedule over here and click next so in this step you can give like what points you want to do let's say we give eight points for this experiment and then we click next we want to schedule it now and we'll click on finish so here you can see that the chaos scenario is running we'll say show the chaos scenario and this chaos scenario is in progress so meanwhile let's just uh see uh login to Grafana so here I have created okay let's see if the dashboard is in place So, we will create a new dashboard. you can see the basic memory statistics over here for in this dashboard for the sock shop applications for catalog you can see the CPU usage and memory usage similarly for payment a user and front end whatever uh we have plotted over here let's just go back to uh what is the status of our chaos scenario and you can see that uh first the chaos scenario has run in this it has installed the chaos experiments so whenever it installs chaos experiment it is nothing but uh it will deploy all the custom resources for that particular uh chaos scenario and uh then it will execute uh the the or it will trigger the custom resources in this step now here we have seen this for memory uh hog is in progress and we will be able to monitor that uh in catalog so you can see this memory has increased like a lot so for since it is a demo application we have not increased the memory too much so whatever we are we have uh induced a chaos we are able to observe that so first thing is you need to uh create all the observability solutions or you should have all the observability solution in place so practically if you want to map it let's say if the memory is increased i should have an alert okay for this particular part my memory is increased so with chaos you need to identify the gaps in your observability also you can see the chaos scenario is success it has reverted all the chaos or pods also or let's just say kubectl you won't be able to see any new pod over here so all the pods or all the custom resources which were created for the that experiment are purged completely and we can see the memory has increased over here let's just take a look around the next scenario that is the scenario is such that I have created a test instance in AWS and I am able to ping that instance. Now with our experiment, we will change the security group and this ping should not be successful. So, the idea behind this is whenever you can change the configurations and check what is the impact.
Whether you have alerts. or observability in place if somebody changes the security group by mistake are you able to or is it noticeable to you quickly so that you can minimize the downtime around which has caused due to that security group change so with this example let's just start with our experiment for this we will use a different approach what i have done is i have taken the template of uh aws ssm and i have modified that so let's just see how we can use that so whenever you again we are scheduling a new chaos scenario with new experiment i think this is a bit slow So, this time we will import a chaos YAML. So, I have created a YAML. The workflow is into the YAML and which we will be applying over here.
So whenever this kind of experiments you have to induce in AWS. there are majorly uh two steps as we have seen that for aws uh whatever you want to execute the chaos in aws you have to go through uh you have to write ssm documents so what is ssm documents is nothing but let's say you go in systems manager and you write the script whatever you want to do in the cloud you write a script for that in terms of documents so if you click on documents over here it will uh you know there are predefined documents given by aws whatever you want to do you can refer those and you can create your own documents and we create our own document and then that document so let's say if this is my document right test chaos uh through ssm which we through which we will change our security group configuration so for this demo we are keeping it basic minimalistic design and you can see that i am just revoking one ingress rule uh in the security group so this is how aws document uh looks like what we will do is we will uh put this aws document inside a config map and pass it to the litmus so how it is done is you can see i have created a revoke security group config map and this is just a config map and in the data section i have just pasted my ssm document over here and then i have applied this uh config map with kubectl apply command next once your config map is in place we have to design a workflow so with the template which was there uh i have utilized the same template and i modified a bit so as you can see this is the workflow and this scenario has three steps as we have seen earlier it will uh install the experiment then our main experiment will be executed and then the chaos will be reverted these are the three steps in this workflow what changes you have to do is uh if you see the workflow has all the custom resource creation Here you can see the resources which are created over here, chaos engine, chaos experiments, what is the chaos result. So from this resource it will showcase into the UI what was the result and all. And here is the chaos engine.
I have passed my config map over here. You can see litmus revoke security group and I mounted this config map in this workflow. second change what i have done is uh i have taken the document path i have given litmus revoke sg what was the document name and the path for that i have specified the path i have specified the instance uh let's just uh revisit the instance one once again so that uh we are sure that whatever instance we have taken it is correct and paste it over here so that uh for running that particular uh ssm document it will take this particular instance id and then in the last step you can see it is deleting the chaos engine uh from the litmus namespace so the reword uh chaos scenario is in the third step so we will take this chaos scenario we will upload it into the kiosk center so they just select our workflow okay and then if I click Next and uh i remain rename it as revoke sg new and i'll be scheduling it now you can see uh the code is fine if there are any issues with the yaml it will uh showcase over here that there is a issue with the linting or something and let's just click on finish if we go to the chaos scenario we can click on show the chaos scenario and we can see that it is in progress now it is installing the chaos experiments we can see over here that the new pods are getting created for that particular chaos experiment you can see there is a new part revoke sgnew and if i get some chaos engine you can see the chaos engine running over here so it is showing aws ssm chaos by id because we have used that particular template over here let's just see the experiment is in progress now we can see that let's just say see whether it is in action or not so behind the scene it is executing this aws sm document and whatever whenever aws ssm document is uh getting executed it uses the run command we have seen over clicked over here in the run command and you can see either in the command or in the command history so this is uh revoke sg command as of now and we can see that it is success you can check the output for this is it has returned true and we can verify we can verify it by checking the security group you can go to the security group and check it let me go to the instance first So, you can see there is no security group that will allow ICMP packages and now let's just check whether we are able to ping or not.
So you can see that we are not able to ping. So whatever the rule was there, it was removed and this is the impact that there is no connectivity as of now. so this is how you can uh execute any uh scenario in your aws maybe instance deletion maybe whatever we have discussed like if your databases are hosted in aws you can execute you can write a ssm document you can put it in a config map and using litmus you can execute that particular chaos so coming back to our analytics if you see again it there will be so these are the parts that it is completed you can see the chaos result this is the latest one So, it is awaiting the result. so in this way you can execute any chaos inside aws account using aws ssm document and it must so That's all for this talk. We have seen what is chaos, what is resiliency and how to increase reliability of your system using chaos engineering and resiliency testing.
chaos why it was you know use useful for our use case is firstly because it was open source it was flexible it had centralized approach for executing chaos into multiple accounts and mainly it has the integration for AWS SSM documents through which we executed chaos inside Kubernetes and outside of Kubernetes in AWS. So that's it. Thank you.