Transcript for:
Rekayasa Chaos untuk Ketahanan Perangkat Lunak

hey everyone good morning good afternoon and good evening from wherever you are today I'm here from harness and I'm here to talk to you about building continuous resilience in the software delivery life cycle with chaos engineering ultimately Cloud native develop development has enabled teams to move quickly but it also introduces new ways for software to fail quickly sres QA engineers and developers need to work together to optimize reliability and resilience to improve developer productivity my name is Matt schillersham and I'm a product marketing manager harness we are a modern software delivery platform for continuous integration continuous delivery security testing feature Flags service reliability management with service level objectives Cloud cost management and Chaos engineering for the fir for the past 20 years I've been helping teams build reliable and resilient systems and teams across the nuclear power industry retail and e-commerce as well as non-profit groups that I've been a part of locally in Minnesota I've enjoyed being a software engineer product manager and product marketing manager and hope you enjoy this presentation today why am I here I'm part of the litmus chaos open source Community which is an incubating cncf project harness as a sponsor is also part of the cncf as a silver sponsor you may have seen me at qcon plus Cloud nativecon Detroit where we had our first ever chaos day in October 2022. please feel free to contact me via email Twitter or LinkedIn ultimately we are here because we are building and making things better as engineers and leaders we are always seeking to understand and learn how the world works I talk about how building for resilience is in fact chaos engineering ultimately this discipline is simply allowing us to understand how the system works and operates this is one of my favorite quotes from Twitter from Andy Stanley who has a podcast where he just says if you don't know why it's working when it's working you won't know how to fix it when it breaks and for me as a prior I.T administrator this is great because you know like when I had incidents and I had to respond if it was a brand new issue that I didn't know what it was it was hard you had to dig around have stressed nervous and um but when I was able to practice failure and prepare for it I was more confident and ultimately less failure occurred because I proactively planned for it another thing here so you know why does chaos engineering even exist so like we get this little logo from bytebitego.com resilience mechanisms were developed in the code and as architecture to help a system recover fail gracefully or simply display an error message to the user not everything has to be perfect but an error message can instruct the user why something failed or help you know the IT person solve the problem chaos engineering can be used to validate and tune these mechanisms to make sure they work now at the end of the day you say why does chaos engineering exist here another way to phrase this is you know like what failure modes does my system have you know and What mechanisms am I using to prevent that and like how do I test it do I wait for an incident to happen to prove it or can I test it proactively some common you know kubernetes failure modes this image is great to Think Through I got these failure modes or stuff the kubernetes website system instability resource contention scaling issues configuration errors resource exhaustion now kubernetes is self-healing to an extent but the application that you put onto the container isn't necessarily always self-healing and you have to know how to handle these failures that happen now chaos engineering The Experience today it's basically like you know a shopping cart on an e-commerce website you can basically click around this is an example from the litmus chaos open source project from the cncf but basically you know you can click and say like hey I need to do this fail around kubernetes or Cassandra or Kafka and there's a button you can click and there's experiments that you can pull from right but ultimately you know that that works well and like when you click in here you can see experiments that are basically you know easy to understand and self-service But ultimately like developers QA engineers and sres you know they can't manually click buttons all the time so like today a chaos experiment might just look like this which is a declarative yaml file you know in here all we're trying to do is delete like a pod from a kubernetes deployment to see how my application behaves when it restarts or when that pod gets disrupted you know does my user on the other end of the application you know get an error message or does it gracefully degrade or is it a seamless transition to the restart application um so that's basically it right now if I dive into what we're trying to do um with continuous resilience you know let's break this down on how reliability and resilience can help development teams so improving resilience across the software delivery life cycle ultimately giving customers improved experience now generally speaking sres Q engineers and developers are a team but they do work siled right they hand off work to each other whether it's through a PR through a test or you know through an incident but what we're trying to do here right is you know sres can leverage chaos engineering maybe after an incident to recreate that incident to see you know how they can fix it if they can fix it or perhaps increase the blast radius of that incident in a simulated environment right and then you can validate that the experiment um the fix that you put in for the experiment resolves and then they can shift that learning that test if you will left to the QA environment so now the QA team can run that same experience experiment to see if there's any you know failure in that environment to see if the system the configuration drifted and ultimately that QA person you can shift that to the left of the developer so the develop bear can run that chaos experiment you know in that CI pipeline or that QA test environment so now you have protection across the pipeline and you're not waiting for a random incident you know that can happen now you can actually avoid that incident so if I think about this at the business level you know innovation in software is a continuous process and it has to be and it can help us improve multiple aspects of the business but more importantly the developer and customer experience itself so let's talk about Innovation and achieving reliability and resilience you know it's challenging to solve everything at the same time but this year in 2023 we need to not only move fast with high velocity but we also need to do it efficiently at a low cost and with the highest reliability and resilience needed for the best customer experience it's a mouthful but how do we solve this you know automatic automation is key in a pipeline so let's talk a little bit about the cost of software development so right now you know there's approximately 27 million software developers globally with an average salary of a hundred thousand in the annual payroll equivalent to 2.7 trillion you know that's a lot of money right and if you look at you know how much time developers spend coding um this was a recent survey poll on LinkedIn you know 54 said less than three hours a day they spend coding you know that's equivalent to you know wrench time right like three hours of wrench time per day that they're you know making something creating something innovating you know the rest of the time like what are they doing right there's meetings there's other toil there's watching the deployment like babysitting there right there's security testing there's all these things um you know the toil that's preventing development teams to be productive and not that you have to code for eight hours a day but if you can't be creative you can't innovate and if you're being again bogged down by all these toils in the deployment process then you're not being as productive so if we look at the math behind this opportunity again the annual payroll of 2.7 trillion you know what does that look like so if we can cut developer toil in half which is doable right then ultimately like you can look at your developer budget increasing as well and then you can redirect that to development right whether that's being more productive with the same amount team or hiring more to do more capabilities right you don't always have to do more for less but you can do more with cutting down this toil you know and I just wanted to point out as well code isn't always the best way to solve a problem but if the toil around building a prototype or testing a small unit of code is complicated a development team won't feel comfortable testing out new ideas so again if you're able to quickly you know write code to solve the problem to test it to prototype it and if you can deliver that you know to non-production or production quickly and test it with a few customers like that's an ability of a development team to test get feedback you know and iterate so ultimately like innovate you need to innovate to increase developer productivity and saves costs and so like where can you increase developer productivity all right let's break that down for reliability and resiliency so you can reduce the software build time you can reduce software deployment time and you can reduce software to bug time so let's dig into that last one a little bit more why do why developers spend more time in debugging right now so one thing is oversight you know there's just a million things going on and you you test as much as you can you automate but you know you overlook something that's just human nature dependencies have not been tested you know it's very normal to not understand all that goes in and goes out of your system especially in these managed service environments you know so you can't test everything right sometimes you sometimes you wait for an incident to uncover that dependency you know retroactively but you would rather be proactive and then you know a lack of understanding of the product architecture in today's world with thousands of microservices you know it can a human understand the map of everything you know it's very hard to memorize that in the old school days of you know monolithic applications sure but microservices today are challenging and then you know sometimes the developer you know their code is running in a new environment and again your code should be written in a way to kind of move around the workload to different clouds but sometimes again there's dependencies that are intertwined that you just don't know about so software developers are spending a lot more time debugging right and if you think about it um debugging in production right has the worst possible experience if you think about responding to an incident right it's very stressful it's painful for the customers people are hunting and digging through the problem and ultimately the cost of that is expensive because now you have production code that's broken you have to go back fix it test it and that's time that could have been spent on you know new feature development right so and it's also like the Lost opportunity cost of that customer because maybe you lost that customer with that transaction that they weren't able to get so that's where going back to that other slide where we're shifting left to QA and shifting left to you know to non production in the code for the developer if you can actually find find these you know infrastructure failures and application failures earlier on it's cheaper and this graph shows you you know like look if you fix it in a QA environment it's you know 10x reduction so if you look at this a bug May cost ten dollars instead of a hundred dollars and if you fix it you know in the code right away before you even push it to QA that can be up a hundred times different than actually fixing it in production so these are real bad values right that you can apply to kind of show why it's important to test more upfront now if you look at Cloud native developers they're focusing on the container itself and you know the consumable apis that it's using right Cloud native developers experience this at a rapidly increased Pace because we are making it easier to deliver software you know they're experiencing more failures because it's easier to deliver software containers are helping developers focus on their application and API and not worry so much about the stack underneath chances you know the chance of lack of understanding the lack of texting can cause issues across the whole infrastructure stack and again if you look back at you know even common kubernetes failures right the system instability the resource contention right these occur when kubernetes cluster run out of resources such as CPU or memory configuration errors occur when a kubernetes cluster is not properly configured resource contention occurs when multiple components compete for the same resources system instability occurs on the kubernetes cluster is not stable and is regularly crashing or restarting in chaos experiments that should be automated in the CD pipeline for example include testing for this resource exhaustion configuration errors resource contention and additionally you can automate the testing for the ability to recover from these unexpected events and errors as well as the ability to scale up and down as needed and then ultimately this can help you automate testing for the ability to detect diagnose and mitigate security vulnerabilities so it was developed you know as developers dig into these problems and debug you know they shouldn't have to like dig too far to find that issue right your testing code as fast as possible and shipping code as fast as you can but not looking at the overall system and as that container sits in an application that consumes apis and resources on the infrastructure the impact of the outage can greatly you know extend just beyond that container right so you have to ask yourself are containers are they tested for the functionality of these faults occurring and is it revealing the deep dependencies in that infrastructure stack so if you look at the faults of these deep dependencies the problem is that happens is that customers face the application is impacted and the developer jumps in to resolve the issue and they find out that there are multiple dependencies that are causing an issue and this ends up increasing the cost of development right so now we have service resilience is impacted developers are debugging it you know dependency fault is discovered and then new resilience issues discovered as well so this is the case of the 10x 100x costs and Bug fixes if you look at dependent fault testing and what's required what we can think through here is you know test your code for faults that are happening in like the code itself the API consumed external resources dependent infrastructure and then again this can apply over the container code the application the API resources and infrastructure this means that cloud native developers need fault injection and Chaos experimentation so if we look at revisiting the original use case of chaos engineering we introduced controlled faults to reduce expensive outages it seems important but ultimately we introduce controlled faults to reduce expensive outages you know we recommend recommended production chaos testing and then it was very high barrier to entry and then it was more so on a game day model and then traditional chaos engineering has been more so a reactive approach driven by regulations like a requirement but the new patterns driven by chaos engineering are the need to increase developer productivity right to remove that toil so they don't have to dig for answers and the need to increase quality in Cloud native environments and the need to guarantee reliability in CL and move to Cloud native so this this need leads to the emergence of the new continuous resilience which is basically verifying resilience through automated chaos testing continuously and all that means is if you know have a known failure mode that you need to protect against you know you're using a resilience mechanism you can have a chaos test to validate that that resilience mechanism still works as expected right whether that's alerting you or you know triggering a failover or just an error message right but if you can do this continuously you can know your system is protected across the pipeline so again continues resilience it says chaos engineering across development QA pre-prod and production and one way we look at this is we measure it with resilience metrics because if you can't measure something you don't know if you're improving it so the resilience score it's the average success of the percent of steady state given an experiment or component or a service and then what that means is basically like my your expected system is supposed to behave a certain way during A disruption so then you can have a score associated with you know did it change or not change is it good is it bad is it up or down and if you map that to a resilience coverage um that's basically the number of chaos tests executed divided by the total number of possible chaos tests times 100. so if you think about building resilience for a system maybe you have 10 failure modes you're trying to protect against that equate to five you know five to ten tests that could be ran so as you onboard a new system to get it production ready maybe you can start off by running 10 out of 10 to get that 100 coverage or maybe you're just doing five out of ten because you know every deployment you only need to do like the most common ones but then once a month or once a quarter you're doing the other five but basically continuous resilience in the developer pipeline is a way to achieve that resilience so again if we look at you know the game day approach versus pipeline approach chaos experiments are executed On Demand with a lot of preparation versus in pipelines chaos experiments are executed continuously and without much preparation and then you know primary primarily we Target sres in the Persona with the game day model versus with chaos and pipelines all personas are executing the chaos experiments and then again chaos with game days the adoption barrier is very high because they're manual right they're events whereas chaos and pipelines adoption barriers much less because every time you're running a deployment or at a certain frequency you're automatically running the tests so again traditionally you know developing chaos experiments has been a challenge code is always changing bandwidth is not budgeted to creating that and the responsibility is typically not identified you know sres are usually pulled into the incident and corresponding action tracking and then pulling the QA or developer and then ultimately like from you know identify identification to fix it's not track to completion so no idea how many more experiments to develop or what failure modes are protect against oh a continuous resilience you know developing chaos experiment is a team sport across the delivery life cycle and it's typically it's attributed to an extension of regular tests you know a chaos Hub or experiment repositories are maintained as code and get so you can have Version Control and historical um information on how systems were configured then you can know exactly like how many tests need to be completed because you have the resilience coverage metric so it's never an unknown that you're talking to leadership about what tests you're running or how how it's performing you can actually just say here's the test I'm running and here's the trend so in summary resilience is a real challenge in the modern or Cloud native system because of nature of the development use fault injection and Chaos experimentation to get ahead of the resilience Challenge and push chaos experimentation as a development culture into the organization rather than a game day culture thanks for listening today I appreciate your time just wanted to let you know of a community event chaos Carnival it's happening March 15th and 16th it's a two day virtual event that's entirely free the cncf and Linux Foundation are proud sponsor um and if you have any questions again you can reach out to me at my hardest email or on Twitter or on LinkedIn thank you very much have a great day