🛠️

Enhancing Software Resilience with Chaos Engineering

Aug 16, 2024

Building Continuous Resilience in Software Delivery Life Cycle with Chaos Engineering

Introduction

Speaker: Matt Schillersham
Role: Product Marketing Manager at Harness
Focus: Continuous resilience in software delivery through chaos engineering
Experience: 20 years in various industries including nuclear power, retail, e-commerce, and non-profits.
Community Involvement: Part of the Litmus Chaos Open Source Community, sponsor of CNCF.

Why Chaos Engineering?

Definition: Understanding how systems work and react to failures.
Quote: "If you don’t know why it’s working when it’s working, you won’t know how to fix it when it breaks." – Andy Stanley
Purpose: To prepare for failures and improve system recovery and user experience.

Importance of Resilience Mechanisms

Developed in code and architecture to facilitate graceful recovery.
Chaos engineering validates these mechanisms before incidents occur.

Common Kubernetes Failure Modes

System instability
Resource contention
Scaling issues
Configuration errors
Resource exhaustion
Note: Chaos engineering helps address these issues proactively.

Chaos Engineering Experience

Example of chaos experiment setup via Litmus Chaos project.
Current implementation involves declarative YAML files to simulate failures and evaluate application behavior.

Continuous Resilience

Goal: Optimize reliability and resilience in software delivery to enhance customer experience.
Collaboration among SREs, QA engineers, and developers is crucial.

Continuous Resilience Breakdown

SREs: Use chaos engineering post-incident to analyze and recreate failures.
QA Engineers: Run chaos experiments in test environments to validate fixes.
Developers: Incorporate chaos testing in CI/CD pipelines to catch issues early.

Business Level Perspective

Innovation in software delivery is critical for reliability and resilience.
2023 Focus: High speed, efficiency, low cost, and reliability are essential.

Cost of Software Development

27 million developers globally, averaging $100,000 salary (total payroll ~$2.7 trillion).
Many developers spend less than 3 hours a day coding due to various toils affecting productivity.

Reducing Developer Toil

Potential to increase developer budget and capabilities by reducing toil.
Factors preventing productivity:
- Meetings
- Babysitting deployments
- Security testing

Improving Reliability and Resilience

Areas to Reduce Time:
- Software build time
- Software deployment time
- Bug fixing time

Debugging Challenges

Oversights and dependencies not tested contribute to longer debugging times.
Costly to fix bugs in production; prevention through QA testing is more efficient.

Cloud Native Developers

Focus on containers and APIs increases failure likelihood due to complexity.
Automated chaos testing in pipelines can address common issues:
- Resource exhaustion
- Configuration errors
- System stability

Fault Injection and Chaos Experimentation

Need: To introduce controlled faults for better resilience.
Emphasis on continuous chaos testing to maintain system reliability.

Continuous Resilience Metrics

Measure resilience with resilience scores (success rate of experiments) and coverage (number of tests executed vs. total possible tests).

Conclusion

Main Takeaway: Emphasize a culture of chaos experimentation in development rather than reactive game day approaches.
Community Event: Chaos Carnival on March 15th and 16th, free virtual event.
Contact: Matt Schillersham via email, Twitter, or LinkedIn for inquiries.

Full transcript