🛠️

Enhancing Software Resilience with Chaos Engineering

Aug 16, 2024

Building Continuous Resilience in Software Delivery Life Cycle with Chaos Engineering

Introduction

  • Speaker: Matt Schillersham
  • Role: Product Marketing Manager at Harness
  • Focus: Continuous resilience in software delivery through chaos engineering
  • Experience: 20 years in various industries including nuclear power, retail, e-commerce, and non-profits.
  • Community Involvement: Part of the Litmus Chaos Open Source Community, sponsor of CNCF.

Why Chaos Engineering?

  • Definition: Understanding how systems work and react to failures.
  • Quote: "If you don’t know why it’s working when it’s working, you won’t know how to fix it when it breaks." – Andy Stanley
  • Purpose: To prepare for failures and improve system recovery and user experience.

Importance of Resilience Mechanisms

  • Developed in code and architecture to facilitate graceful recovery.
  • Chaos engineering validates these mechanisms before incidents occur.

Common Kubernetes Failure Modes

  • System instability
  • Resource contention
  • Scaling issues
  • Configuration errors
  • Resource exhaustion
  • Note: Chaos engineering helps address these issues proactively.

Chaos Engineering Experience

  • Example of chaos experiment setup via Litmus Chaos project.
  • Current implementation involves declarative YAML files to simulate failures and evaluate application behavior.

Continuous Resilience

  • Goal: Optimize reliability and resilience in software delivery to enhance customer experience.
  • Collaboration among SREs, QA engineers, and developers is crucial.

Continuous Resilience Breakdown

  1. SREs: Use chaos engineering post-incident to analyze and recreate failures.
  2. QA Engineers: Run chaos experiments in test environments to validate fixes.
  3. Developers: Incorporate chaos testing in CI/CD pipelines to catch issues early.

Business Level Perspective

  • Innovation in software delivery is critical for reliability and resilience.
  • 2023 Focus: High speed, efficiency, low cost, and reliability are essential.

Cost of Software Development

  • 27 million developers globally, averaging $100,000 salary (total payroll ~$2.7 trillion).
  • Many developers spend less than 3 hours a day coding due to various toils affecting productivity.

Reducing Developer Toil

  • Potential to increase developer budget and capabilities by reducing toil.
  • Factors preventing productivity:
    • Meetings
    • Babysitting deployments
    • Security testing

Improving Reliability and Resilience

  • Areas to Reduce Time:
    • Software build time
    • Software deployment time
    • Bug fixing time

Debugging Challenges

  • Oversights and dependencies not tested contribute to longer debugging times.
  • Costly to fix bugs in production; prevention through QA testing is more efficient.

Cloud Native Developers

  • Focus on containers and APIs increases failure likelihood due to complexity.
  • Automated chaos testing in pipelines can address common issues:
    • Resource exhaustion
    • Configuration errors
    • System stability

Fault Injection and Chaos Experimentation

  • Need: To introduce controlled faults for better resilience.
  • Emphasis on continuous chaos testing to maintain system reliability.

Continuous Resilience Metrics

  • Measure resilience with resilience scores (success rate of experiments) and coverage (number of tests executed vs. total possible tests).

Conclusion

  • Main Takeaway: Emphasize a culture of chaos experimentation in development rather than reactive game day approaches.
  • Community Event: Chaos Carnival on March 15th and 16th, free virtual event.
  • Contact: Matt Schillersham via email, Twitter, or LinkedIn for inquiries.