Coconote
AI notes
AI voice & video notes
Try for free
🛠️
Enhancing Software Resilience with Chaos Engineering
Aug 16, 2024
Building Continuous Resilience in Software Delivery Life Cycle with Chaos Engineering
Introduction
Speaker: Matt Schillersham
Role: Product Marketing Manager at Harness
Focus: Continuous resilience in software delivery through chaos engineering
Experience: 20 years in various industries including nuclear power, retail, e-commerce, and non-profits.
Community Involvement: Part of the Litmus Chaos Open Source Community, sponsor of CNCF.
Why Chaos Engineering?
Definition
: Understanding how systems work and react to failures.
Quote
: "If you don’t know why it’s working when it’s working, you won’t know how to fix it when it breaks." – Andy Stanley
Purpose
: To prepare for failures and improve system recovery and user experience.
Importance of Resilience Mechanisms
Developed in code and architecture to facilitate graceful recovery.
Chaos engineering validates these mechanisms before incidents occur.
Common Kubernetes Failure Modes
System instability
Resource contention
Scaling issues
Configuration errors
Resource exhaustion
Note
: Chaos engineering helps address these issues proactively.
Chaos Engineering Experience
Example of chaos experiment setup via Litmus Chaos project.
Current implementation involves declarative YAML files to simulate failures and evaluate application behavior.
Continuous Resilience
Goal
: Optimize reliability and resilience in software delivery to enhance customer experience.
Collaboration among SREs, QA engineers, and developers is crucial.
Continuous Resilience Breakdown
SREs
: Use chaos engineering post-incident to analyze and recreate failures.
QA Engineers
: Run chaos experiments in test environments to validate fixes.
Developers
: Incorporate chaos testing in CI/CD pipelines to catch issues early.
Business Level Perspective
Innovation in software delivery is critical for reliability and resilience.
2023 Focus: High speed, efficiency, low cost, and reliability are essential.
Cost of Software Development
27 million developers globally, averaging $100,000 salary (total payroll ~$2.7 trillion).
Many developers spend less than 3 hours a day coding due to various toils affecting productivity.
Reducing Developer Toil
Potential to increase developer budget and capabilities by reducing toil.
Factors preventing productivity:
Meetings
Babysitting deployments
Security testing
Improving Reliability and Resilience
Areas to Reduce Time
:
Software build time
Software deployment time
Bug fixing time
Debugging Challenges
Oversights and dependencies not tested contribute to longer debugging times.
Costly to fix bugs in production; prevention through QA testing is more efficient.
Cloud Native Developers
Focus on containers and APIs increases failure likelihood due to complexity.
Automated chaos testing in pipelines can address common issues:
Resource exhaustion
Configuration errors
System stability
Fault Injection and Chaos Experimentation
Need
: To introduce controlled faults for better resilience.
Emphasis on continuous chaos testing to maintain system reliability.
Continuous Resilience Metrics
Measure resilience with resilience scores (success rate of experiments) and coverage (number of tests executed vs. total possible tests).
Conclusion
Main Takeaway
: Emphasize a culture of chaos experimentation in development rather than reactive game day approaches.
Community Event
: Chaos Carnival on March 15th and 16th, free virtual event.
Contact: Matt Schillersham via email, Twitter, or LinkedIn for inquiries.
đź“„
Full transcript