Coconote
AI notes
AI voice & video notes
Try for free
⚡
Understanding Chaos Engineering for Resilience
Sep 15, 2024
Chaos Engineering Presentation Notes
Introduction
Speaker: Ruturaj Kadikar, Senior SRE at Inferacloud Technologies
Topic: Chaos Engineering and Litmus Chaos
Overview of Chaos Engineering
Definition: Testing the resiliency and reliability of systems through controlled experiments.
Key Stats: Major outages in companies like the Federal Aviation Administration due to configuration issues.
Importance of Reliability
Reliability: Consistent stable results over time.
Metrics:
SLA
: Service Level Agreement, determines business expectations.
MTBF
: Mean Time Between Failures; higher MTBF correlates with higher reliability.
Why Test Resilience?
Avoids downtimes and mitigates future failures.
Provides understanding of system behavior under failure conditions.
Increases MTBF, improving reliability and end-user experience.
Regulatory Compliance: Some industries require a certain level of resilience and uptime.
Specific Challenges in Kubernetes
Kubernetes operates in high-availability mode, but complexity and interdependencies can lead to issues.
Potential for human errors and the rapid evolution of Kubernetes may cause unnoticed failures.
Failure Domains in Kubernetes
Network Issues
: Packet loss, latency, jitter.
Pod Crashes
: Abrupt crashes, image pull issues, kubelet/container runtime failures.
Node Issues
: Resource saturation, abrupt node termination.
Load Patterns
: Testing with bursty or spiky load patterns.
Configuration Errors
: Random changes in environment variables, service dependencies.
Failure Domains Beyond Kubernetes
Databases
Network Partitioning
: Can cause data inconsistency, split-brain scenarios.
Time Travel Issues
: NTP synchronization problems leading to data inconsistencies.
Access Issues
: Incorrect credentials affecting database access.
Cloud Services
Instance Termination
: Evaluate impact of abrupt instance terminations.
Security Group Configuration
: Changes leading to business losses through disrupted communication.
Load Balancers
: Testing under heavy load and draining targets.
Principles of Chaos Engineering
Hypothesize
: Understand the steady state of the system.
Identify Failure Domains
: Where things can go wrong.
Run Experiments
: Verify hypotheses against real-world scenarios.
Mitigation
: Implement improvements based on findings.
Tools for Chaos Engineering
Popular tools: Litmus Chaos, Gremlin, Chaos Monkey, Chaos Mesh, AWS FIS.
Litmus Chaos
:
Open-source, flexible, good integration with AWS SSM.
Can execute chaos in centralized or distributed environments.
Running Chaos Experiments with Litmus
Setup
: Deploy Litmus on an EKS cluster and have monitoring via Prometheus.
Chaos Scenarios
: Define and execute chaos scenarios (e.g., memory hog).
Observability
: Integrate with Prometheus and Grafana to monitor chaos impacts.
Example Chaos Experiments
Memory Hog Experiment
Induces memory pressure on a pod and monitors the impact.
Security Group Change Experiment
Change security group settings using AWS SSM documents.
Evaluate the system's responsiveness and alerts for connectivity issues.
Conclusion
Chaos Engineering is essential for improving system reliability and resilience.
Regular chaos testing should be part of ongoing practices in organizations.
Benefits of using Litmus Chaos include open-source flexibility and centralized chaos execution.
📄
Full transcript