⚡

Understanding Chaos Engineering for Resilience

Sep 15, 2024

Chaos Engineering Presentation Notes

Introduction

Speaker: Ruturaj Kadikar, Senior SRE at Inferacloud Technologies
Topic: Chaos Engineering and Litmus Chaos

Overview of Chaos Engineering

Definition: Testing the resiliency and reliability of systems through controlled experiments.
Key Stats: Major outages in companies like the Federal Aviation Administration due to configuration issues.

Importance of Reliability

Reliability: Consistent stable results over time.
Metrics:
- SLA: Service Level Agreement, determines business expectations.
- MTBF: Mean Time Between Failures; higher MTBF correlates with higher reliability.

Why Test Resilience?

Avoids downtimes and mitigates future failures.
Provides understanding of system behavior under failure conditions.
Increases MTBF, improving reliability and end-user experience.
Regulatory Compliance: Some industries require a certain level of resilience and uptime.

Specific Challenges in Kubernetes

Kubernetes operates in high-availability mode, but complexity and interdependencies can lead to issues.
Potential for human errors and the rapid evolution of Kubernetes may cause unnoticed failures.

Failure Domains in Kubernetes

Network Issues: Packet loss, latency, jitter.
Pod Crashes: Abrupt crashes, image pull issues, kubelet/container runtime failures.
Node Issues: Resource saturation, abrupt node termination.
Load Patterns: Testing with bursty or spiky load patterns.
Configuration Errors: Random changes in environment variables, service dependencies.

Failure Domains Beyond Kubernetes

Databases

Network Partitioning: Can cause data inconsistency, split-brain scenarios.
Time Travel Issues: NTP synchronization problems leading to data inconsistencies.
Access Issues: Incorrect credentials affecting database access.

Cloud Services

Instance Termination: Evaluate impact of abrupt instance terminations.
Security Group Configuration: Changes leading to business losses through disrupted communication.
Load Balancers: Testing under heavy load and draining targets.

Principles of Chaos Engineering

Hypothesize: Understand the steady state of the system.
Identify Failure Domains: Where things can go wrong.
Run Experiments: Verify hypotheses against real-world scenarios.
Mitigation: Implement improvements based on findings.

Tools for Chaos Engineering

Popular tools: Litmus Chaos, Gremlin, Chaos Monkey, Chaos Mesh, AWS FIS.
Litmus Chaos:
- Open-source, flexible, good integration with AWS SSM.
- Can execute chaos in centralized or distributed environments.

Running Chaos Experiments with Litmus

Setup: Deploy Litmus on an EKS cluster and have monitoring via Prometheus.
Chaos Scenarios: Define and execute chaos scenarios (e.g., memory hog).
Observability: Integrate with Prometheus and Grafana to monitor chaos impacts.

Example Chaos Experiments

Memory Hog Experiment

Induces memory pressure on a pod and monitors the impact.

Security Group Change Experiment

Change security group settings using AWS SSM documents.
Evaluate the system's responsiveness and alerts for connectivity issues.

Conclusion

Chaos Engineering is essential for improving system reliability and resilience.
Regular chaos testing should be part of ongoing practices in organizations.
Benefits of using Litmus Chaos include open-source flexibility and centralized chaos execution.

Full transcript