Understanding Chaos Engineering for Resilience

Sep 15, 2024

Chaos Engineering Presentation Notes

Introduction

  • Speaker: Ruturaj Kadikar, Senior SRE at Inferacloud Technologies
  • Topic: Chaos Engineering and Litmus Chaos

Overview of Chaos Engineering

  • Definition: Testing the resiliency and reliability of systems through controlled experiments.
  • Key Stats: Major outages in companies like the Federal Aviation Administration due to configuration issues.

Importance of Reliability

  • Reliability: Consistent stable results over time.
  • Metrics:
    • SLA: Service Level Agreement, determines business expectations.
    • MTBF: Mean Time Between Failures; higher MTBF correlates with higher reliability.

Why Test Resilience?

  • Avoids downtimes and mitigates future failures.
  • Provides understanding of system behavior under failure conditions.
  • Increases MTBF, improving reliability and end-user experience.
  • Regulatory Compliance: Some industries require a certain level of resilience and uptime.

Specific Challenges in Kubernetes

  • Kubernetes operates in high-availability mode, but complexity and interdependencies can lead to issues.
  • Potential for human errors and the rapid evolution of Kubernetes may cause unnoticed failures.

Failure Domains in Kubernetes

  1. Network Issues: Packet loss, latency, jitter.
  2. Pod Crashes: Abrupt crashes, image pull issues, kubelet/container runtime failures.
  3. Node Issues: Resource saturation, abrupt node termination.
  4. Load Patterns: Testing with bursty or spiky load patterns.
  5. Configuration Errors: Random changes in environment variables, service dependencies.

Failure Domains Beyond Kubernetes

Databases

  • Network Partitioning: Can cause data inconsistency, split-brain scenarios.
  • Time Travel Issues: NTP synchronization problems leading to data inconsistencies.
  • Access Issues: Incorrect credentials affecting database access.

Cloud Services

  • Instance Termination: Evaluate impact of abrupt instance terminations.
  • Security Group Configuration: Changes leading to business losses through disrupted communication.
  • Load Balancers: Testing under heavy load and draining targets.

Principles of Chaos Engineering

  1. Hypothesize: Understand the steady state of the system.
  2. Identify Failure Domains: Where things can go wrong.
  3. Run Experiments: Verify hypotheses against real-world scenarios.
  4. Mitigation: Implement improvements based on findings.

Tools for Chaos Engineering

  • Popular tools: Litmus Chaos, Gremlin, Chaos Monkey, Chaos Mesh, AWS FIS.
  • Litmus Chaos:
    • Open-source, flexible, good integration with AWS SSM.
    • Can execute chaos in centralized or distributed environments.

Running Chaos Experiments with Litmus

  1. Setup: Deploy Litmus on an EKS cluster and have monitoring via Prometheus.
  2. Chaos Scenarios: Define and execute chaos scenarios (e.g., memory hog).
  3. Observability: Integrate with Prometheus and Grafana to monitor chaos impacts.

Example Chaos Experiments

Memory Hog Experiment

  • Induces memory pressure on a pod and monitors the impact.

Security Group Change Experiment

  • Change security group settings using AWS SSM documents.
  • Evaluate the system's responsiveness and alerts for connectivity issues.

Conclusion

  • Chaos Engineering is essential for improving system reliability and resilience.
  • Regular chaos testing should be part of ongoing practices in organizations.
  • Benefits of using Litmus Chaos include open-source flexibility and centralized chaos execution.