🔧

Understanding System Recovery Metrics

Aug 9, 2024

Lecture Notes: System Recovery Metrics and Equipment Maintenance

Key Metrics for System Recovery

1. Recovery Time Objective (RTO)

  • Definition: The amount of time it takes to get systems up and running to a particular service level.
  • Example: Time to get a web server operational.

2. Recovery Point Objective (RPO)

  • Definition: The amount of data that needs to be available before declaring the system as recovered.
  • Example: Last hour of data versus the entire database.

3. Mean Time to Repair (MTTR)

  • Definition: The average time it takes to repair a system and get it back up and running.
  • Example: Could be an hour or 24 hours depending on the issue.

4. Mean Time Between Failures (MTBF)

  • Definition: A prediction of how long a system should run before a failure occurs.
  • Example: Solid state systems have longer MTBF compared to systems with moving parts.

Equipment and Configuration Maintenance

Backup and Management

  • Devices: Routers, switches, firewalls, and other infrastructure devices.
  • Configurations: IP address settings, security settings, port configurations, etc.
  • Backup Process: Often automated, configurations can be downloaded across the network and stored on a separate machine.
  • Version Specificity: Configurations may be specific to the firmware version.
  • Importance of Firmware Backup: Essential to have backups of both configurations and firmware versions for quick recovery.

Practical Uses

  • Reverting Firmware: Use backups to revert to a previous firmware version and configurations to restore device functionality.