🏗️

Amazon Case Study 2

Nov 30, 2025

Summary

Interview with Lee Atchison on Amazon’s migration from a monolithic application (Obidos) to a service-oriented architecture (Gurupa), and how organizational design, ownership, and risk management drive scalability and availability.

Action Items

  • No explicit dated or owned action items were mentioned in the transcript.

Amazon Monolith to SOA Migration

  • Amazon’s original retail platform centered on a single application called Obidos handling every web transaction.
  • About 100 engineers worked on overlapping parts of Obidos, creating coordination, deployment, and reliability challenges.
  • Deployments were attempted twice a week but frequently failed, so releases were rare and slowed development.
  • Obidos funneled all logic into a single location; this centralized design became a scaling and change bottleneck.
  • Amazon realized treating each feature/service as an independent component would improve flexibility and reduce blast radius.
  • The migration project from Obidos to SOA was called Gurupa and started in January 2005.
  • Gurupa’s design goal: pluggable modules, independently developed, tested, and deployed, acting as services.

Gurupa Migration Execution

  • Gurupa was Amazon’s distributed system replacing the monolith and supporting both front-end and back-end services.
  • Migration took about two years, with Atchison later leading the coordination team for the move.
  • A major cutover event occurred over a single 24-hour period, migrating country by country.
  • Migration war room setup: metrics displayed, phones to contact other teams, all core participants colocated.
  • Approximately 0.1% of all internet traffic shifted architectures that day, with minimal external visibility.
  • A core objective was avoiding a “New York Times event” (public negative press on Amazon outages or failures).
  • Success was defined as executing the migration without triggering press coverage of a major incident.

Migration Overview Table

AspectMonolith (Obidos)SOA (Gurupa)
ArchitectureSingle funnel application for all website transactionsDistributed services with modular, pluggable components
Team involvement~100 engineers working on one codebaseMultiple teams owning independent services
Deployment cadenceAttempted twice weekly, often failing; deployments rareIndependently deployed services (goal of easier, more frequent deployments)
TimelinePre-2005 primary retail platformMigration started January 2005; lasted about two years
Cutover strategyN/A (existing monolith)24-hour, country-by-country migration, centrally coordinated
Risk goalN/AAvoid a “New York Times event” by smooth, low-visibility transition
Traffic impact100% traffic through monolith~0.1% of internet traffic shifted during cutover day, largely unnoticed outside

STOSA (Single Team Oriented Service Architecture)

  • STOSA is Atchison’s organizational model for building scalable, highly available applications.
  • It emphasizes that architecture alone is insufficient; organizational structure must support scalability.
  • Core idea: align services with single, focused teams that have clear ownership responsibilities.
  • Ownership includes defining what “ownership” means, not just assigning a name to a service.
  • Service-level agreements (SLAs) and inter-service SLAs are central to STOSA.
  • STOSA focuses on best practices for how teams interact to maintain availability and scale.

STOSA Key Elements Table

ElementDescription
Team orientationEach service aligned with a single, clearly responsible development team
Ownership definitionExplicitly define responsibilities and authority for each service
SLAsFormal service-level agreements within and between services
Inter-team interactionPractices for how teams collaborate while preserving clear ownership
GoalEnable scalable, highly available systems through organizational design

Risk Management and Scalability

  • High-availability and scaling require planned risk management, beyond just technical design choices.
  • Teams should build a risk matrix capturing both known and unknown risks for their applications.
  • Each risk should be assigned severity and priority to support rational planning and trade-offs.
  • The risk matrix becomes documentation of technical debt and the team’s risk plan.
  • An empty risk matrix indicates misunderstanding or lack of reflection, not the absence of risk.
  • The objective is not zero risk but minimizing unknown risks and unplanned-for risks.
  • Recognizing, organizing, prioritizing, and planning for risks prepares teams for inevitable problems.
  • Better risk planning improves availability and scalability, as many risks manifest under scale.

Risk Management Table

AspectDescription
Risk matrix purposeDocument known and unknown risks, technical debt, and planning assumptions
Attributes trackedRisk description, severity, priority
Good signNon-empty matrix showing teams have thought through issues
Bad signEmpty matrix implying denial or lack of understanding
Main goalReduce unknown/unplanned risks rather than eliminate all risk
Impact on scalePreparedness for risks that surface primarily during scaling

Decisions

  • Amazon decided to replace the monolith Obidos with a distributed SOA platform named Gurupa.
  • Migration was executed gradually overall but concentrated operationally into a 24-hour country-by-country cutover.
  • Organizational design (STOSA concepts) and explicit ownership were recognized as critical to sustained scalability.
  • Risk planning via a risk matrix was adopted as a key practice for improving availability and scalability.

Open Questions

  • Specific metrics or thresholds Amazon used to declare the Gurupa migration fully complete are not described.
  • Details on how SLAs and inter-service SLAs were defined and enforced within Amazon remain unspecified.
  • The exact structure and fields of the recommended risk matrix are not provided.