Summary
Interview with Lee Atchison on Amazon’s migration from a monolithic application (Obidos) to a service-oriented architecture (Gurupa), and how organizational design, ownership, and risk management drive scalability and availability.
Action Items
- No explicit dated or owned action items were mentioned in the transcript.
Amazon Monolith to SOA Migration
- Amazon’s original retail platform centered on a single application called Obidos handling every web transaction.
- About 100 engineers worked on overlapping parts of Obidos, creating coordination, deployment, and reliability challenges.
- Deployments were attempted twice a week but frequently failed, so releases were rare and slowed development.
- Obidos funneled all logic into a single location; this centralized design became a scaling and change bottleneck.
- Amazon realized treating each feature/service as an independent component would improve flexibility and reduce blast radius.
- The migration project from Obidos to SOA was called Gurupa and started in January 2005.
- Gurupa’s design goal: pluggable modules, independently developed, tested, and deployed, acting as services.
Gurupa Migration Execution
- Gurupa was Amazon’s distributed system replacing the monolith and supporting both front-end and back-end services.
- Migration took about two years, with Atchison later leading the coordination team for the move.
- A major cutover event occurred over a single 24-hour period, migrating country by country.
- Migration war room setup: metrics displayed, phones to contact other teams, all core participants colocated.
- Approximately 0.1% of all internet traffic shifted architectures that day, with minimal external visibility.
- A core objective was avoiding a “New York Times event” (public negative press on Amazon outages or failures).
- Success was defined as executing the migration without triggering press coverage of a major incident.
Migration Overview Table
| Aspect | Monolith (Obidos) | SOA (Gurupa) |
|---|
| Architecture | Single funnel application for all website transactions | Distributed services with modular, pluggable components |
| Team involvement | ~100 engineers working on one codebase | Multiple teams owning independent services |
| Deployment cadence | Attempted twice weekly, often failing; deployments rare | Independently deployed services (goal of easier, more frequent deployments) |
| Timeline | Pre-2005 primary retail platform | Migration started January 2005; lasted about two years |
| Cutover strategy | N/A (existing monolith) | 24-hour, country-by-country migration, centrally coordinated |
| Risk goal | N/A | Avoid a “New York Times event” by smooth, low-visibility transition |
| Traffic impact | 100% traffic through monolith | ~0.1% of internet traffic shifted during cutover day, largely unnoticed outside |
STOSA (Single Team Oriented Service Architecture)
- STOSA is Atchison’s organizational model for building scalable, highly available applications.
- It emphasizes that architecture alone is insufficient; organizational structure must support scalability.
- Core idea: align services with single, focused teams that have clear ownership responsibilities.
- Ownership includes defining what “ownership” means, not just assigning a name to a service.
- Service-level agreements (SLAs) and inter-service SLAs are central to STOSA.
- STOSA focuses on best practices for how teams interact to maintain availability and scale.
STOSA Key Elements Table
| Element | Description |
|---|
| Team orientation | Each service aligned with a single, clearly responsible development team |
| Ownership definition | Explicitly define responsibilities and authority for each service |
| SLAs | Formal service-level agreements within and between services |
| Inter-team interaction | Practices for how teams collaborate while preserving clear ownership |
| Goal | Enable scalable, highly available systems through organizational design |
Risk Management and Scalability
- High-availability and scaling require planned risk management, beyond just technical design choices.
- Teams should build a risk matrix capturing both known and unknown risks for their applications.
- Each risk should be assigned severity and priority to support rational planning and trade-offs.
- The risk matrix becomes documentation of technical debt and the team’s risk plan.
- An empty risk matrix indicates misunderstanding or lack of reflection, not the absence of risk.
- The objective is not zero risk but minimizing unknown risks and unplanned-for risks.
- Recognizing, organizing, prioritizing, and planning for risks prepares teams for inevitable problems.
- Better risk planning improves availability and scalability, as many risks manifest under scale.
Risk Management Table
| Aspect | Description |
|---|
| Risk matrix purpose | Document known and unknown risks, technical debt, and planning assumptions |
| Attributes tracked | Risk description, severity, priority |
| Good sign | Non-empty matrix showing teams have thought through issues |
| Bad sign | Empty matrix implying denial or lack of understanding |
| Main goal | Reduce unknown/unplanned risks rather than eliminate all risk |
| Impact on scale | Preparedness for risks that surface primarily during scaling |
Decisions
- Amazon decided to replace the monolith Obidos with a distributed SOA platform named Gurupa.
- Migration was executed gradually overall but concentrated operationally into a 24-hour country-by-country cutover.
- Organizational design (STOSA concepts) and explicit ownership were recognized as critical to sustained scalability.
- Risk planning via a risk matrix was adopted as a key practice for improving availability and scalability.
Open Questions
- Specific metrics or thresholds Amazon used to declare the Gurupa migration fully complete are not described.
- Details on how SLAs and inter-service SLAs were defined and enforced within Amazon remain unspecified.
- The exact structure and fields of the recommended risk matrix are not provided.