[Music] [Applause] [Music] hello everyone and thank you for welcoming us into your virtual stream of hashicomp digital 2020. my name is maddie and i'm joined by my colleague here today luca we are on the site reliability engineering team at eventbrite the largest event technology platform so today our goal is to give you an authentic picture of how we use the aws terraform landing zone at eventbrite to streamline our multi-aws account strategy and the work we did to take it from source code to production we hope you leave this talk with a better understanding of what this accelerator is whether it could be a solution for your team and an idea of the work involved to get it up and running so we want to start this talk by giving a look at the problem space why is eventbrite moving to a multi-aws account infrastructure and what were our requirements of this solution and how did we arrive at using the aws terraform landing zone we think it would then be helpful to give you a zoomed out picture of the account vending machine in action through the eyes of a requesting client before diving into the work we did to adapt the terraform landing zone to fit our needs so let's start with a little background on eventbrite's journey eventbrite's story really isn't different from many companies who have grown from a small startup with 30 co-located engineers to 300 plus engineers across multiple time zones as many of you know with this growth comes a bit of a change in engineering culture and architecture we started with a monolith codebase in a small op team keeping the lights on in the past couple of years we have focused on the very necessary transition to where we are today highlighted in red like many companies we knew a monolithic architecture was inhibiting our growth as an engineering team and we started breaking up our monolith into microservice oriented architecture to enjoy the benefits such as independent deployments clear ownership improvements and system reliability and a better separation of concerns we also set up development by transitioning from an ops team to an sre organization and offloaded some infrastructure ownership to developers however as we look toward the future we realize we had still not achieved the developer efficiency we desired nor resolved common reliability issues despite breaking the monolith into microservices the underlying infrastructure was built in the same aws account and too tightly coupled to have clear ownership efficient development and reliable systems so how do we next evolve our infrastructure we decided to move to a multi-aws account strategy and decouple our systems into isolated domains in our definition domains are fully isolated segments of our business within an event-driven architecture that has both compute and data layer independence by segmenting into domains we address two of our goals having global engineering teams running production systems without the bottleneck of sre and an increase in the system reliability in a domain's world developers can not only operate production systems but they no longer need to understand how the entire platform works and can move quicker with fewer dependencies domains are also not susceptible to instability encountered in other domains so great we have this shiny bright future of domains but how do we get there before we can even start thinking of development we must first build the infrastructure for managing maintaining and evolving a multi-aws account solution so what are our requirements of this multi-aws account infrastructure our first category of requirements is governance as mentioned we want to use aws accounts to establish the high walls of isolation between domains secondly it's imperative to us that sre owns the networking components between domains as well as the shared infrastructure this requirement is also about delivering an easy to use solution for our engineers where they can immediately start developing in some of the harder aspects of connecting to a shared infrastructure are abstracted away with regards to security control and compliance we want to be able to enforce a set of security policies across all accounts as well as control the services used in actions taken via policies finally we get to automation we don't have the bandwidth to set up these domains by hand and manual configuration also leads to a lack of uniformity and compliance thus it is imperative to us that the creation of domains is fully automated with minimum human toil finally it was essential to our team that everything was done as infrastructure as code this is just not just referencing our infrastructure within the aws platform but also integrations with third party providers so how do we arrive at the aws terraform landing zone as we started our search for an existing multi-aws account solution we first stumbled across the aws control tower the control tower is a pure aws service meant to ease governance and security for multi-account organizations however as we started a proof of concept we quickly realized the amount of manual configuration and at the time lack of integration with our samuel saml provider octa was a blocker but as often as the case aws has more than one solution up their sleeves we then looked at the automated solution the aws landing zone aws landing zone seems to be a more mature solution than the control tower however it still has similar limitations it is close to the aws ecosystem has no out of the box integration with octa and cloud formation is the only infras code option at this point we were a bit stuck as to where to turn next luckily our friends at hashicorp knew of a new project that the aws professional services org was working on called the aws terraform landing zone or tlz tlz emerged from various requests within the industry to have a terraform base aws landing zone we were fortunate to be granted early contributor access to the tlz code base so let's do a quick recap of what the aws terraform landing zone accelerator is for those who did not get the chance to see brad present last year the word accelerator is important here as it's actually not a product but more of an automation pattern developed by the aws professional services or team the most simple way we can explain it is that you can think of tlz as three parts first as a set of automation code this accelerator aims to codify security and compliance best practices by orchestrating the provisioning or what tlc calls baselining of not just application aws accounts but also core accounts used for things like logging security shared services and networking next the concept of an account vending machine or avm the avm puts all the automation pieces together into the concept of vending an account the avm consists of a dynamo bd db table that stores account or request information and triggers a series of about 10 lambda functions these lambda functions automate everything from account creation and baselining to setting up what we call the vented account ecosystem with integrations like terraform enterprise and vcs and sample providers and finally terraform enterprise or cloud in order to make the magic of the avm happen tlz leverages essential components like terraform enterprise or cloud like enterprise workspaces the api and the private module registry without these it would be very hard to achieve seamless automation or for us to offload infrastructure management to developers in a safe way so now that we've given you some background on why eventbrite is moving to a multi-aws account strategy and how we arrived at the choice to use aws tlc accelerator we now want to focus on the journey of what taking the tlc code base was like and adapting it to our company needs however before taking a look behind the scenes at these adaptations we think it would first be helpful to give you a vision of the account vending machine in action through the eyes of a requesting client and what the resulted vended account ecosystem looks like the first step of the process is the account request one of the first things we added to tlz was a terraform-based account request procedure via pull requests this change allowed for input field validation and for all request history to be checked into version control we like to think that prs are the new ticket right i am now a developer and i want a new account for my domain that handles event listings pages i create this pull request that you can see here not surprisingly it's an easy to use tara4 module because i love my sre team and they build amazing easy to use interfaces these five fields are all i need to provide my pr is approved i merge it and a terraform plan is kicked off in terraform enterprise at this point i go and make a coffee in the account vending process we have one point of human intervention before all the automation begins members of the sre team are the only ones with the privilege to apply changes within the account request enterprise workspace so an esri clicks the button and voila applied successfully at this point the account vending machine is kicked off after five to ten minutes of making coffee i come back to my desk and what do i now have in my saml provider dashboard which is octa i now have i now see two applications one for accessing our terraform enterprise instance and another for accessing the console of the newly vended aws account i also see a new github repositories created and assigned to my team to hold infrastructure as code now if i've already vented an application account for this domain say another dev account or pre-prod account this repository will have already existed because it holds the code for infrastructure in all environments this is an example of a change that we introduced originally each environment for a domain had a separate repository however this didn't support our vision for future deployment pipelines within terafrom enterprise thus each team has vended one repo for their infrastructure and can decide which strategy they'll use to maintain their code across multiple environments whose states are managed by separate enterprise workspaces last but not least as a developer on the event listing domain team upon accessing my tfe instance i now have access to two workspaces i have read access to the baseline enterprise workspace and write access to the infra enterprise workspace my infrared workspace is linked to the repository i was just funded of note this repository already has the terraform code to import the remote state of the baseline workspace the infra workspace is also provisioned during the ending time with all the necessary variables to create resources whether this is infrastructure in aws or third-party integrations for things like monitoring and error tracking the core concept of baselining and a baseline workspace is something we will go into further detail later on i'll now pass it over to my colleague luca thank you all for joining hashicomp digital 2020 and enjoy the rest of your day hi everyone welcome to ashikom digital 2020. in the second part of this talk we want to take a closer look to the work we've put into adapting tlc to our needs focusing on some specific areas that we feel would be familiar to anyone approaching a similar solution first thing we want to make clear is that tlc is not meant to be a plug-and-play product but rather an accelerator what this means is that adopting this solution will require some effort which in our case means having two engineers yours truly here today working almost full time on this project for three months it's worth mentioning though that in our case we enter this project as early contributors and were able to for to fork the first version of tlc source code that was still far away from its ga status that would be hopefully be released soon still we would like to show some of the work that we think is necessary to successfully bootstrap tlc and adapt functionalities to your company needs a lot of work goes into adapting the avm to meet your own needs as shown already we added the account requesting procedure via pull requests we also had to build some automation steps that we required but were missing from the initial code base for instance to interact with third party providers this was done both by modifying existing lambdas and by adding some new ones tlc intentionally comes with our core network in order to give the possibility to either deploy a new network or attach to an existing one in our case as we were moving to an event-driven architecture we dedicated quite some time in deploying a core transient network services that are shared across banded account are another core element of tlc we've named a few services already that are essential to bootstrap tlc like terraform enterprise and the vcs provider but they are not the only ones of course and given that each company has its own stack the setup of this service is part of the duties of the team bootstrapping and managing tlc once automation and shared service and shared infrastructure are in place a set of services and policies that baseline the vended account must be defined this set includes integration with the shared infra and security regardless both from aws code guidelines and custom to your needs although this set is applied at blending time it can be dynamically changed at a later stage we understand that the concept of baselining can be confusing therefore we will dedicate some time to show you this process in more details at heaven bright we've taken the decision to modify the avm in order to use a unified baseline for all vended accounts although this set must be predefined we can expect it to be subject to changes we are going to see in a second how tlz allows us to apply these changes in a seamless way to existing accounts as well but let's first start with a brand new account here we can see one that was created by the avm first of all tlc code base comes with the default baseline that covers some of the hardest security challenges within a multi-aws account infrastructure among these are security guard rails and a backend of aws services for shipping audit logs to dedicated core account owned by our security team these resources are provided to the account via service control policies at organizational level and as resources in the baseline workspace as mentioned already networking is an area where work needs to be done and we specifically have taken the chance to build an event-driven architecture and attach it to the tlc core network this included adding some network resources to the baseline to fit our internal ip address spacing model and to attach the banded account to the underlying shared infrastructure we've also polished and adapted access management for the banded accounts by better scoping dedicated roles to our internal structure and defining some policies that are enforced across the whole organization this is one of the areas where terraform enterprise really shines as access keys can be stored sensitively in the t in the enterprise workspace removing the need to create and maintain scoped access token throughout the organization in our case accounts also required a dedicated internal domain furthermore dns forwarders towards our legacy infra and the event message bus are also shared across the whole organization and added in the baseline process so here we have our baseline bandit account well a summarized version of it at least but we don't want to bore you with this any further as you can see in our basement we have services required to meet our security compliance standard and to abstract from our developers the harder part of a multi-aws account infrastructure like connectivity to the core network and dns now as we've mentioned the same baseline process is used for all accounts this is done primarily to not have account baselines drifting too far apart from each other giving us more control when changes need to be applied to either a shared service or a core component tfe is once again the tool making this really easy as we can point the baseline repository to all of our accounts baseline enterprise workspace this is done at bending time so here we can see two the two branded accounts with the same baseline but a different workload we can see account number one a bit old school there deployed an application load balancer and some ec2 instances while account number two went for a serverless base infra now going back to the baseline process later down the road we decided to give accounts private certificates management capabilities and specifically to add these as part of our baseline by simply applying this change to our unified baseline the change was made available to all application accounts without impacting the deployed workload needless to say the application account vended from this point onward we use the latest available baseline which includes private self management as an extra in our code base we have also built the possibility to have a dedicated baseline for a single account this is done by simply pointing an account baseline enterprise workspace to a branch within the baseline repository just a little bit of terra for term enterprise magic given what we've shown so far we hope that you saw that we are seeking autonomy with constraints which is a way to empower teams through the golden path approach at evm right we define the golden path as the way we handle technology and architectural guidance the overall goal is to provide teams with agency while also ensuring we have a performance and relatively consistent architecture the idea of the golden path is not to stifle innovation technologies that are on the path have been vetted by us to ensure that they will run well we understand how and when they fail and we have the expertise needed to use them rather than just being something that we launched and then we are stuck with if we truly have a gap in our technology stack that needs feeling then adding something to the golden path it will be a relatively quick process in the context of tlc some of the aspects of the golden path are managed via a set of guidance policies defined as code and enforced at three different levels aws services and actions can be enabled or disabled via service control policies that are inherited by all created accounts from the aws organization route so-called master player and the accounts organizational unit access to services is managed via custom i am policies attached to item roles within an account during the baseline process and at last sentinel policies are added to both baseline and infra enterprise workspaces and used to give a soft warning to death teams diverting from the golden path now let's take a moment and go back to see how we're done with the requirements that we defined initially so for governance if you remember we wanted to use aws accounts to establish the high walls of isolation between domains and of our our sra team owning network components and shared infrastructure we have shown today how we've achieved account governance thanks to the account vending machine while infrastructure governance was not a big part of this presentation although i've mentioned that we have spent quite some time creating a dedicated network as part of our new event-driven infrastructure now for security we're required to be able to enforce policies across all accounts to control services used and action taken and we've shown how security guards are defined in the account baseline while technology and architectural guidance are defined in the golden path and at last we wanted this process to be automated with minimum human toil with infrastructure as code used everywhere we have shown how this was the case from the pr account request all the way to the integration between terraform enterprise and our vcs so having checked our requirements it's time now to draw some conclusion always seen first of all we can state that once released the aws term from landing zone will be a valid solution out there to set up a multi-aws account infrastructure but it is now it is not and as far as we know it will never be a plug and play product but rather an accelerator which requires lifelong development and adaptation as we've seen partially today we cannot underline enough also how important it is to define a golden path for you for your engineering teams to fully leverage the flexibility and the tremendous growing speed that a solution like the one we show you today has without letting them encounter common pitfalls but i guess i guess that a question that everybody wants to ask us now is was it worth it at evenbright we use the aws telephone landing zone to power our new domain infrastructure via self-service account vending machine and we can say now that it was fully totally worth it the effort to adapt this new solution as without it we simply could not have achieved our multi-ws account strategy while meeting all of our requirements thank you so much for joining our talk we sincerely hope that you've enjoyed it and learned about this new solution that has changed the way we work at even bright ciao ciao and have a good action [Music] [Applause] conference you