Transcript for:
Beginner's DevOps Course

This beginner's DevOps course is your first  step toward a DevOps engineering role.   It is taught by the CEO and co founder of LayerCI. The goal of this course will be for regular  developers and regular engineering practitioners   to learn fundamental DevOps concepts so that  they can go towards the DevOps engineering role.   We'll also be talking about DevOps  broadly in the introduction. But   beyond that will primarily be talking  about the engineering side of things. DevOps is a methodology that helps  engineering teams build products   better by continuously integrating user feedback. And if you google DevOps, and you look for  pictures, you'll often see ones like this,   it really helps understand how DevOps  is different than the traditional way   of thinking about software developed.  Back in the day, software was developed,   much like things would be developed in a  factory. So any input would be programming,   and then the output, you'd have a product that  you could put on a CD, and you'd sell it to users. But since the advent of the Internet,  and have continuously updatable software,   it's become really easy to launch things and get  user feedback and integrate that into the current   product instead of making a new version of the  product. So websites like Facebook, continuously   upgrade instead of requiring you to buy, you  know, a new version of Facebook, unlike, you know,   old games, like some city would require you to buy  a new version of SimCity. And that idea is really   formalized by DevOps, the sections are planning,  where you take a set of features that you want   to build. And you work with your team to make some  specifications for those features might look like, you code them. So developers on your team will  build out these features so that they can be   released. And of course, they're built. So for  a website, you might take the source code and   bundle it into JavaScript that a user's browser  could run. For a video game, you might make   releases for various different versions that  run on Linux and versions that run on Windows   and versions that run in the browser. So you  take these built artifacts, and you test them.   So testing is both automatic and manual.  Automatic testing is usually colloquially known   as continuous integration. And manual testing  is colloquially known as quality assurance, QA. And then after it's tested, and you know, the  stakeholders have all given their feedback,   it's released. And continuous deployment  strategies, releasing and deploying all happens   automatically after a change is known to be good.  There's a lot of automation that can be done here.   In larger teams. There's, you know, popular tools  like Spinnaker by netflix that we'll talk about   in later talks. But the core idea is you want to  take the software, and you want to send it to your   users in a way that they don't notice if there's  problems. So if there's an experimental UI change,   you might show it to a few percentage of  users and get their feedback before you   show it broadly. Again, a company like Facebook,  which has billions of users, even if 1% of their   users complain, they'll get hundreds of millions  of emails, the release is built and deployed. So   deploying means it's released to your users for a  website, it would mean it's publicly accessible on   the internet. For a CD ROM, you know, you bundle  your things onto a CD and you distribute that   for a mobile release, you'd build the artifact  and you submit it to the App Store. And then   the App Store would review it and then publish  a new update that your users could download. And then you operate it. Operating  is primarily things like scaling,   making sure that enough resources exist for  the load, adding more servers as required.   configuring things dealing with  architectural problems, essentially,   on monitoring. So as your users use your software,  and especially as they submit things, and start   jobs and create posts on your forums, you want  to make sure that those posts are all healthy. And then finally, you take all this feedback,  and you put it back to the planning stage.   So the planning stage takes all the user feedback  takes all of the things that the operations and   deployments, teams learned about deploying  and scaling the product. And then use that   to build out new features, solve bugs, and make  new versions of the back end and new versions of   the architecture. And then just continues in the  cycle. And this is what people mean when they say,   our company uses DevOps, or our company is  tech forward, or our company is digitally   transformed. They mean that instead of taking a  set of requirements and building one artifact,   which is then shipped, it's a continuous cycle  of taking feedback. You know, in these two weeks   Scrum cycles, usually, and producing software  that users actually want to use that they've   had some say in producing DevOps engineering is  another common part of DevOps. So beyond just   the methodology, which is something that maybe  the technical leaders and CEO would care about,   there's a subfield of DevOps engineering,  and this is usually what engineers mean. You know when they say DevOps, and  that's usually what job postings   mean when they say DevOps. So if a job  posting is asking for a DevOps engineer,   you know, they're not asking for  someone that can plan and deploy   code. They're mostly asking for someone that  can build a test, release, deploy and monitor. So the three pillars of DevOps engineering, our  pull request automation, deployment, automation,   and application performance management. And we'll  get into specifics about those. But the idea is,   pull request automation helps developers build  things faster, and helps them understand whether   their proposed change is good, faster. Deployment  automation, helps you deploy your code in a way   that users don't complain. Again, Facebook has  lots of deployment automation, because if they   just threw their code out into the void,  every time a developer made a change,   there'd be hundreds of millions of complaints. And  application performance management is automation   around making sure that things are healthy. So  automatically detecting downtime, automatically   waking someone up, if you know, the site goes  down overnight, automatically rolling back   things if there's a problem. And we'll get into  the specifics of all of these in future talks. The first pillar, which I mentioned  was pull request automation   has primarily to do with the developer feedback  cycle. So developers share work with each other   by proposing these atomic sets of changes  called pull requests. And by atomic I mean,   they're full featured on their own,  they don't require other things to run.   First, it's like if a developer proposes a pull  request, they should be expecting that that change   is good. And as far as they can tell, the change  fulfills some business requirements. And then   what they have to do is get through some gates.  So organizations and pull request automation,   their goal is to make sure that developers can  tell very quickly whether their change is good   or not. So for example, if you're working on a  website, and the developer proposes a change that   adds a typo, that's something that can easily  be automatically detected. And if you set up   a typo gate that says no change may go in if it  contains a typo, that would be an easy way to make   sure that developers get automatic feedback about  their changes. People say pull requests, you know,   as of 2021, usually they mean Git. So Git is  a technology originally popularized by Linux,   and it helps developers make these sorts of  changes and share them with each other. a pull   request is usually reviewed by at least one other  programmer and something called a code review,   where the other programmer will tell you about  code style will tell the proposing programmer,   whether there's architectural problems, scaling  problems, subjective things that can't easily be   automated. But that process of review can also be  greatly facilitated by a DevOps technology stack.   And DevOps automation can help with things like  ephemeral environments, and from linting. And   from all of these other automations that we'll  get into, after the code review has been done.   Usually, an engineering manager or product manager  in charge of the functionality being proposed,   will get feedback. So if you create a new button,  on a website, you'd like the designer that   designed the button, and you'd like the product  manager that requested the button be created.   Both give feedback because if the button is  phrased poorly, if it's placed poorly, if it's not   mobile, responsive, those are all problems that  would require another merge request. So it would   be great if the original merge request fulfilled  all of the requirements, the first time it was   proposed. And so usually, non technical people  will give feedback on pull requests as necessary. So what can be automated for a DevOps engineer, you can automate things like automated test running, per change ephemeral  environments, automated security scanning,   notifications through reviewers, getting the  right people to review it at the right time.   And the end goal of all this automation is that  a developer should be able to propose a change   and get it merged the same day they propose the  change. That's a huge organizational benefit,   because it means that critical bugs can be very  quickly fixed and merged and deployed without   needing a special process. And it also means that  developers aren't bogged down in bureaucracy,   they can propose changes once they get through  all the gates, the change will be deployed,   there isn't additional special gates that  they need to discover. So for example, if the   proper gates and automations have been set up,  a developer should be able to change a web page   without having to ask everyone in the company.  Whether this web page is used in certain workflows   or not. By virtue of passing the tests and passing  the QA review, it's assumed that the new change is   good. And if a problem does arise, a new gate  can be added to the automation so that in the   future problems don't occur. The second pillar is  deployment automation. And the famous post from   2000s, the founder of Stack Overflow, places Can  you make a build in one step as the second most   important question for a development organization,  and things haven't really changed since then. The efficiency of the build process isn't the  only goal of deployment automation. However,   other goals include the deployment strategies, I  talked about Canary deployments where you want to   show a feature to one user at a time starting  new versions of your application without causing   downtime. If you have to shut off your  website, before upgrading it, and then   turning on the new version, the visitors that  visit the website in the middle of the upgrade   will notice downtime. So there's clever  deployment strategies you can do to avoid that.   And finally, rolling back versions  in case something goes wrong. It's easy to overcomplicate the planets. Many  companies have complex internal platforms   for building and distributing releases.  Broadly, success and deployment automation   is finding the appropriate deployment tools to  fulfill business goals and configuring them.   And then ideal world there should be little  to no custom code for deploying. So off the   shelf solutions like Spinnaker and harness are  wonderful places to start for this sort of thing. Finally, application performance management, even  the best code can be hamstrung by operational   errors. There's a famous case where a user put a  bunch of spaces at the end of their post in Stack   Overflow. And they brought down Stack Overflow,  which is a very popular developer website,   because Stack Overflow hadn't deployed their code  in a way that would deal well with a bunch of   whitespace. So a bunch of space characters at the  end of a post, even with the best code, and even   with the simplest things like just a messaging  board, it's easy to have faults that make it to   production and are only uncovered by users. So  application performance management ensures that   metrics like how long it's taking for requests  to be processed, how many servers are being used,   all of those key health metrics are being  processed. And if there's a problem, like if all   of the requests to the landing page are suddenly  taking a long time, the appropriate people can   be notified automatically, instead of an engineer  discovering on Twitter that their website is down. Logging. So as a program executes, it will  produce logs. And the logs generally have   information about the state of things.  It's useful to be able to map back logs,   like, you know, a user visited the website with  information about that user. So what was their IP   address? What was their username, what resource  did they access and what resources were used   for fulfilling that access. So if they had to  load something from a database, and the database   was slow, it's useful to be able to say, the user  had a slow experience because their request was   fulfilled slowly. But the request was fulfilled  slowly because it was fulfilled from the database   slowly. So mapping these requests all the way down  to their constituent components is very useful. monitoring. So again, I mentioned metrics  and automatically alerting people, but taking   the logs and metrics, how slower things how  much memory is left, and deciding what to do. So   if there's a bunch of load, you might decide  based on the metrics to automatically scale   the number of servers, so add more web servers  as they're being used. Based on the logs,   if there's errors, you might want to automatically  file tickets for engineers to look into them.   And if there's a downtime, you might want to call  someone the person on call so that they wake up   and take care of the downtime. And they can drop  everything, they can have a pager, so to speak.   And that's alerting. So alerting is when fault  is detected, some trigger has occurred based on   the metrics, some number of requests or to slow  things are unhealthy, you know, users are going   to notice degraded performance, someone should be  notified or something should be done some action,   a new product, shouldn't dive into DevOps,  engineering all at once. So all of that I've   talked about our end goals for really large  organizations like Netflix and Facebook, developers that add automation  as a situation required.   So a new startup with no users building a website. pillars two and three are essentially useless  outages won't be noticed by anyone. Something like   a downtime won't be noticed by anyone. It doesn't  necessarily matter. You don't even necessarily   need to run automated tests, a useful stack for  someone there would be something like that low   five, or sell or our product where you can get  staging environments to collaborate with other   developers. But that's about as far as you care  for testing wise, you just get to an environments   for every proposed change. And you can play around  with it yourself to see in a manual QA setting,   whether it's good or not a team building an app  for 10 enterprise users. So enterprise users are   much more sensitive to downtime. So test coverage  and business hours alerting should be priorities.   On logging and log aggregation error  collection, there is popular tools like century   and code Cove for automated test running, there's  tools like bit rise and circle ci vaporizes are   known for mobile testing, and for alerting  There's a famous tool called Pedro duty that   keeps track of who should be notified if there's  a downtime. And so during business hours,   you might assign someone to be the person  that isn't supposed to take any meetings   for the day. If there's a downtime, they  will drop everything and solve the problem. And a social media app like Reddit might be using a large combination of things so  century for catching errors in the website,   Elasticsearch LogStash, Kibana is a popular  way of collecting and looking at logs.   pingdom will check whether certain  pages are taking too long to respond.   launchdarkly lets you add feature flags. So  you can say whether a feature is enabled for   some group of users or not, should the new landing  page be shown to users in North America or Europe,   in terraform, lets you automate the deployment  process. So given a set of servers and a set of   things that need to run on the servers, terraform  will help you automatically create a plan to   ensure that the right things  are running in the right places. And the conclusion of all of this is that DevOps  engineering is vital for developer teams. By being   cognizant of its three pillars, customers will  have a confusing and disappointing experience,   you know, things will go down, things  won't scale properly, things will be slow.   And so it's really important to keep the  three pillars in mind as you're scaling an   engineering organization, or if you're being  hired as a DevOps engineer. new products don't   need to automate very much. However, as the  product matures, and it gets more users,   it's more and more important to automate DevOps  engineering and to dedicate more resources to it. Very good morning to code review automation. Let's  talk about testing, which is going to be really   vital baseline information for when we talk about  continuous integration, and other code review   automation topic. So test driven development is  a coding methodology, where tests are written   before the code is written. And, you know, we're  gonna explain tests and test driven development in   terms of coffeemakers. So enjoy this picture  of a nice coffee maker, as we continue. test driven development spun around for a long  time. It was popularized in the early 2000s. And   the idea is simple, but it requires knowledge of  how things came to be for it to really make sense. So, historically, common words in software  development, like quality assurance, QA,   and unit test have roots in factories building  physical products. If you were running a factory   building coffeemakers, you would test that  it worked at varying levels of completion. So unit tests, ensure individual  components work on their own.   Does the heater work? Does the tank hold  water? integration tests? ensure a few   components work together? Does the  heater heat the water in the tank? system end to end tests? ensure  everything works together?   Does the coffee maker brew a cup of coffee acceptance tests after being launched, sent to  customers? Are they satisfied with the result? Are   they confused with the button layout or breaking  the coffeemaker within their warranty period? All of these tests have software analogies, it's  useful to know which components break in order   to diagnose a problem. But it's also useful to  know that the whole system is working correctly.   Because even if every individual component works  on its own, and if your coffeemaker doesn't heat   water with its heater, that's going to be  a problem when it comes to making coffee. That's really the idea for testing. But  let's get into test driven development,   which is the methodology built on top of  testing that's become so popular in the past   10 or 20 years. Most developers that aren't using  test driven development have a similar workflow,   they'll choose something to work on. Based on our  idea of DevOps, it would be in the planning phase,   the developers would find something  to work on in the planning phase,   they build it, so they'd write code, and they'd  make a build from that code. And then they test   it. So they've read small scripts that made  sure that their code was working correctly.   If you're making a function that adds two numbers,  you might pass it to into unexpected The result is   four. And that would be a good indication  that your function was working correctly. So steps one and three, as it turns out, are  very connected. The tests written at the end   essentially codify the specification, what is  success for building a coffeemaker, it should   heat up in five seconds. So write a test for that.  It shouldn't brew coffee have sufficient strength,   so write a test for that, and so on. test driven  development uses the similarity of steps one in   three to flip this process. So first, developers  choose something to work on. And then they write   the tests before reading the code. So they  write tests that are currently failing because   the specification isn't satisfied. And then they  write code, until all of the specifications they   wrote in step two are satisfied. So they might  make a testing regimen that would work if the   coffeemaker succeeded, and then build the cheapest  coffeemaker, which satisfies that testing regimen. And the end result is the same. So the software  is built, it's tested, and it matches the   specifications. But it's significantly  easier in a lot of cases to write code.   If you write the tests first because you know what  you're building and it forces you to think about   which things are important to work on and which  things can be put into a later set of change. So this is a very quick video to discuss  testing. And the next video, we'll talk   about continuous integration, which is really the  DevOps continuation of this idea. See you there. So we've talked about testing, where developers  read scripts that make sure that their code   continues working way off into the future years  after they've made their code. And that leads   us into our discussion of ci, which is really  one of the big topics that people talk about   in a DevOps context. And ci stands for continuous  integration. It refers to developers continuously   pushing small changes to a central repository  numerous times per day. And those changes are   verified by automated computer software that  runs the tests that the programmers have defined. So we've gone over what tests are. So let's  talk about why a company would use ci. Well, ci is really the first step in automating  DevOps. Imagine the very simplest scenario where a   single developer is making a program  that'll be used by a small group of users. That developer makes the original program releases  it, and the project slowly builds traction. Now, imagine that developer has  a critical bug a year later. And they go back to the old code, and they  say, like, gee, this is really bad code.   I've become a better programmer since a year  ago. I don't really understand what's going on   here. But that's really how development works.  Programmers get better year after year. And   they have to read and understand the bad code  that they wrote just a year ago. And the only   way to be confident making changes to that legacy  code that might just be a year old, is to have ci ci improves developer speed, because new changes  can be made confidently without having to worry   about breaking existing functionality. As long  as the tests pass ci also reduces customer churn.   problems in the software are much less likely to  occur. If you have comprehensive tests that run   automatically. As long as you get those check  marks, you can be reasonably sure that the   core features of your application will continue  working. So how would you integrate ci into your   development process? First, let's talk about the  common branch based development process that many   development teams use. So first, developers  work on a feature branch. So they'll take   the files that are most current, the ones  shown to customers at a specific set of time,   they'll branch off of it. So they'll make  a new copy of the files to work on their   their feature independently of all of the other  developers working on things that make changes   to the various components. So this feature makes  a change to the mobile app, and to the website. And then on that branch, they'll push it back  to the repository, which is usually something   like GitHub, git lab or Bitbucket. And then  that repository will run ci, so the CI will   be configured on the repository side, it'll run  all of the tests that the programmer has defined.   And then the results of those  tests will be attached to the   pull request. And the pull request is the  developer asking to take their code and to merge   it into the central repository that users will  be shown. So you take the feature branch here, and you put it at the end, and all of the  other commits that are being shown to users.   And so this commit is now the one  that will be shown to users next,   and the next time is a deployment, the features  that the programmer meet will be visible to users.   And the best part is it doesn't cost you  anything a central Git repositories like GitHub,   git lab and Bitbucket. Most have generous  free tiers, even for organizations minus some   security and access control, you know, permissions  features that you might need. As you scale up, and   ci providers like layer ci, GitHub actions, git  lab pipelines, all have generous features as well.   Their ci, you know, is really made for people  working on websites, that's maybe something to   consider. But if you're really early on in your  projects lifecycle, it doesn't really matter   which ci provider to use. Of course, there's one  thing to take away from the discussion of ci,   it's that ci is a vital tool, it's really  the first thing that should be automated in   most pull request automation schemes. Because  it's so easy. developers should be writing these   tests regardless. And so if you don't run the  tests automatically, slowly, people will break   things without realizing that they're breaking  them. And users will notice those broken things. And following best practices like feature  branches and ci is a really easy way to   scale a developer team. With just ci,  a developer team can easily scale from   one to 10 developers. And at some point in  there, you'll have to start worrying about   other pull request automation topics, like  the ones we'll cover in the next section. talk a lot about theory. But let's  get practical for a little bit just   to round out our understanding of  how these DevOps concepts work.   Let's look at what setting up ci  looks like for an actual repository. torey This is the live chat example. It's  an open source version of slack.   That's used as a demo repository throughout layer  three is internal documentation. Let's say for   this open source version of slack, we'd like to  run tests every time a developer proposed changes,   so that in the pull requests tab, we'd be  able to know whether a change was good.   In particular, let's say a developer  was changing the color of the website. In the main website, After you log in the top bar in sidebar purple,   perhaps the customer has requested  that the color be blue. Instead, if we asked a developer on our team to make this  change, they would go to the necessary design file and edit the color. In this case, there are two colors to change. If the developer opened this pull request, it'd  be very difficult for us to review their change. Without a CI system, all we  can see is the file change   and the description of the commit. So we can see  that they've edited main dot CSS, and that they've   changed these color values. But it's very hard  to understand the ramifications of this. And it's   especially hard to understand whether this will  have negative side effects for existing users,   especially for changes that are less  trivial than just changing a color. For this request, if I was asked to review it,  I would have to pull these changes onto my local   developer machine, run the script locally, and  then evaluate the changes locally. Or I could ask   the developer to set up a screen sharing session,  and they could walk me through the changes. Both   of these add a lot of friction to the development  process. It'd be better if I could evaluate their   changes without needing any involvement  at all entirely through a web interface. But continuous integration helps. Continuous  Integration allows developers to set up   comprehensive tests, so that if something  doesn't work anymore, after a proposed change,   it says right in the pull request. Let's close this change for now, and look at  the repository to understand how to set up ci. And this repository. One of the services is called  Cypress. And it's an end to end Testing Service. It contains several configurations.  And these configurations   interact with the page with a fake browser. For example, this test enters  a username and password,   and then logs in and then ensures  that the user is actually locked in this test goes to the message area enters a  random message and ensures that the message   has actually been submitted that it's  viewable in the remaining chat area. With enough end to end tests,   you can be reasonably confident  that a chat system like this one continues working. So we'd like to run these tests every  time a developer proposes a change.   To do so we'll have to  install a plug in into GitHub,   set up the server to run after every pull  request and run this test against the new server. To do that, let's set up layers here. For our use cases, it's easy to just  install it directly onto our GitHub account. We can now install it onto our GitHub repository. And now, it's listed here. This means that   we've successfully installed  Lera ci onto this repository. However, nothing will happen yet.  Because there are no configuration files,   we need to set up a configuration  file for this repository that will   start the whole stack and then run  the tests in Cypress as required. Let's do that now. Because our  repository is Docker compose based.   Let's use the Docker compose  example as a starting point. Here, we're going to install Docker, which  is a containerization technology. We'll talk   more about containers versus virtual  machines later on these sets of talks.   Install Docker compose, which is again a way of  running multiple containers at the same time,   these these concepts will become  clear later on in this talk. We copy the repository files into the test runner. We build all of the services, we start all of  the services and then we deploy the pipeline. Let's skip the blank for now we'll talk about that  in the deployment section of this DevOps course. And after all the services  are started, let's run tests. Luckily, I've already pre set up a script  for this, so I can copy my configuration. So to recap, what this configuration will do is install the necessary software, in  this case, Docker and Docker compose, copy the repository files,  build all of the micro services,   start them all locally within the test  runner, and then run our tests against them. So now that we've installed  layer ci onto our repository,   all we have to do is add this configuration,  and we'll have set up ci for it. So let's click Add File, we'll name it layer file. This is  how layer C is configuration files   and other ci providers will have  different file names, of course, we'll copy our configuration. And we'll commit the file. So now that we've set up ci, we can see that  there was a dot next to the commit name. And that dot turns into a checkmark  when the tests have passed.   This means that every time a developer pushes  new code and our source code management tool,   look at a success metric. Namely, whether  the tests have passed or not automatically,   they won't have to run the test themselves.  And the reviewer won't have to trust that the   original developer has actually  tested that the change works. So let's go back to our original  proposed change of changing the colors   in production from blue to purple. Here, we're going to make our  change and reopen the pull request.   But because we've configured a CI provider  for it, we'll be able to see that the tests   are running automatically directly  in the pull request view itself. Now, when our developer asks us for a review,  it'll be much easier for us to be able to tell   whether the change has negatively affected our  customers workflows. In particular, because we've   configured Cypress and later ci to check that  logging in and posting messages still work. Well   know that for this change, even though many files  might have been changed, the core workflows still   work, which gives us a degree of confidence that  nothing terribly bad has happened with the code. So we can look at the file change for our  first idea of what the developer has done. And then we can view what the CI is doing.  So if we open the relevant pipeline, we'll see that the tests are in progress of  running, the new version of the application has   been built and started within the CI runner.  And the tests are running one by one. Here,   it's tested that you can post chat messages  within our alternative slacks chat page,   the landing page loads, and  logging in works correctly. So now within our pull request view, we'll be  able to see a big checkmark here, which shows   that all of the relevant ci checks have passed.  And then you can even automate within GitHub or   other source code management platforms that  certain checks must pass entirely. So you can   automate that all ci checks must pass before  a change could be merged. Let's make sure that   developers are never reviewing code that's so  obviously broken that it's breaking your tests.   And you don't only have to run end to end tests  here. You can also run linters unit tests and   other versions of tests, which we talked about  throughout this series of talks. And now that   I'm happy with the change, I've reviewed  the files, and I see that the CI has passed,   I can merge it with a great deal more confidence  than if I didn't have this automation in place. That's it for setting up ci and an applied   setting. Let's get back to  theory for a little bit. Continuing on the topic of testing  and continuous integration,   let's talk about code coverage. So code coverage  quantitatively measures how comprehensive The tests for a code base are, you might think  that you have enough tests to find all of the   common bugs and to really check all of the  functionality of your app. But it's hard to   put a number on it. Unless you're measuring  code coverage. This is what a code coverage   graph looks like from a popular tool.  Each of these squares represents a file,   and the color represents how many tests are  covering that file. So bright green means 100%   of the file is tested. And bright red means  none of the file is tested. So that would   be a priority of a file that should either  be tested or excluded from the measurement. So let's say you're taking  over an existing code base,   it's relatively large at 100,000 lines of code. Over the years, it's been adopted by a couple  100 users, and you're expected to maintain it   and add features without harming those users. So  the first place you look at is the unit tests,   which we discussed earlier. But they weren't  really prioritized by the previous maintainers.   So there's a mismatch of libraries and naming  conventions. And it's kind of hard to tell   which tests are testing which files  and which files need to be tested.   And before you write any new features,  you'd like an objective way to measure   how sensitive certain parts of the codebase are to  being changed. If something has very comprehensive   tests, you'll be much less scared to make changes  and add features that touch that part of the code   than if there's a part of the code that doesn't  have pests. So this is where code coverage really   shines. You've got a complicated code base that  has existing users, you'd like to enforce that   tests are written so that things aren't broken  in an objective way. So getting into the first   code of this whole series, let's look at this  JavaScript function, which I will make bigger. So this is a very simple  function, if not a bit contrived. It takes a number. And it defines a few variables. It loops up to  that number, pushing strings into a results list. And then every 50 elements, it pushes  a special string into the results list. So this whole function is 10 lines of  code. But not all 10 lines are equal. So   really, there's three kinds of lines in a  program like this. There's the syntax lines,   like these closing ones that don't actually  have any code in them. They're simply syntactic   constructs for the programmers benefit.  They don't you know, it doesn't even   make sense to test these because how would  you test that a semi colon existed or not? There's logic lines like this one, which  actually have side effects. And by side effects,   I mean that these lines, if you remove them,  would change the behavior of the program. And those branch lines like this one,  which changed the flow of the program.   So for loops, and if statements in programming are  used, as constructs that change the order of the   commands that run, so this if  statement, if it evaluates to true,   would run this line. And if it didn't evaluate to  true, it wouldn't run this line. So to reiterate,   the three kinds of lines are syntactical ones that  don't do anything. The actual logic ones that have   effects, and the branch ones that  change which lines of code execute. And code coverage is usually defined  as line coverage. So it's the ratio   of the non syntax lines which are executed by  tests over the total number of non syntax lines. So again, consider this test. If you expect the function  should work with the input to   and you manually calculate what the  function should return for the input to this would be a unit test for your function.   But since you're only executing it on the input  to this if statement, which requires an input   of at least 50 to execute, wouldn't run. So you'd  be testing this line, it would execute this line,   it would execute and this line, which  would also execute, so you'd be executing five out of six lines, and the deed that would be 83% test  coverage. So just the single test gets   us most of the way to understanding our  function and understanding its problems. related concept is called branch coverage. So  instead of measuring how many lines of code it   measures groups of lines, in our example,  above, there's only two branches. There is   the main branch. There's the body of the for  loop, and there's actually a third branch called   The if statement body. So here at this line  will always execute the body of the for loop. will only execute if i is less than n. so here  you need and to be greater than or equal to one,   or these lines to execute. And this line will  only execute if i is greater than or equal to 49. And so branch coverage would be how many  individual branches out of these three are   evaluated to true by a test. So you'd like  to know how many of all of the branches   are tested. And this is useful because if this  line of code executes, then this line of code   will always execute. So treating them both as  individual things that need to be tested, doesn't   really mean as much as taking the bodies of these  statements as things that need to be tested. And if you measured the test with branch coverage,   you'd see the two of the three branches  or x evaluated during the test. So when should you care about this line coverage  in branch coverage, we've already discussed one   scenario where you've inherited an existing code  base. However, it's important in many different   situations. In general, you should measure  and optimize for code coverage. If any of   the following are true. Your product has users.  And those users might leave if they're affected   by bugs, in which case, it's important to  measure code coverage because it lets you   work with your team to improve the code coverage  and reduce the number of bugs. You're working   with developers that aren't immediately  trustworthy, like contractors or insurance,   that you're bringing them into your code base,  they need to make changes in some fixed timescale,   like form up internship. So they can immediately  become experts in the entire code base.   And you'd like them to be able to make changes  without worrying too much about things breaking. Or, if you're working on a very large code  bases, many individually testable components.   Your code coverage analysis can complement test  driven development, which we talked about in the   previous talk, to make sure that everyone on the  team is generally working on important things,   and that the things they make  won't break in the future. So it's a common mistake in code review automation  to make things too rigid before the product has   enough users, if you force developers to get  100% branch coverage, so to write two to five   unit tests for every function, it's going to  make them much slower at developing features   that users will actually notice. Remember  that tests are never viewed by users.   So the only thing that users care about is the  stability of the system. So if you have an MVP,   or if you have a product that doesn't have very  many active users, yet, it might not be worth   it to measure or optimized for branch coverage.  until those users care a lot about stability. And by writing unit tests, and other types of  tests, an important thing to keep in mind is that   developers are solidifying the implementations  of features that they might have to throw out.   If you build a feature, and it ends up not  being something that your users actually want,   it's always a better idea to throw out  that feature, then build it long term. So   if a developer builds a feature, and writes many  tests for it to improve the code coverage of that   feature, they'll be much more likely not to throw  it out, because they'll feel a sense of ownership.   And they'll feel a sense of sunk cost in having  built this feature and made it good, so to speak.   So it's important not to over optimize for  these things before, they're important. And   it's a subjective idea. But really, you'll notice  when your users start complaining about stability. So organizationally, there are some  common policies related to code coverage. The first one is useful when you inherit a code  base. And the policy is that code coverage must   not decrease. This one is one of the easiest ones  to automate. It's especially useful if you're   taking over an existing code base, as I mentioned.  And the idea is that code coverage ratio should   never decrease. If the current code has 75% of  its lines tested, and your new change introduces   40 lines of code, at least 30 of those lines  will need to be tested. Otherwise, your changes,   code coverage would be less than 75% 30 out of 40.  And you'd be decreasing the average code coverage. As with most code coverage policies, this will  increase stability, so there'll be less bugs   because things will be better tested at the  expense of developer speed. So developers   will have to make some complicated tests,  and they might spend a lot of time making   testing infrastructure. So features will be  shipped less quickly if you make this sort of   policy. And if you enforce it  with code review automation. unfortunate side effect of this policy is  that changes that would be harder to test   such as integrations will be less  likely to be worked on by developers.   So developers are incentivized by  their paycheck and by their manager to   ship features quickly to make many features  per Scrum cycle. And so if certain features are   harder to test, because they require internet  connectivity, or connected third party API's,   those features will be harder to make and  harder to test. And so developers will be less likely to make them regardless of  whether they're important to the users or not.   So it might be useful to have an exemptions policy  in place for things like third party integrations.   If your organization decides to go for this  code coverage must not decrease policy. Another useful policy is code owners for test  files. If you've used code coverage automation   to keep code well tested, it's often beneficial to  define code owners for the tests themselves. This   means that developers can change implementation  details without formal reviewers. But logic   changes. So the tests define what success means  for a function or for an algorithm, then, changing   the tests for a new implementation would need  to be approved by a senior developer or manager. in GitHub, with an engineering manager, a GitHub  code owners file might contain this, which means   that spec dot j s is a common JavaScript testing,  naming convention. And at engineering manager   username means if, if there's a file called code  owners, which contains this, then the engineering   manager will need to approve any change, which  changes a test, which is probably a good policy. So if you're working in a large code base  with test driven development, especially, or if you're hiring interns, or contractors, or if  your users are especially sensitive to bugs, and   you're afraid that they'll have a bad experience,  if the even small bugs make it to them,   it might be in your team's best interest to  install a code coverage measurement tool.   And at the time of writing, these  are the three most common ones in   the open source world code coverage,  coveralls and Code Climate. So we've talked about testing, and we've talked  about continuous integration. And those are   really like the initial things that are set up  in a DevOps code review automation pipeline.   But the problem is that it requires the  developers to be on board. And of course,   developers are probably busy building  features and might not necessarily want to   make tests or improve test coverage.  So let's talk about linting,   which is something that approximates testing, but  doesn't need the developers to spend any time. linters are programs that look at a program source  code, and find problems automatically. They're a   common feature of pull request automation, because  they ensure that obvious bugs do not make it to   production, obvious here in quotes. So an example  of linting. Let's again, look at a JavaScript   program, the very simple one, you should  understand even if you don't know JavaScript, it defines a variable var x equals five,  and defines a function but continues   after the open bracket, which is generally  considered bad practice. It uses let for the   second variable and defines it with the same name  as the first one. So this is just confusing, and,   you know, wouldn't be called good code, a code  reviewer would mentioned this in a code review. And then it says while x is less than 100,  console, log x, and then it closes the while loop   on this line, and it messes up the end Det. These  three lines should be indented for consistency,   and then it closes the function. Finally, you  should realize that this while loop goes forever.   x isn't incremented in the body of the while  loop. So just by looking at the code statically   without running any environment or looking  at the code with a browser even, you can   tell that this loop will run forever, and that's  probably something a programmer didn't intend. So much of this feedback could be automated, a set  of rules like don't shadow variables. Never name   a variable in an inner scope that has the  same name as a variable in an outer scope   could be applied to each proposed change, so that  human reviewers would not have to waste effort   leaving code style comments. tools that  maintain and run such lists are called linters. relevantly, another class of code review  feedback has to do with code style.   It's easy for coder bureaus to waste time pointing  out stylistic choices like tabs versus spaces, or   camel case versus pothole case. These  discussions bring no value to end users,   you know, your customers don't care what case  your code is written in. And ultimately, they just   serve to cause resentment and missed  deadlines within engineering teams.   If a review takes an extra couple hours  because of comments like this, that's a   couple hours that the programmer could have been  focusing their attention on another feature. So engineering organizations should eventually  adopt to maintain a global style guide. But in most cases, just starting with  something like the Google Style Guide,   which is open source and available at  this link, is a great starting point. These guides often come with linter  configurations, which help everything Stay stylistically similar, and some programming  languages like Python and go come with their own   style guides and automation, like Pep  eight. In the case of Python that will   make it easy for developers using those  programming languages to stay in a unified style. an organizational thing you can do for code  style is to knit, or which stands for nitpicking.   Instead of blocking at the code review stage, if  there's review feedback that a code style review   feedback, it might be better for code reviewers  to leave small review comments called mitts.   So they'd say knit full colon  shouldn't be styled this way. This is great. Because it allows the reviewers to   merge something with a few pieces of feedback  so that ours don't have to be spent on   a small piece of refactoring that could be  done in a later state at a later point of time. Once the style guide is adopted, it's possible  to configure tools to automatically format code   to follow the Style Guide, which tools are called  Auto formatters. And the programming language go,   which we use at layer ci, a command such as  the following would use the standard format   are the one that comes with go to clean up  all of the source files in the repository.   So we'd use the Janu, find command,  find the files that have a go extension,   and we'd exec go format on them. And this  will take all of the source files and then   format them with their style guides  so that they all pass the style guide. And of course, if your ci system is running tests  automatically, every time your code is pushed,   the code could be automatically limited as  well. Programmers shouldn't have to wait for   a human reviewer to tell them whether the code is  limited and styled appropriately. In most cases,   it's cheap and convenient to run linting and  formatting automatically with a CI system. So an easy solution is you get another  checkmark. To get something set up quickly,   it's a good start to make lint act the same as  running a unit test in ci, so add an x if the   code isn't listed properly, and then the developer  can very quickly get stylistic feedback without   needing to talk to another human or wasting their  reviewers time for getting this sort of feedback. And a CI configuration that might look like this.   So copy the project files, run the linting the  script. And then if the linting script fails,   the whole pipeline would fail. This approach  stops reviewers from the picking style, it passed   the linter is a perfectly reasonable response to  an overly zealous code reviewer. So even simple   automation like this can improve the development  speed of entire development teams. It also stops   reviewers from having to give style feedback at  all, at all of the checks for code review pass,   like it passes all of the linters  done the committed stylistically Okay,   the reviewer might still leave some feedback for  future reference, but they shouldn't be blocking   commits getting to production because of small  stylistic choices that aren't even in the linter. A better long term solution is  to set up a commit back button,   which is common idea that happens all over  the place and code review automation. But   in this specific example, it might look like  this. So you'd say if the code is not limited, run, yes, lint with the dash dash fix flag,   which again, goes through all of the source  files. Yes, lint is the linter for JavaScript.   So this would go through all of the source files.  For each of them, it would apply the linting rules, and it would fix any stylistic  errors. And then it would create a commit an additional, you know set of file changes  on top of what the developers proposing.   And it would create a new branch with a suffix  listed. And it would push to that branch.   So that developer pushes unlimited code, the bot  would automatically create a commit which lifted   everything and it would create a new branch so  that if the developers code was known to be good,   the reviewer could simply merge the Linton  branch instead of the developers original one.   And then we failed the pipeline with lint  failed. This means that the unlimited version   can't be merged. But the limited one, assuming  all the feedback that isn't listed related, could be merged. So we'd have two branches, the  one the developers proposing and the limited one,   the code reviewer would look at the one that  wasn't rented. They'd say whether it was good   or not, like the logic of the commit was good.  And if it was good, then the reviewer in GitHub   could merge this branch instead of  the one they were asked to review.   This branch would be the same as the original  one with an additional commit on top of it. So some examples of limiters  for many programming languages.   JavaScript the standard as a 2021 is Eastlands.  TypeScript. Also now uses Eastlands, Python you piland and flake eight. c++ is much more  subjective. But a common choice is Google   CPP lens from the Google style guide mentioned  above, go comes with a format called go format,   which acts somewhat like a linter. Although  there's additional libraries available for rules   beyond that, Java has checkstyle and find bugs,  maybe older options, but there's a lot of choices   for languages like Java. And Ruby has broke  rubocop and pronto. We've seen users commonly use and Java, JavaScript C sharp, and many other  languages can be lifted with sonar cube,   which is a popular static analysis  framework that is commonly used at   larger enterprises. But it has an open source  version, that is a good place to start for   10 developer teams that would  like to set up static analysis. And finally, it was a startup called  Deep source that we've talked to startup to startup. And they're doing all sorts  of interesting stuff with static analysis as well.   Static Analysis is just the practice of  looking at source code without running it   and finding bugs. So I'd encourage  you to look at deep source as well. So in comparison to most other code automation  tools, linters are exceptionally easy to set up.   Any team with more than one developer  should almost immediately set up a linter.   To catch obvious bugs like infinite loops. Just  by looking at the code, the linter would be   able to tell you whether there was a common  programmatic error like an infinite loop. Automatic linting comes standard has many code  editors. So it would be wise to teach developers   how to configure their code editors to use  the existing linting rules that your team   has set up in the CI automation, so that the  developers don't have to wait to push their   code to get this feedback, they can get the  yellow squigglies directly in their editor. And then teams working on earlier products.  Mentors can help avoid writing unit tests at all. Instead of relying on a test  suite, you can rely often on   a static analysis to find common bugs  like the code not compiling at all or infinite loops or stylistic problems.  This helps small teams refer their product   has many users get feedback without  needing to lock things in with tests. Right. So that's it for linting and  code style. We'll see in the next video. Let's finish up our discussion of  code review automation by talking   about ephemeral environments, which are  really the latest and greatest when it   comes to doing code reviews and helping  developers get their changes merged. ephemeral environments are temporary environments  that contain a self contained version of the   entire application. Generally, for every feature  branch, they're often spun up by a slack bot,   or automatically on every commit using DevOps  platforms like later ci itself, or Heroku. Temporary environments are overtaking traditional  ci platforms as the most valuable DevOps code   review experience. Because these environments  are made on every change all of the stakeholders,   not just developers, but the product, people in  the designers can review a change without needing   that set up a developer environment or asking to  screen share with the developer that proposed it. So for a more concrete example, let's  say developer is changing something   on a website. So they're changing, you know,  the front end or the back end, or, you know,   some component of the website. And they'd  like to get feedback on the proposed change.   So a code reviewer would look at the  code and they might not understand what   the visual ramifications of that change  are. But within the femoral environments,   within the code review view itself, the reviewer  would just have to click that button there. I'll zoom in. So within GitHub, this is what the  reviewer would see. They'd See The Description,   the code change, but also a button to view the  ephemeral environment. And when they click that   button, it wakes up a version of the website  specifically with this proposed change in it   so that the reviewer can actually  take a look at things and   see whether the changes is visually  and workflow wise, working well. In general, ephemeral environments  like halfway between development   environments and staging environments.  At the extreme staging is entirely   replaced by formal environments in  something called continuous staging. Benefits of ephemeral environments.  Well, the most common reason to adopt   an ephemeral environment workflow is that is that  it accelerates the software development lifecycle.   Developers can review the results of changes  visually, instead of needing to exclusively give   feedback on the code change itself. Additionally,  developers can share their work with non technical   collaborators such as designers as easily  assuring a link to the proposed version. So You could post a slack message like this  saying, could you go to this link and   give me feedback instead of needing to  set up a zoom call to share your screen   to get the other person to  look at your proposed changes. The hardest part of setting up ephemeral  environments are dealing with state.   So dealing with things like  databases and microservices, by their nature, you know, ephemeral  environments are temporary.   They're isolated from production environments, and  really only lasts as long as a pull request does.   A reviewer should be able to delete a resource  in a review. So they should be able to see if you   know, deleting a user still works without fear  of that affecting the production environments.   So in the early implementation of ephemeral  environments, it might make sense to connect   API servers with read only permissions to a  staging database. So if you're using AWS, you   might have an Iam role that has read only access  to the database. But in that case, you wouldn't be   able to sign up to the service, for example,  because that would require database rights.   The end goal should be to have a fresh  copy of the database for every commit.   So every time a developer  proposes a change, they get   the new database specifically for their  environment that they can do whatever they want in an ideal ephemeral database has  three attributes. It's pre populated,   so it contains representative anonymized data.  The past security audits of PII, personally   identifiable information must be scrubbed  from databases used in ephemeral environments.   It should be undoable. So if in the course of  review data is deleted, it should be easy to reset   the database to its original state. This is also  crucial for reading destructive end to end tests,   which we'll get into later.  And it should be migrated,   the database should use the schema currently  used in production, it's not very useful to   know if something's working with an old version  of the schema. One of the most common classes   of problems uncovered by formal environments or  broken or nonperformance database migrations. Another hard problem to solve with ephemeral  environments is the life cycle. So when would you   create them and when you destroy them? The classic  approach is to title lifecycle of a pull request   to the lifecycle of an ephemeral environments. So  if the developer opens a pull request, create an   environment for them, keep it running 24. Seven  until the developer deletes the environment. The biggest factor to consider there is cost.  If each FML environment costs 10% of production.   So it's 10% cheaper, and you have 30 open pull  requests, you'd be quadrupling your monthly costs.   So you know, that's an expensive developer tool. Another approach is to create a chat ops bot that  allows creating new environments for a specific   branch with a specific timeout. So for example,  the user type slash PR bot creates in the GitHub   issue description that could create an environment  or in Slack, the user could do the same thing. This requires the environment to be provisioned  at the time that it's required, which can be slow.   And it's again, hard to tell when to delete these.   The best approach is to create an  ephemeral environment for every change.   So similar to the PR workflow, but  hibernate them as they're provisioned. There's only a few providers that do this. So one  is Heroku, which will, with Heroku review app can   turn on and off environments. And the other  one is layer ci, shameless plug, I suppose. So as users use the environments, and layer  ci, there'll be a hybrid ninja not enough.   You can automate this yourself with memory  snapshotting, but it's somewhat involved. So   this might be something that's better  left to using a third party for. And back to that idea of continuous staging. The idea is to merge staging ephemeral  environments and a CI pipeline altogether. So this   is kind of layer ci itself primarily sells to our  users. As your ephemeral environments become more   powerful and easier to create. They approach and  overtake many aspects of traditional continuous   integration pipelines. So if you can set up  the website, and the back end and the database,   then it's relatively easy to run tests because  tests are usually much easier to wrap them the   entire back end. At its logical conclusion, this  concept becomes continuous staging, where ci CD   and ephemeral environments form a single  ci CD flow where a single base sets up all   of the requirements for everything. And then that  forks off into the unit tests but also the server   but also how the review environment but also the  linter is everything comes from that common base. If you're going to make them  yourself, you should probably budget   about a month, or a month of time  it took to set up your environment.   So if your production environment has many  different microservices and has many different   databases, it'll be relatively difficult  to set up an ephemeral environment flow.   Large companies like Facebook have set  this up for their internal pull requests. But they haven't hired developer teams,  infrastructure software engineers that do this.   So if you're a smaller company, you might want to  stick to a hosted service again, like layer ci,   instead of making it yourself up to  maybe when you have about 20 developers. And to avoid having to micromanage  starting and stopping environments,   it's easiest to use a hosted provider. If  you're doing just front end development.   Some popular choices are for cell netlify. But  if you're doing full stack deployments, really   the only choices available right now are layer  ci itself and Heroku review apps. There are some   options available in many source code platforms  like Git lab has a environments feature.   But it's not really truly an ephemeral  environments feature. So you should   explore all of those options  make an informed decision. So that was ephemeral environments.  And that concludes our discussion of   code review automation. Because pull request automation is  such a core part of DevOps engineering,   let's do another applied tutorial here. In this example, we'll be setting up  a femoral environments the same way   we talked about before using a hosted  platform for the sake of simplicity. So because we've already set up ci for this  repository, we already have our layer file,   which is our CI configuration. However, many ci  providers including layer ci, Roku, and others   can set up ephemeral environments, which  are small production deployments you can use   to evaluate the changes live as a reviewer. Again,  let's say we're changing the color this time back   from blue to purple. And we'd like someone to be  able to efficiently review our change, not just by   looking at the test results, but also by looking  at the federal environments. To read manual QA,   you might see, in this case, it's actually very  easy to set up. Let's go to our web micro service. Let's create a new file. Here, we'll make another layer file so that   they run in parallel. And we'll  say from our base layer file, we'll say expose website. And we'll expose the website running inside the  runner itself. Later ci has this expose website   directive, but many other providers have  similar functionality that you can set up. And let's jump right to  creating a pull request for it. So here, our code reviewer would  not only see the test results,   as you can see, those are  here, the initial layer file. Let's look at the actual   graph to understand better what's going on. Here  we have our tests running in the main layer file. The main layer file has again built all of  the services started all of the services.   And now it's running our Cypress tests  the same way it did in the CI chapter. But after the Cypress tests run, we'll have a  second environment, which is inheriting from   the first and that second environment will have  a clickable link that can be used for manual QA. So let's see that here. The snapshot is done being taken of the tests,   which means that the ephemeral  environment can start being built. And now you can see that it's built a staging  server button, and you can connect to it here. So in our actual pull request, now that all  of the tests and ci services have passed, we can click the femoral environment button. As soon as it appears. We could click the main layer file details, we can click the services  web ephemeral environment, and we can click View website. And  what this does is it wakes up the   pipeline which we initially set up to run our  tests. But it forwards the internet visible link   to the web server inside. So here  we've created a fresh environment,   specifically for this test, and we can see  that the test has run and sent the message. And we can evaluate this change so we can  test that creating channels works for example. And that in the test channel, it's  still possible to set messages. This means that you don't need 100% test coverage  to be able to understand the nuance of a channel For every pull request, you'll be able to  spin up a new environments automatically,   and then wake up that environment  when a review needs to be completed. And now that we're satisfied with the environment  works correctly, we can merge the pull request. And from now on, all changes which edit  the website can be manually curated,   so the reviewer can check that things work,  but also a QA team or a designer or a product   manager might be able to check that  the change actually changes what it's   supposed to do. That's it for formal  environments. Let's get back to theory. Welcome to DevOps Academy deployments. And this  one, we'll be talking about foundational concepts.   And primarily, when you talk about  deploying, you're talking about VMs.   And you're talking about containers. And  containers are often also known as Docker.   So let's talk about the difference between those  two. Before we talk about deploying anything. People talk about DevOps deployments, they're  usually talking about the point that Linux,   a large portion of all  deployments are to Linux servers.   And containers are really only defined in  terms of Linux, in production as of right now. So with all that in mind, let's  talk about Linux in the abstract.   So what Linux really helps you do is take care  of four things when you're writing programs,   it takes care of memory. So programs need memory  to do things, memory is also known as Ram.   And since you only have a finite amount of it,  Linux itself needs to figure out which programs   will get which sections of memory, so which ram  sticks will have which programs running on them. Linux also takes care of processors. So  if you're running two things in parallel,   Linux will make sure that the right amount  of processors are dedicated to both. If you're, if you've ever run very computationally  intensive tasks on a laptop, you might notice that   your browser gets laggy. That's because  it's not getting enough processor time.   So if you're running production  workloads, Linux needs to make sure   that every program is getting its fair share  of processor time to run the actual program. Because disk, so Linux takes the files of all  programs and allocate space on the disk for them,   you might have multiple disks, and you might  have both spinning disks and solid state drives,   you might even have disks shared across networks.  So Linux takes care of all of that and make sure   that the right files are on the right disks,  and that programs have access to those files. And finally, there's devices. So  even beyond disk memory, and CPU,   there's things like GPU. So for  machine learning, you see GPU often,   and things like network cards, which  you use for connecting to the internet. Linux needs to take these individual resources  and allocate them to processes. So if you have   five processes trying to connect to the internet  at the same time, but only one network card,   Linux needs to make sure that the right  messages are sent to the right websites   upstream. And the responses are sent  to the right programs downstream. So in a diagram, this is  what that would look like.   So here we have three programs, Chrome,  Notepad, and Spotify. And they're all   running in Linux. So this is assuming you have  a Linux server running these three programs. And here you have the four shared resources. So  Chrome, asks for CPU, and Linux will allocate   some of the CPU time to Chrome. And it'll also  allocate something notepad and some to Spotify.   And similarly for all the other shared resources. So this is great, but there's too much  sharing going on. What I mean by that is that   programs know about each other. So if you  had a program that accept expected a file at   home calling file dot txt,  then it could create that file,   but another program could delete or read that  file. so files can be read across programs. And   that means those programs can communicate between  each other, which isn't always what you want. So, for example, programmers often use  different versions of Python. Python is   a popular programming language. And  there's two popular versions used.   One is Python two, and one is Python three, but  they're both called Python. So if the file at   user bin Python is a Python two executable, and  you try to run a Python three program with it,   then that program would error because you'd be  using the wrong version of Python to run it.   However, some of your programs might need Python  two, and some of them might be Python three.   So here, there's cross talk between programs in  that they're both reading user bin Python, but   they expect there to be different files there. So  that's what I mean by programs overshare sometimes They need different versions  of files at the same place.   And so you can't really run  both programs at the same time. Similarly, two web servers might listen  to Port 80. That's how websites allow you   to connect to them. So if you're running  two web servers that both expects Port 80,   to be open, the first one will start  correctly. And the second one will crash,   saying that Port 80 is already used these  sorts of problems of sharing resources   is really where virtual machines and containers  shine. They allow you to separate resources like   files and ports between programs so that  programs can't step on each other's feet. So here, if you are running your  three programs in that container,   it would look remarkably similar.  So Chrome would be running,   but it would be running within a container.  And that container would be talking to Linux,   which would then allocate the container  resources. And similarly for the other programs. Now, this might not make sense yet. But let's talk  about what actually happens when you put something   in a container like this. So what happens when  there's a container between a program and Linux? The big change is that each  program will get its own   version of shared resources  like files and network ports. The container running chrome might create  a file at tilby slash chrome slash cash.   While the container running notepad could  read that file and see that it didn't exist,   they get different copies  of all of the systems files,   so that they couldn't talk amongst each  other, or have conflicting Python versions. Similarly, if you had two web servers running  that both expected to be able to open port 81,   would be able to open port 80 in their container,  and the other would be able to open port 80 in   their container. So you can have two programs  both thinking that there are the only program   listening on port 80. But really, they'd  be isolated within their own containers. So in Linux containers work by creating  namespaces, which are a Linux feature   that groups shared resources together. If  you had five processes running together   within a Docker container, they'd still  be running within Linux itself. But they   would not see the other processes, the  ones on the main Linux machine itself.   So within the container, if you ran PS, au x, and  you counted how many lines of output there were   VSD UX is how you see the running commands  in the Linux machine, you might see 10. And   that means that there was 10 processes visible  to you within the container. But within Linux,   so within the container up here, if you said how  many processes were running, it would say there's   10 processes. But within Linux itself, you'd see  hundreds of processes running, including the 10   from this container. So the containers are kind  of sandboxed or namespaced, into a single group of   processes where the processes can't see the files  outside of the container, or the processes or the   network ports outside of the container. They  only see the ports within this container. So essentially, what's happening is the programs  are asking, what are the contents of user lib   Python, in our example before and instead of  answering truthfully, Linux is answering with   the contents of another file. So the container  says what is user lib Python, and then Docker,   if you were using Docker for your containers would  respond with the contents of this file var lib,   Docker overlay Fs one user lib Python, which is  a totally separate file in the global system.   So each container would have  its own view of the files. This little deception allows programs to run in  parallel, because Linux would be would respond   with different files for each container. One  container could have Python pointing at a Python   two executable, and one container could have  Python pointing at a Python three executable. So if that's how containers work, then  how to VMs work, how are they different? Well, VMs are very similar to  emulators. If you've ever seen   someone running an older video game on  a modern computer, they're using a VM. So the idea for containers  was to provide fake Linux. Within the container they don't really  know they're running with within a   container. They see files, but the files  are simply pointing at a different place   within the real Linux installation. The  idea for VMs is to produce fake versions   one level below that. So pretty produce fake  versions of the CPU, RAM disk and devices. The VM equivalent of Docker  is called a hypervisor.   It's the program which is in charge of creating  the VMs. So when a VM is running something,   it corresponds to an instance  of the hypervisor within Linux. The hypervisor might lie to the VM and say  there's one SSD attached. There's one drive   attached and then has 50 gigabytes of capacity.  But then when the VM writes to that drive,   it would instead go to a file, it wouldn't go to  a real drive. And then on the host, it might be This file. So when the VM itself is writing to  the file, it's actually going through this file,   which is very similar to the deception of  matching files directly that the container   had. But there's some practical differences. The  first is that VMs are very powerful. You can use   them to run other operating systems, such as  Mac OS, or Windows, and different hardware   configurations, you can emulate a gamecube,  or an apple two within a Linux hypervisor. in containers, it's the processes that are being  lied to, they must still have been the sort of   thing that would run within Linux itself. But  in VMs, there's a nested operating system that   generally doesn't know it's not talking to the  real hardware. When the OS writes things to its   drive, for example, those rights are sent to  a file in Linux instead of a physical drive. So when the process writes to the  operating system within the VM,   that operating system sends the right to  a drive, or what it thinks is a drive,   but that drive goes through Linux,  and Linux actually maps it to a file. Various benchmarks show that CPUs and VMs are  about 10 to 20%, slower than containers. VMs also   usually use 50 to 100% more storage, because they  need all of the things that an operating system   would need. Duplicate containers don't need all of  the files, they only need the application files.   And finally, VMs use about 200 megabytes more  memory for the operating system itself. Again,   containers don't need all of the operating system  because it's the processes that are being lied to.   So VMs use more memory, they're  slower, and they need more storage. So given these performance benefits, it looks  like containers are almost always a better choice.   And in most cases they are However,   there's a few cases where VMs are a better  choice. Again, VMs mean virtual machines. If you run in untrusted, so user supplied  code, it's difficult to be confident that   they can't escape a container. This  has gotten better in recent years,   but it's long been a contentious point. Virtual  Machines are much older and much more mature. So   if you're running untrusted code, usually it's  a good idea to put that within a container. If you're running a Windows or Mac OS script,  so you're running a script that only runs on   another operating system, you'd need to use a  VM for similar reasons. Or if you're running   an old video game that doesn't run in Linux,  and you'd like to run it on a Linux computer,   you'd need to VM. And vice versa, if you're  using VMs, and other operating systems,   if you'd like to run a Linux program in Windows,  you'd usually have to use a VM to run it. And finally, you can emulate hardware  devices like graphics cards with a VM. So if you're testing that your graphics card  works correctly, you could emulate the response   that it would give and then test that the  operating system is working as expected. So that's the big difference  between VMs and containers.   And these are really the two  things that you often deploy.   So let's go into actual deployment strategies  in the next talk. I'll see you there. Let's keep talking about deployments. So rolling  deployments are one of the most popular deployment   strategies. And we'll talk about the pros and cons  of different deployment strategies throughout this   section. Rolling deployments themselves work  by starting a new version of the application,   sending traffic to the new version, to make sure  everything's okay, and then showing off the old   version and repeating that until all versions of  the old version or versions of the new version.   I realize I've said version many times. So let's   look at pictures that will  help illustrate the point. This is the myrn app. myrn stands for  MongoDB, Node JS, react, and express js. Here,   the user's web browser connects to  both the front end and the back end,   where front end is the stuff that the user  sees. And back end is the services that   provide connections to the database. So if  you log in, you're connecting to the back end.   If you're just viewing the landing page,  you're connecting to the front end.   Let's say your app has enough traffic that users  will notice if it goes down for a little while.   How would you push a new version of the  application without causing downtime.   This is where rolling deployments come in. The  high level algorithm for a rolling deployment   looks like this. So you create an instance of  the new version of the backend say, you wait   until it's up. So you keep trying to connect to  it until you get a response that satisfactory.   And then you delete an old version and  route the traffic to the new version.   If any instances of the old version still  exist, go back to step one and repeat. And our myrn example, we'd initially  see three instances of the initial   version and one instance of the new  version. And we'd repeat the process Until we had three instances of the new version  and one instance of the initial version.   So here, all of the versions are the back end. And we add a new version of the new back end. And  we turn off a version of the old back end. And we   keep repeating that. So as time goes on, the red  ones replaced the pink ones. And then after a few   loops of this, the only ones remaining are red  ones, we've added a red one, remove the pink one,   added the red one, remove the pink one. And we're  red is the newest version of the application. So what are the benefits of rolling deployments  over other ways of deploying things? Well, they're   well supported. Rolling deployments are relatively  straightforward to implement. In most cases,   they're natively supported. And several  orchestrators, if you've heard of Kubernetes,   for example, Kubernetes helps you with  this. AWS, Amazon Elastic Beanstalk   also supports rolling deployments. They don't have huge bursts. So in another  deployment strategy, which we'll talk about,   if you had three versions of the back end, you  need to start six in total, to deploy the new   version, and then you turn off the old three.  So doubles for the duration of the deployments,   the amount of things running, which might  be difficult if you have a finite number   of servers, for example. It's also not  uncommon for services like databases   to limit the amount of connections. So if you  had six versions of the backend connecting to   the database, now that might be too much load  on the database, so that could cause problems. And really, deployments are easily reverted. If   in the course of an upgrade, you notice  problems, it's usually easy to reverse   the rolling deployment by just going in the  opposite direction, removing a red adding a pink, you can go in the opposite direction as  well to rollback which is an important   characteristic of deploying  because things always go wrong. The downsides of rolling deployments are they  can be slow to run. So if you have 100 replicas,   and you're replacing one at a time, and it  takes 20 seconds, it would take 2000 seconds   to replace all of the versions, which  is quite a long time for deployment. This can be mitigated by increasing the number of  services being turned on and shut off at a time,   which is sometimes called a burst  limit, or a rolling deployment size. The other problem is API compatibility, which  is the biggest problem of rolling deployments.   So if you add a new version of an API endpoint to  your back end, and consume it in your front end,   then since you're not switching them both at  the same time, you might have version one of   your back end serving a request for version two of  your front end. And then that API wouldn't exist,   so there'd be errors visible to the user for  the duration of the deployment. This can be   mitigated with complicated routing techniques,  but it's generally better to make API is backwards   compatible. So make version two of the front end  be compatible with version one of the back end.   So rolling deployments are relatively simple  to understand and generally well supported.   If your users mind when there's  downtime, it's an excellent first step   to deploy using a rolling deployment strategy.  The key programming consideration is to ensure   that services can consume both the old version  and the new version of services API's. If this   contract was violated, users might see errors  for the duration of the deployments. Let's   talk more about deployment strategies.  And we'll go into bluegreen deployments. Another deployment strategy people  often see are bluegreen deployments.   To set up a bluegreen. Deployment teams  need to disambiguate which services will be   consistently deployed, and which services will  be shared across versions of the application.   I'll explain a little bit more of what I mean by  that in the next section, a database server would   be a shared resource, multiple versions of the  app would connect to the server at the same time,   and a standard deployment would generally  not upgrade or modify the database. In our mern example, all of the  other services or cluster resources,   new versions of them would be  deployed on every prod push. So in a bluegreen deployment strategy, where  you're upgrading the JavaScript or the myrn app,   this is what that would look like. So there's  a blue version, and a green version of the   application where each is a fully standalone  stack. But we're each connects to a shared   database, and the database is not part of blue  or green. It's a shared resource used by both   bluegreen deployments are so called because they  maintain two separate clusters, one named blue and   one named green out of convention. If the current  version of the application is deployed to blue,   we deploy the new version to green  and use it as a staging environment   to ensure that the new version of the app  works correctly before setting users to it. After we're confident that the new  version of the software works correctly,   we'd move over production load from blue to green,   and then repeat the cycle in  the opposite direction. So here We started with the users being sent to blue,  which contains version one of the application. And   then we're investigating that version two works.  And after we're certain that version two works,   will route users to version two. And  then the version one is unused. And we   can shut it off and replace it with version three,   make sure it works, and then switch user traffic  over to it over and over again. On the benefits   side, bluegreen deployments are conceptually  very easy to understand. To set them up,   you just have to create two identical production  environments and send requests to either one or   the other, which is relatively simple with  services like Amazon elastic load balancing. They're also quite powerful, longer running  tasks, like downloads can continue running   in the old version of the application after  traffic is switched over to the new version.   So if a user has an established  connection to green,   and you've switched everyone else over to blue,  that connection can continue finishing whatever   it was doing. So if you're watching video, and  you'd like to download it entirely to the user,   which might take minutes that can continue  going even during a prod push. Additionally,   bluegreen deployments can be extended to many  different workflows, which we'll discuss. There's a few notable drawbacks to bluegreen  deployments, it's difficult to deploy a hotfix,   for example, to revert a change, because the  old cluster might be running longer running   tasks and unavailable to switch to. So if  you have version one of the application,   you switch over to version two, you realize  version two is having problems, you might want to   push version three very quickly, which addresses  those problems. But version one would be the   only place you could deploy version three. So you  wouldn't be able to do that. It's also finicky to   transfer load between the clusters, if resources  autoscale, which we'll talk about later on,   and load is transferred all at once the  new cluster might not have enough resources   allocated to serve the surge of requests, because  requests went all at once to peak production load. And finally, if one cluster modifies the  shared service, like adding a column to a   table in the database, it may affect the other  cluster despite it not being the live one. So here's some common extensions to  bluegreen deployments. As I mentioned,   they're very extensible. And many teams set up  advanced workflows around bluegreen deployments   to improve stability and deployment velocity. The  first idea is a natural extension of bluegreen   deployments, which I call rainbow deployments, but  I don't think there's a standard term for them.   Instead of only having two clusters, some  teams keep an arbitrary number of clusters,   so blue, green, red, yellow, so on. This is useful  when you're running very long running tasks.   If you're working on a distributed web scraper,  and you're scraping tasks take days, for example,   you might need your clusters to  last until the last job is finished,   to ensure things continue working as  expected. So with the rainbow deployments,   you'd keep all of the clusters that are  still processing tasks around or if you're   doing something like video encoding for long  videos, you don't want to shut off the cluster   that's in the middle of encoding a long video  because that work would have to be redone. In a regular deployment of clusters  that only be shut off after all   of their long running jobs are done processing. Some teams rely heavily on manual QA and  don't use continuous deployment. They're   often building desktop or mobile apps, which  needs to be published on longer release cycles.   So if example.com is being routed to the blue  cluster, it would be relatively simple to deploy   a new version of the application to the green  cluster endpoints new.example.com. to that. And   with this setup, the new version of the app could  be tested against the production database in the   very environment that will soon become production.  Such tests are often called acceptance tests,   because they're happening in production with  production data, with nothing like no privileged   access to the code base. So for a game, you might  have the new release available, or the API's for   the new release available and have your QA testers  test that. And after the QA testers give the Go   ahead, you can point the game client to the new  version, and then switch the labels of the to another useful add on to bluegreen deployments.  And really deployments in general is called the   canary deployment. So if the new version of your  app contains subjective changes, such as editing   the UI, it might be ill advised to push them  to all users at once. Facebook has billions   of users. So if even 1% of their users complained  about a change, that would be overwhelming amounts   of feedback. The changes may break users workflows  and need to be modified or rolled back in response   to user feedback. So in the context of a bluegreen  deployments, a canary deployment would be to an   extension which routes maybe 5% of user traffic to  the new version of the application. And check that   those users don't have negative feedback before  switching the rest of the users over. So if blue   was version one green was version two, we'd have  95% going to version one. 5% going to version two We'd wait to see if anyone on version two  complained. If not, then we'd route everyone   to version two. And we'd shut off version  one, and then put version three in that one. So bluegreen deployments are powerful  and extensible deployment strategy   that works well with teams that  are deploying a few times per day.   With strategy only really starts being  problematic. It can use deployment scenarios,   where there's many services being  deployed many times per day. Alright, let's keep talking about deployment.  And I'll see you in the next talk. continuous deployment can sound daunting, but  it's not actually as difficult as it might seem,   in many cases. Let's take a look back at  our current example, and how it's deployed. So, in our README, we've   helpfully added this little line, which is  how we're currently deploying to production. If we look at our hosted version, slug, which is  hosted at this domain, we can see that the color   is still purple, despite having changed  the color to blue in the previous video. The reason that it's still purple is  that we haven't production pushed,   we haven't pushed the new version of  the code, which contains the blue color. And oftentimes, requiring human intervention  to deploy is simply unfeasible, especially as   product skill. So to deploy, let's run  the deployment process manually first,   and then let's talk about how to automate  it with a continuous deployment system. So here, we'll use a terminal and simply  run the command directly from the read. This developer computer comes with  the SSH key required to deploy   otherwise it would be difficult to  disseminate this SSH key to all developers,   which needed the ability to  deploy new versions of code. Here we can see that it's  using Docker compose rebuilt,   we'll talk more about how to  set up a Docker file later on. Now, if we refresh the page, we can see that the deployment has  created a new version of the application,   which is blue, so it's picked up the color  change which was merged in the previous commit. So continuous deployment, we'd like it to run on   merges to the main branch, we don't  want to deploy a feature branches   before they've been reviewed. For that,  we can set up a very simple configuration. We could write this configuration  file in any directory,   but let's write it in the API directory for now. So we'll create another layer file will inherit from the testing layer file to   make sure that the deployment  runs after tests have passed. And will only run the deployment  if the branch is the main branch. However, if it is the main branch,  we'd like to set up a secret   and use that secret for our SSH key and  then use that SSH key to run the script. Let's do that now. And then hopefully gives us the  directives we need to expose the SSH key. What's appose? For now we're exposing the SSH key which is  used to authenticate with a production machine   within the CI process itself. One other thing we need to do is change  the ownership to be more restrictive.   This is required for SSH, but it might not  be required for other deployment processes. So now that we have our SSH key within the CI  server, and these steps are running after tests   are passed, all we have to do is copy our command  and run it as if it was part of a CI process. All this configuration does is wait for tests to  pass, check that the branch is the main branch,   and then use the SSH key to deploy  a new version of the application. Let's create a new pull request with  these changes and see how that looks. We can see that as before the ephemeral  environment and ci services have being   built. But this API service is also being  built the API being the directory which   contains our continuous deployment process.  Let's take a look at what the pipeline actually looks like Instructure. so here we can see that the application has been built successfully,   and is being started, just as in their  regular ci without CD process. So we're   running our continuous integration, but not our  continuous deployment step and this layer file. And we can see the tests are running. As usual, the test running process   requires starting a fake browser.  So this takes around 30 seconds. And then after the tests pass, we'll see that the   deployment process runs. So the  tests lead to the second level. graph, these are usually called  the build stages in a CI CD system. So here, we can see that the step was skipped because  the branch was not the main branch,   which is exactly what we wanted. However,  now if we merge this pull request, we'll create a new merge branch on the main  branch. And we'll run the CI process once again. Here. And because this is the main branch,   the deployment process itself will be running.  So let's take a look at what that looks like. We're simply loading the environment  to run the command and right now. And here, we can see that the deployment  is running within ci itself. So instead of   needing to run it as an individual developer,  you can simply run this SSH command within ci.   And this idea of deploying automatically from  a CI process is called continuous deployment. So let's work through that whole process  end to end once just to make sure that it's   clear on the deployment automation side of things. So let's change the color, again, for the  main landing page, just to make sure that   it's visible if a change gets pushed correctly. And again, we'll change the two colors. And we'll create a new pull request. And now our reviewer will have a lot of  information about whether this change is   good or not. So the reviewer will be  able to see both the files changed. So they'll see all we've  done is change a few colors, we'll be able to look at the CI process itself. So they'll be able to see that tests are running. So in particular, that the application  builds start successfully, tests run against, they'll be able to look at  an ephemeral environment   within minutes of me creating this new change. And if they approve the change, it'll be shown  to users in a short period of time, this whole   process will only take about a minute, with the  longest part being these automated browser tests. One by one, these steps should become green. Again, this is the base. This is the  ephemeral environment. This is the   continuous deployment process. And this is  the egg gets status. This is the one that   the administrators of GitHub might mark as  required. So that only if all of the checks pass,   can the commit be merged and shown to users. so  here we can see that everything has passed. Let's   take a look at the ephemeral environment just to  double check that the color is the one we want. We can see that from the thermal  environment, we change the color to be this   rose ish red, perhaps this is the color  that was desired. So we'll say that this   is correct. And the test has passed and  successfully posted a message. So we know   that the functionality of the application  has continued to work after this change. After we merge it be the end to end test deploy  process for for this merge commit. If we take a look at that, we'll see that because we  merged to the main branch the deployment is already running. in  production, we're creating a production build.   And in a short period of time, the production  server should have the latest version of   our application running on it. So here  it's restarting the production instance. And the snapshot is being taken. So everything is  succeeded, we've successfully pushed if we go to   our website, it is now the shade of red that  we've changed. And that's what an end to end   ci CD, ephemeral environment pipeline  generally looks like at a very high level. Alright, so let's talk more about  deployment automation in the next section, talked about deployment strategies. But that's  not the only thing in deployments. Deployment   strategies help you reduce downtime and deploy  in a way that doesn't affect your users.   But another key consideration for deployment is  making sure that there's enough resources for   your containers or VMs. So that if there's a large  burst of users, your application doesn't go down. So let's say you're building a CI system,   this hits close to home because I have  coursework at layer ci, a CI company, your users would push code, you'd  have to spin up runners to run tests   against that code. And you'd see bursts of  traffic during the users business hours.   And you'd see significantly less  traffic outside of those business hours.   For a peak load of 10,000 concurrency runs,  you'd need at least 10,000 runners provisioned. However, at night, outside of peak hours,   you wouldn't really need all 10,000  runners, most of them would sit idle. So your usage might look like this, which  is also indicative of a lot of applications,   your lowest point is maybe 500. Runners required,  and your highest point is 10,000 runners required.   So you need 20 times more workers from the highest  points in the day to the lowest points in the day. In an ideal world, you'd be able to create or  destroy these runners as necessary. during peak   hours, you'd be able to create new ones, and then  off peak hours, you'd be able to destroy them.   That's the idea for auto scaling. It's only possible to create and destroy  workers because of cloud providers.   At their enormous scale, it's possible  to offer servers for cheap on small one   hour leases. The most popular technology  at the time of this post or this video,   is AWS easy to spot instances, which act exactly  like cloud hosted VMs with large discounts if   you provision them for short periods of time.  Another popular technology for auto scaling   is Kubernetes horizontal pod auto scaling, which  sounds daunting. But since many providers provide   Kubernetes out of the box, you can just assume  that if you're using Kubernetes, and containers,   you'll get auto scaling if you configure  it correctly. Just to illustrate,   if you're using Microsoft's as your as your cloud  provider, there's resources for auto scaling   VMs. And for containers. If you're using AWS,  there's again, resources for VMs and containers.   And if you're using Google Cloud, there's  resources for VMs and containers. auto   scaling is usually discussed on the timeline  of one hour chunks of work. If you took the   concept of auto scaling and took it to its limit,  you'd get serverless defined resources that are   quickly started and use them on the timeline  of milliseconds. So one to 100 milliseconds. For example, a web server might not need to  exist at all until a visitor requests the page.   Instead, it could be spun up specifically for that  request, serve the page and then shut back down. That's exactly the idea for serverless.  It's almost like taking auto scaling, so   provisioning resources as the required and doing  it very quickly on very small time intervals. serverless is primarily used for services  that are somewhat fast to start in stateless,   you wouldn't run something like a CI job  or a CI run within a serverless framework.   But you might run something like a  web server or notification service. auto scaling is primarily used for services  that are slower to start or require state, you'd   likely run a CI job within an auto scaled VM or  container, and not within a serverless container. As a 2021, the distinction between the models is  becoming quite blurred. serverless containers are   becoming popular. And they often run for upwards  of an hour. serverless containers act exactly like   containers, but they're created and turned off in  a serverless manner. So in response to a trigger. Within a few years, it's likely  that serverless and auto scaling   will converge into a single unified interface.   So I'm excited about that. That's that's  going to be the future of deployment. And that ends our discussion of auto scaling  and serverless. I'll see you in the next talk. Another key concept in deployment  automation is service discovery. A database might be at one IP address, so 10 dot  1.1 dot 1.6 by four three chosen arbitrarily,   while the web server would be at  another IP address 10.1 dot 1.2 8080. And they'd have to discover each other, because  the web server needs to talk to the database,   and the database might have  calls to the web server,   those get even more complicated as you add more  copies of your web server, or add entirely new   services. Again, let's consider the myrn app  from elsewhere in the academy DevOps series. So you have a web browser, the user themselves is  visiting your website. They're connecting to your   front end, and they're connecting  to your backend making API calls.   And your back end is connecting to a database. Here, there's three services that need to be  discovered. The browser needs to learn that   example.com corresponds to the front end,  an example.com slash API corresponds to the   back end. And the back end needs to land on the  database is that 10.111 dot 1.3, for example.   So the backend needs to know the IP address and  port of the database. And the browser needs to   know the IP address and port of the backend and  the front end. In the very simplest configuration,   everything is manually configured the back end  and front end or add static IPS. And given host   names within DNS Domain Name System, it's the  mapping of example.com to an IP address on the   internet. And the back end is configured  to connect to MongoDB at a specific port.   So your DNS configuration, this is the CloudFlare  configuration page, which has a DNS provider   would look like this. So if the user visits  example.com, send them to this IP address. And if they visit api.example.com,  send them to that IP address.   So this is all manually configured. And we've  just manually put the IP addresses in this. And then within the backend, we'd  read an environment variable,   which is a dictionary of key value pairs that  are easily set when you're deploying things. So   you'd say connect to the environment  variable, specifying the MongoDB port, and then connect to Port 27017,  which is the default MongoDB port. And then when you're starting the backend,   you just have to specify the IP  address that MongoDB is running in. This configuration is completely fine. For  simple products, it's difficult to mess up.   It's relatively secure, and it doesn't over  complicate things, you can go pretty far with a   simple configuration. Most products could launch  an MVP without any service discovery at all. But you'll know that you need to start caring  about complicating your service discovery when   you see one of the following. So you need zero  downtime deployments. You can't hard code things   like this if you want to do rolling deployments.  Because you can't easily automate where the   arrows point to, you can't automatically change  the IP address. If you're doing such a simple   deployment strategy. For example, if you have  more than a couple microservices, it's gonna   get hard to remember where they all are. And if  you're deploying to several environments, like   if you have a developer environment, a staging  environment, and ephemeral environments, and   production environments that all have different  IP addresses, it's gonna get pretty unwieldy to   set the IP addresses all over the place. So let's  focus on zero downtime deployments because they're   illustrative of the broader problem. Before  that, though, let's talk about reverse proxies,   which are another crucial system design and  DevOps concept. The idea for a zero downtime   deployment is simple. As we've seen, you start  a new version of the back end and front end, you   wait till they're up, and then you shut off the  old version of the backend and front end. So this   happens in both rolling and bluegreen deployments.  However, it's difficult to update the IP addresses   in DNS itself. If our rolling deployments  required changing these values directly in DNS,   that wouldn't work very well, for various  reasons. And particular, DNS can take a long   time to propagate users in other countries than  the United States, for example, might take days   to see the new IP address, and they'd still  be trying to connect to the old version. The solution is to add a web server that acts  as the gateway to the front end and back end,   we'd be able to change where it points to  without changing the DNS configuration itself.   So web servers like these are called  reverse proxies. And they're really   crucial for setting up zero downtime  deployments, and for service discovery itself. So taking our myrn app and adding this level  of complexity, the user's web browser would   instead connect to the reverse proxy. So the user  would ask the DNS system where is example.com The DNS system would respond with, oh, it's here, the IP address of the reverse proxy. And then  the reverse proxy would take the user's request   and send it to the appropriate, they view  on that front end or back end, depending   on what the user asked to connect to. And then  from there, everything else would be the same. So if you're running a deployments,  like a rolling deployments,   the proxy could choose which of v1 or v2  of the front end or back end to send the   users request to, and that would just  be by changing a configuration file. And then after your deployment is done, you  could turn off version one, and the reverse   proxy could route traffic entirely diversion  to a straightforward approach is to store the   service IPS in a hash table. So implicitly,  in the process of running at the moment,   we assumed that our reverse proxy would be able  to know the IPS of the new versions of our apps,   which is exactly the statements of service  discovery. And so that needing to manually   tell our reverse proxy where the front  end and back end live. So where is the   IP address of version two of our back end, it  would be convenient if we could automate it. When the new versions come online,   they can update the value for the key back end and  front end with their own IPS in this hash table.   And then the reverse proxy could watch for changes  to the table and use that for routing decisions. For a very concrete example, which is about  as close as we're going to get to code in this   set of videos, let's look  at this nginx configuration.   nginx is a very popular reverse proxy. And it's  very commonly used in large tech companies. And it lets you define where various host names  go, if you pointed example.com, to the nginx,   reverse proxy, again, in the picture, this  would be nginx, the user thinks they're   connecting to your website, but they'd  be sending the request to nginx. Asking,   for example, COMM And nginx would take their  example and forward it to your actual front end.   So nginx just has to learn where  the IP for your front end is.   And that's what this configuration would  do. So we're telling nginx directly, take this key from this file, and then use conf D, which reads from a hash  table and updates the configuration file,   and then send the user there. So all you need is a key value store, which has a  key fob where the front end IP is the current   front end version. And then to run your rolling  deployments, you'd start a new front end version,   check that it was alive, and then you just change  the key in the hash table to point to the new one.   And then constantly would pick up that  change, replace this value with the new IP   of version two of the front  end, and then reload nginx,   which would change the arrow to the  new version of the front end like this. That was a lot to deal with. So let's back up a bit, all you'd need  to do is update your front end,   to set the key in the hash table for  IPS front end to be the front ends IP,   and then make your back end do the  same for the back end location.   And that way, when the new version of the  application starts, it would update the key   in the table. And then nginx would start routing  users to the new version of the application. This is what proxy passing means and nginx. So   you see this proxy pass directive.  But this is all very complicated.   It's just an illustrative point of if you were  to implement this yourself, how would you do it? The most common thing used in industry  is service discovery by using DNS itself.   So DNS, we thought of before as the  slow protocol that might take days   to propagate changes across the  network. But you can run DNS locally. And that's the industry standard. So let's talk about DNS a little bit. The idea for DNS is just to map host names to  IPS. When you visit layer ci.com. For example,   the global DNS system will first map  the name latest.com, to the addresses   at the time of this video 104 dot 217 9.86 and  172 dot six 7.16 9.106, which are just arbitrary   computers connected to the internet. And you  can use the Digg command on a website to see   where those addresses are. So this is saying for  the key layer ci.com the values are these two. And usually when people mentioned  DNS, they mean the global service.   So visiting websites on the internet. However, as  I mentioned, it's possible to run DNS internally.   It would be ideal if in our nginx configuration,  we could specify HTTP full colon slash slash front   end and then have front end resolve  to the IP of our front end service.   That way we wouldn't have to change  anything except for the DNS configuration. That's exactly how DNS based  service discovery works.   you configure your services to query a server  you control for DNS queries and then this So instead of saying MongoDB, full colon  slash slash, process and MongoDB, you just say   MongoDB full colon slash slash  Mongo, where Mongo is a key   in the key value pairs in  the DNS that you control. Of course, it's not trivial to deploy  your own DNS server. In practice,   though there are popular options like core DNS,   the most likely thing you do is use a cloud  provider or Kubernetes internal solutions. So the end result you'd  get is something like this,   the user's web browser would connect to nginx.  Thinking it was the website nginx would ask   the DNS provider, where is the API right  now, the cloud provider would respond with   this is the IP address, given the deployments,  so blue, green, or rolling deployments,   this is what we currently want users to visit  when they want to visit the API. And then this would correspond to version  one or version two of the backend.   And then the proxy would  forward the request there,   the request to be fulfilled, and then we go  back to the proxy, and then back to the user. So the conclusion of all of this is that service  discovery is tricky, but vitally important as a   foundational building block for these deployment  strategies. And for deployment automation in   general, if you configure service discovery  in an appropriate manner for your deployments,   so DNS based in a Kubernetes cluster, for example,  it makes significantly easier for developers to   have microservices that talk to each other,  instead of having a developer have to write   connect a MongoDB app, and then deal with where  MongoDB actually is. They can simply say connect   to MongoDB at MongoDB colon slash slash Mongo.  And then you as the DevOps platform engineer can   configure where that Mongo always points to the  right place the right IP address. By decoupling   the application logic from the deployment  logic, you'll help the developers on your   team build faster, and you'll be able to deploy  it more easily. So that's it for deployments.   Let's go on to the next and final pillar,  which is application performance management. There aren't that many general topics  in application performance management,   so this section will be a little bit shorter.  We'll go into more detail in future sections   in the DevOps Academy. But just  for this introductory video series,   let's talk about two core concepts.  The first of which is log aggregation.   And it's a way of collecting and tagging  application logs from many different services   into a single dashboard that can easily be  searched. One of the first systems that have to be   built out in an application performance management  system is log aggregation. Just as a reminder,   application performance management is the part  of the DevOps lifecycle where things have been   built and deployed. And you need to make  sure that they're continuously working   so they have enough resources allocated to  them. And errors aren't being shown to users. In most production deployments, there are many  related events that emit logs across services. At   Google, a single search might hit five different  services before being returned to the user. If you   got unexpected search results, that might mean  a logic problem in any of the five services.   And log aggregation helps companies like  Google diagnose problems in production,   they built a single dashboard where they can  map every request unique ID. So if you search   something, your search will get a unique ID  and then every time that search is passing   through a different service, that service will  connect that ID to what they're currently doing. This is the essence of a good  log aggregation platform,   efficiently collect logs from everywhere that  emits them and make them easily searchable.   In the case of a fault. Again, this is our  main app, the users web browser connects to   a back end and the front end, and the  back end then connects to a database. If the user told us, the page turned  all white and printed an error message,   we would be hard pressed to diagnose  the problem with our current stack,   the user would need to manually send us the  error and we'd need to match it with relevant   logs in the other three services. Let's  take a look at Elk, a popular open source   log aggregation stack named after its three  components, Elasticsearch, LogStash and cabana. If we installed it in our burn app, we'd get three  new services. So the users web browser, again   would connect to our front end and back end. The  back end would connect to Mongo, and all of these   services, the browser, the front end, the back  end and Mongo would all send logs to LogStash. And then the way that these three components work,   the components of ALK Elasticsearch Log  Stash and cabana is that all of the others Services send logs to LogStash. LogStash  takes these logs, which are text emitted by   the application. For example, the web browser.  When you visit a web page, the web page might   log this visitor access this page at this  time. And that's an example of a log message.   Those logs would be sent to LogStash, which would  extract things from them. So for that log message,   user did thing a time, it would extract the time  and extract the message and extract the user and   include those all as tags. So the message  would be an object of tags and message so   that you could search them easily. You could say,  find all of the requests made by a specific user. But LogStash doesn't store things itself.  It stores things in Elasticsearch,   which is efficient database for querying text.  And Elastic Search exposes the results as Kibana and cabana is a web server that connects  to Elasticsearch and allows administrators   as the DevOps person or other people  on your team, the on call engineer to   view the logs in production  whenever there's a major fault. So you as the administrator would connect to  cabana cabana would query Elastic Search for   logs matching whatever you wanted. You could say,  hey, cabana, in the search bar, I want to find   errors. And cabana would say Elastic Search  finds the messages which contain the string   error. And then Elasticsearch would return  results that had been populated by LogStash   and LogStash would have been samples  results from all of the other services.   If you visited a web page, this might  be the sort of log that is emitted. And it might be processed into an object like  this. So it has a format, a date, and a simple   time format. That's the same for all messages  emitted by all different services, you'd have a   service which service submitted the log, and you'd  have the message, the actual content of the log. And the processor, LogStash itself would often  be connected to the internet so that JavaScript   in the browser can catch errors and send them to  LogStash. Although there are additional services   like century that might be better suited  for that. How would we use elk to diagnose   a production problem? Well, let's say a user says  I saw error code 1234567. When I tried to do this,   with elk setup, we'd have to go to cabana, enter  1234567 in the search bar, press Enter. And then   that would show us the logs that corresponded  to that. And one of the logs might say, internal   server error returning 1234567. And we'd see that  the service that emitted that log was back end,   and we'd see what time that blog was  emitted at. So we could go to the   time in that log. And we could look  at the messages above and below it   in the backend. And then we could see a better  picture of what happened for the user's request. And we'd be able to repeat this process  going to other services until we found   what actually caused the problem for the user. The final piece of the puzzle is ensuring  that logs are only visible to administrators.   As logs can contain sensitive information like  tokens, it's important that only authenticated   users can access them. You wouldn't want to  expose Kibana to the internet without some   way of authenticating. My favorite way of doing  this is to add a reverse proxy like nginx. Again,   our friend nginx and then have the auth request  mechanism check that the user is logged in.   So in our back end, we could  add something like this,   which simply returns a successful status. If  the user visits example.com slash auth request   and there an admin, it would return a successful  status. And if they're not admin, it would return   an unauthorized status. And then we could  configure nginx. Again, as mentioned in previous   videos, to have these location blocks, the slash  private location would connect to slash off.   And then we could make sure that if this was slash  logs, for example, that the user was logged in,   because with this auth request directive, if  the user visits slash logs, and they're not an   administrator, they wouldn't be able to access  them. Alternatively, Elasticsearch itself is run   by a company called elastic and they have a paid  version, which contains something called x pack,   which facilitates this as well. So you can go for  either a reverse proxy which authenticates users,   or the paid version of the application. As an aside, you can use log  aggregation as an extra test.   So in your ci pipelines, where you  want to tell if code is good or not,   you can repurpose your log aggregation stack  to ensure that no warnings or errors occur   while the tests run. If your end to end test  looks like this. So you're starting your stack,   you're starting your logging stack. You're  running your tests with NPM run test. You could   add an extra step which queries Elastic Search for  logs matching error, and you could make sure that There are no logs, that printed error.  And then even if all of your tests pass,   if there's an error going on, that error might  be important despite all the tests passing.   So this adds a free extra check to your ci stack. And there's a few examples of log aggregation  platforms. There's Elasticsearch, LogStash,   Kibana, which we talked about, there's fluent D  is another popular open source choice. There's   data dog, which is very commonly used at larger  enterprises. It's a hosted offering. And there's   log DNA, which is another hosted offering.  And those cloud providers also provide logging   facilities like AWS, cloudwatch logs. So log  aggregation is a key tool for diagnosing problems   in production. It's relatively simple to install  a turnkey solution like elk or cloud watch,   and it makes diagnosing and triage problems  and production significantly easier.   That's it for log aggregation.  I'll see you in the next talk. Plus topic we're going to talk  about is metric aggregation.   metrics are simply data points that tell  you how healthy production is. So as you   can see on the screen, things like CPU usage,  memory usage, disk IO, file, system fullness,   are all important production metrics that you  might care about. If log aggregation is the   first tool to set up for production monitoring,  metrics, monitoring would be the second. They're   both indispensable for finding production faults,  and debugging performance and stability problems. Log aggregation primarily deals with text,  logs or textual Of course. In contrast,   metric aggregation deals with numbers. How  long did something take his memory being used? It's frighteningly difficult to understand  what's going on in a production system.   Netflix, for example, measures  2.5 billion different time series   to monitor the health of their production  deployments. Successful metric monitoring is   being able to automatically notify the necessary  teams when something goes wrong in production. Let's keep looking at open source implementations  of DevOps tools to keep things general.   Prometheus is a tool originally deployed at  SoundCloud is one of the most popular metrics   servers. And this is what it looks like. Similarly  structured, the inputs are sent to the retrieval,   things like nodes would send how much disk usage  they have to the media server, but also how   long services are taking ALC itself would parse  numbers out of logs and send them to permit this.   And then promethease figures out what get services  from using service discovery from the previous   video. And then it takes those and it stores it  in a time series database equivalent for numbers,   what Elasticsearch is for text, and then that  stored on the Prometheus server node itself. And then finally, there's a front end. So other  services can query promethease. To do things, one   thing you might want to do is if there's something  terribly wrong, like your website is down,   you might want to connect a pager duty  or email someone or send someone a text   message with Twilio, beyond call engineer  and tell them that something is wrong. But you might also want to query  metrics to get a view like this one.   And that's what prom qL is used for. So grafana  is the view that dark view with the graphs.   And it's common way of viewing these  time series. But you can make your own   and you can make API's. And there's many  other front ends that connected previous. The diagram above is daunting, but it's quite  similar to the architecture that we discussed   for log aggregation frameworks. There's four key  components. Like I mentioned, the time series   database actually stores the measurements,  retrieval, the alert manager and the web UI. So the sorts of metrics we collect. Well,  there's a lot of subjectivity about which   metrics are important based on what your  product does and what your users are.   But here's a few ideas for what you  store in something like promethease. So request fulfillment times, these are very  useful for understanding when systems are   getting overloaded, or if a newly pushed  change has negatively impact performance. The format times are often parsed out of  logs using a regular expression, for example,   or taken out of a field in a database. For a website or REST API, a common request  fulfillment time would be time to response   for web sites and rest API's. A common request  fulfillment time would be time to response. That way slow web pages could be  discovered and identified in production. A related metric that is very indicative of  problems his request counts, and if there's   a huge spike in requests per second, it's  very likely that at least a few production   systems will have trouble scaling. Watching  request counts can also be used to detect and mitigate attacks like denial of service attacks,   which are when attackers sent many malicious  requests to services in production. The last common metric across many  types of companies is server resources.   Here's a few examples. So the database size  and maximum database size. If you have two   terabytes of disk for your database and your 1.5  terabytes in, you might want to alert someone to   increase the amount of disk available for the  database or delete things that are being unused   web server memory. So if your web server  is taking a lot of requests per second and   doing a lot of processing, it might require  more memory. So if it runs out of memory,   it would crash and your users wouldn't  be able to access your website anymore. Network throughput. So if you're downloading  many things, or uploading many things,   you can saturate your network. And that  would also cause degraded performance.   And a final one is TLS certificate expiry  time. So this lock in the browser that uses   TLS certificates to see whether the browser  is secure or not, are used all over the place   internally. And these cause problems if they're  not measured, and they're not alerted for.   So for example, Google Voice had an outage in  2021, Google of all companies wasn't measuring   when their TLS certificates would expire.  And that caused an outage a few months ago. So production faults very rarely look  like no users can access anything.   There are often a gradual ramp,  certain API's taking longer and longer,   and then eventually everything breaks. portal  analysis is an easy way to pare down production   statistics into something actionable. A website  might measure how long it takes for websites to   fully load their landing page to notice  when there's a very obvious production   issue. So with cuartel analysis, you'd split  request times and many different buckets.   How long did the slowest 1% of users take?  How long did the slowest 5% of users take?   How long did the slowest 25% of users take.  So if your landing page is slower when users   are logged in and logged out, just by visiting  without a logged in user, you might not notice   that the web page is very slow. But users  that are logged in, would show up in the 1%   of requests bucket, and you'd see that those  users are having a bad degraded experience. Or when example stackoverflow.com itself was  notified of an outage because their landing   page was taking a long time to respond  to requests due to a specific post that   was published to stack overflow. For metrics  analysis, there's many common production tools.   There's Prometheus and grafana. As we  mentioned, there's data dog, again,   not only log aggregation, but metrics aggregation  as well. There's New Relic, which is I would say   maybe the old reliable option. And there's  again, cloud providers by their own versions   of this. There's AWS cloudwatch metrics, Google  Cloud monitoring, and as your monitor metrics. That's it for application performance  management. Thanks for watching.