hi guys welcome back to the youtube channel for those of you who don't know i am johnny chivers i'm a data engineer with over 10 years experience working primarily monday to friday in aws within the cyber security sector today's video is a completely free course on aws data engineering it's loosely based upon the aws data engineering immersion day which is of course you can do with aws if you work in a company that it's their main provider i personally sat this course about nine months ago and got great value out of it in a classroom setting so what i've done is took the content of that course and modified it so we can do it in a non-instructor-led environment that means that you can spin up the resources at home that you need and then follow along with me to do this i have made everything completely free on my github including the slides for the theory so if you navigate the github using the link below you'll go on to the actual repo that has everything you need if you right hand click and download that as a zip file open the zip file and then you'll have all the course content that you'll need as we go along as i said the course is completely free so all i ask for is a like and subscribe what we'll touch upon on the course today is aws kinesis aws data migration service and aws glue in terms of the requirements to complete this course you will need an aws account i will be leaving the free tier inside aws to complete some of this in terms of cost to me it was five dollars to spin everything up and keep it up for five days i had to keep it up for a longer period of time because obviously i had to build the course test the course and record it if you spin them down after you're done it shouldn't cost you any more but please check out the pricing structure in your region before you follow along in the description you'll also find a link to the question bank dot io this is a completely free community driven resource that i've created where we can submit questions have them peered reviewed by other users and then they entered the bank of questions currently it's all aws questions so if you're heading towards certification or you just want to test your knowledge of what we're learning today please head over and sign up it's completely free and i intend to keep it that way for the duration of the app's life okay with that being said hopefully you've downloaded the resources from github i will kind of go over a game where these are in github different sections when we need them but it's easy if you just have them in one place now we're going to jump on to the actual slides so if you open the slides up on one side you can allocate as we go along we're going to do one slide on what data engineering is and then we're going to jump into the kinesis theory so open those slides and join me there i'll jump on the computer and get the slides open as well and then we'll start to go through the theory before jumping on to a practical lab okay guys um this is the first set of slides as i said you can download them all from that github link down below and i'll just call it like slide.pdf so feel free to download the slides and follow along throughout the course annotate them do whatever you want to them completely free completely up to you so the first thing we're just going to take a really quick look at is a definition of data engineering for those of you who are new to data engineering or maybe those who are in the field of data and want to know what the difference between data engineering is and data science so just a quick definition data engineering is the process of collecting analyzing and transforming data from numerous sources data can be transient or persisted to a repository so it really does cover a whole matter of sins that's what i've learned about data engineering on a daily basis i'm bringing data in from multiple different sources they can be structured and unstructured it can be real time or batch i can be doing this through any number of technologies such as kinesis such as glue such as lambda functions and even more traditionally just doing it on linux myself and moving that around not best practice for production but sometimes you just have to get the data in so it really is just that process of kind of getting data moving data cleansing data transforming data and prepping it for applications or users downstream so the first thing we're going to take a look at is aws data streaming and that's going to be really looking at kinesis we will do a lab on kinesis after this where we're going to touch uh kinesis firehose the kinesis down generator and kinesis data analytics but there is a wee bit more to kinesis and it is important if you're going for certification that you understand what this is it will come up on the exams it always comes up on the aws certified practitioner exam it does come up in the aws solutions exam i got it on the aws professional architect exam and it obviously comes up and i got it on the aws data analytics specialty so it came up on every aws example that i've sat and there are a couple of important things to know about kinesis in terms of his architecture and its throughput rate so with that being said what is aws kinesis so aws kinesis first of all is real time and that's the most important thing that go knowing forward when you want to have a real-time solution you should look at aws kinesis it puts data or ngs data onto your stream in real time and then you can pull the data off that stream to derive insights store it in repositories or even transform it in transit it's fully aws managed so that's really important you may have heard of a streaming service called kafka before kinesis is the awas solution for real time streaming and kafka is an open source solution to streaming aws does have kafka as well but kinesis is its native streaming service and because it's native it's fully managed by aws aws look after all the complex infrastructure underneath it it looks after the majority of the code for you as well to spin that all up and as a result all we have to do is get on with actually putting data on the stream and taking it off so it removes the entire overhead of a very complex dev ops process that exists for other streaming platform and then most importantly again it's scalable so it can handle a few hundred records or it can scale out to a few thousands of records from different sources depending on how you want to do it and if your stream runs out of capacity you can spin up multiple streams so it's scalable from the very little amount of real-time data up to a huge amount of real-time data with that being said there are four flavors that you can see around me here the first one is kenny's video streams i will not be touching this in the in the actual uh demos that we're going to be doing this is the only time i'll mention it it has not came up for me in any of the exams ever that's not to say it won't come up it just hasn't came up for me and it's exactly what it says it's where you would stream video in real time and then you can actually connect those images to ml learning devices or or processes and then infer things about like users faces and things like that really interesting stuff i just haven't personally touched it the next one is kind of the fundamental down the bottom here with kinesis data streams this is the building block this is the most involved kinesis um service that there is it's scalable durable in real time and it's basically where you can put data by putting it on from a producer i'm pulling it off with a consumer and we are in charge of putting the data on and pulling the data off in kinesis data streams the third one up here beside me is kinesis data fire hose this is more of a managed service we are responsible for putting the data under the stream but there are pre-configured inbuilt um consumers in kinesis firehose so you can automatically stream the likes of s3 you can automatically stream the likes of splunk you can even do it into bi dashboards depending on what you're using so firehose is a bit more managed there's less to configure when compared to data streams it's it's looked after by aws but what you can actually stream out to is a is a smaller subset of services whereas data streams has the ability to let you stream out to numerous sources by using libraries and then lastly on this side of me there's kinesis data analytics and data analytics is the easiest way to process that in real time and infer information about that data so you can build apps around it using um sql and then with that you can do counts on the data aggregates on the data look for anomalies and we'll actually be doing an anomaly test in the lab so we'll be spinning up a kinesis data fire host stream and it will actually as its target have a kinesis data analytics because kinesis firehose can output the data analytics and then we'll be building an anomaly app to look for like variations in the data we send and if it spots an anomaly it will flag up via an email that it actually has received a rogue record so that's actually the lab we're going to do kinesis data fire hose and kinesis data analytics so this is a high level architecture of what kinesis data streams is like it's really important to understand this because it will come up on any of the certifications that you choose to do in aws it's pretty simple conceptually you have a series of producers over here on the left hand side and they can be literally anything there's a kinesis producer library there's a kinesis sdk and there's actually the cli tools from amazon as well so there's a couple of different ways that you can move the data from producers through code there's other ways that i've done it personally where we've put an api gateway in front of of our stream we've put a lambda function the lambda function then puts the data under the string via an sdk library and then actually our our producers call the endpoint call the api so that was done for like third-party companies who wanted to send us data and we wanted to place it on the stream they would make an api call from their their client machine or from their server or from their database and it would then send the data to our api we'd do some authorization in the middle to make sure that that's the correct payload and we know who they are they know who we are and then our lambda function would have a library in it to put it onto the stream and there is an example on this channel of actually building that so you can go check that out once on the stream streams are made up of shards stars is like the amount of throughput that you have and we'll cover definition in the next slide so you have a stream and streams consist of shards and then at the other end we have consumers so consumers in this example are ec2 instances but they could be lambda functions they could be other parts of code that you have but what happens is that they actually have to have a consumer library on them so ec2 has this um simply through an sdk lambda functions also have this three code and they act as a trigger so when data lands on the stream it's constantly listening or pulling and then it pulls the data off can process that data and then push it downstream to the repository so let's just run through that one last time we have our producers which are producing the data they push it under the stream via the sdk the producer library or cli for example our stream is made up of shards inside kinesis where we go in and we just add shards to the stream and then the shards are actually being listened to or pulled by consumers and when data arrives a consumer lifts it off the shard and processes it and then it can push it downstream to another service or application pretty simple and it's kind of architecture as i said when we get into kinesis the only thing we actually have to worry about and you can see in other videos in this channel is how the shards work and how we actually pull that on and off the infrastructure in this box here is not handled by us it's looked after by aws so if you are thinking about going for one of the certifications there's a couple of important definitions that you need to know you need to know what a producer is as i said it's what puts the data under the stream that will come up our tension period it always comes up is a good question the default is 24 hours so it'll keep the data on the stream for 24 hours and you can actually now increase that up to 365 days so for a year obviously the longer you hold the data on the stream the more expensive it gets because you're pay as you go on kinesis um so it's up to your use case how long you want to persist the data on that stream and then as i said the shard is really important so the shard is a unique set of records sequence of records on the stream a stream is composed of one or more shards as we covered in the previous diagram each shard can support up to five transactions per second for rates up to a maximum total data rate of two megabits per second up to a thousand records per second per weights up to a total rate rate of one megabit per second when you're calculating how many shards you want you take this calculation and you work out how much data you're sending and how much data you want to pull off the stream and how quick you want to do that and that tells you the number of shards and that will come up in the exam not calculating shards per se but you'll need to know that there's a five transaction per second for reads up to a maximum total dollar rate of two megabits per second and up to a thousand records per second for rights up to a maximum total dollar rate rate of one megabit per second and they will ask you that in the exam it comes up a lot what are the throughput rates for kinesis then another important thing is the partition key so you can add a partition key it only really matters if you've won them more shards but what that means if you have a participation key of abc and your three shards then if you're at nearly the maximum rate it will put all a on one shard all b and another shard and i'll see in another shard and that makes it easier to consume off sequence number is the sequence that actually the shard receives the data in it's not guaranteed sequence of what you sent it it's what it actually receives it in when it's root and that's important if you're trying to make sure you haven't missed any records or gaps if you expect it to be continuous and then as we said the consumer consumer is the thing that actually gets the records off kinesis and this could be numerous things because there is a consumer library an sdk and a cli again that you can pull data from so there are different ways to get the data off kinesis itself firehose is a bit easier to understand because we don't manage any of the shards and the underlying infrastructure really all we do is have a data source that can put data onto the firehose delivery stream firehose at that point looks after everything for us and it will actually output the data either transformed or not to a destination commonly this is an s3 bucket but there are other things and we'll cover exactly what they are in a minute a couple of definitions that you do need to know we know what a data producer is already over on this side of me we know that it's where it actually puts data on of course it has to have a library that allows you to put on the kinesis data stream and that again is an sdk uh cli or even the kinesis producer library will do that again for you the record's what we know so a record can be as large as a thousand kilobytes that's really important so you can't exceed the thousand kilobyte uh record size on kinesis firehose just remember that it might come up on exam this is really important buffer and size limit interval so the way kinesis works is that it has a buffer size in terms of the volume of data or has a time interval that expires and then it writes the data as a batch down so it's near real time it's close to real time the minimum buffer size for time is 60 seconds at the moment but just check that out in the region and the minimum buffer size is a couple of megabits um before you can write it down so kinesis will build up data and then write down on the bots format usually when i'm using it like in practice labs we have to wait the 60 seconds it does take a bit of volume of data to let it write but just bear that in mind for the exams it's buffer size and buffer interval when it writes down and as promised here are the destinations as it currently sits do go check these out in the aws website they do change and they do add things all the time so s3 as expected redshift so you can write straight to the the warehouse elasticsearch so you can actually write straight to your laster search splunk datadog dynatrace which i don't use logic monitor mongodb new relic and sumo logic so there's a couple of places that you can write too using kinesis firehose and then lastly there's kinesis data analytics it's a bit more simple again because you really manage barely anything and we will be doing this in the lab so i advise you go check out that link at the bottom if you want to know more but it's very simple we're we're actually in the lab we're going to use a kinesis stream uh kinesis fire hose sorry we'll put data in kinesis data analytics we'll use sql code that forms tables under the data we put on in a streaming fashion it processes the data off the query that we write and then it outputs the results so it's quite simple you put data on you have a query that builds a table out of that data in real time looks for something that you've asked it for and then pushes that data downstream as a result set for you and we'll be doing an anomaly example where we're searching data all the time and we're looking for a record that's not correct and then it will spit out that record and then we can do something with it um if this was a real life example okay with that being said that is the theory side of kinesis.analytics you will need to know more for an exam i have left the links to go read more on the white papers but this is just an introduction to it and what we really want to get on is hands-on labs so we're actually doing stuff so there's the building blocks let's jump onto the aws console and that's spin up a kinesis data analytics application with kinesis fire hose and we'll actually see an action sending data from the kinesis data generator join me there the kinesis element of this lab we're actually going to use the data engineering immersion day um to help us set up we're not going to use the rest of the labs um for the dms or the glue i've tried to set these up and i could get them running eventually but there's just too much involved for us to be doing it not in the lab together so instead i'm going to use my own examples that you can follow along with and i'll supply all the data all the code and all the guidance to set it up but we'll do the same things that are in those labs but just not using this actual documentation because there's too much ambiguity trying to set that up okay with that being said we need to set up the kinesis stuff though so we're going to use the streaming lab and again the description to this workbook is in below the video so just go there um if we click on the streaming analytics lab and we click on the pre-lab setup we're going to set up this kinesis data generator this fire hose and all these different elements in between and to do that make sure you're signed into the aws console obviously and click deploy to aws we will end up back on cloud formation but with a lot of information to enter so just to repeat scroll down to deploy in aws click on that link and it will open up for you if you do not have access to this guide or you do not see this page the other way to do this is back on cloud formation hit the create stack so back on stacks create stacks and what you want to do is upload from a template in order to do this go to the aws github page go to the link as i have it down and say download a zip and we're actually downloading all that information into a zip file locally on our computer this will take a few seconds once downloaded go back onto your cloudformation stack with the upload you want to choose file you want to have the file unzipped so in downloads i have already unzipped the file and it's the master copy right there so there's the download of the zip there's the file unzipped so unzip the file and you're looking then for i'm going to struggle to find this but we're looking for the uh the kinesis pre-lab so into pre-lab and then the kinesis click stream one uh not pdf sorry it is the json so it's the json uh pre-lab um for kinesis real-time click string you can hit open there and that should be you ready to go but alternatively as i said it's easier just to do it through the the click button so once you're up and running back onto this page which is coming again i repeat from clicking deploy the aws loads open to this page you need to enter a few things so this is a username password and email and sms address the username and password we're going to use to sign in to the kinesis data generator the kinesis data generator is just a service aws provides to us that generates data for kinesis so it means we don't have to generate dummy data it's fantastic i use it all the time in my kinesis videos and the 101 series but this is going to set up most things for us an email address is our email address that we want to push information to so in the lab when we get to it we will be creating an anomaly click stream that when data looks a bit weird in our kinesis stream it's going to send an email and tell us so that's what that's for and the sms event does exactly the same thing so in terms of username i am just going to call mine admin and make sure you remember what you call them and in terms of password it has to be six alphanumeric characters and contain one number so i'm just going to call this password1 and that should do the trick email enter the email address that you want to send the email to when we detect the anomaly it's reminds us johnnycheveres.co.uk please visit the website where all this information again is for free and there are all videos for free and then type in your mobile number that you want the anomaly click stream to be sent to as well um the email address is more critical don't worry too much about the the mobile um if you don't have one but the email address is definitely something you want to enter okay acknowledge everything and click create stack and as you can see that stack is off and running okay um that takes a few seconds um i pause the video and after about 50 seconds you'll see that create complete appears here um what's also happened in the background is that you've been sent an email and that email is to subscribe to the click stream if you go to your email account you'll have an email and as you can see i've been sent multiple because i was practicing a couple of hours ago for this video you will see you have chosen to subscribe to a topic um you will need to click this confirm subscription button at the bottom of the email and that is your subscription confirmed so we're now subscribed to a topic that we're going to create in the lab that when an anomaly happens it will send us an email again go into your emails find the email click subscription confirmed right with that being said the next thing we need to do is configure the kinesis data generator so on to the output tabs which is at the top there you'll have a url and it will be the kinesis data generator url click on it and it should take us to the kinesis data generator the username is the username and password that you typed in at the start of the cloudformation script as i said keep them simple delete them once they're done the most important thing here is that you need to go to the region you're operating in so for me it's usc 1 and you can see actually that that cloud formation template created a stream for us so that stream has already been populated as in the sense that it's already been spun up for us in the background and if you want to see that um quickly open up another aws console tab make sure you're in the correct region go to kinesis and go to your firehose delivery streams you should have one stream and you'll see the name of it being pre-lab firehose delivery stream which is the same one that is now being picked up here next thing you want to do is what we're just going to set up some templates so you want to bring down your records per second to one and if we go on to the aws uh immersion day booklet and we scroll down a little bit so that's where we click deploy before we've done all this i've took you through everything that's happened this is us subscribing to the email again i've took you through all of this that was clicking on the url and we're on the kinesis data generator that should look familiar where we entered our username and password and now at this stage we have to create these different templates i do have these templates named on my own github down here if you want to copy and paste them alternatively it's easier to take them out of this booklet in case that changes so follow along again and it's as simple as this you want to copy the schema discovery payload and again if you don't have access to this pause the video and just type out what you see on screen and then we want to take the json itself and we want to copy and paste it and just make sure you don't have that new line so hit that that backspace key if you end up with a new line like me and you want to hit test template and it should look like this again if you don't have access to the booklet if you don't have access to my github pause the video and type out what you see on screen template 2 back in we want to click on payload and we want to copy and paste that into the title that's all that is that is a title and we want to copy the data so make sure it's the click one that we're copying this time the click payload and again pause your screen make sure you have exactly what i have hit test template and you see that we have data template three again it's payload and then you want to put it down this time as the name and then you want to copy that json and paste it in backspace to make sure it's one line test the template and hit close so you should have three you should have schema discovery payload click payload and impression payload and that is pretty much it for the configuration of the templates if you jump back onto the booklet you see that actually it shows you how it should look we've already verified the email um i'm not going to verify the sms subscription i'm not a big fan of the text messaging being sent out um just because i don't like giving my number over all the time if you want to check it out yourself then follow along if not then we're pretty much ready to go on this template i will leave the kinesis down our generator tab open please do the same move it to the side and leave it there for when we need it okay guys that's me logged into the aws management console as i said at the start of the video i will assume no prior knowledge of aws helps to give a bit of data engineering knowledge but again this is designed for the very beginners and then we can get more advanced as we go along you will need to have completed the setup phase of the video before this in order to get access to all the data we require for these labs it will cost some money a few dollars nothing more but well worth the investment as i mentioned during the labs and at the start of the video i am using the data engineering pdf to do these labs that i'm going to pick out and guide us through links in the description it is easier if you have this open alongside and we can compare and contrast as we go along so the first thing we're going to do is some of the labs out of streaming data analytics and the one we're going to do is the click stream anomaly data analytics lab as you can see in the booklet we are going to use the kinesis data generator that we set up in lab in the pre-setup labs to generate data for us it's going to hit a fire hose that's already up and running that you saw again in the lab we're going to create this analytics application right here the anomaly lambda is up and running force and we've already subscribed to the email in the startup phase so make sure you have completed that kind of pre-lab startup phase at the start of this video or none of this will work these s3 buckets also exist from that startup phase so if you go to s3 you will see them again reiterating for the fourth or fifth time in this video so important to go through that startup phase or none of this will work okay i'm going to copy the name here because it's the simplest thing to do i am going to leave this guide open and you can follow along at your leisure and compare and contrast as we go so assuming new prior knowledge we're going to use the kinesis data analytics service to start an application so type in kinesis and you'll see that the firehose application is still up and running from earlier in our pre-setup phase don't touch that we need that and what we're going to go for this time is the data analytics so let's create that application and i copied and pasted it from the document the name you can follow along and just type that in if you want give it a description um the simplest thing to do is just copy and paste the name again we want to use the sql runtime and we want to leave everything else as default so name description sql which is an sql which is structured query language we're going to use this to build a real-time application that performs analysis on our data and we want to create application fantastic successfully created application it's always good when you see that now we need some data so as i mentioned before our stream has been configured for us in the pre-application phase click the button um choose source bit annoying but you can see aws opens the page here you need to scroll to the top of the page no idea why it's done like that at the moment it's just sometimes it doesn't open correctly so scroll up to the top you want to use the kinesis firehose delivery stream and you want to pick the only one that should be there which is the one that we set up in the pre-lab phase um you want to leave the lambda as disabled and you want to choose the role that it can consume from and it's the one that's named after the cloud permission stack the pre-lab one so let's just run through this again just so we know to source bar hosts firehose disable permissions is the one that is named after the cloud formation stack if it's default it's kinesis prelab and then c s e kinesis analytics rule so look for the one that has the cse kinesis analytics rule and now we need to deal with the schema don't click discover schema i repeat do not click discover schema yet what we need to do is jump onto kinesis data generator which i hope you still have open from the previous lab if you do not have it open and not logged in go back on the cloud formation which is cloud formation open that up i'm just going to open that up in a new tab you want to go to stacks you want to go to kinesis pre-lab you want to go to outputs and you want to click on the url for the cases data generator sign in and your template should still be there for you that we set up in the pre-lab so that's where i am i've signed in and the pre-lab templates you can see that are still up the next thing we want to do is the schema discovery payload you want to start sending data now make sure your region's us1 make sure it's hitting the stream but if you haven't touched it from when we configured it the pre-lab that's what it should be set at if not what's that part of the video again set it up and then join us back here you can see that we're sending data to the kinesis stream and at this point then we can hit discover schema because there's data on our stream as you can see we successfully discovered the schema hit save and continue and that is our schema up and running go back on to the kinesis data generator and stop sending that data we no longer need to discover our schema we don't need to send that data so what we've just done is configured a data analytics application we've set it up to consume data from the firehose stream that was set up in the pre-lab and we've sent data to the firehose stream which has now been discovered by our application fantastic and just to show you where we are in the diagram we are about here so scrolling down we've completed all of this we've done the source i've took you through all of this and we have discovered the data after sending the records scrolling right down what we now need to do is actually set up the kinesis um code through sql that will look at anomalies in our data i'm just going to scroll down a tiny little bit and what we want to do is go right to the bottom and there should be an sql screen in the appendix alternatively if you can't see it there there is a hyperlink um in the data as well sorry for all the scrolling here but it's best that we get this right one time so if you go to section there it is 14 you can see it here and you can click open and it will download an sql file for you that we can use so sorry about the scrolling that's number 14 as the document currently sits or right down the bottom you can copy and paste it as well i'm going to take it from the bottom when the time comes back on to the kinesis data generator and it's go to sql editor yes i want to start the application the next thing you want to do is delete all the text out there so it was ctrl a and then backspace now this is critical this is so critical i want to copy everything that's in this box or alternatively you can get it from the file we just downloaded make sure you're not taking anything else with you it is literally just the box and you can do that as well by hitting copy to clipboard on the right hand side back in here and we want to paste so i'm just going to go right hand click and paste and once that's in just hit a backspace up and make sure everything else looks good and just make sure you click off the box and click back in it's one of those weird aws things where you might have to click out of the box click back in and then you can hit save and run sql it'll take a few seconds to save and then it will start running i will pause the video here because there's no point watching the screen for 60 seconds and once it's saved and started running we'll pick it back up okay that's us up and running so source should be source sql stream001 which was the default real-time analytics will take a couple of seconds you can see that there's a couple of different streams that have been created by this code don't worry too much about this code it's um it's made by aws and it's really taking the data we're about to send and then randomly spitting on an anomaly for us that's all it's doing it's a bit contrived but it works so as i said it's just creating different outputs for us and then spitting them out the next thing we need to do is actually take the kinesis data generator and open two more tabs so one two and i'm just going to pull these down to the end near kinesis.generator i want to open up that link so i just copied and paste that in and hit enter we need to sign in so again it's that username and password you used during the configuration stage in the pre-lab to set everything up so it's password oh what it's admin and then it's password they say that it's easier just to keep them simple so you know how to log in okay so we have three open three tabs open we want to go to click payload and we want to keep it at one and we want to send on this screen you want to make sure you go to the correct region and we want to use um impression payload this time you want to make sure this is at one so important make sure it's one and you want to send data okay that's the data being sent from both the click payload and the impression payload what we're going to do now is slightly contrived but we're going to send more data than it expects so we create an anomaly aws said they open five or six tabs and then start sending this data one record at the time i have not been able to get this to work this is not the best advice i'm going to give during this lab because we're about to receive multiple anomaly records don't let this run for more than kind of 20 seconds um we'll still get like 30 emails out of this but it's the best way to demonstrate that our app's actually working so click send data and as i said we're about to send ourselves a load of emails but it'll show us that the anomaly is actually working in this lab and then we'll run through what we've just covered so i'm going to stop sending night got the 1000 records just stop sending if we go back into kinesis um dinner generator oh there's a bit of an issue with that stream just click off it and then click back onto it if that happens hopefully once that repopulates we should see in a couple of seconds that there is a anomaly that has appeared okay it's still not sending through so i'm going to go back on to the casey's data generator and i'm going to send more data this time i'm going to let it run for a couple of for about 30 seconds as i said this will generate multiple emails uh multiple anomalies but we need to see it in action so just borrow with it hopefully it will pick up an anomaly there we go got one stop sending data just stop sending data right now so back on to the kinesis analytics you can see that now we're starting to really pick them up you can see that we're picking up anomalies here and these anomalies are being picked up by the lambda function and if you go to your email you will start to get um these click stream event uh anomalies being sent to you can't take a few minutes to come through these are the ones from earlier but you'll see that you start to get anomalies um being sent through to your email so back onto here um as you can see the anomaly has been created so let's just run through what we did here because there's quite a lot there and a lot to take in i'm going to go back onto the lab document and we'll just run through what we did so to get this up and running we use the kinesis data generator to send data to the firehose instance so that's kinesis firehose where we're sending data we then use kinesis data analytics to read that data and perform the query which was that sql query code here given to us by aws that created these streams out of aws and these strings were used to pick up anomalies more importantly the destination sql stream was used to pick up anomalies using a lambda function that was created by aws for us so aws created this lambda function during the pre-lab phase you can go find it in lambda and when a random sample happened it fired off an email to us um which i just showed you and that email tells us that basically something went wrong or there was anomaly inside our application so again just to reiterate cases that data generator aws fire hose aws fire hose kinesis analytics application using that sql code provided to us by aws and then it was sent to a lambda function when it picked up an anomaly and then we got an email simple as that and that is the real time click stream anomaly detection kinesis lab complete delete the application as we're not going to use it again i advise you to do this now um just so you don't forget later but if you just click on it and you hit actions and stop at least it will stop for you and if your actions delete once it's stopped then it will delete itself and then just a reminder i've created the app um called the question bank dot io link in the description where you can go and submit questions and take quizzes completely free on aws where we can test all our knowledge on the things we've learned and again our idea it's completely free so feel free to go and sign up and i do have videos on the channel about how i created this web app and how the more people we get using it the more questions we have for it and we can be using this towards aws certification so again feel free check out the link in the description and get quizzing um once we are all quizzing together by adding questions we'll have a bigger resource for us to test our aws knowledge on okay guys so we're going to take a look at data migration service or more commonly known as dms so no one really refers to it as database migration service we all just call it dms so what is dms definition from amazon itself aws database migration service or dms helps you migrate databases to aws quickly and securely the source database remains fully operational during the migration minimizing downtime to applications that rely on the database the aws database migration service can migrate your data to and from most the most widely used commercial and open source databases we'll cover exactly what those databases are but i have used dms in the past to take an on-premise my sequels instance up into aurora my sequel i have used it to migrate an on-premise postgres instance up into the aurora postgres instance and i've done microsoft sql server into microsoft sql server rds as well so this is the architecture it's actually pretty simple dms you have a source you have a target you create a source endpoint and a target endpoint and you have a replication task that runs in the middle and this is held in a replication instance this example from aws with the uh the link at the bottom is a bit more complicated you have a source and you have an and source so as many sources as you want you have a target and an end target so as many targets as you want you have a source endpoint at a source endpoint and they go through tasks so you can see task one task two task n and those tasks go out the end points for targets so in this one we're going at the endpoint one this one goes the endpoint n and endpoint and replication task and sorry goes to target endpoint n so it's actually simple fundamentally just to reiterate you have a source you have a target you create a source endpoint you create a target endpoint you have a replication task that moves the data for you and this sits inside a replication instance which is just an ec2 today's lab we're going to do exactly this we're going to have a source which is postgres sql we're going to create a source endpoint we're going to have a replication task that's just a one for one lift and shift and we're going to place that into dynamodb to show something completely different on the other end we'll be creating a replication instance and if it's the first time you've done this the one i pick will be free on the free tier if not it will cost a few dollars in fact it'll cost a few cents if you spin it back down straight after we're done then if you're ever going for the aws certifications or you're sitting a data engineering interview there's a couple of things you should really know just even for your own knowledge replication instance is a instance that is managed on ec2 and it actually hosts the tasks so all this is the compute behind the tasks endpoints are what the actual dms service creates in order to read or write from your source or your target uh fully managed by dms will show you i will show you the wizard and how to set these up it's pretty simple the replication task is just the task that dms uses to replicate the data from your source to your target and those tasks happen inside instances and then schema encode migration as you can see it does not perform schema or code conversion it's just a lift and shift one for one with a bit of renaming if you need there is a schema conversion tool if you need some help on this and there is a couple of services um and third-party libraries that can actually convert code for you if you need it but really think about about moving tablea from one type of database to the same looking table on the other end be that in the same database engine or something that's slightly different which we're going to do today but it's a one for one schema replication and then most importantly the sources from dms so you can see that there's quite a lot you've got oracle sql server mysql maridb postgres sap microsoft azure anybody and then rds instance databases on s3 today we will be splitting up a postgres rds instance and then we'll be actually migrating that to um dynamodb so that's what i'm going to do for you there are more than this obviously and you can actually see that there's an introductory sources link down the bottom so advise you to go check it out it does cover a lot of things so with that being said there's no point talking about dms that the heart is a pretty simple service that does some very complicated things which is great so let's jump on the console where we're going to spin up a postgres rds get some data into it create a replication instance create target and source endpoints create a replication task and lift the data from postgres and place it in the dynamodb so join me there i'm looking on building this example in the lab okay guys that's me logged into the aws console first thing i want to do before i set up is just change region to somewhere that i don't usually work out of and for the purpose of this i am going to choose parse perfect now this is just a bit of setup work that needs done so we have some data to use in the dms lab that we're about to do as i explained earlier in the video the data engineering immersion day does have its own data for this but when i try to set it up it's a little bit hit or miss on whether it actually works and i don't want to send you guys away with a little aws links like we did for kinesis and then it not work for you so at least this way we'll be using the code that i've created the data that i've created and an rds instance that we create together it will cost a couple of dollars hopefully less than one if you take it down straight after this lab but it's well worth it to kind of get the experience of using dms and also so we can work together at the same time so the first thing to do is go to rds because as i said we need some data and we're going to create that data ourselves so click in rds after you load it up rds is relational database service for those people who are new to aws it's where you find the databases and we're going to use a managed database so we don't have to worry about actually configuring it or setting it up in detail we just have to step through a wizard we want to click this create database icon here alternatively if your screen looks different go to databases down the left hand side and then click create database so there's a couple of really important um things we have to do here so just follow along we want the standard create we want it to be an amazon aurora database because this is quite cheap for us we want to click on postgres sql compatibility so that's we're going to use a postgres we're going to use a provision um instance we'll just leave it on the default which is 11.9 that's that's fine let's click dev stroke test because we don't want to pick a big instance we want to keep this quite cheap for ourselves inside database change this to music so mus ic database username let's leave this as postgres and now you need a password so change it to a password that you are going to remember and totally up to you what this is um hopefully that's the same password that i've just typed in so up to you put in a master password really important you get this right because we're going to connect using it what we would then want is a burstable class because again we don't want to spend too much money and we'll take the smallest instance there is which is the t3 medium this should only cost us a few dollars at most and it's well worth doing we don't need a replica we want to leave it inside the default vpc we want to leave it inside the default subnet group we want to click yes make it publicly accessible so it's really important yes make it publicly accessible then in vpc security group click create new vpc security group so make sure you click a new vpc security group and make sure that's public then give your vpc security group a name so i'm going to keep this simple and call it that engineering hyphen dms hyphen sj and then scroll down and we'll just double check everything publicly accessible yes new vpc security group name is de dms sg so it's that engineering dms security group no preference and let's click create that is all from running um both of these will have to go green before it's available and can take up to 15 minutes so just keep hitting refresh to that point in time the one thing i will add is in the description we will need a client to actually connect into the database to give it some data this is going to be pg admin um i've put the link in the description below uh it's completely free if you go to the link and you click on download you'll see that there's different options depending on what you're using so i'm on mac i'm going to use mac if you're on windows use windows and if you're on linux then use one of the app or rpm libraries to download so completely up to you and while this is running you just click on the operating system appropriate to you click a version um that you want to use and then it will download as the install manager for you so you can see here that's the one that i need and i can download it there once it's downloaded click through the preferences and then that should be you set up with the client once this is up and running i will come back online show you what it'll look like and then we'll go into pg admin together and i'll show you how to connect into the database and get some data ready so as you can see guys we've went status available on both of those just keep hitting refresh that point also watch it for the cpu it will spike and then drop once it's ready so the next protocol is actually pg admin so i need to get pg admin up and running so i'm just going to do that via my search bar again just click on pg admin wherever you have it ready to roll since you downloaded it while this was installing it takes a few seconds to start and then it will appear on a window it will ask you for a password a master password the first time you sign into it so that's basically just asking you for the password um that you want to set to use and then the next thing we actually need to do is add a connection so left hand side servers create create server and then you'll end up with a box that looks a little bit like this the two important things we need is name which i'm going to call music and then on connection you need the host address okay then the endpoint you want is this one here and what you want to do is copy and paste it in the default is correct maintenance database is correct username is correct but there's sometimes a little bug where you need to re-enter it so that's really critical i always re-enter the username because even though it's correct sometimes it doesn't work and then the password is the password that you gave um the rds instance when we were creating it and i said it's up to you what you want to call it um but don't forget it and i almost did and then click save and then as you can see it has now appeared on the left hand side if you expand that we have the database that we started in and there's an admin database as well if we open up the postgres one you can see that there is a postgres database there's some language stuff and that is pretty much everything that's there currently so the next thing we want to do is left hand click and hit create database and we want to call this database music so just call it music owner's postgres and save and you can see now that we have a music database let's double click that and open it and as you can see there's a different things going on here if we go to schema and we go to public and we right hand click public and we just go to query tools so right hand click public and go to query tool you can see that it's now open back on github i do have the tables for us so dms create tables and you want to just copy and paste in this entire file i have down the line 16. you can download it or just a simple copy and paste will suffice and if we copy and paste and we run the script by the play button again all i'm doing here is creating some tables and inserting some data so back over on the left hand side our little uh guide down here we just want to left hand click and hit refresh on public then if we go down to tables you'll see we have a table called artist we right hand click that and we go view data and then we go first hundred rows you will see that we have three rows of data and all this is going to be used for for dms is our source data and we're going to take this data and place it in dynamodb that's what we're doing nothing more complicated than that and what i've done here is just set up some data so just follow along get to this stage and we don't have to worry about postgres sql again it's just to get the data there in an aws immersion day you don't have to set up the database so this is a little bit of a bonus extra seeing the rds in action and as you can see there's some really core steps about making sure that it's public making sure you set up a new security group make sure that security group has access to the internet and then making sure pg admin is correctly configured to get there but once we're there we have the data and we're ready to go so join me back in the aws console and we'll start at the start of the actual console page and get on with using dms okay so back on the aws management console and we need to type in dms so database migration service the reason that we're actually here and once on here you want to click create replication instance or go to replication instances down the left hand side we then want to click create replication instance give it a name so i'm going to call this data engineering uh dms instance we don't need to do anything else apart from a wii description so i'm just going to call this data engine uh course instance let's keep it on the t3 medium you can go smaller but it will cost you a few cents to keep this up for the duration of this lab and it will work quicker so i'm always a bit fat a bit of a fan of making things go quicker in labs leave it as default we only need 50 gig um just put it in the only vpc that's there we don't need any fall over publicly accessible is exactly what we need and we want to create sorry we're just doing dev dev or test worked out there and then hit create so we don't need a failover we don't need replication this will take a few minutes to spin up so i'm going to pause the video here and wait on the next stage to go ahead um you can't keep going in the video but you won't be able to test your endpoints once we set them up so completely up to you you can carry on ahead with a video or you can pause like me and wait until it's up and running so everything works concurrently as we do it okay and as you can see after hitting refresh a few times the instance is actually now not available so that's what we needed next thing we need to do is create two endpoints one will be source one will be target so if we look at the endpoints and we have nothing here we go create endpoint as i said we're going to start with the source we're going to select an rds instance it's the music instance one great it's already pre-populated a lot of things for us we're going to provide access or the access information manually again it's the password um that you set when we spun it up and connected through pg admin and the database name is music because that's what we created inside pg admin so as you can see in pt admin we have the database music and that's where you want to go endpoint settings and then we just want to use the wizard we leave everything else as default as it currently sits and we will create that endpoint that is our endpoint created then click on the little check box here go to actions and go to test connection and go run test so we're using that instance we created and we want to run the test against that instance this will take a few minutes the first time we do it has to spin up a few things in the background so just borrow with it and hopefully it will go green and completely successfully i'll pause the video here and then we'll pick it up once it's ready to go okay that wasn't long at all it was about 20 seconds and we were ready to go so you can see that it was successful and that means that our instance can our dms instance can reach our database that's really important we now need to set up our destination so we go endpoints create endpoints and it's our target endpoint so this is where we want to put the data um make sure you give it an identifier again so i'm going to call this data engineering target and dynamo db because we're going to use dynamodb as our target and we're going to use target engine and type in dynamo and you should get dynamo brilliant okay this is really important it needs a service rule to actually talk to dynamodb we haven't created it yet so at the top type in i am go to i am permissions and open a new tab so i just left right hand click there and then open a new tab it'll load as i am and we want to go down to rules and then we'll want to create a new rule which is on the right hand side click aws service and then click on dms you can see i actually had to type it in the top hand side because i could not see the the wood to the trees there so there's dms click on dms and then click next permissions i'm going to cheat i'm going to give this full access administrator access this is bad practice but because we're all following along at home it makes it easier that we don't have to worry about permissions i advise you if you do this to delete the rules straight away as it's bad practice but for now let's just carry on with administrator access and we don't have to worry about the kind of granularity of giving access to dms click next review and give the role a name so i'm going to call this data engineering dms uh dynamo roll and sometimes i would add delete on the end of that remember to delete it but just make sure you go back and delete then if we type in de dms data engineering dynamo rule and click on it you will see an arn that appears at the top copy that with the little copy symbol back on the database migration service and paste that in endpoint settings we want to use wizard and leave everything else as default and click create endpoint and that is our target now active you can select a target you can go actions you can go test connection and you can run the test again this will take a few seconds all it's doing is checking that that endpoint can reach dynamodb there's nothing to it right now and it's going to do that via the role that we just created so i'll pause the video here and then hopefully once that goes green and complete we can go on to the next section which is actually creating the replication task to move the data okay that says green and you can see it was successful literally all it did was see if the endpoint can reach dynamodb right so we have the replication instance we have our source endpoint which is going to be the rds postgres instance that we set up with that music data and then we have the dynamodb data on the other side or the dynamodb target should i say on the other side now we need to move some data and create some tables to do this we use tasks or database migration tasks to give it its full title and we want to create a task we need to give it a name which is de task music we need to then give it a replication instance which is the only one we have our source is going to be music and our target is going to be dynamo and we want it might create existing data so we're not going to bother with ongoing changes because this is just the starter lab but what it's going to do is take those three rules that we had or three rows rather that we had here and populate them in dynamodb i'm actually going to do an all so it's going to bring a couple of extra tables than what we can't see here because it's going to bring some systems tables but the one we're paying attention to is artists so we're going to create this then dynamodb should create a table called artist and then these three rows should appear in that table because dms has lifted them from the rds instance in front of us or behind this client and moved them to dynamodb force without us having to create anything we're going to use a wizard we're going to drop the tables on target limit lob mob is fine table mappings and you have to add at least one selection rule so schema we're just going to say enter schema we're going to say leave everything else as default we want include which is everything automatically on create and create tasks so let's just double check that because i did that really fast task identifier d e and i've spelt that wrong we should call that task music leave that blank only replication instance we should have database is the one we created earlier target is the dynamo dbm point it's going to actually create the table force as we move we want to migrate all the existing data which is those three rows in music we're going to use the wizard we're going to drop tables on target we don't really care about this so we just leave it as default we're going to use a wizard we're going to add a new selection rule we're going to go enter schema we're going to leave it as everything everything include not bothering with any transformation rules because we just want all the data as it is and then automatically on create it will actually start the task so let's click create task and then it's going to create the task once the task created it's going to start running it so i'm going to pause the video here then i'll open up the screen again once it starts create running the task after it's created to show you what that looks like and then it will finish doing the task and hopefully at that point we will have the data in dynamodb for ourselves so i'm just going to pause the video here now okay as you can see it's um finished creating the endpoint and it's actually started the replication task now so create the endpoint starting replication i'm going to pause the video here until it's complete and then we'll pick it back up there's it started running as you can see so that's great um and hopefully then this will go to completed once it's ready to go okay as you can see it says load complete and that happened in about one minute back at the top then what we should see now is actual data in dynamodb so if we go to dynamodb and we look at tables yay you can see there's an exception table that it would have loaded if anything went wrong but there's our artist table and if you click on the actual artist table itself and you go to view your items on the right hand side you'll see that our three artists have arrived one for the sun's nirvana and rolling stones which matches what we have over here ruling stones nevada mumford and sons but you can tell by the keys it's still the same rules it's just an inverse order here and in exceptions we have nothing so that's the introductory guide to dms where we've took the data from postgres sql which sits over here we use the dms replication instance in the middle as per the actual lecture at the start and then we've actually moved that data using that replication instance tasks and endpoints in the dynamodb where it's created an artist table to match the artist table that we have in postgres and we didn't have to do any of the hard lift and shift ourselves as you can actually see in this kind of beginner's lesson for the data engineering immersion day or the data engineering course that i've designed it's really about the harder part is about getting the data ready in the rds instance than the actual move and that's kind of the point of dms if you have your data ready to go on one end it's quite simple as you can see just to configure the endpoints configure the instance and then you're up and running and ready to go then the last kind of thing to remember is obviously link in the description is the question bank dot io uh the website that's completely free that i run where we can do quizzes together if you're learning for your aws certifications and that includes quizzes on dms and also users can add uh questions to the question bank which we can then peer review together so you can submit a question on the left hand side here add a question go through all the different drop downs here and then add your question to the question bank that will then appear in peer review and once there's questions to be peered reviewed i don't have any right now in my section we can peer review them together and then they get added into the bank and again this is completely free so we all have a learning resource that we can actually do quizzes from and improve our aws knowledge again that's the end of the aws dms part let's move on to aws glue and please again feel free to sign up to the question bank um if you have time and start adding questions and taking quizzes okay guys this is the theory on aws glue before we jump into the lab so we're going to cover some of the core concepts so you're aware of them and then you'll actually see them in action when we get onto the lab so as always what is aws glue so it's a managed etl service from aws what that means is the aws handle the extract transform and load so it's on-demand etl where we go in and code and then aws actually perform the compute operations for us as a managed service it works with the spark and python runtimes and can be handled in either scala or pi spark i personally use pi spark but that's a total personal choice it can also obviously run python on its python engine it's the collection of components so aws glue is actually made up of several different things and we're going to touch upon them all in the lab so even if they don't kind of come home now to you they will in the lab and as i said you will need to know all this for the certification process so it's really important or in a job interview for data engineering these are kind of course concepts that will come up in an aws glue conversation so it consists of a central meta repository known as the aws glue data catalog this is where we create and store things such as databases and tables tables then also have columns on them and it's here that our etl code goes and looks at things that we want to do so for example if i have a customer table and it's inside my customer database when i'm coding the etl code in glue i will say go to the glue data catalog go to the customers database and get the customer table and all that in behind the scenes stuff of word that that is actually sitting could be s3 could be in a database what it's structured like what it looks like what format it's in is all handled by the glue data catalog and in my code i am literally just specifying database and table and then the columns afterwards that i want to select and then it's serverless so it means that there's no servers to manage so any on-premise etl tools that you may have used in the past like ssis or one of the informatica offerings will sit on the server glue we don't handle the server it's on demand it's on run so when we run a glue job which is our etl code it actually spins up a container in the background that we know nothing about and runs that container for etl so we're only paying for the compute that we use at runtime when we're running that etl job when the etl job is not running we're not paying for it and that's a massive advantage because it means that our other servers or database don't need the etl overhead on them they only need enough to carry out that application function because our etl code is serverless and being handled by aws in terms of infrastructure so here's a diagram with a lot on it i won't deny uh links down the bottom as always to go check it out but there's a couple of fundamental things that we're going to do today and i will show you another diagram when we get into the lab exactly what we're building but this is the aws one so it's good to see it so down the bottom you basically have the etl process you have a data source so that's where our data sits we've got a target over here that's where the data is going and we have the transform script that we write in the middle to say take my data from the source and move it to the target now as i already said this is done via job and that job is scheduled or not scheduled by aws glee you can also use third-party schedulers that's very common for orchestration but aws glue does have its own build scheduler we have the glue data catalog which contains the metadata information so what our data source looks like what our data target looks like and what those columns are for example inside this component here we can populate the data catalog by using a thing called a crawler that simply set up a job called a crawler it runs over the data and it infers like how many columns are in that data what the column name are what the column types are what the data type is where the data actually resides and it does that all automatically and populates the glue data catalog so when we're writing the etl script we've automatically generated all the source information about the data using the crawler and then we can just use the catalog in the transformation script to do all the heavy lifting for us other couple of important things that maybe that you want to know at this kind of stage is that the data catalog consists of databases and tables jobs are just a way of putting scripts inside and then as i already said you do have a scheduler but you don't have to always use it so here are the fundamental definitions these will come up on an exam so it is important to know them and again if you're sending a data engineering job and it's your first time in aws these are core concepts you will need to know for that interview process aws glue data catalog is a persistent metastore in aws glue it contains table definitions job definitions and other information to manage your glue environment so it's just a big metadata repository of tables locations column names jobs crawlers easy we will not be touching classifiers in this example because the glue crawler will infer them for us but you have the ability to to um set up your own that let you kind of classify data whether it be csv json avro xml might come up an exam just good to know but it's something that it's quite advanced for aws glue and when you're really doing your own data types that's where they come in come into their own a crawler up here as i said is a program that connects to a data store so we we will be writing a crawler in the lab and it runs over the data and infers things about the data like it's schema it's uh its size its volume its number of records its data type and it populates the aws data catalog for us and then we can just get on with using the metadata we don't actually have to write any of it ourselves and then datastore as it says just the repository for the data today we'll be using s3 but it could be a relational database or it could even be something else proprietary provided you have a jcb driver for it commonly s3 buckets and databases that would be our data stores i have done a slide of the glue data catalog on its own i won't read everything out here but i think it's important again if you're sending a certification or you're going for a job that you do know this so it's a persistent meta store it's an audit and governance capabilities because you can use tags to actually tell you things about that data and that's important you get one catalog per region that is highly important which means if you're running three different environments say dev qa and prod you need three separate aws accounts if you want to be in the one region the whole way up and that's something that can be a bit tricky for organizations um to get going but you know it's well worth having that practice and then right here is that it consists of a hierarchy of databases and tables and tables are the metadata that represents your data so just bear that in mind your tables and databases we'll be doing it today we'll actually create an input output database but each have a table and that table will be the metadata about data that lies on s3 under it so we're going to see this all in action with that being said that's kind of the fundamentals of aws glue and the aws glue data catalog let's jump onto the console and get dirty with a lab hands-on dirty because you'll see all this in action i would advise you keep the slides open on one side as well and then you can see when i point out things what the definitions are as we go through so join me in the console where we're going to actually create our own first glue etl job using the aws data catalog using that in s3 that i provided which will come from that github link and then we'll use athena as a bit of a bonus to use the presto engine to look at that data using the sql language so join me on the console and we'll get going okay before we get started with this in-depth setup and then into aws glue i have put a little diagram up on the github of what we're actually about to create here so if we just scroll down and i'm going to zoom out just a tiny little bit to try and get the full diagram on screen perfect we're going to create a new s3 bucket as always um just to keep things simple we're going to divide it into two folders for glue we're actually going to create three but two will be for glue we'll have input and with output we're going to upload a csv file that i've provided of data then output we're going to change that csv file using aws glue into 4k to do this we will register the csv in the glue data catalog as a table and then again we'll register this output table as add table as well in the glue data catalog in two different sides of a database and then we'll actually use aws athena to look at this data so i'm going to set up the s3 bucket we're going to set up the folders that we need we're going to then upload the data and then we'll get that registered in the glue that catalog so back on the console here if we go on to the top and type in s3 left hand click and open a new tab just to keep things kind of clean once loaded we'll have create a bucket let's click on create a bucket bucket names must be unique within aws and i'm working on the parse region so i'm going to call this d e that engineering johnny hyphen shivers hyphen demo and i always use my name because it's a very unique name and it usually turns up a unique bucket name so you need to get your bucket name unique you can call it anything you want leave everything else as default and create the bucket once created let's log into the bucket by clicking in and we're at the first layer of our s3 bucket which is going to act as the data repository for us during aws glue first thing we need to do is create a couple of folders so the first one we're going to call input create folder the next one we're going to call is output and we're going to create the folder and the last one we're going to call is athena results and that's because we need a place to store our athena results when we query the data so just three there athena results input output simple as that and that's just by clicking that create a folder button back on to um github back onto github back in the aws data engineering and there's a customer csv see right there this is the file that we're going to upload so again as i said in the first lab with kinesis the easiest thing to do is just download the zip download that locally and then one star you can open that zip file and have all the data ready so now i've downloaded the zip again we go into input and what we want to do here is create another folder to keep things simple and we're going to call this customers hyphen csv so this is going to act as our table and it's going to be called customers hyphen csv and you guessed it this is where the csv file is going to go so we just go upload add files inside one of the data engineering folders that i have downloaded many times uh we have customer.csv and you want to upload so just repeat those steps because that was a bit fast on my part we're in the input folder we created a customer hyphen csv folder there so input customer hyphen csv we then go upload we then go add files and then we go into the folder that contains our data engineering code that we downloaded and it's the customer.csv and we hit open and we hit upload and if you just want to be bonus check if you click into that so it looks like this and you can actually go query with s3 select csv comma delimited none and you want to just run the sql query you can see that we have some lines of data that's just the bonus don't worry about it if you don't follow it nothing really important back onto the s3 the next thing we need to do is that this customer csv which is here now in customers hyphen csv needs to get into the data catalog and to do this we use aws glue and a thing called crawlers we're going to create a crawler and we're going to get it to crawl this input directory and come up hopefully with a customer's hyphen csv table for us so a couple of things we want to do straight out of the bat is into databases so as you can see everything is blank here because it's the first time i've used the catalog in this region we need to create two databases and we go databases add database and the first one is input which is the input folder it's going to represent and the second one is output and this is just so things structurally in terms of databases and tables represent the split in that s3 bucket so if we just look back on the diagram quickly i've essentially created a folder called input created a folder called output i've put the csv file there in the glue data catalog what i've done is created an input database that has nothing in it and i've created an output database that has nothing in it but what i'm going to do is register the input data in an input database as a table i'm going to register the output data as an output data in the database when the time comes using crawlers in the middle to do this for me so there's no manual parts on my hand now we have the databases set up what we're going to do next is create a crawler which is going to crawl over the data in our s3 location and infer a table from it so crawler's left hand side add crawler crawler name i'm going to call this input crawler cost for customers and we select next data stores crawl all folders s3 then on the include path hit the little folder on the right hand side go down to the bucket that we created go to input and hit the customer hyphen csv table leave everything else as default and hit next add another data store no we want to use we're going to actually create our own rule for this so if we type in ion which i already have from actually testing this earlier and hit open in new tab and go to that new tab we then want to go to rules we want to create a rule we want to go aws services and we want to go to aws glue we want to go next permissions and again bad practice like i did last time i'm going to give it administrator access and that's just because i want us to be able to carry out this lesson without issues of i am permissions make sure you delete the rule once we're done please um because it's bad practice to have it up so d hyphen and i'm going to call this glue roll so d for data engineering and then glue roll um call it what you want but just make sure you delete it at the end and create the rule and let's just double check that that was created successfully so d hyphen there's the glue rule if we then jump back on to the aws console for glue go to choose an existing iam roll and click the refresh button really important you click the refresh button so it appears and if we go down to de you'll see the de glue rule is there and then click next run on demand that's what we want database is input and we don't want to add any prefixes so we're putting it in the input side of the database um so then we go next and we review all the steps should be correct that's the name that's the bucket and the folder we want to crawl that's the rule we created and we're doing this on demand and we click finish we'll hit the check box and we hit run crawler this will take about a minute two minutes tops to create and hopefully at the end of it we'll see that one table has been added so i'm going to pause the video here for a minute and then we can pick it up once we're done okay as you can see that lasted 48 seconds and we have one table added table is now in the glee data catalog so let's go in the long way so we know how to do this we go databases the input database table's an input and you can see that customer csv is now a table and that's all the different columns that are in that original file if you want to take a quick look at this we can go to athena and right hand click and open athena and we want to get started because it's the first time we've been into athena okay you need to set up a query results location in athena so you can see it right here in front of me i'm going to click this button and it loads it up we then want to select where we want it to go we want to go into the de bucket that we set up to start and the athena results folder that i specified as the third one that's not on the diagram and you want to click save and as you can see as default it's picked the aws.catalog it's got our three databases there one we didn't create but obviously there's two that we did input and there's the customer csv simplest thing to do is click right there and go preview table you can see it's often running the table limit to 10 and then you can see all the different columns that we now have inside our database so that's still in csv format all we've done is registered the metadata with the glue data catalog and now it lets us query it with athena so it's still in csv format still in the bucket where we left it we've just placed a table a schema over the top of it that lets us query the data through athena the next step is to use aws glue to transform this to parquet which is a better format for querying and data lakes in general so that's traditionally what we would do with an s3 bucket is make a data lake and to do this we're going to use a glue job and we're going to take the data from the input side of the s3 bucket move it to the output side and transfer the park a on the way so join me there back on onto the glue console then and this time what we want to do is glue etl i'm going to do this via jobs you can do this through our glue studio it's completely up to you i'm going to do it five jobs we want to add a new job let's start with this and we're going to call this customers csv to part okay we need an i am role that has the permissions to carry this out which is the d glue rule we're going to use spark and we're going to leave everything else as default we want to specify a path where this actually stores the code i'm going to leave this all as default it'll go off and create these buckets if they don't exist so just let it do its thing it's the simplest thing to do everything else can stay as default and click next okay it's going to ask us for an input source which is this customer csv and we're going to click next we want to change schema and we want to click next we want to create tables in our target our data store out is amazon s3 the format we want is part k and now it wants to know where we're going to store that data so quickly back into the s3 bucket and back up onto the highest level and we go to output we want to create a new folder here and we want to call this customer customer hyphen parquet so this is where we're going to store the parquet data and create that folder so now you should have your bucket your output folder and then a folder where we're going to store the output data called customer part k and the other one's customer csv and that's just so we logically know what's going on we want to go to target path and we want to go down and we want to go on to d e that's our bucket again we want to go to output this time and we want to highlight that customer parquet and hit select and then go next and you can see that it's mapped one for one everything we want to change except we're changing that file format here on the way through we want to save job and edit script there's nothing else we need to do here but if you get up at a time just read through it this is creating your spark context which again if you're familiar with spark and hadoop it's just creating that spark context that let us use the parallel processing it uses its own thing called dynamic frame in the glue library to lift our data into memory as a table think of it like a table it then applies the transformation which is where it changes customer id to customer id name style name style we didn't change any names but it's kind of important to know that's what it does anyway it gets the name and maps it to the next name and then it maps it to another data frame called a write data frame and you guessed it that writes out the s3 and it uses the path that i specified and then a data sync transformation which is coming from up here that's everything we really have to know all we want to do then is save and run the job and we want to click run job it will then at the bottom of the screen start producing some logs for us if we leave this for about three or four minutes it's about one minute start up and then it starts processing we should end up with data now on the other side so i'm gonna pause the video here we're gonna let this run through and hopefully once we're done we'll have some output okay once it completes or while it's running on the logs the easiest thing to do is actually hit this x button at the top on the left on the right hand side there and then you'll be greeted with the job screen here if you then highlight the job it will give you a better message down the bottom here so you can see that this one succeeded so just that x on the right hand side you land on this page highlight the job in the text box and then you'll get a better use of logs and what's happened and you can see after two minutes um one two minutes of startup time one minute of execution time it has succeeded and if i quickly then go to the s3 bucket oh looks like nothing's there but with if we hit refresh there's our parquet file fantastic right but if we go databases and we go output how many good tables and output there's no table for that data yet and the next thing you guess that we need to do is a table so let's call this then out put hyphen customers and go next we want data stores we want to crawl all folders we want to include the path now and that de file were we put that new part k which is right here and everything else is default and we click next next choose an existing role and we're looking for that d e glue rule and we go next we want to run on demand as usual and we want to database output this time because it's our output database then we want to click next so output customers the s3 location where the park a is the role that we created and we click finish we then want to highlight the output and run the crawler and this should take about one minute and we should see one table has been successfully added so i'll pause the video here and then we'll pick it up once it's ready to go okay that took 46 seconds and you can see that the one table has been added so again if we go databases and we go output good tables and output which is the long way and we go customer park a we have the data in part k then again if we go back on the athena i do have it open at the top but let's just go the full way um for good practice so athena right hand click open a new tab dina this time if we go to the output database you can see that the data is there so input has customer height underscore csv output has hyphen parquet if we click the little three uh hamburger menu at the drop down and we say preview table new query opens customer park a and that's the data ready to go this time and that's the transformation of using glue we have took the data that was sitting in the output section of the bucket here in csv we have registered it in the glue data catalog down here we have used aws glue to modify and move and transform extract and load that data into the output section of the bucket and change it to parquet format and then we're using athena over here using the data catalog to read the data to check that it's okay so that's a kind of in-depth aws glue and just as another reminder again the question bank is in the description below where you can go test your knowledge and sign up for absolutely free and try a lot of aws questions it's not going to be behind the paywall i'm going to keep it as free um as long as i can and if it ever comes to trying to raise money for it we'll try and do it open source um as well as that you can add questions that enter peer review once though they get paired reviewed and said yes or no and enter the question bank and then we can all do quizzes off it and the idea of this is basically we can build a whole base of questions for aws that we can use to try and get our qualifications and then again obviously if you want to test some aws glue knowledge that we've been covering here or athena for example then the question bank does have that data set in there by going start questions let's go aws and then we'll be looking for glue so i type in glue and i get aws glue and i hit next and then i'll get questions on aws glue so that's everything for the glue side of this lab and and tutorial in this aws data engineering course join me now back on the video camera and i'll just run through a few closing remarks what we've done what you need to delete to make sure you don't get um charged for anything and what your next steps would be in data engineering in aws well that brings us to the end of the data engineering course i hope you enjoyed we covered aws kinesis aws dms aws glue and a bit of athena for good measure again i've made all this information for free on github on my website www.johnnycheverus.com.uk so please check it out like and subscribe to the channel if you haven't already because it really keeps me going and making this content for free there's also the question bank dot io which if you go and sign up you can test your aws knowledge for free again and that's something that i own and i want to keep free for as long as possible and that's pretty much everything tonight guys so thanks for watching and until next time see ya