Transcript for:
AWS Glue ETL Overview

hello everyone welcome back to another video on AWS glue in this video we're going to talk in detail about AWS glue ATL in my previous video I had introduced you to AWS glue service um what is AWS glue service what are some of its use cases and what are some of the features that awq provides okay so you can check out that video to understand in depth about AWS service in this video we are going to mainly focus on the ETL functionality of AWS clue Okay so TL stands for extract transform and load and this is a very important step in any data engineering or data processing pipeline okay so you will have a data which is sitting in a data source you might have to extract the data and do some Transformations on the data okay uh Transformations can be like filtering or cleaning the data or joining different data and all those things okay so once the transformation is done you can uh you need to load the data to a data Target this can be a data warehouse or a data lake or whatever okay so this process of extracting the data transforming and loading it into a Target is called ETL process now uh to perform this ETL there are many tools that are available in the market so there are lot of ETL tools that you can leverage to perform this ETL okay ews glue also perform uh provides one such option of uh glue ETL to do this ETL operations we'll Explore More on how to perform ETL using AWS glue in this video okay so I hope I was able to give you a fair idea on what ETL is and uh where AWS glue glue ETL fits into this picture Okay so uh these are the topics that we are going to cover in this video we'll see what is AWS glue ETL and some of the features of glue ETL and after understanding these things we'll go ahead and uh create and run our first glue ETL job in AWS okay so let's get started firstly what is glue ETL it's a fully managed extract transform and load service provided by AWS okay so let's go one by one so what do mean by fully managed fully managed means that the service is complet managed by AWS um when we say fully managed all the servers all the software part and everything is completely managed by AWS the scaling up scaling down all the resources everything is completely managed by AWS so you don't you just have to bring your ETL job and then run it in ETL and like we discussed it's an ETL service we discussed what ETL is so it is provided by ews so that is glue ETL now what are the features of glue ETL so like we discussed it's fully managed by AWS and the second thing is it's serverless so you don't have to create or uh you know deploy any servers or upgrade like I mean uh do any software installation on in servers or whatever so it's completely serverless and it's managed by AWS and uh AWS glue ETL comes with a paches spark in built now uh paches spark is one of the widely used data processing uh Frameworks in Big Data world so we'll uh see how this uh can be used in ETL but uh to just mention here Apache spark comes inbuilt with glue ETL and you can Leverage The Power of Apache spark in your glue ETL jobs okay and the fourth feature is uh you can there are different ways in which you can develop your ETL jobs like we would see in the next section of our video so you can visually develop the ETL jobs like there are a lot of drag and drop tools as well in the UI which you can use to create your ETL job and uh the glue automatically generates a script for you okay and you can also bring your own script uh like you can have your PP Park script which has all the ATL logic and you can just bring it to glue and uh put the script there and R your job without having to worry about the servers and everything because it's completely serverless okay and uh you can also develop your glue jobs using interactive notebooks for those of you are are familiar with the Jupiter notebook uh style of uh you know writing code there is also an option to interactively develop uh your glue jobs using uh notebooks okay and uh so glue also has a scheduling orchestration and monitoring uh in built-in NL so you can schedule your workflows you can orchestrate them and like you can add dependencies also like after this job this job needs to be executed and all those things so you can orchestrate your workflows and you can also do a monitoring okay so uh and then you can um so glue ETL enables you to easily connect to different data sources and Targets in AWS and also outside AWS like we will see so you can uh connect to various data sources like S3 or uh red shift RDS like uh it very easily using LL and pull your data and then uh start writing your ETL jobs okay so uh these are some of the features of glue ETL don't worry if you are not 100% clear on any one of these points uh as we start uh building the glue jobs hands on you will understand what we mean by each of these steps so I will be coming back uh to these points when we are writing our glue jobs okay so uh now with this basic understanding of what AWS glue ETL is let's go ahead and create our first ETL job so uh let's begin writing our first glue ETL job so uh so what we'll be doing in this uh glue job is we will have some data which is in uh S3 and we'll do some Transformations on the data and we'll load it back to S3 okay so in this uh use case are both are source and Target are in I3 so like we discussed the source and Target can be anything else so we will explore more examples where we'll bring the data from a different source and load it to a different Target other than S3 but for this uh example let's take a data which is sitting in S3 and then transform that data using ETL and load it back into another folder in S3 okay so I have my sample uh data uploaded here uh in this bucket Landing Zone customers and uh so customer CSV so this is my data so if you want to take a look so this is how uh the data looks like it's a simple uh customer data that we have okay so uh let's transform this data using glue ETL and then load it back here uh into this transform Zone okay so this will be the output location for our glow job so uh so there are two things that you would need before you get started you need this uh data in your S3 bucket and then you will also need a I am rule which the glue can use so I have already created that am rooll so if you not created that IM Ro you can go ahead and create it with these permissions so I have attached Amazon S3 full access AWS glue console full access and Cloud watch logs full access for this uh role of course you can uh find tune your accesses to give access only to a particular bucket and all that so uh but I just like for the sake of Simplicity I've given A3 full access so this role will have access to all the three buckets and it will also have a glue console full access okay so when you're creating you create it as a glue uh service Ro okay so with these prerequisits let's get started uh let's go to AWS glue and create an ETL job to read this data and transform it okay so like we discussed when you click on this ETL jobs over here uh there are various methods in which you can develop your ETL jobs so you can uh click on this visual ETL job which which gives you some drag and drop tools uh to create your ETL jobs you don't need to write any code here and you can also interactively develop the code using a notebook or you can uh bring your own pypar script or like any script uh which transforms to the data okay so for this example let's uh go ahead with this visual LL and in the coming uh videos so we'll explore both these methods as well okay let me click on this visual ETL so here this is very uh like useful for people who don't uh like are not very uh familiar with the coding and all those so you can easily transform your data okay so what is your data source so there are even if your data is in S3 you can also create a glue catalog table for the data and bring in from there or if you don't have a glue catalog table created we can just select Amazon S3 okay so now we have selected the source so if you see select the source and what the transformation that you want to apply and what is your target okay so this gives us a easy uh UI based way for developing our RL jobs so let's click on this and now we need to configure the source so name is S3 so S3 Source type S3 location or catalog table if you have created a catalog table you can select this and I have already created the table so I can select this one as well so if you don't have the table created you can just select the S3 location so for now let's select S3 location and click on brows S 3 and this is our bucket Landing Zone customers okay so let's actually select this entire customers okay so this is my data path so let's select recurso so that it will read the file in files in all the subd directories and uh my data format is going to be CSV and uh the D limitter is comma okay so I think this fine so uh if you want to preview the data what you can do is you can uh SC like you can let the uh glue scan the data and generate a preview for you so let's do that uh if you actually select a glue catalog table you don't need to do this preview it will already be uh present here um okay let's let the glue scan it okay uh you select the glue ETL role that you just created in the previous step and let the glue scan the data so once the scan is done it will generate a preview for you okay so uh it has scanned the data and generated a preview for as if you see here so this is uh our data how it looked okay so uh but I would suggest like I mean uh to leverage the catalog table functionality so create a table in glue catalog and then you can use that as your source so that you can have all the tables and metadata like already present okay so that you don't have to scan it here so anyway uh for this tutorial let's keep the source as S3 location and go through it so now we have the source configured okay let's add transformation okay um so if you see these are all the like available Transformations so like almost all anything that you can do with the data all you will have a transformation for everything that is there okay so let's select a simple transformation okay let's drop one field from the CSV file and then load to transform Zone okay let's assume I want to drop uh last name okay so now what we are doing is we are putting this transformation called drop fails and then we are dropping this okay so now we have a source and then we have a transformation configured let's add a Target so let's go to Target and our Target is Amazon S3 okay and uh okay and one another transformation that we'll do is we'll just load it in paret format instead of CSV okay so let's configure this target now let's click on this Target and uh S three and node parents so what this node parents means is what is the previous step on which this step depends on so this step depends obviously on this transformation uh step so that's why we have linked it the format is par compression type uh Snappy and let's go ahead and select our okay so let's select this transform Zone as her output okay so this is where uh the data will be written after after dropping this field into this transform Zone in park format okay so if you see here we did not write any code uh to read the data or to transform the data or to write the data into that Target okay so we just used the visual drag and drop tools that is provided by AWS glue to create this okay so data catalog update options so do not update the data catalog that is enough so if you want to like uh create uh catalog table for the Target you can select this so let's ignore that for now okay and save this job as customers drop last name okay so this is RL job and let's save this job okay so uh now this job is saved so if you click on the script here you can see the script which is generated by Glue automatically based on the uh ETL job that you configured in the previous step so if you see here uh it is reading this uh data from into what is called as Dynamic frame so Dynamic frame is basically an abstraction that glue provides okay and it is reading from S3 CSV format and it is configuring all the path so once it is reading it into a glue Dynamic frame it is applying this drop Fields transformation and then after that the transformation is applied it is uh writing the data right Dynamic using this right Dynamic frame method uh into this S3 with uh format as Park okay so this is how glue automatically generates uh this ETL script for you okay so let's run this job and see what happens okay so now it says a job is started you can click on runs here to see okay so it is running so I think it'll take some time to get the job up and running and then start up so another thing to notice here is that you did not not spin up any servers you did not uh install anything like PES park or whatever dependences you just like came here and created your job and ran it so uh that is what we meant when we said that this AWS glue ETL is fully managed by AWS so uh all this uh infrastructure and uh everything all the dependen is handled by AWS for you okay uh so just like couple of things here so uh if you see here this capacity dpus so this this is the processing uh unit uh that the glue provides you can um select the number of dpos that your job needs and the worker type so each worker type also has some features so this is the glue version so we are of course running everything on default but you can configure and uh modify each of this parameters okay let's wait for this glowjob to complete and then see the output okay so if you see it says succeeded and it took uh like what around 1 minute 6 seconds the startup time was 10 seconds okay so of course you can uh like monitor all your logs if you see all logs output locks and logs so this is how uh AWS glue makes it very easy for you to uh monitor your ETL jobs okay so yeah I hope I was able to like convey many of the features of a glue ETL over here okay now let's go back here into the S3 and check our trans transform Zone Let me refresh this so uh this is the output Parker file that our GL job wrote back into S3 into this path that we had configured in the ATL job okay so now you can I mean of course read this file and check for the output if it actually applied those Transformations or not so yeah I hope I mean this was a very uh quick tutorial to demonstrate on how to create your glow jobs how to configure the sources and targets using visual ETL option so in the next videos I'm going to be showing you uh how to use uh like I mean how to develop your ETL job more complex ETL jobs uh using some py spark scripts okay so I hope you found this video helpful and um I'll see you in the next video do let me know in the comments if you want me to cover any particular topic on AWS clue and I'll be very happy to do that thank you and I'll see you in the next video