Transcript for:
Overview of AWS Glue ETL Service

hello everyone welcome to this 02 hero course on AWS clue in this course we are going to be discussing in- depth about AWS glue service and uh understand the basic concepts of AWS glue and we'll also do some handson uh lab okay so uh firstly let's see what are the topics that we are going to cover in this service so uh we are going to discuss what is AWS glue and what it is used for first and then we'll discuss what are the different uh glue components and we'll walk through the AWS glue console and uh what are the components that are there in the AWS glue console okay and after that we are going to uh do some prep for the Hands-On uh lab that we are going to do in the further uh sections so uh this going to be very quick prep and after that we will start discussing about the main components of AWS glue which is data catalog uh which consists of data tase and tables and crawlers and connections okay so this is uh one part of the AWS glue which is data catalog and in the next section we are going to be discussing about AWS glue ETL which is which is like uh which consists of glue jobs and triggers okay so uh these are the topics that we are going to be uh covering in this uh course along with the theory we are also going to be doing some Hands-On to uh better understand glue and see how it works okay so uh without any delay let's get started so U firstly let's see what is AWS glue and what it is used for okay so uh by going by AWS definition AWS glue okay to begin with let's discuss what is AWS glue and what it is used for okay ews glue is a fully managed ETL service by AWS okay so uh let's discuss these two separately ETL service and fully managed ETL service means uh ETL stands for extract transform and load you can have your data in a source and you can extract the data and you can do some Transformations and load it to a Target this process is called ETL in data engineering pipelines okay so uh glue provides this ETL uh functionality and other thing is it's fully managed uh fully managed means that AWS manages the service for you uh it manages all the backend infrastructure the servers and all the software provisioning whatever like you don't need to uh provision or deploy any servers you don't need to install any software or anything AWS completely does that for you so that is why it is called fully managed so uh that's the meaning of fully managed ETL service not just ETL there is another uh aspect to glue as well so the two main features of AWS glue are data catalog and Spark ATL engine okay so what is data catalog it's a persistent technical metadata store so what do you mean by that you can have your data in any of different uh data stores so this data store can be S3 uh or RDS or Dynamo DB or any other thing in AWS okay so you can have your data in any of these data stores and the metadata of this data can be stored in a catalog in AWS clue by metadata we mean that like I mean it can be the schema related information and the the source of the data where it is stored and all those things so you can have a centralized data metadata repository in an AWS glue for the data which is sitting in a many services in AWS okay so this glue can connect to 70 different data sources manage your data in a centralized data catalog okay so like we discussed this data source can be like 70 different data sources okay and this catalog can be connected to any of those and you can maintain your uh the metadata about your data in a centralized data catalog okay so the this data catalog can then be used to like I mean uh uh get the information metadata about your data and then also uh for Access monitoring and all those things okay so we'll discuss in detail what is the use of this data catalog in the further sections but I hope you got uh you know Fair picture of what this data catalog stands for and what this ETL stands for okay so like we discussed the ETL in using the glue ETL engine you can usually create run and monitor your ETL pipelines okay so we will see how to create those ETL jobs and all in the Hands-On sections but uh that's what ETL uh functionality of glue means okay I hope I was able to give you uh Fair picture of what are the two the two main functionalities of AWS glow which is data catalog and ATL engine okay so uh so uh with that idea let's see what are the different glue components and discuss each component in detail okay so uh there are two main categories like two main components of AWS glue which is data catalog and then there is ETL under data catalog we will have database tables crawlers and connections and under ETL we have ETL jobs triggers and workflows okay so these are the main components of AWS glue and we'll be discussing each of this concept in detail in this course okay so let's go to uh AWS console and uh see how AWS glue console looks like and see what are the components in AWS clue okay so uh here I'm logged into my AWS console uh let's go to AWS glue so this is how the glue console looks like if you see here uh these are the two main components like we discussed one is data catalog and then another one is data integration and ETL okay if you expand this data catalog there is database tables and uh schema Registries uh this is also a new feature of AWS glue which we're not going to discuss in this video uh but I think uh that is not required for a basic understanding of glue we will discuss this in upcoming videos and then there is connections crawlers and all those things okay so this is these are the components that we are going to be uh looking into in data catalog and if you expand this ETL section there is ETL jobs visual ETL uh notebook so you can uh write your code in a notebook or you can uh visually create and edit your ETL jobs we'll see that in the Hands-On section and uh there are interactive sessions and all there is triggers workflows and everything okay so uh this is how AWS glue console looks like and these are the different components and where we can find them in glue console okay now uh let's discuss discuss each comp components one by one in detail so firstly we will talk about database okay so a database is basically component of glow catalog like we discussed so it's a logical container within the glue data catalog that stores metadata tables okay so this is just a logical container meaning it's there is no physical database as such so you can create tables in uh glue and all those tables can be grouped under a database okay so I repeat again it's not a physical database and even these tables as well they are not physical we just create a metadata table we just have a table definition and the data will be still sitting in the source itself okay uh so like I said these tables contain information about data stored in various sources such as S3 RDS shift and more okay so this tables will contain the information of the about the data which is being stored in S3 or RDS or red shift or in many other sources okay uh but this table itself doesn't contain the data okay the data will be still sitting in its original source and these tables will just contain the information of that uh data and the grouping of these tables into a logical uh you know name space is called a database okay I hope uh I was able to convey that uh idea clearly okay the glue database helps organize and manage metadata making it easier to catalog search and query your data sets Okay by grouping your tables into a database it becomes easier to organize and manage your metadata and uh like we discussed it's not a physical database it's just a logical name space okay so let's see in the glue console how to create a database okay before we get to that uh we need to do some prep for the handson lab we just need to create two things so one is uh S3 bucket which will be required for demo and we also need to create an IM am rule for the demo purposes okay so we will create these two and then we'll get started okay let's create an S3 bucket first open S3 and let's create a bucket here so I'm just I'm going to create all the resources in North Virginia so make sure you create everything in uh same region I'm going to call it as glue tutorial or let's call it as call it as glue tutorial bucket and leave all the default settings let's create okay so it says that it exists so let's call it as tutorial bucket one the name needs to be unique across AWS okay so we have created uh a bucket so we are going to use this bucket uh for all the demo purposes and the next thing that we need to do is we need to create an IM am roll Okay click on roles here and click on create Ro AWS service and let's search for glue we're going to create a glue service Ro click on next um so for the purpose of this tutorial we'll just give Amazon S3 full access because Sr will have to scan S3 data and all those things and we'll also give uh Cloud watch logs full access okay so I think this should be enough for now if uh we see any errors we can come back and add the permission at the later stage okay let's call this as clue tutorial Ro okay and click on create Ro okay so now we have the RO ready so what we will do is we will create a folder in this and call this as Landing zone so create folder and this is where we will upload our and we'll create another file called another folder called customers and we can upload the data into this okay so I'm going to do another thing here and what we will do here is load date is equal to okay um call 23 6 and 3 ddmm y y okay so uh I'll tell you why I'm doing this later um but we will create this folder and then upload our data into this folder okay so let's upload a sample CSV data [Music] here okay so I'm going to upload this here okay so this is our source data we're going to use this for demo purposes okay so now we have uh everything that we need ready so let's go ahead and create a database in glue catalog okay here in glue console click on databases and click on ADD database here and let's call this as uh let's say sample DB you can call it whatever you want okay or you can say like project DB okay so it's just a logical grouping of all the tables okay location is optional can click data create database okay so it's very simple you we just created a database next let's see what are tables okay so I think we already discussed water table means it's just uh uh like metadata of your data which is stored in various data sources okay so and DWS clue tables play a crucial role in organizing querying and transforming data by providing a structure way to describe and access a data we will see how that works in Practical and again like we discussed it's not a physical data uh table as in the data doesn't move into this table in AWS clue the data still sits in your source which can be S3 or DSR Etc and this table just contains the metadata about your data like where it is stored and what is a schema and Etc okay so let's see how to create a table in glow data catalog okay so uh we are going to create the table under this database so let's open this database and uh you can click on ADD table okay so we are going to create the table for the data that we just uploaded to S3 and uh the name of this table is going to be customers and the database is Project database and uh standard AWS glue table and the source is S3 and the data is sitting in our account itself and we can give the path of that uh S3 okay so we'll open this bucket Landing Zone customers okay so we will select this entire customer okay and the format of the data is CSV and the day limitter is comma itself okay and click on next so you can uh edit the schema manually you can click on ADD and you can specify your column name and data type and Etc you can can manually Define your schema and create a table so there are two ways of creating a table you can manually Define the schema and everything of your table or you can run what is called as glue crawler and create a table okay so let's discuss what is GL glue crawler first and then come back and create the table here using a glue crawler okay uh crawler is basically a program that connects to the data source and automatically scans the data in various data sources and determines a schema and create metadata tables in the glue data catalog Okay so so uh this picture actually depicts it very uh accurately so there is a data which is sitting in your data store and here is your data catalog crawler is a program which connects to this data store and scans the data which are sitting here and you know infers the things like it's schema and everything and then using that schema it creates a metadata table in the glue glue data catalog okay so uh that is what the is a functionality of crawler with that understanding let's now create a table using a crawler in glue data catalog okay so we will abandon this process uh because here we have to uh manually uh edit the schema so let's go back to databases here click on uh database and click on this add tables using crawler okay so this is for manually adding a table okay so let's call this as customers and click on next is your data already map to Glow tables not yet data source is S3 S3 ending Zone and customers okay and uh so we will we are going to give craw all subfolders and cck on add data source okay and click on next now we need to select the I am rule which will be used by the crawler to scan this data and create a table for us so we are going to use the role that we created in the previous step okay and click on next and the database in which it needs to be created the project DB that we created and if you want to add any prefix to the table uh you can add it otherwise the name of the folder uh which is customers in this case will be uh used as a table name so let's keep it as customers itself and frequency so when do you want to run this uh crawler so we're going to select on demand so that we it runs only when we need okay and we'll review everything and click on create crawler okay so now the crawler is created let's click on run crawler let's go back here and see so if you see here the crawler is running so let's wait for this crawler to finish and see if it creates a table okay uh the crawler has run and it says it's in stopping State now let's click on the scroller and see what happened okay it says that it has failed let's see what is a error click on view Cloud watch logs let's see the logs for that uh crawler Run Okay so okay so we see that glow is not otherized to perform glue okay so it looks like we need to add uh this glue permissions to this role as well so let's go back to this uh am and add that permissions to this and rerun okay so let's add let's give glue full access okays glue console full access okay I think that should work add permissions okay so now let's go back and rerun the scroller again okay so it's running again so let's wait for this run to complete and see if it works for okay so it looks like it has completed running so let's click on the scroller again and see okay yeah this time the run is complete okay so now let's go back to tables and see if it has created the table here yep okay so we can see that uh this customers table is created so if you we go to databases and in the database we will have the stable created customer table and let's see about the stable okay so if you see uh this the column name and data type that uh the crawler has inferred and created okay okay if you're curious on seeing the file how it looks like so this is how the file looks like so it has infert the schema of this file and uh loaded to glue database okay so now uh as two steps we have created a database and we have successfully created a table uh with uh S3 data as our source okay so now let's go forward and discuss the next components of HS clue uh before we go ahead I actually want to discuss another aspect here so once you have created this table what is the point right like I mean what is the use of this table so one thing is you can like I mean um you maintaining the metadata of your data which is great but what if you want to query this data using SQL so uh we can make use of uh atina for that so let me just quickly demonstrate that as well so you can actually query this data which is sitting in S3 using SQL which is cool right like I mean you don't have to move the data to uh database okay in atina console click here and click on query editor and in query editor you can see data catalog and the database is Project database and this is the customer table that we just created okay you can click on preview table here okay so so if you are using the uh atina for the first time you need to select uh like A3 location for where this results will be stored you can click on edit settings here and click on so let's create a folder here in this bucket itself and call it query results create folder so let's go back here and select this query results Okay click on save so that's it you can now start querying your data okay so let's run this query and see if we get the data yep so if you see you're able to run a SQL query on the data which is setting on S3 so this is another uh cool feature of AWS glue once a table is created in uh catalog you can query that data using atina here another thing that I wanted to highlight here is this load date column that is present okay this dat this column is not present in the data but it is added here and if you see it is a partitioned okay so uh what essentially that means is here in our data we are uh like partitioning by using this uh load date okay you can add uh like if you get another customers file you can add it to a different date and different dat so you can create any folders with this format and if you want to query the data of customers only from this date you can uh put that filter over here in the query and it will scan only that folder and exclude the rest of the folders so that it will give you a better performance U while querying the data so that is the uh like importance of partitioning of your data in S3 and using that partitioning to get a better performance while quering your data in a now okay so uh next I wanted to discuss about Connections so in AWS glue connection is a configuration object that enables glue to connect your data stores like we discussed your data can be in any of the data stores like S3 or red shift or RDS or anything right so you you need to be able to connect to the data store to uh get the information about the data to scan through the data and INF for the schema and uh Etc so uh in order to able to connect to that data you need to have uh things like your credentials username password database endpoint and all those things so uh you can create a configuration object which contains all those end points usernames and passwords and store it in AWS glue and which can be later used to connect to that data store so here in uh glue console if you expand this catalog you can click on connections here and you can add a connection so if you click on this create connection so you can uh specify your data source like let's assume it's red shift you can select red shift and click on next and then you can give your uh select your red shift cluster database name username and password you can give all these uh inputs and then store it so next time when you're connecting to that red shift you don't need to specify all these things uh manually you can uh specify this connection configuration and the crawler or like clue will automatically connect to that store using these credentials okay so uh that is the idea of Connections in AWS glue okay so the next important topic that we are going to talk is about AWS glue ETL jobs so ETL jobs is a like I mean the core of uh AWS glue ETL functionality so uh ETL jobs are used to transform your data so you can extract the data from a like this uh picture depic you can extract the data from a source transform it in the job and load it to a Target and this job can be uh written in spark like by spark so spark is a very powerful tool for uh data processing in Big Data world so you can Leverage The Power of spark uh using AWS glue you don't need to install spark or set up any uh spark cluster uh because AWS glue is sess and fully managed you can just use uh spark in your uh ETL job and Leverage The Power of spark and transform your data okay and uh there are two ways by using which you can create the ETL jobs you can visually create the jobs by special uh I mean by specifying your your Source your Target and whatever the Transformations that you want to apply or you can uh I mean by doing that it will generate a script for you uh the glue will automatically generate a script for you or you can bring your uh own script with the business logic and then uh run it run it in your AWS GL without having to set up any servers or any environment okay so uh that is the idea of AWS glue ETL jobs I was able I hope I was able to communicate that um but let's see in action how does ETL jobs work and how to uh create those jobs okay so let's see how to create a glue ETL job so here in data integration and ETL click on ETL jobs you can so like we discussed you can either uh create a job using visual uh tool or you can use it uh use an interactive notebook or you can use a script editor where you can write uh like customize script of your own okay so uh just for the sake of Simplicity let's use this uh visual ETL and uh so here uh in visual we are seeing that like I mean what are the source what are the transforms that you want to apply and uh what is the Target that you want to specify okay so our source is going to be S3 so I have selected my S3 source and uh what is the transforms that you want to apply okay so you can do uh any of these transforms like rename field filter conditioner like you can do anything uh any transformation that you want over here okay so for Simplicity let's uh select uh like one transform called drop Fields okay this is going to drop a field let me click on this and uh actually before we do that let's edit this first step so let's uh in our data source we are going to uh we can give S3 Source location or data catalog table we since we have created a catalog table let's select that project ab and customers okay so this is our this is going to be our input okay so uh the next thing is let's edit this transformation okay so we are going to drop a field so let's assume we want to drop this phone field phone 2 Okay so this will drop uh that phone field okay so next thing is we want to specify a Target so we will write this output to n S3 okay so S3 not parents let's give it as so the what is a parent like I mean uh after which transform we need to select this one so so that after transform it is loaded into this Target and what is the format that you want to store let's say we want to store it in park format okay compression type is and let's store it in S3 okay before that let's create uh here let's create something called as transformed so Zone okay so Landing zone is our input transform zone is going to be our out okay so let's come back here and select this uh let's select this folder as an output okay and yep so I think we should be good so uh just to summarize what we are doing is we are reading the data from this S3 location using clog table we are dropping a field and then we are storing the output into uh an S3 bucket in park format okay so that's a very simple ETL job so let's name this as my or let's call this customers ETL drop field okay and let's save this job okay we need to give an I am role so let's select this glue tutorial role and see if that works and then after that you you can select like I mean all these uh things like uh it's a spark job or not like I mean you can select the glue version and what's the language and all those things um so we'll leave that as default okay but you can play around with these parameters like I mean what is the number of uh workers that you need these all will depend on your workload and everything okay so click on Save here and now our job is being saved okay so let's see the script of our job okay so depending depending on the input like I mean the visual steps that we specified glue will automatically generate a script ATL script for us so if you see here like I mean it is just uh reading that data into a glue Dynamic frame so this is basically an abstraction of your data as a glue Dynamic frame and then it is calling this transformation called drop fields. apply and then it is writing the dynamic frame uh to uh S3 location in format of paret Okay so so yep I think this should be good let's click on run so it has started so let's click on the runs here and it says running so let's give it some time and see what happens okay so now uh the job has run successfully and it says succeeded so let's go to this output location and see if it has loaded the data yep so if you see it has loaded the data in par format okay let's see if can view the data SQL query okay so there is some problem the serialization but yeah uh this park file is stored as an output of this job in this S3 bucket so that's how you can uh create uh jobs and uh run those jobs so uh we discussed only one option of visually creating the job like this you can uh also create a job by specifying your own uh script so you can have any complicated logic in this script over here and put it here and run that job okay so the next thing that I wanted to discuss is the triggers so you can have uh triggers for your job so triggers are basically uh the events which can start your job okay the triggers can be of two types it can be a event based Trigger or it can be a schedule trigger so if you click on ADD trigger let's call this as customer ATL trigger so uh whether it is on on demand so if you select On Demand only when you run this trigger it will be uh triggered so you can schedule uh trigger so frequency you can select daily hourly whatever okay and what is the minute of the hour so let's call it first minute okay and click on next and uh what is the target so Target is a job and ATL job that we created okay and click on next and create okay so now we have uh created this trigger over here and uh if you see it is a scheduled trigger so it will trigger this ETL job which is a target uh at one minute past every hour okay so that is the concept of triggers you can also have a event based trigger which is to say like uh uh let me just give you a demo so crawler event so you can have like event so whenever a job succeeds or whenever a cler runs after that if you want to run a particular job you can have an event based trigger over here okay okay so uh I think we have covered pretty much all the basic concepts of clue I hope I was able to give you a fair idea about the catalog and the ETL engine of AWS glue and I was also U able to demonstrate some uh Concepts using handson lab okay uh of course this is not an extensive tutorial there are a lot of other features that glue has introduced like uh schema registry and then there is uh a workflows option and all those things we can uh explore those in the uh coming videos I hope you found this video helpful uh if you have any questions do let me know in the comments below and also do let me know if you want to cover any topic related to a glow in depth in my uh next videos thank you and I'll see you in the next video