Transcript for:
Overview of Google Cloud Data Flow

hello all welcome to Tech capture so in our previous video we discussed about all data processing services in Google Cloud where we discussed about Google cloud data flow data Fusion data proc and a cloud composer we also discuss the features of each of the service and when we should use each of the service in a Google Cloud now we are just going to discuss in detail about a Google cloud data flow so first we'll have a small introduction about Google cloud data flow and then we'll directly jump into the demo and we'll create our first data flow job in this video okay so first let's try to understand what is a Google cloud data flow and why data flow is a most famous data processing service in a Google Cloud so first try to understand what is a data flow so data flow is Unified stream and batch data processing service based on the Apache B so first try to understand what is stream and batch data processing so I'll give you a simple example so you have an application which process a banking transaction now let's talk about patch data processing so what will happen at the end of the day you will get a file of daily transaction so suppose from morning 9:00 a.m. 5:00 p.m. all the transactions are there you will get the file at the 6: p.m. and Del at the 6 p.m. you will process that file and load into the data warehouse and the file will be available aable in data warehouse after 8:00 p.m. so that is happening on daily basis so that is your batch processing because you are loading the data in the batch and that is your daily badge now what will happen in case of streaming so in streaming as soon as your transaction is done it will be pushed into kind of the PBS up or Kafka and it will immediately process using a realtime data processing that streaming data processing so suppose any transaction that done at 9:30 so it will be processed immediately at at 9:30 and data will be available in your data warehouse immediately within a few seconds so that's kind of a realtime data processing once your data is available in data warehouse or for analytics within a short period of time and that is a streaming so your job will be continuously running and it will keep acting the data data and as soon as it receiv receives the data it will process that data and load into the data warehouse so that's the difference between streaming and badge processing and and data flow is famous because it is having capability to work on both streaming as well as batch data processing so because of this capability it works mostly on the realtime data processing and it makes idal for the application where requires a low Len inside it means the data should be available for analytics within a short period of time so if you are waiting for batch processing then you have to wait for whole day to get your data in the data we Warehouse or for analysis but if you are using the streaming your data will be immediately available for analysis in your data warehouse because that kind of a real-time data processing capability is having in a streaming job now let's talk about key use cases where we should use the data flow so it is mainly used for realtime theme processing mainly for iot devices or where you are just working on the locks batch processing also it will work for large scale data transformation so we'll see in our demo how we can create a data pipeline for data transformation so building realtime dashboards and analytics so for all these purpose this data flow will be used now we'll see how to create a data flow job so now we are going to create our data flow job but how we can create the data flow job so for create the data flow job there are two option first data flow jobs can be created using a data flow template and second can be created using a data flow job Builder so this option was not there initially this is recently introduced by Google where now you can create the data flow job without code using just data flow UI okay so data flow template you can allow you can create a reusable template or reusable pipeline using the data flow template and there are two options here either you can create your own custom template or you can use a Google provided template so Google also provide a lot of pre-built templates for common scenario so we'll create our first data flow job using a Google provided template only and we can create using jab Builder as well job Builder as well where we can use a UI for building and running data flow pipelines in a Google Cloud console so here we don't need to write any kind of code okay now we'll go to Google Cloud console and we'll create our first data flow job so let me go to a Google Cloud console now now I am in my Google Cloud console so this is my new project and I haven't used data flow in this project so let me go to data flow now from here okay I'll go to data flow and I'll start creating my first job so now as I said we have a two option either we can create job from template or we can create from a job Builder so this job Builder was not there earlier so this is newly introduced by Google Cloud you can see the older screenshot where this option was not available now this option is available and this is having a very good feature so we'll discuss about Job Builder as well in our next video so for now we'll just create a job from the template now here we are working on the use case where we'll load a data from our text file or CSV file which is a comma eliminated and we load almost a 1,000 000 record from GCS Google Cloud Storage bucket to the query so I am having vs code here and I am having employ data so this is having a 100 100 not 100 it is having a, records and this thousand records I'm going to load from a Google Cloud Storage bucket to a big query okay so let's see what template we can use so here I will use a template that is a x file on storage bucket to a big query so as I said I haven't used a data flow in this project it will ask me to enable apis and initially I might face a permissions error as well here so enabling API will take some time okay so now API is enabled now and we have two options creating from template and job Builder so I'm into a template now so I'll just give the name as data flow demo and I'll just give a 01 in case job failed we'll create new with the 02 and template so initially there are inbu Google build template Google provided template this is for streaming job so there are multiple templates for streaming and these are for batch processing here also we are having a multiple template so I'm going to use a template that is this text file on cloud storage to a big query okay now let's see what details it will ask here now it required a file input file that is our source data okay then it required a big query schema okay fine then query table temporary location okay so I'll just open a storage bucket first and we'll create one storage bucket for our demo so here let's create a bucket so there might be couple of buckets here already so I'll just create a new bucket I'll give you some unique name data flow demo or I'll just give the name data flow demo and I will give you 00 0 Let's see if it is unique if it is not unique it will say it already exist and the T Bucket creation will fail okay so this is unique name and it's successfully created now let me just load my file here so I'll just go here upload upload files and I'll upload my employee data CSV file here which having a 1,000 record okay and we'll just cop copy the path of this file because we need this GS path here in our first field okay so I'll just give the input file path here okay so we have to browse from here let's browse it from here so demo 00 0 and it should becomes green now okay now it is green everything fine now the target so here we should give a big query schema and schema should be in this format okay so let me just copy this schema and this should be in Jon format so I'll just create new file BQ do Json here and let's check the format which format it need okay so it is in ajon format okay so this need in this format so we need a file in this this format because we just copied this format from the example they have given now let's check what are the columns we are having here for employee data so we are having these columns so let's copy these columns okay so I'll just copy one by one then first name then last name and Department position and then just be more so up to position we have salary that joining date okay and Country okay so we have these many fields available here let's check the data types as well so let's go back to CSV file so here you can see your employee ID is integer again your salary is integer and this is joining date is a date so let me change that so ID is integer okay salary is also integer so your joining date is a date okay let's check the format okay we are good so this is our BQ do Json file okay joining date was date sorry so this is our b. ajon file now let's upload this file as well in storage but you have to upload in storage bucket so don't worry I'll provide all the files to you so I am doing it from this scratch so you will understand what we are doing and what each of these file now let's go and select this as well BQ do Json okay this is also green now now B output table so now we already defined schema here okay so where is our schema yeah so we already defined schema here so we don't need to create table with all the schema so we'll just create a empty table so I'll just open a query okay and in bit query I'll create a data set and in that data set we'll create our table so let me go to a big query so here let me check yeah so I'm having three data set let me create one data set here and I will just name it as a data flow so we'll use it for all a data FL related job so I'll just create a data set and I'll just name it as a data flow okay and within this uh data flow data set I will create one table so that I create as empty table without any schema because the schema we already mentioned in aq. Json file so data set is created now we can see data flow data set created and I will create one empty table within the data set so I click on a create table and so it should be a empty table and name I'll just you employee because we are creating it for employee data so I'll just give the name as a employee here okay and I'll just click on a create table now I have created a data set and table so that I will add in my job parameter so here we need a query output table so once this table is created we can see that in the list here so we'll just browse the list so we can see the table once available here okay okay so it is still creating so just wait for some time so it's created within a data flow data set so it's taking time to load here let's validate from here if you can see the table now so once it is loaded we can see the employee table is available to select now so we can select this empty table so this will be our output table and data will be loaded into this employee table okay so let me go back here now temporate directory we can select any bucket so I'll choose our existing bucket only as a temporary location and here also for requir parameter I'll just select this location but we have to give SL temp path so I'll just give a/ temp paath here now one more thing so in optional parameter I will choose a Json file here uh not just on the JavaScript UDF function so this is required to map your input file fi to a b query table columns okay so what is the format of this UDF function so you can see this is the format of the UDF function already created a UDF function for our data so let me just take it from the notepad++ and I'll just explain you what this UDF function does in our data flow job so I just give the name UDF do ajs okay so here what we are doing we are just mapping the input field to the Target table columns so first thing we are just ignoring the column names because we CSV file is having a first record as a column name so we are mentioning here if our employee ID value is employee ID then just ignore that okay just filter that and here we are just mapping employ ID object to the first value we are creating the object here and then we are mapping each of the fields okay to the actual value and then we are returning the js1 format output so basically CSV format data we are converting into a Json string so that's this Java function JavaScript function is doing here so it is just splitting the single single value comma separated values and convert it it into a Json string so let me just save this file and I'll just upload this as well in my storage bucket so this is kind of transformation we are doing here because for loading the data into big query we need the data in a Json format for this data flow job so now I'll just instead of creating a new UDF I'll just use existing UDF and now I'll just select my UDF do ajs file okay and I'll save and the function name UDF function name should be your UD function name so my function name is a transform so you just use this function name transform okay now I have filled all the required field as well as the our JavaScript UDF file here now let me run my job so if it will fail for any of the reason we'll just validate the logs and then we'll try to travel shoot that we might face uh errors related to permissions because we are doing it for the first time in this project but let's see if it is having editor role then it should not face any issue so now one more thing I wanted to show you here as soon as our data flow job starts so it will create a worker node at the back end so these worker node are nothing but the virtual machines so now you can see currently it is in starting state so current CPU nothing is showing here current worker nothing so what I will do I just open a VM page so we'll understand as soon as it is creating the VM instance at the back end so you will understand how data flow job works so let's go to VM instance okay let's refresh So currently we do not have any VM instance here now here also current worker is zero as soon as your worker is started here it is nothing but your a new VM install so just keep just keep eye here okay so current worker now it is zero so it will create a worker node now so that is your VM instance Let me refresh here okay so you can see created a VM instance at the back end and on this VM instance your data flow job will execute so it created one worker here so let's see here okay so it will reflect here in few seconds your job is also running and you can see the job status here you can see the locks here okay so all the logs you can see here so I'll just hide the log for now if we see any error it will show red error here so this is the graph view I usually follow a graph view because it will good to visualize all the steps and you can see the current CPU is 1 memory 3.75 HDD 25gb so this is the specification of this VM switch created at the background so now you can see one worker node also created let's refresh here okay so it created the worker node on the same VM instance and now your job should be executing so if you expand you will see all the steps okay so if I expand you will see all the steps we just minimize it and we'll just pause for few seconds because it will take some time to start your instance and then execute the job now you can see the operations happening here all things you can see output table Json path all these details or metadata related to job you will see here now we got an error here you can see error so let's go to the error and just try to see what is the error we are facing okay so just see the errors here I'll just hide this for now no such function transform okay let's see so we are having a transform function here transform file is UDF JS let's see in our job if we given any wrong information so we have given UDF do JS and function name is transform that is correct why this is giving an error okay the function name is correct let me see if it uploaded successfully okay it is showing empty files I think I haven't saved this file let me just save this file and let me try to upload again because it is showing empty file UDF doj it should be 1kb so here I'll just override so it will have a latest data okay now I can see at least it is around 1 KB so I uploaded the empty file that's why it would not find the transform function so I'll just clone this job okay and name I'll give as a 02 because you cannot give the same name again and rest of the things I'll just validate and keep the same okay so rest everything is fine and I'll go and simply run a job so it will create a new instance of the job I'll go back and check check 02 now so hope you understood how to troubleshoot these errors now okay so it will take again some time to get in the running State and to show the workers here so we just pass forward this video till we get the graph and the workers available here okay so we got the graph now so in sometime we'll see a workers as well now here we can see two CPU and and two workers it means it created one more VM so let's see yeah it created one more VM and it is using the two workers to execute the job okay so let's see till now there is no error so we'll wait for some more time and then we'll see if our job complete successfully or we'll face an error again okay we can see few of the steps completed and the last steps is running now in last steps also 17 out of 20 stages are successful let's wait okay now we can see all stages are successful and our complete job is successful now so here it is still showing running but it will take few seconds to show the success status so it is in running State now but we can go and validate the data so this is our employee table we created the empty table and now we can see if we have a data here so we can see the schema so the schema is based on the Json file we created okay and let's see if we have a data here so I'll just try to query so we had a 1,000 record in this table CSV file okay let me just remove this limit okay so we are having 1,000 record let's sort it based on the employee ID so we can validate the data in our CSV file okay so in our CSV file we had a first record Robert Wilson so we can see the Robert Wilson and all data is loaded so we can see all, columns here okay so in this way we loaded our CSV file data so this is CSV data 1,000 records into the big query okay big query table and we use BQ dojon file and also a JavaScript function here so here took longer time because I just tried to explain each and everything to you from the basic but usually it won't take much time I just wanted to make sure you will understand all the steps all the files okay so one more thing here so we converted the employee ID from string to integer here and also for the salary so sometimes you have to play with data types to make it flexible or make it compatible with your big query schema okay so hope you understood now how we can create a data pipeline using a data flow job so let's see if it is successful now yeah it is successful and we have successfully created our first data pipeline using a data flow where we loaded the data from GCS bucket to quy