Transcript for:
Tech Capture: Data Flow Job - Cloud Storage to BigQuery

hello all welcome back to Tech capture so as we discussed in our last video where we started with the data flow so today we are going to create one a data flow job where the cloud storage file will be loaded into the bigquery table using the data flow template so here this is a data flow UI when you open a data flow will land on this page so this is a overview page here so we'll go and job so here it will show all the jobs I'll just keep running only to show me and here there are multiple inbuilt template for batch processing steam processing and utilities inbuilt template so here you could see there are multiple stream template where we can load a CDC change data capture streams to bigquery from spanner then we can load or change teams to cloud storage and multiple streaming template but in this video we are going to start with a batch template where we'll be using one template the text file on cloud storage to bigquery so we'll be using this template where I will show you one file here so I have one CSV file here which contains the ID name city state country birth date password and last login these columns in the CSV file but out of these files I want to load only few columns from this file to the bigquery table using the data flow job or data flow template so we'll see how we can do that so first of all I'll open this job I will give the name as a PQ load demo I'll choose the region for regional endpoint U.S Central One template I already selected now here we need three things here one one UDF function file that is Javascript file one Json file and our data source file where we'll have a source data like CSV file which I just shown you so I'll explain you the significance of all the files here so this JavaScript UDF path in cloud storage here we have to view the path of the Javascript file so here we'll just Define the mapping that which column of the file should go to the which column of the database so there will Define a JavaScript function then in Json file we'll Define a bigquery schema so if I reload this and show you I just enter something that's why it's not showing here I'll just open a fresh UI and will show you it give us a guidelines there what we should put in each field so the name I will give here is a big queue BQ load different demo and I'll go down and I will select text on cloud storage to bigquery okay yeah so now you can read here the cloud storage path pattern for JavaScript code containing user defined function okay UDF uses with the different function here Json file that defines your bigquery schema okay so we need one JavaScript file with user defined function we need a bigquery schema okay schema is nothing but we have to mention the data type and column name in the Json file and then we have to mention bigquery output table so first we have to create a data set in bigquery and table it can create at runtime so first what I'll do I'll just go to bigquery and I'll create one table foreign data set okay so here I have multiple data sets so I'll just click here and create one data set data I'll just give the name as a data flow create data set so if I am not choosing any location it will take a default Us location for data set okay this is the data flow data set I have created now I already have code for this UDF function I will just show you here so this is my UDF user defined function here so let me just yeah so here this is transform function what it is doing here so it is just taking the line by line input from the CSV file it is taking as this is a comma separated file it is splitting using this comma and then okay so this is your bigquery object name object dot name and it is giving a value 0 it means the first object from the CSV file so here you could see the first object but now we have ID here so I'll put we want only name so one city we have now 0 1 2 so I'll only put 2 here I don't want ID H so H column we don't tell so I will put country and Country will be 4. so you have to count from zero so this is zeroth column this is first second third and fourth so I want to load these columns only I'll share the code in the description okay so we'll just save this file and let me show you the schema file for bigquery so this is a schema for bigquery so I have given the column name so this name is nothing but name of the column and this is name of your bigquery table so this is your first column this is your second column and this will be your third so I'll mention here as a country okay because I change it so I updated it so this will be my Json file so I saved it as a bq.json this I saved as a udf.js now I'll upload these two files and this CSV file to the storage bucket first so we'll go to the storage bucket I'll create now with new bucket for our demo foreign I can give bucket name something called bkdf demo dot zero zero okay so I am giving this bucket name okay it accepted and let me create this bucket okay here I'll upload the files so I have this bq.json udf.js and data.csv I'll upload these three files here so let me just validate the data.csv file here yeah this is the same file right okay now let me go to configuration now we have all the details which are required here so first I'll give the path of the UDF Javascript file so I'll go here I'll open this video.js and I'll copy this GS path so this is internal storage bucket path I'll copy starting with GS okay then I have to give the Json path so this G is already given here just remember so I have to remove this and it will just give just your name is the BQ dot Json okay UDF function name so here this is something you have to remember you have to give the function name we have in our file so this is our function name transform so I'll give this bigquery output table so we have to give the output table name in the form mentioned here so it should be project name first then colon then data set so data set I have created data flow and table name I want as a b Cube load Okay so this will be my project then colon then dataset dot table name so we don't need to create table before it can create runtime as well okay based on the column names we mentioned in the schema file now input path of our files which will be our data file okay so we'll just take the URL from here so this will be our data and here you can choose any temporary part so I will create the table or I'll just take the same bucket as a temporary path temperature location I'll just copy this and you can just add a paint as a temp here and just do a temp BQ because you could see its temporary directory for a bigquery and this is temporary directory for files so I just given temp and temp BQ now we have given all the details and execute the job and we'll see if it fails for some reason or it will execute successfully so let me go back to the bigquery meanwhile I'll show you there is no table as of now because it will create once we start our execution okay So currently in data flow data set nothing is there so let me click on run job now and it will take two to three minute for your job to kick off and a complete so for kick off your job or start your job it will take uh like 30 seconds to one minute here you could see right side status is starting now it came to running and will monitor here I'll show you the progress and we can see the locks here now you could see currently zero CPU is here it will create at runtime once our job started it will create workers nodes and it will execute the jobs so I just open this screen so you can see the changes in the screen now you could see this CPU is one now this parameter change so node is created for our job to execute now it will execute a job so this is something all the process going at the background so you don't need to worry about that as it is serverless for us we only create a pipeline and it will take care of execution at the back end even it will Scale based on your job's complexity okay now you could see everything is running so here zero of one stage succeeded if I click on this it will show multiple threads inside that or multiple task so I just keep doing it so it has only one of one stage here it is having 14 stage if I click on it so there you could see the different states or read from source now it is loading batch loads running I just expanded that and this will be your complete graph so we don't need to actually go into that until unless there is failure but it will just help you to troubleshoot at what stage it failed whenever you are having issue with your data flow so now you can see everything is green so once all stages are completed you can verify your data in the bigquery table so let me reload okay now you could see everything is completed everything is a green so this will also update within few seconds so I'll go to bigquery and I will validate if table is created and data is loaded and I need only three column I guess which column which yeah we need only these three columns name City and Country in our bigquery table so this job status you could see it succeeded now I'll go to bigquery now here you could see there is no table so it will show you once you reload it and reload it yeah now you could see it's created a BQ load and let me see the preview of this table if it is having data yeah you could see it is having data but yeah it uh wrongly took this email column just because we missed a numbering here so you see this is zero this is one okay I think I didn't save the file let me check yeah so here it was a email but yeah that is not an issue so main concern is it's loaded according to three columns whichever we given as per the number so if I had to save this what three columns it took so one two and four so it took this one then two and for these three columns it loaded because I forgot to save the files there I will see it loaded all the three columns so we can use a multiple statement in the function to ignore a First Column name or to apply more transformation that can be done and complexity can be increased but just to show you a template demo I have taken a very simple example here where we can load the simple CSV file from Storage bucket to the bigquery so we'll see one more demo in next video well we'll do extraction of the data from cloud spanner to the Google Cloud Storage bucket so we'll see you in the next video