Transcript for:
IPL Data Project with Apache Spark and Databricks

in this video we will use the IPL data set we will upload that data onto Amazon S3 as a data storage we will use data brakes environment we will write a pares spark code for transformation and build our logic we will use SQL to analyze our data and build visualization on top of it to understand more about our data set we will be finding answers to a lot of different questions so without wasting time let's get started and before you move forward don't forget to hit the like button and subscribe to the channel if you are new here let's start this project by understanding the basics of architecture diagram okay so we will be using the IPL data set IPL is the Indian cricket Premier League very popular what we will do we will use this data set and we will upload or load this data onto the Amazon S3 Amazon S3 stands for simple storage service especially used for storing any types of data okay it is an object storage so you can store any data that you want as CSV pet file audio video anything that you want you can store it here okay once you do that then we will use the databas workspace first of all we will use Apaches spk to write the transformation code so what we will do we will read this data using the spark code then we will do the transformation that we will apply some business Logics on top of it after that we will do the SQL analytics so we will write some SQL queries we will join some of the tables to understand the trends of the IPL data set we will try to find some insights once you find the insights then we will do the final which visualizations on top of it so this is a complete architecture diagram very small project but this will help you to understand lot of things about data brakes Apaches spk SQL reading data how to transform them so we will be doing lot of different things in this project the core understanding or the learning of this project is to learn how to write a pares spar code okay we will be mainly focusing on the writing side of a pares SP code and understanding how Spar code is written all of these different things such as visualization SQL queries are the add-on on top of it to give you the better understanding of the entire process okay so this is the architecture diagram uh you can refer this anytime you want I want to start this and let's understand the basics of Apaches Park first because if you're understanding Apaches Park this then I want to give you the understanding in the entire ecosystem of apachi spark the one is we have the spark code this is the heart of Apaches spark that is responsible for executing every code that you submit to Apaches spark application okay understand about the entire thing so don't worry about right now just understand the components then we have the spark SQL attached to it if you want to write SQL queries inser a p spark application you can easily do that using Spar SQL then we also have The Spar streaming that basically means if you want to handle the real time data if you want to process the realtime data that you see onto the Google Maps Uber okay you can easily do that also using Sparks streaming if you want to run the machine learning application on last data it is also possible inside The Spar and if you want to process the graph graphical data space that is usually seen on the social media website you can also do that on The Spar so these is the basic components attached to spars okay now this is the core architecture of Apaches par now I want to give you the understanding about this architecture using the nodes okay so first of all let's understand what is Apaches par a SP unified compute engine and a set of library for parallel data processing on a computer cluster that basically means let's say if you want to process a huge amount of data right let's say if you want to process 1 GB of data you can easily do that on your single computer you don't need lot of storage you don't need lot of processing power if you have 1 GB of data you can easily perform that operation and process this entire data in your single computer but what what if you want to process pyes of data or like thousand GBS of data if you want to process a th000 GB of data you need first of all the storage system for that it at least one terabytes of data you need okay but what if you have like 100 or thousand terabytes of data you won't be able to store that inside your computer you need the larger storage system and after storing you also need to process that data and all of these data will have billions of rows if you want to process them you need some kind of framework that can process this data on the large scale so Apaches spark is basically a framework what it does it divides your data into the multiple partitions it processes that data onto the large clusters of machines okay it will do the processing and it will combine the output at the end this is the core idea of aaji spark it is a unified compute engine that basically means we have thousands of Machines working together unified compute engine and a set of library for parallel data processing right if you have like billions of rows you need to chunk that data into multiple Parts process them and then finally combine the output so this is the core idea of a baches park is to process large scale data in a distributor way so that you can process it faster and get the output within few seconds okay so again this is a component that we already understood at the lower level API we have the rdds and distributed variables so this is the basic spark toolkit we have the lower level API which is the rdd and distributed variables then we have the structured API this is what we will be mainly working on we have the data Frame cql data set and on top of that you have the advanced analytics some packages available and the structure streaming okay all of these are the different toolkits available inside the AP spot let's understand understand the architecture of Apaches SP okay so again we understood right typically when you think of a computer you think of you think of one single machine that you use for playing games or watching videos okay single machines do not have enough power and resources to perform computation on use information so we need a clusters of machine a cluster is basically the group of individual machine combined together so if I have like five PCS okay I can just uh connect all of these five PC together that will become one cluster the entire cluster will act as a single computer so even if you have like five machines available that cluster will be considered as a one computer source okay now you can use these multiple resources to run your Apaches par code now it is not just about just getting all of these machines together and then you tell the all of these machines to process the data you need a proper framework that can actually manage work across all of these different machine okay so a group of machine alone is not powerful you need a framework to call coordinate work across them and Spark is mainly designed for them okay spark manages and coordinates the execution of task across the cluster of computers so you understand right we have like thousands of machine but apart spk what what it will do it will manage and coordinate all of the execution that you sent onto these thousands machine okay so it is responsible for that so this is the architecture of Apaches spark application as you can see two things are important driver process and the executor and we have the manager plus manager is basically the managing driver process and executor so let's understand these two things we have the driver process okay driver process runs your main function of aach spark it is kind of like the manager of aach spark if you want whenever you whenever you submit any code driver process make sure this code runs successfully all of the outputs are taken care of all of the metadata driver process will take care of it okay so it maintains the information about the SP application it responds to the user input analyze distributes and schedule work across the different executors so driver process is a manager then we have the executor processes executive processes are the actual machines that executes your work so when you say do 2+ two executor process will make sure it does that computation and Driver process will keep eye on that that executive process performs this operation okay so executing code assigned by the driver process and Reporting the state of computation back to the driver process from the executive process so this is the responsibility of the executor process now there are many different things related to a pches spark but this is the basics of it we have the executor process and we have the driver process there are the mainly two things that you need to understand to execute this entire project now again spark supports multiple language you can write code in Scala Java python SQL R now you need to understand the spark session so whenever you write your Apache spark code the default variable will be created which is the entry point for the spark driver which is called as a spark okay you control your spark application through the driver process called as the spark session so when you start writing the spark code you need some entry point and your entry point is your spark variable okay so that the thing that we understood about the driver process and the executor so when you start working on the real world application the driver process is usually stored inside the spark session okay so whenever you use the variable spark this is a default variable whenever you configure your spark environment this is where you start with so whenever you print your spark spk variable you will get something like this which is basically your spark session this is a default variable you can also create your own spark session so don't worry about it we will do this when we actually do the hands on practice but understand this using this spark variable you can do every single thing in Apaches spark if you want to create a data frame if you want to write the SQL queries all of these things can be easily done using the spark session okay so this so this particular thing is very important when you start writing your spark application code so this was the basic about Apache Spark now there are few things that you might need to understand one is the concept of data Frame data frame is same as a spreadsheet so if you have the basic understanding about the python is basically the row and column format where you can store your data in structured way in spark all of these data are stored distributedly so one part of the data might stored onto machine a second part of the data might get stored into machine B so all of these information are stored distributedly okay that is the one concept that you need to understand second thing you need to understand is a transformation so whenever you run any code inside this part right such as like this you have the data frame my range and you perform some operation let's say if you want to divide the number by modular 2 whenever you execute this in the normal python you will get the output then and there but in the spark it will not execute this particular code what it will do it will wait till you assign an action okay there are multiple types of actions available so the basic idea is that whenever you write your transformation block spark will start start stacking different transformation Logic on top of each other transformation Logics are basically the business logic that you want to apply it can be like filtering some data removing null values filling that null values adding to col these are called as a transformation okay so whenever you write the transformation code par will create the entire plan for the execution and then whenever you perform the action okay action is like can be count show all of these are the different actions available so whenever you perform the action spark will execute this entire graph that is created now in the back end it does lot of different things it creates the logical plan it creates the physical pan all of these things are the backend working of it but for this project you don't really need to understand this I have covered all of these in my Apache spark course if you're really interested learning spark in depth then you can definitely choose to enoll into that but for this project I just give you the basic understanding of the driver process and the executor process and then using this knowledge you can execute this entire project and learn about the basics of Apaches spark and complete one good project onto the IP data set okay so this is the entire goal of this project after spark we need to understand about the data brakes okay so data brakes is basically a software that supports the Apaches Park environment so let's say if you want to configure Apaches Park in your local PC what you will have to do you will have to install jvm you will have to get all of the packages related to Apaches Park configure the environment path after a lot of Errors you will be able to configure the Apaches par in your single machine now if you want to do this on your on premise Center let's say if you have like thousands of machine and if you want to do that that will be very difficult task because you will have to think about the scaling you'll have to think about the storage networking there are so many different components that you will have to take care of it so instead of doing that datab brakes is a company that tells you that don't worry about all of these backend working we will take care of that for you just click on the button just click onto the start button you will be able to have the Apes spark environment created for you and focus on writing and executing your code just focus on the business side don't worry about the infrastructure side of the thing so that is the core idea of databas is to give you the environment to execute a pares par code now on top of that they also provide many different things such as they have the lake house architecture they provide that SQL datab house is machine learning so they have multiple things that they provide on top of it yeah but the core idea of databas is just to power the parches park environment now this much information is enough to execute this entire project because you have the understanding about the SQL in the python that I assume Okay the reason you are learning spark is because you have the basic understanding about SQL and the python and little bit data warehouse we will be focusing mainly on the spark and you will learn everything about spark the major thing for this project you will understand about the data breakes little bit about the Amazon S3 and we will do the final visualization so there is the entire goal of this project now you got the now you got the basic understanding of it now let's understand about the data set that we will be using in this project okay and then we will start the execution now this is the data set that is available onto this website call the data. world now this is the IPL data 2017 this is not the latest data but uh again we understood that we want to learn a purchase part we don't want to analyze the data and find the useful insights okay this is not the goal of this project this is the goal of this project is to use the IPL data set whichever that is available this is a good data set because it has like five different files for us to work on it you can use this data set to understand how to write the basic Apaches Park code by yourself okay so this data set contains the ball by ball data of IPL season okay there are around 6 37 matches including the 2000 a complete data bar so it has the lot of information about each and every single ball of the data this data set has the ball by ball data for the IPL till the 2017 season okay you can get the source of this the actual source of this data so if you want to get the data from this Source you can also do that but this data is pretty much organized around five Cs different files available if a cricket fan and if you understand the cricket then you can use this data to build your own project now these are the different data available we have the ball by ball CSV we have the information about the matches players player match and team okay so what I'll do I will just click onto it and understand one by one now let's understand the dictionary first okay data dictionary basically means different types of columns and what they mean so as you can see we have like five uh CSV files available first is the ball by ball CSV this CSV data contains the information about each and every single ball okay whenever you deliver a ball in a cricket which player hit that ball was it a one run six run four uh who balled it who was the who ball it at which stadium this ball was pitched lot of different information that are attached to every single ball so you will find it here we have the match ID the individual matches okay we will have that then we have the over so this is like 15th over 14th over 13th over ball ID ball ID is was it the first ball Second Ball third ball okay Innings number Innings number basically mean was the first inning so in IPL It's usually the first inning and the second inning okay person who bats and the person who ball first will do the batting then we have the team batting team balling Striker batting position extra so if there is like wide no ball then we have the number of runs scored this is very important column so you can find a lot of insights during the Run score so what you the four one run zero run four run you might have the six lot of different information are available extra runs based on extra runs extra runs so wide leg buys buys no ball pity just by understanding the column name you will understand what type of data it contains so you need to have like basic understanding of the cricket to understand this then we have the Ballers extra out type uh if this guy got out based on the catch bold run out lbw retired hurt stumped again every single detail of information inform was this player hurt caught and B so uh you do the balling and if you hit the bat you get caught then the information about that hit Wicket off strucking field Bower Wicket and the date of the match season Striker okay non- Striker who was in the striking position this is a player okay player ID so who which player was playing which player was on the non- striker position each and every single bow of the information okay then who was the bowler again this is the ID of that particular bowler which can be joined using the player ID then we have the player out fielders Striker there some primary key SK okay all of these are the information available inside this data ball by ball then we have the data available on the match level so here like a different match match uh primary key and then we have the match ID this is the team Royal CH Bangalore versus the team two so at which day this uh match happened okay then we have the season year where did it happen city name country name who won the toss who won the match what was the toss decision or win type so did they win by runs or the wickets what was the final outcome man of the match who was the man of the match what was the win margin and the country ID okay like this we have the player information again if you have if you're a cricket fan you will be able to understand this data pretty much okay then we have the player primary key then player match key match ID so this stor the information about single player player okay on that match how did this player actually perform okay who was the opposite keeper player keeper Captain every single thing about that in that player match will be get stored here then this is the just the player information okay so who is the player such as rul D okay his date of birth batting hand bowling skill and the country name like this we will have the player information each and every single player is stor here and then at the end we have the team information so we have around 12 teams available in this here so we will have information about those so again just by understanding and looking at the data you should get understanding about what each and every column means uh you can check the data dictionary just to understand all of these columns we also have the schema type available you can understand whether it is string date integer values all of these things so these is the basics about data set now what you can do you can uh download this data and upload it manually on the your Amazon S3 uh if you have the AWS account it is pretty simple all you have to do is go onto the AWS bucket all you have to all you have to create is your own AWS account and create your bucket I have done I have taught you this particular thing so many times on my YouTube channel so it should be pretty clear now I just want to go on to the next step that is a uh spark application so if you don't have AWS account don't worry about it what I have done is that I have already uploaded all of this CSV data onto this bucket this is publicly accessible so if you want you can use my bucket and access the data if you want to create your own project and have the ownership of the project then you can just download the data from this website okay you will find the data available uh it is just click onto this and click onto the download files you will be able to get this data and download it in your PC once you do that then you can manually go and upload this data or you can also have the mechanism to upload this data but this is a pretty static data set so you can directly create theet and upload your data and then based on that you can write the code in your apartes pack okay but if you want want to use my S3 bucket totally fine go for it so this was everything that you need to understand about this data set now again I'm telling you the goal of this project is about writing aach SP code so we will be heavily focusing on that and the syntax out of it the basics of it the goal of this project is not to get the business logic right okay because there will be a lot of transformation that we will write that might not make sense but again that is not the final outcome that we are mainly looking for we are mainly looking for is to write the Apes bar code by ourself now before we move forward I just want to say like this is a free project and you will learn lot of things in this but again this is just to over you I can't cover every single thing in one out of the video so if you're interested in learning about all of these different things such as python SQL data wouse with a snowflake and a pches spway data breaks for data engineers then you should definitely check out my courses I put a lot of hard work building these courses these are in detail courses python seal are the foundational courses but the core of working start with the data warehouse and baches Spark and I will have more courses is added to you so if you're interested in upscaling your career in data engineering then you should definitely check out my courses you will find the link in the description and let's continue with our execution so let's start with our project execution so the first thing that we need here is the databas account okay so all you have to do is just write data breaks on Google all data breaks then you can just open this data breakes page over here all you have to do is click on onto the try data brakes okay uh you have to fill the basic information such as first first name last name email address company name so I'll just pull my basic information I'll add my email address once you do that uh you you can add your company name title title can be if you're working as a data engineer you just put the data engineer phone number is optional if you want you just put your country name and then click onto the continue once you fill your basic information then you will be redirected to this page now datab break supports three environment okay we already understood about the basic architecture of datab it supports the three environment AWS aure and Google Cloud now if you're using datab for the professional use you can select one Cloud this usually you do it if you have your existing infrastructure available on one of the cloud platform so let's say if my company is using AWS for some data by 10 work and some data storage I can go with the AWS same with the Azure and the Google Cloud platform but this is a paid one because when you select one cloud provider it actually deploys the data brakes on your cloud provider using the ec2 machines and all of the resources what we really need to do is we need the free version that is the personal use get started with the Community Edition this is completely freely available you don't have to pay anything for this all you have to do is just uh figure out this capture in this direction submit this once this is done you will receive an email okay you just start with the community addition you will receive an email about the your account activation activate the datab breakes account and you can get started with it okay I already have The datab Brak account available okay so this is what it looks like this is the Community Edition available uh over here you will see some of the things such as you can create the notebox you have the workspace so everything that you do on your databas account it will be available over here then we have the search feature we have three things catalog this is where all of your databases will stored this is a compute we will create our own compute service and then we have the workflow okay workflow is not available under the Community Edition but if you have the upgraded version if you use the aure datab breakes or if you upgrade your community version you can access the workflow and more of the features let's started first we will create our computer because this is where we will run our Apache spark code okay so click on to the create compute over here uh let me see if you can see yeah you can see it then give the compute name it can be whatever you want so I'll give this as IPL project cluster okay something like this here as you can see we have the zero worker and one driver node we already understood the basics about apage spark in that we understood that we have a worker node which is the executor process and the driver process so our driver process we have like 15 GB memory and the two CES and for the worker node we don't have anything because for the tutorial purposes Community addition this is enough okay 16 GB of RAM is quite huge if you're working on a smaller data set but if you go with the upgraded version you have the options to choose from like minimum worker to the maximum worker based on your scalability and your code it will automatically add new machines so that you can easily run your Apaches spark code okay then we have the run time run time here you can select any so we will go with the default one we will not touch anything here if you click onto the spark can add some environment variables so if you want to provide some settings you can do it over here and then you can just click onto the create compute this will take some time to create the compute service so as you can see you track all of the basic so configuration is available over here if you create all of the notebooks that will be tracked here all of the libraries are available here so if you want to install some of the external libraries you can easily do it here then we have the logs spark Qi Drive unlog so all of these are mainly the configuration side of the things that you can easily track it on the compute level so let's wait for this to get so our comput so our compute is getting created while this is getting created we can start working on our notebook file okay so all you have to do is just click onto this this create button and click onto the notebook now this is where we will write our code okay so first of all I just want to give the name to this notebook which is the IPL data analysis spark okay you can give whatever you want then we have the multiple options available here okay we have the python SQL Scala and R we will write our code in Python language which is the ppar but you can also write SQL Scala and R code in the data brakes yeah you have to select your computer service so over here all you have to do is Click onto this IPL project cluster it is currently getting connected once this is connected then you can start writing your Apaches Spar code okay everything else is fine you don't have to worry about this this is just a basic Jupiter notebook environment where you can write all of your Apaches part Cod okay now about the data set right as I already told you we have all of the data set available onto the S3 bucket so that we can directly start accessing them and writing our code okay so this is the publicly accessible bucket that I have already published either you can create your own S3 bucket and make it public and then directly access it you can also use the configuration setting by providing the access key and the secret key if you don't want to publish this but for the Simplicity a lot of people might not have the AWS account or there might be many people who don't want to go through this uh uploading data stage what you can do you can directly use my bucket and start accessing the data and mainly focus on the Apache Spar code so that you don't have to worry about uploading all of these data by yourself so we already understood about our data set that is available over here so we understood about all of these columns okay now what I want to do I want to mainly focus on writing the parches bar code and we will write all of The Code by ourself now I have provided all of this code onto my GitHub repository but I will highly encourage you to write this code by yourself so that you get the Hands-On practice and the confidence that you can write a part spark Cod by yourself okay so the first thing that we want to do is create the spark session now there is a default spark session available whenever you write the spark keyword you will see the spark session is available but to follow the best practice we will create our own spout session as per our application and for that it is very simple all you have to do is write something like this from let me just check my OBS if you guys can uh see this if from okay yeah so from you will write code so so we will write Cod so so we will write our code from by spark. SQL import spark session okay spark session is available inside this package and we will create the session create session as Park is equal to spark session uh this is the spark session that we're going to use dot Builder wait Builder dot app name here here we can give our app name in this case it is our IPL data analysis do either get or create okay so if we already have this app available we will get it or if we don't have it we we will create it all you have to do is just WR execute this you can hit the shift enter or you can just uh click onto this button uh to run this code then if you do the spark session now we have the spark session available for this particular app that we created okay so we can easily track entire thing the entire session onto this app now once this is done we can start writing our code for the apach spark and for that first of all we want to read the data from all of these bucket okay this is where we have stored our data so we want to read this data one by one and also do some basic Transformations on top of it so what we will do we will read this data and the syntax to read this data is pretty simple all you have to do is let's say I want to read this ball by ball CSP okay the data all I all I have to all I have to write is that ball by ball data frame because I'm reading the data frame then I will use the spark uh session which is our spark variable dot read dot format inside I will provide the CSU because the CSU is my format there are some options that you can give such as option header true because my first row is header and I will just provide the the path oh wait I just ran this by mistake let me just close this okay I will provide the path and the path is copy the S3 URI okay this is the exact path that you can use and this will read our CSP if I run this this will read the data from this his3 bucket which is a publicly available so you can easily access it and then if I do show five let's wait for this to complete currently it is running you can also click onto this to check all of the lobs and if I can see I was able to get all of these column values yes okay if I if you do the showy this is the action that we talked about if you do the showy it will display all of the data so as you can see we have the match ID over ID ball Innings all of this and these are the actual data that is getting printed so this is the data frame that we were able to read but there's a problem with this okay and the problem is that all of these columns are into the string format okay as you can see all of these are into the string format uh if you go here you will see we have the integer integer integer value Boolean values integer values date values we have the different values available for the different columns now to do this we have two options one is what we can do is that we can write the infer schema as true okay there is one more option available which is option iner schema okay as true if I do this if I run this and then we if we check the data type you will understand it spark will automatically try to understand the type of the data that is stored inside the CSC file and it will assign that data type to that particular call okay so let wait for this and as you can see we are able to read this data into the proper format uh that is being generated by this spk okay so this is automatic we did not provide anything but here we might have some other issues so for example in this case we have the retired hurt lbw run out bolt all of these are into the Boolean format but if I check here if you check all of these are automatically converted into the integer format so this is the problem right because spark is not able to understand all of these things automatically because it is considering it as an integer because it doesn't know this is a Boolean value or this is an integer value or this is a string value for that we need to create our own schema and this is the best practice that you do in the real world so that when you are write reading a data when you are reading a data from any external Source you have to Define your own set of structure schema so that no matter what type of data comes into it it will go through this step byep checking process that all of these data are into the proper data type format okay and you will understand about this right now so let me just close this so that we will have the much better picture of this we have something called as a stu field and defining a schema into the Apes part and it looks something like this okay first we need to import this uh packages if you want to define the structure schema so what we will do we will add one cell on top of this where we will import all of the packages okay so we have something called as from pyspark.sql types import we have struct field comma struct type okay these are the two things we need to define the entire structure and then we might have another thing such as we need to Define integer type we have string type we have buan type we have date type okay these are the different type available we have the decimal type okay so these are the different type that is available into the Apaches Sparks so you can use these different type to assign different values to that certain columns uh you will understand what I mean okay so just run this so once you run this all you have to do is Define the structure schema and it looks something like this we have like Ball by ball DF or instead of DF we will write schema we are defining a schema is equal to struck type okay between bracket we will add a subscript and then we will Define each and every field so struck field and in this what is my first column My First Column is match ID okay so I will copy the name of this I will put the name here which is a match ID comma here uh I have the integer type because this match ID is an integer so I will add integer type okay and this can be nullable so I'll just keep it as true so when you keep the two as this means this field can have the null values okay then the same thing we will have to repeat for the second column which is the over ID uh I'll put it over here and then for the over ID we also add the integer then we have the ball and the Innings number so we will do the same for them you can add a comma here you can add a comma here Al so I will do it for the same ball ID and I will do the same for the Innings okay then uh let me just do this then we have team batting that is a string value okay so what I will do I will do the team batting and instead of integer type now I will Define the Str type because this is a string value but okay this is done now if I scroll down you will see there are so many different columns available and till now what we need to till now what we have been doing is that we were manually creating all of this stuff type but now we have ai okay so instead of you doing all of these things what you can do you can utilize the AI tools so what I can do is here is that I can just copy pce this entire thing okay I do copy I paste it on the chart GPD I will say okay this is my data frame schema okay I want you to generate y spark scha okay struct schema using or using the schema of my data frame okay something like this and all you have to do is just paste this if you wait okay it will automatically generate this entire thing for you okay you don't have to do all of these things by yourself this is the power of AI because if I do this one by one it will take around 10 15 minutes okay because I'll have to go manually and do all of these things copy Bas and stuff but if I directly use the ai ai is working for me it is directly generating all of this code for me so this is how you increase your productivity okay uh you don't wait you don't do all of these things manually you need to understand how you can utilize this EI Tools in your workflow so that you can focus on the important aspect of your code execution rather than focusing on the manual work so I got this entire thing generated from the J jbd I can direct copy paste this and I can directly replace it over here okay so instead of me manually doing all of these things I have all of this uh struck fi created like bullan type I have the date type integer type everything is sorted out for me so I don't have to worry about all of these things so the same thing now what we need to do instead of this let's put this on the top okay so I will cut paste this and I will put it I will put it on the top here so first I will generate the schema and I will run this so this will store all of the schema information now while I'm reading the data fe uh let me just remove this in for schema what you can do you can do something like this dot schema okay and provide the schema name which is the ball by ball schema and if I put this here and if I run this now while reading the data frame spot will add the schema information okay to this particular schema and as you can see now we have all of these column values as Boolean we don't have it as integer so this will make sure every time that we read this uh data frame from the external Source if there are any uh inconsistency in the data spark will automatically apply this schema on top of it if there is an error it will give you the error if there is no error you will be able to see the entire output so this is the power of creating the structure schema now I will just do the show five okay and let's wait for this and you will be able to see the entire output generated for you okay so this is sorted now what I will do let me remove this and then we will have to do this entire operation for all of the data frame okay so this was for this I will do it for the this this one which is the math schema which is available here the math schema the same operation we will do it uh I will highly encourage you to use the chart J in your case I already have created all of these things so we don't waste time onto this manual work so this is the math schema again same thing you can just copy paste the math schema from here okay put it on the chart jpd generate this entire code and the same way we will have the maass DF where we will read the data provide the schema information header to provide the CSP done okay we don't have to worry about this manual task all you have to do is just provide the information and we are done same with the other file so instead of match then we have the player match so I will do the same for that okay uh then we have the player match uh which is player schema okay so I will also do it for this so this is a player match and this is a player and then the last one was the team okay so I will do the same operation I have the code available so that we don't waste time doing the same thing again and again I again provided you the code onto the GitHub repository so you can also go but again for the practice if you write all of this code by yourself so it will get registered inside your brain so this is what it looks like okay if you just see it from the outside okay you understand this is the struct type and this is the struct field but as you write by yourself all of these code you will understand you might make mistakes in the subscript you might make mistake while writing this entire syntax so at least for the practice write all of these code by yourself so that you get the confidence that you can write the Apaches spark code by yourself okay so these are the five data frame that we were able to read and now what we want to do we want to work onto the transformation block now the transformation is the core of Apache spark because this is the final logic that you apply on the data set it can be removing null value it can be adding two columns and creating the third column it can be filtering some of the data there are different Ty of transformation function that you can do this is generally comes under the structured API okay I've covered this in detail into my Apaches part course so you will understand everything from the back end but in general it is basically the set of Logics that you apply on the data frame so that you can get the final output us the transformation will be based on the business logic there might be certain requirements from the data scientist or data analyst that they want these columns into this proper format this is what the transformation really looks like okay here you filter out all of the information that you don't need and only pass it forward the information that you really need in our case we will do the basic transformation just to understand how to write Apaches Park okay the goal of this project is not to create the business friendly outcome that is true for the business the goal of this project is to understand how to write a PES Park code because because we are mainly focusing on the technical side of the things we don't want to worry about the business side because this is not the real business right now that we are worrying about this is just a test data set okay we want to learn a parches part for the business side once you get into the company they will tell you what they need at the end so that you can write the code according to that requirement and generate the output okay so we will do some basic transformation and I have already created all of this transformation for you so filter to include only valid deliveries so I want to exclude wide and the no balls from the our data set and for that we will do something like this we have the our data frame available which is the ball by ball DF okay and let's write this code by ourself which is Ball by ball DF is equal to filter and then we'll do something like this uh yeah column okay inside the column I have the column available as whes okay and equal to equal to zero so I don't want to have the whites and column no ball no B f is equal to is equal to zero let me just check if all of these things make sense this fine okay I have this I can also include bracket here so make sure the brackets are fine so what this does is that it excludes the whites and the no balls from the data set because we don't want whites and no balls for our final analysis again depends on the business requirement but for now just to understand the transformation function we are adding we are excluding the wides and the noble we just want the proper delivery okay so if I run this this will result column is not defined okay we need to import the column and to import the column we have some packages available so what I will do from here top go to the top from Spice spark. SQL do functions import I can directly import all of the functions available or I can import some of the functions that we will use so such as column we will use when we will use sum we will use average and we will use row number okay so I will run this so this is importing all of the functions from the SQL package to our notebook now if I run this able to generate the output this output is basically we were able to exclude this entire wies and the noers from our data frame but spark did not run this codee right now first spk will not execute the entire thing and give it the output we already understood about the lazy evaluation that means the more transformation block we right apage spark will be able to convert the entire thing the entire logical plan for that and then execute the final code once you write the action okay so these are the concept available we have the LA evaluation okay it doesn't the SP so we have the so we already understood about the LA evaluation Park will wait until the very last moment to execute the final graph so this is the kind of plan it cre it creates the entire logical plan and the physical plan and based on that it optimizes the entire transformation block and then gives you the output I go in detail about all of these things into my apach spark code want to understand about the spark in depth with the data brakes then you can think about enrolling into the code so this is our first transformation like this we can write more transformation blocks on top of it so let me get another transformation block and in this transformation what we will write calculate the total and average run scored in each match and inning so if very simple you can write something like this total I'll create the new data frame total and average runs is equal to Ball by ball DF dot Group by okay and I will add match ID comma Innings number do aggregation here based on the match ID and the Innings I want to get the total and the average run score in each so what I will do I will do the first Total for total I will do do the run scod this is a column available I will give the alas so that we understood it properly alas as total runs okay and let me add the comma and then we need the average of the runs scod also runs code dot Alias and I will do the average average runs okay is fine so calculate the total average run score in each match and innings so I'm just grouping by the match ID and the Innings ID and then I'm just running this so once you execute this you will see we applied some filters on top of it and then we created the new data frame this data frame that we will use in the final visualization like this you can write multiple functions so I have already written some of the transformation block transformation block then we can also do something like this which is the window function so if you want to understand how the window function works in the partes part in this window function calculate the running total of runs each match for each over okay for each match in each over so first of all let's create the logical plan for our window function which is window spec okay and we will do something like this window dot Partition by first of one we have the match ID comma we have the innings number dot order by over ID okay so this is where I'm like creating the logic for my window function and then I will do Ball by ball DF is equal to Ball by ball DF do width column this is when I want to create the new column and I will add the new column as running till runs and then I will apply my logic which is sum I want to get the running total of the runs is my windows spec so I will apply that okay and let's run this probably we will get the one error because window is not defined and for that we will uh need to import the window package which is from by. SQL do window window okay it looks something like this import this run this after it is done if I run this now after you run this we will have the new column available which is a running totals at the end if I just go here you will see the running totals is available over here okay like this we can have something like this also which is conditional formatting uh flag the high impact balls such as either Wicket or more than six runs including the extra so the flag different bow that is being done which is either it is a wicket or it included the more than six R so uh let's say Ball by ball D is equal to Ball by ball DF dot with column okay and I will have high impact I want to create a new column so when I will have column which is this one run code okay runs Cod Plus yeah plus column extra runs is greater than six okay extra runs uh let me just is greater than is greater than six after that or I will have new let me just add the space here so that we can easily track it column polar Wicket okay is equal to equal to True comma true okay otherwise wise false so what this does is basically so what this does is basically it checks the runs for plus the extra runs so you you do the total of the runs for plus the extra runs if it is greater than six or any wicked that happened while bowling that b if it is true any either of these condition is true then we return it as the true that means this is the high impact bow otherwise it is a false okay so basic conditional formatting if you have the basic understanding of python then you should be able to understand this I run this uh again it will we have the error runs code cannot be resolved so we have runs code not the runs code okay if I run this now uh we were able to create the new column high impact onto our data effect so this is what we do the basic transformation again uh this depends on the business condition this depends on your database understanding so how much uh data set and the columns that you understand and how you can create more structured way and extract the more value from it this is based on my understanding you can also go into this data frame you can try understanding all of these corners and think about how you can extract more insights by yourself okay this is where the actual data engineering and analyst shines because best by looking at the data you start need to think about what I can extract from this if you understand the cricket and if you have all of these columns available to you you need to think about how can I extract more information from this data what are the different in so you can find the strike rate of the player you can understand uh Innings of the different overs how it impacts overall performance you can understand how the winning toss can have the impact on the overall match things are considered as an insert so when you watch a Gade on the TV you get to see all of the statistics front of you you have to do all of these things in the back end to get all of this inserts okay so this is what we will do it with the Apaches far now once it is done you can do something like this which is Ball by ball. show I'll do this F action once you run this action then only all of these transformation block will get executed and the output will get printed to you till then no nothing will execute okay Apaches spark will create all of this kind of logical plan and the physical plan in the back end just to give you the final output so this is the entire graph that it creates in the back end and then you can print them by writing the action it will able to generate the final output So currently it is running all of this transformation blog in the back end and then it is giving you the final output again I have into the detail about each and every single thing in my Apaches part course because this is the most in-depth course that you will find on the Apaches part with the datab briak because I have put so much effort as you can see all of these notes I have created by myself okay really want to understand the apartes park in depth this is the best course I have developed because I put five months building this course okay so as you can see this is the final output we have available here okay now we have to do the same operation for all of the different data frame and I have already written the code so that we don't waste time onto doing the same things so as you can see it over here this is for the math DF okay and for the maass TF what I'm doing is that I'm importing some of the function year month and date for the first what I will do from the math date I will extract some of the values such as year month and day so that we can have the analysis onto the year level month level and the day level then we can also create more conditional formatting which is the high margin so categorizing BN margins into high medium and low So based on the win margin category we created the new column win margin if win margin is greater than 100 there's a column available in our data okay so if I go to the data dictionary you will see in the match you will have a column called as a win margin okay win margin is available that basically means how when a particular team won that match might have won that particular match based on the high margin so let's say they did not did not give any single wiet or they won the match with the higher runs something like that so if you understand the Crickets you will have the understanding about the win margin because in basic terms how well a particular team were able to win that match so it can be based on the runs it can be based on the wickets okay it's really important that you understand the domain if you really want to understand the data because if you don't understand the cricket then all of these data might not make sense for you we have the win margin if it is greater than or equal to 100 then it means high if it is between 50 and 100 that means medium and otherwise it is low okay so we are just creating the new column then analyze the impact of the toss this is what we talked about okay so toss winner is equal to batch winner that means yes otherwise no the particular team also won the toss and the match we will write the yes otherwise no so all of these are the basic transformation block that we are thinking okay based on this existing data what I can extract and then you just do the show at the end I will just do it too okay I don't want to print all of this data if you run this it will the generate the final outcome this was for the match I will do it for player okay for the player again we have importing the lower and the regular expression that basically means first of all we will normalize and the C clean the player name so if there are any uh extra keywords uh that are not needed into the player name what we will do we will do the regular expression and we will replace all of them with the empty value okay so if there are not needed keywords such as the dummy keywords or the garbage value we will replace it we are just cleaning this then handling the missing value for the batting hand the and the balling hand with the unknown so over here what we are doing is that we are just filling the null values with the unknown whenever we don't have any value and categorizing player based on the batting head so batting sty is like if it is contains the left that is this player is the left-handed otherwise it is a right-handed again as for your understanding you can create all of these columns by yourself and then what I will do I will just do show two and this will create the entire transformation and the final action based on that so as you can see it created the new column country name batting style as you can see this is the India the player name SC ganguli B mum okay uh date of birth is null left-handed bat right-handed bat batting style So based on the left-handed bat which is not the right string to store in the data we structure this column into the right-handed okay so that we will have like one type of proper string available inside our column so that it makes much more easier to understand at the end then we will do it for the player match okay so based on the different match player match contains the information about the player into that particular match so here we have what we are doing is that adding the WAN status So based on the player age okay if it is 35 if it is greater than 35 then we will add it as a veteran or not veteran same filter to include only players who played the match excluding the bench players so here we can exclude some of the players who did not bad but for now just just remove this we don't really need this okay all of the players for now okay then we have a dynamic columns to calculate the years since the debute that basically means we are just uh person person who joined so what so so this basically so this basically means person who started Cricket at the specific Year from the current year so how many years have passed okay so we are grading the new column based on that and then we do the final show this will print all of the values I should have limited the show okay as you can see it is printing all of the values available in of the data frame for the team data let me just check the cost colums here so we have three information team SK team ID and team name so we don't really have to do any transformation on top of it you can do the basic regular ex cleaning but team does not have lot of columns okay we just have the three columns so we will not do any transformation for the team what you can do if you want to just sit and explore all this data you can do the show and try understanding all of the column values okay this is the exercise that you can do it on the start so we were able to write the basic transformation code using the Apaches SP okay we did some basic transformation we read the data we created the struck type uh then based on that we wrote the transformation as per our understanding as an assignment what I will suggest you is to understand this data and think about what more you can extract and you can also use the chart gity for that okay so what you can do is just you can copy the column name and ask chat gbd okay uh generate some insights or create the new column based on your understanding uh for the final value what you can what you can ask what you can ask chart gbd is that using data set can you generate some insights and create the new columns and it will generate some code for you but before you ask the chat JD I will highly encourage you to at least spend time and think by yourself if you have the understanding about the cricket that what I can extract from this data okay I have the two teams information available so can I find the winning probability based on one particular team when they match with some other team you can also do that you can also find some insights based on the country name you can also do it from the toss name toss winner there are so many information available like man of the match okay you can also find the information about that player like how many times he's been selected as a man of the match all of these things you can easily do that you can also categorize them okay like if the particular player has won like five man of the match you can select as as the high impact player if it is between like three to 3 to two Okay if it is if it is between around less than five and more than three there is a medium impact player and if it is less than two that it might be low impact clear okay so all of these things you can do it by yourself okay I just show you the basic syntax but this is what you can do it by yourself to analyze the data now what I want to do I want to do some SQL analysis okay I was able to write all of this code by myself I transform our data now I want to generate some insights from this we did the transformation we cleaned our data now let's try finding the insights and for that we need to convert these data frame into the SQL table type format okay and for that you can do something something like this so this is my ball by ball deer dot create or replace Global Temp this is a syntax available at Apaches par if you want to convert this into the temporary view to write the SQL queries and I will just give the table name as Ball by ball okay like this I will do for all of the data frame which is match DF dot create or replace temporary view match okay okay MIT C same for the player DF player DF dot create or replace Global create or create or replace Global Temp View player okay then I have the player MDF do create or replace Global time view after player player Mach and then I have the team de do create or replace Global time view I will do team okay uh for each and every data frame what we just did is just we converted them into the SQL table so that we can query now what I can do I will start writing the SQL queries for this okay so first of all let's have the Insight such as STP scoring batsman per season is equal to spark. SQL this is a T available where if you want to write the SQL queries you write it between three quotes like this so you can write the multiple line code and then I can just start writing my code like this which is here just put this yes like this yeah like this I'll write my query so what I really want to find top scoring Bassman per season okay so as understand I want the top scoring batsman and the season so I can Group by the season and find the top scoring batsman okay so pretty simple select I'll do the start from first of all I will get the data from what is this from Ball by ball this is okay this has all of the information about the ball okay then uh let me just format this properly first after this I will do this join on Match because match has the information about the different match ID so as m m do match ID and then I will give this as B so what I'll do B do match match ID is equal to m. match ID if you want again you can understand about this data from here so as you can see for the ball as the match available which is a unique ID of the IPL match and in the match I have the information about all of the matches that has happened what I really want to find is that top scoring batsman so I need to go from the ball to the player information so that I can find the player information and the match information okay so this is the syntax available and then I have the join player match I will go first on the match level on that particular player which is a PM on M do match ID is equal to first I will join joined on the match which is pm. match ID and okay in this particular match I also need player was patting at the time so that basically means we have the column in the ball by ball as a striker okay if I go here if I show you there this column available as a striker this basically means each player was playing that particular ball at that time because player exchange the pitch there there is the there is the column available as a striker so this basically tells us that this player was playing that particular ball at that time I want to join this information with my PM do player ID if you again understand the basics this player was facing that particular ball which is was a striker at that particular time so I'm able to get the player information was playing at that particular ball okay on the ball okay I'm going like very Gran information so we have the player player match then the match and the ball label so we have all of this information available that we already understood we are going like in detail we are finding the join connection between all of them to find the final analytics then we have the player information p on P do player ID is equal to PM do player ID okay so I'm just directly joining them onto the player now what I want is I want top scoring Batman per season okay so I can have like P dot player name I will have the player name then M do season year uh again I can just show you here season year so in the matches I have the season year so I can have this information okay and then I want to have the sum of runs scored as total something like this I can do the group by for these two so Group by e do player name for my m dot season year and order by I can also do the order by m do season year comma total runs in the descending order something like this okay again if you don't have if you're not able to understand this particular thing that means you need to understand the SQL syntax first so told basic Foundation course is a python SQL again go for the data wouse course to understand data wouse and this park course if you want to learn everything about this park so I just show you the basic example if I run this hopefully this runs okay I'm getting the error table Ball by ball cannot be found and uh I was able to execute this let me execute this again uh Ball by ball is not able to found should able should be available Kima catalog okay I think we made a mistake here instead of Global Temp VI the syntax is create or replace replace Temp View instead of Global Temp View Global Temp View is a completely different topic so I'm just going to replace this with the temp VI okay if I run this and now if I run this this should work well pleas is connecting it is executing this command okay so we have the error as like B Striker cannot be resolved so if I go on top let me just check here do we have the striker information we have the cker information available into the DF if I print the column name which is of ball by DF dot dot columns i r this do we have the striker information we have the striker SK okay uh it is available here so that means we change some parts here so as you can see instead of Striker we have the striker here we also have Striker SK okay let me see this so we do so in this columns do we have the striker information available let me check we have the striker information available then I'll see okay so we had the spelling mistake s i k r it is s r i k r now if I run this okay it was able to run this particular thing and now if I do show five just want to get the top five top scoring batsman per season okay so let's wait for this to run uh it will take 1 second 2 second 3 second 4 second it we have water it can also hydrate so as you can see we have the player name season year and the total runs and just by looking at the data you can understand this we have the season year information okay s Mars okay to code total 614 runs we have the Gotham gambir Jaya Watson and the Smith for 2008 season these were the players who scor the highest run let's do the show 30 okay just to have more visibility on the data and see what do we get in the final output let's wait okay so again we will have more information available so for 2008 we have all of these players available from Mar to tendar tendar is are 188 so all of these again top scoring bats per season okay you can also filter on top of it you can also have 2008 2009 all of these data are available we have the data till 2017 so you can also do that there was the one part of the analytics like this you can also do more analytics on top of it so I'll just get the SQL queries this is the economical bowler power play okay so which bowler was most economical in the power play power play is basically uh most of the fielders are within the circle only two players are allowed as per my knowledge so so which bowler was most economical in the power play and again the S is pretty simple we get the player name for the player we get the total run scored we get the bowler Wicket okay total wickets taken and then uh we have the We join the ball player match and the player where over ID is less than six Group by the player name count greater than 120 okay we need the total count and we just order by this if I run this what is the output you will understand what this query actually did if you have the basic sequel this is very simple query all we are doing is that we are just joining The Columns based on the similar columns getting them aggregating them and Counting them okay so what we are doing this we are getting the average run score and the total wickets taken by a player in that particular over so as you can see we have like Sena Nayak okay then Nan MC gret malinga so this is the average run per Bard and as for this we have like this guy who gave like the least run then this guy gave the second least so this is giving us the final analytics based on the average run score we can replace this to equal to one so count so which is the total bowler Wicket is equal to one or be so based on the Wicket also we can modify this query and this will give us uh every baller who at least took one Wicket so let me see if we have any picket or not okay we don't have it but these are the economical Poler that gave us less runs and as you go down it will keep increasing okay this you can find more insight so you can also have the toss impact on the individual matches very simple we get the match toss winner toss them and the match winner So based on the particular match and the toss winning we understand that okay Royal Challengers uh this is the toss winner chose the field Kolkata won Okay match outcome was lost so even if they won the toss they lost J Super Kings chose the bat uh they won so we have the won like this you can start identifying the impact of particular toss onto the match and then uh we can have query like this which is like average runs in a bin so when a particular team won a match okay or particular player won the match how much impact did they have based on the runs on the winning of the team so here is the player average run scored and the inning Spates So based on the winning of that particular match how much impact did they have on the final outcome the joints are pretty simple these are the standard joints that we've been doing till now and as you can see we have the Rashid Khan okay which is average runs in the win and then we have the Innings played this outcome is giving us the average runs in that particular winning match okay so this guy scored like average six runs we have the two runs so this was about like using SQL to get the final analytics on top of it this is not the analysis video I just wanted to give you the basic idea of it now we can also do the final visualization so it is something like this again I don't want to go into the detail of the visualization code because this is the basics of python this is more towards the data science and the data analy but I just want to show how you can visualize this data so using the economical power play okay we will convert this into the pandas data frame and then we will visualize this using the mat PL lip I'll just run this using the mat plot lip we will Define the figures we will Define our dimensions and the column names getting the error as PLT is not defined so I think syntax let me just get the import mat plot lip and it is basically import mat plot li. pip blot like this okay I'll copy this paste it here and if I run this now this should be able to plot the entire graph of that particular analysis which is the economical Ballers onto the proper graph so again the bar chart as you can understand we provide the baller name then average un score and the we provide the color and all of these are the label information so as you can see we have the average runs per Bowl so these are the top 10 economical Bowlers in the power play we have the SM hardwood anit Sony like this we have all these players available that you can easily plot just like this you can also have the impact of the toss so the data frame that we created toss impact on the individual matches again we provide the x and the Y SNS is not defined it may import report C pH as SNS I think this is the syntax yeah if I run this hopefully this runs and as you can see this is the one and the match outcome is the one and loss number of matches and who won the toss okay on the x-axis we have the teams who won the toss and these are the information such as if they won the match or not so as you can see uh for the Royal Challenger back these are the dummy information we don't have about it for this gra we can understand that Royal Challenger whenever they win the toss they are more likely to uh win the final match here we have the Mumbai Indians whenever they win the toss they are more likely to lose the match like this you can just understand based on the graph okay P Warriors we have all this old data available but just by the visualization you can understand the outcome of the toss on the final uh impact of the winning match like this you can have more graphs created so this is the average runs in wiins this is the information about the average runs scored by the batsman in the winning matches like this you can also have the information about the scores by the venue So based on the different venues available I can get the information from the SQL queries and what I can do on top of this I can just plot this entire graph as a bar chart and you will be able to see this entire information here okay let fate as you can see in the bbal stadium we have the highest score matches then we have the S bate Association Punjab Cricket Association Green Park from the highest to score lowest score matches based on the different uh venue of the stadium so you can have the dismissal types which is basically a person how they got out uh while playing the bat so let's see this let's wait for this as you can see we have not aail not available data for this but as you can see we we have like someone who got out based on the catch we have The Bard run out keeper catch lbw stump caught by Bard hit picket all of these other information is basically giving us the information about the dismissal of that particular player like this we can also have like team toss winning performance okay based the different teams so if I do this again SQL peries you can just go through this if you want to understand it now this is the team performance after winning the toss so we have like the chenai Super Kings perform the best Mumbai Indians Kolkata Riders based on the whenever they win the toss this is how they perform okay again this was entire thing about the spark okay so again you can go through this the goal of this video was to give you the quick overview about the spark and and how to utilize the function uh I gave you understanding from the getting the data to reading the data defining into the proper structure format transformation of the data doing some basic SQL analysis and then visualizing that data at the end so that you can have the basic understanding of what is going in the data this is not the just data engineering this included everything from data engineering data science to data analytics we did all of these things into one project using Apaches F okay so Center of this entire video was the Apaches spk to teach you how to write the different syntax this is the reason this is the reason I did not go deep into the plotting I did not go deep into the or writing SQL queries all of these things all of these things are prerequisite while you're learning a PES spk if you're a very beginner and if you have no understanding about all of these things and if you still watch this video till here then you can definitely check out my courses we have the courses available on python SQL data warehouse using Snowflake and Apaches spk using data brakes these are the in-depth courses okay these are the all of the IND courses where you will get two projects onto the bches park we have like this project available on AWS and project available onto the aure all of these are the modules covered on top of that you will also get all of this in-depth notes so that you can easily revise your topic put a lot of efforts into building this course these courses are mainly for the people who want to understand the particular thing in depth and learn so if you want to learn Apaches Park in depth if you want to Learn Python data warehouse then you can enroll for this course I will keep doing all of these videos on the YouTube channel I show the actual project execution from start to end so this was one of the project I wanted to give you the quick overview of the Apaches spark and how we can write the code so this was everything about this video I hope you learned something new from this video I just wanted to give you the good overview of apach spark if you are a beginner so if you love this video and if you learn something new then don't forget to hit the like button share this video with the person who might want to use this you know what you want to see next in the project and I'll see in the next video thank you