Data Processing Engine with Kafka, S3, and Athena

in this video we will take some stock market data use Python to produce that data and put that data onto Kafka cluster after that we will consume that data and store that data onto Amazon S3 crawl that data to build a blue catalog and analyze that data using Amazon Athena using SQL so a lot of things to do here and without wasting time let's get started hey guys welcome back to the video in this video we will be designing a stock market data processing engine using Kafka so I will divide this video into three different parts for the first part we will talk about the prerequisite required for this project in the second part we will understand the basics of Kafka and understand the architecture diagram in detail and in the third part we will do the actual Hands-On practice if you are seeing me for the first time then my name is Sir chill I'm a freelance data engineer on this channel we mainly talk about data engineering productivity and freelancing if that is something that you want to learn then make sure you hit the Subscribe button and also like this video if you learn something new from this video so let's start with the basic prerequisite required for this project first thing you need is laptop and internet connection because will be doing everything on browser on cloud machine so make sure you have that second thing you need is python installed on your PC or a laptop and have basic understanding of python anything about 3.5 version of python works fine so don't worry about that and lastly you will need AWS account because we will be doing a lot of things on AWS such as storing our data onto S3 hosting our Kafka servers on ec2 machine running crawler and querying that data using Amazon Athena so make sure you have all of these things set up if you don't know how to do that I will put the link in the description I already have tutorials on all of these things and for the python make sure you have jupyter notebook installed because we will be writing our code inside the jupyter notebook Jupiter notebook is an interactive python environment it looks something like this so you can easily run your code check the output and move forward now there are few things you need to remember while executing this project you will face lot of errors and when you face the error make sure you go online and try to solve it by yourself if you don't try to solve it by yourself and if you rely on someone else to solve it for you then you you will never learn so if you don't understand any concept or if you face the matter make sure you go and research by yourself first and if you still can't find the solution then ask me in the comments because most of the time what people do is when they face some error they either give up or just write it in the comments and wait for someone to reply that is not how things work in real world when you will go in your job when You Face some errors you will have to solve all of these things by yourself if you don't understand something then you can read blogs you can read books or you can watch more videos to understand those things this video will give you the confidence to start your career in data engineering and understand the real time streaming in action I will show you how these things work on the concept level and on the execution level so I hope you have fun executing this project and without wasting time let's understand the basics of Kafka and all of the different terminologies that are attached to real-time streaming so let's first understand why do we even need real-time streaming on the first place you use lot of different real-time application on day-to-day basis you use Google Maps Uber Amazon all of this application are real-time basic so when you order something from Amazon you get notification of your order confirmation within seconds when the order gets shipped when the order is coming to your home all of these things are coming to you in real time basis whenever the event is generated it will get sent to you same in the case of Google Maps or Uber everything happens in real time so these are the basic applications of real-time stream engine so now what is Kafka Apache Kafka is a distributed event store and stream processing platform now what does this really mean so let's understand these two things distributed event store and stream processing event is basically when you do something online so when you click on some some application when you like something when you share something all of these things are considered as event so distributed event stories basically means all of these events are distributed on different networks and coming to you and then we have stream processing stream processing is an act of Performing continual calculation on potentially endless and constantly evolving source of the data so in the simple terms it is like the flowing water in the river it is continuous it is constantly evolving it keeps moving forward and there is no end to it so in the real world when you are moving from location a to location B each and every step you take the event is generated and sent to the server so one of the stream example is the web analytics so when you use your phone open some application click on something like some tweet all of these things get recorded and sent to the server so this is one of the example now there are a lot of different application of stream processing so when you are making some transaction online and if the credit card company finds that this is the faulty transaction then they will give you the notification in real time basis real-time e-commerce in the case of Amazon third is the banking so we use UPI day-to-day basis and whenever you make any transaction you will get the notification in real time basis if you receive the notification after 30 seconds or even one minute you will start losing your markers you send the money but other person did not receive the money and didn't get the notification so all of these things has to happen in real time so this was the basics of real-time streaming and why we need it now let's Deep dive into the Kafka so Kafka has a lot of different components and one of them is a producer producer is basically something that produces data it is that simple so we have data coming from sensors we have data coming from stock market we have data coming from web analytics whatever you do online all of these things are considered as a producer we are producing some data all of those data are sent to the Kafka server then we have a broker a single member server of a Kafka cluster so cluster is basically the group of different machines so in the case of ec2 let's say if you have n machine that becomes one cluster and one single node or a server is called as a broker in terms of Kafka so you can call it a server but in the Kafka they call it as a broker I have just given one fancy name but it basically means one server one broker or you can also call it as a node so one ec2 machine one single node or one single broker collection of those Brokers are called as a server so we have a producer then we have multiple brokers inside one cluster in our architecture we just have one broker but you can have multiple Brokers and create a cluster out of it then we have a consumer so it is very simple you have a producer that produces data in between we have Kafka cluster or a broker there we write the data and based on the consumption consumer can consume that data from the broker so producer Kafka broker or Kafka cluster then consumer that consumes the data and use the data for some kind of activity so let's understand all of these things in a little bit detail and this is the technical architecture of Kafka as you can see over here we have a lot of different components so let me just explain you this architecture in very simple Manner and then we will understand the technicality side of it okay so again as you can see we have like producers we have multiple producers data coming from multiple sources that basically means multiple producers then we have Kafka cluster inside the Kafka clusters we have multiple Brokers we already understood so this can be like one easy to machine single second easy to machine and an ec2 machine so inside that we can install Kafka and that becomes one Kafka cluster so it is very simple and after that we have a consumer that is basically consuming data we can have one consumer all multiple consumers so multiple producers multiple servers and multiple consumers that is what makes Kafka distributed now all of these things are distributed so we need some kind of manager who makes sure that everything works properly and this is where the Zookeeper comes into the picture so if you have basic understanding of Hadoop ecosystem then you will know zookeeper is a resource manager so in the terms of Kafka zookeeper make sure that all of these servers are running properly so if one broker dies or fails it will make sure to notify all of the other Brokers about this situation and make sure the data is in proper state so keeper also manages the access control list and all of the other security related things and the management so this is basically the manager of all the Kafka cluster that is all you have to understand now if you want to understand about zookeeper in detail you you can read more about it and get better understanding of it so how Kafka uses zookeeper cluster management failure detection and Recovery storing Access Control list and the secret that we already talked about now these are the basics of zookeepers it is an open source Apache project it is a distributed key value store so all of the information are stored inside the key and value it maintains the configuration information stores Access Control list and secrets enables highly reliable distributed coordination so so we have multiple servers and Brokers working together from different locations so zookeeper makes us all of those servers are working properly and provide distributed synchronization so all of the data that are getting replicated zookeeper makes sure that all of these things happens properly so we understood producer consumer zookeeper and the Kafka server now this understand the topics so topics are basically The Logical buckets inside the Kafka broker so let's understand this by an example let's take the case of Amazon again now you can create topics for multiple things according to your needs so if you want to create one topic for let's say the order are created so whenever any consumer creates a new order it will get sent to that topic you can also create a topic for written order so whenever someone returns that order that will get sent to that particular top so you can create different section inside the Kafka broker and manage your information properly so you can do all of these things by creating a topic so you can create topic for order created return order shipped order and all of the different events that are happening inside your application so producer will send data into that particular topic and on the consumer side if I'm only interested into the return orders and I don't want to mess with all of the other information then I can only focus on that particular topic so topics are basically The Logical representation inside the Kafka broker now inside each and every topic you can Define different types of partition it can have zero partition to n number of partition based on your requirement so copper consists of one or more partition a partition is a log which provides ordering guarantee for all other document contained within it so this is how it actually stores right you have your data somewhere then you have a topic a and topic B and all the data are stored inside multiple logs files so each and every log file is a different partition so you can have different partition based on the customer ID so let's say again in the case of Amazon uh if you have like 100 customers it will have 100 different partitions and all of the information for that particular customer will get stored inside a single partition so if you want to access the data for a particular customer you will just look at one single file rather than scanning through all of the different files so this this might be one of the example you can also partition it by dates and based on your different requirements so inside the cluster you have Broker one broker or many Brokers inside the broker we have topics and inside the topic we have different partitions and all of these things are replicated across different Brokers so if one broker fails you can easily access your data using other Brokers so as you can see it over here so producer is writing partition a data onto broker zero and partition B's data onto broker 2 but all of these data are getting replicated inside a different broker so as you can see partition is data is available on broker 1 and broker 2 same with the partition B so when the broker zero dies broker 1 will take its place and all of those data will get written onto broker one so this is how it works on basic terms again there are a lot of technicalities behind this how it works how zookeeper decides all of these things we don't want to Deep dive into each and every things because it will take a lot of time I'll have to create the entire course around Kafka that will be like 10 to 15 hours long so for now just understand the basics and move forward and then you can explore by yourself so we understood we have data inside data we have topic and then all the topics are partitioned using the log file so what is log file files in which incoming events are returned to the end of the file as they receive so this is the first entry that comes in one two three four five so any new event that is coming will get written at the end of the file so this is how the stream stores the data you have like data coming first then all the way to the data coming last and all of the data get appended to the log file now this is the broker Basics we already discussed this but let's discuss it again and understand it in detail we have a producer that sends message to the broker broker receives and store messages Kafka cluster can have many Brokers and each broker manages multiple partitions as you can see it in this diagram and broker application we already talked about this we have partition zero datas are kept being replicated inside multiple brokers in in broker 102 and 104 same for the other different partitions so we understood a lot of different terminologies now let's understood how to do this so producer Basics okay producer can write data as a message so you can write your producer in various different languages such as C C plus plus Java python go JavaScript up to you you can easily do that consumer consume data from the multiple topics so you can also do that using different programming language or you can have any tools attached to it so these were some of the key terminologies that were attached to Apache Kafka so this is a lesson glossary okay we have stream unbounded sequence of order and immutable data stream processing we already talked about this continual calculation performed on one or more stream immutable data basically means you cannot change it again again we talked about event broker cluster node zookeeper and data partitioning and many more and this is everything you need to understand to execute this project now I will highly suggest you to go online and read about all of these things from a technical point of view and understand more of it so we understood basis of Kafka now let's understand our architecture diagram in detail so first we have a data set so don't worry about I will give you the data set so we will have some kind of static data set stored in our PC then we will write a python code that will simulate the real-time stock market application so we will read data from the data set and produce the application that simulates the stock market then then we will add a producer code that will push those events into the Kafka broker so over here we have the Kafka already installed on our ec2 machine then we will write a consumer code that will consume that data and put that data onto Amazon S3 Amazon Amazon S3 is an object storage so you can store any type of the file you want to store you can store images audios videos CSV files then we will run a glue crawler this is a service available on AWS so what you can do is you can crawl the schema out of the different files and build a catalog this is basically blue catalog and then you can directly query data on top of different files so we will run crawler once and then we will keep storing multiple files the events that are coming from the Kafka consumer so that we can directly query on top of those files and understand what is happening and we will do all of these things in Amazon Athena so you will understand all of these things in a detail when we actually do the project for now just understand the big picture and have the basic understanding of it and after that we will Deep dive into all of this so I hope you understood the basis of Kafka if you did if you learn something new at least hit the like button that will help this channel to grow and reach more and more people and now let's start with the actual execution of the project so I hope you understood the basic concept of Kafka and you already have all of those prerequisite required such as install python have AWS account and all of the thing things we talked about now we will start with the actual execution so this is our architecture diagram we already talked about this we will start our project with installing Kafka onto ec2 machine and we will test the producer and consumer on Kafka okay so let's get started with execution let me just go back and let me just open the ec2 so once you log into your AWS account all you need to do is go to ec2 okay you can click on instance over here and click on launch instance over here you can give whatever name you want such as Kafka stock market project it's up to you whatever you want to give after that we're gonna be keeping all of these settings default as uh this is like Amazon Linux uh we will be using T2 micro this is this comes under the free tier so you won't you won't be charged anything and you can click on this create new pair this key will be will be using to connect to uh the ec2 machine from our local computer so you can give whatever name you want over here Kafka stock mark Market project okay I already and then uh after after you give the name you can click on the create key pair and it will download automatically I already have keypad on my machine so I'm just gonna going to use that uh once you download your keeper or so once you so once you download your keypad all you need to do is create one folder and store your key over there so as you can see uh the key that uh Kafka stock market project key this is already it's been stored on my local machine so once you create the new keypad it will get downloaded and then you can easily uh go ahead then after that keep all of these things as it is you don't need to change anything and just click on do launch instance once it uh you can click over here and check your instance is spending uh it will take like one to two minutes two instance to run and after the status changes to running we can move forward and start with the actual execution so now our instance is running you can see the status over here just click on instance ID click on connect and you can use this command to connect to your ec2 instance so even if you are using Windows uh you can open your CMD if you are using Mac just open the terminal and go to the folder where you have your uh key uh saved so in my case it was in the document CD Kafka project if I do the ls you can see the key over here or it is present over here let me just zoom it little bit okay then you can you just have to pass this command to yes and you can connect to the ec2 instance easily now if you are using Mac and if you are not able to connect to the ec2 machine and if you get some permission error what you need to do uh this is just for the Mac okay so let me just go back let me just clear if you get the error of permission denied you you just have to write CH mode okay 400 and your key name which was Kafka if you do this then you won't get the error of permission deny but and then you can easily run the SSH command to connect to the ec2 instance so even if you are using Windows you you won't get any error in the windows generally but some mac users do get this error so I just wanted to tell you this so once you are connected to the ec2 instance then we need to write different commands to install Apache Kafka on ec2 machine and then test some of the things so we will start with that so I already created the list of the commands that we want to execute first what we will do we will download all of the required files onto our AWS ec2 machine so you just have to paste W get command this will download the Apache Kafka compressed version onto your ec2 instance so we can use all of those files to run our Kafka server after Kafka is uh downloaded you can just check by doing LS okay now what we need to do we need to uncompress this so if you do LS and if you paste it you can see that we have one folder and we have the zip file over here now as we already talked about that Apache Kafka runs on top of jvm java virtual machine so we also need Java to be installed on our PC so if you check directly okay Java version you won't find anything so for that you need to run this command sudo em install Java 1.8 if you do this it will install Java 1.8 on your ec2 instance now you can run Kafka on your local machine so if you have Windows you can also do that because it is Standalone so whenever wherever you have Java installed it will run on top of it so if you don't want to uh install so if you don't so if you don't want to install Kafka on ec2 machine you can use your local computer to do that also but uh just keeping in mind that everyone uses different OS I chose ec2 machine so it becomes much easier for everyone to follow along with me and also you will get hands-on experience working with the AWS cloud so now you can just check the Java version and you will see 1.8 after that what we need to do we need to start multiple things first we need to start zookeeper and second is we need to start the Kafka server and that is pretty simple all you need to do let me just clear it CD and just navigate to your folder the Kafka folder that uh be currently uncompressed then you just need to copy this what we are doing is bin slash zookeeper start wait yeah zookeeper server dot start sh and then this is a config zocable properties we will talk about this thing uh in few minutes just provide this and this will start your zookeeper uh so so once you once you so once you hit enter it will start the Zookeeper server now what you need to do you need to go to new new window with the basic profile and again connect to your ec2 machine because on one window we will have a zookeeper running on the second window we will start our Kafka server and the procedure is the same all you need to all you need to do is navigate to the uh same folder that we went so this is Kafka project this is where my stock market file is key file is stored go to this click copy the SSH just hit enter I'm connected to ec2 machine then if you go over here what do we have first of all we want to increase the memory for our Kafka server because so as you already know we are turning our Kafka servers on single ec2 machine so we want to assign some amount of memory to it so if you want to read more about these things you can go online and read it but uh you just copy this command and paste it over here and it will allocate some amount of memory to our Kafka server after that what you need to do you need to start the Kafka server so again CD Kafka what is the command here Kafka server start and our config properties let me just paste it okay so it will start our Kafka server so we have our zookeeper running and second we have our Kafka server running now there is a problem the thing is we cannot use this server and the reason for that is let me just find the PATH okay so as you can see over here this is where this is the address where our Kafka server is running and if you check this is the DNS address where we can you this is the DNS address that we can use to access our Kafka server we will be connecting to our Kafka server from our local machine to send data and receive it now if you go to your ec2 let me just go this is the private DNS if you can see 172.3135205 this is the private and you cannot access your private DNS from your local computer unless you are in the same network so this is the problem so what we need to do we need to change this private IP to public IP so that we can access our ec2 machine outside of the network so that is pretty simple first just stop both of the server let me just stop it you can press Ctrl Z Ctrl C that works both let me just clear it okay we will be editing one file the file that is config third properties that we were using over here now don't worry I will provide all of this command list in the description so you can go and check it and follow along so so all you need to do is copy sudo Nano config properties just paste it on one of the server and this will open this file then go back go back go back okay over here you need to remove this hashtag advertise listeners plain text colon slash slash your host name so replace your host name with the public address of your ec2 machine which is 65.2.168.105 okay you can press Ctrl x y and enter and you can save this so you can just let me just clear it let me just go to again pseudo Nano and let me just check if the changes we made were uh saved or not okay so we already have the saved version so we changed it now what you need to do we need to again run the same thing start the Zookeeper server once the Zookeeper starts let me just clear this we just zoom this little bit foreign uh and then start the Kafka server now our Kafka server should run on top of the public IP address so if I scroll up little bit uh let me see if I can find the information as you can see our server is plain text 62.21 168105 9092 which is the port so we have our zookeeper running and the Kafka server running one thing also we need to do we need to provide the security access from our local machine so over here uh you just need to on your ec2 console click on to security click on this security groups scroll down and click on this edit inbounds we need to allow the request from our machine to the AWS ec2 instance so you can click allow all traffic anywhere I will suggest you to click the my IP I don't want to click the my IP because it will reveal my IP address over here so I'm just gonna click anywhere but you can just click my IP and it will work too click on the save rules and this will give you the access to your ec2 machine from your local computer now the thing we just did like edited the inbound rules of the security groups is not the best practice you should never allow all the traffic from anywhere in the world because this is not the secured way and most of the time when you are working in your company in your project the cloud engineer or a devops engineer will handle the security part of it as a data engineer you are more likely to work onto the data side of building pipelines in use application but as you are just installing it for the learning purposes I want you guys to understand all of these things from the scratch so for the sake of the tutorial we are not worried about the security part of it because we want to mainly focus on the data engineering side of it so we have our zookeeper installed we have our Kafka server we also edited our inbound rule so that we can access the ec2 instance now again we need to create the new window again open the new command prompt if you are using Windows you need to do the same process documents Kafka project go to your ec2 connect copy SSH paste it over here and connect to the ec2 machine okay so we have two windows running for the one for the Zookeeper one for the Kafka project with Kafka server and third one is where we will write some of the things such as we will create a topic which you already talked about then we will create the producer and consumer and see the Kafka in action okay so again the commands are already provided CD Kafka so just go to the Kafka directory if you do LS you will see all of those files let me just clear it then this is the command bin Kafka topic create topic this is the topic name okay bootstrap server this is where you will replace your IP address so in our case our IP addresses let me just go over here copy this and paste my IP address here okay you can do this for all of these three commands just copy this copy this so one for the creation of topic second is for the producer and third one is for the consumer so what I'll do I'll just uh for topic name this is let's say demo s okay let's just copy this uh just copy this and I'm going to paste this over here so this will take some time okay just ignore the warning what is the warning you get it doesn't matter and then the topic is created so if you see this then you were able to successfully create the topic now what we need to do we need to start the producer so again the same thing make sure your topic name is the same demo test demo test you can give whatever name you want so just copy this again on the same window we'll start our producer so as you can see over here you can type lot of things so this is our producer now we also need the consumer side of it so let me just open new window see a lot of Windows you are opening right now let me just zoom it little bit so that you guys can see that's very big yeah I need to format this properly but CD document same process Kafka LS okay let me just clear it let me go to my ec2 connect SSH just paste it and I'm connected to my ec2 instance let me clear it again and uh just copy this the consumer make sure you are in the Kafka folder first and then paste this and this will start your consumer code you see a lot of things are running in the back end so it is starting the consumer server let's just wait I guess it is already started let's just start typing so as you can see whatever you type hello you will see it over here let me just adjust this let me minimize these two so that you guys can easily see it okay hello world you will see we are producing data from a producer and then we are sending it data so we are so we are producing data from one end and then we are able to get that data onto the consumer and through the Kafka server so if we go back to our this PPT what we just did we created the producer now this producer we are doing it manually but we will be doing the python code on top of it then we have the Kafka server and then we have the consumer so this middle part of it is kind of over and as you can see this is amazing right you are getting the data in real time so whatever you write over here you will be able to see all of those things in real time I can write whatever I want and you will see all of the data coming and in in real time basis to the consumer now this is just a terminal you can have one you you can try whatever the text you want uh your feel free to do it uh just explore it right to see what happens and once you are done with that then we can start with our python code so so congratulations if you were able to complete this projectile here because now we will be using python code to do all of these things so we were able to create our producer and consumer and set up the Kafka server onto ec2 machine now what we will do we will start writing the code and do all of these things programmatically so instead of manually typing all of this data we will push the stock market data in real time and we will consume that data then upload that data onto Amazon S3 and do the further thing so uh just open your jupyter notebook we already talked about this uh all you need to do is click on to this python kernel 3 and this will start uh give this name as Kafka producer and open one more and give this name as Kafka consumer okay so we have our producer over here and we have our consumer over here now what we will do we need to install one package which is called as let me just pull out the package name pip install Kafka python now in my case the requirements are already satisfied because I already have installed this Kafka package but if you are installing it for the first time you will it will install it from the scratch then you will need to import some of the packages uh these are the packages so I've already written the code so you you don't waste time on manually writing the code but I will suggest you to write all of this code by yourself so that you will get more hands-on experience so I'm just going to execute this what we are doing is we are importing pandas we are importing consumer and the producer function from the Kafka we are importing time function Json terms and importing Json you will understand why we imported all of these things code software you import all of this packages then we need to create the object for Kafka producer so at one point we will create the Kafka producer and another code we will create the Kafka consumer so let's just put this so again this is pretty straightforward producer is equal to Kafka producer we are just using this function this takes the bootstrap server so we need to provide the right IP over here this is the same IP of our ec2 instance because we are connecting our ec2 machine from our local and that is the reason we use the public IP now you guys understand that why did we why we didn't use the private IP and why we use the public IP because now we can connect to our Kafka server from our local machine and send data this is a serializer so we need to send the data into proper format in our case you will be sending the Json data to our Kafka producer just run this and you will be able to create dock object after that we can just start sending the data so in our case you just write producer.send this is our our topic name topic name was our demo test copy it set the value you can give whatever you want such as hello let me just do this hello and word like this you just hit send and if I open my consumer code you will see something like this so you can send in such as name name third shell okay and if you see we are getting the Json value over here so we are sending data from our local machine to the ec2 of the Kafka server and we are able to consume it so we can do this on complete Loop and easily send it and consume it so this is the producer side of the Kafka code now let's just do the consumer side of it again the consumer is pretty straightforward uh we need to import similar packages over here you can just import consumer and forget about the producer and over here instead of using consumer also you can just use the producer so uh it's up to you how you want to organize it then we have consumer code written over here so once you import the packages then we need to repeat the same process uh consumer consumer the topic name over here we need to provide the IP address let me just copy this paste it okay and the deserializer so over here we serialized our data and encoded it into the Json format and over here we are just loading it and decoding it into the Json format so let me just run this okay so once you create your consumer object then you can easily Loop through it for C in consumer print C dot value okay if you run this this will start consuming so if I run some data over here you will be able to see all of this data so let me just change it let me just name sir name let me just run this and you will see all of this data whatever you paste it over here just say if I paste some random keywords you will be able to see it on the consumer so the thing that we did with the terminal you will you will also see the same thing over here also so thing that we did on the terminal we were able to replicate the same thing using Code now what we need to do instead of manually putting the data from by our cell we can just load the data from some API some data sets or anything we want put the data over here into Json format that will keep producing data and keep sending data onto in one Loop and then we will have consumer who will keep receiving that data and put that data on to some Target location so your so if you are able to so if you are able to understand how systematically we executed this project first we started with the basics of Kafka understood it and then we were able to install the Kafka on to ec2 machine we understood how producer and consumer work in real time then what we did we just started with a small chunk first we connected with the Kafka on the producer side from using code second we dealed with the consumer and we connected together and saw the same action that we were able to do it on the terminal now we need to just fill the one step so instead of focusing on this entire architecture diagram now we will focus on the first two step of it which is getting the data set and then simulating that data into the real-time format so that we keep getting data in real time and then what we will do then we will do this entire thing which is like crawling creating catalog and stuff like that so let's just start focusing on you can just stop it over here consumer thing and then we will start the producers producer thing now in the case if you have access to the real-time stock market API which are which is generally paid if you have some access then you can use that but in this case we will simulate that data so we will pick the existing data set of of the stock market and then we will run it in the loop we will do some sampling on top of it so we will get some random data and then we will send that data into one by one to our Kafka producer that will produce that data and put that data onto Kafka server so in our case the stock market streaming is basically we are using stock market data and making it into the real-time format because but because not everyone can pay to this API I just wanted to show you how you can create your own real-time data set to work on the project so I already have a data set I will put the link in the description so you can go and use the same data set but you can use whatever the data set you can download it from the Kegel if you want to do that it doesn't matter what data lab you use because it will be the same you will be able to do it on this project whatever the data set you use so this is what the data set looks like so this is very small data set as you can see we have some index date open time High Time Low close uh volume and all of those things so we'll be using same data set index process so first we will read this data set and then we will start sampling it and then we will start putting this data onto our producer code so first let's read our data so if you're already familiar with the pandas you will do the DF is equal to read underscore CSV uh inside we have data and then index process.csv if I just hit enter DF dot head you'll be able to see we were able to easily read the data so this is pretty simple if you know the basics of python and pandas you'll be able to do it after that I just want to get the random data I don't want to get this data into the uh same uh order so what I can do is just write DF sample one and this will give me one random Row from this uh entire data frame if I keep running this you will see we will keep getting different rows over here okay so I just want some Randomness in our data set so we are not really trying to find some kind of insights from these data we are mainly interested on the operation side of it like how the data will go from one side to the next side so that is the reason we are getting random samples then what you can do you need to convert this into Json format so into the dictionary so to date if you do this this will convert it into the dictionary type format as you can see over here and you can also specify Orient is equal to records if I do this this will be into the proper dictionary so for index we will have on the actual value for the date we will have actual value if you don't do the Orient you will see something like this okay we have index and inside the index we will have some kind of key and that key will have some kind of value so we don't want something like this we just want proper Json structure so do this record and this is inside the list as you can see it over here so do zero so this is one proper Json or the dictionary format we were able to extract from this data now it is pretty simple all you need to do is put this inside the loop so while if you are aware of this while true let me just intend this and then you can copy this producer dot send let me just give this name as take stock okay stock market dictionary copy this paste the value over here and everything else is same so let me just run this and while this is running let's just open our terminal window where is my terminal so as you can see we are getting this data into real time so you you are able to see that from the One Source you are inserting this data into real-time basis very fast row by row and you are able to get this data onto the next end so we send lot of different data into the small amount of time and that broke our entire server so as you can see the broker is not available zookeeper went down our Kafka server went down because we are running this Kafka server on very small machine and we are just using one single broker and one single partition so it is not able to handle all of this data at this scale if you want to do that we want larger machine and distributed computing but we will understand all of these things into the future videos when we do the more detailed tutorials but you're able to see all of these things in action so when you run the code you can see how the data is moving from one source to another source so if this happens I mean just if the server breaks on your side all you have to do is run it again let me just start the server so start the Zookeeper start your Kafka server and then you can start the consumer so again so if your server breaks again you can start the Zookeeper again you'll have to start the Kafka server again and you can start the consumer let me just run it once again if you are able to see the data let me just stop it again we don't want to again overload the server with a lot of data so you can see all of those same Json data we are getting so that's pretty cool uh so that's pretty cool we are able to see how all of these things happen behind the scene just like your current location is going on the rapid phase so when you're traveling from one location to the destination uh when you're using Google Maps or any app such as Uber Ola or whatever you use you get the you get the you get the live location so all of these events are going from your location to the server and coming back in the real time basis for you to give the final result so so we were able to produce some data okay now what we need to do so if you print the consumer on the core you will see the same data available over here also so if you want to remove all of this data that is pretty simple write producer dot flash this will flush all of the data and now if you print this you will not get anything so it will empty the entire buffer and then you can again send the new data if you want so if you want to remove the old data you can easily do that so this was so this was so this was more about the producer side of it so we were able to complete this this this this and this now what we need to do we need to work on this part of our code so Amazon S3 crawler clue and then we will see some real-time data coming onto Athena side okay so let's keep going forward now what you need to do you need to go to your S3 or go to your AWS console click on S3 and create new Bucket over here thank you uh uh click on to create bucket uh make sure your bucket name is unique or throughout the AWS universe so you cannot use the same bucket name whatever I use because you will not be able to use it so give a unique name Kafka stock market tutorial YouTube and again you can attach your name so in this case if I attach the ratio you can attach your name and it will automatically become unique you can again use the same region uh uh nearest to your location and then keep everything as it is and just click on the create bucket and this will create the bucket uh Kafka stock market YouTube Let me just go to this bucket and this is my S3 bucket now if you don't know what XD bucket is I will suggest to watch some tutorials I already have it on my channel but SD bucket is an object storage you can store whatever the type of file you want you can store audio video movies uh CSV files Avro files orc files and whatever you want so you can store this is object storage right so we have the object storage now what we want to do we want to upload data from our code okay to the S3 bucket and for that we have a package called as s3fs so you can do if install S3 FS I already have the requirement satisfied but if you don't have it you can uh just install it you also need to configure AWS in your local machine so the way to do that is basically you need to go to iam okay so when you so when you created your AWS account you might have got the AWS secret can access key but if you don't have it or if you forgot about it then what you can do you can go to user click on to add user just give the name like the shell admin account click onto access and programmatic access access scanner programmatic asset you can also give the AWS Management console access but in this case let's just keep uh admin program Matic program metric access okay just give this let's just skip next to next review okay oh let me go previous go back go back over here you need to attach existing policy so click on to add administrator access okay go next next create user this will download one CSA file this file will have my access key ID as you can see it over here and the secret key ID make sure you do not leak your access scanner secretly so I will not show you over here but uh this is where you will have this downloaded and this will download only once so make sure you properly save it somewhere after you save it all you need to do is go to your terminal again so let me just open one of the terminal I have so many terminals open right now around one two three four five and this is the sixth one if you don't have AWS CLI installed or you can all you have to do is go to AWS CLI onto Google you can install it if you have Windows let me just show you if you if you have if you have Windows you can install it over here if you have Mac OS you can install it and if you have Linux you can easily install it so once you install it all you have to have to do is go to CMD AWS configure hit enter over here you will provide your access key ID so just copy your access key ID from your from your CAC file and paste it over here just hit enter provide your secret key ID over here so whatever your secret key idea just paste it over here from the file I already have it assigned as you can see over here then uh your default region whatever whatever the region you are using uh over here so on the ec2 side I'm using Mumbai so AP South one this is my region but it can be uh uh changed or based on your location so whatever region you are using on your AWS you use that hit enter hit enter and you are done with your AWS account setup on your local machine so once you do that then you will be able to send data from your uh local machine to S3 so once you install s3fs we already did that on right just right from S3 FS import S3 file system let me just run this again okay now let's just comment this code for now we don't want to do anything so what you need to do first create the object S3 FS same okay now just run this so now what we really need to do for each and every different row we want to create a file and store it on S3 bucket but we also want to give the file unique name each and every time so so what we will do we will assign the numbers at the end of the file and then store it so this is how it works we will use count and I in enumerate in enumerate C this is our consumer C consumer okay now this is the python function this basically means you will get the count okay the count and I is basically your uh the actual uh iterator on the consumer so if I do this print count and if I do print I so if you run this and if we process some data let's just send some random data over here you will see count 0 and some data let's just run this again you will see one and some data I'll run this again you will see two and some data so what is what it is doing it is iterating and increasing one count every time it gets new data and also printing the value now we can use this easily and store our data on to S3 what we really need to do is okay so write something like this so here you need to provide your uh so here so here so here you need to provide your S3 bucket name so let me just copy this paste it over here slash this is my file name okay and this is where I will put the count value as it is file I dot value this is the actual value that I want to put and this is my file object over here so if you understand basics of file Creation in the python you will be able to do that so let me just flush this again once more and let me go back and start my consumer stop this and I will run this and I will also run this for one two three four five okay for the five second I ran this let me just stop it uh let's just stop our consumer also we don't want to keep running it because again our server will pin if I refresh this I will be able to see some data over here as you can see we have a lot of different like stock market json123 till 100 and in just like five seconds we were able to produce a lot of data so if I just click on it download one of the file and let's just see how it looks let me just open this over here so you will be able to see one single event is available in this file okay uh so this part is done let me just go back to architecture diagram we were able to uh upload some data onto S3 now what we need to do we need to build a crawler AWS blue catalog and Athena so let's just do that if you guys are finding this video helpful make sure you hit the like button and subscribe to this channel this helps a lot I mean you guys are seeing like uh this the creation the creation of this video takes a lot of time uh it is around like 12 am right now 12 30 am of the night and I'm still recording this video at around 11 pm it is around 12 30 am so make sure you hit the like button to support this Channel and let's just go forward with that okay also let me have some water okay let's move forward uh so what do we want to do we have data onto S3 we need to write a crawler we already understood what the blue crawler is but let me just again explain this in the brief the crawler what we will do is we will crawl the entire schema from our S3 file so we can directly query on top of it using Athena so this glue crawler I'll just call it crawler Okay click on this okay so just click on this crawler click on to create crawler give unique name stock Market Kafka project okay project keep everything as it is click on next is your data already mapped to glue tables not yet add data source inside our data source we want to uh this is optional we don't want to do it over here in this account select your extra bucket in our case S3 bucket for us oh let me just open it um this was mystery bucket Let me just copy the name as it is you can paste it yeah just select this and choose and just put the slash over here so uh we want to run crawler so we want to so we want to run crawler on the entire S3 bucket okay uh just keep everything as it is click on ADD SD data source click on next IM option uh you know in my case I already have the IMO I am ready I am role is basically that gives access to the services so in our case when we wanted to access uh AWS S3 and upload data on top of it we had to configure all of those AWS secret key and access key on our local machine now if these services such as clue wants to talk to S3 you need IIM role IM role will give access to the glue to write all of those data on to S3 bucket or three data from it so uh if you don't have IM role or you can go to this new tab just open IIM okay now let me just put it over here click on two roles click on to create role AWS service uh over here select your service name in our case it was clue so let me just blue Okay click on the glue click on next uh for for the for now just give the admin access so just write admin and you'll be able to see administrator access this gives full access to AWS services and resources just click next give name such as glue admin role uh click on the create role and you'll be able to click create the entire role I won't do it because I already have roles created uh I'm just going to use one of the role I created for the YouTube Project if you have done it you will remember that so I'm just going to use that okay let me just use it again then keep all of these things as it is let's just create the new data database on over here let's just keep the database name as stock market Kafka just click on to create database we have the stock market Kafka database available uh choose a data set just refresh it stock market Kafka keep everything as it is and click on next everything looks good click on finish this will create our crawler uh this is ready click on click over here and click on run uh this will start the crawler running as you can see and let's just wait till the crawler finishes and once it is finishes we will see the next step so crawler is almost finished as you can see over here and you see one table changes so you can see it on the uh it is succeeded okay let me just sort it properly no okay okay this is the our crawler so let me just go to Athena as you can see over just write Athena and you will be able to go over here let me just close all of these previous queries now if you refresh this if you go AWS catalog stock market Kafka and you will be able to see one table Kafka stock market tutorial if you click on the preview you'll be able to see uh all of this data over here okay we have around 10 rows right now now if you get some error while doing this while querying this uh that it is not able to find some Target location all you need to do is go to settings click on to manage over here you need to provide some kind of S3 path where it can store this temporary query so if you don't have any extra bucket you can create extra bucket and store it if you have or if you already have the bucket ready just click on it and it will be good to go so I already had the one glue temporary bucket I just told stored it over that but if you don't have extra bucket just create the extra bucket for the Athena queries and it will store all of the temporary results in that bucket so make sure you create the bucket in the same region or else it will not work so right now we have 10 rows so let me just go back to architecture diagram so what we have we were able to upload data onto S3 crawler blue catalog we created it and we can easily query on top of it so they kind of completed our entire uh architecture diagram but we are not able to see the the real time data right now okay so we are not able to see the magic of real time so again so we were able to do so we were able to do it but we are not able to see this action in life so what we need to do we need to again send some data in real time because we are not sending data right now and we are not able to retrieve that data in real time so what we will do we will add some delay so we don't overload our server so we will add delay of one second you just need to do time sleep one second okay uh we already have the uh sleep so instead of doing this you can just to sleep one and just run this this is sending data onto your producer and if you run this we are consuming data and uploading data onto our S3 so if I go to my S3 bucket Let me just go to S3 if I quote okay I already have my bucket So currently what do we have around these many rows if I run this I will keep getting more and more rows over here if I keep running it let me just go to over here and let me just run count okay let's just see how many rows we have we have six 31 rows I just wait for few minutes 7 31 we are running it 789 861 921 so as you can see we are getting data in real time basis so from our producer we are sending data at each and every second from our consumer we are getting this data and we are uploading this data onto S3 bucket all of these files uh as you can see over here are coming in real time basis every second and we are able to query this data into real time using this Athena so if you if you keep querying it the count is key count keeps increasing it every second so like one four nine four one five seven one five seven six and stuff like that and if you just print the entire data uh if I just run this this will scan the entire data as you can see over here uh you have like a lot of different data 93 milliseconds it took this is the entire data scan if I do let's just say if I do what do I do what do I do what do I do Max of date okay Max of 8 if I run this I will get 28 5 or something if we have more max State then it will automatically change but right now we are not getting it so let me just to count again star let's see how how many rows we have now we have 2600 rows around 608 if I run this again 2.7 so I mean you get the idea right what is happening we are sending data in real time we are getting data in real time and we are able to query this data in real time so we were able to get the idea of how this things happen in real time we have data over here okay by simulating this data using the existing data set using Python 3 then we are sending this data using producer where it is going inside the Kafka going inside the consumer S3 crawler glue and Athena so we don't have to run crawler because crawler has already ran and create created the catalog so we can directly query on top of it and get the real-time result in Athena table we already discussed all all of these things in the start of the video so you were able to see these things in action first of all congratulations if you were able to reach till here and completed this project because most of the people most of the people might have given up in between because they got one error and they are not able to solve it so again they are trying to find the easiest part but you didn't do it you completed this project till the end so if you completed the entire project so let me know in the comments by writing I completed this entire project and I'm excited to learn more that way I will know that you actually completed this project because not everyone will comment that thing and if you see that comment make sure you reply to that comment and tell other person congratulations uh this was the entire project that I designed but you you can modify this project according to your requirement so I just gave you the Kickstart to the Kafka now you can go do more Kafka courses again everything is free on YouTube I will create my own courses in the future so for right now you already have the tools go online learn more about it you can modify this architecture according to your requirements so instead of using the simulated version you can actually use the real time API understand how to integrate real-time API with the Kafka producer you can go online you will find a lot of resources okay then on the consumer side instead of pasting data on to Amazon S3 you can directly push the data onto snowflake database bigquery or whatever you want or you can directly visualize that data onto some using graph so up to you how you want to design this architecture according to your requirement so congratulations on completing this project I hope you had fun executing this project and you understood and learned something new because very few people will be able to do this and you are one of them so congratulations with that so this is just a star part of your data engineering career I am planning to build more and more different types of projects and build the courses around it so if you want to learn all of these things then all you can do is just subscribe to my channel you can also follow me on LinkedIn or Twitter if you want to get more updates on this and I hope to see you in the next video thank you for watching and have a good day bye

Transcript for:Data Processing Engine with Kafka, S3, and Athena

Transcript for:
Data Processing Engine with Kafka, S3, and Athena