Transcript for:
Introduction to Apache Spark Overview

with the increase in size of data that generates every second is important on license data up to get important business inside in a lesser time This is where a party spark comes in process real time big data So keeping them board and self spark in mind We have come for this full of course Now before we go ahead with this session I'd like to inform you guys that we have launched a completely free platform called Pre Training Academy where you have access to free courses such as C I A flower in the still marketing You can start the details in the description below Now that's how quick glance of the agenda We'll start off with an introduction to the park and we also understand its ecosystem Then we'll understand what our ID's in spark and look at some fundamentals with respect to our duties After that we'll learn about spark transformations and actions So will you work with different transformations and actions and spark going at We walk with sparks sequin falling which will understand what I'd be dark dreams and spark and also work with Dina fiends in spark Then we'll see how to read it all from different sources And then finally will you walk with sparks dreaming surveys Let's start with the station So we will get started with spark anyway And the other points will discuss later And we really will have to do a spot that's for sure Right And spark requires some level for understanding Before we actually go to the hands on site we can really hands on spark for sure We'll be doing a lot of hands on But if I athletically start with okay write a program you will not understand what you're doing Right So that is why some slight Southern I'm not really a slight person but still some topics we have traditionally you can go to spark Right Um so first thing you need to understand this What is Sparc And from where Regard spot Right So why people are so excited about spot So uh you're your previous session Train of light What is his name Hey was asking me that Can I learn Spar I really want to learn spark That's the power of spark I'm not saying I mean even I want to learn machine learning I'm not saying that way but he won't deliver it Spark So he was asking me some doubts Like Can I get started with Spark I wanted a mission running on spark because everybody's excited Our spot So and all my training's these days I don't spark most of majority of training Sovereign spot S O There is a lot of extra excitement about this tour And why is it why you spark is so popular Or is it some magic or something What spark is doing right So it's very easy to understand why it is so popular somewhere in 2009 Okay we had ah a project called amp Lab in the University of Berkeley There is a risk exposure called I'm glad So this amplifier projectors even life to the service that's protected Now you're Mr Buckley So they're people were tryingto create a new cluster manager Do you remember Young Yeah So yarn is what you are but a source manager in how do on her do version two came in 2012 What you're seeing today So yarn came into the central like Yang came with her You were talking about with us and nine so in 2009 There is no young we have ever turning her newborn old hard version And there was no good cluster managers So some folks at the University of Berkeley they were trying to create a cluster manager very similar for young Okay that there is no yon Okay And they call this project as Ms Oh's missiles Missiles is they're projecting They have given for this project Okay And these guys were doing some are Andy And at last they created the customer manager likes it Very similar young my sources there Even now it's an Apache project very similar to yours So why young became so popular because yarn is coming by the fall with her duke If you download her do you will get young if I want to use use my source out installing separately but had hope will run on top off my source Or so they're The only difference will be that your yarn layer will be handled by missiles Okay so Mrs is a very a famous project Even today some off the folks use it Actually not everybody used it on missiles was created in 2009 in and blab So they created Miss O's and they said that it's a great question manager We're gonna use it blah blah blah right now in order to test the power off missiles Okay First they land some apparatus programs of course My produce waas what we had in Italy and it was running fine So then that hard why don't we just use a different programming framework And they created something called Spark So Spark was created actually tow test missiles meaning my sources your cluster manager And they created a new programming Frank McCourt's spark So you can write the program using spot It'll run on missiles on The only difference was that spark was completely in memory execution like lose most of your RAM So if you have a cluster manager on if there is a lot of RAM getting used you can actually test right after the cluster manager is good or back So they created this spark programming framework we're in If you write a spark program it will use majority of your ram for faster processing And they tested missiles and missiles became a sexist Everybody was happy Happy People never report much aboard spot They just created as a suburb project Okay But after any year like 2010 or 11 people thought that Hey this park is good because if you write a program in spark first thing people notice is that it is running faster compared with my produce Actually white is running faster So people thought that why don't we develop spark a bit more if people can bite a spark program and if it is running faster should be great right on The development started on somewhere in 2012 I believe Spark became an Apache project So they contributed to Apache and said that you know we have created a new project It's an ill make it open source and this is a framework like map produced If you write a spark program instead off a map reduce program it will run faster than my pretties apart So I thought OK let's try that in the beginning A passion A top level project Okay From there everything changed from 2012 Actually spark is quite new because it became widely popular somewhere in 2015 only and I started using sparks from photos and 16 only it is me There's a word Right Okay so I'm like Good because it began populism Very 15 and 16 I started using spot Okay on Even in 2016 there were very less training since park This people were like excited but nobody knew what to spark So we started using in 2016 all of a sudden end off like December 2016 It became the most active Apache project in the history on in 2018 17 18 Now this is that begins to contributor about your project So in a span off 23 years it gained a lot of popularity Right And now spark It's like I said the most popular Apache project so far So a pageant has around 300 plus projects in that the number on the spot Guys off now right Also originally there was a spark version zero Then we got a spot Question one Now we are in spot question food or three This is the latest wishing on many companies run something called spot 1.6 It is very stable very stable worship That is the reason they're in it Probably I don't know right after 1.6 year two or zero there is no one north seven After one lord sits the next part question is major list to Lord zero Okay And we are on two or three right now If you go to the spot website it will save your aunt or three Currently two or three is not pretty stable because it's the latest situation So not so stable But not one Toto They're very very stable actually on they are learning spark too We're not learning Spot one You're not missing anything because inspired you have some improvements on all the features of were not available in two So that part is covered on in most of the trainings They will request you Please start It's part two I mean don't start that spark one meaning it's it's it's a start sold But people want to start that spark to So the major version right now is two for two or three I can say that we are using and something else happened So what actually made spark more popular is that in 2012 these guys the founder so spot there is two guys They give a patron said That takes far can go but they were not so happy So what this guy's did badly They found a company called data Bricks so they found a sore spot They give it up I said because it become open source obviously But they also want to make money right So it is open source You can't make money So if if you give something to Apache it you're actually contributing to the world Probably like for the good scenario If I don't give it up I say like Microsoft Windows Microsoft Windows is not available with Apache right So it's only money making on So you are not getting any chance to modify anything or are your contributions or anything If it is not an Apache the only contributions for commercial versions are done by the company so the growth will be limited Very simple The growth will be limited like you will have your own developers who can mortify and order But others will not be a lover So then it went to Apache Now what happened is that Apache has a spot pushing apart your spot that anybody can download That is 100 person There's free Open source again The spark is available with Florida Horton works my part sure that we know So they are using a cloud Iraq lister that has park in stores So that is again same spark or Libby can use it But these guys in 2012 or found this and data bricks Okay company And their sole purpose is to sell a spark Nothing else but this company so special because it is created by the founders of spark the people who actually wrote the source code from this company So if I want only spark I can probably go to these days because they're the best But people don't go here People go to cloud our Warton books because chances are very rare You want only spark You want other tour starts like school You weren't flume You weren't No date ever excess Nothing really school Poor flu They only spark right now One off The participant asked this question I'll get back to that in a moment But still spark is an independent project which means it does not require a dupe But mostly you will see it on top of her room Okay Spark is actually an independent project which means I can download spot I have sparked running on my windows 10 left beautifully Works doesn't got anything Night NPF aside system It will work Okay so about people don't do it because the pine this like right now we have a migration happening in g e So I was with the last week before coming here So there we have a migration happened geese migrating to spark spark and her do so do you already has a little cluster So why should this install sparks separately Right so they're already having a hard look cluster so they'll install spark on top off her Do as an ecosystem say it will start leveraging its defense for storage and then processing And so monster off the times you will see spark on top off a loop But that does not mean it will run only on top of that you can run it anywhere Practically I never show you in the slights Very very Condron However you can drink so on and so forth s O data plexus popular in that case and it is available almost everywhere Right So I think we'll run some slides to get you an idea The best resource to learn spot Okay So a couple of things which I know I can teach you right A lot of people asked like give me the best book to learn spark And it's not just one book then they would never come back to me So the best book to learn spark is something called a spark There definitely a guide Don't buy you two Don't read it Dis make a torch spark there Definitely a guide The best book Don't even touch it Why I'm saying this okay you can probably buy our own mind This is written by uh that a malaria and the founder of Spark The guy who wrote the source scored brought the book and the book is pretty much like the source court You won't understand anything You and I don't understand half off the book but very good So buy it and keep it so people ask and you can show this a spot right Very difficult to understand But it has like accurate information because the guy who wrote Spark is writing this book right So if you re leveled it rdd He will go to any level explain quantities So it's a bit difficult to understand Probably don't need it immediately Once you complete this part class and those are my lemons then you start with this book You give your very good idea Farid but this is one place where you can Second is the documentation from a bartender Fabrics This is where I learned actually So you open up actually recommendation you open data bricks documentation They will give you a lot off ideas I mean basics at least if you want to learn Right Uh okay so that's learning Spark spot the definitive I Amazon This is the matter is a hardy And Bill Chambers there's other people who wrote spark So this is paper got back 2018 actually The night So it wasn't late Esther Addition It is right on dhe basic operations And ah a lot of things are there So try this if you want After you This is Sparta Definitely Right You also have four hard Doop doop The definitive guide Then there is one more good book for her job Hard open action I think that's what some would book Okay How do books are a bit more easy to understand in the sense like this part book is small amount or rendered on this source courtside Like how exactly Stuff work Hurdle books have a bit more easy This is the slide with you to understand If you want to learn spark right on the slightest very important Also if you really want to learn spark why it is important So somewhere in 2400 King that's what the year there on with hard hope You got something called my produce You know where this map reduce I don't have to teach right till 2015 I can say you can see 2007 till 2015 What happened was that we're not the problems with hard Oh was that you have my operatives find no problem But what my produce does is batch processing It is slow under this batch processing So people wonder toe try different different type of work lords in her room So some of them wonder toe right sequel queries on top off her dude that is where you got haIf then we got impala Then we got a presto Now here you have Ah drill on and many tools So around n plus different sequel to Southern which can explore the data in her do And some other people wondered you know machine learning to be then So if you in the earlier versions before spot when people want Toto machine learning they use something called Mahut about Sam I hope Ah here against him I hold Summit This is your machine learning library But what is the problem Machine learning is what it is about iteration what you mean by attrition Like you're taking the same data and I trading over and over again right One off The things which might produce cannot do is attrition Because if you write a map reduce program the map will run Producer will run It will push the later too hard disk again The second iteration again You're to read it So when Mahut was doing machine learning the problem was that it was very very slow because internal it uses my produce on top of her job So my hope came But this became your remission learning tool Okay Andi So interactive recovered for graft processing We got graph flag Kregel So these are the are tools which you use for Graf processing on her do tradition had no spark or nothing Then people wanted to do are streaming data analysis real time data That is where you got storm All right So you see the problem right Over the period off 8 to 10 years people started developing different different tools to handle different different workload So if if today I goto ah ha ha dope system on If I'm very new to her group and if I ask somebody what will I should learn then he will be confused because all these things were to learn big Hi mojo They've sitting to start working with her Another problem is if let's say one fine day you start learning Stone Storm is real time processing You're the first learn storm say FBI storms language Then you're getting started on top of her Do you should learn the integration then only starts working So learning all these different tools actually became a problem Okay On what happened in 2014 when spark came the biggest USB off spark is that it is a unified processing engine What these 50 plus stores on her dope are doing spark alone can do that is the really our advantage of spot Now Many people confuse here because if you go to anybody and us like why's somebody should learn spot they will say because it is faster No speed is a byproduct of spot It is faster I know But the real reason why people are migrating to spark is that are these stores whatever they're doing spark alone can do so you don't have to learn like hundreds of tours You learn one tool one language and you can do everything But you need to have some basic idea about what you're going to do For example if you want to do a machine Learning spark has something called Emily Machine Learning Library that will allow you to do a machine learning But you should know what this machine learning that only you can do Otherwise you cannot explode it but I don't have to install tend toward sort when detours sparkling can do it So the primary reason why people are going for spark is this general unified engine In my experience the second reason is speed obviously this faster I will tell you why it is faster How it is small faster in or the third reason is ease off programming like you know I don't know So you can get started with the python in spark the languages fight on And if you want to write a sequel queries or streaming or machine learning for everything you can use fighting in spite That is not the case in the traditional world Big uses a different language Storm uses a different language so every tool here actually used a different language even photographed processing In Spar you can use fight them normal price don't So he's off Programming is another thing because I have struggled a lot because I have a couple off These stores have worked That's part of the project on then The major problem is that you will take some 34 months to learn the whole first Then when you come back that will be a facility and say Go and learn some other tours so it's spark that gap is gone So that is why every company want to migrate to spark and it is easy to write faster or so so I say that map it is almost obsolete That is a use case And also another problem is that yes these stores are going to get obsolete But tomorrow they will not be obsolete So what happens is that these stores are already in production in many companies on they cannot One fine day comments that will remove everything But yes going over already Map reduce is going I told you right after this is absolute pig He's gone There is a protocol Apache Nobody's using big I was teaching a lot of folks on peak but no pig is gone so some of them are almost extinct Storm is gone Very interesting this case because Thomas wants like the king of the industry and streaming data processing The problem is that storm is actually 100% Israel time So about spot screaming It's batch micro batch That was the only competition now flanking about your fling Fling ist again Same thing streaming but onion reassign Not like micro patching So what People are doing most of the use cases you can handle it sparks screaming Trey derail you one micro level even processing there they will have a Flink in storage and that will handle so stormy start illegal So fling is a kind of like a competition to spark streaming because it is real time But I am saying that most of the companies are doing most off their case Turn is using sparks streaming only because you get only one thing there to make like really hundreds of Israel It is nearly time now with 100% Ariel in spark so there's something or streaming we will come to that streaming data means you want to process real time data So I gave you an example right So let's say uh credit card fraud detection ice life My credit card on that has to be immediately captured by some system and it has to say whether it is fraud or not this is a real time right now to implement this one solution I have a spot streaming in spot I can't do it Okay but Spar does something called microbe matching meaning it is not able to capture a single swipe as it is It'll collect one second work data so I cannot go beyond below one second one second World data But if I'm using something like storm It can even pick one swipe in that microsecond I say that is this way But I'm saying that is not so important I want second worth delay You can suffer at least Still you can manage that spot Most of the cases but yes spot supports a streaming So in this and this fragility of these guys are going the graph processing it started gone We have spark for that Most of them I hope is going I told you right days is is there uh sword will be gone or must soon will be gone because stays this middle of my produce and spark What day is this It makes your map It is faster That is what is this Basically so sparked Mason More faster So it is in the middle So or mostly be going in couple off years Horton Marks is the company was promoting tastes so probably they'll keep it because they weren't market rate Right That is a reason Okay let's see What is Sparc right So in spark you have shared doing monitoring and distributing I will show you this practically and I like this diagram A large because this will show you the spark architecture in a very nice way actually So how does it look to you Like if you look at this diagram Where do you think it is Yeah So it looks like it's our last testimony right I mean yeah or you can say act Um right Probably So the point is are the Cory House park So then you're download Spar you get something called Spark Court Okay on that is the lowest layer off abstraction in spark where you can start programming radically and I will show you what to do there On DDE You see there is a ram and hard this picture which came up which means spark and use both ram and hard disk And I'll tell you why this is very important data Usually a little systems can also use on dhe These are the four languages currently supported in spark core programming So this is Carla This is this iconic Scala This is python are and java as off Now in Putri zero only four languages are supported So you should either learning you off this and you only know by turns So it makes sense you can continue with Python I teach a spark That scholar that is also a very sore scored a spot is actually return This kind of that is a reason some off the folks wantto learn Scalia and wit but now so earlier and spark 1.6 version and all If you write a pie turncoat it was slower There is one thing you won't understand if you're inspired One door six And if you write your spark or in pie turn it will be slow on Skylark or it'll be twice faster It was like that because they didn't do any optimization job Also slow slower than scholar Fastest one was Kalla at that point in time Still it'll be ah bit slow because they didn't implement a certain libraries that was later toe java to support the optimization Sin Java So this is only in 1.6 and one door six is very rare in Sparta to what they have in return They have returned abstraction early so that you write the court in any off This language is your programme will run at the same speed so it doesn't matter You're right in fight or style our job everything will be seems as off now we don't have to worry about it Secret is also there But secret is not a language Yes it's I mean you can also write queries and all Using sequel Ok go for one more or bitch Yeah So these are the So you will be thinking that if spark is able to do everything right how it is able to do it So these are the libraries available in spark meaning When you down Lord spot you get something called coursework Fine on that is like my produce you can write But maybe I'm not a core programmer I'd only sequence I wonder I see what you use Something called spark sequence in sparks equal very much like high band Or you can create our tables on There's something called data frame You can create them and query the later on The good news is they integrator sparks equal with life by default So what will happen in a hard ooh cluster If you in store spark okay And if I open sparks equal I can read the data from my hive table So previously when you're running queries and hive It is very very slow now the same tables you can read in sparks equal And then quickly So these days hive is just used as the stories Most of the processing will be done by your sparks equal because it is faster I really right Uh another library is Emily This is what I was saying Your machine learning libraries So whatever you are learning you can implement here I don't think deep learning and all our supporters because for deep learning and all you will be using tensorflow on graphics is the framework where you can represent your data using a graph Have you ever done graph processing No Right I have them Okay Because why This is important If you have started engineering computer science you would have learned graph theory right I mean I have learned I don't know was there in your syllabus there is something called graph theory So there we have some stuff like you have a just and relations no order defined like this Right So this will be me This would be you This will be you one on And then you write relations between them like I like you probably you don't like me So you like someone else like this So I mean you have property grouse Very having are just okay which represents some property And then you have relations between them Okay on you can quickly the structure So for some type of we partially worked with a social media company in one ofour project's not like Facebook and or they were the start up Actually their their entire date of us like this Even Facebook right If your download the data from Facebook it's very difficult to do it Okay you can't directly do it but there is a way to get the data data means all the publicly available data They will give you the graph paper format before Matt and Mr Grabbed a pack Ramses graph It doesn't come in Jay So on or xml it's graf a performance Leslie's book is actually using this graph structure pistol nor this paragraph I think they have their own some system but the presentation is in the former Forget off Actually so many times this is required This graph presentation off data and spark can do this graphics library which can represent your data and then streaming later streaming data eyes where I was saying You have real time data coming in and you want to process it real time on then make a decision based on that s O some off the queries You know if you write like this it will be much more efficient The multi relational queries say for example this guy like this guy this So if you want to find our patterns and relations that presentation using graph is easy In some cases let me give you another example This airport data we had if you're having airports and flight data right so they say at present airports like this bangle shouldn't I I don't know USA somewhere And if I represent all the flights which are going in you know I can't do this you think sequel or so But the graph careers will be much much more faster for me because this directly structure like a graph structure is able It is easy to drivers in the graph a p I So probably between the wherefore to have like thousands of flights currently flying So you want to query them and track something always for airport date of use graph type of system rather than sequel will work but it will be slower if the data is really high even sparks it will be slow So this airport management like how many flights are flying in real time and so on and so forth They use this graph FBI a lot So only for certain news cases where you're having Except we're just and Lord Off relations between them Social media center that example Night So I wantto find out what you like in Facebook collapsing So I won't have to the Lord of comparison How many friends you have on what your friends actually like Do you actually like that So driver sing these queries and Secret is very difficult and graphic and easily Travers I can just say that these these these notes find me these type off relations Please leave me so social media companies do use a lot of good affects Actually even two interviews Twitter how many people are you following on How many were followed by you So this relations right These are all learned from McGrath a p ace in Twitter So if I have a graph on the property The property can be an image I can't directly quickly the image But I have the ur land All right so I can say that fine Similarly majors or with this key word so on and so forth So you can directly handle on structure later A lot So on That isn't no secret This is different Okay there is Ah no secret database called New York for J New York For Jake it's a no sequel Databases They store daytime graph form So that is a different use case that it's for real inquiries and all but this company actually uses only graft to store the later okay Graphics is different That comes part off spot New for Jason No secret Later bees I had worked in a small project where they were using your for J I didn't learn it actually but I have seen that people are actually using it We even the one circle for the right So we are in one more orbit Okay Now let me ask you what do you make out off this Who is this Is there anything special about this picture Flamingo Okay What is the Flamingo doing Waiting for who BB created right So what is the Flamingo do it huh Yeah you're very clear Exactly This is called Standalone Mortal Spark I didn't create this light It is date every slight So if you have to blame blame them Exactly Flamingo standing and one leg right It's got to be more creative Okay for you So these are different More sandwich You can't run spark basically Okay So don't get confused with the name So the top one this single pc That asshole Local local mornings You are running it locally on one machine Okay You cannot expand to more than one machine And you used only for development and testing purpose on that's understood This smart is Miss owes him this I cannot mess sos So originally spark was created round up off my sos I told you right so still If you want you can install them A source cluster right And then it starts Park on top off that it will run No problem right on DDE this is standard Learn more So what is standard on more I want in star sparking a cluster But I don't have young I don't have my soul's I don't have anything Then spot give you its own plaster Manage the sport Stand alone But this is very rare because ideally it'll be only young And this is young Spark Oranje Yeah So I really in most cases what you see iss young this blue boy I mean that is young because most of the organizations will already have her do plaster and young and it just makes sense to integrate spot On top of that another very important point A lot of people get confused That is I'm writing it Spark has no storage There is nothing called stories and spark because it is an execution engine like my operatives is Maverick is having any stories No it reads the later from her due process It been probably started back in her do right Similar to that spark also has no stories There isno stories component in spot It is just an execution engine extremes You should provide the data somewhere So if you're running spark on Hadoop the data is on in the office If you're running spark or let's say your PC Your PC file system has the data standalone norm esos whichever up to you to decide So me So what's the problem is that some of the other components in decoration may be difficult So the real problem is that if you want to purchase so all raise organizations will go for renders But that isn't the story Fireman organisation I want her to oppose spot something I'll go to either Claudia or 40 marks and if I go to them they will say What will give you her do Pignon There is no Mrs so commercial under supporters nor there That is one drawback That means you are engineered for installing separately everything so people don't go Actually for that there are some Apple has a cluster where they're on my source I don't know why but they have a missile's cluster for some analytics 4% work I have never seen in my life No because it already use most of them You see I only standard on my housing but missiles have not seen Okay if you're interested Some extra knowledge sparking or so turned on kubernetes whodunit ISS sparkling Co Bennett is is I think production really know you know what it's called when it is and Dana orchestration tonight You know what is Boca So you have something called Boca right That is something going Dhaka This is a very ordered food Very old Like four or five years I think I don't know why Doctor allows you to do Do you know what is a virtual mission Right You know where this of'em So in your laptop if you create a BM what will happen Let's try installing Windows VM The problem is that this Veum will use a lot of resources right Operating systems a reinstall everything So if my laptop has a G b ram I know to you four GB ram only for the Veum What docker does it allows you to create something called the container Okay Meaning on one laptop I can run like en docker containers one Lynn explain windows It will use the library Is it that are already available on my base operating system so it will not install a for real operating system That's what I'm saying So instead of giving four GB Rampal of'em I can give one GB to a doctor or fight will imbue toward Dr On if it is only next It is much much good performance So Doctor became a big hit Because if you are a programmer you're not a Java program You want to test it You want to test it on apple You want to test it Only next Windows How do you do all of this Right So you can just launched darker containers on your PC itself Like 10 off them Run your court See whether it is work very easy Right Go Bennett This is the next level of Dhaka Very has something called container orchestration in a data center You are able to run my April containers like Dr contain Issue can manage all of the music community If I'm correct it is a Google project Okay so it is our date That's in the orchestration So Doctor is like you are running it on one machine Probably 10 machines match So here Marcie and Stark open ITI It becomes a data center manager actually so you can give like hundreds and thousands of service to Coburn ity and say that I want these many containers Lord doctor like on Penis it will launch them and manage them for you So cool Bernetti can become one off this instead off ya Norm Esos or this thing could burn it taken become your resource manager I can install a spark and then I can say that I want to run spot program If I say he turned their it'll go to community It will launch containers for me to run the spark Court So in cloud you have multiple options Either you can directly run say I go to Amazon I create some machines myself What I can use this service is like Emma E m elastic my produce I can go to Amazon and say that Hey Amazon I need you to create a hard spot Plaster for me said 10 machines 100 machines in five minutes It likely it and give it to me and then I can run or my workload But these are like disposable clusters Once you complete your job you could relieve them razor paying money continuously Right So Coburn And this is on the cloud also I mean it's locally sympathize with less clothes so is available everywhere I don't know I have not extensively goingto guarantees but I think you can search because it is something which is coming up in a big way I think Google is the cool goober Nitti school burn It tastes whatever you call it So this is production grade container orchestration So the problem with your doctor and all is that it is very difficult to manage the containers with Coburn if you can manage them I don't know the exact architecture of community Like whether they launch their own container or Dr Container but basically so it is all coming to an abstraction now Previously everything was very clear But example you're by yourself on used You can see the hard disk right Then you install the operating system Now what community is that Give me a data center You're on very group of servers Give me a group of service like a data center on I will launch as many resources as you want You just tell me how many containers you want I will ensure from where I need to launch how I knew to launch So like same like that So like an abstraction it is getting right So but probably so it's a R co Bennett is built upon 15 years of experience off running production work towards that Google Yeah So originally this came from Google Google was already turning this not in this name some other means So they now even had hope came from Google Right So same like that They're creating the school vanity I think spark on Coburn ity is available That is why I was speaking about this spot gone kubernetes Yeah very much available Spot rolled Upwards it up You must have a running Kuban a dick Lester Blah blah blah are Doc I this doctor Are you okay Dr Immediacy So Internet is using DA currently Nothing else Blah blah So it is supported So it is not in the slide That's why I'm saying in the slide your own secret vanity Sliders a bit older And if you pay attention to the slide I can show you one small change Or to tell me what is a change Okay If you pay attention somebody is on top of this Like that is a keeper Zookeeper Oh you're gonna say and show you weren't smart Saves going How'd you come up You saw that So why did technically means Is that all these three Moz Whether you're running on stand alone or my SOS or yon can be highly available Using zookeeper Zookeeper supports your high availability if you like Spark if you're running if one machine crashes or or you know if all these states are maintained by zookeeper Zookeeper is integrator Let's park zookeeper Arisa another ah ha do because it's from tour Why it is used for communication between the mission's actually to put it very simply like in a very large cluster If one mission goes down then how do you know I'll give you a simple example of the rice so keep her might become another big problem right And so Kevin isn't admitting Mostly So you don't have to worry about it too much but still Do you remember I told you you can have Ah active name Gordon Stand by name No night So in a hard who cluster you have to name North right this active this stand by I got it Now What is the idea Only and only one should be working Really active evac do clashes Decision Who will tell this That's a question that Sam connecting to the cluster How do I know who is active I can't connect little mission Right So this guy will have an I p address This guy will have an I p address one race that I can keep on pinging Who is active currently How do I know who is active Who will tell me Right There is nobody to tell So one raise that I can keep on bringing both the machines to say who is alive who is not alive So like that There are many situations in which you are to understand which machine is alive or not alive So Vince Oh keeper What happens is very simple It is a service So this active name nor will it disturb It's okay but okay stand by Will also test of its own people Now you're just as goalkeeper who is active It will tell this is active if this guy crashes Okay this guy informs okay but that I am the new active You can also keep also capable So it's a coordination mechanism Okay Paris was for coordination and high availability What the slide means says that sparked support zookeeper So for talking between the machines and management and all it can use the keeper if possible if I available So this this is I cannot zookeeper It didn't just come This is actually I can off Zookeeper you are asking I know because normally people are not much aware off zookeeper No no it is It is for all the service is actually let me show you So zookeeper iss for any service to coordinate So one example I gave you is that How do you know Name noticed Ideally this will not automatically come up The zookeeper has been from the state to come up if you look at the architecture okay If you implement high availability in her do normally what happens if I don't have so caper If this guy goes down hard to write a command to make this up that is actually a stage off time right It should automatically come up so these days will be connected with zookeeper and there'll be a key politeness It's and so if Hardwicke So if this date goes down so people will know that this guy's gone after that Say a second or two There is a American configure It will ask this guy Come up okay And I've dated a meta data here So basically this day holds met our data Who is doing what in the cluster Otherwise you see there are a lot of service is in a hard oh plus river There is active and stand back Not only name nor the source by nature You're young Resource manager is an active and stand by If you're a ketch base that service for Ganassi Space has so much strength Standby So if I go to a herd of class Terry if I start asking everybody who is the master who is the slopes Very difficult So I go to Zoo Keeper and this guy will have all this knowledge even a sport for spark You have active and stand by Master Okay If you're installing spark independently on that can be coordinated The zookeeper That is where the slightest thing So any machine was just running So who will orchestrate all this That is management people the slave missions If they go down the driver should know that this machine went down So processing has stopped So zookeeper is internally coordinating or these things say Yon has the source Manager right Resource manager is the master in young on resource manager by the four doctors will keep it Don't even start So that is the coordination happens between them Like if you run a program it is our locating the ram and zookeeper doesn't know the source management It just remembers who is ally who is not unlike this information So So keeper by the fort will be available in most off the clusters Begin Men back now What So this orbit is over Spark can read from almost any fire system on that's right off are actually a first I don't know What is this Cube By the way Thesis has three Amazon local file system That is some other file system Maybe in any file system Basically So it supports a Lord of Fire Systems within which we are not aware of right local file systems It can read from all these things Any no sequel late ofhis any arguments that is I think a very cool feature because I will show you if you have all my secret Levi Spark and Derek Cleary from the table Or if you have mongo TB or any no sequel databases it can read from the table and process the data output probably can store back it in tow That table right It also supports her dupe input formats Spark streaming can work with flu man Kafka Meaning you're doing real time streaming Right So the question is that how do you get the later So this is always a challenge Because what happens is that normally you will have a spark cluster running So that's it There's a spark Lester How dope Lester Only very our spark in style So let's say you have a spark plus turning I want to get to Terra data so actually spark and directly get the twitter later There is no problem It can directly come for the cluster and you can process it But the problem is if one off the machine who is receiving the this data crashes you will lose the data for some time tonight in spot So what you can do in this architecture is that for high availability you can either use flew more Kafka So you can ask these guys to get the data for you Froome or Kafka So flu miss point upon delivery off data like I can configure a flume agent we will see in the Lebanon It's gonna get the day Different picture Look here calf guys something so you can get the data from here So even if your machines are not working data will come here Right You're not losing the data So for that reliability you're using a flu man There is our different thing You have to be aware off if you have a spark Lester Right So this part cluster and I'm talking only about spot streaming Okay Not like normal processing When you configure spark streaming one off the machine one off the machine will start working as something called a receiver Okay So what is this machine's job to get the data That's all it cares It is a normal day planner Now the drawback of sparks streaming nor the drawback I can say each this mission crashes If this mission crashes okay your stream will be lost off course It was switched to another machine but it may take some time so that let's say five seconds for 10 seconds You're losing the data right You're not storing it anywhere I got a So if I want to avoid this if this mission goes down then my stream will be gone If I want to avoid this What I can do I can say Kafka Hey get their daytime Kafka Okay From there I'll get to the spot Even if this mission crashes another mission will come up The data will be available so sparse streaming If you weren't reliability you have to use Kafka or flu that extreme You may not have reliability Slide some off You are asking Will some of the tours go extinct Yes your high fever Now nobody's running a lot off haIf Cory's everything s Park sequel queries Leases on Mahut for machine learning Everybody has migrated to spark and my live storm Most of them are in sparks streaming Now on this slide just compares the different space and you can run spot for example Instead off map reduce Now most ofus abusing Spar The resource manager can be young or missiles It doesn't matter any off this on the story's level in her dubious FDA office this Paki honest nor there no Ta kun was a storage manager like in 2009 long back when when originally missiles came the stories was handled by a system called a raccoon Now this is renamed As I lose you I look she or I lose you This project is still there in the storage Li like actually a first only distributor storage you can get So one common problem in spark is that you will try to bring the data from different places Like spark will be running on her do But maybe some of your data is in her do Some of them is in our review Miss Some of them is in Cassandra So if you use normalised the office it's fine if you're using this document or now it is known as a loose you It has a cashing Lear so it can speed up your processing by cashing the data that is on large Brandies Not extensively used Because setting it up and all are a mess actually OK I've seen it once That time it was called accurately Norton Lucy Oh I think they renamed it If my if my memory serves me well Doc Uranus I lose you know spending I have to look into Mmm Lucy Oh I look she or I look see What are you going I don't know Uh I looks your formerly accurate open source memory speed but she'll distributor stories I don't know how many's technologies are solving here that actually this is like uh So let's say you have earned soft data and you need a story Is layer in between cash in later to make it faster They knew you solution That is only yours Case on the waste everybody's using We actually a first ornament on late novelist uh in a city SST cashing It looked so this costly or sit It's costly but faster but only use cases that if you're processing a huge amount of data like uh we heard like Bank of America bankofamerica they had like turn sentence off data process like there are bits of data on it keeps on coming So storing them honestly a 1st 1st time Reed will be very slow Any of it will be very slow So they pushed directly to Asia Europe this thing a Lucio because that's a necessity cashing Lee from there they read it so unis faster Actually that is only use case you're seeing for a new CEO and this is another use case off it right You have own pra my stories and clouds stories and you're getting the data It comes to analysis earlier From there you can start your computing So it is a story Slayer abstraction You can store daytime multiple places Okay on my looks you will get it in one place from their reconstruct processing So it is like bringing the later into a cashing lee and then processing it Okay Don't worry If you don't know loose yours It's perfectly fine It's not like an AA mandate Every component or something Okay Now can you tell me what is a drawback off my produce If there is any drawback what do you think So spark is replacing my produce right So what is that raw Back off my pretty Is that you So yes a map and I will put you persistent Shuffle out Poor do Persist then off course Reduce or what You have to finally persist Right So that is why my produces very very slow So that is what is in the slide Actually written so on If I have to do and I trade you processing right I read the later and 10 times I have to process it Then I have assured you're the job spending my apparatus jobs out because it is very difficult to change them together in the immediate real and light always will be there all right That is where spark becomes very different because on the sly just days that you can use who's Ito share your all these jobs nothing else spark There's something called in memory processing and this is very confusing for many people The first thing you need to understand is that in memory processing means it will use the RAM if available doesn't mean that you always knew to give Ram if RAM is provided So let's say you want to process pen G B file And if your cluster has been GB Free Ram it was really Tinto the ram on it will do all the calculation Only the final output It'll push it into your hard disk Intermediate results are not stored on the hard disk Second point is that it will if what if RAM is not available then it will start using hard to score So step by step in it read it read whatever data that fits into the Ram Process it Then again it'll really and process it like that It has to go There is another way right If you don't have time Still it is faster than ever Lose okay Because from ground up they have designed the court off spot They are not mortified My produce court or something toe create sparks on That is why here it stays 10 200 times faster than my produce Typical may produce my so in memory processing we will see how it is faster and all Ah that endless Cassandra that I that is Cassandra It just demonstrated you can store the resulting Cassandra from any American reunion store Or so these are the distributor's on applications which use a spark So off course databases the major distributor and Horton works Claudia or this guy's are having you know spark as off Now on these all applications can use spot for processing So mostly be 80% Alright Visualization tools So previously if I'm using something like What is it Um Bente Hobie I okay to visualize my later Why did the little fire equerry tow my heart A cluster high will run the query and visualize it And now sparks equal will run quickly so it's much much faster So all the Beatles and the ideal tours now use spark for moving the data and processing the data and so on and so forth Okay uh this slide is a bit old This is the 100 terabyte sort competition in 2014 So every year there is a sorting competition that will happen Even you can participate if you want So the uh the thing is that they will give you one terabyte off data Okay You have to start it Okay On the data they will already give you the format and order based on how your assorted isn't indigenous sorting or took sort of what sort they will tell you already give you the late hours of sample data and you write an algorithm To sort can be a Java program Any program recumbent on whoever source a later fastest will be That is the conclusion on in 2014 when this ran spark did it in 23 minutes had open seven map reduce and 72 minutes But look at the cluster size spark was running on 206 missions apparatus was in 2100 missions That is a difference One by $10 machine Still it is faster because Lord off Rambus available everything was in memory So fasting faster starting will happen And even for one better by David a sorting and again spark became the winner there Okay Do you know what does that driver These are some things you required Otherwise you're gonna understand Spot That's wait Where does that driver you know nor Device driver Okay which drives the program That's actually correct right But what is the guy So when you're writing a spar program the program will have something called a driver Okay And this is the master off the program Why What is there mustard off the program on Then You have something called Executive Is this slave So let's say I wrote a spark program in the Spar program Definitely will be a driver and then executed Without that this part program cannot run I mean these are logical concepts I'm saying OK now when you want to run that program Okay One way you can run is that you can say Run it locally I can say that Hey I wrote a spark program Okay run the program local When I say local what will happen is that a J V and will be created like a J V m Okay both my driver executor Everything will be inside this So this is the local more of spark Java virtual machine Yeah Still a container will be creator similar to this I'm Javy Amina General emcee Okay so a container will be a locator So this is like a young container Yeah So a container will be a locator in that both your driver and executed will run So the lock tell more is not very efficient because we will get only one container and everything is running inside that So if you're submitting the spark court actually when you say you have a pie turn cold right It will read your logic and convert and learn it in surgery And after everything is inside a J vehemently Without that it character even my apparatus programs We wrote in Java right Begin right by Thorne Gordon my pretties So if I'm writing up by Thorne I'm apparatus program If I run I will say submit the program Then I have to mention some jar files on those jar files will be in her do They were really your fight on court and convert that into a format which will run inside a jayvee um on executor inside a j vehemently Because end off their day It is returning job What if it is sparking is Aaron Scholar writes Everything has to be inside and gave you so the local more The problem is you get only one container inside that the driver and executed everything Build Run Okay so this is only good for testing purpose right So you just want to test a spark program or you want to learn spot You will normally say that Hey sparkling in the local more I write some court it just turns If you're in a class step this is where things becomes more interesting And this is a bit confusing also Okay so this is my heart do cluster right And I have four data nodes Imagine I have for Nate And imagine I created a spark program Now the first question is that when you're running the spot program what are you analyzing Okay I'm analyzing a file What is the size off the file Right So let's imagine the file is in her dope Okay Just a simple use case The file is here here and here Just unexamined I'm somebody the total size off the file s Let's sit N g b So I have a n g b file in her do that Maybe in blocks and or so That's investor on I want to process it using spark now And that isn't a cluster Right So when I submit my program to a cluster I have to ask yarn for executors How many executives rescue one and what is their capacity Meaning I can either say Hey Young give me one executor Uh and in that executor 20 GB ram I want no problem What will happen An executable Come here is your executor second Dana Nothing broken Dana And this has let's say two Indeed GB ran Okay I'm some four processor court or something on the new entire spark court will get executed inside this but normally people will not go This was like one machine is executing everything So possibly what I will lose that I will ask you Hey Young give me Let's say four executors Just an example I'm saying give me four executors so young will give me four executors right on each executor I tell yarn Give me five g Be sexy This is Fi G B This Vijay be this fi g be fine Baby Ram Ram in memory So tortoise 20 but five baby Right on So imagine this on then there is a driver right So this is executor Imagine you have five notes in this hot new plaster So hear my driver is running Okay So what would happen when you submit the program in one off the machine That driver will start running on in the program You have asked Young give me four execute ear's on each executor I want five g p Ram to Processor Corp Blah blah blah So young will launch 1234 Here on whatever court you have written this driver will push indoor for missions at the same time on each mission will process your court This is how in a cluster spark is processing But this is not so easy as it looks like because you should know how much memory you need for processing right And how many executors you warned Will William actually give you those I cannot say that Give me 100 executors each with 100 shitty time Yan will say that I don't have capacity I can't give you so normally wherever I have went for consulting and all If you write a spark program you will discuss with your right Mean Okay I want to write a spot program because inspired If you want to get the full performance it should be in memory You need maximum ram and that is a challenge in every heard of cluster So you will go to our Hado Pittman or spark at me and say that I have it on a program I need 100 GB rhyme in the cluster So he will tell you in your program You asked for their sake and execute her 10 GB each something like that right that you will configure as the number of execute Er how much madam You one on the new submit your program so young will launch these things and the driver will be running on a separate machine right on whatever logic you have written driver will read and struck pushing to all executioners and the output normally comes back to the driver Whatever out what you have in the driver you can write a logic Either store the output on her do or sordid and Cassandra or wherever you want That it's possible on this driver normally will be your master There's not a monster Remember I have master from young That is your So a lot of people ask What if this machine goes down My driver machine goes down Obviously if your driver mission goes down then the processing will be disturbed A spar programming or crash Okay In young you can configure application Master restart time so restart So if this machine goes down it will restart your am on another mission application must earn another machine so that he can continue processing this application Master is in constant touch with your resource manager So there is an I d application i d where it will not tell what you have processed O r What execution is going on currently It knows those things So you are a source manageable Launch one more application master that will become your spot and I will get the court It starts running from there So the So the driver is the master part off your program on these executors are where you push your court So depending on your cluster you can ask for the number of executors on how much memory you want for each executed That is how it actually runs And there is something called application Master We start our time of year to configure in yon There we can say 35 settled Try to restart it The application master on the same machine Sometimes what happens This mission will not crash your application Massive A crash The driver process will crash so it'll restarted on the same machine Can we do to many Listen maybe the resource not available so you can configure the timer And if it doesn't work then a little boat another machine and say Start from that You need to understand something called R B B So that is why are the different lemon dill cities return So are the lists are the basic building blocks off spot So first thing you need to understand is that in spark your data is represented as something called our baby right Any data that you have If you want to process in spark the first step you'll do is that you create something Corden are the Are the restraints for wrestle aimed Distributed data set It is like a variable What a point that you can see Right So these slights will give you some idea about Are really I would also practically show what is in our lady So So just look at this picture This picture is very very good in understanding how spark It's working right So I have four blocks off data in her do So that is what is the presenter day Actually if it's right so you know four blocks off data is there right And I want to process them now That is one file Even though it is divided into four blocks it's one single file I want to process And what is this date of some text data Imagine each block has like this era Then a time stamp on the message again Warning sometimes stamina Mrs Some log file Imagine right Four blocks off data So we are assuming they are on four data nodes and you want to process them now From here onwards it is going to confuse you OK so I'm saying in our grants you will get confused Okay We're doing very So where is your original Later In hard disk Tonight as blocks initially office So that you're to keep it in your mind Now I have already waited on a program Imagine to process the data in the program The first step I need to do is to create an rdd Okay so I have to say that creating are really on Our deal is like a variable you can say or a point So here the name off my Arditti is called Log lines are dearie You can call it Raghu If you want I just say create Hey spot create something called Log lines are ready by reading the data from this file you can give the location Where is the file And if I run this What is going to happen on Imagine in this case I have asked for four executioners in my spark program I said I want four executors for some reason So what will happen in each data and or 11 execute her will be launched in arterial situation like and I have 11 block on each date Owner and I say Hey spark create an RD record Log lines are ready for me and if I hit end there all this data will be copied into the ranch or the main memory because our deary's represent your data in memory assuming Ram is available Ideal conditions So if I wrote redraw this picture this is actually our data bricks picture like But if I redraw this picture now don't thing that always the file is in her room right In my typical example I have the filing But there is one small thing here Let's say you have Ah six snorts and hurdle cluster 12 three four 56 That is a market action on block one Don't three four Now the big question You want four executors Which machines will launch the executor Young young has no data locality of even this late Young doesn't know Where is your data Right in map reduce data localities there because your blocks are reciting here Let's say your blocks are here right So these are the four blocks I write him a produce program What is going to happen in my map Reduce framework It is returned that okay I need to have four mappers One more very important point You need to remember in my produce If I'm processing the same data four J Williams will get launched I got it One here weren't here One here One here And this will copy here This will copy here this week Or pee here This will copy here That is how your map parents Because can you process four blocks using one mapper Not possible You will have number off Mappers equals the number of blocks or number of input splits they say right Ideally you can say number off my bicycle to one But then what is going to happen This block will process This will go then Next block will come and it'll process This is your young container in my produce So then yarn is daunting Containers for your map I really have it launches is that it will launch 11 container for each block right on your blocks I said 688 and be right So in young say things You will say that my container sizes One GB I want one GB container right on What will happen when Gaby want eveyone give you Andi before containers will be used It is very rare that manual you mentioned number off mappers You can't really do that because that will affect your performance And also if you're launching one container only you can't resize it So there's like a one GB container on I have never seen anybody using static members in 99.99% of the programs You will never touch the number off mappers You will just say I want to process it And I really if you have let's say idea situation saying four blocks you will launch four mappers right Each Brock will copy it a one mapper you're mapping faces over then produce or whatever spark is different in spark One of the core principle is that you can control how many containers you want and what is their size or so So I can say I want four containers inspired So now we're talking about spark Okay in spark I can say I want either a single container That's what I'm saying Executed I can't say I want only one execute on This guy has a G B ran city wrote long only one And these four blocks can come inside this and get processed Fine Or I can say I don't want one Execute er probably I want to leverage parallelism I won't accept four execute er's But the point is when you say you want four executors there is no guarantee this four Welcome here it can be here Here here here or so In that case this block need to be corporate into the ram off this machine Initially there may be related to get the data So that time then it'll be seen There are two things There is something called a spark dynamic execution a location You can enable it and disable it if you enable it What will happen It launched eight Kill seven Only one will remain your disabled It'll keep it as it is So there is a property called a set Spar daughter Dynamic execution enabled December that is too young So you are telling yarn I want eight executors But don't kill everybody I may use it a man or use it It'll keep it for you if you uh enable it That is where it will kill You are saying that if it isn't our getting used just kill them I don't want them safe Your data size is small If you kill them you're gonna actually seeing the Sparky Why Five off them coming up and then remaining them getting the later automatically You can't do that Right But coming back to our bodies and our discussion I'm saying that in that example you have four blocks off data So that is this four blocks my heart a close race Let's say six notes on you want to read this data and process it So what do you say Yes I want to create an are really the name off the arteries called logline surgery or whatever Our duty is just like appointed Okay And when you do that when that court runs what is going to happen is that you asked for four executors Also on Let's assume the executors are launched here So there is an executor here Okay there is one more guy here One more guy here and one more guy here so these blocks will be copy to here so Now that is your data Your data is reciting inside the ram off four machines or four executors This is called an rdd Because otherwise how do you call it Is your nature units some little mention like like a variable So that is called an RPG So now once the date eyes available in the ram that's another date Eyes There You You have a representation Your data So this entire log files it's now called log lines are leading also in spark There is something called partitions Meaning Right now your data is lying as four blocks So we say that this is a four partition rdd So normally when you're reading from her do you will have blocks Each block will become a partition What if your know what reading from her Do your sam reading from my local P C I can mention how many part patients I need for the later say on my windows Left our okay I have ah one g b file If I simply lead it it will come as a single one Jamie Partition Okay if I want I can say that Hey from my decision My windows laptop Okay I can't say that Hey Spark Read this one GB but create four partitions for me on this What is our grand days Each partition can go to an executioner An executor Bill processed a partition not a file So all raise your data has to be partitioned The more partitions you have the more parallelism you will keep Or let's say you're reading from Cassandra Our typical example Cassandra It's no Secret Service Cassandra doesn't have blocks or anything blocks out on her Do so If I get the data from Cassandra I read a table I will get one million rose Right One million Rose I cannot give it an executor Right on The processing will be very slow So what I will do I have to have an idea about my data Let's say one million date eyes on TV or something I will say that when I'm creating an are ready I can say that Take this data from Cassandra and divided into four partition Each partition will go to an executed An executor can manage more than one partition or so Minimum burnt partition should get depending those lips I have an executor with 20 Zeevi memory So let's say a partitions I says one GB it can manage 20 partitions so data locality is not 100% is guaranteed When you launch an executed 100% it is nor guarantee So is this clear Because he also figured out the number off executors Okay first thing that depends on what is the size of your data So in how do you normally this is not required because in her do what happens by the fault of block is a partition Right But again that will make sense Because what is the block size in her group 1 28 and be one to engage Tembe is there block size right So I have a file with that is divided into 10 blocks sort organs Eyes off The file is with this MP That is one point g v I want to process one point G b file Now the question is that how many executors I need That depends on your cluster configuration I can't even process this in a single executed I can ask Hey Jahn give me one execute her gp ram or three GB I'm just a safe side I'm saying so young we launch one executed Let's say three GB ran Okay on this has 8 10 partitions This file has 10 partitions 10 blocks All 10 partitions will come in that single over the second plane Get process So this executor right So the executor you're launching Okay So you can ask how many executives you want So the more partitions you have more more processing speed He will get the same example if you're gay So I have four partitions like four executors I launched on Each is getting the order into one partition This will be faster I can also learn just single executor here Copy all this fort here it meant or this so fast and another very important point So these are all related to resource management But you should also understand this stuff right An executor is a container right A J B M one processor core is required to manage one container Meaning if this is a dual core machine how many Maximum Execute as you can launch Don't If you ask for eight executors what will happen doesn't work So an executor is a J V m To manage a J V M J vm has a ramen CPU ran You can say I want a GB ram 10 GB ram Fine But within that to manage ideally for a single J B in one processor court will be allocated minimum one processor court All right But again the question if I have 16 GB ram I mean 16 GB rhyme for a container that will have lot of partitions inside that maybe one CPU core is enough to run all of them The concern where there's a processing power off your machine But you have to keep all these things when you're launching a cluster Well from inside I'm saying so to put it short Heidi here you don't have to do these many things When you're launching a spot program you should know the size off your data on how it should be chunked So the idea is if the number off partitions are more so let's say I have a file whatever the size maybe X sizes X right If the number of partitions are more that means the number of splits are more I really So I'm saying that I want 10 partitions so I get 10 partitions Now if each partition can be inside an executive an executors can process that data so parallelism you get But if it is less there's I wanna leave too partitions Then they're into executive pissed Then two executives They're so I really you're gonna This is less What is their processing So the more partitions you have the more processing for where you can get So if you're loading likes a text file Right So if I'm starting a text file I can't say that Create an rdd number of partition fight So why did it leak Willie Split that text file int 05 on then give it to 11 executed by the pint this unit five executives to process it So if you go to Yang Yang will say I don't have enough so I'll give you a little executed So to executable get five partition one guy get to an 11 3 So on each executives have important safety You court right settle maybe a little bit slow So if you have enough resources parallelism works Actually Ideally in a hard rock world your data is divided into blocks East Bloc is in a date on or that is one partition So if I'm reading a file from her do if that file has let's say 20 blocks 20 partitions will be created But now when I get 20 partition the question is that should I load them into 20 executors Janica Twin the executors Maybe not I will have only 10 execute us Maybe also in that execute er how much ram you have How much sleep you are You have But these are not just admin topics that the developer has to make a choice here If you have a file this is your file right This is your find And this file is divided into 10 blocks in her room Okay so I have work I want Exactly right But it's a 123 for 56 upto 10 blocks You have Okay Now to answer your question right when I leave the later into spark it Lord automatically does something called partitioning on each block will become one partition by the fort and you cannot change it Okay bye Ready for this one And most of the cases we keep it But what you can do so So I get this is one block this one block So all this is red So this is my partition One partition toe partition three Right This is what Partition four and help partition and similar import splits So you have 10 partition now You want to process it That is where the speed and all matters like to process it What you need you need Ram You need a CPU pellet Right So they're this partition will go This is hard Disk blocks are there This is hard Disk blocks are in your hard disk When I say I create a partition partition is in Graham It won't just sit in the time it has to be inside an executor or a J vm by That is where the data should come Generally the processing should start right So then then you can decide I have 10 partitions Okay How many executors I need Can I ask for n execute er's Yes you can do You can say I want Then execute er's Okay So you will get a part Execute her one on here You will get executed Ben And each partition will go to an executor Partitions always go to execute her so the speed even will go to the SEC Security Now this executor I have asked for one GB rhyme for this executed I also have one processor court That means this partition can be managed by this executed saying you're Pete and will go here Okay This executor also have one process on one g b But the point is you're young Should I love you give you and these things the same thing I can do in another way I asked for on leave to execute orders Example Okay I asked for two executors This executor has five g b This execute also have five g b This has one processor core This says one processor cool and b one b two b three upto p five will come here be six p seven and will come here logically spark actually segregated them Partitioning is just to understand that you're different splits on there later But now if you write a program Solar Sirota program I said filter the data So my driver is here right This is what my driver in my driver I said filter the later very sad later in partition So this filter is your logic This logic will be pushed to this Execute on this filter will apply Here Here here here Here Who will apply this process Rent time together Right So one processor court I have using that one processor core Can I ah barrel really run this Probably I have Marty threading and all I can finally process this much amount of later Good rice one by your little process it'll take some time Depends on home It's you Yes your processor code You can increase You can't even say I want four course for the container when you're launching this Ah executor I said one core bite before right I can also say Hey Jahn give me my executor its executor request for court But then the problem is if this is a quad core data on or only one executor can be done because I already took four course If you ask for more you will not get right If this is a quiet coordinate on or if you lose one execute you cannot launch another executor Already taken four course so in prediction How we do it is that one suit designer application We look at the data from where the data is coming But it already come partition or you have to partition because if you're reading from her do it already Is partition like blocks and all But if you're reading from Cassandra Cassandra does invited for partition the later so you get like one big chunk So while reading we decided should be partition Yes we partition How many 10 10 means were to be the size of this much So how many executives Union So then you go to write me and say that Can I launch a spark job with this money will aggregating he will say yes You will be getting the new launch them Now If you have understood this much I'm going to again confuse you Because if you ask for an executive right you say I want an executive and how much memory you want Executor you say I want 10 GB by the far you get only 90% off it Okay because 10% is allocated for system cars it is a container It has to accept the stone cold's and breastbone So you said Ngvs then DBS for Nor you It will not give you for it And g B 10% days So stability So what you get mind GB in this you will not get in this You get only 60% dish So ultimately some 54% off the family you will get because it is a J V I'm right there is the rubbish collection J B A management or that request memory So in reality if you look at a spark Lester if you ask for n g b container you'll get around 5.8 g b four r dd remaining our system will take Please keep it in your mind because this is an interview question So you were going around on TV right When you get there say no it will not both eso by the file If you ask for 10 g b 10% a system called Celtic Container has to communicate with your operating system right and young So for these things it will reserve some memory Right So from 10 Debut became nine GB in this nine g b You're Jae beom has to manage its garbage collection on communication All this it will take another I think I don't know the exact number 30% of something it'll take It will drill down Total I think 54 off a six person days off this 10 GB only your rdd memory You will get to actually fit the partition Which means in 10 GB you get roughly around six debut to fit the partition So don't think that you get a J VM 10 size you can fit NGP offered partition No it doesn't work like that So these are actually internal to spot But then we run We will understand I mean in production when we run But then your calculation will be wrong right If you don't understand this your calculation will be wrong So these calculations will matter for your interview Central That's what I'm saying Because if you go for a typical interview where if you say you know spark these are the things people ask usually I mean they weren't asked like Hey what is the difference between spark and map Anybody will say that they know that right So these are kind of questions they left from Bellevue Get memory Who will give memory What will happen if I do this Even I am not an expert on spot I've started working like three years or monster but some of the things even I'm reserve place Sometimes you will say Oh this is possible I didn't know this So then you'll run it and see that Okay I can see this but not everything you can't learn or so in spite Okay So this part partitions and ah battle is um So now if you look at the picture does it make more sense This picture So you have four blocks you ask for four containers for partitions Ideal case Each is loaded and that's called it our dearly I Now the real question How do you write a spark program I basically that is what you want Oh apart from our deity or lamb and all Once you get the data you to analyze the data So how do we analyze the data in python Did you learn something called a higher order function So normally when you write a python function you will say function than function Name blah blah blah Let's have our deaf and or then you write the function and you will reuse the function Why are you creating a function you can garlic anything There is something coiled on anonymous function or disposable function Excitement to create a function I will use it only ones I don't mind it anymore so you don't have to really give a name for the function or different You can create it on the flight that is called Anonymous Function Scored Anonymous Function Anonymous Okay in a spot programming inspired programming What we do is that we have something called higher order functions That is something called it Hyatt Order Function What is a higher order function Let's say I have a function called ABC I can't pass another function to this function That's called a higher order function meaning this ABC is a function on Normally you will pass some parameter or something Some value right you'll say is this means this That is what you said But I can have a function and I can pass another function Tow this function that's called a higher order function so we will be passing and autonomous functions here This lambda isn't an owner's on only must function I will show you the court It'll become better but inspired basically what we do is that once you create an art dealer now you have the data Really Your data now I want to process the data How do you process the data You have something called transformations So it is actually there in a spot website It's also good that that you can look at spark Official website How do you go Spark daughter Patchy dot org Yeah So spot not Apache dot org is the official website off spot If you bought a documentation it'll say latest releases two or three or zero Right on These are the older versions and all If you click on here you can see all the spark versions You can see that 163 It was the last spark One version Now the Arun who trees This is our abortion latest portion on if you go to the documentation let's say latest release Imagine latest release Okay if you scroll down Okay Here You can see our dearly programming guy See And if you click on this okay And scroll down a bit This we will come back You can see a resilient distributed data set or are Really this is what we created light So we just created an rdd at least theoretically So once you create an rdd I will show you how to create it Okay You have are really operations So now my data is available is in our daily What can I do that the artery Right So that is where you can start writing your functions Okay Anonymous function on dhe Yeah These are the transformations does This is what you need to understand So these are all transformations You can use them Spark map filter flat map that have many actually So if I want to filter my later I will just call this failure Okay If I call Filter it will ask me What do you want me to filter within this bracket I will write my expression to filter That is how you feel about your date Map is another It is like for each Okay map I will call Mapple Asked me like what you want me to do So will write an expression within Mac what map has to perform So these are all higher order functions Map filter flat map There's that all had order functions actually So you do something called transformations in one our daily If you apply any off this functions it will create a new artery That's for transformation That is how you analyze your later there So I want to filter my date I will call the filter transformation and our ladies are immutable Very important point Once you create an art dealer you cannot change it You can only create another one by applying some logic you can never edit on are really there immutable right If I go to my ppd Yeah so we have created a log Lines are really fine builders We have understood on dhe Then what I did probably I'm interested only in error messages from this artery So you you know you see that A lot of data in four warning era I want only error messages to filter So what I can do I can call the fill their transformation Okay so I can call a filter transformation on then I can say that Hey spark Okay match only error lines and give it to me and I'll show you how to light the logic This will produce another are very and I can call it as error Sandra Dee Dee This is the steps in which you write a spot program First your creator rdd on Now I want to do a filter I will call a filter on it will you know change whatever I mean little fills the only other messages on that I will store it as another Really Now somebody was asking me what will happen to the memory Right Settled Believe this RV I mean if there is not enough memory So let's say this are really fit into the memory And then you call this filter action it will filter whatever this record and this will be gone This is not required now because you have this right because next processing that start from this okay or is Emily Let's assume in the normal use case you call a functioning to create and that are really on This artist is going No no this is your current data So if you have any builder in young dynamic allocation then this will be gone I told you right This is actually a partition And there is the executor running here This guy will be ideal because it doesn't know what it cannot predict That there'll be no date All right So that excuse So that is one more problem Now your problem is you have four executors The second executor has no data process because it's filter all the other There is no Arab of this guy will be sitting idle So one date and all will be there when executed will be there That has no later will simply sit idle So that becomes a problem right How can I solve that problem It's actually very easy inspire There is a transformation called police oil leases a very common transformation And why do you call koi lease You pass a number It will reduce the number off partitions You can resize it because you have to calculate this But in this example assume you know that See now if you look at the later second partition is empty The partition has only one line That's also not so So I'm assuming that I want only to so I can help spar Take this are really apply Coil is keep order later but really to partition on TV I want a little so it will just bring it into partitions now So you have to do tests so normally What you Is that your sample it for example just examined If you're having one terabyte of data right and you take a good sample of that data let's say NGV 400 GB and you'll run it once So then you can understand you know after the filter Okay you can call a collect action So let's say after the first that I want to see that it I can do it So I know that orders only I'll order Let's say I don't know one GB rate or 10 GB later After the filter When I get the date it's only five That means half is reduced so I can calculate Okay if I'm loading one terabyte off later I don't have toe manage these many Jae beom So after this step I should call the qualities I can go So there are two are transformations coyly Sandri partition coy Little always decrease There is no way you can increase if you want to increase You cancel the partition from four Probably you want to goto eight Maybe you want to increase the processing power No let's say I want to divide it for their I have probably some J b m three Okay So I can increase the re partition the data so I can say from four I want eight so I consider leave partition in bracket I can see it So even though if a partition is empty the baby and we'll keep it there'll be no daytime the partition But it will remember that I have a partition Corp Even there is nothing So so the envelope deleted Yonville thing There is something inside it Young cannot actually go inside and see whether you have something there I told you that initially when you're creating the artery right So I say you mentioned in partition Okay You asked what went big on Dana Only then will be created Tony will be ideally idea Nothing will be there Then it'll be live here You have to manage it What qualities To reduce us actually Okay so now you have two partitions actually because we just wanted to optimize the court So now you have four I set the partition eight Okay well read the whole later Do a full shuffle and just give them on eight partitions Now since you have asked this much re partition can also be used to decrease the number of partition Coil is actually intelligent because if you're doing police okay what will happen is that this message might go here This message might go here or there and we just delete these two If I'm doing a coil is there is very minimum data movement If I don't re partition to go for a chauffeur meaning from here I can say re partition too It will do it for me But before doing that read all this data onto the ram and go for his suffering their little radio So that is maximum data moment and re partition So I really to reduce your number of partition you squeeze to increase There's only one way that is re partition There is no other way you can use now The most important point surprising to is that let's say you wrote a spark or in which you wrote these three lines create an rdd filter the rdd you said oil lease and you say learn nothing will happen This is the surprise Actually you wrote a spark or where you wrote all these three lines okay Read from a CIA first filter error message Then you set oil lease and you submit the job That list of nothing will happen in the cluster because all of this are lazy There is something or lazy execution Meaning spark will not start execution unless you call something called an action These are transformations Meaning you're changing the later But you are never saying that Show me the output right You are saying re partition that it filter the data Okay So what Where are you saying Show me the output You're not doing that So unless you call something called an action where you say that Jew me thou put nothing will work to do that you can call an action call Collect Collect is a most common action inspired The new call collect Okay You're telling spark that I want to see the final outward That's it So collect will look at this cleaned our duty So you are asking Give me the output of this plane Daddy and Spark will understand if I want clean hearted I should have other side Really The 500 started I should read first It'll go to your hard disk reader data Do all this stuff on Then show this is output I show this on your screen on then the entire pipeline is empty Spot never keeps your date a memory once the processing is over Okay You got the results on his screen like this You're so everything is gonna know our duty Nothing Is there no card In a nonpartisan nothing will be there Everything is going So only for that split second all these things happens So if you call it collect here you will see this output Okay Now if I want to know whether I should read honoree partition what I should do is that I should go back here Okay I will do uh save off the Sarah you can all sort of a sale Mr Oren are really when you say stable to save us a file our next fight So I will say save as an action on this field that rdd and then I particularly the size like my orders in our latest n j B After the filter it is fi g b So the process fi G b I am launching this many things so I can be right My court in that so collect will simply display output on the screen That is also an action called Save as text File Okay I'm not uh just talking I'm showing stuff So if it is tech state aligned by Landel partition that's what's on The same logic is applied because if there is a line which is divided between two partitions little calculating port split saying logic off map it is applied here Now his question was What if I don't want to do it So he had a very good question So he's asking So the problem with this is if I run the spark or in a split second everything will happen And then I see the output on my screen No are really nothing is there in the memory Because the this is scored bag director A cycling graph sparkle creator Dad a graph of execution and perform Stefan's to post a priest a four hour result And then everything is gone again You're on against Lord that's a different So collectors in action where you return all the elements in the driver program meaning you see that you can also say save aspects trial There is an action Uh that will save See Save aspects trial If I say save as text file for an r Really whatever data is there and rd it'll save us a text file so I can see that in our group or wherever you're starting it right now Is this the only action available No I have save Aska Sandra Able or many other ways to save the data Ideally X file save you go to just to see the later Okay So in the initial face off reading their data even my producers park Everything will be saying because it is blocks you are reading nor difference it makes once the data is available in memory Then their difference comes because here you see view dinner filter then words that oil lease on Then let's say something else or severely for doing all this The data is still in the memory You're not writing anything but there is no intermediate yourself You're pushing too hard This once you call collect you can write 100 functions and then call collect or those manipulation will happen Final the collective level this place that is a speed So there are many actions or let's say collect and say what's next will collect these one actions I cannot say collect on safe sex No So first is collect will run again If I say a mass text file again the whole operation has to start So you will be wondering is that way I can improve it I will show you OK so probably that is his question So um here I will show you this So once you execute executor the Dag bag is nothing but a director As I click graph it is a fancy way of saying spark creates all the steps to be executed in a graph for men And I will practically show you the bag in the wrong spot Job You can see that you can see partitions You can see there Dag You can see how many executives it is launching Everything is visible It's not just theory Terry um thing But unless I speak about it and just sure Then you will say how this came Know how the partitions actually came You will be confused and the driver collects the data So this is the first step If you wantto look at it once more If you look at the picture I have a very good picture to explain this Just see this picture Same stuff I'm doing That is law glance Rdd arrests are really clean orderly And finally I'm calling an action called Count Countess and other action So if you say count everything will work from the beginning because it has to come Right So count Action called And what happened It showed me that a cell test five women a fire lines in the artery No surprises Now I also want to call one more action on the rd recourse safe to Cassandra The problem is if I call this action by planets empty spark has to start from reading from the block creating the orginal artery then filtering then narrows all Then I can say to Cassandra Okay now one more thing then after I'm doing one more filter I want only messages One from this data See message one message one stage one I just get on living Stage one and then I say collect So I just want to see this But again what will happen It will start from the no So if you look at this three actions you can see all three actions depend on this oddity Everything is starting from here I That is where you can cash and our TV It is possible to cash on our TV So when I say cashing whatever data you have in that artery along with our dearly will be cashed and then there can you cash you can cash in Graham You can cash in hard disk and you can also say RAM plus hard disk lake where everyone So there are multiple options to cash it and burns your cash and our lady Let's say you say count it will count Then you say save Cassandra It'll start from here because this artery data is already cashed on Don't think if you're catching your cashing for one month once the spark program is over cashing his deleted cashing is holding their daytime memory without flushing So normally when you say collector temporary So this is cashing is valued only till your program completes all the action Once all actions are over you exit everything is gone Then you're not So I will show you tomorrow and go start a spot spark application If I'm a development I will create something called Spark and Next Object There's an object I create and that presents my program And I really When my program finishes I'll kill that object If I kill that everything is going Even cashing will be gone Because that cashing his valued only the spirit of my existing program I'm not cashing it for a program All right Next month that is not possible During that execution time I wantto spirit up So you're so called rdd answer dot cash Then it'll cash it otherwise of your cash Once you say cash cash is a transformation So if I said cash not cash protocol in action to cash it after that right So I have to say dot cash Then count that look counted and cash it the previous artery Then whatever action I call it start from there it is in memory But once you call an action everything is related from memory Right So that is that cashing makes sense right If you cash it even if you call an action it in or delete it from memory it'll keep it See I created this sickly Narnia What of it on I say count It will count and delete Or this thing from memory right It is not in memory during the execution reason memory after the execution is all where nothing is in memory Ah Then you should keep in the intermediate data in memory right You should tell us far I will call an action Show me the output But this rdd whatever data you have don't delete from your pipeline Keep it because I don't want to use it for some other action than anything That's for cash in So there is a command called un persist Unprocessed is an action sort of transformation If use the un persistently believe it from market I'll show you how to catch tomorrow It's very easy Otherwise if you kill your spark our next object like your session is gone it'll automatically deleted You can't rental battle plans Only one knew Kendra So So basically so you may be thinking So how do I start programming with spot I I want to write us a proper spot program So first thing you should learn how to create an rdd If you know how to create in our Lady then I can start with thes simper single transformations map Now there is also a flip start off all of this The flip side is initially even spar came into the industry Everybody was mad about our release transformations and I'll write a mat filter People were literally dying on writing these things Okay but soon there was a problem with this See the problem is if you write your court using these transformations and actions there is no way spark can optimize your cold Why am I saying this is that you say Phil Okay In the filter You're saying that filter only other messages by unless Parker runs it It doesn't know what you're talking about on this These things do not have a very strict schema All these are not having very strict schema So then you want to process structured it Alexa like C s Wi Fi Okay I want to really see us We file on I want control I want ski My letting are really is not a good way to process it You can process it but spark internally will not be able to optimize your court That is where sparks equal comes into picture for data frames We call it So Spark has a moral core Sparks equal and sparks equal is much smaller powerful on optimizing course park So if any right sparks equal you should know how to create a table and query using secrets and also receive quickly so that you see for two program on More than that it is much more optimized and core our duties because you're using all Lambda and Spark has no way to understand what land that you're using unless it transit Right I wrote the Lambda Court Some Beard Court Spark has nowhere to understand what is the meaning of this Lambda unless it see the data But if it is a table I can't say Phil on this column Spot knows what is that column for dessert later If I were it is right It can avoid that calling by loading the data to optimize I'm saying right so if I write a sequel query it has a filter grew by Then join and something spark can read it Apply the scheme and understand what you're talking about Now I'll tell you if you write a sequel Query very wrote a joint and then a filter Then the filter will come first right Huh That is not possible here I'm saying because it doesn't know very well filtering There is no strip scheme All right It's not a sequel right It's an anonymous function So it doesn't know what you're filtering Burial filtering So it has to load the free later Then do a join then go for whatever you wrote for filter So in secret side spots equal side it is much more optimized So what is recommended by spark ISS Lerner really is nice Great Do some transformations and understand but stick with sparks equal so support It's more Actually the corps are really transformation than actions are very less use now because most of the people started writing date after inquiries like sequel queries and off course you need to bring structure to the later I understand But most of the data you can get it in structure for now Okay They're tough Mr Collection off data sets a very low in a data frame is nothing but a day Pass it That is the definition Ah so usually your own operate on data sets much because data every mass such contained state as it's right when you have a cluster So I'm just connected to the cluster right now right Spark gives you a shell toe Work like the high shelf if you remember So you started high even started typing So very similar toe that spark use you something called a spark shell on the spark shell is available in Python Scala On our there is no job Marshal Java assets does not have a shell functionality You're right using I d or something By now if you want your spark shell in Scholar we're going to do python I think I'll export apart to get started this partial So let me try this Let's try the pie turns shell by spark to the shirt 99 So this job eight as off now Java it is supported Okay I think in the cluster they have disabled a scallop shell Access Maybe I have to export some configuration to do that That's why I'm not able to open the scholarship But if you want a pie tone shell your type by spark to that's a command wipe I spark to because we were inspired to this plaster has sparked one and spark to install You can try this We will just I know some dough Some basics Not like our grand stuff but yeah So scattered So very simple So this is hereby township Okay On what It stays when starting the shell it stays type help and blah blah blah Logging level for spark on and or it's a spark version is Stuart would or zero So on the cloud Ex lab yearning Totoro Totoro zero using by turn version 275 That's okay Okay Spark session available as spark I will talk about this later Later What is this Park session But it just stays X box session of a liberalized spark Okay And this is your spark shell So now from here you can start creating our release and then writing transformations and everything that you want Okay Now I will just see if I can show you this So if you type this comment by spark simply type I spark Okay Bash Dash Help Sorry Sorry Bye Spark too right Bye spot Do hyphen Hyphen Help Yeah So I want to show you these options Okay Just have a look at this So I'm just saying that I want to start a spot Shelly Okay on I need some help in starting that by So what it say's what are the options you can have So oneness Master you are so this We don't have to worry as off now Okay If I scroll down I can show you Can you see this driver memory So a new start spark Shelly Okay Our driver will be creative okay And then you can ask for how many executors you want right How much memory you want for the driver by one question on then Here you see What is this executor memory I told you right You can ask for how many executors you want and how much memory So by default I think it is one g ve district in here The fart is one g b Okay But you can ask for execute her memory right on if I scroll down Ah see here Executor core How many processor cores you weren't for an executor right And see here Driver court How many processor Cory want for the driver Number off executed default So if you're starting the spa action like this interactive shell You will get two executors one driver each execute her has one GB ram before sitting But you think these arguments I can say how many executives I want and what is their memory and so on and so forth Now another important point Let me try this Okay Just give me one moment I was just do a pie spark too Okay And so what you're looking at here is thes history server You are And this is not really useful to us Because once you start a spark shell on then you accept it It really is play here You can see here by sparks Shell It was launched by me 1.8 minute I exit from you on Then you can see here But this is not really useful to us because this will come early in the history server That means want to exit from spot Okay so there is Ah really why That port number is not working Currently When you're running the spark it'll show there But I can show you something If so this session we I stopped okay I didn't do anything I start decision And if I goto this application This is the spark you white So here you can see the jobs and there are no jobs on If I expand this can you see executed one addict that is executed to add it Because of the fault you are having to execute Er's on here You can see executors Andi There was one driver on to execute her So there was one driver active on two executors Regard as for this right up And one more important point is that when you simply safe I spark too If you simply say pie spark to spark will start in the local movie That means you're not talking to young or anything If you want to talk to young you have to say hyphen Hyphen Master Yeah this is the option you have to use So whenever you're starting the spark shell other ways If you start in the local more what will happen Driver Executor Everything will be in a single jae beom and also the resources will be located by your operating system nor by young in the local more yarn has no role in your local more on that is word for learning If you simply want to learn say Spice Part two and then you can just start writing your court Whatever I really when you're launching the spark shell you're to say Vice Part two on the New York was a hyphen hyphen master young meaning I want to resist it with the yarn on the lawn spot So what This request will go to yon and Young will locate containers to you Now let's try to do one thing Let's try to launch it with our own parameters right So what were they about Where the options we have by spark to Nash Nash help So let's look at what are the options Let's save even more executed I warned up Three executed Probably Right So what is option num executed Hyphen hyphen numb Execute terrorist I will say three on hyphen hyphen Execute her course Course Let it be there I want memory more so that his memory hyphen hyphen executor memory Right before this one G v I want probably g b Something like that right So what I can do I can simply go to my shell by a spot Do I can say master is yon and I can't say number off executors Probably four I don't know And then what is that Executor Memory Why This is okay executor memory I need to G That was easy Tender uh by this fire does not support any obligation option White is saying that probably via asking for more topsy Let's see whether we can get three executors OK so first we will try that Then we will see whether we can get more memory or something So right now I'm saying that a launch a spark shell the master this young Sorry Um not ma'am Right Memory right Yeah Wrong command Correct So now I ask for three executors and it started on We can see it here unless we exit So let me exit from here So it'll go to the history server Okay a little fresh here This is what we started right Can you see Executed One order to other three added Can you see So three executors were given to me and you can actually go here Executors Can you see This is one driver These are executed Iris right You get executors on here You see the memory All right so How much memory did you ask huh I think one of the four beasts Selector right ones Eby be selected actually So for our dearly stories right now you're getting worked 3 84 840.1 megabyte So how this rule actually works Like I told around roughly 50% off the memory will be taken on remaining it will give it to you will actually get around 600 being this but it display only this much because you are not loading any date as off Now it is simply running executor simply running You don't have any anything inside this night No for they are dirty It's That's what You don't have any date Isis Right When we load something we can actually see that memory will occupy how much amount off are Really You have a locket Oh so 3 84 Right now it had a locator but there is no daytime Said that zero by releasing that But when you actually low let's say 500 every data show 500 be actually occupied in this because up to maximum 600 envy anything you will get because 40% anyway will be gone for your system cars and J b m like so up to 600 500 something 50 or something You will see here So right now you can see only less memory because we're not actually doing anything with you So that is why it is simply displaying Ah you know this much on Dhe I exited from here So what if we try to change this number of execute us four different hyphen executor memory right My Marie Oh gee there is a threshold value I think if it is more than one GB let's try this There's a threshold value or does this Ah it stays An error occurred while calling on Java spark context required Executor memory is about maximum pra sure for this cluster So the cluster administrator has said that you can ask only maximum this much So what They say 2048 plus 3 84 m b is about So it is saying that Max to Max you can ask only 2000 for the TV off memory But here what will happen when you launch and execute er is that initially led Locator 3 84 and B Okay then we keep our GBS You know what is they access to be in located So they're saying that the total memories about the cluster administrator what he has said you cannot launch so we can try 1.5 g or something like that I think 1.5 gigabyte should work I really let me check if that will work God exit exit I just wanted to check whether the cluster is running Fine Okay So how do you mention So see here Size must be specified as bite kilobytes Megabytes gigabytes There are bites like so I think you cannot say 1.5 g That will not work Right So how do you mention Ah megabytes Mm Right on Dhere Event timeline How many executors regard 1234 Right And if I go here it could have located less memory Yeah so now it is saying some 6 50 beyond Leeches not giving complete That's fine because we don't have any date as off now But you can clearly see that you're getting four executors here right So there are four executioners Which one It's Uh huh I got it so But now what I'm showing you here is not correct Because I'm showing you from the histories of it Like I'm starting spark shell then exit from there Then I'm sure you knew That is the US That is not correct You show you should see Live right Sure If you're running the job you should see life that you it is not working I was asking for that and I don't know why Huh You know this is running from our flower lap So this you write so very spark and start No this example It isn't a cloud lab right In this example it is in your cloud lab So you need internet If you diss locker system you're on your internet or anything It'll run locally Similar Exactly Same actually Only difference executors and all will change How many you are Ask us exactly thing So these jobs will display anything which is running Why you're not seeing anything here because we don't have any rdd or any action Or if you have something you will see here Okay On stage us I will talk The stories will show you if you have cash Bernardi If your cash and rdd It will come here The story section and me Armand is verily show the Java libraries hand or executor Test will show me executors and driver that you have added jobs is there If you run a job it'll come like you said This thing don't collect It'll display here Like when you go on an actual job it will display here the dagon stages It will show you in the job How many states are self there and how those stages are created We will see that tomorrow anyway Okay then you have stories on then you have environment on execute Er's like so environment will show you the different libraries and all which are part of spark So if you go here now you can't actually mess with me What if I want to see this from the shell Right So I'm just saying uh bye spot I'm starting a day game Okay I am sorry So there was a comment s si dot Underscore Gone That is a beard command which can show you all this from the shell So now you're seeing with a Jew I What if I don't have is you I can do an a C Don't underscore corn dart Get all right It's perfect I remember the comment I'm happy Okay so that is a Viet command Nobody remembers this Come and actually see right on the score card Okay But the output this weird But you can see something from the outward What is that You can see Can you see Number of executing somewhere It'll be dearest driver memory Right This is driver memory And you can say master this yarn somewhere Number R C Here executor memory How much It has a locator 1005 100 You can also see number off Executed Where do you see It's very difficult to read from this I can't copy Paste expected It's not even a living me to copy Ah see here haven't I'm coping I'm getting so this command is like a $1,000,000 command I'm quite sure nobody will know this command Even I came to know very recently How can you see the spark and fix from the CIA Like a si dot On the score Corn Forget already It's a short cut actually Uh sure You like a man This is the commander please make a note of this because sometimes we will start working in a cluster where they will not give you the way Then how do you know how many executives you longer What happened Your toe Look at this Say this So this will give you an idea about the number of execute er's everything Launch live live you're seeing like now you launched Now this many Are there a seat on the score Corn dog Get on This is a fight on shell right But it is internally communicating to the Scala This is C Command I'm saying this command only other comments Normally whatever you write it spite only So this is your potential Normal by tone Commence well right here on Lee difference We will use those map filter transformations Now the good news is we will try something in the shell The bad news is you will actually try most off the things like this Jupiter So how do you create an rdd from by spark import Sc blah blah blah Uh just hard out Clear Okay so Carrie Lee cup data rotator battle nyse and these are some of the tomorrow We will look at are the recreation on da Some more level I will I will see if I can get this You y live your way Then I can show you some more Okay I'm not able to get it Forget it We can do some more thing I know the u R l but it is It's not the problem with that You are It is aws Lester If so the new launch in the uh young more living application Master you are You click on that I really should go and show me the U But it will go to the private eye P off Easy to instance And it'll say the private eye profusely for instance on the same page cannot be displayed So they have to give me You want to see that I don't know What is it You want it That's the problem right Everything I'm doing in one shot textile flat map then map introduced by key and you're storing the word count Result in an R D record comes But when I want it for their process it you can also save it in the car Do you did that but I'm just still saving so I can process it further Right So all I'm doing this I haven't already called Combs on the word count Output is actually starting this hardy Just the word count Right The word count output I'm starting in an already called comes That is all I'm doing Why am I doing it So what you should do first thing upload the later in her toe order later Dorsey Yes We should be in her do second thing Opened this nor book called Ragu Okay as a reference we will not directly run the notebook because well just explore it a bit but make sure you have this anyway Your unwanted run this notebook as it is because there are so many things and you will not know what is happening So what I want you to do is that create one more normal You can simply say file new notebook and safe I spark will create one more note book for you right So And you can copy place from here rather than running directly here Right Just copy paste here and first thing we will copy Paste this Okay So these are the imports you're doing now when you're starting the pie sparks shell All these imports will be there Some of them you have to import but most of them will be important Now when you're using the North Book it is like your writing our own application Right So these have some of the standard imports you need spark and text sequel conviction spot corn stories level So these things are required for spark Okay I will tell you Been where they're using Okay A number I for no manipulations division The similar were plotting a lab graph for that pipe load on Then Mac locked lips So these libraries were just importing so that we can photograph Can you try this in your procession These two lines Is it working If it works I'll tell you what This This copy place from here it is commander here There's a hash Remove the hash and just run this So using this configuration object you create your spark context You'll say create the spark and text and use this configuration where this is very useful Like I said when you're deploying a production cord okay you will type all the corner not back on dairy O to mention you know where is the master How many containers you need You can also pass the mess arguments It is not always hard quarter If I don't mention these things here by submitting the court I can pass them as an argument and say num execute of this money I want But most of the people actually prefer this where they will just set it in their court So here I'm just setting my parameters for my configuration The cluster Using that configuration I create my SC So now I have my spark An extra object this is automatically learn in the shell When you start the shell s C is available So there's already done by the shell So now you guard the SC right And what we're doing we're reading the captain data or D A captain whatever you call right So if I look at here do one thing changed the location See I'm reading from you sir G l faculty Ragu So this will be different Remember when you're creating an art dealer you can mention the number of partitions So here I am saying that Read the CSP later with four partitions Okay on Unicorn equal to force So this is the place where Ah but even if you're running I'm getting that you character Okay I tried it We will see Okay on then You are so savory Partition to six So this will really are there timeto four partition and adjust re partition for six I just wanted to show you that that's what I'm doing There is no particular use for this This one to show it is possible on If I executor this shell it should work I really don't see Yeah And how do you know whether it is actually working You will know where Gap dens spelling mistake is that underscore over the eyes Dark bake then So you have to do till this and you should be able to see the captain data So just see whether you can breathe the data and get the 1st 10 lines just to ensure that this training using etc So we have the data So But one problem with this data is that since spark it's not really understanding the schema It is considering that each line is like a string right Even though we have some numbers here like 56 28 all spark will think everything is a string It doesn't understand what my property you have Now if I want to do some meditation or subtraction I cannot do it Even if I extract this part it'll be a string so I can add them or multiply them That's not possible So that is where you can apply your own schema You can actually have a schema for this data So I know you just read the data and nothing else is happening And if you don't want to see these things clear it OK You can go to sell you know this right Clear everything away So we will display there on just comment this line Otherwise it'll run next time or so So this is you just wanted to check it Now you'll know that I guys they're just hash it Commended Otherwise every time it'll run and show you on the screen right So these are basically what you have transformations So Carly's on you have actions collect so there's other things you can actually see All right So you just create s So why are we doing this We want to play a schema Right So this is actual schema You just have name country career matches worn lost Thais and toasts Right You just created in a collection called Feels As Off Now on Once you have it you can import something called a name that people you can create Something called Name that people So I'm doing this step by step So you understand So from collections you are importing something cold and named the pupil Okay And what is the use off this name The pupil So a name The pupil is something like we use for pattern matching sometimes So what happens You have a pattern or a schemer and you want to match it with a existing data What you can do I created a name People here Okay And it has captain and feels so the name of this is captain on it contains this Feels feels I have defined here whatever feels I have on I will use this name People Then you will understand why we have created this Just understand its name is Captain As of now we just created and then what you're doing After this you will write your own function Okay so what is this function doing If you look at here he would understand This is normal pie turn function The function name is called up bars records Okay It takes lying us include So if you run this function what dysfunction will do it can take a line Andan split it using coma Okay so you have a c S v file right So if I call this function on my CSB file what will happen It will read every line split it using coma on what the function will return It will return the captain object actually meaning it will have this captain on within that it will take a very field and apply this Whatever scheme are you have created here like name country carrier matches Okay And also wherever you want to convert it into indigent you're calling it here because the 1st 1 a string 2nd 1 is string There's also string from here It is all in pictures So if you call this function what it will do it will really or see a sweet file line by line I will call it inside a map celebrate every line on it'll extract every column It'll apply but would be a created US captain so that you have Ah Heather kind off thing for this data on each data type also will be defined here These are things will be a string here are manually calling the fight and by thorn in desert I'm broadcasting it because these are extreme I want to make it in Asia saying busier in digit So this is have you So the similar concept in your uh similar concept in you're a scholar is called case class So I can say that name dot Then I can get whatever its insides So this is Kim are really this is for scheme are really where you're applying a schemer with artillery Otherwise the problem is if I read a text file So you look at what we didn't work Come inward firework And I read a file It had lines I was able to operate them line by line like I can't say extract this world or count this word But I cannot say give me from Seventh Column that that was not possible in our work card Example because there is nothing for seventh Parliament that word Current example here Once I apply this it'll get a schema It will process the whole stuff but at least you can access the fields That is what I'm saying At the waist It is not possible if I don't use names people I can't access the individual columns that I want with the name So from here now I can say give me only the captain's with this something So I can say let's take her name off The captain who's scarier is something like that I can access That is we're importing name We will see a couple of steps If you go further you will see why we have created this captain object So now if I go here Look at this line Look at this line This is what you need to understand So this nine I'm just copy Pasting it here I'm saying that this is my our duty Captain Some desk or order is my ordinance already do a map transformation Lambda Okay For every element called this party's record function very spicy record function This is part of the court function So for every line this entire function will be called that mean savory line It was split using coma extract Individual columns on all those columns This captain object will be called which means they will have this skull structure But even though this is schema it is not 100% Er what is optimized Scheme up That's what I'm saying The new warned like proper scheme I like an RGB Emma stable right You're to go for something called date offering We will see that in Sparta This is like for the time being I want to represent the later in a car I'm not for my That is why I'm using this Okay so I'll run this Okay And if I do a captain's so can you see now the difference in data Do you understand the difference in data now What is the difference in objects with the schema Whatever feel the name you have mentioned like name and the country It is not a table Don't think that will come like a table Right So these are called are objects Actually each role will be converted into something called a row object Captain Object Actually So this is called a scheme are really right because sometimes you want to work with schema Sometimes you don't want to work with skim up right now I want the schema That is why each row became a scheme Are really captain object So otherwise it'll just give me 10 lines on in each line I will have Hussein but I don't know who is holding Marty's was I'm so here it is A name is Hussein So I can say I want to feel all the names are all the countries like that I can say so The important point here to remember here is that you're to create a name people with whatever corland names you want Okay on the New York apply to the data Also this is very important in your original data You have to say in bigger off this feel otherwise say to normal everything will be string Really everything a string and it is useless or fraud or whatever You want to convert your token bird there Okay Now there is no difference So here they're saying that I'm creating a name that people Okay on the name off My pupil is captain and it contained Feels on I'm storing It s Captain You can change this name It will work I just use the same name people for courses So here it is a bit confusing Uh so the object name here is actually this Captain this is what we're actually getting Biplanes are hideous Very internally Treat subjects If you haven't already off objectives that called Pipelines are ready I will show you Okay so now if you can come to this point at this point I'll just clear the outward because otherwise it is very messy on dhe Just common this line Okay Just come in this line because if you don't commend it again it'll run Take Wendy and then it's gonna mess it up Don't know that on now what you can do Look at here what we're trying to do So is this correct data Yes because matches if you look at the tag called The Match is now The problem is the schema is that in a very rule you have the scheme It is not a call I'm not schema Like if you look at an RGB Emma stable you have a call on right That is Colin Marsh Chema According has I heard that then you have the schemer So here it is like associating a bag with every word like that You have the schemer So that is why it is not very efficient Actually I can't even call It is proper schema Every word I'm adding a tag like name or whatever Hey one I can do some very basic operation but I Really If this is the CSD fine I will create a date after That's the easiest way to were Cornett on You know what this means That I courts don't warn greater than records dot Lost What does that mean I want all the captains who has ah history off more winning I've been losing the match right Okay then what I want to do I will say Captain more win start map I just want to print this so country start collect We will call collect and see what will happen So can you identify what we have done You said I want all the captains whose record is like they helped born more than lost That is what I'm doing here Right then I'm saying that from that data This is that our baby I just want only their country and match I want little columns What if I want to leave their name and country I can say wrecked our country and record Where's the field name Right What is a scheme of you ever name Now you see you're getting the name off the captain on the country This mean these are the guys and their countries Where the vending greatest But do you feel overall writing spark Ortiz Easy overall I mean compared with writing on my produce court Probably So this is just word count I don't want to run So in the previous example regard the country and the matches on you just said reduced by Keep meaning You did a word count on the matches It'll add all the matches country so that I don't want to run Okay you can also go sorting We have already seen sort by key on If you want to do ascending you can say ascending equal to force This is sword by key Okay on You are so so The sword by method have shown you this Okay but probably I think this date as it requires all off them So I just run them Okay so we will run them as it is Probably this analysis requires them in that form So what I'm going to do here I'm just copy pasting I will say that I want all the captain's where I want the country and the matches only Okay And then what I will do I will say I want to do a word Count on them And if I do well matches underscore countries Dark collect So look at this court and let me know if you have any questions Yeah So I I saw this error a bit of time So just to give you an idea right So when you're creating the notebook right you're to say file new notebook and the interpreter has to be by spot some off You were used by country It doesn't give any other information or suggest days I cannot read I don't know What is the problem The interpreter has to be placed back not by country So then what we're doing So once you get the country's okay you can sort them We have already done sorting Okay So you don't have to worry about that So now you will see there is an artery cold Captain's 100 Okay so there is Captain's 100 So he had his captain's 100 Can you see And what is this Captain's 100 campaigns It contains data where they have played more than 100 matches Right on dhe I'm just using this Let me show you what I'm doing Rykoff One greater than zero So what does he have Captains 100 contain So Captain's 1001 is not really required because it is not doing anything It is saying Take the second The parliament more than zero Whether it is greater than zero it will not compare Actually it'll give you the same same artery There is no change in that But I'm using the same Are really name here this captain's 1000 And when I'm using here the same artillery only And then what am I doing I say map I will call a function I will say Give me the name and then watch What is this doing Florida Off Born They were the bad matches So what This will give you statistics of the Captain Light like person days Right on bent If I do so this is their winning probability or whatever you call right because you are dividing work You're taking a Florida on your dividing warn they were ordered by matches So out of how many matches how many were born So it seems Ricky Ponting is work 71 person based almost you can say right outof matches played 71% Chua Sworn Dorn is like 55% days and so on and so forth So you're just analyzing the status ticket status tickety right Then you can also soared them So I says sword by in the ascending order so that we can find who is the successful captain So how do you find a successful captain Map Lordly Right So you're saying that for a B in result Okay up in a happened B should be ableto uh the perilous work See what I was doing So we don't use it normally when you just run till this Okay it becomes an are really So the result is an rdd Okay on You cannot photograph on the artery directly So I deleted this initially Then I didn't think that so But then use their daughter collect The artery will be saved in tow this variable holder result now result will not have an r Really Rather the self will have the output of whatever data that you contain and that you can float You can't die threat over an rdd All right so call in action and then that result has to be saved Then you can plot it But now I think you should be able to see that Yeah All right It blocks I mean it is definitely plotting is very simple It's not broken science but she had I had a confusion Okay so remember this point directly You cannot run any normal plotting our third or an rdd You're to say dot Collect So the idea he will be available in a variable bite on variables but in that area But you can't say I want to do whatever I want right now One question We had waas about the stages in spark So why don't we do this together Uh can you open the shell I mean in the web Consort go to the console on copy paste your user name and password right And simply say bye sparkle Launch it in the local Mort vice part and I just want to run a normal work on nothing special Okay But I want to show you something while doing that So I'm just launching the shelf on I have the court here so I will just go here and I'm doing a regular work Um no special stuff right I scroll up Um So I will just read this book are really So here is your book are ready and I will simply say the word count So you will get word count out Put on There is no rocket time Steven I'm not filtering I'm not doing anything I just do it right And now I exit from the shell because I want to show you the u Y If I exit from the shell it is gone on dhe We have the spark histories over So let me access that So here is like history server on This was my session deal Faculty One last topic like somebody was asking what is uh you know stages in spotlight So actually you have something called a job This is split into something called stages and this is split into something called task There are three things a job is something When you which will run when you call a collect say for example you say Do this Do this do this dot Collect That's called a job Let's say you're right Something like this You haven't really called a you say a dart map Something you did You said filter Okay something You're dead And then you said collect Finally you're collecting the artery All right so this entire thing is a job There's a job and spark will fire a job on leave Anu call in action that is for sure Either you count or you collect whenever you call in action What did this it understand or care to create this artery then no map then no filter Then show the old So this entire thing is called the job Now a job will contain something called stages So what is a stage Venice stays welcome right to understand that there are two type of plants Omissions You have something called Nadal Transformation and wiII the transformation So they knew right transformation map filter and all I can say either their narrow or they're white example off night or transformation is like map is notto filter is narrow Is that all narrow So why do you call them now Transformation I say For example if I have partitions off data this data has a B This has C B Okay I'm saying do a map I say Lambda I say For every ex convert to X comma one I want to convert a key value If this is my transformation if I apply this each partition can apply this Indian independently meaning this they will become a coma One we will become become aware this will become like this This will become like this to executor this these two machines need no talk to each other Clinic There is no dependency that this coordinated or transformation in narrow transformations When you write a transformation like map or filter Okay it can be applied individually to a partition so they will become a comma one that has nothing to do with C Right This mission will take it off What should happen here This mission it's bigger What should happen here And there is no data transfer in between That isnot a transformation on an example ISS map or filter So then what is wider transformation an example of why transformation is something like Join grew by So there are some transformations like Join on grew by Okay so there's something called Grew by key Okay so let's say you're late Eyes a game like this you have to machines and you have key value pairs So here you have a call My one again Echo my one again Be calm I won Here you have become my one again A coma one Now I'm saying grew by key So what grew by cable Do it'll group all the data based on the tee Now My keys What a is here is also here Candace Mission independently of the group Ikey know this issue Travel here on this B should travel here This is why transformation and they're never so They never write a night or transformation all the night All transformations will be executed in one stage That's for the stage Meaning if I'm running this court I'm saying Do a map do a filter Do a collect what's what we'll do It is a job It will call for one stage One stage means all these three past will be sequentially performed one by one perform on or these independent partitions and your output will be shown Now let's say after this filter Okay You called up something like grew by key right Group by a key So what sparked will go It will know that Okay map and filter can run independently But if this has to work everybody should completely this Then really this can start right Because group ikey request a whole later So what's part best It will create a stage for this So this will be your stage So it will ensure a very very complete map and filter that it stays a row So probably there are 10 partitions or 100 partitions like it We wait for everyone to complete map and filter because their narrow independently they can go on Didn't ask Everybody is everybody run with map and filter Yes and now we have to show for the data So I'll call for a different states This will become your stage two while stage one Because these two cannot run in one shot for this to work these guys for complete first So that is when so whenever you write a wire transformation spot will call a new stage but in the dagger is not visible That is where I was confused Normally when you do a word count reduced by keys a white transformation inward count reduced by case of white transformation So ideally if you look at the dad it to show that flat map the map also come in one state then they're just making sure coming on our stage there is a special type of our deity called a shuffled Our dearly shuffle are leaving This will come exactly here in the middle of these two because once this stage is over it has to shuffle the data for the I'll create an internal shuffle are ready because it s a shuffle All the data on give it to the next stage saying that now you can grow by on DA reduced by key group Bikey join all our white was joined I cannot do in one partition Nico partition then to merge them to roll a joint operation So why transformations have costly for spot because it requires network The data has to travel forward coming down all the disease Easy then the robber data you have If you write about how your transformation It'll take a lot of time So when you write your court you should try toe minimize wired transformations You should try to reduce the white transformation Okay so these are the stages and task task is their map or filter So where does that task that Is this individual or transformations Right So first you have a job that will be in stages He stays will help Ask past this transformation Either a map or a filter or anything you can see Let's say you have three partitions which are running in pre machine What if this mission crashes Right Can happen So internally spark has something called a Leni's graph Meaning when you run a job it remembers that Okay this is the job out of the map Filled her blah blah blah Okay On each partition What is happening The driver will know So you have a driver here right And the driver is communicating with this Execute er's so driver is pushing the court Okay first turn map then Phil s O The driver is the guy who is controlling all this So if this machine crashes the driver will know that this executed is gone It will launch another executor and lord only door That partition this was here It's Your partition was loaded and it will ask it to continue on living yours Program will be a bit slow but it'll execute It is for tolerant only that partition because these two guys are already running right So it knows that what Partition was loaded here And Phil what state it was processed probably has already done a map Whatever the president does start from beginning that it's not possible Whatever date I was here it will be lost right Super start from beginning I really saw when you're creating your own spark applications So you will write your own spark and text right on one off the problem that we see in production Edwards is that Let's say I'm a developer okay I like my own spark Or so I will say creates Parken text reader Later process the data and I will submit my program What will happen My program will create a driver It also corporate executive and will process my later processing is over Probably the executives will get deleted Let's say if they have no doing anything Executors will be gone My driver will keep running So this is a common problem in a very spot cluster See how 10 developers they create inspired programs They will submit OK on what will happen The program will run No problem Analysis will be or where But you created this park and picks right If you don't stop it'll run forever So you have to say stop your spark and next one sooner This is over What is analysis over When you're writing the court you will say imports part configuration import blah blah blah equal To read a serial text file map flat map Collect the new say a sea route Stop Many developers don't know that So what happens is a stable run forever in the cluster It leaves the source right Our driver will take one GB Ram Executors will be gone once you're processing is over Ah Young will take it off the if they're sitting for our little deleted But your driver will not be gone because that is having the spark context object life It will unnecessarily be running Lead resource on it is not doing anything right So you should stop your spark context by launching in a cluster right Like we had that Where's the errors Are really And the log lines are Didi there we read So cashing is valued on Leaving your program is funny You are not cash in for one month Cashen will be for microseconds The idea for cashing is not the store that are really for a month You can't do it in any Callister Normally you are cashing it so that during the execution is faster If your program is running during the execution it has to be faster Then you are calling it So let's say you do overcome program You say a torch for that flat map Then you say map key value Pay it Then you say radio Spikey This is here where Corn program the new sir dot Collect So if you run this entire court what will happen It will do word count And every day I will be gone Okay Now let's say you go there later till this here The day Tyson key value for money Right Okay You're You're saying that I want to save this data safe too Cassandra for some reason this data I just want to push it into Cassandra If you call this action again it has to start from beginning Right So save it No I can't cash this our dearly so that once this is done if I against the sale Because Santa just say from here this oddity is cashed back Is that hundreds of cashing the intermediate I really can be catched So during the execution of your cold it will make it faster if you're calling my people actions Otherwise there is no use for cashing I really so I can do the same thing in this park that's called inspire course It is called a date offering Okay So you have ah CST file You can read as a data Freeman spark and then you can bite your queries on will be much much faster Because in spark is executing your query We'll be really faster This is the word count court for Scalia Sorry Flat map Right Order that First your new flat map saying flat map I said take X convertible ex dot Split using space Then he will say again Mac you say take x x comma one Then you say don't reduce by key and they will say Exco my wife Rocket symbol Expressway Yeah I'm using saying flat map map and reduce baking in Pi Thorne you will say Lambda X colon X comma one Here you're on right Lambda You use this rocket symbol There is only difference Technically it is done because there is no difference Blow it I don't think you can do because Plourde requests I don't think you're gonna Floridian come handling you near the Jewish stuff I think I have not extensively worked extensively work with blood because we have something called a Zeppelin very extensively used like you're a Jupiter But is there plenty of small for the whole group So whenever we have to do any visual ization of something we go for a Zeppelin notebooks same concept like Jupiter nor book but ah in the huddle world we can't say so Jupiter notebook is not created for her dope or something right It was created for fight or anything Then now they have added spark and only confront The plane was created only for your produce and sparked world So you can run haIf Cory's for example So if I want it on haIf goody One thing is I can open the command line and write the haIf Cory I can open as that discipline or book copy paste the Cory It'll run for me So this is usually for known technical days in the sense like manager sandal So you have to show some reporter manager you have some 1000 papers You can't write a sequel and family even hear me Do you not understand Right So the store they connect your body be a must that will pull the data or the tables everything It'll show right on with interesting feels and or like name So somebody look at it They can understand is the date that we have Minnick and just some poor soul of your dragon Drop the bill the charts So these I think of Sport Pilot's in Dallas for Paris and I love it There are no filter and say that Okay I want this call of this column From this day on it can create this complex statistics and graph like it is not exactly like a constraint like you're not writing a proper joy inquiry or something but you want to find relations and all right then you can do it So but actually union training on B I tow work You cannot just sit down Say visualize Have lower sport fire Which one do you need Training like how to use the tools under a lot of things inside that secret is required If you're a developer you requested a sequel to understand this Okay The use cases like if your manager want to see the data has a graph or something that's the best too Because in you things equal you can unleash elect a platform like you can't do anything else like a table But this guy can visualize in many many different areas Right Andi I think I'm not sure The sport fire also has this interesting fields idea Meaning it'll automatically figure out what are the repeating in trees And all important our recreation is actually leading the later right So what type of data you have What format Off you have this You will have a problem Usually but most off the types of data is supported Bye Spot When you're getting it right just park can braid from a variety of sources Right So that's a good question Like how do you read the rate All right so if you have like dark t x t dot CS Me not that or any EXT four Max fighting for little read That is no problem in any text format okay Or so you can read Jason you can read exam ill You can read Afro You can read park aid Um this are the hypo formats we use in the big data industry Now you might be aware of what this pecs dance C'est we and that Lord So these are like normal text format The soreness will keep looping easily Spark can handle that data Example is examine You know what is examined Like bag and Averil is very common This data type is something which you may not know Usually it was normally you may not have seen and I have ro filed right Can I show you in a profile Yes I can show you an amplified Maybe that iss that will make some sense That's a good question Anyway it seems you have asked I was running one example in cloud lab on the output is actually avenue So this is typical example When you're downloading the data from Twitter it comes in Avenue Most of the cases Avatar or Jason So direct elections in which circulator can give you the data Night one will be Jason It is like your key value paid One will be having its a serialization format I'll show you the Avro data and then we can speak so that you can understand it is an Apache format Have wrote a fracture Notoriety after all to uh to say beings and r ight First let me show you the date Adam I'll speak about tomorrow I think it should be somewhere here I don't know Ordered some tweets just to see something will happen Ah there is a folder Reach breeds Ah shit Uh so um this is actually okay This might be different because this is uh this is flume bringing the data right So in your flume example And if I open let let's download the later right Hollywood on your actions download and we got it right If I open this I have asked it to save it as dark T x t file Okay On DDE or one of its we don't have not back plus plus sitting Ah this is out Can you see Have ro dot schema So why Why people are using apple right So ab Rh o is a text format where we can send the metadata along it Actually normally if you have a CS refile you'll have column here they're on and so on and so forth Right It is structure Later Now in plain text data I have broken up in the schema and send it so somebody's reading They will read along with a very structured Actually this is public right Somebody has treated this world Whatever word it is I don't know This is not from Apple Avenue has here again C average or scheme are type record names So it happens A schema actually your data on then send your data now Why Data twitter send is this This is actually the tweets on dhe I have not seen this So some twitter gone lord form some something is there you know public tweets actually out there So the point is Saburo files will come with her daughter ever extension And normally if you really like in a text try you cannot make sense off it because it is a proper format So if you're having spark spark and read this and appear like a proper rope column people because it has came All right So spot we read the schemer and then you can say that Okay this is a table okay From the aggro Onda present your data in a structured format You can really If you struggle a bit I mean I know the scheme us so we can read it That's not a problem but you cannot process it not reading If somebody want to process it it has to be in a proper structured format right Very much like Jason If I'm getting Jason I can read An indecent word is inside Jason But if somebody want to process Jason Key value pair then you have your dump it And now our table And so if you get a daughter every file now why This is very important in most of the big data pipelines and platforms Ah lot off data will be send as avenue on average serialization so it will compress your data So you get saving say for example I have a very large text file I want to send it over the network I can convert it into a bro and send it so actually save the space So many systems actually send the data in the form right off avenue and inspire you can directly real Abreu There is an average Rita building and it can understand what he's after on DA When you're getting sensor later very common example They give it a Navajo format Usually it'll come in the form of events The sensors in this end aerator Okay every letter every second some sensory is sending some data It'll be any wind Every second you can see an event file landing This event file will be in the form of an apple Now in case off sensor later you may not want the scheme up because if a sensor is sending the later the scheme I will be something like the name off the sensor or something like that If you're going warn that that's fine You can remove the schemer Probably You're interested only in what the center is sending nor the meta data But you can add the meta data along with the data and send like a text format that Sabra and Spark and Reed Avenue parquet is competition compression format And in high class we have seen parquet Hi Can you spot parking Hive can also use O R c like worse if I'd so par kort Everything spark country some most of the cases I believe you are able to reeve the data Now Second question that people ask Is that What if you're getting some reared type of data that nobody on Earth is able to understand And then I want to read it in the spark I And then you have to figure out a way I spy is not going to help you for everything right So again taking the Guse case their GPS data is coming in the form of a pro The format of Fabbro they get this sensor data from my train engines They have a locomotive engines okay which are running in Europe and us So these engines actually send their data that comes in the form of average But even that data spot is not able to leave because it is no not normal every format So if you give that date at the spot spark will say that I'm not able to read I don't understand What is your data So that they have written necessarily see realistically serialised toe leave the data and give some structure and give it into spot So sometimes you have to do it Not every data you will be ableto me then process statically using some also off the organizations will have Ah data ingestion p There's a separate in court date on Justin Being their job is to get the data or data on dhe Make it formidable Useful on Then I'll give it right Uh you also have this a VX Emily Do you know something or re XML voice Except like in 25 or seven There's a company called 24 by seven where I went for consulting 24 but seven It's Ah bp It is in Bangla Very very big baby I call center So these guys are core center business but they are sort of analytics So what they do they have core center and chat So all the tack message they will get that this text data so you don't have to wait all the courts Also they will collect because ah there the core center business is outsourced to this company Let's say you have a company and union a customer care You ultra secretive about seven Now they will not analyse your cars That is not possible It's possible but they don't wantto They're not analyzing your cause But for every call If I'm calling a car center right they will get the meta data Like who called How much time you spend on the car weren't button Super Slick 134 Now to land this coming a former Corby except voice examined So there's a system which will read this from their telephone me and give it in sort off Make an example for me That example you can or directly Lord in this park sparked Warner Anderson It's not normal Example like a plane XML with Rule Dragon brought you can Lord this re examine you cannot lowered There is no voice in Since it is generated by this sir California systems it is boiled re examined There is a system which generates that's only text data added This Amy structure right example means what you have some structure but spark is not able to make out so In those cases this company has a team who will get this data They will write the converter give a structure and give it a sparking So the actors for every business indifference that will be only for so what they wrote only they know like that is not publicly available It is their business use case So reading and creating are ready You won't have too much worry Most off the later types you can create Otherwise there will be a team who will give you the later in some some form in which you can understand and then able to start working on it on DA the Tortola and structure date isa different Eustace like all your beauty and all if you're talking it is very difficult to handle Trust me So you don't want to do it Actually he given a chance Or Children want to do it because these are our fancy things Like like you're like you're learning machine learning So it is a machine learning people are like it's a very fancy But when you start learning the statistics and regulation you get bored like same people say I'm doing video analytics I live off my guard So like in movies they'll match the face It's pure mathematics foster home So you have a video and garters what they do They will read the video data frame by frame and convert that into binary You cannot directly lead a video in any anything any system that you're using So we had a project where we were analyzing the security CCTV data right Nor the whole later some specific data so that their party towards which will read your data on dhe Then they will convert it into a binary on Then you have the right logic on this binary later Why that is only what is other choice There is not a choice right There is not a choice So on the binary data you have to find Pachter And so you know whatever you want to do it's not like click click click Somebody Space would have here It doesn't work like that is very complicated Actually on this and quarters are very difficult to work with even images Video image You convert a binary again on by another fight with your work so that is a whole different area on it requires a lot off additional knowledge than just spark Okay about normal structures and the structure later Mostly what you get is structured it a semi structure later This you can easily handle in any vain spark Right So this is a pro We'll see ab related When we run Iran if you imagine to get the data So in the flu in class I'm a show you how I got this data this film date All right So you can also get it It is not a big deal Actually getting pretty ladies were easy and start very difficult Uh and I wantto give you an idea as to what we're going to do today So these many files having shared with you Okay this is a slight So that's okay We don't need a slight for today Now there's a file called up I spar Word count It's a text file you can see here I just opened it So you need this file to start decision So as off now we need only this guy by spot work on and we will see others later Okay Andi what we will do we will take a different approach in the sense like we will do everything together so that you can also understand what I'm doing and I step by step we will go a bit explanation of the ways I will do something and later at least the basics So once you are once you complete today's class you will have a fair understanding off how to create an rdd howto work with an artery Then if I give you ah a sign man you can easily do it I'm an it At least you can run it without any problems right But the basics how it is working what is happening that has to be very very clear by one flip side off This is I still figure out it in work The Sparky away right The life spark You is still not available so yesterday But we will figure out a way to look at it because I need you to see what is happening like So I will reiterate what we discussed in the last half an hour yesterday I because I was only doing it You're not doing it So first thing you need to understand is that you can start the spark shell right on spark Shell Will I love you in the record spark One of the ways in which you can work is using the shape Second way is that you can write a program So if you know what you need to write what you can go you can take a north bag I don't bite Encore Save it As a daughter by fire she'll become a pipe bomb core and then you can do something called a spark submit So once you have a program you can say spark submit on the pipe down by little run for you So that is that That is actually what we do in production right in production First you will explore you sanction And if you know what you have to write will write your cord and submitted like a file Okay if it is cholera you will compile and created our five like Java and C spots up right now Uh remember yesterday told you that spark and running local more than young more right by the fire When you're starting spark shell it is in the local morning like it is not using young and you can also mention to spark a spark start on Linda Local more So how do you do it You simply say pie spark Bye spot Okay then you say hyphen Hyphen Master I'm sorry I always have this country by price Part two It is a hyphen hyphen master And then you say local sorry local Okay And let me do something in local I will type like this local three Whatever it is you can also type So you're telling that hey start the spatula in the local Mort Because Masseria saying local and some number call tree Whatever it is right And let's even if you can you can start along with me No What is happening here What is this Three Right So when you're starting in the local more you can mention the number off processor course That spark can use this three ministry court It will use three processor Cool because it is one machine only local mission Local Mormons were running on one machine on buy it If I I think if you don't mention anything it will use only two course or something So now I mentioned the three So that means now Spark is running in the local more on it is using three processor course for execution Okay so watch that I Now if you don't know how many process of course you have So you have Ah so very have like eight course or 16 course and you won't have any clue You can also do this So this is a starter spot Nothing happened And I'm not This is cluster right You have connected to the cluster locally on that machine wherever you have the machine in the cluster Now if you don't know how many course you have I'm exiting from here Okay On You can also say local star star means use maximum available processor course So eight goals If you have eight it'll eat for running spot Apart from the operating system Obviously your operating system will be using some processor cause that's different How many free processor Of course you want You can mention using this argument local Then you can mention a number quad core actually received It's a bit confusing Physical course are forward Each score will have work How do you know what is processor court So here How many we have for I mean I don't know the rule In my laptop there are four If I expand each really me to oversell course So that's a total eight I have If I was running spark on this machine it'll use for maximum four because you hear her only four course So you're to go to the device manager and expand and see how many you have I'm just saying in case if you want to look at it like I think my laptop it's actually I seven in that each processor core internally create over chill calls or something so I can actually see how many course you have Those men it will use But anyway local motors not so great in the sense like you can use it for development in production Ever You're not goingto use local Morty will say Master Yang more only So you're able to start spark anyway right now let's do one thing Um okay we will also try the young more Okay but before we go to the yarn more let's try to create an rdd Right So we're not The first things you need to learn is how to create an rdd All right You can create the artery in two ways Actually one is from a file So I have a C S B file I want to read it create and our daily You can do it Second is I don't have a file but I have an idea you know Right on Right You have an idea or you have What else you have Dictionary inviting So I have a bite own dictionary I want to convert that into an artery That is We're supposed sometimes your date I will be in a data structure Not in a file right What if I have our dictionary off 1,000,000 objects on I want to read it as an rdd I can create an artery from our dictionary or any later type in fighting It can be a less store in IRA or any later I thought you have You can convert that into you and our lady And the first example that I'm going to show you for this is all If you started right Okay You can actually do this Hillary can copy paste or you can type it also So you have a range function invite on Do you know that if I say range One comma 100 overlapping Give me 100 numbers Right It's a collection range from common 100 means you get 100 numbers Now These 100 numbers I want to convert one are really Okay So how do you do it ISS You will call a function called Paralyze can see here Albert paralyzes a function which really convert any existing collection into in our daily It cannot be used to create an already from a file give you about CS refile I cannot stay paralyzed It'll not work only four by tone collections like dictionaries or sets or ira anything that you have You want them to convert You can say bad lies So right now I have a range off 100 numbers It is a collection of 100 numbers actually And I'm saying that hey spark battle allies this 100 numbers into an ordinary cold It So is that our deal You know actually right And I'm also seeing a si dot This is very very important So what is this A c right SC stones for spark gun fixed You can see issue What is Sparc context So this is a potential right What We started is a potential or price partial on If I write something here any any court I write by turn Well executed That is wrong right I want spot executed Okay If I write a simple by turn card here by Tom Miller in it for me I don't want by 200 I want sparkle running So in order to access the spark libraries you can use this object called SC So when you say a seeing eye dog you're saying that Dr Spark like Marie and do whatever I'm saying So I'm saying that topless park and create an artery from this set off numbers Yeah So there is something more spark session objects that you can use four data frames or know what you say sparks streaming there If you wantto communicate with the library you can use a spark session Object s C is your spot Convicts subject in a spot 1.6 We're on how Spark Session We have Sequel contest A spark and Tek standard So remember this object is required This a c guys required whenever you want to communicate with spot But now you create an article a right And if I want to manipulate this already I want to fill their I don't want to call it seat because already a Zanardi on anything that I don't earn our daily only spark and execute every time I don't have to say C a C but during the creation off the Arctic air to say Hey see dart It's a bad life that Senator and I give the numbers were there 1 201 former 100 so 100 numbers are now available as an rdd But remember spark is lazy Nothing is happening Now You type your court but it will not execute your court because unless you call in action spark is not goingto worry about what you're doing And if you want to call in action you can simply say a dark collect on spark runs shows you that even so simply collect is probably the most used action Destroy that if you're in or collect you can see that So now a spark job ran That's my whole point A spark job actually ran when you called the collect it Read the data into a minute printed the data and now you can see the data We are okay with it We're not doing any transformation right now They're just looking at what is happening right now Interesting question How many partitions This This already has life because I did not read it from her Do border new anyway 1 200 Is this numbers Right So obviously spot should have partitioned our dearly How many partitions I don't know Even you don't know So if you want to see that you can simply say a dart Get the Methodists likely different That's very cooperative Here get numb partitions get numb Partitions Well a very interesting expert How many partitions you have Eight I have 16 Fine It may be different OK Is there anybody who is having 16 How did you start Special Did you mention master Local star So that the number off partitions depends on the number of processor core If you started with three course you get three partition I did start when I started I told you can mention local star So I think my machine is having 16 cores So they located 16 cores as a Yeah You may be on a different machine Exactly I still like So I'm connected Problem with a friend machine Maybe defend missions Actually Now behind the scenes these are optimized It does not mean like I'm getting up mission with 4 16 course or something it'll be my hatred it like So I'm handling very small workload Right So even though course are located tow me If it is sitting idle some of the process may take it So you're just saying that this artery has and I can prove it If you want to try this do this exit from here Okay Exit and says I want to start spot shell with Let's keep a Corman number What you want five All right I will say what Five So I'm starting with what Five processor cores And if I create the same are Delia I should get How many partitions Five partitions Ideal night Let's see what's happening You can also try along with me in the Sala Spark A powder actually works here It doesn't work So I was I want to create the rdd No Even if I exit and start in the skylight she'll remember At least I command your type by Don't tell our negative Remember Now if I don't get numb partitions So when you have five Hi Now So that is what When practically you do with these are all confusing You are in local more That means you're only on one machine It's not a cluster in that machine I asked for local more I asked for fire processor cores and it has your make pikers on Now if I don't get numb partitions I see five Because by the fire house Park is going to partition your data it will assume that you have five course I'll create five partition so each processor core can process one partition Ideally we are in local more again Now this rule is applicable only if you do paralyze If you read from a text file or something it doesn't go like this There is a different world Okay now another question Can I change this Now it is creating an RD David fire partition right I want 10 partition 1100 partition Probably Can I do it off course A person did know One thing is a secondary partition But while creating the idea itself I want that's a 10 partition So you your understanding and right you can simply say coma then So while creating the RV you can mention the number of partitions This is what we actually do in production When we write the spark Court beat your reading from Hado or Marcelo Vieira for you Juan you will say read coma And that means I'm explicitly asked for 10 partition Like so it depends on the size of the date and all Then you're to make a calculation How many partitions The one But you can mention a number on those Many partitions will be there in your RV No it is not hard and fast rule And again that rule is applicable only local more in yon more You will say what Number off course Execute her memory driver memory that you manage This is only for testing This has no this This number has a value even in production even in production When you're creating in our baby you have to say Okay lead from a C s Refile the CSB files Isis 10 g b Probably okay on coma huh 10 You say so I want 10 partition So now you have 10 partitions off artery on You could have asked for 10 and execute er's Each executed will take one partition Then it will process it So that calculation you have to grow in your reading But my point is you can mention the number of partitions while creating the artery And you can also do re partitioning And I so I can say a torch What does it matter Re partition and then I can say 11 Okay on da then if I do a dart get numb partition for okay you can't change and are very right So if you say it Audrey partition It was just returned a prominent nor do anything because it cannot edit an existing are ready So I have to say that re partition a and save it as an article be Now we will have the same data about 11 partitions We've been launched One more special Don't worry but yes so you can mention now Very interestingly you can also say re partition One million possible beautifully worked in create one million partitions Today I will be available in 34 partitions Rest will be empty and that's useless actually So that number you mention how many partitions you want There is no control Any number you can mention Actually it all depends on the resources you have Okay On venue one will reduce the number of partition See equal to be dark Can you check the spilling off this thing I always have trouble with this I think it is correct Scene art Get no partitions when you want to reduce the number of partitions You say What Oil lease So yesterday there was a confusion on these two things That's what I want to explain So let's say this is your data You have this uh partitions Okay Three partitions are there huh And I have better like this 1234 56 789 10 So I have n numbers in three partitions and you can say what oil lease I do If I said coyly store it'll happen It'll reduce the partition I watch spark with intelligently Do it will copy this Five here on this six here on will believe this partition This is oil lease Same set up If you're doing It's a 123 four When you have five and six 789 10 Same setup I can Se ri partition Oh this will also be mental partitions But they're different Says it will work Full shuffled This will go here This will come here The Sogo here the suit Come here Reshuffle the Ferrari later than only reduced Oh so this increases your network traffic If you're dealing with that there are bits off data If I do re partition toe come down That's not a good idea Always do Coil is oil Iselin diligently minimize for daytime woman But again for increasing the number of partition There's only re partition There is no release for that from three I want to go to six right at all Basically partition six There is no other way I can go There is no oil is only for reducing So coil is came later I think Initially there was only a part of a method called pre partition as an improvement came coil is so that when you're reducing the partition they touch for this very less actually So always do coil is don't worry partition for reducing And now your job is to launch your own special in yon more with your own configuration night which means your to do a spark Sorry vice Part two by a spark to endure Dash dash help diaper dash dash help and hit End it Now this is before launching what I want to do So we're building our launch script right So if anyone to launch your definitely say what by spot don't right now you want to launch in the cluster more so what you say you say master yarn right And now you can give conditions What other conditions So from you here you can copy paste What other things you want to mention Let's see school down Let's say that I your memory we want to mention I'll just copy this Okay driver memory I want one g Okay one g v on driver plus but executor memory We weren't a copy executor Memory executor memory again I want one g The father's won The words were still still saying Let's keep it as 1005 100 Or reduce it Probably it's a 500 m because all of your lunging right plaster will be busy So executor memory that much then driver course Armando mentioned Let it automatically take executor court I want so for the executors I want to course Probably execute her course too Okay on number off executors I want here is number of exit I want three so you can build your a script I mean by will not earn it first let you build it then we will run it Oh so I'm saying that Hey spot lawns in the young Ah you will have a driver and execute Er you know for driver I need one DV memory I need a three execute er's It should have 500 megabyte off memory on execute Her court is work too You can kill it Or if you're already in the shell say exit and the open close brackets it will exit from the ship All of you can exit I think it might be resources In that case if I run let's see if it is running so all of you can exit I'll just try if I'm getting it Okay so it is not a living toe Mention the argument I think the reason might be that all of us are requesting those sources So it is denying city how only eight machines like seven machines actually s work are nerds But What is the configuration we get if you launch by the phone a si dot On the score Con start get our Huh So this is a different configuration right Let me get if you simply launch saying master is young How many executives who have can you see here So you mentioned number of executioners to you Go to execute us right now You can also mentioned that I am and processor Court I think it was All of us are launching Yarn is denying the resources for us yesterday If you remember I showed you write that be launched and so that these many executor cities us So right now it just started to execute this number of execute wrists too Okay this is just to show you that began launching the Iand more Let's do one thing that start in local modernity because if all of us are loading the rate and analyzing probably resources also will go less in that case So you can simply say pie spark too Okay on then you can see Master Master No sorry Master local and I'll just use to processor cores on dhe Start the shell along with me I want to show you something Okay You can say two or three Doesn't matter Keep it too if you want So we're on the same page and let me know whether all of your started the show in local So what I want you to do is that we will create the same rd with different partitions So I have to processor cores only right now on if I say creek paralyzed and create the artillery you know that partitions will be there right So I'll just create the are ready and I will simply say a lord collect Okay so when I do it or collect I will get the output Now what I will do I will create it with Let's see n partitions Okay I say thin partition And then I will again say a lord collect so very simple You create are dearie And then you're just starting a collect again You're creating it You say 10 I want 10 parties again You're doing a collect You're no doing anything else right And now what I want you to Louis exit from the shell Okay So once you're done this much do what exit now you are out of the shed Why I want to do it is the sparky Why Right So I'm not able to access it One spark it's running If you exit you can see the U S It will go to history Server on in the north Bad if you go This is the history Several white Just copy paste this No but I was that Just open a browser copy Paste it on You can see Ah Your name User name right today Which one This one The it and are really collected again Create an artery with 10 partitions and then collected Hi And then exit from the shell and then go to the history server You should be able to see your user name Where is me But see in the u I what you see you see to collect This is my first collect There's my second collect on Do you see the number of partitions So by 2 10 by 10 first I created the fall It'll be two minutes pen so you can actually see in the u I 10 partitions were here in this job to partitions were here in this job and look at the time So in four second mind a millisecond Meaning if you're if you increase the number of partitions your job will run faster Ideally you can actually see these statistics from the US Sparky Why they should we be civil when you're running normally For some reason I'm not able to access this You evil We're running That is we're exiting and coming here and checking it So all the jobs you run you can see here on Also if you click on one off the job I'm clicking on this job Okay you can see that bag bag a visualization Can you see now this is very simple because it say's you're gonna paralyze That is all You're done You're not done any special But if you'll go some transformations in Laurel appeared here in a graph And here it stays You didn't collect Okay if you expand you can see here Ah the spark was turning some 1/3 since I did this collect on it stays duration waas this much for doing this Even timeline anyway we'll just show you the executor driver headed very important point What it shows in the event timeline it say's executor Diver added If you're in the yard more it'll say Execute riding in the local more driver and executor are running together I told you right side Lower base executor Driver added This is like one thing right in the yard More reckons I want five executed an executor separately Relaunch Here everything is one local Stay just We'll talk about stages later These things we already know executor will be only one because everything is running inside one Okay so that is saying on jobs We can see it through with jobs and now we will launch spark again local more only that's fine on we will do some analysis So if you go back to your data sets I shared with you there's a file called a book Can you see And if you open this it's a book Actually some book God only knows what book Okay some business book just a lot of later So we like it actually that I So this is the sample Later we will use initially we will change it or so So I want you to upload this book to her Do you So go to your Hugh She was a fine brother Right Go to Ashley Affects browser on Uploaded in your home for literally Yeah So I think I have it in my home for day charities uploaded in your home for this You can applaud anyway but for this they will keep it there So what we're going to do we will read other data from the book into an r really which obviously we have produced And then we'll do some analysis Whatever possible I So first thing you need to do is that you have to read the data Very simple If you have the driver you can connect and you can start working on that It's all behind the scenes connections only Right So how do you connect to our Libya Means you will have our debut Miss Kline saying whether this spot by worse and hired I was like how you for example one way to write haIf Cory's You launder shell and then you start writing the query But in production I told you write one story Ah when we discussed Hi I have a friend who is working for two years And you didn't know that he was using haIf because he has a only one screen but behind the state will be connected to spark or high So you don't need to use this now What I want you to do save book equal to a si dot next file See how easy they have developed the court You want to read a file so into an hour Ready So what you do is that you'll say si dot You're don't say paralyze Previously you were doing paralyzed Now you said ext file extremism 1/3 on it can read any text file meaning that or see us we or any type off text data on the new Give the name off the file Now what is going to happen by default It will look in your home folder in her room So this book is in my home for it If it isn't some other for what do you say You sir Blah blah blah What is the location And now this book Uh should I really contain my Nate out my book right I mean it could have around here and you can actually verify that by the world Doing a book dot Collect so in the Scala shell When you go collect it'll display only 30 lines for feelings This is displaying over the lines It is bad actually because you're only to see everything I mean just if you were to look at it Okay so then elements from an artery So how that really is created right So normally if you give a text file every line is an element in a text file So that is I hear if I don't take then I get 10 lines I think this is one line right And then achieving something Something on my book file does not have a proper structure So sometimes even a word is a line So that is where this coming like this I now the most important point your toe understand is invite on What is this This beginning and end the venue Call Collect moderate The output is always presented in an IRA Collect or take or anything like that is a way that later is the represented So you're to be very careful I'll tell you why later But look at this Everything Even if you do a collect it will come it up I did all the outputs on our daily actions You call it will present it in an idea to you That's fine We're not border right now So we got an elements You're take 100 You get 100 lines from the file up to you to decide by You can also say book so book It clearly says it's in our daily by spot blocked artery right Within book you take 10 elements on Let it be a list ordinary Let's keep it as it is as off now right now what you can do What you need to understand is that so I have an rdd I have stored it right and I can also say book dark count So when your woodwork count actually you it will show you how many elements are there in the artery So you can see 9 26 you know lines are there elements are there right Countess and action count is an action take is in actions Then collect is an action There are many more actions But you can also do a first You can say book dark first 1st 2 means that we'll show you the first lying now from the book right It doesn't matter actually whether it is ordering Because end of the day if I really extremely have three partitions if I call a shuffle function there is no order I Everything will be shuffled here and there Or if I create an artery with three partition I sorry Partition order is gone right So order data It is very difficult when you're transforming you convention on order I want to take only this line and this world and count that you convention Otherwise there is no order guaranteed So another thing you don't understand we're not working with order data here Order doesn't matter here because all the data So if you're looking at a text file you just have lines That is all you have right So whether you presents process 1st 10 lines or last time lines it is similar data Tonight's the order is not got into you But usually when you do a text file it will read from the beginning That is where you say faster control 1st 1 But if you look at by Thorne dictionaries what is happening in a dictionary There is no order right When you have a dictionary There is no odor so it depends on the date I pour So you cannot play later types here Okay so buy it If I This is not having any particular schema That is another problem Now I read this text file some off the waiter will be numbers Right And some of them will be votes right Well I don't have a way to say that this isn't about this is a word Because it is just reading lying lying line everything String Everything becomes stream now even numbers become stream You have a way to mention schema Okay You cannot play a class create a class and apply and say that this system scheme I want that it's possible But by default When spark is reading there is no scheme out or anything That is where I said they have Frames are more efficient because there you have a proper column schema Right Okay now let's go Something interesting I think we have read the later right Okay We started probably to create it again Book peak will tow a seed or text Fine You can see book party extinct on always Just run At least one action to verify that the data is already there You can say book Dodge Uh where does that go But again to discount right The sensor data is that otherwise he will keep on writing your court Maybe the source file does not even exist right now What I want to do I want to take this book file on I wanted to overcome you Remember The world can read it in my produce right there You have it a time than you just count the words right Staying logic I want to apply here and in my pretty was It was very difficult If you remember you had a driver class then some other things map for my food and Roy And let me show you how easy it is to do here in this spot So the logic I'm goingto follow now this day it is a very sparse status will be And when I get the exact count But the logic I'm going to follow is that first I will take each line and split it using space Because my words are spaced separated right So you'll have all the words separator First I and once you get an individual world right I will convert that word in tow Word comma one Like what Peter didn't map it Some apple You remember You took a word and said bird coma one right on Then I will just group the words together and I will come them together and I'll go work on Right So let's do this step by step So you understand how we're doing it right So let's give the same variables Okay You can copy from here Okay In this north back let's call it as text underscore file So I know that you're really created something called Book but let's keep the same thing I will say extend a score file equal to s C dot text file Booked our t x t So just reading this book nor t x t into Ah Artie record this and I was a text file Notch take two Just to see whether I have the data Yes I don't have the later right So two lines you can see So it is working now the first thing that you're going to do we will do it step by step So you understand Okay Let me do this and I will explain what I'm doing Let's say equal to and just I'm pasting the score Whatever court I got from the word file your toe based fill that line door split Okay Your toothpaste You know that there's this part lying dart split OK whatever is here now whatever doing here So there is a transformation called a map on flat map There are four transformations when it's called a map Another is called a flat map Okay First let's try the map Then we'll come to the flat map So just press up Adam and change this to map So I know this flat map I'm just changing this too Map So this is what you need to run I just highlight this this cold you can get from the word file just based Fill this Edit this flat man Make it just map here I'll explain what it is So what happened right I mean don't do anything Just run it and keep it as it is What happened This map right map is a function map is a transformation we use in spark Right And what map does is that you give any logic to map it will run on all the elements in our deity So whatever logic you want to apply to all the rd all the elements in the artery you can give it to map So I'm saying that take my data and I'm calling map Right And I'm saying that a map do something for me and this is an anonymous function This box you see Lamba So Lamda we have anonymous functions I don't leave yesterday right If you want to create a function where you don't have to give a name for the function and explanation that is recorded as Lambda function So the way you write a lambda function is that you say lambda Okay then this line can be anything So I just used lying here can be exercise I'm saying that Take every line or take every element Okay split it using space I can also write it like this for example In short off line I can't say take X And then this Linus is for a presentation of sea So map is the higher order function What map will do It will apply your logic to everything So in our case we have a book our daily and we have lines on all the lines It will apply whatever logic you're saying What is the logic you want to apply that you represent as a lamb guy in the lander you say x so here X represents an element and artillery So an element is a line In this case you're saying that take every line that is an import and there is a split function you can use in python use s plated using space So what This map a little It will take this logic on apply to every line basically every line respected into words Right now another important point inside this map function you can also call a regular function Right now I'm calling a lambda That doesn't mean only lamb developer I can write a proper python function I can call it inside Map What map will apply that function to every line Right you are expressing is lambda function right So to say Lambda And then your logic whatever So now you have to be careful So usually you will believe that if I do this I really What will happen is that my lines will be taken and we split using space I'll get words But that is where the problem comes Now Stored this sa rd recall A really this part If I do are a thought it's a take 20 I'm doing a take Wendy Okay you Garda words Okay You guard the words but there is a problem What is the problem It is a list inside a list Here You see this is one word it took okay on it Created a list for that were actually I So your output is actually in a list The total output isn't a list I'll confirm that Okay within that it has created list for the so each record it is converting toward a list So you're getting Nestor structure There is a nesting that is happening right So my idea is that I want to take each line and apply this logic This What is this Splitting logic Okay on if you look at this lambda function okay Nor the Lambda function the split function in python split output will be represented in a if you give up string Actually it will come as a list Actually that is a reason we already have a list And then you're having another listing inside that No The problem is if I get like this I cannot come of arts because you have the words the extracted the words actually can see the big making And I put all the words you can see But they have email list and then another list outside that That is why you are doing a flat map People don't usually explain this because so that is why if I do the same thing with a flat map flat map only thing I'm changing is flat map But the difference I'll show you if I do our air or take 20 now I did a flat map Okay Can you see the difference You see the difference between map and flat map now because if I do a flat map it will do map on leet meaning it will apply this space plating to every lines and all that the same thing Map will take your logic and apply to every element Flat map will also do the same thing but it'll flat on your structure You will not get this list inside a list rather vanity expanded Now you have individual words in a single list So many people have this question in when they perform Basic are externalities Why are you doing flat man Math can do this for you right So if you'll do the map and platinum you see the difference Why It is not possible So if you want to expand the structure you can use a flat map That's my point If you want to keep because it is probably be with map I don't have any problems right now My problem is work on and I don't want this I want this So that's why I used a flag So So you're saying this lamb Nothing right I can define a lot of things I can say Lambda take every element on dhe Do something So it's very easy to represent This is for Did I know any must function this part This going anonymous function on now you understand If you write something like this Spark has no way to understand what you're doing unless it runs it This is your weird logic I can write anything here right And unless part lower said later and apply this and run it and get the output It doesn't know what you're talking about So these are all old without touching your schemer I'm not even bothered about whether it is a string or an indigent or a column I don't know any off this light So we are going to discuss something called a spot sequel Okay which may be more interesting to you because last in my last class we learned course park tonight Where we were discussing about are the lease than transformations and actions and all that is also OK right But that's more programming stuff right Well when it comes to Spark sequel like the name suggests that it's equal So you can at least do something with you already know your rights equal queries and all but not just sequel queries You can do much more than that with spots equal night Um now the first thing that we have to understand is that in course park right in course park what you did you have something called SC You remember this thing You will say a C dodge right The spark context object right on You are using this SC two communicated spotlight Brody Then you create our daily and do blah blah blah right And once you read your data it definitely becomes an RDD There is nothing else in cold spot The only thing that you have is our duty However you read the letter becomes an rdd And once you haven't ideally what you do your transformations and actions like you really that's a map or flat map or something And finally you say collect that's an action So this is the flow in course back what you normally and I know when you go to Spark Sequel I'll explain what this park sequel sparks equal ISS spots model where you can process structure data like Roseanne column type of data right The first thing is this is also a bit confusing Okay In spark version too They have introduced something called a spark session object in spite question too That is we're using spot too right They have introduce something called spark session object Okay using this you can access sparks equal so you will never say a CIA dodge If you say a si dot that Mr talking to court spark and the spark session will be available as a name course park This is again confusing confusing So you will type spark dart something that means you're using sparks equal All right so the name of the object this park by default when you start the show you can see spark create a little show in the shell When you start this partial actually so there's the name of the object will be using to talkto sparks equal Like very whatever you want to do You have to say spark not rewrite or anything and we'll practically see that Now once you have some data right in Sparks equal You create something called a data free data free This is this is the data structure we have in sparks sequence Okay And you may be already aware of data offering in Europe I turn plus very similar like a row column form And that's called a data frame Right So in sparks equal you will create something called a date a frame Once you have a data freeing you can analyze the data You can say select filter particular column which is greater than something All these things you can do This data frame can be registered as a table So a data stream is not a table by the way Data frame is a structure like you have Roman column right It is not a table assets Right But in spark I can say creator date offering I ought to register the state of him as a normal table Once it becomes a table you can write sequel So this will be the flu So I'd say you have Ah some data set See Yes we file You will say I read the CS We file creator date offering Okay Once you have a date after him with a state of the state offering myself table So now you have a table Once you have a table you start writing secret on top of it This is usually what people do in industry right Like so Very few people actually write data from quarries because if you want to directly right queries here you should know the dot filter or whatever commands you are using Sequel Where is a very common so usually people Otis protest a table and then start writing your sequel queries Data for a Mr Data type Actually that is inside spark it is like an indigent or or or a floor It's a data structure that you have It's not a table It looks like a table to you right Internally It's not a table so I can order in secret on a date offering That's not possible because the equipment at table So once you say logistic the same structure is coming as a table There is no difference at all Saying Golden Rule Welcome You start writing a sequel queries The question is what am I missing Because afford sparks equal that is wearing less theory It is for low hands on Okay but some theory Is that what you need to really invest in other ways He will get confused right More service hands on Okay But I'm just thinking whether I'm missing something sort of created a tough ring resisted as a temporary table or a permanent table Then you start writing your sequel queries Fine That we understood right The on and some off the experience from the industry from my point of view Is that what is happening in the industry right now by spark a CE currently on 231 This is the latest wish No spark Okay two o'clock 3.1 on mostly of spark is onto everybody Some spark to write back off America is on $1.6 3 I was with them last last last week So my point is why I'm saying this is a very important point because Lord of companies actually run one door six because you can't migrate in one day You can't say like tomorrow I'll go to spark to It doesn't work like that A lot of jobs will be running right now Why am I talking or two Three One is that after two door to off spark there has been a lot off improvements spark to itself has gotten owner of improvement after a two door version of Spar especially Toto tree They have introduced like a number of changes inspired a lot of improvements Right on what the open source community is saying is that do not use our ladies You learned our reading the last class I know that we did map transformation and all Okay but now what spot people are saying and that they say if you're writing a cord write it as a date offering if you can if you cannot create a great offering So maybe there is some type of our data where you cannot give a structure Not possible Let's say probably ought to go back to our lady That's fine But if you want performance okay Always use data frames And there is something called a data set data frame or our data set So I will talk about what is this later I don't want to confuse you a lot They tell friends you already know what it is There is another thing Gold data set Data frames Wants to create a date offering If Europe I turned developer and you quit using python here You're a scholar developer You queer using scholar It's killing some table There it is Pure sequel No Scotland or fight or nothing These two are equally efficient So internally there is something called a catalyst Optimizer in sparks Equal catalyst You can search for Project catalyst in Google and you'll get turns off information Catalyst is the optimizer of spark So if you wanna see quick query this is the guy who is oh who will optimize Internet in proper format on especially after too old or too Okay they have introduced something called a CB or Coast based optimizer This is available only after Toto It's a major improvement Cost based optimizer means if you write a sequel query what it will do it will try to create 34 plans How to run the query andan It'll internally check which is the best plan to run because you write brought a filter then grew by then joined then filter So ultimately how much data I need to be loaded and how the query has to run should be decided So CBO cost based optimizer in Spark is available only after Toto So if you're using Toto or any future versions if you write your sequel query in any fashion internally Spark is goingto optimize it So it doesn't matter if you're writing it or not Table as a sequel query or directly on a date offering both There is no performances are difference now We had our differences in one spot One if you're using these things are always different in a spot to internally you write the sequel in any way it's gonna optimize again It has to be more than gold or two for better performance That's what we have seen so far because I was using the world one the right one It is not so great I think our class Terry's running Toto is it I remember I forgot What question The attorney But that's okay This is learning cluster in production cluster That is what when spot became Lord of people wanted to migrate Now if anybody is migrating they'll be going toe to toe three only directly Because the benefit off toward orto plus you need So the minimum question you have to go It's two or three for all the performance and are so here I can use my SQL secret language here I cannot use SQL fightin So it's actually called a language integrated query where two right here is called the language in the greater equity either Scala career bite on Cody Right here I don't hear you right both off the things they're saying some people are So if you're a fighter on expert and I you have a good knowledge of python then people prefer date offerings because they already know data frame and they can bite equity Maybe they're not so good It's equal sequence Also a language which request some level of knowledge You can't just write a sequel Cory If you're see send a sequel develop then you know how to write your query so those people will prefer this way Others will prefer this I'll show you both Whatever you do on a date a frame it is an rdd internally so our duties will never cease to exist Spark is our duty but you are getting layers on top of it like data frame is a layer on top of it So even though I write a sequel internally convert to our city's off our dearie transformations and actions There is nothing else actually But you don't have any control on that Like how would convert for what it does So it is in a way I can say like life what hi it is doing You write up what is a haIf quickie It'll convert on apparatus job right So here very similar to that used Write a sequel Query Internal It'll be running on our deity but the speed and power will be completely different since you're using spot now Last two Last week we had a meet up Big Data meet up where I was talking to some other folks who are actually running spark jobs and or some other companies and the Project Center So they shared a very interesting thing to me I'm not quite sure because I have not actually thought about it So couple off my friends they were saying that in their companies right So one common question is that should I write in Positano Scala I want to write spark program But should I used by Thorne or Skyla Now here It doesn't matter because you're using Sequel But let's say you're writing an RD record Okay so these folks were saying that uh when you are having like let's say a 16 core processor on our date on order I don't cannot utilize all the 16 goals it seems when it is executing the court right on Also the memory management in the fighting is very poor Okay so what they were doing So the point is like if you have a date assigned this in your company he will prefer only pattern evil So I will not learn Skyler I will write by turning everything So what These folks were doing their spot Developers were actually writing court and python Okay But David cardboard to Java and executed Sorry Escada and executed they are doing days is if you can work the Scala It is a jar You can turn it in a proper jae beom And Norman you get better memory management and processor management It's very easy I mean end off the day if you know a bit off Scala like if you're a Skylark developer and if you have a bite on the lot along with you So these if you're writing a transformation right Justice Daniel I'm saying you will say what map here He would say what Lambda X X comma weren't here He will say map X X coma one You have to manually do it But in order to leverage this Jay Williams power because you get this garbage collection and all these properties with J B you don't get it in Python Okay So even though the community stays by Tony Scala programs are equally powerful But when you're under and in a very put large production environment why don't have some smaller disadvantages So that is a time they can go to Java Scala on executives A J V in court like normal bite court So job is also fine very much fine So if you're a job are developed these guys were you using scholar a lot So they know Skylab It'd Scala or Java because the uh the resource optimization off a J V And you cannot get anywhere else right Any other language you use the deviance power is always with Jae beom So python So that does not mean like you should not right by Thorne board That's what I'm saying You can write your pie turn gold and say that this is my logic boss the sister logic You convert whatever format you want you run it and after it's the same logical right because otherwise you have to sit and learn Skylar If Europe I turned developer ideally right that is not required So that is how they were giving an interesting example Any spark implementation will have some psychologists for sure Okay Skyler is always preferred But that does not mean you should not write spy tone You can convert it anyway if I'm writing on the record First problem is that the court will become lengthy I can't write secret on our deity So a simple sequel called Like Select and Filter might have to write a lot off court Second thing is that if I'm writing my own Art Deco nor the data from court the problem is that spark has nowhere to understand how your data is being represented Data frames have a schemer so it knows how the data looks like So that means let's say you have a C s Wi Fi light There's nothing but a C s Wi Fi and I created are really out of it right on in the artery I will say that 10 blows or something I have I write some reduced by key Okay On some very weird complicated transformation by looking at this part does not know what you're trying to do So it has to lord this entire data to the memory and then really can process the same thing If I could get a date a frame outof it I'll show you how to create it I'll clearly say I have three columns This is indigenous is float on the water If I write a sequel in which I have already returned filter this column Okay only these two columns will get started in the memory So it knows that you're talking over discord Um it has a view of the data rd This cannot give you that since you're writing an anonymous functions on that If I'm writing an rdd court here right are deities He was an anonymous functions on an anonymous functions do not have any structural scheme associated with them Right So if you write our dearly court this entire data need to be loaded into the memory Respond doesn't know what you're thinking But if I represent this date as a data frame I'm saying three columns are there Then I say filter the table where I want only these two columns If I execute the query it will Lord only these two columns to the memory It will skip this actually because it knows that you run warned this actually so the optimizer will can actually see how your data is lying in the what you say file or wherever you have it So it is much more memory efficient If you write data for inquiries that is way spots a issues only data frames If spark is converting our data Remington rdd it is very powerful If you are writing you're on our deity It is very bad actually sort of thing If I have this file I write my own RD record Okay I have no way to tell us part that don't Lord the for Data Lord only this and because I'm writing anonymous functions But if you write a later frame it will understand Okay You want only this I will create the court accordingly every time and you run it tonight so that we date our friends are much more efficient That's what Sparks is okay in sparks Equal model What You need to learn this So this is a date offering right You have a date of things How will you create this That is exactly what you learn You're not learning anything else since park sequel model Because once you have a data stream either you're gonna see quick query or or you can run your own queries Now one way you can create our data frame is from an rdd You have an rdd can call it And second word this are you to do a day toughing I practically show you Okay but there can be situations where your date eyes and rdd on you wantto can work that into a structure like a date A frame so I can call a function created a tough ring for me Second thing if I have a c s v file right from the CSP file I can directly create a later plane Since it is structured fight My third is if I have an exam ified right I can create a data from this from this guy forthis If I have a J son fight night I can create a date offering from here Uh then if you have a park a fire So these are different file formats Okay I can again create a date offering from the spark A file All right um next examine huh Now if you have an RV bmo stable this is an RD be amiss table I have a table from this day but I can't lead and created a tough thing I'll show you this practical from my sequel will read Okay this is very useful sometimes Like in G e v Have a situation where you have they have they use green plum green plummets data warehouse So their primary date eyes and green plum And then they have some daytime spot on her do cluster Right Uh now sometimes they want tojoin operation in sparks sparks equal Now one table is already in spark The second table isn't green plum So instead of our TV Emma's here It can be green plum Also any sequel store So if I have green plum here using the driver I can say read it into our data free Then I do join So that means spark can actually get the data from any secret store Oracle Dream plum Irureta anything on the flight and the nitty represent as a date offering And then it can You can do whatever you want right Also you can create from hive Very important Hi You all right Hi Vis By default your data warehouse on her do where you create all the tables and all If I have a table in haIf I can read and then creator data stream from haIf Also any no sequel databases any any No sequel you have You can be This is the true power off their data frames Like from any source you can get the later on You can put a lit up like you can read from Cassandra dump it into us and our dearly or read from Jason Save it into green plum Because I really like this possible Right So once you have a date after him you can also save it I can't I have a data from here I want to save it as a table in haIf say within our table in our TV Emma's or save associates we find or whatever only thing you should be aware of The ski mind how to save it should match with your accoutrement and order But read and write from any of these sources are possible And in other course I will show you how to read from Fife rtb Emma's park a r d c Yes we examine Jason No secret I don't have right now No sequel BU student from Mongo D B So mongo TV there is a connected You think that you can get a collection and create a rate offering very much So Once you know this this story is very easy Inspired Question one We had an object called Sequel Context Object This is where in spot question one we also had another object called haIf context So this guy of us used to talk this park sequel This guy was used to talkto high from spark Huh In the new spark to you have in this park session So we call it s just spark object right But why the community has done they have retained the support for the backward compatibility So even in spark to you can create this object and called the data possible So sometimes you will see that because somebody could order it on a cord in spark one Okay We had that issue in one place Somebody would ever tennis court and spark one And that process equal context And when they migrate to spark toe they don't want to rewrite the entire stuff So the same court can be retained So if you see the sequel context haIf context these are the objects we use in spark one especially inspired So there is only something for spot position spot That's all you have night But these things you might see in order to Asian Spark one Langston is a project again from Apache spot Thanks Tyne The whole idea of tungsten is to speed up your spark processing beat are really be a data frame or anything but the majority of contribution is on data flames actually because parties are deities Anyway there's no optimization possible So data frames are the future That's what sparks is so this Bankston project eyes actually usedto in court and be called your data So then you actually story or data right How it will store So I say in sparks Equal created a tough thing on Then I just want to store it somewhere So Johnston has something called in quarters We should help youto in court the date and store and then recorded I will probably talk about it in the data sex section We have a data set section that will talk about Bankston Project Instant And what is that Jupiter on Upload the notebook The Northbrook name is data frame Underscore basics now Very important point I think the last class we struggle ensure your interpreter is by spark Here It'll be python for some off you then It's hard work Yeah by Thorne means it'll run You think by Thorne it won't a spark it or you write a pie tone cord It legs acute using by Thorne libraries It won't even talk to spark Ah you will get others Then it was I don't know what you're talking about because it is just this thing right This is actually communicating with whatever is behind this incident Review white light So if you say like the pie Spark interpreter then only spark it Lin book If I said bye turn you say one plus two bite only allowed one in tow by Tom Morrow Camilla No spark libraries will be involved So you'll get out of saying that I can't find the library This that in order Okay so getting started the day definitely on the first And the most interesting point is that if you have Jason data so many many uh applications are producing Jason Data Jason is very common like so this people don't Jason it's nothing It's not a very complicated fight I'll give you a complicated violator Um so if you open this people or Jason it is a very simple days on fire There is It's not a complicated trial I'll give you a complicated one Now the beauty of Sparks equal is there is a really happy for Jason which means if you give a Jason file can directly real and create a structure out of it But again you have to be careful what it does It will read the key as your column name so it'll create a column called name Value will be Michael And again will be a call on Courtney I mean same courtroom and he will be there for a J Here You will end up having our values because you don't have anything here because it is automatically inferring your scheme All right But they had wanted this If I have a Jason file I can directly say read it if you don't have to do anything special So if you just scroll down here first thing we're doing this from by Sparks sequel will import this park session Remember Spark session That's entry point Oh Sparks equal library Let me see if this will work Because in some off the notebooks it could have already created this partitions Little throw Another saying that Hey I can't create sparks session That's fine So here you are saying that from by spots Equal house See There's another for me So what This this ever actually says that it is already running so it cannot create it Okay that's fine If you get this edit it is fine But if you're writing your own program let's say not that notebook this is how you start You will first import the spark session Then there is something called spark session builder You can give up a name for your application and then you can call this matter called get or created Okay so now my spark session is already learning here on if you want to create a date offering just clear this output Okay If you want to create a date offering it's very easy All you need to do is this You can simply say spark dot read dot Jason and then give you a part So in my case the data is in data sets people or Jason on If you want to see the data frame you can say the F dot show So this is your data frame and I think you might be familiar with this in your pipe down close the same name column structure So spot door reed is a method And the response don't read or Jason sparked Audrey Dorsey is we Ah spar Don't read or Parky so directly you can read You don't need any third party tools but when you are actually loading example files right if you want to read Example file You cannot say spark Don't read or XML it will not work Why Because example file has a different structure Who works with examine I work in exam You have a rule tagged and roar tag And so all this year to mention So we have 1/3 party package using that you can read it examined But if you want J So on RCs were something you see every little weed but example There is multiple levels actually Right Jason Or so you can create Nestor the developers who can create But by the fourth example support is not there You're the user data bricks package to read and then create an example I'll show you But make sure you can create this data thing Anybody who is not able to do it Two important points here First the name off your data free Miss B s in this example it can be anything Doesn't matter Second thing here when you use this method there is automatic scheme My inference Meaning you are not saying what is worth Schema spark will take a best guess and apply the schemer if you want You can create your own schemer So let's say I want to read this Jason But I want this uh call him Toby Date or something String actually imagine Okay So that means you can create your own schemer and tells far that haven you read Don't just in for this game I'll give you the schemer You play that scheme So bye bye different Nor do it by the file If you call this methods automatically inferring or the schemer And most off the case is this sufficient It can understand your day down our data from looks like a table actually Right A data frame is like a row and column And another important point is you see each row right Each row is called a row object It is not a table So one common confusion is our data frame is never a table Okay Each Roy's an object actually Internet source as an object Okay um and I will talk more about this when we come to data sets Like how each row can be accessed an or these things But understand column The values and roll Bayreuth is storing object object object object It's gonna roll Object actually but There's a catch here Um maybe some of us understand some work You may not understand what will happen Let's see you try to access this value right There is no value here So what should happen Ideally I want back says this value you will get another huh You're gonna exception right Because there is no value here but the problem is in data freeing It knows that you have three columns on each Gordon has some data So let's say you wrote your our data from sequel queries And although Campbell it and run it Okay you will get an exception only while running You will get only run time exception And that is bad Because if you had like 70 columns off data they're saying one column some values are missing You wrote a sequel query And if you run the sequel Cory that's what I'm saying If you tied the sequel query okay it will say everything is fine It won't say anything But when you pack isn't actually executive okay When it actually reads the whole daytime try toe you will get around time exception saying that Okay there is a problem That is where data sets are efficient if you create something called a data set Okay It looks exactly the same There is no different You'll have columns and rows and everything But when I'm actually calling my steps in the shell I'm saying the shell if I just write access this value even before executing it to say that don't run It'll give me a compile time Mary Northern time It'll it'll say that there is no value here Don't come by the school So that is a maser difference in data set and a tough ring We'll see What is the date Us It later Okay so they doesn't serve any new Onda Spark has started using it now only and people will slowly migrate Oh that that is because in data said each row will have your own scheme I attached It will not be like a row object You can like the schema proper scheme and say that attached to every rule it will internally And Gordon story It knows in each field what is there So it knows there isn't now So if you write a selector on this before actually running it will say you have a boss You have a nice don't compile me OK I will throw another Please correct it This is just an example now or if Like there's something missing Nothing is that you want to access You end up in an air It so very well You trouble showed That is the question right So if I get a runtime error compile time or a vicious better the better when it's combined their matter Right So because the internal structure is stored in a data set in a data Freeman never stores It just assumes these mini rolls out there It never know what is inside the room All right so here if I access now I'll get another right It'll give an exception So I just want to be aware that if I run my court I should not get an exception A lot of people what they do they don't take care off Nurse they're just rear and create their own really care visits in a level or not All right And then you actually run Then you start getting exception saying that I don't find the value here or there So that time a new creator date that's it It can actually Even before compiling compilation time I can say that these values are narrow Sort for data Right Mismatch Maybe something is not even know Data types mismatch Right So those kind of exceptions you can easily handle in deposits But the problem is I have not seen people actually using data sets these days It's very new Their tests came into spark toe It's very popular but people are still using data frames only Probably They're not comfortable doing this I don't know what is the reason it comes more or less like an alphabetical manner Okay you can switch it and save it if you want but the structure comes like this The days on actually has name Andy's But Columnist predicted like this That is where you might want to apply your own schemer You want to say that I don't want this I went for a school in his name Either you can write up query and extract the column and create another date offering all you can say I'll create my schema But the first column I want this that can calm I want this Then it will be affected So here it Lord magically decided is any You don't have a control on that Actually night now once you have a date a frame you can just bring the schemer So there is a common corporate schema on this shows that there is an H column and the name calling Okay Ages long and name is String These other data types associated So here you have in a level troop Right now Right now values are through this You can also having a level falls You can change it Okay so on you can simply say the F door columns It'll just show you what are the columns you have right And that is also are described command so describe command is very much like you're describing a table It stays our data from somebody Spring age string name like this When you call it describe it It's like as good as collecting the statistics So in sequel you have this table statistics right where you collect information about the team So when you call our describe actually what is happening a little it'll actually summarize whatever you have in a data thing on then If you're the FDR described art show what it has collected It will look at here for example age tow The mean is this standard deviation Is this Minnis max So wherever possible it will calculate So this is useful if you are a running aggregate queries and or underweight offering because once you describe these values it will find useful information from this column So this is like you are table level statistics right How do you say your table dot generate statistics You do right so very similar to that If you don't describe it like that's a column and let's say that these are the summary off the tables Interesting for you somebody if you want you can do with this We don't actually use this describe and all I mean because described method and or came long back actually right So now you don't have to run this method in tool or two and two or three because every data that you have storing it will already analyze like whether so for example you're storing gender corner so Gender column usually has male and female Well it's not applicable Only three values are there so you can actually choose to encode that corner in a data set You can say that insert off mail story Does one instead of females or does a little So you're actually reducing a lot off our date on that corner Right If I started a string I'm taking a load off data So these kind of status six up north you don't actually run Describe these days These these what you say in cording techniques are there now on date assets which can store your date I like instead of a male stories one and female return a zero So the corner matters are much much reduced So are you able to do till this till the describe Yes Right Okay now some data types make it easier to in first Chema like tabular format Sensi Yes we However you often have to set the scheme I yourself if you are in dealing with their daughter read my third So what we're trying to discuss here is if you are getting a proper Jason file or a c s we file you can simply say really creator data because you have rows and columns Now what if you do not have a proper structure to your data Then how do you read the later Right that time you can apply your own schema to the data night And to do that we have a data type called a structure Have you heard about struck anywhere else and seek Will you have RTV miss Have you structure data type No not in a spot like structures A data type Well you can have my people values within one struck You can have multiple types And then it came sir dot dot dot and access it Say for example you want to store your address so address will be like house name Then what city and street then state I can store this entire thing Instruct Okay Then I can say I want to access only my city so I can say a dark city It'll give me a little city from the street huh Something like that Okay so that is orders Instruct We I have used it in Java Okay Long back Not now From that they have created this struck later type in spark There's a data type So what I'm saying that imports truck field This is general within that I want string type on in desert type Whichever data drive 200 is that there is a type like there's a floor type in desert pipe Okay so I'm importing this so that I want to represent my schema Right So first thing is that we import this data types to represent our data And next what you need to do is okay You will create your scheme Are you will say struck field You will say age I want a column called Age That is indeed your type Now level true I want another column called Name that is String type again A level two So this is your schema actual schemer Okay Destruct will hold everything in in one kind of like a container So that is a struck feel You're saying that these are the fields I want That is one thing called ace That is indeed a type And that is called a name that is string type So this is each feel you're representing or each column struck Feel is a rapper like thing Okay inside that you're saying I want an in desert by Poland I want a string type column So you have created two columns here So this is like your schema that you're defining Okay on Then you will create something called a final structure This is optional You will say struck types on here You have feels equal to date asking you You're saying that use these fields I have defined here as my final structure on Then in the same Jason reduce a spark Don't read Or Jason you will say Read the Jason coma You will say scheme are equal to the scheming So what is going to happen Your own scheme I will get applied here right now Actually enough I think Invite owner Or so it is there There are two ways to attack schema One is using this structure type Okay uh apart from this truck type Do you remember in that captain example named the Triple you can also use named the Triple But the drawback is if my memory's correct named people can have only 22 or 23 columns Max in the schema So if you want bigger number of columns your base you struck type struck paper support as many columns as you want Like you have a very loud of stable right in Scala It is called a case class if you don't have a name to triple in Scala So if I'm creating a scholar program and I want to attach my own scheme I like the captain object right I don't have a name That turbulence Callon named pupil is only in Python I'll create something all the case class Very similar Similar idea only Right So then you will say attached the schema on Brenda Schema So I'll just run this Yeah So now it iss coming with whatever scheme are that you have attached right You can see the off your Prince Chema again The same scheme I we have given there is no difference but I just want to show you that this is possible to act as your own scheme up and you can actually refer this spark documentation So if you don't believe what I'm saying right so I'll show you in the documentation Older versions to wood or toe There is ah sparks equal data frames and data sets Okay you can scroll down Go to the python court on there is creating data frame score scroll down some very desert And this extractor type Can you see Ah here it is written Right So it stays Um Ah Here to create an artery off to Potter's creator schema represented by struck type Matching the structure Apply The schemer so struck dead can be used in many places here They're creating an rdd then applying the struck type and creating the data thing What we did we created a J soaring with the struck type Wherever you want to play the schemer you can use a struck type Actually I show you there is an example So this is not that this fear using only this thing What is a struck type when you're using a name Their pupils What you need to do is that if you have an r b d from this rdd a can create a date every all right I haven't our beauty I want So right now if you don't have an r d Did he have a Jason fight So from this our deity if I create a data frame either I can say apply a structure type or applied a named people because I really need to hold some scheme if you want so I can create a name people and say that Hey just convert into a data stream It created a definite I show you that Okay Now are these goodies familiar to you Like how to access according because I don't know whether you've seen this in Ah ah Bite Don't date offering like you want to access Ah column You simply said the F age That's according to age on If you call the type on this age date offering this age column it stays by sparks Equal column So there's a column type that we want And you can also select individual uh what you say columns So you can say BFD or select age and you always do a short thing So So when you want to see the output in a data frame you're always say show there is not collect Collect will give you them delaying our baby That's a different thing Okay so here in you're not not what is the notebook If the FDR Select Age that won't show you anything but is there dot show you can see that that column or whatever is that data frame contains Sure I will show you the 1st 20 Rose show argument will show you the 1st 20 Rose These are similar to Python Okay you have a bite on data frame commands right Very similar to that But they're also you have filter Then you have group join kind off operations right Very similar to that Similar comments I dissimilar Some comments are slightly different Like you don't have this show commander for sure Yeah sure Come on Because you don't see something in Python It's not like this here We have a date offering So anything that is there in a data from you want to see or to save Doc show and show the 1st 20 Rose Ideally something else will be there It'll show you don't have possession because here everything is stored in there only as a data frame So if I seem place that'd be f dot Select Okay it'll have the age tonight but it just gets stored as a temporary Ah what you say date A friend I have to say dot Short to see anything in the data stream actually Okay You can also save the type of aces again You will have our date offering So look at here So venue do this right I can store it as a different date after him I can't say Rahul equal to this Rahu equal to b f dot select age Which means you can select a column on so store it as something called a Drago Now Raghu will become our data frame actually Okay And if I want I can se rgu dot show here So any operation you do on a date offering will result in another date offering And you can save it if you want Like a go this way You don't want to save it That's also fine And that is what we ran here Type the FDR select Just remove this so that we don't confuse it So type is actually a bye spot sequel data from And here you have the same shore command You can see what is inside this And then there are some head The command for example Head will actually give you a list off roll objects Right So this is important in one way Because if I do uh where is it Huh Head that too Okay head the to command will give me the 1st 2 rows from the date Nothing and see what it is It is a row object is your data this rule our audibly row object so you can actually see that it is storing the data Isiro object It is not starting like a text file and it's actually an object has got the role object the head will give you Ah individual row objects from the date of flame You can also do this on day say show same thing you will get no difference at all Now you can also create new columns for example You can save it column New Age So if I run this right I'm saying that the F dot with column I'm saying New age coma There's a DF age dot show So you are adding a column called What is a New Age Right So any idea how these values came in the New Age column So you are saying that I want to create a new corland or New Ways column Okay where it is just what you say reflection off the same corner where you can modify that Also you can say I want to add values or something on this will create another data from here on defied worthy F dot show their differences Your original data frame will remain unchanged because very much like our Didi's data frames are so immutable So this could have created a temporary data frame in your session as off now right You can also rename if you want BF dark with column rename There is an argument that called With column renamed You can Say Age Super New Age Something like that So you're renaming this scoreline basically more complicated operations to create new columns I mean here all you're doing is that you're creating a new column You're referring to the original cola and then you're multiplying by toe adding one dividing by two Unlike that you're creating new columns right So I hope these things are very clear now The important thing is how do you use equal So I will show you another use case where we will be running some queries You think by Thorne as off now we have not ran iniquities The important thing is that how do you use equal So this is not a table So so far what we ever writing is just a date offering If you want to use sequel you have to use something called Create or replace and better review You're actually creating a timetable See you have to say BF Dodge create or replace Kemp View people what this command will do It will create a temporary table called people on Then if you wantto run sequel queries it is like this If you have very much see what I'm doing here I create a temporary table called people This name can be anything and this temporary table will register with haIf It's a temporary table if you exactly be gone Okay once you have this table if you want to run see Quick Query Simply say spark dot sequel Whatever query you want right But again I'm storing it as a data fraying And if I do a show I can see here So this Barkus box session objective you said Spar Nord sequel iniquity It'll turn on the table I got it But if you exit this shell this table will be gone because it's a timetable Smart apartment table If you wantto if you want a permanent table there is a command called Save as stable if you use a vast table by the follicle Goto haIf okay Or if you want to dump it into rdd amiss or something you can create a connection and save us the table It'll go to the RGB That is also possible All right I show you how to save it in a haIf table But now any sequel where you can run here any sequel query that you want regular secret camera The results are always said a tough thing you get a tough name is that it's urgent right Okay Great Um on we will run this later This filter approach and all We will do it later okay And what are we trying to do We will create an rdd from the artery will create a later flight will create an rdd from the artery will create a date offering sometimes that is required Now there are two ways to do this One way is that I have an rdd right These are descended to this party Each row I will convert tomorrow object So this is normal are really lines I can say that Convert each line into something called a row object And once So the drop check No Then I use them at their core The creator frame It'll created this West emitter Second method is that I have a game the RTD Okay And I I last Lord off lines and rdd I will use my struck here So you struck type previously That was for the Jason data We gotta clear the Jason on the RV Or so I can say I haven't already with 10 lines Okay I will create struck I will say the split inducing coma for Scarlett Miss in Desert Pipe Second string Third is something Then I created a tough So both ways you can do actually first update is that each row I have to convert a row object and while creating this row object out to mention the data type So I will say in the row object First column is in desert Second column is blah blah blah Then I say created a tough ring here What I will do I don't have to create row object here I'll create the struck type screamer Okay So normally if we have simple Ardis this is the way of you doing this role object If you have next Nestor structures and all then struck type will be better because it can't define so many columns internally These two things we're going to do actually using this file we have right ski my inference Example Night So So here we are creating a C Right So what we're doing is we're saying that read the people or the extra into and are really cold lines Lines is in our beauty It's not a date offering Remember they see or text file Look and you can actually look at it Ah the data Also I think that plaster is a bit slow today Yeah so the new door top three It is showing you the top three elements off the our beauty Well Michael Justin Andy you can or sort of what they do Okay so it is working Right So are you able to read it The data Right And now look at what I'm doing I will first create an artery called parts and in parts I will simply say lines dot map I will just say that Take every line of split using coma So how My date I will look like my date I will look like this because each line I'm splitting Michael than 29 Justin than 19 water where I have so first I'm just splitting my data using coma Okay And then I will say dot map I will say Take every element and convert to our row object This our order view is a row object on Then I'm saying that the name should be off the corner Should be name on This should be in digital age should be invisible And if I don't talk do now your date eyes now converted into row object All right so here you are converting it You're saying that map the data Okay Take every entry from here Okay Convert that into a row Object Okay On the first part will become name Second part will become what to say Age right Name and age So piece of rope even both will become And then if I do A people are top to toe or take two You can see that are cruel little objects Now does this one way of doing it Okay And now now this is useless to us now because it's a row object You can't do anything now You will say creator data frame from this So simply added as a traitor for these two lines So when your door day off not show it comes like this only like it is probably alphabetical order And in right So But you're just looking at the data The operations may not matter For example if I say account the age or something it is going to work only on days Carla But the display will be on this manner That is one thing The displaced specific tow This thing sparks equal and you will say spark your career data frame people So this is the method Create data free is the method that you will use If you want to create a date offering from an rdd this is very important Create data frame is the method which will take this people these people right people are dd on Then create this data offering So this kema people is your data frame now Can you do a scheme Are people don't show it is not there in the slide The schemer people don't show what we did We read a text file into an rdd now The question is how do you convert your artillery toward a tough thing One way is this one way is that you split each roar into different column and then you say This is my schema on You can work that into row object Right So this is where you have the right object like this People is actually rob Gypsy Roro once we have the people think OK you say create a tough ring passed this people This is one way Now if you have much more complex data type then you can use the struck type struck type Or so there's an example If you scroll down So here you can see the this thinking and you can all sort of district as a temporary table and then run the sequel queries Right So this part is fine You can I will ask you to run that later Now programmatic ski my example for complex Nestor schema right here Look at what I'm doing again I'm reading this people or t x t I will against a split it using coma Okay on I will say that part Start map Lambda I was just extract the first colon and six second column Children this first Okay so if you look at the data the data eyes now like Michael 29 the 30 same data we have But the only difference is you can also defined the schema as a struck type So I will say the scheme I struck type where I have a name column off string on the age column off string on Same thing you're doing good You're saying that sparked or create a frame you will pass The are really were also passed the scheme So two ways One way is that either you manually convert or object and so created a tough thing Second ways that don't converted object They find your ski mind struck tight Say cleared A tough ring gives are dealing you the scheme Iittle much How many of you are aware off this company called a yelp You have heard about zero matter right All of you have no matter actually came from Yelp Yelp was the first company who started this foot thing You ordered food and sweetie This kind of stuff actually got started with you Help is not in India It is available only in us as off Now Probably that is the reason we may not know much about you Now one good thing about yell Maybe it is good for you in future or so Is that so Yell past Lord off Data Because Lord of restaurants get registered with yelled on They use mongo d b to manage their orders and or Okay they I love you to download all their data for free So you can actually on every year there is something going the yelp our data set challenge where they will ask youto so you can just go to Google and stared down Lord uh yeah don't Lord Yelp later Sit on here You can see yelp later Sit down low right down Lord yelled date us it So um so the day does that has these many reviews How many are there Finally right Five million reviews these many businesses this many pictures from this this total data set you can either get in Jason format or sequel compatible format Now don't do that okay Because they're like KGB or 10 GB size data But if you are interested in doing some your own projectors and the night is very good because the natives are so good And you're getting some good huge data right You have that website right For all these machine learning folks where you get the data What is its name Gaggle They have a copy off this but I'm not sure whether it is the full data What I did I don't order a small part off this data okay And is already available to you here If you go there is a file called Business Door Jason can you see if I open this This is the file from Yale So you see this is how the yellow laters that will look like on You can also use it for your project center Like I said So every business has a business idea Right on If I actually go to yelp where is help if you look at the business right So let's say you're searching for I don't know break first near San Francisco Yeah you'll get this business is right on The point here is that if I open to businesses here So this 1st 1 restaurant you will have the name off the business star on Then the location the four toes and and so many other things I scroll down there Is this more business in four In more business in four they will give you additional details For example are takeout Yes Accept credit cards Something something right If I go to another restaurant See these guys are saying except Google pay except Apple pay Well if it isn't already is But he had that often is in order that so That means every restaurant data is totally different You wouldn't have a common scheme Our something light on this whole day dies in that business door Jason fired Not the whole later We have around 16 megabyte off the data So you have a business I d Then you have something called full address Right on If you look at here this is Nestor Jason Right scene I have a key called ours Okay Written hours I have Tuesday within Tuesday I have close So this is nesting right The organist ing right And again if I scroll down see the level of information you have It's like very big very big in terms off nesting I'm saying like all these fields are there so what we're going to do this Um well will probably use this What you say a shell rather than using the notebook right So I can just open the console Copy this basted here And we're uploading this business door Jason in a common folder for you Okay And I can save by spark too Um I think I have already uploaded it here so you can see business toward Jason here This is my personal for day Yeah So I hear a running total towards a rope so that is a good version of spark actually Now in your data sets folder there is also one more file call Yelp Can you see a text file Call yelp Just open this and I will do this first because I want to overtake you all If you will read I know that I won't be able to read Okay The point I want to show you is that if I simply do us part don't read or Jason Okay And giving this data set spark credit Okay And you will get this warning That's fine Ok now if I do have this dart Prince Chema Right friend Schema Look at the schemer in less than a second you go total structure to this data How many columns are there Look So all it took is less than a second for you to read and create a date offering Now it's up to you to rights Equal queries Everything is stored as a data offering So once you read it here now I have this base that is nothing but my date offering But if I do have this dot show it may not show me properly because so many columns are there right So I may not see this in a proper for man but what I can do I can just register it as a temporary label I can say create or replace temporary view on Let's say I wantto understand how many restaurants are there total in my later said He's in that first It is really fast because you're looking at the 61,000 records right Each has their own 70 column not more than 70 corland on the queries I'm really first comparing it with something like Apache haIf right So can you try to do this along with me You're to start your shell on like this soap I spark to prototype Start the shell on Then you say this Quito sparred or read or Jason Ah slash g l data slash business door Jason And then you can follow the text file I have given to you Now if you act if I actually want to run some complex queries not so complex But if I run something like this right No not so slow I got the crystal Yeah I got it 30 seconds See the query is not very easy Or so select stayed calm I'm going around Average review comes as average review from business grew by state order by avid is descending limit five So that's not a very easy quitting but it gave me in three seconds It depends on the cluster also because all of us are started one shell right So And one more thing is that we started off by sparking local more No we're not actually running in young or something so But it's okay The data set like P m B That will work Really So the commands I have written I think you will easily understand Here you are finding the total number of restaurants on here What are you doing You're grouping by state on understanding how many you know Uh you're saying Selects state coma Count Warner's business grew by state order by business State wise you're calculating So once you're registered as a temporary table everything is equal them All right So I'm getting warning saying that Slow Reaper sister and retook some in time acknowledgement downstream Acknowledgment So this is because it is reading Also you guys will have one more problem Maybe because you're always eating from the same file right It should I really keep it in memory Okay You're reading from the same file right Same location on these things You can actually see Uh connection refused The file is actually on a hard disk but when you say spot don't read or Jason it'll keep it in grand Right So all of you will get a copy in the RAM individual Definitely individually So it should I really won't like that on Yeah I go tell him what I'm getting at it saying that fetch failed shuffle fail to connect So the machine stopped responding for me That is with us All of us are running Are you getting the output or are you getting at it Output Then why am I getting at it I got answer So it took some time but I got the answer Maybe it's nobody will definitely get some answers So see how sorr spark is very efficient at handling on yesterday Data Now I have an assignment for you on this data I want you to write a sequel Query Okay Where I want to know all the restaurants So let's look at the data Right Is there a strain Data right Give me the output in a sequel Query Where I want to find out all the restaurants there Um huh So they're all the restaurants which will what you say nose at 90 clock on Venice City Let's see whether you can do it I want to find out all the restaurants So you have opening closing hours right to use the Venice D All the restaurants which will close at nine o'clock on Venice D It's a Celtic glory right It's a Senate committee right You're doing a select select start from where So I want to show you how to access that field So first you do a Prince Chema you'll see the schemer and see very can access it If you do up this door Prince Chema it will show you where you can access it And what happens is that even though Yelp and all will be using this monger d b for analysis they may not prefer manga Levi Mongo Dale is good at providing real time information So you search you search a restaurant the show That's fine But if they want to find out like last 10 years which restaurant made much much money they will read it into something like spark then writer analysis So Mongo Devi is not an analytical Maybe it is just real time performance like your booking and all up So what are you guys doing there also no longer baby That means if a 1,000,000 customer book also you can have they can see if they can get the cap But they can't really analytics on that analytic somebody else will run This is like so many customers come They should get solved That is the purpose of mongrel Be actually Right So uh what did I ask Even I don't remember So how do you access that column That is my question Selleck star from the table huh Right Very select Start from the table there Hours dark venice day dot Work close Equal to what Whatever I said try dot show I mean because this hour's column is a strict it is a structure So within that you have Friday within that you have clothes that is a string So you want to access to their door Dot dot starting from ours So you have to say spark Dart sequel within courts Select whatever word of it were door door door equal to um the condition I gave or you could have done a limit Don't go with the limit Okay Don't run directly Do it The limit of my bad So it will be something like this Right Spark not seek weather Senate start from base Well yeah Help me with the query Where does that our worst dart Yeah No Mr Dodge So just understand how to access Nestor structures okay Because that might be required How to get nesting Dar dar Dar es potable How you can go You can very much You can do the same thing Not like this It will be different I'll show you how quickly will not be like this Because this sequel Berries right you do here right So 11 ways that you can So both will be actually equal But the execution time will be different Like I believe actually when you try a show Five What it will do outside if you go ashore five So use the business door J sword for some case studies on dhe I have one more assignment Don't do it right now Okay But uh that as Simon will be if I scroll here right Where is the schema This Dark prince scheme All right I have another assignment Okay Well don't do it right now but I want Oh select Huh So So I don't do it right now but just think about it okay I want to write a sequel Query again Where I want to find out all the restaurants that takes the reservation Okay What is the talent inviting it huh There's this piece that is a space If you're directly type of the Lord's work no course bakes at a celebration night So there are two things Just see what you can do or try first If it votes without anything then that's good right I don't think it'll work right It will learn with No you can't It won't I didn't They fight Like if it is a call I'm like take out It is one vote right No You have to tell the system that there is a space by the for the Nordic So in sequel what you do That's my question Normal Artie Beaumont's work to do backslash character escape character Right So here there is an escape character Find out what it is Not now I know Why did this without escape character You cannot type because it will not understand what Whether there is a space or north Okay find out Why does this have character and tell me Ok not now You can take things So this case that is very interesting You can find Lord of Things inside this night So now what I want you to do is um so we will look at the analysis later But what I'm trying to do is that there are certain type of fire score parquet Some of you might be knowing it Park is competition on high this excellent storing park If I So sometimes when you get data from high when it'll be parking later Like so how do you read Parking Tow Spark One question Second question How can I run queries without using sequel like right now What We need this sequel queries I don't want to use equal I want your spite on the square the data So this this folder will be shared with you right now Name start parquet If I opened this what is inside this This is how the parquet files will look like This is the file Actually this is the meta data off the file Okay on if you try to open the file one off the file it will throw at it It sure Troy that ideally community is throwing at it for that It is taking this Mustaine Okay You can't read by the way That's what I meant It wanted to display the content or anything So park it fires by default You cannot open in not part or anything You have to use some system to read it and that's what we're going to do so I will just check whether I can read the data first Okay My data is actually in a folder Called it a go Okay Here you have a file called for local name sort Parky on I opened this by tone script Okay Again you'll be used a shell so we already have the shell Right And how do you read Park a file Very simple You say spark door three door park it It will become its warning Everybody's trying to rewrite I think it's ready Be if not show Yeah so So sometimes another problem is that the parquet files inherently has lot of problems You will get a lot of warnings and others when you try to access the meta data So when I read also recorded So this is the data that we have You have a first name gender daughter here Right So what is this data These are the names off babies born in here and how many were born like there were 54,003 36 Jennifer's in 1983 Like US state date itis Um can you read Right So you have to give this location how you're a pussy Uh the folder Weather data is there How You know if you can read parking you can be right So that means the command is working Can you also do what the eff dot counted is there in the pipe on file to see how many rows are there D f dot com 18 black right s o B f Thought ground should give you what 80 Lex Good But how to wait So 1.8 million records we have I came here I didn't see huh Do you think there will be a different safety open on Jupiter I don't think so So around 18 lacks records are there You can see here Right So I did a d f Dutch Uh count You can also do what the f dark Prince schema towards e And you can see there's a first name gender torta year So these are the different problems we have Right So these are the columns we have in the data actually on DDE What is that we want to do So if you want to select let's say specific columns You can do it the effort or select There's a select command very much like your rd Be immersed Can simply say BF dot Select first name yet on then if you do uh first name's BF lord show You said you should see on leader two columns you selected right So sell it Command is very common Like what you do in secret You can say I want only these rose Right You can also count it Okay You can also say I want to count Distinct So the names will be repeating every year light So I can say I wantto do uh distinct count So let it show the output The f dot select distinct count Like this name shall be repeating Right Every year there'll be somebody with maybe I confess distinct How many names are there 93,000 88 innings Anyway let me do one thing I will just show you a couple of things for later Tow the fight on analysis Okay You can also turn it for those who are having access Can run it with me No problem I think some of the people shell is down Okay Uh but don't bury The same commands will work You have access to the cluster anyway right So what I did just now is that I just do Did this right Now if you wantto find out something like this view word rap let's say you want toe find something like this or original data frame toe Find the five most popular names off girls born in 1990 You want to find out the top five female names in 1980 The query looks pretty much like a sequel Query So my copy this See all you're doing is that you are saying that filter on from the data frame year equal to 1980 gender equal to female order by our total in descending And then you want first name select first name Lim files So if I run the score quickie And if I do a popular not sure it Sure I really saw him Show me the popular names right So look at the query So if you look at the query do you find it is very similar to your secret Very very very similar Right Because you have a filter close you are ordered by up The only thing is that a very time you're using this B f B of A That is a name off your up data From that you're creative You can also use their dollar projection sequel You have a dollar interpolated right I think I wrote it here somewhere You can also say like this No it's not that I wrote it somewhere It is not here Instead of saying everybody be off the of you against the dollar And as this column you can say projected so running filter queries or order by queries this ordering them in descending order Everything is easy It's like a fight in court here Also you can do the same thing Same thing how I wrote it somewhere I don't have it here You can try it yourself Scala The difference will be This will be three equal the hot this equal toe operator Right So this is not assignment assignment distinctly Quill This is what comparison Operator If you're writing the same Korean Scala you'll need three equal symbols for comparison There is a major difference Apart from that the query will be very similar nor difference adored the light Now let's try to do something interesting Okay So how popular were the top and female names off 2010 back in 1930 So let's say you want to understand this You want to find out the top 10 female names off 1930 on then 2010 and then I just I just want to compare them Hope popular Obviously if I want to do this the first step will be I have to filter the data off all the female names off 2010 then female names off 1930 then I'll go join And then I say whether it is good or bad right now you can also write your own UT of functions in sparks equal You'll certainly find functions like to do that You just have to import this UT If so I'm just saying that from by Spark sequel functions import beauty of UT f or you sir define functions will allow you to write your own function since park rather than using sparks functionality right And why did I import this Because I'm just creating a simple function called Nowhere Here you can see so lower is a function of what it will do It will take any any name whatever you give and just do as daughter over You convert allover lower letters just to show you that it's possible And once you have this what you can do is that I'm creating a date offering called uh 2010 So I'm saying that I want to filter the year is 2010 on Nortel asked There is a column for total 2010 General call it as gender Toto sent in First name is first name 2010 and I will pass this function lower I will convert the first name Pullover case and I will call this Carla mess name 2000 times So you see you're using Aly s ng here You're creating a new data frame But you Celia Celia Celia's every corner name you are renaming it only difference you're calling this lower ut a function and filtering it with the year 2010 So this will create a date offering called Assistant 2010 where you have all the you know um names right off people And we would also create one more for 1930 So here I am saying that against same stuff that year is only different 1930 And once you have both these data frames look at the joint query The joint query also looks pretty sane So in the joint Marta have written I said filter So this filter could've been returned before But I wrote it here Okay Filter were a gender equal to female on I want tojoin this with 1930 where also the filter is again Fi me on I want to match the name thing Okay order by Tortola Limit then stopped in female names And if I actually want to run the query I have to do a show to do a joint or show And that is so again it iss lazy evaluation Like if you write a joy inquiry will not run the joint quickie You're too sir dot Show to fire the execution and it'll do the joint But we are operating at around 18 Lacks off data right They're actually we filtered it But the amount off rose we have is around 18 legs But still it is able to process the parquet data We'll see how popular it is Think here it is taking a bit of time anyway so join equity You're writing like this So this filter is actually not required You can filter about I wrote it here already When you're creating our data from you gonna filter So enjoying what you're saying You're saying that this is first data set S s sent 2010 or first date offering second date a famous 1930 Okay uh and this is the core Lem from here The column is named Toto Simp in From Here the Carla misnamed 90 30 this year Common Carla So when you are doing a joint you say using this column right Common you say a door a door be darby door Then you give the corner So there's a corner heart condition So this name caller saying name right Because you want to understand how the name distribution is there right And then what you do you do in order by total in the descending order so that the top names will come You joined by name Then you say order by descending the total count so that if Mary is the most popular name it'll come at the top Actually and you say limit 10 because you want only top 10 on Then you're just Ellie asking You want to protect the column right Whichever you want So you say First name as name then total So it should show here Yeah So these are the Elia's as we have And this is the core of the projector So it seems it seems Isabella this her 22,000 people here they're only 142 here Right So I think Elizabeth is a very common name among even in 2010 and 1930 Both places you have almost similar Right Apart from that all other names are like very less here and here You have a number of people with the same name So join queries and all look very similar to your sequel Okay But maybe if you're more comfortable in sequel you want to write it in sequence like usually If I give you this data I don't think you'll write it like this I think you will create a table and then he will say joint That's easy for you But my point is that you can actually write it like this Also if you want If you want to practice this what we did First we read the data right This is Sparky How you're reading it If you want you can cash the data that is optional All right then what We did the selected only the first name And here Right this you remember And then we did a count The industry total How many records are there on this was the query Where What We were trying to find the five most popular names for girls born in 1980 So it's a simple query And then what I did we wanted to try joins Okay so we created a UTI a first for converting to lower case Okay And then I find found out all the people who are born in year 2010 all the people who were born in 1930 and here is my joint quickie Right So like I said these parts are over You have reading from our dearie reading from this on If you want to read from CSB there are two ways One is that you can say spark Don't read or see us free Give the CS If I like Jason it will read But previously we did not have that option It came now Okay When When I started writing like cord Long Bye We had to use 1/3 party package from data bricks Sort of public Says the package we shall help you to be So download that package added Then only you will be ableto But this time you can directly say spot Don't read or see you speak It is going to read All right So now what we want to do Why don't we try example I like done Okay So uh the the problem is if you want to read an example file by the fourth you can or say spot don't read or examine So far it is not available So if you just Google just gotta go get it And if you start for data bricks spark examine Okay Uh this company called a data bricks Who is the in vendors off spot They have created a spark example Data source Okay So what is this It's a package actually So you can download this package and using this package you can read example file And the question is how rude of Lord This package Right So while starting the spark shell You can ask spot or download this package So I have an example File If you go to the folder here there is an example File called Where is it is Can you see some eggs Ember where employees don't examine This one is an example file If I open this See it's pure example right So it has This is called a route bag Employees this one and this is a boulder or Paige right Employees So this is how this and within that you have employee number name and address City country blah blah blah Now first step that you need to do is upload this file in their data sets folder in her do And if you again go back to the folder there is a file called Reading Jason and examine Mortified You open this It is clearly written It'll read example files You need to restart your spark shell with a bolo command Now mortified this you say sparks Shell hyphen hyphen master local right local on Then you're saying packages corn data Public spark example package This means you're starting your spark shell and asking Siri sorcery It should be by spark right It's not spot My mistake I have ever done Yeah Say bye Spot do Okay I was writing the other one So simply say pie spark to master locally And you can see this downloading Can you see resolving dependencies resolution Report this mortal it found right It will download this example Model you can clearly see So now you're spa Action is running but it also added that example Motive Right So this is the command There is a mistake in the north part If you want you can edit it Okay Where is this is the command Look at my screen They're the exact time and your type So one ways that you can download in the shell Now if you're writing a court or something and added in the court like the package name while running the cortical downloaded actually for you you can also get a jar off this package Okay Locally if you want to add it like every time Otherwise your toe I don't know what it right So it is available as a jar file You can add it but this is the best week Try Are you able to start the shells here tonight the local more great So it's a starter now if you want to read it this is the command It's very simple So how do you read it You will say Spar door read or format There is a method called read or format on the auto mention what back is you're using to read Option In first came Our true which means you wantto automatically understand the schema route tag is employee wrote tags Employees on Dark Lord give the location And now if I do work do you think that's enough So depending on the structure off Europe like this has a specific group Dragon brought answer to mention What is what If you want more information here it will be given right so it stays These are the features you can add Path Path is the location then a lower tag You're to give your tag on If you want to do something for sampling ratio we don't want to use it Whether you want to exclude any attributes right whether you want to treat empty values as now right on all these things Column Number of corrupt record value Tagg Character said so anything that you wanted is here So now we used a very simple file but you can see that it perfectly created a data frame from an example fired So you have a date Definitely No I can start quitting the data Try this by yourself Now don't Don't trust everything that I teach I mean this is how you read example but not every example You can grieve like this You know that by the fair is a proper structure You can read some example Files will be much more complicated than this So there you have to manually do something to give a structure And then only you can Because we had ah use case where we were getting this re examine You know voice examined Like if you place a foreign court you call the customer care right If you're calling the customer care you'll say plus one plus two So this card senders they will record it Like how much you called and all these things that we come in a voice example format You come from the i p California So this I p telephones They will push it as a structure called voice examine We examine You can read it in this It wanted to live on the water Then you will read it So we actually did that Okay but if you give this package I just I don't know what you're talking about It has a structure and all but not every example You can read like this So a lot of applications are there with store or give you the data indigestion format that ap if you connect But there is no rule like usually was only day sonar examine It depends on the application So today more most of the JavaScript everything is built on Jason But examine is also that Like I said we had TV examiners always examine so some systems produce them So you are just a con Zuma may have been been normally when reading Big Data Analytics We have this country miss We just get the data so we can all say that Give me only Jason in the applications already examined now I told you I think there is one more foreman called Gavrilo forgot Sorry You can also read from Abreu No no no Avenue is normal Text to data with material I will show you Afro tomorrow If you open it in a normal text file it may not make sense You will see the data They'll be ahead that kind of thing Very hard Meta data Okay then the original data I'll show you have row tomorrow file If I If I haven't Averil file I can simply say there is a package for reading average again Data bricks you say Start with the bank Is it'll read now in one of our project What news case We had We had this train engines again G e because my major client his g e So they have this train engine locomotive engine Right This train engine will produce sensor data on that sensor data they sent to some cloud someplace from there it will land on your Kafka Okay But that will come in after a format Because there was a name It is colder not incident I'm thinking incident not incident What In my mind it is coming Us incident not incident There is a particular name because every two seconds or three second one file will land in that folder So the sensor data is continuously streamed and everyone second or two second you It will land as a daughter Every file in a folder in Kafka actually Kafka topic Then you read from Cathcart to spark But even using this Afro data breaks package we were not able to read a profile So they brought a normal job program where they will read arrow a convert and apply whatever scheme are they have give like a proper tablet form actor Spot spark will process it So most of the organizations will have a date of ingestion they will handle because not even though I'm saying that you can read Jason you can do this If you actually go to production you'll have a number of other problems So not everything will happen automatically like this So you'll have a data ingestion team They will take the avenue and they will convert it into some structure and say that Hey spark take it right So not everything will work exactly like it is written I'm just saying event not incident So there is It's called events Every two seconds and event will happen whenever there's an event of fire will appear in the staff car topic again and we were continuously reading it So Avatar Files will have a daughter Avenue extension like abc dot afro It'll Burton tomorrow of evil Be downloading data from Twitter using flume And that data would come as a proof so you can actually see what is avenue You don't have to worry so we'll see average tomorrow So normal even you use flume Forget the data from Twitter There are two options One format you can use Jason second for me this afternoon So we are using avenue in the exercise so you will see that it's actually coming as a pro Forma Okay The next thing I want to try this reading from RD be Emma's here If I go shore data basis don't worry I'll give you this link Okay I'll say use retail baby And if I do short tables see that is a database called a retail baby which I creator within that these tables are there and there's a table called orders Night I think that's what we're really order items right There is something called order items Right So if I do a select star from or that items Oh I wouldn't limit but I think it should work Did you all connect Yes right Maybe that is Aries So you have regarded So that is around one lack 72,000 198 rows in this table All right so it's a pretty big table Not so began the whole But some data we have At least now what If you want to read this into spark this table we can't do that Now if you want to do that first thing you near this That my sequel Driver Jail BBC driver So can you check The driver is here in this folder Can you see this My sequel Connector Java something Something this fine And I want you to upload this using FTP purely next machine Let me show you Upload the driver using FTP See I have uploaded here Can you see this for you 15 I'm just checking Not this one I think the name is different My sequel connected Mr B Oh here Here Yeah This one Not this one This is the bind was actually so this jar file I want youto upload in your home for that using FTP have a baby Because what we will do we will start spark shell on give this jar file to spark So sparkle started this and within spot We will write our connection String Say they're connected Artie be amiss Give me the table It'll read it It should I really will see where will actually read or not So even start spot shed So nobody was that by spotting neither set by spot You lose This job is that I would so sparkle to start and within spark We will write our connection string for connecting to the baby What did I use that name What is the password What is the connection string And it'll read the table and created a definite it should be able to do it School pills like data transfer utility So here also you're doing data transfer but they're not storing anything You're just bringing the table to spark as a date offering so that we can buy sequel queries On top of that on typically what we do is that so This is very rare use case you're actually reading from Artie bms So in my sequel you can go that grand privileges for the use of so you have to say for your grand privileges for this park your original complaint So if they use that name now the user name is what lab use or something So the abuse that should have four grand privileges then only listener Look that's one more problem So we have I have not actually used it much because use cases very less But in one of our customers case they had Oracle So they wander toward it and it didn't book because that has a number of other problems But I haven't really will hit but it will not really little So then he did something on permission issues and are finally we were able to read the data because his requirement is like he had their final record Okay Andi that was some terrible It's off data on already He had a lot off data in our data warehouse that he brought into spot this oracle Did I have a little out of touch You have only read You can't transfer it Actually So when he was running joy inquiries he wanted this data to be available so initially viewed it like this later he used some sort of Ah cash cashing layer on top of what I could where it will be already cashed Spark will read it from there and then process it So this matter is not so efficient If you have a huge amount of Dick actually we re literally wants So once you have the driver Okay I didn't tell you what we need to do Right So you have to start the spark shell like this Do you have this file called reading Examine Jason Example Mortified Yeah So here you are to say by spark do Then you say hyphen Hyphen Master local at this line Okay And you will say driver class part on the job Meaning You're just saying that used this particular job So you're saying that starts park shell And this is the class part driver class part And this is the connector that we have Okay And it starts on If you want to read let me first try Because or this is S O This is jar you're giving here The location is saying here it is something called driver class part both your to give So the arguments are different Let me see if I can read it Okay So how do you release Use a spark daughter Wed There is a matter on then you say four matters J B B C This is very important Okay on you are a list This that is the connection string Then there is something called the B table Very Or to mention that our TV dot table You want to read you certain name password not Lord Now ideally if I do a J d B c B f Not sure we have the data So this is coming from my sequel right This is coming from my sequel So you read from my sequel directly to spot See how easy it is My whole point is rather than the functionality Things are very easy If I have to do this in some other programming language I could have a sip then writes Imagine Java Java can do it easily But I'm saying the functionality is very easy right Anybody can read and understand Okay What is happening You don't have to explain it Or so we're actually passing two arguments here One is the John and more nous a driver class spot Okay so I really The job is enough If you're running in the local morning They had religious for the safety in the yard More throw other sometimes saying that I can't find your class spot So then that they were listening Our date suffering This is a tough game I can say iniquities I want spot brightness by yourself Don't press me unless you can run by yourself And if you're working in spark one this is a different completely different You don't have this for magic BBC and all They're very similar but the commands will be slightly different there we say you will create a job object first for system utility object There is they put put put Username is this password is this You are real Is this and call that object That's how you read so in sparked only they introduced this Smith third So I don't know if you start working Maybe you start working on a spot one plus two and you may not see this So the air earning our spark toward orto version So he restored or two And if I go toe sparks Sequitur Can you see the menu reading already Be a Muslim already be there J D B C Two other databases No So D b table the J D B C table that should be read Nor that anything that is valued from close off a sequel Curry can be used for example for table in securing I think it has to be a table That's what I think Maybe table Right And ofcourse I showed you only the basic options You are released Their debut table is their driver is there There are a lot of other things you can have You can have a partition column Okay You can have a fetch size the jail BBC fed size which determines how many rules toe fetch per round trip These are all performance considerations Uh you can stay truncate Onda a lot of other things right So we will look at them now Last but not least since Lost a lot off You were asking about installing spot right I thought I will show this So maybe it's something which you can do right Some of you have already done this You can just basic data bricks website And there's an option called tried data bricks a button Okay just click here Tried data bricks data bricks gives you Ah free trial of their pack platform This is for 14 days only So don't do this If you're signing up for this your toe use your credit card Indoor There is a community addition here Okay So community edition you can just get started Give your username password No credit card required on Wants to sign up because I I have already saved that account in my laptop Okay this is this is the one So once you create your account you should be able to log in like this Yeah So once you sign up okay you will be able to do like this You will get this page Okay On The important Fisher is that you can just go to this clusters There's a medical clusters on You can simply click on create a cluster so click on create a cluster give some name and it supports upto 231 spark There is a maser version for two That is datable expression Don't worry Spark question is 231 So let's say you want to go door to door zero You can say something like this to Roto zero on fight on Russian Do we want toward three Okay Whichever you want you can select uh on this Don't make any changes They create a blister and now the cluster is getting created So this will give you a single machine on air abuse But six g b a time And it is starkly three night so and it is very fast Horse I mean the cluster creation and all I really it will not take much time You can see the Sparky away from here If you're running something on this cluster you can stop it on Delete it McCluster So I was running some clusters previously You can see this year So you have this plaster Manu like here Yeah So now the plaster is running I just get back to you Okay The cluster is running and you can go to the Sparky way And there's a sparky way as off now that it's nothing in spot Right now the question is that how do you work with Lester So first thing you can do is that you can upload your data So just go to this data part here on here you can simply say creative people So they store everything in a table format on Let's say we want this Jason file Right Where is it People or Jason Just drag and drop it here so and this will be the location Just copy this location This is where that will be if you want to work I go to this home peach and you can create a notebook New notebook Uh name is something Languages by thorns Kala Or are We will choose spite on cluster Is the class turning Create and B F equal toe spark door read or Jason this day is one fired We have the data right So this is good for practice It's absolutely free first So far Right But the point is if you start if you create this cluster on your own touch the cluster for 1 20 minutes they will believe it All right So always they should also get something like if you're not using it they will believe it But I think that's not a big deal because I let them They leave the cluster your court and all will be saved If you create a notebook like this it'll be persistent data You can have any laptop or something Whenever you want to applaud and you say running right I feel it is good You can use it in your old exploration So lot of people wonder Install spark and mess with spark Right So this is your answer One thing on a good amount of documentation is there Please please read through the documentation night That is real documentation Okay on You can just go through their documentation What are we trying to do here Right That's what I want to explain So normally when you install a spark Lester Okay there is an option for that integrated type Hi Evan Spark a different Exactly They don't talk to each other but what we do is that then be installed spot Okay He tells part that there is high And if you remember high plus how you have something called a meta So where it remembers all the table to create and all So basically you're telling spot a spot that is high somewhere Going Talk to it Okay On that That support who is it Is available If you're starting the shell like you said price part two Bye departed Temple Is that a pie Spark bushes Because everything will be configured You can simply say I read that a little Really You don't have to do any configuration But let's say you're creating your own application in our text pile of something there You have to light these statements That's why the statements are important If you go here in your hi Poppy you will create the spark session You will say this is the name off my session on DDE Some configuration you can pass You have to say enable how you support get you this you have to do If you're writing your own source like you are writing your analysis you want to read some table Come fine So you have become violent Didn't run it right Not from the shin They're saying what I think like a pipe on school night So there remember you can create your own esteem sparking text saying you have to create your own spark session on by creating you will say enable high support Get over it This means now this using the spot session you can talk Okay so from the North Book and also from the fights back Tosha you'll be able to access But I'm saying you are writing our own source for Let's say you're right the source court here suddenly to some plus there in that cluster you will say I run this program right So there you have to create your own spark object So right now ideally if I run this cell OK it should that tell me that how you support is there the next round This will grant this I saw how you support me How done that You don't have to run this bar Gigi that go in here sometimes Let's see Yeah it's a it's a it's already the spark object is running because it might not Look it was only running Are you the need to said it here tonight on then what you do is that if you want to create a table this is how you do it So you will say spar not sequel Create table if not exist Some table name Okay Ah he values So this create able command is work It's a high command Create able command is hyper man you're saying spot dot Sequels off from spot you're telling that Created table in tight There is one Cats Yeah Ah the table name Right So the stable name it will go to your depart in height and you cannot use the same table name if you have a dd better You say Devi got stable a basic or let you take the same thing It'll throw another light thing that the table already exists I don't get in my point So in hive if you have a database you can give the data business Oregon's simply gives the mother name So I said create table if not exist Source 100 Okay on Then he valued that It's my schema on Then If you want to lower the data simply say lower data in part In my case the key value should be here with this Poppy that file So in your case it is in slash deal later Then you see this e would be uh Now do one more thing There is one more point If you load a file into high Hi Will cut this file from here right Do you remember that point with this guy If I lowered this pile I will deleted problems The first idiot who lords will get it for us to have donated an exception So I really could be this Your home for this used in her do so that everybody will have a copy right Otherwise somebody load the data then the data will be popular They're helpful You will not get it again The populace your phone for that That every What So here you see I can't simply say I want to cooperate this as an action party Where do you want to copy Um I've been saying you're something you're not able to see the fine Also one more point is give the different table name don't you That's saying cabling Okay all of you forgive a different table ing don't know howto make that happen Probably Oh then you get a DVD dot You have a debut in the TV Doctor added how honored I did I just say table like so I created a book or a sassy 100 In short off that you're to say create a table something daughter sat 100 you're needing Okay And you should be able to see the data has a date up in So once you do this Okay you can start an inquiry site on They will be What So you basically created a hive table Ordered the date An outpouring You since part see Hey are we created a table Then I being ordered the data that is also high If I want to play either data I should open ocean here in my query today Nothing I'm saying Selects star from Gaucho Read from heaven Julian spot So you won't need a tuppence a week It just to show that you can breathe sometime So like that any high table you can do So now I'm saying selects car from this life If you have a different table in high try that was the next are from some debuted or timetable recreated transaction table Any table you have it but directly even you can read from type you cannot write So during spark installation Spark will not know that there is high on dhe The interesting point This If you install sparking a new cluster on your dealing spot there is no how you I don't know what is I You starts part sequel It will create a meta stored separately Box equal Request mint I If you're not anything that is high it'll create it There'll be the smelters for their visibility Okay start storing all the table tender But then your problem is you have one human testing on the spot Modesto That is bad right But ideally what people do Hi Configurations are in a file or high site examined They will copy that pile And there's a spark corn directory there They based it here So if you put it inside that part will know that Okay this is the highest configuration This is a meta story right If you create a table by a little if you're directly down Lord High not like in a cluster Let's say your laptop if you're down door Hi You say start create able what I will do how you comes with a small TV Calder the deviousness it don't do the dumb Borden it got it Got it But it is not very sort of a professional like that is not very efficient The sparks equal can independently work But then the problem is that you can't communicate between them I am eating for Babel but you should be able to leave some type All right so that is this configuration you to be there So another point I want you to try this If you run any query right I say I I said that Ah spot dark sequel Select star from this But I can see your data thing on this body can be any high table You have high right And you created some tables by this chorus in places far more sequins on any query that you want dot show it should run on You see that So you're practically accessing howyou tables from spot and it will come at the data I think everything is a date I think this is so high Configuration is in a file called hindsight Examine That file contains all the high configuration Ah copy off that Kyle To be placed in spot configuration directly I think I should be able to show you we have ah command line access late I don't know whether this will work whether the seeding slash Yeah Yeah so that it's just a quick thing If you install high right the configuration Claudia will be on a PC high print this hold it here you will have a file called hindsight examined This is a file This file contains all the high even sitting So this fine should be cooperating to spar for integration Then only spot will know that otherwise If Houston places an extra how does it know that at the table You're simply states Alex Start on some paper read this file and then get them permission Here you have the Modesto You are You know how do you says slash Right Hi I'm a hostility Well this is where I This is the uh uh Where High restoring all that for you This sparkly No But if this um hold away see about shape logic it isn't in there a piece It allows you to talk using any language Let's say you have ah fight on two of them You're on your own or the movie program and you want to connect this height But again your views are very busy connected and all but it will hit this trip service that can handle any language connection so different different languages will be using like java will be something So in any language if you like your program you wantto talkto Hi I will be there So whether he'll be handing the support of a party there's a corn holder I don't know where it isn't This cluster should be somewhere here on many district We'll be seen No they have not cooperated here They have copied somewhere There's a porn folder One place is here Uh oh I think it is not available to us Usually didn't the corn for that If you pop it in the corn bullet Oh are they could've corporate in some common and German radio spot Nor did it as a property probably So if you just drop it in the barn Ford wrote little but normally Nicolas we used to do like that might be language particle So it is used to handle connections that is coming from any language It has multi language people So if you're writing by turn in by Tony will say if you want to communicate the type this is my highest are very well you open ended my passport on that question Hit the trips of it The business ever was handling all this connection from Bavaria connecting spot That's what it is like and service are very common like in hard do because system on art of places You have this thing called a bit of service So this proves you can read and write will show right Also you can read some pie now you can continue with this but I want you to start this special Then you started this red bull connection Now on simply start by sparkle because I want to show you how it is in the shell or so Oh don't do it by sparkle local master local Because if all over started shall I give you a right so safe I spot to hyper hyper monster Not even I will exit because otherwise it control in order You said vice part you Piper Piper Ma spit on I want you to find a CS verified some CS we find anything we find will do on uploaded in the day Does it fall that I don't have a CS three point Yeah I have a C s C five Can you see So this is this us in Perseus The file Can you find any CSP Fine doctor Yes The extension and then uploaded in her Do any any any Any hurdle for that I'm using a file called playing Darcy SP It need not be the same file Then I have to go So shall I copy this It was not available in this folder right here Losing that point playing doctor Yes B seven of aboard then Good It's available in dear lady It seems so slash data slash gs long This afternoon here in jail later Another saying fire That is it Yeah So copy the stealing up home for that Don't use the same kind of just a actions Copy this new home location I have it in a day That's it Pull it So it is a date That's it Support it I haven't had a plane but I just want to show you You can simply say the year dot equal Go Well Uh not read Nazi as we you can see See how to reel STS We find you said Spargo Resort is integrated Yes we're fine You can actually write it into high So this is a date up You know I want to write it at the table I want to write it in height so I think I have our TV cord of June I have a typical joon candidate and height and simply that the dot right No they've asked nothing back out of this Move it Yeah indeed On DDE That's it It saved at the table So you this is one of shell lam sing with everything is configure you simply say a light dot Say last table Deep meaning table needle created a brand notice Nor did on There is something called the stay with more than in sparks equal dirty documented And you can see that say more that will I love you to define how to save the data So three options are the oneness board up in Which means if the table is that it will happen or there is real right over right or skip the table already a little descriptive data So for that you're to import a library called Save More Their simplicity Imports from part or seek will not save more And I can mention the same or that you want that in the documentation slot Secret documentation Second interesting thing while saving you can also mention in high what format it needs to say Farsi Barchi you can see for matter what See format barkeep this is like next to No it won't compress or do anything All right And quite interestingly you can also do one more thing This data frame is the right thing I wrote it somewhere in the north Back how Here it is Uh do you see what happened I have a date A frame I said the year dot Right Dark days on this will be stored at the day son Fine enough boulder for the ground turn Bush pilot because he s me decided in spots that stores the Jason There is a day of not right door Jason Right door Parky so you can ease new conversion from one point You can try yourself And if you go to this hold it there son It's unfortunate Should be afforded when will be created So now I said we'll go down I'll show you So now I said good Um so if I go here it should have created order Quarter that goddamn somewhere It is very good on inside that what you have These aren't my life If I open this all right But I'm not sure whether you have Ah Um you have a vato Save the pilot and name Like it you some beer name Maybe you can say the file name was that outright So you are reading a CS we find into our data offering And from there you said save messages You can also say the you don't write or parking in a west parking for these matters are available Just explode How to read and bite into a date Right Um wasting much time Let's move the school Right So let's need the school off Ice cream right It is not s scene Okay So many people like school Our school right Different Like the ice cream school spelling and see right Not at this age This is this So what is right So we are done That spot sequence Now there is an assignment Uh it's solution Mr Practice Anything Your assignment there is a different as I'm important to submit Ah that is a retail data analysis case Really It is very good You think that happens I will share the solution And along with the court you can run it You're against the same thing in greed A pilot and understand blue bike with bees on already talking like So what exactly is right And why do you need to be a very it isn't But if you in starts school it will I love you You bring the car around any sequel Which means the sequence store can be an RTV amiss Like my sequel or Akko It can be a date of their house Like Narita excited about green Plunk right Bitch of a store understands equal You have a table here or databases and tables You think school I can't stay Get the data from here on Then it can just dump it into the So it's a date That photo to be very pear right You just bring the data from this Fine But this point Oh I can also say Hey school get the data from there and store it in a high table So one thing is that you're directly dumping in the office as a fight Okay Second possibility that I already have a hive table I have another actual table I can say read from Oracle The whole table Dump it into my hi there Stand sitting later That possibility I can Or so they started in an Etch based This is not family used but it's possible it's space is not enough Sequel later If I can say read from my sequel So this guy is like Anne feel good A typical meal You can't cancel like that Lex faggot data That is a bit of transformation You can go from the date of everyone on it Kendall that little words as well What do you mean by reverse Prom do can take data Dump it in So both ways you can plan for the data The data transfer It is not an analysis group You're not doing any analysis We're just bringing their data from point A to point B That's hard feelings right now Where should you use it Right If you're having lots state protection set up people actually use their own it here toe taking dramatic or Talyn So uh g n g e v a losing talent that will be writing talent jobs The talent is any deal And Palin this much much more subtle than school scoop is a very nice people compared to school talent Can 2000 other things actually So a lot of people will be using their only steel toe back calendar Informatica But where is your requirement for the biggest challenge Is that right So you start You join a company you start working in a dictator Predict now you have some data with your testimony right The customer days I have fired a Robert their time All right David you want to bring it in All right Now definitely The ideal cool can help you to bring it But the problem is you go to the ideal guy on He could have never heard about her do you So that means he might be a talent development but he has been a great island with her dopers They don't need that in job Kendall thought wherever I have one I have seen that you feel guys will be separate You bored with it They made more about her dope and all but they weren't like exports in her do so If you go to the talent guy and say that can you help me Using a talent to bring from the stories that I don't have time to do your job That is that peace is you don't have to take help from anybody You can write your own equal job our job get the data so you'll become a small scale You feel developed person by your That is action with case law school And we have you like in any number of experiments we have used school and while importing or exporting So this is all the import This is import import means from our d d A must to do This is our export from our do Floridians We're just sending a data fromage dear First to my sequel That's an expose My secret is different Edible And then find this How does this work So is it just like magic It will but no Uh uh Required my number one You need a driver They were bringing the late up on my secret Ofcourse You need my secret driver So we have a level in school library for otherwise coping or understand what you're talking about Second point this this this data transfer is done using batteries So you're saying my apparatus is extinct like it is not extinct Meaning what school will do It will understand Okay you have some data Can still It's a table cancer People divide your table and loans mappers to copy the data so I can actually sit on my purse burning when you're copying the later So from the point of Europe I do It's a map Reduce job The map map It's not doing anything Copy Paste It is doing like it's popping the later that is what it is But it is a map or a job You will see the map running when you run school Anybody has around school So you could have seen that right And you are just name uh no map and no So what do you need to actually learn Here is how the transfer and that is a very confusing topic in many training So this questions down Myron and distance of it is happening See why Number one you have a table here right You have a table So you say I want to bring this table in 200 You're right A command in school That is understood And since school past the driver on the username and password in order to connect to the baby it will inspect the material of this is okay Step number one it will try to find your primary So that is an order off the area Meaning your table should have a primary point number one Imagine this is the primary T I'm just taking like them But this is my primary right So it will automatically detect this Carla Mr Minority on by default it divide your table in four number four So this whole table tennis 1000 river like won purple coffee people there's about the record will help beautify under fire under to blah blah blah four equal jobs It'll do right But I say 123 Ah logical Like this Based on this value time Ricky van this primary divided is used to divide So now your table is divided into before it launched a map for one mapper map of three heats Mapper will copy one part I didn't store that You meaning you will get four pilots if you run enormous old job And if everything is ideal condition right it will inspect the primary t Listen if you have a primary key they send the Primerica to develop their before four mappers will be launched Each map will pick one part off the table and mutually exclusive Transfer it your heart do destination for little have bullfights anything This is a defiant behavior but the depart Now the question is why Paul People after the year question blaze When I explained that person like buy food who decided the number to be for any idea Why for now you and I don't know So I take the documentation Some day in stack Overflow suggested that initially even school last creator they part that eats mission At that point in time had four processor cools Quantico each corps will one map It would take one person Really What the never is afforded maximum computer there All my personal belongings Then then scoop was created That remains the same now one court not a whole court and not utilize the critical that the market reading and all but one court requires I did I did On this number four was created long back So now you have better performing ability to important people Forward Now the second point Okay I want to do normal transfer I run the school job divided into four gives me the data I'm happy but my table has won three Lindros So dividing it into Ford is not a good idea because one map of it on the order of things were plans that you can use an argument you can say uh number up nappers I want Ben So what did the lawyer like divider table in your pen So depending on the size of your table if you have not stables you can overwrite saying that Hey school divide my table in 10 400 or whatever See super number they come up But the problem is the problem is so if you say very nice table and you say divide your table into thousands that will happen But 1000 concurrent sessions will hit your idea Miss that is back Your idea You must be ableto handle this nose right A few tenants called that divided table in 2000 parts It will launch 1000 matters 1000 mappers will attack your are Libyans At the same time your argument may suffer So don't increase the degree of madness and beyond the point that you're already they must get more support We can actually look at the job output You can show you how many records translate exactly So if it is like 1 2000 actually were like pronto Exactly But if you're like 1001 girls then that one or two here and there Anything apart from that Lord will be quick again I'm talking about the default settings We have not over it in any setting And we're assuming primary keys there blah blah blah First setting you can override is that you can increase the map right now What is you Piper Scoop the man on in that table There is no time Ricky Your command will be Opal Say that I'm not able to find our primary cure anything People simply fame your import So in case it's your table do not does not have a primary team He does not have a primary key in your command You have to mention a colon It has to be in for smiting It can be any column ideally and in this accord on the Lautner values like some unique Mandy's like any column you convince cinders and map it Okay But if you're mentioning a corner where there's a lot of reputation happening we don't have much of a unique values under division may not the idea So dance another You have a table No primary key but you have a transaction I did it Backslash number that can be considered like a primary key to tell school a school Transfer the data But I'm sorry I don't have a community Used this transaction I think all of this later it actually calculates a hash off the values that is there in that column on then Then they decide how many number number over those other needs So if you're having unique values the split maybe they're not having unique values Littman Our vessel equals but one map Or make it Hey pal Generals Yeah that map make it $900 slightly The moth of the case is what I have seen You will have a primary key then you don't have to board it But if you don't have a primary key you're to say use a column and use some Poland on DDE Can I use the string corner by the partner Bye Reported to give a string colon it was I cannot split you're using are given for that I tell you what to do Otherwise string parliament will not understand deserted understand String York Opens and are giving their only hope Will understand Okay with the whole um I have to use on imports So does this import like this is your sending from here to here export also with the same way But the differences in export you should already have a table in our view Cookout creating you create an empty table and then you say it's over our table from how do you dump it here On a couple of more interesting things If you're you think high you're using high They say I did an import I again the dimple people work How does not understand what is the primary key Strains cannot be validated How you have No no Where is the primary key So if you're doing import on again you important saying data It will not say that is a duplicate in Rio Something that doesn't understand this happened right So you have been didn't didn't living what is here already Should I copy the same date Our tour I append a later or you heard about the condition But I've only done the later it is happening on dhe The may not be able to cover this topic in hands on but I'm just saying How do you How do you ensure it's the day that changes right Like this is a table I said open book Hooper The board of course Okay Next time some values are mortified in the paper I want to catch only those values Are you getting the point You always only have a table You pull it all the data is available for you Now if this can happen it's an rdd In the next time you import you should not need a full data right You don't need updated records wherever it is possible So that option is available in the source If the day touching this like you can track racing time stand on only those more defiant values to live with your papa or the red Otherwise every time you will get the full table right to afford Another problem that will end up in exports is that I created a table on dhe I have the find primary t So there is my primary tion rd be amiss Has no idea have either a primary t even how you have no idea of our primary seven You'il X export the data It'll again split into four mappers What will happen Four mappers will come to Rod idioma stable one map a little dumb The data's second map a little down the rate after Napa with them later When the fourth appetizer dumb the data that will be a primary key vanish Maybe that day half a man who is already here sport map on leaving thing So you have inconsistency in data because 75% of the days haven't been lord and political difficult Little pill It was only that napper had some duplicates No that is well your king's a staging table You bring it into a staging paper you can use a stating their production school In our reviews you bring it into a staging table From there you stay in our baby A Must you coffee There you can validate the country who can Or vanity doesn't export than you do what It doesn't that first record when you're exporting If you're exporting from Hyo Hyo doesn't even understand or just plain greedy you did a pilot pilot mark our primary so it will take any random Parliament's platonic sport This whole export happens and it may not be equal And what like board members will be launched It will get some data but the problem is that the destination will have a primary she So when you hit the destination some records paid somebody get inserted That is a very big problem because you can never travel shorted You are one million records Some of course would have dated something or be updated So you bring a state in table You'll dump it in the staging table someday and invalidate the country It's not the table You can go for a table Dumb you cannot So pretty I can't even say Join two tables out there One difference But all those will happen at the cost of guardians Who will go The joint operation on Gideon doesn't believe in a joint right So if I like the joint prayer in school it's like the beauty Naevia Mus and there are foot can be transferred Who does not have stolen a chopper How does a great enhancement Because most of the function they support more databases they corrected Support was less before now where I could support It's more like that Apart from that basic functionality remains the same That is not for you School you started One dark four door sits This is the Bible Anything that isn't you want to do in school is in the right and it's Jim ified So just open this on It will have all the commands in the world that you can on school the explanation so they will run it Okay But I'm just suggesting that later in future you wantto do something with school This is the best guy scoop User guide 1.4 dot six person So here everything is that how to connect with Davis Have ah selecting the day tarting for how to write a query Controlling everything is that they will go through some of them in the handle for sure But later you want to do something extract This is and there's a very easy once you get a feel of it I don't think you need a lot up in public And I just want to check how much uh the lab can take however lower Right So in this file that I shared a pile with you right The first school command which you have to learn in any school project is that you should be able to connect with it even Bye So what are you trying to rule you out You're trying to pull the data from our DP The first thing is that you should be able to connect to the needy and that is this You could just copy this on Baste it in your command line It's a CZ school list Databases So those argument they're just asking school list all the databases and you will pass a connect argument on this is where you're my secret is running You certainly love you sit across pretty straight learning ideally against the orderly tonight So there's not all the databases in my my secret so far Steph you always do with that Make sure you can connect with that Indian By the way what is the point in doing in port or getting the table or anything So for our practice I have set up this baby called Retail in the Cordy has some tables and what we can fight night This command is running If this command fails uh the host name will be incorrect You may not have permission to list awesome Portman Or be open your check that out cost or raise You do this You're particular school This data basis connect was impossible It's like standard for every school practice And obviously you can also check the tables You can also see See the second command Iran is slightly different I will say school list tables right on Then I will give my connection string on the debating the day leading first everything It seems you cannot connect to the date of this There are two options One there is a password file So if you don't want your password to be like the Vita you can added it Second there is an option spy a text file so I can create a file folder options or fix your immediate city And there I can add all these things Corman leaves on called But I was often alley just off your face Turning awareness like that is a password file the next time we restore the past way And it was a hyphen hyphen password But you don't have a type second that is an option spine to create a paddleboard options in that you can add all these things connection string username password everything So every time you run have protected you called options right So I'm actually showing this in of a real project actually words normally what happens Well if the data basis we list the tables now there is something called a evil quickie No Which means um the next thing is this event meaning Let's say you want to important table before you import the table You want to ensure that you have permission The table exists Values are there So that is where you run this evil even is like any comment You can say school people on connection string and I cannot acquitted I'm saying select star from your limit So if this runs and I see the output that means I have access on the table I can run that very day Tizer Everything before you try the import you when evil to ensure that 80 quid you can output will be on your screen Destiny See So that is our safe practice the Valkyries So there is a table called the order items We can try toe access It so I can simply they do a school evil Okay You cannot supply it along with me I'm saying think about what is it I'm saying Hey school evaluate the query that I'm fighting And here is my connection string You tell me It is my passport And this is the query So you're you're to say hyphen Hyphen Query Do whatever you want on you Get it Very interestingly You think that even an argument you can even create a table And instead it is running anybody that do you in our demons from school by can say create a favor inside values because that is just an equally and even create a very quick way help for connection right It is just like you're accessing from toward or what to say Sequel Have bins or something like that School business plans for your ability to evaluate this Reevaluate whether everything is working I don't think my paper queries will work Just trying or in the evil document you can check I have the Weber days right You said I You see evil somewhere Yeah The purpose The evil allow users to quickly run simple sequence queries against the database that I see here display anything valueless provider for evaluation only It is not supposed to be used in production We're close I don't think you can run multiple Berries and all this party evaluation by Okay so now eh So here you can see you on how to do this But I was doing a school people I was creating a table Colder dummy Then I was inserting a value one into the table All these things you can do but I don't want to learn this And now let's try the school import This is what you want to see actually Right Um so when you want to import the data the command is here You can see your possess coop in court on Then you give the connection String database password table on the new passed something called Warehouse directly But you can have two arguments One is called a warehouse directory or you can say target directly What is the difference You find that very house directly What did little copy the table and create what you say Ah Holder The same favorite name You guys have a lighthouse directory What is going to happen I'm hoping to you sir Really Faculty scooping portrait A lady that in this folder another border with the favorite name will be created on the pilot Be done If I said car get directory were born we simply dam the leader I did not create a table so it is very useful If you have like thousands off tables and you want to replicate the same structure you say their houses people leave the table name Bring it upon this and we can actually buy our past import Now you can actually check out the vehemence This table called up retail Devi Okay this has the primary This seat Sorry Normally really Where did that order item or items actually has a primary Okay so So that means that the portrait work right If I don't win it in the departure board um I will try first I don't mind you trying but I'm just thinking it Everybody runs will import actually but Okay rock here Run Looks like it is walking Let me check about it Is working for sure I will explain what happened Okay Let it run then I'll explain what happened Seema produces running Tennessee 25 40 75 4 members done Retrieve one lack does So you can try the query yourself by yourself and then go to the Hadoop location Check whether the data is there Then I would explain what happened doing this So I did it Ah there is on argument You can pass You can view Carl um comma coma and ability If I go to my idea First browser Um before the name is hoping board I had given Get a scoop in board Here is retained Ddb So can you look at this It has created off older cold order items in the important man You asked it Too important Pooping Boldly Say lady this was the border you gave when you look at her Do what a day scooping portrait A levy It created a folder with your table in order Item on then Within that you have four four mappers Go mingle each one If you look it'd be normal Well not separated Working right It should work I mean it's very simple Actually this boulder order items he got on the table name Well you mentioned warehouse directly to read the table name and saying will be created an entire that people park piles There are my people conditions I'll show you how you can happen now if you really let me try this Now you find a lead on the command Okay That was a good place What will happen if I were on the common This was the command right now So one exception will come for sure and I want to show you the exception Uh so what is the exception Output directory already exists Why That exception is coming It is because off map reduce If you remember their never ending my produce programs the folder you have should not exist right The new run map a little job and create a folder and dump it Now if you want to get rid of this except exception there is an argument you can pass You can say that I don't believe in the folder Is that just coffee Then it happened It start up You'll already have your files for more They will run that there is an argument Actually there it is There in the commando said so Now there is an exception to the board that already exist And now what I want you to do Delete this boulder because we want to run this again So go to her Do Andi lead this hope Import Boulder Racing to get exception as off Now they leave it Yes I want to show you the difference between target directly in warehouse directly Now then the land we mentioned Warehouse directory and you were mentioning the folder till Retail lady you can also say target directly Okay Very You have to give even the folder name the thin that it will create the pie Oh I did not show you what it is doing right I will show this time So it's wrong The sickening quarter months has a full board accumulate Anybody who's that No acumen also does the same thing and the same thing But what I want to explain here is what happened when you ran the query right So this is the query island I mean the scoop command I man on the first thing that it does that it stays I'm running school pushing one door four or six It saves that in your password On the command line is insecure You can say capital T If you don't want preparing my sequel beginning Poor Generation Can you see the sequel statement It is running selecting the Lord star from order items as being one the right So I'll tell I'll tell about it It says that I'm running this go man and split over down again It says a lighting the job Fine nights They created the job Fine It looks like you are importing from my people That's fine It says beginning important order items right here You can see the statistics See what is the total size off your data 11 black 72,000 more 90 How many splits are there for What is the split Size four people Easy rapport Nine are most equal The can see It says numbers before submitting Parkins for job Okay And then it around the jaw You can do one more thing You can actually go toe this job Browser Right job How's it And this is the job The line You can go here Andi It didn't have all of them unguarded It is It's a map Thank you So this was working fine pasts right So therefore mappers okay And you have individual laws for them And if I open this the source manager actually showed these logs in a better way now So beautiful The problem is that table this night Ah can you see Working on split What Greed is running This is what An individual mapper And there's a post map This is actually saying where Order item I d greater than one unless then So this is the primary key It is getting only 1 to 43,000 Next day we'll get some 43 people Bull next word of it it's actually Frances Berry and then pretend and get there You can see in the logs party before mappers Each will run the same thing Um So where is it Okay and then the career in the waters You can go right Um if I scroll down One more thing is that some of you were asking what if the directory already exists So one option you can do that You can say delete target directly So if you pass this argument it will believe the folder and then we coffee So previously you were getting an effort on Also in this particular example I'm giving a number off mappers one So this is how you can control the number off So instead of full I'm saying one if I fly the simple I should get a single file I did It is running So this time you did not get another because you're passing this argument called delete Target Director It's gonna delete the colder than re popular We're clear that I caught I've worked in my view If the f s it is what is the name smoking You will have one less thing their fight stay single because number off mappers are one by that by us says they want Sorry in this Yeah I just got that map It is program is being rather than fight when the also when they were writing map renews you created the job right So there's also that supply this by yourself How important thing one map or so How could delete that on if you want You can also say this You can part an argument saying that happened So this means it will upend a later But the existing Hold it I'm just running the same command But this time I'm saying I print type in happened happened will upend your files I think all of us are learning now It's taking a bit of a bank Minus stuck Everybody said But so far then how it was running No but so far they're running right How did that work But it's not Why distract this thing I'm doing something else Oh okay That is one mistake it But if it is a mistake for Chloe another the password is living I by mistake they start something So in that command the password is different You try to work on your seat May not stop right in your coffee body Yeah so this and nothing is different Uh I had to change it You can just add it here Right So I can just copy this Just do this Pop It is because I don't want you to make a mistake Also uh you say table warehouse number off Just based it Here Been Shut off this You can simply say this Make this team otherwise in the documentation if you look later that is the reason it is stuck I really so make sure you can burn it with the change The settings it says a pending to directly you can clearly see at the end the northern offending politically But no matter which other project you land in scope is a must So it is a mandate Rito You should know people will ask you to do something on school pusher Very common ability It's very easy or it's not rocket science Even if you don't know you can learn on your own it's not very difficult so that this change this nine foot happened right Things this nine foot happen No no no by people in Toyota it will say that are gonna technology exists So certainly if you copy again you copy it'll grow And two options Either you say delete the older it'll be good What does that happen A Let's settle this for sure See here have the tonality split by So if you do not have a primary key in your table that is when you use played by So here are already return for performance Recent Choose a corland which is indexed Okay one condition If there are no other values in the column corresponding records will be ignored on data in the split by Carl um need not be unique But if there are duplicates there can be skew in the data So this condition well this condition is applied If you have a table and there is no primary key then your process split by Ideally if you have a corner on that is indexed That is good Knowing our values that is good unique values That is good Then use that corner So first try without So I have created a table called order items No peeking There's a table This table has no primary key in the same order item stable only I just called peer into another people So first let just try an import without mentioning any argument Okay so this is a view I have already done that I didn't create table orders items No Catholic star on Now what I'm doing here I'm doing a school pimple username password able I'm not passing any argument So if I do this it should fail because there is no primary key I really right So first we will make it fail Then we will see what we can do So I'm just copy pasting this part You can see what I'm doing The table name is order items No peeking Yeah so very clearly it will throw another Can you see that It says no primary key could be found for the table Please specify one with split by all perform a sequin shall import with dash M one now Very interesting point You have a table you want to import There is no primary key I don't know any column or so that you can also say dash M one If you do this what will happen No split No split means no primary heat one map A real important So if I have a table and there is no primary key one option is that I mentioned a corner You think that it was split Now what if I don't know that also right then I can say just bash em one which means only one mapper It won't split the table at all So no need for primary key Primary is only useful splitting and I so this will import the whole table in one go one mapper If it a small table it is good if it is a very large stable This performance issues will be that because a single mapper cannot handle those money But you get another that is important When you directly try you will get another So you have to mention us played by column Okay And how do you mention that Very simple Saying commands scope import chronic use of name password table warehouse and you say split by order items or that I d Right So this is the uh you know command that you are passing Like if I copy this should be able to run So now I think we used a column called With Scotland Did you use in this command order Item order Ready So this order ideas So unique column So ideally this should work without any problem Right So this is very used us played by Try displayed by Okay Now don't go the next step I want youto first Look at it What I'm going to do I'm just copy this part Yeah so I'm trying to again to a scoop import This time I'm copying the table called retail Baby Sorry Where is the table Orders Orders stable And that has a primary key Okay Order stable Has a primary key Okay but again I'm using a split by Just tow Show that I'm saying Split by order status So order status is a coral in which I want to split That's festering Okay that's a stream Now if I run this I will end up in another I should I really end up the receiver there is coming up because you cannot split by string by the fort So the other problem is this time I think it took it No it's it's working for string Say some off the scoop Imports cannot split string Okay I wanted to check how it works in this cluster in this cluster is perfectly working so we use festering Corny but it's split without any problem If you get an error when you're using a string call Um that is when you have to use this argument Can you see here Apache school splitter Allow exploiter True Now in your cluster when they could have installed the school they could have by far said this property to true like it will by the fourth hello strings in some of the clusters What will happen They will not allow so you can pass this argument saying that Hey I loe me to split you sing street otherwise he would start seeing others Wait we don't have access to this Actually it is not in the u I It isn't a school cornfed dot file actually that we don't have access I don't know where it is but they could have set it automatically Litter True So if it is running fine If it is not running that is when you need to use this property Otherwise on especially on Horton first clusters If it is a defiled plaster it will be force this property That means your topaz this argument Otherwise it'll say you're talking about the string Cornum I cannot speak to spritz stream So you're like like introduce Will split string right Yeah So there are a lot of other properties on You can also have something called a bounder equity So what do you mean by boundary Query is that if you want to pass your own query for splitting you can pass it so normally what school will do It will decide how to divide your data Best way What if you want to decide so you can pass something called about the equity You can say that So here I am saying selectively know for already Max off order Ready We're alright amid a greater than something So based on that condition it re split I can pass my own career for splitting normally What it does is that it will equally split your later So here I am saying that I want records on liver the order item ideas more than this So it won't bring the whole data It will bring only the data where this condition is applied What if I don't want the full table You can do that Or so One way of doing it is bound iniquity Second ways that you can pass Hyphen hyphen query Not bound record You can simply pass equity under the salt off The query can be important Also both you can do boundary where you can pass or they can say a normal query or so you can write nor difference at all Actually So there are two options either about actually boundary quickly came first Okay Then we got an argument called query You can directly right the quickie Okay But inbound iniquity you will normally write a regular query like this If you say hyphen hyphen query your toe Say dollar condition There's a keyboard you hope to use I will show you What is that Okay so this means you are importing all the records but the condition is or the right of my ideas more than this I would request you to try this later because huh And this is something I wanted to show you Where is it Let me just scroll up See Look at this quickie Coram's antiquity So one thing that we're doing here is that you can pass an argument called columns Okay Which means I wantto copy the whole table But I want only these columns to be available I don't want the full day Well actually right So before we actually run this I will Just believed Yes in the Korea have return It'll bring all the columns Now I don't want that I want my own condition Like I want only specific columns to be there So here you can say that I'm setting number off my purse toe to that is not required If you want you can tie that only difference Here is that You will say these are the columns I want create a warehouse directory number off my purse too So ideally I would put should have only those columns or that I d what Iss Some other bottoms Right Worked out Where did I mention Order Item order Ready or the right Um I d on sub total These three columns on Lee should come Really done And if you check here it is a folder name School import See only these four columns came We originally had seven or eight columns in the later Now you're getting only three columns later on You can also import with your own query So that is this Have a look at here I'm saying scoop important connection These things are staying Okay here I am saying hyphen hyphen query And in courts I am writing my quickie So here this is like a joint gritty It is a torta giant query that I'm writing on Whatever output I get okay will be transferred on One condition is that it's written somewhere Mmm yeah for query We cannot use warehouse directory It must have this dollar condition are given You cannot also use columns and you must use split by The point is when you write this query right you're saying that the output off the query should be transferred Now what if I'm doing a joint on two tables Right In that case how does how will it identify Where is a primary care Because I have two tables and then output is different So then always when you pass the query your dimensions played by which column whichever one column your dimensions played by second thing it should be always They're how Target directory You cannot say their house directory because warehouse directory is where it'll create a table Name folder name here How Multiple table's center Good Confused All right so you have to mention the target directory on your pews displayed by andan You pass the query whatever color you want the output of this entire query will be dumped into the fort So whenever you pass a query the query is a condition for school condition for school So you're mentioning that dollar dollar conditions meeting This is the condition that I'm giving you So run whatever on so this So if you look at here I'm doing a select start then I'm doing us some order by revenue to tables Amusing Andi never say on dollar conditions so no no no It is not available It is not defined The door It is just a part of your syntax saying that in fact what it does it will run this whole quickie on the output off That query is smashed with this condition So the condition is actually the output of this query Ah yes Whatever conditioner passing here So this entire query is that dollar condition huh And light So it is a sin Pax you're saying that this query on the condition it can be anywhere Actually it need not be Always hear this on condition can be at the end or so I can write a query and say and condition So you're basically saying that they take this query output are played as a condition Andan importante So this has been there like from the beginning off scoop this condition argument on why they have come up with such a name or suchlike dollars in tax and all that I don't know I mean where does the reason so And there is one more error Sometimes when you run this entire school and that is why I have added here it will give you a beer They're saying that I cannot connect with the database or so Sometimes when you run you get another saying that I cannot connect with the database or or some some error which you know it is not an error because you have connection everything Correct That is where your toe past this driver hyphen hyphen driver Sometimes the driver means whichever producing Now you're using my sequel Driver right has to be mentioned in the parameter other ways He will get unexpected letters We have seen this in many many many condition That is why in every command I added this here or so Here it is in order But most of the comments I will have order this in school You haven't lived before the library There you need the this thing So I think How do you check C B P c No They have kept it somewhere else So in Horton Works clusters I don't know where they kept the library folded in the Horton box elasticity The U S R H D p school labor that is afforded here I don't see it If I go to D C school there is only a corn for that That is not the place There's a live before the burial to call Peter our driver There Only this will work Library folder So you also have an AA a wayto handle Narrowly values and all Or let's run this just to make sure it will run What happened What is throwing error Connect Common north phone Okay It's not complete Sorry Sorry I thought why It is throwing another now Yes Ive scoping the output off the query actually tonight and it will run You will get the output You can also mention narrator values and the limiter For example What you can do is that you can say that narrow string now well known string Replace it with this value or on there is something called incremental import I will talk about this a bit later Some but we have Ah yeah Simple hive Important So if you want to import a haIf what you need to do is that again You will have connection User name the last name table on Then you will say hi of import on Then you have to give her database on a table So can you check Ah what baby you have in height I have something called No no you're only be the fault is that you need to replace it with your database Do you have a database You do Do you remember the creator I have something called June 28 So here in hive import you have to give the hive database on hive table It will create And I'm just saying number off mappers too So let me see if this works should work Can you try that by yourself Check the high Davey Don't cooperate or default because everybody will be doing the same in that case So venue door Hi Eve Import Actually what happens Your data will first go to Is the effects from their keycorp Ito haIf meaning warehouse for their wherever location you have But it is a two stage process It will not directly dumb You can see here first The lord also hear it when quest the effects From there it will again take on Lord it into haIf like first of all to a temporary location Okay From there It lord it into you sir Hi Warehouse Better Very Herman should manage table Same committal Run actually So you can see that the table Actually God created in haIf And I just want to verify this so I can go to my Hugh query letters high I just goto haIf query Editors on my databases Join 28 Um where is the TV June 28 Okay Andy for sorry June 28 Right June 28 There is a order item table Can you see order items Select star from ordered items limit I have that They're back in life But Lord of Confusions are there in hive import Okay so the simple important is easy Lord our father there is an argument I forgot I think create they believe exist Britain isn't my north back I wanted to try export later Not now I will tell you how to try that It's very simple You again to export But before you export your to create an RTV Emma stable on you know how to create an RGB of my stable on how could teach you on do the same thing It will export I remember when we first started using high of import in school It didn't work some error King We took her on one month to understand what was happening as the legal situation This is ah realize situation Uh normally would say I have inboard If you say higher import it will import No problem right This is about everybody's doing There is an argument cord Create hype Able create hive favor along with higher import you can say create hype able What do you think this will do Asper The look ofit it say's it'll create a table It does not do that So somebody had return a query like very lengthy scripted Tow us inside That school was there It was a job Actually we had high big everything running within that somebody has returned something saying that high diving board they were doing higher import from the table create hype table this was failing on We didn't know why it's failing or what is this argument Nobody bother actually Then we looked into it Create hive table Looks like it will create a table right Actually the use off this argument is that if the table exist Import will feel that is document Already existing haIf the import will fee That is what this argument sees But the argument name is create hype table I still don't know how it is Even if you look at the documentation spelling mistakes also are there because I remember looking at this documentation Here it is School create haIf table Yeah the create hive table populate with their definition for a table based on a table previously or one plant to be this effectively performs hive import school pin board without running proceeding import If data was already loaded you can use this tool to finish the pipeline You can also create higher tables with tour data then can be important and populated This is what the documentation stays Okay Actually if you run a creator hive hive table along with high diving board what this will do it will check if the table already exist in high Then it will fail Second problem it will fail But before failing it will cooperate Our pleasure the effects So next time you're in another era will come It will say that they're directly already exists and you don't know what is happening So this went for one month our most Because we did We we were even not knowing that it is actually coping Place the office then coping from there We're not border Initially you don't learn these things right So we're getting a Mr Directory There is no directory Right Is copping to highways and erect tree again You'd run again It'll throw it then you d leave something in a gallon and another becomes one Monday Bend like that Then some guy forgot about that This is the problem So this means that table should not exist So then we delete and created a tape lady on Then it was able to create a table and lower the data So some of the scoop arguments are very confusing And don't just copy paste in production and such a unique Amanda Okay just run it once and see if it is working But only on there was a spelling mistake I think they corrected it was exit in Start off exist Ah see if the table exits It's exists There was ah Gerard request for herto correct This That is northern Okay so if you say create higher table you've said then the job will fail If the target high table exist how can you justify this Creator table The job will fail if the table exist There is no matter right with about he'll trip and then the definition should be changed Right If somebody used this what the devil thing or get creative table Right So this this very confusing arguments are there in this thing Anyway I will leave you tow the export export I will ask you to try yourself But was one first thing It is very simple What you need to look A man is already here Create a table in arguments on then goto her due scope export on give the source and destination it should be able to export it All right let me know if it doesn't work I will figure out a solution for you Another our data transfer utility like school and flew Miss Having a lot off documentation and a lot off configuration parameters and all We don't have to be aware of everything we'll do They stick up love flume commands and burned them to ensure you know how flume is working Right So Why do you need flume Right So school pure you think to bring our DBM us kind of data like structure data flume issues for streaming data like you have any source which is generating data continuously You want to catch it and then send it toe her Duke That is where you will be using flume Originally flume was created by Clow Did um on why they created flu Their use case was that a lot of customers had a web servers So in Web servers these log files will get created And they wanted to capture all these log files when it is getting generator and then store in her room the moment So if you actually visit Amazon or flip card light if you click on a product or navigate your department all of that information is logged and it keeps on coming Right So they needed a tool which can capture or this streaming information on without losing it Send it to her That is how or is Emily Flume King Later What happened People added more support So today flume can connect with many sources One it's like you're rob log fires for sure Second is like to tell if you have a tweets when we need common thing is that you can connect with Twitter on Then you can download all that with using flew from very efficiently Will I love you to get the data And what do you need to understand This this thing This is architecture of flu If you understand this moral more or less you can run Flew right in here So you will You have to create something called a flume agent flume age And there's nothing but a set of configuration like what you want to collect from Very want to collect all this information you will write in a file and that is called a flume isn't Once you create inflammation you say run Okay And once off Bloom agent starts to starting a JVM we'll start pulling their data from point A to point B on this park creating a folder called What is It with their data Right here Is it on This is this This looks like a text file Okay It has a text extension but it is no text file This is Avenue I want to show you that if I download this file Actions down Lord Yeah I'll go to the location This is great Every guard And if I open this uh this is avenue This is a video format So you won't be ableto c average or schema at the top So this is the schema Okay This is like keys type Name feels use of location something something Something you serve Screening the secretary name So it will have a head like a streets crema And these are the actual tweets v gore So pretty much looks like a key value apparently on dhe I'm not quite sure of it with regard something interesting So in this it is very difficult to read Okay But I can search for something What were the key words we were looking for I don't think there's anything on Maudie It is there somewhere Can you see Huh Said a subdivision of India Mentor Great Admiral Rosina Indiana Mother That's moody Some tweet right from somebody right Waters We were searching for Trump Definitely There'll be Trump RC So this is the suite Okay Oh sorry So aggro files will have Lord off Metadata That is what you see This un 100 The characters Actually it's not structure to date up Right That tweet will be here so you can identify it with something Yeah So Trump enable is already saying his scout speak something Something happened So there's a tutor Trump we may not have anything or Netflix I'm not sure there is something on Netflix so But it is in some other language Okay so on Netflix to it is in some other language So some treat or Netflix Right So now ideally you cannot process this data as it is So you have to leave the Avro file Afro files can be read into sparks Equal sparks equal Hazar Afro Package We downloaded our data Bricks Example Package if you remember So these have sparks Start with examine saying baby can say spot shall start with the average for mitt Okay on Duh I think this should It should be able to read this We will just try because I have not even tried it Okay so spot should be able to read it Uh So what is the package we were using We may not right right now but there is a package which you can read now that's what I was trying to do Normally if you use the cloud our source you will get it as Jason This one is much more easy to read than average Afro It's very difficult Very But you can see that we saw that I mean I didn't fake it You're actually now boarding tweets I mean I'm just saying right So trump or whatever you were searching it was already there Right So all if you're able to get this Ah so this one you can see so you can just start the shell with this Okay So I'm just thinking whether I'll be ableto read this I'm not quite sure but we will try so I can just upload this in my home Fold it I'm just uploading this in my home for that Okay this flew in data okay On dhe then what I want to do is I want to start spark shell Bye spot too So let's see if we can read This shall be I have not try at Berkeley to say Say spot daughter You don't avenue B f equal to spark dot Read dort Afro Yeah I guess we have no pre process it I'm not 100% sure And we will try it Sparked or read or traveler should work or we have two important Right Okay Jabba by phone Okay guys give me one moment I'm just checking if we can read it Okay BF equity What was the final scene Ah the system got stuck Great Course Yeah Not so I'll show you how to process it But there is a way right now about Afro I think we have to rename it his daughter Rename How do you get the file extensions to be shown you properties No no I want to rename it I just changed It'll also happened or PX tea right That his view Okay emotions this one right No I know that Ah view options All right Yeah yeah I'm just thinking whether this will work It has to be an avital Maybe it will work Maybe it will not work because the reason iss I directly asked flume to save it as dor t x t So it may have a problem with that Let's try that Anyone's 54951 Mmm Okay Now it is throwing in a red after reading block sizes in valued or tools for this open Okay Okay Okay We have to change it Yeah it say's error Why Looking for meta data No Here Average order and payment So there is something called block size and you have influence for saving the data I asked you to save it as dark T X t on That is something I should not have done So now when I can vote after I'm not able to read it But that is a way you can really avenue for sure In spot Right I want to show you one more example or log data This is much more easy Not like flew an example on I can share this with you to try yourself because getting the what you say to retaliate is one thing Okay Second thing is that how do you actually catch log fires using flu And there's a real application right So what I have done in our cluster I have installed a small utility Um just close this window Okay Can you see a folder colder Not here Yeah So there is a folder called that Jen Underscore logs Okay And if I do Unless there is a script called a Start Logs door Shh It's it's a large and Russian utility So if I run the script what will happen Start larks darts at it dot slash stop So I just started descriptive What will happen now Log files will keep getting generated But the question is that where is it getting generated So there is a folder called the Logs If I go to this folder Okay If I do win Unless there is a file called access dot log If I do a tail on this file can you say Lourdes getting generated Yeah So this is streaming later night on this simulates See add to cart This emanates how user's browsing Any commerce of sight Say categories Gold shoes his log in page again Departments fitness products Some product is looking again department footwear categories Right Check out settle keep on generating some dummy data So you have some data out of work at least Right And this is how any commerce website actually will have the logs So when you click on a protected Amazon it laws like this So it will log your I p address their date that get request You are having the shoe tippy response Court Which platform Windows Which browser All the data So what I did now I started this unity and I will give it to you You You You can also started right now This will keep on generating this lock first Now how do we catch it You think so That is usually the application of flu It is designed to catch the log files Right So let it run here I don't want to disturb and I can do a control See it will not kill it Because but behind the scenes it will be running And I have configured a flume agent If I go to Raghu Uh do you see a file gold Some corn file and there's a flume door Corn There's a july dot com right Yeah Look at this particular configuration This is more easy for you to understand My sources call that Jen channel is called meme Channel Are single scored HD The type of sources were ex sick in the last example We were using it to its source This example The type is called Exit that's defined by Apache What take sick source will allow you to do it will allow you to run a command and get the output And this is the command You're running tail dash f on the log so it'll keep on tailing the law Right The sex six source And what our data it will get it will place it in the memory There's a memory channel on index dear First what I'm doing I'm saying that dumb board today Tiner for the corps flew underscore demo Right And these are the property So some of you asked me like how do you control Like it will create file on her do But when it should create a fire These are the property So you have something called a role in terrible role size and role count Which means roll interval is 1 20 That means storm in it So it will wait for two minutes and then create a file Second properties roll size This is I guess spend them be So if the file size we just stand and be another file will be created Roll Countess 100 If it gets 100 lines in the log and pre it'll create one more so each other off this reaches First it will work So you can our adjust it So sometimes probably it starts generating the logs The and it reaches 100 lots and analytically it a new file because there's a property which you reached first Or if the law fires are very big probably only 50 line skating But it crossed NMB or 20 envy It'll roll into one more So using this property you can configure when the file should be generated on the office Minister s So maybe these stores or distort So you're saying that if if the total size off the files which has been M B I knew if I should be created what if you ran the system for two minutes Still it is five and B on In that five minutes you got only 5 50 lakh fires So these two conditions have not violated Whatever happened after 1 20 seconds a new file will be created First it'll apply So these are the three to reassure Zone D s The first thing I do can apply here huh So here I can change So anything that you do in flu configuration is in this file that you rest Everything is simple This is memory This is this That is this same thing Yeah This is the location where that thing is running General Disco logs logs accessed This is where the logs are generating No website it is How do I say dummy It is not getting from a website Just hold class purpose so you can get this kind of software's So you need some data from where will you get the data Right little general Some random data for you So you will feel that somebody is giving you the leader that it is one second by default but every second leg secure So you will not miss anything Continuous little execute on the tail will exactly tale from last time Like tell how many you have failed right I got a So they'll actually run this to see if it is working Like that is one thing you have to do right So how do you run flu Last time I did not show you You will say flume Dash N G Uh agent name What was the name of Day isn't I think I have it here just copy place from here July right It's better toe do like this So you are saying flume Dash N G This is the key Word flew Next generation is in with the name log Agent on the configuration file is in this July or con If I hit ended I really hate Children Yeah so it stays component pipes thing started See if say is creating the log file here Right So I think it will reach 100 lines first I don't know We have going preconditioned light Either it's 10 a.m. B or reach 100 files or wait 1 22nd Parting 100 fires will raise first 100 lines I think that is the That might be the case Even I don't know how frequently it is generating but you can see that it is running on If you go to her dupe just go to her Do uh here What was the fourth inning we gave flume them All right Okay So here if I look at here there'll be a folder core flume demo somewhere and refresh Um flume demo Can you see a folder Yeah There's flume demo Yeah So this is the fire that is generating See it is a temp file You can see zero bites right Why That is a zero byte It'll generate a time file And once that threshold reaches it'll convert or t x t So right now you won't see the file here once Either it has to raise that 1 20 seconds or 10 m B or whatever Then it this time finally be converted A daughter be extreme pushed here or I can still stop the flu mission Then this file will appear here another way that I can stop the flume Asian Then whatever daylight has collected so far it will come here So I can just go here Say kill the flu mission Who's Yeah I killed it Right So the moment I killed it if I go here this could have been created A text file Haku came Actually there was a role also night And if I open anyone off them you have the log files same log files which we were user browsing blah blah blah Whatever we were doing Same is available here So this is how you bring log data to her So I will do one thing I will give you this long generation utility on You have to copy that in your home older sister folder And I will write instructions how to start and stop it And this configuration file also will give you flew so fast you start the log utility running this and change the SDF a spot your home folder automatically the locks or start coming in your four to see whether you can do it So they're actually use case off room Normally people discuss traitor but rather than twitter flu miss commonly used for this this If you want to read into spar it will directly But structure if you want to give you needa rejects yes once you have your credentials Now what I want you to do is open your club this thing Weapon sword Okay because we have to run Flew in from the web can soar You're gone now in the files shared to you there is a folder course school pan flew inside that there is a file called a flume dot corn Can you see this If you open this I have left this Carrie out of Phil Can you see this consumer key consumer secret access token access Soak in secret Here your toe copy paste from your credentials Don't use others Torrential will Stimulate of upset A lot like I really should be Really Webster But we don't have but will stimulate any commerce site web lot Then we'll pull it into fuel So copy paste your consumer key Consumer secret access talking access talk in secret here so that it will look like this This on my cloud lab Yeah So Poppy Baisley Consumer key Consumer sacred Don't use this The mind Nor does this Okay on Then upload that flume door Corn files to your Lennox Ftp ftp do have people because it has to be in the lab right You need to make one more change I'll tell you what is the change I will explain What is this configuration Okay First you make sure your scorpion or then I'll explain what it is Confusion No truly next three ftp isn't inexorably not initially office up So one more thing here there is an HD If a spot can you see on here After you sir change to your you seven because minus user jail faculty So after this change Are you suddenly and then any folder that you want This is where the wheels will land You don't have that value you have I think right This value is not there in your fire is there right I'm just thinking I'll be using the same fight Well one small kind of confusion I think I copy patient twice So what I will do Mmm mmm I have this file right This is my fight Let me just save it if the f s the f s Hi friend Put flume dark corn deal data So I'm just uploading this flume door Cornfed Ojea Later Four day you're downloaded because there's my configuration Then I will tell you where you need to change Guarded You just go to Hugh Go go to Hugh And if you go tojail data you will have the flume configuration You CDL data here you have Where Yeah This file Right Just download this file in the Lenox machine Not in her do huh This isn't her job right You need to get it in Lenox right So you also say it's their first dear first hyphen Get right from your home directory What you should do right You are in home directory You're to say it's the F s The year first hyphen Get on D l data slash flume Dark orange Don't hear not just run this command on It will get the file on in your home directory You don't run This were to make some changes So I will just explain the configuration Then you can make the change and then run the stuff So all you need to do is one thing Have you logged onto the web can soar Then this is the command unit and just run this command that's all So does that mean or that coming is the f S B f s I forget This is my command logo on the web console Just run this command on You will get this file core flume dot com Make sure you have it You do Uh yeah Yes On Then change the keys Change the keys You know how do you save the editor Right And at least copy paste So are you guys able to get the data I think Yes right If you get the file do Avi I don't not flume dot com and you should see it like this Now I will briefly explain the configuration Then I will allow you to edit it Okay so when you can't figure a flu agent you have to give a name for the agent And that is the 1st 1 iteration can be Anything can be recouped So all the lines are starting with this reiteration Right So the name has to be saying whatever name you're giving on first time saying sources Twitter journalist man channel sink is such defense That is against some name I'm giving I'm saying that I want to create a source The source name mistreated Okay on Then You see source door twitter door type This is where they're mentioning what type of a source you want to eat their source So here this is just a name But that it can be the core So doesn't matter I'm saying the type off my sores is orga patchy Source twitter dot twitter source That means I want to go straight to the source on Yeah you're intelligent issues of different Even the names are user defined These three areas certifying this is where you define flume configuration fighting order defined All right on then What You Do you say that in the critter Consumer key Consumer sacred accessed Open access token secret you define right on You can also mention keywords Now if you want you can change this I'm just using some keywords like Netflix Big data Trump Moody and all So what is going to happen It is going to connect with a P I Twitter a p I on Then it is going to filter your quits Otherwise a fuels no not hash hash or to further filter This will give you sweets matching with these topics on it is not guaranteed We will exactly getting a trick streets So the twitter data is not properly filtered and it is not It is not It is impossible to filter or so because they're millions off tweets So wherever it finds any mass later to Netflix that in there to it it'll give you ideally right We have also seen in Lord afters You will get Lord after it was is not matching or so So how they do with this in production is that when you start getting tweets you will get like 10 million 20 million please And then he had to process it Then you ought to say Okay in these streets look for hash tag this and then you extracted so they this bunch of downloading may not make much sense Actually it might not exactly have a tweet from Trump or something This will give you in avenue because the source we're using is or Apache flume so restrictive Source that there is another source Court clouded off bloom So students whose if you use that source Okay that is as off now north supported Here you will get in Jason format Now you will get an Afro and I will show you We will get the later on then what you're doing This is the sink details Think is what Very of storing the later All right So anything within committal concert That's Ah word Right And this is the sink sink Mr Destination I'm saying that Ah file type is data stream Data stream means it will create temporary files and then start writing in tow Text files Okay On the right format is text bad size roar size and roll Come This is very important So what will happen is that it stays bad Sizes 1000 roar sizes zero Role Countess 10,000 role in the realistic center What I will go I will explain these properties in the next example So it will make more sense because tweets are very difficult to explain using this properties Then you say the memory channel Piper's memory that means your channel is rhyme Its capacity is one G b So this is a value that you're giving transaction capacities 100 m b Which means it can hold one GB data But after time maximum 100 bury it and write can happen so source can either right to the channel or sink and read from the channel at 100 MBPs speed But the total amount it can store is this number So huh I really this isn't MBPs So this is just random Alice I put but in production you're actually look at where the size off your our data and then they decide accordingly You this agent on liver huh So if you start running this it will occupy this much space in your room because this uptown section capacity right Okay Haram Whoever This managing that I'm your system admin so it will occupy this much space in your ram first and then the source can right at that time 100 Sorry it's not 100 MBPs It is 100 events So flume has a concept off event and the event is the smallest unit off data you can transfer in case off sweets One too It is an event If you're collecting log files one line is an event So this means the source can write 100 events at the time or the sink and read 100 events at a time from the channel So one GB capacity you have hundreds can land at the time 100 tweeted can realtor time There is a capacity or mention on Then you mentioned that your to store it s dor t x t blah blah blah on the CIA for spot So this is like 100 events in a second so 100 weeks in a second Very rare You get hundreds in a second because and if you get also will not be able to store because that Mets spirit comes all right So this value how could decide based on how many even sir coming on What is your capacity off your machine on the next question is that were Will you run flume You wonder and flew inside How do Plaster flume has nothing to do with her Look to be precise Okay Flu Mr Independent Project So you don't run flume within her do Hello Cluster will be there You will install flew minister prints over because the acne is a lot of resources and connectivity and all right so that machine will have connected to the herd of plaster Let's say it is getting tweets little download and then push in tow SDF it's night And this is the location where the tweets will appear I want you to change these four values in your configuration Also change this location It is Geul faculty So here it has to be your use of me No that is your project right Say for example of you did a project on our d monetization So we want to get all that we excavated the monetization Ah there should be a documentation I can give you I don't have it right now but yes so But if you think places for Google like ah flume source this one It'll give you a list off arguments you can pass the most important bonus keywords like you mentioned These are the key words that you want right Change this also the location Otherwise everybody will start accessing my home Hold it and you will end up in ever Actually yes Without channel flume engine will not start on Very important thing I think I forgot this If you look at this line um where is it Where is the source source Huh So this source is strictly right This uh this source is connecting with memory channel This line this is my source for greater It is connecting with the memory channel and sing will also connect with Memory channel See this line Meaning both Sore sensing should talk to channel Right Uh so these lengths I see the name off my stinky sage differs that is talking to the channel memory sandal Who is putting that in the channel with their source huh On it is very difficult to be book So any many mistake you're making this Okay You're the artist Eight of secrecy but this very confusing configuration But if you look at the configuration it is easy to lead But sometimes in certain officially efforts you type something else and then you forget for to have certain type they received didn't understand which property you changed where you changed It's very difficult Oh understand these properties So then should a file be created like you are getting tweets Should 1000 to its being a file or 10,000 tweets So you can control that using multiple properties Either the number off events like when 100 wives come create a file or time if it is one minute glory for all indelible So that is what we are written here Here It doesn't matter the oversized payroll count role in there Well so here I have just given some random values but it'll start creating multiple files so they can say either it is based on the time or the size off the data or or something else that you want All right So if you have this file I just want to test run it Buster fall Before we go further into a discussion the command is already here Can you see this file There is off commander here on All you need to do is copy Paste this I will copy paste first to ensure it is running on Then we will You can also run huh Very good See It is not getting the documents Process 102 100 c It's downloading The later were easy So it stays establishing connection connection established Receiving status stream right and creating the folder So it is processing I think at least 100 documents It is getting in each each this thing on it say's startle documents 1000 again It'll start processing it It keeps on streaming the date I know So copy paste and run that and see if you can get the waiter control See a little If you don't know controls it keeps on like running