[Music] hi everyone i welcome you all to the live session on data analytics full course by intellipaat this session is conducted by multiple experts who will be teaching you all about data analytics from basics to advanced level but before we begin the session make sure to hit the subscribe button and also hit on the bell icon so that you will never miss an update from us now let's see the agenda firstly we will begin with introduction to data analysis in that we will be explaining you what is data analysis why do we require that and types of data analysis once we finish that we will explain you about the life cycle of data analysis then later on we will tell you how to become data analyst what are the skills required job and career prospects of data analyst post that we will learn about pandas and data analysis using pandas then later on we will learn about exploratory data analysis and also numpy then later on we we'll be doing a very quick hands-on demo on how to create numpy array after that we will be seeing the difference between a data analyst and data scientist then finally we will be covering data analyst interview questions and answers so this is the agenda of the video without any further delay let's get started we've all heard this term somewhere it pretty much rings a bell that every time we think about it right so analytics with respect to data well what are we trying to analyze is what we need to know here right so the formal definition of how it goes it's basically uh extracting uh meaningful information from raw data it's as simple as that so we have data right so consider anything to be data if that is not useful to you at that point it is considered as data but then if you perform some operation on this data and make sure that it is useful for your organization this is when the data becomes information and this information is what is usable for you right so the process of pretty much converting raw data into information uh by performing certain analysis on it can be termed as data analytics guys i mean this is just a very rough definition of what goes in the industry to give you a better insight of what that means basically it is the pursuit of extracting meaning from raw data using specialized computer systems case and then these systems they transform they organize and they model the data end of it is to basically draw the conclusions from all of the data uh so you'll have again as i've mentioned with the raw example our goal here is to basically draw conclusions to identify patterns to make use of this raw information in a better way i mean sure data can just be used in its raw form but then not with every case right to give you an example so think of the data probably you understand so let's say you're looking at a huge data set which contains thousands of values those are just numbers right so having these numbers sure as a data analyst you will understand what you're doing but then let's say you need to explain it to a person who doesn't know what data analytics is or you need to explain it to someone probably uh your peer or someone above you as well let's say you have a business meeting where you have to explain this giving the numbers will not really be that great right it will not make a very good presentation so converting all of these numerical data making it into graphs making it in a way where everyone understands the data and showing out graphically whatever insights that you can generate from the data that can be considered as data analytics as well guys again uh today basically the field of data analytics is ever so growing rapidly and why is it growing so rapidly well because the market demand for it is that much so we have pretty much every uh you know startup companies these days have a requirement for big data as well so big data uh to dump down the definition it pretty much means a huge amount of data that cannot be handled by just one machine if you're stretching out your data across multiple nodes you have data coming in from various sources uh then you need a method to handle all of this big data that's coming in to process the data and then to perform analysis on it guys so this has been in demand in the market for a while and uh for the last couple of years data analytics has been uh in the boom and then it's providing a very straight uh answer to all of the questions being raised about how all these can be handled guys and of course we're going to need the people who have the skills which is needed in skills which are needed to manipulate all the data queries uh to translate all these numbers into graphs and uh to make insightful analysis on it right so that's what is the main key uh role of a data analytic person uh what does the need for data analytics if you have to break it down into three steps guys so again we already know that it is on the rise right so i can pretty much go about to uh step out into the open and then tell you that it is not soon that it will be a an integral part of an organization it is already an integral part of every organization there is guys so the first important need is again we already know that it is a top priority for all of these organizations ranging from the small organizations all the way to the big guys and then this is needed to make very good uh decision making well with respect to you are being confused to let's say uh take a decision for you then looking at this analytics will make your decision on making skills and decision taking skills a little bit easier and it'll be more validable as well guys and it'll be more validated as well guys and the second thing is to over require or new revenue right so uh let's say you have the data which pretty much only a couple of people can make sense of understand let's say your market is very niche in that case sure you'll be still making revenue still be making money on it but then at the end of the day if you have to reach out to a broader audience you need to make sure your data is understood by this broader audience right so that forms the second very important need for data analytics guys and the third one is to obviously uh decrease these operation costs for every organization so this can be a very demanding task if you have to convert raw data into information process the data and make out very good uh visualizations out of it as well but then if you have the workforce for it then sure you can do it but if not again since it's probably a manual task until a couple of years ago it surely was a manual task and these days you can implement machine learning and data analytics to pretty much automate itself and make your job easier as well it speeds up your work it keeps you very efficient in such a way where your data is being visualized processed very effectively so again time is money when it comes to big organizations right so on that note pretty much it helps decrease all of the operation costs with respect to you know either waiting in the data pipeline or waiting to process uh you know waiting to publish waiting to do some analytics waiting for visualization all of these weights uh you know you need not consider them anymore because they will pretty much be reduced to null guys so on that note you might be wondering who are data analysts right so data analysts are the people who sit at the end of the chain i'll just explain the chain in the next slide uh so these guys form a very good uh workforce for the company where they basically you know deliver the values by taking in all the data that data scientists might give it to them a data engineer might give it to them and then he uses this person data analyst uses all of these to answer a couple of questions and then communicate all of the results back and all these results that he just gives back is basically used to take very good business decisions case it might include analytics of past trends it may be prediction of what the future looks like and so much more and then the common tasks done by data analysts are data cleaning uh performing analytics and then creating visualizations well data cleaning to come about it is basically simple guys so again data is wrong for raw information right to make sense out of it to pick out information that we only require and to make sure that again efficiency is the game here if you're pretty much processing data that we pretty much will not require then it's just a waste of time and resource so cleaning the data making sure that your data is just perfect enough for processing that's that's a very important first step guys so this is the spearhead of a data analyst's role and then the second one is obviously as soon as your data is clean you need to perform some very good analytics on it and create very good visualizations guys so these four designations you see on the screen right now is basically the designations the data analyst is also known by uh so the first is the business analyst so data analyst is also a business analyst uh you know he or she can be an operations analyst a business intelligence analyst and a database analyst as well guys so coming to the tasks that they pretty much do the first important task that we already spoke about was cleaning and organizing uh raw unstructured data right so this forms this is again a very overlooked concept in today's world unless you get your hands dirty with the data itself uh when you start to uh to go about doing that you will realize that cleaning and organizing the raw data is extremely important because at the end of it your data might be unstructured your data might be semi-structured or it might be a structured data this will not matter if the data you're processing is of no use at the end guys so cleaning and organizing raw data is very important and the second thing is the analysis of all of the hidden trends found in the data so making sense of something again maybe predictions in the future or looking at past trends uh you know generating this sort of an information which you cannot figure out upfront just by looking at the data this is pretty much the umbrella under which the hidden trends function of our data analysts work so you'll be looking at a data you'll find something very interesting so let's say you find a trend in your data where pretty much you can access a new side of the market with respect to your sales uh in the next five years ten years and then you were not told this but then your data was telling you this so you making sense of the data in which this trend was found which was not up front then this is again a very important skill of a data analyst as well guys and the third important thing is pretty much the big picture view or using descriptive statistics at the end of it sure we know what the past trends of the company are we know how the company is going on now but then if you need to uh just to summarize all of this and say even add some predictions for the next five ten years getting this big picture view of what's happened what's happening now and what will happen in the future or not just using uh very simple statistics but using descriptive statistics where we'll be picking up each uh aspect of what's going on and then perform analytics on it and find out what's going on if we are right at this place if you're wrong at this point how can this be improved you know how can this concept be uh checked how can we reach the clients better and so much more so getting this big picture view is again a very important task our data analyst goes about doing guys and the fourth one is probably the most important thing a data analyst does is pretty much the creation of dashboards and visualizations guys so as i already told you in the introduction part of the video where uh we're starting out with raw information again a lot of numbers but if you have to show these numbers at a business meeting yeah i mean numbers are numbers some people like them but a majority of the people who they might not understand the numbers so making these numbers as your input creating very good visualizations giving them a user interface and giving your customers your peers your superiors your business meeting members are giving them a very good user experience with all of this data again at the end of it will add up to very good business methodologies and then it will help you take better business or driven decisions again and again the same goes for representations of results to your clients and your internals as well guys so coming to the chain of how the data moves around in a form i have three important things that i just want to walk you through guys in a quick just ask the first person who is the spearhead of the data for an organization by spearheading he's the first one to look at the data or he'll be responsible for bringing in all of the data from various sources guys so data coming in from the various sources right so think about all the data you can get from twitter or if you're performing any sentiment analysis think of all the data that can come from all of the big data sources you can have data coming from your various nodes from hadoop so much more you can have your data coming from your own network away from your network and so much more guys the data engineer is the spearhead who handles bringing in the data and making it understandable by the organization guys and then as soon as the data engineer finishes his part with the data the data moves on to the data scientist the data scientist uh he's responsible for working on all of these raw data converting it into valid information but then we already told this for the data analyst as well well the data scientist pretty much uses machine learning algorithms he uses deep learning he uses let's say uh binary classifications naive bias he makes use of so many concepts here that a data analyst might not know and he uses a lot of machine learning and deep learning as i've already mentioned to convert this raw data into valid information and 99 of the time the information which is being converted into are numbers guys so after the data scientist pretty much goes about doing his magic onto the data and then we have the data analyst who steps in this person again as we've already been mentioning uh he is responsible for the prediction of what happens uh in the future that is finding out trends and then the presentation of all these uh informations to your peers to your clients uh to your superiors and so much more uh this is the basic gist of how data moves around so the data is first seen by the data engineer he does his job comes to the data scientist the data scientist works his magic we are all these fancy algorithms and whatnot and then pushes the data to the data analyst the data analyst performs predictions sees trends visualizes the data makes it presentable for everyone to understand and then eventually goes about analyzing the data guys so on that note we can come check out the type of data analytics uh that can be done in today's business case uh there are a couple of types of data analytics and i'll be walking you through the same so the first type of data analytics uh you can go about doing is the descriptive analytics uh with respect to descriptive analytics again the quick uh one uh phrase answer to what descriptive or analytics is is basically uh you know uh picking up data from a source or summarizing your data making sure your data can be understood by everyone that is picking out very good insights from your data from all the past events doing some predictive uh analytics on that as well and then keeping it ready so your data is descriptive at the end of the day and uh a question which goes by to understand the descriptive analytics is uh what happened in my business so the answer to this question what happened in my business is given to by the descriptive analytics guys so most of the time uh the data which is generated by descriptive analytics is very comprehensive it is extremely accurate and the visualizations are very effective as well so here's a very simple scenario that you guys can consider so consider a scenario or you know where an e-learning website decides to focus on a trending course according to the analytics of the search volume of the course content uh which pretty much the users go on searching and all of the revenue generated by the course in the past few months as well so there are certain technologies again we live in a world full of trends right so let's say uh data science is in the uh boom right now so pretty much any e-learning company or any company for that matter they're gonna find these trends they're gonna know that hey data science is sitting in the top tier and we need to do something about it if they have students in data science then most of these companies are aimed at making the the students life better right so even as a firm here an intelli path that's exactly what we do as well we get this amazing insight from all of our learners uh for any particular course and we use all of those insights all of the feedback that we get and what's trending in the market what's latest in the market and we use all of these to perform analytics and then pretty much all of these helps us to put out a better course as well guys so at the end of it you can already see how or the descriptive analysis is already helping our business right and then the next type of data analytics we can check out is the diagnostic analytics guys so we already checked out what is happening to my business with respect to descriptive analytics with respect to diagnostic analytics you will check out of why this particular thing is happening to my business guys to simplify it again it's basically gives you the ability to drill down to the root cause of why something is behaving just like how it should guys so drilling down the data to identify certain set of problems which are present in your data or in the trends that the data shows is a very vital part of diagnostic analytics guys and then it basically helps in answering uh the question for us about why the issue has occurred you know so it just takes one look towards the data towards the analytic result of the data to understand what the cause behind a problem if one should exist as guys so again to give you a very simple scenario of how diagnostic analytics would work uh let's say consider a company where the sales went down for a month so let's say they weren't doing as good as the previous months so to diagnose this particular problem you will let's say consider the situation where the number of employees were quitting their job uh and they weren't bringing in a lot of sales so this could pretty much you know this number of people quitting the company could directly impact the sales that they brought into the company so the sales went down this month because they did not have the company right uh they did not have the sales right so finding out why this is happening and uh hunting for that root cause which is basically a treasure hunt with your data where you go on finding something and taking insights from your data finding out that why part of it is extremely important when you'll be finding out what's happening uh to the data and why it's behaving this way guys so this again is a very vital part of diagnostic analytics and coming to the third or type of data analytics it is predictive data analytics guys and as the name suggests the quick question you'll be asking is what will happen in the future based on the past trends guys so finding out hunting out uh you know historical patterns that are being used to basically predict all these specific outcomes using let's say machine learning algorithms using deep learning and using so many more concepts to predict something in the future uh based on the data we have now is again a very important niche of data analytics guys so predicting the future trends and the possibilities in the market based on the current trends self-explanatory and then it helps in optimizing all of the business plans for the future because now your business has a direction to head to right so you're predicting certain aspects in the future and you know which path leads to a better business or strategies right and this will give you an edge over all of the other businesses as well to give you a very nice example think of the netflix recommendation systems guys so this will basically use statistical modeling to analyze all the content that it's being watched by the audience across the world so let's say uh there are a couple of tv shows which are very famous in india but then there might be a couple of tv shows which is uh you know trending in the united states and let's say australia united kingdom so much more making sure that the right users in the right geographical area get to know the right recommendations are extremely important for netflix as a business right so this again predictive analytics will just do them good there guys again it will provide us with the prediction of all of the upcoming content of the shows that must be watched by all the different class of audience as well so let's say someone sitting in the united states is very interested in watching an indian trending television series as well so knowing on netflix finding out that this person exists and to recommend an indian tv show for this person sitting in the united states to watch again this is another very good business opportunity for netflix and then if they go about nailing their recommendation system for everyone it just makes them a better business model and then it just makes a better user environment for the users to go about watching videos there guys so the next one the next type of data analytics is prescriptive data analytics uh so as soon as we think of a prescription again uh we think of a doctor we think of something a medical appointment or something right here it's again something similar the question which will be asked here is what should be done so applying advanced analytical algorithms to ensure that you make very good recommendations you make sure you're punching out very good strategies that help the business and so much more is a very vital part of prescriptive analytics guys so it basically involves breaking down all of this complex information into a very simple set of steps and these steps are like the prescription we handed out when we uh visit the doctor right and these are the prescriptions which are precautious as well let's say uh that the precautions basically used to remove any of the future problems that might occur and it will also help in performing predictive analysis this will basically help to predict outcome and eventually that will help us optimizing business as well guys so this again demands the use of artificial intelligence and big data and then the data analyst person will always be in touch with the data engineer as well and the data scientist as well uh because he he's going to need all the help he can get with respect to artificial intelligence from the data scientist and help with respect to big data from the data engineer as well right so all these three guys working in correlation make up a business and i've already uh told you that so again to give you a common scenario of how this would happen consider uh the prescriptive data analytics that google map goes about doing right so every day we're commuting from our office to our homes or let's say we're going on a vacation and then we need to get out of the city as quick as possible so google maps has a wonderful api where they map out the best possible route considering live traffic conditions weather conditions road closures and so much more it considers your distance traffic constraints and again as i already told you it even considers if it's raining if you're walking uh you know what's the best route that you can take to walk what's the best route to take if you're on a car and now they've come up with the bike route as well so if you're in a moped or a bike you can take a different route compared to the person who'd uh commute with the car and so much more so knowing this kind of a prediction um into the future and giving you the sort of a prescription to ensure you don't run into any problems along the way is uh is a very important part of prescriptive data analytics guys so on that note let's quickly uh check out what the life cycle of data analytics is like so the data analytics lifestyle it basically defines the analytics process and all of the best practices which goes on from the discovery of the data or the project till the completion of this project guys so there are a couple of steps involved in this life cycle process and the first one is business understanding at the end of it again understanding the purpose and all of the requirements that come from your business and understanding it from the business viewpoint is very important and vital for the functioning of a business right this also consists of a very good introductory plan it consists of a decision plan it consists of a it consists of a formal to-do list let's say for the business to go on to achieving the target so the first important uh thing about the life cycle is the understanding of what's going on around you and the second one is the data understanding yes with respect to data understanding again this mainly involves the process where we're collecting the data and we're processing the data in a way which leads to analytics and then after the analysis of the data is done we need to uh pick up some insights that we can you know go about using from the data so extracting all of these meaningful insights from the data again is a very vital uh step in the life cycle of data analytics through data understanding guys uh the third part of it is data preparation so data preparation is the uh let's say converting the data from an unstructured form to a structured form and this involves constructing let's say a data set as well and then this data set will be provided and fed into a model and then this model will be used by a machine learning algorithm to train to understand to see what's going on to perform predictions and then it will be given to the data analysts to be visualized using tableau or any other business intelligence tools and so much more so data preparation again is a very important phase where the data is actually being transformed and even at the stage the cleansing of the data is pretty much performed as well guys and then comes modeling this step is very important because it involves the selection of various modeling techniques you know applying all of these modeling techniques uh making sure the parameters are right and all the readings here are right to ensure that your data being converted into the information raw data being converted into the information is off uh the optimal let's say it's the optimal tolerance that can be used for your business usage and then as soon as your modeling part of it is done as soon as the techniques are applied and all the parameters are marked then comes evaluation guys so evaluation is a very important phase where again the model which is being built it will be built very rigorously tested very or rigorously based on what you've built in the initial status based on what you've built in the initial stages and then so many tests will be performed on this data as well so evaluating something that you have generated performing various tests on it is extremely important uh most of the time this is overlooked but then these days everyone knows the value of evaluating your data guys so this again involves reviewing all of the steps that are again uh needed to carry out uh or carry out to construct this particular model and to perform tests on it is very important and with uh evaluation that's exactly what we do guys so the next uh life cycle concept i want to tell you guys is deployment deployment is the last uh step in the data like in the data analytics life cycle because deploying is when you're sending out your model into the world or for let's say by the world i mean let's say for your team for your peers or let's say even for your client and customers as well so making your data go from just uh using it all the way to spreading the data and uh performance spreading the data to your clients to your peers or anyone else if you know you can perform more tests on these as well so after deployment you can have after deployment tests if something is wrong again you can go back to modeling perform more evaluation perform more deployment as well so the thing you need to know here is that deployment pretty much goes about to be the final phase of the data analytics uh lifestyle guys and on that note or we can uh check out very quickly what are the roles of analytics in various industries around us guys again data analytics has an amazing insight and impact when it comes to the telecom industry because again if you've been observing all the prices of or let's say the calls the messages or the internet packs these days have been coming down and down and there was a time when they were being exorbitantly high as well so again the telecom industry realized that if you keep the prices very high to make more profits and eventually your customers will not come they will not buy the internet packs so to keep this in mind probably they decided let's say we're having a bad impact right now so let's drop down our prices to see if it works and it is i guess and this has helped the telecom industry in bringing better business and for us as users this has helped us uh you know just uh make it a bit economical and efficient on our side with respect to money as well and then the retail banking industry of data analytics has a huge impact in the retail banking industry as well to know what their customer wants because uh again with like telecom industry as well they have a huge amount of customers to play with right so they need to understand the view of each customer's requirement of each customers and then to find out if all of the customers are actually in the common chain of what's being supplied by the bank or if the customers are against something that's being sent by the bank so that again is a very important thing that you need to check here as well and then with respect to the e-commerce industry as well to perform uh again some analytics in the e-commerce industry it might be ad recommendations or there are very big sales which are run by some of the big names in the industry such as amazon flipkart mintra and so much more so these guys will have to perform extremely heavy analytics on the data that they see based on the products that they sell and based on the places based on the city in which the product sells or if the people are unhappy with the price again performing analytics has just changed the e-commerce industry uh if you may guys and the last most important industry where data analytics has touched is the healthcare industry this has had the most impact with respect to analytics in the healthcare industry with respect to so many things uh mainly it is respect to with respect to uh finding out what medication is required by what countries what amount of medication is working for what population and so much more guys i can probably uh talk about just the roles of analytics in these industries for days together and uh we can still be going on and have a very good discussion of how important analytics has become these days again with respect to insurance as well just a quick info guys test your knowledge of data analytics by answering this question which of the following method creates a new array object that looks at the same data a view b copy c paste d all of the above comment your answer in the comment section below subscribe to intellipath to know the right answer now let's continue with the session uh what what what geographical location requires uh insurance what uh what what the audience are and so much more so again as i've said uh we can go on talking about this but then to keep it to the scope of this tutorial uh we can just quickly brief through each of these guys so with respect to the healthcare industry uh this is the formal way of going about analytics now the first one is to again analyze all of these disease patterns analyze all of the disease outbreak that pretty much goes out we had a disease outbreak a couple of years back which is ebola and so much more we had h1n1 and so much more right to keep a track of all of these outbreaks this basically again improves uh the surveillance with respect to health and then this gives out good responses to the emergency sectors as well guys and then development of better targeted preventive techniques obviously and then the development of vaccines are making sure your vaccines again reach your customers and all of these where you need to reach your customers or let's say in this case the patients are very important guys so identifying the consumers again in this is the greatest risk of business so identifying certain customers or patients who are the greatest risk is again you know because they might be developing some adverse health outcomes and then developing welfare programs to keep a track of their health to track up their health on a daily basis even a weekly basis monthly basis to perform analytics on all of the aspects and the parameters that you're tracking with respect to the patients again that is a very important thing and lastly to ensure that they can reduce readmissions because they might know what the cause of an adverse effect is and if there are 10 patients with very similar symptoms then you can perform analysis and uh you can at least filter out and find out that all of these 10 patients might have this one common uh symptom associated with them and this might be the cause of that so mapping that for every patient is again very important yes so coming to the uh telecom industry again telecom industry pretty much goes about using predictive analysis to gain all of the insights that they need to make better decisions to make faster decisions and to make more effective decisions again as i was talking about the internet pack example again this was a very key in that and then by learning more and more about the customers daily and the preferences and the needs these telecom companies can be more successful in this extremely highly competitive industry as well it's good for them with respect to business and it is good for us as customers uh by bringing down the prices again so it is used for analytical uh customer relationship management to use for fraud reduction it is used for bad debt reductions use for price optimization call center optimization and so much more so now that you're looking at data analytics in this way you realize that data analytics eventually has a big play or has a big say when it comes to any of these business models right so again coming to banking as i've already mentioned analytics is making banks become very smart day by day guys so it is managing all the the plethora of challenges that the bank faces and then again while pretty much uh you know going about or doing some basic reporting all the way to descriptive analytics and this is all a must for every single bank right even performing let's say advanced uh prescriptive analytics and so much more and all these are starting at this age banks have started to realize that this will again help you generate very good insights and this will result in extremely good uh business impact that will help the banks as well on that one we need to check out how data analytics is helping in the banking industry as well right so it is used to acquire and retain customers it is used to detect fraud which is again extremely important with respect to banking it is used to improve risk control find new sources of growth for the bank and to optimize all of the product and uh generate their portfolio models as well so as soon as we check out the banking sector again with the e-commerce industry right so this is again this is the market which is exploding for the last couple of years i can see even the decade right ebay came up flipkart came up amazon is again taking over everything mintra niko guys there's so many e-commerce portals today and to make sure we perform very good analytics on these e-commerce industries is very vital so how is data analytics used again it is used to improve user experiences it is used to enhance customer engagement customize offers and promotions maintain effective supply chain management optimize pricing models minimize the risk of frauds provide them very good advertisements that pretty much help them pick up products good recommendation systems where they'll pick up another product after the first product guys and so much more if you've just bought an iphone again you'll pretty much be recommended with a couple of cases that the people have bought as soon as they bought another iphone so you might be that person you might like the case and you might pick it up at the end of it you have the case to protect your phone the business just created more money out of it right so again the analytics or the role of analytics in e-commerce industry is is extremely vital let's say the people in the e-commerce industry have known this for a while guys so coming to the analytics in the insurance industry again here as well guys this is basically used to enhance your customer engagement acquire new customers retain the existing customers make sure the customers don't leave prevent the frauds at the end of it reduce the frauds prioritize all the claims that need you'll have medical insurance you'll have again health insurances uh you'll have life insurance you have so much more that you need to take and all of these directly impact the user right so making sure you take the feedback from the user work on it and then create some analytics out of it is again very vital guys so on that note uh we can quickly come to our raw case study which i was talking about and this case study is a very famous one it's basically the house of praiser case study and here all we'll be doing is predicting house prices guys so we'll have a certain set of data which we will use to predict the prices of the houses so basically how can we predict the price of a house there are so many things that you need to know right you'll be looking at the locality in which the house is present you'll be looking at the amenities you'll be looking at the number of bedrooms the living space the number of floors in your house or the number of cars that can fit in your garage the size of the garage the quality of the construction of the house if it has a swimming pool or not if it has a uh you know a spa or not i mean so many things if we have to list down uh this particular use case then it will be extremely tough because each one of us has our own judgment of how we can validate a house right because house is something that's very personal to us again the materials of what was used to go to build the house the style of the house which is built in the number of uh you know if the house has an elevator how convenient is the house for or disabled people guys so much more so basically we'll be performing exploratory analysis on this guy so exploratory data analysis again is used to find a hidden trend in your data by performing analysis on it and then at the end of it the trends will be shown as numbers but since we already know visualization we're going to be pretty much using these numbers to visualize all of the data for us and we'll be doing it step by step so we'll be finding correlation between the data as well so again correlation is basically to check how one variable is linked and how the changing of one variable directly uh you know changes the other variable as well so how these two variables are pretty much hung up together uh how changing one variable change the other or can be known using correlation as well guys so a couple of steps that pretty much is generally followed in the case of performing exploratory analysis first we'll be visualizing all of our data finding the missing values and we'll be looking for correlations guys and then after this is done we'll be cleaning the data to check uh if any issues are fixed or if we'll be checking out of the data that we have we have is pretty much being used fully or not and then we go about building a model which is used to visualize our result it will give us the diagnostic it will give us the residual diagnostic roc curves you know charts graphs tabs uh tables and so much more guys i do not want to basically overwhelm you with the use case so to keep this use case very simple we'll just perform a exploratory data analysis at this stage to find out correlations between the data and as soon as we go about progressing with respect to our data set and to work with you will understand how beautiful data analytics is guys so let me quickly jump into google collab which is basically a jupyter notebook hosted on the google cloud and here we can go about performing our raw data analytics on the use case that i just walked you through so we're going to need a couple of files to run our use case we'll actually need one file which is our training data set file uh let me just quickly uh add the file to our google collab and then we can go about proceeding performing our analytics guys those will just take a second to upload the file give me a second we just actually need one file from out here but then it doesn't harm to upload it and keep it in your run time but then you just take the message right so particularly pretty much all of your files are recycled as soon as the runtime is pretty much changed so the first step uh for our use case is to load all of the necessary files in the libraries that we require guys so the first again we'll be using pandas to handle all of our data we'll be using seaborn and matplotlib to perform plotting operations on all of this data how to perform visualizations on all of these data and the style that we'll be using is pretty much called as the bmh method and with respect to the bmh again bmh is nothing but the bayesian method for hackers and this is just a type of a graph visualization method which gives us graphs which look nicer and then it helps us to perform analysis better on linear data sets guys so the second thing we'll be doing is again loading all of the necessary files the one of the important files we need is the training data set which is all this and here you have the id of the houses uh you have the subclass of where the process is present you have the zone in which the house is present uh what is the area of the front edge that you have what is the area lot of your house what is the street it's a uh what is the alley again lot shape lot contour or what is it you're the house was built and what it was remodeled in what is the size of the roof uh you know again so many conditions out here what is the foundation made of how is the quality of the basement condition of the basement uh the exposure to the basement again a finishing type of the basement guys you know just look at how expansive this uh data set is this is again a data set which is extremely popular among us analysts where we pretty much uh like to work on this because it consists of everything and you can perform so much on the single data set and so much more guys so on that note let's quickly find out the information of all the variables that are present again since i was just working with the housing as heating how is the quality control of the heating does have centralized air conditioning how is the condition of the electricals what is the uh square footing of the first door square footing of the second floor what is the low quality finish that's uh and how many square feet of that do we have uh what's the living space area how many full bathrooms do we have how many half bathrooms do we have uh kitchen quality guys this will go on right so all these data is what we need to check out again if you see here all of these are the values that are present which will help us map something but then ali uh again doesn't have many values you can check out here itself right so ali it's nan is basically not a number so there are not many you know ali or details which we can make analysis out of so we will not require ally again coming down not much of fireplace uh quality as well we do not have pool quality control at all miscellaneous features are very less again even fencing is very less so uh the average is somewhere around 1400 right so to make sure that our cleaning up our data uh very important part of it is to basically how we go about doing it again or we'll just remove all of these data outside less than 30 uh so less than 30 of again 1460 and then we can have at least 70 percent of the data to give us some accurate results right so it's already checked ali id not every house has an id so that's removed not uh every house is mapped to an alley pool quality control is not their fence is not their miscellaneous switches are very less so we have dropped all of these columns and we will not be using these to perform our analytics guys and so again to describe how what goes on to uh you know do a distribution uh i hope you guys know this concept about normal distribution and uh and with respect to all of the details that we can get out of it when we uh perform the math operation guys so basically we can count the total number of data that's present with respect to all the individual data what is the mean uh sale price of the data what is the standard deviation let's say the mean of the normal distribution at the right at the center somewhere here guys so the mean is somewhere around 18 000 right uh i'm sorry this is one eighty thousand oh if you just keep tracing from the center point down this is somewhere where uh 180 000 exists guys so again what is standard standard deviation is basically the deviation from the mean so what there are house values which deviate from this mean as well again 25 percent deviation 50 deviation 75 deviations and then what is the maximum sale price of the house as well so all of these uh can be found out from this particular graph guys and if you can already observe and even if you're not exposed to a normal distribution this starts out as a very steep curve but then it ends out uh with respect to a lot of data here as well right goes on until uh 800 000. so basically probably from 5 000 or let's say even 400 000 all the way to 800 000 we call these as outliers these are called outliers because these are very far away from our normal distribution and then these actually might not be useful for us with respect to our mean or whatever and these will impact a lot when we are performing analytics with respect to the mean or the standard deviation or anything for that matter so we will have to actually remove them and not consider them uh to basically perform very accurate analysis guys so on that note we can pretty much go on to finding uh you know the type of the data set and the type of the data that we'll only consider because in this particular case since you're playing with numbers it has to be the numeric data type right again as you can check out we have integer numbers we have floating numbers and so much more so let's pretty much go on to print uh what it looks like after we have dropped the values where we're not using we're not using id we're not using ally we're not using so much more right so these are all the numerical uh values that we'll be using to pretty much consider again your built is a numerical value 2003 is a date i mean a year overall condition overall quality all these can be rated from a particular scale right so again a square footing is a particular number as well so the second floor of this particular house has 854 square feet so much more so on that particular note as soon as we check out all of these numbers are present we can start performing an analysis guys so before that again we need to just plot all of these to just check uh what it look like on graphs because seeing numbers are one thing seeing graphs on the other hand are something else right so pretty much we'll be uh developing histograms and we can be checking out this so let me just scroll down a little yeah perfect so with respect to first floor square footing the mean is somewhere around or here right so it's around let's say 500 square foot thousand square foot and again look at a second floor or the square footing look at the bedroom average basement finishing qualities uh garage is the number of cars that you can park in the garage and look at the value here too so the majority of the houses here have two spaces to park your cars the year that the garage was built in just a quick info guys intellipaat provides online data analytics course in partnership with ibm and microsoft the course link is given in the description below now let's continue with the session again uh the ground floor living area how many half bathrooms you can see one half bathrooms again at this point of time you have certain values at zero as well well sure we can consider values of zero if it's very important but then since we're talking about sale price what is present is more important than what is absent right we need to have something descriptive for our analytics methods to work so in that case we'll have something called as the golden features list and this variable will basically contains all of the features uh that will be associating with respect to why our sales price is as high as it is guys so this variable called as the golden features list will have all of the features guys so basically we're creating a variable where we're finding on the correlation and then we can already check out the top 10 correlated values which are strongly correlated by correlated again let's say uh let me give you a quick impact basically we have uh uh described this in the descending order as you can already check out so the first thing here is overall quality again uh to transcribe this into literal terms overall quality of the house whatever the rating that was given in our data set is mattering the most of how the houses is being priced again the living area of it the number of garages you can park is having almost 64 impact of for with respect to why the price is like that garage area is having a 62 impact the basement uh square footing has a 61 percent impact and then the year it was remodeled and uh you know changes made is having a 50 impact of why the house price the sales price is like that guys so looking at a couple of linear relationships you can pretty much find out that a lot of values are zeros that i just walked you through a couple of seconds and look at that uh with respect to our again the ground floor living area we have a very little number of zeros with respect to the sales price again uh check out the basement or surface area again with respect to the total basement surface area again even here as well so look at all these tiny dots which are sticking up to zero with respect to sales price so these have no impact for our sales price because they have they're not giving us any valid linear data look at this again with zero oh it is raised up to a lot about six hundred thousand dollars right so all these are not adding any meaning to our data so if x is equal to zero again this might indicate that that house does not have that feature at all so if this is zero this house does like i mean a lot of houses do not have pools here so pool area is zero so in that particular case we need to remove all of these zeros so that we can go on to finding more correlated values that we can actually use to go about working within case again here's all of the correlated values that we found as soon as we run this command basically we are sorting it again in the descending order to find out what uh helps most and you can see that it's almost 80 percent of the total quality of the house which matters when you're buying uh the because of the price of the house and people are actually preferring this living area second floor surface area has a 67 you know chances of affecting the price and so much more and check out what the golden uh features list looks like uh with respect to all of the strongly correlated values we just found again you're remodeled you're built uh so much more so total surface area again number of full bathrooms the first floor surface area the garage area total basement or square footing the number of cars you can park in the square footing of the second floor the living area and the overall quality again in this particular order from the least to the highest is exactly what we're trying to uh find out with respect to exploratory data analysis guys so just looking at the data set you could never figure out why the list price of a house was so much and once we break down into simple terms like this we can find out that there is an 80 impact from the overall quality of the house or which the user is checking out to just consider the house or not if the quality is very less he will not pick up the house if the quality is higher then sure he will pick up the house so 80 percent of the reason why the price is set like that is a very important aspect of why the house price casing study is important guys again this has been a very important very nice data set to work with and you will get a lot of analytics that can be done using uh this particular data set as well guys uh next up i want to discuss the skills of a data analyst with you guys so you know a data analyst is a person who pretty much uh works with and creates a lot of beautiful visual skills so to do this you need to understand the basics of mathematics because if you have to have knowledge of statistics mathematics is the foundation to go and after you understand the foundation you will build a step-by-step growth a career path for yourself where learning becomes the most important thing because again having the knowledge of statistics will help you play with a lot of numbers at the same time and keeping the math part aside for a second you need understanding of languages such as python and our programming language as well you know guys at this point of time my goal here is to not overwhelm you if you do not know statistics if you do not know python or r or any of these technologies mentioned on your screen well fear not make sure to stick with me till the end of the video and i will guide you on a fast track path to become a professional data analyst guys and then the next point we have is data wrangling data wrangling again is to have an idea to play around with data throw the data which is not needed keep the important ones to know that you can work with because if you're working with inefficient data your visualizations will be really bad you know it's like again if i bring back the food example adding a little less salt or adding too much salt is bad as well and to give you a more uh clarity of data wrangling let's say you're preparing noodles but then if you just put the noodles in fine it's it working it's working because you're just making use of the part you eat but what if you throw the entire packet of noodles along with the packet right it doesn't make sense if you try to boil the packet as well here is where data wrangling makes so much sense in the world of data cleansing and pretty much data processing guys so you need to understand your data before you do this and then coming to a little bit of big data concepts you will require a little bit of knowledge about spark you know a couple of components and tools of spark one of it is big and another one of another one is hive as well guys well uh to break it down a little bit simpler for you guys let us talk about each skills uh the first thing i want to talk about is the analytical skills because you know data analysts will work with large amounts of data at every point of time so what is this large amount of data you know it can include a lot of facts it can be a lot of numbers it can be a lot of figures and so much more guys this if it is structured data if it is unstructured data they can work with images they can work with audio video and much much more at the same time so basically what they'll do is you know they need to go through this particular data they need to understand the data and analyze it to pretty much find some sort of conclusion so that is a very important skill to have and then coming to the communication skills well again as i told you at the start of this presentation itself a data analyst will have to present his or her findings to to a non-technical person so having the communication skill where you can convey all of this really well is extremely important because at the end of the day you know you'll be translating your data which could be meaningless to someone because they cannot understand it at this point of time into understandable documents uh you know very good looking dashboards or again very nice looking reports at the same time guys because again you know having a good communication skills will ensure that you can convert a complex idea into uh you know something which is easily understood by a lot of people and this brings us to the skills uh next skill set which is pretty much very important and this is one among the important skills to have which is critical thinking skills uh because at the end of the day again you'll be looking at thousands of numbers right so you need to know where to look you need to know what to look what your numbers can do for you where are the trends how can you get to those trends what is your process benchmarking how can you get uh to the goal that you set and so much more guys and why is all of this done well all of this is done to make sure that you can have a conclusion to look up to right so to simplify it again to give you an example uh let's say you go shopping for all the food that you have to pretty much you know cook at your house so you're gonna need vegetables you're going to need seasonings you're going to need so many things to cook with it you're going to need oil for some dishes so you will plan right so most of us pretty much you know go out once a month or twice a month to bring out all the groceries and you know store it so if that's the case then you need to understand how much of groceries you need every single month how much you're consuming right so you have to set a goal saying you know what this is the amount of vegetables that i'm going to need for this month and then you're going to plan it and you know pick it up weekly daily or whatever your schedule is so how do you set that duration of when you should go pick your vegetables that is exactly why you need to formulate conclusions to understand how you can go from nowhere to there guys and then coming to the most important communication skill is that you know it'll again work together with the critical thinking skill which will basically enhance the output of you guys and then the most important thing i want you guys to contemplate about and think about as a data analysis is an art guys because you'll be working with numbers you'll be working with graphs visualizations and everything can look its best it's most beautiful if you know how to create it right so contemplate on that for a second guys again if you have any questions about data unless again guys i want you guys to head to the comment section so pandas is an open source python library which is used for manipulating one dimensional and two dimensional data and the name pandas is derived from panel data which is a common term for multi-dimensional data sets encountered in statistics and econometrics now let's look at the types of data structures in pandas so we've got one dimensional and multi-dimensional data so if you're working on a one-dimensional data set it is known as a series object and the two-dimensional data object is known as the pandas data frame and if you're working on higher than two-dimensional data pandas would create a panel data for you so let's properly understand what exactly is a series object well a series object in pandas is a one dimensional labeled array which is capable of holding mixed data types like integer string floating point number and so on now let's understand about data frame so what data frame is a two-dimensional label data structure with columns which contain data of different types so here we see that we have a two-dimensional data frame where the first and the third columns are of string type and the second column is numerical in nature so now that we've understood what exactly is pandas and we've also understood the different types of data structures in pandas let's go to jupiter and start working with pandas so i'll start by importing the pandas data frame so i'll type in import pandas as pd so this pd which you see this is just the alias so i am importing pandas with the alias pd so we have successfully imported this pandas data frame now i'll go ahead and create a series object from a list so let's see i will name the list as data and i will given these values so the values inside these lists are 1 2 3 and 4. now to create a series object from a list all we have to do is use this pd dot series function so i'll use pd.series and inside this i will pass in this list and i will store this in s1 now let me print this out right so we have successfully created our series object from this list so we've got these four numbers and these are the index values so what you see over here is the indexing start from zero so zero one two and three this is these are the index values and these are the actual values now let's see how can we change the index of the series object so i'll just copy this over here paste it over here again so we have got the index attribute over here inside this what i'll do is i will give in different set of values now let's say i want the index values to be a b c and d so i'll just pass in these values for the index parameter over here and i'll store it back to s1 now let me print s1 over here so we have successfully created a series object where the index values are a b c and d now if we want we can actually extract these individual elements with these index values so let's say i want to extract the value which is present at the index value c so i'll type in s1 i'll put in parenthesis and inside this i'll put in c right so i've successfully extracted this value similarly let's say if i wanted to extract the element which is present at index a so i'll type in a over here right now let's say if i want to extract the first two elements so if i want to extract the first two elements i'll just put in a column over here i'll put in two right so i've extracted the first two elements similarly if i want to extract the last two elements i'll put in a colon and over here on the left side of the column i'll put in minus two so this is how we can extract the last two elements from the series object right so we have created a series object out of a list now let's go ahead and create a series object or of a dictionary just a quick info guys test your knowledge of data analytics by answering this question which of the following is a false statement about data analytics a it collects data b it looks for patterns c it does not organize data d it analyzes data comment your answer in the comment section below subscribe to intellipaat to know the right answer now let's continue with the session so let me create my dictionary over here i'll name this as d1 i'll put in curly braces over here now i'll create my dictionary so i'll just start off by assigning the key value pairs so a 1 after that b plus 2 after that for c that is 3 and after that for d that is 4 all right so i have successfully created my dictionary over here let me also print this out right so these are the key value pairs now again to create a series object out of this i will type pd dot series and inside this i will pass in the dictionary which is d1 and i will store this in s2 now let me print out s2 so this is my series object over here right so all of the keys have been assigned as the index and all of the values are the actual values in the series object right so a becomes the index value over here b is the index c is the index d is the index and the value corresponding to this key becomes the value in the series object as well now let's say if i want to resequence the index values over here let's see how can we do it so i will copy this and i will paste it over here now again i will use the index parameter and inside this let's say instead of abcd i want the sequence to be dcba so that is what i'll give over here d c b and a now let me run this and let me just print out s2 over here right so we have reverse the sequence of these indices so initially it was a b c and d now i have reversed it and now it is d c b a right so this is how we can create a series object out of a data frame and also change the sequence of these indices all right so we have work with series now let's go ahead and see how to create a data frame out of a list so i'll just write a comment over here creating data frame from a list right so let me again go ahead and create my list data equals one two three and four right so i have created my list now i'd have to go ahead and create the data frame and to create the data frame this would be the syntax so i'll type in pd dot data frame so over here you have to keep in mind that d is capital and f is capital right and inside this i will just pass in this list and i will store this in df now let me print this out right so we have successfully created a data frame out of this list now let's also go ahead and create a data frame out of a dictionary so i will type in fruit over here so the name of the dictionary is fruit and over here i will write the key to be equal to fruits and the values for this key are so we've got all of these fruits we've got apple we've got mango after that we've got banana and finally we've got guava so this was the first key value pair after that i'll given the second key value pair which would be the count of the fruits so i'll type in count and over here i will just sell in the count so let's say there are 10 apples 20 mangoes 40 bananas and 30 guavas right so we have created a dictionary now let me print this out now let me also go ahead and create a data frame out of this so you already know we'd have to type in pd dot data frame and inside this i will pass in fruit and i will store this in let's say fruit underscore df now let me print this out fruit underscore df right so what happened over here is these two keys have turned into the column names so fruit has become the column name over here count has become the column name over here and these values come into the rows right so fruits we have over here and apple mango banana and guava are the row values again count becomes the column name and these values over here come over here right so this is how we can create a data frame out of a dictionary right so now that we've understood the basics of series and data frame let's go ahead and see how to import a data frame and do some sort of data manipulation on top of it so i have this customer churn data frame with me let me go ahead and import it so to import a data frame i'll type pd dot read csv now inside this i will just given the name of the file so the name of the file is customerchurn.csv and i will store this in a new object and name that object to be customer churn now let me go ahead and print the first five columns of it it will be customer churn dot head right so this is our customer churn data frame and it comprises of all of these columns so we've got customer id gender senior citizen partner and so on right so this is a special function in pandas which gives you the first five rows right so one two three four five now similarly let's say if you want to have a glance the first 10 rows you just given the value 10 over here and you can glance at the first 10 rows of this customer churn data frame so an analogous function to head is the tail function which would give you the last few rows so i'll type in tail over here now i'll click on run so this over here gives you the last five rows from this customer churn data frame similarly if you want to have a glance in the last 10 rows you will just put in the value 10 right so these are the last 10 rows present in the customer churn data frame all right now let's see how can we extract a specific rows and columns from a pandas data frame so for this we've got lock and dialog functions so let's actually start working with the i lock function so let's say i want to extract only the rows from row number five to row number 15 and only the columns from column number two to column number four let's see how can we do it right so i'll start off by giving the name of the data frame which is customer churn after that i will use the function i lock and i'll put in a comma over here right so whatever is present on the left side of the comma that denotes all of the rows and whatever is present on the right side of the comma that denotes all of the columns so let's say if i want to extract all of the rows from 5 to 15 so i'll put in 5 i'll put in a column and i'll type in 15 right so i'll be extracting all of the rows starting from row number 5 to row number 15. similarly if i want all of the columns starting from column number two to column number five this is how it'll go right so let's actually have a glance with this so this two to five over here so since we already know that in python indexing starts from zero so zero 1 2 so this would be our column number 2 which is senior citizen so 2 3 and 4 senior citizen partner and dependence right so we have extracted column number 2 column number 3 and column number 4 and we have extracted the rows starting from row number five to row number fourteen right so this is how we can extract specific rows and columns from a data frame now let's go ahead and see how can we perform some sort of data manipulation so let's say from this entire data frame i want only those records where the gender of the customer is female right so for this what we'll do is i'll just start off by typing the name of the data frame which is customer churn and then inside the parenthesis i will given the name of the column which is gender after that i'll use the double equal to operator and then given the condition which is the gender of the customer needs to be equal to female right now i'll click on run and let's see what do we get so you get a bunch of true and false labels now this bunch of true and false labels basically means that over here at record number 0 it is actually true that is the gender of the customer is female so false over here indicates that gender of the customer is not female here again it is false here again it is false so again it record number four it is true that is the gender of the customer is female now what i'll do is i will cut this and i will paste it back inside this right so what is happening over here is from this customer churn data frame i will extract only those records where this condition is true right and i will store this in let's say female customer all right i'll click on run now let me print out the head of this female underscore customer dot head right so we have successfully extracted a subset from the original data frame where the gender of the customer is only female now similarly let's say if you want to do some sort of complicated operation than this so let's say we want to extract only those records where tenure of the customer is greater than 50 and the internet service of the customer is equal to fiber optic all right so now let's start off with the first condition so i'll given the name of the data frame which is customer churn and inside the parenthesis i will given the name of the column which is 10 yard so the tenure of the customer needs to be greater than 50. all right so i'll cut this and put this inside places i'll put the and operator and given the second condition so the second condition is the internet service of the customer needs to be equal to dsl so i'll type in internet service and this needs to be equal to dsl right again i'll put this condition inside quotes over here all right so we've got these two conditions and finally i'll put those two conditions inside the data frame so what is happening is from this entire customer churn data frame will be extracting only those records where these two conditions are satisfied right and i will store this in let's say c underscore tenure underscore infinite now let me print out the head of this so it'll be c underscore tenure underscore internet dot head all right so let me have a glance at the tenure so over here if you see the tenure of the customer is greater than 50 for all of these values so 62 58 72 70 and so on now similarly if i have a glance at the internet service column then you'll see that all of these values are dsl right so this is how we can perform data manipulation operations on top of the pandas data frame all right so we are done with the practical so exploratory data analysis or eda for short is the process of performing initial investigations on data so as to discover patterns abnormalities or anomalies and assumptions with the help of summary statistics and graphical representation basically when we have some data on which we want to perform data science and statistical analysis and machine learning modeling we first need to make sure that we understand what the data represents shape of the data what are the different kinds of things that are available in our data different data types available in our data if we could visualize our data and understand relationships between individual columns or individual features in our data that's really good and all of that and more such as visualization and all is done the step called exploration data analysis that's what eda is all about now eda allows us to get a better understanding of our data and make important observations on it in order to understand this let's understand with an example if you have a data set which you have no understanding of you don't understand what the data it contains is what do the columns represent how do the columns relate to each other and which columns are important for the particular tasks that you are trying to solve in cases like this it is very difficult to understand how you are going to solve the problem how you are going to build the model how you are going to perform statistical analysis so on and so forth so in this sections what happens is we would like to perform exploratory data analysis in order to get a better understanding of our data this also helps us understand whether or not the data that we have needs some sort of uh boost so for instance if we have some data set that's highly biased on one result for example we are trying to predict whether or not a particular particular internship is going to convert into a job let's say we have a data set that we want to figure out whether or not we can create a predictive model in which we can feed in some data about an intern and figure out whether or not we should give them a job although it's a very specific use case if all we have is data where people did not get converted into full-time employees our data set is completely biased and this is something that we could miss if we don't take time to analyze our data using exploratory data analysis so if that is the problem then we would have to get more data so that we can balance out both the probabilities and then teach our model about these things so it has better understanding of what features lead to what results so now we come to why eda why would you want to perform exploratory data analysis we've understood what it is but why exactly would you want to do it well exploratory data analysis is one of the most crucial steps in data center it allows us to achieve certain insights and statistical measures that is essential for the data scientist good understanding of exploratory data analysis process allows us to make some important observations and early decisions that could help us produce or that could help us not take steps that are not needed so as we discussed in the previous example if a data set is biased then performing uh modeling and using multiple algorithms and trying to figure out the accuracy of each model is not going to be a good idea mainly because the data set we have is not good if you don't have a good data set or don't have a data set that's representative of the reality in the current world then your model is not going to be good it's not going to perform really well so extraordinary data analysis helps us perform as fewer steps as possible in order to get the best result this is why exploration data analysis is so crucial step in our data science process so it performs certain steps to define and refine our important features and variable selection so it could be that you have a data set that contains 25 different columns but for what you are trying to predict only three or four columns are going to be enough or they explain most of the things or most of the variation in the data or in the column that you're trying to predict so you only need to use four variables now which four of those should you use and if indeed are there four variables or are there five variables that explain it better for that exploratory data analysis is important it allows us to select features that are going to be useful for us in case we don't use the correct features we end up with a model that's highly bloated contains data that's irrelevant requires users to fill in data that's relevant and could lead to incorrect or highly improbable results so we need to be very careful when we are trying to take model columns or columns on which we are trying to model our data on to make the predictions now why would you want to perform exploratory data analysis with python there are many other languages out there including r you can even perform exploratory data analysis in javascript and java and c plus so why should you use python well exploratory data analysis using python is very easy first of all python is probably one of the easiest languages to get started this is why python is so popular in data science community because many people who perform data science operations or machine learning operations on a particular data set are people who come from the academic world they don't have much understanding of what programs are what variables are what classes are they only need to write code to get the things that they want to get done beta this is where python comes into play fills the gap since python is so easier to use and another advantage that python provides is that it has a number of libraries that allows us to perform tasks related to exploratory data analysis in case we are performing these tasks ourselves it's very very important for us to make sure that the libraries we are using are trusted and used by multiple people because if it's a library that someone has created a confidence in it is going to be really low for example if it's a library to perform data visualizations but it's a library that not a lot of people have used then you cannot be sure that the library is going to give you accurate visualizations and it's always going to be a bit problematic in python however there are libraries that have been used by millions of developers and academic people what they do is they just understand the basic and then they import their data and visualize it and perform the tasks that they need to perform right at the step and that's really important for our dataset so this is why python is such a good candidate for exploratory data analysis mainly because it's easy to understand easy to read it's got a really clear syntax as well as because it's got so much good tools and libraries around exploratory data analysis and data science in general it's just a really good experience for developers or people who are getting started with external data analysis or data science or even professionals now python has libraries to perform all kinds of tasks relating to exploratory data analytics such as visualizations analysis etc visualize the data there are libraries available not just one library but multiple depending on the features that you're trying to use you can import either one of the libraries that are available there's libraries for data analysis libraries for numerical manipulation there are libraries for many other different kinds of tasks such as creating neural networks defining the structure of a neural network uh visualizing neural networks so on and so forth so this is why python is so popular now some of the most popular libraries to perform eda are numpy pandas and maplotip now since i am assuming that many of you are beginners and you have no understanding of what these libraries are we will cover what these libraries do in a later point in the presentation also again if you wish to learn more about these libraries and more about data science and exploratory data analysis and twice and in general stick with us till the end of the presentation and we'll help you and will help guide you to some of the resources that can help you out there now let's take a look at some of the advantages that exploratory data analysis provides us so it provides us with several benefits first advantage is that it allows us to spot missing data using exploratory data analysis we can easily find null and missing values so in case we have a data set in which certain columns have null values or certain columns for certain proportion of the data is missing we can remove the data we can fill the data although removing the data and filling the data is not part of exploratory data analysis spotting the meshing data is definitely a part of exploration data analysis and it allows us to do the same it allows us to find the underlying structure of data so on the surface what we can see is the number of columns the columns that we have and the data that we have in them underneath the row and columns and the tabular structure of our data if indeed it is tabular what we can do is we can um understand what the underlying structure is underlying structure means what are the different data types that a particular column holds are they in the correct format to be processed what is the range of data that is contained in a particular column do columns relate or correlate with each other in the data set and if so is it helpful or harmful for the task that i'm trying to perform so on and so forth then there's variable importance exploratory data analysis helps us to figure out which variables in our dataset are the most important ones so in that case it helps us understand which variables have the highest influence on the thing that we're trying to predict or on the column that we're trying to predict that way we can just drop the rest of the columns and only use the ones that have influence on our columns above a particular threshold and finally data visualization now we can sit and talk about data a lot but one of the issues that accelerated data analysis calls is that when we show these statistical measurements to people first especially people who are not statistically literate it's going to be really difficult for them to understand mainly because they have to firstly scan through the data then understand which data has higher value which data has lower value and then understand the implication of each of them it's much easier to visualize this data and understand the characteristic of the data that we possess in this scenario exploratory data analysis is really really helpful it allows us to perform visualizations and explain our data in a better capacity even to people who may not be that familiar with the data that we have as well as the uh statistical analysis that we have performed now let's take a look at the libraries that we had discussed earlier and what in exploratory data analysis uh i would just like to mention that these libraries uh these are the three libraries that i have picked these are the most one or at least some of the most common libraries that are being used not necessarily the case that these are the only libraries that will be used or also not necessarily the case that these are the libraries that are going to help you in a particular problem it could be that there are better alternative alternatives available and to understand about the alternatives you need to firstly look at all the options that are available so i recommend before starting any exploratory data analysis process take a look at the tasks that you want to perform take a look at the libraries that are available these libraries are the most popular ones and for good reason so this should help you out especially if you're a beginner but as you become more and more advanced and more and more into the exploratory data analysis and data science community you understand that there are better tools available for the one particular task that you're trying to perform it's better to use those tools all right so the first one is numpy numpy or many people call it number as well it is a numerical manipulation library in python we can perform numerical manipulation using raw python itself but it contains a lot of the information that are not relevant for our purposes since python is an object-oriented program it's programming language which it's fine if you don't understand what that is but in case you do understand it stores a lot of information that's not relevant so when we try to perform complex mathematical operations those those properties about the object orientedness of the particular variable come into play and they negatively affect the or negatively impact what we have what negatively impact the performance so in situations like this number is quite useful it firstly just keeps all the relevant data secondly it has a lot of functions that allow us to perform these numerical uh manipulations inbuilt functions for calculating the sine wave the cosine waves for fast fourier transform is also available in many different kinds of numerical manipulations that are difficult to implement by hand are already implemented in numpy so you don't have to worry about implementing these things you just have to worry about using when the need to be then there's standards if you have done a data science with python for any period of time pandas is a library that you must have come across pandas is created for data analysis data summarization and many other tasks relating to data it allows us to load data in our in our python program using something called uh using something called data frames data frames is a specific special object or a special way of representing data and standards and it affords us a lot of benefits it allows us to perform a lot of complicated tasks on our data such as joining two data frames uh filtering our data frame manipulating data dropping data and a lot of different kinds of tasks that by hand are going to be very difficult so it's a layer of abstraction over these complicated tasks so you can perform them yourself and it's very efficient as well then we come to matplotlib it is a visualization tool or in the visualization library allows us to create charts and graphs and allows us to perform a lot of visualization many different visualization libraries such as seaboard use metalloclip underneath the hood so in case you wish to become a data scientist and want to perform exploratory data analysis by visualizing the data matplotlib is a library that you might need you can also use other libraries but matplotlib is probably the best one to start with because it's very easy it exposes you to some of the fundamental concepts that you need to understand when you are dealing with uh in iron python visualization or graphs or charts they can help you out with that all right so we have reached the demo section now for the demo i have already written the code so that i don't write the code and you have to then follow along and look at the errors that i make i'll explain the code to you now i don't only need to import the libraries that you need in the current scenario i didn't need to use any numerical manipulations because the data is already manipulated for me then i can use this i'm using pandas c bond for interactive visualizations and also matplotlib these are two different visualization libraries uh cbon is great for creating summarizations of our data sets and matplotlib is great for visualizing individual columns this bit of code here is telling the jupiter notebooks let me just zoom in on it because it may be possible that you guys aren't able to see this now in matplotlib it's possible to use it with multiple patterns in a scenario we're using it with jupyter notebook we're just trying to tell matplotlib that whatever visualizations that you produce make sure that they are optimized to be rendered inside the jupyter notebook now many people say it's not necessary to do this stuff it was necessary previously but now it's not but i like to keep it in just in case then i am importing a data data set and to give you some context this is the dataset that i have downloaded from dragon you can download it yourself i would recommend it that you do it afterwards and you can play with it as well this data set and many other data sets are available on kaggle for you to perform exploratory data analysis with you can do it on kaggle as well and when you are doing it you can using that exploratory data analysis notebook that you have created you can share it with people kaggle is a great platform for that so you can use it for that and now we take a look at the first five columns first five rows of the data now as you can see the shape of the data is five rows in 81 columns now 81 columns is probably a lot as you can see that even our pandas library is refusing to show all of this because it knows that it's not possible for us to show everything so we have this now we take a look at the information of the data this tells us about the data set that we have and gives us information about each column right so we have one four six zero non-null columns normal data points with of type in 64 in the id column and you can take a look at everything here as you can see in the ali column we only have 91 91 data sets that means that majority of the data is none we'll deal with that in a moment as you can see the data is quite disparate sometimes it's 1460 sometimes it has 1452 data points so we have we are dealing with a data set in which we can have null values quite a lot because the pool queue c has only seven values that are not null others are all enough and as you can see the data set is quite disparate as well some are objects which are basically string or some other dataset other than numerical data set some are integers and some are float float are uh basically real numbers these are numbers with decimal points so currency could be considered float and uh real numbers are considered integers so your mobile number could be considered an integer because it's not going to have a decimal point with that right so i i don't want you to get bogged down into the specifics of the code what i'm trying to tell you is how we are going to do this so here what we are doing is we are taking all the columns in which our data is greater than or equal to 30 so any column that contains less than 30 data which is it has 1460 columns but only 30 or less percent of the values are present in that and we are going to drop it so any column that contains less than 30 of the data we're not going to consider it at on because it could have negative implications so we do that now what we do is we also delete the id column from our dataset mainly because id is just something that is generated by the database in which it was stored or maybe not by the dataset but by the people who just a quick info guys intellipaat provides online data analytics course in partnership with ibm and microsoft the course link is given in the description below now let's continue with the session just a quick info guys test your knowledge of data analytics by answering this question which of the following method creates a new array object that looks at the same data a view b copy c paste d all of the above comment your answer in the comment section below subscribe to intelliper to know the right answer now let's continue with the session uh collected this data it's not really important for us it's important for storing the data but for analysis of the data id is just a number that was randomly assigned to a person it has no significant value now we will print the list of columns that were dropped as you can see ali qc friends and miscellaneous features are the columns that are dropped id is the one that we have dropped manually even though it contained a lot of data now we try to do two things here first thing we take a look at the sale price which is a column that we are very interested in because that's the column that we're trying to model our scientific process across so we describe the data which is we count the number of columns as you can see it contains every bit of data that is available the mean is it one lakh eighty thousand nine hundred and twenty one point some value standard deviation minimum value 25 percentile 50th percentile which is uh here then 75th percentile or the third quartile and the maximum value in the in the column this is by the way this is something that pandas is doing for we didn't have to write functions to loop over the data and figure out the main max 25th percentile and all of that so it tells us the name of the column sale price data type is float and now what we do is we try to visualize it so as you can see the majority of our data lies in this region yeah so here and here this is what is known as a histogram or distribution plot since c bond allows us to do it pretty easily we do it don't worry about the code stick with us till the end of this presentation and we'll show you what ah you can learn to write the code right now we're trying to understand the process of exploratory data analysis so as you can see even though we have a maximum value of 77 lakh 55 000 uh majority of our values are lying between somewhere around these values right now we take a look at the different data types available in our dataset we have integer we have o which stands for object object basically means that this is something that's not a number and we have float this 64 is basically how many bits are used to represent the data this means that the data could be of high self you can by the way change this all yourself so here what we do is we say okay give me all the numerical data that we have this is done mainly because we want to be sure that we are only delivering numerical data dealing with data that is not pneumatic first it will have to convert the data into numerical formats and so on and so forth so we are trying to firstly estimate the task ahead of us we take the numerical data that contains float 64 164 basically not selecting the object data types we take a look at the head as you can see it has 37 columns which is a massive drop from what we had earlier earlier we had 81 columns out of which we dropped i think four or five because they had a lot of null values this one is by far something that we have done only to understand what the data is about now we take a look at the distribution of our data using histograms what we do is we basically create a histogram for all the columns so again we didn't have to do it ourselves these are all 31 columns with histograms and you can take a look at the data as you can see it contains a lot of the some columns contain a lot of data that's centered around the particular data point others are very separate such as this one this only contains values that are zero so that's fine and you can take a look at the data and analyze it we're not going to analyze it right now this is just for you to understand how to generate these visualizations then we take a look at correlations correlation basically is a number that allows us to understand how allows us to understand how much of our data is uh how much of our how much influence does one column has on a particular column and we remove the last column because the last row is basically sales or the last column sorry i'll just remove the comment so the last column is sales so we don't have to worry about that from here we basically just take all the columns with correlation greater than 0.5 we call them golden feature list any column that has a higher value than 0.5 or correlation means it has good correlation for our data we're taking the absolute value which means we don't care whether it's negative or positive correlation rate is from negative 1 to positive 1 negative 1 means if the data in this column decreases data in other column which is negatively correlated with it will increase positive correlation means data in this column increases data in other column which is positively correlated with this increase so we take that and we sort it in ascending values now after doing this we take a look at the correlations so these are the correlations the highest correlation we have is overall quality and gr live area it's around 80 to 70 now what we can do is we take a look at all the numerical columns and we generate uh for the first five one or four all fifth values what we're trying to do is we're trying to create visualizations to understand how a particular column relates to first five columns so for instance sales price relating to msr class ms lot range lot area and so on and as you can see we are doing it for everything and we are doing it five plots at a time now that we have done that here is a bit of a code that's going to be a little difficult to understand so i'm not going to try and explain it to you but basically here what we're trying to do is we're trying to create individual data frames or individual feature data frame as i've named it create some temporary data frames based on particular sections such as we take the column name and the sales price and then using that data frame we check whether or not the columns value is not equal to zero and we create or we add it to the data frame then we create an all correlations matrix and then we try to print it now we come to the interesting part so these are all the 11 strongly correlated values that we have right for this one since we had 81 columns and now we reduced it down to only 11 columns that have really high influence on our sales price column now to visualize it what we have done is we have created a something called heat map let me just zoom in so that you guys can see i can zoom in a little too much yeah and also you can manipulate the size of this uh plot by when you're writing code i'm not going to do it right now but as you can see that some columns have correlation of 0.4 some have and this is negative by the way as you can see these are color coded so any column with light color such as blue or purple are negative any columns with like the color such as yellow or green are positive as you can see we have zero point eight three percent of correlation with gr11 area and uh five places and so this allows us to understand what are the particular columns that are highly correlated with each other allows us to understand whether or not a column has influence on not just one column but multiple columns and how does that affect our purposes we have a fancy thing called package in python okay so those packages are available for free in the market it has a lot of built-in functions and built-in things that will help you guys to write many many codes easily without going for the detailed algorithm and all so in that way these packages are really helpful so numpy what it does it is used for mathematical and logical operations on the arrays okay and it provides feature for multi-dimensional arrays also so again when we tell arrays python doesn't have any array it is built on list only so basically when we go through numpy when we go for indexing slicing and all those kind of stuff it will only be the list everything will be based on the lists so that's how it goes okay so numpy can support 1d 2d both of the kinds of arrays okay and i think you might be using this anaconda version of python so but that will be coming along with pandas and i mean numpy version a pre-installed okay numpy will be pre-installed in the python versions okay i mean anaconda now when you use a package right what we do we import the package okay we import the package with something like we the syntax is it can be any package okay alias can be anything so what is the syntax syntax is import then there will be package name then there will be ask keyword just to assign an alias otherwise you would need to type the entire thing throughout the program and then we will have alias name of np okay so that's what it was okay so now when i say numpy array it can convert anything uh any any python data structure to arrays okay so that's how it works okay so if i show you like this why it is taking time now yeah so see it has converted both these things so the first one is a python list second one is a python tuple right so when we pass that argument that tuple or a list to a numpy array uh function it converts them all to a numpy array okay so that is what this this returns okay and whatever we provided as you see the outputs are starting and ending with square brackets right so that means it is already converted to a list now we can perform various functions in we can perform various functions in numpy yes so once we have numpy error created it can be editable okay because it's a list now it's not anymore a tuple so that's a way it is editable okay and numpy doesn't come by default with python if we want to use it we need to install it so let's say you are not using anaconda and you are using this kind of a python user interface then what you need to do you need to install it using command prompt okay pip install sorry i am getting it again and again we've installed numpy so it will find it and it will install numpy okay i don't have this pip version set so that's why it is not working but it should uh install it okay so that's how it works that's how we can use numpy in the in this kind of functions also okay now we can have 2d arrays in python also like 2d array numpy also so how we write it we write it like this so we just need to pass 2d arrays so 2d array means two lists right we can pass two lists so it will be concatenated to 2d array okay so it will be changing it to 2d array so that is how we write it in numpy now that is how we create arrays in python so that's how we do it so now next will be uh how where it is and where it is advantages rather than using a basic python list or uh python like that python list or python double right so by number you are changing yes so we are changing data type from mutable to non-mutable just the other one right right next is nd array object okay so in numpy as i have shown you two 2d arrays right it can go up to n and then can have any number so in real world if you ask me we don't use more than two max to max 3d arrays but yes it can go up to that okay so like that it works so next is and next is like why why we use uh numpy arrays so in numpy this nd arrays will have all the items of same type okay so it can't be one of them is end and one of them is string list it has two it all all of them has to be entered okay like that so items can be accessed using zero based index as you know uh python arrays uh python lists also starts from zero indexing right so here also it remains the same property it has zero index so if we go for numpy array then if we want to see the first row it will be like this so it will give you the first row and this will give you the second row right similarly for columns if you want to see first sorry if you want to see the first column you need to do this right sorry so you need to do this so you will see the entire values if you need to see the only the first column then you need to do it like this like that it works right so that's how it will be working now for 2d arrays i will be coming okay i am working on 2d error sorry i forgot so that's how it works okay so we will be coming to that in a few minutes so 2d arrays i will show you so for others what was the names a and b right 2d errors i will be showing you not an issue so for arrays if you see 1d array it will give you a0 that will be 1 a 1 will be 2 like that and if you do 1 to 2 then it will be printing only single element because the right hand side index is always left ok right hand side index it is not up to it goes up to that it is not including the right right hand side index so that's why it has two elements now if we do this then it will print all if we do this that also will mean the same and it will print the same it is pretty basic pretty much like the list that we have in python okay now uh numpy arrays are beneficial because it takes up the same size of block in the memory for all the objects that it stores okay this part we will see a comparison in the later later half of the section and then we can understand like where it is beneficial rather than storing it in a classical list uh then numpy eric okay now next is if you go into check the individual yes right so it will be begin colon n minus one that is how so right hand side will always be excluded like that okay next is numpy array initialization right so how do we initialize numpy errors as we have already seen right how do we do or do that but still this is how you do it okay so initialize means uh what we want to do is we don't want to place any list in list in there okay but what we want to do we want to have some by default values okay default arrays we want to take it up okay so let's see how we do it okay first one is we will i mean if you guys are familiar with the terms like masking and all that is for image processing and for data processing we will do that a lot of times we will do that okay so a lot of time we will be a lot of time we will be using the zero best or one best indexes okay so for creating a zero based arrays right what we need to come on what we need to do we need to have some some sort of like some shape of zero arrays okay so that's how you do it okay that's how you do it in numpy so what you do you write num np dot zeros then you specify the then you specify the dimension of the array okay three by four means it will have like three rows and four columns so first one is off so you can think it like this it will be np dot zeros then row comma column okay that's our tuple you need to pass it to okay that's that's this kind of top tuple you need to pass it to through the python arrays okay so that's how it works in python i mean in numpy so that's how you print zero arrays in numpy okay next is if you want something uh with the interval right so how you do you write like this np dot array within bracket we pass so it will print you another new array so what we are doing we are sorry in here what we are doing by adding yeah so in this what we are doing we are yeah right it's arranged so it's a range actually okay so what we are doing we are passing three variables first one is the starting number second one is the ending number and last one is the the line is the interval that we want to print so from 10 to 25 we are having five we want one uh lists to be five separated right so first one and second one difference will be five like that so that's how we have so we have three numbers right ten fifteen twenty and uh always remember the right hand will be excluded so 25 won't be taken up so till 20 we will go okay till 20 we will print it out okay so that's how this kind of arrays are getting printed out in the in the indus numpy next thing is this a range is done right now let's say we want uh us we want to spread some points over a straight line okay so how do we do it we want to i mean what what i mean is if you think of a 2d plane right like x and y plane if you think of and we want to spread some points along that line okay so what we will do then like we want some points between range a and range b okay so here we are giving intervals right so we need to calculate the interval we need to calculate the point difference and all then we need to provide it here so that's how it it works right but we don't want it we want it otherwise okay we want it to spread along a nine line like this so for between five and ten we were we want six numbers it is almost same as numpy but it has a bit difference so how it is differing i will show you okay next is this lean space okay lin space and arrange right so arrange what it does it takes in two num three numbers so start number end number and so we what we do in arrange we have two three numbers as a parameter we what we take we take n numbers the start number end number and the interval and we print out the values based on the intervals so from start number till the end number we print out the values based on the intervals okay but in lin space what we take we take three parameters again and what we have in there we have like we have start number n number and then we take in number of points in there okay number of points in there so if we give you a start number and n number then if we split it into 10 points then we will get 10 points between the start number and number so that is the difference between spins and air range and all these functions we will be using heavily when we use go for data science okay because these things we will need and we will see how do we do that okay now next is how do we do a array with the same number okay next is how to create an array with the same number so this is same as same for same uses like that like for zeros right formats for curving out some masks like that okay so if you use this keyword then if you pass some values so come on if you pass some values then it will give you that number of the the dimension it will dimension it will take care and it will give you that kind of an array so row then column and then it will be taking up so in p dot pool it will take up first tuple the tuple of row numbers column numbers and it will take up the number to fill that array with okay number two fill like that this nah thing works this full function works it can be anything okay it will give you a array full of that kind of numbers of any any dimension okay any dimension can be given okay any real world numbers can be given i mean sorry any integer number can be provided as a dimension okay so that's how this full function will work okay next is to randomize an array again for i mean that masking only so how we do let's say we want a three by four matrix to be randomized so it will give you some some here and then numbers of random numbers uh so that will be uh and those will be structured in a matrix form okay the dimension you pass in that way it will be organized so random is this thing okay now we go to how do we access this numpy arrays okay now let's say we are creating two arrays how do we access it okay how do we see what's wrong with the array and what not let's say i am creating a 2d array okay now if i do render so that's how the shape function works okay that's how we can access a shape of an array so shift function will return couple of its row number and column number okay so a dot shape returns double row num column like that okay so it will be returning like this okay now the next thing is what is the use of the shape function okay so this is a tuple but we can change the values of the tuples so this will change it okay we can access the shape function of it and we can change the values okay even now we can access like a no come on we can access this yeah so we can access it like this okay come on i have not given print what is happening okay so we can access individual elements of the shape tuple that is being written like this okay like we use uh like we access normal tuples okay so we can what we can do with this a dot shape we can give we can get the tuple i mean uh tuple of row number and column number as an output we can change the axis change the shape of the shape of the matrix with with this shape function and we can access individual arrays let's say you want to see how many rows are there in data set right you can use it you would want to loop through all the data sets all the rows of your dataset using a for loop you don't want to go for map and all so that's when you use a dot ship and for print a that's how you use okay for print a what you do you you just i mean sorry for uh shape 0 that means you are accessing individual elements of a tuple right so 0 will be giving you the number of rows and one will be giving you the number of columns like that okay okay one thing uh so you guys are accounted with uh i mean you guys know about lambda functions right it was covered uh yes correct uh but uh i just want to go back to zero index zero index it would return the number of row correct right if you give comma another 0 then it will give you the number of column is it okay shape is always cubes only the column can rows here the 0 will give you the rows and uh one will be the column so can you please do that so tuple how it works number of elements in tuple we will we will have a one to one d error right so one d arrays uh can't be uh having two two indexes right okay so okay it will be whatever you call it okay so that's how it works okay now we can have several more examples of impeda i mean this shape function but i am not going into the detail but you need to remember one thing you need to match the number of elements in the tuple like we have six elements right if i want to split it to three three it won't work because the size of the array is to total of six right so we need to have it multiple of six only that's how we can break that's how we can reshape that's what is about this shape function okay next is how to get total number of columns okay so again i am having 10 functions and i want to see the size of it so it will give you 24 as you can see there are 24 elements in the array okay so that's where that's how this size function works and this is another syntax of np dot arrange so here what it takes it takes some number of number of values number of points needed will place the numbers between zero till that number okay so that's how another syntax of arrange works okay so for size function it will give you the total number of elements in an array okay so that is what size function works for okay next is dimension of array it is more or less same like this uh it is more or less the same thing like that uh like a shape function so i will put it right in there but the only difference it can be modified okay so where were we we were with this ending right so how it will work ndm will return you the it is not i mean it is a bit different from shape shape returns you the exact shape of the array like the number of rows and number of columns right but n dim will return you the number of dimensions of the array okay so if you see the n dimm then what we have we have two as a return right that means it a total of two dimensions three and two two dimensions right so that's what ending returns okay name will return you the number of dimensions of the error okay so that you need to remember okay so that is what you need to remember so ending returns you the number of dimensions okay it is not a shape but it is a dimension okay now d type d type is the array data type that we have already okay next is state t type okay d type is to give you the type of each elements in the array okay in the numpy array it will give you type of all the elements and numpy arrays cannot be heterogeneous it has to be homogeneous always so numpy arrays will return you i mean that would return you homogeneous arrays so when you do a d type function on an array it will return you the number of dimensions of the array i mean sorry the data type of each of the elements in the array okay so here we have 24 24 into numbers right so that's why we have n32 as a return type okay so that's how it works okay so that is what is about this t type function next is how do we use it for maths calculations okay so any any doubts till this part till this d type now if i show you something more say okay next is uh as i have created now uh array with like float numbers so you see the arrays the d type is printed as float64 okay so you have uh you got an array of i mean the d type as short 64. okay so that's how this t type function works okay so and numpy as the name suggests it used it is used for linear algebra okay so numpy is used for number mathematics or linear algebra so you won't be really having string arrays in numpy okay so you won't be having it that is not provision so numpy is only used for linear algebraic operations and for that we need integers and floats right nothing more than that okay next is a numpy maths okay we can do summation division multiplication and all those kind of here and there stuff with this numpy okay now how do we do it okay let's get started with a single array so this is how numpy summation function works just a quick info guys test your knowledge of data analytics by answering this question which of the following is a false statement about data analytics a it collects data b it looks for patterns c it does not organize data d it analyzes data comment your answer in the comment section below subscribe to intellipath to know the right answer now let's continue with the session and it can add up to different things okay okay so some will return you or some of all the elements or it can return you sum of two arrays okay some can do anything any one so sorry so sum can return you either sum of all elements in a in a python array or it can return you the sum of elements of two matrices so matrix multiplication matrix addition all this logic will be implemented in there so now what it will do it will take up two list it will add up all the elements and it will return you the output okay so 5 10 2 3 that means 10 5 15 and 2 3 5 so 20 right and for subtract it will take two arrays and it will subtract the results and it will throw it up to you okay so that's how this subtract subtract and numpy dot sum function is used for summing up and finding difference between two matrices np dot sum has a few flares in it okay so it can do a few things so we will discuss that so let's say we have seen how it takes in two lists and it it sums you up the result and it will show it up to you now if you specify an axis parameter then it will do the sum based on the rows or based on the columns okay we print out a and b for you yeah so c a and b are 5 10 2 3 right if we do a sum of sum with x is 0 that means it will do you a column y sum so that means 5 to 7 10 to 13 okay and if you go for a row by sum then it will be x is equals one so it will return you 15 and uh five right so uh that's how this will but this will work okay add a row by sum that's uh that's how it will work okay so that's what is the so that's how this functions work yeah that's how this numpy uh arrays are array array sum function works okay so that's how you can get the okay give me a second okay fine so this is not how it works right remove the square brackets and keep only the the circular brackets sorry no not like that it won't take it yeah so it says 1 2 2 7 10 3 13 yeah got it okay so two confusions were there first of all we like it like xs1 why it is not taking up let me check for some examples in the ppt doing some casual mistakes yeah that's what i have written right so zero zero zero one zero five so it should be zero plus one five plus zero that is one and five so it's fine right zero one zero five so one five yeah correct zero zero one five six okay let me copy this exact example but it does np dot sum with x is z one having some problem just give me a few minutes i will do that yeah now it is working fine zero yeah now it's fine so now if we place it with a replace with b if we remove this as those are already list how it works yeah now it is looking good again okay i will be getting back to you on this so for now let's move ahead with the other functions okay so having some kind of a problem i will check that in the break and i will explain that after the break so not sure why it is happening but it shouldn't be happening it like that okay let me go ahead so uh some i will discuss it later for subtract we already have discussed for for this division also it will work in the same way okay what is it sure okay now if we pass two parameters to the divide function it will divide row wise or column wise as you define with this excess right so okay so divide won't work excess wise it will just take in two parameters and it will just divide you the numbers okay row wise it will divide the numbers so 5 and 2 will be divided like whatever happens in the matrix matrix matrix operations right in mathematics the same thing will be applied here so in matrix operations how what we do we just multiply divide the first element by the first element of the next and the second element by the second element of the next error right so that is how it works so that is how this multiplier division will work here okay similarly we can see the multiplication being worked on the same thing how we like did in matrix multiplication in our class 12 maths so it will be same thing so 5 into 2 10 by 3 so 10 30 like that it will work okay so these are a basic few functions that will be needed for our uh data science data science courses and this expa it will give you e to the power a sqrt it is for square root np dot sine is for sine value of each of the elements np dot cos for cos value log for log values okay so these are not really used in data science related things let's say i will show you quickly all these examples just to have your basic knowledge about this numpy we are i am showing that to you so these are not really needed for a for being a data scientist like uh so you can just check these values okay okay next is the element-wise comparison so we have so that's how element wise comparison does so it will compare each and every element of the arrow with another one and it will return you a list of true falses okay so that's how it will work okay so that's how it will work it will give you output of true first if you change it change something this to two then the second element of the output will also be changed to true okay now uh along with this np dot equal there comes another function where we check the entire array okay if all the elements of the array are same or not so that one is called so it will give you a single value okay true or false so it will check the arrays that is passed to it it will compare those on a broader level that is on a array level and it will give you only single output of take of also true or false based on the comparison results okay so that's how it is that's how it works okay next is aggregate function on this on a single array aggregate functions always work on a single error right so that's what we are gonna see next now if we sh we have already seen this sum function right yeah so these are a few aggregate functions of the or what we can apply to a numpy array okay if you write it or not doesn't matter when you do np dot sum it will automatically be converted to one array so you can skip it safely but again so np dot sum a will give you sum of all elements minimum will give you minimum max will give you the maximum element mean will be the mean of the elements okay so mean means the middle element of the arrays okay then standard deviation is how the how the elements differ from the mean so that's what is standard deviation okay so uh that is what is about standard deviation i will try to connect my pad and will show you the formula once same thing will happen okay so it will give you some of all the elements minimum of all the elements max of all mean of all like that this core relation coefficient is a machine learning term we will go into it later i will tell you what is it so it's not now let's not worry about it so except this have all these things clear so that's how it will work okay okay so that's how mean median and standard deviation you can take into account okay now what is mean what is median i will stack it for the next class for scipy because then we will be discussing all these statistical things at one go and i will give you a brief about all this okay so if you have forgotten this not only we will be discussing in the next class so that's fine main median mode and all this thing so not sure i am not having my tab with me so i will have it and i will show you the formulas and all the small stuff and also what you can do you can once go through all these basic things okay that's just the basic stats not uh neither in entire detail you can go through the basics so it will help you understanding what i am covering in the next class okay so that's how it will work so okay so uh this is this must be clear now so with that assumption we will go ahead with with uh to some new topics okay okay so that is like broadcasting okay what is broadcasting if we have taken two arrays of different dimensions in here what would have happened let's say we are keeping till this part what is this error x is one is not coming through okay yeah axis won't work for this kind of cases yeah okay so let's say we have this kind of an array to work with so the first array if you see the first array is of two i mean uh what three by three right so one two three and four five six it is in three by three array right the second array if you see carefully we have passed four five six only three four five only three numbers it's a two by three and the next one if you see that one is a one by three array right three four five only three numbers are in there so it is an one by three right so now when you add it you uh if you notice it carefully you will see the second row of the array is also getting added with the same number three four five right so that's why it is coming four three seven five four nine six five eleven okay so how this is happening this is the concept of array broadcasting in numpy okay so if you see in this diagram what is happening this array is getting expanded to the to match with the exact dimension of this array of the first array okay so this was an array of one by three but when it stretches down it becomes an array of four by three right so it matches the dimension of the first array then only this array addition is possible right so this is the concept of broadcasting okay this is the concept of broadcasting now one thing to notice this broadcasting won't only work for array additions okay it will work for all kind of operations between two arrays and numpy okay so as you just as you have just seen this thing all uh works for this uh subtraction also right so subtraction is also subtraction is also supported i mean this broadcast thing is also done for the subtraction scenarios also okay this is the concept of array broadcasting so what it does the numpy array gets inflated to match with the dimension of the first array so that the operation that we are supposed to do is possible is feasible okay so this is what the concept of array broadcasting is okay now indexing and slicing in python this as you have already went through i will just i will go through this a bit quickly so let's just see how you do it so index in python list refers to an integer position right it can either start from minus or it can start from plus so minus when when we go for so python when a positive indexing is done then the index goes from smaller number to higher number and then prints accordingly right so let's say for this monty python we are going from 6 to 10 so that will take from p till h right because 10 will be excluded but when we go for minus 1 then we go from the larger number to smaller number because as you know for negative everything goes opposite and the bigger the number is it is it is smaller right so minus 12 and minus 7 it will go from minus 12 till minus 8 because right hand will always be excluded so that's how it will be printed in printing the entire monty string okay so that is what indexing works in python list okay now when we do it uh do a 6 colon 10 that means that is from 6 till 10 like that indexing is done okay next is slicing okay so a 0 if you see this matrix 1 2 3 4 5 6 and 7 8 9 if you go for a 0 then it will print all the elements of the first row and if you so that's how first row will be 0 with index and first column will be zero index okay if you go then within brackets colon one that will also return you the same thing okay let me show you that means we are just taking 0 we return the first element of i mean the first row of the matrix when we return a colon one what does that mean okay so colon one is for the i mean so what we want we want all the rows okay so in 2d arrays how the indexing works if you know so then it will be a comma separated one so first part will entirely work for rows second part will entirely work from columns okay so if we write colon one that means for the row part give me all till the second row so that is the first row right so that's what it is returning one two three but that list is within a list because we are taking the taking it as a list only we are slicing it so it will return a list okay and the i mean list of lists and the first one of a0 we are just simply accessing the first row of the 3d array i mean sorry 2d array and that's why we are getting one two three as a list okay so that's how this slicing works for arrays now let's go back to the ppt and let's see how what what all things we can do so a colon one extract till row equals zero that means uh that means it is the first row that is reprinted right now for the second part if you if you remember the indexing and slicing thick so the first one will account for the row and second one will account for the columns right so if we do colon one that means give me all columns till one so that is the first column so that will give you only one okay now if we do one column that means give me all from the first first column till last okay so that's where we get this one colon will return you two three okay now if you remove colon 1 that means you want all rows and you want columns from 2 onwards okay so that's why it gives you 2 3 5 6 8 9 okay so that's how indexing and slicing works for numpy array okay that's how it works okay so we will have a few examples let's do it so just a quick info guys intellipaat provides online data analytics course in partnership with ibm and microsoft the course link is given in the description below now let's continue with the session okay so okay so let me explain this to colon means we are wanting the third row of the matrix okay from the third row onwards so third row is the last row so only one row will get printed out and we are giving colon three so that means it will go up to fourth up to the third element of the column right so it will give you the entire column so that's why colon three returns you seven eight nine okay uh next thing is array manipulation okay array manipulation is to stacking array i mean like okay so what is the difference between array manipulation and the sum and other operations that we have seen so in that case what we are doing we are performing some operations on two arrays and we are giving outputting a third array here also we will almost do the same thing but a bit differently okay now we won't add or subtract those we will keep the originality of the arrays and we will output something and something okay so okay the first function is concat okay so this is for a horizontal concatenation okay so the arrays will be concatenated horizontally and one is auto bound it shouldn't be okay we don't have any columns in here right so that's why column wise is not possible but if we had 2d arrays it will be possible that's what i will create for you it will be easier for you to understand yeah so this is uh like column wise stacking right so this one two three in here and this three four five in here got concated together formed a new row this four five six and five six seven formed a new row okay if we go for 0 it will be like this okay so this is just column wise so that i mean sorry this is row wise that means the rows of the first matrix and the rows of the second matrix will be concatenated to another matrix okay let me print this mattresses for you it will be easier so see the first mattress is one two three second one is four five six when we do row wise concatenation right we just add concat the rows at the bottom of the matrix so one two three four five six three four five five six seven okay now if we go for column wise addition a column wise concatenation then what it will do it will concat the columns on its own okay it will concat the columns on its own okay so then it will give you a new row output okay so that's how so what it does is one two three three four five gets stacked in because this is column wise now so that will be column wise now so one two three and three four five will be stacked in together four five six five six seven will be stacked in together okay so that's how the output array output will look like so the columns will be concatenated to a row okay so that's how this concatenation works and works okay that's how the concatenation works okay so this is what is about this concatenation property okay next is v stack and h stack for numpy arrays okay so what is v stack and what is h stack as the name suggest v stack is vertical stacking okay and h stack is horizontal stacking stacking means the same thing as concatenation with a bit diff with just a small difference okay that we will uh discuss now so before going into v stack what i will do i will show you a stack example okay there is another function called stack in numpy okay so i will show that first okay before going into this uh normal thing this h stack and v stack operations okay so what it will do for this two arrays only i will consider so let's say i am stacking a and b with x is zero so i am stacking a and b on x is zero okay so it is not as less than i should pass it should be a tuple in here so this is how the stacking looks like okay this is stacking horizontally okay and if i go to axis one this is stacking vertically okay let me print out two separate ones might be the quick change i haven't noticed it should change it will be one two three four five six like that no yeah yeah that's the internet output i was looking for so uh for horizontal stacking for x is equals zero so for again let me try with one d array once one second here now it is coming fine yeah so that is what is about horizontal stacking and vertical stacking okay so in horizontal stacking we stack two two arrays row wise okay we stack two arrays row wise and in vertical stacking what we have we stack two arrays based on we stack two arrays based on columns okay two errors based on columns so as you see in the row equals zero the output is one two three and three four columns so the two one d arrays have gotten into a sequence i mean row wise they have concatenated but when it is going for column wise then for 1d arrays it has concreted the columns into its rows okay it has concreted its column into its rows rows okay so this shouldn't be printing first of all it shouldn't be printing like this why it is printing so that's how this uh stacking works so again this for 2d why this problem i am not really sure i am typing the exact syntaxes but still it is not working okay so first stack we need to have 1d arrays if we need to have more than 1d arrays to be concatenated together then for that we have concat function okay this stack function will only take in 1d arrays into account and it will stack those accordingly okay so that is what is about stacking okay now if we go into and this is what is about stacking so stacking will take only one dry into account but if we need multiple dimension arrays we need to go for like 2d arrays okay we need to go for 2d arrays that's how it is so now let's say i am creating a new array like this okay so if you see this carefully the stack function it takes in only 1d array and if you pass in a 2d array it doesn't do anything with the array right except what it does it just created i mean stacked array of its own like of its own or the the the array a and b individually was stacked onto something and that's what got written over here okay now here on the contrary if you see the h stack function that i have written at the bottom of the screen at the last line it took in two 2d arrays to properly stacked okay so that is what is the difference between this normal stacking and this h stack and v stack okay if you want to go for a concat function concat actually differs in two ways first uh one way actually just only one difference of concat that is if okay let me write it so let me do it a bit clearly give me a few minutes print for takeoff concat stack okay let's see the differences okay now this is 0x is one okay so for each stack that is horizontally stacking right so what it did one two three two three four concatenate in a same row four five six five six seven concatenate okay and return you return you the uh like a desired output that you want to see so that's how horizontal stack worked okay so this is row y stack but the rows are concatenated together into a new column right new row sorry so rows are concatenated to a new row and it outputted a different version of a matrix which is having a different shape okay so if you compare the output of horizontal stacking and vertical concatenation you will see these two are same thing okay these two are same thing so that is the basic difference between this concatenation and stacking so stacking is to add up the elements to the same dimension so when we are doing a row wise stacking that is horizontal stacking we are adding the rows the element to a row row itself okay but instead in concatenation the rows are added up to the entire matrix okay the rows are added up to the entire matrix so that is the only difference with between this horizontal stacking and vertical and the concatenations okay now for the vertical stack the same thing happened right the columns got smudged in just the columns got increased and those elements are stacked into the columns so 2 5 3 6 4 7 have added into the column of the first row itself which was done in the horizontal concatenation right so that's how this concatenation and stacking works and differs so that means for stacking the stacking will be along the axis which it is currently working and for concatenation they will be concatenated on the same dimension okay so that is the difference between these two so are these things and the these these four things clear to everybody or i need to explain it a bit more first of all for stacking doesn't work on 2d arrays and for h track v stack and horizontal concatenation and vertical concatenation these are the examples that i have written so are these things clear seeing a missed response okay let's go over and go over this so horizontal stacking right what it did it had two arrays to work with okay a and b okay so how a look like how b looked like i will show you stack forget about it it won't work here so this one two three four five six let me write it down one two three four five six two three four five six seven these two are the rows right so when we are stacking horizontally the two rows that is one two three and two three four are stacked okay on the rows so one two three two three four are merged in a single row four five six and five six seven are merged again in a single row okay and the resultant output is written having returned okay so that's how this thing works okay this that's how this stacking works okay now for horizontal concatenation what we have we are concating the rows to the entire matrix so 2 four five six seven these two rows are concatenated to the entire matrix that is row wise okay so it didn't form a new row instead it just acted in the rows at the bottom of the older matrix okay for vertical stacking it's the same thing but on the column basis okay this is now a single column this is now a single column like that and for vertical concatenation it is the same thing as horizontal stack it added those elements in there okay so that is what is about this four operations on this array okay so is this clear now okay next is column stack it is same thing let's see it here only to go in the other slides so the column stack matches with the vertical concat okay so which matches with the horizontal stack so that is the column stacking okay that is what is about this column stacking and there okay that is how it works just a quick info guys test your knowledge of data analytics by answering this question which of the following method creates a new array object that looks at the same data a view b copy see paste d all of the above comment your answer in the comment section below subscribe to intellipart to know the right answer now let's continue with the session so that is not much used column stack function button please so that's how it use so that is what is a function of this column stack okay so we generally use h stack and v stack a lot not the other ones okay so if it's clear we will go into the splitting of arrays okay this is a bit tricky one so what i will do okay so the syntax is show you here the syntax how it tells is for the split function is np np dot split okay it takes three parameters into account first one will be a array it can be anything next one will be index this i will explain the next one is xs okay the array the first one can be any numpy array okay it can be any number array indices has two forms either integer or a list if it is an integer then what it does it takes into account the array is split from that very index which is mentioned as an integer but if it is an uh like section like an an and list is passed then then that 1d array will be that that array will be splitted from that row okay split it from that row and column based like that so it will be it will be coming to an 1d array the list it will be then be splitted based on that so we will see that in a few minutes okay and next one is xs as you already know that is having 0 and 1 that is the same thing as you already know 0 for like a row based one for column based okay so that's how it will work now let's let us go back and work on this a array okay so print a this is this will be our a array now what we will do we will split this array so a the first second one i am passing an integer third one i am giving it a column so see what happened i provided two as an integer value right so the array is split onto two two arrays okay if i pass three it will be it won't be three why it is not equal it should be equal right okay why it is not working because we have two rows and we are splitting it based on rows right so it won't work so that's why we need to split it on a column column basis so that is the idea okay so see here so we have a 2d error right so we are splitting it based on row and we are setting it to 2 2 split so the two arrays will be coming out now we are splitting it based on x is equals one so that means a columnar way so we have three columns so it can be split into three three parts right so that's what is about the split okay that's what is about this split let me print it something here now this is the easy and simple part we will get into the critical part in the next part critical thing is coming up for you guys now let's say i am not passing something as integer instead i am passing it as an list let's see how the output differs okay so this is how it will differ when you pass this like when you pass this pass a list to the integer value so the array here will be split into three parts okay and then those three outputs will be given okay so in the list the first one is and the first first first thing refers to the column so array will first be split into a colon 1 okay then the array will be split between these two numbers that is 1 colon 2 then the array will split for the last number that is sorry the array will split into the last number that is two colon okay so this three thing will be outputted based if it's a list is passed into the indices thing okay so this is what is the idea here so what this basically does if you see the array in full it has three elements right it has it it is a two by three array right now i am splitting it from one to two right so that means we are splitting into these three parts that's how split works okay so this array refers to from where it is from which along which access it will be splitted and this is how it will work now so this is what is the what is to hold true for row wise now if we go for column wise you see something different right so what is this doing now here everything is given for only row right but here everything is done for column okay so see the first output is 1 4 the next one will be done for column the first one that we have seen was done for rows now in the second one the everything will remain the same instead we will do for columns instead of doing it for rows okay so now if you see carefully these are the splits that you got in there okay one four two five and three six okay these are the splits you got from there so every operation is done like this so whenever you get an index you split that array individually into three parts like this okay and then you concat the output of these three separately so that's what will return you the exact output of the split function if you split that on on the rows then it will be a row by split and it will return you the values and the 1d arrays based on its rows and if you split it based on column that is x is 1 then it will return you some values based on the columns itself the column separation itself that is what the concept of a split is what is what is the advantage of numpy over list okay so this numpy arrays consumes less memory next is this is faster we will see an example that how fast a numpy rs can be and next is more convenient as you see unless you didn't had all these methods right some min max median mode uh this concat h stack restack this all this method that we have discussed till now we need to write those on our own we need to do this on our own right so these are the uh that's why numpy array is more convenient and as a data science and someone was talking about pandas if you know about pandas already or if you have read about somewhere pandas array is all only based on numpy array itself okay it is same thing okay pandas area numpy array will go hand in hand you will see okay so if you understand numpy array better it will be easier it will kick work for the pandas part okay so that is what is about this uh array part okay that is what is about this numpy part without wasting time i will go into example first then i will come to this one okay example will be somewhere down there okay this one this is a quick example how it is advantageous based on memory space and how it is advantageous based on its accessibility time we are having time near equals zero sorry for that let me increase it we have another zero another zero okay so see here is the example for memory size what we are doing we are defining a list right range function returns a list right we all know so we have taken a list here we are getting size of 1 that is 1 and we are multiplying with len of l right so one is an end so it will be so what is what is y one because l will return a range of or i mean range range from zero to thousand right l will return that so one will be the element and it will be end so that's why we have written one here and we have written lane of l so that means l has thousand elements and it will take 28 byte for each element to store from the list basic you must know this so that is why it is printing as 28 000 okay from for that it is printing as 28 000 so our for storing thousand elements in a python list takes up 28 000 bytes okay whereas for a numpy array it takes up only 4 000 bytes why is that different so as you know a dot size is what is a dot size a dot size is the length of the array a dot item size is size of the array stored in the memory itself okay so size of the array stored in the memory so that's what it gives us four thousand so that is coming as four thousand bytes so why is that difference okay why is that difference why is this huge difference now let's go to this ppt so see for numpy array we how we store we directly store the data into the pointer so forget about this dimension and strike focus into this data so this is a numpy array where we have a data head and the data head points to the data directly okay data directly so as you might might have covered in the list basics how list are stored in python is not directly lists are stored on some memory location and that memory location is returned to the list element so that's why that is unmutable one because u and tuple is an immutable one right because in tuple you can change the value because it directly stores the uh like uh it doesn't have the memory address of it it is an immutable one but in list you will have the memory address stored in it so whenever you access some element it is accessing the memory location only not the direct value so again that memory location will have a value so for restoring memory location it will consider 2 bytes and for this it will 14 so 14 by 2 28 bytes for a single memory location it will have yeah list will be having two step of data storage so first one will be a pointer second one will be a data okay second one will be a data and no numpy is not immutable numpy is mutable we have seen multiple examples right number is not immutable it is not because of this storage that is immutable or mutable that's what is considered it is due to some other features this is this is just the language construct so they have made tuple to be immutable when tuple doesn't store it like this okay numpy is not immutable numpy is based on list only so it is totally immutable but it is it the storage of data is different it stores the data directly okay but for python list it stores the data in a two ways okay so 14 by 2 28 bytes will be taken up for a single memory location okay so that's why this that's what you saw in the example also right and a programmatic example you have seen it so that is how it is beneficial from the memory location okay now if you see we are doing some xyz operations that is adding to list first adding to list and second adding to numpy arrays the same thing we are doing okay so here we needed to pass it through a loop right for adding but here we don't need to do that we can do a clear matrix multiplication by a plus b and then we are returning the time difference if you see the time difference here for numpy for for a normal list is 0.005 okay but for uh numpy arrays it is zero nine nine nine something like that that is too small because when we i was doing it for ten thousand for my machine and configuration for one thousand actually it was not even coming something it was coming in the zero value first not four thousand and see even if you see four thousand of thousand lists uh thousand values in list and one lakh values in numpy array still numpy is somewhat faster than list right if i do a one lakh in both you will see that real difference you will see it is 41 percent faster than a numpy array okay now if i just reduce a zero it is coming in the range of zero okay that's how lightning fast numpy array is because in numpy array the access is direct right so that is just going into the location fetching the data and doing the addition or multiplication or whatever you you want to do it so that is what is being done in numpy arrays okay that is what is done in numpy arrays okay so that's how it works okay so that's what is about this numpy arrays now we will understand difference between a data analyst and a data scientist so starting with data analyst so we already know data analysts collect data from various sources or they can be primary or they can be secondary sources they pre-process the data they clean the data manipulate and also visualize the data they performs data big data analytics on the data they create reports then a data scientist collects the data collects the analyze data and looks it with an angle of creating such a process that helps in building automated models that helps in the prediction for the business and also the task of a data scientist is to analyze and visualize the data and decide about which algorithm should be used so this is the major difference between a data analyst and a data scientist now moving on to the next part that is we will look into the skill set of data scientist and data analyst starting with data scientist so a data scientist should have knowledge of sql that is the structured query language used for databases then tableau or power bi that is used for data visualizations so they must have knowledge of these two tools in microsoft excel then going to the programming languages they should have a comprehensive knowledge of python as well as r so there are various libraries in python and are both various libraries and packages that are specifically used in data science machine learning and artificial intelligence to analyze and and visualize the data to create machine learning models create automated models so they should have knowledge of both these languages then mathematics a data scientist must have a thorough knowledge of mathematics especially statistics and probability linear algebra calculus differential equations because these are used for creating machine learning models then data pre-processing manipulation and visualization these are the basic tasks of a data scientist so they should be thorough with data pre-processing manipulation and visualization tasks then machine learning algorithms so they should be an expert and in implementing various machine learning algorithms on the data sets so that they can create models predictive models out of that then they should have knowledge of artificial intelligence which includes neural networks and deep learning the knowledge of neural networks and deep learning is very important for a data scientist to know then obviously software engineering they should have a knowledge of software engineering as to build work on various softwares creating softwares then finally they should be a good problem solver now moving on to the next part that is skills of a data analyst so they also should have knowledge of sql that is structured query language then tableau that is used for data visualization and microsoft excel then programming languages they should have knowledge of python or r that is used for that is used by data analysts to perform various analytics techniques then statistical knowledge is must for a data analyst as statistics is used for analyzing the data and applying statistical techniques to understand the insights of the data then a data analyst should also be a good problem solver to analyze the business problems to solve the business problems and make effective strategies for increasing the revenue generation of a business then ability to analyze model and interpret the data then they should also have a knowledge basic knowledge of data science then machine learning algorithms so that they can collaborate with data scientists and ml engineers for building predictive models then comes creating dashboards and reports for the organization so that everyone can understand about the trends and patterns of the business data to know the insights of the data and to understand in which direction the business is actually going now we will see the job scope and salary of data analyst and a data scientist so starting with data analyst according to forbes in india there are 5 000 plus jobs available for a data analyst in 2020 and talking about united states the number of jobs are more than twelve thousand so you can see there are a plethora of career opportunities in the field of data analytics talking about average salary so in india the average salary of a data analyst is 7 lakh 50 000 per annum and in the united states it is dollar one lakh fifteen thousand and five hundred then moving on to the data scientist talking about number of jobs according to forbes in india there are 10 000 plus jobs available for a data scientist in 2020 and in the united states the number of jobs are more than 25 000 so there is a wide scope of opportunity in data science field also so people who are willing to become a data scientist so there are plethora of opportunities that you can grab so talking about average salary a data scientist earn an average salary of 9 lakh 35 000 in india and in the united states it is dollar 1 lakh thirty two thousand in this video we understood why we need to handle data what is the importance of data with the help of uh the example of google maps then we looked into the roles and responsibilities of data analyst and data scientist then what are the skill set that makes a data scientist and data analyst after that we understood the difference between a data scientist and data analyst and finally we looked into the job scope and salary of both these profiles so this is all you need to know about the difference between a data analyst and a data scientist now moving on to the next part that is understanding the difference between the data engineer and data analyst so a data engineer basically builds an architecture for for the data and you can say the infrastructure for the data how the database uh how the data will be correlated to each other then a data engineer looks for the maintenance of the databases they handle the error if any error occurs in the databases so this is basically the task of a data engineer whereas a data analyst collects the data from the database and pre-process it clean the data and manipulate the data in order to understand and analyze the data and find the trends and patterns in the data to visualize the ups and downs of the business and the market so that a strategy or an effective strategy can be made an effective model or any other improvement that can be made can be discussed with the other teams and can be implemented to improve the business now skills of a data engineer and data analyst now first we look at the skills of a data engineer so it starts with data integration how they collect the data and store it in the databases then data warehousing then comes cassandra that is apache cassandra and it is a free and open source tool to design to handle large amounts of data across many community servers then comes sql that is structured query language that is a database language in python is and java is a must for a data engineer then comes apache hadoop that is a software library used for the distributed processing of large sets of data across various clusters then comes hbase hive and map reduce that helps and that provides data query and analysis then comes stitch data or segment that helps developer for rapidly moving the data after that a data engineer should have knowledge of unix linux or solaris and finally they should have knowledge of etl tools such as redshift or panoply that is used to extract transform and load the data now we will see the skills of a data analyst so again it starts with sql uh data analyst should have the knowledge of structured query language that is used for database related tasks then they should have knowledge of tableau that is used for data visualization that helps in the analysis of the data then microsoft excel then r or python that a data analyst must be experienced with because they are popular among all the industries and this programming language that is python and r are widely used in the industry for various tasks such as analysis of the data or building machine learning models or deep learning models so they have various libraries and packages that helps them to do that task then statistical knowledge so statistics is one of the important component of data analysis that helps in analyzing all sorts of data then comes problem solving skills so a data analyst must be a good problem solver then ability to analyze model and interpret data then they should also have a knowledge of data science machine learning algorithms so that they can collaborate with data scientists and ml engineers to build predictive models and help them to analyze the data then finally creating dashboards and reports for the organization so that they can give these reports to the organization to understand where actually the business is going now looking into the job scope and salary of a data analyst and a data engineer so starting with data analyst so according to forbes in india there are 5000 plus jobs available for a data analyst in 2020 and talking about united states the number of jobs are more than 12 000 so it's a very big number and good opportunity for the people who love to play with data then talking about average salary so in india the average salary of a data analyst is seven lakh fifty thousand per annum and in u.s it is one lakh fifteen thousand uh five hundred dollars per year so it is a very good amount to start your career now going on to the job scope and salary of a data engineer so according to forbes in india there are 1500 plus jobs available for a data engineer in 2020 and in the united states the number of jobs are more than 5000 so it is again a big number to grab the opportunities for the people who want to become a data engineer then talking about average salary so a data engineers earns an average of 6 lakh 30 500 in india and in u.s it is close to 1 lakh dollar per year so in this video we have looked around who is a data analyst who is a data engineer that their roles and responsibilities difference between a data engineer and a data analyst and also with the help of an example we understood why do we need to handle data and we also looked around the job scope and salary of both for both these profiles so you must have a pretty much knowledge about data analyst and a data engineer so this is all you need to know about both these profiles just a quick info guys intellipaat provides online data analytics course in partnership with ibm and microsoft the course link is given in the description below now let's continue with the session just a quick info guys test your knowledge of data analytics by answering this question which of the following is a false statement about data analytics a it collects data b it looks for patterns c it does not organize data d it analyzes data comment your answer in the comment section below subscribe to intellipat to know the right answer now let's continue with the session coming to the first question on these top data analysis uh interview questions it is what are the key differences between data analysis and data mining well data analysis involves the procedure of actually cleaning the data organizing the data was procuring it before doing all of these and of course at the end of it are driving meaningful insights out of it but data mining is a very different concept because in data mining we are very much concerned about finding patterns in the data you know finding hidden patterns and to see how better we can use these hidden patterns to drive insights well data analysis can also be compared to data mining up to one certain extent but then the procedures that goes into working off data mining is pretty different now data analysis is actually used to produce results which will have a larger impact on the audience let's say in general when you directly compare it to data mining so these two uh you know form to be very vital differences that lie in between data analysis and data mining coming to question number two it says what is data validation well data validation is the actual process where we try to see if the data is valid or not as simple as that it's not just that uh you know when you're thinking about data and its accuracy we want to find out if the data is actually accurate and if the data is actually of a high quality as well first see if it's coming from a valid source so you can see how accurate it is third see what the quality of the data is so to go on to do these there are multiple process where you know we use uh to actually go on to perform data validation two of the very important uh processes are data screening and data verification data screening involves making uh use of a lot of models to make sure that uh you know our data can be uh verified in a way that it is accurate and to make sure that there are no redundancies no repetences or uh nothing in the data that will harm the models let's say and then we have data verification what happens in data verification is that you know if let's say there is a scenario of redundancy now if there is a redundancy what we try to do is we try to see if we can evaluate using multiple steps and to see if the presence or the absence of that particular item in the entire data set is required or not if it's required we keep it if it's not required it gets removed as simple as that so first screening the data and then uh you know performing data validations are a couple of processes that have a huge impact in data validation coming to question number three it states what is data analysis in brief well it says in brief so make sure you do not get carried away with your answer data analysis is actually a very structured process where we work with data continuously to perform activities uh you know such as data ingestion data preprocessing data cleaning data transformation data visualization and eventually we drive insights out of these analytics results or the visualizations themselves so why is this done we do it to make sure we can drive revenue or analyze any of the key performance indications of a business so to begin with data is collected from multiple sources right it can be an on-site server it can be an off-premise server a cloud storage network or wherever it is once the data source is grabbed we have to make sure that we bring in the data into the company architecture and once this is done we'll understand that the data is the raw entity so it needs to be cleaned it needs to be processed if there's any missing values it has to be filled uh if there's any entities that's repeating or if it is out of scope we need to remove it so after pre-processing what we'll do is uh we'll use a couple of models to drive analytics out of it to see that you know we can create very good reports and to make sure that the data output is converted uh to format which is understandable by technical audience and an equivalent non-technical audience as well because at the end of the day a data analysis result can be viewed can be viewed mostly by uh you know board members stakeholders and you will basically be explaining this to a non-technical audience so in that particular case all of these steps are paramount coming to question number four it states how do you know if a data model is performing well or not well this question is very subjective as you can already tell because you know there are many things that involves to check if a data model is performing very well or not so to find out and assess the accuracy of a data model uh you know we can use a couple of steps the other thing we can do is uh you know a very good design model as you might already know will give you very good predictability right so you can understand if this is the input that can be the output and it will keep doing that on a basis where it will offer you good predictability that is one second thing is that a good model will make sure that no matter what changes you make to the data or what changes you make to the pipeline the results can be obtained without having to make major changes in the model itself that is what defines a good model and then of course uh you know your model needs to have the ability to scale against data in a way where you know with one requirement you might be working with a small amount of data you might have another requirement where you're working with a huge amount of data as well so your model needs to cope up that's point number three point number four uh when you're talking about this is to make sure that your models working itself should be easy right if it's complicated for you as a data analyst to use it literally wouldn't make sense to spend a lot of time and effort into it so having it easily understood among clients and in fact you know data analyst equally we'll make sure that both the requirement specification from the client and the performance of the data analyst working with the model will be as high as possible to get the results so these are some of the points that you have to keep in mind to check if a data model is performing well or not coming to question number five it says explain data cleaning in brief one thing you have to know about this question is that it can also be stated as explain data wrangling in brief so because data cleaning is also called as data wrangling and as the name suggests whenever we talk about data cleaning we try to find ways where we can clean the data take it from its raw entity into usable information right our end goal is to basically have data with the at most quality possible so how do we do it first we'll try to see if we can clean the data by removing all the stuff we do not require right we call it outliers but then uh to remove it we need to understand what do we need to remove do we need to remove one uh particular row one column do you need to remove the entire data block or what the second thing is it's not just about removal it's about adding certain stuff back in as well so we can fill certain data in without having to cause any other redundancies or to obstruct anything else in model performance as well so we remove the data we can add the data and in fact we can replace the data with mean or median values as well again this is a very important statistical concept that we'll be discussing in the next couple of questions but then yes you can replace existing data with the mean or median values of the data itself and that in fact makes models very much accurate in a couple of cases and the last thing is that you know we'll make use of concepts called as placeholders which are used for all the empty spaces in the data too so these are some of the four very important steps or procedure that's involved in data cleaning so after understanding data cleaning um here's question number six what are some of the problems that a working data analyst might encounter well there are many many issues that an our data analyst might face when you're working with data right because data itself is is an ocean so working with the ocean is not easy so the one very important thing we talk about this is the accuracy of the model right if there are uh any issues that causes the biggest problem is going to be the accuracy in model development and if it is low uh it will cause a lot of errors a lot of uh you know mistakes where at the end of the day performance will fall the second thing is that you know the source of the data if it is not a verified source it means that your data will be very very raw and very shabby this means that it will require a lot of cleaning a lot of pre-processing even before the analysis has been on this will take a lot of time and effort as well and the third thing is whenever you're extracting data from multiple sources or let's say uh you know you are merging data from multiple sources right so this will cause again another redundancies with respect to data and very very vital that these redundancies are taken care of so when you're thinking about all these steps right it is very vital they go hand-in-hand because if you miss one of these steps uh with data ingestion or data cleaning the rest of your model will come tumbling down so uh you know these are some of the major problems that the data analysts face in a proctor in a production environment moving on to question number seven uh it says what is data profiling well data profiling as the name suggests is a methodology that we make use of to make sure that you know we have the right steps in place to analyze all of the entries that are present in our data so it's not just taking a quick overview of what data is present but rather taking an in-depth look so what we're trying to do with data profiling is to make sure uh that we have the ability to see the data uh in a very high accuracy way and to make sure that you know we understand the data better right so what are the data types of the data that is present what's the frequency of occurrence uh and a lot of other things as well so when you're thinking about all of this it's very very important that we profile the data so with this we can check out question number eight question number eight asks what are the scenarios that could cause a model to be retrained one very important thing about data that you have to know is that data is never a stagnant entity right data keeps changing as business changes so whenever there's an expansion in the business it means that you know there's a day and night difference in the data that the company starts seeing so what this does is a huge challenge to data analysts because here we have to analyze if the model itself needs to be retrained or not to work with the new data if yes then yes the model will be retrained the data the new data will be shown the model will be trained business protocols will be changing whatever the model sees will change and of course the output can change drastically as well so considering all of this we have one general rule of thumb that we make sure is that whenever uh you know we have the models that are retrained right we try to align it as much as possible to the new business protocols and whatever is being offered as well so this end is very vital to understand the scenarios that happen to uh cause a model to be retrained as well coming to question number nine it states what are the prerequisites to become a data analyst well there are many many skills it's actually required to be a data analyst and i'm sure you guys will know most of these skills if you're here at this video right so uh you know you'll require programming skills such as javascript etl and xml you need high proficiency with working with data databases especially like mysql mongodb oracle sql and more and you need to have the ability the passion and the interest to collect and eventually analyze data you need to figure out how to design databases on your own and how to perform data wrangling and data mining as well and once you know how you can go about you know working with data uh what comes next is you having the ability or the experience itself to work with a large amount of data sets in a production environment this is very very very important and all of these skills are the major uh prerequisites uh you know let's say uh for you guys to become budding data analysts well coming to question number 10 it states what are the top tools used to perform data analysis well as you might already know there is multiple tools that can be used uh in the field of data analysis right but then there are really popular ones that get a lot of attention and these are you know google search operators also called as gso we have rapidminer we have tableau we have we have open refine we have a lot of these other tools which which take the center stage as in when we talk about data analysis so it's very important that you know these tools question number 11. it says what is an outlier well an outlier is basically a value in a data set that we consider to be very much away from the mean of the data because in two ways it will hurt the data because an outlier can cause model accuracy to fall a lot and of course at the end of the day if it's outside whatever we call the normal zone or the useful zone in the data or the in layers right uh what it does is it is redundant right so when you're thinking about small amounts of data it's probably okay but then when you have large data sets if it can swing the model accuracy around big no the second thing is storage as well storage is not really cheap and when you're thinking about all of these it's very vital to know what outliers and how to remove them as well but then there are two types of outlets you have to know about one is univariate outliers and there's multivariate outliers and they speak uh you know the names speak for themselves as well right well so moving on to question number 12 it states how can we deal with problems that arise when the data flows in from a variety of sources well uh you know when when your data is coming from a variety of sources it will come with a variety of problems uh because uh because data is really shabby as we've been discussing you know and it is very important to handle these as well so to go on to solve these problems what we try to do uh is we try to see if the data is same or if the data is similar uh you know uh which might cause redundancy and if it is similar of course we can merge it into a single record and reduce the redundancy we can restructure the entire schema where we are trying to make sure there's good schema integration with the data and at the end of it no matter how much cleaning you do no matter how much pre-processing you do and how much time you take to do all of these there will be an entry in a large data set or there'll be a couple of entries in fact that will be hurting model accuracy if you're not careful so these are some of the problems that arise and how uh you know we can deal with them as well coming to question number 13 uh it says what are some of the popular tools that are used in big data there's many many many tools in big data and again there are there are some of these tools which come to the uh back of the head as soon as we think about uh you know big data so a couple of these tools are hadoop we have spark we have scala we have hive we have flume and as you can see there's a pattern that a side all of these tools are by apache apache is an open source foundation and they've been contributing a lot to the world of big data and it is very important that either you have the experience using these tools or you know what each of these tools do as well that brings us to the 14th question that states what is the use of a pivot table pivot table as you might know as you might have heard or used it as one of the most important concepts that exists in fact in microsoft excel in my opinion because they will give you a very easy to use way where you can view and summarize large data sets without having to put much effort into them so or you know most of the operations that we do with pivot tables actually are drag and drop operations and it really does not require any sort of coding or any sort of programming or really any sort of you know you know let's say a hard amount of effort is not required to create reports and to make them look beautiful itself as well so in all of these usages uh pivot tables are very important and i'm sure uh if you're at this video you guys might have used uh you know pivot table cr cells as well so coming to question number 15 uh it states uh explain the k n imputation method in brief k n is k nearest neighbors and one of the most important methods that we use to see how we can pick up the nearest neighbors in a nodal arrangement and to see how we can measure the distance metric between each as well so what we're trying to do here uh with a k n imputation is that we try to predict both discrete and continuous attributes that happen to be present in a data set with this uh how can we do it well a distance function is exactly what we use here to find with two data you know two nodes or two neighbors are similar or not if there are similars in more than one attribute then you know we're going to use that to perform analysis as well so to go on to do this in a very structured way we use a method and that method itself is called as the knn imputation method moving on to question number 16 uh it says what are the top apache frameworks used in a distributed computing environment right so we're living in a world of distributed computing where we can make use of the computing power offered by thousands if not millions of computers across the globe at a time right so when you're sitting with that kind of a power there are two tools which always come to center stage this is mapreduce and hadoop because these are the top apache frameworks which are in the limelight which can handle large amounts of data set very easily and if it's in a distributed working environment there are very less number of tools that get to outshine what map mapreduce and hadoop does so these are the top apache frameworks coming to question number 17 it states what is hierarchical clustering hierarchical clustering uh can be uh you know this question can be asked you in another way what is hca hca is basically hierarchical cluster analysis or it can be called as hierarchical clustering as well so it's basically an algorithm that we use to make sure we can find the easiest ways where we can group similar objects so once we group this similar object each group itself will be called as the cluster and hence the name so why do we require hierarchical clustering well we require hierarchical clustering because we need to create a set of clusters to show that this data differs from another cluster itself be it individually be it cluster wise or even if they consist of similar entities as well so to show the difference that exists in between data we make use of hierarchical clustering moving to question number 18 uh it says what are the steps that are involved when working with a data analysis project well again this question can be answered in multiple ways but then the foundational concept is the same so you know the many many steps that are involved when you're working on an end-to-end project especially in the world of data analysis you'll begin by finding out there is a problem existing or you'll have a client to bring you a problem statement once you understand the problem you'll source the data once you get the data you will clean it you'll pre-process it and you'll explore it to see what the data actually is once you understand the data you will start creating data models and you'll create situations where you use the data on a smaller scale to see if you can solve a problem in a simple way once you do that you'll validate it across a bigger data set to see if it works or on the complete data set if it does you implement it but then it's not just done with the implementation aspect of it once you implement it you need to verify if you have solved the problem or not and that is the last important step so based on how much the uh interviewer is asking you to stress on the answer make sure to expand on it keep it concise don't get lost in each of these steps because uh you know i can talk about each of these steps for 20 minutes right now but then it really wouldn't add a lot of value in an interview scenario right so read the room read the interviewer understand what he or she is expecting and deliver the answer in that way so with this we come to the 19th question that says can you name some of the statistical methodologies used by data analysts well again you can name a lot of methodologies used by data analysts but try to keep your answer concise right so name four or maximum five in my opinion that should be very useful see we have markov process we have cluster analysis we have imputation method we have imputation technologies we have the bayesian methodologies we have rank statistics and we have a lot of other statistical techniques that uh you know are very very useful uh in fact i can say they are foundational in many cases when you're working in the case of data analysis as well so for all of these techniques that exist each of these have their own niche purposes and they're very effective at doing what they state so when you ask this question make sure to mention this coming to the next question which is the 20th question it states what is time series analysis or also called as tsa for short well time series analysis is used to perform uh you know trend analytics we try to see what uh so we try to see how the data changes with respect to time and it's not just how the data changes with time here what we're trying to analyze is for a fact that the presence of data at particular intervals of time and seeing the presence of data and its response and its behavior of how actually it drives the data set so if you have a data that does this at this point of time and you under you want to understand why it does that or you know you want to understand what it does because of it uh you know changing its state or changing its data at that particular point in time then dsa is very very effective to use so moving on to the 21st question it's actually a follow-up to the 20th question it states where is tsa used right so tsa has a lot of usage when you think about it in a lot of different domains so make sure to at least mention three or four but then i have given you more out here so that it will help you remember right so tsa is used in statistics in the field of signal processing econometrics weather forecasting earthquake prediction astronomy and even applied sciences as well there's a very good chance the interviewer will not ask you how it is used but probably you know give you a situation and ask you to or think about it and tell if you can you know provide a solution there as well but then these are some of the most important places where time series analysis adds a lot of value so when you ask this question make sure to mention this moving to question number 22 it states what are some of the properties of clustering algorithms right well clustering algorithm whenever you have it implemented will have a lot of properties but then these are the main ones so you have the flat of the hierarchical property it can be iterative uh it's going to be disjunctive and a lot more so when you're thinking about clustering clustering algorithms uh do talk about how these foundational concepts of it being flat or hierarchical it being iterative to make sure that you arrive at the solution or it being disjunctive which definitely defines how clustering algorithms work effectively to split our data into clusters and make use of it as well coming on to question number 23 it states what is collaborative filtering well collaborative filtering is a very important concept it's an algorithm which is widely used in the field of data analytics to create recommendation systems so recommendation systems is used to understand the behavioral data of a customer or in fact to understand what the data is doing and why it is doing something the way it is doing as well so when you can think about it uh you know when you're shopping on an e-commerce site maybe like like amazon or flipkart or something you will be shown uh recommendations right so users who bought this also bought this so when you think about that that's a recommendation given to you and in fact collaborative filtering is used just for this purpose as well so it'll be using a lot of data it'll be using cookies data browser history and there's there are very complex methodologies and very complex models and algorithms put into place to recommend the best products out to you and the working of that is achieved through collaborative filtering coming to question number 24 it says what are the types of hypothesis testing used today there's three main types of hypothesis testing which gets the center states again right so in fact there's many types of hypothesis testing but these are the important ones the first will always be anova or what we call as analysis of variance because here we try to perform analysis uh in between the mean values of multiple groups to see how ah you know data is changing the second thing we call as the t test the t test is actually uh used wherever standard deviation is unknown to us and let's say the sample size is pretty less as well again we'll have another question discussing about t-test and the next thing is the chi-squared test or also called as the t-square test uh you can pronounce it both ways so what we do with the chi-square test is that we try to use hypothesis testing here to find if there is a requirement find what's the association of the categorical variables in a sample and to see how best we can you know arrive at the goal of our model as well so when you're thinking about hypothesis testing make sure to talk about three or let's say even four uh types of testing and uh you know quickly brief and briefly talk about what they do as well coming to question number 25 it says what are some of the data validation methodologies used in data analysis well there's many many data validation techniques as we've been discussing for a while now and these four are very important the first thing is field level validation field level validation is we try to validate every single field uh where data is present to make sure that there is no errors let's say because a date a user might have entered it wrong or and any other redundancy which would have caused the wrong data to be in the wrong place as well so form level validation is done when the user works with the form but then before saving the information if there's any changes that's been made we need to keep a track of that as well and then we have data saving validation data saving validation will take place when the file uh you know let's say or even the database record is being saved as well so whenever it's being saved there's a change of state to analyze that we perform this validation and then we have the search criteria validation here we're trying to see if the valid results can be returned or not whenever the user asks for it or whenever the user is going through the data set summarizing it or even driving results out of it as well so these are some of the very important data validation methodologies that are widely used by data analysts across the globe coming to question number 26 it states what is k-means algorithm k-means algorithm is a very very very popular algorithm that we use to uh you know cluster data into multiple data sets or in fact cluster data into multiple clusters as well so each of these clusters will consist of similar data and the name k means comes where uh you know comes from k actually being the number of clusters that we required if k is equal to phi it means that r our k-means algorithm is gonna divide the dataset into five different clusters so to use this uh you know you will require a good amount of experience it's very simple to use though but then to achieve really really effective results it requires experience to work with k-means and of course to know how many clusters you want and since you'll be working with the data which is unsupervised in nature you will not have any sort of labels that guide and show what the data is as well so you know working with k-means can get finicky at times but then it's really fun and it's really effective too coming to question number 27 uh states what is the difference between the concepts of recall and true positive rate well this is a very tricky question i'll tell you why because recall and true positive rates are literally the same thing so recall and true positive rates are exactly the same and whenever we try to think about the formula of either recall or true positive rate right it's basically true positive divided by the true positive value and the probability of the false negative value too so it's tp divided by tp plus fn uh you know it's it's the actual formula for recall it's a very simple question with a very straightforward answer coming to question number 28 uh it's it's what are the ideal situations in which a t-test or a z-test can be used uh we have one very simple uh you know practice that we follow that that becomes standardized in fact right now a t test will be used if your sample size is less than 30 and a z test will be used uh if your sample size exceeds 30 in many cases so are a pretty neat trick to remember this is that t is less than z so uh you know t can be used whenever you use a sample size lesson 30 z is at the end of the alphabet spectrum and it means uh that your data is exceeding uh certain cases right so that's that's a pretty pretty simple way in case if you have not had any experiences working with z-test and uh the t-test as well coming to the 29th question it states why is naive bias called as naive well whenever we think about knife bias right we don't know why it might be called knife because this is a very tricky question it gets asked a lot just to see if you actually know the answer or not well it's called naive because it actually makes the general assumptions that you know everything present in your data is important and it is independent of each other thinking this always as a human being it can be naive because you know your data is different it is sometimes dependent on each other it will not always be independent uh not all the data is important as well right so to know all of this your model will not understand that it will just say you know everything is important and everything is independent so this is not true it will not hold in a real world scenario and that is the reason we call naive buys as naive wise coming to the 30th question it says uh what is the simple difference between standardized and unstandardized coefficients as the name suggests whenever we talk about standardized coefficients we're trying to uh interpret all the results based on the standard deviation values when it's unstandardized it means that we're trying to use the actual value of the data rather than the standard deviation results so if you're considering the standard deviation values it's the standardized coefficients if you're considering the actual data that is present and not the standard deviation value then it's unstandardized pretty simple question pretty straightforward answer coming to the 31st question it says how are outliers detected right there are many many methodologies that we make use of whenever we're detecting outliers it is very important to remove them in most of the cases but then uh you know out of all these methods that exist we have two very important methods one is the standard deviation method the other one is the box plot method in the standard deviation method what we try to do is the value itself will actually be considered as an outlier if the value is either uh you know lower or higher than three standard deviations from the mean value so we'll find the mean value you'll get to know the standard deviation if the value is three times higher or lower than the standard deviation value it means that it deviates too much from the data and that it is an outlier and then we have the box plot model box plot model is the most popular way of how you can find out outliers because here we try to find the interquartile range or what we call as the iqr and if your data is and if your data is to be considered as an outlier it means it has to be either 1.5 times more or 1.5 times less than the mean in the interquartile range if that's the case then of course your data is considered as an outlier coming to question number 32 which states why is knn preferred when determining missing numbers in data canon as we've already discussed called kkk nearest neighbors algorithm it's a very famous one that's used a lot and we now know why it's called that but why is it used when you're determining missing numbers it's because it has the capability to you know approximate a value based on other values that are present very close to it as well it can easily approximate these values and hence it finds a very important place in the world of determining missing numbers present in our data as well so coming to question number 33 states how can one handle suspicious or missing data in a data set while performing analysis uh you know you can think about this in a lot of ways because when you're thinking about any sort of data there will be discrepancies in it so to handle all of these what we try to do is we create validation reports where we discuss the data in detail to make sure that you know most of it is right at least and then we escalate the data or the suspicion to our superiors or you know experienced data analysts to have a look at it and then take a call about it so we try to replace the data which might be old and valid as well we try to make it valid we try to put in the latest up-to-date data to see if it fixes it and we can go on to use the many many many strategies that we have in the world of data analysis to find the missing values use approximation uh you know to put in the values to see what fits best to perform analysis trial and error and whatever is required to handle uh you know either missing data or suspicious data as well coming to the 34th question it says what is the simple difference between pca and fa pca stands for principal component analysis and f a stands for factor analysis uh you know among all the many differences that line between them one major difference between principal component analysis and the factor analysis is the fact that factor analysis is actually used to specify and work with variance uh that exists in between the variables right so whenever we're working with the major difference what we're trying to do with pca is we're trying to use the variance that exists in between variables but then with pca it's a little different with pc what we try to do is we try to explain the covariance that exists between the components so factor analysis deals with variance principal component analysis deals with covariance this is probably the biggest difference that lies in between principal component analysis and factor analysis now moving on to the 35th question it says how is it beneficial to make use of version controlling version controlling is very very very important and it provides you a lot of advantages right one thing is that it will give you a structured pipeline where you can compare all the files identify a lot of differences merge if any changes are done in a very structured and a convenient way where you can track it as well because here you can track the life cycle of any sort of application building with every stage of its built be development beat production b testing beat verification you can actually track all of these two and then when you're thinking about how it will bring together a good way to establish you know let's say a collaborative work culture in the company then yes uh you know version controlling uh system will do just that and then when you're thinking about making sure that with every release of your version uh the security of your code the safety of it the variety of it and the veracity of it as well right so to keep it secure keep it safe our version controlling system will definitely help coming to the 36th question and it states what are the future trends in data analysis well if you ask this question what the interviewer here is trying to do is he or she is trying to assess your grip on the subject and how much research have you actually done into the future of data analysis right uh you know so to answer this make sure you can state good number of facts uh and respective validation for those facts of course for all the sources that you have used just a quick info guys test your knowledge of data analytics by answering this question which of the following method creates a new array object that looks at the same data a view b copy c paste d all of the above comment your answer in the comment section below subscribe to intellipart to know the right answer now let's continue with the session to add a lot of positivity to your candidacy because at the end of the day they're trying to understand how how in-depth you know experience you you might have in data analysis or how in-depth of a passion do you have for data analysis as well uh you know try to explain how data analysis will have a huge impact on businesses today in fact uh talk about how it is already revolutionizing the world around us how it will revolutionize the future trends as well and uh how it is showing nothing but immense potential too if you are facts if you have statistics if you have experience if you have to talk about new tools new techniques uh new ways that data analysis will help please do talk about all of these coming to question number 37 it states why are you applying for the data analyst role in our company well you know with this question the interviewer wants you to convince them basically regarding why you think you're proficient in the subject how well can you handle data analysis how well can you work in a production environment how best can you be fit as a team player you know all of these things as well so to answer this in a very very uh you know nice way first thing you have to do is you need to know the job description in detail you need to know what they're expecting of you to see if you can deliver that or not because once you understand the details of what the company does what the job description is what your team does where you will be fit into the organization you know you will of course find the best way to understand why you need to do it you might have a passion for data analyst you might have a genuine interest to build your career there uh you know you you let's say you have a your so let's say you have a very keen interest in working with data and to solve problems using that too well then you know you can have multiple multiple reason and there is no one answer for all the questions here so this question completely depends on you but then you can use this as a guide to go on to answer it coming to the 38th question it states can you rate yourself on a scale of one to ten depending on your proficiency in data analysis this is another question the interviewer will usually ask to see how confident you are when you're sitting in the interview how much you know about the subject and how spontaneous can you be whenever you ask questions right so one very important thing that you need to understand here is that answer it honestly do not be over optimistic right so if you think uh you know you on a scale of one to ten you are at a seven please make sure not to say nine nine and a half or ten because uh if you're answering it that way then uh you know you might be asked very very difficult questions or advanced questions which might be over the scope of your expertise which might uh you know put the wrong image into the interviewer's head and you might lose out on the candidature as well to be as honest as possible talk about what you've done and talk about why you rate yourself at that particular scale as well coming to question number 39 it says has your college degree helped you with data analysis anyway well talk about the latest program that you completed in college right so you uh you know firstly start your answer with this talk about if your call is actually helped you with data analysis anyway don't say no if you haven't had any experiences there because see if you are in fact enrolled in a college degree you will have access to a lot of data that in fact you could have used to perform data analysis as well so talk about if you have been keen to do it if you have done anything to the data anything with respect to data analysis in the past or let's say and let's say if you have not done any of these right talk about the experience that the entire college degree program gave you and how you think it built you uh in a way where uh you know you can you can be a good fit into data analysis into the entire field and the organization that you're applying as far as well so so talk about how it moulded you into being the data analyst enthusiast that you are today coming to the 40th question it's what is your plan after joining for this data analyst role so again do not be carried away by this question it is very very easy to get carried away here and you can talk about it for a lot of minutes but then keep it as concise as possible this is my advice to you because you know you can talk about a lot of things here but start out with talking about a plan on how you can begin understanding the the data analyst structure set that is already set up in the company how you think you will be a good fit there how you plan on either uh you know seeing if the actual existing model might need it might need some betterment or not but at the end of the day don't just saying you know that you're gonna barge in and do a lot of changes in a in a short amount of time keep your answers structured keep your answer in thinking in consideration that you'll be going into a production environment and you know it's always vital if you can highlight how better uh you know you can make the product the company the brand or whatever it is with you being there as well so talk about the expertise that you bring to the table and how uh you know you think that might help the organization too coming to the 41st question it states what are the disadvantages of data analytics so even though the advantages are huge right there are one or two disadvantages that you will have to talk about and if you do not know this and and it is very important that you know this as well right see data analytics can actually cause a breach into the customer privacy and all of their information as well right so it's very vital to know where you're getting the data from uh you know you might be thinking about customer transactions purchases subscriptions and you will have a lot of access into what this customer is doing as well you know in that particular case make sure that you do not cause any sort of breaches in customer privacy and then when you're thinking about another very important disadvantage as a learner is that some of these tools are really complex to use and you will require training to go on to use it as well so it's not as easy as you download the tool install it and begin working with it you will require a good amount of structured training and uh to to work with it efficiently as well and one of the other things is that it'll take a lot of skill it'll take a lot of expertise it'll take a lot of hands-on experience to go on to select the right tool itself first of all to find the right tool to use it and to use it effectively it's going to take a lot of time it's going to take some money and of course it's and it's going to take the biggest asset to the company is the client's time too so in that particular case even though these are not huge disadvantages the privacy thing is definitely one disadvantage to consider complexity of the tools are important and of course the requirement for a lot of skills can get uh you know overwhelming for beginners as well so as an add up question to that you might be asked what is what skills should a successful data analyst possess well this is basically a descriptive question because here uh the interviewer wants to understand how your analytical skills are and how your thinking skills are right see there are many many many tools and skills that a data analyst must have think about programming languages right you will require expertise in python you will require expertise in our sas concepts of probability details of statistics regression correlation and many many many other skills you know that a good data analyst should possess how much of experience should he or she possess well that depends on how long this person has been into data analysis right but then it is vital that you know what skills are required uh you know what skills are used by professional data analysts you might have some experts or you know let's say famous data analysts out there that you might be looking up to so it's very vital uh you know you know you to understand all of that and answer uh this question in an effective way as well coming to the 43rd question it states uh why do you think that you are the right fit for this data analyst role again if you're being asked this question it means that the interviewer wants to understand if you are the right fit for the job description number one number two why do you think data analysis is for you and number three what knowledge do you have in data analysis to actually fit into their organization well so one thing again very important notice that do not get carried away because you can answer this for 20 25 minutes straight but keep it very small keep it concise keep it detailed and just talk about your passion talk about your interest talk about your goals and how much these girls and passions will match the company's substructure and their goals as well see every company will have a vision and a mission and if you can commit to the vision and the vision of that company and better either the and you know better the product they deliver or better the brand itself then of course you will be the right fit for the table right so taking a product from uh you know one place uh into another place which is definitely better and you can prove it with numbers in the case of data analysis it is very important you know how you are the right fit for the role as well question number 44 it says can you please talk about your past data analysis work well uh you know you might be asked this even though you're a fresher you might be asked this even if you're an experienced candidate as well if you're a fresher then do talk about how you have either uh figured out there are problems that exist in the world of data analytics uh or so if you're a fresher do talk about how you might have uh you know perform very simple analytics as well right so think about how many people could have uh you know how many guests that come into your place find a structured way on calculating that and why you think that's the case talk about how many people buy a certain product in your supermarket you know there can be many many things that you might not have officially done on paper but your brain would have analyzed it right talk about that but then if you actually have a certification program you would have worked on in fact a good certification program you would have worked on a lot of industry level projects as well use cases to talk about that and of course if you're an experienced person attending a data analysis interview talk about your previous company talk about what you did to the company what value you are you added what problems you solved and how better was the team when you left uh you know compared to when you were joining as well right so talk about a lot of these insights talk about your debating skills talk about how how you can think out of the box how good you are with your statistical skills and of course how much of a team player you were in the past and how that has continued as well so uh let's talk about a behavioral question now question number 45 states can you please explain how would you estimate the number of visitors to the taj mahal in november of 2019 well this question does not expect you to say the exact number of what uh number of visitors taj mahal got in the member in 2019 but then here people are trying to see your thought process without you making use of a data set without you making use of a computer to see how you would approach the problem right so here's a very good template that you can use first you would begin by gathering some data right so to do that you will have to understand a lot of things you need to know the population of agra where the taj mahal is located how people can move in and out of agra how convenient is it and a lot more the next thing you'll be looking at is the number of tourists that come to agra right and how popular is taj mahal itself in agra so with all of these you will find the average stay of these tourists that come you can analyze it in detail for example the age of the people who come the gender the annual income that they have the number of vacation days that they take the number of bank holidays that exist in india and all of that number of days taj mahal is closed number of days it's open the number of days it records the highest number of visits in all of these once you have gathered all of these data you will go on to perform statistical analysis on it to understand how many visitors it has got as well and if you want to perform even advanced analysis you can go to the local tourist offices to gather all sorts of data process it on your own and eventually you know use it to verify if your model is right or not as well so this is a classic behavioral question and it's not just the taj mahal it's not just the date uh you know this question can be asked in many many different ways for example how many cars were sold in india in let's say the last month or the last year as well so that's another very common question that's asked and whenever you ask this make sure to take this kind of an approach use the template and uh you know think about it for a second take one or two minutes to think about it there is no hurry uh when you're answering this question coming to question number 46 it says do you have any experience working in the same industry as ours before well uh as you might already know it a data analyst will fit into a lot of domains right even in the field of medicine even in the field of social media be it whatever it is a data analyst is required if there's a business as simple as that so if you're worked in the same industry as the company that you're applying for now do talk about how you've used your skills to polish yourself into that industry and how best you brought yourself out of it uh if you do not have worked in the same experiences don't just say no talk about what you think are the skills that are needed for this industry and how you're working on yourself to come to that particular spot as well right so it's very important that you talk about how you're building yourself towards working in this industry your interests your passions and if you have had experience of course do talk about that and thoroughly explain it uh you know all and thoroughly explain all of the skills that you might have used techniques tools and how you benefited the past company that exists in the same industry as well coming to question number 47 it says how good are you in explaining technical content to a non-technical audience with respect to data analysis see uh your ability to explain a technical content to a non-technical audience is has to be one of the most important things that you do as a data analyst because your communication skills are very very very vital it's not just delivering a technical content it's your ability to have patience it's your ability to break complex things into simpler it is your ability to break up complex things into simpler things where you can explain it to an audience who might be your board members stakeholders or some other non-technical audience as well so if you've had any experiences of situations where you had to explain technical concepts and break it down to a non-technical audience make sure to talk about that talk about how well-versed you are how effective that you have been in delivering content before as well so you know this question again is a question where i cannot give you one answer for each and every one of you but then you you yourself give it a thought for 15-20 minutes and you will get the best way on how you can answer it coming to question number 48 it says what tools do you prefer to use in various spaces of development again this is a question that's asked to check what all the tools that you have used how you've used it and how best do you think a tool will fit into that particular task as well so if you have used any or all of the tools that's involved in data analysis talk about it talk about how comfortable you were in using these tools and talk about how they are popular as well see some of the tools that we discussed in the questions are very very popular right talk about how they are popular and why they are popular too so this will add a lot of value to your candidature and coming to the 49th question it says which step of a data analysis project do you like the most uh well it's very very common to have a liking towards one or two tools in the entire process of data analysis right see a data analysis is a vast field you just cannot have uh one particular tool where you will do everything with that tool so you will have multiple tools to do it and you will have a liking to one particular tool you might not like the other tools as well right so when you're thinking of the entire analytics cycle uh it is always vital that you do not talk negatively about any of these tools because those tools are put into place to do their tasks and they have been doing it well if you do not like it well okay you do not like it we understand that but uh do talk about what best the tool can do how it can improve itself to probably uh you know making yourself be inclined towards that or what are the steps that you think can help make the tools better in fact as well but understand this when you're answering this question know that it is okay to have a favorite tool uh know that it is okay to uh you know put in more time into one particular tool because it's your career it's your interest it's your liking and if you think you can be very efficient at uh using that tool effectively then be it it is okay and then coming to the last question on these top data analyst interview questions and answer set it states have you earned any sort of certification to improve your learning and implementation process right see a good certification program what it does for you it's going to give you thorough knowledge in-depth expertise from our top trainers across the globe it's going to give you a certificate it's going to give you exposure to potential hirers it's going to do a lot for you so if you are involved in a very complex program that has the capability to take you from a beginner all the way to an advanced data analyst it means it's adding uh you know a lot of value to your life but how should you answer this when you ask this in an interview see one thing is that if you have a certification right it shows the interviewer that you have interest in advancing your career in this particular field it shows that you're strong as it shows that you have strong aspiration to learn because you've put in your time you've put in efforts you've put in money and all of these and dedicated it to the world of data analysis right so if you're doing all of this it always adds value to the interviewer as well and then when you're thinking about how effective uh learning that you've put into place see many certification programs will give you all the theoretical knowledge that exists but only the good ones will give you really good industry level projects and case studies to work with so if you had the experience to actually crack these industry level problems it means that you have the capability to learn things on your own and be a player in a productive environment as well so that itself is going to portray you know a lot of value of you are being certified in that domain and now if you have more than one certification just go on to list the certifications of of what you've done as well but you know instead of stating a lot about what you have done the most value that can be added to your application is to state how you have used this certification right so you could have learned a lot of things but if you have not used that will not add any value right so what problems have you solved the entire world of id came into place because we have problems and we realized we can use computers to solve them so to how best of your capability and potential have you made use of the certification programs to solve problems that are there or how how do you plan to solve them if you haven't done any as well talk about how helpful it's been to you your career and how much value it added to your life experiences to your uh you know learning and interests towards data analysis as well just a quick info guys intellipaat provides online data analytics course in partnership with ibm and microsoft the course link is given in the description below so guys we have come to the end of the session if you have any doubts please put on your comments in the comment section below we will try to answer it as soon as possible thanks for watching you