Transcript for:
Introduction à CompTIA Data Plus

okay [Music] add it up [Music] okay [Music] well good morning you guys welcome welcome if this is your first time here uh this is the comptia data plus which is the da00-001 exam um we were doing this a few times last month and then i found out it was better to do it in april that's when the book will come out uh by the way we do have the book so if you have not had a chance to pick up the book let me show you what the book looks like so um the one that i got is the one that's taught by mike chapel which is who is actually teaching this class so um if you want to go to amazon get the book i got the kindle version so um if you want to go and just check out what's going on far as the the book this is the book okay so it is mike chapel wrote it and it is also it is really good i've been reading some some things outside of um the classwork so i'm just gonna pull this up just for those that may not even know what this whole thing is even about or why it matters so for a data heads like me um this is a really beautiful thing that comptia came out with because this is vendor neutral meaning that this is not a a test that is done by let's say apple or microsoft or you know lenovo or some company like that it is gender neutral so comptia um has you know people know them for their a plus their security plus their net plus but now they have a data plus um and so i was really excited when they first announced this course it it as you can see this came out in february of this year and we are just on april 1st today so um there are people that have been studying for this i'm assuming since it first came out but i know the books didn't even come out until april because i was i was kind of looking for them honestly so where we're at right now um so he goes through and just to kind of recap bring us up to where we are in the class work so it'll make sense um so it is a 90 question test you have 90 minutes to finish a 90 minute test okay so you have to get a 675 which he goes over in the beginning of the class work we won't be picking up from there you can watch the live stream on on youtube or you can watch the live stream i did on that before but um so this is pretty exciting let's go ahead because we only have an hour uh that i'm going to study and i'm going to keep this to an hour and it's every other day so today is friday of course saturday we'll take off i'll see you guys again on sunday morning where we can go through um and talk so on the weekends it'll kind of be different because the timing will be different but uh for the weekday this is is pretty easy for me to stick to um so let's get started or other data repository so in data manipulation techniques so that's over here let me stop this for a better so i can be clear on where we are and what we're doing so this is linkedin learning if you are on linkedin linkedin learning acquired um a video um teaching system okay it was lynda.com before lynda.com got purchased they turned it into linkedin learning and so you have an option for linkedin learning i because i am a veteran they give us the um linkedin learning um each year they give it to us free which is an excellent thing if you are a veteran take advantage of the fact that linkedin gives you a premium service each year not just linkedin learning but they give you a premium for your linkedin profile so just make sure you do that if you if you have not if you need to know how reach out to me i'll be happy to get you the information of how to do it i've been doing it for years now so i can i can definitely give you some insight on that so this is the data mining portion of it so this is kind of the domain number two and in the book uh so we'll just pull up the book as well so we can uh kind of go along so the way the book is organized and this is what we will start kind of reading to um is broken down i will see if i can get the table of contents here for you so he is on domain 2 right now and that's kind of the way that he's breaking it down as he's doing it chapters are considered to be domains um so we'll just go ahead and get into it and and start taking our notes and if you guys have any questions or something just ping me um i'm here we will be going on breaks so this is not just one full hour straight through the system will automatically uh go through let you know when to go on break it's going to change you're going to hear thunder rolling through and thunder sounds and all of that and it's going to give you kind of a timer so you can always look and see when is the best time for you to you know come and and be part of it so just want to put that out there and off we go one of the best examples of a derived value is a person's age we could store age in a data set like a database table but that would be very problematic the reason is that we have to update everyone's age each time they celebrated a birthday that's going to be difficult for us to track and it's almost certainly going to cause us some data quality issues the better way to store this type of data is to keep the person's birth date in the database instead birth date is an unchanging fact no matter how old i am my birth date remains the same plus i can take this birthday and calculate someone's age at any point in time so it's better for me to have birth dates stored in the database and keep age as a derived value that i can calculate whenever it's needed so let's take a look at how we might do this in sql in the employee table that we've been using so far we have the birth dates of employees stored let's take a look i'm going to select job titles and birth dates from the employee table now if i want to see someone's age i can calculate it using their birth date actually performing these date calculations can be a little tricky and you don't need to understand all of the syntax that i'm about to type what you should know is that one thing that you didn't point out of course you do not need to know like a secret need to know foreign language for the test you just need to be able to understand what you're doing what they're doing here repeated continuously there's a function in sql server called date diff that allows me to calculate the difference between two dates so let's go ahead and apply that to these two values i'm going to run date diff and i'm going to ask for my result in the number of days of taking the birth date and subtracting the value we get from get date of the current date and let's run that and now i have the person's age in this third column i can actually even go ahead and call it as age and run it again but notice these values are a little odd the ceo's age is 19 255. what's happening here is this is in days because that's what i asked for and if i want to convert this back to years i can just divide it by 365.25 to account for leap years and their age as a floating point number so i see the ceo is 52.7 years old so here this age column is a derived value it's important to understand that i didn't actually put this age column into the database each time i run this query it calculates the person's current age and provides that current age in the results that's one of the great things about a derived value i don't need to keep it updated in the database because it's never stored there it's just derived from the values that are stored there the last type of data manipulation that we'll discuss is combining multiple data sets this is a very common task for data analysts who sometimes need to combine data sets that they find in their own organization with each other or with data sets from external sources we'll talk about three different ways that you might combine data sets appending data merging data and blending data appending data combines two data sets that have the same structure into a single data set we do this by simply adding the rows from one data set to the bottom of another data set for example let's say that we have two csv files containing information about sales orders like the ones you see here one of these tables contains data for january while the other contains data for february if we want to combine these two data sets into a single data set we can simply append the february data set to the bottom of the january data set and then continue with our analysis merging data combines two data sets that have a common variable while appending data to a data set adds rows merging data sets adds columns for example we might have one table of data containing information about the products customers ordered tracked by product id we then might have another data set containing product names and prices that uses product ids as the primary key we can combine these tables by matching up each row in the orders table with the matching row from the products table that merges the two data sets together in the world of databases we call this merging a join of the two tables the last way that we might combine two data sets is by blending them data blending combines data sets that come from different sources it's conceptually very similar to merging or joining data sets but blending combines data sets that weren't designed with the intent to be joined together for example database tables have foreign key relationships between them that define how to join them together this is a formal structure and the join is guaranteed to work because the referential integrity rules of the database enforce this structure in data blending we're bringing together data from different sources where these rules aren't enforced instead we're making a best effort attempt to blend the two data sources what our package makes it easy to work with dates [Music] i'm gonna go string probably wrong but then it has to be date math then whoa [Music] so is it is lubric date i don't know anything about r so database tables that you would like to join together using a foreign key relationship what term best describes this active merging [Music] which one of the following values will appear first if they are sorted in descending order and descending order xavier which one of the following is not considered an aggregate function min max i don't think select is you are working with a data set and want to change the dates of categories that you used for different types of books what term best describes this action i want to change the name of categories that you use i don't think it's filtering though but it is recoding interesting okay that one i need to look up the time recoding so let's we need to kind of start a place to [Music] keep our data documents okay so we definitely need to look up we need to [Music] do more research on recoding [Music] you are working with the data set and need to swap the values in rows with those in columns what action do you need to perform transposing when i do want to look transposing up too because i think i misspelled that okay what else we have when analyzing the values of two vowel two variables you decide to convert both variables so they are on a scale of zero the one what term is this action um that is when he was talking about um when you have things that are within a certain i'm gonna go normalization i think that's the term he used okay you would like to combine the text in two different streams to form a single screen catenation [Music] imagine for a moment that you have a thousand page textbook about databases and data analysis sitting in front of you it's a comprehensive book containing everything that you need to know about databases now let's say that you'd like to learn more about foreign keys and you're not sure where they're discussed in the book how would you go about finding every page that contains the words faring key would you flip through every page of the book and look to see if it contains those words of course not that would take hours of your time instead you turn to the back of the book and use the index to find every location where the book mentions foreign keys the index of a book is super useful because it points you at the locations in the book that contain important terms helping you cross-reference the text database tables also have indexes for essentially the same reason database indexes help us identify the rows in a database table that might match our query results without checking each row individually let's say that we have a database table containing millions of rows about each one of our organization's many customers this table might have a primary key of customer id that's just an integer and several other fields containing information about the customer one of those fields might be the state in the united states where that person lives if we wanted to find out which of our customers live in new york we could check every row in the database to see if the customer in that row lives in new york that would work fine for this small sample of the table because we can quickly check these five rows but what if our database table had millions of rows it's still possible but it's not very efficient and it's going to take a long time if we had an index to our table based upon the state field it would look something like this the index tells us which table rows contain different values for state and then allows us to go directly to those rows without checking each row individually indexes like this one speed up database queries when we execute a database query the database will first check to see if there are indexes for all of the columns mentioned in the query if there are then we have a situation called a covered query the indices cover all the columns required to fulfill the query and the query can then run very quickly in cases where we don't have a covered query then the database does need to check every row to see if it is indeed matching this individual row checking is known as a table scan and it's very computationally expensive when you're troubleshooting database performance issues indices are one of the first places that you should turn if users or applications are executing many queries that are not covered by an index that can dramatically slow down the database adding an index that covers these queries can speed things up significantly now there is however a downside to using indexes indexes speed up the process of retrieving information from the database well that's because it has to be written down the process of inserting information into the database every time you add or modify data in a table the database also needs to update the indexes on that table another way that you can improve your query performance is by working with only a subset of the records contained in a database table there are two major ways that you might do this by limiting the number of rows returned by a query and by using temporary tables limiting the number of rows returned by a query is very useful when you're initially designing a query that's going to be eventually run against a very large table as you test your query you might not actually need to see all the results and sending those results from the database server to your computer might be time consuming in those cases it's helpful to limit the query results to a subset of the rows you can do this by instructing the database to only return a small number of rows top 100 to do this differs from database to database i'll show you an example here in microsoft sql server let's begin by writing a query that pulls all of the rows from the person table when i go ahead and execute this query it takes a while to run it's pulling a very large number of rows the query itself was very simple so the database shouldn't have had to spend too much time executing it yeah but we're in like 16 lot of data almost twenty thousand rows it's a long time for the results to reach me you can see this query took 12 seconds if i just wanted to make sure that my query worked and see what types of data it returns i don't need to see all 19972 rows instead i can tell the database to execute my query and only return the first hundred rows to do that on microsoft sql server i just added to my select query the keyword top and then the number of rows that i'd like to see when i run this it executes very quickly it's super fast and this is going to be much more of a pleasure to troubleshoot temporary tables are another way to optimize query performance if i'm doing some analysis where i'm only going to work with a small subset of a table i can save that subset or the results of any intermediate query into a temporary table that the database will keep for my use that table is only available to me and it's destroyed when i disconnect from the database but it can be a great time saver for example here's a sql query that finds the first names last names and job titles of all employees who earn more than 50 dollars per hour of course i can run this query and see the seven people who match these criteria now if i want to use this data later i can modify my query to place the data into a temporary table the format for creating a temporary table also differs from database to database and there are multiple ways to create one of these tables here in microsoft sql server i can modify my select query to place its results into another table by adding an into clause let's go ahead and put that right after the select clause i'm going to say select into and then i'm going to give my table a name that begins with a hashtag that hashtag makes it a temporary table i'm going to call it high pay now when i run this query the results don't display on the screen because they've been placed into that temporary table i can run queries against that temporary table let's give that a try i'm just going to select star from my temporary table high pay and when i run this query i get back those results the seven rows that i had selected so i just put it in a template i can also do filtering and other things on this table just like it's any other database table let's say i want to see only the employees whose rate is over a hundred dollars an hour when i run this now i get just the one row of those seven rows that match my criteria as with many of the topics we're discussing in this course you don't need to know how to actually create record subsets or temporary tables on the data plus exam because the process differs from database to database you just need to understand how these techniques can help you optimize query performance when we have complex data analysis needs there are often multiple ways that we can construct our queries and multiple ways the database might execute those queries fortunately modern databases have become very efficient at optimizing queries so we can normally write our queries in whatever way is comfortable for us and then the database will optimize them to run more efficiently the actual steps that the database will follow when it executes a query are called a query execution plan this plan tells the database how to efficiently perform the desired action now in the next video i'm going to talk about how reusing query execution plans can optimize performance but first let's talk about how we can view and interpret these plans here i'm connected to a microsoft sql server database and i have two queries already written the first is a simple query that just retrieves a single column of data from a single database table the second is more complex joining together information from multiple tables if i simply execute both of these queries i see the results down at the bottom of the screen now i can also tell the database that i'd like to understand how it plans to execute a query if i highlight the first query i can click this button to display the estimated execution plan for this first query i see that there are only two steps to the process i read this from right to left so first we do an index scan of the table which is the most costly operation i wish he had zoomed in on we simply perform if i repeat this process for the second query this query is more complex and i get a more complex execution plan now i see that there are two seeks performed one for each of the two tables in the query then we perform an inner join to combine two tables before selecting out the results finally if i highlight both queries at the same time and then generate the execution plan i see both plans together importantly i also get a relative measure of how much time each query will take i can see that my simple query is only responsible for 35 of the cost of the combined queries while the more complex query requires 65 of the resources used so the second query is about twice as complex as the first query as a data professional you may want to modify or reuse query execution plans you won't need to know how to do that on the data plus exam but you should understand that it is something that can be done and it's an excellent way to tweak the performance of commonly executed queries parameterization provides us with the opportunity to further improve both the performance and the security of database queries they provide us with the ability to create pre-compiled queries that perform a lot of the heavy lifting the first time that you execute a query and then reuse that work to speed up subsequent query executions there are two major techniques that we can use to parameterize database queries and they differ based upon where the query is stored in a stored procedure the client does not directly send sql code to the database server instead the client sends arguments to the server which then inserts those arguments into a pre-compiled query template that's already stored on the server for this reason stored procedures must be created by a database administrator and then they're updated on the server in a parameterized query the query is maintained by the software developer in the application code and then it's matched with new inputs whenever it's executed both stored procedures and parameterized queries improve query performance because they allow the reuse of query execution plans once the database develops an execution plan for the first execution of the query it then reuses that plan for future executions let's take a look at an example this database has a table called customers that contains contact information for a business's customers now i could write a sql query to show me all the customers located in texas i'll just do select star from the customers table and i want those records where the customer state equals texas and when i run this query i get back the six customers who are located in texas now if i had a web application that allowed me to select the customers who live in a certain state i could send a query like this one from the web application to the database server alternatively i could create a stored procedure that allows me to store most of this query on the server let's go ahead and do that i'm going to create a procedure called sp for stored procedure and then customer state and then i'm going to tell the database when someone calls this stored procedure they're going to provide a character string containing the state of the customers they'd like to retrieve so i'll just put in the syntax for that i'm going to have a state variable that's going to be a character string of unknown bar char and i'm going to create that as variable characters from before down here and the only change i'm going to make is instead of specifying the state i'm just going to use that variable that's passed in when the user executes the stored procedure so i'll run this and that creates the stored procedure so now that my stored procedure is saved on the database i can execute it whenever i like the way i execute it right here in sql code is to use the exec command and then provide the name of my stored procedure sp customer state and then whatever arguments it requires so if i want the customers from texas i can just run this query and there you go if i look instead for customers from california i find seven customers in the state of california so in addition to improving the performance of our queries using a stored procedure like this one also provides some security benefits it helps protect us against a type of attack called a sequel injection attack no matter what input the user provides the user can't alter the underlying sql statement that's already stored in the database all right we have a quiz another one you have uh let's close this down so you have a database where queries are performing slowly investing investigating the results you find that the is performing a time consuming table scan what can what what actions can best improve the query performance uh [Music] you could add an index would be my guess because the index will allow you to have a um it's instead of it having to go through every single entry within the database it can just go to an index so for example if you had all of your customers stored for the state of california in a specific database instead of it having to go out to that you know like every database and scan every database justify california you could have it go through and figure out one of two things you can have it go to the index in the index you would have let's say a number one would be for all of your california customers then it would go to that particular index go to one and know just to go to that base which will make the query a lot quicker you are a database administrator and you create a precompiled query that is stored on a database server what have you created i'ma go query execution plan no let's replay that real quick perimeterization is what we were supposed to go through and for some reason it's planned extremely slowly but okay i guess but the answer is here no okay so you are a database administrator and you create a precompiled query that is stored on a database server whatever you created a stored procedure which is that one where he shows the okay that's where it is he shows the reason why it is a stored procedure is where he shows when he was showing the almost like a schematic of the way that you've put together your queries so you've written out a format of how the queries should be okay make sense thanks for watching my comptia data plus data mining course it's been a pleasure joining you for this journey through the world of data that is my topic completed this course you found it interesting and helpful in preparing for the data plus exam you should now have a solid foundation for the data mining domain of the test if you're following my data plus course series in order you should now turn to the data plus data analysis course otherwise you can move on to any of the other five courses in this series good luck with the rest of your test preparation all right so now we are going to first of all download our our certificate and we are going to um [Music] do one of two things we keep all of these little pieces of everything that we have done um so i am going to do one or two things i am going to download the pdf version of this and i think i am going to create a new folder um specifically for this class um and i need to do this with my other ones um so that i have all right so we've got a folder on the board so we are going to put this here we are also going to um make sure that there is a certification link so we are also keeping track of this on my website so i am going to we're going to post that onto linkedin as well but before we finish we need to go ahead and make sure that this is up on the website so i do have a um i do have a squarespace account and we're gonna go through real quick and just put it there um put it up on the page i'll do it with you guys here um we need to be in a different section because we are not in that section we are in this section so we need to be in the data science world and we need to put the post in the data science world and um not sure let's see what they want us to do so uh we finished and i know some people would be like well why is that even a thing like why why do you care um documentation purposes that's why i care so this will put out um [Music] this tracks that data so if i want to put in i think i'm just going to put in a link here because they gave you a link but i'm not sure um uh so no we are not doing that we are going to get rid of that part and we're going to figure out how to do this much quicker than this because i guess we could just put the um i think the file that they gave us is not a um yeah it's a pdf document so i don't know that we would it would be better to have that in a um [Music] we have that in a form so we should probably put that if they give us the download certificate portion and we do it as a pdf i'm assuming we should be able to just put that in here because it is they just gave us a link and i'm not sure what that link was necessarily for but they did give us a link let me see what they we go here [Music] okay so all this is is a link itself okay got it so we should probably just grab this content because it makes it so much easier to just put the content in there as well as the form which i think is a pretty good idea that we do that i might have to convert this image to [Music] i might have to convert this image which i think i'm going to do i'm just going to convert it from a pdf to a jpeg which i think will be easier for the system to um to be able to really work with is to not have it in a pdf and just have it as a jpeg we'll throw it up onto the website so that the content is there for people who um so let's do this let's go ahead add this image real quick [Music] there it is we created it on the other screen there we go we are going to write the content right here um we need to make sure that that is in the seo part i know if anybody is um [Music] all right so [Music] we want to make sure that we have all of this stuff you always want to make sure that your name and information is there very important um that you do that if you are going through to put content out you want to make sure your name is there it is extremely important that what you put out is there for people to see so that is that is where this whole link portion comes in i think as well is that if we're going to put the source url in there um [Music] we will we will put information actually you know what i think i'm going to keep that clear i'm not going to put that in there but this that's all that you really need to do and outside of just go ahead and publish it and make sure that it goes out so once that's done it'll be out it's on the page we'll go back to our studying here so what we've learned from this is as the content goes through we need to make sure that there is so there are two things that we need to follow up on for learning is what is recoding and what is transposing um so that is a title i did get another one wrong and i didn't write it down at the time but i i probably should do that the next one so we're moving on to the next his next class so we have like two hours left of this course we won't be here for two hours we only have about another 20 minutes or so left in this course we'll have one break here coming up um in a couple of minutes and then we will go through and we will work on the next part in the next class and as you can see um we don't have that much left to go um in the course so we basically have about an hour and then it'll tell us on the next part how long the other portions of the courses will take um but we need to always make sure that we are staying focused on just keeping it simple so that's what we will do we will we will definitely keep everything very simple um in our study group and working i know there is a couple of uh you guys that are um overseas and we're kind of working on this and the time difference is there i get it um so what i was thinking to do is i was considering if we should open up a slack channel and just kind of keep our content in a slack channel i haven't really uh i'm thinking that that's the way i'm going to go is to just open the slack channel that way if people want to come in they want to um you know leave messages and i'm not available at the time they can they know you can always go to the slack channel leave theirs there if there's anything that's going on um they can quickly get answers um by other members of the team but right now i haven't i haven't really uh fleshed that thought process out so i guess i should add that as kind of part of our to do um i use todoist um so you guys don't have to use todoist i just use todoist and i think i'm going to just um keep a portion create a whole entire just project for that um not even just a whole thing but like an actual project i don't think i have one yet let me make sure i don't have one no i do not have a project for that so we're gonna create one and we're just gonna call it um i've got quite a few others as you can see when i was working on some other certification exams um i had class work that i needed to do we're gonna make it a really you know like bright color and that type of thing i don't know if i want to do it in the board yet i i think i'm going to just keep it in the listing side of things um okay so we definitely need to um we need to confirm if we want a slack channel um we also need to research i guess i could have did that in here too i didn't really think about it because i don't really use um the so i need to we need to i need to research [Music] all right so that's going to add that so we need to research recoding what that means [Music] um and we need to look at transposing we definitely need to look up um research what transposing is and kind of have a definition and data set or or so of what's going on with that um again i used to do is you guys don't have to use it it just that's my thing that has really worked quite well for me um to go through but now we're going to proceed forward with the data analysis part i'll wait till after the break so we have a break in about a minute as you can see there's always a timer by the way if you haven't looked there's a timer in the bottom right hand corner on that will constantly go and that's kind of our color of purple uh kind of keeping with that purple theme um you can always look there to kind of keep up with where we are usually the way that i originally set up this class was for a two-hour session so you would have two sets of breaks that are set in uh so what will happen is this will automatically move forward to the breaking time and that you will have your first break um and your first break will usually be about like five or ten minutes gives you an opportunity like right now we're gonna go ahead and stand up so if you have a standing desk let's go ahead stand up and for those of you that don't have a standing desk you'll be looking at my stomach as the dust rises but what we're gonna do um is stand up so we're not just kind of sitting down the next portion of the class we're gonna do uh standing up and we're just gonna we're gonna wait so we're then when we come back we're going to do what we can we'll have about another 15 minutes left so we may get one or two classes left this class will end at 7 00 a.m central standard time so that you guys are aware of that part i appreciate everybody that's shown up to the course we've had a couple of people come into the class the study group and then they either had to leave really quickly or not but it that portion doesn't matter just understand that you can come in study with us this is a safe place to make mistakes i just want to make that portion clear as well so you guys should be getting ready for your break we are going to get i'm just going to check a couple of things make sure that we have if there's any notes i should probably put this out as well where people can publicly so hey welcome back you guys so um let's go ahead and kind of talk about um i want to talk about some things so that it is um it is clear um i don't think you guys can quite see um what's going on so let me at least get my camera back on here so you can see um so we have about five minutes left and so what i wanted to talk about and that five minutes is to really give you guys kind of some insight into um what i think we should really work on um what we should really look at and kind of um [Music] maybe deal with so okay we're gonna put you out on the screen so if you have not purchased the book so right now we are going through my chapel's course on linkedin learning yes what we will do as we proceed forward is we will go through and i think i'm going to map this out so that it's a little bit clearer what i what i think and you guys can chime in like i said i should probably see what does uh what does what does slack do um what we can do with slack so [Music] what is the pricing for slack so well it doesn't cost anything for you to do it we could we could totally create like a slack channel um if that's something that you guys are yeah we could totally create a slack channel and just have the content there so that if someone wanted to go to the slack channel that might be studying that's in this group that you guys might actually want to um want to do give me a second let me change this back out because we need to be in the correct place um i am actually tracking the time so um i use a app called timery and time marie uh is i'm sorry if the app on my mac is called timery but it's actually through toggle and toggle allows you to i'll show you guys that as well um so toggle is a a like for you to track your time so if you're if you're just looking for uh something to time track with and this is the way that i'm gonna time track because it just it works for me um you can you can do it completely for free so um when you get ready to time track it just allows you to be able to go in and say like if you look at the top of the screen here i have a browser that's set up and it gives you is just a click so you go in you set up whatever project that you're working on and bam you can track your time so a couple of things we need to take away from today's course is um we need to we're going to work on researching recoding transcoding at least that's what i'm going to be working on now far as the slack channel is concerned i do think that i would create a slack channel if there was a number of people who actually were looking for ways to communicate maybe once we get further along i could create a slack channel if there's a need but i will leave it on as a to do on our to-do list um but far as the study and portion i'm gonna create a section for um all right so that we're going to move these into that section and then we're going to keep a section for kind of to do so it is seven o'clock so i appreciate you spending time with me taking the time to just chill out and to work through this work and that's what we're going to do sunday i'll see you guys then we are going to go through the next section and right now the next section is on data analysis so i'll talk to you guys on sunday bye