Transcript for:
Overview of Microsoft's Data Mining for SQL Server Analysis Services

everyone in this video we are going to take a look at Microsoft's data mining for sequel server analysis services this demonstration is not exhaustive it's just kind of an overview of how the piece is going to go together and what it means so let's say it's going to start first of all I've set up a demonstration similar to Microsoft's tutorial one that they have on the internet they the tutorial and my demo revolves around using adventureworks data warehouse 2012 so I have included that in my little demonstration here we've got an MDF and LDF file that needs to be attached so once those get attached then we can start working with it I've got this little script file and we're going to sum things up now first of all the scenario is that we have a list of customers that I have bought things from us this is in our database but the deal is is that as we've looked at the data and reports over the over time we've seen that customers buy our products more if they don't have children so apparently these are now adventureworks is expensive bicycles apparently those there'll as show that reporting it showed us that people with no children are more likely to buy very expensive bicycles so whether that's true or not who knows but in this fictitious scenario that is true and so we're going to have to work with that now the other part of scenario is imagine we're opening up a new store we want to send out some advertising to you know people in the area and we want to offer some kind of discount on new products or whatever but we want to target only people who don't have children so we look around to find a mailing list and we find one but it doesn't include the number of children so we are going to have to try to predict the number of children based on other facts and that's where the whole thing starts now you can lots of holes into the the theories of predicting how many children people have based on how many cars they own and whether they're married or whether they own a house or whatever it doesn't really matter that the fact is is that this is just a scenario that will highlight how data mining might work as I said the database has to be set up I've already got it set up so as long as I put it in the right folder and run the code is you're attached and next thing I do is I want to go ahead and build a different database this one is actually database that I call data mining demo and this is going to be the one that were we're working with for a mailing list now I don't want to type out a mailing list so I'm going to cheat a little bit what I'm going to do is and we'll go ahead and steal some code that's giving a seals of data from adventureworks and pretend it's a whole new customer so must steal things that are going to be important such as marital status and gender income total number of children and date now this is our existing customers when I go ahead and put it in their eighteen thousand then I'm going to again come up some fictitious ones this is ten thousand off the same database but I'm not getting their name I'm not getting their older individual customer key I'm making up a new one so we're going to pretend that these are new customers we don't have a name form but we do have information that we can use to kind of predict the number children so that's marital status gender income okay so now this is the data we have in our know not one that's correct I just never focused on that anything about that anyway so this is what we have in the current customer list you can see we have customer key and we have birthdate and various other things this is the potential customers they serve we have four potential customers you can see the numbering systems different sort that one versus this one starts eleven thousand and we can make reports against that of course I mean if I go through and use a group by statement I can start seeing some demographics with that in fact that is the classic way we have in mining data looking at this report I can see that the total number of children home in our current demographic is zero when these combinations appear occur the combinations are let me actually add another one total customer and quite worth it now okay so total children and then total customers I guess I really wanted to sending alright so the most likely scenario is somebody single their mail income is three thousand they're most likely according this to buy one of our bikes basically followed by single females same number and then followed by single females with 40,000 the males with 40,000 and I'm seeing a pattern here so looks like single is definitely more important than gender as far as being one of our customers income oddly enough you would think that this would have a bigger factor but I'm not really seeing that and then it starts you know coming out to married folks with within a certain income bracket but the point is when I said likely to buy a bike that really likely have zero children not likely to buy a bike but the sense of the made-up correlation is that if they have zero children they're more likely to buy a bike and it's kind of what we're looking for well we can get some automatic reports about the patterns in our data without having to write code like this using analysis service data mining that's what it does it gives you simple little reports automatically by going through a wizard putting in some data to to look at telling it what kind of algorithms you want to use for grouping things together and then looking at the pre-made reports and that's kind of cool you can also go through and do some predictive things with it but that's a little bit more hit or miss depending on need how it's garbage in putting garbage data into the system you'll get garbage data out of the system but the reporting pieces is great solid although it's nothing to get too excited about that's that let's go ahead and take a look at that so to do so I'm going to just open up this prima project I have and as you go through you go out to Microsoft's tutorial you go through it to tutorial they'll walk you through the process of getting all set up it's a pretty simple basically you just have to have gone through and set up some kind of connection a data source view I got they have current customers I'm going to use the current customers for my data mining model and later on be able to query the potential customers and then when I set the data mining structure I'll go through and walking through the the wizard choose different like binding models so it's not too bad actually you just go click on here say next some existing relational database or you can call it from a queue API one so now you choose the type of algorithms you want to use a money model as they call it it doesn't really matter much because you can always add on more neural network sounds fun now I said doesn't matter much but that's not quite true because each algorithm is a little better about giving you report data on certain prediction certain correlations so you have to use the actual proper one for like regression reports or clustering reports etc there's more information out there again on Microsoft side but in this case I'm just showing you how that the wizard works so I can click on the current customer because that's what I want to actually have it give them like reports on and later on I'll use that for predictive then I choose what is the key column while I've got name key it helps a lot and then the inputs I'm looking at are like gender marital status yearly income the take truth the birthday would probably be quite handy date but I'm leaving out for right now and then I go through and say like total of your children so I'm trying to pit it predict and then they can be individual values nearly income could be discretized in which case it goes through and groups things together for me which is kind of handy total number children that is a discrete value so single value single value single value this one again it's kind of nice to group together goes where you get a lot but I might go ahead and even add discrete because it's already been blocked off at 20,000 30,000 40,000 anyway I'll go ahead and say next and then we have to figure out how much of them was I eighteen thousand link was eighteen thousand rows or want to use for testing the the data and the testing part is really handy because then it can go through and run through the algorithm gives you its predictions and then test to make sure that how well the predictions worked so we actually know that certain combination has zero children and it comes out that the guess is zero children that's a match the testing is important so leave like that and then just go through and give it some kind of name the data mining model is going to be this is they're all networked so how about in it and then the current customers is a collection of the structures collection model so I'll go ahead and use models this allowing drill stir is kind of handy because you can go through and kind of roll down into an individual record will see what the combination was and you can turn on later well glad to pinch it so now I have this I have to in order to be able to view the reports I have to actually go ahead and publish this out to announce a server and with that done I can now go through and take a look at these reports and sure why does that but it does so in this report I can see that certain combinations are you know favoring one value over another so like the difference between one child and zero children a single value strongly favored parables favor 0 a marital status of married or favoritism one so you can get comparison values here now these reports like I said aren't really all that fancy then you usually get a couple of different reviews this one here is the generic statistical breakdown so those who are statisticians out there they actually get quite excited about this part because now they can really get at how it did its predictions but that's the way it works and then you can you can go through and take a look at what they call lift chart which case it will go through and show you some basic predictive capabilities I should set it to a particular value let's say I'm trying to predict 0 and I'll go through and show me that the random guess model would be you give me 100% of the the answers and I'll be able to to data and I'll get the other predicted you know 100 percent of time with the values but and then the ideal would be that with a very short amount of value of data we'd be able to make a high level predictive predictions and then continue on you can see that our neural network is actually performing certainly better than the random guess but certainly not ideal and what we do is we go through an add-on other data mining models like here click on that and add another one see Microsoft clusterer process again and then after that we do be able to try that one out and see how how that fared in comparison to the neural network now I'm going to this really quickly and like I said there you don't just like randomly have to to guess which one's going to work with some additional study you'll be able to choose between which one's the most likely to work the best for you but this is the basics of how it works and you can see now I'm getting additional reports again there nothing really fancy but they do give you a lot of information for a little effort remember that I was able to to create a simple report and to extract out some information by just using a simple group by this is just yet another way of doing that to give you information about your your current data and from that perspective a this is quite handy I can take a set of values I can pop them into the system very quickly and then give some useful information about the makeup of that data without having to write a whole lot of group by statements in and of itself that's pretty handy the prediction piece again it it only works as well as it's only going to work as well as it can based on whether the algorithm is of use to use to you and of course the data you put in determines what kind of useful information you get out well that's it I'm going to recommend you go out to Microsoft's tutorial I take a look at the basic data mining tutorial this will cover a lot more what goes into it there's also a extra this is just the overview documents but there is also like I said a tutorial that you can do this there's actually a couple but just going through the basic one will give you quite a lot of useful information about how this works and I think it's well worth your time to go ahead and look at that if you're the least bit interested in Microsoft's data mining now you'll notice that the data mining stuff tutorial and documents are pretty old now their sequels over 2014 they've never updated to a newer version I'm not sure if they're planning on doing that soon so not sure with that for roads for the future of analysis of a stat of money my guess is that they're probably working on something a newer structure with the tabular support so they can include it in power bi as well as an Elsa server but I really don't know all I know is that this tutorial does indeed work with sequel Server 2016 so it's still worth going through a panda checked out it does work so you should be able to go ahead and go through it and learn more about Microsoft's data mining tools okay that's it for this video take care