by a computer randomly but the people refused to answer they refuse to take part this is a huge problem in statistics nowadays it's becoming more and more every year it's getting so hard to get data right to get data on people even the US Census you know when we have the US Census there's so many people that don't fill it out right we really need to get data so that we know what's going on but so many people don't fill it out so this becomes a really proud this is one of the ones that I think even the best of statisticians and the best of data miners and data data scientists have a tough time with non-response bias because so many people know that worried about either identity theft or they're worried about you know getting a virus if they click on something so it's really difficult to get good data so so for example one of the things where one of the issues you'll see this for example online surveys like if you have a voluntary response survey where you're putting a survey online somewhere for every one person that clicks on it you probably had 20 people that saw it and said no I'm not clicking on that right we said that was actually one of the problems with voluntary response samples is they have such a high volume of people that don't answer and the only people that do answer have a certain characteristic usually they're either bored or they really really feel angry or upset by the topic so we tend to not be get very representative samples sometimes that all falls under non-response bias but you can even even in our simple random sample right or it is the US Census right the US Census we're getting a lot of people that are not answering right they're not filling it out that's a big problem if you can do a random sample here's what a real like a random sample nowadays looks like you you you you have a computer randomly selected person from the population right you called them up you say okay you know and they don't answer they don't want to talk to you right the computer picks another person and then you call them now to try to get a hold of them they don't want to talk to you they don't want to give you data then you have the computer randomly select another person and they don't want to give you data and I'm a computer analyst likes another person and they don't want to get maybe by the fifth or sixth person we finally get someone that will talk to us that's becoming a huge problem now so like one in every five one and every six that we that we actually call we can get data from so this is becoming an issue I think one of the things that statisticians especially because they know all the advanced mathematics behind this stuff are trying to see how could you account for how much non-response bias you now have a data data has so much non-response bias now that it's becoming a serious issue on how much is that affecting our our calculations and our understanding of populations okay one quick word that my students always get confused of is they confuse non-response bias with voluntary response samples there's no reason they do sound a little bit alike but they aren't quite different remember voluntary response sample means you put a survey out into the world and allowed people to select themselves to be in the data in other words you they chose you you didn't choose them or the computer didn't choose them they chose themselves to be in your data alright that's voluntary response sample non-response bias means the computer randomly selected them or you selected them but then they don't want to talk to you once once you have selected them that's the big difference okay so a voluntary response sample their self selecting themselves to be in your data set you're allowing that which is not a good idea non-response biases you selected them or the computer randomly selected the person but then once you've selected them they don't want to talk to you okay that's a good way to kind of think about they are different but somebody's in my stat students always get these two things backwards maybe because they just sound a lot alike all right let's look at our last one so that saved the best for last right what did we say they're lies they're really really bad lies and then there's statistics right okay so sometimes ethics comes into products comes into this so let's look at deliberate bias so deliberate bias now we're talking about sort of really Shady deliberate stuff going on okay and it does happen it does happen falsifying reports just a few years ago one of the biggest pharmaceutical companies in the world was was caught that they had been falsifying reports they were supposed to do regular checks of their medicines to make sure it had the right amount of medicine and the you know so they they would do randomly select pills to check and determine that they had been falsifying those reports they had been checking any of the medicine they just weren't falsifying reports really bad stuff right that that's really bad deleting data is a very common one I've seen this so you can think about a business right that's collecting data and then some of the people said that they thought that the business was really well run and some some people said the business was terribly run but then all of a sudden all the people all the data about people that said the business was terribly run somehow got deleted I'm sorry about that we just deleted out that data and then all of a sudden all the data you have left shows that people love your business right so some shady stuff now I'm sending the data is indicating that the business is really well run now they just deleted out everybody that said that the business was poorly run okay so that becomes a real issue again you can have a random sample a sin but then somebody deletes out all the data that that you know that makes their company look bad the data is no longer really reflective of what the population really thinks about your business okay so deliberate bias is kind of the shady stuff of the falsifying reports deleting data you know those kinds of things I'm also conflict of interest like yeah you know you really this is why independent statistics companies are really it's better to use an independent statistics company oftentimes first of all they know what they're doing and how many businesses sometimes don't have you know really good people maybe it's a small business and they don't have money to really to pay for data scientists or or statistics people this is also why again every business in the world now is scrambling to get data scientists that's why you know if you're taking this class this is what are you're stepping on that path people need people need businesses need data scientists to to sort of work through data and help them make good decisions but if you can have a conflict of interest you know like you know you probably you know you shouldn't have if you're doing an article about about chocolate and then maybe the the benefits of eating chocolate maybe shouldn't be done by a chocolate company right maybe you should have an independent statistics company doing that that's best experiment or that or that are collecting that data okay so sometimes again conflicts of interest can come into play so you always think about when you read a study think about who who paid for that study right or what company did that study right so that's something to think about as well because we want to make sure this stuff isn't happening okay so basically the takeaway from this is yes the way you collect data matters right you should have either a census or a random sample but our census or random sample does not guarantee that data reflects the population because you could have one of these other biases going on right something shady or even just a high percentage of people in the data were we're lying about their answers there can be all kinds of reasons why data does not reflect the population besides just the way you collect data okay that's why we always have to think about this so when we look at a data set we want to say oh is this data unbiased that's a really good question you know I was talking to a statistician in a while back and and I asked that same question is there any day to the ton biased anymore and and she was saying well you know that's that's a great question because it's very difficult to get truly unbiased data that doesn't have any of this stuff okay all right so I'm hoping that helps you though this was our topic on bias and this is Matt to show and intro stats and I will see you next time