Transcript for:
Understanding Bayesian Inference and Statistics

okay welcome everybody what I want to do today is give you a conceptual introduction to Bayesian inference and data analysis by conceptual I mean we're not going to do calculations my goal is to help you understand the philosophy and an outline of the procedures so you understand the purpose of doing this and I think the if you take one message home from all this it is that Bayesian inference is a very humble approach to inference and it's nothing more than counting assumptions and I'm gonna show you what that means and how that's connected to scientific models and that's really all I hope to achieve today but that will take us two hours I think and it's still a massive compression because the truth is doing science is very difficult right but I hope this you have you eat today with something interesting so this is probably most people's impression of what statistics means right a flowchart like this so you've got some data where the data come from who knows it fell from the sky your supervisor gave it to you you found it on the Internet anyway the data arrives statistics as it's normally taught begins after the data are produced right and then you think about statistics for the first time and there there's some flow chart and you ask a series of confusing questions and and answer them and that you end up with some test that you will find in SPSS as menus and then you will execute this test and it will tell you what's true right now all this is wrong of course and that's not controversial to say but this is the kind of with say in statistics statisticians hate this but this is something that the scientists have generated because it satisfies the demands that scientists had to produce papers so I want to do is back out of this and think about what is a philosophy that would help us decide what we should do instead and that's a that's a complicated problem and a lot of this is embedded in history so why do we use these two boxes of tests and the reason is because there are lots of legitimate scientific context in which they're useful and and we're course of statistics these these tests that are the workhorses of Statistics largely come out of this Batarian and Pearson tradition which is very much focused on agricultural trials large sets of replicas trying to figure out which fertilizer which wheat variety which kind of barley you should you should grow a lot of this was focused on beer production as well and so a lot of statistics were worked on things like the t-test the t-test was invented to test Guinness beer that's not alive so and in those circumstances what do you have you have large numbers of replicate units you have experimental control and things like T tests are really legitimate and powerful tools analysis of variance invented here at Rothamsted Agricultural Research Station along one of the longest running continuously running scientific research stations in the world in Rothamsted England and re Fisher was the famous biologist and statistician was president there and had probably be for a single person the greatest impact on statistical practice and that came Nova the t-test and a lot of those things but in many other scientific contexts those tools don't make any sense at all so here's an example so actually I should ask does someone know what this is let me guess you know what it is yes so this is Mars MC right this is Mars and from an observer on earth if on multiple nights you plot where Mars is so this is a composite photo of multiple nights the position of Mars in the night sky so from our perspective it does a loop right and planets is just Latin word for wanderer it's so this is why they're called planets cuz they wander I'm like the stars which stay fixed they're not really fixed right they just was really slow the planets planet they wander and so they're explaining this path of motion is a very about one of the the famous achievements of scientific practice trying to understand what really goes on in a long history of constructing scientific models to predict this t-tests don't get you anywhere with this right what you need is some scientific model which makes predictions about the path of the subject in the sky and to do that you have to project this this you need some bigger model a 3d model of the solar system that didn't you project onto this visual sphere that encases us and this is a big problem and if you study this at all you know what the solution is it's complicated but there's no population of Mars Mars's if I may invent a plural there's no population of Mars's that you're going to use to think about like a t-test in the way it's derived or anova you've got one observation and the question is what explains it right it's a completely different framing but to do scientific inference the question would be what's a foundational approach a philosophy that can handle both of these contexts and other contexts to continue problematizing this topic a bit machines can learn now they can even read here's a robot reading the New York Times yeah the International New York Times which I don't recommend by the way this will give you biased judgments but this these robots can read now do they understand what they're reading no but pick and read and they can rincon spread it back and they can summarize it they can do lots of things this robot is doing fritzsche but it's not doing t-tests and it's not doing other things it's learning and it's learning in a feeble way compared to even an animal like a squirrel and restarting which regularly learn fuzzy statistical relationships that they need to survive in their environments all the time those animals aren't doing science they're not collecting data than running ANOVA's how are they learning and what can we learn from those procedures as well so the goal is to have a foundation of understanding inference in general what is the theory of learning that can encompass all these things and help us understand what's different and what's the same about them and can we use that to then improve how we do science so the metaphor that I always use in teaching is instead of thinking about statistical models or computer programs as robots let's think about them as a golem so first I say why not robots well robots are good at stuff except for walking which they're terrible at right this is the irony they're good at counting bad at walking we're the reverse yeah well robot it sounds like it's a sophisticated piece of technology but would be good the golem isn't gonna sound like that to you a golem is a monstrous automaton this is a myth I think in Jewish folklore take a bunch of clay say some magic words and you can make a artificial slave that can defend you and do hard work for you and there are various golem legends the most famous of which is the goal of a frog and if you've never visited Prague you've seen little gift shops right with golems and things like that if you haven't visited product do so and then you'll see the golden aprotic stuff there's this story about rabbi in the 1500s in prague who constructed a golem to defend the Jewish population of products against anti-semitic violence and in this story though he ends up deconstructing it because it's the goal of his unthinking just obeys commands and so like in the lots of stories like this if you're not very careful with your commands you end up causing harm it doesn't the goal doesn't understand anything at all it just it has programming and it executes them it's good at some things it's incredibly powerful but it has wisdom and so in the invade that be commissioning it because well part of the story is you're not supposed to play with the power of creation it's very very dangerous when we do stats were playing with the power of creation we're making these very sophisticated learning algorithms that we may not understand if we're not very careful with them and we don't make the effort to understand how they work then you can wreck prog this is the metaphor so let me let me extend this silly analogy to its climax right so on the one hand we have the Golem it's made of clay it's animated by truth that's literally the word in Hebrew that's written on its brow that brings it to life it's very powerful much more powerful than the person who made it but it's blind to intent and so it very easy to misuse and of course it's fictional right I think you've probably never existed but it's a very compelling story mystical models there's a lot of similarity they're made of silicon at least because computers are made of silicon right now it's a bit weird but bear with me it's also animated by truth we'd like to know the true causes of things in the world and we want statistical models to help reveal those causes we hope they're powerful and if we're good they will be there are also bloggers creators intent and so statistical models will behave in ways that are unanticipated and better we understand them the better able would we will be to avoid those unintended consequences so the easy to misuse and I don't want to say that models are false so there's this saying from a famous statistician les century George box that all models are wrong but some are useful I don't like the same very much and and I get that if you're a beginner it's a very nice saying right I worry about that saying because it almost seems like we'll stop criticizing the model because it's it's unrealistic but I think we should criticize models because they're unrealistic right that it's a category error instead I want to say that their models are not even false it's a category error to even debate the truth or falsity of a model they're all false they're their constructions it's sort of like saying if I if I come up to my friend I go over to his house and he's building a table and he's using the wrong tool to construct the table like he's trying to nail it together with a screwdriver or something it isn't that the screwdrivers false it's just the wrong tool models are tools and it doesn't make sense to say that they're wrong or not right it in my makin sense this is a complicated thing and so it's orienting you towards the idea of selecting the right model to help you learn in a way but they're not false things they're construction they're bits of technology that we built hammers are not true or false right screwdrivers are not true or false you make a choice to use a different tool at a different time these are bits of technology so what is Bayesian stuff in Bayesian data analysis is is Bayesian inference applied to the scientific analysis of data and it's a very simple and humble approach all it means is we're going to use probability which I'll define this will come to describe uncertainty that is our lack of clarity and imprecision again I'll define this in America's way coming up on the on the slides to come it's one useful way to think about this is if you know basic binary logic truth tables constructions of true and false Bayesian inference a Bayesian probability theory just extends that to continuous possibilities so now things can be somewhat they can be plausible instead of just true right as we can have something that's 100% plausible and that's the claim that's true 50% plausible and so on so it's a logical extension if you extend ordinary binary logic to continuous possibility the only thing you can end up with is probability theory it's it's the only correct proper extension of it it's computationally difficult even though it's very old Bayesian inference is older than frequent this inference it goes back to well Gauss here in this room court people probably heard of Gauss who used to be on German money right but then we got this your own money now he's gone they call show the pen mark note later on in the lecture Gauss used Bayesian inference to derive linear regression in 1800 linear regression was originally a Bayesian procedure it's in his 1809 saying about predicting the return of a comet that got him really famous when he was like 25 years old or something like that he was kind of smart and and on the continent German and French mathematicians used probability theory wasn't called Bayesian at the time because it's all there was there was just probability theory to do all kinds of scientific inferences a lot of it in astronomy but in other areas as well ballistics and many other things and they worked on relatively simple problems - they didn't have computers they did what they could analytically for contemporary sorts of statistical models we hope to fit you need some fancy technique that to avoid doing the mathematics and this is a technique you may have heard of called Markov chain Monte Carlo and I'll say a little bit about this later it's just a way to get the computer to do the math for you that's all it is it's just a trick for that so I said it's older bayes's older than frequentist and in england around a certain period of time it became quite controversial because the frequentist approach was taken over by some very powerful people including Sir Ronald Fisher it was an important theoretical biologist I've mentioned a month on the previous slide the president of Rothamsted research station so one in one of fishers very often cited and highly influential books on how to do statistics which he talks about ANOVA and so on in the preface he's you know all he says about Bayesian analysis is it must be wholly rejected it doesn't explain why now elsewhere he had explained why he thought it should be wholly rejected but he just dismisses it sort of out of hand like this he's also the one responsible as far as we can tell for calling it basing his Bayesian is just called the probability theory right he's the one who said it's this Bayesian stuff after a service scholar named Bayes okay so what is Bayesian data analysis it is nothing more than counting the implications of assumptions so you can put it like this if we count all the ways the data by which I need the observable variables so we take measurements we've observed something that's an observed variable if we count all the way the data we did see could have arisen according to our assumptions and then we rank those different ways relative to one another you know they'll be examples of this to come you're not supposed to understand this from just this light then the assumptions wouldn't that can arise in more ways are more likely to be true they're more plausible that's all basis and this is sounds like an incredibly silly thing but it turns out to be really really powerful and this is why the title of this talk is that Bayesian inference is just counting because it really is it's just counting now I'm going to give you examples of this when we're going to make some assumptions about the processes that could produce data and then I'll show you how they're counted there are going to be no calculations we're just going to count and that's all Bayesian inference is and so when you run a fancy Bayesian model in your computer or the Markov chain and all that it's just using a very indirect technique to do counting counting over infinite sets which sounds terrible how do you count infinite things well we can do it this is what calculus is for right and calculus is the way of doing things like that so yeah so let me say this again and then we'll go into actually running some examples in a few slides we commit to this view we want to know how we've got a series of plausible processes that could have produced the data for each one we ask how many different ways could it have produced the data that we see and then we rank those relative counts of ways and those are possibilities and probability is just possibility done this way it's just counting everybody agrees with with that by the way even frequentist that probability theory is just counting but there's you know there are some differences so I don't want to say much about about the frequent standard frequentist view because I don't want to pick an argument with it this is in my opinion not the important thing the important problems with statistical practice are not that people aren't Bayesian I'll get to what I think the important problems are later yeah but if this isn't the big deal but it's nice to know what the difference is because sometimes it's very hard to be a frequentist and it's very much easier to be amazing and vice versa so the frequentist view is that probability is defined objectively it is a limited frequency of an observable variable right you've probably heard this before yeah and the Bayesian view is not a frequency and a probability or different things even when a probability has the same value as a frequency that doesn't mean it is a frequency it's just a counted relative plausibility given your assumptions but it doesn't have to equal the observable thing so why would this matter in the frequentist view uncertainty arises from sampling variation and in the Bayesian view it doesn't in the Bayesian view uncertainty arises from your internal uncertainty or the uncertainty of the machine that is the there there's still plausible ways that other processes could have produced the data but it's not because you have multiple samples and there's variation among them and that generates uncertainty in an agricultural trial you won't notice the difference between these approaches because they end up numerically almost identical but there are some different cases where they behave very differently so consider this image this is Saturn I manufactured this by taking a real picture of Saturn and blurring it but this is how Galileo when he looked through one of those early telescopes saw Saturn we know this because he's got notebooks where he drew this and it's basically a blob with two little blobs on the side and he says this is this thing Saturn has blobs on it what car is this then he starts sending off varied letters to all of his colleagues I saw some blobs on Saturn right there are orbs on orbs now we we now know that there are rings around it right but so it's this question you see an image like this you've got a primitive telescope what's the real image this is an image resolution thing so like on these crime-scene investigation you know top court sort of shows where they do magical stuff with imagery right so those are Bayesian algorithms there are things like that existed they're basing algorithms there's no sampling variation here it doesn't matter how many times Galileo looks through the telescope he's going to see the same thing right so the the frequentist view doesn't get you started on this you need a model of generative processes that could produce this image there's those series of images given scattering of light that would produce this view right and different underlying images to produce very similar views and so they have different possibilities and the Bayesian calculus does that it lets go from hypothesis about what the true image is to how close to this it would produce right how close to this would get given some high cuts some knowledge about how optics works in this case okay yeah so the the summary may be that you can take home what kind of herewith is in the Bayesian view probability is always epistemological it's not an objective fact about the world it's the internal uncertainty of the machine so it's it doesn't depend upon sampling variation it's in the machine or in the golem yeah it's or at the bottom of this slide thinking about coin flips so we often say that coin flips are random you flip a coin just random whether heads or tails lands up that's a fine description I'm not going to complain about that but the coin is not random the physics are perfectly deterministic there's no debate about that right a physicist will tell you well there's a system it's got high angular momentum but it's chaotic in the sense that the quality of measurement you would need at the initial flick is the you would have to have incredibly precise measurements to be able to predict what happens it's so it's just it's practically impossible to predict yeah a coin flip that's all it means but it's not an inherently random process we have physical models that tell us why we can't predict the coin flip it's a deterministic system does that make sense hopefully I'm blowing your minds a little bit here right so one way to say this is the coins aren't random we are right and we're we're ignorant of that initial state and the angular velocity and all the other things you would need to plug into the physics model to know whether heads or tails comes up but the coins not random at all the randomness describes our uncertainty and Bayes takes that view on and applies it to everything because it's true the world is deterministic by almost all scientific models yeah there are these debates in quantum physics about whether God plays with dice right but there's no experimental result in physics which is not consistent with the deterministic universe and so but we don't live in the quantum scale right so we're not worried about that we're social scientists I think all of us yeah and so this this description of uncertainty being in you is accurate for the sort of work that we do this make some sense you all seem to be following me yeah you're nodding at least which is encouraging I appreciate it okay let's go through an animated example now I'm doing the counting so I can help you see what goes on this is the simplest example I could come up with I like to use this metaphor called the garden of forking data I take this of course from this famous short story but Bohr Hayes no how many of you know this it's a great story where they're all these branching paths and the story and different things can happen it's like like right you make choices and the different futures open up and you can think about life as a series of branching destinies contingent upon your past choices and then analysis is a bit like that in Bayesian data analysis we're mapping out all those branching possibilities given some assumed truth and that's what we're going to count we're gonna count all these paths that could lead to the event we have realized and there will be different true snakes of the world that could produce the state that we're in right now and we're gonna count all the weird ranching pass through different choices that could do it well let me show you how this works so the future is branching pass the data are events and we want to know how many of the possible paths could produce the events we've realized so let's let's think about a simple thought experiment in statistics we like to draw things out of bags or urns this is why because it's easy to teach with so let's take of it that way we have a bag pictured there on the left and it has four marbles in it I tell you that and it's true it's not alive the marbles come in two colors white and blue no other colors so the four marbles the possibilities are that each of them is white or blue and now the question is how many whether the contents of the bag so we're going to draw marbles with replacement one at a time and each time you draw a marble you get some information about the contents of the bag right but you could draw the same marble and then you can put it back you can get the same one again right so you go white white it could be the same marble or it could have been two white marbles and you have to deal with all the possible branching paths given the true state of the bag that could produce this so we're gonna draw this out now and this is all Bayesian inference so the hypotheses the conjectures about what's in the bag we can list them exhaustively because there's only four marbles they could all be like one could be blue two could be blue three could be believed or they can all be blue agreed that's has to be it yeah easy science is easy we're done right now we're gonna draw three marbles so I'm going to say we we've drawn them one at a time and put them back and draw them the next we draw the first one it's blue we put it back in reach in draw the second one it's white put it back in truck third it's blue given these observations now which of these is most possible and we can figure this out using Bayesian inference right that's what we're going to do using probability theory so here's what we do we're gonna make the garden of forking data as I call it we're gonna think about the first marble draw from the bag four things could have happened on that first straw assuming that the bat contains one blue and three white we're just gonna assume that for the second it's not that we're claiming it's true we're gonna say if that were the state of the bag what could have happened on the first draw there are four pass into the garden of data from that initial pull one is blue and three are white right these are different passes there's three different white marbles they all look the same to you they're just a white marble but they're actually different marbles they're special snowflakes every single one of them yeah this makes some sense we're just on the first draw now we grow the garden because each of those things that could have happened on the second draw there are four things that could happen on top of it this is the Garden of forking data these are the paths into the future right on the so if the first one had been blue then the second draw is not contingent on it because we put the marble back all right so we could have draw it again in which case we get blue the second time or we can get one of the three white marbles the same for all the other events that could have happened because these are independent draws as a marble if they weren't independent the garden would look different right say we didn't put the marble back then you drink these the gardens we get past we get narrower and narrower as we went that make some sense stuff but you're just drawing up all the possibilities and then the final one the last marble that we've seen so far you get a really big now lots of possibilities right so this is this is all the things that could happen now what we're going to do with this given these data we can eliminate some of these paths because they're inconsistent with the observation but before we collected observations this is what was possible there are many many different right so it's it's four times four times four possible things that could happen in three drives from the bag and those are the possible sequences of events all the all the paths of branching data does it make some sense yeah I drew this with a computer program by the way because doing this by hands but it would like me so let's eliminate some paths then what actually happened is a tiny subset of this over here and the first thing that happens down here at the bottom we draw a blue marble so we number on the left hand branch into the garden of data then we draw one of these three white marbles we don't know which but we know it's one of those paths so there are three paths that question well the data are empirical and the data are empirical and the only there's there are facts that you agree are true like there's only four marbles in the bag and they have to be either blue or white that's not empirical you have to trust that that I have it like you okay so you can't verify that so you have to make some assumption but the data are real and we're using the evidence the data to decide which of the theories is most consistent and that's the whole point so we're saying there are a range of hypotheses they have empirical implications and that's what we've drawn we've drawn the empirical implications of the conjecture that there's one blue marble we've only done one theory so far and what worked what we found is that there is there are three different ways that you could observe this fact that we get a blue a white and a blue marble if this were the true state of the bag the back has one blue marble in it there are three ways we could have seen this data and we're going to compare this to the other possibilities in a second and then maybe that'll help your query so is it empirical yes is it completely empirical nothing is all learning depends upon assumptions there's no escaping theory without theory you can't learn and so this is a theory it has a miracle implications we learned because we confront the theory with empirical patience and there is no other form of learning well then you'd have to study it yeah do you have to do some science yes yeah well you know you have a theory and the theory has assumptions in it yeah and then the theory makes implications about how the data should look if the theory were correct and that's what we're counting now is all of the ways that the theory could have produced this observation and some theories are going to have more ways to produce the observation than others and that's what's going to differentiate them yeah so in the Mars case some theories get closer to the true path of the sky than others they make better predictions of where it is so let's think about the other conjectures let's list them again five possibilities of what's inside the bag you can't look in the bag right you haven't looked we've just got these three samples and we're going to count up the ways each of these can produce the sequence of data we've seen so we just did the first we just did the second one actually the first one right the second one and it was three that's why the three appears in this column there are three pass through the garden of forking data that are consistent with this possibility that's the theory what about the first one yeah yeah we're just gonna eliminate the first one very good because obviously it's impossible right there are zero pass we could draw out the whole garden red and just be bunch of white marbles all right for pass at white marbles just going out all the way out all whites but since we've seen blue marbles yeah it's inconsistent with the facts that's easy right Bayesian inference is that but for much more complicated combinations it all works the same way let's do the other stuff so what about number three well we could draw that garden so I'll just do it in this in this third of the screen cuz I'm gonna do the other ones and the other parts of the screen in a second so same idea though here's the origin of the garden this is the one we've drawn before but there's one blue marble I'm just repeating this garden but now crammed into the corner yeah so we've got one blue three white in each but we go up to these three paths there are three ways that if the bag contains one blue marble and three red marbles there are three ways we could have seen the data yeah we can't tell them apart so there's three three ways you could have seen it let's compare it to next one imagine there are two blue marbles in the back and two white marbles draw this garden now I'm going down on the screen to blue to white to blue to white on each split same number of total possible datasets as it were that you could see but now when we count up the ways you see there are eight ways we could get to blue and one white when three draws from the bag there are more ways that this could produce the data let's make some sense third one three blue marbles one white marble draw the garden the same way right four branches each time but there's there's three blue marbles in one why each time count up the number of blue white blue sequences move like blue paths there now not so this is the most consistent with the data released it has the most ways to produce the actual observation yeah does it make sense exciting right this is rocket science yeah this is all bases is just counting assumptions and all it is probability discounting assumptions amazingly it works so try to summarize here when we do statistics we don't draw these things for a tiny data set like this it's already a huge number of possibilities rather 64 options you don't want to start drawing your data this way right you need huge pieces of paper it would be madness luckily mathematics compresses all this and there are simple rules for probability statements that let you do these calculations without having to draw this all up this is combinatorics right there are combinations and permutations and things like that all that mathematics is we're dealing with sequences like this and so we could just use multiplication in this in this case turns out to be the solution to this so on the first path there are zero ways on the second point the second marble draw there would be four ways to get to get a white marble and zero ways to get a blue marble from the first conjecture so 0 times 4 times 0 is 0 anything times 0 0 it's impossible this can't be true second case though we draw a blue marble first there's one way according to this conjecture to get that because there's one blue marble in the back there are three ways to get a white marble for the second marble one way to get blue there are three paths it goes through and so on for the others so this is what you actually do when you learn statistics and you learn the laws of probability you learn this product rule and the product rule comes from this kind of counting it's just from counting good you don't have to understand every detail of this right now I just want to show you there's a conceptual link and that all the laws of probability are really just about counting counting sets so this is where we get to 0 3 8 9 and I didn't show you all blue but I'm sure you can intuit it's also 0 because there's a white marble yeah ok in this view if you get more data you can just update right so there's not some frozen stage or you have to make the inference now more data can come in a stream like it does for a squirrel foraging on the lawn or whatever right and it updates its beliefs that's how learning works in organisms and robots we hope in robots so now as we've gotten those counts to that show of these what I call power on this slide the previous counts now we draw a fourth marble and it turns out to be blue and I ask you to update you don't have to do it all over again you can just multiply correct because it doesn't it it you don't have to start over again at the beginning you can just take the previous counts and multiply them by the number but you could have gotten a blue marble given each conjecture so obviously the the all white and all blue we can just ignore those right but it's useful to think about this first column where we write down the number of ways you could have drawn a blue marble there's 0 1 2 3 & 4 which is just the number of blue marbles in each conjecture yeah and then we multiply those by the previous counts and that's updating gives you the new counts it's a branching it's multiplication cuz you've got these branches in the tree so 0 3 16 27 now and 0 most plausible is the bag with 3 blue and 1 white yeah but you look at these numbers in you're like well what do these numbers mean to me what's 3 main what 16 mean it doesn't mean anything the only thing that matters is their relative size the relative sizes contain information about the relative plaus abilities of these things and the relative possibilities aren't that different yet you probably don't think it's this one but it hasn't been eliminated right there's no evidence that this is impossible the All Whites are not pointing at the slide and I'm recording this there's no evidence that the bag with only one blue marble in it is impossible right it still has some possibility you could have seen this data it's not even a banishing lis tiny probability could easily happen right if I gave each of you a bag and had you draw four marbles one of you would probably get this result right with a with a bag that had only one blue marble in it it's not that impossible but it's probably if you were going that you would bet one of these two yeah and if you're gonna bet on only one of them I can guess between you we bet on electric trying to lose money right and that's what these things but the the relative differences give you some idea of how much it should been don't bet by the way it's been no betting okay this approach also lets you use other information if there's any information which you can summarize as the relative numbers of ways that the data could arise then you can combine it with the other information so you can use all forms of data in the same estimate so as its own example say someone tells you okay the factory that makes these bags is like marble gift bags you know I don't know invited a big gift store people buy bath bombs and stuff like that there's other things the factory that makes them the blue marbles are rare they're more expensive because they have some fancy blue dye in them maybe and they make fewer out them but every bag contains at least one I'll give you only that information now can you use that the answer is of course you can because it constrains the possibilities it opens up more ways to do things so in particular we this gets summarized by your informant at the factory as the relative production counts of the different kinds of bags so there are no all-white bags because that would make customers upset so their process ensures there's always at least one blue one but there's a ratio of three to two to one of the of the bags with one marble two marbles or three marbles yeah so every three out of six right bags half the bags have one blue marble in them yes yes but I've assumed that's not true yes yeah sorry maybe I left something up here robbers the river bank and it's at least one you know blue or white I should have written up there thank you I didn't notice this on my slide yeah every bag contains at least one blue one white marble because otherwise people are upset they want a mix of colors but blue marbles are rarer my purpose in this proportion that make sense yeah but we couldn't to assume any numbers here and any information from our spy in the factory and the point is to combine them with what we already know so we've already done all that other calculation going through the garden of forking data and here now in this column I have this comic called prior ways it's prior because it's our previous calculation that we've already done it was before solve Prager means and they were 0 3 16 27 and 0 now we have this factory count information and we can just multiply because it's numbers of ways right this is it and this is this multiplication rule in probability the product rule so now it's 0 9 32 27 and 0 now the the even bag if you will the bag with two blue and two white is the most possible it's pulled into the lead but there's barely any difference between those two still so it has made a difference I'll show you this just to show you that this is a often one of the advertised strengths of Bayesian inference it naturally accommodates different kinds of information in the same calculation whereas in typical you know mystical procedures this is very difficult to do very very difficult to do yeah good you're at least willing to let me proceed a little bit yeah this stuff is by the way impossible to fully understand on the first in the first encounter it's complicated and I want you to adopt the attitude that it's like learning a language it's much easier than that but it's like learning a language in the sense that the you can understand a lot before you really fluent right and you can make use of it right you take an intro course in a French or something and you can go and order some food yeah and embarrass yourself and things like that right it's very useful to do but the fact that you're not perfect doesn't mean you can't use it right so you have to be patient with yourself and accept the idea that understanding comes in pieces and then some Concilium storms over time and that's just how scientific skills develop okay so let's convert these the possibility because these absolute counts don't actually have meaning it's only the relative values and if you insisted on having these counts in the actual combinatoric of a data set they would be in the billions very very quickly you can see these numbers grow very fast so with a data set of only four observations you've already got numbers that are approaching 100 if you had a data set with tens of thousands of data points which is not that unusual these days in science you'd have counts in the billions trillions so I don't huge huge numbers and you don't want to write those things down so probability theory is normalized is what we say and then it's a normalized counts we take those counts and we just divide each one by the sum of all of them and that's called normalization so now the maximum value is 1 and all the numbers sum to 1 so the way this usually works is and those and those relative counts are called possibilities or probabilities so in our walking through our example again the table at the bottom on the left or the different possible contents in the bag then I have this column with a P at the top which is just a label a name for each of the conjectures it's the proportion of the bag that is blue marbles that's what P needs no it's not the p-value P if they're gonna be no p-values in this locker the values or blasphemy in bases so P is the proportion of the bag that has blue marbles so for the first one it's 0 for the second one it's 0.25 right a quarter of the bag is blue half three quarters and all of it understood it's just a label it's just a number that describes the hypothesis and then there there are accounts that we've gotten right ways to produce the data this is just from the prior accounts from the beginning of the example if we sum up this column the ways to produce data column and divide each number by that sum will have these numbers on the far right these possibilities and those are probabilities probabilities are just normalized counts of the ways the data could happen according to each theory yeah now the way the statistics you would usually frame this is that P is a parameter that we'd like to estimate what proportion of the bag is blue marbles there's a range of possibilities it could be that's an assumption that we built into the analysis the bounds on that number what values could it theoretically take and then we use the data to estimate that value so we're estimating P the proportion of the bag that's blue that's the usual way statistics is phrased but you just have a range of conjectures it could be an infinite number of conjectures because you could entertain every value of P between 0 and 1 continuously that's impossible with a bag that only has 4 marbles in it though right but if it was a giant and infinite bag of marbles right have you reached in and pulled out things then the proportion of blue marble I know it's impossible the infinite bag but in math it's very easy and then P could be anything between 0 and 1 right it could be point one seven point one seven one three seven right so those are all possibilities and then you're just trying to estimate as precisely as possible what it is and that's a usual thing you call a parameter estimate or an effect size if you're doing a comparison between treatments it's the same idea there's a range of hypotheses about what the difference between the treatments is those are conjectures given each conjecture how many ways could have produced the data you've seen and that's the way all stats is done in this framework so make some sense today so far there's no difference from the frequentist view except that we got to use prior information that's the only thing that's different so far frequentist about probability is the same business it's just counting okay so in our how many of you have used our at all yeah all you super fantastic yeah it's the world's best calculator right okay to stuff with it and if you want to do this kind of calculation in are you can just make a variable called ways and give it a list three eight nine I left off two zeros because we know what's going to happen there yeah but you could put them in it won't make any difference you then divide ways by the sum of ways which I show it to code at the bottom here you can do weird stuff like this in R and that converts it to probabilities and then you get the probabilities here that make sense most of the time the probability functions in R are continuously renormalizing so that you stay in this probability space that's what probability functions do so that we don't get these exploding counts yeah all right so let me try to summarize this a little bit plausibility is labeled probability in in apply probability at least this is us it's just a set of non-negative real numbers that sum to one that's all probability is and it comes from normalized counter probability theory is just a convenient way to count big sets of things and do you think about their relative amounts it's just sort cuts for counting and that's why I started this I asserted that Bayesian difference is just counting that's all it is but it's counting really complicated spaces and we need computers to do it because people are bad at counting computers are good at it people are good at walking computers are really bad at it right you can make a robot walk up stairs you win an award right meanwhile babies their one-year-old can walk up stairs right is it it's a nice nice symbiotic relationship between us and computers yeah so they're good at things that we're not that's why we made them right it's no accident that they're good at things we're not that's why we vote we built them to be good at yeah this make some sense just has to make some sense question one of them two of them yeah and then I don't wait for the action of healthy so I just I just calculate these possibilities without the events okay no no no you you've got the data and now you're asking what the calculation is what it means is having seen the data now how many children helps what proportion of children help in the population yeah so it's like you've got some bag of children right in this in this metaphor you get there's a bag of children some number of them are helpers and some numbers numbers are nasty and you draw a child out and you observe their behavior you write that down we've observed three kids one of them the first one helped the second one didn't and then the third one helped and now we're going to estimate the proportion of children in the bag from our helpers yes of course we're doing statistics on the data but you can draw out we'll get to this later you could plan the experiment using the theory in fact you should because you want to know if the experiment can discriminate what's actually going on so this is how you decide how many children you need to drive at the badge yeah yeah yeah okay it's a good question that's a good question so good other questions all right yeah okay building a model so let's think about this in a constructive way and I'll give you another animated example of how this works it looks more like the way a conventional statistical analysis because we don't draw gardens of forking data there's a in applied statistics there's this narrative that we use most of the time but not always the first part of this is we design a model that's a story of how the data could arise maybe we don't have the data yet but we anticipate its form we know the possibilities right we've asserted that kids are either helping or unhealthy where that marbles come in blue and white maybe we haven't done the experiment yet but we know the structure of it we know what the data will look like maybe we already have the data maybe we downloaded it our colleague gave it to us right somebody quit and handed us their PhD dissertation right these things happen and now we're gonna analyze the data and get a publication and so the first thing is to use your scientific knowledge of the topic of the discipline to tell a story about how the data could arise and there'll be multiple stories and the question is to diffuse the data to distinguish among them and that data story helps you design the statistical model the statistical model should embody the assumptions of the data generating process what's the causal hypothesis for how these data get produced yeah that sounds fine in the abstract in specific examples you have to think hard and you have to know your scientific discipline so this is often why statisticians are are not very useful for scientists and they say that I know my colleagues are listening but but it's true is because the field of Statistics has to design techniques ignorant of the the application so it creates this vague horse topic kind of advice about in general you want to do this in that but in a particular scientific domain given your scientific expertise you should use your expertise in design system analysis because the statistical model should embody scientific assumptions your causal hypotheses about how the data are borne and that's why we call these things data generated models so I'd like to say a data story let me tell me a story about how these marvels came to be and you'd say well there was a bad just weird guy came in he gave us a bag and said there's some blue marbles of white marbles in this bag and then he drew three I put them on the table left that's how the data came to be right and but in that you've got enough to build a model because you saw the weird guy draw the marbles out and then put it back in so you know they were sampling with replacement in the other facts yeah that makes sense more seriously you have some developmental hypothesis about child behavior there are scientific theories about how that behavior develops and then that can be more or less compatible with the evidence yeah and that's what we do then you conditioned on the data this is the counting the ways through the path now the data definitely exists updating means you take your original plausibility 's by which you rank the hypotheses maybe you think they're all equally plausible that's why you listed them prior to seeing the data and then you condition on the data meaning you constrain your beliefs based upon what's now impossible given the data you see and what's less possible less possible given that there are fewer ways for certain conjectures to produce the data than others makes sense that's what we just did in an animated sense and then you evaluate the model which we have not done yet but you critique it you think okay there's a certain conjecture that's most consistent with the data but you know it's still pretty terrible we're still not predicting the events very well we still don't know nervous to understand why some kids help and some kids don't what additional scientific assumptions do we need to validate make and then test with data to make better predictions about how kids develop yeah there's nothing in the blue marble example to be better but in real science there typically is so the best model may still be terrible and that's why this third stage is necessary right this was the Golem of Prague remember right it's still a golem and it'll it's not false or true it's just a good tool or not and there's evaluation stages when you realize that hammering the table together with a screwdriver is a bad idea right and then you try to find another tool and models can fail and really spectacular ways Bayesian updating won't tell you that it'll just update it takes the model is given and tells you the number of ways that model could produce the data but it won't avail you eight on its own whether it's good or not that's up to you as a scientist yeah so you have to step out of the mathematics at some point and do criticism let's do another example that's closer to the sort of estimation problems you might see let's think about it globe sometimes when I teach this I have this in my office and then you ever visit me in my office you'll see this inflatable globe in my office I use this for teaching and I throw it at audiences so no Annette has ever been injured I assure you and so camp doesn't play mm and I throw it into the audience and someone panics usually and it bounces off their head and then the person behind them touches it and then I say okay you holding the globe tell me where your right index finger is is it over water or over land and they'll say nervously I don't think what do I win and they say nothing throw it at somebody else and we throw it around counting water land water water water lands however for some number of times in this example nine times nine tosses into the audience first one there was a water in a land three waters a land water leg water we stop and say okay let's write that down let's data its data we're going to use that data now to estimate the proportion of the globe that's covered in water yeah now you may know the answer to this question what proportion of the earth is covered in water notice we're asking the question what portions this globe is covered in water and it's not the same questions but that's part of the lesson is this is how measurement works right because there are there are other events there's an air valve on the top of this globe and that's why they're neither water nor land and other sorts of things and and sometimes people land on a coastline and they're not sure how to answer me right and you have to make assumptions but those issues aside that's measurement and and those issues are different in every field let's leave those issues aside and think about assuming that the globe was a good representation and it is a pretty good representation of the geography of the earth what should we how should we construct a statistical estimate from this sequence of data for the proportion of the globe covered in water do you any of you know the true proportion yeah that's right it's almost exactly a little over 70 percent it will say 70 yeah it's it's about 70 percent water this is mostly a water world right viewed from space as a whole half of the globe that's the Pacific Ocean which is almost entirely water yeah and so let's go through that three-part sequence construction sequence that I had on two slides ago and think about how we construct an analysis from this okay you with me this is this is the division so the first stage is to design we're gonna tell the data story the data story was again weird guy comes in throws a globe in the audience right and you can tell this data story and from that you get facts that help you design the statistical model relevant facts in this case the different samples are independent or at least approximately independent of one another when you throw it to somebody else the thing spins chaotically it's caught in some awkward way it bounces off a couple of people gets caught the the there there's not correlated samples right you don't get water water or water because the first ones water get their mixed up and and then the probability assuming some true proportion of water pee again I'm using pee just to mean a proportion and you toss the globe then the chance any individual toss comes up water is that value yeah that's an assumption about how the tossing process works I'll say it again assuming there is some true proportion of water we don't know yet but we're going to label it P for proportion when anybody catches the globe the chance that their index finger is over water is P on any individual trial so now we're going to have a sequence and we have so let us say what the probability of the whole sequence is does it make some sense and you've done stats problems like this yes yes we very good I will show you indeed we will have an infinite number of conjectures and it turns out that's no harder than a finite number of conjectures because math I'll show you how we do this there's an infinite number of possibilities right and we can rank them all this is what's great your computer could do this no problem there's no calculus this for you have an infinite set but you can still rank all of its members we're gonna do calculus but you're not even gonna notice it'll happen and you'll be like wow I just did calculus and you'll feel fresh for the rest of the day yeah good that's a great question because this is the trick that always but actually infinite sets are often easier in mathematics than finite sets there's lots of awkward stuff that come from finite sets and I won't go through that road of discourse but there's lots of fun things about that okay so this is Lisa's inter native story you toss the globe there's a probability P of water and they're probably 1 minus P of land and there are no other events that's an assumption we're asserting that but probably we have consensus on them we might debate these assumptions later and each toss is therefore assumed to be independent you could make a model where they're not independent this is called the serial autocorrelation law serial autocorrelation model where this where adjacent tosses have some additional similarity and we make models like that all the time but this is supposed to be a simple example so we'll be some of that out now there you can translate this data story into a series of probability statements which are just ways to count the pass through the garden of working data that's on it for an infinite number of conjectures right this is like a bag and there's an infinite number of things that can possibilities it contains but the steps of the same is just there's a mathematical procedure for compressing all that it's called probability theory and the laws of probability right and we're not going to go into the details of that for this but I think Robert sent you a link to my book a PDF to my book and so this is chapter 2 I think of the book and I go through and I'll show you the are code to do this stuff in there and how all the construction works but I don't leave it out of the lecture because the concept is what really matters is I want you to come away with some conceptual understanding of what's going on ok so basically an updating now is the conditioning part this is the counting part that I showed you it for the bags of marbles it's often called Bayesian updating it's updating because you've got some initial set of Clause abilities the machine needs some initial information state about how plausible the different conjectures are you can make them all equally plausible or you could have pre-existing scientific knowledge that some of them are silly in this case that would be the case right so you know that the earth is more than half water because you went to school right so you know that's a fact and so any hypothesis less than 1/2 is already ruled out for you and you could start there to make this simple we're not going to even use that knowledge we can just make them all equal but it doesn't the procedure works exactly the same regardless so if you've got some prior information state which is called the prior here amazing and updating updates that to the posterior and these just mean before and after before and after what the data that's all so it makes sense so now it turns out that because the way amazing updating works there are many models where this the time ordering doesn't matter and you'll end up with the same belief no matter what order the data arrived at and that's the that's true in this case as well that sequence of classes could be reshuffled and you'll end up with the same inference at the end because the order doesn't matter there are models where that's not true because you have Auto correlation in the tosses and then the order has information in it but it's not true here and all that comes from the data story okay so you programmed your golem it has a prior information state here we're just going to the prior information state is for every infinite there's an infinite set of conjectures labeled P and it's all the real numbers between 0 and 1 and we need to assign each of those some prior plausibility and again easy in math you can do this you assign a distribution on that set P where they're all the same and that be a uniform prior which amounts to the state that they're all equally plausible or you could set it so that everything below 1/2 is impossible assignment 0 and everything above it to some other number doesn't matter which number right it's because it's all relative assign the bald one it works then you condition do the updating and again if you look in the book in chapter 2 I showed you the code to do this and we walk through it conceptually it's just counting passed through the gardener data but it's done with combinatorics there's a formula the strides in the text to take that data story from the globe tossing and produce a formula you've probably seen before the binomial sampling formula it's the coin tossing formula you've seen it before okay and then you get a posterior this is a new confidence in each value of P and this is conditional or the data make some sense I know very exciting right let's see this in a cartoon form it's a little easier to understand in cartoon form so the infinite set of possibilities is on the horizontal axis at the bottom right everything things are on one will consider them all and now we're going to define a prior and I'm just going to make it so that they're all the same that's this dashed horizontal line which says that before our golem has seen one toss of the claw it's programmed to the believe every possible proportion of water is equally plausible it's a dumb goal right you're smarter than it so you went to school the golden didn't yeah makes sense then we update we see the first toss so what I'm going to do in the next series of images is we're gonna do one toss a little bit of time and we're gonna let the Machine update and you're gonna see how this line changes into a curve that represents different plausibility rankings for the different values of P and all that comes from probability theory it's completely deterministic given the data so the first thing is we see 1 WC I put the whole sample at the top of the image this WL w w everything is grayed out except the first one because we're only considering we've only the machine has only seen that first water right now so this is the N equals 1 that's the sample size N equals 1 now the the prior was this horizontal dashed line the posterior now is this diagonal solid line why is it that seems weird but this is the only mathematically legitimate posterior that you can reach it comes from the laws of probability and well you think about this we've seen one water that increases the possibility that there's more water so we're putting more plausibility up on the high end and we've eliminated only one hypothesis which is zero so there's zero probability of zero now that's the only thing that's been eliminated but these are less plausible because there are fewer ways fewer pass through the garden of forking data that you could get a water if water scarce if water is common it's very easy to see water and so that's why the possibility has gone this way now there's an exact calculation which enforces the fact that this has to be a straight line that won't be true very soon but it's true right now with one observation and given that prior the prior the difference that wouldn't be a straight line it'd be different ok now I'm gonna put this up here and we're gonna do all of the tosses they're gonna be nine of these we're gonna go through them all I'll show you how the machine see stuff okay second toss now we see a land and now we get a symmetric hill there's just as much evidence that there's water as their land same amount of evidence of water as land you started out with equal prior plausibility of all of them so you have perfectly symmetrical possibility the most plausible state of the world right now for this golem is that half of the globe is covered in water but you notice it's not very sure of that yeah there's lots of plausibility for a wide range of proportions on both sides yeah now we get another water the third toss and now the hill shifts right so in each of these I make the previous posterior becomes the prior and I make it a dashed line and then the solid one is the new posterior you see that in the animation across so now we get another water and now it's shifts over because it's more plausible that there's more water now on the third one but still it's very vague this golem is not very confident about any of the possibilities you have no significant result right doing significance testing here right there's no null hypothesis next three N equals 4 equals 5 min equals 6 we get another water it moves to the right again notice here is getting taller why because it's concentrating right it's getting tight around a narrower range of values there's still the same amount of area under the hill in all of these pictures and every one of these pictures is the same amount of volume underneath the curve it's just getting concentrated a narrower area so the curve has to get taller yeah that makes sense so the the sample size in a Bayesian analysis is not doesn't have this special character that it often has an infrequent dis analysis because it's embodied in the shape of the posterior distribution already the posterior distribution summarizes everything you've learned from the data including the sample size and everything else and then you don't need to do anything with the sample size like calculate degrees of freedom or any of that other stuff that you do in a you know in a frequentist analysis it's already there in the shape of the posterior distribution and then you know with 5 we see another water again shifts up and to the right but then on the sixth sample we get another land and then oh you get a course correction right jerk to the left again yeah in a very specific way last three I think you've got the theme right he's know this works in the last three what you'll notice is the same jiggling is going on but the jiggles are getting smaller and smaller because the each new each new observation contributes relatively less information now the machine is getting more and more sure about what's going on and so it's not as influenced by each individual data point anymore and this is the sample size effect that you're familiar with in many statistical procedures so by in equals not only there's still a lot of vagueness right you need more data before you could publish this and but it's clustering around higher values or point seven is somewhere around I'm sorry I'm touching my slide people won't be able to watch this later but touching about point seven there's lots of possibility there there's a homework problem I often assign when I teach this class for credit where they take this example and then they see how many tosses of the globe you need to get it to narrowly contract around the true value the answer is pretty big actually but it depends upon how much precision you need yeah I mean we didn't really know the proportion of the earth that was covered in water until we had satellites right it's a pretty hard problem to get it exactly right but you can get close to point seven with a small amount of data okay that's the conditioning step a quick run to summarize that in this particular case the order of the day are irrelevant we could have done the presentation to the Machine of the W snails in any order we did up with the same final picture at the end but that's only because the data story that we told assumes that the tosses are independent of one another I've said this before but that's that's an assumption we made the other thing I want you to draw away from this is this this dance with the dashed lines and the solid lines every posterior becomes the prior for the next observation right now you don't have to when you do amazing data analysis you don't present the data one observation at the time to the model you just dump it all in the machine does it all but you could present them one at a time and it works an important realization from this is there is no minimum sample size required for a Bayesian analysis it's just that when you have very little data you will not have any confidence with which to distinguish the hypotheses but you can have one observation as the minimum number for a Bayesian analysis it's just that you will make no conclusions yeah because you'll basically get the prior back right yeah well depends upon the model in the data actually sometimes one observation is can eliminate a hypothesis and then it's really good all right so this is famous name with Einstein and gravity and light right things like that one observation can sometimes be very decisive if you have a good model make sense yeah okay evaluation it's hard to say anything generally useful about how you evaluate models because it really depends upon the domain and the scientific question and the background debates and the scientific field but I it's important to say that you're not done when you do this the Bayesian machinery is is what I call a small world phenomenon it's perfectly logical it's about pista mala ji you have to check it against the facts the model may have unexpected behavior it could be the the theory that comes out on top to be the best relative to the others and it could still be terrible and it just takes your scientific judgment to figure this out so this is what I call on this slide no little supervision Bayesian inference just answers a logical question these golems just answer the questions you ask them the question could be bad and often you don't realize it's a bad question until you get the answer right so this is like in folktales right you find the genie and you ask it a question you're like oh damn that was a bad question and now I will suffer okay this is how these things work with programming computers if you you you often realize it was the wrong question when you get the answer and that's what the supervision step is but the goldens perspective it did its job and you should pay it now thank you right it did exactly what you asked but you asked at that question so we might ask if it malfunctioned sometimes that happens as well but often even in the absence of Alphonse it's answer make any sense to you given what you wanted an answer to the answer will make sense to the internal state of the machine but it needs to make sense to us to make to be of any use in science so then then you often look back and ask whether your question made sense that leads to redesign of the model because that's how you ask a different question you redesign the bottle and so on and I know since you all are engaged in actively in science you could tell stories of this realizing models were terrible great after you run the experiment this happens to all of us as part of the manic-depressive cycle of investigation right you think you've got it and then you don't and then you got it then you don't and it's just normal and it never stops I'm happy to report right ok so if we're gonna really build these models let me give you a little closer to how physical models are presented in scientific journals it there's this script by which we go from the hypothesis to and attaching it to the data if we're gonna actually build a statistical model of an arbitrary statistical model based upon our scientific assumptions the first thing you do is you list all the variables what are variables well we're gonna we're going to come back to the globe tossing thing on the next few slides and I'll show you what the variables are in that case I think you probably have a guess then you define the generative relations among those variables that is if you knew any one of them how does that help you know the others those are the generative relations among variables and those gender relations depend upon the science that's where they come from statistics does not determine them yeah but you've got to build them in statistical language then there's this question mark part which is where you do the estimation and stuff and then you profit you publish the paper right so there's an old mean but maybe somebody does it yeah okay one person all right pretty good all my memes are old they're no fresh memes here okay you in yes yes problems with this uh-huh because typically there will be a huge number of statistical models that imagine any one hypothesis so hypotheses are not specific enough to tell you all of the steps that generate data so usually scientific fields will have some vague hypothesis like something increases with something else when you make a statistical model you've got to say exactly what that increase looks like and the background of theory often does it say so that's why well I don't know you give me an example there You've number of possibilities no no I'm talking about science another you got statistical framework something you've got a scientific hypothesis that something happens when causes is something else and those hypotheses nearly always are vague they don't specify mathematical functions right when you explain your scientific hypotheses to your colleagues do you write down equations probably not maybe you do but but then you have a long but it's not the same as hypothesis because there will be an infinite number of models which are consistent with the hypothesis infinite number there's this is famous in biology because say you get a big distribution of the frequencies of alleles in a population of organisms you have some generative you have some hypothesis for example how much does selection explain the distribution of alleles in the species there are literally an infinite number of specific models that are consistent with the hypothesis that selection matters all they're also an infinite number of posses consistent with hypothesis that selection doesn't matter so this is a famous debate in Chapter one of my book there's a whole section about this that's will give you some more background on it this is the norm I believe in science is that hypotheses are a giant bag of potential statistical models each of which must specify exactly the functional relationships among variables and we don't usually engage with that level until we do the statistics or if you're a theorist then you do the theory right so in cognitive psychology there'll be a huge range of cognitive models which are all instantiations of the same background hypothesis like reinforcement learning there's a thousand models of reinforcement learning right I'm probably under estimating there but they're all that learning is reinforcement which is a hypothesis and then you get into sub hypotheses and for each of those again there'll be a bunch of mathematical hypotheses which you're consistent with it does that help so that it's it's I think it's common for people to identify that word hypothesis with model but when people communicate in the pragmatics of science hypotheses as they're written down are now almost never models now they're vague causal relationships that then must be specified before you can predict data yeah okay once you've got the model you input a prior I say joint prior because it must be for all those variables yeah you need prior possibilities for the values all the variables could take and then you show the model the data and it deduces the posterior okay the case for the globe tossing example the joint model looks like this this is what statistical models look like in statistics whether you're Bayesian or not they looks like this so we've got some variables here and you can guess what they are W is the count of water that we have observed it's a number between 0 and 9 before we do the experiment we know it'll be a number between 0 and 9 why because we understand that's our data story it tells us it has to be it can take no other values because it's a count of the number of times someone looked where their finger was and said water and that's between 0 & 9 because we're gonna throw it 9 times make sense so that's that's a variable when we see it after we've done the experiment we call it data right but it's just a variable that could be observed or not it depends upon the experiment whether it is observed or not in is also data it's how many times you throw the globe yeah if you did different experiments it would be a number other than 9 but it's an input into the model the model doesn't know it you have to tell it what it is P is this variable that is the thing we're trying to estimate right it's its true value is what we'd like to know we don't know this and we can't observe it you have to infer it from other things and this is typical in the sciences is there some some question about a value in nature and it can't be directly observed we have to measure it indirectly in fact most of the time in science you have to do statistics used to measure things right this is what effect sizes are right we do comparisons and not let you estimate differences and effects causal interventions the consequences of causal interventions they can't be directly observed because causation is epistemological right everybody's read they're caught yeah emanuel the best con yeah right the causation is a belief it's an assumption always causation can never be observed in nature it arises from scientific assumptions which constrain the possibilities so P is this true state of the world we'd like to estimate but we know things about it we know it's a number of things that are in one because it's a proportion yeah so we still know something about it and then you have to say so we're going to observe W we're gonna observe in we're not going to observe P so we need to assign the Machine some initial plausibility x' these are the conjectures the possible to the back and in this case we can make a uniform between zero and one uniform distribution is exactly as it sounds it's a flat line it's uniform this is not required it could be highly non-uniform if you wanted if you had good reason to make it so this is just notation it no matter how complicated the stats model there'll be a series of statements like this which just define the generative story so W is generated as a binomial sampling sequences with binomial needs within trials and the probability of each succeeding is P this is coin flipping yeah and then P has uniform plausibility before we see the data between 0 1 after we see the data the distribution of P will change and that's the information we get and it will no longer be uniform so Bayesian models very importantly are generative which means they can be run in both directions what does that mean if you run a generative model forward it's a simulation you don't have data but you can ask the machine if I ran this experiment what would the data looks like that's a generative model that makes sense Bayesian models are always generative non-banking models might be generative this is one of the nice things about the Bayesian approach is you can run things you can do power analysis with them you can run them forward you can do experimental design steady design because it given the assumptions of the generative process you can produce data that let's let you imagine what the experiment would look like and whether it could just distinguish the different hypotheses and that's what I call running the model forward right as time flows this reason I chose forward you go forward in time the data up here yeah now the backwards direction is data are here and we'd like to go backwards to the process that produced them now we run the model in Reverse in the forward direction you input all of you input the parameters you would choose a value of P and then run the model forward and it would produce some w's in the reverse you don't input the P you may put the W in the end and you run it backwards this is the Bayesian updating and then you get a distribution for P and will tell you the exact value but it gives you the relative possibilities conditional on the day these are the two directions so the sack of the reverse direction is statistical inference in the physical sciences they also call that often call this the inverse problem right inverse of time it's the inverse problem if you want to infer a process that's an inverse problem if you want to generate data that's doing the experiment that's the forward problem yeah causal processes have implications in time going forward if we if we have hypotheses about those implications we can use them to make inferences about which causal process produced data these are the two directions this make some sense this is very standard terminology this is how scientific inference works in all fields under all philosophies there's nothing bayesian about this except that Bayesian models are always generative and not all models are okay there are lots of statistical models for example in economics which are not generative they don't predict distributions of data at all even though they make useful estimates they don't predict distributions okay so here's a joint model again all you know are if you want to think about what a forward simulation means from this model you set in to some sample size right to make a simple in 90s then we sample from the prior distribution of P we take a thousand 10,000 values between 0 & 1 uniform distribution P P is now a big bag of values of P that are present in proportion to the prior possibilities and then for each of them we sample the W using our Bynum here and then we end up with 10,000 WS and I just use the table to summarize them and you as you imagine since all the peas are equally plausible to start with over these 10,000 imaginary experiments you get the whole smear of counts of W between 0 & 9 because this this is what the model says it says no they're all equally plausible if you ask me to simulate I'm going to return all these stuff about equal proportion if you change that initial value of P you get differences so you can do this after you as you take the model as well after you update you can repeat this and that tells you what the model now expects and this is how you generate predictions from statistical models once you've trained the posterior and built it with real data if you wanted to predict the next event this is the procedure to do it make sense it's always the same okay so running this thing in Reverse in the computer the exact calculations can be done a lot of different ways to do the updating and I'm not going to step through these but I wanted to give you an idea of just four of them that are common although there are many others no matter how you do it it's still Bayesian inference it doesn't matter the exact calculations some of them are only approximations but they're very useful approximation so usually when you start an introductory Bayesian course you'd see an analytical approach I don't do that because I think it's almost useless to learn the analytical approach it's good for conceptual understanding but you can almost never use it in any reasonably complicated model because it's just too complicated nobody no mathematician on the planet can do the integrals that would be required so do the updating but the computer can do it numerically and that's what we do instead one way you can do it numerically is this technique called grid approximation where you segment up in a grid all your hypotheses and then for each of them you just do the counter and this is called the grid approximation it's an approximation because there's not actually an infinite number of hypotheses there's a finite number on a grid and so it is an approximation but if you make the grid really tight it's good enough at least for science yeah and the in the second chapter of my book I showed you how to do the grid approximation for the water example so you can see how it works it's an easy set of calculations it's just counting but you can't do it for any reasonably complicated model because the number of things you'd have to evaluate and count explodes combinatoric Li and you just can't do it you have to finally dissertation at some point right and so you just can't keep computing there's this great approximation that I call the quadratic approximation sometimes called a Laplace approximation and this is an approximation which says that the posterior distribution will be approximately Gaussian in shape normal in shape it's an it's an approximation it's often really accurate most of non Bayesian statistics makes a very similar assumption it asserts that the uncertainty around an estimator will be Gaussian it's an assumption it's often not true so we do Gaussian ik approximation in the Bayesian context you're getting estimates that are very similar to typical maximum likelihood results in non basting statistics a very similar sort of procedure most of the Bayesian work that people do though is using this Markov chain Monte Carlo approximation which makes no assumption about the shape of the posterior you can get anything back that logical and but it takes longer to run but other people write these algorithms for you and you just push buttons now and it goes right but you do have to learn how to supervise these things and the whole second half of my book is learning how to do that how to be a responsible Markov chain Monte Carlo operator right because these things are jet engines and you don't want to stand in front the wrong thing on these things there's they're perfectly safe mainstream tools people use them all the time there's nothing exotic about them at all in psychology maybe they're a little bit exotic but in biology grad student every grad student learns how to use Markov chains it's just ordinary there's nothing bizarre about them at all just a way to do counting okay one thing about predictive checks before I shift into the last half-hour of this so when we evaluate the model again it's hard to give advice because good advice will depend upon knowing either your background problem the problems you're worried about the costs of particular kinds of mistakes all of that comes in to how you check the predictions of your model and worry if it works but you really want to do this forward simulation after you've updated to the posterior to see what the model now thinks about the world and and you might then realize it has ridiculous beliefs and then that leads you to revise the analysis so I often call these predictive checks this tradition comes from this handsome fellow here in uniform this is Edwin Jaynes he was an American naval officer so you can probably guess but also physicists who had made many important contributions to Bayesian statistics during his career and he was very big on the idea that gee you have to respect models and check them right trust but verify and these is this predicted checking style but it's not like a significance test exactly but it has the same spirit right you're checking if the model is consistent with the evidence with the data and that's what a predictive check is like but you're checking the nominal right is the nominal that we've estimated from the data consistent with the data this is something that's not often done in amazing statistics it's a standard part of the workflow and Bayesian statistics given what the model has learned does it make any sense at all right but there's no Universal best way to do it now you have to use your judgment and there's no way to justify any threshold like 5% of course there's not in non-basic sistex either 5% why is it the 5% threshold for statistical significance well because there was a bony fish that crawled out the ocean in the Devonian and had five raised on its arms that's the true story right and so five is an acog 'native attractor and this is why Fisher said about 5% right he was trying to stop agricultural scientists from destroying the food supply of England he picks five but it's just got the bony fish at five rays and that's why mammals have five bones and they're in their feet and hands and that's all it is right at data at seven we'd have to 7% threshold well we'd have a base you know a base 14 system right instead of a base 10 counting system and we still call it five but anyway let's not go off on my science fiction short story it's yes this is a deep conversation it's related so they're not always the same so alpha values neyman-pearson Fisher is a whole different theory of significance testing which I say I don't open up this can right now but when you talk about alpha and type 1 error ranks and count the calibrating error rates that's Neyman and Pearson who were rivals at Fisher well they hated one another it's a miracle they didn't have a duel with pistols Fisher significance testing is different but yeah it's it's for the sake of not doing too much hard it's about the same principle yeah the Alpha value being 5% but if it's Jerry and statistics disagree p-values are continuous and neyman-pearson there's just a threshold and the absolute value of the p-value is it informative well you have errors of different types and you can count them but no there's no we're out doing significance testing because it's a bad idea well we can't predict we made out and be able to predict anything it's because rejecting on the hypothesis is a useless ritual that's why we're not doing it we want to build us if we want to build a substantive scientific model and that's what this procedure is about we have substantive conjectures we're trying to estimate which of those causal processes produces the data significance testing doesn't tell you what caused the data it tells you what didn't we're gonna talk about causation later but significance testing will tell you what causes what and absolutely will not it just measures the strength of an association yeah so it's yeah this criticism about giving by the way is not Bayesian there are lots of frequent of statisticians who don't like significance testing in fact the American Statistical Association every year basically publishes a joint manifesto saying please everybody stop doing this but scientists don't listen I know some of you have seen these statements but it's like it's like Groundhog Day every year right we have to do this over and over again statisticians saying scientists what you're doing with significance testing is illogical and you should stop doing it it's not a Bayesian pratik it's absolutely not a Bayesian critique okay but you asked like you know 5% 5% is arbitrary yeah there's no good reason to adopt it okay let's come back to Mars right everybody's favorite planet after Earth you like Earth more Mars is great okay so this is an important story because there's no sampling variation you've got this path you need a model that can predict it but you can't use the frequentist device of saying that our uncertainties arises from variation across trials that doesn't work at all the Bayesian formalism works fine here because you've got some initial expectations and you update it with data your predictions get tighter and tighter if the model can be trained correctly yeah or you can completely reject models because and in the posterior check when you it can't possibly predict the path right that's how scientific inference works before Kepler right but we figured out why you get this loopy loop in the sky because we're orbiting the Sun and so as Mars and so there's this parallax effect where with our relative motions which makes Mars looks like it's going backwards it's not actually going backwards right the planets are going around in the same direction but at different speeds before that they're in this clever fellow Claudius Ptolemy actually lots of clever people like him he just published one of the most famous compendium of this had a bunch of very successful mathematical models to predict the positions of the planets in the sky and they're completely unrealistic but they work incredibly well and so this is the Ptolemaic or geocentric model the solar system you've all familiar with this this is really cool thing it works it works perfectly there's no problem using this to predict where Mars will be it makes perfect predictions absolutely no problem at all and Copernicus when Copernicus said let's put the Sun at the middle his model made no better predictions than ptolemies they were empirically indistinguishable they predicted exactly the same data right this is why people didn't think of particles right because it's just an arbitrarily different model yeah the Sun at the middle you know you just upset the Pope Thanks it's had anyway then in Chapter we getting a chapter four of my book I recount some of that history and I talk about how we can distinguish these models but and also at the beginning of chapter seven the relevant thing today is how do they do it well they have these certain orbits of orbits thanks called epicycles so earth right it's such a great model of geocentric system because the earth isn't even at the middle right it's like offset then you know the other players are orbiting some point between the earth and some imaginary point in space called the eklund and it's just the music of the spheres right and but it works it makes perfect prediction because it was trained on the observations they had the data they found a set of mathematical functions which could almost perfectly predict the presence of Mars and the other planets but it had this crazy system in this system turns out in modern language would be something called a Fourier series written here on the left of the slide a Fourier series is a way of approximating any periodic function by decomposing it into a series of circles and you could approximate any continuous repetitive path like an orbit with embedding circles in circles and this is called a Fourier transform it's it's a workhorse of Applied Mathematics you can do lots of really cool things with it but it's not a claim about causation it's just description it's a mathematical technique a really good one for approximating to any arbitrary precision some path of a thing and it's amazing Ptolemy discovered this right he didn't know as a Fourier transform but he discovers this they do the trigonometry which the Greeks invented remember and it's a real achievement so I say this is people make fun of geocentrism but none of us are sophisticated enough to make this model right this is a real scientific achievement it's a really amazing thing and it also demonstrates that statistical models don't contain any information about causation in them all that they're just prediction engines machine learning is about prediction it's not about cost and and the big tip distinction is now if you want to use this model as a model of the solar system to do some work you're going to do some causal intervention in the solar system what would that mean like you launch a probe to Mars and you want it to land on Mars this is a bad model you're gonna miss Mars it's not gonna get anywhere near it so it matters when you do an intervention but in the absence of an intervention the predictive model without any causes in it is fun so it makes sense and so this is why inferring causes an additional step you don't get it from doing statistics the field of statistics is not a field about causal inference it's a field about describing associations among variables and the cause is up to us as scientists we bring the cause of information in comes from the scientific background we design experiments that can distinguish causes but all of that is an interpretation laid on top of the model the model will be perfectly happy too with this yeah always okay so all all these sorts of models like regression or like this they're essentially geocentric so I say in my book regression is the geocentric model of statistics it's just a big descriptive engine for making predictions but there's nothing about causing regression none of the symbols mean cause if you want them to mean a causal effective of a variable you need additional assumptions outside the regression model and I'm going to show you some of those in the next slides but this is a very important point I think it left out and causal inferences is a big topic in applied statistics it absolutely is but statistical models don't have it in them okay yeah so this does one to get gaps in here right so Gauss invented linear regression to describe the motions of planets actually this is what he did with it and these are very powerful machines I'm not trying to say they're bad these linear golems can do lots of really sophisticated things they're all based on the Gaussian distribution Gauss didn't named it after himself prions he just said here's a distribution of errors people later named it after him and so I wanted to give you a quick glimpse into why we use normal distribution so much as err distributions in nature and where that assumption comes from and I want to be clear that it's it's an ignorant assumption it's an assumption that it's not a claim about how things are going to turn out it's a claim that however things turn out this calibrates our our uncertainty in the right way natural normal distributions arise spontaneously through aggregating processes in nature all the time they're completely unremarkable which is why we use them so much but they represent states of ignorance we like 1/2 informed state of ignorance let me give you a quick example I can't spend much time on this but so this is this is gases 1809 derivation of linear regression by the way Bayesian the word Bayes doesn't appear here of course does everybody who's Bayesian at the time but it's probability theory inverse probability is what they called it okay so why normal let me give you an example so football right imagine we all go out to a football pitch and we line up on the midfield like this and then on each of you've got a coin you're a coin in your pocket all of us have your points in our pockets because we live in Germany we flip it trechie needed to get the elephant you need to live right you need a euro always it's a part of your survival kit so you flip it and if it comes up heads you know the proud eagle or whatever let's call it eagle then you take a step to the left and if it comes up the other side you take a step to the right and we repeat this process multiple times and our positions are going to drift right so on the first some of us get to eagle some of us don't and then we drift a little bit more and our distances from the midfield line scatter more now we do this some number of times we measure our positions and the question is what's the distribution of the distances from the midfield line on this and I assert that after a few coin tosses they will be approximately normal they have to be because the process of your steps and the distance you move is adding fluctuations that are generated by the coin toss and in nature when you add fluctuations you end up with normal distributions and that's it there's a deep information theoretic explanation for why that has to be so which I'm not going to repeat here but it's in my book so I'll say if you're curious about it and this has been known for a really long time so here's an animated version of this story the football pitch we start out at the mid field position 0 there on the top after for corn tosses I summarize the distribution of distances for like 100 people say that we've put on the field at the bottom and you'll see it's looking kind of Gaussian already but the tails aren't thick enough after 8 now it's pretty Gaussian and after 16 it's almost exactly Gaussian as I show there this happens spontaneously all the time just comes from dampening oscillations when you add oscillations together in natural systems may dampen one another so that the most plausible values in the middle because most likely thing that happens is the oscillations cancel one another and you end up at the middle again and and huge numbers of natural processes produce things like this so here's a quick movie maybe a too bright to see people make physical machines to simulate of this this is a big board with pegs and they pour a bunch of marbles down it and then they collect in the bottom in a Gaussian distribution because they they bounce to the left and the right and then those deviations add like the coin flips on the on the football pitch and they stacked up in these approximately Gaussian forms so if we're ignorant about the exact path that the marble takes but we want to guess where it's going to be we should guess it'll be in a Gaussian position yeah that is more plausibly in the middle and falling off in plausibility exactly in this particular form it makes some sense that's why we use the Gaussian it's not a it's not a claim about where things will be exactly it's a calibration of our ignorance about where things will be but it's an educated guess okay let me let me go a little bit through this because I want to get to causal inference some linear models there geocentric they're incredibly powerful and useful when we teach Bayesian statistics we don't tend to teach all the little special cases of regression we just call everything regression or a linear model so there's a huge range of specialized procedures and tests like T tests single regression multiple regression ANOVA and cove of anova manova these familiar at all right they're all the same thing they're all linear models and so I just like to teach the linear model because then you get the full power right and you can mix and match and do what you need given your scientific purpose it's about learning the framework instead of the individual little tools yeah and it's this construction perspective and these are I just wanted to show you in the Bayesian perspective where the linear regression means and needs now that there are an infinite number of plausible lines to connect two variables together and we want to rank them all in the relative possibility and that's what Bayesian regression does is an infinite number of possible lines before you see the data so here in the upper left of this slide we introduced the first ten of data points these data points are people and we're estimating the slope that connects weight to height in an adult population and you know it's positive yeah so when we see ten individuals I'm sampling I think a hundred lines from the posterior distribution here now the posterior distribution contains lines lines have a slope and an intercept so you can describe where the line is and they're very scattered because the models still not sure because you've only seen given a ten people it's like yeah this is the best I can do right now so all of these lines are plausible we give it ten more it starts to contract because now a lot of those lines are less plausible and it starts to contract on the data after 50 you can see that scanner right but it's more uncertain on the ends because of the way lines tough after 100 is getting quite confident over here after 350 it's the plausible lines are all tightly bunched up around a single best long right which is the line you get in a frequentist analysis is that best fit line that's in the middle the uncertainty in the Bayesian Alice his bowtie shape of uncertainty that's all the lines around it that are roughly equally plausible to one another just make some sense and so depending upon the model the posterior distribution can contain a really complicated functional shape and let the posterior distribution posterior distribution is doing is ranking them all relative to one another how plausible are they given the data but it could be really complicated to be a bunch of curves all kinds of things quick example you can do curvilinear things so this is a March temperature trend and historical temperature trend for Japan on the top and from the year 1900 to the year 2000 yeah and on the bottom is the first day of the cherry blossoms this is recorded every year because culturally this is very important in Japan people have picnics and all kinds of stuff in the cherry blossom and and so if we wanted to estimate the trend of the cherry blossom thanks so we can compare it to the temperature trend there's a linear model in here actually that used to construct this fluctuating trend but it's highly uncertain you can see that this gray region I've drawn on this is the posterior plausible center part of that trend at each year but you'll notice it's why because the data are finite and they're highly variable right so this is the Stila posterior estimate but it's a highly Wiggly thing that's the technical term in statistics this is high wig leanness okay I think I'm most general actually to do this okay regressions right it's geocentric it doesn't have causation in it and this is where we need to supervise it as a golem it's an Oracle but it doesn't have your interests in mind you have to be very careful what questions you ask it it will answer your question correctly but the answer may be nearly useless and mislead you so one of the things that we worry about in regression models and this would include a Nova's and everything else when you start putting predictor variables into them and treatments and factors and whatever it is is you're worried about compounding what do you need to control for and there's this tradition that evolves that people will think well let's be safe let's put everything in and that is is that a good idea so think about this is this you know scientific papers and people say we controlled for age and socioeconomic status and gender and a bunch of other stuff right and then you'll say oh well those possibly can't must be safe now this is a causal effect this is a bad idea this is what I call causal salad causal salad is okay I want to make a causal inference so I've got all these variables let me toss them together I'm gonna do some statistics and then I'll call the the resulting parameter values causal this does not work we could prove it logically mathematically that this is bad now I just want to give you some conceptual examples to take away this is not a uniquely Bayesian thing at all this everything I'm going to say here applies equally to all perspectives on doing statistics it's prom for everybody it's the priest statistical problem that governs everything else it's the most important thing the biggest problem the statistical practice is not whether you're raising the frequentist is that people are never taught how to legitimately claim causation now and they just run bottles and claim the parameter estimates or causal so adding variables can create compounds just as well as it can remove them I want to show you a case where it removes them and then show you a case where it creates them and then I'll let you go home and be happy yeah okay an example I think you'll Intuit and this is a way to introduce you to causal diagrams what you're seeing here there's something called a directed acyclic graph don't worry about what that means it's just a terrible name for a causal diagram this is causal because the arrows indicate causation there are three variables here age height and a math score for students and say you're interested in the hypothesis of whether taller people are better at math that's why the question mark is there we don't know that but we want to estimate the causal force on this so this is like structural equation modeling which I know some of you have certainly seen ranks structural equation modeling starts comes from the same origin you're from a biologist named civil right age influences height and math ability yeah we're pretty sure and so as a result age is a concept if you wanted to estimate the causal influence of height on math individual age gets in the way of this because it's going to create a correlation between height and math ability even in the absence of any causal relationship between at taller I assert that taller people are not inherently better at math but on average in a population of children heights and math ability will be correlated because as they grow and they study it creates association but that association is not causal yeah no one wants to argue that point with me okay good said sorry I choose examples that are intuitive right so age is what we normally call us a compound is something that causes both of the other two things which means if we just measure the association between Hein and math its confounded by age and we have to somehow remove the age effect to estimate the causal effect of man-thong height and this is where adding something to regression model works and I want to show you why it works and give you an idea because then it'll also also be able to explain why this doesn't always work so what we say is that math is independent of height conditional on age and we have to say what conditioning means conditioning means for each age we just we throw away all the other kids we just look at all the kids with the same age and then we assess the association between height and math scores that's what controlling for age means it means stratifying on the variable stratify the population by each unique value of age or all day or if they're not all the same age you look at similar ages yeah that's what conditioning on it means and that's what you do when you add a variable to the right-hand side of a regression or an ANOVA your condition and your stratifying the other estimates by that thing it's a way of conditioning so if I simulate this population and I did we've got I think this is like a thousand kids or something I forget how many it is five thousand kids the math score age and height the way to read these this is called the Paris plot the way to read this is the bottom axis here is what's below it so this is age on the horizontal axis in the top middle plot and the vertical axis is math score so their math scores are improving as they age yeah cuz they're learning yeah education works right and this graph in the upper right is math score on the vertical and height okay so this is H height is the horizontal thing so taller people are better at math but this height cause better math scores and of course not I know because I simulated these data rights and ages cause II better math scores and age it's causing height but that generates a correlation between the two that's that's the confound effect so now what we're gonna do right and also yet age and hyper correlated right because people grow what kids grow right I'm done growing I'm shrinking but that said what we're going to do is we're going to stratify by age so we're going to take this pot up here and we're going to look within each age cloud at the correlation between height and math ability that's what conditioning means and I can show you that we just look at four different ages here age equals seven eight nine and ten scatterplot between height and math and you see there's no pattern once we condition on age that's why stats is helpful for making causal estimates but you have to Intuit what might be a confound and put it in the model this is what leads to causal Salik right is you see this example and you think oh I should always do this I should always put everything in the model because then I'm not gonna remove these confounds no let me show you the opposite problem now and I'll leave you with the happy story of the opposite problem let's think of a different kind of causal relationship think about a light like one of the lamps in this room that's on and it's caused by two things at least approximately the presence of electricity in the building which I have labeled your power there's power flowing into this building which makes it possible for the lights to turn off and there is a switch on the wall both of these things have to be on and as it were for the light to turn on that's why the there's an arrow from each in this causal diagram pointing to light so it makes sense why the diagrams like this power doesn't cause the switch right the switch can have any state on or off regardless of the power but the power's out the light can't be on right and it doesn't matter what the power is like you can switch us off the switch doesn't cause the power power because it comes a switch there's no arrows between them they don't cause one another at all make sense right this is not a circuit diagram of the room it's not what it is it's about what causes what and the power doesn't move the switch on the wall and moving the switch on the wall doesn't make power flow into the building it just makes it flow to the light yeah it makes sense no this is think about if the power is out right if you wanted to know the state of the light you need to know the value of both of these variables that's what these causal diagrams be light is a function of the presence of electricity and the state of the switch on the wall that controls the light that's what a causal diagram means and that's all it means perfect common sense and so we were we were gonna when the light is on you know something about these variables because you're a person right so if the light is on you know both of those things are true there's power and the switches on yeah and this is a sense in which causally speaking the switch is independent of the power they have nothing to do with one another you can if the powers on that doesn't constrain the lights to be on you can get up and turn them off yeah I'm always telling my son this you can turn the lights off that would be fine but it's still there and you can still turn it on yes of course you can you can move it he moves on the walls right he moves he moves on the wall of pivots yeah my house it just won't do anything no the line is still in up there we haven't disassembled the room the lamp is on the ceiling the the switch is on the wall right this switch means the state of the switch is it is it indicated up on or off whether that causes the light to turn on depends upon the presence of power in the city yeah that's all the diagram says right so in a population of buildings there's no causal relationship between the power in the switch they're independent of one another right but once you here's the point I want to make once you know the state of the light they're not independent they're causally independent still but they're statistically associated and this is why it's bad to just throw things into a regression model so let me let me walk through this let's think about why knowing the light gives you information about the other variables it's because there it's jointly caused by them so if the light is on and you and the powers on what's the switch on got it right it's easy this is science causal inference right you have inferred this the power doesn't cause the switch though but they're statistically associated in the population as a consequence of this so once you condition on the light knowing power tells you to pop the position of the switch that's the statistical Association in a statistical model then conditioning on the light tells you if it creates a statistical association between these two it means that power gives you information about the switch once you learn whether the light is on right but it's purely statistical if you don't know the causal diagram you won't know that this is to confound you won't know that it's not a causal relationship and this is why just adding variables like light to a model can screw up your inference this is a let me do one more example it's a lights off and the power switch is on then you what's the switch off right so statistically sociation you can predict even though there's no causal relationship between the power and the switch you can predict the state of the switch if you know the other two yeah so light light is not the cause of the switch but if you add it to the model it'll tell you that power predicts the switch right which misleads you now of course you would never do this in this case if you understand electricity and the switch on the wall and all that but if you just start doing causal salad with variables you can easily be tricked this is a special effect in in statistics this is known as collider bias the switch is the light rather is called a Collider because two arrows enter it they collide on the variable let me show you an another example that isn't totally silly this is a real example from the literature on happiness which is a very fun literature right people study how to be happy so there's a big literature on it right I'm gonna know how to be happy if the science can help me this would be great so this is a post example so say you're interested in getting relationship between age and happiness you want to know older people or sadder we're happier right probably nonlinear there's probably some curvilinear relationship right it's plausible that marriage is a compound right because that affects people's happiness and we're not going to say which direction but my pension effect right so be careful what I say here so this is being recorded it's suppose the true causal relationship is that age doesn't influence happiness at all this is a thought experiment you're gonna ask the question you're going to regress happiness on age but you're asking should I control for marriage because the relationship might be different among married people and non married people sounds reasonable but if the true state of the world is that there's no relationship between age and happiness people are born at a certain happiness and stay that way their whole life or they drift around but it has nothing to do with age but age and Happiness effect marriage why would that be true I think this is very plausibly true every year you live you have another opportunity to get married so older people are more likely to be married than younger people age causes marriage it does I know it sounds weird but from a variable position it does yeah and there's no also divorce and remarriage and all these other dynamic processes that are fun but you know leaving that out of the model for the moment happiness also causes marriage people don't want to marry sad people so all other things being equal happiness is a predictor of being married in my book I think it's chapter six I wrote a simulation of this and I show you that if you do regression where you're predicting happiness with both of these you end up concluding that people get sadder with age and this is because it's a Collider you find out so happiness is independent of age and it's an assumption of the model but once you condition on marriage it's not it's dependent on age conditional on being married because there's this finding out effect so again let's think about it like a light switch I know that someone's married and I know that they're young then they're probably happier than average so I find out happiness by knowing age and vice-versa if someone's married and they're really old they're probably less happy on average and if their friends in the audience I'm sorry but it's not causal France this is just a statistical artifact of conditioning on a Collider which is a very bad thing to do how do you know if something is a Collider you need theory it's not in the model it's outside the statistical model okay our time is coming to an end so let me try to summarize that a bit this is why why not just add everything so remember causal salad is just my playful term for this very common procedure in the Applied Sciences of not applied sciences in applied statistics of just conditioning on everything that you've measured in hopes of avoiding being confounded that works for benign sorts of compounds like age creating a spurious correlation between height and math ability it will not work for colliders colliders exist this is a real threat there are many famous examples of artifact relationships because people condition on colliders happens all the time it's not even exotic you get all sorts of other things that I haven't give you examples for - although there are examples of these in my book things like conditioning on post treatment variables if you run experiments you're not safe from this so a post treatment variable is something that happens after the treatment is applied and sometimes people put these as controls in their models this is very dangerous you can end up concluding that the treatment has no effect when it actually does why because the treatment is mediated through a post treatment variable and so if you put the post treatment variable and it explains away the treatment after you know the post treatment variable you don't learn anything extra from the treatment it's a very risky thing to do to conditional post treatment variables sometimes it's safe though it depends upon the details of the causal model likewise there are pretreatment variables things that happen before the treatment these can be colliders and then you don't want to put them in the model either but you might want to condition on them because they might be compounds yeah you havin fun yet there are whole books about this again there are like two chapters in my book which are all about these problems question like psoriasis sometimes it's a bias sometimes it's not it's a bias when you condition on it and you're not aware it's applied then it creates a bias in the estimate of the causal effect this is not Bayesian this is the terminology in the field of causal inference standard terminology in the field of causal inference yeah it's neither Bayesian nor not although most causal inference stuff is Bayesian but that's just it doesn't have to be this is independent of that there is good news you can make causal inference and observational systems that's good because that's mostly what I do I'm an anthropologist right we go off to the field for years learn foreign languages get parasites that's what we do professionally and if you tell you stories but our systems are observational when we do experiments what we're really doing is just measuring things right now they're just ways to do measurements controlled measurements but they're not really treatments being applied at random and we still wanna do causal inference though so we need strong theory to do it and we use these causal diagrams to govern our inferences and work there's a lot of resources on how to do this so I'm going to leave you just with a couple of suggestions obviously there's my own book which you have a copy of you have a PDF of this there's a bunch of causal diagrams in it to introduce you to this in the context of doing estimation as well there's also this great book causal inference and statistics a primer by perova module which is a very gentle introduction to causal inference it's the priest statistical considerations that we've done at the end of this presentation today it's meant to be introductory just has very basic statistics which I think all of you are nonlinear models t-tests those sorts of things cross tabulations and a bunch of examples of how you can end up being confounded and when you can decide how to remove the compounding or if it's possible even to do so so let's just return to the headline here on this side both of these things Bayesian inference and causal inference are really just counting the implications of assumptions and that's all we've got in scientific inference is this dance between making assumptions seeing how well they predict reality and then adjusting the assumption we need some middle ground to count up the implications of resumption so we can do comparisons against reality and that middle ground is probability theory and both of these frameworks they're just applied probability theory okay thank you for your indulgence and I hope some of this was useful