Transcript for:
Understanding Conditional Independence Concepts

so what should we do if the variables are not  independent? let's go through an example: consider   a sister and brother. we want to find their ethnicity. so, what is the probability   that the sister is white given that  we know that the brother is also white?  so well, probably it's quite likely; say 90 percent.  but then what is the probability that  the sister is white given that we know   that her brother is black well it's  less likely; say for example 10 percent. knowing that they are from the same patterns. now,   from this, we can conclude that  s and b, the two, are not   independent. because if they were independent,  then this conditional probability of s given b   would be the two will be equal. like s would not  depend on the value of b, whereas here it does. now assume that we know their parents. okay? so then  like here we know that the parent of the sister   is, for example there, for example white. now  the probability of s being equal to white   given that the patterns are white, imagine that  it's a very high probability like 95. okay? now   from here, we can see that the effect of knowing  the color of the brother does not matter anymore.   the probability of the sister being  white, knowing that the parent is white,   and then knowing that the  brother is white, is still 95.   and also knowing that the brother is black is  still 95. what matters is really knowing the   cause of the skin color of the sister, which is the parents'. knowing the   brother does not matter here, and from there we can  conclude that s is conditionally independent of b   given the parents, and we write it in this  form: s independent of b given the parents.   now let's get back to our covid example. here,  we consider fever and the PCR test result.   if someone has fever, we can guess their  test result, and vice versa. so for example   the probability of someone having fever given  that we know they didn't they had a   negative pcr test, say it's 20 percent ,  and if we know that they had the pcr test it  increases. because they most likely had covid, and   they it is possible, it is likely, that they exhibit  some symptoms, so say 70%, and you see that f and p   ,the value of f equal to 1, does depend on p, from  here to here. they are not the same, so they are not   independent. and the same we can do for p: if we  know that the person does have fever or does not,   it affects the probability of a positive or  negative pcr test result. so the conclusion is that   the two variables are not independent. but we don't  stop here. we know that now if we additionally know   another variable, say that we know the person  is infected by covid; c equal to one, then   they have fever and test positive with a high probability . so p of having fever given that we know the person  is infected with covid will be 0.6, for example. and   say p of having a positive test result given  that the person has covid will be 0.9. but then more importantly, what's happening  here is that knowing about fever no longer   increases the knowledge about the test results,  and vice versa. so the probability of having a fever   , condition on the fact that we  know the person has covid,   it is still 0.6, even if now we additionally  know that the person had a pcr test result.   okay? because the whole role of a pcr test result  is to indicate, to reveal, if the person had covid   or not. now, we know the person has covid... so  it doesn't change the probability! and we can   easily say that regardless of the value of p being  equal to zero or one, here i'm using zero and one   instead of false and positive to just make it  easier to read, we will   have p of f equal to 1, like having a fever, knowing  that the person has covid will be 0.6, as before . okay? now, roughly we can say the same with the pcr  test result. okay? if we know the person has covid,   then most likely we will see a positive test  result. okay? and this does not matter if the person   has fever or not (like if the person is symptomatic  or asymptomatic the test is supposed to work here  in any case). okay? so what can we conclude from here?  can we say that f is conditionally independent of   p given c? no, we should be careful. we only so far  show that it is conditional independent of c being   equal to 1. what do i mean by that? it means that  just knowing that the person has covid, then we can conclude that f and p are independent.  okay? because for example, here f equal to 1 is   independent of the value of p equal to 0, and p  equal to 1. and by the way, i can easily write   down the opposite; like f equal to 0 conditions on  the same part. because they should add up to 1. if   i want to conclude that f and p are independent  given c i need to reach the same   conclusion from the same equations for c equal to 0,  as well. and that's what we aim here, in this slide.  now if we know that the person does not have  covid, it's not infected, then they have fever, and   will test positive with the low probability. okay?  so fever will be the the probability of the person   having fever, knowing that the person does not have  covid, will be low. and the probability of a positive   test result, knowing that the quote that the person  is not infected with covid, is also low. now, you may   ask how do we know? just for simplicity, assume that  we somehow know; like with a very accurate test   result, even more accurate than the pcr test, we  somehow know that the person is not infected. okay? now, we can further conclude that the probability  of having fever, knowing that the person does not have covid, does not depend any  more on the pcr test result. so either p equal   to zero, or p equal to one, doesn't matter. it's the  same as just knowing the person   does not have covid, and it's low: 0.2%. and the  same we can conclude for the pcr test result .  knowing that the person is not infected is  sufficient. we will conclude the probability .  it doesn't matter anymore if the person  has a fever or not. okay? so we can conclude   that f and p are independent given c  equal to zero. now putting this together,   with the previous conclusion of c equal to one,  we can conclude that f is independent of p, even c.   fever is conditionally independent of the test  result given covid. okay? this does not of course   always hold for some choices of variables.  only partial independence can happen here.   like we may have only this part, but here we  reach the conclusion that it holds in general.   good! so let's now try to formalize stuff. consider  the sets of random variables x, y, z. so here x is a   set. it can include several variables, and so is y  and z. we say that x is conditionally independent   of y given that in a distribution p (we didn't talk  much about the distribution p. it was implicit in   the previous slides. here we're formalizing it, so  we put it here). denote that p models, this notation,   x independent of y commission on z, and here  we want to define. this will hold if p of x   given y and z equals to p of x given z. now if  we want to be precise we need to write down   the values .of course for simplicity  almost always we omit these parts of equal to y and z.  remember that the capital x is the  variable of interest, the random variable   itself. the small letter here; that these are  the value that the random variable takes.   so it is emphasizing that this condition  should hold for all values of x y and z .  okay? in particular we saw that uh like z  equal to z and y equal to y... this should   always hold, especially for z equal to z( like what  we saw here: c equal to zero, and c equal to one).   both need to hold, in order to conclude that uh  they are independents. yeah they're independent.   okay now just a few side notes:  the variables in the set z are said   to be observed variables. if z empty ,  it's not condition on anything. then we can go back  to our originally defined independence definition .  we just say x and y are mutually statistically,  or probabilistically independent. we can also   say that they are marginally independent, and we  just say p of x given y is p equals to p of x.   the set of all probability independencies in p  we denote them by this... we capture it by this set i  of p, and uh well p stands for the distribution, i   stands for independencies. okay? so i of p will  include all such conditional independencies   that hold in the distribution. for example,  in the previous case, where we had f, c and p,   then i of p would include this conditional  independence: f independent of p given c. and we have a proposition here: we  say that x and y are independent given z   if it holds that x and y condition on z is  equal to p of x given z times p of y given z .  you see it's a straightforward generalization  of the non-conditional case, where we just had p   of x and y is equal to p of x times p of y. what  is added here is just that we condition it on z.   so it's easy to remember, and it's also intuitive.  and you can try to prove this proposition.   now, how can conditional independence help? we just saw how independence can help, but what about   conditional independence? let's go back to our  example: covid, fever, pcr test result.    we saw a conditional independence, now if  we write the joint probability distribution   of f, p, and c by using the   conditional probability, we can write it down as p  of f given p and c times probability of p and c. now, we further know that this  conditional independence holds.   if we knew that p of  p and c, if p and c would be independent, then   we could easily write down this part as p of  p times p of c. but this doesn't hold .what is   instead holds is that f is independent of p given  c. okay? it is included in the set of independencies.   so we know that fever and test results are  independent condition on covid, which   we just showed. how can we use this? well, then by  definition, this part; p of f condition on p and c   can be reduced to p of f given c. right ? this is right here. don't get bothered   with the notations, with these parts x y z.  you can see that this is exactly equal to this ,  and we just got rid of p, because it was  independent of it. okay? so what happens   to the number of parameters? well, we have 2 here; is the conditional probability. we have 2 for c,   and we have 4 minus 1 (3) for this one. so 2  plus 3 equals 5. okay! original we had seven. and again the importance of conditional   independence: conditional independence  can reduce the number of parameters. a little exercise here: show that the following  factorization results in the same number of   parameters. so here we factorized according to f. we  said it's equal to the probability of f, condition on p   and c, times probability of p and c. now if we would  have done it differently, we had condition it on c;   say p of f and p condition on c, times p of c, go on...  and try to show that still we would have the same   number of parameters. this should be intuitively  correct. because we're using the same independence ,  and so we don't expect fewer or more number of  parameters. okay... so so far we covered examples.   now how to generalize the idea? first, step one is  to we need to factorize the joint distribution   into c, p, ds. right? we just saw it in the previous  slide. we always do it: this part. and how to do this?   here is where chain rule can help us; the chain  rule for random variables. it says that p of x1   to xn (i'll read from right to left,  it's easier) this joint probability   distribution can be factorized as follows: it's  probability of x1, times p of x2 given x1,   times p of x3 given x2 and x1, all the way  to p of xn given xn minus 1 to x1.   okay and you can write it down in this compact  form p of x i condition on x i minus 1 to x 1. this you can prove using the definitions of  conditional probability, okay? now, just note   that sometimes in the literature this p of x1 to xn is written as p of xn to x1   to resemble the order of the factorization. okay?  because it starts from x1 to xn, and it may help   also to remember it easier, but of course, the  two are equal. it doesn't matter p of a and b   is equal to p of b and a. okay? a little exercise  here, before we proceed to the next step:   for binary-valued random variables compute the  number of required parameters using the chain rule.   okay... like we know here. if they're binary  we know that we need 2 to the power of   n minus 1 parameters  to model this joint distribution. but then, if we now do it using this chain rule, show that the number of parameters does not reduce.  i already kind of gave the answer, and here's the   hint. and then, in the next exercise, you're asked  to generalize it for general discrete random   variables, where you can have alpha1 for x1 to  alpha_n times xn. again, intuitively this   is expected. right? because chain rule holds for  general joint probability distributions.   we haven't yet done any reduction in these  cpd (conditional probability distribution) terms,  so we don't expect any reduction in the number  of parameters. so this was step one to factorize .  step two: we need to simplify some of these  cpd terms. okay? conditional probability in   distributions. based on what? based on conditional  independencies. how? okay... here is a little example:   uh consider joint distribution p of x1 to x4, using the chain rule, we will factorize it in   p of x4 given x3 to x1 times p of  x3 given x2 to x1, and so on to p of x1.   imagine that we have a conditional independence  where x4 is independent of x1 and x2 given x3.   okay? so obviously that means that i can just  omit x2 and x1 from this conditional distribution.   and this means that the number of required  parameters for this cpd will reduce from   2 times 2 times 2 (8) to only 2. now the  joint distribution becomes so i will just   go from here to here, where i can omit x1 and  x2, and i will get this simplified version of   the joint probability distribution, and the  total number of parameters now reduces from   15, originally i had 2 to the power of 4 minus  1, to 2 plus 4, plus 2, plus 1, which is 9. okay? we go on with a little uh exercise: so here we  can see that if a conditional independence holds   for p, it can be accordingly factorized, so that the  corresponding cpd term is simplified. okay? whatever   independence you give me here, what i  need is the term to have the term p of, whatever is   here, condition on this, this is supposedly a single  variable condition on a bunch of other variables.   uh here and here is it will be independent  of some other variables. okay? meaning that   this should appear in the factorization; something  like this. so i need the first term condition on   all the other variables, and then i will simplify  it. now the term that i just described, i can make   sure that it appears in the chain rule, based on  the reordering. so i can reorder the variables   so that the written, the desired term, will appear  here. so i can always make sure that the single   conditional independence will somehow  appear in the cpd terms in the factorization,  and then i can reduce the number of parameters.  now the exercise is asking what can be said   about the inverse statement? meaning that  if i see a factorization that compared to   the chain rule, like here, i see that one of  the cpds are reduced, then can i conclude   that there is a conditional independence? okay... good!  now i want to continue with the example: imagine   that in addition to this conditional  independence, we have this additional one; x1   being independent of x2 given x3. so it's  no longer about x4, here it's all about   x1, x2, x3. then, we will have that x1 given x2, x3 is  just equal to x1 given x3 (x2 is omitted). okay? but   we cannot apply this to the previous factorization,  this is what we ended up last time:  x4 given x3 and the remaining ones. why we cannot  use this? because i don't have this term here. okay?   x1 given x2, x3. it's the only place where x1  appears on the left side of this conditioning   notation (is here), but it's not conditional  anything. so what should we do? well we can reorder   the variables, and then use the chain rule  again. so that this part is rewritten. okay? so uh   if i do it in a way that i don't.... i'm fine  with the first part! so originally i had x4   condition on the other variables, that is fine.  but here i need x1 condition on x2 and x3.   and then the other ones i also don't care. it  can be x3 given x2, and p of x2, or   x2 given x3. and then given... yeah! times  p of x 3. but i need this term. and i need   the other one also, so that i can still use the  previous conditional independence here. okay? so   if i do this, then i will simplify this term as in  the previous slide, but also i will reduce this one,  and then i will end up with the simplified  version where two of the cpds are simplified now.   now note that this way of writing things:  if i want to write it down using the order,   it will be p of x2 and x3 and x1 and x4, right ? here! of course you can. this doesn't matter! you   can just write down x1, x2, x3, x4. it's just  to help with the ordering of the cpd terms.   now what if we additionally have x3   independent of x2 condition on x1? okay... so here   we had x1 conditioned on x3, then it's  independent of x2. here is x3 and x2. okay? now it is   impossible if we want to further simplify this factorization. you can try   yourself. at least it's not straightforward  to do this, because this one requires p of x1   appearing condition on x3 and x3. this  one, it requires p of x3 condition on x2   and x1. okay? so if we wanna use our previous  approach, at least for sure we cannot handle   both of them, in the same factorization. okay..   and the conclusion here, which we will later  make it very concrete and rigorous, and what   is that not all independencies can appear in a factorization. now see it in practice: back to  our covid example with 12 variables...   so what should we do? we should just find the  conditional independence, and factorize the   joint distribution accordingly? well... yes! that  is the idea, but a couple of questions: first,   how to find these conditional independencies, I(p)?  we need to find them from this data set.   i mean in all the previous examples i  just said assume that we are given this   conditional independence, and this conditional  independence, but then how can we really find it ?  and the answer is that we will get back to this  in the chapter structure learning.   let's just say that there are statistical tests  for doing this. now even if we have this I(P),  how can we systematically factorize the  joint distribution accordingly? okay? like   which ones to include? which of these independent  conditional independencies to include? which and   which ones not? and according to what order? we just  saw that we may not be able to include all of them.   and uh you know... visualization  tool can help here? like a graph... and   this is the whole idea of bayesian networks.  okay... which we will get back to in the next video.  just to summarize: the goal was to  find from data the joint probability   distribution. here's the data, and we wanted  to get the joint probability distribution.   i want to make it more concrete by saying well the  goal is to find the correct factorization of the   joint probability distribution. what do  we mean by correct? let's put that aside for now.   but let's just say that each factorization will  result in a certain number of parameters. okay?   the whole purpose of bayesian nets is about this  factorization. that's what's happening behind   the scenes. okay? and we want to find like each  factorization can result in a different number of parameters. we want to find which one is correct.  and how can we do that? well, first of all, we need   to find I(P); which conditional independence  hold. for example, if we perform those tests that   i told you... like imagine that we find that I(P) will have c conditionally independent of d given m. okay? and also m is independent of d. then  how can we factorize this? so the next question is if we know I(P), what  should we do? we will cover this in the next video...