so what should we do if the variables are not
independent? let's go through an example: consider a sister and brother. we want to find their ethnicity. so, what is the probability that the sister is white given that
we know that the brother is also white? so well, probably it's quite likely; say 90 percent. but then what is the probability that
the sister is white given that we know that her brother is black well it's
less likely; say for example 10 percent. knowing that they are from the same patterns. now, from this, we can conclude that
s and b, the two, are not independent. because if they were independent,
then this conditional probability of s given b would be the two will be equal. like s would not
depend on the value of b, whereas here it does. now assume that we know their parents. okay? so then
like here we know that the parent of the sister is, for example there, for example white. now
the probability of s being equal to white given that the patterns are white, imagine that
it's a very high probability like 95. okay? now from here, we can see that the effect of knowing
the color of the brother does not matter anymore. the probability of the sister being
white, knowing that the parent is white, and then knowing that the
brother is white, is still 95. and also knowing that the brother is black is
still 95. what matters is really knowing the cause of the skin color of the sister, which is the parents'. knowing the brother does not matter here, and from there we can
conclude that s is conditionally independent of b given the parents, and we write it in this
form: s independent of b given the parents. now let's get back to our covid example. here,
we consider fever and the PCR test result. if someone has fever, we can guess their
test result, and vice versa. so for example the probability of someone having fever given
that we know they didn't they had a negative pcr test, say it's 20 percent , and if we know that they had the pcr test it
increases. because they most likely had covid, and they it is possible, it is likely, that they exhibit
some symptoms, so say 70%, and you see that f and p ,the value of f equal to 1, does depend on p, from
here to here. they are not the same, so they are not independent. and the same we can do for p: if we
know that the person does have fever or does not, it affects the probability of a positive or
negative pcr test result. so the conclusion is that the two variables are not independent. but we don't
stop here. we know that now if we additionally know another variable, say that we know the person
is infected by covid; c equal to one, then they have fever and test positive with a high probability . so p of having fever given that we know the person
is infected with covid will be 0.6, for example. and say p of having a positive test result given
that the person has covid will be 0.9. but then more importantly, what's happening
here is that knowing about fever no longer increases the knowledge about the test results,
and vice versa. so the probability of having a fever , condition on the fact that we
know the person has covid, it is still 0.6, even if now we additionally
know that the person had a pcr test result. okay? because the whole role of a pcr test result
is to indicate, to reveal, if the person had covid or not. now, we know the person has covid... so
it doesn't change the probability! and we can easily say that regardless of the value of p being
equal to zero or one, here i'm using zero and one instead of false and positive to just make it
easier to read, we will have p of f equal to 1, like having a fever, knowing
that the person has covid will be 0.6, as before . okay? now, roughly we can say the same with the pcr
test result. okay? if we know the person has covid, then most likely we will see a positive test
result. okay? and this does not matter if the person has fever or not (like if the person is symptomatic
or asymptomatic the test is supposed to work here in any case). okay? so what can we conclude from here?
can we say that f is conditionally independent of p given c? no, we should be careful. we only so far
show that it is conditional independent of c being equal to 1. what do i mean by that? it means that
just knowing that the person has covid, then we can conclude that f and p are independent.
okay? because for example, here f equal to 1 is independent of the value of p equal to 0, and p
equal to 1. and by the way, i can easily write down the opposite; like f equal to 0 conditions on
the same part. because they should add up to 1. if i want to conclude that f and p are independent
given c i need to reach the same conclusion from the same equations for c equal to 0,
as well. and that's what we aim here, in this slide. now if we know that the person does not have
covid, it's not infected, then they have fever, and will test positive with the low probability. okay?
so fever will be the the probability of the person having fever, knowing that the person does not have
covid, will be low. and the probability of a positive test result, knowing that the quote that the person
is not infected with covid, is also low. now, you may ask how do we know? just for simplicity, assume that
we somehow know; like with a very accurate test result, even more accurate than the pcr test, we
somehow know that the person is not infected. okay? now, we can further conclude that the probability
of having fever, knowing that the person does not have covid, does not depend any
more on the pcr test result. so either p equal to zero, or p equal to one, doesn't matter. it's the
same as just knowing the person does not have covid, and it's low: 0.2%. and the
same we can conclude for the pcr test result . knowing that the person is not infected is
sufficient. we will conclude the probability . it doesn't matter anymore if the person
has a fever or not. okay? so we can conclude that f and p are independent given c
equal to zero. now putting this together, with the previous conclusion of c equal to one,
we can conclude that f is independent of p, even c. fever is conditionally independent of the test
result given covid. okay? this does not of course always hold for some choices of variables.
only partial independence can happen here. like we may have only this part, but here we
reach the conclusion that it holds in general. good! so let's now try to formalize stuff. consider
the sets of random variables x, y, z. so here x is a set. it can include several variables, and so is y
and z. we say that x is conditionally independent of y given that in a distribution p (we didn't talk
much about the distribution p. it was implicit in the previous slides. here we're formalizing it, so
we put it here). denote that p models, this notation, x independent of y commission on z, and here
we want to define. this will hold if p of x given y and z equals to p of x given z. now if
we want to be precise we need to write down the values .of course for simplicity
almost always we omit these parts of equal to y and z. remember that the capital x is the
variable of interest, the random variable itself. the small letter here; that these are
the value that the random variable takes. so it is emphasizing that this condition
should hold for all values of x y and z . okay? in particular we saw that uh like z
equal to z and y equal to y... this should always hold, especially for z equal to z( like what
we saw here: c equal to zero, and c equal to one). both need to hold, in order to conclude that uh
they are independents. yeah they're independent. okay now just a few side notes:
the variables in the set z are said to be observed variables. if z empty , it's not condition on anything. then we can go back
to our originally defined independence definition . we just say x and y are mutually statistically,
or probabilistically independent. we can also say that they are marginally independent, and we
just say p of x given y is p equals to p of x. the set of all probability
independencies in p we denote them by this... we capture it by this set i
of p, and uh well p stands for the distribution, i stands for independencies. okay? so i of p will
include all such conditional independencies that hold in the distribution. for example,
in the previous case, where we had f, c and p, then i of p would include this conditional
independence: f independent of p given c. and we have a proposition here: we
say that x and y are independent given z if it holds that x and y condition on z is
equal to p of x given z times p of y given z . you see it's a straightforward generalization
of the non-conditional case, where we just had p of x and y is equal to p of x times p of y. what
is added here is just that we condition it on z. so it's easy to remember, and it's also intuitive.
and you can try to prove this proposition. now, how can conditional independence help? we just saw how independence can help, but what about conditional independence? let's go back to our
example: covid, fever, pcr test result. we saw a conditional independence, now if
we write the joint probability distribution of f, p, and c by using the conditional probability, we can write it down as p
of f given p and c times probability of p and c. now, we further know that this
conditional independence holds. if we knew that p of
p and c, if p and c would be independent, then we could easily write down this part as p of
p times p of c. but this doesn't hold .what is instead holds is that f is independent of p given
c. okay? it is included in the set of independencies. so we know that fever and test results are
independent condition on covid, which we just showed. how can we use this? well, then by
definition, this part; p of f condition on p and c can be reduced to p of f given c. right ?
this is right here. don't get bothered with the notations, with these parts x y z.
you can see that this is exactly equal to this , and we just got rid of p, because it was
independent of it. okay? so what happens to the number of parameters? well, we have 2 here;
is the conditional probability. we have 2 for c, and we have 4 minus 1 (3) for this one. so 2
plus 3 equals 5. okay! original we had seven. and again the importance of conditional independence: conditional independence
can reduce the number of parameters. a little exercise here: show that the following
factorization results in the same number of parameters. so here we factorized according to f. we
said it's equal to the probability of f, condition on p and c, times probability of p and c. now if we would
have done it differently, we had condition it on c; say p of f and p condition on c, times p of c, go on...
and try to show that still we would have the same number of parameters. this should be intuitively
correct. because we're using the same independence , and so we don't expect fewer or more number of
parameters. okay... so so far we covered examples. now how to generalize the idea? first, step one is
to we need to factorize the joint distribution into c, p, ds. right? we just saw it in the previous
slide. we always do it: this part. and how to do this? here is where chain rule can help us; the chain
rule for random variables. it says that p of x1 to xn (i'll read from right to left,
it's easier) this joint probability distribution can be factorized as follows: it's
probability of x1, times p of x2 given x1, times p of x3 given x2 and x1, all the way
to p of xn given xn minus 1 to x1. okay and you can write it down in this compact
form p of x i condition on x i minus 1 to x 1. this you can prove using the definitions of
conditional probability, okay? now, just note that sometimes in the literature this p of x1 to xn is written as p of xn to x1 to resemble the order of the factorization. okay?
because it starts from x1 to xn, and it may help also to remember it easier, but of course, the
two are equal. it doesn't matter p of a and b is equal to p of b and a. okay? a little exercise
here, before we proceed to the next step: for binary-valued random variables compute the
number of required parameters using the chain rule. okay... like we know here. if they're binary
we know that we need 2 to the power of n minus 1 parameters
to model this joint distribution. but then, if we now do it using this chain rule, show that the number of parameters does not reduce.
i already kind of gave the answer, and here's the hint. and then, in the next exercise, you're asked
to generalize it for general discrete random variables, where you can have alpha1 for x1 to
alpha_n times xn. again, intuitively this is expected. right? because chain rule holds for
general joint probability distributions. we haven't yet done any reduction in these
cpd (conditional probability distribution) terms, so we don't expect any reduction in the number
of parameters. so this was step one to factorize . step two: we need to simplify some of these
cpd terms. okay? conditional probability in distributions. based on what? based on conditional
independencies. how? okay... here is a little example: uh consider joint distribution p of x1 to x4, using the chain rule, we will factorize it in p of x4 given x3 to x1 times p of
x3 given x2 to x1, and so on to p of x1. imagine that we have a conditional independence
where x4 is independent of x1 and x2 given x3. okay? so obviously that means that i can just
omit x2 and x1 from this conditional distribution. and this means that the number of required
parameters for this cpd will reduce from 2 times 2 times 2 (8) to only 2. now the
joint distribution becomes so i will just go from here to here, where i can omit x1 and
x2, and i will get this simplified version of the joint probability distribution, and the
total number of parameters now reduces from 15, originally i had 2 to the power of 4 minus
1, to 2 plus 4, plus 2, plus 1, which is 9. okay? we go on with a little uh exercise: so here we
can see that if a conditional independence holds for p, it can be accordingly factorized, so that the
corresponding cpd term is simplified. okay? whatever independence you give me here, what i
need is the term to have the term p of, whatever is here, condition on this, this is supposedly a single
variable condition on a bunch of other variables. uh here and here is it will be independent
of some other variables. okay? meaning that this should appear in the factorization; something
like this. so i need the first term condition on all the other variables, and then i will simplify
it. now the term that i just described, i can make sure that it appears in the chain rule, based on
the reordering. so i can reorder the variables so that the written, the desired term, will appear
here. so i can always make sure that the single conditional independence will somehow
appear in the cpd terms in the factorization, and then i can reduce the number of parameters.
now the exercise is asking what can be said about the inverse statement? meaning that
if i see a factorization that compared to the chain rule, like here, i see that one of
the cpds are reduced, then can i conclude that there is a conditional independence? okay... good!
now i want to continue with the example: imagine that in addition to this conditional
independence, we have this additional one; x1 being independent of x2 given x3. so it's
no longer about x4, here it's all about x1, x2, x3. then, we will have that x1 given x2, x3 is
just equal to x1 given x3 (x2 is omitted). okay? but we cannot apply this to the previous factorization,
this is what we ended up last time: x4 given x3 and the remaining ones. why we cannot
use this? because i don't have this term here. okay? x1 given x2, x3. it's the only place where x1
appears on the left side of this conditioning notation (is here), but it's not conditional
anything. so what should we do? well we can reorder the variables, and then use the chain rule
again. so that this part is rewritten. okay? so uh if i do it in a way that i don't.... i'm fine
with the first part! so originally i had x4 condition on the other variables, that is fine.
but here i need x1 condition on x2 and x3. and then the other ones i also don't care. it
can be x3 given x2, and p of x2, or x2 given x3. and then given... yeah! times
p of x 3. but i need this term. and i need the other one also, so that i can still use the
previous conditional independence here. okay? so if i do this, then i will simplify this term as in
the previous slide, but also i will reduce this one, and then i will end up with the simplified
version where two of the cpds are simplified now. now note that this way of writing things:
if i want to write it down using the order, it will be p of x2 and x3 and x1 and x4, right ?
here! of course you can. this doesn't matter! you can just write down x1, x2, x3, x4. it's just
to help with the ordering of the cpd terms. now what if we additionally have x3
independent of x2 condition on x1? okay... so here we had x1 conditioned on x3, then it's
independent of x2. here is x3 and x2. okay? now it is impossible if we want to further
simplify this factorization. you can try yourself. at least it's not straightforward
to do this, because this one requires p of x1 appearing condition on x3 and x3. this
one, it requires p of x3 condition on x2 and x1. okay? so if we wanna use our previous
approach, at least for sure we cannot handle both of them, in the same factorization. okay.. and the conclusion here, which we will later
make it very concrete and rigorous, and what is that not all independencies can appear in a factorization. now see it in practice: back to
our covid example with 12 variables... so what should we do? we should just find the
conditional independence, and factorize the joint distribution accordingly? well... yes! that
is the idea, but a couple of questions: first, how to find these conditional independencies, I(p)?
we need to find them from this data set. i mean in all the previous examples i
just said assume that we are given this conditional independence, and this conditional
independence, but then how can we really find it ? and the answer is that we will get back to this
in the chapter structure learning. let's just say that there are statistical tests
for doing this. now even if we have this I(P), how can we systematically factorize the
joint distribution accordingly? okay? like which ones to include? which of these independent
conditional independencies to include? which and which ones not? and according to what order? we just
saw that we may not be able to include all of them. and uh you know... visualization
tool can help here? like a graph... and this is the whole idea of bayesian networks.
okay... which we will get back to in the next video. just to summarize: the goal was to
find from data the joint probability distribution. here's the data, and we wanted
to get the joint probability distribution. i want to make it more concrete by saying well the
goal is to find the correct factorization of the joint probability distribution. what do
we mean by correct? let's put that aside for now. but let's just say that each factorization will
result in a certain number of parameters. okay? the whole purpose of bayesian nets is about this
factorization. that's what's happening behind the scenes. okay? and we want to find like each
factorization can result in a different number of parameters. we want to find which one is correct.
and how can we do that? well, first of all, we need to find I(P); which conditional independence
hold. for example, if we perform those tests that i told you... like imagine that we find that I(P) will have c conditionally independent of d given m. okay? and also m is independent of d. then
how can we factorize this? so the next question is if we know I(P), what
should we do? we will cover this in the next video...