[Music] [Music] hello and welcome to the first tutorial in the introduction to machine learning course my name is prios I'm one of the teaching assistants for this course in this tutorial we'll be looking at some of the basics of probability Theory before we start let's discuss the object I of this tutorial the aim here is not to teach the concepts of probability theory in any great detail instead we'll just be providing a high level overview of the concepts that will be encountered later on in the course the idea here is that for those of you who have done a course in probability Theory or are otherwise familiar with the content this tutorial should act as a refresher for others who may find some of the concepts unfamiliar we recommend that you go back and prepare those Concepts from say introductory textbook or any other resource so that when you encounter those Concepts later on in the course you should be comfortable with them okay to start this tutorial we'll look at the definitions of some of the fundamental concepts the first one to consider is that of the sample space the set of all possible outcomes of an experiment is called the sample space and is denoted by Capital Omega individual elements are denoted by small Omega and are termed Elementary outcomes let's consider consider some examples in the first example the experiment consists of rolling an ordinary die the sample space here is the set of numbers between 1 and six each individual element here represents one of the six possible outcomes of rolling a d note that in this example the sample space is finite in the second example the experiment consists of tossing a coin repeatedly until a specified condition is observed here we are looking to observe five consecutive heads before terminating the experiment the sample space here is countably infinite we the individual elements are represented using a sequences of H's and T's where H and T stand for heads and tails respectively in the final example the experiment consists of measuring the speed of a vehicle with infinite Precision assuming that the vehicle speeds can be negative the sample space is clearly the set of real numbers here we observe that the sample case can be uncountable the next concept we look at is that of an event an event is any collection of possible outcomes of an experiment that is any subset of the sample space the reason why events are important to us is because in general when we conduct an experiment we're not really that interested in the elementary outcomes rather we are more interested in some subsets of the elementary outcomes for example on rolling a die we might be interested in observing whether the outcome was even or odd so for example on a specific role of a d we let's say we observed that the outcome was odd in the scenario whether the outcome was actually a one or a three or a five is not as important to us as the fact that it was odd since we are considering sets in terms of sample spaces and events we will quickly go through the basic set theory notations as usual capital letters indicate sets and small letters indicate get set elements we first look at the subset relation for all x if x element of a implies X element of B then we say that a is a subset of b or a is contained in B two sets A and B are set to be equal if both a subset of B and B subset of a holds the union of two sets A and B gives rise to a new set which contains elements of both A and B similarly the intersection of two sets gives rise to a new set which contains of only those Elements which are common to both A and B finally the complement of a set a is essentially the set which contains all elements in the Universal set except for the elements present in a in our case the universal set is essentially the sample space this slide lists out the different properties of set operations such as commutative ity associativity and distributivity which you should all be familiar with it also lists out the dear laws which can be very useful according to the Demar laws the complement of the set a union B is equals to a complement intersection B complement similarly the complement of the set a intersection B is equals to a complement Union B complement the demorgans laws presented here are for two sets they can easily be extended for more than two sets coming back to events two events A and B are said to be disjoint or mutually exclusive if the intersection of the two sets is empty extending this concept to multiple sets we say that a sequence of events A1 A2 A3 and so on are pairwise disjoint if AI intersection AJ is equal to null for all I not equal to J in the example below if each of the letters represents an event then the sequence of events a through e are pair is joined since the intersection of any pair is empty if events A1 A2 A3 so on are pairwise disjoined and the union of the sequence of events gives rise to the sample space then the collection A1 A2 and so on is set to form a partition of the sample space Omega this is Illustrated in the figure below next we come to the concept of a sigma algebra given a sample space Omega a sigma algebra is a collection F of subsets of the sample space with the following properties the null set is an element of f if a is an element of f then a complement is also an element of f if AI is an element of f for every I belonging to the natural numbers then Union I = to 1 to Infinity AI is also an element of f a set a that belongs to F is called an F measurable set this is what we naturally understand as an event so going back to the third property what this essentially says is that if there are a number of events which belong in the sigma algebra then the countable Union of these events also belongs in the sigma algebra let us consider an example consider Omega = to 1 2 3 this is our sample space with this sample space we can construct a number of different Sigma alas here the first Sigma algebra F1 is essentially the power set of the sample space all possible events are present in the first Sigma algebra however if we look at FS2 in this case there are only two events the null set or the sample space itself you should verify that for both F1 and F2 all three properties listed above are satisfied now that we know what a sigma algebra is let us try and understand how this concept is useful first of all for any Omega countable or uncountable the power set is always a sigma algebra for example for the sample space comprising of two elements H comma t a feasible Sigma jebra is the power set this is not the only feasible Sig jebra as we have seen in the previous example but always the power set will give you a feasible Sigma algebra however if Omega is uncountable then probabilities cannot be assigned to every subset of the power set this is the crucial point which is why we need the concept of Sigma algebras so just to recap if the sample space is finite or countable then we can kind of ignore the con set of Sigma algebra because in such a scenario we can consider all possible events that is the power set of the sample space and meaningfully apply probabilities to each of these events however this cannot be done when the sample space is uncountable that is if Omega is uncountable then probabilities cannot be assigned to every subset of 2 to the Omega this is where the concept of Sigma algebra shows its use when we have an experiment in which the sample space is uncountable for example uh let's say the sample space is the set of real numbers in such a scenario we have to identify the events which are of importance to us and use this along with the three properties listed in the previous slide to construct a sigma algebra and probabilities will then be assigned to the collection of sets in the sigma algebra next we look at the important concepts of probability measure and probability space a probability measure P on a specific sample space Omega and sigma algebra f is a function from F to the closed interval 0a 1 it satisfies the following properties probability of the null set equal to0 probability of Omega = to 1 and if A1 A2 and so on is a collection of pairwise disjoint members of f then probability of the I of all such members is equals to the sum of their individual probabilities note that this holds because the sequence A1 A2 is pairwise disjoint the triple Omega FP comprising a sample space Omega a sigma algebra F which are subsets of Omega and a probability measure P defined on Omega comma F this is called a probability space for every probability problem that we come across there exists a probability space comprising of the triple Omega FP even though we may not always explicitly take into consideration this probability space when we solve a problem it should always remain in the back of our heads let us now look at an example where we do consider the probability space involved in the problem consider a simple experiment of rolling an ordinary die in which we want to identify whether the outcome results in a prime number or not the first thing to consider as the sample space since there are only six possible outcomes in our experiment the sample space here is consists of the numbers between 1 to six next we look at the sigma algebra note that since the sample space is finite you might as well consider all possible events that is the power set of the sample space however note that the problem dictates that we are only interested in two possible events that is whether a number whether the outcome is prime or not thus restricting ourselves to these two events we can construct a simpler Sigma Jabra here we have two events which correspond to the events we are interested in and the remaining two events follow from the properties which the sigma jebra has to follow please verify that the sigma algebra listed here does actually satisfy the three properties that we have discussed above the final component is the probability measure the probability measure assigns a value between 0 and one that is the probability value to each of the components of the sigma algebra here the values listed assumes that the die is fair in the sense that the probability of each phase is equals to 1x 6 having covered some of the very basics of probability in the next few slides we look at some rules which allow us to estimate probility values the first thing we look at is known as the bond feron is inequality according to this inequality probability of a intersection B is greater than equal to probability of a plus probability of B minus 1 the general form of this inequality is also listed what this inequality allows us to do is to give a lower bound on the intersection probability this is useful when the intersection probability is hard to calculate however if you notice the right hand side of the inequality you should observe that this result is only meaningful when the individual probabilties are sufficiently large for example if the probability of a and the probability of B both these values are very small then this minus one term dominates and the result doesn't make much sense according to the bus inequality for any sets A1 A2 and so on the probability of the Union of these sets is always less than equals to the sum of their individual probabilities clearly this gives us a useful upper Bound for the probability of the Union of events notice that this equality will only hold when these sets are pairwise disjoint next we look at conditional probity given two events A and B if probability of B is greater than zero then the conditional probability that a occurs given that b occurs is defined to be probability of a given B is equal to probability of a intersection B by probability of B essentially since event B has occurred it becomes a new sample space and now the probability of a is accordingly modified conditional probabilities are very useful when reasoning in the sense that once we have observed some event our beliefs or predictions of related events can be updated or improved let us try working out a problem in which conditional probabilities are used a coin is tossed twice what is the probability that both tosses result in heads given that at least one of the tosses resulted in the heads go ahead and pause the video here and try working out the problem yourself from the question it is clear that there are four Elementary outcomes both tosses resulted in heads both came up tails the first came up heads while the second TOS came of Tails and the other way around since we assuming that the coin is fair each each of the elementary outcomes has the same probability of occurrence equals to 1x 4 now we are interested in the probability that both toses come up heads given that at least one resulted in the heads applying the conditional probability formula we have probability of a given bals to probability of a intersection B ided probability of B simplifying the intersection in the numerator we get the next step now we can apply the probability values of the element outcomes to get the result of 1x3 note that in the denominator each of these events is mutually exclusive thus the probability of the Union of these rents is equals to the summation of the pro individual probabilities as an exercise try and solve the same problem with the modification that we observe only the first toss coming up heads that is we want the probability that both tosses resulted in heads given that the first TOS resulted in the heads does this change the problem next we come to a very important theorem called the base theorem or the base rule we start with the equation for conditional probability probability of a given B is equal to probability of a intersection B by probability of B rearranging we have probability of a intersection b equals to probability of a given B into probability of B now instead of starting from prob with probability of a given B if we started with probability of B given a we would have got probability of a intersection b equal to probility of B given a into probability of a these two right hand sides can be equated to get probability of a given b equal into probability of B is equal to probability of B given a into probability of a now taking this probability of B to the right hand side we get probability of a given B is equal to probability of B given a into probability of a by probability of B this is what is known as the base rule note that what it essentially says is if I want to find the probability of a given that b happened I can use the information of probability of B given a along with the knowledge of probability of n probability B to get this value as you will see this is a very important formula here we again present the base rule in an expanded form where A1 A2 and so on form a partition of the sample space as mentioned base rule is important in that it allows us to compute the conditional probability probability of a given B from the inverse conditional probility probity of B given a let us look at a problem in which the base rule is applicable to answer a multiple choice question a student may either know the answer or may guess it assume that with probability P the student knows the answer to a question and with probability Q the student guesses the right answer to a question she does not know what is the probability that for a question the student answers correctly she actually knew the answer to the question again pause the video here and try solving the problem yourself okay let us first assume that K is the event that the student knows the question and that c be the event that the student answers the question correctly now from the question we can gather the following information the probability that the student knows the question is p hence the probability that the student does not know the question is equals to 1 minus P the probability that the student answers the question correctly given that she knows the question is equals to one because if she knows the question she'll definitely answer it correctly finally the probability that the student answers the question correctly given that she makes guess that is she does not know the question is Q we are interested in the probability of the student knowing the question given that she has answered it correctly applying base rule we have probability of K given C is equal to probability of C given K into probability of K by probability of C the probability of answering the question correctly can be expanded in the denominator to consider the two situations probability of answering the question correctly given that the student knows the question and probability of answering the question correctly given that the student does not know the question now using the values which we have gathered from the question we can arrive at the answer of P by P + Q into 1 minus P note here that the pce rule is essential to solve this problem because while from the question itself we have handle on this value probability of C given K there is no direct way to arrive at the value of probability of K given C two events A and B are set to be independent if probability of a intersection B is equals to probability of a into probability of B more generally a family of events AI where I is an element of the integers is called independent if probability of some subset of the events a a i is equals to the product of the probabilities of those events essentially what what we're trying to say here is that if you have a family of events AI then the independence condition holds only if for any subset of those events this condition holds from this it should be clear that pairwise Independence does not imply Independence that is pairwise Independence is a weaker condition extending the notion of independence of events we can also consider conditional Independence let a b and c be three events with probability of C strictly greater than zero the events A and B are called conditionally independent given C if probability of a intersection B given C equals to probability of a given C into probability of B given C this condition is very similar in form to the previous condition for independence of events equ lently the events A and B are conditionally independent given C if probability of a given B intersection C equals to probability of a given C this lateral condition is quite informative what it says is that the probability of a calculated after after knowing the occurrence of event C is same as the probity of a calculated after having knowledge of occurrence of both events B and C thus observing the occurrence or non-occurrence of B does not provide any extra information and thus we can conclude that the events A and B are conditionally independent given C let us consider an example assume that admission into the mtech program at IIT Madras and it Bombay is based solely on candidates gate score then probability of admission into it Madras given knowledge of the candidates admission status in it Bombay as well as the candidates gate score is equivalent to the probability calculated simply knowing the candidates gate score th knowing the status of the candidates admission into it Bombay does not provide any extra information hence since the condition is satisfied we can say that admission into the program at it Madras and admission into the program at it Bombay are independent events given knowledge of the candidates gate score