Transcript for:
Basics of Probability for Machine Learning

Thank Hello and welcome to the first tutorial in the introduction to machine learning course. My name is Priyatosh. I am one of the teaching assistants for this course. In this tutorial, we will be looking at some of the basics of probability theory. Before we start, let us discuss the objectives of this tutorial.

The aim here is not to teach the concepts of probability theory in any great detail. Instead, we will just be providing a high level overview of the concepts that will be encountered later on in the course. The idea here is that for those of you who have done a course in probability theory, probability theory or are otherwise familiar with the content, this tutorial should act as a refresher. For others who may find some of the concepts unfamiliar, we recommend that you go back and prepare those concepts from say introductory textbook or any other resource, so that when you encounter those concepts later on in the course, you should be comfortable with them. Okay, to start this tutorial we will look at the definitions of some of the fundamental concepts.

The first one to consider is that of the sample space. The set of all possible outcomes of an experiment is called the sample space and is denoted by capital omega. Individual elements are denoted by small omega and are termed elementary outcomes. Let us consider some examples.

In the first example, the experiment consists of rolling an ordinary die. The sample space here is the set of numbers between one and six. Each individual individual element here represents one of the six possible outcomes of rolling a die.

Note that in this example the sample space is finite. In the second example the experiment consists of tossing a coin repeatedly until the specified condition is observed. Here we are looking to observe five consecutive heads before terminating the experiment.

The sample space here is countably infinite. The individual element are represented using a sequence of H's and T's where H and T stand for heads and tails respectively. In the final example, the experiment consists of measuring the speed of a vehicle with infinite precision. Assuming that the vehicle speeds can be negative, the sample space is clearly the set of real numbers.

Here we observe that the sample scales can be uncountable. The next concept we look at is that of an event. An event is any collection of possible outcomes of an experiment, that is, any subsequent experiment.

subset of the sample space. The reason why events are important to us is because in general when we conduct an experiment we are not really that interested in the elementary outcomes rather we are more interested in some subsets of the elementary outcomes. For example on rolling a die we might be interested in observing whether the outcome was even or odd.

So, for example on a specific roll of a die we let us say we observe that the outcome was odd. In this scenario whether the outcome was even or odd is not really a problem. outcome was actually a 1 or a 3 or a 5 is not as important to us as the fact that it was odd.

Since we are considering sets in terms of sample spaces and events, we will quickly go through the basic set theory notations. As usual, capital letters indicate sets and small letters indicate set elements. We first look at the subset relation.

For all x, if x element of A implies x element of B, then we we say that A is a subset of B or A is contained in B. Two sets A and B are said to be equal if both A subset of B and B subset of A holds. The union of two sets A and B gives rise to a new set which contains elements of both A and B. Similarly, the intersection of two sets gives rise to a new set which contains of only those elements which are common to both A and B. Finally, the complement of a set A is essentially the set which contains all elements in the universal set except for the elements present in A.

In our case the universal set is essentially the sample space. This slide lists out the different properties of set operations such as commutativity associativity and distributivity which you should all be familiar with. It also lists out the De Morgan's laws which can be very useful. According to the De Morgan's laws the complex complement of the set A union B is equals to A complement intersection B complement. Similarly, the complement of the set A intersection B is equals to A complement union B complement.

The De Morgan's laws presented here are for two sets they can easily be extended for more than two sets. To events, two events A and B are said to be disjoint or mutually exclusive if the intersection of the two sets is empty. Extending this concept to multiple sets, we say that a sequence of events A1, A2, A3 and so on are pairwise disjoint if AI intersection AJ is equal to null for all I0 equals to J. In the example below, if each of the letters represents an event, then the sequence of events A through E are pairwise disjoint. disjoint since the intersection of any pair is empty.

If events A1, A2, A3 so on are pair wise disjoint and the union of the sequence of events gives rise to the sample space then the collection A1, A2 and so on is said to form a partition of the sample space Ω. This is illustrated in the figure below. Next, we come to the concept of a sigma algebra. Given a sample space omega, a sigma algebra is a collection f of subsets of the sample space with the following properties.

The null set is an element of f. If a is an element of f, then a is a set of f. If a is a set of F then A complement is also an element of F if A i is an element of F for every i belong to the natural numbers then union i equals to 1 to infinity A i is also an element of F a set A that belongs to F is called an F measurable set this is what we naturally understand as an event so going back to the third property what this essentially says is that if there are a number of events which belong in the sigma algebra then the countable union of these events also belongs in the sigma algebra Let us consider an example. Consider that omega equals to 1, 2, 3. This is our sample space. With this sample space, we can construct a number of different sigma algebras.

Here, the first sigma algebra f1 is essentially the power set of the sample space all possible events are present in the first sigma algebra however if we look at f2 in this case there are only two events the null set or the sample space itself you should verify that for both f1 and f2 all three properties listed above are satisfied Now that we know what a sigma algebra is let us try and understand how this concept is useful. First of all for any omega countable or uncountable the power set is always a sigma algebra. For example for the sample space comprising of two elements h comma t A feasible sigma algebra is the power set this is not the only feasible sigma algebra as we have seen in the previous example, but always the power set will give you a feasible sigma algebra. However, if omega is uncountable then probabilities cannot be assigned to every subset of the power set this is the crucial point which is why we need the concept of sigma algebra.

So, just to recap if the sample space is finite or countable then we can kind of ignore the concept of sigma algebra because in such a scenario we can consider all possible events that is the power set of the sample space and meaningfully apply probabilities to each of these events. However, this cannot be done when the sample space is uncountable that is if Omega is uncountable then probabilities cannot be assigned to every subset of true to the Omega. This is where the concept of sigma algebra shows its use. When we have an experiment in which the sample space is uncountable, for example, let's say the sample space is the set of real numbers.

In such a scenario, we have to identify the events which are of importance to us and use this along with the three properties listed in the previous slide to construct a sigma algebra. and probabilities will then be assigned to the collection of sets in the sigma algebra. Next, we look at the important concepts of probability measure and probability space. A probability measure P on a specific sample space omega and sigma algebra F is a function from F to the closed interval 0,1 which satisfies the following properties. Probability of the null set equals to 0, probability of omega equals to 1 and if A1, A2 and so on is a collection of pairwise disjoint members of F, then probability of the union of all is equals to the sum of their individual probabilities note that this holds because the sequence a1 a2 is pairwise disjoint the triple Omega fp comprising us sample space Omega a Sigma algebra F which are subsets of Omega and the probability measure P defined on Omega comma F this is called a probability space For every probability problem that we come across there exists a probability space comprising of the triple omega fp.

Even though we may not always expressly take into consideration this probability space when we solve a problem. it should always remain in the back of our heads. Let us now look at an example where we do consider the probability space involved in the problem.

Consider a simple experiment of rolling an ordinary die in which we want to identify whether the outcome results in a prime number or not. or not. The first thing to consider is the sample space.

Since there are only six possible outcomes in our experiment, the sample space here is consists of the numbers between one to six. Next we look at the sigma algebra. Note that since the sample space is finite, you might as well consider all possible events that is the power set of the sample space.

However, note that the problem dictates that we are only interested in two possible events that is whether the outcome is prime or not. Thus restricting ourselves to these two events we can construct a simpler sigma algebra here we have two events which correspond to the events we are interested in and the remaining two events follow from the properties which the sigma algebra has to follow. Please verify that the sigma algebra listed here does actually satisfy the three properties that we had discussed above.

The final component is the probability measure. The probability measure assigns a value between 0 and 1. that is the probability value to each of the components of the sigma algebra here the values listed assumes that the die is fair in the sense that the probability of each phase is equals to 1 by 6 Having covered some of the very basics of probability, in the next few slides, we look at some rules which allow us to estimate probability values. The first thing we look at is known as the Bonferroni's inequality. According to this inequality, probability of A intersection B is greater than equals to probability of A plus probability of B minus 1. The general form of this inequality is also listed. What this inequality allows us to do is to give a lower bound on the intersection probability.

This is useful when the intersection probability is hard to calculate. However, if you notice the right hand side of the inequality, you should observe that this result is only meaningful when the individual probabilities are sufficiently large for example if the probability of A and the probability of B both these values are very small then this minus one term dominates and the result doesn't make much sense According to the bool's inequality for any sets A1, A2 and so on, the probability of the union of these sets is always less than equals to the sum of their individual probabilities. Clearly this gives us a useful upper bound for the probability of the union of events. Notice that this equality will only hold when these sets are pairwise disjoint.

Next, we look at conditional probability. Given two events A and B, if probability of B is greater than 0, then the conditional probability that A occurs given that B occurs is defined to be probability of A given B is equal to probability of A intersection B by probability of B. Essentially, since event B has occurred, it becomes a new sample space and now the probability of A is accordingly modified. Conditional probabilities are very useful when reasoning in the sense that once we have observed some event, our beliefs or predictions of related events can be updated or improved. Let us try working out a problem in which conditional probabilities are used.

A fair coin is tossed twice. What is the probability that both tosses result in heads, given that at least one of the tosses resulted in the heads? Go ahead and pause the video here and try working out the problem yourself. From the question, it is clear that there are four elementary outcomes. Both tosses resulted in heads, both came up tails, the first came up heads while the second toss came up tails and the other way around.

Since we are assuming that the coin is fair, each of the elementary outflow is equal to the number of tosses. comes at the same probability of occurrence equals to 1 by 4. Now, we are interested in the probability that both tosses come up heads given that at least one resulted in the heads. Applying the conditional probability formula, we have probability of A given B equals to probability of A intersection B divided by probability of B. Simplifying the intersection in the numerator, we get the next step.

Now, we can apply the probability values of the elementary outcomes to get the result of 1 by 3. Note that in the denominator each of these events is mutually exclusive. Thus, the probability of the union of these three events is equals to the summation of the individual probabilities. As an exercise, try and solve the same problem with the modification that we observe only the first toss coming up heads.

That is, we want the probability that both tosses resulted in heads given that the first toss resulted in heads. Does this change the problem? Next, we come to a very important theorem called the Bayes theorem or the Bayes rule.

We start with the equation for the conditional probability. Probability of A given B is equal to probability of A intersection B by probability of B. Rearranging, we have probability of A intersection B is equal to probability of A given B into probability of B. Now, instead of starting from probability with probability of A given B, if we have started with probability of B given A, we would have got probability of A intersection B equals to probability of B given A into probability of A.

These two right hand sides can be equated to get probability of A given B is equal to into probability of B is equal to probability of B given A into probability of A. Now taking this probability of B to the right hand side we get probability of A given B is equal to probability of B given A into probability of A by probability of B. This is what is known as the Bayes rule. Note that what it essentially says is if I want to find the probability of A given that B happen I can use the information of probability of B given A along with the knowledge of probability of N probability B to get this value.

As you will see this is a very important formula. Here we again present the Bayes rule in an expanded form where a1, a2 and so on form a partition of the sample space. As mentioned Bayes rule is important in that it allows us to compute the conditional probability of probability of a given b from the the inverse conditional probability of B given A.

Let us look at a problem in which the Bayes rule is applicable. To answer a multiple choice question a student may either know the answer or may guess it. Assume that with probability B the student knows the answer to a question and with probability Q the student guesses the right answer to a question she does not know. What is the probability that for a question the student answers correctly she actually knew the answer to the question.

Again pause the video here and try solving the problem yourself. Okay, let us first assume that K is the event that the student knows the question and that C be the event that the student answers the question correctly. Now from the question we can gather the following information. The probability that the student knows the question is P, hence the probability that the student does not know the question is equals to 1 minus P. The probability that the student answers the question correctly given that she knows the question is equals to 1 because if she knows the question she will definitely answer it correctly.

Finally, the probability that the student answer the question correctly given that she makes a guess that is she does not know the question. is Q we are interested in the probability of the student knowing the question given that she has answered it correctly applying Bayes rule we have probability of K given C is equal to probability of C given K into probability of K by probability of C. The probability of answering the question correctly can be expanded in the denominator to consider the two situations probability of answering the question correctly given that the student knows the question and probability of answering the question correctly given that the student does not know the question.

Now using the values which we have gathered from the question we can arrive at the answer of p by p plus q into 1 minus p. Note here that the Bayes rule is essential to solve this problem because while from the question itself we have a handle on this value. probability of C given K, there is no direct way to arrive at the value of probability of K given C.

Two events A and B are said to be independent if probability of A intersection B is equal to probability of A into probability of B. More generally, a family of events A, I, where I is an element of the integers, is called independent. dependent if probability of some subset of the events Ai is equals to the product of the probabilities of those events essentially what we are trying to say here is that if you have a family of events Ai then the independence condition holds only if for any subset of those events this condition holds from this it should be clear that pairwise independence does not imply independence that is pairwise independence is a weaker condition. Extending the notion of independence of events, we can also consider conditional independence.

Let A, B and C be three events with probability of C strictly greater than zero. The events A and B are called conditionally independent given C is if probability of A intersection B given C equals to probability of A given C into probability of B given C. This condition is very similar in form to the previous condition for independence of events.

Equivalently the events A and B are conditionally independent given C if probability of A given B intersection C because the probability of A given C this latter condition is quite informative what it says is that the probability of A calculated after knowing the occurrence of event C is same as the probability of A calculated after having knowledge of occurrence of both events B and C thus observed observing the occurrence or non occurrence of B does not provide any extra information and thus we can conclude that the events A and B are conditionally independent given C. Let us consider an example assume that admission into the M.Tech program at IIT Madras and IIT Bombay is based solely on candidates GATE score then probability of admission into IIT Madras given knowledge of the candidates admission status in IIT Bombay as well as the candidate GATE score. is equivalent to the probability calculated simply knowing the candidates gate score thus knowing the status of the candidates admission into IIT Bombay does not provide any extra information hence since the condition is satisfied we can say that admission into the program at IIT Madras and admission into the program at IIT Bombay are independent events given knowledge of the candidates gate score you