Exploring Entropy in Data Science

[Music] yes you can understand entropy the ray step Quest hello I'm Josh starmer and welcome to stack Quest today we're going to talk about entropy for data science and it's going to be clearly explained note this stack Quest assumes that you are already familiar with the main ideas of expected values if not check out the quest and ropy is used for a lot of things in data science for example entropy can be used to build classification trees which are used to classify things entropy is also the basis of something called Mutual information which quantifies the relationship between two things and entropy is the basis of relative entropy aka the colback leeler distance and cross entropy which show up all over the place including including fancy Dimension reduction algorithms like tne and umap what these three things have in common is that they all use entropy or something derived from it to quantify similarities and differences so let's learn how entropy quantifies similarities and differences however in order to talk about entropy first we have to understand surprise so let's talk about chickens imagine we had two types of chickens orange and blue and instead of just letting them randomly roam all over the screen our friend stat Squatch chased them around until they were organized into three separate areas a b and c now if stat Squatch just randomly picked up a chicken in area a then because there are six orange chickens and only one blue chicken there is a higher probability that they will pick up an orange chicken and since there is a higher probability of picking up an orange chicken it would not be very surprising if they did in contrast if stat Squatch picked up the blue chicken from area a we would be relatively surprised area B has a lot more blue chickens than orange and because there is now a higher probability of picking up a blue chicken we would not be very surprised if it happened and because there is a relatively low probability of picking the orange chicken that would be relatively surprising lastly area C has an equal number of orange and blue chickens thus regardless of what color chicken we pick up we would be equally surprised combined these areas tell us that surprise is in some way inversely related to probability in other words when the probability of picking up a blue chicken is low the surprise is high and when the probability of picking up a blue chicken is high the surprise is low bam now we have a general intuition of how probability is related to surprise now let's talk about how to calculate surprise because we know there is a type of inverse relationship between probability and surprise it's tempting to just use the inverse of probability to calculate surprise because when we plot the inverse we see that the closer the probability is to zero the larger the Y AIS value however there's at least one problem with just using the inverse of the probability to calculate surprise to get a better sense of this problem let's talk about the surprise associated with flipping a coin imagine we had a terrible coin and every time we flipped it we got heads blah blah blah blah ug flipping this coin is super boring hey stat Squatch how surprised would you be if the next flip gave us heads I would not be surprised at all so when the probability of getting heads is one then we want the surprise for getting heads to be zero however when we take the inverse of the probability of getting heads we get one instead of what we want zero and this is one reason why we can't just use the inverse of the probability to calculate surprise so instead of just using the inverse of the probability to calculate surprise we use the log of the inverse of the probability now since the probability of getting heads is one and thus we will always get heads and it will never surprise us the surprise for heads is zero in contrast since the probability for getting tails is zero and thus we'll never get Tails it doesn't make sense to quantify the surprise of something that will never happen so when we plug in zero for the probability and use the properties of logs to turn the division into subtraction the second term is the log of0 and because the log of0 is undefined the whole thing is undefined and this result is okay because we're talking about the surprise associated with something that never happens like the inverse of the probability the log of the inverse of the probability gives us a nice curve and the closer the probability gets to zero the more surprise we get but now the curve says there is no surprise when the probability is one so surprise is the log of the inverse of the probability bam note with when calculating surprise for two outputs in this case the two outputs are heads and tails then it is customary to use the log base 2 for the calculations now that we know what surprise is let's imagine that our coin gets heads 90% of the time and it gets Tails 10% of the time now let's calculate the surprise for getting heads and tails as expected because getting tails is much rare than getting heads the surprise for tails is much larger now let's flip this coin three times and we get heads heads and tails the probability of getting two heads and one tail is 0.9 * 0.9 for the heads * 0.1 for the Tails and if we want to know exactly how surprising it is to get two heads and one tail then we can plug this probability into the equation for surprise and use the properties of logs to convert the division into subtraction and use the properties of logs to convert the multiplication into addition and then plug and chug and we get 3.62 but more importantly we see that the total surprise for a sequence of coin tosses is just the sum of the surprises for each individual toss in other other words the surprise for getting one heads is 0.15 and since we got two heads we add 0.15 two times plus 3.32 for the one tail to get the total surprise for getting two heads and one tail medium bam now because this diagram takes up a lot of space let's summarize the information in a table the first row in the table tells us the probability of getting heads or tails and the second row tells us the associated surprise now if we wanted to estimate the total surprise after flipping the coin 100 times we approximate how many times we will get heads by multiplying the probability we will get heads 0.9 by 100 and we estimate the total surprise from getting heads by multiplying by 0.15 so this this term represents how much surprise we expect from getting heads in 100 coin flips likewise we can approximate how many times we will get tals by multiplying the probability we will get tals 0.1 by 100 and we estimate the total surprise from getting Tails by multiplying by 3.32 so the second term represents how much surprise we expect from getting tails in 100 coin flips now we can add the two terms together to find out the total surprise and we get 46.7 hey stat quatch is back okay I see that we just estimated the surprise for 100 coin flips but aren't we supposed to be talking about entropy funny you should ask if we divide everything by the number of coin tosses 100 then we get the average amount of surprise per coin Point toss 0.47 so on average we expect the surprise to be 0.47 every time we flip the coin and that is the entropy of the coin the expected surprise every time we flip the coin double bam in fancy statistics notation we say that entropy is the expected value of the surprise anyway since we are multiplying each probability by the number of coin tosses 100 and also dividing by the number of coin tosses 100 then all of the values that represent the number of coin tosses 100 cancel out and we are left with the probability that a surprise for heads will occur times its surprise plus the probability that a surprise for Tails will occur times its surprise thus the entropy 0.47 represents the surprise we would expect per coin toss if we flipped this coin a bunch of times and yes expecting surprise sounds silly but it's not the silliest thing I've heard note we can rewrite entropy just like an expected value using fancy Sigma notation the X represents a specific value for surprise times the probability of observing in that specific value for surprise so for the first term getting heads the specific value for surprise is 0.15 and the probability of observing that surprise is 0.9 so we multiply those values together then the sigma tells us to add that term to the term for Tails either way we do the math we get 0.47 now personally once I saw that entropy was just the average surprise that we could expect entropy went from something that I had to memorize to something I could derive because now we can plug the equation for surprise in for X the specific value and we can plug in the probability and we end up with the equation for entropy bam unfortunately even though this equation is made from two relatively easy to interpret terms the surprise times the probability of the surprise this isn't the standard form of the equation for entropy that you'll see out in the wild first we have to swap the order of the two terms then we use the properties of logs to convert the fraction into subtraction and the log of one is zero then we multiply both terms in the difference by the probability then lastly we pull the minus sign out of the summation and we end up with the equation for entropy that Claude Shannon first published in 1948 small bam that said even though this is the original version and the one you'll usually see I prefer this version since it is easily derived from Surprise and it is easier to see what is going on now going back to the original example we can calculate the entropy of the chickens so let's calculate the entropy for area a because six of the seven chickens are orange we plug in 6 / 7 for the probability then we add a term for the one blue chicken by plugging in 1 / 7 for the probability now we just do the math and get 0.59 note even though the surprise associated with picking up an orange chicken is much smaller than picking up a blue chicken there is a much higher probability that we will pick up an orange chicken than pick up a blue chicken thus the total entropy 0.59 is much closer to the surprise associated with orange chickens than blue chickens likewise we can calculate the entropy for area B only this time the probability of randomly picking up an orange chicken is 1 / 11 and the probability of picking up a blue chicken is 10 / 11 and the entropy is 0.44 in this case the surprise for picking up an orange chicken is relatively high but the probability of it happening is so low that the total entropy is much closer to the surprise of associated with picking up a blue chicken we also see that the entropy value the expected surprise is less for area B than area a this makes sense because area B has a higher probability of picking a chicken with a lower surprise lastly the entropy for area C is one and that makes the entropy for area C the highest we have calculated so far in this case even though the surprise for orange and blue chickens is relatively moderate one we always get the same relatively moderate surprise every time we pick up a chicken and it is never outweighed by a smaller value for surprise like we saw earlier for areas A and B as a result we can use enthropy to quantify the similarity or difference in the number of orange and blue chickens in each area entropy is highest when we have the same number of both types of chickens and as we increase the difference in the number of orange and blue chickens we lower the entropy triple bam PS the next time you want to surprise someone just whisper the log of the inverse of the probability bam now it's time for some Shameless self-promotion if you want to review statistics and machine learning offline check out the stack Quest study guides at stack quest.org there's something for everyone hooray we've made it to the end of another exciting stack Quest if you like this stack Quest and want to see more please subscribe and if you want to support stack Quest consider contributing to my patreon campaign becoming a channel member buying one or two of my original songs or a t-shirt or a hoodie or just donate the links are in the description below all right until next time Quest on

Transcript for:Exploring Entropy in Data Science

Transcript for:
Exploring Entropy in Data Science