Suppose there is a game with the
probability of winning of 1/3, and if you win, you will get 1.8 times the
amount of money that you risked on the play, and if you lose you lose the entire wager. Then we can calculate the
expected value of the payout by summing over all possible
events of the probability of each event times the payout of each event, and once we calculate the expected value for
this game, it comes out to a negative value. You have no edge at all in this game,
so you may get lucky first few times but there is no long-term winning strategy. So you will inevitably donate all your
money to the house if you keep playing. Now let's consider a second game where if you
win you get paid the amount of money you risk. Or in other words, you double your betting amount. Suppose that you have a hundred
percent chance of winning this game. Then it doesn't quite matter what happens
if you lose since you are guaranteed to win. And you are expected to double your
money each time you play the game. Then what is your long-term winning strategy? Well you essentially have a money printer, so you might as well put your
entire life savings on each play. Even better you can maximize your
profit by playing on a margin, meaning you should borrow the biggest
amount of money that you possibly can and put all your assets including your kidney
and liver as a collateral for the margin. We looked at two extreme examples:
one where you have no edge at all, and there is no long-term winning strategy, and another where you are
guaranteed to win each time and your winning strategy is
to bet everything each time. So the real interesting problem is
when your expected payout is positive, but you are not guaranteed to win. Let's take a look at one such example: Suppose you buy some S&P500 Index Fund, which is a fund that takes
top 500 American companies and averages them by their company's value. And let's say you implement a strategy to
sell at a profit when the price doubles and cut your losses when the
price drops by 25 percent. Taking the historical data starting from
2003 and back testing this strategy, I calculated that the probability
of win is roughly 59 percent. Since we are expected to win more, and each win
pays much more than how much we lose each loss, The expected payout per play
obviously has to be a positive value. This game definitely is worth playing,
but how much should you risk each play? One possible strategy commonly used by the
members of the subreddit WallStreetBets is called food stamps or Lambo, which is a strategy to risk everything in
order to maximize your potential profit. So for this strategy, we can maximize the
expected value by maximizing the value of r. So you take 4x leverage to
risk everything you own. Naively that is the optimal strategy since you
are maximizing the expected average profit. You may get lucky first few times, but
if you continue to all-in each play, you will eventually lose everything. The very act of maximizing expected
payout actually leads to ruin. So there must be a better
long-term strategy right? Let's start by defining the problem. Suppose there is a game where you know the
probability of winning, which we will call p, and q, the probability of
losing, which equals 1 minus p. The game is repeated n times, and each play, you will risk a fixed
percentage r of your portfolio. and each time you win you will gain tr, and each time you lose you will lose sr. And let's put on some auxiliary conditions. The amount you can lose is that
most your entire portfolio, So you cannot have a negative balance, and you are not guaranteed to win. And we want n to be sufficiently large since we
are looking for a long-term sustainable strategy. So you are expected to win some and lose some, but still end up profitable in the long run. Then what fraction of your
portfolio should you risk each play to maximize the long-term average profit? You may have noticed how we are
restricted to binary scenarios, but in the real world, there are many
games with more than two outcomes. For example the Powerball lottery
has nine different ways you can win, each with different payout. And you can even have a game
with a continuum of outcomes that depend on some continuous distribution. As interesting as those scenarios sound, they are too difficult to cover on a single video, so for this video we will
focus only on binary outcomes. Before we start solving the problem, I want to start with the answer,
so we can work our way towards it. So the optimal risk we should take
per play is p, the probability of win over s, the percentage loss per loss minus q, the probability of loss over t, the percentage gained per win. The amount you should risk grows
as the probability of win goes up, as the first term is linearly
proportional with respect to p, and the optimal risk also goes up
as percentage gain per win grows, since the second term is negative
inversely proportional with respect to t. This intuitively makes sense as an investment is considered better if you win
more, and each time you win you win more money. Similarly the optimal risk decreases as the probability of loss and the
amount of loss per loss goes up. This formula is known as the Kelly Criterion. It's best to look at some
extreme cases of the formula. So if we have a perfectly balanced
game with equal chance of win and loss, and equal gain and loss, the optimal risk is nothing, meaning this isn't a game worth playing. If we instead have a game where
you're less likely to win, and your gain per win is less than your losses, then your optimal r value comes out to negative. What this means is that you don't
want to be playing the game, but you want to be sitting on the other side and taking a short position being the casino or the insurance
company stealing money from you. Now what if the game is
extremely good in your favor? Then your r value can come out
to something bigger than one, meaning you should invest not
only your entire portfolio, but take a leverage and
invest with borrowed money. Notice how I put quotation
marks around the word average. The first question I want to
ask is: what even is an average? There is a generalized
mathematical definition of the word as mathematicians love to abstractify everything. But for now, we will just think of it
as some measure of central tendency or a good representative element of a dataset. And the average that we are all very
familiar with is the arithmetic mean, or sometimes just called THE
mean when the context is clear. And we compute it by adding up
all the numbers in the data set and dividing by how many numbers there are. There is another kind of mean
called the quadratic mean, or sometimes called root mean square and we get it by taking the arithmetic
mean of all the squares of numbers then taking the square root. Another common mean that we
use is called the harmonic mean and we get it by taking the
arithmetic mean of all the reciprocals then taking the reciprocal. And using similar idea, we can extend the
definition to work for all real numbers, except for zero. And even for zero, we can just take the limit as alpha
goes to zero to define the zero mean which is known as the geometric mean. As alpha grows bigger, more
weight is placed on bigger values, and as alpha gets smaller, more
weight is placed on smaller values. So if we take the limit as
alpha goes to negative infinity, we get the smallest value of the dataset, and if we take the limit to the positive infinity, we get the maximum value. Now when is it appropriate to use
these different types of means? Imagine one person having three dollars
and another person having five dollars. If Karl Marx saw this, he would
want to equally distribute wealth. Then the most appropriate average
would be the arithmetic mean. Now consider a second situation
where you are a successful day trader and tripled your money the first year and 5x'ed money the second year, then what is your average growth per year? It is equivalent to roughly
3.873x'ing your money each year. So when we have a compound growth,
and want to calculate average growth, geometric mean is the most appropriate. Now consider the next situation where person A can finish a task in three days, and person B can finish a task in five days. Then what is the average number of days
it takes a person to finish the task? If both of them are working
on the task concurrently, we can add the rates together and we can divide by 2 to get the average rate. So it means that person A can
finish 1/3 of a job per day, and person B is working at 1/5 of a job per day. Then it is equivalent as if each
person was finishing 4/15 per day. So on average it takes 15/4 days to finish a task, and this is an appropriate
situation to take the harmonic mean. Now consider the next scenario where there are two chambers filled
with the same homogeneous ideal gas and each chamber contains the same amount of gas. Now suppose the mean speed of the
molecules in the left chamber is 3, and the right chamber is 5. So the right chamber is a bit hotter. Now, once you connect the two chambers together, they will eventually reach a thermal equilibrium, and mean speed of the molecules on both sides
of the chamber will eventually balance out to a number between 3 and 5. So what is this average speed after mixing? Well we know that the temperature
on the left and the right will eventually reach an arithmetic mean, and since temperature is, roughly speaking,
the average kinetic energy of the molecules, it is proportional to the
square of the mean speed. And for those of you who are curious
what the constant of proportionality is, it is π times the mass of the air molecule, all over 8 times the Boltzmann constant. So now to find the average mean
velocity after the thermal equilibrium, we would have to take the
root mean square of 3 and 5. We looked at multiple different scenarios
to use different kinds of means. Notice how as alpha grows bigger, the mean takes more weight of
the bigger value which is 5. So from these four examples, the harmonic mean is the smallest and
the quadratic mean is the biggest. There are two more averages
that are very commonly used. How much does an average American make per year? and since there are ridiculous
outliers like billionaires which would skew the arithmetic
mean too far to the right, and zero income people which
would make geometric mean zero. Instead of any of the generalized means, we typically take the median in practice which is the middle number. And it is a pretty good representative
of how much an average person makes. And another commonly used average is the mode. So when we say an average
American owns a smartphone, we probably mean most Americans. And the majority is a pretty good
representative of an average person. Now we have a solid notion of what an average is, so we are ready to take the first step. Each time we win, we gain tr, so it is same as multiplying
(1+tr) to how much money we have. This essentially is the same as 30% gain
really meaning 130% the original amount. And we can come up with something
similar for each time we lose. Suppose we win k times out of n games, then we will lose (n-k) games. I will call this big R for the total return. And we would want to multiply
this to the initial capital to see how much money we have after n games. And for this expression, I want to use
the convention that 0^0 is equal to 1, which sounds like a nonsense. But suppose you all-in each play, and if you get lucky, you have multiplied
your money n times by the factor of (1+tr), and if you get unlucky just once,
then you lost all your money. So to make sense out of this situation, we will choose to define that 0^0
is equal to 1 for this expression. Of course we can treat this
rigorously with limits, but we just want a model that works. Now we are expected to win about np games, where p is the probability of winning one game. And we are to lose about nq games. So this is the average total return right? Earlier, we have talked about all
the different types of averages, and which one is this one? Well to really make sense out of this, we need to introduce a new
toolkit called random variables. Consider a scenario where you
toss an unfair coin 3 times. There are eight different events possible, and we can calculate the probability
of each event by multiplication since each throw is independent. Then we can come up with the random
variable for the total number of heads so X of the event HHT would be 2. This example illustrates
what a random variable is. It simply is a function that takes an
event and assigns a numerical value. And by doing so, we turn
the raw non-numerical data, such as sequence of heads and tails, into numerical data and it
is much easier to work with. Well that is the more modern interpretation, but before the rise of set
theory and measure theory, mathematicians naively thought of it as a variable
that can take on the values of a random event. and for this scenario, X can take
on the values 0, 1, 2, and 3. And in either perspective, we can
calculate the probability of X = 2 by adding probabilities of all
possible ways of two heads showing up. And we could do this similar for 3. But what's the probability of 7
head showing up out of 3 toss? That's just impossible, so the probability is 0. Now we can plot the probability for
each values of the random variable, and we call this the probability mass function. This is one very specific example
of a class of random variables that we call binomial random variable, denoted B, with two parameter
n for the number of trials, and p for the probability of winning one trial. And we can just write B when the context is clear. and since B represents the number of wins, it can take on the values from 0 to n. How do we calculate the probability
of getting two wins out of four trials if the probability of winning is 1/3? well there are six different
ways you can win twice, and each of these events have a
probability of (1/3)^2 × (2/3)^2, and we can get 6 by taking 4 choose 2. So doing the very same thing in general, the probability of B = k is n choose
k × p^k × q^(n-k) where q is (1 - p). Next I want to come up with the notion of
the arithmetic mean for random variables and how it compares with the arithmetic
mean of a statistical dataset So let's say there are 10000
independent experiments, each consisting of ten dice throws, and for each experiment, we want to
count the number of times six shows up. We can create a histogram to tally up the results and we can see that it is fairly common
to see 6 showing up one or two times, but getting 10 sixes in a row
is practically impossible. The probability model that represents
one instance of the experiment is the binomial distribution of 10 trials
and the probability of winning equaling 1/6. And if we plot the probability mass function, the shape is essentially
identical to the histogram because as long as the number
of experiments is large enough, the number of experiments where 3 sixes came
up out of the total number of experiments should approximately equal the probability
of 3 sixes showing up in a single experiment. And this holds for any other values as well. So we can define the expected value, which is
the probability equivalent of the arithmetic mean by summing over the product of
the value of the random variable times the probability of getting that value. And as long as the number of
experiments are large enough, the arithmetic mean approaches the expected value. Now, what is the expected value
of the binomial distribution? Well this summation is fairly tricky to compute, so for those of you that are interested, I will leave a link in the
description below for the derivation. So for now, let's take a small
leap of faith and say it equals np. And it is a reasonable answer
that agrees with our intuition since if we toss a coin 100
times, we expect about 50 heads, and if we toss the dice 60
times, we expect about 10 ones. It should be clear now that we
need to construct a random variable that depends on the binomial
distribution for our problem. So now we introduce how transformation
of random variables work. Suppose that X is a random variable that
can take on the four values -1, 0, 1 and 2, with each value having some probability. Then Y defined as 1/X can take on
the reciprocals of the values of X, and the probabilities of
each value of Y is same as the probabilities of getting the corresponding X, so as an example, if we are trying to
find the probability of Y equaling 1/2, first substitute Y for 1/X, then solve for X. Notice how the plot of the probability
mass function of Y is out of order, so we just have to rearrange it
to make it more intuitive to read. Now what if the transformation is not one-to-one? For example consider the square transformation, then Y can only take on three values
since both ±1 squared equals 1. So if we are trying to calculate
the probability of Y equaling 1, Y is equal to X^2, and there are
two possible values of X^2 = 1, so we have to add the two probabilities. so we have to redraw the PMF (probability
mass function) by merging the two values. This essentially is how transformations
of all discrete random variables work. Now how do we find the expected value of Y? We can just use the definition
of the expected value of Y, which is the sum over all values of Y times
the probability of getting that value. We can instead just sum over
all possible values of X^2, times the probability of getting each value of X. And during the summation, values that were not one-to-one
will appropriately add together. This essentially illustrates the
proof of the discrete version of the law of the unconscious statistician, which is often called LOTUS, which is its acronym, and it is one of the most important
theorems in probability theory. This theorem is significant since we can find the expected value of
a transformed distribution without knowing anything about the transform distribution. Now we are ready to formally
define R, the total return, as a random variable that depends
on the binomial distribution. And this makes sense since B is a random
variable that takes on the number of wins. And we want to repeatedly
multiply (1 + tr) for each win, and (1 - sr) for each loss. Let's try taking the expected
value and see what happens. And we can use a LOTUS to
evaluate the expected value. At this point the summation looked too nasty, so I evaluated it using Wolfram Mathematica. So kudos to you if you can
try and simplify it by hand. At this point, it's probably good to
stop and think about what this means. In the very beginning, we informally introduced the random variable
G to denote the net gain per single play, and we also calculated the expected
value of net change per single play. so we can make the substitution for
the expected value of the total return. This expression intuitively makes sense as if we have a net change of E(G) per play, then we are on average
multiplying by (1+E(G)) per play. And this expression is maximized if we
take the little r, which is the risk, to be as big as possible. It's not too hard to see that maximizing
the expected value is not the best strategy. But to provide a concrete example, suppose
there is a game where you flip a fair coin, and if you flip head you quadruple your risk,
and if you flip tail you lose your wager. And if we take the strategy to maximize
the expected value of the total return, The only way we can win is
to consecutively flip heads, and if we flip just one tail
we lose all of our money. As the number of trials get larger, your chance of becoming rich
approaches zero exponentially fast. And you are almost certainly
guaranteed to lose everything. Maximizing the expected value
was not the best choice, so what if we maximize the median instead? For continuous distribution, the
definition of median is obvious, which is the value that splits
the probability into two halves. But we need a more general definition that
works for discrete random variable as well. So the definition that we will
use for median is any real value m such that the probability
of X ≤ m is at least half, and probability X ≥ m is at least half. So if we look at this binomial distribution X, the total probability less than
or equal to 2 is more than half, and the probability greater than or
equal to 2 is also more than half. So for this distribution,
the median is equal to 2. Let's take a look at another
binomial distribution, which happens to be symmetric and bimodal. And the total probability less than or equal to 2 and greater than or equal to 2 is at least half. So 2 is the median of this distribution. And by the same reasoning, 3 is a median as well. And if we take any number in
between 2 and 3 such as 2.5, that satisfies the definition of median too. So for this case, the median is the
set of closed interval from 2 to 3. Then what is the median of
a binomial distribution? It is the expected value
rounded to the nearest integer. Well that was a lie, we don't actually
have a close formula to find the median in terms of n and p, but we
do know a few things about it. And one thing for sure is that median could either take on np rounded up or down, or maybe both at the same time. This isn't something we could prove quickly, but I want to provide an
intuition for big values of n since we are interested in
long-term winning strategy. So for large values of n, the binomial distribution starts to
look more and more like the bell curve, so the mean and the median should
roughly be equal to each other. But since B can only take on
integer values, so does the median. For rest of the video we
will just approximate that the median of the binomial distribution
is roughly equal to the mean. Now how do we compute the median
of a transformed distribution? For example, if we have some distribution X, and if we apply a power to it, the probability mass function
essentially looks the same since there was no rearrangement
in the ordering of each bars. So the median would be represented by
the same bar on the left and the right. So the median of the transform is
the same as transform of the median. Now what if we apply a monotonically
decreasing transformation like 2^-X? Then the ordering completely flips,
but the middle is still the middle. So the median commutes with the
transformation for this case as well. What if we have a transformation
that is not strictly increasing, like the integer part of X divided by 3. I marked where the original
median was with the blue arrow, and since this function is not one-to-one, we would have to merge the bars together. Since 4 was the original median,
everything up to 4 is at least half. So everything up to 5 is at least half as well. And everything down to 4 is at least half, so everything down to 3 is at least half as well. so 1, which is the transform of 4, is
the median of the new distribution. So we can just find the median of a transform by taking the transform of the median as long as g is a monotonic function, whether it's increasing or
decreasing, strict or not. So can we find the median of the total return? We can write the transformation
as a single exponent function. And if the base of the exponent is
greater than one, it is increasing, if it's equal to 1, then it is constant, and if it's in between 0 and
1, then it is decreasing. If the base is 0 or negative we run
into some issues with the monotonicity, so if we restrict both the numerator
(1+tr) and the denominator (1-sr) to be positive, we get a restriction
of allowed range for r. Intuitively, 1/s is the maximum
possible risk that we can take, which is equivalent to all-in, but what does negative 1/t signify? That actually is the maximum
risk that we can possibly take if we are taking the short position. So now we have a reasonable range for r, and within this range, the
transformation is monotonic. So we can compute the median of big R by taking the transformation of the
median of the binomial distribution, then median can be approximated by np, so after some manipulation, we arrive at this expression. And we can think of this as a
single variable function of the risk since other variables are fixed parameters. So this allows us to do some single
variable differential calculus. 1/s and -1/t are the roots of this function, so if we graph this function
with respect to the risk, it looks something like an upside
down parabola with a single maximum. I normalized the graph so that the
peak is always at the same height since it is really what we want to know about. So if we vary the value of p,
the probability of winning, the location of the maximum changes. But if we change the value of n,
the shape of the graph changes, but where the maximum is does not. And this illustrates one of
the key techniques in calculus: instead of finding the maximum
of the original function, we can instead find the maximum
of a transformed function as long as the transform is monotonic. and we can take natural log as well
which is the monotonic function since it splits products into sums, and it makes calculus so much easier. Now let's take the derivative
which turns logs into reciprocals, and we have to pull out the
coefficient by chain rule. and we set it equal to zero in
order to find that single maximum. At this point all the heavy lifting is done, so I will leave it to you to solve for r, and get that r that maximizes the
median is indeed the Kelly's formula. We computed the optimal risk for mean and median, so we might as well try the
mode since those three are the averages you learn in
elementary school statistics. And unlike median, mode is fairly
intuitive to define for random variables. It simply is the value that
gives the biggest probability, and there can be multiple modes. Then what is the mode of a binomial distribution? Unlike the median, there is a
close formula in terms of n and p. This is a bit of overkill, and just like median, we will say that the mode is an integer
value close to the expected value, and we can approximate it
using the expected value. Then what is the mode of a
transformed random variable? If we simply take a discrete random
variable and permute the values, then the biggest bar is still preserved. But if we merge some bars, what was originally the biggest
may not be the new biggest. So one condition where mode commutes
with g is that g is one-to-one. By the way, this only holds
for discrete random variables and fails for continuous
random variables in general. So if you can come up with an example,
leave it in the comments below. Now what if we stack all the bars into one? That is, the transformation
is a constant function? Then every single value, including
what was originally the mode, transformed into the new mode. So mode commutes with g if
g is a constant function. I'm sure there are more interesting examples, but these two conditions are
the ones we will be using. Just like how we did for median, the exponential function is
either one-to-one or constant as long as the base of the
exponent is greater than zero. And we found the appropriate conditions. So mode of the transform is transform of the mode. So it will simplify to exact same
value that median came out to. So once we find the r that maximizes the mode, we will get Kelly's formula once again. If we plot the probability mass
function of the binomial distribution, it looks roughly normal. But if we transform it to the total return, the ordering is preserved
but the shape gets skewed. But if we instead plot this on a log scale, the shape becomes normal again. By the way the choice of the base being 5 was completely arbitrary in
terms of the shape that it gives, it just happened to give me the best picture. So this raises suspicion that there is an exponential or multiplicative
behavior lurking in the background. So we should try taking the geometric mean. Just like how we defined the
arithmetic mean for a random variable, we can generalize this for any mean. But things get pretty tricky
if we send alpha to 0. And for the case of sample geometric mean, the limit approach the product of
each data, then taking the n-th root. Now we will take a natural log on both sides since the power drops as the
coefficient and product splits into sum. And once we exponentiate both sides, we have an alternative formula for geometric mean that does not have product in it. In fact this looks much closer
to the generalized formula above compared to the formula that involves the product. So it could be thought of as the
missing link between the general form and the product form. In any case, we will use
this formula in a similar way to define the geometric
mean for a random variable. Let's first compute the
expected value of the log of R, which can be evaluated using LOTUS. Then we can drop the powers and split up the sum, and by the linearity of summation, we can split up the sum into
three separate summation. and pulling out things that does not
involve k outside of the summation. Notice how the circled part literally is the expected value of the binomial distribution. And the one in the blue circle
is sum of each probability, which should add to one. Now we can factor out like terms,
and make some substitutions, and this should look awfully familiar once we take the exponent
to find the geometric mean. It comes out to exactly same
expression as the median and the mode, and once again, we can derive
the Kelly's formula from here. I want to wrap it up, with a final remark that provides
an interpretation of the formula. Since the geometric mean is supposed
to be the multiplicative average, we can take the n-th root to find
the average net gain per single play. If we win, we multiply by
(1+tr) to the principal capital, and if we lose we multiply by (1-sr), so we can think of this as the
average multiplier per single play. And if we instead find the
geometric mean of the total return for the binomial random variable of one trial, we get this exact value as well.