Exploring Floating Point Number Representation

numbers. what even are they? and how did we manage to trick computers into using them? these are very important questions. one of the most common ways computers represent numbers internally is a system called “floating point”. if you’re like me, the only time you’ve heard people really talk about floating point is to explain what’s going on when the system does something wrong. floating point is why point one plus point one isn’t always point two. it’s the system with “negative zero” and “Not a Number”. however, I think the specifics of the standard way computers implement floating point, both on a technical level and on a philosophical level, are very interesting, and are worth examining on their own. I’m jan Misali, and let’s walk through the process of how you could have invented floating point yourself, if you really wanted to. okay, so first of all, you all know what bits are, right? a bit is an abstract unit of information. it can be implemented in any number of ways, but the only important thing is that a bit can be in two different states. “on” and “off”, “true” and “false”, “one” and “zero”, or literally anything else. let’s stick with ones and zeros, since we’ll be using these bits to represent numbers. here’s a symbolic representation of a string of eight bits, known as a “byte”. the most common way to interpret this as a number is, well, to interpret it as a number. not in decimal, of course, but in binary. every position within the byte is associated with a specific power of two. all you need to do is multiply each bit by the value of its position and add them all together. using this system, one byte can represent any integer from 0, represented as eight zero bits, all the way up to 255, represented as eight one bits. this particular implementation is called an “eight-bit unsigned integer”. it’s “unsigned” because there’s no way to represent negative numbers. this format is very limited. one byte is not a lot of information, so when modern computers store numbers they tend to use more than just eight bits. let’s bump this up to a much more sensible thirty-two bit unsigned integer, with a whopping four bytes of information. with 32 bits, you can represent a much wider range of numbers, with the maximum value being over four billion. that’s pretty big, and there aren’t that many situations where you’d really need a number bigger than that. but, there aren’t exactly no situations where you need a number bigger than that. the total global population, for example, is above this limit. we’d need thirty-three bits to represent a number that big. but, okay, when you’re dealing with numbers in the billions, do you really need this many bits to represent them? like, let’s say we did have a 33-bit format, and we wanted to use it to represent the current global population. what would the bit on the end there, that one extra bit we added, even be doing? to give it a value would be to say that the total number of humans alive today is either even or odd, which is kinda absurd, isn’t it? we can’t possibly know the global population to that level of precision, and even if we did, it’s a number that changes so frequently that the value of the last handful of bits in the number would become out of date almost immediately. surely, there must be a better way to use all these bits, right? like, maybe we could make these numbers a little less precise, and in exchange have them represent a much wider range of values! not just really big numbers, but really small numbers too. this is the problem that floating point exists to solve, and to understand how it pulls it off, we’ll need to get right to the point. this point, right here, placed directly in the middle of the 32-bit number. I guess you might want to call this a “decimal point”, but technically it’s not a decimal point, because we’re not using decimal, are we? so, now that this binary point is here, we can just interpret the number as usual, just with the orders of magnitude for the bits adjusted, in the way you’d expect. compared to a 32-bit unsigned integer, the range of values this can represent is way smaller, but importantly, all those bits after the binary point mean there’s a whole continuum of real numbers you can represent between all the integers. like, hey, you can write “one and a half” with this! of course, this point isn’t really part of the number, at least, not from the computer’s perspective. it’s only written here for our benefit. remember, in the world of computers, everything is zeros and ones. there aren’t any extra symbols to spare to use as a point like this. in order to get the computer to know that there’s a point here, it needs to be told ahead of time that it should interpret this particular sequence of four bytes as a number with a point in the middle. that is to say, the location of the point is part of the format itself. since this type of format always puts the point in the same place, this is called “fixed point”. and this is where we get to the first important idea behind the floating point format, and it’s the idea where it gets its name from. you’re thinking it too, right? like, if we’re being limited by how the computer has to always assume that the point is in the same place, what if we just tell the computer where to put the point? what if instead of being fixed in one spot, the point was free to float around, and be placed anywhere in the number? so, let’s try it! with 32 bits to work with, there are 31 positions between those bits where the binary point could go. with just five bits, we could store a number that tells the computer which of those spots to put the point. simple! okay, problem, we are no longer representing this number with 32 bits. these extra five bits aren’t like, free. this has gone from a 32-bit number to a 37-bit number. alright then, so just take the first five bits from our number and use those to indicate where the point should go, then use the other 27 bits for the actual digits of the number. the “mantissa”, as it’s called. in fact, now that there’s only twenty-seven bits in the mantissa, these five bits for where the point should go allow us to do something very clever. with 27 bits, there are 26 places between those bits where you could put the point. but since we’re indexing where to put the point using a five bit number, that number has 32 possible values, which is six more than we actually need. that means that we can use those extra values to place the binary point outside of the mantissa, with three extra positions to the left, and three extra positions to the right. and, while we’re at it, why not go even further? just remove a few more of those mantissa bits and give them to the point-index, and now the first byte of the number is used for the position of the binary point, with the remaining three bytes for the number’s digits. you’ve probably noticed the tradeoff here. the more bits you use for the position of the point, the wider the range of numbers you can represent ends up being, but at the same time you lose a lot of precision. if you want to get really wild, you could imagine using some extra bits to tell the computer where exactly the point-index ends and where the mantissa begins, in some sort of floating floating point system, but let’s not get carried away here. this rather simple implementation of floating point still has a couple of problems, which we’ll need to solve before we have something with the same capabilities as the standard floating point format. we’ll start with the easier one. this format currently doesn’t have any way to write negative numbers. there are a few different ways to store negative numbers, but thankfully, the way they work in the floating point standard is the easiest method to explain. you just take one bit and use it as the “sign bit”. if the sign bit is zero, the number is positive, and if it’s one, it’s negative. and that’s it. now, you might be thinking, “say, why would you use zero for positive and one for negative? if zero is off and one is on, or if zero is false and one is true, shouldn’t it be the other way around?”, and that’s a really good question. [beat] the other problem with our simplified floating point model is that there’s a lot of redundancy. as in, there can be multiple different ways to represent the same number. for example, if you want to represent one and a half, 1.1 in binary, you can put the one-one anywhere you want within the mantissa, and then say that the point is between the two ones. the worst case is with the number zero, where not only can the point go literally anywhere, but since positive zero and negative zero are the same thing, the entire first nine bits are all completely useless, so there’s over five hundred ways to represent this one number. this is a problem. what we’d like is for any number you can store in this format to have one single correct representation, so that way we can get the most out of these 32 bits. it would be really nice if all four billion ish values represented different numbers, cleanly distributed across the whole range. the solution comes from a different, similar thing: scientific notation. scientific notation is a lot like floating point. in fact, you could say that it is a floating point system. or, more commonly, you’ll hear people say that floating point is a type of scientific notation. the point is that they both achieve the same thing, so we can look at how standard scientific notation works to resolve the ambiguity problem. in normal scientific notation, the position of the decimal point (it really is a decimal point this time) isn’t indicated with an index like what we’ve been doing here. instead, there’s an exponent. there’s some power of ten that you multiply by the mantissa, and that’s how you work out where the decimal point goes. this does the same exact thing we’ve been doing, but it’s in decimal, and it feels more like real math. crucially, in scientific notation, the way you write the mantissa always includes a decimal point after the first digit. this means there is a single canonical way to represent a given number, which solves the exact problem we have in our floating point implementation. so, let’s do it! we’ll just take the eight point-index bits and turn them into an exponent. we’re in binary, not decimal, so this exponent is a power of two, not a power of ten, but it works the same way. then we just insert a binary point after the first of our 23 mantissa bits and we’re good to go! well, almost. there’s now the issue of how to interpret the exponent. this isn’t really that different from the way it worked before, and you can totally still think of it as being an index for where the point is supposed to go if you want, but it will still be helpful to go over the underlying math. as before, these eight bits are most naturally interpreted as an integer value from 0 to 255, but if we do that, we don’t have any way to represent numbers between zero and one. we’ll need to use negative numbers again. so, simple enough. we just do the same thing we did last time. use the first bit as a sign bit, and the other seven as a seven-bit number from 0 to 127, so now we can have exponent from negative 127 to positive 127. cool! but it isn’t cool, because guess what? negative zero is back. that’s a whole extra value for this exponent that isn’t doing anything. so, new plan! don’t bother with the sign of the exponent, just store it as a value from 0 to 255, but then subtract 128 from it, to adjust it so now its range is from minus 128 to positive 127. or, maybe it could go from minus 127 to positive 128? that’s the thing, we’re gonna have an extra one no matter what we do, so unless we can find something useful to do with an extra unused exponent, this is going to need to be unbalanced. let’s come back to this later. for now, let’s make another little adjustment to this. there’s one fun quirk that comes from doing all of this in binary and not decimal. to convert a normal decimal number into standard scientific notation, where do you move the decimal point? well, you put it after the first digit, right? except the first digit isn’t really the first digit. there’s all these zeros that go before it. this is more obvious when the number is less than one, because we actually write all those zeros in that case. so, the decimal point is specifically moved after the first non-zero digit. that’s really important. going back to binary, we naturally do this same thing. given some binary number, if we want to prepare its mantissa so it can be represented as a floating point number, we just need to move the binary point to go after the first digit that isn’t a zero. did you catch that? the first non-zero digit of a number written in binary. in decimal, the first non-zero digit could be anything from one to nine, but in binary, there’s only one digit that isn’t zero. it’s one! in other words, every binary number starts with one. this means there isn’t any reason to bother storing this bit in the first place! since we know it’s always one, we can effectively use our 23 bits to contain twenty-four bits of a mantissa! a whole extra bit of information, completely for free. this also solves our earlier problem of having too many ways to write zero, because now it’s completely impossible to write zero at all. oops! I lied just then, about how every number in binary starts with one. this is almost always true, and in fact it’s true about exactly 100% of real numbers written in binary, but there’s exactly one exception. so for those keeping track, we now have two problems left to solve. we can’t write zero anymore, and we have an extra value for the exponent that makes the system awkward and unbalanced. are you thinking what I’m thinking? exactly, right? we have one thing we aren’t sure how to write, and we have one thing we can write that we aren’t sure how to interpret. so why not just throw those together and use the extra exponent to write zero? excellent! so, all we gotta do is set the exponent’s “bias” so its range is from minus 128 to positive 127, then reserve the negative 128 exponent to be used for zero! well, hmm. now we’re back at having too many ways to write zero. like, sure, there is one obvious canonical way to do it, having all the bits in the mantissa be zero, but it would be nice to put those bits to use so we’re not wasting all these potential numbers. once again, the solution to this problem comes by trying to solve a separate problem that we’ve ignored up until now. in its current form, our model floating point system represents all non-zero numbers with the same level of precision, with 23 bits after the binary point. but there’s a really weird problem that comes from doing it this way. let’s say you have two really small numbers, and you want to calculate the difference between them. as it stands, the difference between these numbers is too small to be represented within our system! currently, we have no choice but to round down to zero. but what if there was another way? after all, the way we came up with just now to write zero means that “zero” can actually have all sorts of values. maybe there’s some way we can use those extra ways to write zero to instead represent these numbers, numbers that are too small to be represented normally. “subnormal numbers”, you could call them. here’s the solution. when the exponent is at its minimum value, which right now we’re saying is negative 128, you actually interpret it as being an exponent of minus 127, as though the exponent were one higher, but then you take the leading digit, the one that is supposed to always be one, and replace it with a zero instead. this solution works great. these subnormal numbers mean you can have numbers orders of magnitude smaller than the exponent on its own can specify, with the only caveat being that they get increasingly less precise the closer you get to zero. and so, we now have a fully functional floating point format. good job! with 32 bits, it can represent really small numbers and really large numbers. mm. except, there’s one thing that’s bugging me here, and by me I mean the creators of the IEEE floating point standard that I’m pretending to be designing myself. but before I get to that, let’s talk about the philosophy of floating point for a second. I’ll admit it’s kinda weird to think of a number format as having a philosophy, but with floating point numbers there absolutely is one. floating point arithmetic is all about this compromise between precision and being able to use a wide range of numbers. sure, you can write the number one quadrillion, no problem, but it’s not exactly one quadrillion. floating point numbers aren’t exactly “real numbers”, in the mathematical sense. they represent ranges of “real numbers”. three doesn’t mean exactly three, it means “any number which is closer to three than it is to any other floating point number”. there is an uncountably infinite number of real numbers that could be represented as “three” in floating point. but of course, with 23 or 24 bits of the mantissa, that range is very tight. the next floating point number after three is like 3.0000002. for most practical purposes, this level of precision is enough, and when it’s not, you can just upgrade from 32 bits to 64 bits. this is why it’s not really that much of a problem that we still have negative zero. zero doesn’t mean zero, it means “any number which is too small to be distinguished from zero”. thanks to the subnormal numbers in this system, that range is extremely small, but it’s still an uncountably infinite set of numbers, and not one single exact value. so, distinguishing between positive and negative in this case, while somewhat silly, can still be a meaningful thing to want to do. and this is where the one final problem we need to solve to turn our model into the actual IEEE single-precision floating point standard comes in. see, while having the exponent go from negative to positive 127 with one extra value for the subnormals sounds balanced, it’s super not. these subnormals can get really close to zero. but on top of that, zero itself is there to get all the numbers that are too close to zero to fit in these subnormals. this is nice for small numbers, but what the heck do we do with big numbers? right now, the biggest number this system can represent is a little bit smaller than two to the power 128. that’s a really big number, but if floating point numbers are supposed to be used for ranges of real numbers, and they are, this big boy has the unfortunate burden of needing to stand in for infinity. infinity isn’t a real number. it’s a concept. but we’re not dealing with real numbers! this is floating point, baby! all of these numbers represent uncountably infinitely many real values, and the range from this number slightly under two to the 128 all the way up to infinity is just as valid as any other range. oh, and don’t forget negative infinity too, on the other end! so like, it would be nice if instead of this somewhat arbitrary not-quite-power-of-two being this system’s infinity, there was a different, more clearly marked-as-special value. after all, we used a whole exponent just for numbers too small to be represented normally, so what if we did the same thing on the other end for numbers to large to be represented normally? so, okay, back to the exponents. let’s bump that bias back up so it’s from minus 127 to positive 128 again, except the subnormals are really 126. those little guys already cover a whole bunch of orders of magnitude so they can manage. great, so now we can use the 128 exponent for infinity! except this is the same problem we had before, when we used a whole exponent just for zero. there’s two to the 23 values up here, they can’t all be infinite. maybe there’s some way to have supernormals, as a counterpart to the subnormals from before, so you can have increasingly larger numbers that get less precise until eventually they all round up to infinity. that would be cool, but that’s not how it works, at least not in the standard implementation. instead, infinity is represented with the maximum exponent and an all-zero mantissa, and all other values here in the two-to-the-128 zone are special numbers called “Not a Number”. one of the most interesting differences between floating point numbers and real numbers is the way that zero behaves. it’s not actually zero, remember, it’s just the way you represent any number too small to be distinguished from zero within the constraints of the format. so, in floating point, one divided by zero isn’t undefined. it’s infinity! this is only possible because “zero” means “a number too small to write normally” and “infinity” means “a number too big to write normally”. within the realm of floating point arithmetic, dividing by zero and getting infinity makes perfect sense. and, naturally, if you divide one by negative zero, you get negative infinity. but even though several typically undefined things like this are allowed, there are still things you can’t do. or at least, you can do them, but there isn’t really any way to interpret the result as a range of real numbers. this is the purpose of Not a Number, or “NaN”. most things you can do that get you NaN as a result are funky operations involving weird numbers like zero, infinity, and NaN itself. here’s a fun one. what’s zero multiplied by infinity? zero times anything should be zero, but infinity times anything should be infinity. and really since zero is in practice some very small but positive number and infinite is some very large but finite number, if we knew their exact values the answer could be literally anything, so it’s Not a Number. you also get Not a Number if the answer is defined but can’t be approximated as a range of real numbers, like if you take the square root of a negative number. so, just like how zero doesn’t really mean “zero” and infinity doesn’t really mean “infinity” and three doesn’t really mean “three”, “Not a Number” doesn’t really mean that something isn’t a number. it just means that a question that was asked cannot be given a meaningful answer within the restrictions of the format. this is all very cool and interesting, but what’s the point of having so many NaN values? well, since NaN is the result of performing an invalid operation, those 23 mantissa bits (plus the one sign bit) can be repurposed to indicate which invalid operation was performed, so that way you can see exactly what went wrong. exactly how this works is uh, weirdly unspecified by the standard. like, the standard has suggestions for what kinds of things you should do, but the exact way you do this is, quote, “left to the implementer’s discretion”. and with that, we have now successfully fully reinvented the IEEE standard for single-precision floating point numbers. it’s a little more complex than other ways you can represent numbers, but not that much more. I don’t really know why I like floating point numbers so much. I think the main reason is that from looking at this system, I can really feel the compromises its designers had to make. there were probably long arguments about exactly where to put the divide between the exponent and the mantissa, what range of values the exponent should have. and that whole thing with infinity and NaN is such a hack, right? like, they didn’t need to do that. like, if they wanted to have error detection be implemented into the representation itself, they could have just repurposed negative zero and have that be “Not a Number”. the other thing is that this whole way of thinking about math, where numbers are these ranges of values with limited precision, that’s so different from how I’m used to thinking about numbers. in floating point arithmetic, a quadrillion plus one is still a quadrillion, because with numbers that big, “one” is basically the same as zero. and that’s so weird! like, it makes sense when you’re dealing with real-world data, but it’s so strange to have this version of math where everything is an estimation, and you basically always need to assume that you numbers might be slightly wrong. I’ve been jan Misali, and much like infinity minus infinity, I too am not a number.

Transcript for:Exploring Floating Point Number Representation

Transcript for:
Exploring Floating Point Number Representation