Understanding IEEE Floating Point Standards

Hey friends, welcome to the YouTube channel ALL ABOUT ELECTRONICS. So in this video, we will learn about the IEEE standard which is used for the floating point numbers. So in the previous video, we have seen that how to normalize any binary number and how to represent it in the floating point format. And then after, we have also briefly seen that how this floating point number is stored in the memory. So, we have seen that while storing this floating point number, a 1 bit is reserved for the sign bit and the few bits are reserved for the exponent. And then, the remaining bits are reserved for the mantissa. Now for storing this floating point number, a certain standard has been defined. So this standard defines, like in how many bits this floating point number will be stored. And out of the total number of reserved bits, how many bits will be used for the exponent as well as the mantissa. And apart from that, It also defines like in which format this main design the exponent will be stored. Because if you see this exponent, then it can be either positive or the negative. That means this exponent should be stored in such a way that both positive as well as the negative numbers will get covered. So this IEEE 754 is one such IEEE standard which is commonly used for storing the floating point numbers. Now in this IEEE standard, depending on how many bits are used for storing this floating point number. We have total 5 different formats. For example, in this half precision, 16 bits are used for storing the floating point number. Similarly, when the floating point number is stored in the 32 bits, then that format is known as the single precision format. Likewise, in the double precision format, 64 bits are used for storing the floating point number. And likewise, this floating point number can also be represented in the 128 or the 256 bits. So, out of these 5 different formats, the single precision and the double precision formats are the commonly used formats. And in this video also, we will talk about these two formats. So first, let's see this single precision format. So in this single precision format, the floating point number is stored in the 32 bits. So out of the 32 bits, the 1 bit is reserved to indicate the sign of the number. So if this bit is 0, then it means that the number is positive. And for the negative numbers, this sin bit will be equal to 1. Then after, the next 8 bits are reserved to store the exponent value. And then after, the remaining 23 bits are used to store the mantissa. So we know that when we normalize the binary number, then the significant digit just before the binary point will always remain the 1. Therefore, this 1 is not stored and only this fractional part is stored. That means in the single precision format. Once the binary number is normalized, then the first 23 bits of this fractional part will be stored in this mantissa. But now let us see how this exponent part is stored. So as you can see in this 8 bit, we can represent any unsigned integer between the 0 and the 255. But here as you can see, this exponent part can be either positive or the negative. So the question is how to represent the negative values of the exponent. So we know that there are different ways for representing the negative numbers. Like it can be represented in the 2's complement form or even it can be represented in the sign magnitude form. And similarly, it can also be represented in the bias representation. So here, in this IEEE 754 standard, the exponent value is represented using the bias representation. So first, let us understand what is bias representation and here why the exponent is stored in this format. So in this bias representation, the fixed offset or the bias is added to the number in such a way that the negative number becomes positive. For example, if we have an n-bit number, then the value of the bias should be equal to 2 to the power n-1-1, where the n represents the number of bits. For example, for the 8 bits, if you calculate the value of the bias, then that will come out as 127. Now in this 8-bit bias representation, the bias is added in such a way that the negative number will get shifted towards the positive side. And for the 8 bits, if we see the range of the exponent, then it will be from 0 to 255. That means if we want to see the actual range of the exponent, which can be represented in the 8 bits, then we need to subtract this bias. So after subtracting the bias, if we see the actual range, then it will be from minus 127 to plus 128. So, this table shows the actual numbers as well as the value of the number after adding the buyers. And as you can see, the last column shows the corresponding bias representation. Now in this IEEE single precision format, the exponent values of all zeros and all ones is reserved for the special purpose. That means if we see the available range for the exponent, then it will be from minus 126 to plus 127. Now the advantage of this bias representation is that the numbers from the negative to positive are changing in the specific order. That means here we have a continuity. And here as you can see, the numbers are changing in the ascending order. On the other end, if you see the other representations, which is used for the negative numbers, that is 2's complement and the sign magnitude form, then that is not the case. That means as we go from the negative to positive numbers, then there is no continuity in the representation. Moreover, in the sign magnitude representation, we have two different representations for the zero. So, if we use any of the two representations for the exponent, then there is a discontinuity in the number representation. And because of that, comparing the two floating point numbers becomes difficult. So first let's understand how we compare the two floating point numbers. So during the comparison, first we compare the sine bit. So if the sine bits are equal for the two numbers, then we will compare the exponent part. And if that is also equal, then we will compare the mantissa part. And from that, we can get to know that which number is greater than the other one. Now for the exponent value, if there is a continuity in the number representation, then comparing the two numbers becomes much easier. For example, let's say we want to compare these two floating point numbers. So here, just by comparing the exponent of the two numbers, we can say that the number 2 is greater than the number 1. That means when we want to compare the two floating point numbers, then this bias representation is very useful. And that is why the exponent is represented in this biased representation. Alright, so now let's take couple of examples and let's see that if any number is stored in this IEEE single precision format, then how to find its actual value. So let's say this is a 32-bit number which is stored in this single precision format. So we know that in this single precision format, the MSB is the sine bit. Then the next 8 bits represents the exponent value. And the remaining 23 bits represent the Mantiza value. So here, since the MSB is 0, so we can say that the given number is the positive number. So now, let's find the actual value of this exponent. So here, the value of the exponent is equal to 1000 0101. And in the binary, that is equivalent to 133. So here, this value which is stored in the exponent is along with the bias. So if we want to know the actual value of this exponent, then we need to subtract the bias from this 133. And we know that for this single precision format, the value of the bias is equal to 127. That means if we see the actual value of the exponent for the given number, then that is equal to 6. That means here, this exponential part is equal to 2 to the power 6. Now we know that in the normalized binary form, before this mantissa, there will be a binary point. And the digit before the binary point will always remain 1. So here in this fractional part, we can remove all these zeros. That means this will be the significant of the given number. And along with the exponent value, this is the normalized binary number. So now, let us convert this normalized binary number into the true binary number. So here, since the exponent is equal to 6, so we will shift this red exponent towards the right side by 6 bits. That means now in a true binary format, this number is equal to 1001111. And in the decimal, that is equal to 79. So we can say that the given 32-bit number in a single precision format corresponds to 79. So similarly, let's take another example. So here for the given 32-bit floating point number, let's find out the equivalent decimal number. So as you know, in this 32-bit format, The MSB represents the Sin bit. And here since it is 1, so the given number is the negative number. So now, let's find out the actual value of the exponent. So here, the exponent is equal to 10000011. And in the decimal, that corresponds to 131. So like I said, while storing the value of the exponent, the bias of 127 is added to the actual exponent. So now, in this exponent value, if we subtract the bias, then we will get the actual value of the exponent. So here, that is equal to 4. So, we can say that here the exponential part is equal to 2 to the power 4. So, similarly, now let's see the mantissa part. So, here this is our mantissa part. And we know that in the normalized binary form, just before this mantissa, there is a binary point. And the digit before this binary point will always remain 1. So, here in this fractional part, we can remove all these zeros. That means now this will be our significant. And along with the exponent, this is our normalized binary number corresponding to this 32-bit number. So here, since the exponent is equal to 4, so we will shift this red X point or this binary point towards the right side by 4 bits. And after shifting this binary point, this will be our binary number. That is 11100.11. So now if we just see the integer part, then this 11100 in the decimal corresponds to 28. And similarly, this 0.11 in the decimal corresponds to 0.75. That means the overall number will be equal to 28.75. But here, since the sin bit is equal to 1, so the given number is the negative. So we can say that for the given 32-bit number, the equivalent decimal number is equal to minus 28.75. So in this way, if we have been given any 32-bit number in a single precision format, then we can easily find the equivalent decimal number. So now, let us see the other way around and let us find out how to represent any decimal number in this IEEE 32-bit format. So let's say, we want to represent this 12.625 in this IEEE format. And for that, first of all let us find the equivalent binary number. So here, this 12 in the binary corresponds to 1100. And this 0.625 corresponds to 0. That means if we see the equivalent binary number, then that is equal to 1100.101. So now as a second step, let's normalize this binary number and let's write it in this floating point representation. So in a normalized number, we should have only one significant digit before the binary point. And we know that, that too should be equal to 1. So here for that, we need to shift this binary point towards the left side by 3 bits. And here, since we are shifting the binary point towards the left side, so the exponent value will increase. So in this case, since we are shifting it by 3 bits, so the value of the exponent will increase by 3. And after shifting, this will be our normalized binary number. So now, let's see how to represent this normalized binary number in the 32-bit format. So here, since the number is positive, so this sign bit will remain 0. Now here, we know that this one just before the binary point is not stored in this 32-bit format. And here only this fractional part is stored. So here this fractional part is 100101. So first, let's copy this and let's write it in the mantissa part. And then after, let's fill the next 17 bits by the 0. So in this way, we got the 23 bits of the mantissa. So now, The only remaining part is the exponent. So here as you can see, the exponent is equal to 3. So here, before storing this exponent value, first we need to add the bias. That means here, the stored value of the exponent will be 3 plus 127. That is equal to 130. And in the binary, that is equal to 1000 0010. So in this way, we got the sine exponent and the mantissa part of this 32-bit number. So typically, these 32 or the 64-bit long numbers are stored in the hex format. That is the hexadecimal format. So here, to find the equivalent hexadecimal number for the given 32-bit number, let us make the group of 4 bits. So here, the first group will be equal to 0100. Then after, the next group will be equal to 0001. Similarly, then after, the next group is equal to 0100. And likewise, the next group will be equal to 1010. And after that, we will have a 4 groups of zeros. So here, this 0100 corresponds to 4 in the hexadecimal. Likewise, this 0001 corresponds to 1. Similarly, this 0100 corresponds to 4, while the 1010 corresponds to A. And then after, we will have the four zeros. So, we can say that for the given decimal number, the equivalent 32-bit number in the IEEE format is equal to 414A0000. And as you can see over here, this number is shown in the hex format. Alright, so now let's see the largest and the smallest number that can be represented in this single precision format. So in this IEEE single precision format, the largest value of the exponent will be equal to 127, while the minimum value will be equal to So, with the largest and the smallest values of the exponent, if we see the floating coin number in a normalized form, then this is how it will look like. So, in this format, for the largest number, in the mantissa part, all the 1 should be equal to 1. And similarly, for the smallest number, this mantissa part should be equal to 0. So, if you see the mantissa part of this largest number, then it is slightly less than 2. And it can be given as 2 minus 2 to the power minus 23. And further, it will get multiplied by this term. So, if we calculate the value of this term, then it is roughly equal to 3.4 times this 10 to the power 38. And similarly, for the smallest number, this mantissa part is equal to 0. That means, the smallest representable number is equal to 2 to the power minus 126. And that is roughly equal to 1.1 times this 10 to the power minus 38. That means in this IEEE single precision format, if you see the largest and the smallest numbers, then they are in the order of 10 to the power 38 and 10 to the power minus 38 respectively. On the other end, if you see the 32-bit fixed point representation, then for the sine integers, the maximum positive number is equal to 2 to the power 31 minus 1, which is roughly equal to 2.1 times 10 to the power 9. So if we compare this fixed point, With the floating point numbers with the same bits, then this floating point number covers much more range. And just by looking at it, the question arises, why this floating point number covers greater range? So the thing is, this floating point number covers the greater range, but at the cost of precision. So for example, in this 32-bit floating point representation, the 23 bits are reserved for the mantissa. That means in the 32 bits, the precision that we can achieve is up to 23 bits. Or in the decimal, that is equivalent to the 7 significant digits after the decimal point. So if you want to represent any number beyond the 7 digit precision, then it cannot be represented accurately in this 32-bit format. For example, if you want to represent this number, then in 32-bit, it cannot be represented accurately. Because if you normalize this number, then it will be in this form. That means here, after the decimal point, we will have the 9 significant digits. But like I said, in this 32-bit floating-point format, we can only represent up to the 7 significant digits. So here, the last two digits will get rounded to the nearby decimal number and then it will be stored in the 32-bit format. That means in the floating-point numbers, we are achieving the greater range but at the cost of precision. Now the thing is, with the 32 bits, the total number of distinctly representable numbers is equal to 2 to the power 32. So, in this loading point number or specifically in this single precision format, all these distinctly representable numbers are spreaded over the entire range. So in these 32 bits, these are the smallest representable non-zero numbers in the normalized form. That means here, we cannot represent any normalized number smaller than these two numbers. Similarly, these are the largest positive and the negative numbers which can be represented in this 32-bit format. And beyond this range, we cannot represent any number. So as you can see, in this floating point number, the numbers are spreaded over the entire range. That means here, they are not distributed uniformly. On the other hand, in the fixed point representation, all the numbers are distributed uniformly. That means the spacing between the number is equal. And that is why, this fixed point number covers the smaller range. So in short, the floating point number covers the greater range at the cost of precision. And if we want more precision, then we can go for the double precision format. So in this double precision format, the floating band number is stored in the 64 bits. So out of the 64 bits, the 1 bit is reserved for the sine bit, while the next 11 bits represents the exponent. And then the remaining 52 bits are reserved for the mantissa part. So here, since the 11 bits are reserved for the exponent, so the value of the bias will be equal to 2 to the power 10 minus 1. That means for the double precision format, the value of the bias will be equal to 1023. So once again, in this 11-bit exponent, all the 0s and all the 1s are reserved for the special purpose. So excluding that, if you see the range of this exponent, then it will be between the 2046 and the 1. So here, to get the actual value of the exponent, we need to subtract the bias from this value. That means in this 64-bit format, the maximum value of the exponent which we can represent is equal to 1023. That is equal to 2 to the power 1023. And similarly, the minimum value of the exponent is equal to So correspondingly, if we see the smallest representable number, then that is around 2.2 times this 10 to the power And similarly, the largest representable number is around 1.79 times this 10 to the power 308. So as you can see, this double precision format covers the huge range. And here it also provides the better precision. Because here the mantissa has 52 bits. Or in the decimal, that is equivalent to 16 decimal digits. That means if we require the greater precision, then we can go for this double precision format. So that is all about the IEEE single precision and the double precision formats. Now so far we have seen that in this floating-point format, the exponent values of all the zeros and all the ones are reserved for the special purpose. So, in the next video, we will see that when the values of the exponent is equal to all zeros or all ones, then what it signifies and how to interpret it. But I hope in this video, you understood the basics of this IEEE 754 standard and how the floating-point numbers are stored in this IEEE standard. So if you have any question or suggestion, then do let me know here in the comment section below. If you like this video, hit the like button and subscribe to the channel for more such videos.

And then after, we have also briefly seen that how this floating point number is stored in the memory. So, we have seen that while storing this floating point number, a 1 bit is reserved for the sign bit and the few bits are reserved for the exponent. And then, the remaining bits are reserved for the mantissa.

Now for storing this floating point number, a certain standard has been defined. So this standard defines, like in how many bits this floating point number will be stored. And out of the total number of reserved bits, how many bits will be used for the exponent as well as the mantissa. And apart from that, It also defines like in which format this main design the exponent will be stored. Because if you see this exponent, then it can be either positive or the negative.

That means this exponent should be stored in such a way that both positive as well as the negative numbers will get covered. So this IEEE 754 is one such IEEE standard which is commonly used for storing the floating point numbers. Now in this IEEE standard, depending on how many bits are used for storing this floating point number.

We have total 5 different formats. For example, in this half precision, 16 bits are used for storing the floating point number. Similarly, when the floating point number is stored in the 32 bits, then that format is known as the single precision format.

Likewise, in the double precision format, 64 bits are used for storing the floating point number. And likewise, this floating point number can also be represented in the 128 or the 256 bits. So, out of these 5 different formats, the single precision and the double precision formats are the commonly used formats. And in this video also, we will talk about these two formats.

So first, let's see this single precision format. So in this single precision format, the floating point number is stored in the 32 bits. So out of the 32 bits, the 1 bit is reserved to indicate the sign of the number. So if this bit is 0, then it means that the number is positive. And for the negative numbers, this sin bit will be equal to 1. Then after, the next 8 bits are reserved to store the exponent value.

And then after, the remaining 23 bits are used to store the mantissa. So we know that when we normalize the binary number, then the significant digit just before the binary point will always remain the 1. Therefore, this 1 is not stored and only this fractional part is stored. That means in the single precision format.

Once the binary number is normalized, then the first 23 bits of this fractional part will be stored in this mantissa. But now let us see how this exponent part is stored. So as you can see in this 8 bit, we can represent any unsigned integer between the 0 and the 255. But here as you can see, this exponent part can be either positive or the negative. So the question is how to represent the negative values of the exponent. So we know that there are different ways for representing the negative numbers.

Like it can be represented in the 2's complement form or even it can be represented in the sign magnitude form. And similarly, it can also be represented in the bias representation. So here, in this IEEE 754 standard, the exponent value is represented using the bias representation. So first, let us understand what is bias representation and here why the exponent is stored in this format. So in this bias representation, the fixed offset or the bias is added to the number in such a way that the negative number becomes positive.

For example, if we have an n-bit number, then the value of the bias should be equal to 2 to the power n-1-1, where the n represents the number of bits. For example, for the 8 bits, if you calculate the value of the bias, then that will come out as 127. Now in this 8-bit bias representation, the bias is added in such a way that the negative number will get shifted towards the positive side. And for the 8 bits, if we see the range of the exponent, then it will be from 0 to 255. That means if we want to see the actual range of the exponent, which can be represented in the 8 bits, then we need to subtract this bias.

So after subtracting the bias, if we see the actual range, then it will be from minus 127 to plus 128. So, this table shows the actual numbers as well as the value of the number after adding the buyers. And as you can see, the last column shows the corresponding bias representation. Now in this IEEE single precision format, the exponent values of all zeros and all ones is reserved for the special purpose.

That means if we see the available range for the exponent, then it will be from minus 126 to plus 127. Now the advantage of this bias representation is that the numbers from the negative to positive are changing in the specific order. That means here we have a continuity. And here as you can see, the numbers are changing in the ascending order.

On the other end, if you see the other representations, which is used for the negative numbers, that is 2's complement and the sign magnitude form, then that is not the case. That means as we go from the negative to positive numbers, then there is no continuity in the representation. Moreover, in the sign magnitude representation, we have two different representations for the zero.

So, if we use any of the two representations for the exponent, then there is a discontinuity in the number representation. And because of that, comparing the two floating point numbers becomes difficult. So first let's understand how we compare the two floating point numbers.

So during the comparison, first we compare the sine bit. So if the sine bits are equal for the two numbers, then we will compare the exponent part. And if that is also equal, then we will compare the mantissa part.

And from that, we can get to know that which number is greater than the other one. Now for the exponent value, if there is a continuity in the number representation, then comparing the two numbers becomes much easier. For example, let's say we want to compare these two floating point numbers.

So here, just by comparing the exponent of the two numbers, we can say that the number 2 is greater than the number 1. That means when we want to compare the two floating point numbers, then this bias representation is very useful. And that is why the exponent is represented in this biased representation. Alright, so now let's take couple of examples and let's see that if any number is stored in this IEEE single precision format, then how to find its actual value. So let's say this is a 32-bit number which is stored in this single precision format.

So we know that in this single precision format, the MSB is the sine bit. Then the next 8 bits represents the exponent value. And the remaining 23 bits represent the Mantiza value.

So here, since the MSB is 0, so we can say that the given number is the positive number. So now, let's find the actual value of this exponent. So here, the value of the exponent is equal to 1000 0101. And in the binary, that is equivalent to 133. So here, this value which is stored in the exponent is along with the bias.

So if we want to know the actual value of this exponent, then we need to subtract the bias from this 133. And we know that for this single precision format, the value of the bias is equal to 127. That means if we see the actual value of the exponent for the given number, then that is equal to 6. That means here, this exponential part is equal to 2 to the power 6. Now we know that in the normalized binary form, before this mantissa, there will be a binary point. And the digit before the binary point will always remain 1. So here in this fractional part, we can remove all these zeros. That means this will be the significant of the given number.

And along with the exponent value, this is the normalized binary number. So now, let us convert this normalized binary number into the true binary number. So here, since the exponent is equal to 6, so we will shift this red exponent towards the right side by 6 bits.

That means now in a true binary format, this number is equal to 1001111. And in the decimal, that is equal to 79. So we can say that the given 32-bit number in a single precision format corresponds to 79. So similarly, let's take another example. So here for the given 32-bit floating point number, let's find out the equivalent decimal number. So as you know, in this 32-bit format, The MSB represents the Sin bit. And here since it is 1, so the given number is the negative number.

So now, let's find out the actual value of the exponent. So here, the exponent is equal to 10000011. And in the decimal, that corresponds to 131. So like I said, while storing the value of the exponent, the bias of 127 is added to the actual exponent. So now, in this exponent value, if we subtract the bias, then we will get the actual value of the exponent. So here, that is equal to 4. So, we can say that here the exponential part is equal to 2 to the power 4. So, similarly, now let's see the mantissa part.

So, here this is our mantissa part. And we know that in the normalized binary form, just before this mantissa, there is a binary point. And the digit before this binary point will always remain 1. So, here in this fractional part, we can remove all these zeros.

That means now this will be our significant. And along with the exponent, this is our normalized binary number corresponding to this 32-bit number. So here, since the exponent is equal to 4, so we will shift this red X point or this binary point towards the right side by 4 bits. And after shifting this binary point, this will be our binary number. That is 11100.11.

So now if we just see the integer part, then this 11100 in the decimal corresponds to 28. And similarly, this 0.11 in the decimal corresponds to 0.75. That means the overall number will be equal to 28.75. But here, since the sin bit is equal to 1, so the given number is the negative.

So we can say that for the given 32-bit number, the equivalent decimal number is equal to minus 28.75. So in this way, if we have been given any 32-bit number in a single precision format, then we can easily find the equivalent decimal number. So now, let us see the other way around and let us find out how to represent any decimal number in this IEEE 32-bit format.

So let's say, we want to represent this 12.625 in this IEEE format. And for that, first of all let us find the equivalent binary number. So here, this 12 in the binary corresponds to 1100. And this 0.625 corresponds to 0. That means if we see the equivalent binary number, then that is equal to 1100.101.

So now as a second step, let's normalize this binary number and let's write it in this floating point representation. So in a normalized number, we should have only one significant digit before the binary point. And we know that, that too should be equal to 1. So here for that, we need to shift this binary point towards the left side by 3 bits. And here, since we are shifting the binary point towards the left side, so the exponent value will increase. So in this case, since we are shifting it by 3 bits, so the value of the exponent will increase by 3. And after shifting, this will be our normalized binary number.

So now, let's see how to represent this normalized binary number in the 32-bit format. So here, since the number is positive, so this sign bit will remain 0. Now here, we know that this one just before the binary point is not stored in this 32-bit format. And here only this fractional part is stored.

So here this fractional part is 100101. So first, let's copy this and let's write it in the mantissa part. And then after, let's fill the next 17 bits by the 0. So in this way, we got the 23 bits of the mantissa. So now, The only remaining part is the exponent. So here as you can see, the exponent is equal to 3. So here, before storing this exponent value, first we need to add the bias. That means here, the stored value of the exponent will be 3 plus 127. That is equal to 130. And in the binary, that is equal to 1000 0010. So in this way, we got the sine exponent and the mantissa part of this 32-bit number.

So typically, these 32 or the 64-bit long numbers are stored in the hex format. That is the hexadecimal format. So here, to find the equivalent hexadecimal number for the given 32-bit number, let us make the group of 4 bits.

So here, the first group will be equal to 0100. Then after, the next group will be equal to 0001. Similarly, then after, the next group is equal to 0100. And likewise, the next group will be equal to 1010. And after that, we will have a 4 groups of zeros. So here, this 0100 corresponds to 4 in the hexadecimal. Likewise, this 0001 corresponds to 1. Similarly, this 0100 corresponds to 4, while the 1010 corresponds to A.

And then after, we will have the four zeros. So, we can say that for the given decimal number, the equivalent 32-bit number in the IEEE format is equal to 414A0000. And as you can see over here, this number is shown in the hex format. Alright, so now let's see the largest and the smallest number that can be represented in this single precision format.

So in this IEEE single precision format, the largest value of the exponent will be equal to 127, while the minimum value will be equal to So, with the largest and the smallest values of the exponent, if we see the floating coin number in a normalized form, then this is how it will look like. So, in this format, for the largest number, in the mantissa part, all the 1 should be equal to 1. And similarly, for the smallest number, this mantissa part should be equal to 0. So, if you see the mantissa part of this largest number, then it is slightly less than 2. And it can be given as 2 minus 2 to the power minus 23. And further, it will get multiplied by this term. So, if we calculate the value of this term, then it is roughly equal to 3.4 times this 10 to the power 38. And similarly, for the smallest number, this mantissa part is equal to 0. That means, the smallest representable number is equal to 2 to the power minus 126. And that is roughly equal to 1.1 times this 10 to the power minus 38. That means in this IEEE single precision format, if you see the largest and the smallest numbers, then they are in the order of 10 to the power 38 and 10 to the power minus 38 respectively. On the other end, if you see the 32-bit fixed point representation, then for the sine integers, the maximum positive number is equal to 2 to the power 31 minus 1, which is roughly equal to 2.1 times 10 to the power 9. So if we compare this fixed point, With the floating point numbers with the same bits, then this floating point number covers much more range.

And just by looking at it, the question arises, why this floating point number covers greater range? So the thing is, this floating point number covers the greater range, but at the cost of precision. So for example, in this 32-bit floating point representation, the 23 bits are reserved for the mantissa. That means in the 32 bits, the precision that we can achieve is up to 23 bits.

Or in the decimal, that is equivalent to the 7 significant digits after the decimal point. So if you want to represent any number beyond the 7 digit precision, then it cannot be represented accurately in this 32-bit format. For example, if you want to represent this number, then in 32-bit, it cannot be represented accurately. Because if you normalize this number, then it will be in this form. That means here, after the decimal point, we will have the 9 significant digits.

But like I said, in this 32-bit floating-point format, we can only represent up to the 7 significant digits. So here, the last two digits will get rounded to the nearby decimal number and then it will be stored in the 32-bit format. That means in the floating-point numbers, we are achieving the greater range but at the cost of precision. Now the thing is, with the 32 bits, the total number of distinctly representable numbers is equal to 2 to the power 32. So, in this loading point number or specifically in this single precision format, all these distinctly representable numbers are spreaded over the entire range. So in these 32 bits, these are the smallest representable non-zero numbers in the normalized form.

That means here, we cannot represent any normalized number smaller than these two numbers. Similarly, these are the largest positive and the negative numbers which can be represented in this 32-bit format. And beyond this range, we cannot represent any number.

So as you can see, in this floating point number, the numbers are spreaded over the entire range. That means here, they are not distributed uniformly. On the other hand, in the fixed point representation, all the numbers are distributed uniformly. That means the spacing between the number is equal.

And that is why, this fixed point number covers the smaller range. So in short, the floating point number covers the greater range at the cost of precision. And if we want more precision, then we can go for the double precision format. So in this double precision format, the floating band number is stored in the 64 bits. So out of the 64 bits, the 1 bit is reserved for the sine bit, while the next 11 bits represents the exponent.

And then the remaining 52 bits are reserved for the mantissa part. So here, since the 11 bits are reserved for the exponent, so the value of the bias will be equal to 2 to the power 10 minus 1. That means for the double precision format, the value of the bias will be equal to 1023. So once again, in this 11-bit exponent, all the 0s and all the 1s are reserved for the special purpose. So excluding that, if you see the range of this exponent, then it will be between the 2046 and the 1. So here, to get the actual value of the exponent, we need to subtract the bias from this value.

That means in this 64-bit format, the maximum value of the exponent which we can represent is equal to 1023. That is equal to 2 to the power 1023. And similarly, the minimum value of the exponent is equal to So correspondingly, if we see the smallest representable number, then that is around 2.2 times this 10 to the power And similarly, the largest representable number is around 1.79 times this 10 to the power 308. So as you can see, this double precision format covers the huge range. And here it also provides the better precision. Because here the mantissa has 52 bits.

Or in the decimal, that is equivalent to 16 decimal digits. That means if we require the greater precision, then we can go for this double precision format. So that is all about the IEEE single precision and the double precision formats. Now so far we have seen that in this floating-point format, the exponent values of all the zeros and all the ones are reserved for the special purpose.

So, in the next video, we will see that when the values of the exponent is equal to all zeros or all ones, then what it signifies and how to interpret it. But I hope in this video, you understood the basics of this IEEE 754 standard and how the floating-point numbers are stored in this IEEE standard. So if you have any question or suggestion, then do let me know here in the comment section below. If you like this video, hit the like button and subscribe to the channel for more such videos.

Transcript for:Understanding IEEE Floating Point Standards

Transcript for:
Understanding IEEE Floating Point Standards