Understanding Floating Point Numbers

Hey friends, welcome to the YouTube channel ALL ABOUT ELECTRONICS. So in this video, we will learn about the floating point numbers. And we will see that how very large numbers like the mass of the planets or the Avogadro's number and similarly a very small numbers like the mass of the atom or the Planck's constant is stored in the computers. So during the video, We will also see the difference between the fixed point number and the floating point numbers. And with this comparison, we will understand the importance of the floating point numbers in the digital systems. So first, let us understand what is the fixed point numbers. So in our day-to-day life, we are all dealing with the integers as well as the real numbers. Now when these numbers are represented in the fixed point representation, then the position of the radix point or the decibel point remains the fixed. So, all the integers are the example of the fixed point numbers. So, for the integers, there is no fractional part or in other words, the fractional part is equal to zero. So, by default, the position of the decimal point is at the end of the least significant digit. And since there is no fractional part, so typically, we do not represent this decimal point. But we can say that there is a decimal point on the right-hand side of this least significant digit. And this position of this decimal point will also remain fixed. So similarly, for the real numbers, the position of the decimal point is just before the fractional part. For example, this 11.75 is the real number, where the 11 is the integer part and the 75 just after this decimal point represents the fractional part. So when these real numbers are represented in the fixed point representation, then the position of this decimal point remains the fixed. Now in any digital system, these numbers are stored in a binary format using the certain number of bits. Let's say in a one digital system, these numbers are stored in a 10-bit format. Now the issue with the fixed point representation is that with the given number of bits, the range of the numbers that we can represent is very less. So if we take the case of the integers and specifically an unsigned integer, then in the 10-bit format, we can represent any number between the zero and the On the other hand, for the sine integers, the MSB is reserved for the sine bit. So using the 10 bits, we can represent any number between minus 5 and 2 to plus 5 and 1. That means using the 10 bits, the range of the numbers that we can represent is very limited. So here, basically the range refers to the difference between the smallest and the largest number. Because by increasing the number of bits, we can increase this. range. But still, if we want to represent the very large numbers like 10 to the power 24 or 25, for example, the mass of the earth, then we need more than the 80 bits. And the issue of the range becomes even more prominent with the real numbers. So when we are dealing with the real numbers, then we always come across this decimal point or in general this radix point. So the digits on the left of this decimal point represents the integer part. And the digits on the right represents the fractional part. So to store such numbers in a binary format in the computers, some bits are reserved for the integer part and the some bits are reserved for the fractional part. So let's say, once again, these real numbers are stored in a 10-bit format. And out of the 10 bits, the 6 bits are reserved for the integer part and the 4 bits are reserved for the fractional part. Now when we store these numbers in a binary format, then there is no provision for storing this binary point explicitly. But here we have different sections for the integer as well as the fractional part. And accordingly, each bit will have its place value. So here, just after the 2 to the power 0's place, we will assume that there is a binary point. So out of the 10 bits, if we reserve 6 bits for the integer part, then for the unsigned numbers, we can represent any number between 0 to 63. And for the fractional part, the maximum number that we can represent is equal to 0.9375. And the minimum number will be equal to 0.0625. That means in the 10-bit format, if we want to represent any real number, then the minimum non-zero number that we can represent is equal to 0.0625. While if we see the maximum number, then that is equal to 63.9375. That means in general, In this fixed point representation, the location of the radix point is fixed. And once we decide it, then it will not change. So in a 10-bit fixed point representation, Once we freeze this specific format, like 6 bits for the integer and 4 bits for the fraction, then we cannot represent any number smaller than this 0.0625. For example, if we want to represent this 22.0125 or this 35.0025, then we cannot represent it in this 10-bit fixed-point representation. So if we want to represent such smaller numbers, then we need to assign more bits for this fractional part. Like the 5 bit or the 6 bits for the fractional part. So of course, by doing so, certainly we can increase the precision. But now, our range will get compromised. For example, now we have only 4 bits for the integer part. And now, in these 4 bits, we can represent any number between 0 to 15. That means in this fixed point representation, once the location of this radix point is fixed, then our range and the precision will also get fixed. But in the floating point representation, it is possible to change the location of this radix point or the binary point dynamically. For example, for the given number of bits, let's say a 10 bits, if we want more range, then we can shift this binary point towards the right. Or for example, for some application, if we require more precision, then it is possible to shift the radix point towards the left. That means using the floating point representation, It is possible to represent the very large numbers like the distance between the planets or the mass of the planets and the very small number like the mass of the atom using these floating point numbers. So this floating point representation provides both good range as well as the precision. So now let us see how to represent these floating point numbers. So the representation of this floating point number is very similar to how we are representing the decimal numbers in the scientific notation. So, in a scientific notation, the radix point or the decimal point is set in such a way that we have only one significant digit after the decimal point. So, for the integers, by default, the radix point or this decimal point is set to the right-hand side of this least significant digit. So, here to represent this number in a scientific notation, the decimal point is shifted to the left-hand side by the 5 decimal places. And that is why here the exponent is equal to 5. So as you can see over here, we have only one significant digit before the decimal point. But if the same number is represented like this, then that is not the scientific notation. Because if you see over here, then the digit before the decimal point is 0. But in a scientific notation, it has to be non-zero. Similarly, if you take this number, then in a scientific notation, this is how it can be represented. So here for the scientific notation, The decimal point is shifted to the right by 3 decimal places. And that is why over here in the exponential term, we have this 10 to the power minus 3. So in a scientific notation, we have total 2 components. That is the significant and the exponent. So here in the second representation, if you see the significant, then that is equal to 4.345. And similarly, the exponent is equal to minus 3. And here of course, Since we are representing the decimal numbers, so the base of the exponent is equal to 10. So here in the scientific notation, we are normalizing the numbers so that we have only one significant digit before the decimal point. And because of this normalization, it is possible to represent all the numbers in a uniform fashion. For example, if we take the case of this number, then the same number can also be represented like this. And of course, the value of the number will still remain the same. But as you can see, all these representations are different. So that is why it is good to have a uniform representation for the each number. So in general, we can say that in a scientific notation, this is how the decimal number is represented. Where this D represents the decimal digit. So similarly, this floating point representation is very similar. And here, this B represents the binary digits. So, if you see this floating point representation, then it consists of the three parts. That is sign, fraction and the exponent part. And here the base of the exponent is equal to 2. So, in this representation also, first binary numbers are normalized in this format. So, in a scientific notation, we have seen that we must have only one significant digit before the decimal point. Now in case of the binary, we have only two digits. That is 1 and 0. And therefore, in the binary, the only possible significant digit is equal to 1. That means in this floating point representation, this significant digit just before the binary point will always remain 1. So we can say that this is the general representation for the floating point number. So now let's see how to normalize any binary number and how to represent it in the floating point representation. So let's say this is our binary number and we want to represent this number in the normalized form. So we can code that. We need to shift this binary point in such a way that just before the binary point, the significant digit is equal to 1. That means here, we need to shift the binary point to the left by 2 bits. And that is why over here, this exponent is equal to 2. That means whenever we shift the radix point to the left by a 1-bit position, then the exponent will increase by 1. So here, since the radix point is shifted to the left side by 2 bits, so the exponent will increase by 2. So similar to the left-hand side, when the red exponent is shifted to the right by a 1-bit position, then the exponent will decrease by a 1. For example, if we have this number, then to represent this number in a normalized form, we need to shift the binary point to the right side by a 2 bits. And that is why here the exponent will decrease by a 2. Or in other words, here this exponent is equal to minus 2. So these two representations are in the normalized form. So in this way, We can normalize any binary number and we can represent it in the floating point form. So now, let's see how this floating point number is actually stored in the memory. So while storing, the 1 bit is reserved for the sign bit. That means while this number is stored, then the MSB will represent the sign bit. So if this bit is 0, then it means that the number is positive. And whenever this bit is equal to 1, then it indicates that the number is negative. So, after the sign bit, the few bits are reserved for storing the exponent value. And then, the remaining bits are reserved for storing this fractional part. So, now if you see this significant, then here the integer part of the significant will always remain 1. And therefore, this 1 is not stored and instead of that, only fractional part is stored. So, this fractional part is also referred as the mantissa or the significant. That means while storing this floating point number, we have total 3 parts. That is sine, exponent and the mantissa. Now to store this floating point number, a certain standard has been defined. Like how many bits will be reserved for the exponent as well as the mantissa part. And similarly, how to store this mantissa as well as the exponent part. Because this exponent part if you see, then it can be positive or the negative. That means we need to decide how to store this exponent part. So to store such numbers, a common standard has been defined. And one such commonly used standard is the IEEE 754. So in the next video, we will see the format of this IEEE standard and we will understand that as per this standard, how the floating numbers are stored. But I hope in this video, you understood the difference between the fixed point numbers and the floating point numbers. And using this floating point representation. How it is possible to represent the very large numbers or the very small numbers with the good precision. So if you have any question or suggestion then do let me know here in the comment section below. If you like this video hit the like button and subscribe the channel for more such videos.

And with this comparison, we will understand the importance of the floating point numbers in the digital systems. So first, let us understand what is the fixed point numbers. So in our day-to-day life, we are all dealing with the integers as well as the real numbers.

Now when these numbers are represented in the fixed point representation, then the position of the radix point or the decibel point remains the fixed. So, all the integers are the example of the fixed point numbers. So, for the integers, there is no fractional part or in other words, the fractional part is equal to zero.

So, by default, the position of the decimal point is at the end of the least significant digit. And since there is no fractional part, so typically, we do not represent this decimal point. But we can say that there is a decimal point on the right-hand side of this least significant digit.

And this position of this decimal point will also remain fixed. So similarly, for the real numbers, the position of the decimal point is just before the fractional part. For example, this 11.75 is the real number, where the 11 is the integer part and the 75 just after this decimal point represents the fractional part. So when these real numbers are represented in the fixed point representation, then the position of this decimal point remains the fixed. Now in any digital system, these numbers are stored in a binary format using the certain number of bits.

Let's say in a one digital system, these numbers are stored in a 10-bit format. Now the issue with the fixed point representation is that with the given number of bits, the range of the numbers that we can represent is very less. So if we take the case of the integers and specifically an unsigned integer, then in the 10-bit format, we can represent any number between the zero and the On the other hand, for the sine integers, the MSB is reserved for the sine bit.

So using the 10 bits, we can represent any number between minus 5 and 2 to plus 5 and 1. That means using the 10 bits, the range of the numbers that we can represent is very limited. So here, basically the range refers to the difference between the smallest and the largest number. Because by increasing the number of bits, we can increase this. range.

But still, if we want to represent the very large numbers like 10 to the power 24 or 25, for example, the mass of the earth, then we need more than the 80 bits. And the issue of the range becomes even more prominent with the real numbers. So when we are dealing with the real numbers, then we always come across this decimal point or in general this radix point.

So the digits on the left of this decimal point represents the integer part. And the digits on the right represents the fractional part. So to store such numbers in a binary format in the computers, some bits are reserved for the integer part and the some bits are reserved for the fractional part. So let's say, once again, these real numbers are stored in a 10-bit format. And out of the 10 bits, the 6 bits are reserved for the integer part and the 4 bits are reserved for the fractional part.

Now when we store these numbers in a binary format, then there is no provision for storing this binary point explicitly. But here we have different sections for the integer as well as the fractional part. And accordingly, each bit will have its place value.

So here, just after the 2 to the power 0's place, we will assume that there is a binary point. So out of the 10 bits, if we reserve 6 bits for the integer part, then for the unsigned numbers, we can represent any number between 0 to 63. And for the fractional part, the maximum number that we can represent is equal to 0.9375. And the minimum number will be equal to 0.0625. That means in the 10-bit format, if we want to represent any real number, then the minimum non-zero number that we can represent is equal to 0.0625.

While if we see the maximum number, then that is equal to 63.9375. That means in general, In this fixed point representation, the location of the radix point is fixed. And once we decide it, then it will not change. So in a 10-bit fixed point representation, Once we freeze this specific format, like 6 bits for the integer and 4 bits for the fraction, then we cannot represent any number smaller than this 0.0625. For example, if we want to represent this 22.0125 or this 35.0025, then we cannot represent it in this 10-bit fixed-point representation.

So if we want to represent such smaller numbers, then we need to assign more bits for this fractional part. Like the 5 bit or the 6 bits for the fractional part. So of course, by doing so, certainly we can increase the precision.

But now, our range will get compromised. For example, now we have only 4 bits for the integer part. And now, in these 4 bits, we can represent any number between 0 to 15. That means in this fixed point representation, once the location of this radix point is fixed, then our range and the precision will also get fixed.

But in the floating point representation, it is possible to change the location of this radix point or the binary point dynamically. For example, for the given number of bits, let's say a 10 bits, if we want more range, then we can shift this binary point towards the right. Or for example, for some application, if we require more precision, then it is possible to shift the radix point towards the left.

That means using the floating point representation, It is possible to represent the very large numbers like the distance between the planets or the mass of the planets and the very small number like the mass of the atom using these floating point numbers. So this floating point representation provides both good range as well as the precision. So now let us see how to represent these floating point numbers.

So the representation of this floating point number is very similar to how we are representing the decimal numbers in the scientific notation. So, in a scientific notation, the radix point or the decimal point is set in such a way that we have only one significant digit after the decimal point. So, for the integers, by default, the radix point or this decimal point is set to the right-hand side of this least significant digit. So, here to represent this number in a scientific notation, the decimal point is shifted to the left-hand side by the 5 decimal places. And that is why here the exponent is equal to 5. So as you can see over here, we have only one significant digit before the decimal point.

But if the same number is represented like this, then that is not the scientific notation. Because if you see over here, then the digit before the decimal point is 0. But in a scientific notation, it has to be non-zero. Similarly, if you take this number, then in a scientific notation, this is how it can be represented. So here for the scientific notation, The decimal point is shifted to the right by 3 decimal places.

And that is why over here in the exponential term, we have this 10 to the power minus 3. So in a scientific notation, we have total 2 components. That is the significant and the exponent. So here in the second representation, if you see the significant, then that is equal to 4.345. And similarly, the exponent is equal to minus 3. And here of course, Since we are representing the decimal numbers, so the base of the exponent is equal to 10. So here in the scientific notation, we are normalizing the numbers so that we have only one significant digit before the decimal point. And because of this normalization, it is possible to represent all the numbers in a uniform fashion.

For example, if we take the case of this number, then the same number can also be represented like this. And of course, the value of the number will still remain the same. But as you can see, all these representations are different.

So that is why it is good to have a uniform representation for the each number. So in general, we can say that in a scientific notation, this is how the decimal number is represented. Where this D represents the decimal digit.

So similarly, this floating point representation is very similar. And here, this B represents the binary digits. So, if you see this floating point representation, then it consists of the three parts.

That is sign, fraction and the exponent part. And here the base of the exponent is equal to 2. So, in this representation also, first binary numbers are normalized in this format. So, in a scientific notation, we have seen that we must have only one significant digit before the decimal point.

Now in case of the binary, we have only two digits. That is 1 and 0. And therefore, in the binary, the only possible significant digit is equal to 1. That means in this floating point representation, this significant digit just before the binary point will always remain 1. So we can say that this is the general representation for the floating point number. So now let's see how to normalize any binary number and how to represent it in the floating point representation. So let's say this is our binary number and we want to represent this number in the normalized form. So we can code that.

We need to shift this binary point in such a way that just before the binary point, the significant digit is equal to 1. That means here, we need to shift the binary point to the left by 2 bits. And that is why over here, this exponent is equal to 2. That means whenever we shift the radix point to the left by a 1-bit position, then the exponent will increase by 1. So here, since the radix point is shifted to the left side by 2 bits, so the exponent will increase by 2. So similar to the left-hand side, when the red exponent is shifted to the right by a 1-bit position, then the exponent will decrease by a 1. For example, if we have this number, then to represent this number in a normalized form, we need to shift the binary point to the right side by a 2 bits. And that is why here the exponent will decrease by a 2. Or in other words, here this exponent is equal to minus 2. So these two representations are in the normalized form. So in this way, We can normalize any binary number and we can represent it in the floating point form. So now, let's see how this floating point number is actually stored in the memory.

So while storing, the 1 bit is reserved for the sign bit. That means while this number is stored, then the MSB will represent the sign bit. So if this bit is 0, then it means that the number is positive. And whenever this bit is equal to 1, then it indicates that the number is negative.

So, after the sign bit, the few bits are reserved for storing the exponent value. And then, the remaining bits are reserved for storing this fractional part. So, now if you see this significant, then here the integer part of the significant will always remain 1. And therefore, this 1 is not stored and instead of that, only fractional part is stored. So, this fractional part is also referred as the mantissa or the significant. That means while storing this floating point number, we have total 3 parts.

That is sine, exponent and the mantissa. Now to store this floating point number, a certain standard has been defined. Like how many bits will be reserved for the exponent as well as the mantissa part.

And similarly, how to store this mantissa as well as the exponent part. Because this exponent part if you see, then it can be positive or the negative. That means we need to decide how to store this exponent part. So to store such numbers, a common standard has been defined. And one such commonly used standard is the IEEE 754. So in the next video, we will see the format of this IEEE standard and we will understand that as per this standard, how the floating numbers are stored.

But I hope in this video, you understood the difference between the fixed point numbers and the floating point numbers. And using this floating point representation. How it is possible to represent the very large numbers or the very small numbers with the good precision. So if you have any question or suggestion then do let me know here in the comment section below.

If you like this video hit the like button and subscribe the channel for more such videos.

Transcript for:Understanding Floating Point Numbers

Transcript for:
Understanding Floating Point Numbers