💻

Understanding the Representation of Real Numbers

Oct 30, 2024

Real Number Representation in Binary

Fixed Point vs. Floating Point

  • Fixed Point Representation:

    • Fixes the binary point's position within a register.
    • Limits the range and precision of real numbers.
  • Floating Point Representation:

    • The most common way to encode real numbers in binary.
    • Uses binary scientific notation:
      • Mantissa: The base number.
      • Exponent: Determines the range.
    • Uses two's complement for negative values.
    • Historically, format varied depending on the computer.

IEEE 754 Standard

  • Established in 1985 to standardize floating point representation.

  • 32-bit Single Precision:

    • Structure: 1 sign bit, 8-bit exponent, 23-bit mantissa.
    • Effective precision of 24 bits.
  • 64-bit Double Precision:

    • Structure: 1 sign bit, 11-bit exponent, 52-bit mantissa.

Conversion Example: Converting to IEEE 754

  • Example: Convert 19.59375 to IEEE 754 Single Precision:
    1. Determine Sign Bit: Positive number, so sign bit = 0.
    2. Convert to Binary:
      • Use repeated division for whole numbers.
      • Use repeated multiplication for fractional part.
    3. Normalize Binary: Move the binary point to get the exponent.
    4. Calculate Exponent Bias: Add 127.
    5. Drop Leading 1 from Mantissa.
    6. Add Zero Fillers at the Back of Mantissa.

Biased Exponent Encoding

  • Allows for more positive or negative values.
  • Example: A 4-bit exponent can represent 16 values.
  • Bias for 8-bit exponent (single precision): 127.
  • Bias for 11-bit exponent (double precision): 1023.

Floating Point Arithmetic

  • Rounding:
    • Mantissa fit into storage.
    • Computers may round to the nearest value.

Special Cases in IEEE 754

  • Infinity: Exponent all ones, mantissa all zeros.
  • NaN (Not a Number): Exponent all ones, mantissa not all zeros.
  • Zero: Exponent all zeros, mantissa all zeros.
  • Subnormal Numbers: Exponent all zeros, mantissa not all zeros.

Summary

  • Floating point representation enhances range and precision compared to fixed point.
  • IEEE 754 standardizes the format, simplifying calculations and comparisons.