📏

Understanding IEEE Floating Point Standards

Sep 15, 2024

IEEE Standard for Floating Point Numbers

Introduction

  • Discussion about IEEE standard for floating point numbers
  • Previous video: normalization of binary numbers, floating point representation, memory storage.
  • Floating point numbers have:
    • 1 bit for sign
    • Few bits for exponent
    • Remaining bits for mantissa
  • IEEE 754 standard: defines storage of floating point numbers.

IEEE 754 Formats

  • Half Precision: 16 bits
  • Single Precision: 32 bits
  • Double Precision: 64 bits
  • Quad Precision: 128 bits
  • Octuple Precision: 256 bits
  • Commonly used formats: Single and Double Precision

Single Precision Format

  • 32-bit Structure:
    • 1 bit: Sign
    • 8 bits: Exponent
    • 23 bits: Mantissa
  • Normalization:
    • Binary number is normalized to store only the fractional part in mantissa.

Exponent Storage

  • 8-bit Exponent: Represents unsigned integers 0-255
  • Exponent can be positive or negative:
    • Stored using bias representation
    • Bias = 2^(n-1) - 1 (for n = number of bits)
    • Example: 8 bits, bias = 127
  • Range after subtracting bias: -127 to +128
  • Bias representation ensures continuity in numbers from negative to positive.

Special Exponent Values

  • All zeros and all ones reserved for special purposes.
  • Available range: -126 to +127
  • Continuity and ease of comparison via bias representation.

Comparing Floating Point Numbers

  • Steps:
    1. Compare sign bits
    2. Compare exponents
    3. Compare mantissas
  • Bias representation aids comparison due to continuity in number ordering.

Example Calculations

  • Converting IEEE 754 format to decimal:
    • Use sign bit, exponent value (after bias subtraction), and mantissa
    • Normalized binary numbers converted to true binary for final decimal value
  • Example conversions illustrate process.

Decimal to IEEE 754 Format

  • Conversion of decimal to binary, normalize, adjust exponent, and store.
  • Example: Converting 12.625 into IEEE format.

Range and Precision

  • Largest and smallest numbers in single precision format:
    • Max exponent: 127
    • Min exponent: -126
  • Precision limited by stored mantissa bits.
  • Fixed vs floating point representation: floating point covers greater range but less precision.

Double Precision Format

  • 64-bit Structure:
    • 1 bit: Sign
    • 11 bits: Exponent
    • 52 bits: Mantissa
  • Bias: 1023
  • Enhanced range and precision compared to single precision.

Conclusion

  • IEEE 754 standard helps store floating point numbers with defined structure.
  • Double precision offers greater precision than single.
  • Special cases for all-zero and all-one exponents to be explored in future content.