liminfo

Ieee754

Free reference guide: Ieee754

29 results

About Ieee754

The IEEE 754 Reference is a searchable quick-reference for the IEEE 754-2008 floating-point arithmetic standard. It covers six practical categories — Single Precision, Double Precision, Special Values, Rounding, Comparison, and Conversion — giving software engineers, computer science students, numerical computing researchers, and embedded developers the exact bit-level details and code examples they need to understand, debug, and correctly implement floating-point arithmetic.

Programmers use this reference to look up the exact bit layout of float (1 sign + 8 exponent + 23 mantissa bits, bias 127) and double (1 sign + 11 exponent + 52 mantissa bits, bias 1023), the hex representations of canonical values like 1.0 (0x3F800000), the reason 0.1 cannot be represented exactly (0x3DCCCCCD is the closest float approximation), and JavaScript's MAX_SAFE_INTEGER limit (2^53 - 1 = 9007199254740991). The reference also covers machine epsilon, denormalized numbers for gradual underflow, and the distinction between quiet NaN and signaling NaN.

The Comparison and Conversion categories address practical programming pitfalls: why == comparisons fail for floating-point numbers and how to use epsilon-based or ULP-based comparisons instead, how to implement Kahan compensated summation to reduce accumulated floating-point error from O(n*eps) to O(eps), how to convert decimal numbers to IEEE 754 bit patterns step by step, and how to reinterpret float bits as integers using memcpy or struct.pack for type punning.

Key Features

  • Single precision (float, 32-bit): sign + exponent (bias 127) + mantissa layout, representable range, examples for 1.0, -0.5, and 0.1
  • Double precision (double, 64-bit): sign + exponent (bias 1023) + mantissa layout, range, machine epsilon (2^-52), MAX_SAFE_INTEGER
  • Special values: +0 and -0 (equal under ==, different under 1/x), +/-Infinity, NaN (quiet vs signaling), denormalized numbers
  • Five IEEE 754 rounding modes: round-to-nearest-even (banker's), toward +Infinity (ceil), toward -Infinity (floor), toward zero (trunc), ties-away
  • Floating-point comparison: epsilon-based equality, relative error comparison, totalOrder for NaN ordering, ULP gap calculation
  • Kahan compensated summation algorithm to reduce cumulative addition error from O(n*eps) to O(eps)
  • Decimal to IEEE 754 conversion: step-by-step sign, integer, fractional parts, normalization, biased exponent calculation
  • Type punning: memcpy-based bit reinterpretation in C, struct.pack/unpack in Python, hexadecimal float literals (0x1.8p1)

Frequently Asked Questions

Why does 0.1 + 0.2 not equal 0.3 in floating-point arithmetic?

0.1 and 0.2 cannot be represented exactly in binary floating-point. The nearest float to 0.1 is 0x3DCCCCCD (about 0.100000001490116), and the nearest to 0.2 introduces similar error. When added, the accumulated rounding error makes the result differ from the nearest float to 0.3. This is a fundamental property of binary floating-point, not a bug. To compare for equality, use Math.abs(a - b) < epsilon instead of a === b.

What is the difference between +0 and -0 in IEEE 754?

+0 and -0 are distinct bit patterns: +0 has all bits zero, while -0 has the sign bit set and all other bits zero. However, +0 === -0 evaluates to true in comparisons. The difference only becomes observable in division: 1 / +0 = +Infinity and 1 / -0 = -Infinity, and in functions like Math.atan2 that are defined differently for the two zeros. Most code never needs to distinguish them, but numerical analysis libraries sometimes rely on the signed-zero behavior.

What is NaN and why does NaN != NaN?

NaN (Not a Number) is the result of undefined floating-point operations like 0/0, sqrt(-1), or Infinity - Infinity. In IEEE 754, NaN is defined to be unordered relative to every value including itself — so NaN < 1, NaN > 1, and NaN == NaN are all false. This is intentional: it forces explicit checks using isnan() in C or Number.isNaN() in JavaScript. There are two kinds: quiet NaN (qNaN) propagates silently through calculations, while signaling NaN (sNaN) triggers a floating-point exception.

What is machine epsilon and why does it matter?

Machine epsilon is the smallest number e such that 1.0 + e != 1.0 in the floating-point system. For float it is 2^-23 (about 1.19e-7) and for double it is 2^-52 (about 2.22e-16). It defines the relative precision of the format — any value computed with relative error smaller than epsilon is indistinguishable from the true result. Use it as the tolerance in epsilon comparisons: if abs(a - b) <= epsilon * max(abs(a), abs(b)) the values are "equal within floating-point precision".

What are the five IEEE 754 rounding modes?

Round-to-nearest-even (the default, also called banker's rounding) rounds to the nearest representable value, and on an exact midpoint rounds to the one with an even least significant bit — minimizing statistical bias. Round-toward-+Infinity (ceiling) always rounds up. Round-toward--Infinity (floor) always rounds down. Round-toward-zero (truncation) discards the fractional part toward zero. Round-to-nearest-ties-away rounds midpoints away from zero, matching the familiar school rounding rule, added in IEEE 754-2008.

What is the ULP and how do I use it for floating-point comparison?

ULP (Unit in the Last Place) is the gap between a floating-point number and the next representable number. For 1.0 as a float, ULP = 2^-23 ≈ 1.19e-7. For large values the ULP is proportionally larger: ULP(1e6) as float ≈ 0.0625, meaning the next float after 1000000.0 is 1000000.0625. ULP-based comparison checks if two floats are within N ULPs of each other by comparing their integer bit representations — more robust than epsilon-based comparison for values of vastly different magnitudes.

How does Kahan compensated summation reduce floating-point error?

Naive summation of N numbers accumulates rounding error proportional to N * machine_epsilon * sum (O(n*eps)). Kahan summation keeps a running "compensation" variable c that tracks the lost low-order bits from each addition. Before each addition, the next term is corrected by subtracting c. The compensation captures the rounding error of the previous step and adds it back to the next term. The result has error O(eps) regardless of N — essential when summing large arrays of floating-point values.

How do I convert a decimal number to its IEEE 754 float bit representation?

For a number like -6.75: (1) determine the sign bit (1 for negative). (2) Convert the magnitude to binary: 6 = 110(2), 0.75 = 0.11(2), so 6.75 = 110.11(2). (3) Normalize to 1.M form: 1.1011 x 2^2. (4) Compute biased exponent: 2 + 127 = 129 = 10000001(2). (5) Mantissa bits are 10110...0 (23 bits). (6) Assemble: 1 10000001 10110000000000000000000 = 0xC0D80000.