# CSCI 1300, Section 100 Floating point numbers

by Daniel von Dincklage
edited by Mike Mozer

```The representation of a floating point number is something like:

[+/-] [Exponent] [normalized Mantissa]

For C++ float variables, the exponent is 8 bits, mantissa 23 bits
For C++ double variables (IEEE754), the exponent is 11 bits, and mantissa
52 bits.

The exponent is stored as a number offsetted by ne negative half of the
representable range - 1, e.g., for a float the range would be 256, the negative
half would be -127. So 0000 0000 in the exponent would be 2^-127,
1000 0000 exponent 1.  (Note that the exponent is base 2 !)

The mantissa is stored in the form:

1        1        1              1
---   +  ---   +  ---  +   ... + ---
2^0      2^1     2^2            2^n

To encode .125 (1/8) the mantissa would be thus 0001 0000 000

However, the mantissa is normalized, which means that it is shifted to the
left until the first "1" disappears, making the representation of .125
"0000 0000 000". Of course, this shift is reflected in the exponent, so
rather than storing Exponent 0, one would store exponent -3, making the
exponent 128 - 4 : 124.

So, 5.01 would be in this form:
1* 2^2 + 0*2^1 + 1 *  2^0  + 0 * 2^-1 + 0 * 2^-2 + 0 * 2^-3 + 0 * 2^-4
4 +        +         1 +

+ 0 * 2-5 + 0*2^-6 + 0*2^-7    + 0*2^-8   + 1*2^-9         + 0*2^-10 +
+ 0.0078125            + 1.953125e-3
[Here we are         [We are still missing
still missing          0.000234375]
0.0021875]

+ 0*2^-11 + 0*2^-12 + 1*2^13       + ....
[1.220703125e-4]

This continues, until the bits run out, as the number is not
representable fully, with the 23 bits we have.

As the number is stored in powers of two, small whole numbers
can be represented without loss of precision, as can fractions like
1/2, 3/4, 5/16.  However, values that are natural in a base 10
system, such as .01, cannot be exactly represented with finite
bits, leading to rounding errors.

Finite precision of floating point numbers leads to nonintuitive
errors.  For example, .51*100. may lead to a value slightly
larger than 51, while .51*1000. may lead to a value slightly
less than 510.

When using floating point, it is a good rule to never
check for equality, e.g., "if (f1 == f2)".
```