Floating point numbers

edited by Mike Mozer

The representation of a floating point number is something like: [+/-] [Exponent] [normalized Mantissa] For C++ float variables, the exponent is 8 bits, mantissa 23 bits For C++ double variables (IEEE754), the exponent is 11 bits, and mantissa 52 bits. The exponent is stored as a number offsetted by ne negative half of the representable range - 1, e.g., for a float the range would be 256, the negative half would be -127. So 0000 0000 in the exponent would be 2^-127, 1000 0000 exponent 1. (Note that the exponent is base 2 !) The mantissa is stored in the form: 1 1 1 1 --- + --- + --- + ... + --- 2^0 2^1 2^2 2^n To encode .125 (1/8) the mantissa would be thus 0001 0000 000 However, the mantissa is normalized, which means that it is shifted to the left until the first "1" disappears, making the representation of .125 "0000 0000 000". Of course, this shift is reflected in the exponent, so rather than storing Exponent 0, one would store exponent -3, making the exponent 128 - 4 : 124. So, 5.01 would be in this form: 1* 2^2 + 0*2^1 + 1 * 2^0 + 0 * 2^-1 + 0 * 2^-2 + 0 * 2^-3 + 0 * 2^-4 4 + + 1 + + 0 * 2-5 + 0*2^-6 + 0*2^-7 + 0*2^-8 + 1*2^-9 + 0*2^-10 + + 0.0078125 + 1.953125e-3 [Here we are [We are still missing still missing 0.000234375] 0.0021875] + 0*2^-11 + 0*2^-12 + 1*2^13 + .... [1.220703125e-4] This continues, until the bits run out, as the number is not representable fully, with the 23 bits we have. As the number is stored in powers of two, small whole numbers can be represented without loss of precision, as can fractions like 1/2, 3/4, 5/16. However, values that are natural in a base 10 system, such as .01, cannot be exactly represented with finite bits, leading to rounding errors. Finite precision of floating point numbers leads to nonintuitive errors. For example, .51*100. may lead to a value slightly larger than 51, while .51*1000. may lead to a value slightly less than 510. When using floating point, it is a good rule to never check for equality, e.g., "if (f1 == f2)".