Floating Point Representation of Rational Numbers

We have already seen how computers represent both positive and negative numbers using binary. In order to represent real numbers (which includes fractions or rational numbers) computers often use floating point representation. Floating points means that the decimal (or binary point in the case of binary numbers) can "float", or be moved anywhere in relation to the significant digits of a number. In decimal numbers this is analogous to scientific notation. For example the number 1234.567, which is in fixed point notation, in scientific notation would be 1.234567 * 10³.

An 8-digit base 10 representation of the number 1234.567 = 1.234567 * 10³ would be 31234567. The 8-digit base 10 representation can be broken up into two parts. The first digit is the exponent, in this case 3. The following 7 digits are the significant digits, or mantissa. Notice that this representation assumes that the decimal place for the mantissa is after the first digit.

The advantage to representing numbers using floating points is that it is possible to represent a much larger range of numbers. Fixed point notation when limited to a certain number of digits must always assume that the decimal is in the same place. For example if an 8-digit base 10 fixed point representation assumes that there are four digits to the left and four digits to the right of the decimal point then the largest number that could be represented is 99999999 = 9999.9999. With the floating point representation used above the largest number that could be represented is 99999999 = 9.999999 * 10⁹ = 9,999,999,000.

A problem with fixed point representation is that fewer numbers between the maximum representable number can be represented. For example 9,999,999,000-1 = 9,999,998,999 = 9.999998999 * 10⁹. Except that the mantissa in this case has more than 7 digits, so it would have to be rounded to 9.999999 * 10⁹ in order to be stored in an 8-digit base 10 floating point number. This leads to the peculiar behavior where 9,999,999,000 - 1 = 9,999,999,000.

Floating point arithmetic is similar to scientific notation arithmetic. For example in order to add 1.234567 * 10³ and 7654321*10² the operand with the lower base is converted to the same base as the other operand and the mantissas are added. So, 1.234567 * 10³ + 7.654321*10² is equivalent to 1.234567 * 10³ + 0.7654321*10³, which is:

	 1.2345670 * 10³
	+0.7654321 * 10³
	 ---------------
	 1.9999991 * 10³

Notice that the mantissa of the answer is 8 digits, which is larger than our 8-digit representation allows, so the actual sum, 1999.9991, would be rounded to 1999.999. In this case the answer is off by just one-ten-thousandth. But there are other cases when strange behavior can occur. For example, when two numbers of much different size are added:

	 1.234567 * 10³   =>    0.0000001234567 * 10⁹
	+7.654321 * 10⁹   =>   +7.6543210000000 * 10⁹
	 --------------         ------------------
	                        7.6543211234567 * 10⁷

When 7.6543211234567 * 10⁹ is rounded to account for the 7 digit mantissa, it becomes 7.654321 * 10⁹, which is one of the operands.

Floating point multiplication is also very similar to scientific notation multiplication, mantissa are multiplied while exponents are added. For example:

	 1.234567 * 10³
	*7.654321 * 10³
	 --------------
	 9.449772114007 * 10⁶

Which, when rounded to account for the 7 digit mantissa is 9.449772 * 10⁶.

Rational numbers are represented in binary in the same way as decimal numbers. For example, the 8-digit fixed point decimal number 0003.7500 is equivalent to 3*10⁰+7*10^-1+5*10^-2 = 3.0+0.7+0.05. This same number in 8-bit fixed point binary would be 0011.1100, which is equivalent to 2¹+2⁰+2^-1+2^-2 = 3.75. This same number in 8-bit binary floating point notation would be 00111110. The first three digits are the exponent and the last five are the mantissa. So, in binary scientific notation it is equivalent to 1.111 * 2¹ = 1.875*2 = 3.75.

Final note, because rational numbers in binary are a sum of 2's raised to some power, only numbers with a denominator that is a sum of powers of two can be represented. For example 3.75 = 2¹+2⁰+2^-1+2^-2, which is a sum of 2's raised to some power. 0.1 = 2^-4+2^-5+2^-8+2^-9..., which is an infinite sum that can not be represented with a finite number of bits, so it would be rounded. This is why often when you are writing code if you enter a literal double value it will print a very long sequence of digits that are close to the number you entered, but not the same.

Many computers use IEEE Floating Point format to represent floating point numbers. Numbers are converted to binary scientific notation with the following number of bits used to represent the exponent and the mantissa.

IEEE Single Precision (32 bits)
- 1 bit for the sign of the number
- 8 bits for the exponent - hence, the largest possible exponent is 2^7 - 1 and the largest possible mantissa is 1.111...11 (all ones) which is approximately 2 so the largest possible single precision number is approximately 2 * 2^{2^7 - 1} = 2⁸ which is approximately 3.4 * 10³⁸. Hence the magnitude of numbers represented in IEEE single precision is from about -3.4 * 10³⁸ to 3.4 * 10³⁸. Note that this matches the values given for float in the table of primitive data types on page 46 of the text.
- 24 bits for the mantissa (only 23 are explicitly stored - the first bit is assumed to be a 1). Hence, the number of representable mantissas is 2²⁴ which is 16777216. That means the number of significant decimal digits is about 7 (because 16777216 is about 1.7 * 10⁷). Again, note that this matches the number of significant digits for type float given in the chart on page 46.
IEEE Double Precision (64 bits)
- 1 git for the sign of the number
- 11 bits for the exponent
- 53 bits for the mantissa (only 52 are explicitly stored)

Practice Exercises

1. Convert each of the following numbers from fixed point to scientific notation.

0.843
345.099
45386

2. Convert each of the following numbers from fixed point to 8-digit base 10 (where the exponent is one digit and the mantissa is 8 digits).

0.843
345.099
45386

3. Compute the following:

3.439*10³+1.48278*10²
8.762*10⁷*2.08*10²

4. Convert each of the following from 8-bit fixed point binary to decimal.

0100.0010
0101.0101

5. Convert each of the following from to 8-bit fixed point binary to binary scientific notation.

0100.0010
0101.0101

6. Convert each of the following from to 8-bit fixed point binary to 8-bit floating point binary (where the exponent is 3 digits and the mantissa is 5 digits).

0100.0010
0101.0101

7. Use the information about the number of bits used in IEEE Double Precision to compute the following.

The range of numbers that can be represented (that is the range of the exponents) in double precision. Compare your answer to the value given for type double in Java in the chart on page 46.
The number of significant digits in double precision (how many digits the mantissa can be). Compare to the number for type double.