CPSC/MATH 402 - Investigating Floating Point
Representation and Arithmetic
Log into the Linux system to do this exercise.
This handout and some programs are on-line --
go to http://cs.roanoke.edu/Spring2012/CPSC402A and
follow the link.
- In section 1.2 we saw that the maximum relative error in representing
a number x by fl(x) in a base 10 system that uses k-digit chopping
is 10-k+1 (page 22) and in a base 10 system that uses
k-digit rounding is 0.5 * 10-k+1 (page 22, proved in
problem #24 on page 31). The proof of these facts generalizes
to any base b that uses k base b digits to get the maximum roundoff error of
b-k+1 if chopping is used to represent a
real number and 1/2 * b-k+1 if rounding is used. This number
is often called machine precision or machine epsilon.
The IEEE Standard Floating Point number system, the system used by
the lab computers, is 52-bit base 2 for double precision (the default in
languages such as Java), and
23-bit base 2 for single precision.
- Calculate machine precision for each of the following:
Single Precision (Chop) ______________________________________
Single Precision (Round) ______________________________________
Double Precision (Chop) ______________________________________
Double Precision (Round) ______________________________________
- Based on your answers how many significant decimal digits
(base 10) are there for each of the above?
- The file Precision.java
is a Java program that
determines approximations of machine precision
(machine epsilon) for type float
and type double using two different techniques.
One is by computing the quantity
| 3 * (4/3 - 1) - 1 |
and the other is the alternate characterization of
machine epsilon as the smallest positive
number epsilon such that fl(1 + epsilon) > 1.
Save the program to your directory (you can do this with
the File/Save As feature in the browser).
- Now compile and run the program.
- Compare the output of the program to your calculations.
- Try the 4/3 approximation above on your calculator. What
does it give as an approximation of machine precision? What does that
say about the number of significant digits stored in your calculator?
- Explain why the approximation (using 4/3) works.
- The program Increment.java illustrates the
hazards of testing for equality of two floating point numbers. Save
it to your directory and read the documentation and the code
to see what it does.
- Compile and run the program with the following input: (a) n = 8;
(b) n = 64; (c) n = 1024; (d) n = 10. What happened and why was the behavior
different for different input?
- Change the loop control to total > epsilon so the program
stops when the value is within machine precision of 0. (NOTE: epsilon
is already computed -- it is an approximation to machine epsilon).
Rerun the program for the input above plus other values. Does it
work correctly?
- Smallest Postive IEEE Floating Point Number
- Recall that the exponent range for IEEE double precision is from -1022
to +1023 (an 11 bit exponent) and for single precision is from -126 to 128.
(Double precision is discussed on pages 18 & 19.)
In theory what is the smallest positive IEEE floating point
number in:
Single Precision __________________________________________
Double Precision __________________________________________
- The program UFL.java (underflow level)
computes the smallest
positive power of 2 in both single precision (using type float) and
double precision (using type double) by starting at 1 and dividing
by 2 until 0 is reached (remember that when underflow is reached the
computer treats the number as 0). The program also prints the constants
MIN_VALUE in the Double and Float wrapper classes.
Save the program, compile, and run it.
- How do the results of the program compare to the theoretical
results? There is an explanation for the discrepancy - you'll
examine it in homework.
- Largest Positive IEEE Floating Point Number
- In theory, what is the largest positive IEEE floating point number
in:
Single Precision _____________________________________________________
Double Precision _____________________________________________________
- The program OFL.java (Overflow Level)
just prints out the constants MAX_VALUE in the Double and Float
wrapper classes. Save, compile and run the program. Compare the answers
to your calculations.
- Add a loop to the program after the values currently printed to
compute and print double precision powers of 2 from 2^0 to 2^1040
(print both the exponent and the power of 2).
Compile and run the program. What happens when overflow occurs?
- Did overflow occur at the power of 2 expected? Explain.
- The program Poly.java computes values of
the polynomial p(x) = (x - 1)^6 in two different ways for values of x
near 1 (specifically it starts at x = 0.995 and goes in increments of
0.0001 up to 1.005.
-
Save the program, compile it and run it. Study the output.
Notice the difference in answers for the two different
calculations. Also notice the computed value of (x - 1)^6 for x = 1.
- The problem at x = 1 is from accumulated roundoff in computing the
value of x - each new x is computed from the preceding x so roundoff
error builds up. An alternate approach to generating n equally
spaced points between numbers a and b is to compute the increment
h = (b - a)/n, then the points are x = a + i*h for i = 0, 1, 2, ..., n.
Modify the code in Poly.java to compute the values of x in this
way. Compile and run the program and compare the results.
- The expanded form of the polynomial did not fare so well in the
above calculations. In general, a polynomial should not be evaluated
in expanded form but not all polynomials have nice factored forms.
The recommended method of evaluation (in the absence of a nice
factorization) is a nested evaluation.
Add code to the program to evaluate the polynomial using
nested evaluation (assign
to poly3) and print out the result (as a fourth column).
- Compare the results of the three methods of calculating the
value of the polynomial. NOTE: One way to get a printout of
output to examine is to create a logfile (or script - the file will
contain everything that appears on your xterm from the time
you start the script until you stop it). To start
a logfile type:
script
or
script filename
at the command prompt. In the first case a script file named
typescript is created; in the second the name of the file
is whatever you specify.
Run your program - the output going to your screen is also going to
the logfile. Type exit to close the script.
- Printing Recall that the command to print a file is
enscript -2rG filename
You can use this to print output or your programs.
HOMEWORK DUE Monday, February 13, 2012:
- Floating Point Representation and Computer Arithmetic
- Complete the lab!
- Would the approximation of machine epsilon using 4/3 work in a
machine that used base 3? Explain your answer.
- Explain why the other technique also gives machine precision.
- The UFL.java program produced smaller values of the underflow
level than theory predicts. That is because the IEEE Floating Point
System uses a concept called subnormals or gradual underflow.
Subnormals are non-normalized numbers - they have the smallest
exponent but the mantissa is not normalized. For example in the
base 10 floating point system with 3-digits and a smallest
exponent of -1 the smallest number in theory
is 0.100 x 10^-1. But with subnormals numbers such as 0.010 x 10^-1
(which equals 10^-2), 0.058 x 10^-1, and 0.001 x 10^-1 (= 10^-3)
can be represented.
Calculate the smallest single precision and smallest double
precision positive floating point numbers in IEEE Floating Point
System with subnormals. (Show how you are doing this!)
Do your results agree with the program's results?
- Why did the input 10 cause problems in the Increment.java
program but 8, 64, and 1024 worked as expected?
- Discuss the behavior for the three
different methods for computing (x - 1)^6 for x near 1.
Compare the behavior of the
methods for x really close to 1 versus a bit further away. That is,
for which values of x is each method most accurate versus for which
is each least accurate.
- Exploring Root Finding
- Complete the Root Finding Lab.
- For each of the functions in Newton1 and Newton2 find an
interesting starting point and use the graphs to draw the
the approximation lines and the first
3 iterates to show what is going on (a different graph for each
starting point).
- Similarly find interesting pairs of starting points for
the Secant method for each of the two functions and draw
the approximation lines and the first 3 iterates to
show what is going on.
- The asymptotic error constant (see the definition of order of
convergence on page 79) for Newton's method on a function
with a zero of order m is 1 - 1/m. So if m = 3 this would
be 2/3. Describe what this means in terms of the error at
each iteration AND examine the output from the Newton3 program
to see if that is the behavior you observe (give concrete
examples - you can print out the iterates if you wish).
- Do problem #10 on page 86. (You should have implemented
this technique in the lab and you should have already done
the problem in homework!)
- You should have already implemented the False Position Method
(though some of you didn't think you did it right so try again).
Modify that program AND write a program to do the Bisection
Method then use the programs (plus programs to do
Newton's method and the Secant method) to do problems #17 & #18
on page 76). Recall that Math.tan is the tangent function in
Java and Math.PI is pi.
- Do #19 on page 76
- Do #26 on page 77. Set up the equation and use a program
to solve it.
- Hand in:
Copies of your lab handouts, your work on problems not
in the handout, a copy of your program that implements
the bisection algorithm and a copy of your program that
implements the False Position method
and a logfile (script) showing the output.