CPSC/MATH 402 - Investigating Floating Point Representation and Arithmetic

Log into the Linux system to do this exercise. This handout and some programs are on-line -- go to http://cs.roanoke.edu/Spring2012/CPSC402A and follow the link.

  1. In section 1.2 we saw that the maximum relative error in representing a number x by fl(x) in a base 10 system that uses k-digit chopping is 10-k+1 (page 22) and in a base 10 system that uses k-digit rounding is 0.5 * 10-k+1 (page 22, proved in problem #24 on page 31). The proof of these facts generalizes to any base b that uses k base b digits to get the maximum roundoff error of b-k+1 if chopping is used to represent a real number and 1/2 * b-k+1 if rounding is used. This number is often called machine precision or machine epsilon. The IEEE Standard Floating Point number system, the system used by the lab computers, is 52-bit base 2 for double precision (the default in languages such as Java), and 23-bit base 2 for single precision.
    1. Calculate machine precision for each of the following:
      Single Precision (Chop)  ______________________________________
      Single Precision (Round) ______________________________________
      Double Precision (Chop)  ______________________________________
      Double Precision (Round) ______________________________________
    2. Based on your answers how many significant decimal digits (base 10) are there for each of the above?

    3. The file Precision.java is a Java program that determines approximations of machine precision (machine epsilon) for type float and type double using two different techniques. One is by computing the quantity
          | 3 * (4/3 - 1) - 1 |
      and the other is the alternate characterization of machine epsilon as the smallest positive number epsilon such that fl(1 + epsilon) > 1.
      Save the program to your directory (you can do this with the File/Save As feature in the browser).

    4. Now compile and run the program.
    5. Compare the output of the program to your calculations.

    6. Try the 4/3 approximation above on your calculator. What does it give as an approximation of machine precision? What does that say about the number of significant digits stored in your calculator?

    7. Explain why the approximation (using 4/3) works.

  2. The program Increment.java illustrates the hazards of testing for equality of two floating point numbers. Save it to your directory and read the documentation and the code to see what it does.
    1. Compile and run the program with the following input: (a) n = 8; (b) n = 64; (c) n = 1024; (d) n = 10. What happened and why was the behavior different for different input?

    2. Change the loop control to total > epsilon so the program stops when the value is within machine precision of 0. (NOTE: epsilon is already computed -- it is an approximation to machine epsilon). Rerun the program for the input above plus other values. Does it work correctly?

  3. Smallest Postive IEEE Floating Point Number
    1. Recall that the exponent range for IEEE double precision is from -1022 to +1023 (an 11 bit exponent) and for single precision is from -126 to 128. (Double precision is discussed on pages 18 & 19.) In theory what is the smallest positive IEEE floating point number in:
      Single Precision       __________________________________________
      Double Precision       __________________________________________
    2. The program UFL.java (underflow level) computes the smallest positive power of 2 in both single precision (using type float) and double precision (using type double) by starting at 1 and dividing by 2 until 0 is reached (remember that when underflow is reached the computer treats the number as 0). The program also prints the constants MIN_VALUE in the Double and Float wrapper classes. Save the program, compile, and run it.

    3. How do the results of the program compare to the theoretical results? There is an explanation for the discrepancy - you'll examine it in homework.



  4. Largest Positive IEEE Floating Point Number
    1. In theory, what is the largest positive IEEE floating point number in:
      Single Precision      _____________________________________________________
      Double Precision      _____________________________________________________
    2. The program OFL.java (Overflow Level) just prints out the constants MAX_VALUE in the Double and Float wrapper classes. Save, compile and run the program. Compare the answers to your calculations.

    3. Add a loop to the program after the values currently printed to compute and print double precision powers of 2 from 2^0 to 2^1040 (print both the exponent and the power of 2). Compile and run the program. What happens when overflow occurs?

    4. Did overflow occur at the power of 2 expected? Explain.



  5. The program Poly.java computes values of the polynomial p(x) = (x - 1)^6 in two different ways for values of x near 1 (specifically it starts at x = 0.995 and goes in increments of 0.0001 up to 1.005.
    1. Save the program, compile it and run it. Study the output. Notice the difference in answers for the two different calculations. Also notice the computed value of (x - 1)^6 for x = 1.

    2. The problem at x = 1 is from accumulated roundoff in computing the value of x - each new x is computed from the preceding x so roundoff error builds up. An alternate approach to generating n equally spaced points between numbers a and b is to compute the increment h = (b - a)/n, then the points are x = a + i*h for i = 0, 1, 2, ..., n. Modify the code in Poly.java to compute the values of x in this way. Compile and run the program and compare the results.

    3. The expanded form of the polynomial did not fare so well in the above calculations. In general, a polynomial should not be evaluated in expanded form but not all polynomials have nice factored forms. The recommended method of evaluation (in the absence of a nice factorization) is a nested evaluation. Add code to the program to evaluate the polynomial using nested evaluation (assign to poly3) and print out the result (as a fourth column).

    4. Compare the results of the three methods of calculating the value of the polynomial. NOTE: One way to get a printout of output to examine is to create a logfile (or script - the file will contain everything that appears on your xterm from the time you start the script until you stop it). To start a logfile type:
         script filename
      at the command prompt. In the first case a script file named typescript is created; in the second the name of the file is whatever you specify.

      Run your program - the output going to your screen is also going to the logfile. Type exit to close the script.

  6. Printing Recall that the command to print a file is
         enscript -2rG filename
    You can use this to print output or your programs.

HOMEWORK DUE Monday, February 13, 2012:

  1. Floating Point Representation and Computer Arithmetic
    1. Complete the lab!
    2. Would the approximation of machine epsilon using 4/3 work in a machine that used base 3? Explain your answer.
    3. Explain why the other technique also gives machine precision.
    4. The UFL.java program produced smaller values of the underflow level than theory predicts. That is because the IEEE Floating Point System uses a concept called subnormals or gradual underflow. Subnormals are non-normalized numbers - they have the smallest exponent but the mantissa is not normalized. For example in the base 10 floating point system with 3-digits and a smallest exponent of -1 the smallest number in theory is 0.100 x 10^-1. But with subnormals numbers such as 0.010 x 10^-1 (which equals 10^-2), 0.058 x 10^-1, and 0.001 x 10^-1 (= 10^-3) can be represented. Calculate the smallest single precision and smallest double precision positive floating point numbers in IEEE Floating Point System with subnormals. (Show how you are doing this!) Do your results agree with the program's results?
    5. Why did the input 10 cause problems in the Increment.java program but 8, 64, and 1024 worked as expected?
    6. Discuss the behavior for the three different methods for computing (x - 1)^6 for x near 1. Compare the behavior of the methods for x really close to 1 versus a bit further away. That is, for which values of x is each method most accurate versus for which is each least accurate.

  2. Exploring Root Finding
    1. Complete the Root Finding Lab.
    2. For each of the functions in Newton1 and Newton2 find an interesting starting point and use the graphs to draw the the approximation lines and the first 3 iterates to show what is going on (a different graph for each starting point).
    3. Similarly find interesting pairs of starting points for the Secant method for each of the two functions and draw the approximation lines and the first 3 iterates to show what is going on.
    4. The asymptotic error constant (see the definition of order of convergence on page 79) for Newton's method on a function with a zero of order m is 1 - 1/m. So if m = 3 this would be 2/3. Describe what this means in terms of the error at each iteration AND examine the output from the Newton3 program to see if that is the behavior you observe (give concrete examples - you can print out the iterates if you wish).
    5. Do problem #10 on page 86. (You should have implemented this technique in the lab and you should have already done the problem in homework!)
    6. You should have already implemented the False Position Method (though some of you didn't think you did it right so try again). Modify that program AND write a program to do the Bisection Method then use the programs (plus programs to do Newton's method and the Secant method) to do problems #17 & #18 on page 76). Recall that Math.tan is the tangent function in Java and Math.PI is pi.
    7. Do #19 on page 76
    8. Do #26 on page 77. Set up the equation and use a program to solve it.

  3. Hand in: Copies of your lab handouts, your work on problems not in the handout, a copy of your program that implements the bisection algorithm and a copy of your program that implements the False Position method and a logfile (script) showing the output.