Computer Laboratory

Course pages 2012–13

Floating-Point Computation

Principal lecturer: Dr David Greaves
Taken by: Part IA CST, Part IA NST, Part I PPS
Past exam questions
Information for supervisors (contact lecturer for access permission)

No. of lectures: 6
Suggested hours of supervisions: 2
This course is useful for the Part II courses Advanced Graphics and Digital Signal Processing.


This course has two aims: firstly to provide an introduction to (IEEE) floating-point data representation and arithmetic; and secondly to show, how naïve implementations of obvious mathematics can go badly wrong. An overall implicit aim is to encourage caution when using any floating-point value produced by a computer program.


  • Integer and floating-point representation and arithmetic. Signed and unsigned integers and fixed-point; arithmetic, saturating arithmetic. IEEE 754/854 floating point (32 and 64 bit); zeros, infinities, NaN. Brief mention of IEEE 754r. What numbers are exactly representable in bases 2 and 10. Accuracy in terms of significant figures. Floating point arithmetic is non-associative, and mathematical equivalences fail. Nonsensical results, e.g. sin(1e40), counting in floating point.

  • IEEE Floating-point arithmetic. Floating point arithmetic, and the IEEE requirements. Why the IEEE standard has endured. Overflow, underflow, progressive loss of significance. Rounding modes. Difficulty in obtaining IEEE-quality in libraries. The java.lang.Math trigonometric library promises.

  • How floating-point computations diverge from real-number calculations. Absolute Error, Relative Error, Machine epsilon, Unit in Last Place (ulp). Finite computation: solving a quadratic. Summing a finite series. Rounding (round-off) and truncation (discretisation) error. Numerical differentiation; determining a good step size.

  • Iteration and when to stop. Unbounded computation may produce unbounded errors. Solving equations by iteration and comparison to terminate it. Newton’s method. Idea of order of convergence. Why summing a Taylor series is problematic (loss of all precision, range reduction, non-examinable hint at economisation).

  • Ill-conditioned or chaotic problems. Effect of changes of a few ulp in the inputs. Conditioning number when amenable to mathematical analysis; Monte-Carlo exploration when not.

  • Other approaches and their problems Adaptive methods. Arbitrary precision floating point, adaptive floating point, interval arithmetic. Discussion on the problems of exact real arithmetic. Remark on the x86 implementations of IEEE arithmetic, and compiler “optimisations”.


At the end of the course students should

  • be able to convert simple decimal numbers to and from IEEE floating-point format, and to perform IEEE arithmetic on them;

  • be able to identify problems with floating-point implementations of simple mathematical problems;

  • know when a problem is likely to yield incorrect solutions no matter how it is processed numerically;

  • know to use a professionally-written package whenever possible (and still to treat claims of accuracy with suspicion).

Recommended reading

Overton, M.L. (2001). Numerical computing with IEEE floating point arithmetic. SIAM.

Further reading - goes far beyond the course

Goldberg, D. (1991). What every computer scientist should know about floating-point arithmetic. ACM Computing Surveys, vol. 23, pp. 5-48.