Course pages 2012–13

# Floating-Point Computation

**Principal lecturer:** Dr David Greaves**Taken by:** Part IA CST, Part IA NST, Part I PPS**Past exam questions****Information for supervisors** (contact lecturer for access permission)

No. of lectures: 6

Suggested hours of supervisions: 2

This course is useful for the Part II courses Advanced Graphics and Digital Signal Processing.

## Aims

This course has two aims: firstly to provide an introduction to (IEEE) floating-point data representation and arithmetic; and secondly to show, how naïve implementations of obvious mathematics can go badly wrong. An overall implicit aim is to encourage caution when using any floating-point value produced by a computer program.

## Lectures

**Integer and floating-point representation and arithmetic.**Signed and unsigned integers and fixed-point; arithmetic, saturating arithmetic. IEEE 754/854 floating point (32 and 64 bit); zeros, infinities, NaN. Brief mention of IEEE 754r. What numbers are exactly representable in bases 2 and 10. Accuracy in terms of significant figures. Floating point arithmetic is non-associative, and mathematical equivalences fail. Nonsensical results, e.g.`sin(1e40)`, counting in floating point.**IEEE Floating-point arithmetic.**Floating point arithmetic, and the IEEE requirements. Why the IEEE standard has endured. Overflow, underflow, progressive loss of significance. Rounding modes. Difficulty in obtaining IEEE-quality in libraries. The java.lang.Math trigonometric library promises.**How floating-point computations diverge from real-number calculations.**Absolute Error, Relative Error, Machine epsilon, Unit in Last Place (ulp). Finite computation: solving a quadratic. Summing a finite series. Rounding (round-off) and truncation (discretisation) error. Numerical differentiation; determining a good step size.**Iteration and when to stop.**Unbounded computation may produce unbounded errors. Solving equations by iteration and comparison to terminate it. Newton’s method. Idea of order of convergence. Why summing a Taylor series is problematic (loss of all precision, range reduction, non-examinable hint at economisation).**Ill-conditioned or chaotic problems.**Effect of changes of a few ulp in the inputs. Conditioning number when amenable to mathematical analysis; Monte-Carlo exploration when not.**Other approaches and their problems**Adaptive methods. Arbitrary precision floating point, adaptive floating point, interval arithmetic. Discussion on the problems of exact real arithmetic. Remark on the x86 implementations of IEEE arithmetic, and compiler “optimisations”.

## Objectives

At the end of the course students should

- be able to convert simple decimal numbers to and from IEEE floating-point format, and to perform IEEE arithmetic on them;
- be able to identify problems with floating-point implementations of simple mathematical problems;
- know when a problem is likely to yield incorrect solutions no matter how it is processed numerically;
- know to use a professionally-written package whenever possible (and still to treat claims of accuracy with suspicion).

## Recommended reading

Overton, M.L. (2001). *Numerical computing with IEEE floating point arithmetic*. SIAM.

Further reading - goes far beyond the course

Goldberg, D. (1991). *What every computer scientist should know about floating-point arithmetic*. ACM Computing Surveys, vol. 23, pp. 5-48.