University of Cambridge, Computer Laboratory

Numerical Methods Learners' Guide

The course official syllabus is on the web page NumMethods.

This document expands each syllabus line to a few sentences and recaps the course narrative.

2017/18: This guide is being updated just before or after each lecture.

Overview

This course is a general introdcution to the mechanics of numerical computation. We look at some commonly-used algorithms and understand their performance in terms of run time and accuracy. The central part is an introduction to traditional numerical analysis, which is mostly about the growth of worst-case error under various conditions. The early part is a very deep dive into IEEE floating-point representation. We make a deep dive so that the sources of rounding error and the true cost of operations is properly understood. The final section, concerning FDTD simulation in scalar and matrix form, introduces further useful material that builds on linear simultaneous equation solving to solve integral equations numerically.

Demos

Link: DEMO FOLDER COMMENTRY.

Contents

Topics not covered in 2017/18

Topic(s) that are not examinable this year include:

The additional commentaries for each domo program will be here soon.


Part 1 - What You Know

We use the term 'complete underflow' to refer to the situation where a small number is added or subtracted to a large one, where, owing to the precision in use, the value of the large one is unaffected by the operation.

We also illustrated 'loss of significance'. Wikipedia: Loss of significance is an Wikipedia: Loss of significance is an undesirable effect in calculations using finite-precision arithmetic. It occurs when an operation on two numbers increases relative error substantially more than it increases absolute error, for example in subtracting two nearly equal numbers (known as catastrophic cancellation).

It is important to understand the basic costs and mechanics of low-level operations. Why does division by a constant take longer than multiplying by its reciprocal? Why is it slower to store numbers in XML, JSON or ASCII compared with binary? What errors might arise?

Division is an expensive operation that should generally be replaced with a faster approach when possible. For instance, rather than frequently dividing by numbers read from a table, pre-compute the reciprocals and use multiplication instead. Or even do the whole computation with the summation of logarithms.

Note that division in binary is a little easier than in higher bases since 'it either goes or it won't go' compared with pen-and-paper division in base 10 where one has to guess how many times it goes and then multiply by the guess to find out.

Multiplication too has traditionally been an expensive operation. But in the last decade, with smaller-and-smaller transistors being available, only low-performance embedded computers found in devices such as toys and IoT devices are likely to be slower at multiplying than adding and then by a small factor of 3 to 5.

Converting numbers between human-readable decimal and internal binary form is an expensive operation. And storing human-readable numbers in ASCII or UTF-8 uses more memory: A single-precision number might print as 12 characters needing 13 bytes with an additional comma delimiter, but the binary version is only 4 bytes.

Multiplication: mpx1 shows simple long multiplication, base 2, taking n iterations to multiply n-bit numbers. mpx2 is the same code re-factored to look like Booth's algorithm. Booth uses n/2 iterations and is a base (or radix) 4 implementation. Simple hardware implementations of these algorithms would use one processor clock cycle per iteration. Today's processors might typically multiply using long multiplication base 65536 (16 bits at a time). If your hardware clock frequency is 2 GHz and a 64-bit multiply is taking four clock cycles, (because it is base 65536) then the multiply will take 2 nanoseconds.

The native and vedic long multiply code was not mentioned in 16/17.

Base Conversions

Reading in a number in base 10 is essentially a polynomial evaluation where the digits are the coefficients and the argument is the base. We multiply as we go, taking advantage of the fact the computer has multiply facilities in hardware that operate in the base we want the output to be in (binary).

Writing out a number in base 10, for human consumption, conversely, requires long division since we do not have base 10 hardware division facilities in modern computers. (Early IBM machines supported it for the COBOL language.)

Because division is a costly operation, output in base 10 is quite CPU-intensive and should be avoided if the numbers are only to be read in again by a computer.

Floating-point output, in human-readable form as printable ASCII characters, is especially expensive. We observe that printing out a floating-pointer number that is an integer, or the integer part of small (ie. less than 2^mantissa-precision) floating-point numbers is not too difficult since the binary encoding of the integer part is manifest in the mantissa (with a little binary shifting). Our approach, therefore, for all other numbers and precisions, is to multiply or divide the mantissa by powers of ten until it contains a bit pattern that can be considered a suitable unsigned binary integer. We then print this out using, essentially, our previous integer binary-to-ASCII method. We kept track of how many powers of 10 scaling were applied and insert a decimal point in the ASCII digit string at the appropriate place. Note: we do not really divide by powers of 10, we are good computer scientists, so we multiply by pre-computed reciprocals.

Output also typically involves beautification, such as suppressing leading zeros, avoiding scientific notation if the decimal place lies somewhere in the mantissa, and printing NaN and Inf in the special cases that we cover in part 2.

The large listing in C in the projection pack is provided just to emphasise the complexity and no knowledge of C is expected. You can try the ML and/or Java alternatives in the demos folder. You are expected to be able to do long division and integer-to-ascii conversion (and understand that these are the same) but getting all of the details of float-to-ascii correct under exam conditions would not be asked.

Also, pay attention to the 'floor' operation, and its relations, which in many languages are tacitly invoked when mixing types in expressions. These also involve the mechanics of base conversion, and hence are expensive and should be used sparingly.

Floating Point Output Worked Example

The details of an example like this one are worked through in the lecture after the single-precision IEEE format was presented in detail.

Let's look at the output routine operation for 3/256. This is a good starting example since it clearly only has two bits in its mantissa and one will be hidden. In decimal this is 0.01171875... If we want to print this to 3 significant figures, we will want to ultimately be applying the integer output routine to 117. Note that 117 is 111_0101 in binary.

Notation: We put an underscore in our long numbers every four digits to make them readable, but the underscore has no meaning, like the comma used every 3 digits in English prose. We put 0x in front of a digit string to denote it as hex (hexadecimal).

The input number is 2^(-7) + 2(-8) so written as fixed-point binary this is 0.0000_0011.

As floating-point, we have a mantissa of 3 and an exponent of -7 which is 120 in excess 127 format or 0x60. Packing this into fixed-point fields (hiding the msb of the mantissa) we have 0x3c40_0000. This would be the input to our output routine.

To output as three significant figures we will need to multiply or divide by powers of ten until 100 <= m < 1000. Hence we achieve 117.171875 by multiplying by 10^4. We knew to multiply up and not down from our initial number being less than the lower bound, 100.

117.171875 in single precision is 0x42ea_6000.

Our computer has single-precision floating-point arithmetic hardware. Using the exponential chop in the presented code, where powers of 10 are used that are themselves powers of 2, a single multiply by 10^2^2 was needed.

Finally, we need to extract the bit field 111_0101 from this result and print it. The first one is the hidden bit and the 11_0101 is clearly extractable with a shift and mask when we write out the hex number in binary

  0x42ea_6000 =   0 1 0 0   0 0 1 0   1 1 1 0   1 0 1 0   0 1 1 0   0 0 0 0   0 0 0 0   0 0 0 0
                  s e e e   e e e e   e m m m   m m m m   m m m m   m m m m   m m m m   m m m m
                                        ~ ~ ~   ~ ~ ~                   

The amount to shift right by, 17, is given by simple linear arithmetic of the current exponent (133), then excess (127) and the mantissa width (23): 133-23-127=17.

Part 2 - IEEE Standard

You need to memorise the single-precision floating point bit fields as lectured. Exam questions will often rely on this detail. You will not be expected to remember the precise conventions for storing NaN, infinities and zero. De-normal numbers may be ignored, although their definition is intrinsically obvious if you think about it. You should also note that double-precision has a wider exponent range and more than double the precision in the mantissa.

We will use the abbreviation 'ulp' to stand for 'unit in the last place' denoting the significance of incrementing the binary (or other base) representation by one.

We note that there are a few weirdnessess in the standard, such as two zeros that are defined to be equal for comparison.

Rather than throw an exception on divide-by-zero, log -0.00000001 or arcsin 1.00000001, that would potentially halt a massive computation, discarding all results everywhere in huge arrays, we prefer a taint approach, replacing the result with the special NaN value that propagates to the final result but which may remain localised in its poisonous effect.

Part 3 - Floating Point Operations

This section of the printed slides is quite verbose, so there's little to add here, but we can repeat the key points.

The four arithmetic operations are defined that we treat the arguments as exact and then round to nearest representable result. Rounding the mid-way case upwards always would be biased so we need to vary in policy following some unbiased pseudo-random rule. Round to even is chosen. Other rounding modes are needed in rare cases, like interval arithmetic discussed briefly later.

Despite the standard, we note that compilers sometimes compute intermediate results to a higher precision than that of the operator's arguments or the required result. Sometimes, on the other hand, they do not. This means that an expression such as a+b*c can have a different result depending on whether the product was rounded before the summation.

The standard libraries are not normally obliged to be engineered to standard such that they always return the nearest representable answer based on their argument being exact. The arguments are very rarely exact in practice owing to having been sourced by an earlier part of the computation and suffered rounding (and loss of precision/underflow, truncation and input quantisation). So we generally specify that a library must give a result that can be out by one ulp provided it remains semi-monotonic.

Semi-monotonic means that the first derivative retains its correct local polarity. For instance, if the output should go down when the input goes up at this point (e.g. first quadrant of cosine), then it does or stays the same.

The standard library contains all the standard functions written out in old books of mathematical tables or supported by a scientific calculator: trigs, logs, roots, hyperbolics and their inverses.

Error Accumulation Models

We always aim that our expected error is zero; in other words, we are unbiased.

But we might consider that the expected absolute value of error builds up randomly: unfortunately it is often systematic.

The input quantization error arises from digitising the real world which is continuous. For example, CD-quality audio uses 16 bits per sample of air pressure: hence a unit in the last place (ulp) is 2^-16 of the peak-to-peak pressure variation available to represent the sound.

We need generally to consider worst-case error accumulation, where absolute errors sum for plus and minus and relative errors sum for multiply and divide. In detailed analysis, when we work through a computation, we commonly need to convert between relative and absolute error each time the output from one class of operator is an input to the other. If we know the expected magnitude of the numbers, we can convert between relative and absolute error easily. Otherwise we just assume the input value is approximately unity in case that gives us a handle on the current magnitude. And failing that, we can assume the current magnitude is approximately unity: i.e. the relative error is the absolute error. For constants in the program code, we must look to see whether they are precisely-representable, or also have an error.

To show systematic amplification of the input quantization error, there are three plots: the first shows an unbiased random walk in 1-D with fixed step size, A, of unity. This shows the expected mean deviation of 0 and mean absolute deviation of A.sqrt(N).

The second shows the absolute error (relative will be the same shape) in the value of single-precision variable p in the loop

  p=1.0; for (i=0;i<100;i++) p=p * x 

which was plotted by subtracting the values of p from those from another copy of the loop that used double precision. This too shows a random walk of similar nature and we see that the (variable) step size, A, averaging roughly macheps/2.

The second plot used various values of x hence multiple, yet deterministic, traces. All values of x were precisely-representable as single-precision numbers.

The third plot shows the same as the second but for various values of x that had to be quantised to single precision before starting. This produces systematic growth of the error that dominates the random walk effect (look at the ordinate scale). The relative error accumulates linearly, as given by the earlier rule of relative errors adding under multiplication.

The takeaway is that local random walks do happen but are commonly negligible and any systematic error propagation more-properly needs our attention.

We define macheps and it is worth memorising their orders of magnitude for single- and double-precision.


Part 4 - Simple Maths, Simple Programs

We need to be aware of the structure of error forwarding in our code: does an error persist through to the output or will be it be lost as the computation progresses?

Even an ultra simple program, like our example of implementing the quadratic equation root formula, can go wrong. As programmers we need to consider all corners of the operational space and generally add in 'IF' statements to redirect some corners to alternative implementations or error reports.

Note that floating-point addition and subtraction are not generally associative: better accuracy can be achieved by controlling the association, e.g. perhaps by sorting the summands first.

Remember you do not have to naively use the built-in numerical facilities in all circumstances. Sometimes it is better to perform long arithmetic using multiple words in some base of your own choosing. Libraries are normally available. Kahan summation is a neat little example, where two words are used with very little management overhead, to get nearly double the precision over one word (details not examinable).


Part 5 - Infinite and Limiting Computations

We cannot compute forever. We have to stop. When we stop early, with respect to the underlying mathematics, we introduce a new type of error: the truncation error.

The truncation error arises where we approximate the mathematical 'take to the limit' operation with a value that occurs before we get to the limit.

Truncation of a Limiting Computation

In the differentiation examples we truncated using a finite, non-zero value of h: we did not take h all the way to zero.

When we truncate in this way, we avoid errors the quantisation and rounding errors arising with small values of h. Small values of h with the differentiation example lead to nearly identical values of f(x) being subtracted, thereby greatly loosing precision.

So if one form of error goes down as we increase h, and another goes up, our total error is likely to be minimised for the value of h for which the errors are equal. We can deduce the cross-over point for general continuous functions using Taylor's approximation and assuming that the value of f(x) and of all its derivatives is approximately unity. If we have better information regarding f(x) and its derivatives, then we might use that to more accurately determine the best h.

Note that the square root of macheps is about 10^-8 for double precision and the cube root is LARGER at 10^-5.

We can consider the way in which our truncation error varies according to the value of h to determine the, so-called, order of the method. For instance, in the second-lectured method for differentiation, the truncation error is proportional to h^2 and so we say it is a second-order method.

Truncation of an Infinite Iteration

In the summation of an infinite series, we can commonly truncate when the terms that we are combining become sufficiently small, such that our accumulation so far is sufficiently accurate for our application. We might consider modifying our method, knowing that we are going to truncate it, so that the truncation has better properties, such as being smaller or more evenly spread over use cases (Chebychev later).

The rate at which the truncation error gets smaller, as a function of how many steps we make, determines the so-called order of the method. If we see a pattern in the relative or absolute truncation error between before and after each step, we can potentially estimate the number of iterations needed to solve a problem in advance of starting it. Also, we may get insight into whether the error might get larger as we iterate, which is either a disaster or else a hint that we should seek to reverse the iteration such that the errors can progress in the opposite direction.

Our golden ratio iteration multiplied the absolute error by 3 each time when we got it backwards, but divided it by 3 each time, making it get ever smaller, in the direction that works. We need iterations that scale the error with a factor that is between -1 and +1: this makes them smaller. Negative ratios are fine, they just mean the error sign alternates each time, giving a zigzag plot.

The binary chop iteration reduces the bound on the error by half on each step. The relative error overall is following a geometric series. This means gaining one more accurate bit per iteration. Information theory tells us that if are just using the output from a single boolean predicate per iteration, then one bit is all we can gain on average. This is first-order convergence.

The Newton Raphson iteration squares the relative error on each iteration. Squaring something less than unity makes it smaller (which is good!) Such squaring doubles the number of accurate digits each iteration, meaning the required number of iterations is the logarithm of the number of significant digits we desire. (Q. Does the squaring mean it is fatal if we start with a relative error greater than 1? A. Our analysis that simply ignored the higher-order terms in the Taylor expansion when x is larger than 1 is wrong.)

Newton Raphson has second order convergence: the number of iterations needed is the log of number of digits needed which is the log of the log of the denominator in an expression of accuracy.

Summing an Infinite Series

Some series converge in maths for all argument values but numerically behave badly with large input values. We should avoid those large values.

Range reduction identities for sine and cosine are obvious from geometry, by subtracting multiples of 2 Pi and subtracting from Pi while swapping one for the other. For the second octave we can interchange sine an cosine and use the first octant. Hence we only need to evaluate them in the first octant (zero to 45 degrees).

For exponent, in lectures, we looked at taking the argument modulus the log of 2, so we only had to apply the series to the remainder with the integer quotient becoming an offset to the resulting exponent. Rounding down ensured the series is only applied to positive values, eliminating problems of monotonicity. The biggest number the series is then applied to is log 2, which is less than unity for all bases greater than 2.

For other problems, range reduction might be better towards unity. For instance, for finding square roots, the Taylor expansion of (1+x)^0.5 works well, provided we keep x small, meaning the argument to square root is close to 1. That expansion does not converge if x is greater than 1, so root 2 is the highest we can directly countenance. A good range reduction approach for square root is a matter of dividing or multiplying by factors of 4, which is a simple shift by even powers of two, and the answer needs to be corrected by shifting by half the pre-shift amount, which is also a simple power of two.

Part 5b - Approximate Methods

Quadrature

We cannot sum to infinity in finite time, so truncation is applied, introducing an error.

Sometimes we need to apply a numerical method to the data generated by a function that is too complex to analyse or is just stored in a table. Integrating is the premier example. Numerical integration is called quadrature. At school we may have looked at Simpson's Rule, which puts a parabolic spline through each grouping of three adjacent points and we can find the area under the splined approximation using algebra.

Simpson's Rule has order of at least 4 (halving h divides the truncation error by at least 16). There have been many books and papers that give better-and-better quadrature techniques. As Computerists we should know they exist and we should normally use a pre-existing library (e.g. matlab) rather than coding our own.

The demo, (being added to the demos git repo), showed that rather few (reciprocal of fifth root of macheps) strips is the best number in general. It also showed that two strips are enough for a quadratic function, hence exposing clearly the rounding error growth that degrades the result beyond the optimimum number (25 to 35 for single- precision). The interesting part of the plot is the extreme right-hand-side: all power laws form a straight line on a log-log plot and the order of the algorithm sets the gradient. The rest of the plot is a plot of rounding noise, which has square-root growth.
Simpson's Rule Demonstration of Order

Simpson's Rule formula will be given if needed in an exercise or examination question.

Splines And So On

The next part this section, up until we resume taking about range reduction in the Taylor series for sine, is non-examinable this year (17/18), but the important takeaway is the principle of expressing a function as a weighted linear combination of some basis 'vectors'. I put the word 'vectors' in quotes since we are talking about scalars in reality, but everybody is familiar with expressing a point in k-D space as a weighted sum of k basis vectors.

The Lagrange basis set of size 2n+1 is the 'obvious' set of functions that are unity at some given integer in the range -n to +n and zero at the other integers. We can then directly construct a spline from tabulated values as the weighted sum of these basis functions.

A plot of the Lagrange reconstruction reveals that the error is non-uniformly spread. Can we do better? Yes, we do not have to use evenly-spaced integers as our basis points, instead we use a non-uniform pattern, such as that attributed to/invented by Chebychev.

Sometimes we cannot afford the time or energy to compute an accurate result and are happy with a coarser answer. This may hold especially on low-power embedded systems, such as a smoke and motion sensor in a ceiling tile.

For graphics applications we rarely need more than about 12 bits of accuracy - which is half single precision. In a graphical computer game we need speed. Where a function is monotonic and we are comparing values of its output we can compare its input instead!

Cordic

The Cordic algorithm can perform a number of scientific computations using fixed-point integer arithmetic. The most common examples are converting between rectangular and polar co-ordinates and back. The straightforward computation of sine and cosine was all that was lectured in recent years. Since we are working with fixed-point fractions, we might adopt some convention on the location of the implied binary point, or we might just assume some fixed denominator, that is not necessarily a power of 2, and use run-time values for the numerator.

Cordic computes sine and cosine of an angle by rotating the unit vector from the horizontal axis to the desired angle in a number of well-chosen steps. Each step has the property that its corresponding rotation matrix is almost trivial to apply (i.e. to multiply by). The rotations are taken from a precomputed table of angles whose tangents are fractional powers of 2 and the only decision at run time is whether to add or subtract the next value from the current angle as we reduce the current angle in absolute terms towards zero. We use rotations that are more than strict rotations in that they also extend the length of the vector by a known amount. If we always use exactly the same number of rotations, we can compute offline the total amount of length extension that happens and factor that into either our fixed-point denominator or the length of the vector we start with. That factoring is also offline.


Part 6 - Illness

The lectured matrix inversion example shows that nasty things can happen even when the input numbers look innocent. This matrix problem has a graphical projection interpretation and we can readily visualise: it is an information-loosing projection that will be tough to invert.

In general, we need other approaches to spotting a poorly-behaving problem. It may typically just be ill-conditioned for some regions of its domain (input space).

Some functions or programs may simply have high partial derivatives between an input and an output. These are worth knowing about and we should be aware of our input quantisation error or sample noise on such an input, since this could dominate the output error. In other words, our output is very sensitive to this input and so we should be careful when collecting it from the outside world. But a high gain is not an indication that the computation itself will be unstable. A high gain does not amplify the relative error, as shown in the 10^22 example in the notes.

We can measure the error amplification using random input perturbations. This is a generalisation of the numerical approach to finding the derivative explored earlier.

Or if we have the inverse function we can check backwards stability. A backwards stable function is 'round tripable', with its inverse giving a good approximation of the input from the output.

Or if we have a cheap way of finding whether the answer is correct, we should apply that.

To understand the stability of the computation itself, we can look at the change in relative error introduced by our program: this enables us to speak of its condition number at that locality.

Where our method uses some step size, h, we may need to use different values for different parts of the domain, perhaps dynamically adjusted as in the spice algorithm at the end of this course.

Some functions are chaotic and remain poorly-behaved however much love we give them: "one flap of a sea gull's wings would be enough to alter the course of the weather forever".


Part 7 - Solving Systems Of Simultaneous Equations

One way to solve a system of linear simultaneous equations is to multiply the solution by the inverse of the coefficient matrix. But this is not often a good approach since it involves more computation than L/U decomposition and can be less stable.

Gaussian elimination converts the coefficient matrix into upper-triangular form. This has cubic cost. We aim to make it well behaved by good pivot selection. Keeping track of the modifications made to the right-hand-sides gives us a corresponding lower-triangular matrix. (The lower-matrix is mathematically formed by a product of matrices that are identity matrices with a set of non-zero values beneath one of the ones, but forming their product ends up never multiplying any of these together, so computationally we can just insert them with store instructions, as in the demo code.) The pair multiplied give us back the original coefficient matrix.

After L/U decomposition, we can solve for any right-hand-side vector in quadratic steps. If we solve for the set of right-hand-sides that are the columns of the identity matrix, the result is the inverse of the coefficient matrix and this is a good technique for matrix inversion if we should need it. (Cramer's rule has exponential coast or quartic if you borrow a little help from L/U decomposition.)

Commonly our coefficient matrix has special symmetries that mean short-cut techniques can be used. If it is transpose-symmetric and positive definite (not lectured), then we can use Cholesky which has lower cost (while still cubic). Our circuit flow matrices (later) have these properties.

L/D/U decomposition gives L and U with ones on their leading diagonals using an auxiliary diagonal matrix D and is helpful as a variant of Cholesky without square roots.

Not all systems have unique answers: the equations might be linearly-dependent or in short supply, in which case there are multiple possible answers (we may select the best answer using a regression fit). If there are too many equations there may be no solutions.

When we have too many simultaneous equations (rows in the matrix expression thereof), we might want to choose a best-fitting answer, such as the one with the least sum squared difference between the r.h.s. yielded and the target. This is multi-dimensional linear regression.

When we have too few equations, the solution space is a line, plane or volume ...

Simulated annealing searches for low cost (or high benefit), as returned by a scalar metric over a multi-dimensional space. To try to avoid false minima, it starts with large jumps in the search space and with only marginal preference to retaining improvements. As the so-called temperature falls, we make smaller jumps. Below a certain temperature we only accept improvements: this is the so-called quench phase.


Alternatives to Floating Point

We have spoken of fixed and floating-point representations, but these are not the only ways of computing.

Unum Posits (not examinable) are a very new floating-point encoding (2017) that is better than the IEEE standard in a number of ways. Any number of bits can be used to store a posit. All bit settings of the word have a meaning. Word size can be extended by just adding trailing zeros. Fields change size dynamically: e.g. mantissa has no bits when exponent large. They have a greater range and precision for same number of bits as IEEE. Hardware implementations as part of mainstream processor instruction sets are possibly likely in the future. (Meanwhile, FPGA implementations are being explored: LINK.)

Interval systems model each value as a pair of values with the idea that the true answer lies between these values. All of the standared operators are overloaded to apply to such pairs, but where program control flow is predicated on an interval, a resolution function must typically be applied to prevent exponential growth in the number of possibilties represented.

Multi-digit arithmetic is just a generalisation of the school 'long' forms of arithmetic. Infinite precision, perhaps based on lazy lists, is commonly used as well. For instance, there is the dotnet BigInteger library or the standard gnu mpfr demo in the projection slide pack.

Symbolic systems exist, that hold expressions as formulae rather than numbers, apply most operators using distributive laws where simple identites are applied to minimise the growth in complexity. For instance, the cubing of a square will become a raising to the sixth power. That little example also shows that they will perform computations where the results are easy to precisely represent, such as the product of two natural numbers.

We did not cover these systems in lectures in any detail, but, apart from Posits, you are expected to know enough about what they are to deduce simple answers regarding how they might behave. In a Tripos Exam, if this is ever to be examined, we would provide sufficient explanation of the type of system being considered.


Simulations

To simulate the real world, which is analogue, on a digital computer, we commonly need to quantise in three different ways:

  • the values of properties (like fill depth in a bucket),
  • an elemental physical quantisation (like an area into patches of finite size), over which we assume the properties are constant) and
  • a time-domain quantisation, over which we assume rates of change of properties are constant.

    The dart-throwing Monte Carlo simulation converges poorly, obeying simple statistics. The relative error in a single throw is roughly unity. Assuming an unbiased simulation, after N throws the r.m.s. noise will be roughly square-root N. Antithetic improvements were briefly mentioned but are not examinable.

    FDTD quantisation of the continuum introduces errors just like the truncation errors in the numerical differentiation example earlier. There are multiple quantisation parameters in general, such as the time step and the elemental length. Adaptive time steps and patch sizes are commonly needed for efficient computations that retain accuracy near sharp edges.

    The values that need to be retained from one time step to another form the state vector of the system. In a finite-element simulation, we can consider there to be a state vector for each element or else the whole lot is the state vector.

    The ways that errors introduced by these approximations to the continuum compound varies greatly according to the nature of interaction between components. Commonly behaviour is good: these truncation errors can either tend to cancel out or at least not accumulate systematically. This arises from the inherent stability of many of the systems we explore, arising from their good design (by man or nature). For instance, the error might change sign every step. But each situation needs to be analysed individually.

    Stencils

    Most commonly, a FDTD simulation is making a numerical approximation to the result of a set of integral equations. This models the dynamic evolution of the system.

    The forwards stencil is the most obvious approach for such a problem and works well for many systems in practice. See The Forward Stencil: (Euler's Method) for ODEs. In the last part of the course, we will place the forward stencil in flow matrix form under the title 'time-varying components'.

    The forward stencil uses the rate values at the end of one timestep as though constant throughout the next timestep. It is also known as the forward difference method and Euler's Method. The new value of a state vector element is its current value plus its rate of change multiplied by the time step size. So it is nothing more than numerical integration by the Riemann Sums method (see quadrature slide, where this is also labelled as the the mid-point rule, but that is a confusing name in this context.)

    The simple forwards stencil serves well when high accuracy is not needed, as in computer games and so on.

    The backward stencil uses the rate values at the end of the current timestep as though constant in this current timestep. When we can work out the backwards stencil (by analysis or iteration) then we could potentially just simply use it, but this might be no better than the forward stencil. Better is to take an average of the forward and backward approaches, giving the Crank-Nicholson method.

    Forward Stencil Stability and Accuracy

    We will consider here just the stability and accuracy when modelling exponential curves. Curves that have zero second derivative will be completely accurate and are called straight lines.

    The forward stencil essentially takes only the linear term from a Taylor expansion and so error accumulation is proportional to step size squared. This was illustrated in the lecture. However, as we iterate in the time domain, to cover a given modelled elapsed time, we need a number of steps inversely proportional to the step size. Hence, when errors are accumulating systematically, in theory, and not lectured, the overall performance is proportional to the step size meaning Euler's method is a first-order system: halving the step size doubles the accuracy. But this is only in the worst case, and most practical systems are much better behaved.

    With the forward stencil, an exponential decay simulates as a geometric decay. The geometric decay is nothing more than a sampled exponential function, but it is the wrong exponential: it has a slightly faster decay rate, so the time domain has been slightly dilated, but we have a constant relative error in rate of time.

    The backwards stencil is wrong in the opposite direction. Crank-Nicholson cancels out these two errors, giving a second-order method (details not lectured). Actual numbers were presented in the lectures.

    Consider

    As n gets larger, this clearly does not converge if |1-ah| > 1. Hence the forward stencil will fail to converge for the exponential decay if the steps size is larger than 2/a. This is again an unimportant result, since there is no accuracy left well before we consider using such a large step size. Nonetheless, it seems traditional to mention the proof.

    Better Techniques

    There is a huge literature of techniques. Many are embedded in standard libraries. Complete lecture courses and human lifetimes can be devoted to this one subject.

    Sometimes the forward stencil will fail. An example is so-called Stiff Equations, but their precise definition does not exist and such situations are beyond the scope of this course anyway. Iterations such as that of Gear, flashed up in one slide and non-examinable, are the way forward for those with interest. You might also like to explore the Runge Kutta techniques.


    Part 10 - Flow Networks

    Commonly we wish to simulate large, continuous systems that have been spatially-quantised and matrix form is naturally appropriate. Other times we have complex systems that are discrete in reality, such as traffic, water or electrical systems, consisting of paths between nodes with fluid flow along the paths.

    Flow networks have a steady-state equilibrium given by the solution of simultaneous equations. Where all links are linear, we do not need to iterate to find the steady state. But where some links or nodes change over time, as with valves being gradually closed or buckets changing in fill depth, we embed our flow matrix solver inside an FDTD simulation.

    Traffic in a road network or packet flow rates on the Internet are also systems that can be simulated using fluid flow. Examples of capacitor-like components are then, respectively, car parks and queues in buffers.

    To be specific in lectures, we speak mainly of flow networks where the fluid is electricity or water. It is introduced at a constant rate at one or more sources and flows down to a sink at the lowest level which we say has zero potential. The current generator, in the water analogy, can be thought of as a pump that takes fluid out of an infinite sump inside the drain and delivers it at a constant rate somewhere near the top, and however high this needs to be, the perfect pump is always able to deliver that rate. In the electrical analogy, as well as current generators, we also typically model perfect voltage generators that deliver any amount of current at a set height (or voltage). These perfect generators can model real generators by including a resistance with them called the internal resistance of the generator.

    Our links are resistive by default: the amount of flow is given by their fixed conductance multiplied by the potential difference between their ends, and flow is from the higher end towards the lower end. This differs from the links considered in the min-cut, max-flow networks mentioned in the Algorithms course, which are considered to have zero resistance (infinite conductance) until their flow meets their saturation level, at which point their resistance rises as needed to limit the flow to that value. In this course we illustrate such non-linear behaviour using the pressure limiting valve and the electrical diode.

    We can write an ad hoc piece of code for every network we wish to model (such as the one for the two tanks in the demos folder). Better is to have a systematic modelling approach that can read in a network (e.g. via a GUI) composed of components (from a standard library).

    We phrase the conservation of flow equations in matrix form where the solution is a column vector of all the potentials excluding an arbitrary one that we nominate as ground zero. Each component has a characteristic template in the matrix: for instance a conductance appears at four places, unless one end is connected to our ground, when it appears only once.

    Steady-State Operating Point

    The steady-state equilibrium is found, for linear links, by simply solving the linear simultaneous equations given by the conservation of flow at each node (Kirchoff's current law in electricity). The matrix of coefficients will be symmetric and positive-definite (not lectured) and hence Cholesky's approach works. Also, typically, the number of nodes and links will be roughly the same, to a factor of 2 or 3 perhaps, so our coefficient matrix is very sparse when there are hundreds of nodes or more. Hence techniques for sparse equation sets (not lectured) are also commonly applied.

    The steady state may be perturbed by external influences, which are normally of the form of a link being blocked or unblocked by the operation of a switch or tap (or loch gate or a road being closed by the police or whatever). However, the flow generators may also be 'non stationary' in their rates, such as a photoelectric panel will have a cyclic variation in the flow it generates each day as the sun rises and sets.

    The equations must be solved again after any such variation in the system parameters.

    Non-linear Components

    Our nodal analysis can directly model flow sources and conductances. All other components must be phrased in terms of these two fundamentals. Here we will replace a non-linear component with a phantom pair that is a flow generator and a conductance. In the final section we replace a tank with a similar pair. values is different.

    The value of phantom conductance used is the instantaneous derivative of the flow w.r.t. the potential. The value of the phantom flow is set so that the correct potential difference is developed over the component, or put another way, such that the tangent we have modelled with the instantaneous derivative no longer passes through the origin, but is offset correctly according to the properties component being modelled.

    Where a link between nodes (or pipe or channel) has resistance that varies according to its flow, it is non-linear. Where there are relatively few of these overall, we insert our linear equation solver inside an iteration framework and allow it to converge on the operating point over some small number of iterations. The implementation in the demos folder uses a simple substitution iteration (as shown for the Golden Ratio in earlier lectures) but commonly a Newton-Raphson iteration can be applied if we have an analytic formula for the non-linearity (such as the diode formula) and thereby readily have the formula for its derivative.

    The iteration we need starts with some potential difference across the non-linear component (typically the value found in a previous timestep), this gives us a conductance. We solve the equations and get the new flow through the component. Does this flow develop a potential difference that is close enough to what we started with? If not, iterate.

    Reservoirs - Time-varying Components

    Fluid systems commonly contain tanks that fill and empty over time. The instantanious fill depths, as always, are part of our system state vector. The electrical capacitor can be thought of as a prismatic water bucket whose sectional area is the electrical capacitance and the fill depth is the potential difference between its terminals. For all such reservoirs, the depth after a time step is given by quadrature (a numerical integration). We typically use the forward Euler method, that assumes the flow is constant at its start value throughout the time step. Accordingly, the contents after a time step are the contents before the time step plus the flow rate into the container multiplied by the time step duration. Hence the complete system of links is wrapped up inisde a FDTD simulation.

    The reservoir/tank/capacitor is replaced with a phantom conductance and flow generator pair. The value of phantom conductance used is agin the instantaneous derivative of the flow w.r.t. the potential. The value of the phantom flow is set so that the correct potential difference is developed over the component, corresponding to its current state (from the state vector).

    Adaptive Time Step

    Where we are iterating within a time step owing to the presence of non-linear components, we can typically compromise a little in accuracy by truncating that iteration a little early and instead use a smaller time step for the outer FDTD process. When modelling real-world systems, where component tolerances are about one percent (as is typical for the conductance of an electrical resistor on a circuit board of pipe diameters in an oil refinery) then our answer only needs about 3 significant figures of accuracy. Overall, we will use an adaptive control mechanism that tracks how well our non-linear iterations have converged (which is readily apparent from back substitution in the simple link models or by counting iterations) and uses a smaller time step if accuracy or convergence appears to be going down, or which goes back to checkpoint and re-runs with a smaller time step if things have gone wrong.

    (We have just studied several parts of the basic operation of SPICE simulator that is universally used for analogue integrated circuit design. I hope you can see how to use it for other systems, such as the world economy and the struts under a drilling platform in the North Sea.)


    Part 11 - Additional Topics

    Additional Topics will not be lectured, more than likely.


    END.


    Questions Answered

    Booth Multiplication

    Q. In page 22 of the Numerical Methods handout, I think there is a mistake in the function mpx2. I believe that if the carry parameter is initially set to 0, then it will never become non-zero, so the function is identical to mpx1. This is because (x mod 2) is either going to be 0 or 1, so with carry = 0, n will be either 0 or 1. (n = 0 or 1) Then carry’ will always be 0 because of the case statement, so on the next loop the new carry will still be 0. I’ve run a test of this to be sure, and it’s clear that carry is never 1.

    A. That's correct. The mpx2 example is the same as mpx1 but it the code is arranged in a style to look like the immediately-following Booth example.

    Summing Many Numbers

    Q. Why should we first combine together numbers or opposite magnitude and similar size? Surely this leads to lots of cancellation?

    A. Yes, it does lead to cancellation and hence loss of significance. But at least the resulting number will be small and not cause severe underflow in other parts of the accumulation. Ultimately, you simply have to keep more precision using Kahan's method applied to double precision or an off-the-shelf, high-precision library, as demonstrated under alternative techniques.

    Semi-Monotonic

    Q. What precisely is the definition of a semi-monotonic approximation?

    A. The definition of semi-monotonic I gave was that the derivative only changes polarity when it is supposed to. This would be not at all for the log and exp functions and every 180 degrees for sine and cosine. Changes in polarity owing to rounding or truncation errors at other places are banned. (I notice that the Wikipedia page for monotonicity does not define the term but the second Google answer, from Oracle's/Sun's original Java documentation, says: whenever the mathematical function is non-decreasing, so is the floating-point approximation, likewise, whenever the mathematical function is non-increasing, so is the floating-point approximation. Not all approximations that have 1 ulp accuracy will automatically meet the monotonicity requirements.)

    Q. Which definition of macheps is correct?

    A. It is hard to argue whether the one I give is "right" per se, but it does appear in the standard and it corresponds to the idea of quantisation step size used in other fields, such as pixels and ADC resolution in audio. The widely-used alternative has macheps half the size. The Wikipedia page currently (May 2018) aknowledges there are two definitions in wide use but its own presentation of which is which is confused in places!

    Q. I am a first-year student taking the Computer Science paper 1 course. Considering that we did not cover the following algorithms in detail -- Long Multiplication, Standard Long Division, Fixed-Latency Long Division and the Numeric Base (Integer Input, Integer Output, Floating Point Input, Floating Point Output) -- I was wondering whether these are examinable and are we expected to know these algorithms for the exam? (The algorithms are at the end of the first part of the lecture notes.)

    A. The variable-latency long division is exactly what you learned at school when taught to divide, so you should know that one for sure. The native and vedic long multiply code was skipped and is listed above as non-examinable. You do not have to know the details of the fixed-latency method of division. Integer input is pretty-much trivial - you should be able to write that in any programming language where you have some proficiency. The floating-point input is an exercise you should do for your supervisor. The integer output and floating-point output are variations on the variable-latency divide function and you should understand the principles and cost. Reproducing runnable code is not called for. Overall, we note that the input methods are essentially polynomial multiplication where the digits are the coefficients of the powers of the base. And that the respective output routine is a division to get back the digits.


    END (C) 2017-18 DJ Greaves.