University of Cambridge, Computer Laboratory

<TITLE>Numerical Methods Learners' Guide
</TITLE>
<link rel="stylesheet" href="http://www.cl.cam.ac.uk/users/djg11/orangepath.css">

<h1>University of Cambridge, Computer Laboratory</h1>

<p>
<h1>Numerical Methods Learners' Guide</h1>


<p>The course official syllabus is on the web page <a href="https://www.cl.cam.ac.uk/teaching/current/NumMethods">NumMethods</a>.

<p>This document expands each syllabus line to a few sentences and recaps the course narrative.


<p><b> 2017/18: This guide is being updated just before or after each lecture.</b>

  <h2>Overview</h2>

  
<p>This course is a general introdcution to the mechanics of numerical computation.  We look at some commonly-used algorithms and
  understand their performance in terms of run time and accuracy.  The central part is an introduction to traditional numerical analysis, which is
  mostly about the growth of worst-case error under various conditions.
  The early part is a very deep dive into IEEE floating-point representation.  We make a deep dive so that the sources of rounding error
  and the true cost of operations is properly understood.  The final section, concerning FDTD simulation in scalar and matrix form,
  introduces further useful material that builds on linear simultaneous equation solving to solve integral equations numerically.


<h2>Demos</h2>

<p>Link: <a href="http://www.cl.cam.ac.uk/~djg11/teaching/numerical-methods/demo-listing.html">DEMO FOLDER COMMENTRY</a>.

<h2>Contents</h2>

<ul>
<li> <a href="#notcovered">           Material not covered and not examinable<a>
<li> <a href="#part1-wuk">           Part 1 - What You Know</a>
<li> <a href="#part2-ieee">          Part 2 - IEEE Standard</a>
<li> <a href="#part3-fp-ops">        Part 3 - Floating Point Operations</a>
<li> <a href="#part4-spsm">          Part 4 - Simple Maths, Simple Programs</a>
<li> <a href="#part5-truncation">    Part 5 - Infinite and Limiting Computations</a>
<li> <a href="#part5b-approximates"> Part 5b - Approximate Methods  </a>
<li> <a href="#part6-illness">       Part 6 - Badly-behaved  Programs and Ill-Conditioned Problems  </a>
<li> <a href="#part7-gaussian">      Part 7 - Solving Systems Of Simultaneous Equations  </a>    
<li> <a href="#part8-alternatives">      Part 8 - Alternatives to Floating Point   </a>    
<li> <a href="#part9-simulations">       Part 9 - Simulations   </a>    
<li> <a href="#part10-flows">            Part 10 - Flow Networks   </a>    
<li> <a href="#part11-extra-topics">     Part 11 - Additional Topics   </a>    

</ul>

  <a name="notcovered"> <h2>Topics not covered in 2017/18</h2>

    <p>Topic(s) that are not examinable this year include:

      <ul>
        <li>The native and vedic long multiply code was skipped. You do not need to be able to reproduce the fixed-latency method of division.
        <li>Kahan's method details - they were not lectured, but a demo was shown.
	<li>Lagrange splines and Chebychev basis for spline points (shown briefly in lectures).
	<li>Simulated Annealing
	<li>Posits and any details of the examples of alternatives to floating-point that were demonstrated.
	<li>Stability of Euler Method
        <li>Gear's method of Stiff Equations
        <li>The method of antithetic variants was very briefly mentioned, but is not examinable.
        <li>Anything to do with voltage generators and Norton Duals: we shall only consider constant flow generators.
        <li>Non-linear components in the fluid flow analysis (time-varying components and linearisation remains).
	<li>(All of Part 11 - Additional Topics - Not in printed slides anyway).

	<li>... and any other items that are clearly marked non-examinable in the printed lecture slide.
      </ul>

      
<p>The additional commentaries for each domo program will be here soon.

<hr>

<a name="part1-wuk"></a> <h2>Part 1 - What You Know</h2>


<p>We use the term 'complete underflow' to refer to the situation
where a small number is added or subtracted to a large one, where,
owing to the precision in use, the value of the large one is
unaffected by the operation.  

<p>We also illustrated 'loss of significance'. 

   Wikipedia: Loss of significance is an Wikipedia: Loss of significance
   is an undesirable effect in calculations using finite-precision
   arithmetic. It occurs when an operation on two numbers increases
   relative error substantially more than it increases absolute error,
   for example in subtracting two nearly equal numbers (known as
   catastrophic cancellation).

<p>It is important to understand the basic costs and mechanics of low-level operations.
Why does division by a constant take longer than multiplying by its reciprocal?  Why is it slower
to store numbers in XML, JSON or ASCII compared with binary? What errors might arise?

<p>Division is an expensive operation
that should generally be replaced with a faster approach when possible.
For instance, rather than frequently dividing by numbers read from a table, pre-compute
the reciprocals and use multiplication instead.  Or even do the whole computation
with the summation of logarithms.


<p>Note that division in binary is a little easier than in higher bases since 'it either goes or it won't go' 
compared with pen-and-paper division in base 10 where one has to guess how many times it goes and then multiply by the
guess to find out.


<p>Multiplication too has traditionally been an expensive operation.  But
in the last decade, with smaller-and-smaller transistors being
available, only low-performance embedded computers found in devices
such as toys and IoT devices are likely to be slower at multiplying
than adding and then by a small factor of 3 to 5.


<p>Converting numbers between human-readable decimal and internal binary
form is an expensive operation.  And storing human-readable numbers in
ASCII or UTF-8 uses more memory: A single-precision number might print
as 12 characters needing 13 bytes with an additional comma delimiter,
  but the binary version is only 4 bytes.  
 

<p>Multiplication: mpx1 shows simple long multiplication, base 2, taking n iterations to multiply
n-bit numbers.  mpx2 is the same code re-factored to look like Booth's algorithm.  Booth uses n/2 iterations
and is a base (or radix) 4 implementation. 
Simple hardware implementations of these algorithms would use one processor clock cycle
per iteration. Today's processors might typically multiply using long multiplication
base 65536 (16 bits at a time). If your hardware clock frequency is 2 GHz and a 64-bit multiply is taking four clock cycles, 
(because it is base 65536) then the multiply will take 2 nanoseconds.

<p>The native and vedic long multiply code was not mentioned in 16/17.


<h3>Base Conversions</h3>

<p>Reading in a number in base 10 is essentially a polynomial evaluation where the digits are the coefficients and the argument
is the base. We multiply as we go, taking advantage of the fact the computer has multiply facilities in hardware that 
operate in the base we want the output to be in (binary).

<p>Writing out a number in base 10, for human consumption, conversely, requires long division since we do not have
base 10 hardware division facilities in modern computers.  (Early IBM machines supported it for the COBOL language.)

<p>Because division is a costly operation, output in base 10 is quite CPU-intensive and should be avoided
if the numbers are only to be read in again by a computer.

<p>Floating-point output, in human-readable form as printable ASCII characters, is especially expensive. 
We observe that printing out a floating-pointer number that is an integer, or the integer part of small (ie. less than 2^mantissa-precision) floating-point numbers is not too
difficult since the binary encoding of the integer part is manifest in the mantissa (with a little binary shifting).
Our approach, therefore, for all other numbers and precisions, is to multiply or divide
the mantissa by powers of ten until it contains a bit pattern that can be considered a suitable unsigned binary integer.
We then print this out using, essentially, our previous integer binary-to-ASCII method.  
We kept track of how many powers of 10 scaling were applied and insert a decimal point in the ASCII digit string at the appropriate place.
Note: we do not really divide by powers of 10, we are good computer scientists, so we multiply by pre-computed reciprocals.

<p>Output also typically involves beautification, such as suppressing leading zeros, avoiding scientific notation if the
decimal place lies somewhere in the mantissa, and printing NaN and Inf in the special cases that we cover in part 2.

<p>The large listing in C in the projection pack is provided just to emphasise the complexity and no knowledge of C is expected.  You can
  try the ML and/or Java alternatives in the demos folder.  <i>You are expected to be able to do long division and integer-to-ascii conversion
    (and understand that these are the same) but  getting all of the details of float-to-ascii correct under
    exam conditions would not be asked.</i>

<p> Also, pay attention to the  'floor' operation, and its relations, which in many
  languages are tacitly invoked when mixing types in expressions. These also involve the mechanics of base conversion, and hence are expensive and
  should be used sparingly.  
  
  
<h3>Floating Point Output Worked Example</h3>


<p>The details of an example like this one are worked through in the lecture after the single-precision IEEE format was presented in detail.

<p>Let's look at the output routine operation for 3/256.  This is a good starting example since it clearly only has two bits in its mantissa and 
one will be hidden. In decimal this is 0.01171875...  If we want to print this to 3 significant figures, we will want to ultimately be
applying the integer output routine to 117.   Note that 117 is 111_0101 in binary.  

<p>Notation: We put an underscore in our long numbers every four digits
to make them readable, but the underscore has no meaning, like the comma used every 3 digits in English prose. We put 0x in front of a digit
string to denote it as hex (hexadecimal).

<p>The input number is 2^(-7) + 2(-8) so written as fixed-point binary this is 0.0000_0011.  

<p>As floating-point, we have a mantissa of 3 and an exponent of -7 which is 120 in excess 127 format or 0x60. 
Packing this into fixed-point fields (hiding the msb of the mantissa) we have 0x3c40_0000.  This would be the input to our output routine.

<p>To output as three significant figures we will need to multiply or divide by powers of ten until 100 &lt;= m &lt; 1000.
Hence we achieve 117.171875 by multiplying by 10^4.    We knew to multiply up and not down from our initial number being less than the lower bound, 100.


<p>117.171875 in single precision is 0x42ea_6000.  


<p>Our computer has single-precision floating-point arithmetic hardware. Using the exponential chop in the
presented code, where powers of 10 are used that are themselves powers of 2, a single multiply by 10^2^2 was needed.


<p>Finally, we need to extract the bit field 111_0101 from this result and print it. The first one is the hidden bit and the 11_0101 is
clearly extractable with a shift and mask when we write out the hex number in binary
<pre>
  0x42ea_6000 =   0 1 0 0   0 0 1 0   1 1 1 0   1 0 1 0   0 1 1 0   0 0 0 0   0 0 0 0   0 0 0 0
                  s e e e   e e e e   e m m m   m m m m   m m m m   m m m m   m m m m   m m m m
                                        ~ ~ ~   ~ ~ ~                   
</pre>

<p>The amount to shift right by, 17, is given by simple linear arithmetic of the current exponent (133), then excess (127) and the mantissa width (23):  133-23-127=17.


<a name="part2-ieee"> </a> <h2>Part 2 - IEEE Standard</h2>

<p>You need to memorise the single-precision floating point bit fields as lectured. Exam questions will often rely on this detail.
You will not be expected to remember the precise conventions for storing NaN, infinities and zero.  De-normal
numbers may be ignored, although their definition is intrinsically obvious if you think about it.  You should also
note that double-precision has a wider exponent range and more than double the precision in the mantissa.


<p>We will use the abbreviation 'ulp' to stand for 'unit in the last place' denoting the significance of incrementing
the binary (or other base) representation by one.

<p>We note that there are a few weirdnessess in the standard, such as two zeros that are defined to be equal
  for comparison.
  
<p>Rather than throw an exception on divide-by-zero, log -0.00000001 or arcsin 1.00000001, that would potentially halt a massive computation, discarding all results everywhere in huge arrays, we prefer a taint approach, replacing
  the result with the special NaN value that propagates to the final result but which may remain localised in its
  poisonous effect.
  

<a name="part3-fp-ops"> </a> <h2>Part 3 - Floating Point Operations</h2>

<p>This section of the printed slides is quite verbose, so there's little to add here, but we can repeat the key points.

<p>The four arithmetic operations are defined that we treat the arguments as exact and then round to nearest representable result.
  Rounding the mid-way case upwards always would be biased so we need to vary in policy
  following some unbiased pseudo-random rule. Round to even is chosen. Other rounding modes are needed in
  rare cases, like interval arithmetic discussed briefly later.

  
<p>Despite the standard, we note that compilers sometimes compute
  intermediate results to a higher precision than that of the
  operator's arguments or the required result. Sometimes, on the other hand, they do not.
  This means that an expression such as a+b*c can have a different
  result depending on whether the product was rounded before the
  summation.  

<p>The standard libraries are not normally obliged to be engineered to standard
  such that they always return the nearest representable
  answer based on their argument being exact.  The arguments are very rarely exact
  in practice owing to having been sourced by an earlier part of the computation and
  suffered rounding (and loss of precision/underflow, truncation and input quantisation).
  So we generally specify that a library must give a result that can be out by one ulp
  provided it remains semi-monotonic.

<p>Semi-monotonic means that the first derivative retains its correct local polarity.
  For instance, if the output
  should go down when the input goes up at this point (e.g. first quadrant of cosine), then it does
  or stays the same.

  
<p>The standard library contains all the standard functions written out in old books of mathematical
  tables or supported by a scientific calculator: trigs, logs, roots, hyperbolics and their inverses.

<h3>Error Accumulation Models</h3>
  
<p>We always aim that our expected error is zero; in other words, we are unbiased.


<p>But we might consider that the expected absolute value of error builds up randomly: unfortunately it
is often systematic.


<p>The input quantization error arises from digitising the real world which is continuous.  For example, CD-quality audio
uses 16 bits per sample of air pressure: hence a unit in the last place (ulp)  is 2^-16 of the peak-to-peak pressure
variation available to represent the sound. 

<p>We need generally to consider worst-case error accumulation, where absolute errors sum for plus and minus and
relative errors sum for multiply and divide.  In detailed analysis, when we work through a computation, we commonly need to convert
between relative and absolute error each time the output from one class of operator is an input to the other.
If we know the expected magnitude of the numbers, we can convert between relative and absolute error  easily. Otherwise
we just assume the input value is approximately unity in case that gives us a handle on the current
magnitude. And failing that, we can assume the current magnitude is approximately unity: i.e. the relative error is
the absolute error.
For constants in the program code, we must look to see whether they are precisely-representable, or also have an error.

<p>To show systematic amplification of the input quantization error, there are three plots: the first shows an unbiased random walk in 1-D
with fixed step size, A, of unity. This shows the expected mean deviation of 0 and mean absolute deviation
of A.sqrt(N).

<p>The second shows the absolute error (relative will be the same shape)
in the value of single-precision variable p in the loop
<pre>  p=1.0; for (i=0;i&lt;100;i++) p=p * x </pre>
<p>which was plotted by subtracting the values of p from those from another copy of the
loop that used double precision.   This too shows a random walk of
similar nature and we see that the (variable) step size, A, averaging roughly macheps/2.

<p>The second plot used various values of x hence multiple, yet deterministic, traces. All values of
x were precisely-representable as single-precision numbers.

<p>The third plot shows the same as the second but for various values of x that had to be
quantised to single precision before starting.  This produces systematic
growth of the error that dominates the random walk effect (look at the
ordinate scale).  The relative error accumulates linearly, as given by
the earlier rule of relative errors adding under multiplication.

<p>The takeaway is that local random walks do happen but are commonly
negligible and any systematic error propagation more-properly needs our attention.


<p>We define macheps and it is worth memorising their orders of magnitude for single- and double-precision.

<p>  <hr>
<p><a name="part4-spsm"/>          <h2>Part 4 - Simple Maths, Simple Programs</h2>

<p>We need to be aware of the structure of error forwarding in our code: does an error persist through to the output
or will be it be lost as the computation progresses?

<p>Even an ultra simple program, like our example of implementing the quadratic equation root formula, can go wrong.  As programmers
we need to consider all corners of the operational space and generally add in 'IF' statements to redirect some corners
to alternative implementations or error reports.

<p>Note that floating-point addition and subtraction are not generally associative: better accuracy can be achieved by
controlling the association, e.g. perhaps by sorting the summands first.
 
<p>Remember you do not have to naively use the built-in numerical facilities in all circumstances.  Sometimes it is better
to perform long arithmetic using multiple words in some base of your own choosing. Libraries are normally available.
Kahan summation is a neat little example, where two words are used with very little management overhead, to get nearly
double the precision over one word (details not examinable).

<p>    <hr>
<p><a name="part5-truncation" />    <h2>Part 5 - Infinite and Limiting Computations</h2>

<p>We cannot compute forever. We have to stop.  When we stop early, with respect to the underlying mathematics, we introduce a new type of error: the truncation error.

<p>The truncation error arises where we approximate the mathematical 'take to the limit' operation with a value that occurs before we get to the limit.


<h3>Truncation of a Limiting Computation</h3>


<p>In the differentiation examples we truncated using a finite, non-zero value of h: we did not take h all the way to zero.

  
<p>When we truncate in this way, we avoid errors the quantisation and rounding errors arising with small values of h. Small values of h with
  the  differentiation example lead to nearly identical values of f(x) being subtracted, thereby greatly loosing precision.

  
<p>So if one form of error goes down as we increase h, and another goes up, our total error is likely to be minimised for the value of h for which the
  errors are equal. We can deduce the cross-over point for general continuous functions using Taylor's approximation and assuming that the value of f(x) and of all
  its derivatives is approximately unity.  If we have better information regarding f(x) and its derivatives, then we might use that to more accurately determine the best h.
  
  
<p>Note that the square root of macheps is about 10^-8 for double precision and the cube root is LARGER at 10^-5.
  

<p>We can consider the way in which our truncation error varies according to the value of h to determine the, so-called, order of the method.  For instance, in the
  second-lectured method for differentiation, the truncation error is proportional to h^2 and so we say it is a second-order method.
  
<h3>Truncation of an Infinite Iteration</h3>
  
<p>In the summation of an infinite series, we can commonly truncate when the terms that we are combining become sufficiently small, such that our
  accumulation so far is sufficiently accurate for our application.  We might consider modifying our method, knowing that we are going to truncate it, so that
  the truncation has better properties, such as being smaller or more evenly spread over use cases (Chebychev later).

  
<p>The rate at which the truncation error gets smaller, as a function of how many steps we make, determines the so-called order of the method.
If we see a pattern in the relative or absolute truncation error between before and after each step, we can potentially estimate the number of iterations needed to solve a problem
  in advance of starting it.
  Also, we may get insight into whether the error might get larger as we iterate, which is either a disaster or else a hint that we should seek to reverse 
  the iteration such that the errors can progress in the opposite direction.

  
<p>Our golden ratio iteration multiplied the absolute error by 3 each time when we got it backwards, but divided it by 3 each time, making it get ever smaller, in the
  direction that works.  We need iterations that scale the error with a factor that is between -1 and +1: this makes them smaller.  Negative ratios are fine, they
  just mean the error sign alternates each time, giving a zigzag plot.
  
<p>The binary chop iteration reduces the bound on the error by half on each step. The relative error overall is following a geometric series.
  This means gaining one more accurate bit per iteration.  Information theory
  tells us that if are just using the output from a single boolean predicate per iteration, then one bit is all we can gain on average.
  This is first-order convergence.
  
<p>The Newton Raphson iteration squares the relative error on each iteration.  Squaring something less than unity makes it smaller (which is good!)
Such squaring doubles the number of accurate digits each iteration, meaning the required number of iterations is the logarithm of the number of significant digits we desire.
(Q. Does the squaring mean it is fatal if we start with a relative error greater than 1? A. Our analysis that simply ignored the higher-order terms in the Taylor expansion
  when x is larger than 1 is wrong.)

  
<p>Newton Raphson has second order convergence: the number of iterations needed is the log of number of digits needed which is the log of the log of the denominator in an expression of accuracy.
  
<h3>Summing an Infinite Series</h3>


<p>Some series converge in maths for all argument values but numerically behave badly with large input values. We should avoid those large values.

  
<p><b>Range reduction</b> identities for sine and cosine are obvious from geometry, by subtracting multiples of 2 Pi and subtracting from Pi while swapping one for the other. For the second octave we can interchange sine an cosine and use the first octant.
  Hence we only need to evaluate them in the first octant (zero to 45 degrees).


<p>For exponent, in lectures, we looked at taking the argument modulus the log of 2, so we only had to apply the series to the remainder with the
  integer quotient becoming an offset to the resulting exponent.  Rounding down ensured the series is only applied to positive values, eliminating
  problems of monotonicity.  The biggest number the series is then applied to is log 2, which is less than unity for all bases greater than 2.
  
<p>For other problems, range reduction might be better towards unity.  For instance, for finding square roots,
  the Taylor expansion of (1+x)^0.5 works well, provided we keep x small, meaning the argument to square root is close to 1.
  That expansion does not converge if x is greater than 1, so root 2 is the highest we can directly countenance.   A good range reduction
  approach for square root is a matter of dividing or multiplying by factors of 4, which is a simple shift by even powers of two, and the answer
  needs to be corrected by shifting by half the pre-shift amount, which is also a simple power of two.


  <a name="part5b-approximates"/> <h2>Part 5b - Approximate Methods</h2>
  
<h3>Quadrature</h3>

<p>We cannot sum to infinity in finite time, so truncation is applied, introducing an error.


<p>Sometimes we need to apply a numerical method to the data generated by a function that is too complex to analyse or is just
  stored in a table.  Integrating is the premier example. Numerical integration is called quadrature.  At school we may
  have looked at Simpson's Rule, which puts a parabolic spline through each grouping of three adjacent points and we can 
  find the area under the splined approximation using algebra.


<p>Simpson's Rule has order of at least 4 (halving h divides the truncation error by at least 16). There have been many books
  and papers that give better-and-better quadrature techniques.  As Computerists we should know they exist and we should normally use a pre-existing
  library (e.g. matlab) rather than coding our own.


<p>The demo, (being added to the demos git repo), showed that rather few (reciprocal of fifth root of macheps) strips is the best number in general.
  It also showed that two strips are enough for a quadratic function, hence exposing clearly the rounding error growth that degrades the result
  beyond the optimimum number (25 to 35 for single- precision).  The interesting part of the plot is the extreme right-hand-side: all power laws
  form a straight line on a log-log plot and the order of the algorithm sets the gradient.  The rest of the plot is a plot of rounding noise, which has
  square-root growth.
<br><img width=199 src="simpson-rule-plot1.png" alt="Simpson's Rule Demonstration of Order">
	
<p><i>Simpson's Rule formula will be given if needed in an exercise or examination question.</i>
  
  <h3>Splines And So On</h3>

  
<p>The next part this section, up until we resume taking about range reduction in the Taylor series for sine,
  is non-examinable this year (17/18), but the important takeaway is the principle of expressing
  a function as a weighted linear combination of some basis 'vectors'.  I put the word 'vectors' in quotes since we are talking about scalars in
  reality, but everybody is familiar with expressing a point in k-D space as a weighted sum of k basis vectors.

  
<p>The Lagrange basis set of size 2n+1 is the 'obvious' set of functions that are unity at some given integer in the range -n to +n
  and zero at the other integers.  We can then directly construct a spline from tabulated values as the weighted sum of these
  basis functions.

  
<p>A plot of the Lagrange reconstruction reveals that the error is non-uniformly spread. Can we do better?  Yes, we do not have
  to use evenly-spaced integers as our basis points, instead we use a non-uniform pattern, such as that attributed to/invented by Chebychev.


<p>Sometimes we cannot afford the time or energy to compute an accurate result and are happy with a coarser answer.  This may hold especially on low-power embedded
  systems, such as a smoke and motion sensor in a ceiling tile.

<p>For graphics applications we rarely need more than about 12 bits of accuracy - which is half single precision. In a graphical computer game we need speed.
  Where a function is monotonic and we are comparing values of its output we can compare its input instead!

  
  <h3>Cordic</h3>
  
	
<p>The Cordic algorithm can perform a number of scientific computations using fixed-point integer arithmetic.  The most common examples are converting between rectangular and polar
  co-ordinates and back. The straightforward computation of sine and cosine was all that was lectured in recent years.
  Since we are working with fixed-point fractions, we might adopt some convention on the location of the implied binary point, or we might just assume
  some fixed denominator, that is not necessarily a power of 2, and use run-time values for the numerator.
  
  
<p>Cordic computes sine and cosine of an angle by rotating the unit vector from the horizontal axis to the desired angle in a number of well-chosen steps.  Each step
  has the property that its corresponding rotation matrix is almost trivial to apply (i.e. to multiply by).
  The rotations are taken from a precomputed table of angles whose tangents are fractional powers of 2 and the only decision
  at run time is whether to add or subtract the next value from the current angle as we reduce the current angle in absolute terms towards zero.
  We use rotations that are more than strict rotations in that they
  also extend the length of the vector by a known amount.  If we always use exactly the same number of rotations, we can compute offline the total amount of length
  extension that happens and factor that into either our fixed-point denominator or the length of the vector we start with. That factoring is also offline.

  <h3></h3>
	
<p>  <hr>

<p>  <a name="part6-illness" />  <h2>Part 6 - Illness</h2>

  
<p>The lectured matrix inversion example shows that nasty things can happen even when the input numbers  look innocent.
This matrix problem has a graphical projection interpretation  and we can readily visualise: it is an information-loosing projection that will be tough to invert.

<p>In general, we need other approaches to spotting a poorly-behaving problem.  It may typically just be ill-conditioned for some regions of its domain (input space).

<p>Some functions or programs may simply have high partial derivatives between an input and an output. These are worth knowing about and we should be aware of our
  input quantisation error or sample noise on such an input, since this could dominate the output error. In other words, our output is very sensitive
  to this input and so we should be careful when collecting it from the outside world. But a high gain is not an indication that the computation itself
  will be unstable.  A high gain does not amplify the relative error, as shown in the 10^22 example in the notes.

<p>We can measure the error amplification using random input perturbations.  This is a generalisation of the numerical approach to finding the derivative explored earlier.

  
<p>Or if we have the inverse function we can check backwards stability. A backwards stable function is 'round tripable', with its inverse giving a good approximation of the input   
from the output.
  
<p>Or if we have a cheap way
  of finding whether the answer is correct, we should apply that.
  
<p>To understand the stability of the computation itself, we can look at the change in relative error introduced by our program: this enables us to speak of its condition number at that locality.
  

<p>Where our method uses some step size, h, we may need to use different values for different parts of the domain, perhaps dynamically adjusted as in the spice algorithm at the end of this course.
  
<p>Some functions are chaotic and remain poorly-behaved however much love we give them: "one flap of a sea gull's wings would be enough to alter the course of the weather forever".
  
  <p>  <hr>
<p>  <a name="part7-gaussian" />  <h2>Part 7 - Solving Systems Of Simultaneous Equations </h2>


<p>One way to solve a system of linear simultaneous equations is to multiply the solution by the inverse of the coefficient
  matrix. But this is not often a good approach since it involves more computation than L/U decomposition and can be less stable.

  
<p>Gaussian elimination converts the coefficient matrix into upper-triangular form. This has cubic cost. We aim
  to make it well behaved by good pivot selection.   Keeping track of the modifications made to the 
  right-hand-sides gives us a corresponding lower-triangular matrix. (The lower-matrix is mathematically formed by a product of
  matrices that are identity matrices with a set of non-zero values beneath one of the ones, but forming their product ends up never multiplying
  any of these together, so computationally we can just insert them with store instructions, as in the demo code.)

  The pair multiplied give us back the original coefficient  matrix.

  
<p>After L/U decomposition, we can solve for any right-hand-side vector in quadratic steps.  If we solve for the
  set of right-hand-sides that are the columns of the identity matrix, the result is the inverse of the coefficient
  matrix and this is a good technique for matrix inversion if we should need it. (Cramer's rule has exponential coast
  or quartic if you borrow a little help from L/U decomposition.)

<p>Commonly our coefficient matrix has special symmetries that mean short-cut techniques can be used.
  If it is transpose-symmetric and positive definite (not lectured), then we can use Cholesky 
  which has lower cost (while still cubic).  Our circuit flow matrices (later) have these properties.  
  
<p>L/D/U decomposition gives L and U with ones on their leading diagonals using an auxiliary diagonal matrix D and is helpful
  as a variant of Cholesky without square roots.
  
<p>Not all systems have unique answers: the equations might be linearly-dependent or in short supply, in which case there
  are multiple possible answers (we may select the best answer using a regression fit). If there are too many equations
  there may be no solutions.
  

<p>When we have too many simultaneous equations (rows in the matrix expression thereof), we might
want to choose a best-fitting answer, such as the one with the least sum squared difference
between the r.h.s. yielded and the target. This is multi-dimensional linear regression.

<p>When we have too few equations, the solution space is a line, plane or volume ...


<p> Simulated annealing searches for low cost (or high benefit), as returned by a scalar metric over a multi-dimensional space.
To try to avoid false minima, it starts with large jumps in the search space and with only marginal preference to retaining improvements.
As the so-called temperature falls, we make smaller jumps.  Below a certain temperature we only accept improvements: this is the so-called quench phase.


  <p>  <hr>
<p>  <a name="part8-alternatives"> <h2>Alternatives to Floating Point</h2>

<p>We have spoken of fixed and floating-point representations, but these are not the only ways of computing.

<p><b>Unum Posits</b> (not examinable) are a very new floating-point encoding (2017) that is better than the IEEE standard in a number of ways.  
Any number of bits can be used to store a posit. All bit settings of the word have a meaning. Word size can be extended by just adding trailing
zeros.  
Fields change size dynamically: e.g. mantissa has no bits when exponent large.  They have a greater range and precision for same number of bits as IEEE.
Hardware implementations as part of mainstream processor instruction sets are possibly likely in the future.  (Meanwhile, FPGA implementations are being explored: <a href="http://www.cl.cam.ac.uk/research/srg/han/hprls/orangepath/kiwic-demos/kiwi-posit-unum-and-custom-arithmetics">LINK</a>.)

<p><b>Interval</b> systems model each value as a pair of values with the idea that the true answer lies between these values. All of the standared operators
are overloaded to apply to such pairs, but where program control flow is predicated on an interval, a resolution function must typically be applied to
prevent exponential growth in the number of possibilties represented.

<p><b>Multi-digit</b> arithmetic is just a generalisation of the school 'long' forms of arithmetic. Infinite precision, perhaps based on lazy lists, is commonly
used as well.  For instance, there is the dotnet BigInteger library or the standard gnu mpfr demo in the projection slide pack.

<p><b>Symbolic</b> systems  exist, that hold expressions as formulae rather than numbers, apply most operators using distributive laws where simple identites are
applied to minimise the growth in complexity.  For instance, the cubing of a square will become a raising to the sixth power.  That little example 
also shows that they will perform computations where the results are easy to precisely represent, such as the product of two natural numbers.

<p>We did not cover these systems in lectures in any detail, but, apart from Posits, you are expected to know enough about what they are to deduce simple answers regarding how they might behave. In a Tripos Exam, if this is ever to be examined, we would provide sufficient explanation of the type of system being considered.

  <p>  <hr>
<p>  <a name="part9-simulations"> <h2>Simulations</h2>


<p>To simulate the real world, which is analogue, on a digital computer, we commonly need to quantise 
in three different ways:
<nl>
 <li> the values of properties (like fill depth in a bucket), 
 <li> an elemental physical quantisation (like an area into patches of finite size), over which we assume the properties are constant) and
<li> a time-domain quantisation, over which we assume rates of change of properties are constant.
</nl>


<p>The dart-throwing Monte Carlo simulation converges poorly, obeying simple statistics. The relative error in a single throw is roughly unity. Assuming
an unbiased simulation, after N throws the r.m.s. noise will be roughly square-root N.   <a href="https://en.wikipedia.org/wiki/Antithetic_variates">Antithetic</a> improvements were briefly mentioned but are not examinable.

<p> FDTD quantisation of the continuum introduces errors just like the
truncation errors in the numerical differentiation example
earlier. There are multiple quantisation parameters in general, such as the time step
and the elemental length. Adaptive time steps and patch sizes are commonly needed for efficient computations that retain accuracy near sharp edges.

<p>The values that need to be retained from one time step to another form the <b>state vector</b> of the system.
In a finite-element simulation, we can consider there to be a state vector for each element or else the whole lot is the state vector.

<p>The ways that errors introduced by these approximations to the continuum compound varies greatly according to the nature
of interaction between components.
Commonly behaviour is good: these truncation errors can either tend to cancel out or at least
not accumulate systematically.  
This arises from the inherent stability of many of the systems we explore, arising from
their good design (by man or nature). For instance, the error might change sign every step.
  But each situation needs to be analysed individually.
  

<h3>Stencils</h3>

<p>
<p>Most commonly, a FDTD simulation is making a numerical approximation to the result of a set of integral equations. This models the dynamic evolution of the system.

<p>
The forwards stencil is the most obvious approach for such a problem and works well for many systems in practice.
See <a href="https://en.wikipedia.org/wiki/Euler_method">The Forward Stencil: (Euler's Method) for ODEs</a>.
In the last part of the course, we will place the forward stencil in flow matrix form under the title 'time-varying components'.


<p>
<b> The forward stencil uses the rate values at the end of one timestep as though constant throughout the next timestep.</b>
It is also known as the forward difference method and Euler's Method.
The new value of a state vector element is its current value plus its rate of change multiplied by the time step size.
So it is nothing more than numerical integration by the Riemann Sums method (see quadrature slide, where this is also
labelled as the the mid-point rule, but that is a confusing name in this context.)

<p>The simple forwards stencil serves well when high accuracy is not needed, as in computer games and so on.

<p>
The backward stencil uses the rate values at the end of the current timestep as though constant in this current timestep.
When we can work out the backwards stencil (by analysis or iteration) then we could
potentially just simply use it, but this might be no better than the forward stencil. Better is to take an average of the forward and
backward approaches, giving the Crank-Nicholson method. 

<h3>Forward Stencil Stability and Accuracy</h3>

<p>We will consider here just the stability and accuracy when modelling exponential curves. Curves that
  have zero second derivative will be completely accurate and are called straight lines.

<p>The forward stencil essentially takes only the linear term from a Taylor expansion and so error
  accumulation is proportional to step size squared. This was illustrated in the lecture. However, as we iterate
  in the time domain, to cover a given modelled elapsed time, we need a number of steps inversely proportional
  to the step size.  Hence, when errors are accumulating systematically, in theory, and not lectured,
  the overall performance is proportional
  to the step size meaning Euler's method is a first-order system: halving the step size doubles the accuracy. But this
  is only in the worst case, and most practical systems are much better behaved.
  
<p>With the forward stencil, an exponential decay simulates as a geometric decay.
  The geometric decay is nothing more than a sampled exponential function, but it is the wrong exponential:
it has a slightly faster decay rate, so the time domain has been slightly dilated, but we have a constant relative
error in rate of time.

<p>  The backwards stencil is wrong in the opposite direction.   Crank-Nicholson cancels out these two errors, giving a second-order method (details not lectured). Actual numbers were presented in the lectures.

<p>Consider <img src="http://latex.codecogs.com/svg.latex?y(t)=e^{-at}" border="0"/>
  
  
<p><img src="http://latex.codecogs.com/svg.latex?y_{n+1}=(1-ah)y_n=(1-ah)^2y_{n-1} = ... = (1-ah)^{n+1}y_0" border="0"/>
  
<p>As n gets larger, this clearly does not converge if |1-ah| &gt; 1. Hence
  the forward stencil will fail to converge for the exponential decay if the steps size is larger than 2/a. This
  is again an unimportant result, since there is no accuracy left well before we consider using such a large
  step size.  Nonetheless, it seems traditional to mention the proof.


<h3>Better Techniques</h3>

<p>There is a huge literature of techniques. 
Many are embedded in standard libraries.
Complete lecture courses and human lifetimes can be devoted to this one subject.

<p>Sometimes the forward stencil will fail.  An example is so-called Stiff Equations, but their precise definition does not
exist and such situations are beyond the scope of this course anyway.  Iterations such as that of Gear, flashed up in one slide and non-examinable,
are the way forward for those with interest. You might also like to explore the Runge Kutta techniques.


<p><hr>
 <a name="part10-flows" />  <h2>Part 10 - Flow Networks</h2>    

 
<p>Commonly we wish to simulate large, continuous systems that have been spatially-quantised and matrix form is naturally appropriate. Other times
  we have complex systems that are discrete in reality, such as traffic, water or electrical systems, consisting
  of paths between nodes with fluid flow along the paths.
 
<p>Flow networks have a steady-state equilibrium given by the solution of simultaneous equations. Where
  all links are linear, we do not need to iterate to find the steady state. But where some links or nodes change over time, as with valves being gradually closed
  or buckets changing in fill depth, we embed our flow matrix solver inside an FDTD simulation.


<p>Traffic in a road network or packet flow rates on the Internet are also systems that can be simulated using fluid flow. Examples of capacitor-like
  components are then, respectively, car parks and queues in buffers.
  
<p>To be specific in lectures, we speak mainly of flow networks where the fluid is electricity or water.
  It is introduced at a constant rate at one or more
  sources and flows down to a sink at the lowest level which we say has zero potential.  The current generator, in the water analogy, can be thought of as a pump that takes fluid out of an infinite sump inside the drain and delivers it at a constant rate somewhere near the top, and however high this needs to be, the perfect pump is always able to deliver that rate.  In the electrical analogy, as well as current generators, we also typically model perfect voltage generators that deliver any amount of current at a set height (or voltage).  These perfect generators can model real generators by including a resistance with them called the internal resistance of the generator.
  
  
<p>Our links are resistive by default: the amount of flow is given by their fixed conductance multiplied by the potential difference between their ends, and flow is from the higher end towards the lower end.  This differs from the links considered in the min-cut, max-flow networks mentioned in the Algorithms course, which are considered to have zero resistance (infinite conductance) until their flow meets their saturation level, at which point their resistance rises as needed to limit the flow to that value.  In this course we illustrate such non-linear  behaviour using the pressure limiting valve and the electrical
  diode. 

  
<p>We can write an ad hoc piece of code for every network we wish to model (such as the one for the two tanks in the demos folder). Better is to have a systematic
  modelling approach that can read in a network (e.g. via a GUI) composed of components (from a standard library).

<p>We phrase the conservation of flow equations in matrix form where the solution is a column vector
  of all the potentials excluding an arbitrary one that we nominate as ground zero.  Each component has a characteristic
  template in the matrix: for instance a conductance appears at four places, unless one end is connected
  to our ground, when it appears only once.
  
  
  <h3>Steady-State Operating Point</h3>

<p>The steady-state equilibrium is found, for linear links, by simply solving the linear simultaneous equations given 
  by the conservation of flow at each node (Kirchoff's current law in electricity).  The matrix of coefficients
  will be symmetric and positive-definite (not lectured) and hence Cholesky's approach works.
  Also, typically, the number of nodes and links will be roughly the same, to a factor of 2 or 3 perhaps, so our
  coefficient matrix is very sparse when there are hundreds of nodes or more. Hence techniques for sparse equation sets (not
  lectured) are also commonly applied.
  
  
<p>The steady state may be perturbed by external influences, which are normally of the form of a link being blocked or
  unblocked by the operation of a switch or tap (or loch gate or a road being closed by the police or whatever). However, the flow generators may also be 'non stationary' in their rates, such as a photoelectric panel will have a cyclic variation in the flow it generates each day as the sun rises and sets.

<p>The equations must be solved again after any such variation in the system parameters.
  
  <h3>Non-linear Components</h3>

<p>Our nodal analysis can directly model flow sources and conductances.  All other components must be phrased
  in terms of these two fundamentals.  Here we will replace a non-linear component with a phantom pair that is a flow generator
  and a conductance.  In the final section we replace a tank with a similar pair.
  values is different.

  
<p>The value of phantom conductance used is the instantaneous derivative of the flow w.r.t. the potential.  The value of the phantom flow
  is set so that the correct potential difference is developed over the component, or put another way, such that the tangent we have modelled with the
  instantaneous derivative no longer passes through the origin, but is offset correctly according to the properties component being modelled.
  
<p>  Where a link between nodes (or pipe or channel)  has resistance that varies according to its flow, it is non-linear.
  Where there are relatively few of these overall, we insert our linear equation solver inside an iteration framework
  and allow it to converge on the operating point over some small number of iterations.  The implementation in the
  demos folder uses a simple substitution iteration (as shown for the Golden Ratio in earlier lectures) but commonly
  a Newton-Raphson iteration can be applied if we have an analytic formula for the non-linearity (such as the diode formula) and thereby readily have the formula for its derivative.

<p>The iteration we need starts with some potential difference across the non-linear component (typically the
  value found in a previous timestep), this gives us a conductance. We solve the equations and get the new flow
  through the component. Does this flow develop a potential difference that is close enough to what we started with? If not, iterate.
  
<h3>Reservoirs - Time-varying Components</h3>

<p>Fluid systems commonly contain tanks that fill and empty over time.
The instantanious fill depths, as always, are part of our system state vector.
  The electrical capacitor can be thought of
  as a prismatic water bucket whose sectional area is the electrical capacitance and the fill depth is the potential difference between its terminals.  For all such reservoirs, the depth after a time step is given by quadrature (a numerical integration).
  We typically use the forward Euler method, that assumes the flow is constant at its start value throughout the time step.
  Accordingly, the contents after a time step are the contents before the time step plus the flow rate into the container multiplied by the time step duration.
  Hence the complete system of links is wrapped up inisde a FDTD simulation.


<p>The reservoir/tank/capacitor is replaced with a phantom conductance and flow generator pair.
  The value of phantom conductance used is agin the instantaneous derivative of the flow w.r.t. the potential.  The value of the phantom flow
  is set so that the correct potential difference is developed over the component, corresponding to its
  current state (from the state vector).
  

<h3>Adaptive Time Step</h3>

<p>Where we are iterating within a time step owing to the presence of non-linear components, we can typically
  compromise a little in accuracy by truncating that iteration a little early and instead use a smaller time step 
  for the outer FDTD process.  When modelling real-world systems, where component tolerances are about one percent (as is typical for the conductance of an electrical resistor on a circuit board of pipe diameters in an oil refinery) then our answer only needs about 3 significant figures of accuracy.  Overall, we will use an adaptive control mechanism that tracks how
  well our non-linear iterations have converged (which is readily apparent from back substitution in the simple link models or by counting iterations) and uses a smaller time step if accuracy or convergence appears to be going down, or which goes back to checkpoint and re-runs
  with a smaller time step if things have gone wrong.

  
<p>(We have just studied several parts of the basic operation of <a href="https://en.wikipedia.org/wiki/SPICE">SPICE<a>
      simulator that is universally used for analogue integrated circuit design.  I hope you can see how to use
      it for other systems, such as the world economy and the struts under a drilling platform in the North Sea.)

 
<p><hr>
  

  <a name="part11-extra-topics" />     <h2>Part 11 - Additional Topics   </h2>    

<p>Additional Topics will not be lectured, more than likely.


  <hr>
  
<p> END.
<hr>

<h2>Questions Answered</h2>

<h3>Booth Multiplication</h3>

<p><font color=brown>Q. In page 22 of the Numerical Methods handout, I think there is a
mistake in the function mpx2. I believe that if the carry parameter is
initially set to 0, then it will never become non-zero, so the
function is identical to mpx1.  This is because (x mod 2) is either
going to be 0 or 1, so with carry = 0, n will be either 0 or 1.  (n =
0 or 1) Then carry’ will always be 0 because of the case statement, so
on the next loop the new carry will still be 0.  I’ve run a test of
this to be sure, and it’s clear that carry is never 1.
</font>

<p>A. That's correct.  The mpx2 example is the same as mpx1 but it the code is arranged in 
a style to look like the immediately-following Booth example.

<h3>Summing Many Numbers</h3>

<p><font color=brown>
Q. Why should we first combine together numbers or opposite magnitude and similar size? Surely this leads to lots of cancellation?
</font>

<p>A. Yes, it does lead to cancellation and hence loss of significance.  But at least the resulting number will be small and not cause
severe underflow in other parts of the accumulation.  Ultimately, you simply have to keep more precision using Kahan's method applied
to double precision or an off-the-shelf, high-precision library, as demonstrated under alternative techniques.

<h3>Semi-Monotonic</h3>

<p><font color=brown>
Q. What precisely is the definition of a semi-monotonic approximation?
</font>

<p>A. The definition of semi-monotonic I gave was that the derivative only changes polarity when it is supposed to. This would be
not at all for the log and exp functions and every 180 degrees for sine and cosine.  Changes in polarity owing to rounding
or truncation errors at other places are banned. (I notice that the Wikipedia page for monotonicity does not define the term
but the second Google answer, from Oracle's/Sun's original Java documentation, says:  whenever the mathematical function is non-decreasing, so is the floating-point approximation, likewise, whenever the mathematical function is non-increasing, so is the floating-point approximation. Not all approximations that have 1 ulp accuracy will automatically meet the monotonicity requirements.)

  
<p>Q. Which definition of macheps is correct?

  
<p>A. It is hard to argue whether the one I give is "right" per se, but it does appear in the standard and it corresponds to
  the idea of quantisation step size used in other fields, such as pixels and ADC resolution in audio.
  The widely-used alternative has macheps half the size.  The Wikipedia page currently (May 2018) aknowledges there are two
  definitions in wide use but its own presentation of which is which is confused in places!


<p>Q.  I am a first-year student taking the Computer Science paper 1 course. Considering that we did not cover the following algorithms in detail -- Long Multiplication, Standard Long Division, Fixed-Latency Long Division and the Numeric Base (Integer Input, Integer Output, Floating Point Input, Floating Point Output) -- I was wondering whether these are examinable and are we expected to know these algorithms for the exam? (The algorithms are at the end of the first part of the lecture notes.)

  
<p>A.   The variable-latency long division is exactly what you learned at school when taught to divide, so you should know that one for sure. 
The native and vedic long multiply code was skipped and is listed above as non-examinable.
You do not have to know the details of the fixed-latency method of division.  Integer input is pretty-much trivial - you should be able
to write that in any programming language where you have some proficiency.  The floating-point input is an exercise you should do
for your supervisor.  The integer output and floating-point output are variations on the variable-latency divide function and you should
understand the principles and cost.  Reproducing runnable code is not called for.  Overall, we note that the input methods
are essentially polynomial multiplication where the digits are the coefficients of the powers of the base. And that the respective
output routine is a division to get back the digits.

  
  <hr>


<p> END (C) 2017-18 DJ Greaves.