This course examines the evolution of high-performance computers and
processors and discusses the difficulties associated with making
objective performance comparisons and maintaining code compatibility.
The IBM System 360 is used as a reference architecture. Microprocessor
evolution from 8 bits though to 64 is presented, along with important
digressions to low-power, dataflow and VLIW architectures, since these
techniques now underlie mainstream processor implementation.
Detailed features of a number of popular Instruction Set Architectures
are compared and contrasted, with particular attention to their
effects on implementation and hence performance. The course addresses
micro-architecture implementation issues, examining how Instruction
Level Parallelism can be exploited through deep pipelining and
super-scalar techniques such as out-of-order execution. Issues in
memory hierarchy design are explored, and the impact they have on code
optimisation. Multi-processor cluster interconnect, on chip and off
chip, is briefly examined.
Lectures
Instruction set architectures.
ISA history and compatibility, illustrated with
IBM 360 and notable 8, 16, 32, 64 microprocessors.
Review of stack/accumulator/GPR instruction sets
in terms of byte sex, load-store versus
register-memory, addressing modes, sub and
un-aligned memory support. [3 lectures]
Comparing architectures.
Moore's Law, System versus chip performance.
Performance metrics MIPS, MHz, FLOPS, SPEC.
Power. Price. Compatibility [2 lectures]
Advanced pipelining.
The CPU performance equation. Structural hazards: long latency
instructions. Data hazards: result forwarding and delayed loads.
Control hazards: branch prediction, trace caches and avoiding branches.
Exceptions. [3 lectures]
Beyond super-scalar. The limits of ILP. Alternative
architectures: VLIW processors and custom VLIW synthesis, Tri-media,
SMT, SCMP [2 lectures]
Memory hierarchy. Cache
configurations. Latency versus bandwidth. Re-ordering and
coherence. Programming for caches.
[2 lectures]
Multi-processor systems. Multi-core devices,
multi-processor cache coherency. Interconnects
for NUMA, message passing clusters and network on chip:
OCP, ARM AXI. Models for weak memory ordering. [2 lectures]
Objectives
At the end of the course students should
appreciate the balance between implementation and architecture
in determining performance
understand how quantitative analysis led to the convergence
towards RISC-like designs
comprehend the issues associated with deeply-pipelined designs
understand the operation of processors supporting out-of-order
execution
be able to describe the difficulties associated with building
wide-issue machines, and have a basic understanding of the
alternatives to Instruction Level Parallelism
appreciate the tradeoffs made by architects in the design of
memory hierarchies, and be able to optimise algorithms for memory
hierarchy performance
Recommended reading
Hennessy, J. & Patterson, D. (2002). Computer architecture: a
quantitative approach. Morgan Kaufmann (3rd ed.) ISBN 1-55860-724-2.
(2nd edition, 1996, is also good.)
Further reading and reference:
Johnson, M. (1991). Superscalar microprocessor design. Prentice Hall.
Markstein, P. (1990). IA-64 and elementary functions. Prentice Hall.
Tannenbaum, A.S. (1990). Structured computer organization. Prentice Hall (2nd ed.).
Van Someren, A. & Atack, C. (1994). The ARM RISC chip: a programmer's guide. Addison-Wesley.
Sites, R.L. (ed.) (1992). Alpha architecture reference manual. Digital Press.
Kane, G. & Heinrich, J. (1992). MIPS RISC architecture. Prentice Hall.
The CPU Info Center http://infopad.eecs.berkeley.edu/CIC/tech/