Parallel Computing Notes

Flynn s Classification (1966)

Broad classification of parallel computing systems based on number of instruction and data streams

SISD: Single Instruction, Single Data conventional uniprocessor

SIMD: Single Instruction, Multiple Data distributed memory SIMD (MPP, DAP, CM-1&2, Maspar) shared memory SIMD (STARAN, vector computers)

MIMD: Multiple Instruction, Multiple Data message passing machines (Transputers, nCube, CM-5) non-cache-coherent SMP s (BBN Butterfly, T3D) cache-coherent (Sequent, Sun Starfire, SGI Origin)

MISD: Multiple Instruction, Single Data - no commercial examples.

Source MIT CSAIL.

Cache Consistency

Bus Based Snooping

Makes use of broadcast nature of a bus to maintain consistency of views

MSI/MESI/MOESI Protocols

Caches are associative on FSB as well as on CPU side (snooping)

Bus has an INVALIDATE signal that any node can drive to abort a cycle, allowing dirty cache line to get written out before restart.

On MOESI style systems, bus cycle can be serviced from another cache.

Can have multiple busses for more bandwidth, but associative snooping overhead prevents scaling.

Example Sun Starfire (UE10000) has four busses, connecting 16 blades each with 4 CPUs and local memory.

MSI CC Protocol

M - Modified: The cache line is present only in the current cache, and is dirty; it has been modified from the value in main memory. The cache is required to write the data back to main memory at some time in the future, before permitting any other read of the (not longer valid) main memory state.

S - Shared: Indicates that this cache line may be stored in other caches of the machine.

I - Invalid: Indicates that this cache line is invalid.

MESI CC Protocol

MESI protocol (known also as Illinois protocol) is a widely used cache coherency and memory coherence protocol, which was later introduced by Intel in the Pentium processor to "support the more efficient write-back cache in addition to the write-through cache previously used by the Intel 486 processor".

Every cache line is marked with one of the four following states (coded in two additional bits):

M - Modified.

E - Exclusive: The cache line is present only in the current cache, but is clean; it matches main memory.

S - Shared.

I - Invalid:

A cache may satisfy a read from any state except Invalid. An Invalid line must be fetched (to the Shared or Exclusive states) to satisfy a read.

A write may only be performed if the cache line is in the Modified or Exclusive state. If it is in the Shared state, all other cached copies must be invalidated first. This is typically done by a broadcast operation.

A cache may discard a non-Modified line at any time, changing to the Invalid state. A Modified line must be written back first.

A cache that holds a line in the Modified state must snoop (intercept) all attempted reads (from all of the other CPUs in the system) of the corresponding main memory location and insert the data that it holds. This is typically done by forcing the read to back off (i.e. to abort the memory bus transaction), then writing the data to main memory and changing the cache line to the Shared state.

A cache that holds a line in the Shared state must also snoop all invalidate broadcasts from other CPUs, and discard the line (by moving it into Invalid state) on a match.

A cache that holds a line in the Exclusive state must also snoop all read transactions from all other CPUs, and move the line to Shared state on a match.

The Modified and Exclusive states are always precise: i.e. they match the true cacheline ownership situation in the system. The Shared state may be imprecise: if another CPU discards a Shared line, and this CPU becomes the sole owner of that cacheline, the line will not be promoted to Exclusive state. (because broadcasting all cacheline replacements from all CPUs is not practical over a broadcast snoop bus)

In that sense the Exclusive state is an opportunistic optimization: if the CPU happens to have it right, moving into the Modified state needs no memory bus transactions - if the CPU has it wrong, there's an extra bus transaction, but cache coherency is still preserved.

MOESI Protocol

MOESI is a full cache coherency protocol that encompasses all of the possible states commonly used in other protocols. Each cache line is in one of five states:

Modified.

Owned: This cache is one of several with a valid copy of the cache line, but has the exclusive right to make changes to it. It must broadcast those changes to all other caches sharing the line.

Exclusive.

Shared: This line is one of several copies in the system. This cache does not have permission to modify the copy.

Invalid.

This is a more elaborate version of the simpler MESI protocol, which avoids the need to write modifications back to main memory when another processor tries to read it. Instead, the Owned state allows a processor to retain the right to modify a shared cache line by promising to share any writes it performs with the other caches.

MOESI is beneficial when the communication latency and bandwidth between two CPUs is significantly better than to main memory. Multi-core CPUs with per-core L2 caches are an example of that.

Wikipedia

Taken from Wikipedia Feb 2006.

CC Network

A solution that "pipelines the bus".

Example: Scalable Coherent Interface

Slotted ring instead of a bus

Still a broadcast media and so snooping possible

Extensions to Torus structures - scaling again causes a push for directory protocol.

Today, being re-invented as networks on a chip.

Directory Protocols

Basis of many ccNUMA systems.

Keep sharing status with the memory instead of in just in cache.

Each block/line of memory has a directory entry storing which nodes it currently resides on.

Traditional Switch Structures

Flits sent between nodes

Clos, Benes, Delta, Crossbar Structures from telehpone systems?

Blocking/nonblocking

Hypercubes mostly used

A Tesseract - 4D cube.

Wormhole and Manhatten routing to avoid blocking

Supercomputer Interconnect

SGI Numalink

Example Supercomputer

Connection Machine(s)

CM-1 was SIMD research project.

CM-5 MIMD, using up to 65000 SPARC microprocessors.

NCSA's CM-5 has: 512 nodes, gigabytes of memory, 140 gigabytes parallel disk storage system called as Scalable Disk Array (SDA)

Both Cray and Thinking Machines saw software costs and revenues soon exceed those for hardware.

Scalable disk array SDA

SoC Bus Protocols

ARM AHB high performance bus - fixed handshake timing

Open Cores Protocol (OCP) - pipelineable bus protocol

ARM AXI - switchable bus protocol with asynchronous send/receive using transaction id.

SAN. Storage Area Network

FDDI (old), Hyperchannel, Fibre Channel

Data migration and backup does not need to pass through a host.

Recently Firewire on PCs

Commercial offering: Connectrix.

New technologies such as WDM fibre may be used.

Cluster Computing

Collection of identical computers on a fast LAN

Batch or interactive use/mix.

Control station maintains Job Queue and collects results

Load balancing is provided

Dynamic migration is possible - requires 'mobile-ready' app code

Parallel Virtual Memory (PVM) Libraries allow application-level pseudo supercomputing.

Synchronisation Primitives - next lecture.

Examples: Beowwulf, Condor, Xenoservers

www.answers.com/topic/parallel-computing

Weak Memory Ordering Notes

Comp Arch 2005/6 DJG

Sequential Consistency

Assume we have cache consistency, but is this enough ?
Sequential consistency -
- All operations happen in the order specified by the code
- All CPUs' views of these operations are consistent with a global ordering of the combined operations
Expensive to maintain on modern CPUs
Tantamount to an implicit memory barrier (fence) instruction between each memory operation (load or store).
Actually, need to wait for all possible invalidations or exceptions from a previous instruction to have settled before starting the next.

Single CPU Guarantees

1. A given CPU always perceives its own memory operations as occurring in program order. That is, memory-reordering issues arise only when a CPU is observing other CPUs' memory operations.

2. An operation is reordered with a store only if the operation accesses a different location from the store.

3. Aligned simple loads and stores are atomic.

Self-Modifying Code

Not supported since I/D cache separation.

Needed for JIT and OS loader.

Flush whole I cache gets expensive if frequent.

Used deliberate alias techniques for very small, known caches

Better to use an OS primitive that maps to partial flush/invalidate instructions.

Possible Consistency Models

Three (at least) main models should be considered:

1. Sequential Consistency

2. Processor consistency (W to R ordering): Writes from a given processor are always seen by others as in that processor's order but stores by different processors may not be seen in true order.

2. Weak Consistency:

All operations on synchronisation variables are sequentially consistent,
No synchronisation variables may be accessed until data operations are complete,
No data variables may be accessed until synchronisation operations are complete.

A write is 'complete' when no subsequent read can find a different value.

(A read is 'complete' when no subsequent write can change the value read.)

Do we need cache consistency between barriers ?

OO Race Problems

Java has synchronised keyword

C C++ has volatile keyword

Processor 1 executes its code first. Its cache happens to flush the modification of Foo.bar back to main memory, but none of the other writes.

Then, processor 2 executes its code, it loads its cache only with the new value of Foo.bar from main memory (finding old values in the cache for all other memory locations).

Will processor 2 see garbage for integer x ? What if it were an object handle ?