Parallel Computing Notes


Flynn s Classification (1966)

Broad classification of parallel computing systems based on number of instruction and data streams

  • SISD: Single Instruction, Single Data conventional uniprocessor

  • SIMD: Single Instruction, Multiple Data distributed memory SIMD (MPP, DAP, CM-1&2, Maspar) shared memory SIMD (STARAN, vector computers)

  • MIMD: Multiple Instruction, Multiple Data message passing machines (Transputers, nCube, CM-5) non-cache-coherent SMP s (BBN Butterfly, T3D) cache-coherent (Sequent, Sun Starfire, SGI Origin)

  • MISD: Multiple Instruction, Single Data - no commercial examples.

    Source MIT CSAIL.


    Cache Consistency

  • Bus Based Snooping

  • Makes use of broadcast nature of a bus to maintain consistency of views

  • MSI/MESI/MOESI Protocols

  • Caches are associative on FSB as well as on CPU side (snooping)

  • Bus has an INVALIDATE signal that any node can drive to abort a cycle, allowing dirty cache line to get written out before restart.

  • On MOESI style systems, bus cycle can be serviced from another cache.

  • Can have multiple busses for more bandwidth, but associative snooping overhead prevents scaling.

  • Example Sun Starfire (UE10000) has four busses, connecting 16 blades each with 4 CPUs and local memory.

    MSI CC Protocol

    M - Modified: The cache line is present only in the current cache, and is dirty; it has been modified from the value in main memory. The cache is required to write the data back to main memory at some time in the future, before permitting any other read of the (not longer valid) main memory state.

    S - Shared: Indicates that this cache line may be stored in other caches of the machine.

    I - Invalid: Indicates that this cache line is invalid.

    MESI CC Protocol

    MESI protocol (known also as Illinois protocol) is a widely used cache coherency and memory coherence protocol, which was later introduced by Intel in the Pentium processor to "support the more efficient write-back cache in addition to the write-through cache previously used by the Intel 486 processor".

    Every cache line is marked with one of the four following states (coded in two additional bits):

    M - Modified.

    E - Exclusive: The cache line is present only in the current cache, but is clean; it matches main memory.

    S - Shared.

    I - Invalid:

    A cache may satisfy a read from any state except Invalid. An Invalid line must be fetched (to the Shared or Exclusive states) to satisfy a read.

    A write may only be performed if the cache line is in the Modified or Exclusive state. If it is in the Shared state, all other cached copies must be invalidated first. This is typically done by a broadcast operation.

    A cache may discard a non-Modified line at any time, changing to the Invalid state. A Modified line must be written back first.

    A cache that holds a line in the Modified state must snoop (intercept) all attempted reads (from all of the other CPUs in the system) of the corresponding main memory location and insert the data that it holds. This is typically done by forcing the read to back off (i.e. to abort the memory bus transaction), then writing the data to main memory and changing the cache line to the Shared state.

    A cache that holds a line in the Shared state must also snoop all invalidate broadcasts from other CPUs, and discard the line (by moving it into Invalid state) on a match.

    A cache that holds a line in the Exclusive state must also snoop all read transactions from all other CPUs, and move the line to Shared state on a match.

    The Modified and Exclusive states are always precise: i.e. they match the true cacheline ownership situation in the system. The Shared state may be imprecise: if another CPU discards a Shared line, and this CPU becomes the sole owner of that cacheline, the line will not be promoted to Exclusive state. (because broadcasting all cacheline replacements from all CPUs is not practical over a broadcast snoop bus)

    In that sense the Exclusive state is an opportunistic optimization: if the CPU happens to have it right, moving into the Modified state needs no memory bus transactions - if the CPU has it wrong, there's an extra bus transaction, but cache coherency is still preserved.

    MOESI Protocol

    MOESI is a full cache coherency protocol that encompasses all of the possible states commonly used in other protocols. Each cache line is in one of five states:

  • Modified.

  • Owned: This cache is one of several with a valid copy of the cache line, but has the exclusive right to make changes to it. It must broadcast those changes to all other caches sharing the line.

  • Exclusive.

  • Shared: This line is one of several copies in the system. This cache does not have permission to modify the copy.

  • Invalid.

    This is a more elaborate version of the simpler MESI protocol, which avoids the need to write modifications back to main memory when another processor tries to read it. Instead, the Owned state allows a processor to retain the right to modify a shared cache line by promising to share any writes it performs with the other caches.

    MOESI is beneficial when the communication latency and bandwidth between two CPUs is significantly better than to main memory. Multi-core CPUs with per-core L2 caches are an example of that.

    Wikipedia

    Taken from Wikipedia Feb 2006.

    CC Network

  • A solution that "pipelines the bus".

  • Example: Scalable Coherent Interface

  • Slotted ring instead of a bus

  • Still a broadcast media and so snooping possible

  • Extensions to Torus structures - scaling again causes a push for directory protocol.

  • Today, being re-invented as networks on a chip.

    Directory Protocols

  • Basis of many ccNUMA systems.

  • Keep sharing status with the memory instead of in just in cache.

  • Each block/line of memory has a directory entry storing which nodes it currently resides on.


    Traditional Switch Structures

  • Flits sent between nodes

  • Clos, Benes, Delta, Crossbar Structures from telehpone systems?

  • Blocking/nonblocking

  • Hypercubes mostly used

    A Tesseract - 4D cube.

  • Wormhole and Manhatten routing to avoid blocking

    Supercomputer Interconnect

    SGI Numalink

    Example Supercomputer

  • Connection Machine(s)

  • CM-1 was SIMD research project.

  • CM-5 MIMD, using up to 65000 SPARC microprocessors.

  • NCSA's CM-5 has: 512 nodes, gigabytes of memory, 140 gigabytes parallel disk storage system called as Scalable Disk Array (SDA)

  • Both Cray and Thinking Machines saw software costs and revenues soon exceed those for hardware.

  • Scalable disk array SDA

    SoC Bus Protocols

  • ARM AHB high performance bus - fixed handshake timing

  • Open Cores Protocol (OCP) - pipelineable bus protocol

  • ARM AXI - switchable bus protocol with asynchronous send/receive using transaction id.

    SAN. Storage Area Network

  • FDDI (old), Hyperchannel, Fibre Channel

  • Data migration and backup does not need to pass through a host.

  • Recently Firewire on PCs

  • Commercial offering: Connectrix.

  • New technologies such as WDM fibre may be used.

    Cluster Computing

  • Collection of identical computers on a fast LAN

  • Batch or interactive use/mix.

  • Control station maintains Job Queue and collects results

  • Load balancing is provided

  • Dynamic migration is possible - requires 'mobile-ready' app code

  • Parallel Virtual Memory (PVM) Libraries allow application-level pseudo supercomputing.

  • Synchronisation Primitives - next lecture.

  • Examples: Beowwulf, Condor, Xenoservers

    www.answers.com/topic/parallel-computing


  • Weak Memory Ordering Notes

    Comp Arch 2005/6 DJG


    Sequential Consistency

    • Assume we have cache consistency, but is this enough ?

    • Sequential consistency -
      • All operations happen in the order specified by the code
      • All CPUs' views of these operations are consistent with a global ordering of the combined operations

    • Expensive to maintain on modern CPUs

    • Tantamount to an implicit memory barrier (fence) instruction between each memory operation (load or store).

    • Actually, need to wait for all possible invalidations or exceptions from a previous instruction to have settled before starting the next.

    Single CPU Guarantees

    1. A given CPU always perceives its own memory operations as occurring in program order. That is, memory-reordering issues arise only when a CPU is observing other CPUs' memory operations.

    2. An operation is reordered with a store only if the operation accesses a different location from the store.

    3. Aligned simple loads and stores are atomic.


    Self-Modifying Code

  • Not supported since I/D cache separation.

  • Needed for JIT and OS loader.

  • Flush whole I cache gets expensive if frequent.

  • Used deliberate alias techniques for very small, known caches

  • Better to use an OS primitive that maps to partial flush/invalidate instructions.

    Possible Consistency Models

    Three (at least) main models should be considered:

  • 1. Sequential Consistency

  • 2. Processor consistency (W to R ordering): Writes from a given processor are always seen by others as in that processor's order but stores by different processors may not be seen in true order.

  • 2. Weak Consistency:

    • All operations on synchronisation variables are sequentially consistent,
    • No synchronisation variables may be accessed until data operations are complete,
    • No data variables may be accessed until synchronisation operations are complete.

    A write is 'complete' when no subsequent read can find a different value.

    (A read is 'complete' when no subsequent write can change the value read.)

  • Do we need cache consistency between barriers ?

    OO Race Problems

  • Java has synchronised keyword

  • C C++ has volatile keyword

  • Processor 1 executes its code first. Its cache happens to flush the modification of Foo.bar back to main memory, but none of the other writes.

  • Then, processor 2 executes its code, it loads its cache only with the new value of Foo.bar from main memory (finding old values in the cache for all other memory locations).

  • Will processor 2 see garbage for integer x ? What if it were an object handle ?

    Memory Barrier Instructions

  • CPUs differ greatly in the reordering they introduce and in the memory barrier (fence) instructions they offer.

  • Linux library offers a clean, portable interface.

  • Linux-kernel synchronization primitives contain any needed memory barriers, which is a good reason to use these primitives.


    Using a Barrier (slide from MIT)


    Linux SMP Primitives

    Loads and/or stores preceding the memory barrier are committed to memory before any loads/stores following the memory barrier.

  • smp_mb(): "memory barrier" that orders both loads and stores.

  • smp_rmb(): "read memory barrier" that orders only loads.

  • smp_wmb(): "write memory barrier" that orders only stores.

  • smp_read_barrier_depends(): forces subsequent operations that depend on prior operations to be ordered. This primitive is a no-op on all platforms except Alpha.

    Linux Journal Articles: 1 2. Power PC : eieio instruction (Enforce In-Order Execution of I/O)


    Atomic Instructions

  • Uniproc: Test and Set Instruction.

  • Uniproc: Compare and Swap Instruction.

  • Uniproc: Load and increment Instruction.

  • SMP: These need to bypass cache completely and become atomic DRAM operation.

  • SMP: Two stage solutions using load linked and store conditional.

  • Try to use 'lock-free' data structures and don't spin.

  • Joint Compilation of Firmware+Processor to Hardware

    Under Construction Feb 06: www.cl.cam.ac.uk/Research/SRG/HAN/Lambda/paper.html