Parallel Computing Notes

Flynn s Classification (1966)

Broad classification of parallel computing systems based on number of instruction and data streams

SISD: Single Instruction, Single Data conventional uniprocessor

SIMD: Single Instruction, Multiple Data distributed memory SIMD (MPP, DAP, CM-1&2, Maspar) shared memory SIMD (STARAN, vector computers)

MIMD: Multiple Instruction, Multiple Data message passing machines (Transputers, nCube, CM-5) non-cache-coherent SMP s (BBN Butterfly, T3D) cache-coherent (Sequent, Sun Starfire, SGI Origin)

MISD: Multiple Instruction, Single Data - no commercial examples.

Source MIT CSAIL.

Cache Consistency

Bus Based Snooping

Makes use of broadcast nature of a bus to maintain consistency of views

MSI/MESI/MOESI Protocols

Caches are associative on FSB as well as on CPU side (snooping)

Bus has an INVALIDATE signal that any node can drive to abort a cycle, allowing dirty cache line to get written out before restart.

On MOESI style systems, bus cycle can be serviced from another cache. Protocols

Can have multiple busses for more bandwidth, but associative snooping overhead prevents scaling.

Example Sun Starfire (UE10000) has four busses, connecting 16 blades each with 4 CPUs and local memory.

CC Network

A solution that "pipelines the bus".

Example: Scalable Coherent Interface

Slotted ring instead of a bus

Still a broadcast media and so snooping possible

Extensions to Torus structures - scaling again causes a push for directory protocol.

Today, being re-invented as networks on a chip.

Directory Protocols

Basis of many ccNUMA systems.

Keep sharing status with the memory instead of in just in cache.

Each block/line of memory has a directory entry storing which nodes it currently resides on.

Traditional Switch Structures

Flits sent between nodes

Clos, Benes, Delta, Crossbar Structures from telehpone systems?

Blocking/nonblocking

Hypercubes mostly used

A Tesseract - 4D cube.

Wormhole and Manhatten routing to avoid blocking

Supercomputer Interconnect

SGI Numalink

Example Supercomputer

Connection Machine(s)

CM-1 was SIMD research project.

CM-5 MIMD, using up to 65000 SPARC microprocessors.

NCSA's CM-5 has: 512 nodes, gigabytes of memory, 140 gigabytes parallel disk storage system called as Scalable Disk Array (SDA)

Both Cray and Thinking Machines saw software costs and revenues soon exceed those for hardware.

Scalable disk array SDA

SoC Bus Protocols

ARM AHB high performance bus - fixed handshake timing

Open Cores Protocol (OCP) - pipelineable bus protocol

ARM AXI - switchable bus protocol with asynchronous send/receive using transaction id.

SAN. Storage Area Network

FDDI (old), Hyperchannel, Fibre Channel

Data migration and backup does not need to pass through a host.

Recently Firewire on PCs

Commercial offering: Connectrix.

New technologies such as WDM fibre may be used.

Cluster Computing

Collection of identical computers on a fast LAN

Batch or interactive use/mix.

Control station maintains Job Queue and collects results

Load balancing is provided

Dynamic migration is possible - requires 'mobile-ready' app code

Parallel Virtual Memory (PVM) Libraries allow application-level pseudo supercomputing.

Synchronisation Primitives - next lecture.

Examples: Beowulf, Condor, Xenoservers