At 09:12 on Saturday morning, a new switch in our data centre failed. In doing so it bridged together all of its connections to the rest of the network. Since switches have multiple uplinks for resilience, this created a loop in the network.
This loop meant that network traffic was duplicated indefinitely, causing CPU starvation on the switches and routers as they received a flood of traffic and resulted in a cascade failure of the entire network.
During this outage many systems failed as a consequence of the network failure, in particular our virtual machine pool lost connectivity with its disks. The traffic flood also caused some devices to crash or reboot.
Following failed attempts to mitigate the problem remotely, emergency work was untertaken on-site during the afternoon and evening to locate and isolate the faulty device by iteratively partitioning the network, allowing the rest of the network to recover, and then fixing devices stuck in a broken state. Intermittent network disruption continued into the late evening due to a suspected internal state corruption in the core router (gatwick), until each CPU therein was rebooted.
Services hosted on virtual machines were gradually brought back into service during the rest of the weekend, starting with the web and email servers early Saturday evening.