Between sometime in the evening of Sunday 30th May and early morning on Tuesday 1st June the performance of the NetApp filer “elmer” was so badly degraded as to be effectively unusable. The problems affected various services, including the main lab web server.
It is a little difficult to work out exactly what caused what. There was an incident on Saturday 29th May when a number of virtual machines hosted on Xen Enterprise servers “wedged”. This is a symptom which has been seen before, but we have no idea what causes it. RF and MAJ, working from home, restarted the machines we observed to be broken, but it is possible that we missed some.
All seemed well until Sunday evening, when more widespread problems with the filer were reported. At first sight the symptoms looked like overload, but it was not possible to diagnose remotely. MAJ investigated on Monday morning, and found the filer load to be abnormally low. Despite this performance was unusably slow. In addition several of the internal management commands on the filer were hanging in completely unexpected ways. A filer reboot was tried but this made no difference to the performance, and with the benefit of hindsight it may have made some problems worse.
At this point MAJ and GT decided that the best course of action was to call NetApp for support. The call was accepted at priority 1, which means that we get dedicated assistance 24×7 until the server is usable again. MAJ was then on the phone to NetApp for 6 hours 57 minutes (one continuous phone call) while the support team attempted to diagnose the problem. This involved using the Cisco Webex service, which proved very effective in allowing the support team to share the workstation session. A few blind alleys were followed, but eventually they decided that they needed to analyse a core dump. This was duly taken, and (with not a little difficulty) transferred to NetApp for analysis.
The phone call was eventually ended when the support call had to be passed on to the next shift. Herein lies a problem: NetApp will happily let a priority 1 call “follow the sun”, but we have no such luxury. Fortunately we were able to agree that since it would take time to analyse the dump, MAJ could desert the phone and go home.
At about 22:00 NetApp called MAJ at home with a diagnosis: the filer performance was being degraded because of high LDAP server latency. It appears that this symptom has been seen before, but is not well diagnosed within the operating system. There were some subtle clues in the instrumentation, but the initial support team did not spot them.
It was too late in the day to take any immediate action, but guessing that PB was likely to be first on scene on Tuesday morning, MAJ sent him a text message with the diagnosis. The LDAP servers were duly dealt with early on Tueday morning, and filer performance returned to normal. There were then a number of consequential matters to deal with.
It is obviously a concern that a relatively minor problem had such a devastating effect, and that it was not diagnosed quicker. The filer is configured to use two different LDAP servers, and also has a fallback of using local files for a slightly out of date version of the same information. It appears that this fallback does not work as well as might be hoped. The precise failure mode of the LDAP servers is not entirely understood, but it is possible that they were not sufficiently dead to invoke the fallback.
The filer is now operating normally. The downtime is regretted; fortunately it happened on a relatively quiet day. This should perhaps serve as a reminder that whilst we may aim for 100% uptime, it can never be guaranteed.