Since the morning of Monday 30th January, we have had a problem with the offsite backup system; a number of filesystems, including the one containing user home directories, are not being copied to the offsite server.
The problem began when I was persuaded, somewhat against my better judgement, to delete a recent filer snapshot in which some confidential material had been inadvertently captured. A completely unforeseen side effect of this was that the remote backups stopped working, complaining about the missing snapshot. I believed that snapshots which were important for the operation of the mechanism were protected from deletion, but this appears not to be the case.
I have so far been unable to get the backups running again. I have a case open with NetApp and am awaiting advice. It is entirely possible that I will be told that there is no easy fix and that the backups will have to be re-initialised; this will present some difficulty as we have insufficient remote disc capacity to do that without sacrificing existing archives.
For avoidance of doubt, no existing backups have been lost. Local snapshots are still being taken, though in some cases they appear with different names than usual because I am using a different mechanism. However if there is a catastrophic failure of the local filer (which would in any case be a major disruption) we could only restore the affected filesystems to their state on Monday 30th January rather than the 1 hour target. There will also be a gap in the long term backups; the size of this will depend on how long it takes to resolve the problem.
If you want to know whether a particular filesystem is affected by the problem, look in the “
.snapshot” directory. If you see a mixture of names of the form “
hourly.N” AND “
sv_hourly.N“, then the files are on the affected volume. If you only see the names with the “
sv_” prefix, the remote copies are working normally.