There are three classes of failure in a distributed transaction system:
-
The client can fail
-
The transaction service (including communications channel can fail)
-
The stable storage can fail
(stable storage refers to storage such as disk or tape that persists after power outage - of course, it still has failure modes such as physical damage, but these are normally much more rare than power loss).
If the client fails, it will either be during the transaction, or after
commitment. In the first case, the service simply undoes the transaction.
If the transaction service fails, the client and server can wait for recovery
and retry, or independently assume failure, and wait to issue aborts/undoes.
If the stable storage fails, the enterprise should acquire more reliable
hardware by spending more money!
Cases one and two involve recovery mechanisms. We describe techniques
for this next.