This post is about silent data loss in replicated systems (state-machine replication a la ZooKeeper) due to the disk state being wiped out. The disk state is crucial in such systems to guarantee that replicated data isn’t lost in the case a server crashes and recovers. The replication protocol in ZooKeeper assumes that servers can crash and recover, and the ensemble can make progress as long as f + 1 members are up and running, where f is a bound on the number of faulty servers determined by the size of the ensemble. If the ensemble has 5 servers, then the ensemble can make progress as long as 3 are up (f = 2). It additionally assumes that the disk state before a server crash is there during recovery. This post is pointing out that this isn’t always the case, and it is an issue to be aware of. Note that this isn’t unique to ZooKeeper, but I’ll focus on ZooKeeper because I know it well.
The disk state can be wiped out for various reasons like faulty drives, misconfiguration, starting a server replica in a different machine virtual or not. Losing disk state might be silent at times for due to software, hardware, or operator errors. This issue has come up occasionally in the community, but to my knowledge hasn’t been documented or discussed extensively.
Here is the problem that losing disk state really induces. Say we have an ensemble of ZooKeeper servers comprising 3 servers: A B C. A leads and B follows; C is dormant. Now say that A and B commit a transaction, say 0x1 – “create znode /foo”. Recall from ZooKeeper 101 that before committing a transaction, a quorum of servers needs to persist it to disk by writing to the transaction log.
It all looks good until say A crashes and its disk state is gone. Now A wakes up and it has no memory of committing 0x1, so it thinks it is starting from scratch. A next forms a quorum with C, and C hasn’t heard anything about 0x1. A and C elect A to lead and declare the empty state to be the new state. In the meanwhile, B is clueless waiting to talk to other servers to form a quorum. When it does, it learns that the new state doesn’t contain 0x1. Transaction 0x1 is now lost forever…
This figure illustrates the execution I just described:
Getting around it: Increase replication
Fundamentally, this is a problem with the assumption that servers only crash in the case they are faulty. Losing this state looks more like an arbitrary fault (or a Byzantine fault) and could be handled as such. If instead of using simple majorities as quorums (more than half of the replicas) we use super majorities (more than 2/3 of the replicas), then we can guarantee that we have enough replicas. To see why it works, convince yourself that the smallest intersection of any two simple majorities has one server, while for super majorities we have always f + 1 in any intersection.
Having a single server in the intersection between quorums causes the problem above because the single server can crash, lose its disk state due to a faulty disk, and recover memoryless. Having f+1 servers in the intersection guarantees that as long as no more than f servers get a faulty disk simultaneously, any intersection between two quorums will always contain at least one server that hasn’t lost its disk state and the ensemble can recover the state it has previously replicated through that server in the interesection.
Using super majorities has one important drawback, which is requiring 3f + 1 servers to tolerate f faulty servers. To tolerate 2 faulty servers, we now need 7 servers rather than 5. Consequently, using 7 servers with super majorities, we can only tolerate 2 servers crashing rather than 3 servers as we have with simple majorities. That’s not great, so can we deal with this problem of silent data loss without having to resort to super-majorities?
Second cut: let the operator decide
An even simpler approach is to not restart servers automatically and perform a sanity check over the disk data before restarting the server. In the case the server does lose its disk state, it could be removed and reintroduced to the ensemble via reconfiguration. The reconfiguration takes care of rebuilding the state onto the server recovering so that it can start making progress with the rest of the ensemble.
Note that for the reconfiguration to work the ensemble must have enough replicas up to be able to make progress without the faulty server. In the example above, A can’t be reintroduced if B doesn’t start talking to C again. Getting B back might require an operator to intervene. But, what if B has also lost disk state? In this case, we have violated our f bound, which is f = 1 for the above example, and the ensemble is stuck, it can’t reconfigure A and B back in.
Third cut: using a token
The assumption that no more than f servers can be faulty and recovering simultaneously is obviously fine, but we have asked the question of whether we could automate even further by enabling progress when we have too many faulty servers, even if some data is lost in some rare cases.
One way is to use a token, a database identifier. Such identifiers have been used extensively in commercial databases like Oracle. Oracle has a create database command that generates a dbid. We would like the token generation to be automated instead of relying a command issued, however.
As with Oracle databases, the token represents an instance of the ZooKeeper database, and a server can only join an ensemble in the case it has the same token as everyone else. The servers consequently need to agree on a common token. Servers need to verify the token upon a handshake with the leader, possibly even during leader election, to make sure they refer to the same instance. They should also enable clients to determine if they are talking to the right incarnation.
What makes the token agreement problem interesting is that we need to reach agreement to reach agreement: Zab gives us agreement but we need to agree upon a token to tolerate disk failures. Instead of trying to solve the problem recursively, we simplify the solution by fixing a token distributor and requiring unanimity. A chosen server, not elected dynamically, is responsible for distributing new tokens, and it does it in a 2PC manner, requiring all other servers to acknowledge and commit.
As long as a majority of servers keeps the same token there is no need to generate a new one. In the case any server loses its disk, it recovers by adopting the token provided by a majority. If no majority holds a token, then the token distributor does its job and generates a new one. Given that it requires unanimity, the token generator should wait until a majority explicitly declares not to have a token. Otherwise, because of timing, it could end up generating new tokens and instantiating the database unnecessarily.
An invariant of the system is that the token a message of the replication protocol carries always matches the token of the receiver, except for the case in which it doesn’t have a token and it is recovering. Servers not bearing a token shouldn’t participate in the protocol.
Back to the initial example, how does this token scheme solve the problem? When A comes back up and connects to C, it won’t have a token so A and C won’t be able to form a quorum and get going with processing requests. Server A needs to wait for B to say what its token is so that it can make progress. Now that A knows that there is a quorum with the same token, it can adopt the token by pulling a copy of the state of B. Why B? Because B has the latest state (latest epoch/zxid).
Let’s be upfront about the weak points of this scheme. First, it isn’t great to have a fixed distributor, but trying to elect one would complicate things and would essentially be a “solving consensus to solve consensus” kind of direction that wouldn’t work. Second, unanimity leaves no room for masking crashes. If any process is unavailable, then the token distribution protocol can’t terminate as described. Both assumptions are OK in the case new tokens aren’t distributed often, though. In fact, it really shouldn’t happen often because a new token represents a new instance of the database, which possibly means lost data. We might be able to copy some data from the previous instance, but there is no guarantee that it will contain everything that has been committed.
The history of the database id (dbid) in ZooKeeper actually dates back to the early days of the project . We haven’t finished incorporating the dbid although the field is present in the persistent state of ZooKeeper. We recently started talking about this feature in the context of virtualization because servers might be restarted in a different host without carrying the disk state, in which case we wanted to avoid the problem mentioned here. Reconfiguration is also a topic I’ll leave for a future post. It affects the unanimity assumption because the set of servers could change dynamically.
Reconfiguration already exists in the 3.5 branch of ZooKeeper, so the option of reconfiguring a server in is viable. I also don’t see a strong reason for not automating this process of checking the disk state and reconfiguring if necessary, it sounds doable. The token scheme provides an additional notch of automation, but it will take some effort to implement. The use of super majorities was supposed to be simple with quorum verifiers, but when I tried to implement a verifier for super majorities, I ran into all sort of problems because of the dependency on simple majorities across the code. It sounds like the initial effort to abstract away the type of quorum system we use wasn’t really continued. I haven’t put much effort into getting this one to work, but it shouldn’t be too hard to fix it. I expect it to require some refactoring.
Acknowledgments: The token distribution scheme mentioned here has been discussed and developed with Johannes Klein and Satish Thatte. Thanks to Camille Fournier, Martin Kleppman, Patrick Hunt, and Bill Bridge for comments on a draft version of this post. It has changed quite a bit due to the comments.