It’s deployed and works fine until one day:

The primary TM decides to commit and then pauses. The backup TM thinks the primary failed and it decides to take over. The backup TM broadcasts an Abort message. The primary TM resumes and broadcasts a Commit message. Some RMs abort and others commit.

Find fault-tolerant distributed algorithms is hard. They’re easy to get wrong, and hard to find errors by testing, We should get the algorithm right before we code. Writing and checking a TLA+ spec is the best way I know to do that.

Untitled

Untitled

Untitled

Untitled

CHOOSE v ∈ S : P

// Only when there's exactly one v in S satisfying P .