Viewstamp Replication

protocol overview
details
changing views in failure
Paper's architecture

The original paper is a bit confusing, this offers a more concrete explanation of implementation details.

In the paper, a view is a set of members and a leader (called a primary). Each event within a view has a unique logical timestamp. A viewstamp is the tuple of view id with some timestamp and used to detect lost information

uses the "primary copy technique", where everything is computed on a primary node, and replicated for the purpose of failover
a view change occurs if the primary is separated from the cluster, where a new primary is assigned
paper has different protocols for replication depending on state, one when membership and leadership is stable, one otherwise
during a view change, some actions from the old view might be lost in the new view, and we use viewstamps to decide whether to commit or abort an action

protocol overview

It differs from Paxos in that it is a replication protocol, not consensus

a primary replica handles ordering of client requests, all other replicas are backups
failure is handled by allowing the primary to change, changing through a sequence of views
- so we need the view change protocol
the goal is that the next view should reflect all client operations that were executed in an earlier view, so the current view cannot execute changes without ensuring that the backups can retain it
- so the primary must wait until f+1 replicas have acknowledged the request
- each request is a quorum

details

Identity of the primary is encoded in the viewstamp (the totally ordered view number part)

primary is chosen by round robin, ordering replicas by IP address
clients have a viewstamped replication proxy

without failure

Every message sent from one replica to another contains the current view number

if behind, message is dropped
if ahead, transfer its state to the receiver before processing

Each replica holds some state:

configuration is a sorted list of 2f + 1 IP addresses
replica number is the index of the replica itself
log contains the ordered list of requests that have been received
client-table tracks the most recent requests from all clients, important for ensuring execution of requests
current status {normal, view-change, recovering}

As a part of state, there's a bunch of numbers:

view-number is the totally ordered identifier for the current view, incremented through view changes
op-number is assigned to the most recently received request
commit-number is the most recently commited op-number

client sends request to primary,
primary receives request, and looks up the client in its client table. If request number is strictly larger than the last from the client, then it's processed.
primary increments the op-number (logical timestamp) and adds it to the log. It updates the client table. Now it proceeds with 2PC to its other replicas
backups receive prepare, add the request to a priority queue, and once the all preceding requests are added to its log, it adds the new request and finally replies PREPAREOK to the primary
primary waits until it has received f, and then can increase the commit number to the op-number of the committed request and can send the requested operation to the service and send a reply to the client
backups learn of the commit through the next prepare or a timeout possibly, they finish all operations (maybe through state transfer) and then they send the operation up to the service layer without notifying the client (primary has done this already)

changing views in failure

primary should be sending prepare, and if not commit, messages regularily to replicas. Because of round robin, there's no leader election and all replicas know which replica is the new primary

Paper's architecture

a cohort is a single logical entity, with several underlying replica modules of some primary module
in their architecture, any module running is replicated, one is the primary which processes two-phase commits, and others are simply backups and endpoints
a view is a set of sets (cohorts) of replicas, one cohort is a primary
viewstamps let you state guarantees that the cohort's state reflects only events that occur logically at or before the viewstamp
lifetime of a system is divided into epochs, which are subdivided into views
- changes to epochs are administrative, changes to views are due to failure
there are four protocols for the algorithm: normal operation, view changes, recovery for crashed replicas and reconfiguration for epoch changes
ensures availability beyond some number f of failed replicas, by using a protocol that does not need to check f replicas