月知录

跳转到正文(jump to main content)

Paxos Made Simple译注版

引言

最近在重读Paxos Made Simple,相比以前又多了一些理解,就打算记录下来,顺便把理解标注一下,希望对其他人理解论文也能有些帮助。
文中会是一段英文一段中文的编排方式,便于直接对照原文阅读。翻译部分在力求准确贴近原文意思的前提下会尽量白话一点。注解会在译文里,写在中括号内。
论文中的一些名词,例如proposer/acceptor等,在译文中以英文形式出现。
开始吧!

概述(Introduction)

The Paxos algorithm for implementing a fault-tolerant distributed system has been regarded as difficult to understand, perhaps because the original presentation was Greek to many readers[5]. In fact, it is among the simplest and most obvious of distributed algorithms. At its heart is a consensus algorithm – the "synod" algorithm of [5]. The next section shows that this consensus algorithm follows almost unavoidably from the properties we want it to satify. The last section explains the complete Paxos algorithm, which is obtained by the straightforward application of consensus to the state machine approach for building a distributed system–an approach that should be well-known, since it is the subject of what is probably the most often-cited article on the theory of distributed systems[4].

一直以来,很多人都认为用于实现可容错的分布式系统的Paxos算法难以理解,可能是因为原论文是希腊语[5]写成的。其实它是最简单直观的分布式算法之一。Paxos的核心是一个共识算法,也就是[5]中提到的“synod”算法。下一节会看到,这个共识算法很自然地脱胎于我们希望它满足的特性。最后一节通过将共识算法直接应用于状态机来构建一个分布式系统的方式(这种方式可以说广为人知,因为它是分布式理论这一领域里可能最常被引用的文章[4]的主题)解释了完整的Paxos算法。

共识算法(The Concensus Algorithm)

问题(The problem)

Assume a collection of processes that can propose values. A consensus algorithm ensures that a single one among the proposed values is chosen. If no value is proposed, then no value should be chosen. If a value has been chosen, then processes should be able to learn the chosen value. The safety requirements for consensus are:

  • Only a value that has been proposed may be chosen,
  • Only a single value is chosen, and
  • A process never learns that a value has been chosen unless it actually has been.

假设有一些进程,它们会提交一些值。一致性算法保证这些值中只有一个会被选中。如果一个值都没有提出,也就不会有值被选中。如果一个值被选中了,这些进程也应该知道被选中的是哪个。共识需要满足如下安全要求:

  • 只有被提交的值才能被选中
  • 只有一个值被选中
  • 值真的被选中之后,进程才能获悉有值被选中

译注:safety —— 不好的事情一定不会发生。

We won’t try to specify precise liveness requirements. However, the goal is to ensure that some proposed value is eventually chosen and, if a value has been chosen, then a process can eventually learn the value.
We let the three roles in the consensus algorithm be performed by three classes of agents: proposers, acceptors, and learners. In an implementation, a single process may act as more than one agent, but the mapping from agents to processes does not concern us here.

我们不会尝试给出共识算法需要满足的活性(liveness)要求。但是,目标是要确保提交的某个值最终会被选中,而且如果有值被选中,进程最终能够知道被选中的值。

共识算法中的三个角色分别由三种不同的代理来扮演:proposer,acceptor,以及learner。在算法实现中,单个进程可以充当多个代理,但在这里我们不关注代理和进程的对应关系。

Assume that agents can communicate with one another by sending messages. We use the customary asynchronous, non-Byzantine model, in which:

  • Agents operate at arbitrary speed, may fail by stopping, and may restart. Since all agents may fail after a value is chosen and then restart, a solution is impossible unless some information can be remembered by an agent that has failed and restarted.
  • Messages can take arbitrarily long to be delivered, can be duplicated, and can be lost, but they are not corrupted.

假定代理相互之间通过消息通讯。我们使用常用的异步、非拜占庭模型:

  • 代理运行速度不定,可能停机,可能会重启。因为所有的代理都可能在某个值被选中之后发生故障后重启,所以发生故障后重启的代理必须要能记住某些信息,否则共识就无法达成。
  • 消息传递需要的时间长短不定,消息可能重复,可能丢失,但内容不会被篡改。

选择一个值(Choosing a Value)

The easiest way to choose a value is to have a single acceptor agent. A proposer sends a proposal to the acceptor, who chooses the first proposed value that it receives. Although simple, this solution is unsatisfactory because the failure of the acceptor makes any further progress impossible.

最简单的选择值的方式是使用单个acceptor。Proposer向这个acceptor发送提案,acceptor选择收到的第一个提议的值。这种方式尽管简单,却不适合,因为acceptor发生故障就会导致系统无法继续运行了。

So, let’s try another way of choosing a value. Instead of a single acceptor, let’s use multiple acceptor agents. A proposer sends a proposed value to a set of acceptors. An acceptor may accept the proposed value. The value is chosen when a large enough set of acceptors have accepted it. How large is large enough? To ensure that only a single value is chosen, we can let a large enough set consist of any majority of the agents. Because any two majorities have at least one acceptor in common, this works if an acceptor can accept at most one value. (There is an obvious generalization of a majority that has been observed in numerous papers, apparently starting with [3].)

所以,我们来尝试另一种选择值的方式。我们使用多个acceptor而不是一个 。 Proposer向多个proposer发送建议值。 Acceptor可能会接受这个建议值。如果有足够多的acceptor接受了某个值,这个值就被选中了。多少才算足够多呢?要保证只有一个值被选中,足够多需要包含一半以上的acceptor(即多数集)。因为两个多数集中至少有一个acceptor是共有的,如果一个acceptor最多只能接受一个值,这种方式就是可行的。(在大量的论文中都有关于多数集的泛化【TODO 确认论文中是泛化吗?】,看起来是从[3]开始。)

In the absence of failure or message loss, we want a value to be chosen even if only one value is proposed by a single proposer. This suggests the requirement:
P1. An acceptor must accept the first proposal that it receives.

在没有故障或消息丢失的情况下,即便只有单个proposer提议了一个值,我们也希望这个值能被选中。这就要求:
P1. Acceptor必须接受它收到的第一个提案。

But this requirement raises a problem. Several values could be proposed by different proposers at about the same time, leading to a situation in which every acceptor has accepted a value, but no single value is accepted by a majority of them. Even with just two proposed values, if each is accepted by about half the acceptors, failure of a single acceptor could make it impossible to learn which of the values was chosen.

但是这个要求带来一个问题。不同的 proposer 可能会在同一时间提议多个值,导致每个 acceptor 接受了一个值,但是没有一个值被 acceptor 的多数集接受。即便只有两个值,如果每个都被约半数的 acceptor 接受,单个 acceptor 出现故障就会导致无法确定哪个值被选中了。

P1 and the requirement that a value is chosen only when it is accepted by a majority of acceptors imply that an acceptor must be allowed to accept more than one proposal. We keep track of the different proposals that an acceptor may accept by assigning a (natural) number to each proposal, so a proposal consists of a proposal number and a value. To prevent confusion, we require that different proposals have different numbers. How this is achieved depends on the implementation, so for now we just assume it. A value is chosen when a single proposal with that value has been accepted by a majority of the acceptors. In that case, we say that the proposal (as well as its value) has been chosen.

条件P1以及值只有在被acceptor的多数集接受时才算被选中意味着一个acceptor得能接受多个提案。通过给每个提案一个编号,我们可以记录一个acceptor接受了哪些提案,因此每个提案包含提案编号和值。为了防止出现混乱,我们要求不同提案的编号要不一样。怎么做取决于实现,我们现在只是假定是这样的。当带有某个值的提案被acceptor的多数集接受之后,这个值也就被选中了。这时我们就说这个提案(以及它的值)被选中了。

We can allow multiple proposals to be chosen, but we must guarantee that all chosen proposals have the same value. By induction on the proposal number, it suffices to guarantee:
P2. If a proposal with value v is chosen, then every higher-numbered proposal that is chosen has value v .

我们可以允许多个提案被选中,但必须保证所有被选中的提案都有相同的值。通过对提案编号使用归纳法,只要保证下面的条件即可:
P2.如果一个值为 v 的提案被选中,那么被选中的、编号更高的所有提案的值是v。

Since numbers are totally ordered, condition P2 guarantees the crucial safety property that only a single value is chosen.

To be chosen, a proposal must be accepted by at least one acceptor. So, we can satisfy P2 by satisfying:
P2a . If a proposal with value v is chosen, then every higher-numbered proposal accepted by any acceptor has value v.

因为编号是全序的,P2确保了至关重要的安全属性——即只会有一个值被选中。

提案要被选中,至少得有一个acceptor接受了它。因此,满足了下面的条件也就满足了P2:
P2a. 如果一个值为v的提案被选中了,那么被任一acceptor接受的、编号更高的所有提案的值是v。

We still maintain P1 to ensure that some proposal is chosen. Because communication is asynchronous, a proposal could be chosen with some particular acceptor c never having received any proposal. Suppose a new proposer “wakes up” and issues a higher-numbered proposal with a different value. P1 requires c to accept this proposal, violating P2a . Maintaining both P1 and P2a requires strengthening P2a to:
P2b. If a proposal with value v is chosen, then every higher-numbered proposal issued by any proposer has value v .

Since a proposal must be issued by a proposer before it can be accepted by an acceptor, P2b implies P2a , which in turn implies P2.

我们还是要满足P1,以保证某个提案被选中。因为通讯是异步的,可能会出现一个提案被选中了,但是某个accetor(记为c)却从没收到过任何提案。假设某个新的proposer“醒过来”后发起了一个编号更大、值不一样的提案。P1要求c接受这个提案,这就违反了P2a。同时满足P1和P2a需要将P2a加强为:
P2b. 如果一个值为 v 的提案被选中了,那么每个proposer发起的、编号更大的所有提案的值是v。

To discover how to satisfy P2b, let’s consider how we would prove that it holds. We would assume that some proposal with number m and value v is chosen and show that any proposal issued with number n > m also has value v . We would make the proof easier by using induction on n, so we can prove that proposal number n has value v under the additional assumption that every proposal issued with a number in m . . (n − 1) has value v , where i . . j denotes the set of numbers from i through j . For the proposal numbered m to be chosen, there must be some set C consisting of a majority of acceptors such that every acceptor in C accepted it. Combining this with the induction assumption, the hypothesis that m is chosen implies:
Every acceptor in C has accepted a proposal with number in m..(n − 1), and every proposal with number in m..(n − 1) accepted by any acceptor has value v .

为了找到如何满足P2b的方式,我们来考虑下怎么证明它成立。我们会假设某个编号为m值、为v的提案被选中,然后展示被 发起 的、编号为n(n>m)的任何提案的值也是v。我们会对n使用归纳法来简化证明,所以我们会证明在附加条件——所有被发起的、且编号在m..(n-1)之中的提案的值都是v——下,编号为n的提案的值也是v,其中i..j表示从i到j的数字的集合。编号为m的提案如果被选中了,肯定会有acceptor的一个多数集C,C中的每个acceptor都接受了m提案。将这个结论和归纳假设结合起来,假设m被选中就意味着:

C中的每个acceptor接受了编号在m..(n-1)之中的一个提案,并且被任意acceptor接受的、编号在m..(n-1)之间的每个提案的值是v。

译注:C中的每个acceptor至少接受了m提案,更重要的是,编号在m..(n-1)中的提案要被接受,至少要跟acceptor的多数集发起申请,而这个多数集肯定跟C有交集,因此上面前半部分成立;后半部分由归纳假设——所有被发起的、且编号在m..(n-1)之中的提案的值都是v——即可得到

Since any set S consisting of a majority of acceptors contains at least one member of C , we can conclude that a proposal numbered n has value v by ensuring that the following invariant is maintained:
P2c. For any v and n, if a proposal with value v and number n is issued, then there is a set S consisting of a majority of acceptors such that either (a) no acceptor in S has accepted any proposal numbered less than n, or (b) v is the value of the highest-numbered proposal among all proposals numbered less than n accepted by the acceptors in S.

We can therefore satisfy P2b by maintaining the invariance of P2c.

因为acceptor的任意多数集S都包含C的至少一个成员,只要保持下面的不变式满足,我们就可以说编号为n的提案的值是v:

P2c. 对于任意的v和n,如果编号为n、值为v的提案被提出,那么有包含acceptor多数集的集合S,满足如下条件之一:(a) S中的所有acceptor都没有接受过编号小于n的提案; (b) S中的acceptor接受过的编号仅次于n的提案的值是v。

保持不变式P2c成立就可以满足P2b
译注:满足(a)说明系统还没有接受过任何提案;否则,假设接受过的提案是m,接受m的acceptor多数集必定跟S有交集,矛盾了。

To maintain the invariance of P2c, a proposer that wants to issue a proposal numbered n must learn the highest-numbered proposal with number less than n, if any, that has been or will be accepted by each acceptor in some majority of acceptors. Learning about proposals already accepted is easy enough; predicting future acceptances is hard. Instead of trying to predict the future, the proposer controls it by extracting a promise that there won’t be any such acceptances. In other words, the proposer requests that the acceptors not accept any more proposals numbered less than n. This leads to the following algorithm for issuing proposals.

为了保持不变式P2c成立,想要发起编号为n的提案的proposer必须要知道编号仅次于n且 已经 或者 将要 被acceptor的多数集接受的提案(如果提案存在的话)。获悉已经被接受的提案是容易的;预测哪些提案将要被接受就难了。Proposer不会尝试预测未来,而是通过得到一个承诺(承诺不会接受小于n的提案)来控制未来。换句话说,proposer会请求acceptor不要接受任何编号小于n的提案。由此得到的发起提案的算法如下:

  1. A proposer chooses a new proposal number n and sends a request to each member of some set of acceptors, asking it to respond with:
    (a) A promise never again to accept a proposal numbered less than n, and
    (b) The proposal with the highest number less than n that it has accepted, if any.

I will call such a request a prepare request with number n.

  1. If the proposer receives the requested responses from a majority of the acceptors, then it can issue a proposal with number n and value v , where v is the value of the highest-numbered proposal among the responses, or is any value selected by the proposer if the responders reported no proposals.

A proposer issues a proposal by sending, to some set of acceptors, a request that the proposal be accepted. (This need not be the same set of acceptors that responded to the initial requests.) Let’s call this an accept request.

  1. Proposer选择一个新的提案编号n,然后向某些acceptor发送请求,要求它们响应如下:
    a. 承诺绝不会在接受编号比n小的提案
    b. 如果接受过编号比n小的提案,将其中编号最大的返回
    我将这个请求称为 编号为n的 prepare 请求
  2. 如果proposer从acceptor的多数集收到了请求的响应,它就会提出一个编号为n、值为v的提案,其中v是步骤1中返回的编号最高的提案的值,如果步骤1没有返回任何值,proposer就自己选择一个。
    Proposer提出提案就是向acceptor的一个集合发送接受提案的请求。(这个集合不用非得是步骤1中响应过prepare请求的acceptor集合) 这就是 accept 请求。

This describes a proposer’s algorithm. What about an acceptor? It can receive two kinds of requests from proposers: prepare requests and accept requests. An acceptor can ignore any request without compromising safety. So, we need to say only when it is allowed to respond to a request. It can always respond to a prepare request. It can respond to an accept request, accepting the proposal, iff it has not promised not to. In other words:
P1a. An acceptor can accept a proposal numbered n iff it has not responded to a prepare request having a number greater than n.

Observe that P1a subsumes P1.

上面描述了proposer的算法。Acceptor的算法呢?它会从proposer接收到两种请求:prepare请求和accept请求。Acceptor可以忽略任意的请求而不影响安全性。因此我们只需要说明什么时候acceptor可以响应请求。它可以响应所有的prepare请求。对于accept请求,只要acceptor没有承诺过不响应,它就可以响应请求,接受提案。换句话说:
P1a. 当且仅当acceptor没有响应过编号大于n的prepare请求时,它才能接受编号为n的提案。

We now have a complete algorithm for choosing a value that satisfies the required safety properties—assuming unique proposal numbers. The final algorithm is obtained by making one small optimization.

现在我们有了完整的、满足安全性要求(假设提案编号唯一)的选择值的算法。再做一个小的优化之后,就可以得到最终的算法。

Suppose an acceptor receives a prepare request numbered n, but it has already responded to a prepare request numbered greater than n, thereby promising not to accept any new proposal numbered n. There is then no reason for the acceptor to respond to the new prepare request, since it will not accept the proposal numbered n that the proposer wants to issue. So we have the acceptor ignore such a prepare request. We also have it ignore a prepare request for a proposal it has already accepted.

假设acceptor收到了编号为n的prepare请求,但它已经响应过编号大于n的prepare请求,也就是承诺过不会再接受编号为n的请求。因此这个acceptor不响应编号n的请求毫无问题,因为它不会接受这个提案。因此可以让acceptor忽略这种prepare请求。对于已经接受过的提案,它也可以忽略相关的prepare请求。
译注:理论上直接忽略没有问题,但是实现时,这么直接忽略感觉并不好,因为需要proposer超时等待,所以acceptor直接返回了Nack更好。

With this optimization, an acceptor needs to remember only the highestnumbered proposal that it has ever accepted and the number of the highestnumbered prepare request to which it has responded. Because P2c must be kept invariant regardless of failures, an acceptor must remember this information even if it fails and then restarts. Note that the proposer can always abandon a proposal and forget all about it—as long as it never tries to issue another proposal with the same number.

有了这个优化,acceptor只需要记住它接受过的编号最大的提案,以及响应过的编号最大的prepare请求的编号。因为不管是不是发生故障,P2c的不变式必须保持,所以acceptor必须记下这些信息,即便它出故障后重启了。需要注意的是,proposer可以放弃提案,不记录任何信息——只要它不会尝试发起另一个同样编号的提案就行。

Putting the actions of the proposer and acceptor together, we see that the algorithm operates in the following two phases.

  • Phase 1.
    a. A proposer selects a proposal number n and sends a prepare request with number n to a majority of acceptors.
    b. If an acceptor receives a prepare request with number n greater than that of any prepare request to which it has already responded, then it responds to the request with a promise not to accept any more proposals numbered less than n and with the highest-numbered proposal (if any) that it has accepted.
  • Phase 2.
    a. If the proposer receives a response to its prepare requests (numbered n) from a majority of acceptors, then it sends an accept request to each of those acceptors for a proposal numbered n with a value v , where v is the value of the highest-numbered proposal among the responses, or is any value if the responses reported no proposals.
    b. If an acceptor receives an accept request for a proposal numbered n, it accepts the proposal unless it has already responded to a prepare request having a number greater than n.

将proposer和acceptor的动作组合起来,我们看到算法分两个步骤执行的:

  1. 步骤1
    a. Proposer选择一个提案编号,向acceptor的多数集发送编号为n的prepare请求。
    b. 如果acceptor收到的prepare请求的编号比它曾响应过的所有prepare请求的编号都大,它就响应这个请求,承诺不会再接受编号比n小的任何提案,并且如果接受过提案,则将编号最大的那个提案返回。
  2. 步骤2
    a. 如果proposer从acceptor的多数集收到了针对它的编号n的prepare请求的响应,它就会为编号为n、值为v的提案向这些acceptor发送accept请求,其中v是prepare响应中编号最大的提案的值,或者(如果prepare响应中没有提案)任意值。
    b. 如果acceptor收到了针对编号n的提案的accept请求,除非它响应过编号大于n的prepare请求,它就会接受这个提案。

A proposer can make multiple proposals, so long as it follows the algorithm for each one. It can abandon a proposal in the middle of the protocol at any time. (Correctness is maintained, even though requests and/or responses for the proposal may arrive at their destinations long after the proposal was abandoned.) It is probably a good idea to abandon a proposal if some proposer has begun trying to issue a higher-numbered one. Therefore, if an acceptor ignores a prepare or accept request because it has already received a prepare request with a higher number, then it should probably inform the proposer, who should then abandon its proposal. This is a performance optimization that does not affect correctness.

一个proposer可以发起多个提案,只要它处理每个提案时都遵循算法。它可以在处理过程中随时放弃某个提案。(不会影响正确性,即便是请求和/或响应在提案被放弃之后很久才被收到)当有其他proposer已经尝试发起编号更高的提案时,放弃当前的提案也不失为一个好主意。因此,如果acceptor因为收到了编号更高的prepare请求而忽略了别的prepare或者accept请求,它应该通知proposer,然后proposer应该放弃它的提案。这是个性能优化,不会影响正确性。

获知被选定的值(Learning a Chosen Value)

To learn that a value has been chosen, a learner must find out that a proposal has been accepted by a majority of acceptors. The obvious algorithm is to have each acceptor, whenever it accepts a proposal, respond to all learners, sending them the proposal. This allows learners to find out about a chosen value as soon as possible, but it requires each acceptor to respond to each learner—a number of responses equal to the product of the number of acceptors and the number of learners.

要获悉某个值被选中了,learner必须了解到提案被acceptor的多数集接受了。一个显而易见的算法是让每个acceptor在接受某个提案之后就通知所有learner,把提案发送给它们。这样learner可以尽早地发现被选中的值,但要求每个acceptor要通知到每个learner——消息数是acceptor数和learner数的乘积。

The assumption of non-Byzantine failures makes it easy for one learner to find out from another learner that a value has been accepted. We can have the acceptors respond with their acceptances to a distinguished learner, which in turn informs the other learners when a value has been chosen. This approach requires an extra round for all the learners to discover the chosen value. It is also less reliable, since the distinguished learner could fail. But it requires a number of responses equal only to the sum of the number of acceptors and the number of learners.

非拜占庭故障假设使得一个learner容易从另一个learner得知某个值被接受了。我们可以让acceptor接受某个提案之后通知一个learner代表,当某个值被选中之后,再由这个learner通知其他learner。使用这种方式,其他learner需要一轮额外的通信来发现被选中的值。这种方式可靠性也更差,因为learner代表可能会发生故障。但需要的消息数只是acceptor数和learner数之和。

More generally, the acceptors could respond with their acceptances to some set of distinguished learners, each of which can then inform all the learners when a value has been chosen. Using a larger set of distinguished learners provides greater reliability at the cost of greater communication complexity.

Because of message loss, a value could be chosen with no learner ever finding out. The learner could ask the acceptors what proposals they have accepted, but failure of an acceptor could make it impossible to know whether or not a majority had accepted a particular proposal. In that case, learners will find out what value is chosen only when a new proposal is chosen. If a learner needs to know whether a value has been chosen, it can have a proposer issue a proposal, using the algorithm described above.

再延伸一下,acceptor可以在接受某个提案之后通知多个learner代表,每个learner代表可以在值被选中之后再通知所有的learner。Learner代表数越多,可靠性越高,代价就是通讯复杂度更高。

由于消息会丢失,可能出现某个值被选中,但所有learner都不知道的情形。Learner可以向acceptor询问它们接受了什么提案,但是如果有某个acceptor发生故障,可能就无法知道某个提案是否被acceptor的多数集接受了。这种情况下,只有当有新的提案被选中了,learner才能知道什么值被选中。如果learner需要知道某个值是否被选中,它可以让某个proposer使用上述算法发起一个提案。

进展(Progress)

译注:progress —— 好的事情最终一定会发生。在Paxos,也就是最终会选中某个值,所有learner也能知道哪个值被选中了。

It’s easy to construct a scenario in which two proposers each keep issuing a sequence of proposals with increasing numbers, none of which are ever chosen. Proposer p completes phase 1 for a proposal number n1. Another proposer q then completes phase 1 for a proposal number n2 > n1. Proposer p’s phase 2 accept requests for a proposal numbered n1 are ignored because the acceptors have all promised not to accept any new proposal numbered less than n2. So, proposer p then begins and completes phase 1 for a new proposal number n3 > n2, causing the second phase 2 accept requests of proposer q to be ignored. And so on.

容易构造这么一个场景,其中两个proposer都不断发起一系列编号逐渐增加的提案,但没有一个提案被选中。Proposer p完成了提案编号n1的阶段1。然后另一个proposer q完成了提案编号n2(n2>n1)的阶段1。Proposer p的针对n1提案的accept请求被忽略了,因为acceptor都已经承诺不会接受任何编号小于n2的新提案。因此proposer p就会使用新的提案编号n3(n3>n2)完成阶段1,导致proposer q的accept请求被忽略。如此循环往复。

To guarantee progress, a distinguished proposer must be selected as the only one to try issuing proposals. If the distinguished proposer can communicate successfully with a majority of acceptors, and if it uses a proposal with number greater than any already used, then it will succeed in issuing a proposal that is accepted. By abandoning a proposal and trying again if it learns about some request with a higher proposal number, the distinguished proposer will eventually choose a high enough proposal number.

为了保证算法能有进展,必须选出一个proposer代表,只由它来尝试发起提案。如果proposer代表可以跟acceptor多数集正常通信,并且它使用的提案编号比以往用过的都大,那它就能成功发起会被接受的提案。如果proposer代表发现有更大编号的提案出现后放弃当前提案并重试,它最终会选出一个足够大的提案编号。

If enough of the system (proposer, acceptors, and communication network) is working properly, liveness can therefore be achieved by electing a single distinguished proposer. The famous result of Fischer, Lynch, and Patterson [1] implies that a reliable algorithm for electing a proposer must use either randomness or real time—for example, by using timeouts. However, safety is ensured regardless of the success or failure of the election.

当系统有足够多的部分(proposer,acceptor们以及通信网络)能够正常运行,通过选出单个proposer代表,liveness就可以保证。Fischer,Lynch和Patternson[1]的著名研究结果表明,可靠的选举proposer的算法必须使用随机或者真实时间——例如,使用超时。但是,不管选举成功还是失败,安全性都能保证。

实现(The implementation)

The Paxos algorithm [5] assumes a network of processes. In its consensus algorithm, each process plays the role of proposer, acceptor, and learner. The algorithm chooses a leader, which plays the roles of the distinguished proposer and the distinguished learner. The Paxos consensus algorithm is precisely the one described above, where requests and responses are sent as ordinary messages. (Response messages are tagged with the corresponding proposal number to prevent confusion.) Stable storage, preserved during failures, is used to maintain the information that the acceptor must remember. An acceptor records its intended response in stable storage before actually sending the response.

Paxos算法[5]假定了一个由进程组成的网络。在Paxos的共识算法中,每个进程充当proposer,acceptor和learner的角色。算法会选出一个leader,作为proposer代表和learner代表。Paxos算法恰如上面描述的那样,请求和响应都是以普通的消息发送(响应消息中会带上相应的提案编号以防出现混乱)。稳定存储(故障发生时能够保留数据)用来保存acceptor必须要记住的信息。Acceptor在返回响应数据之前就要把数据保存到稳定存储中。

All that remains is to describe the mechanism for guaranteeing that no two proposals are ever issued with the same number. Different proposers choose their numbers from disjoint sets of numbers, so two different proposers never issue a proposal with the same number. Each proposer remembers (in stable storage) the highest-numbered proposal it has tried to issue, and begins phase 1 with a higher proposal number than any it has already used.

剩下的就是描述如何保证两个提案不会有相同编号的机制。不同的proposer从不相交集合中选择编号,因此两个不同的proposer绝不会发起编号相同的提案。每个proposer(在稳定存储中)记住它曾尝试发起的最大编号的提案,用从未使用过的更大的提案编号开始阶段1。

实现一个状态机(Implementing a State Machine)

A simple way to implement a distributed system is as a collection of clients that issue commands to a central server. The server can be described as a deterministic state machine that performs client commands in some sequence. The state machine has a current state; it performs a step by taking as input a command and producing an output and a new state. For example, the clients of a distributed banking system might be tellers, and the state-machine state might consist of the account balances of all users. A withdrawal would be performed by executing a state machine command that decreases an account’s balance if and only if the balance is greater than the amount withdrawn, producing as output the old and new balances.

一种简单的实现分布式系统的方式是由许多客户端向中心服务器发送命令。服务器可以看作是确定状态机,能够按某个顺序执行客户端命令。状态机有一个当前状态;它每次读入一个命令作为输入,得到输出结果和新的状态。例如,分布式银行系统的客户端可能是出纳员,状态机的状态可能由所有用户的账户余额组成。提款操作就是执行一个状态机命令,在账户余额大于要提款的金额时减少账户余额,并返回提款前、后的账户余额。

An implementation that uses a single central server fails if that server fails. We therefore instead use a collection of servers, each one independently implementing the state machine. Because the state machine is deterministic, all the servers will produce the same sequences of states and outputs if they all execute the same sequence of commands. A client issuing a command can then use the output generated for it by any server.

使用单个中心服务器的系统,服务器出现故障,系统就无法正常运行。因此我们使用一组服务器,每台服务器独立实现状态机。因为状态机是确定的,如果所有状态机执行相同的命令序列,它们就会生成同样的状态和输出序列。发起命令的客户端就可以使用任一个服务器执行命令后的输出。

To guarantee that all servers execute the same sequence of state machine commands, we implement a sequence of separate instances of the Paxos consensus algorithm, the value chosen by the i th instance being the i th state machine command in the sequence. Each server plays all the roles (proposer, acceptor, and learner) in each instance of the algorithm. For now, I assume that the set of servers is fixed, so all instances of the consensus algorithm use the same sets of agents.

为了保证所有服务器能以相同的顺序执行状态机命令,我们会运行一系列相互独立的实例,每个实例都执行Paxos共识算法,第i个实例选中的值就是第i个状态机命令。在每个实例中,每个server都会充当所有的角色(proposer,acceptor以及learner)。现在我会假定使用一组固定的服务器,因此共识算法的所有实例都使用同样的代理。

In normal operation, a single server is elected to be the leader, which acts as the distinguished proposer (the only one that tries to issue proposals) in all instances of the consensus algorithm. Clients send commands to the leader, who decides where in the sequence each command should appear. If the leader decides that a certain client command should be the 135th command, it tries to have that command chosen as the value of the 135th instance of the consensus algorithm. It will usually succeed. It might fail because of failures, or because another server also believes itself to be the leader and has a different idea of what the 135th command should be. But the consensus algorithm ensures that at most one command can be chosen as the 135th one.

在常规操作中,某台服务器会被选为leader,它在共识算法的所有实例中作为proposer代表(只有它可以发起提案)。客户端发送命令到leader,由leader来决定这个命令的次序。如果leader决定某个客户端命令排在第135位,它会尝试让这个命令在共识算法的第135个实例中被选中。一般来说它会成功。如果出现了故障,或者有另一个server自认为是leader并在第135个实例中发起了另一个不同的提案,leader的提案就可能不被选中。但是共识算法会保证在第135实例中,只有一个命令会被选中。

Key to the efficiency of this approach is that, in the Paxos consensus algorithm, the value to be proposed is not chosen until phase 2. Recall that, after completing phase 1 of the proposer’s algorithm, either the value to be proposed is determined or else the proposer is free to propose any value.

I will now describe how the Paxos state machine implementation works during normal operation. Later, I will discuss what can go wrong. I consider what happens when the previous leader has just failed and a new leader has been selected. (System startup is a special case in which no commands have yet been proposed.)

这种方法高效的关键在于,在Paxos共识算法中,值要到阶段2才会被选出来。回想影响,proposer算法的阶段1完成之后,要么会(根据acceptor返回)确定要提交的值,要么由proposer自由选择要提交的值。

下面我会描述在常规操作中,Paxos状态机如何工作。然后我会讨论哪儿可能出错。我会探讨在前一个leader发生故障、而新的leader被选出来之后会发生什么(系统启动时比较特殊,此时没有提交任何命令)。

The new leader, being a learner in all instances of the consensus algorithm, should know most of the commands that have already been chosen. Suppose it knows commands 1–134, 138, and 139—that is, the values chosen in instances 1–134, 138, and 139 of the consensus algorithm. (We will see later how such a gap in the command sequence could arise.) It then executes phase 1 of instances 135–137 and of all instances greater than 139. (I describe below how this is done.) Suppose that the outcome of these executions determine the value to be proposed in instances 135 and 140, but leaves the proposed value unconstrained in all other instances. The leader then executes phase 2 for instances 135 and 140, thereby choosing commands 135 and 140.

新的leader(在所有实例中都是learner)应该直到大部分已被选中的命令。假设它直到了命令1-134,138以及139,也就是在相应共识算法实例中选中的值(后面我们会看到命令序列中的空隙是怎么产生的)。然后leader会执行135-137实例以及139之后实例的阶段1(后面我会描述是如何做的)。假设这些执行后的输出决定了在135和140实例中要提交的值,但不会影响所有其他实例中被提交的值。然后leader会执行135/140实例的阶段2,选出相应的命令。

The leader, as well as any other server that learns all the commands the leader knows, can now execute commands 1–135. However, it can’t execute commands 138–140, which it also knows, because commands 136 and 137 have yet to be chosen. The leader could take the next two commands requested by clients to be commands 136 and 137. Instead, we let it fill the gap immediately by proposing, as commands 136 and 137, a special “noop” command that leaves the state unchanged. (It does this by executing phase 2 of instances 136 and 137 of the consensus algorithm.) Once these no-op commands have been chosen, commands 138–140 can be executed.

Leader以及其它从leader获知所有命令的服务器就可以执行命令1-135了。但是因为命令136/137还没有选出来,leader还没法执行138-140(虽然它已经知道对应的命令是什么)。Leader本可以把下两个由客户端提交的命令作为136/137的命令。相反我们会提交特殊的“noop”命令(这个命令不会修改状态机的状态)来作为136/137命令,抹去命令空洞(Leader执行136/137实例的阶段2来完成这件事)。一旦这些no-op被选中,命令138-140就可以执行了。

Commands 1–140 have now been chosen. The leader has also completed phase 1 for all instances greater than 140 of the consensus algorithm, and it is free to propose any value in phase 2 of those instances. It assigns command number 141 to the next command requested by a client, proposing it as the value in phase 2 of instance 141 of the consensus algorithm. It proposes the next client command it receives as command 142, and so on.

此时命令1-140已经选出来了。Leader也已经完成了140之后所有实例的阶段1,它可以自由选择在这些实例的阶段2中要提交的值。它将命令编号141赋给客户端提交的下一个命令,提交此命令为实例141的阶段2的值。它会将收到的下一个客户端命令作为142号命令,如此持续运行。

The leader can propose command 142 before it learns that its proposed command 141 has been chosen. It’s possible for all the messages it sent in proposing command 141 to be lost, and for command 142 to be chosen before any other server has learned what the leader proposed as command 141. When the leader fails to receive the expected response to its phase 2 messages in instance 141, it will retransmit those messages. If all goes well, its proposed command will be chosen. However, it could fail first, leaving a gap in the sequence of chosen commands. In general, suppose a leader can get α commands ahead—that is, it can propose commands i + 1 through i +α after commands 1 through i are chosen. A gap of up to α − 1 commands could then arise.

在获悉已提交的命令141被选中之前,leader就可以提交142号命令了。有可能它为了提交141命令而发出的所有消息都丢失了,也可能142号命令都已经被选中了,却没有任何一个服务器收到过141命令相关的消息。如果在实例141的阶段2,leader没有收到预期的响应,它就会重发这些消息。如果一切正常,它提交的命令就会被选中。但是,它可能先失败,导致在命令序列中出现空隙。一般来说,假设leader可以提前处理α 个命令——也就是说,在命令1-i被选中之后,它就可以提交命令i+1 - i+α ——可能产生最大α - 1个命令的空隙。

A newly chosen leader executes phase 1 for infinitely many instances of the consensus algorithm—in the scenario above, for instances 135–137 and all instances greater than 139. Using the same proposal number for all instances, it can do this by sending a single reasonably short message to the other servers. In phase 1, an acceptor responds with more than a simple OK only if it has already received a phase 2 message from some proposer. (In the scenario, this was the case only for instances 135 and 140.) Thus, a server (acting as acceptor) can respond for all instances with a single reasonably short message. Executing these infinitely many instances of phase 1 therefore poses no problem.

新选出的leader可以为无限多个实例执行阶段1 —— 上面的场景中的135-137以及大于139的所有实例。为这些实例使用同样的提案编号,leader只需要向所有其他服务器发送一条较短的消息就可以了。在阶段1,除非acceptor已经收到了其他proposer的阶段2消息,否则它返回成功就可以了(在这个场景里,只有实例135/140会出现这种情况 【TODO】)。因此,服务器(作为accetpor)可以为所有这些实例只返回一条较短的响应消息。所以,执行无限多实例的阶段1不会带来什么问题。

Since failure of the leader and election of a new one should be rare events, the effective cost of executing a state machine command—that is, of achieving consensus on the command/value—is the cost of executing only phase 2 of the consensus algorithm. It can be shown that phase 2 of the Paxos consensus algorithm has the minimum possible cost of any algorithm for reaching agreement in the presence of faults [2]. Hence, the Paxos algorithm is essentially optimal.

因Leader发生故障而选一个新的leader很少见,执行一个状态机命令的实际成本——即对命令/值达成共识——就是执行算法的阶段2的成本。在可能出现故障的情况下,所有的共识算法中,Paxos算法的阶段2有尽可能低的成本[2]。因此,根本上来说Paxos是最好的。

This discussion of the normal operation of the system assumes that there is always a single leader, except for a brief period between the failure of the current leader and the election of a new one. In abnormal circumstances, the leader election might fail. If no server is acting as leader, then no new commands will be proposed. If multiple servers think they are leaders, then they can all propose values in the same instance of the consensus algorithm, which could prevent any value from being chosen. However, safety is preserved—two different servers will never disagree on the value chosen as the i th state machine command. Election of a single leader is needed only to ensure progress.

上述对于系统常规操作的讨论假设总是只有一个leader,除非是在当前leader发生故障之后、新leader选出来之前的这段时间里。在异常情形下,leader选举可能失败。如果没有leader,也就不会有命令被提交。如果有多个服务器自认为是leader,在同一个实例中它们都可以提交值,这会导致没有值被选中。但是安全性是可以保证的——两个不同的服务器绝不会对哪个值被选为第i个状态机命令有异议。选一个leader出来是为了保证process。

If the set of servers can change, then there must be some way of determining what servers implement what instances of the consensus algorithm. The easiest way to do this is through the state machine itself. The current set of servers can be made part of the state and can be changed with ordinary state-machine commands. We can allow a leader to get α commands ahead by letting the set of servers that execute instance i + α of the consensus algorithm be specified by the state after execution of the i th state machine command. This permits a simple implementation of an arbitrarily sophisticated reconfiguration algorithm.

如果服务器不固定,必须要有某种方法来确定哪些服务器实现了哪些实例。最简单的方法是通过状态机自身。当前的服务器集合可以作为状态的一部分,并且可以通过普通的状态机命令改变。如果执行完第i个状态机命令后的状态指定了哪些服务器执行实例i+α ,leader就可以提前执行α 个命令【TODO】。这样不管多复杂的服务器重配置算法都可以很容易实现。

参考文献

[1] Michael J. Fischer, Nancy Lynch, and Michael S. Paterson. Impossibility of distributed consensus with one faulty process. Journal of the ACM, 32(2):374–382, April 1985.

[2] Idit Keidar and Sergio Rajsbaum. On the cost of fault-tolerant consensus when there are no faults—a tutorial. TechnicalReport MIT-LCS-TR-821, Laboratory for Computer Science, Massachusetts Institute Technology, Cambridge, MA, 02139, May 2001. also published in SIGACT News 32(2) (June 2001).

[3] Leslie Lamport. The implementation of reliable distributed multiprocess systems. Computer Networks, 2:95–114, 1978.

[4] Leslie Lamport. Time, clocks, and the ordering of events in a distributed system. Communications of the ACM, 21(7):558–565, July 1978.

[5] Leslie Lamport. The part-time parliament. ACM Transactions on Com- puter Systems, 16(2):133–169, May 1998.

译注参考文献