标签:
Once a leader has been elected, it begins servicing client requests. Each client request contains a command to be executed by the replicated state machines. The leader appends the command to its log as a new entry, then issues AppendEntries RPCs in parallel to each of the other servers to replicate the entry. When the entry has been safely replicated (as described below), the leader applies the entry to its state machine and returns the result of that execution to the client. If followers crash or run slowly, or if network packets are lost, the leader retries AppendEntries RPCs indefinitely (even after it has responded to the client) until all followers eventually store all log entries.
Logs are organized as shown in Figure 6. Each log entry stores a state machine command along with the term number when the entry was received by the leader. The term numbers in log entries are used to detect inconsistencies between logs and to ensure some of the properties in Figure 3. Each log entry also has an integer index identifying its position in the log.
The leader decides when it is safe to apply a log entry to the state machines; such an entry is called committed. Raft guarantees that committed entries are durable and will eventually be executed by all of the available state machines. A log entry is committed once the leader that created the entry has replicated it on a majority of the servers (e.g., entry 7 in Figure 6). This also commits all preceding entries in the leader’s log, including entries created by previous leaders. Section 5.4 discusses some subtleties when applying this rule after leader changes, and it also shows that this definition of commitment is safe. The leader keeps track of the highest index it knows to be committed, and it includes that index in future AppendEntries RPCs (including heartbeats) so that the other servers eventually find out. Once a follower learns that a log entry is committed, it applies the entry to its local state machine (in log order).
We designed the Raft log mechanism to maintain a high level of coherency between the logs on different servers. Not only does this simplify the system’s behavior and make it more predictable, but it is an important component of ensuring safety. Raft maintains the following properties, which together constitute the Log Matching Property in Figure 3:
If two entries in different logs have the same index and term, then they store the same command.
If two entries in different logs have the same index and term, then the logs are identical in all preceding entries.
The first property follows from the fact that a leader creates at most one entry with a given log index in a given term, and log entries never change their position in the log. The second property is guaranteed by a simple consistency check performed by AppendEntries. When sending an AppendEntries RPC, the leader includes the index and term of the entry in its log that immediately precedes the new entries. If the follower does not find an entry in its log with the same index and term, then it refuses the new entries. The consistency check acts as an induction step: the initial empty state of the logs satisfies the Log Matching Property, and the consistency check preserves the Log Matching Property whenever logs are extended. As a result, whenever AppendEntries returns successfully, the leader knows that the follower‘s log is identical to its own log up through the new entries.
During normal operation, the logs of the leader and followers stay consistent, so the AppendEntries consistency check never fails. However, leader crashes can leave the logs inconsistent (the old leader may not have fully replicated all of the entries in its log). These inconsistencies can compound over a series of leader and follower crashes. Figure 7 illustrates the ways in which followers‘ logs may differ from that of a new leader. A follower may be missing entries that are present on the leader, it may have extra entries that are not present on the leader, or both. Missing and extraneous entries in a log may span multiple terms.
In Raft, the leader handles inconsistencies by forcing the followers‘ logs to duplicate its own. This means that conflicting entries in follower logs will be overwritten with entries from the leader‘s log. Section 5.4 will show that this is safe when coupled with one more restriction.
To bring a follower‘s log into consistency with its own, the leader must find the latest log entry where the two logs agree, delete any entries in the follower‘s log after that point, and send the follower all of the leader’s entries after that point. All of these actions happen in response to the consistency check performed by AppendEntries RPCs. The leader maintains a nextIndex for each follower, which is the index of the next log entry the leader will send to that follower. When a leader first comes to power, it initializes all nextIndex values to the index just after the last one in its log (11 in Figure 7). If a follower‘s log is inconsistent with the leader‘s, the AppendEntries consistency check will fail in the next AppendEntries RPC. After a rejection, the leader decrements nextIndex and retries the AppendEntries RPC. Eventually nextIndex will reach a point where the leader and follower logs match. When this happens, AppendEntries will succeed, which removes any conflicting entries in the follower‘s log and appends entries from the leader‘s log (if any). Once AppendEntries succeeds, the follower‘s log is consistent with the leader‘s, and it will remain that way for the rest of the term.
If desired, the protocol can be optimized to reduce the number of rejected AppendEntries RPCs. For example, when rejecting an AppendEntries request, the follower can include the term of the conflicting entry and the first index it stores for that term. With this information, the leader can decrement nextIndex to bypass all of the conflicting entries in that term; one AppendEntries RPC will be required for each term with conflicting entries, rather than one RPC per entry. In practice, we doubt this optimization is necessary, since failures happen infrequently and it is unlikely that there will be many inconsistent entries.
With this mechanism, a leader does not need to take any special actions to restore log consistency when it comes to power. It just begins normal operation, and the logs automatically converge in response to failures of the AppendEntries consistency check. A leader never overwrites or deletes entries in its own log (the Leader Append-Only Property in Figure 3).
This log replication mechanism exhibits the desirable consensus properties described in Section 2: Raft can accept, replicate, and apply new log entries as long as a majority of the servers are up; in the normal case a new entry can be replicated with a single round of RPCs to a majority of the cluster; and a single slow follower will not impact erformance.
一旦一个leader当选后,即开始服务客户端请求。每个客户端请求包括由复制状态机执行的命令。leader将命令作为新条目加入它的日志,然后并行发送AppendEntries RPC给其他服务器来复制条目。当条目被安全复制后(如下所述),leader发送条目给状态机并将运行结果返回客户端。如果follower崩溃或是运行缓慢,又或是网络丢包,leaer会不定期地重发送AppendEntries RPC(即使它已经回复了客户端),直到所有follower都储存了所有的日志条目。
图6展示了日志的组成。每个日志条目包含一个状态机命令和leader接收时的term值。该term值在日志条目中是用于检测日志间的不一致性,并确保图3中的某些属性。每个日志条目也有一个整数索引来标记在日志中的位置。
leader决定何时发送日志条目给状态机是安全的;这种条目叫commited。Raft承诺commited的条目是耐用的,最终会被所有可用的状态机执行。一旦一个日志条目被leader创建,并在大多数服务器完成复制,那么它是commited的(如图6中的条目7)。这将提交leader所有靠前的日志条目,包括上一个leader创建的。5.4节讨论了这个关于leader改变的规则的一些问题,这也表明commit的定义是安全的。leader持续跟踪已知的已提交的最大的索引,它包括了未来其他服务器最终找到的AppendEntries RPC(包含心跳)的索引。follower一旦得知一个日志条目已经commited,它将条目应用到自己的本地状态机(以日志顺序)。
我们设计Raft的日志机制来维持不同服务器间的日志的高度一致性。这不仅简化了系统的行为,也使他变得可以预见,它更是确保安全的重要组成部分。Raft保持以下属性,共同构成了日志匹配的属性,如图3所示:
如果不同日志中的两个条目拥有相同的索引和term值,那么他们储存相同的命令。
如果不同日志中的两个条目拥有相同的索引和term值,那么日志中之前的条目都相同。
第一个属性遵从一个事实,即一个leader在一个给定的term和日志索引位置至多创建一个条目,而且日志条目永远不会改变它在日志中的位置。第二个属性是由AppendEntries进行的简单一致性检查的保证。当发送一个AppendEntries RPC时,包含条目在日志中的索引和term的leader立刻在这之前创建了新条目。如果follower没有找到日志中相同索引和term的条目,那么它将拒绝新的条目。一致性检查分为以下几步:日志的初始状态为空满足了日志匹配属性,以及一致性检查保留日志扩展的匹配属性。结果每当AppendEntries 返回成功后,leader通过新条目知道follower的日志和它是一致的。
正常工作时,leader和follower的日志保持一致,因此AppendEntries 一致性检查永远不会失败。然而,leader崩溃会使日志不一致(老的leader可能复制日志条目不完全)。这些不一致会加剧leader和follower的崩溃。图7显示了follower日志可能与新的leader不同的情况。follower的条目可能比leader少,也可能比它多,或者两者皆是。日志中的条目丢失或是多余可能导致跨越多个term。
Raft中,leader通过强制follower复制自己的日志来解决不一致性。这意味着,follower中的日志条目将被来自leader的覆盖。5.4节将表明多一个限制会比较安全。
为了保证follower的日志和自己保持一致,leader必须找到其中两个日志都同意的最新的日志条目,这时删除follower日志中的所有条目,并且发送leader的条目给follower。这些动作都将发生在由AppendEntries RPC发起的一致性检查中。leader维护nextIndex,下一个发送给follower的日志条目的索引。当leader一上台,它初始化所有的nextIndex值为它日志中最新条目的索引值(图7中的11)。如果一个follower的日志和leader的不一致,AppendEntries一致性检查将在下一个AppendEntries RPC失败。拒绝后,leader递减nextIndex值并重新尝试AppendEntries RPC。最终nextIndex能够找到leader和follower日志匹配的点。当发生这种情况时,AppendEntries成功,从而消除了follower日志中冲突的条目,并从leader的日志中添加(如果有的话)。一旦AppendEntries成功,follower日志和leader一致,它将在余下的term里保持这种状态。
标签:
原文地址:http://my.oschina.net/daidetian/blog/488778