标签:分布式系统
A. Practical Byzantine Fault Tolerance
1.What’s its checkpoint?
We will refer to the states produced by the execution of these requests as checkpoints and we will say that a checkpoint with a proof is a stable checkpoint. When the replication code invokes
the make_checkpoint upcall, snfsd gets all the copy-on-write bits and creates a (volatile) checkpoint record, containing the current sequence number, a list of blocks and the digest of the current state. snfsd computes a digest of a checkpoint state as part
of a make_checkpoint upcall. Although checkpoints are only taken occasionally, it is important to compute the state digest incrementally because the state may be large. We can avoid sending the entire checkpoint by partitioning the state and stamping each
partition with the sequence number of the last request that modified it. To bring a replica up to date, it is only necessary to send it the partitions where it is out of date, rather than the whole checkpoint. A request that has executed tentatively may abort
if there is a view change and it is replaced by a null request. In this case the replica reverts its state to the last stable checkpoint in the new-view message or to its last checkpointed state (depending on which one has the higher sequence number).
2.About its advantages and limitations
Advantages:
(a) To tolerate Byzantine faults, the paper proposes an algorithm which works in a asynchronous system like the Internet. Previous systems, such as Rampart and SecureRing, rely on the synchrony assumption
for correctness, which is dangerous in the presence of malicious attacks.
(b) Also, it uses an efficient authentication scheme based on message authentication codes during normal operation; public-key cryptography, which was cited as the major latency and throughput bottleneck in Rampart, is used only when there
are faults.
(c) The paper used the approach to implement a real service: a Byzantine-fault-tolerant distributed file system that support the NFS protocol. And the system is only 3% slower than the standard NFS daemon in the Digital Unix kernel during normal-case
operation.
(d) It provides experimental results that quantify the cost of the replication technique.
(e) Applying the read-only optimization to lookup improves the performance of BFS significantly and reduces the overhead relative to BFS-nr to 20%.
(f) If the primary has actually failed, the group will be unable to process client requests until the delay has expired. Our algorithm is not vulnerable to this problem because it never needs to exclude replicas from the group.
Limitation:
(a) The algorithm does not address the problem of fault-tolerant privacy: a faulty replica may leak information to an attacker. And they plan to investigate secret sharing schemes to solve the problem in
the future.
(b) In this paper they assume that the client waits for one request to complete before sending the next one. But they can allow a client to make asynchronous requests, yet preserve ordering constraints on them.
(c) The overhead introduced by the replication library is due to extra computation and communication, such as executing cryptographic operations and an extra message round-trip.
(d) There is still much work to do on improving our system. One problem of special interest is reducing the amount of resources required to implement our algorithm. The number of replicas can be reduced
by using f replicas as witness that are involved in the protocol only when some full replica fails. We also believe that it is possible to reduce the number of copies of the state to f+1 but the details remain to be worked out.
(e) Our approach cannot mask a software error that occurs at all replicas. However, it can mask errors that occur independently at different replicas, including nondeterministic software errors.
3.Others about the paper
It introduces the three-phase protocol in normal-case operation. The protocol aims to totally order requests. Here is my question: Does the three-phase protocol tolerate Byzantine faults? Or just to totally order requests?
Another question is about view changes: How to select the new primary? It says that when the primary p of view v+1 receives 2f valid view-change messages for view v+1 from other replicas, it multicasts a
message to all other replicas. Is it related to the above question?
My third question is that I still don’t understand what differences between it works in synchronous environment and asynchronous environment.
Maybe I couldn’t rebuild the system in the paper just according to the paper, but there are some approaches worthy for me to learn. For example, about checkpoint, we could compute the state digest incrementally, and avoid sending the entire checkpoint by partitioning
the sate.
B. Paxos Made Practical
1. What’s its checkpoint?
Since view numbers are monotonically increasing, the combination of view-id and timestamp, which we call a viewstamp, determines the execution order of all requests over time. When the primary receives a request from client, it will send a message to backups.
And there is a field committed included. The field specifies a viewstamp below which the server has executed all requests and sent their results back to clients. These committed operations never need to be rolled back and can therefore be executed at backups.
When the primary’s reply to the client gets lost and the primary subsequently fails, the cohort can re-execute the request after a reboot. In short, the checkpoint is used to record what requests have been executed, so that once fails, we don’t have to re-execute
from the beginning. My question is that: Whether the whole request or one operation of the request does the checkpoint record?
2. About its advantages and limitations
Advantages:
(a) I think it build a clear view-change protocol. It considers the crashed cohort that fails to respond to message and the new cohort that may wish to join the system.
And it is a multi-step process like Paxos, that involves first proposing a new view-id, then proposing the new view.
(b) The primary can temporarily respond to read-only requests without involving backups, if a majority of backups promise not to form a new view for 60 seconds. So that it could make the replication protocol
more efficient.
(c) It employ a third machine, called a witness, that ordinarily sits idle without executing requests,but can participate in the consensus protocol to allow one of the other two replicas to form a view after
a failure or network partition.
Limitation:
(a) In Practical Byzantine Fault Tolerance, its system uses view changes only to select a new primary but never to select a different set of replicas to form the new view. While in Paxos Made Practical,
it uses view changes to form the new view. So I think the latter could learn from the former to reduce overhead.
(b) I believe it realizes consensus and view changes, but it doesn’t show the performance evaluation.
3. Others about the paper
In fact, the paper aims to solve the two limitations of Viewstamped Replication:
(i) It doesn’t show how to replicate a simple system; (ii) It assumes that the set of possible cohorts is fixed over time. So the paper does it. It makes Paxos practical and easy to understand.
C. Rex vs. Eve
Rex and Eve both aim at the multi-core server. In the paper of Rex, it often refers to Eve. So here I want to put them together and let them have a trail of strength with each other.
1. What’s its checkpoint?
Eve is an execute-verify architecture, and execute every batch of requests concurrently. So the checkpoint should be different from that of traditional SMR protocols. To achieve efficient state comparison and fine-grained checkpointing and rollback, Eve stores
the state using a copy-on-write Merkle tree, whose root is a concise representation of the entire state.
Rex resorts to a general checkpointing framework to alleviate this burden. It doesn’t have the primary checkpoint periodically during its execution. Checkpointing cannot be done on a state where a request has not been processed completely because Rex does not
have sufficient information for a replica to continue processing an incomplete request when re-starting from that checkpoint. When Rex decides to create a checkpoint, the primary sets the checkpoint flag, so that all threads will pause before taking on any
new request. When a checkpoint is available on a replica, any committed trace before the cut points pf that checkpoint is no longer needed and can be garbage collected.
2. About its advantages and limitations
Eve Advantages:
(a) Eve will achieve a speedup of 6.5x compared to sequential execution with 16 execution threads.
(b) From the following figure (Figure 8), I think Eve has a good ability to recover from failure.
3. Others about the paper
Does multithreading execution have relation to Byzantine faults?
Eve realizes deterministic parallelism by mixer, as mixer will partitions the set of requests into non-conflicting batches. So that Eve can execute these batches concurrently without conflicts. While Rex is an execute-agree-follow model. At the beginning, it
lets primary freely execute requests concurrently, meanwhile during its execution Rex will record the nondeterministic decisions into a trace. Then other machines should agree on the trace to ensure consensus. Secondary replicas would execute requests concurrently
according the trace. I think Eve and Rex both find the nondeterministic decisions and conflicts among requests, and then let replicas agree on it to ensure consensus. My question is: do their approaches to capture the nondeterministic decisions have the same
overhead? Who will be better?
About Rex, since we could remove unnecessary causal edges to reduce overhead, so if the client proposes three requests, and through execute state, we find that they are independent, so there is no causal edge among them, they could be executed concurrently.
My thinking is that whether can we know they are independent without executing, just like preprocessing, so that we could omit the execute stage?
Reading Papers about Distributed Replication
标签:分布式系统
原文地址:http://blog.csdn.net/bluecloudmatrix/article/details/39448679