Failure Detectors

Formal Models

Events and Links

Safety and liveness of distributed protocols

Reliable Broadcast Primitives

Causal Broadcast

Shared Memory/Data Consistency

Impossibility of Consensus

Consensus Algorithms

Probabilistic Consensus

Fast Consensus

Non-blocking Atomic Commit

Terminating Reliable Broadcast (TRB)

Byzantine Fault Tolerance

Consensus

Paxos

Replicated State Machines and Reconfiguration

CS 294-91 Distributed Computing UC Berkeley Winter 2013 This course provides basic theoretical and practical foundations of distributed systems. Students learn about system models, safety and liveness of protocols, different failure models, reliable group communication abstractions, and more. It utilizes a textbook and additional research paper-based lectures. In the past decade evermore applications and services, which previously were running on local PCs, have moved to the Internet, in data centers, accessible through the Web. This puts distributed systems at the center of many of s application architectures. Distributed systems (or distributed computing) concerns systems in which many nodes (machines) solve a common problem, using message passing over a network that connects those nodes. The aim of this course is to establish familiarity with the basic theoretical and practical foundations of distributed systems.

Distributed computing is challenging due to two fundamental problems: (i) partial-failures, and (ii) asynchrony. Partial failures means that parts of the system (network or machines) can be faulty, but it is desirable for the rest of the system to function correctly. Asynchrony is due to the variance in the time it takes to send messages between computers and the operating speed of different computers. It is therefore desirable to make the system function correctly while events are happening asynchronously.

Over the years, many recurring problems have been studied with respect to the two aforementioned challenges. Furthermore, many abstractions have been proposed that simplify dealing with these two challenges when building distributed systems. In this course we will study many of these problems and abstractions, including the following: today

- Models of distributed systems
- Safety and liveness of distributed protocols
- Different failure models for distributed systems (fail-stop, fail-noisy, Byzantine)
- Reliable group communication abstractions (reliable, atomic, etc)
- Shared memory and consistency models (linearizable, regular etc)
- Failure detectors and their relationship and implementation in real systems
- Impossibility of Consensus
- Consensus and Paxos
- Replicated State Machines and Reconfiguration
- Byzantine Fault Tolerance   We will loosely follow the following textbook, but also have additional lectures based on research papers:

[Introduction to Reliable and Secure Distributed Programming](http://www.amazon.com/Introduction-Reliable-Secure-Distributed-Programming/dp/3642152597), C. Cachin, R. Guerraoui, L. Rodrigues, Springer, 2011.

CS 294-91 Distributed Computing

Parallelism and Distributed Systems

Failure detector

Failure Detectors

1 courses cover this concept

CS 294-91 Distributed Computing