Fault tolerance is the ability of a system to continue operating properly in the event of component failure. It allows for graceful degradation, where the system can still operate at a reduced level, and can be achieved through anticipating exceptional conditions and duplication. Reversion to a safe mode may also be necessary if the consequences of a system failure are catastrophic.
Carnegie Mellon University
Fall 2020
A course offering both theoretical understanding and practical experience in distributed systems. Key themes include concurrency, scheduling, network communication, and security. Real-world protocols and paradigms like distributed filesystems, RPC, MapReduce are studied. Course utilizes C and Go programming languages.
No concepts data
+ 34 more concepts