Synchronization and Fault Tolerance Techniques in Concurrent Shared Memory Systems
Date
Authors
ORCID
Journal Title
Journal ISSN
Volume Title
Publisher
item.page.doi
Abstract
Mutual exclusion is one of the most commonly used techniques to handle contention in concurrent systems. Traditionally, mutual exclusion algorithms have been designed under the assumption that a process does not fail while acquiring/releasing a lock or while executing its critical section. However, failures do occur in real life, potentially leaving the lock in an inconsistent state. This gives rise to the problem of recoverable mutual exclusion (RME) that involves designing a mutual exclusion (ME) algorithm that can tolerate failures, while maintaining safety and liveness properties. With the recent development of NVRAM (non-volatile random-access memory) technologies, there is renewed interest in the RME problem. The NVRAM technology is a combination of the low latency of traditional random-access memory with the high persistence of disk storage media. NVRAMs can be used to provide near-instantaneous recovery to many problems including the RME problem. This work describes techniques for designing efficient algorithms to solve the RME problem under two different failure models, independent failure model and system-wide failure model, depending on whether processes fail independently or simultaneously. Additionally, especially for systems with low memory capacity, this work describes fault-tolerant techniques for reclaiming memory, in case there is no built-in support for garbage collection. The primary measure of an RME algorithm is its performance. Performance of any ME algorithm, including an RME algorithm, is measured by the number of remote memory references (RMRs) made by a process—for acquiring and releasing a lock as well as recovering the lock structure after a failure. Loosely speaking, it represents the number of expensive shared memory instructions. In this work, two models of RMR computation are considered: (a) the CC model, and (b) the DSM model. The results mentioned in this work are applicable to both of these computation models. For the independent failure model, this work presents a framework that transforms any algorithm that solves the RME problem into an algorithm whose performance (in terms of RMRs) can simultaneously adapt to (a) the number of processes competing for the lock, as well as (b) the number of failures that have occurred in the recent past, while maintaining the correctness and performance properties of the underlying RME algorithm. Assume that, for n processes, the RMR complexity of the underlying RME algorithm is R(n). Then, this framework yields an RME algorithm for which the RMR complexity is given by O(min{c, ¨ √ F + 1, R(n)}), where ¨c denotes the point contention (number of active processes) and F denotes the number of failures in the recent past. The system-wide failure model is a special case of the independent failure model that assumes that failures only occur simultaneously. For example, a power outage is a real life example of such a failure. This model makes a stronger assumption than just multiple independent failures. This assumption is leveraged with enhanced RME algorithms presented under this model. For the system-wide failure model, this work presents optimal RME algorithms (and related transformations) whose worst-case performance yield a O(1) RMR complexity. The fault-tolerant memory reclamation algorithm provides novel techniques to bound the worst-case space complexity of RME algorithms. The techniques used are general enough that they may also be employed to bound the space complexity of other RME algorithms. Its RMR complexity is merely an additive factor of O(1).