Reducing the Complexity of Fault-Tolerant Behavioral Hardware Accelerators
Date
Authors
ORCID
Journal Title
Journal ISSN
Volume Title
Publisher
item.page.doi
Abstract
Continuous technology scaling has allowed to integrate a large number of different hardware components on the same integrated circuit (IC). Thus, these complex ICs are typically called System-on-Chip (SoC). Area, power and performance have been traditionally the most important design metrics, but for many safety critical applications, reliability is equally important. Fault tolerance can therefore not be a second class citizen anymore and must be considered early on in the design process of these complex ICs. Due to the heterogeneity of these SoCs a single fault-tolerance solution is not possible. Dedicated solutions have been proposed for the embedded processor, the memory, different interfaces and for the dedicated hardware accelerators. For example, in the processor case, the program execution relies on the control flow instructions that determine which section of code will be executed at run-time. A single event upset (SEU) can impact the execution order of the program. Thus, in this thesis we study the effect of transient errors on the corruption of program control flows, and present a methodology to detect these at the software level. This is done by inserting additional control flow instructions directly at the assembly code after a static control flow analysis is performed. Moreover, one key differentiating element between different SoCs is the hardware accelerators in them. Most of other components in the SoCs are off-the-shelve modules and the main differentiation element in the different SoC offering is typically the mix of hardware accelerators that they include. Due to the long design cycles of these complex systems, the design of these accelerators is often now done at the behavioral level and High-Level Synthesis (HLS) is used to generate the Register Transfer Level (RTL) code of the accelerator. It is therefore imperative to introduce low overhead fault-tolerance techniques for these accelerators described at the behavioral level. This thesis presents different techniques to reduce the overhead associated with traditional N-modular redundancy techniques for these accelerators.