Fault Tolerance in Hardware Accelerators: Detection and Mitigation




Journal Title

Journal ISSN

Volume Title




In the age of self-driving cars and space adventures, fault tolerance has become a first order design metric. Thus, it is vital to incorporate fault tolerance coherently into the Very Large Scale Integrated (VLSI) design process. This is especially the case in state-of the-art complex heterogeneous Systems-on-Chip (SoC), which typically contain a variety of dedicated hardware accelerators. These SoCs have to be taped out at shorter and shorter periods while their complexity keeps increasing. This is driving designers to finally embrace the use of C-based VLSI design called as High-Level Synthesis (HLS). HLS has shown to significantly reduce the design and verification time compare to the use of low-level hardware descriptions languages. Moreover, one significant advantage of raising the level of abstraction is that C-based VLSI design allows to generate a variety of micro-architectures with different trade-offs from the same untimed behavioral description. Fault-tolerance at the hardware level has so far has been mainly based around building N-modular redundant (NMR) systems like duplication and Triple Modular Redundancy (TMR), where the hardware channel is identical. In this work, we exploit HLS’s advantage to generate micro-architectures with different characteristics from the same behavioral description for automatically generating fault-tolerant systems. In particular, we first propose an automated framework that given a single behavioral description generates a set of NMR systems with different area and performance trade-offs by choosing different mixes of micro-architectures. Secondly, we leverage this advantage to generate redundant systems that minimize common mode failure (CMF). CMFs imply that multiple modules in the redundant system are affected at the same time by a fault. Hence, it has been shown that adding diversity in the hardware channels can make the system more tolerant to these type of faults. We also leverage the power of machine learning to estimate the diversity through fast and efficient predictive methods, thus, significantly speeding up the redundant system generation. A previously reported design diversity metric called Diversity Metric based on circuit Path analysis (DIMP) or RT-level fault injection based method is investigated to check if they can achieve similar results compared to the gate-netlist fault injection based diversity calculation. Lastly, a low-cost, universal fault-recovery/repair method that utilizes supervised machine learning techniques to ameliorate the effect of permanent fault(s) in hardware accelerators that can tolerate inexact outputs is proposed.



Fault tolerance (Engineering), Machine learning, Integrated circuits -- Very large scale integration, Systems on a chip, Hardware