Fault Tolerance in Hardware Accelerators: Detection and Mitigation

dc.contributor.ORCID0000-0003-1317-3755 (Taher, FN)
dc.contributor.advisorSchaefer, Benjamin Carrion
dc.creatorTaher, Farah Naz
dc.date.accessioned2020-12-09T21:00:31Z
dc.date.available2020-12-09T21:00:31Z
dc.date.created2019-12
dc.date.issued2019-12
dc.date.submittedDecember 2019
dc.date.updated2020-12-09T21:00:32Z
dc.description.abstractIn the age of self-driving cars and space adventures, fault tolerance has become a first order design metric. Thus, it is vital to incorporate fault tolerance coherently into the Very Large Scale Integrated (VLSI) design process. This is especially the case in state-of the-art complex heterogeneous Systems-on-Chip (SoC), which typically contain a variety of dedicated hardware accelerators. These SoCs have to be taped out at shorter and shorter periods while their complexity keeps increasing. This is driving designers to finally embrace the use of C-based VLSI design called as High-Level Synthesis (HLS). HLS has shown to significantly reduce the design and verification time compare to the use of low-level hardware descriptions languages. Moreover, one significant advantage of raising the level of abstraction is that C-based VLSI design allows to generate a variety of micro-architectures with different trade-offs from the same untimed behavioral description. Fault-tolerance at the hardware level has so far has been mainly based around building N-modular redundant (NMR) systems like duplication and Triple Modular Redundancy (TMR), where the hardware channel is identical. In this work, we exploit HLS’s advantage to generate micro-architectures with different characteristics from the same behavioral description for automatically generating fault-tolerant systems. In particular, we first propose an automated framework that given a single behavioral description generates a set of NMR systems with different area and performance trade-offs by choosing different mixes of micro-architectures. Secondly, we leverage this advantage to generate redundant systems that minimize common mode failure (CMF). CMFs imply that multiple modules in the redundant system are affected at the same time by a fault. Hence, it has been shown that adding diversity in the hardware channels can make the system more tolerant to these type of faults. We also leverage the power of machine learning to estimate the diversity through fast and efficient predictive methods, thus, significantly speeding up the redundant system generation. A previously reported design diversity metric called Diversity Metric based on circuit Path analysis (DIMP) or RT-level fault injection based method is investigated to check if they can achieve similar results compared to the gate-netlist fault injection based diversity calculation. Lastly, a low-cost, universal fault-recovery/repair method that utilizes supervised machine learning techniques to ameliorate the effect of permanent fault(s) in hardware accelerators that can tolerate inexact outputs is proposed.
dc.format.mimetypeapplication/pdf
dc.identifier.urihttps://hdl.handle.net/10735.1/9085
dc.language.isoen
dc.subjectFault tolerance (Engineering)
dc.subjectMachine learning
dc.subjectIntegrated circuits -- Very large scale integration
dc.subjectSystems on a chip
dc.subjectHardware
dc.titleFault Tolerance in Hardware Accelerators: Detection and Mitigation
dc.typeDissertation
dc.type.materialtext
thesis.degree.departmentElectrical Engineering
thesis.degree.grantorThe University of Texas at Dallas
thesis.degree.levelDoctoral
thesis.degree.namePHD

Files

Original bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
ETD-5608-013D-262383.72.pdf
Size:
3.12 MB
Format:
Adobe Portable Document Format
Description:
Dissertation

License bundle

Now showing 1 - 2 of 2
No Thumbnail Available
Name:
PROQUEST_LICENSE.txt
Size:
5.84 KB
Format:
Plain Text
Description:
No Thumbnail Available
Name:
LICENSE.txt
Size:
1.84 KB
Format:
Plain Text
Description: