Department of Computer Science and Technology

Technical reports

A Performance-efficient and practical processor error recovery framework

Jyothish Soman

January 2019, 91 pages

This technical report is based on a dissertation submitted July 2017 by the author for the degree of Doctor of Philosophy to the University of Cambridge, Wolfson College.

DOI: 10.48456/tr-931

Abstract

Continued reduction in the size of a transistor has affected the reliability of processors built using them. This is primarily due to factors such as inaccuracies while manufacturing, as well as non-ideal operating conditions, causing transistors to slow down consistently, eventually leading to permanent breakdown and erroneous operation of the processor. Permanent transistor breakdown, or faults, can occur at any point in time in the processor’s lifetime. Errors are the discrepancies in the output of faulty circuits. This dissertation shows that the components containing faults can continue operating if the errors caused by them are within certain bounds. Further, the lifetime of a processor can be increased by adding supportive structures that start working once the processor develops these hard errors.

This dissertation has three major contributions, namely REPAIR, FaultSim and PreFix. REPAIR is a fault tolerant system with minimal changes to the processor design. It uses an external Instruction Re-execution Unit (IRU) to perform operations, which the faulty processor might have erroneously executed. Instructions that are found to use faulty hardware are then re-executed on the IRU. REPAIR shows that the performance overhead of such targeted re-execution is low for a limited number of faults.

FaultSim is a fast fault-simulator capable of simulating large circuits at the transistor level. It is developed in this dissertation to understand the effect of faults on different circuits. It performs digital logic based simulations, trading off analogue accuracy with speed, while still being able to support most fault models. A 32-bit addition takes under 15 micro-seconds, while simulating more than 1500 transistors. It can also be integrated into an architectural simulator, which added a performance overhead of 10 to 26 percent to a simulation. The results obtained show that single faults cause an error in an adder in less than 10 percent of the inputs.

PreFix brings together the fault models created using FaultSim and the design directions found using REPAIR. PreFix performs re-execution of instructions on a remote core, which pick up instructions to execute using a global instruction buffer. Error prediction and detection are used to reduce the number of re-executed instructions. PreFix has an area overhead of 3.5 percent in the setup used, and the performance overhead is within 5 percent of a fault-free case. This dissertation shows that faults in processors can be tolerated without explicitly switching off any component, and minimal redundancy is sufficient to achieve the same.

Full text

PDF (1.7 MB)

BibTeX record

@TechReport{UCAM-CL-TR-931,
  author =	 {Soman, Jyothish},
  title = 	 {{A Performance-efficient and practical processor error
         	   recovery framework}},
  year = 	 2019,
  month = 	 jan,
  url = 	 {https://www.cl.cam.ac.uk/techreports/UCAM-CL-TR-931.pdf},
  institution =  {University of Cambridge, Computer Laboratory},
  doi = 	 {10.48456/tr-931},
  number = 	 {UCAM-CL-TR-931}
}