Research: Transient Error Detection and Recovery in Modern Processors


Transient errors in present day microprocessors have various effects. As these errors can affect the control flow of a program, change the system status or modify the data stored in memory. Further, if the system does not perform some run-time checking, an erroneous output might not be detected and be used as a correct output. Many present day information systems are high-availability systems. They are used in a variety of fields where failures can be catastrophic. Some examples include biomedical, aerospace and banking applications where events like spontaneous reboots or incorrect results cannot be tolerated. Hence, runtime error correction and/or redundancy techniques are mandatory to overcome the effects of transient errors. We are working on detection and mitigation of transient errors in various structures inside the processor core. The structures under consideration are the processor pipeline, Re-order buffer and Instruction Queue. At present, techniques are being implemented at the architectural simulator level for detection and recovery.



Research: Modeling Dependability in High-Performance Systems


Present day high performance systems consist of multiple cores serving several jobs simultaneously. Their configurations consist of multiple layers of hardware and software. A typical configuration is shown below:







Each system consists of multiple boards and each board contains multiple cores. An emulation of the operating system (OS) runs on each board, while various software utilities run on the cores. Each hardware or software component can fail independently and automated or manual repairs are started as soon as the failure occurs. We are working towards dependability modeling of these systems that takes into account failure and repair of both hardware and software components. Increased use of these systems in critical applications has given rise to reliability and availability concerns. While reliability refers to failure free operation during a given time interval, availability refers to failure free operation at a given instant of time. Steady-state availability represents the long term probability that the system is available. Another key concept in the context of these systems is performability. Performability quantifies how well a system performs under faults over a given period of time. We are developing Markov models for modeling availability and performability of such systems. These models consider both hardware and software failure and repair scenarios. Additionally, we have developed a Generalized Software Availability Model by deriving closed form expressions for transition probabilities in the Markov model.