Today, computer systems are applied in safety critical areas such as military, aviation, intensive health care, industrial control, space exploration etc. All these areas demand highest possible reliability of functional operation. However, ionized particles and radiation impact on current semiconductor hardware leads inevitable to faults in the system. It is expected that such phenomena will be observed much more often in the future due to the ongoing miniaturisation of hardware structures.
In this book we want to tackle the question of how system software should be designed in the event of such faults, and which fault tolerance features it should provide for highest reliability. We also show how the system software interacts with the hardware to tolerate these faults.
At first, we analyse and further develop the theory of fault tolerance to understand the different ways how to increase the reliability of a system. Ultimately, the key is to use redundancy in all its different appearances. We revise and further develop the general algorithm of fault tolerance (GAFT) with its three main processes hardware checking, preparation for recovery and the recovery procedure as our approach to the design of fault tolerant system. For each of the three processes, we analyse the requirements and properties theoretically and give possible implementation scenarios.
Based on the theoretical results, we derive an Oberon-based programming language with direct support of the three processes of GAFT.
In the last part of this book, we analyse a simulator based proof of concept implementation of a novel fault tolerant processor architecture (ERRIC) and its newly developed runtime system feature-wise and performance-wise.