Over recent years, software developers have been evaluating the benefits of both serviceoriented architecture and software fault tolerance techniques based on design diversity by. An introduction to software engineering and fault tolerance. A structured definition of hardware and softwarefaulttolerant architectures is presented. Software fault tolerance techniques are designed to allow a system to tolerate software faults that remain in the system after its development. Assessment of data diversity methods for software fault tolerance. Vmware vsphere 6 fault tolerance is a branded, continuous data availability architecture that exactly replicates a vmware virtual machine on an. Compounding the problems in building correct software is the difficulty in assessing the correctness of software for highly complex systems. In order to complement design diversity in the quest for fault tolerance software, there exits several data diversity techniques which are similar to the aforementioned for the design diversity approach.
Systematic and design diversity software techniques for. We have several software fault tolerance schemes as proposed in 46,47,48,49,50 are based on software design diversity in order to tolerate software design bugs. However, software reliability focuses on design perfection rather than manufacturing perfection, as traditionalhardware reliability does. If design fault detection is required, design diversity in the software has to be used, too. The software fault tolerance techniques rely on design redundancy to tolerate residual. Fault tol erance is a function of computing systems that serves to as. Despite more and more improvements in fault preventing techniques, it is a fact that faults remain in every complex software system. Most system designers go to great lengths to limit the impact of a hardware failure on system performance. Software fault tolerance is the ability of computer software to continue its normal operation despite the presence of system or hardware faults.
Faulttolerant design article about faulttolerant design. The two bestknown methods of building fault tolerant software are nversion programming 3 and recovery blocks 7. There can be either hardware fault or software fault, which disturbs the. Software fault tolerance carnegie mellon university. Sft iii is a feature providing fault tolerance in intelbased pc network server running novells netware operating system. Design diversity is a solution to software fault tolerance only so far as it is possible to. Many see fault tolerance to design faults as a lowquality solution, compared to the more desirable goal of faultfree software. Jul, 2016 references 1avizienis a the methodology of nversion programming, software fault tolerance, vol. Index terms design diversity, fault tolerance, multiple computation, nversion programming, nversion software, software reliability, tolerance ofdesign faults. Introduction thetransfer ofthe concepts offault tolerance to comlputersoftware, that is discussed in this paper, began about20yearsafterthe first systematicdiscussionoffault. Software fault tolerance by design diversity cuhk cse. Software fault tolerance professur fur systems engineering. A characteristic of the software fault tolerance techniques is that they can, in principle, be applied at any level in a software system.
However, it is more unusual to find that strategies for fault tolerance have been included in a system for coping with design faults, although such strategies are becoming increasingly common in systems with high reliability requirements. Designing faulttolerant soa based on design diversity springerlink. That is, it should compensate for the faults and continue to. Software fault tolerance cmuece carnegie mellon university. This course has been developed by the centre for software reliability with funding from the engineering and physical sciences research council grant number 00711eng95 as part of their. The root cause of software design errors is the complexity of the systems. To tolerate faults, both of these techniques rely on design diversity, i. The purpose is to prevent catastrophic failure that could result from a single point of failure. If its operating quality decreases at all, the decrease is proportional to the severity of the failure, as compared to a naively designed system, in which even a small failure. The term essentially refers to a systems ability to allow for failures or malfunctions, and this ability may be provided by software, hardware or a combination of both. Fault tolerance techniques for distributed systems ibm developerworks understanding fault tolerant distributed systems acm software controlled fault tolerance acm byzantine fault tolerance wikipedia fault tolerant design wikipedia fault tolerance wikipedia acm requires membership. In fact there exist sophisticated computing systems, designed for environments requiring nearcontinuous service, which contain ad hoc checks and checkpointing facilities that provide a measure of tolerance against some software errors as well as hardware failures 11.
Software fault tolerance, audits, rollback, exception handling. Buy only what you need wide range of configurable, fault tolerant, multi function io modules to suit most applications. Software fault tolerance is basically the design faults in the computer system. Sft iii allows two servers to mirror each other so that one server is always available in case the other one fails. This is a system that aims to provide more than instance of the same system and switch to the other mirror in the event a system fails. Fault tolerance also resolves potential service interruptions related to software or logic errors. In this book we cover several techniques for building reliable systems from unreliable parts. They include the recovery block scheme rbs programming, consensus recovery block programming, nversion programming nvp, n selfchecking programming nscp and data diversity. Dec 06, 2018 fault tolerance is the way in which an operating system os responds to a hardware or software failure. Data diversity as a complementary software fault tolerance strategy to design diversity was. Fault tolerance through automated diversity in the management. This is really surprising because hardware components have much higher reliability than the software that runs over them. Pdf design diversity has been used for many years now as a means of achieving a degree of fault tolerance in softwarebased systems.
Citeseerx software fault tolerance by design diversity. Abstractnowadays the reliability of software is often the main goal in the software development process. When a fault occurs, these techniques provide mechanisms to. Software fault tolerance techniques are employed during the procurement, or development, of the software. With the advent of computers, nversion software diversity has been proposed avi77 as a means of dealing with the uncertainties of design faults in a computer. Fault tolerance white papers faulttolerance, fault.
So the goal of the system designer is to ensure that the probability of system failure is acceptably small. Fault tolerance is defined as how to provide, by redundancy, service. Definition and analysis of hardware and softwarefault. The data so obtained will be used to evaluate the meaning of design diversity and the architecture of future fault tolerant computers. Software fault tolerance is a necessary part of a system with high reliability.
If its operating quality decreases at all, the decrease is proportional to the severity of the failure, as compared to a naively designed system, in which even a small failure can cause total breakdown. Checkpoint and restart using data diversity with input re. Therefore, it is reasonable to deal with the remaining software faults bugs during runtime to increase the overall reliability. Software fault tolerance methods are discussed, resulting in definitions for soft and solid faults. Fault tolerance is the property that enables a system to continue operating properly in the event of the failure of or one or more faults within some of its components. Design fault tolerance by means of design diversity is a concept that traces back to the very early age of informatics. Architecture and software fault tolerant technology. A fault tolerant system is designed from the ground up for reliability by building multiples of all critical components, such as cpus, memories, disks and power supplies into the same computer. Review of software design diversity 1 introduction 2 n. Fault tolerance is a quality of a computer system that gracefully handles the failure of component hardware or software.
This chapter concentrates on software fault tolerance based on design diversity. We suggest the combined utilization of so called systematic diversity and design diversity in a timeredundant system instead of the structural redundant duplex system. Software designers or system integrators who want an introduction to the problems found in designing for fault tolerance and to the range of design solutions. The adoption of software fault tolerance techniques based on design diversity has been advocated as a means of coping with residual software. As more and more complex systems get designed and built, especially safety critical systems, software fault tolerance and the next generation of hardware fault tolerance will need to evolve to be able to solve the design fault problem. A\izienis et al another important goal of dedix is the evaluation of specification methods. A structured definition of hardware and software fault tolerant architectures is presented. Sc high integrity system university of applied sciences, frankfurt am main 2. The importance of implementing a fault tolerance system. Design diverse software fault tolerance techniques recovery blocks. Fault tolerant software architecture stack overflow. Techniques for fault tolerance fault tolerance is the ability to continue operating despite the failure of a limited subset of their hardware or software. Fault tolerance is the realization that we will have faults in our system hardware andor software and we have to design the system in such a way that it will be tolerant of those faults.
Fault tolerant software has the ability to satisfy requirements despite failures. A soft software fault has a negligible likelihood or recurrence and is recoverable, whereas a solid software fault is recurrent under normal operations. Nov 06, 2010 velop faulttolerant software by the implementation of fault tolerance tech niques share, in g eneral, the following characteristics. Hardware implemented fault tolerance design reduces operating system size, minimises systems software and increases processing speed, offering the end user the safest and simplest design. How a fault tolerance system makes our lives easier. Software fault tolerance using data diversity attention. The versions are used as alternatives with a separate means of. Therefore fault tolerance is achieved by using diversity in the data space. Last section provides the reader with an overview of some real. Software fault tolerance is an immature area of research. It is impossible to reduce the probability of a fault to zero.
A system can be described as fault tolerant if it continues to operate satisfactorily in the presence of one or more system failure conditions. Software fault tolerance during the development of software, it is infeasible to find all its bugs, which can reach as far back as the design phase. Implement a software fault tolerance scheme distributed or concurrent as a library framework for a programming language of your choice, or study a specific software fault tolerance scheme middleware or application using software fault tolerance e. Dec 29, 2016 fault tolerant systems are designed to provide availability even when anticipating both intended and unintended service disruption. Designfault tolerance by means of design diversity is a concept that traces back to the very early age of informatics. From a different point of view, any emphasis on providing fault tolerance for design faults is, in this authors experience, a radical change from the common attitudes of many practitioners and researchers alike. To handle faults gracefully, some computer systems have two or more.
1294 101 407 56 867 1358 324 916 761 311 603 484 1414 125 295 33 256 1419 377 1111 156 920 1245 124 525 1100 1294 296 288 661 690 265 663 224 416 362 1367 1379 1468 496 17 1343 352 998 626 65