Chapter 2- Reliability and Fault Tolerance.ppt
《Chapter 2- Reliability and Fault Tolerance.ppt》由会员分享,可在线阅读,更多相关《Chapter 2- Reliability and Fault Tolerance.ppt(41页珍藏版)》请在麦多课文档分享上搜索。
1、Real-Time Systems and Programming Languages Alan Burns and Andy Wellings,Chapter 2: Reliability and Fault Tolerance,Aims,To understand the factors which affect the reliability of a system and introduce how software design faults can be tolerated To introduce Safety and Dependability Reliability, fai
2、lure and faults Failure modes Fault prevention and fault tolerance N-Version programming Dynamic Redundancy,Scope,Four sources of faults which can result in system failure:Inadequate specification not covered Design errors in software covered now Processor failure not covered Interference on the com
3、munication subsystem not covered,Safety and Reliability,Safety: freedom from those conditions that can cause death, injury, occupational illness, damage to (or loss of) equipment (or property), or environmental harm By this definition, most systems which have an element of risk associated with their
4、 use as unsafe Reliability: a measure of the success with which a system conforms to some authoritative specification of its behaviourSafety is the probability that conditions that can lead to mishaps do not occur whether or not the intended function is performed,Safety,E.g., measures which increase
5、 the likelihood of a weapon firing when required may well increase the possibility of its accidental detonationIn many ways, the only safe airplane is one that never takes off, however, it is not very reliableAs with reliability, to ensure the safety requirements of an embedded system, system safety
6、 analysis must be performed throughout all stages of its life cycle development,Aspects of Dependability,Dependability,Available,Readiness for Usage,Dependability Terminology,Dependability,Reliability, Failure and Faults,The reliability of a system is a measure of the success with which it conforms
7、to an authoritative specification of its behaviour When the behaviour of a system deviates from that which is specified for it, this is called a failure Failures result from unexpected problems internal to the system that eventually manifest themselves in the systems external behaviour These problem
8、s are called errors and their mechanical or algorithmic cause are termed faults Systems are composed of components which are themselves systems: hence failure - fault - error - failure - fault,Fault Types,A transient fault starts at a particular time, remains in the system for some period and then d
9、isappears E.g. hardware components which have an adverse reaction to radioactivity Many faults in communication systems are transient Permanent faults remain in the system until they are repaired; e.g., a broken wire or a software design error Intermittent faults are transient faults that occur from
10、 time to time E.g. a hardware component that is heat sensitive, it works for a time, stops working, cools down and then starts to work again,Software Faults,Called Bugs Bohrbugs: reproducible identifiable. Heisenbugs: only active under rare conditions: e.g. race conditions Software doesnt deteriorat
11、e with age: it is either correct or incorrect but Faults can remain dormant for long periods Usually related to resource usage e.g. memory leaks,Failure Modes,Failure mode,Value domain,Timing domain,Arbitrary (Fail uncontrolled),Constraint error,Value error,Early,Omission,Late,Fail silent,Fail stop,
12、Fail controlled,Approaches to Achieving Reliable Systems,Fault prevention attempts to eliminate any possibility of faults creeping into a system before it goes operationalFault tolerance enables a system to continue functioning even in the presence of faultsBoth approaches attempt to produces system
13、s which have well-defined failure modes,Fault Prevention,Two stages: fault avoidance and fault removal Fault avoidance attempts to limit the introduction of faults during system construction by: use of the most reliable components within the given cost and performance constraints use of thoroughly-r
14、efined techniques for interconnection of components and assembly of subsystems packaging the hardware to screen out expected forms of interference. rigorous, if not formal, specification of requirements use of proven design methodologies use of languages with facilities for data abstraction and modu
15、larity use of software engineering environments to help manipulate software components and thereby manage complexity,Fault Removal,Design errors (hardware and software) will exist Fault removal: procedures for finding and removing the causes of errors; e.g. design reviews, program verification, code
16、 inspections and system testing System testing can never be exhaustive and remove all potential faults A test can only be used to show the presence of faults, not their absence It is sometimes impossible to test under realistic conditions Most tests are done with the system in simulation mode and it
17、 is difficult to guarantee that the simulation is accurate Requirements errors during the systems development may not manifest themselves until the system goes operational,Failure of Fault Prevention Approach,In spite of all the testing and verification techniques, hardware components will fail; the
18、 fault prevention approach will therefore be unsuccessful when either the frequency or duration of repair times are unacceptable, or the system is inaccessible for maintenance and repair activitiesAn extreme example of the latter is the crewless spacecraft Voyager (currently 10 billions miles from t
19、he sun!)Alternative is Fault Tolerance,Levels of Fault Tolerance,Full Fault Tolerance the system continues to operate in the presence of faults, albeit for a limited period, with no significant loss of functionality or performance Graceful Degradation (fail soft) the system continues to operate in t
20、he presence of errors, accepting a partial degradation of functionality or performance during recovery or repair Fail Safe the system maintains its integrity while accepting a temporary halt in its operation The level required will depend on the application Most safety critical systems require full
21、fault tolerance, however in practice many settle for graceful degradation,Graceful Degradation in an ATC System,Full functionality within required response times,Redundancy,All fault-tolerant techniques rely on extra elements introduced into the system to detect & recover from faults Components are
22、redundant as they are not required in a perfect system Often called protective redundancy Aim: minimise redundancy while maximising reliability, subject to the cost and size constraints of the system Warning: the added components inevitably increase the complexity of the overall system This itself c
23、an lead to less reliable systems E.g., first launch of the space shuttle It is advisable to separate out the fault-tolerant components from the rest of the system,Hardware Fault Tolerance,Two types: static (or masking) and dynamic redundancy Static: redundant components are used inside a system to h
24、ide the effects of faults; e.g. Triple Modular Redundancy TMR 3 identical subcomponents and majority voting circuits; the outputs are compared and if one differs from the other two, that output is masked out Assumes the fault is not common (such as a design error) but is either transient or due to c
25、omponent deterioration To mask faults from more than one component requires NMR Dynamic: redundancy supplied inside a component which indicates that the output is in error; provides an error detection facility; recovery must be provided by another component E.g. communications checksums and memory p
- 1.请仔细阅读文档,确保文档完整性,对于不预览、不比对内容而直接下载带来的问题本站不予受理。
- 2.下载的文档,不会出现我们的网址水印。
- 3、该文档所得收入(下载+内容+预览)归上传者、原创作者;如果您是本文档原作者,请点此认领!既往收益都归您。
下载文档到电脑,查找使用更方便
2000 积分 0人已下载
下载 | 加入VIP,交流精品资源 |
- 配套讲稿:
如PPT文件的首页显示word图标,表示该PPT已包含配套word讲稿。双击word图标可打开word文档。
- 特殊限制:
部分文档作品中含有的国旗、国徽等图片,仅作为作品整体效果示例展示,禁止商用。设计者仅对作品中独创性部分享有著作权。
- 关 键 词:
- CHAPTER2RELIABILITYANDFAULTTOLERANCEPPT
