1、Lessons Learned Entry: 1778Lesson Info:a71 Lesson Number: 1778a71 Lesson Date: 2007-03-6a71 Submitting Organization: JPLa71 Submitted by: Martin Ratliffa71 POC Name: John Waters, Henry Stonea71 POC Email: John.J.Watersjpl.nasa.gov, Henry.W.Stonejpl.nasa.gova71 POC Phone: 818-354-1709 (John Waters),
2、818-354-9051 (Henry Stone)Subject: Anomalous Flight Conditions May Trigger Common-Mode Failures in Highly Redundant Systems Abstract: After launch, MRO was found to be susceptible to a solar flare event during the critical aerobraking phase of the mission that could corrupt the multiply redundant id
3、entical file systems in the Command if that image is also found to be corrupted, the spacecraft switches to the redundant string CMIC in search of a pristine FS image. Figure 1 is a color diagram that illustrates the architecture described in the Figure 1 caption by means of a block diagram. It show
4、s 2 major blocks positioned side by side and connected by a line labeled Figure 1. Each MRO C&DH string has a single CMIC, which in turn has a single 2 MB SRAM piece part that is logically partitioned into 8 segments. The redundant File Systems are stored in SRAM Segments 2 and 4 on each CMIC. Realt
5、ime checkpoint and fault protection data are stored in Segments 3 and 6. The RAD750 flight computer accesses its CMIC via a cPCI bus and accesses the other computers CMIC via a synchronous serial interface (SSI) cable. After MRO launch, spacecraft telemetry revealed some correctable CMIC memory erro
6、rs (Reference (1) that prompted analysis of MRO radiation exposure. This post-launch analysis identified a significant risk that solar energetic particle (SEP) events associated with a solar flare later in the mission could corrupt all four FS images (Reference (2). That is, given the high likelihoo
7、d under nominal conditions of particle strikes causing SEUs in a certain block of memory, there is a high likelihood during large solar flares of many more strikes damaging multiple blocks of memory. Since the radiation flux would also increase the likelihood of a computer reboot, FS corruption duri
8、ng the Mars aerobraking maneuver could have posed an unacceptable risk to the mission. Radiation analysis (Reference (3) also revealed a 7 percent probability that a large flare will occur and that the resulting SEUs will impact all four FS data images during the 2-year MRO primary mission. In addit
9、ion, the analysis showed a 14 percent probability of data corruption (due to entry into a period of increased solar activity) during an extended mission. This analysis showed that for large solar flares, the probability of SEUs is so high that even an increase in redundancy to 16 FSs Provided by IHS
10、Not for ResaleNo reproduction or networking permitted without license from IHS-,-,-would not bring a significant risk reduction during the primary mission. This is because the probability of all images being corrupted is driven mainly by the probability of encountering a large flare. Shielding is so
11、mewhat effective against solar protons, and the proton flux decreases with distance from the sun, but shielding does little to block high energy particles, and they pose an SEU risk throughout a planetary mission. The MRO mission risk has been mitigated by uplinked changes to flight software, and by
12、 ground procedures such as daily monitoring of solar activity forecasts. Also, the mission risk was limited by the minimal number of solar sunspots (solar min) that occurred during MRO aerobraking maneuvers. Reference(s): (1) “CMIC Memory CRC Error,“ Problem/Failure Report No. Z87764, Jet Propulsion
13、 Laboratory, November 8, 2005. (2) M. Ratliff, “Probability of Corrupting Four MRO CMIC File System Images in One Day,“ Jet Propulsion Laboratory IOM No. 5132-06-013, February 17, 2006. (3) M. Ratliff, “Probability of SEUs in MRO CMIC SRAM File System,“ Jet Propulsion Laboratory IOM No. 5132-06-100,
14、 December 7, 2006. (4) “Flight System Fault Tolerance, Redundancy, and Cross Strapping (Draft),“ Jet Propulsion Laboratory Guideline No. DocID 60492, February 7, 2007.Lesson(s) Learned: 1. Highly redundant spacecraft designs that feature foolproof fault recovery schemes are still vulnerable to commo
15、n mode failures unless the range of anomalous conditions during all operational phases is fully understood. 2. Had the radiation analysis been performed during MRO C&DH subsystem development, detection of CMIC sensitivity to radiation and subsequent implementation of Error Detection and Correction (
16、EDAC) would have eliminated the risk of a mission critical common-mode failure.Recommendation(s): 1. When fault tolerance is attained by means of multiple, redundant, physical copies, evaluate the full range of anomalous conditions that could trigger common-mode failures during a critical mission ph
17、ase. Use Reference (4) as a guide to addressing design concerns through redundancy, including mitigation of risks from mission environments that have not previously been experienced.2. Perform analyses of device sensitivity to space environmental extremes sufficiently early to Provided by IHSNot for
18、 ResaleNo reproduction or networking permitted without license from IHS-,-,-permit implementation of design countermeasures.Evidence of Recurrence Control Effectiveness: JPL will reference this lesson learned as additional rationale and guidance supporting Paragraph 4.4.3.4 (Information System Desig
19、n: On-Board Storage Protection from Single Event Upsets) in the JPL standard Design, Verification/Validation and Operations Principles for Flight Systems (Design Principles), JPL Document D-17868, Rev. 3, December 11, 2006.Documents Related to Lesson: N/AMission Directorate(s): a71 Exploration Syste
20、msa71 Sciencea71 Space OperationsAdditional Key Phrase(s): a71 Systems Engineering and Analysis.Engineering design and project processes and standardsa71 Engineering Design (Phase C/D).Entry Systemsa71 Engineering Design (Phase C/D).Orbiting Vehiclesa71 Engineering Design (Phase C/D).Roboticsa71 Eng
21、ineering Design (Phase C/D).Spacecraft and Spacecraft Instrumentsa71 Mission Operations and Ground Support Systems.a71 Safety and Mission Assurance.Product Assurancea71 Safety and Mission Assurance.Reliabilitya71 Additional Categories.a71 Additional Categories.EnvironmentAdditional Info: a71 Project
22、: Mars Reconnaissance OrbiterProvided by IHSNot for ResaleNo reproduction or networking permitted without license from IHS-,-,-Approval Info: a71 Approval Date: 2007-04-20a71 Approval Name: ghendersona71 Approval Organization: HQProvided by IHSNot for ResaleNo reproduction or networking permitted without license from IHS-,-,-