1、Lessons Learned Entry: 1465Lesson Info:a71 Lesson Number: 1465a71 Lesson Date: 2003-08-31a71 Submitting Organization: JSCa71 Submitted by: David LengyelSubject: Accident Investigations/Web-based Mishap Investigation Support System (MISS) Hardware and Software Infrastructure Abstract: To reduce servi
2、ce outages, a Mishap Investigation Support System hardware and software infrastructure fully tested and ready to support a full range of perceived and anticipated operational needs is required.Description of Driving Event: Several service outages occurred early in the implementation. The first case
3、involved expiration of a software license associated with the Neoterris box (February 14 - 48 hours). The second involved failure of a mother-board on the Dell application server (March 5 - 2 hours). Unknown cause (March 13 - 3 hours). GRC firewall outage (March 18 - 1.5 hours). Inadvertent cable di
4、sconnect (April 2 - 15 minutes). A separate hardware issue involved large-file load time from certain web-browser applications. Considering all of the above, from February 13 to May 13, 2003 the system total downtime was approximately 90 hours - this includes 24 hours for scheduled maintenance and u
5、pgrades and a “down time penalty” to account for less than optimum performance. This represents an operational availability of approximately 96%. Lesson(s) Learned: As a result of problems encountered, lessons were learned quickly and changes implemented just as quickly. Only limited time was availa
6、ble for system testing and under only limited conditions. The systems engineering life-cycle was compressed into a virtual real-time requirements analysis/design/test/implementation. Provided by IHSNot for ResaleNo reproduction or networking permitted without license from IHS-,-,-Lessons include the
7、 need for traditional multi-functional, end-to-end testing of all system components over the full range of operational scenarios, including different internet routing and NASA firewall configurations. Again the need exists to have a system ready to support the full range of perceived and anticipated
8、 operational needs based on the CAIB experience (see #1). Recommendation(s): a71 Implement multiple web-servers in parallel to maximize system availability for critical web-based applications. Duplicate, to the extent possible, a commercial implementation configuration to ensure greatest possible sy
9、stem reliability.a71 Implement the application software on a separate server from the database server. This approach allows routine large file backup activity to take place on the database server without slowing down the web server. Evidence of Recurrence Control Effectiveness: TBD NASA ResponseDocu
10、ments Related to Lesson: Agency Contingency Action Plan for Space Flight OperationsMission Directorate(s): a71 Space Operationsa71 Exploration SystemsAdditional Key Phrase(s): a71 Accident Investigationa71 Administration/Organizationa71 Computersa71 Configuration Managementa71 Information Technology
11、/Systemsa71 NASA Standardsa71 Policy & Planninga71 Safety & Mission Assurancea71 SecurityAdditional Info: Provided by IHSNot for ResaleNo reproduction or networking permitted without license from IHS-,-,-Approval Info: a71 Approval Date: 2004-06-16a71 Approval Name: Ronald Montaguea71 Approval Organization: JSCa71 Approval Phone Number: 281-483-8576Provided by IHSNot for ResaleNo reproduction or networking permitted without license from IHS-,-,-