Better Software - June 2008 - (Page 37) – alexander PoPe “i n s a n i T y : doing The same resulTs.” “To err is human, To forgive is divine.” Thing over and over again and exPecTing differenT hese two statements summarize the underlying philosophy of root cause analysis (RCA). We must accept the fact that we make mistakes in product development, but we must also accept the responsibility for not repeating those mistakes. Absent any proactive steps to prevent mistakes, we won’t be very successful in preventing them. And absent any proactive steps to understand our mistakes, we will repeat them, thus meeting Einstein’s definition of insanity. RCA has many definitions. To some, it is identifying the cause of an application failure and fixing it. To others, it is identifying the reasons for making the initial error or mistake that led to the failure. This confusion is in part caused by differences in analyzing and preventing physical failures versus analyzing and preventing errors in the intellectual activity used in developing software. Industrial accidents or system failures often have an unknown physical cause until investigated, and the RCA focuses on identifying the mechanical, chemical, electrical, or physical cause of the failure. These include mechanical stress, overheating, fatigue, etc. Prevention usually requires a change to one or more physical components of the system. On occasion, these failures are traced back to design or management decisions. When we trace software failures back to the initial mistake, most often they are the result of an error in our thought processes or intellectual activities. In software development, the common view of RCA is identification of conditions or events that caused a person or team to make an initial error that later manifested as a defect or failure. To further understand this difference, let’s consider some well-known failures. There have been two space shuttle disasters. In both cases, the manifestation of physical failure had been observed multiple times in earlier flights. In both cases, management decisions did not correctly assess the risk of catastrophic failure. These disasters occurred through a combination of physical faults and decision-making faults. The TWA800 disaster was caused by multiple faults: an explosive air-fuel mixture in a fuel tank and an ignition source. The older model F-15 fighters are failing due to stress fractures. These two cases can be attributed to identifiable chemical or mechanical causes that were not identified or understood in T – alberT einsTein the design of these systems. In all four cases, RCA could stop at the “how to fix it” point or, as in the shuttle case, go beyond the physical cause to the underlying decision process. However, when we consider software, we have a situation where trivial oversights or typos can cause expensive failures. A missing hyphen caused a $500 million Mariner probe to Venus to malfunction. A one-character error in the AT&T 5ESS system caused a $1 billion failure in the 800 phone system. On the other hand, one-character errors in software often are discovered in reviews, testing, or even in use with insignificant failure costs. The lack of a relationship between the error significance and the failure cost makes software RCA fundamentally different. The same trivial error can lead to failures with orders of magnitude of difference in cost. Or, looking at this from another viewpoint, identifying the root cause of a costly failure does not guarantee you will eliminate costly failures elsewhere due to similar causes. In all of these cases, the programmer made an error in creating the work; in other words, the intellectual thought process failed. In software work, there is no physical trail of evidence to lead us to the error. In addition, we have the problem of the frequency of errors. Any sizeable system will have hundreds of errors waiting for the right set of conditions to cause a failure. Let’s establish some definitions. The first three are adapted from Software Metrics by Fenton and Pfleeger: • Human error—a mistake by a human in developing a software work product • Fault—the encoding of the human error into the software; note that one error may result in multiple faults • Failure—the manifestation of the fault in the execution of the software • Defect—often used to mean any or all of the above, but in my opinion it’s best used to refer to faults or failures discovered during reviews or testing Software failures and defects are discovered in use or by testing. At times, the failures require extensive analysis to pinpoint the fault in the code. This problem solving is sometimes referred to as RCA. From the perspective of the operational staff or users, this definition works since they do not want the failure to occur again in the existing product; they want it fixed. From a development perspective, we want to prevent errors due to this root cause from recurring in future products or development activities. This means we must look for the underlying situation that allowed us—or even encouraged us—to make the error. This can be much more difficult than identifying a physical cause. It also means that to fully understand how the error was made, we need information from the person who made the mistake. Others may guess at the cause, but in most cases only the person making the error can provide insight into the intellectual process that initiated the error. I have seen companies try to apply RCA to production failures in an attempt to improve operational reliability over the common Problems wiTh rca www.StickyMinds.com JUNE 2008 BETTER SOFTWARE 37 http://www.StickyMinds.com
Table of Contents Feed for the Digital Edition of Better Software - June 2008 Better Software - June 2008 Contents Mark Your Calendar Contributors Technically Speaking eLightenment Code Craft Test Connection Management Chronicles Agile Model-Driven Development The Myth of Risk Management Stop the Insanity! Product Announcements 10 Things You Might Not Know About … The Last Word Ad Index Better Software - June 2008 Better Software - June 2008 - (Page Intro) Better Software - June 2008 - Better Software - June 2008 (Page Cover1) Better Software - June 2008 - Better Software - June 2008 (Page Cover2) Better Software - June 2008 - Better Software - June 2008 (Page 1) Better Software - June 2008 - Better Software - June 2008 (Page 2) Better Software - June 2008 - Contents (Page 3) Better Software - June 2008 - Mark Your Calendar (Page 4) Better Software - June 2008 - Mark Your Calendar (Page 5) Better Software - June 2008 - Mark Your Calendar (Page 6) Better Software - June 2008 - Mark Your Calendar (Page 7) Better Software - June 2008 - Contributors (Page 8) Better Software - June 2008 - Contributors (Page Telelogic1) Better Software - June 2008 - Contributors (Page Telelogic2) Better Software - June 2008 - Contributors (Page 9) Better Software - June 2008 - Contributors (Page 10) Better Software - June 2008 - Technically Speaking (Page 11) Better Software - June 2008 - eLightenment (Page 12) Better Software - June 2008 - eLightenment (Page 13) Better Software - June 2008 - Code Craft (Page 14) Better Software - June 2008 - Code Craft (Page 15) Better Software - June 2008 - Code Craft (Page 16) Better Software - June 2008 - Code Craft (Page COD1) Better Software - June 2008 - Code Craft (Page COD2) Better Software - June 2008 - Code Craft (Page COD3) Better Software - June 2008 - Code Craft (Page COD4) Better Software - June 2008 - Code Craft (Page 17) Better Software - June 2008 - Test Connection (Page 18) Better Software - June 2008 - Test Connection (Page 19) Better Software - June 2008 - Management Chronicles (Page 20) Better Software - June 2008 - Management Chronicles (Page 21) Better Software - June 2008 - Agile Model-Driven Development (Page 22) Better Software - June 2008 - Agile Model-Driven Development (Page 23) Better Software - June 2008 - Agile Model-Driven Development (Page 24) Better Software - June 2008 - Agile Model-Driven Development (Page 25) Better Software - June 2008 - Agile Model-Driven Development (Page 26) Better Software - June 2008 - Agile Model-Driven Development (Page 27) Better Software - June 2008 - Agile Model-Driven Development (Page 28) Better Software - June 2008 - Agile Model-Driven Development (Page 29) Better Software - June 2008 - The Myth of Risk Management (Page 30) Better Software - June 2008 - The Myth of Risk Management (Page 31) Better Software - June 2008 - The Myth of Risk Management (Page 32) Better Software - June 2008 - The Myth of Risk Management (Page 33) Better Software - June 2008 - The Myth of Risk Management (Page 34) Better Software - June 2008 - The Myth of Risk Management (Page 35) Better Software - June 2008 - Stop the Insanity! (Page 36) Better Software - June 2008 - Stop the Insanity! (Page 37) Better Software - June 2008 - Stop the Insanity! (Page 38) Better Software - June 2008 - Stop the Insanity! (Page 39) Better Software - June 2008 - Stop the Insanity! (Page 40) Better Software - June 2008 - Stop the Insanity! (Page 41) Better Software - June 2008 - Stop the Insanity! (Page 42) Better Software - June 2008 - Stop the Insanity! (Page 43) Better Software - June 2008 - Product Announcements (Page 44) Better Software - June 2008 - Product Announcements (Page 45) Better Software - June 2008 - 10 Things You Might Not Know About … (Page 46) Better Software - June 2008 - The Last Word (Page 47) Better Software - June 2008 - Ad Index (Page 48) Better Software - June 2008 - Ad Index (Page Cover3) Better Software - June 2008 - Ad Index (Page Cover4)
For optimal viewing of this digital publication, please enable JavaScript and then refresh the page. If you would like to try to load the digital publication without using Flash Player detection, please click here.