Reliability Assessment: A Guide to Aligning Expectations, Practices, and Performance. Daniel Daley
you will never achieve the desired improvement.
In order to achieve the desired improvement and to harvest the full inherent reliability, it is important to clearly recognize the source of failures.
In addition to mis-operation, it is possible to cause failures or allow failures to occur because of inadequate maintenance or inspection. Let’s look at a few simple examples.
The “Path to Failure” is a series of causes and effects that ultimately lead to a failure. At the very beginning of the path is a Systemic Cause that creates a trap for some unsuspecting individual. The next step is a Human Cause leading to a Physical Cause and finally setting up a Failure Mechanism and, ultimately, a defect that will result in a failure. (The following diagram shows a cause-effect flow in which each effect sequentially becomes the cause of the following effect.)
A Failure Mechanism is a form of deterioration that ultimately produces a defect. For instance, for any mechanical device, the only possible failure mechanisms are corrosion, erosion, fatigue, or overload. Let’s take corrosion as an example. If a corrosion circuit exists (cathode – anode – electrode), there will be visible signs. First, it should be possible to see two dissimilar metals being joined by a liquid electrolyte, or the products of corrosion (rust) should be evident. If operators, craftspersons, and inspectors are keeping their eyes open, they should be able to recognize this failure mechanism at work. If this failure mechanism is allowed to go on working for a long enough period to result in a defect and a failure, it is not the fault of the device. It is the fault of the humans who operate, maintain, or inspect the device. In order to harvest all the inherent reliability, people need to:
•Know what they are looking for (e.g., understand failure mechanisms)
•Be placed by design and discipline in a position where deterioration or defects are evident (e.g., follow organized rounds in a disciplined manner)
•Keep their eyes open
Taken one step further, after a failure mechanism has been at work for a period of time, a defect will form. But the presence of a defect does not automatically result in a failure. Often nature “throws the dice” for some period of time after a defect has formed but before a failure occurs. By this I mean that several circumstances may need to be present to result in a failure. For example, corrosion may weaken a pipe, but the piping system may also have to experience unusual but not unexpected pressure increases before a failure will occur. This aspect of “forgiving nature” or a grace period between defect and failure provides another opportunity to prevent a failure. But, as with the case of active failure mechanisms, people need to play an active role in finding and removing defects.
Well-designed programs for operations, maintenance, and inspection are one of the keys to harvesting all the inherent reliability of a system. Poorly-designed programs allow systems to operate at some level less than possible based on the inherent reliability.
Maintaining or Improving Inherent Reliability during Modification and Renewal
There are two distinctly different paradigms surrounding the aging of systems and equipment. One paradigm is best described by this description of an aging system, “This plant is unreliable because it is getting old.” The other paradigm is the complete opposite, “We have been working with this unit for a long time, so we have worked out all the bugs and know how to stay ahead of the problems.” In the first case, aging is used as an excuse for poor reliability. The equipment is managing the personnel. In the second case, aging is used as a reason why reliability is good. The personnel are managing the equipment.
In addition to the short-term or day-to-day concerns affecting reliability, there are long-term concerns. For instance, most units go through some form of modernization, expansion, or renewal process during their life. These events are often used as opportunities to enhance reliability. Sometimes, however, the reliability after the event is worse than before.
One form of renewal is an overhaul or, for a complete plant, a turnaround. One philosophy espoused by those with a short-term point of view is to perform the absolute minimum amount of work during those events. Another viewpoint is to limit the work to the amount needed to fulfill requirements. If requirements call for reliable service for the next specified number of years, then the work scope will be designed to deliver that result.
A simple example that compares the minimum amount of work to the amount of work needed to provide reliable service for a specific period is the overhaul of a diesel engine. It may be possible to address immediate concerns and return the engine to service (albeit for a limited period) by replacing piston rings, fuel injectors and connecting rod bearings. This approach may even provide an engine that is usable for quite some time, depending on the condition of other parts. Yet, if you want to ensure the engine provides the same reliable life as a new engine, it is necessary to perform a careful tear-down, evaluating the condition and remaining life on each and every part. Components that have been worn beyond the point that they can provide the desired life must be replaced.
Other events that occur in the life of many plants and systems are a modification in service or an expansion. During these events, it is possible that current inherent reliability will be retained; it may also be enhanced or even reduced. As in the situation described in the fictional account above, it is not uncommon to see equipment that once provided a source of redundancy used instead as a source of additional capacity. In the example, a redundant electrical feeder was used as a source of power for new loads. It is not uncommon to see spare pumps placed in parallel service with primary pumps to increase throughput.
In some cases, this modification will reduce reliability simply by eliminating redundancy. In other cases, as with parallel pumps, in addition to the loss of redundancy, both pumps may actually wear faster because they are working against one another.
During the development of new facilities, we apply Design-For-Reliability techniques to ensure that the completed product is reliable. We can apply those same techniques during the design of modifications to ensure that the modified facility has an inherent reliability equal or greater than before the change.
The fictional account provided at the beginning of this chapter paints a fairly gloomy picture of how the reliability engineer’s data is received by members of plant management. In some cases, I am sure it is an exaggeration; in others, it is fairly accurate. Think for a moment about issues in your personal life where you have built a set of expectations only to have them dashed by more accurate or realistic information. For many people, reliability is an abstract characteristic that is based more on luck and good intentions than it is on physical realities and solid analysis. For those individuals, it is often painful news when they learn that their systems and equipment are not reliable and that many of the elements contributing to the poor reliability were results of their own choices.
In order to minimize the negative impact of this discovery, it is best if the exercise of learning “what you have a right to expect” is accomplished as a part of a proactive exercise. This exercise should be done quite separate from any event resulting from poor reliability. Finding out that you have some opportunities for improvement feels a lot better when you are doing it on your own than when a catastrophic event has occurred and you are being forced to do so by your boss or his boss.
Independent third parties have little ownership for the programs that have been installed but are ineffective. They are also more likely to tell the complete and undistorted truth than someone who is dependent on the people receiving the report for pay increases and promotional opportunities. Another problem with using someone from inside your current organization is that each and every group has made some contribution to good or poor reliability. As a result, every employee within a plant can be