Maintaining Mission Critical Systems in a 24/7 Environment. Peter M. Curtis
performed in the system acceptance test procedures used during the acceptance phase and use the original data for trending any changes in the system. The Reliability Assurance Testing should be performed after the vendor has provided Preventive Maintenance (PM). The reason we perform these tests after the vendor preventive maintenance routine is that the vendor just interacted with a commissioned system and disassembled some portions. In some cases, they are providing updated software or control boards. The system now needs to be certified through Reliability Assurance Testing to be worthy of critical load. Remember that the vendor provided PM does not measure performance or track system degradation, so without a Reliability Assurance Testing program, the quality control process had been compromised.
Before the facility goes on‐line, it is crucial to resolve all potential equipment problems (Technology, Operations, etc.). This is the construction team’s sole opportunity to integrate and commission all the systems, due to the facility’s 24/7 mission critical status. At this point in the project, all systems installed were tested at the factory and witnessed by a competent Commissioning Authority (CxA) familiar with the equipment processes and procedures.
Once the equipment is delivered, set in place, and wired, it is time for the second phase of certified testing and integration. The importance of this phase is to verify and certify that all components work together and to fine‐tune, calibrate, and integrate the systems. There is a tremendous amount of preparation in this phase. The facilities engineer must work with the factory, field engineers, and independent test consultants to coordinate testing and calibration. Critical circuit breakers must be tested and calibrated prior to placing any critical electrical load on them. When all the tests are completed, the facilities engineer must compile the certified test reports, which will establish a benchmark for all future testing. The last phase is to train the staff on each major piece of equipment and prepare for the transition to operations.
Many decisions regarding how and when to service a facility’s mission critical electrical/mechanical equipment are going to be subjective. The objective is easy: a high level of safety and reliability from the equipment, components, and systems. But discovering the most cost‐effective and practical methods required to accomplish this can be challenging. Network with colleagues, consult knowledgeable sources and review industry and professional standards and best practices before choosing the approach best suited to your maintenance goals. Also, keep in mind that the individuals performing the testing and service should have the best training and experience available. You depend on their conscientiousness and decision‐making ability to avoid potential problems with perhaps the most crucial equipment in your building. Most importantly, learn from your experiences and those of others. Maintenance programs should be continuously improving. If a scheduled procedure has not previously identified a problem, consider adjusting the schedule respectively. Examine your maintenance programs on a regular basis and make appropriate adjustments to improve constantly.
Acceptance and maintenance testing are pointless unless the test results are evaluated and compared to standards, and to previous test reports that have established benchmarks. It is imperative to recognize failing equipment and to take appropriate action as soon as possible. Common practice in this industry is for technicians to perform maintenance without reviewing prior work tickets and records. This approach defeats the value of benchmarking and trending and must be improved. The mission critical facility engineers can then keep objectives in perspective and depend upon his/her options when faced with a real emergency.
The importance of taking every opportunity to perform preventive maintenance thoroughly and completely ‐ especially in mission critical facilities‐cannot be stressed enough. If not, the next opportunity will come at a much higher price: downtime, lost business, lost potential clients, and not to mention the safety issues that arise when technicians rush to fix a maintenance problem. So, do it correctly ahead of time and avoid shortcuts because it will be very difficult to do it again.
1.6 Documentation and Human Factor
The mission critical industry’s focus on physical infrastructure enhancements descends from the early stages of the trade when all efforts were placed solely in design and construction techniques to enhance mission critical equipment.
Twenty‐five years ago, the technology supporting mission critical loads was simple. There was little sophistication in the electrical load profile; at that time, the industry was in its infancy. Over time the data centers have grown from a few mainframes supporting minimal software applications to server farms that can occupy 100,000 ft2 or more – with Google and Microsoft being prime examples.
As more processing power is required to sustain our global economy, the electrical and mechanical systems supporting the critical load became increasingly complex. With businesses relying on this infrastructure, more capital dollars were invested to improve the uptime of the business’s lines. Today billions of dollars are invested on an enterprise‐level into the infrastructure that supports the business 24/7 applications; the major investments are normally in design, equipment procurement, and project management. Few capital dollars are invested in the documentation, change management, education/training, or operations and maintenance. The initial capital investment was just the tip of the iceberg (Figure 1.1).
Figure 1.1 Hidden Costs of Operations
Figure 1.2 Typical screenshot of SmartWALK™ dashboard
(Courtesy of PMC Group One, LLC.)
Years ago, most organizations relied heavily on their workforce to retain much of the information regarding the mission critical systems. A large body of personnel had a similar level of expertise. They remained with their company for decades. Therefore, little emphasis was placed on maintaining a living document for a critical infrastructure. Tables 1.4 to 1.6 identify questions with regards to managing the loss of personnel, documentation, and managing during a critical event.
Table 1.4 Managing Loss of Critical Personnel
The Issues: Employee Turnover, Retirement, Sick Leave or Vacation |
Was knowledge lost? |
Where is existing documentation? |
How are new employees trained? |
What risks are faced during the transition? |
Table 1.5 Documentation Issues
The Issue: Traditional documentation systems are inconsistent, inaccessible, and unstructured. |
How is information shared? |
Is system data readily available? |
Where is the documentation? |
How are revisions approved and made available to all users? |
Table 1.6 Managing During Critical Events
The Threats: Fires, Natural Disasters, Blackouts, |