Maintaining Mission Critical Systems in a 24/7 Environment. Peter M. Curtis
Figure 1.3 SmartWALK™ mobile screenshot
(Courtesy of PMC Group One, LLC.)
Figure 1.4 Screenshot of SmartTEAM® Learning Management System
(Courtesy of PMC Group One, LLC)
The mission critical industry can no longer manage their critical system as they did twenty‐five years ago. The requirements are very different today in that the sophisticated nature of the data center infrastructure requires constant refreshing and updating of documentation. One way to achieve this is to include a living document system that provides the level of granularity necessary to operate a mission critical infrastructure into a capital project. This will assist in keeping the living document current each time a project is completed, or a milestone is reached. Accurate information is the first level of support that provides first responders the intelligence they need to make informed decisions during critical events. It also acts like a succession plan as employees retire, and new employees are hired, thus reducing risk and improving their learning curve. Remember that greater than 50% of all downtime can be tracked to human error.
Human error as a cause of hazard scenarios must be identified, and the factors that influence human errors must be considered. Human error is a given and will arise in all stages of the process. It is vital that the factors influencing the likelihood of errors be identified and assessed to determine if improvements in the human factors design of a process are needed. Surprisingly, human factors are perhaps the most poorly understood aspect of process safety and reliability management.
Balancing system design and training operating staff in a cost‐effective manner is essential to critical infrastructure planning. When designing a mission critical facility, the level of complexity and ease of maintainability is a major concern. When there is a problem, the Facilities Manager (FM) is under enormous amounts of pressure to isolate the faulty system while maintaining data center loads and other critical loads. The FM does not have the time to go through complex switching procedures during a critical event. A recipe for human error exists when systems are complex, especially if key system operators and documentation of Emergency Action Procedures (EAP) and Standard Operating Procedures (SOP) are not immediately available or have not been reviewed or updated periodically. A rather simplistic electrical system design will allow for quicker and easier troubleshooting during this critical time.
To further complicate the problem, equipment manufacturers and service providers are challenged to find and retain the industry’s top technicians within their own company. As 24/7 operations become more prevalent, the talent pool available will continue to diminish. This would indicate that response times could increase from the current standard of four hours to a much higher and less tolerable timeframe. The need for a simplified, easily accessible, and well‐documented design is only further substantiated by the growing imbalance of supply and demand of highly qualified mission critical technicians.
When designing a mission critical facility, a budgeting and auditing plan should be established. Each year substantial amounts of money are spent on building infrastructure, but inadequate capital is allocated to sustain that critical environment through the use of proper documentation, education, and training.
1.7 Education and Training
Technology has been progressing faster than Moore’s Law. Despite attaining high levels of technological standards in the mission critical industry, most of today’s financial resources remain allocated for planning, engineering, equipment procurement, project management, and continued research and development. Unfortunately, little attention is given to the actual management of these systems. As equipment reliability increases, a larger percentage of downtime results from actions by personnel that were not properly trained or do not have access to accurate data during crisis events. The diversity among mission critical systems severely hinders people’s ability to fully understand and master all necessary equipment and relevant information.
In the past, a greater percentage of people were hands‐on, and it was natural for many families to make their own home and auto repairs just out of necessity. In doing so they became mechanically inclined and attained an understanding of how systems operate. This experience gave a number of today’s mission critical professionals a set of skills to build upon.
Today’s “Nintendo generation” is gaining a slightly different set of skills through computers, software, and video games. They are gaining valuable experience with IT systems and will have a solid foundation to continue to develop more advanced IT skills. The next step is to create a strong succession plan that teaches them how critical infrastructure operates and connects their already abundant IT knowledge to engineering. Then, existing professionals can show them how to apply that knowledge in the field.
The best strategy may be to start training successors as early as possible so, upon retirement of current staff, someone is trained with the necessary experience to take on operational responsibilities. New college programs that include internships should be developed and made attractive for young engineers. These programs need to show real career path options and align with corporate needs.
It is time to invest in our future so that the people who will be running the critical infrastructure of our country will have the necessary skill sets needed to meet and exceed our current standards. We need to constantly evolve and improve as professionals or risk becoming extinct. If not addressed in a timely and proper manner, we jeopardize the foundation of how our everyday business is run and our e‐commerce generated. Imagine what would happen if, due to inadequate training, no one fully understands how to operate and maintain our critical infrastructure before all the experience hardened experts retire.
With that being said, certified training programs should be developed by industry and instituted, so there are established standards and best practices. It is only through education and training that we can guarantee facility employees are knowledgeable about all equipment and processes.
1.8 Corporate Knowledge Transfer – the Means to Securing Tomorrow’s Critical Infrastructure
We are at a crucial crossroads in our digital transformation as our data centers, supporting our critical infrastructure and global information flow, are expected to double in the next decade due to evolving technologies – from machine learning to AI, IoT to Edge. Today’s facility engineers play an essential role in the institutional changes regarding the safety and security of critical infrastructures that impact the lives of billions of people and trillions of transactions worldwide – from emails, movie streams, to financial transactions, transportation, to healthcare and pretty much all aspect of our digitized lives. Our aging population of facilities’ engineers hold the key to digitized critical facility operations of tomorrow. With a large percentage of the world’s information flow processed across data centers, there’s an urgency to ensure that the next generation of mission‐critical engineers is well informed and highly trained, as well as proficient in the corporate knowledge of their predecessors. To prevent a knowledge gap that will most certainly jeopardize the safety and security of our future, it’s imperative that the retiring industry leaders capture and transfer their corporate knowledge of managing efficient business processes – some of which may not be found in an operators’ manual. They must take an active role to coach, train, and mentor the younger breed of engineers who will be managing tomorrow’s digitized facilities. As described