Data Lakes For Dummies. Alan R. Simon
did to consolidate and organize data for your annual strategic planning can become part of your overall data lake.
Eliminating Future Stand-Alone Data Marts
Even after getting your data mart proliferation under control as part of your data lake efforts, beware: History can easily repeat itself!
Make no mistake about it: Just because you’re now in the data lake era rather than the earlier data warehouse era, business organizations will still likely want to create their own smaller-scale data marts for their specific analytics needs.
Your data lake gives you a carrot-and-stick, one-two punch to help prevent the proliferation of future data marts.
First the stick, and then the carrot.
Establishing a blockade
Your company’s top leadership needs to help you establish a blockade against new data marts springing into existence. Your chief information officer (CIO) needs to make this policy crystal clear, in concert with their counterparts on the business side: the chief operating officer (COO), chief financial officer (CFO), and others in your company’s executive ranks.
Ideally, even your chief executive officer (CEO) should sign a declaration that another round of data mart proliferation won’t be tolerated.
Should a “no proliferation” edict be written in stone? Probably not. Some departments within your company will inevitably come up with some unique, time-is-of-the-essence analytical need that is better met through a stand-alone data mart than through the data lake.
However, the proponents of a new data mart should be required to prove their case and have their data mart project approved as an exception to the “no proliferation” rule. They need to declare the following:What the business imperative is for building a new stand-alone data mart (for example, to address some sort of business crisis or to take advantage of a market opportunity that must be addressed immediately)
Why their analytical needs can’t be met using the data lake in the same time frame that it would take to build their new data mart
Whether their planned data mart will be used only for a short period of time and be retired or if it will subsequently be incorporated into the data lake
Providing a path of least resistance
Business users around your organization build new stand-alone data marts because that’s what they’ve done for a long, long time. They realize that the best way to bring data-driven insights into the way they do business is to take charge of their own fate and build an end-to-end solution. Old habits are extremely difficult to break!
Beyond a blockade on new data mart development, your data lake can give these business users a path of least resistance. Make it easier for them to go to the data lake for the data they need instead of doing everything on their own.
Suppose that a new chief people officer (CPO) is hired to lead your company’s HR organization. Jan, the new CPO, is a big believer in applying super-advanced analytics, such as machine learning and artificial intelligence, to numerous HR functions: employee evaluations, salary adjustments and promotions, succession planning, and more.
Jan appoints an analytics team within HR and tells them that, within the next three months, they need to have some initial machine learning models built in time for the semiannual employee evaluation cycle. Raul, the analytics teamleader, has been with your company for 15 years and has built several HR-specific data marts in the past for similar needs.
Raul assigns two of the team members, Julia and Dhiraj, to analyze the HR data in Workday (a cloud-based HR and financial management system) to figure out what data needs to be brought into the machine learning model. Raul also assigns another team member, Tamara, to start designing an Amazon Redshift database to store the HR data and support the machine learning algorithms.
Not so fast, Raul!
Raul submits his budget request for the new HR employee incentive evaluation and involvement operations (EIEIO) data mart and is surprised to learn that he needs to present his business case to the company’s new data mart exception board. Raul starts preparing his PowerPoint slides, and comes across item number 2: “State why your analytical needs cannot be met through existing data lake content.”
“Hmm … a data lake,” Raul thinks. “I wonder if the data we need is already in there?”
Sure enough, Raul goes browsing through the data lake catalog and finds that the data lake already has a ton of HR data from Workday that is regularly refreshed. He asks Julia and Dhiraj to match up the work that they’ve done so far with what the data lake catalog shows. Within two hours, they report back with the fantastic news: “Everything we need is in the data lake already!”
A well-constructed data lake offers business users a path of least resistance when it comes to gathering the data they need for their analytical needs. Raul’s team will still need to build the machine learning models to produce the analytics that Jan, your CPO, wants to apply to the next evaluation cycle. But they no longer need to proceed with analytics on a business-as-usual basis, constantly acquiring and storing the same data over and over in different data marts.
Over time, as familiarity with the data lake spreads throughout your organization, fewer unnecessary data mart requests such as Raul’s will need to be redirected back to the data lake. Raul wasn’t deliberately trying to do everything on his own; he just wasn’t familiar enough with what the data lake provided, not only to HR but to your company as a whole.
Establishing a Migration Path for Your Data Warehouses
Data warehousing has been on the scene since around 1990, which means that thousands of enterprise-wide data warehouses have been built and deployed over the years. In fact, looking back at the B-52 analogy earlier in this chapter, you can think of a data warehouse as the equivalent of a propeller-driven airplane that preceded the jet aircraft era, which, of course, makes the data lake the equivalent of that technology-leaping jet.
Some ultramodern, large-scale enterprise data warehouses have been built in the past several years, using relatively new technologies such as the SAP HANA in-memory database management system. Many others, however, were built on older relational databases and are still chugging along. They still work okay, for the most part. But in this new era of data lakes, it’s time to decide what to do about the old-timers.Sending a faithful data warehouse off to a well-deserved retirement
If your data warehouse is really showing its age, your best bet is to hold a nice retirement party in the company cafeteria with cake and ice cream for everyone and with a few speeches about how wonderful the data warehouse has served the company’s enterprise-wide reporting and business intelligence mission over the years. (Okay, you can probably skip the cake and ice cream, as well as the cafeteria party itself.)
Then you can do the same thing for your data warehouse that you do for any of your creaky, brittle data marts. Build a new set of data feeds from your source applications and systems into the data lake. Then within your data lake, rebuild the data models that your data warehouse used to support business intelligence and reporting alongside machine learning and other advanced analytics (see Figure 2-6).