Data Lakes For Dummies. Alan R. Simon
it came to analytics and data — between your IT organization and the business users who are supposed to be their customers. Not good!
The data lake presents your organization with an opportunity for a fresh start. You can apply many of the best practices and also the painful lessons from 30-plus years of data warehousing to your data lake efforts and avoid repeating the mistakes and shortcomings of the past. As your data lake gets built, no matter if you’re on the IT side or the business side of your company, you can help rebuild that essential trust, especially when it comes to all-important analytics and the resulting data-driven insights.
Reducing Existing Stand-Alone Data Marts
You really can’t argue with the original concept of an enterprise data warehouse! Figure 2-1 illustrates the basic idea of a single home for most or all of the data needed to support a broad range of analytics across the entire enterprise.
Sounds like a great idea, right?
FIGURE 2-1: The vision of an enterprise data warehouse.
Dealing with the data fragmentation problem
A lofty vision is one thing; reality is often something else. Figure 2-2 illustrates how almost every organization’s idea of centralized, enterprise-scale data warehousing eventually surrendered to a landscape littered with numerous stand-alone, nonintegrated data marts.
Okay, so maybe the idea of “Do your own thing, and build your own data mart” got out of control. Now that you can see what a mess that approach created, why not just retire those data marts and fold them into your enterprise data warehouse that’s probably underutilized?
A collection of independent data marts is almost always hampered by a lack of common master data (for example, to sales, a “customer” may be something different than a “customer” is to your marketing team), different software packages and technologies across the data marts, and other challenges. Taken together, these challenges make it almost impossible to consolidate separate, independent data marts back into a single data warehouse. Most organizations instead throw their hands up in the air and say that they’re following a federated data warehouse approach. You “create” a federated data warehouse by simply declaring that some or all of your data marts are part of a “federation” that, when considered together, are sort of like a data warehouse. “Um … yeah, that’s our story, and we’re sticking to it. It’s magic!” (Not really … and not all that valuable from an enterprise-wide perspective.)
FIGURE 2-2: The reality of numerous stand-alone data marts.
Decision point: Retire, isolate, or incorporate?
What should you do about your proliferation of data marts now that your organization is building a data lake? The short answer: Get rid of the data marts … or at least most of them!
You have three main options for how to deal with your proliferation of independent data marts as part of your data lake initiative:
Retire some or all of the data marts, and replace them with data lake functionality.
Isolate some of the data marts, and leave them in place alongside your new data lake.
Incorporate some of your data marts as components of your data lake.
Data mart retirement
If your existing data marts are creaking and groaning and are now coming up short even for the analytical needs of their respective users, here’s a great idea: Get rid of them!
Figure 2-3 shows how your new data lake gives you the perfect opportunity to not only get your data mart proliferation under control, but also upgrade your overall analytics.
FIGURE 2-3: Using a data lake to retire data marts.
Chances are, most of your data marts, especially those that have been around for a while, support descriptive analytics (basic business intelligence functions such as drilling deeper into summarized data to gain additional insights from lower levels of your data). But what about advanced analytical needs such as machine learning or other data mining and artificial intelligence–enabled analytical needs? Probably not so much!
So, why keep those aging data marts around? Redirect the data feeds from your source systems into your new data lake, and rebuild your analytics for accounting, your human resources (HR) organization, sales and marketing, and other parts of your enterprise within the data lake environment.
Data mart isolation
What if one of your existing data marts is an absolute work of genius? Suppose that three or four years ago, your company built a data mart to support your annual strategic planning cycle. Your strategic planning data mart has data feeds from numerous applications and systems around your enterprise. Do you really want to reinvent the wheel just because you’re now building a data lake?
Great news: You don’t have to throw away your data mart baby along with the data lake water! (Okay, maybe not the best metaphor, but you get the idea.)
Figure 2-4 shows how you can leave that strategic planning data mart in place alongside the new data lake. You’re essentially isolating that data mart from the new epicenter of your enterprise analytics. True, some data feeds will be duplicated between the strategic planning data mart and the data lake. But that’s okay! And over time, maybe you’ll decide to incorporate the strategic planning data mart into the data lake itself.
FIGURE 2-4: Leaving a data mart intact and alongside your data lake.
Data mart incorporation
The primary difference between isolating an existing data mart (refer to Figure 2-4) and incorporating that data mart into the data lake (see Figure 2-5) is that you eliminate the duplicate data feeds between the two.
FIGURE 2-5: Incorporating a data mart into your data lake.
Suppose your data feeds for your strategic planning data mart are exceptionally well architected. Why not move them over to bring data into the data lake? Chances are, other analytical needs for accounting, finance, HR, marketing, and other organizations and functions within your enterprise can also leverage that