Data Lakes For Dummies. Alan R. Simon
groups.
Managing Equal Opportunity Data
Your data lake’s big data foundation presents you with an opportunity that, not too long ago, was out of reach for most organizations. You can store, manage, and analyze all three types of data — structured, unstructured, and semi-structured — within a single environment, and without having to jump through hoops to do so!
Many of the business questions you ask of your data will only require structured data. Suppose you work in the supply chain organization within your company. You’ll definitely want your data lake to provide insight into the following:
Who among your strategic suppliers has the best combination of on-time component production and also very low problem rates?
Which third-party logistics firms have the best — or worst — on-time shipping performance?
What’s the percentage of product spoilage among all internal and third-party warehouses during the past six months?
Other critical business analytics may involve unstructured or semi-structured data. You’ll want to know the following:
What percentage of tweets from your customers represent a positive sentiment about your product quality? Negative sentiment? What “hot spots” are showing up in blogs, tweets, and other social media posts, as well as YouTube videos, that can mean profitability and market share problems for you down the road?
Your reports show a dramatic increase in breakage in Warehouse #2. You have surveillance cameras in all your facilities. Is there anything that shows up on video that could indicate one or more root causes for this breakage that you can address through procedural changes?
Your data lake gives you one-stop shopping for structured, unstructured, and semi-structured data in a logically centralized, cohesive environment.
BACK TO THE FUTURE, PART 2
In the first edition of Data Warehousing For Dummies (Wiley), back in 1996, I included a chapter about the future directions of data warehousing. One of the forecasts I made was that the first-generation data warehousing of that time would eventually evolve into what I called “multimedia data warehousing” and would include not only structured data but also video and audio content. I made this prediction on the basis that “not all of the business questions we need to ask out of a data warehouse will come from numbers, dates, and character strings; sometimes we need information from images and other multimedia content as well.”
Guess what? You can think of a data lake as the modern incarnation of that “multimedia data warehouse” that I wrote about more than a quarter-century ago. It’s here!
Building Today’s — and Tomorrow’s — Enterprise Analytical Data Environment
Building an all-new analytical data environment around big data technology sounds like a great idea, right? You may be worried, though, that your organization can invest a ton of money over the next couple of years, only to find that your data lake is obsolete because of an entirely new generation of technology.
In other words, can your data lake be not just today’s but also tomorrow’s go-to platform for more and more analytical data and data-driven insights? Absolutely!
Constructing a bionic data environment
Maybe you’ve heard of a B-52. No, not a member of the American new wave music group (so don’t start singing “Love Shack”) but rather the U.S. Air Force plane.
The B-52 first became operational in 1952. The normal life span for an Air Force plane is around 28 years before it’s shuffled off to retirement, which means that B-52s should’ve gone out of service around 1980. Instead, the B-52 will eventually be retired sometime in the 2050s. That’s a hundred years — an entire century!
However, a B-52 today bears only a slight resemblance to one made in the ’50s or ’60s. Sure, if you were to put one of the original B-52s side by side with one of today’s planes, the two aircraft would look nearly identical. But the engines, the avionics, the flight controls … pretty much every major subsystem has been significantly upgraded and replaced in each operational B-52 at least a couple times over the years.
Better yet, a B-52 isn’t just some old plane that you may see flying at an airshow but that otherwise doesn’t have much purpose due to the passage of time. Not only is the B-52 still a viable, operational plane, but its mission has continually expanded over the years thanks to new technologies and capabilities.
In fact, you can think of a B-52 as sort of a bionic airplane. Its components and subsystems have been — and will continue to be — swapped out and substantially upgraded on a regular basis, giving the plane a planned life span of almost four times the normal longevity of the typical Air Force plane. Talk about an awe-inspiring feat of engineering!
However, all those enhancements and modifications to the B-52 happened gradually over time, not all at once. Plus, the changes were all carefully planned and implemented with longevity and continued viability top of mind.
Your data lake should follow the same model: a “bionic” enterprise-scale analytical data environment that regularly incorporates new and improved technologies to replace older ones, as well as enhancing overall function. You almost certainly won’t get an entire century’s usage out of a data lake that you build today, but if you do a good job with your planning and implementation, 10 or even 20 years of value from your data lake is certainly achievable.
More important, your data lake won’t be just another aging system hanging around long past when it should’ve been retired. You almost certainly have plenty of those antiquated systems stashed in your company’s overall IT portfolio. That’s why the B-52 is the perfect analogy for the data lake, with a “bionic” approach to regularly replacing major subsystems helping to keep your data lake viable for years to come.
Strengthening the analytics relationship between IT and the business
If a tree falls in a forest, but nobody is around to hear it fall, does it make a sound?
Or how about this one: If you build a system to support analytics across your organization and load it with tons of data, but nobody really uses it, does your organization really have analytical data?
Don’t worry, you didn’t go back in time to a college philosophy class — you won’t be graded on your responses to either of these questions.
You can think of a data warehouse as a direct ancestor of a data lake. Data warehousing came onto the scene around 1990, and it has been the primary go-to approach for enterprise analytics in the decades since.
Far too many of today’s data warehouses are like that tree falling in a forest. The IT side of your company originally set out to build an enterprise-wide home for analytical data that will support reporting, business intelligence, data visualization, and other analytical needs from every corner of your organization.
Alas, that data warehouse, like so many others, came up short. Maybe the data warehouse doesn’t contain certain sets of data that are needed for critical analytics. Perhaps the data warehouse contents aren’t properly organized and structured and are difficult to access with the business intelligence tools available. Whatever the reason may be, your organization’s business users finally said, “To heck with it!” and built their own smaller-scale data marts to satisfy their own departmental or functional analytical needs.
Along