In my previous blog, I talked about some of the pros and cons of data lake and data warehouse solutions. While for some organizations it might make sense to choose one or the other, what we see is that the data lake and the data warehouse really work better together.
We can use the data lake early in an architecture solution because it’s optimized for ingesting data from Big Data scenarios. We can also use it to give access to the data science layer. However, when it comes to end users and self-service, there’s still a lot of value in having a highly optimized query. We benefit from still having that analytical data model in a data warehouse. We also want the best performance time, and we want to make it easy for users to understand how to slice and dice the data.
We typically put a data warehouse on top of the data lake, and then our reporting can come out of that data warehouse.
What’s the Best Configuration?
The image above shows a single data lake to a single data warehouse, but there are lots of options. The image below looks rather more complex. We might find there is some delineation you want to make based on where data is stored, and thus you want to have a couple data lakes. We can also have a collection of data marts or data warehouses that serve different audiences. All the options that existed before data lakes still exist today; we just have an additional layer now to better meet the needs of that Big Data scenario.
We’re also seeing some changing functionality. I covered the individual strengths and weaknesses of data lakes and data warehouses in the previous blog of this series, but the primary difference is that one has a schema-on-read and the other has a schema-on-write. Today, we’re starting to see some of that relational database technology coming into the data lake or sitting on top of it.
Generally, there’s still enough differentiation where we still see data warehouse and data lake layers being separate. However, we’re seeing some advancements on both sides to try to enter that middle ground of being best at both.
Examples of New Hybrid Options
Data Lake Hybrids: Databricks Delta
One example of that is Azure Databricks and its implementation (Databricks Delta). Databricks Delta starts to give us more of that transaction management. Other high-value features include data versioning so we can track point-in-time values and some efficiency around the upserts. With it, we’re starting to see some layering of the functionality on top of the data lake.
Common Data Model: Power BI Dataflows
Another example is the Common Data Model, which Microsoft is doing with an open data initiative to figure out how to put more structure around a data lake to lend itself to end-user analysis.
Power BI Dataflows lets us ingest data from a self-service or enterprise level, land that data in a data lake, but still have some sort of schema and meaning around it. This maintains an understanding of relationships between those entities and facilitates the ability to build data visualization off it.
Again, this may not meet everybody’s performance needs (a data warehouse may still make sense), but we’re seeing some opportunities in that middle ground of hybrid solutions.
Does it make sense to have a data lake, a data warehouse, or both? It ultimately depends on your business requirements.
A data lake can add a lot of value to your organization if you:
- Work with Big Data and need to handle volume, velocity or a variety of data.
- Have a data science role or machine learning/AI where you need to do a broader exploration of data that may not yet have a known analytical value.
- Value speed over accuracy (meaning you prioritize the ability to analyze data quickly over a more formal IT extract, transform and load process).
The data lake can offer a lot of value. But, from an end-user perspective, the data warehouse is still a cornerstone of a good data analytics solution.
The data warehouse is generally your cornerstone because it allows you to:
- Provide the single source of truth that businesses expect in most cases.
- Ensure you have a quality, analytical data model that lends itself to slicing and dicing the data.
- Guarantee that you have accurate data when users are doing a self-service scenario off that model.
Both solutions offer plenty of value to your organization, and they will generally work better together. In my next blog, I’m going to look at a common question about the data swamp that can happen with a data lake (and how to avoid that situation).