There’s a lot of discussion around data lakes and data warehouses. When organizations are evaluating these two technologies, there is one big question: “Where should you store your data to do the type of analysis you need and get the most out of it?” While it may be tempting to choose only the newer data lake format, or the tried-and-true data warehouse, ultimately both systems offer both pros and cons. We are going to explore some of the big ones in this blog.
Pros of the Traditional Data Warehouse
While the traditional data warehouse has started to see some Big Data challenges (from social media and marketing tools) and the growing demands of data science, there are still a lot of benefits to this solution.
One of the core benefits of the data warehouse is that it’s a well-established, proven solution. The SQL server stack is a widely available, mature toolset that has been around for a long time. If you need additional resources or a team internally to support your data storage system, it’s easy to find resources that are familiar with those tools. Although new functionality can create issues, for the most part a data warehouse is well-proven and can run optimally with very little maintenance.
A data warehouse also delivers very good performance – partially due to its underlying structure. Since a data warehouse has a schema-on-write, SQL servers (or other data warehouses) already understand its underlying structure. Therefore, its query engine can understand how to work with that data to quickly return query results.
The usability of a data warehouse is due to the analytical model. Users may not be familiar with how to get data out of the source systems where the data originates. Often, those structures can be extremely complex and normalized. Hence the data warehouse is an opportunity to have a more denormalized model that makes it easier for users to understand how to slice and dice that data. The data warehouse is flexible in that it can work well on-premises or in the cloud. Whether you’re using a SQL server stack on-prem, or exploring an Azure SQL database / data warehouse in the cloud, any of these data warehouse options can support this type of approach very well.
Cons of the Traditional Data Warehouse
While data warehouses have significant advantages, they also have some significant drawbacks in this era of Big Data and high data science demand. Some of the cons of the data warehouse are:
Storage cost is one of the key modern drawbacks of a data warehouse. Database resources are more expensive than those of a data lake due to features. The budget-conscious company (that is, most companies) must ask if it needs to store large volumes of data in an Azure SQL database or data warehouse (which may have a higher storage cost than a data lake).
With a data warehouse, you have to wait for each business process component to be built to get value from that data. Although there’s a lot of value in having an analytical model, it also takes time to have your development team support getting data into that model, typically through the development of ETL processes. If you have data that’s not in your data warehouse today, and users want to report on it, then there can be additional latency to get that data into the warehouse.
Cost and warehouse model can restrict capturing data of unknown value, which can limit exploration. You might only pull certain amounts of data where you had known reporting requirements. There may be a whole other set of data that you’re not bringing in that users might want to explore.
Not Designed for Big Data
Of course, the Big Data challenge is one of the major sticking points when it comes to the data warehouse. Its schema-on-write is not really optimized for that variety, velocity and volume of data. Disruptors like hybrid source systems (cloud and on-prem), and high data volume, variety and velocity coming from big data scenarios can limit the effectiveness of your data warehouse.
Pros of the Data Lake
The data lake solves some of the pros and cons of the traditional data warehouse, but it also offers drawbacks of its own. First, let’s look at its pros.
Volume, Variety and Velocity
The real strength of the data lake is that it does a good job addressing some of those Big Data challenges like volume, variety, and velocity. It’s infinitely scalable, handles structured or unstructured data, and is designed for rapid data ingestion. It can also support Internet of Things scenarios. Because it’s schema-on-read, we don’t have to understand the format of the data until we’re ready to read it back out. We can write that data very quickly before we have that understanding and without risk of write errors.
Low Cost Storage
A data lake is also low cost (relatively speaking). That allows us to not worry as much about the types of data that we’re storing. We may not know the analytical value around our data yet, but with data lake storage, we can inexpensively hang on to it for some future point where we might be able to find value for it.
Landing the data in the data lake makes it easy for us to open copies of that data (or subsets of that data) to different user groups. Whether that’s for self-service or data science, we can control access to that data.
Cons of the Data Lake
While these are some big positives, the data lake also has some significant cons.
There are on-premises data lake solutions (Hadoop is a very common one). However, installing a data lake solution on-prem can be much more complex, whereas spinning off a data lake in the cloud is very simple. This may be considered a negative if it does not align with your infrastructure strategy.
Skillset Learning Curve
The data lake often comes with a new set of tools and services that need to be understood (and it’s a bit of a learning curve). That requires some additional investment, either from recruiting to get the right team members or doing internal professional development to understand those new tools.
When moving from a data warehouse to a data lake (or just adding a data lake to your existing system), there’s going to be a transition period. If you already have an existing data warehouse, how does the data lake fit into that? Do you rework some of that for the new needs and functionality, or not?
Investment in Best Practices and Processes
A data lake also requires significant investment not just from a skillset perspective, but also in understanding how that impacts your best practices. In addition to being able to do the work, you need to understand what your organizational standards will be around doing that work.
Not Optimized for Query Performance
Although we can very easily land data in the data lake, in order to query that data back out, the data lake doesn’t have the same underlying query engine that a data warehouse does. Those queries may not perform to the level you need.
Which Option Is Right for You?
Often, we see that the data lake and the data warehouse really work better together.