A data warehouse project consists of a number of core parts that need to be put in place to get good analytics reporting for your users. At a high level, the parts consist of data modeling, data mapping, ETL (Extract, Transfer and Load) processes, data validation, possibly the creation of a cube and the creation of front end reporting and dashboards. Depending on the size of the project, these tasks can take a fair amount of time to implement.
The part that is often one of the most time consuming pieces of a project is the ETL processes. The ETL work is the piece that moves the data from the source system, cleanses and transforms the data appropriately according to business rules and moves it to the data warehouse. There are a variety of ways that the ETL processes can be implemented. For example, they can be driven primarily by SSIS (SQL Server Integration Services) packages or primarily by stored procedures or a combination of the two. The tools available within Microsoft SQL Server are very capable of implementing a solid ETL solution. While these tools can be used right out of the box, there are some drawbacks to using them this way. A typical data warehouse solution will have an ETL process for each dimension and each fact table being populated. If each ETL process is done without the bigger picture in mind, there can be a significant amount of time spent writing similar code for each process. In larger projects this can really add up.
Instead of using the tools right out of the box, it can be useful to take a step back and think about the goals of a solid ETL process. A solid ETL process should have a low source system impact, it should be highly performing, and it should provide logging and data validation among many other industry best practices. These features and best practices can be brought together into a re-usable ETL framework. The core features and benefits that should be included in an ETL framework are listed below. Since ETL code is typically very similar from dimension to dimension and fact to fact, a good practice is to develop a set of templates that include these core features and that cover the common data warehouse ETL requirements such as handling slowly changing dimensions.
To really save development time, the addition of a code generator could be included that uses the templates to generate the actual code based on your data warehouse design. This will allow you to avoid a lot of repetitive work, produce consistent and easy to maintain code and reduce the potential for making mistakes. Once generated, you’ll only have to modify the areas specific to the dimension or fact being worked on. For example, you’ll still have to manually write the source system extract query per the business requirements. A framework like this will take some time to develop or a cost to purchase, but the time saved in building and maintaining the ETL solutions should pay for itself quickly.
These features and benefits are often found in a solid ETL framework, whether it is custom built or purchased.
ETL Framework Features
- Low Source System Impact
- Highly Performing (Set based)
- Data Validation
- Built in Error Handling
- Incorporate Industry Best Practices
- Meta Data Driven Code Generation
- Unit Testing Capabilities for Development
- Quick to Develop
- Consistent Code
- Easy to Maintain
We built a custom ETL framework that has matured over many years. This framework now includes all the features listed above, which allows us to develop reliable solutions much quicker than starting from scratch each time. It also allows our developers to easily maintain solutions that they may not have been involved in building. This is because they all have the same ETL architecture. This is especially true, because most of the code is generated based off of the data warehouse schema along with configuration defined at the time of code generation. This leaves very little manual work and avoids the repetitive, time consuming and potentially error prone development of having to code each ETL process manually from scratch.
While I don’t have the luxury to use a framework for every project I work on, it is more than apparent that you can save large amounts of time by not rebuilding and testing the features in each new solution. It’s not always common to have a budget to build all these features in while implementing a project. Having a framework that is tested and proven beforehand can dramatically speed up the results and increase the reliability of the solution.
Before starting your next data warehouse project, strongly consider a solid ETL framework. It will be worth the time and money in the long run.