Slowly Changing Dimensions in SSIS
One of the most critical pieces of any data warehouse is how you handle dimensions. Most dimension tables are modeled differently than fact tables because dimension records change more slowly than fact records. During a daily load, you may only have a single column that changes on one dimension record, but you will undoubtedly have to insert many new records into the fact tables.
Due to the slowly changing nature of the data in a dimension table, we handle the processing of these tables quite differently. The basic process is to compare the new incoming data with the existing data, update only the records that actually changed, and insert only the new records that don’t appear in the existing set. The process sounds simple enough, and any good ETL tool should be have components in place that help the developer implement that process as painlessly as possible, but alas, this is not the case.
Issues with the SCD Transform
My favorite ETL tool is SSIS (SQL Server Integration Services), but unfortunately, the SCD transform is SSIS’s least attractive feature. The SCD transform is supposed to be the component an SSIS developer would use to handle the slowly changing dimension process, but it has pretty poor performance and cannot be changed after you set it up in a package.
For the most part, I don’t worry about performance when it comes to dimensions because those tables tend to be rather small (less than 10,000 records). But given the processing power we have at our disposal today, updating that table shouldn’t take any more than a few seconds. The complexity to an SCD transform is where it has to determine if the same record exists in both the new data and the existing data. To do this, the component needs to compare the business keys (not surrogate keys). That comparison can take up quite a few cycles if the business keys have character values and if there are several columns to compare. The SCD component in SSIS doesn’t seem to provide any sort of efficiency mechanisms for speeding up this part of the process other than sorting your inputs and adding indexes on the two tables.
Making a change to a dimension table in a data warehouse is inevitable. Business requirements changes, source systems change, and so on. Any time you have a change to a table, you will have to modify the ETL that updates that table. If you use the SCD Component in SSIS, you are going to have to clear out the entire data flow and start over from scratch. In my opinion, this is unacceptable. This is the real reason that I started looking for alternatives.
Alternatives to Handling a Slowly Changing Dimension
SQL Merge Statement
Personally, I think the best alternative is the SQL Merge statement. Introduced with SQL Server 2008, this new statement allows you to combine an update and insert in the same statement. The downside to using this approach is that it uses the database server to process the join between the two tables. The upside is that, once you get the hang of the syntax, you can write a merge statement very quickly. Any ETL tool would be hard-pressed to perform a merge faster than the actual database can.
If you are a particularly savvy developer, there are ways that you can automate these merge statements. You can look in the INFORMATION_SCHEMA views to get column names and data types, or you could use a wizard to help you walk through the steps of creating such a statement.
The benefit of using SSIS as your ETL tool is that you can offload much of the hardware requirements of ETL onto a server other than the database server. If you want to continue taking advantage of that feature, you can use the Lookup component in SSIS to perform the Business Key join between the new and existing dimension tables instead of the SCD component. Lookups are fast and can be optimized by using a lookup cache. If you’re just dealing with a Type 1 SCD table, then you only need two lookups. The first lookup checks the business key and directs records to be inserted (business key not found) or updated (business key found). Then, you would need another lookup before the update to check the columns on the dimensions to be updated to see if any records actually changed. You can use lookups to handle Type 2 SCDs as well, but you would need a third lookup.
SSIS Dimension Merge SCD Component
Another great alternative is to use the SSIS Dimension Merge SCD Component which can be found out on CodePlex. It is a third-party component that you will have to install, and the free version is no longer supported, but it works very well. Since it is no longer supported, the original developer is not pushing out a version for 2012, but you can find versions that work with SSIS 2012 by other developers. The component was bought by Pragmatic Works, and they have a paid version that is a bit more dependable than the original and also works with SSIS 2012.
I have used this Dimension Merge SCD Component in several data warehouses now, and I really like it. It is very flexible, handling both SCD Type 1 and Type 2 easily. It uses a very fast hashing algorithm to handle the join between the two tables and also has support for a special member input (which I haven’t used but will in my next data warehouse). If your table schema changes, you simply open up the component and refresh the metadata.
If you would like to see exactly how to use this component, check out this video taken from the CodePlex site.
There seems to be a lot of talk in the world of Business Intelligence development about “best practices”. But in the end, there is not one “best” way to do things. Instead, each method must be examined and scrutinized by its advantages and disadvantages. You should select the method that best fits your particular situation. If you have some room in the budget, I would recommend buying the Dimension Merge SCD component from Pragmatic Works; if you’re low on hardware, I would recommend using the lookup method that we were stuck with before SSIS 2005 (lookups). But if you want a solution that will always work and always perform well, I would recommend the SQL Merge statement.
Thanks for reading. Please email me if you have any questions, and look for my next post.
Brett Neuman has been developing business intelligence strategies for more than eight years at several fortune 500 companies. He has implemented data warehouse solutions and analytics strategies for many verticals including Health Care, Finance, Retail and Marketing. He has a strong background in application development, data warehouse architecture, data modeling, ETL, OLAP, reporting and dashboarding utilizing both the Business Objects and the Microsoft suite of Business Intelligence platforms. Brett joined StatSlice for the opportunity to empower organizations to make more effective business decisions through the availability of accurate, mission-critical information.
You can subscribe to our RSS feed.