Building slowly changing dimensions from salesforce.com data

An interesting problem arose recently. A colleague was trying to build a type 2 slowly changing dimension (SCD 2) using data from the case history object in salesforce.com.

The primary problem is that the case history is an audit trail for selected case object fields and so does not record the full state of the case record at each transaction. So the first entry in the case history for any new record shows only that the record was created. There is no further metadata about the state of the record when it was inserted. It is only possible to calculate the initial state by taking the current state of the record and the case history records and replaying them in reverse.

Usually SCD2 is used when taking regular snapshots of a source dataset. The record start and end time stamps are derived from the execution time of the record extraction process. In this case the need was to drive the record timestamps from the case history record creation date time fields. In order for the users of the data to be able to see all changes in the data in the correct order. This had the effect of setting the grain of the time dimension to the level of the second because that was the minimum level of the time stamps on the case history data.

That would not be easy to build though because the ETL process would have to create SCD2 records for each case history record. Those records would use the clarion date time of the case history as both the start and end time stamps of the SCD2 record. These records would be bookended by the last seen version of the case record and the most recent version. In effect the process would have to use a normal incremental load pattern to copy over the current end state of each record as a new SCD2 record and the back-fill between this and the previous version all intervening transactions that were found in the case history.

We haven’t attempted to implement it in this way yet. It would be preferable if we had some way of tracking changes to entire records in salesforce.com. That would probably require an Apex trigger to copy the entire row into a staging object within salesforce.com before our ETL process could extract the data. The obvious consequences of such apparel though would be a massive increase in the amount of data storage required for the app and the performance impact of running a trigger that copies entire rows on every transaction. There may be other problems with this approach.

The recent addition of lightning components to salesforce.com might offer an alternative pattern for this and will be something to investigate further. Similarly another avenue might also be the streaming API.

Posted in Salesforce.com Tagged with: , , ,

Leave a Reply

Your email address will not be published. Required fields are marked *

*