Implementing a Custom Change Data Capture System Part 1: Introduction

Bruce Hendry is a consultant in Slalom’s Technology Solutions group in Dallas, TX. Bruce is passionate about the Microsoft Business Intelligence stack and the .NET Framework and enjoys technology.

Coming up with a title proved to be one of the most difficult parts of beginning this posting. Implementing a Custom Change Data Capture System doesn’t really ‘capture’ what should be said. I recently found myself in a situation with a client where it was necessary to quickly implement update processes between a legacy transactional application and a new transactional system. “That’s nothing new”, you say–and you are right–except that it was late in the game, there was already an extract-transform-load (ETL) solution in place that was not architected to account for updates because it was never in the scope…until that day, and (here was the nut of the challenge) nothing reliable existed in the legacy application to capture changed data.

Our team designed and implemented a viable solution. I decided to document it in a series of postings for a few different reasons:

Firstly, I think it is viable, and I want to share it in the hopes that it might help some folks down the road.

Secondly, I am sure it has lots of room for improvement; I am curious to hear feedback, suggestions, etc. Trust me, you won’t hurt my feelings. I don’t claim to be an expert; I think good solutions are the end result of exactly this kind of process. This solution may get blown out of the water, proved irrelevant by the existence of a better solution, or evolved based on input from bright minds; either scenario is okay with me.

Lastly, I need something to blog about. This would be way too long to fit into one post, hence the series. I’m not sure exactly how it will break down in terms of number of articles, but I will try and segment them logically. I will keep the Table of Contents updated so interested parties can jump directly to content they deem relevant. This first posting won’t have much technical value. I want to describe what we were confronted with and how we came up with the solution. At this juncture, it seems like it should go something like this, but I am sure it will play out differently.

  • Part 1: Introduction: Gives an overview of the problem and how the team arrived at the solution I’m about to document, including the requirements and limitations that influenced our thinking
  • Part 2: Implementation of the solution, diving in to the details
  • Part 3: Summary and Closing Comments

Problems and Solutions
Without further ado, let’s lay the groundwork for this story. Please note that I have taken liberties with irrelevant facts to protect client data and simplify conveyance of the solution.

We’ll call the legacy application and its database Apollo (think beginning, starts with an A), and we’ll call the new application and its database Challenger (think successor, C comes after A). We’ll call the ETL solution…wait…ready for it…the ETL solution. The original plan (before it was known updates would be required) was for Challenger to come online in phases. Each phase would encompass migrating the entirety of an existing customer’s data. The ETL solution had been built with the understanding that there would be no updates to Apollo’s data for the given customer once their data had been migrated. A customer’s data would be cutover on a specific date and users would only access Apollo for reference purposes after that date. The ETL solution dealt solely with a complete migration (per customer) and was concerned only with loading new data into Challenger’s database. Updates had never been considered and were not accounted for whatsoever in the existing ETL processes.

The ETL solution leveraged reference tables in a reference database to maintain the linkage between Apollo and Challenger. There were no functional applications or reports built around this reference data, it was just there in case it was needed in the future, e.g. a user wanting to find an entity in Challenger known only by its Apollo business key. The structure of these reference tables, using a fictitious Car entity as an example, was as follows:

ReferenceCar Table

ColumnName DateType
ReferenceCarID int
ApolloCarID int
ChallengerCarID int

So data in the ReferenceCar table looked like this:

ReferenceCarID ApolloCarID ChallengerCarID
100 9123 111456
101 9124 111457
102 9125 111458
103 9126 111459

Continuing on with this example, the Apollo Car table looked something like this:

CarID (Apollo) Make Model Year Color
9123 Datsun B210 1974 Brown
9124 Acura MDX 2004 Black
9125 Jeep Wrangler 2000 Yellow
9126 Infiniti Q56 2008 Black

For the sake of this article, let’s say the Challenger Car table was nearly identical to Apollo’s, with the exception of a DateLastUpdated column:

CarID (Challenger) Make Model Year Color DateLastUpdated
111456 Datsun B210 1974 Brown 12/10/11   1:45:10 PM
111457 Acura MDX 2004 Black 12/10/11   1:45:10 PM
111458 Jeep Wrangler 2000 Yellow 12/10/11   1:45:10 PM
111459 Infiniti Q56 2008 Black 12/10/11   1:45:10 PM

The important thing to note here is that the Apollo table was devoid of any columns like DateLastUpdated or ModifiedDateTime that reliably indicated something in a row had changed. Not long after I joined the team, the mandate came down that users would still need the ability to edit data in Apollo even after a customer’s data had been migrated. These post-cutover edits in Apollo would also need to be propagated to Challenger. We were under the gun to come up with a solution quickly, so the ETL team gathered at the whiteboard to consider our plight and propose potential solutions. At the end of the meeting, we had noted some requirements, some limitations and 3 proposed solutions.

Requirements/Assumptions

  • Updates to Apollo should flow to Challenger on a nightly basis.
  • Once the updateable entity was edited in Challenger, updates in Apollo should be ignored.
  • Providing an audit trail of updates was essential.

Limitations

  • Apollo had no reliable means to tell when data changed.
  • Apollo was on SQL 2005 so Change Data Capture (CDC) in SQL Server 2008 was not an option.
  • Apollo was a heavily used production system and any risk of operational disruption would not be well-received by the client, i.e. changing the existing Apollo database schema was not going to be well-received.

Proposed Solutions

  1. Save a snapshot of the Apollo database and use it for comparison on a nightly basis to determine changes to the Car table.
  2. Add a DateLastUpdated datetime column to the necessary tables and add triggers to these tables that would refresh the DateLastUpdated columns whenever changes were made to a row.
  3. Use Checksum or Hashbytes to determine if any changes were made to a row in the Car table.

Deciding on the Solution
We nixed the first option pretty quickly. Its advantage was that it would not require any updates to the Apollo system. Beyond that, it seemed clunky. The overhead involved in backing up and restoring snapshots, in addition to having to write the comparison code to determine what changed, was discouraging.

The second option was my suggestion and the one I was most comfortable using. I had used this same method with SQL Server (prior to 2008) for many years. It has the advantage of being “update agnostic”, i.e. it isn’t dependent upon any external application code; it is all contained in the database. It doesn’t care if a column was updated by a web service, the primary operational application, or Joe User plunking around in SQL Server Management Studio (SSMS). The trigger handles refreshing the trusty DateLastUpdated column and can even be tuned to only respond to a subset of columns on the table. Unfortunately, I was used to having some influence over the source databases and their operational systems. In this case, I had none. The big, glaring disadvantage to this solution was the fact that it could impact the production system. Granted, adding a column on the end of a table and hanging a trigger off of it can be incredibly benign (if one knows what one is doing and is familiar with the applications accessing the database), but there are no guarantees. Who is to know if some devil-may-care developer has a SELECT * query bound to a component that might break with the new column, etc., etc.? We had to look hard at our third option.

I had never used Hashbytes in my SQL programming. I was familiar with Checksums from the application development world, but had not used them in SQL, either. The proposed solution was that Hashbytes or Checksum could be used to monitor the changes in Apollo. Using one of these functions, a subset of data from Apollo could generate a value that is then used in subsequent comparison interval, e.g. the next day the new Hashbytes or Checksum value is compared with the previous day’s Hashbytes or Checksum value. If a change is detected, update Challenger. This was new territory for me, and it took me a little bit to get my head around the paradigm shift. We decided to go with option 3, provide a proof of concept, and get approval from the stakeholders.

Coming Soon
In the next post, I will dive into how we implemented the solution, extending the examples above. I will provide sample code and a download sample project that should contain everything needed to see it in action. If you have any questions or input in the meantime, please feel free to leave a comment or drop me a line at bruceh@slalom.com.

Slalom Consulting’s Dallas office Slalom Consulting's Innovation focus
Learn more about our Dallas office Learn more about Slalom Consulting Innovation

subscribe by emailSubscribe to follow new Business Intelligence posts

About Bruce Hendry
Bruce Hendry, Consultant in Slalom's Technology Solutions group in Dallas, TX, is passionate about the Microsoft Business Intelligence stack and the .NET Framework. He loves technology...but not as much as his wife Kim and their new daughter Josey Jean.

2 Responses to Implementing a Custom Change Data Capture System Part 1: Introduction

  1. Pingback: Implementing a Custom Change Data Capture System – Part Two « The Slalom Blog

  2. Pingback: Implementing a Custom Change Data Capture System – Part Two | Bruce Hendry

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: