Implementing a Custom Change Data Capture System—Part Two

Bruce Hendry is a consultant in Slalom’s Technology Solutions group in Dallas, TX. Bruce is passionate about the Microsoft Business Intelligence stack and the .NET Framework and enjoys technology.

Introduction

This posting continues from my first blog Implementing a Custom Change Data Capture System. It may be helpful to review Part One before diving into Part Two.

Part One described how our team was challenged with late arriving requirement changes for an existing Extract Transform Load (ETL) solution. The new requirements dictated that changes in the source transactional system propagate to the destination transactional system. The source system had no inherent reliable means of tracking changes, and altering the source system was not an option. Part One covered the team’s proposed solutions and briefly described the one we selected, using HASHBYTES as a means to monitor and respond to changes in the source system.

Part Two will conclude the series and provide the details of the solution our team implemented. The end of this post contains a link to downloadable content that can be used to demonstrate the custom change data capture system and follow along.

Implementation

Time was of the essence, so we quickly set forth on our proof of concept. We discovered in short order that CHECKSUM was not likely to work for us as it had some known issues. The issues are beyond the scope of this article, but as of this writing the comments on the link above touch on them.

We were a little discouraged, because CHECKSUM has the advantage of returning a value for all columns of a given table easily. Fortunately, concatenating table columns for HASHBYTES did not prove too cumbersome, and it allowed us to monitor only a subset of table columns for changes.

At a high-level, here is the solution I am about to describe in detail (I lifted and modified this nifty flow diagram out of PowerPoint from my esteemed colleague, Frederick Chun):

Update Process

  1. Compare Last Hash Value: Apollo’s Vehicle entity will be monitored for changes to the data by comparing the last HASHBYTES value.
  2. Stage Updates: Any changed data will be staged in an interim table for processing.
  3. Exclude Challenger Updated Items: If the Vehicle entity has been edited by a user directly from the Challenger system since the last Apollo update, it will be excluded from the current Apollo update and any future Apollo updates. This was a business requirement that basically said, if a user uses Challenger to update a particular entity, e.g. CarID: 123XYZ, then the user has indicated that CarID 123XYZ will no longer be updated in Apollo and updates in Challenger should not be overwritten.
  4. Update Challenger: Challenger data will be overwritten with the Apollo updates that have occurred since the last Apollo update.
  5. Archive Update History: An audit trail of all updates will be persisted into an archive table.

The thing I liked most about this approach is this—we had an existing, unalterable legacy system that contained no reliable means of tracking changes to the data. With the addition of two columns to a table in another database (SourceHashValue, SourceHashDateTime), we were able to implement a full-fledged and reliable change data capture system. The legacy system had no need of any awareness about this other database.

Recall the ReferenceCar table from Part One. This table lived in the ETL team’s reference database and served a single purpose; it provided the surrogate key mapping between the cars in Apollo and the cars in Challenger.

ReferenceCar Table

The ReferenceCar table was augmented with the following columns:

  1. SourceHashValue varbinary(8000) – This persisted the HASHBYTES hash value for the Apollo Car data from one update to the next.
  2. SourceHashDateTime datetime – This stored the timestamp of the last update from Apollo to Challenger, i.e. the timestamp of when the hash value in SourceHashValue was generated.

Updated ReferenceCar Table

And let’s take a look at the source Car table from Apollo and the destination Car table from Challenger.

Apollo Car Table (Source)

Challenger Car Table (Destination)

The 3 tables above provide the framework to determine if something changed in the source system that should be propagated to the destination system. When the Challenger Car table is initially loaded with data from the Apollo Car table using SQL Server Integration Services (SSIS), HASHBYTES is used to return a hash of the columns that will be tracked for changes, in this example, Make, Model, Year and Color.

This hash is stored in the reference.ReferenceCar table’s SourceHashValue column, along with the time stamp of when it was taken. The ReferenceCar table also contains the mapping between Apollo and Challenger, Apollo’s CarID and Challenger’s CarID, respectively (review the Updated Reference Car Table above) . Once the hash is stored with a time stamp, the next time the SSIS process runs the hash value can be used to determine if anything changed. The time stamp can be used to compare against Challenger’s LastModifiedDateTime to filter out items that were updated by users in Challenger in the interim, and permanently flag those items as “not updateable”. Lastly, a record of everything that happened is stored in an archive table.

Recall our requirements from Part One:

  • Updates to Apollo should flow to Challenger on a nightly basis
  • Once the updateable entity was edited in Challenger, updates in Apollo should be ignored
  • Providing an audit trail of updates was essential

The process described above handles these requirements. In the next section we will walk through the process from start to finish.

Walkthrough

Everything necessary to complete the walkthrough can be downloaded from this post. If you want to follow along, you will need the following sql scripts (tested on SQL Server 2008 R2):

  • 1_HashbytesDemoDatabaseSetup.sql
  • 2_HashbytesDemoDataUpdates.sql
  • 3_HashbytesDemoReviewUpdateResults.sql

and the SSIS project (compatible with Visual Studio/BIDS 2008) that contains the following files:

  • HashbytesDemo.sln
  • HashbytesDemo.dtproj
  • HashbytesDemo.dtsx

Download these items and unzip the SSIS solution, but don’t open it yet; we’ll need to setup the database first to avoid validation issues when opening the SSIS project.

Database Setup

Open and run the 1_HashbytesDemoDatabaseSetup.sql script on the server where you plan to run the demo. This will:

  1. Create the HashbytesDemo database
  2. Create schemas (to represent different databases for simplicity)
    1. apollo
    2. challenger
    3. reference
  3. Create tables
    1. apollo.Car
    2. challenger.Car
    3. reference.ReferenceCar
    4. reference.StageCar
    5. reference.StageCarArchive
  4. Adds 100 rows of test data into apollo.Car
  5. Create the stored procedure uspApolloCarUpdatedSource

After you have run the script, the database should look like this in SQL Server Management Studio’s (SSMS) Object Explorer:

The apollo.Car table should have 100 rows of data. The other tables are empty, for now. Feel free to borrow the following sql to view the data in the tables you just created.

use HashbytesDemo
go
select * from apollo.Car
select * from challenger.Car
select * from reference.ReferenceCar
select * from reference.StageCar
select * from reference.StageCarArchive

Video Example

Note that the IDENTITY columns in apollo.Car and challenger.Car are seeded with different values to allow for easier visual identification of the mapping between the 2 systems in reference.ReferenceCar. This is strictly to aid in the demonstration.

Project Setup

The SSIS solution assumes that the HashbytesDemo database is installed on your local instance of SQL Server with the Server Name localhost. If your setup is different, you will need to update the localhost.HashbytesDemo Connection Manager in the project.

Once you are up and running with the project open, you should be looking at a screen that resembles the following:

Initial Load

Assuming your Connection Manager is set up correctly, you should be able to run the SSIS project by right clicking on the package in Solution Explorer, and choosing Execute Package.

You should see the 100 rows in apollo.Car flow through to challenger.Car. This is the initial migration and all the car data in Apollo is synchronized with the car data in Challenger.

Video Example

Running the sql to view the data in apollo.Car, challenger.Car and reference.ReferenceCar should verify the data is synchronized between both systems.

Video Example

Emulating Updates

The 2nd sql script, 2_HashbytesDemoDataUpdates.sql, will emulate user updates in both the source and target. Updates in the source (Apollo) should flow through to the destination (Challenger), unless Challenger was updated in the interim by a user or some other process.

Open and run the 2_HashbytesDemoDataUpdates.sql script on the server where you are running the demo. This will:

  1. Update a subset of data in apollo.Car to emulate user data changes in the source
  2. Update a subset of data in challenger.Car to emulate conflicting user updates in the destination
  3. Display Apollo and Challenger Car data side by side to allow you to compare the data before running the SSIS process

Video Example

Looking at the data, we see the random data elements that were updated. Notice that the ones that were updated in ChallengerMake; they have more recent time stamps in ChallengerLastUpdatedTime. These are of interest because they are the ones that will become flagged as “not updateable”. We will see all the other Apollo data updates make their way through to Challenger.

Running the Update Process

We are going to execute the HashbytesDemo.dtsx package again. This time, we will not see any new data loaded, just updates being propagated from Apollo to Challenger.

Right click on the HashbytesDemo.dtsx package and choose Excecute Package. After the package has completed, take a moment to examine the row counts as they moved through the data flow tasks (DFT’s).

Video Example

The Initial Apollo to Challenger Car Data Load DFT shows that of the 100 rows found in the Apollo Car Source, none made it through to the Challenger destination. This is correct as there were no new data.

The Load Reference Data DFT shows the same, as it should always match its predecessor.

The next 2 DFT’s are the ones to watch, as they should have handled correctly propagating the updates. Let’s look at Stage Pending Apollo Car Updates first.

Notice the flow of data through the tasks. The Apollo Car Updated Source correctly captured 20 updated rows. (Please note that the demo may not always generate exactly 20 rows.) You can review the stored procedure reference.uspApolloCarUpdatedSource if you want to know more about how it worked. In a nutshell, it compared the reference.ReferenceCar.SourceHashValue with the current hash created from HASHBYTES against the data in Apollo to determine if the data were different.

The Lookup Challenger Car Data task looks up the corresponding data in Challenger to use for comparing the LastModifiedDateTime and for storing the “old” (prior to update) values in the audit trail.

The Challenger Updates Exist Conditional Split divides the data into 2 paths, updateable and not updateable. You can see that 15 rows were identified as updatable, i.e. they had no conflicting updates in Challenger. 5 rows were flagged as “not updateable”, i.e. their Challenger LastModifiedDateTime values were more recent than the reference.ReferenceCar.SourceHashDateTime. This indicates that a user (or some other process) updated the same Car in Challenger since it was last updated with Apollo’s data. Because of the requirements, these CarID’s will be flagged as permanently “not updateable” with changes from Apollo. If you look at the end of the reference.uspApolloCarUpdatedSource stored procedure, you will see where it checks the reference.StageCarArchive table for permanently flagged items.

The Challenger Updateable Derived Columns and Challenger Not Updateable Derived Columns tasks create 2 new derived columns, one for the ChallengerUpdateable flag, the other for a text description to make the audit table, reference.StageCarArchive, more readable.

Lastly, the Update Staging Destination and Update Staging Destination (Not Updateable) tasks simply inserts the data into the reference.StageCar table. 15 items are updatable, 5 are not.

Next, let’s look at the Update Challenger Cars from Stage DFT.

It pulls the staged data directly from reference.StageCar and correctly finds 20 rows. The following Updateable Conditional Split simply excludes the “not updateable” rows. The Data Conversion task is not in use and can be ignored. If there were data conversions required for the destination, that is where they would go. The next 3 OLE DB Command tasks do the work.

Update Challenger Car Command performs the update to the challenger.Car table. Update StageCar ChallengerUpdateDateTime updates the StageCar table with the time stamp of the updates to Challenger. Update ReferenceCar SourceHashValue updates the reference.ReferenceCar table SourceHashValue with the new HASHBYTES hash so that the reference.ReferenceCar table is in synch with the current value in the Apollo system. (To give context, this hash value will be used as the basis of comparison the next time the process runs. If relevant data is updated in apollo.Car in the meantime, the hash in reference.ReferenceCar won’t match and the item will be selected for updates.)

The last 2 tasks in the Control Flow, Archive Staging Data and Truncate Reference StageCar simply copy the data from reference.StageCar and insert it “as is” into reference.StageCarArchive. Then, StageCar is truncated since the data is preserved in StageCarArchive. The archive table serves 2 purposes, it provides an audit trail, and it is used to persist the CarID’s in Challenger that are permanently “not updateable”. Recall the last bit of our WHERE clause on the Apollo Car Updated Source:

Take a moment to run the 3_HashbytesDemoReviewUpdateResults.sql script to review the results in reference.StageCarArchive. (The screenshot below is clipped width-wise for display purposes.)

Notice the data with NULL ChallengerUpdateDateTime values. These are the ones that were flagged as “not updateable”. The side by side columns display the current value as it exists in challenger.Car, the old value (its value in Apollo and Challenger before the user updated the data in Apollo), and the new value (its value in Apollo AFTER it was updated by the user). Keep in mind the new value will not have flowed through to the current value if it was not updateable.

Summary

This concludes the series. I decided to knock it all out in 2 parts since it took me much longer than I anticipated to get back to Part 2 and finish it up. The demo can easily be restarted by simply running the 1_HashbytesDemoDatabaseSetup.sql script again. At a high level, the steps are these:

  1. Run 1_HashbytesDemoDatabaseSetup.sql
  2. Execute the HashbytesDemo.dtsx package (this will complete the initial load)
  3. Run 2_HashbytesDemoDataUpdates.sql
  4. Review the pending updates using the output from 2_HashbytesDemoDataUpdates.sql
  5. Execute the HashbytesDemo.dtsx package (this will process the updates)
  6. Run 3_HashbytesDemoReviewUpdateResults.sql to review the results

This is a demo project and does not represent production code. Shortcuts were made for expediency. For example, the OLE DB Command objects in the Update Challenger Cars From Stage DFT are parameterized in such a way that the data will be processed row by row. They would yield much better performance were they set based.

As the saying goes, hindsight is 20/20. Being fortunate enough to work with an amazing and talented team at Slalom and on my current project yields great results. Since I started this posting about change data capture, I have learned a few things from my colleagues about cleaner, simpler ways to solve the problem outlined in this series. Most notably by leveraging the MERGE statement in tandem with HASHBYTES. While MERGE is a more t-sql intensive solution, the content in this posting relies more heavily on the built-in SSIS components, and therefore has merit in that regard. At the least, I hope it provides some conceptual relevance when trying to solve similar problems.

Feel free to follow-up with questions or comments.

Download

This zip file contains the sql scripts and SSIS files necessary to run the demo described in this post. Unzip and open the SSIS solution file. Open the sql scripts in SSMS and run when indicated in the walkthrough.

Implementing a Custom Change Data Capture System Part 1: Introduction

Bruce Hendry is a consultant in Slalom’s Technology Solutions group in Dallas, TX. Bruce is passionate about the Microsoft Business Intelligence stack and the .NET Framework and enjoys technology.

Coming up with a title proved to be one of the most difficult parts of beginning this posting. Implementing a Custom Change Data Capture System doesn’t really ‘capture’ what should be said. I recently found myself in a situation with a client where it was necessary to quickly implement update processes between a legacy transactional application and a new transactional system. “That’s nothing new”, you say–and you are right–except that it was late in the game, there was already an extract-transform-load (ETL) solution in place that was not architected to account for updates because it was never in the scope…until that day, and (here was the nut of the challenge) nothing reliable existed in the legacy application to capture changed data. Read more of this post

Follow

Get every new post delivered to your Inbox.

Join 130 other followers

%d bloggers like this: