Thinking About Data

“Though the mills of God grind slowly, yet they grind exceeding small…” Henry Wadsworth Longfellow translated this famous line from a German poem written by Friedrich von Logau in 1654.  Imagine if Longfellow worked as a data architect in today’s Information Technology industry.  Perhaps he would have written this now famous line as follows: “Though the databases of God grind slowly, yet they grind exceedingly small.”

This is often how I feel when I begin investigating a database to diagnose performance problems or to start documenting the schema and constructing ETL process to populate a reporting database or data warehouse.  As data modelers, data architects and database developers- all of whom I will collectively refer to as database people for the remainder of this article-we are taught to think about data relationally and dimensionally. Relationally we normalize data for production OLTP databases, organizing it in such a way as to minimize redundancy and dependency.

Dimensionally we design data warehouses to have facts and dimensions that are easily understandable and provide valuable decision support to various business entities.  High quality, reliable data that is easy to query and consume is the goal of these design patterns.  During my twenty year career, however, I have discovered that well-designed database schemas and data models are the rare exception, not the rule.  And there are a couple of common themes underpinning the bad designs I encounter.

Poorly designed databases are often the result of designers who instead of thinking about how data will flow through their databases and more importantly, which people and internal business entities will want to consume that data, simply view databases as a convenient resting place for data.  This erroneous view frequently stems from a lack of formal training in database normalization, relational algebra, dimensional modeling and data object modeling; skills that I believe are essential for anyone serious about enterprise level database design.  The database schema is foundational to almost every business system, so it is imperative to involve skilled database people early in the design process.  Failure to do this may result in flawed database schemas which will suffer from one or more of the follow issues:

  • Lack of extensibility
  • Difficult to maintain
  • Lack of scalability
  • Difficult to query
  • Contain a high degree of data anomalies which render the data unreliable
  • Performance problems

Having worked with a lot of weird and strange database designs-I could probably write an entire book on the subject- I want to briefly mention some of the more commonly encountered database design errors.  And I am going to classify these errors into two general groups: Design errors in production OLTP databases and design errors in databases intended for reporting.

Production OLTP databases

  1. A completely flattened, un-normalized schema. Databases designed this way will have all sorts of issues when placed under production load; performance and scalability for example. I often hear this line from developers, “Well it worked fine in QA with a small amount of data, but as soon as we deployed to production and threw serious traffic at it, all sorts of performance problems emerged.”  Flattened schemas like this frequently lack declarative referential integrity which leads to data anomalies. Getting reliable reporting from this environment is difficult at best.
  2. A highly normalized schema, possibly over-normalized for the context, but lacking in sufficient/useful payload.  Every part of the data model has been taken to its furthest Normal Form and data integrity is well enforced through Primary Key/Foreign Key relationships. But getting useful data requires both excessive table joins and deriving or calculating needed values at query time. For example, I worked on the invoicing portion of an application-the invoicing code was written in PERL-in which customer product usage was calculated during the invoicing process.  The final usage totals were printed on the paper invoices, but not stored anywhere in the database.  Finance needed this information, both current and twelve months historical and to get it, I had to recreate large parts of the invoicing process in SQL; a herculean task.  When I inquired of the developers as to why customer usage invoicing totals were not stored in the database they responded as follows, “I guess we never though anyone would want that data.”

Reporting databases intended to serve internal business entities

  1. Attempts to build internal reporting functions directly against production OLTP databases.  A discussion of OLTP optimization techniques vs. OLAP optimization techniques is beyond the scope of this article. But suffice it to say that attempts to run I/O intensive reporting against a production OLTP system which is not optimized for reporting will cause tremendous performance problems.
  2. Building out a secondary physical database server and placing an exact copy of a production OLTP database on it as a “reporting instance” database.  Doing this will certainly remove any chance of overwhelming the production server.  But it will not provide a database optimized for reporting.
  3. Adding a few somewhat haphazard aggregation tables to the “reporting instance” database mentioned above.  This may temporarily reduce query times for reports relying on aggregated data, but it is not a long-term substitute for a properly designed dimensional reporting model.

Data models are often given short shrift because the original developers, being inexperienced with relational and dimensional databases, do not think correctly about data. This error in data thought frequently results in poor database designs which may perform poorly and contain unreliable data that is difficult to query.  I want to leave you with a specific example of this point by briefly relating a somewhat recent client experience of mine.

My client at the time had recently purchased a start-up company whose product was a complex ad serving application/engine.  The SQL Server databases foundational to the application were suffering severe performance problems which rippled through the entire system and resulted in a less-than-stellar customer experience.

At my client’s behest I executed a major system review and quickly ascertained two primary issues; a data architecture that limited scalability and incorrect table indexing which was a direct result of the architecture. I kept their developers and Program Manager involved with my solution development process and after a successful deployment, which solved their performance issues, the program manager made a key statement to me. She said, “Wow, I never understood the value and importance that a person with strong data architecture and DBA skills could bring to project like this.  You can be certain that on all future projects of this magnitude I will insist on, and budget for a person with your skillset to be involved at the outset to ensure we avoid these types of database issues.”

Every IT, Program and Project Manager would do well to heed her advice.  Consider spending some time with your recruiting department to find an experienced data architect with a successful track record at the enterprise level.  It will be time and money well-spent.

Implementing a Custom Change Data Capture System—Part Two

Bruce Hendry is a consultant in Slalom’s Technology Solutions group in Dallas, TX. Bruce is passionate about the Microsoft Business Intelligence stack and the .NET Framework and enjoys technology.


This posting continues from my first blog Implementing a Custom Change Data Capture System. It may be helpful to review Part One before diving into Part Two.

Part One described how our team was challenged with late arriving requirement changes for an existing Extract Transform Load (ETL) solution. The new requirements dictated that changes in the source transactional system propagate to the destination transactional system. The source system had no inherent reliable means of tracking changes, and altering the source system was not an option. Part One covered the team’s proposed solutions and briefly described the one we selected, using HASHBYTES as a means to monitor and respond to changes in the source system.

Part Two will conclude the series and provide the details of the solution our team implemented. The end of this post contains a link to downloadable content that can be used to demonstrate the custom change data capture system and follow along.


Time was of the essence, so we quickly set forth on our proof of concept. We discovered in short order that CHECKSUM was not likely to work for us as it had some known issues. The issues are beyond the scope of this article, but as of this writing the comments on the link above touch on them.

We were a little discouraged, because CHECKSUM has the advantage of returning a value for all columns of a given table easily. Fortunately, concatenating table columns for HASHBYTES did not prove too cumbersome, and it allowed us to monitor only a subset of table columns for changes.

At a high-level, here is the solution I am about to describe in detail (I lifted and modified this nifty flow diagram out of PowerPoint from my esteemed colleague, Frederick Chun):

Update Process

  1. Compare Last Hash Value: Apollo’s Vehicle entity will be monitored for changes to the data by comparing the last HASHBYTES value.
  2. Stage Updates: Any changed data will be staged in an interim table for processing.
  3. Exclude Challenger Updated Items: If the Vehicle entity has been edited by a user directly from the Challenger system since the last Apollo update, it will be excluded from the current Apollo update and any future Apollo updates. This was a business requirement that basically said, if a user uses Challenger to update a particular entity, e.g. CarID: 123XYZ, then the user has indicated that CarID 123XYZ will no longer be updated in Apollo and updates in Challenger should not be overwritten.
  4. Update Challenger: Challenger data will be overwritten with the Apollo updates that have occurred since the last Apollo update.
  5. Archive Update History: An audit trail of all updates will be persisted into an archive table.

The thing I liked most about this approach is this—we had an existing, unalterable legacy system that contained no reliable means of tracking changes to the data. With the addition of two columns to a table in another database (SourceHashValue, SourceHashDateTime), we were able to implement a full-fledged and reliable change data capture system. The legacy system had no need of any awareness about this other database.

Recall the ReferenceCar table from Part One. This table lived in the ETL team’s reference database and served a single purpose; it provided the surrogate key mapping between the cars in Apollo and the cars in Challenger.

ReferenceCar Table

The ReferenceCar table was augmented with the following columns:

  1. SourceHashValue varbinary(8000) – This persisted the HASHBYTES hash value for the Apollo Car data from one update to the next.
  2. SourceHashDateTime datetime – This stored the timestamp of the last update from Apollo to Challenger, i.e. the timestamp of when the hash value in SourceHashValue was generated.

Updated ReferenceCar Table

And let’s take a look at the source Car table from Apollo and the destination Car table from Challenger.

Apollo Car Table (Source)

Challenger Car Table (Destination)

The 3 tables above provide the framework to determine if something changed in the source system that should be propagated to the destination system. When the Challenger Car table is initially loaded with data from the Apollo Car table using SQL Server Integration Services (SSIS), HASHBYTES is used to return a hash of the columns that will be tracked for changes, in this example, Make, Model, Year and Color.

This hash is stored in the reference.ReferenceCar table’s SourceHashValue column, along with the time stamp of when it was taken. The ReferenceCar table also contains the mapping between Apollo and Challenger, Apollo’s CarID and Challenger’s CarID, respectively (review the Updated Reference Car Table above) . Once the hash is stored with a time stamp, the next time the SSIS process runs the hash value can be used to determine if anything changed. The time stamp can be used to compare against Challenger’s LastModifiedDateTime to filter out items that were updated by users in Challenger in the interim, and permanently flag those items as “not updateable”. Lastly, a record of everything that happened is stored in an archive table.

Recall our requirements from Part One:

  • Updates to Apollo should flow to Challenger on a nightly basis
  • Once the updateable entity was edited in Challenger, updates in Apollo should be ignored
  • Providing an audit trail of updates was essential

The process described above handles these requirements. In the next section we will walk through the process from start to finish.


Everything necessary to complete the walkthrough can be downloaded from this post. If you want to follow along, you will need the following sql scripts (tested on SQL Server 2008 R2):

  • 1_HashbytesDemoDatabaseSetup.sql
  • 2_HashbytesDemoDataUpdates.sql
  • 3_HashbytesDemoReviewUpdateResults.sql

and the SSIS project (compatible with Visual Studio/BIDS 2008) that contains the following files:

  • HashbytesDemo.sln
  • HashbytesDemo.dtproj
  • HashbytesDemo.dtsx

Download these items and unzip the SSIS solution, but don’t open it yet; we’ll need to setup the database first to avoid validation issues when opening the SSIS project.

Database Setup

Open and run the 1_HashbytesDemoDatabaseSetup.sql script on the server where you plan to run the demo. This will:

  1. Create the HashbytesDemo database
  2. Create schemas (to represent different databases for simplicity)
    1. apollo
    2. challenger
    3. reference
  3. Create tables
    1. apollo.Car
    2. challenger.Car
    3. reference.ReferenceCar
    4. reference.StageCar
    5. reference.StageCarArchive
  4. Adds 100 rows of test data into apollo.Car
  5. Create the stored procedure uspApolloCarUpdatedSource

After you have run the script, the database should look like this in SQL Server Management Studio’s (SSMS) Object Explorer:

The apollo.Car table should have 100 rows of data. The other tables are empty, for now. Feel free to borrow the following sql to view the data in the tables you just created.

use HashbytesDemo
select * from apollo.Car
select * from challenger.Car
select * from reference.ReferenceCar
select * from reference.StageCar
select * from reference.StageCarArchive

Video Example

Note that the IDENTITY columns in apollo.Car and challenger.Car are seeded with different values to allow for easier visual identification of the mapping between the 2 systems in reference.ReferenceCar. This is strictly to aid in the demonstration.

Project Setup

The SSIS solution assumes that the HashbytesDemo database is installed on your local instance of SQL Server with the Server Name localhost. If your setup is different, you will need to update the localhost.HashbytesDemo Connection Manager in the project.

Once you are up and running with the project open, you should be looking at a screen that resembles the following:

Initial Load

Assuming your Connection Manager is set up correctly, you should be able to run the SSIS project by right clicking on the package in Solution Explorer, and choosing Execute Package.

You should see the 100 rows in apollo.Car flow through to challenger.Car. This is the initial migration and all the car data in Apollo is synchronized with the car data in Challenger.

Video Example

Running the sql to view the data in apollo.Car, challenger.Car and reference.ReferenceCar should verify the data is synchronized between both systems.

Video Example

Emulating Updates

The 2nd sql script, 2_HashbytesDemoDataUpdates.sql, will emulate user updates in both the source and target. Updates in the source (Apollo) should flow through to the destination (Challenger), unless Challenger was updated in the interim by a user or some other process.

Open and run the 2_HashbytesDemoDataUpdates.sql script on the server where you are running the demo. This will:

  1. Update a subset of data in apollo.Car to emulate user data changes in the source
  2. Update a subset of data in challenger.Car to emulate conflicting user updates in the destination
  3. Display Apollo and Challenger Car data side by side to allow you to compare the data before running the SSIS process

Video Example

Looking at the data, we see the random data elements that were updated. Notice that the ones that were updated in ChallengerMake; they have more recent time stamps in ChallengerLastUpdatedTime. These are of interest because they are the ones that will become flagged as “not updateable”. We will see all the other Apollo data updates make their way through to Challenger.

Running the Update Process

We are going to execute the HashbytesDemo.dtsx package again. This time, we will not see any new data loaded, just updates being propagated from Apollo to Challenger.

Right click on the HashbytesDemo.dtsx package and choose Excecute Package. After the package has completed, take a moment to examine the row counts as they moved through the data flow tasks (DFT’s).

Video Example

The Initial Apollo to Challenger Car Data Load DFT shows that of the 100 rows found in the Apollo Car Source, none made it through to the Challenger destination. This is correct as there were no new data.

The Load Reference Data DFT shows the same, as it should always match its predecessor.

The next 2 DFT’s are the ones to watch, as they should have handled correctly propagating the updates. Let’s look at Stage Pending Apollo Car Updates first.

Notice the flow of data through the tasks. The Apollo Car Updated Source correctly captured 20 updated rows. (Please note that the demo may not always generate exactly 20 rows.) You can review the stored procedure reference.uspApolloCarUpdatedSource if you want to know more about how it worked. In a nutshell, it compared the reference.ReferenceCar.SourceHashValue with the current hash created from HASHBYTES against the data in Apollo to determine if the data were different.

The Lookup Challenger Car Data task looks up the corresponding data in Challenger to use for comparing the LastModifiedDateTime and for storing the “old” (prior to update) values in the audit trail.

The Challenger Updates Exist Conditional Split divides the data into 2 paths, updateable and not updateable. You can see that 15 rows were identified as updatable, i.e. they had no conflicting updates in Challenger. 5 rows were flagged as “not updateable”, i.e. their Challenger LastModifiedDateTime values were more recent than the reference.ReferenceCar.SourceHashDateTime. This indicates that a user (or some other process) updated the same Car in Challenger since it was last updated with Apollo’s data. Because of the requirements, these CarID’s will be flagged as permanently “not updateable” with changes from Apollo. If you look at the end of the reference.uspApolloCarUpdatedSource stored procedure, you will see where it checks the reference.StageCarArchive table for permanently flagged items.

The Challenger Updateable Derived Columns and Challenger Not Updateable Derived Columns tasks create 2 new derived columns, one for the ChallengerUpdateable flag, the other for a text description to make the audit table, reference.StageCarArchive, more readable.

Lastly, the Update Staging Destination and Update Staging Destination (Not Updateable) tasks simply inserts the data into the reference.StageCar table. 15 items are updatable, 5 are not.

Next, let’s look at the Update Challenger Cars from Stage DFT.

It pulls the staged data directly from reference.StageCar and correctly finds 20 rows. The following Updateable Conditional Split simply excludes the “not updateable” rows. The Data Conversion task is not in use and can be ignored. If there were data conversions required for the destination, that is where they would go. The next 3 OLE DB Command tasks do the work.

Update Challenger Car Command performs the update to the challenger.Car table. Update StageCar ChallengerUpdateDateTime updates the StageCar table with the time stamp of the updates to Challenger. Update ReferenceCar SourceHashValue updates the reference.ReferenceCar table SourceHashValue with the new HASHBYTES hash so that the reference.ReferenceCar table is in synch with the current value in the Apollo system. (To give context, this hash value will be used as the basis of comparison the next time the process runs. If relevant data is updated in apollo.Car in the meantime, the hash in reference.ReferenceCar won’t match and the item will be selected for updates.)

The last 2 tasks in the Control Flow, Archive Staging Data and Truncate Reference StageCar simply copy the data from reference.StageCar and insert it “as is” into reference.StageCarArchive. Then, StageCar is truncated since the data is preserved in StageCarArchive. The archive table serves 2 purposes, it provides an audit trail, and it is used to persist the CarID’s in Challenger that are permanently “not updateable”. Recall the last bit of our WHERE clause on the Apollo Car Updated Source:

Take a moment to run the 3_HashbytesDemoReviewUpdateResults.sql script to review the results in reference.StageCarArchive. (The screenshot below is clipped width-wise for display purposes.)

Notice the data with NULL ChallengerUpdateDateTime values. These are the ones that were flagged as “not updateable”. The side by side columns display the current value as it exists in challenger.Car, the old value (its value in Apollo and Challenger before the user updated the data in Apollo), and the new value (its value in Apollo AFTER it was updated by the user). Keep in mind the new value will not have flowed through to the current value if it was not updateable.


This concludes the series. I decided to knock it all out in 2 parts since it took me much longer than I anticipated to get back to Part 2 and finish it up. The demo can easily be restarted by simply running the 1_HashbytesDemoDatabaseSetup.sql script again. At a high level, the steps are these:

  1. Run 1_HashbytesDemoDatabaseSetup.sql
  2. Execute the HashbytesDemo.dtsx package (this will complete the initial load)
  3. Run 2_HashbytesDemoDataUpdates.sql
  4. Review the pending updates using the output from 2_HashbytesDemoDataUpdates.sql
  5. Execute the HashbytesDemo.dtsx package (this will process the updates)
  6. Run 3_HashbytesDemoReviewUpdateResults.sql to review the results

This is a demo project and does not represent production code. Shortcuts were made for expediency. For example, the OLE DB Command objects in the Update Challenger Cars From Stage DFT are parameterized in such a way that the data will be processed row by row. They would yield much better performance were they set based.

As the saying goes, hindsight is 20/20. Being fortunate enough to work with an amazing and talented team at Slalom and on my current project yields great results. Since I started this posting about change data capture, I have learned a few things from my colleagues about cleaner, simpler ways to solve the problem outlined in this series. Most notably by leveraging the MERGE statement in tandem with HASHBYTES. While MERGE is a more t-sql intensive solution, the content in this posting relies more heavily on the built-in SSIS components, and therefore has merit in that regard. At the least, I hope it provides some conceptual relevance when trying to solve similar problems.

Feel free to follow-up with questions or comments.


This zip file contains the sql scripts and SSIS files necessary to run the demo described in this post. Unzip and open the SSIS solution file. Open the sql scripts in SSMS and run when indicated in the walkthrough.

Increase Dynamics CRM Online Scribe Data Migration Performance

Slalom Consultant Jayson Goldinger

Jayson Goldinger is a Slalom Consultant based in Denver, CO, and a member of the Slalom National CRM practice. His work revolves around Microsoft Dynamics CRM including configuration, customization, report development, and integration. He is also recognized as a Scribe MVP.

For those of you who have done data migration or integration with Dynamics CRM Online, you know that the performance may be a challenge. I recently faced that challenge during a recent project.

I had approximately 52,000 account records with associated contacts to import. I also needed to do a check (Seek Step) against CRM to see if the record was already in CRM. I was using Scribe Insight with my source being Excel and my target was Dynamics CRM Online using the Dynamics CRM Adapter. On my first attempt my job ran for 9 hours and only imported approx. 15,000 records. At that rate it would have taken over 30 hours to complete. That rate is not really acceptable. It also potentially adds other issues like the connection being dropped after some period of time and then I would have to restart the integration. Some of my colleagues have said they have seen the Dynamics CRM connections dropped around the 8 hour mark.

The question then becomes how to combat this issue. Continue reading and I’ll explain how I was able to increase the performance from about 30 source records per minute to an average of 200 source records per minute. I say source records because in the job that I am discussing each source record may create one or two records. Read more of this post

Implementing a Custom Change Data Capture System Part 1: Introduction

Bruce Hendry is a consultant in Slalom’s Technology Solutions group in Dallas, TX. Bruce is passionate about the Microsoft Business Intelligence stack and the .NET Framework and enjoys technology.

Coming up with a title proved to be one of the most difficult parts of beginning this posting. Implementing a Custom Change Data Capture System doesn’t really ‘capture’ what should be said. I recently found myself in a situation with a client where it was necessary to quickly implement update processes between a legacy transactional application and a new transactional system. “That’s nothing new”, you say–and you are right–except that it was late in the game, there was already an extract-transform-load (ETL) solution in place that was not architected to account for updates because it was never in the scope…until that day, and (here was the nut of the challenge) nothing reliable existed in the legacy application to capture changed data. Read more of this post

%d bloggers like this: