An Ordinary Day with the Client

Slalom Consultant David Van De Somple

Slalom Consultant David Van De Sompele's expertise includes performance tuning, production DBA activities, data modelling, ETL processes and reporting. He's based in Slalom's Seattle office.

If a day with the client is so ordinary, what is there to write about?  That is a fair question and the answer to it is the topic of this article.

Dictionary.com defines the word ordinary as: common, mediocre and unexceptional. But if that is the value you are delivering to your client then you should consider a career other than consulting. As consultants we specialize in various technologies and methodologies.  Blending this expertise with creativity and innovation, we help companies solve complex business and technological problems.  Over time this becomes ordinary to us – it’s what we do – but this ordinary effort from the consultant should look like an extraordinary effort from the client’s perspective.

On rare occasions you may be presented with a particularly challenging situation: uncharted waters, or an unpredictable fluke for which no one has the answer.  And as a senior consultant you must be willing to dig deep and bring the entire breadth of your professional experience to bear in order to solve the problem.  To better illustrate this I’ll relate a recent experience I had.  It was one of those situations in which everything seemed to be going well until a very strange situation occurred and I was asked to jump in and solve the issue.

(Here’s where it gets technical—feel free to skip down to the last two paragraphs if you don’t want to dive deep into one particular night in the life of a technology consultant!)

My client asked me to assist with planning and implementing a major SQL Server upgrade.  Prior to the upgrade the production environment contained a mix of three SQL Server 2000/2005 servers running Transactional Replication.  These database servers are responsible for processing millions of mouse clicks and advertising impressions every day.   In addition they provide cogent statistical reporting to both internal and external clients. It is a very complex system and when it is offline there is a significant customer impact.  The project called for upgrading all three servers to SQL Server 2008 R2 using a phased approach.  For the first phase of the production upgrade, a senior system admin was going to do the work and have me observe and provide guidance as necessary.  This was to ensure that the client could gain hands-on experience with SQL Server.  Specifics of the pre-upgraded architecture were as follows:

  • SQL1:  SQL Server running SQL Server 2000. It acts as a publisher for replication.  It receives millions of mouse clicks and ad impressions per day.
  • SQL2: SQL Server running SQL Server 2000.  It subscribes to SQL1’s publications and every night runs large data aggregation jobs in order to provide reporting and statistics to internal and external customers.
  • SQL3:  SQL Server running SQL Server 2005.  Distribution for the replication topology is this server’s sole task.

SQL Server replication requires the Distributor to be running a version of SQL Server greater than or equal to the version running on the Publisher, making the Distributor our critical path and the only server undergoing an upgrade for this first phase.  The Distributor was running on newer hardware so we were only upgrading the database software.  To minimize downtime this was to be an inline upgrade, and customers were notified of a planned three-hour maintenance window. Normally I am not a fan of inline upgrades, but according to Microsoft an inline upgrade was perfectly acceptable and the SQL 2008 installer contains an automated option to upgrade an existing server from previous versions.  The client had the luxury of testing the inline upgrade in a QA environment that mirrored production and the entire process worked seamlessly. In fact, it seemed almost too easy, but it gave us the confidence to proceed with the Distributor upgrade in production.

The initial production upgrade proceeded without incident.  Replication restarted, the snapshot replication agents ran correctly and began applying to the subscriber.  Snapshot replication is used to provide the initial data set for transactional and merge replication.  Replication manages a snapshot folder on the Distributor for each publication.  When a publication’s snapshot agent runs, the data and schema for each article in the publication is copied to the appropriate snapshot folder from which it may be applied to the Subscriber database.  Due to the size of the databases and the age of the Subscriber hardware, a couple of the snapshots required about 90 minutes to apply.  I used this opportunity to head home.  During my commute I received a phone call from the sys admin who had performed the upgrade.  He said that two of the subscriptions had thrown errors and failed to apply correctly and he didn’t fully understand the error messages.  When I arrived at home and came online remotely, here is what I found:

  • Two of the subscription agents, each subscribed to a publication with one large (300+ million row) table, had thrown errors.
  • A third subscription agent subscribed to a publication with about 24 small tables was running, but undistributed transactions were backing up in the queue which means they were not getting delivered to the subscriber.

The subscription agents for both of the large table publications had thrown slightly cryptic errors indicating problems with the snapshot application at the Subscriber.  Further investigation revealed that both snapshots had applied and upon completion, immediately entered a transaction rollback.  A transaction rollback involving 300+ million rows takes several hours to complete as it must undo every command that was executed during the snapshot application. The small table publication was exhibiting a different behavior.  It had entered what I would define as a “runaway snapshot” mode.  The snapshot applied to the subscriber and upon completion would immediately start applying again.  This action would continue until I manually halted the subscription agent.  At this point it was 6:00 PM and the reporting system had been offline for the allotted three hours. I emailed the project manager, who was understandably concerned, with a quick summary of the issues and my next steps.

All of the errors occurred during snapshot applications on the Subscriber, so I decided to begin my troubleshooting there.  The first step was to delete all three subscriptions and then recreate them.  Due to the size of the tables involved with two of the publications, I decided to subscribe to just one of them plus the small table publication.  The result:  Still broken.  The small table subscription resumed its “runaway snapshot” mode.  And the first large table subscription finished applying and then went into immediate rollback — a two hour process to completion.  It is important to note that I could not disrupt the rollback without taking drastic measures and risking serious damage to the database and its data. An exhaustive Google search during this time provided no help for the problem at hand.  It was now 9:00 PM and the reporting system had been offline for 6 hours. The nightly data aggregation job was scheduled to run on the Subscriber at 9:30 PM which meant we had to disable that job until the problem at hand was solved.  I emailed the project manager with another update and included my next steps in resolving the issue.

Rebuilding the subscriptions did not resolve the issue so rebuilding the three publications seemed the next logical step.  I started by again deleting the subscriptions then I moved over to the Publisher, deleted the three publications and rebuilt them from scratch.  Then I went back to the Subscriber and subscribed to one of the large table publications and the small tables publication.  And for good measure I had the Distributor create new snapshots.  The result:  Still broken, same symptoms and another two hour wait for the rollback to complete.  Another Google search plus a thorough search of MSDN and Technet ensued, but with no useful information found.  It was now 12:30 AM, the reporting system had been offline for 9.5 hours. The project manager looped me into a conference call with the resident Windows sys admin to discuss the situation.  They were extremely concerned, but ready to call it a night feeling that we should get some sleep and begin anew early the next morning.  I informed them I would press on as I felt this issue could not wait.

I hung up the phone and started thinking through some basic questions and answers:

Q: Why are only these three publications having problems?
A:
There are eight other publications replicating to the Subscriber without incident.

Q: What do I know?
A:
Dropping and recreating the publications and subscriptions did not solve the issue.

Q: What haven’t I thought of?
A:
If the problem is not with the Publisher or Subscriber then perhaps it could be something with the Distributor.

The answer to my final question would ultimately lead me to the solution.  The problem seemed to be with the snapshots themselves, almost as if they were getting corrupted during their creation on the Distributor.  I logged on to the Distributor and opened the snapshot folder and found the most current subfolders for the three problem publications.  Each publication has a snapshot folder and when the publication’s snapshot agent runs a new sub-folder named with the current date and time is created.  I opened the schema files; those containing .idx, .pre and .sch extensions, which contain all of the commands to drop and recreate the table(s) and indexes.  I didn’t see anything obviously incorrect or corrupt.  I decided to try creating one final new set of snapshots which would force the creation of new snapshot subfolders.  The snapshots completed and began their application at the Subscriber.  Result:  Still broken, same symptoms.  It is now 3:00 AM and my mind is going numb.  I decided to get a few hours of sleep, but was still pretty sure the problem had to do with the snapshots themselves.  I awoke three hours later with what would be the solution.

Creating a new snapshot and subfolder within the publication’s master snapshot folder did not solve the problem, so perhaps I needed to take the process one step higher and recreate the publications with new names which would force the creation of new master snapshot folders.  Could it really be that simple?  I deleted one of the large table publications, recreated it with a new name and subscribed to it with an immediate sync snapshot, forcing the snapshot agent to run.  Upon agent completion the data applied to the Subscriber.  The last index build completed and immediately the subscription began receiving replicated data.  Problem solved!

I excitably called the project manager, waking her from a dead sleep, to tell I had a solution in place; the relief in her voice was noticeable. I quickly applied the same fix to the other two problematic publications and within a few hours the entire system was back up and running.

From the client’s perspective this was an extraordinary effort by a consultant who was originally asked to act in an observational role only so the client could learn from the upgrade experience.  When the situation quickly spiraled beyond client’s experience I was ready to jump in and own the problem until a solution was found.

From this consultant’s perspective this was all in a day’s work; just an ordinary day with the client.

Slalom Consulting's Seattle Office Slalom Consulting's Customer focus page
More about Slalom Consulting’s Seattle office. More about Slalom Consulting’s Customer Focus Area.

subscribe by emailSubscribe to be emailed about new Business Intelligence posts.

About David Van De Sompele
Slalom Consultant David Van De Sompele's expertise includes performance tuning, production DBA activities, data modelling, ETL processes and reporting. He's based in Slalom's Seattle office.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: