SQL Server 2014: features aimed to scale

BrianFranklin

Brian Franklin

Are your clients running SQL Server 2008, or perhaps even some lingering instances of SQL Server 2005? Are they migrating to SQL Server 2012? Most likely, you’re seeing your clients using a blend of versions. Luckily, there’s good news on the horizon: the release of SQL Server 2014 is right around the corner, bringing some interesting new features and smart enhancements to the table. The SQL Server 2014 Community Technical Preview (CTP) 2 was released to the public last October, and is the first and only production-ready build prior to release to market (RTM). Read more of this post

When Structured Query Language (SQL) Becomes Spaghetti Query Language

Slalom Consultant David Van De Somple

David Van De Sompele

Avoiding the Overuse and Misuse of Dynamic SQL

Spaghetti code is a derogatory term for poorly written source code that contains a tangled and complex control structure, is prone to errors that are often difficult to find, correct, and modify. Anybody who has worked in an IT organization is likely familiar with this term. Perhaps you’ve even had the misfortune of twisting and winding through someone’s spaghetti in an attempt to document, troubleshoot, or make modifications. If so, then you’ve lived the nightmare and will empathize with me in this article.  Spaghetti code can be written in just about any programming language and SQL is no exception. For this post I will focus on what I consider to be one of most egregious types of SQL spaghetti: Dynamic SQL. Read more of this post

Implementing a Custom Change Data Capture System—Part Two

Bruce Hendry is a consultant in Slalom’s Technology Solutions group in Dallas, TX. Bruce is passionate about the Microsoft Business Intelligence stack and the .NET Framework and enjoys technology.

Introduction

This posting continues from my first blog Implementing a Custom Change Data Capture System. It may be helpful to review Part One before diving into Part Two.

Part One described how our team was challenged with late arriving requirement changes for an existing Extract Transform Load (ETL) solution. The new requirements dictated that changes in the source transactional system propagate to the destination transactional system. The source system had no inherent reliable means of tracking changes, and altering the source system was not an option. Part One covered the team’s proposed solutions and briefly described the one we selected, using HASHBYTES as a means to monitor and respond to changes in the source system.

Part Two will conclude the series and provide the details of the solution our team implemented. The end of this post contains a link to downloadable content that can be used to demonstrate the custom change data capture system and follow along.

Implementation

Time was of the essence, so we quickly set forth on our proof of concept. We discovered in short order that CHECKSUM was not likely to work for us as it had some known issues. The issues are beyond the scope of this article, but as of this writing the comments on the link above touch on them.

We were a little discouraged, because CHECKSUM has the advantage of returning a value for all columns of a given table easily. Fortunately, concatenating table columns for HASHBYTES did not prove too cumbersome, and it allowed us to monitor only a subset of table columns for changes.

At a high-level, here is the solution I am about to describe in detail (I lifted and modified this nifty flow diagram out of PowerPoint from my esteemed colleague, Frederick Chun):

Update Process

  1. Compare Last Hash Value: Apollo’s Vehicle entity will be monitored for changes to the data by comparing the last HASHBYTES value.
  2. Stage Updates: Any changed data will be staged in an interim table for processing.
  3. Exclude Challenger Updated Items: If the Vehicle entity has been edited by a user directly from the Challenger system since the last Apollo update, it will be excluded from the current Apollo update and any future Apollo updates. This was a business requirement that basically said, if a user uses Challenger to update a particular entity, e.g. CarID: 123XYZ, then the user has indicated that CarID 123XYZ will no longer be updated in Apollo and updates in Challenger should not be overwritten.
  4. Update Challenger: Challenger data will be overwritten with the Apollo updates that have occurred since the last Apollo update.
  5. Archive Update History: An audit trail of all updates will be persisted into an archive table.

The thing I liked most about this approach is this—we had an existing, unalterable legacy system that contained no reliable means of tracking changes to the data. With the addition of two columns to a table in another database (SourceHashValue, SourceHashDateTime), we were able to implement a full-fledged and reliable change data capture system. The legacy system had no need of any awareness about this other database.

Recall the ReferenceCar table from Part One. This table lived in the ETL team’s reference database and served a single purpose; it provided the surrogate key mapping between the cars in Apollo and the cars in Challenger.

ReferenceCar Table

The ReferenceCar table was augmented with the following columns:

  1. SourceHashValue varbinary(8000) – This persisted the HASHBYTES hash value for the Apollo Car data from one update to the next.
  2. SourceHashDateTime datetime – This stored the timestamp of the last update from Apollo to Challenger, i.e. the timestamp of when the hash value in SourceHashValue was generated.

Updated ReferenceCar Table

And let’s take a look at the source Car table from Apollo and the destination Car table from Challenger.

Apollo Car Table (Source)

Challenger Car Table (Destination)

The 3 tables above provide the framework to determine if something changed in the source system that should be propagated to the destination system. When the Challenger Car table is initially loaded with data from the Apollo Car table using SQL Server Integration Services (SSIS), HASHBYTES is used to return a hash of the columns that will be tracked for changes, in this example, Make, Model, Year and Color.

This hash is stored in the reference.ReferenceCar table’s SourceHashValue column, along with the time stamp of when it was taken. The ReferenceCar table also contains the mapping between Apollo and Challenger, Apollo’s CarID and Challenger’s CarID, respectively (review the Updated Reference Car Table above) . Once the hash is stored with a time stamp, the next time the SSIS process runs the hash value can be used to determine if anything changed. The time stamp can be used to compare against Challenger’s LastModifiedDateTime to filter out items that were updated by users in Challenger in the interim, and permanently flag those items as “not updateable”. Lastly, a record of everything that happened is stored in an archive table.

Recall our requirements from Part One:

  • Updates to Apollo should flow to Challenger on a nightly basis
  • Once the updateable entity was edited in Challenger, updates in Apollo should be ignored
  • Providing an audit trail of updates was essential

The process described above handles these requirements. In the next section we will walk through the process from start to finish.

Walkthrough

Everything necessary to complete the walkthrough can be downloaded from this post. If you want to follow along, you will need the following sql scripts (tested on SQL Server 2008 R2):

  • 1_HashbytesDemoDatabaseSetup.sql
  • 2_HashbytesDemoDataUpdates.sql
  • 3_HashbytesDemoReviewUpdateResults.sql

and the SSIS project (compatible with Visual Studio/BIDS 2008) that contains the following files:

  • HashbytesDemo.sln
  • HashbytesDemo.dtproj
  • HashbytesDemo.dtsx

Download these items and unzip the SSIS solution, but don’t open it yet; we’ll need to setup the database first to avoid validation issues when opening the SSIS project.

Database Setup

Open and run the 1_HashbytesDemoDatabaseSetup.sql script on the server where you plan to run the demo. This will:

  1. Create the HashbytesDemo database
  2. Create schemas (to represent different databases for simplicity)
    1. apollo
    2. challenger
    3. reference
  3. Create tables
    1. apollo.Car
    2. challenger.Car
    3. reference.ReferenceCar
    4. reference.StageCar
    5. reference.StageCarArchive
  4. Adds 100 rows of test data into apollo.Car
  5. Create the stored procedure uspApolloCarUpdatedSource

After you have run the script, the database should look like this in SQL Server Management Studio’s (SSMS) Object Explorer:

The apollo.Car table should have 100 rows of data. The other tables are empty, for now. Feel free to borrow the following sql to view the data in the tables you just created.

use HashbytesDemo
go
select * from apollo.Car
select * from challenger.Car
select * from reference.ReferenceCar
select * from reference.StageCar
select * from reference.StageCarArchive

Video Example

Note that the IDENTITY columns in apollo.Car and challenger.Car are seeded with different values to allow for easier visual identification of the mapping between the 2 systems in reference.ReferenceCar. This is strictly to aid in the demonstration.

Project Setup

The SSIS solution assumes that the HashbytesDemo database is installed on your local instance of SQL Server with the Server Name localhost. If your setup is different, you will need to update the localhost.HashbytesDemo Connection Manager in the project.

Once you are up and running with the project open, you should be looking at a screen that resembles the following:

Initial Load

Assuming your Connection Manager is set up correctly, you should be able to run the SSIS project by right clicking on the package in Solution Explorer, and choosing Execute Package.

You should see the 100 rows in apollo.Car flow through to challenger.Car. This is the initial migration and all the car data in Apollo is synchronized with the car data in Challenger.

Video Example

Running the sql to view the data in apollo.Car, challenger.Car and reference.ReferenceCar should verify the data is synchronized between both systems.

Video Example

Emulating Updates

The 2nd sql script, 2_HashbytesDemoDataUpdates.sql, will emulate user updates in both the source and target. Updates in the source (Apollo) should flow through to the destination (Challenger), unless Challenger was updated in the interim by a user or some other process.

Open and run the 2_HashbytesDemoDataUpdates.sql script on the server where you are running the demo. This will:

  1. Update a subset of data in apollo.Car to emulate user data changes in the source
  2. Update a subset of data in challenger.Car to emulate conflicting user updates in the destination
  3. Display Apollo and Challenger Car data side by side to allow you to compare the data before running the SSIS process

Video Example

Looking at the data, we see the random data elements that were updated. Notice that the ones that were updated in ChallengerMake; they have more recent time stamps in ChallengerLastUpdatedTime. These are of interest because they are the ones that will become flagged as “not updateable”. We will see all the other Apollo data updates make their way through to Challenger.

Running the Update Process

We are going to execute the HashbytesDemo.dtsx package again. This time, we will not see any new data loaded, just updates being propagated from Apollo to Challenger.

Right click on the HashbytesDemo.dtsx package and choose Excecute Package. After the package has completed, take a moment to examine the row counts as they moved through the data flow tasks (DFT’s).

Video Example

The Initial Apollo to Challenger Car Data Load DFT shows that of the 100 rows found in the Apollo Car Source, none made it through to the Challenger destination. This is correct as there were no new data.

The Load Reference Data DFT shows the same, as it should always match its predecessor.

The next 2 DFT’s are the ones to watch, as they should have handled correctly propagating the updates. Let’s look at Stage Pending Apollo Car Updates first.

Notice the flow of data through the tasks. The Apollo Car Updated Source correctly captured 20 updated rows. (Please note that the demo may not always generate exactly 20 rows.) You can review the stored procedure reference.uspApolloCarUpdatedSource if you want to know more about how it worked. In a nutshell, it compared the reference.ReferenceCar.SourceHashValue with the current hash created from HASHBYTES against the data in Apollo to determine if the data were different.

The Lookup Challenger Car Data task looks up the corresponding data in Challenger to use for comparing the LastModifiedDateTime and for storing the “old” (prior to update) values in the audit trail.

The Challenger Updates Exist Conditional Split divides the data into 2 paths, updateable and not updateable. You can see that 15 rows were identified as updatable, i.e. they had no conflicting updates in Challenger. 5 rows were flagged as “not updateable”, i.e. their Challenger LastModifiedDateTime values were more recent than the reference.ReferenceCar.SourceHashDateTime. This indicates that a user (or some other process) updated the same Car in Challenger since it was last updated with Apollo’s data. Because of the requirements, these CarID’s will be flagged as permanently “not updateable” with changes from Apollo. If you look at the end of the reference.uspApolloCarUpdatedSource stored procedure, you will see where it checks the reference.StageCarArchive table for permanently flagged items.

The Challenger Updateable Derived Columns and Challenger Not Updateable Derived Columns tasks create 2 new derived columns, one for the ChallengerUpdateable flag, the other for a text description to make the audit table, reference.StageCarArchive, more readable.

Lastly, the Update Staging Destination and Update Staging Destination (Not Updateable) tasks simply inserts the data into the reference.StageCar table. 15 items are updatable, 5 are not.

Next, let’s look at the Update Challenger Cars from Stage DFT.

It pulls the staged data directly from reference.StageCar and correctly finds 20 rows. The following Updateable Conditional Split simply excludes the “not updateable” rows. The Data Conversion task is not in use and can be ignored. If there were data conversions required for the destination, that is where they would go. The next 3 OLE DB Command tasks do the work.

Update Challenger Car Command performs the update to the challenger.Car table. Update StageCar ChallengerUpdateDateTime updates the StageCar table with the time stamp of the updates to Challenger. Update ReferenceCar SourceHashValue updates the reference.ReferenceCar table SourceHashValue with the new HASHBYTES hash so that the reference.ReferenceCar table is in synch with the current value in the Apollo system. (To give context, this hash value will be used as the basis of comparison the next time the process runs. If relevant data is updated in apollo.Car in the meantime, the hash in reference.ReferenceCar won’t match and the item will be selected for updates.)

The last 2 tasks in the Control Flow, Archive Staging Data and Truncate Reference StageCar simply copy the data from reference.StageCar and insert it “as is” into reference.StageCarArchive. Then, StageCar is truncated since the data is preserved in StageCarArchive. The archive table serves 2 purposes, it provides an audit trail, and it is used to persist the CarID’s in Challenger that are permanently “not updateable”. Recall the last bit of our WHERE clause on the Apollo Car Updated Source:

Take a moment to run the 3_HashbytesDemoReviewUpdateResults.sql script to review the results in reference.StageCarArchive. (The screenshot below is clipped width-wise for display purposes.)

Notice the data with NULL ChallengerUpdateDateTime values. These are the ones that were flagged as “not updateable”. The side by side columns display the current value as it exists in challenger.Car, the old value (its value in Apollo and Challenger before the user updated the data in Apollo), and the new value (its value in Apollo AFTER it was updated by the user). Keep in mind the new value will not have flowed through to the current value if it was not updateable.

Summary

This concludes the series. I decided to knock it all out in 2 parts since it took me much longer than I anticipated to get back to Part 2 and finish it up. The demo can easily be restarted by simply running the 1_HashbytesDemoDatabaseSetup.sql script again. At a high level, the steps are these:

  1. Run 1_HashbytesDemoDatabaseSetup.sql
  2. Execute the HashbytesDemo.dtsx package (this will complete the initial load)
  3. Run 2_HashbytesDemoDataUpdates.sql
  4. Review the pending updates using the output from 2_HashbytesDemoDataUpdates.sql
  5. Execute the HashbytesDemo.dtsx package (this will process the updates)
  6. Run 3_HashbytesDemoReviewUpdateResults.sql to review the results

This is a demo project and does not represent production code. Shortcuts were made for expediency. For example, the OLE DB Command objects in the Update Challenger Cars From Stage DFT are parameterized in such a way that the data will be processed row by row. They would yield much better performance were they set based.

As the saying goes, hindsight is 20/20. Being fortunate enough to work with an amazing and talented team at Slalom and on my current project yields great results. Since I started this posting about change data capture, I have learned a few things from my colleagues about cleaner, simpler ways to solve the problem outlined in this series. Most notably by leveraging the MERGE statement in tandem with HASHBYTES. While MERGE is a more t-sql intensive solution, the content in this posting relies more heavily on the built-in SSIS components, and therefore has merit in that regard. At the least, I hope it provides some conceptual relevance when trying to solve similar problems.

Feel free to follow-up with questions or comments.

Download

This zip file contains the sql scripts and SSIS files necessary to run the demo described in this post. Unzip and open the SSIS solution file. Open the sql scripts in SSMS and run when indicated in the walkthrough.

Bug Workarounds and Adding Value for Your Client

Slalom Consultant David Van De Somple

Slalom Consultant David Van De Sompele's expertise includes performance tuning, production DBA activities, data modelling, ETL processes and reporting. He's based in Slalom's Seattle office.

Discovering bugs in Enterprise software can be a frustrating experience, especially when these discoveries derail the completion of a time sensitive project or application. Finding timely and effective workarounds to such issues is one way that we can add value for our clients. I recently had just such an experience while building an ETL process with SQL Server 2005 Integration Services (SSIS). My client’s requirements for the project were straight-forward: Get a set of files from an FTP server and load them to some tables in a reporting database. The files were in .CSV format which simply means comma separated values; each field is separated by a comma. Text values were qualified with double quotes. The process was to be automated, running daily at 5:00 AM.

After developing the initial package I was ready to run it in debug mode to ensure all of the SSIS components were working correctly. After reading about half-way through the first file (roughly 10,000 records), the Flat File Connection Manager threw an error stating that there were more fields in the file than I had defined in the Connection Manager. My initial thought was that perhaps there was a corrupted row of data within the file. So I opened the file in a text editor and went to find the offending line. Listed below is the last row that loaded correctly followed by the row responsible for the error. I have added the column names to make this more readable and I have changed the actual data values to protect the client’s sensitive data:

PartnerID PartnerName AdName ConnectionFee ConnectionDate
123, “BNC Company”, “Car Repo Print 1”, 2.50, 11/20/2011
456, “ABC Company”, “Ad one, Main Menu”, 3.23, 11/21/2011

Looking carefully at the AdName column in the second row you will notice it contains an embedded comma. The text qualifier is double quotes so anything enclosed within should be treated as text by the import mechanism. But what I discovered was that the Integration Services flat file import mechanism would not correctly parse any line of the file if the file’s field delimiter (a comma in my case) was embedded within a text field; the text qualifier was ignored and the embedded comma treated as another field break. A quick search on Google revealed that this is a known and acknowledged issue with SQL Server 2005 Integration Services.

I decided to test this issue against other Microsoft products with an import feature to see if they exhibited the same behavior:

MS Excel 2007: Correctly imported and parsed the file. Recognized the double-quote text qualifier and had no issue with the embedded comma.

SQL Server 2000: Correctly imported and parsed the file. Recognized the double-quote text qualifier and had no issue with the embedded comma.

SQL Server 2008: I was unable to test SQL Server 2008 because I had no instance available, but from what I read on various developer message boards the bug also exists in that product; I can neither confirm nor deny this.

SQL Server 2008 R2: Correctly imported and parsed the file. Recognized the double-quote text qualifier and had no issue with the embedded comma. So it appears the bug has been fixed in this release.

Clearly I had stumbled upon a bug that had thwarted my development effort and I needed to find a workaround that would support the automation requirements for this ETL project. There are many possible workarounds to this bug, but I am only going to discuss the three that seemed most tenable for my current situation.

Upgrade to SQL 2008 R2–In the long run this was my preferred option, but it was not feasible for the following reasons. My client needed this ETL process to be in production within three days. Upgrading a SQL Server can be a major undertaking and the server on which this SSIS package would run is no exception. The server houses eight large reporting databases and hundreds of old SQL DTS packages that are slowly being rewritten in Integration Services. Any upgrade would certainly require far more than three days to completion. In addition, an upgrade was not within the corporate hardware/software budget for the current year.

Use the MS Excel 2007 Data Connector–The sources files for the import originate from an outside entity and are created on a non-Windows system. Asking the creator of these files to convert them to an Excel format was not possible. Manually opening each file in Excel and saving it in .xlsx was not possible as complete automation was a requirement for this project.

Custom Parse the file outside of the Flat-file Data Connector–This ended up being the best option available. There are many ways to parse a comma delimited file and rather than list and explore all of those methods, I am going to explain the method that worked best for me given the project parameters.

My goal was to use the flat file connection manager to import the file, but bypass its parsing feature so I could use my own custom parser in its place. Integration Services includes a component that is perfect for the job; the Script Component found in the Data Flow Transformations section of the SSIS Toolkit. Using this component, you can intercept the output data from a Data Flow Source, manipulate it with custom code and then pass it as data output to the next component. So to summarize, I needed to accomplish three main tasks to achieve success:

  1. Modify properties of the Flat File Connection Manager so it will import the file, but bypass any attempt at parsing each incoming row.
  2. Configure properties of the Script Component to receive the data from the Flat File Connection Manager as an input and define the output columns to which I will assign elements parsed from each input row.
  3. Write the custom code responsible for parsing each line.

Modifying the Flat File Connection Manager to bypass incoming row parsing is easily achievable. Start by pointing it to the location of your text file. On the lower half of the Connection Manager’s General page you can configure the format of the incoming file via a drop down menu. Your choices are Delimited, Fixed width, and Ragged right. Choose the Ragged right option. On the Columns menu do not select any column markers; we care only about getting the entire row. In the Misc section of the Advanced menu you will need to define a few attributes of your incoming row. Give it a name. I chose ImportRow for mine. The column delimiter will mostly likely be a line feed. Make sure the data type is a string [DT_STR] and set the Output Column Width property to 8000. Now that you have defined the Connection Manager, define a new Flat File Data Flow source component in the data sources tab and point it to your Flat File Connection Manager. Open the Flat File Source Editor and click on the Columns menu. You will see the column name you defined in the connection manager. When the package executes, SSIS will open the file and read each row as one large 8000 character field, ignoring all commas, quotes and other characters. Before writing custom code to parse the incoming data we must customize some of the Script Component properties.

Customizing the Script Component properties for this solution is highly involved and can become somewhat tedious if your incoming data has a lot of columns, so pack some patience for this part. When you add a new Script Component to the Data Flow workspace SSIS will display a pop-up menu and you must select one of three options. Since our script is going to operate on data from an input and provide an output we want to select the Transformation radio button. Click the Flat File Source you just defined and grab the green arrow and connect it to the Script Component. Doing this will automatically configure the input source for the component. Our next task is to define the Outputs and Output Columns.

We are going to define two Outputs for the Script Component, one to receive rows that parse correctly and another to act as an error handler for rows that do not parse correctly. With the Script Transformation editor open, select the Inputs and Outputs menu. There will already be one existing output, Output 0, defined but it has no columns. Click on Output 0 and then go to the properties menu. Set the Exclusion Group property to a value of 1. This is important as it tells the component that you want to filter rows. Also consider renaming the output to something meaningful such as ParsedRows. Now we need to define the specific columns for this output. Because the incoming data is an 8000 character text string, it is best to parse each value into an output column of the same data type; string. You can use the Data Conversion object later on to explicitly convert values to other data types as necessary. The key point here is to understand how many columns you expect to parse from each row and that you assign them to corresponding column outputs of sufficient length. To add columns to your output, expand it and select the Output Columns folder and then click the Add Column button. In the properties section you can assign a name, data type and length to each column. Using the example data provided earlier in this post, I added five columns and assigned them names identical to the data example and ensured they were strings of sufficient length for the incoming values. Upon completion of this step we will add another Output to handle rows that do not parse correctly.

Click the Add Output button to add your error handler and give it an appropriate name. I named mine ErrorRow. Set the Exclusion Group property value to 1 and ensure your SynchronousInputId is set to the name of the Input. We will add two columns to this output. One to capture the line number and one to capture the actual line of data that caused the error. I named my output columns ErrorLineNum and ParsedErrorLine respectively. Define the ErrorLineNum column as a 4-byte integer and define the ParsedErrorLine as an 8000 character string. Now that all of the inputs and outputs have been configured it is time to write some code that will parse out the values from each input row and assign them to the output columns.

My custom parsing code will consist of a VB.NET script that uses a Regular Expression to read each row and split it on every comma that is not enclosed within the double quote text qualifier. If you have never heard of Regular Expressions, or only heard them mentioned in passing here is quick primer. According to Wikipedia, a regular expression (often referred to as Regex) is a concise and flexible means for matching strings of text, such as particular characters, words or patterns of characters. Almost every major programming language contains a Regex implementation and the .NET languages are no exception to this. Complex Regular Expressions often resemble hieroglyphics-perhaps ancient civilizations were the inventors of Regex-and may leave you feeling overwhelmed and asking, “How long will this take to learn?” Fortunately there are a lot of free and readily available helper applications you can download that will assist you in writing what you need to achieve the task at hand. I will list some helpful Regex websites at the end of this article.

With the Script Transformation editor open, select the Script menu and then click on the Design Script button. The designer will present you with the following generic script:

‘ Microsoft SQL Server Integration Services user script component
‘ This is your new script component in Microsoft Visual Basic .NET
‘ ScriptMain is the entrypoint class for script components

Imports  System
Imports  System.Data
Imports  System.Math
Imports  Microsoft.SqlServer.Dts.Pipeline.Wrapper
Imports  Microsoft.SqlServer.Dts.Runtime.Wrapper

Public Class  ScriptMain
   Inherits  UserComponent

Public Overrides Sub  Input0_ProcessInputRow (ByVal Row As Input0Buffer)
   
‘ Add your code here
    ‘
End Sub
End Class

We will modify this script to accomplish our parsing task. Since we are going to use Regular Expressions in our code we need to make the Regex namespace available. To do that, add the following line of code in the Imports section:

Imports  System.Text.RegularExpressions

For the remainder of the custom code, I will simply list it and provide inline comments:

Public Class  ScriptMain
Inherits  UserComponent

‘Create and initialize variable to track the number of rows
Dim rowNum As Int = 0

Public Overrides Sub  Input0_ProcessInputRow (ByVal Row As Input0Buffer)
‘We will use this to count the incoming rows.  This is helpful for troubleshooting errors.
rowNum += 1

‘Use the regex.split method to split the input string (InputRow as defined in the flat file source) into an array of matching substrings at the positions defined by the regular expression match.
Dim parseCol
AsString ()=Regex.Split(Row.InputRow, “,(?!([^""]*””[^""]*””)*[^""]*””[^""]*$)”)

          ‘I expect each row to contain 5 columns.  Any more or less than that will throw an error.
          If parseCol.Length <> 5 Then
‘Grab the problematic input row and rowNum count.
Row.ParsedErrorLine = Row.InputRow
Row.ErrorLineNum = rowNum
   ‘Send the row in error to the ErrorRow output.
Row.DirectRowToErrorRow ()

          Else
‘Assign the array values to the output columns you defined earlier in the script component..
Row.PartnerId = parseCol (0)
Row.PartnerName = parseCol (1)
Row.AdName = parseCol (2)
Row.ConnectionFee = parseCol (3)
Row.ConnectionDate = parseCol (4)
‘Send correctly parsed rows to the ParsedRows output
Row.DirectRowToParsedRows ()
   End If

End Sub
End Class

As you can see, this code is pretty basic. The actual parsing occurs by using the Regex.Split method with the Regular Expression: ,(?!([^""]*””[^""]*””)*[^""]*””[^""]*$) to split the incoming row of the file on every comma not enclosed within double quotes. I used a couple of free helper applications and the Regular Expression Reference Guide to develop this regular expression and get it working properly:

  1. Expresso Regular Expression Development Tool by Ultrapico which can be found here: http://www.ultrapico.com/Expresso.htm
  2. Rad Software Regular Expression Designer which can be found here: http://www.radsoftware.com.au/regexdesigner/
  3. The Reference Guide can be found here: http://www.regular-expressions.info/reference.html

Once your code is completed save your changes in the Script designer and exit back to your Data Flow workspace. Click on the Script component and you will notice there are now two green arrows corresponding to the names of the Outputs you defined earlier. I directed the ErrorRow output to an OLE DB source so any error rows were written to an error table. I directed the ParsedRows output to Data Conversion component. What you do with your correctly parsed rows will depend on the specifics of your project. And that reminds me of a final point I want to make.

This solution may seem like a lot of extra work for the relatively simple task of importing and parsing a text file. But looking beyond that task, the architecture and methods employed in this solution should serve as a model for how to apply custom code to incoming data. The Script Component is an excellent tool for manipulating data that has complex transformation requirements exceeding the capabilities of the other SSIS Data Flow Transformation tool set objects.

Discovering bugs in Enterprise software can be a frustrating experience, but investing some time and creative thought can yield a solution which may in fact be useful for far more than parsing rows of data. Thinking about bug workarounds this way can help you develop your SSIS repertoire which you can extend to your client projects, adding significant value.

Slalom Consulting’s Seattle office Slalom Consulting's Strategy focus
Learn more about our Seattle office Learn more about Slalom Consulting Strategy

subscribe by emailSubscribe to follow new Business Intelligence posts

Implementing a Custom Change Data Capture System Part 1: Introduction

Bruce Hendry is a consultant in Slalom’s Technology Solutions group in Dallas, TX. Bruce is passionate about the Microsoft Business Intelligence stack and the .NET Framework and enjoys technology.

Coming up with a title proved to be one of the most difficult parts of beginning this posting. Implementing a Custom Change Data Capture System doesn’t really ‘capture’ what should be said. I recently found myself in a situation with a client where it was necessary to quickly implement update processes between a legacy transactional application and a new transactional system. “That’s nothing new”, you say–and you are right–except that it was late in the game, there was already an extract-transform-load (ETL) solution in place that was not architected to account for updates because it was never in the scope…until that day, and (here was the nut of the challenge) nothing reliable existed in the legacy application to capture changed data. Read more of this post

Dynamics CRM 2011 Report Development using FetchXML

Slalom Consultant Jayson Goldinger

Jayson Goldinger is a Slalom Consultant based in Denver, CO, and a member of the Slalom National CRM practice. His work revolves around Microsoft Dynamics CRM including configuration, customization, report development, and integration. He is also recognized as a Scribe MVP.

The inability to have custom reports was one of the biggest drawbacks of Dynamics CRM online when it was first released. Fortunately Microsoft has addressed this issue in the release of Dynamics CRM 2011. However, there is one small catch. The catch is that you have to use FetchXML (which is a proprietary query language that is used in Microsoft Dynamics CRM) for your SRS report query.

In my opinion one of the most powerful features of reporting in Dynamics CRM is the ability to use Advanced Find to “pre-filter” the data sent to the report. As a report writer of a number of years, I was always frustrated by the moving target of parameters that a user might request. When you get more than two or three, the report becomes a little difficult for the end user to use. By taking advantage of the “pre-filtering” with Dynamics CRM and Advanced Find the options of parameters become almost infinite and as a developer I only need to worry about one thing–setting up the option of “pre-filtering” in my report. It may seem like a difficult task but it is actually quite simple. I’ll walk through the steps. Read more of this post

An Ordinary Day with the Client

Slalom Consultant David Van De Somple

Slalom Consultant David Van De Sompele's expertise includes performance tuning, production DBA activities, data modelling, ETL processes and reporting. He's based in Slalom's Seattle office.

If a day with the client is so ordinary, what is there to write about?  That is a fair question and the answer to it is the topic of this article.

Dictionary.com defines the word ordinary as: common, mediocre and unexceptional. But if that is the value you are delivering to your client then you should consider a career other than consulting. As consultants we specialize in various technologies and methodologies.  Blending this expertise with creativity and innovation, we help companies solve complex business and technological problems.  Over time this becomes ordinary to us – it’s what we do – but this ordinary effort from the consultant should look like an extraordinary effort from the client’s perspective.

On rare occasions you may be presented with a particularly challenging situation: uncharted waters, or an unpredictable fluke for which no one has the answer.  And as a senior consultant you must be willing to dig deep and bring the entire breadth of your professional experience to bear in order to solve the problem.  To better illustrate this I’ll relate a recent experience I had.  It was one of those situations in which everything seemed to be going well until a very strange situation occurred and I was asked to jump in and solve the issue. Read more of this post

Using DAPI to Secure MOSS Data

Slalom Consultant Derek Martin

Slalom Consultant Derek Martin is an accomplished Microsoft systems developer and integrator, experienced in developing and deploying SharePoint and CRM solutions, integrating line of business applications, and leveraging existing infrastructure investments.

Saw this very interesting post from Dr. Z’s blog located here about using the Windows Data Protection API inside of SharePoint to secure data.  It is a little thick for me since I’m not truly a developer (yet), but anytime I can see a real world example with SharePoint AND an example of using the Enterprise Library Patterns in the real world is worth a mention.

The basic idea is that you can use the application pool identity of your MOSS farm and reference into the DAPI to do the heavy lifting of securing data into the MOSS database.  Warning well headed of course, about not doing this if you are using a local machine account (Local System or Network Service) to secure your data as if you open up another WFE, that account would be different and you’d just get gunk back.

The question I have is this: would this model still be useful when using transparent encryption on the SQL server (available in R2 but not 2008 RTM I believe) or is the author going at something more fundamental?  Anyways, good example code!

Follow

Get every new post delivered to your Inbox.

Join 130 other followers

%d bloggers like this: