Slalom Consultant David Van De Sompele's expertise includes performance tuning, production DBA activities, data modelling, ETL processes and reporting. He's based in Slalom's Seattle office.
Discovering bugs in Enterprise software can be a frustrating experience, especially when these discoveries derail the completion of a time sensitive project or application. Finding timely and effective workarounds to such issues is one way that we can add value for our clients. I recently had just such an experience while building an ETL process with SQL Server 2005 Integration Services (SSIS). My client’s requirements for the project were straight-forward: Get a set of files from an FTP server and load them to some tables in a reporting database. The files were in .CSV format which simply means comma separated values; each field is separated by a comma. Text values were qualified with double quotes. The process was to be automated, running daily at 5:00 AM.
After developing the initial package I was ready to run it in debug mode to ensure all of the SSIS components were working correctly. After reading about half-way through the first file (roughly 10,000 records), the Flat File Connection Manager threw an error stating that there were more fields in the file than I had defined in the Connection Manager. My initial thought was that perhaps there was a corrupted row of data within the file. So I opened the file in a text editor and went to find the offending line. Listed below is the last row that loaded correctly followed by the row responsible for the error. I have added the column names to make this more readable and I have changed the actual data values to protect the client’s sensitive data:
||“Car Repo Print 1”,
||“Ad one, Main Menu”,
Looking carefully at the AdName column in the second row you will notice it contains an embedded comma. The text qualifier is double quotes so anything enclosed within should be treated as text by the import mechanism. But what I discovered was that the Integration Services flat file import mechanism would not correctly parse any line of the file if the file’s field delimiter (a comma in my case) was embedded within a text field; the text qualifier was ignored and the embedded comma treated as another field break. A quick search on Google revealed that this is a known and acknowledged issue with SQL Server 2005 Integration Services.
I decided to test this issue against other Microsoft products with an import feature to see if they exhibited the same behavior:
MS Excel 2007: Correctly imported and parsed the file. Recognized the double-quote text qualifier and had no issue with the embedded comma.
SQL Server 2000: Correctly imported and parsed the file. Recognized the double-quote text qualifier and had no issue with the embedded comma.
SQL Server 2008: I was unable to test SQL Server 2008 because I had no instance available, but from what I read on various developer message boards the bug also exists in that product; I can neither confirm nor deny this.
SQL Server 2008 R2: Correctly imported and parsed the file. Recognized the double-quote text qualifier and had no issue with the embedded comma. So it appears the bug has been fixed in this release.
Clearly I had stumbled upon a bug that had thwarted my development effort and I needed to find a workaround that would support the automation requirements for this ETL project. There are many possible workarounds to this bug, but I am only going to discuss the three that seemed most tenable for my current situation.
Upgrade to SQL 2008 R2–In the long run this was my preferred option, but it was not feasible for the following reasons. My client needed this ETL process to be in production within three days. Upgrading a SQL Server can be a major undertaking and the server on which this SSIS package would run is no exception. The server houses eight large reporting databases and hundreds of old SQL DTS packages that are slowly being rewritten in Integration Services. Any upgrade would certainly require far more than three days to completion. In addition, an upgrade was not within the corporate hardware/software budget for the current year.
Use the MS Excel 2007 Data Connector–The sources files for the import originate from an outside entity and are created on a non-Windows system. Asking the creator of these files to convert them to an Excel format was not possible. Manually opening each file in Excel and saving it in .xlsx was not possible as complete automation was a requirement for this project.
Custom Parse the file outside of the Flat-file Data Connector–This ended up being the best option available. There are many ways to parse a comma delimited file and rather than list and explore all of those methods, I am going to explain the method that worked best for me given the project parameters.
My goal was to use the flat file connection manager to import the file, but bypass its parsing feature so I could use my own custom parser in its place. Integration Services includes a component that is perfect for the job; the Script Component found in the Data Flow Transformations section of the SSIS Toolkit. Using this component, you can intercept the output data from a Data Flow Source, manipulate it with custom code and then pass it as data output to the next component. So to summarize, I needed to accomplish three main tasks to achieve success:
- Modify properties of the Flat File Connection Manager so it will import the file, but bypass any attempt at parsing each incoming row.
- Configure properties of the Script Component to receive the data from the Flat File Connection Manager as an input and define the output columns to which I will assign elements parsed from each input row.
- Write the custom code responsible for parsing each line.
Modifying the Flat File Connection Manager to bypass incoming row parsing is easily achievable. Start by pointing it to the location of your text file. On the lower half of the Connection Manager’s General page you can configure the format of the incoming file via a drop down menu. Your choices are Delimited, Fixed width, and Ragged right. Choose the Ragged right option. On the Columns menu do not select any column markers; we care only about getting the entire row. In the Misc section of the Advanced menu you will need to define a few attributes of your incoming row. Give it a name. I chose ImportRow for mine. The column delimiter will mostly likely be a line feed. Make sure the data type is a string [DT_STR] and set the Output Column Width property to 8000. Now that you have defined the Connection Manager, define a new Flat File Data Flow source component in the data sources tab and point it to your Flat File Connection Manager. Open the Flat File Source Editor and click on the Columns menu. You will see the column name you defined in the connection manager. When the package executes, SSIS will open the file and read each row as one large 8000 character field, ignoring all commas, quotes and other characters. Before writing custom code to parse the incoming data we must customize some of the Script Component properties.
Customizing the Script Component properties for this solution is highly involved and can become somewhat tedious if your incoming data has a lot of columns, so pack some patience for this part. When you add a new Script Component to the Data Flow workspace SSIS will display a pop-up menu and you must select one of three options. Since our script is going to operate on data from an input and provide an output we want to select the Transformation radio button. Click the Flat File Source you just defined and grab the green arrow and connect it to the Script Component. Doing this will automatically configure the input source for the component. Our next task is to define the Outputs and Output Columns.
We are going to define two Outputs for the Script Component, one to receive rows that parse correctly and another to act as an error handler for rows that do not parse correctly. With the Script Transformation editor open, select the Inputs and Outputs menu. There will already be one existing output, Output 0, defined but it has no columns. Click on Output 0 and then go to the properties menu. Set the Exclusion Group property to a value of 1. This is important as it tells the component that you want to filter rows. Also consider renaming the output to something meaningful such as ParsedRows. Now we need to define the specific columns for this output. Because the incoming data is an 8000 character text string, it is best to parse each value into an output column of the same data type; string. You can use the Data Conversion object later on to explicitly convert values to other data types as necessary. The key point here is to understand how many columns you expect to parse from each row and that you assign them to corresponding column outputs of sufficient length. To add columns to your output, expand it and select the Output Columns folder and then click the Add Column button. In the properties section you can assign a name, data type and length to each column. Using the example data provided earlier in this post, I added five columns and assigned them names identical to the data example and ensured they were strings of sufficient length for the incoming values. Upon completion of this step we will add another Output to handle rows that do not parse correctly.
Click the Add Output button to add your error handler and give it an appropriate name. I named mine ErrorRow. Set the Exclusion Group property value to 1 and ensure your SynchronousInputId is set to the name of the Input. We will add two columns to this output. One to capture the line number and one to capture the actual line of data that caused the error. I named my output columns ErrorLineNum and ParsedErrorLine respectively. Define the ErrorLineNum column as a 4-byte integer and define the ParsedErrorLine as an 8000 character string. Now that all of the inputs and outputs have been configured it is time to write some code that will parse out the values from each input row and assign them to the output columns.
My custom parsing code will consist of a VB.NET script that uses a Regular Expression to read each row and split it on every comma that is not enclosed within the double quote text qualifier. If you have never heard of Regular Expressions, or only heard them mentioned in passing here is quick primer. According to Wikipedia, a regular expression (often referred to as Regex) is a concise and flexible means for matching strings of text, such as particular characters, words or patterns of characters. Almost every major programming language contains a Regex implementation and the .NET languages are no exception to this. Complex Regular Expressions often resemble hieroglyphics-perhaps ancient civilizations were the inventors of Regex-and may leave you feeling overwhelmed and asking, “How long will this take to learn?” Fortunately there are a lot of free and readily available helper applications you can download that will assist you in writing what you need to achieve the task at hand. I will list some helpful Regex websites at the end of this article.
With the Script Transformation editor open, select the Script menu and then click on the Design Script button. The designer will present you with the following generic script:
‘ Microsoft SQL Server Integration Services user script component
‘ This is your new script component in Microsoft Visual Basic .NET
‘ ScriptMain is the entrypoint class for script components
Public Class ScriptMain
Public Overrides Sub Input0_ProcessInputRow (ByVal Row As Input0Buffer)
‘ Add your code here
We will modify this script to accomplish our parsing task. Since we are going to use Regular Expressions in our code we need to make the Regex namespace available. To do that, add the following line of code in the Imports section:
For the remainder of the custom code, I will simply list it and provide inline comments:
Public Class ScriptMain
‘Create and initialize variable to track the number of rows
Dim rowNum As Int = 0
Public Overrides Sub Input0_ProcessInputRow (ByVal Row As Input0Buffer)
‘We will use this to count the incoming rows. This is helpful for troubleshooting errors.
rowNum += 1
‘Use the regex.split method to split the input string (InputRow as defined in the flat file source) into an array of matching substrings at the positions defined by the regular expression match.
AsString ()=Regex.Split(Row.InputRow, “,(?!([^""]*”"[^""]*”")*[^""]*”"[^""]*$)”)
‘I expect each row to contain 5 columns. Any more or less than that will throw an error.
If parseCol.Length <> 5 Then
‘Grab the problematic input row and rowNum count.
Row.ParsedErrorLine = Row.InputRow
Row.ErrorLineNum = rowNum
‘Send the row in error to the ErrorRow output.
‘Assign the array values to the output columns you defined earlier in the script component..
Row.PartnerId = parseCol (0)
Row.PartnerName = parseCol (1)
Row.AdName = parseCol (2)
Row.ConnectionFee = parseCol (3)
Row.ConnectionDate = parseCol (4)
‘Send correctly parsed rows to the ParsedRows output
As you can see, this code is pretty basic. The actual parsing occurs by using the Regex.Split method with the Regular Expression: ,(?!([^""]*”"[^""]*”")*[^""]*”"[^""]*$) to split the incoming row of the file on every comma not enclosed within double quotes. I used a couple of free helper applications and the Regular Expression Reference Guide to develop this regular expression and get it working properly:
- Expresso Regular Expression Development Tool by Ultrapico which can be found here: http://www.ultrapico.com/Expresso.htm
- Rad Software Regular Expression Designer which can be found here: http://www.radsoftware.com.au/regexdesigner/
- The Reference Guide can be found here: http://www.regular-expressions.info/reference.html
Once your code is completed save your changes in the Script designer and exit back to your Data Flow workspace. Click on the Script component and you will notice there are now two green arrows corresponding to the names of the Outputs you defined earlier. I directed the ErrorRow output to an OLE DB source so any error rows were written to an error table. I directed the ParsedRows output to Data Conversion component. What you do with your correctly parsed rows will depend on the specifics of your project. And that reminds me of a final point I want to make.
This solution may seem like a lot of extra work for the relatively simple task of importing and parsing a text file. But looking beyond that task, the architecture and methods employed in this solution should serve as a model for how to apply custom code to incoming data. The Script Component is an excellent tool for manipulating data that has complex transformation requirements exceeding the capabilities of the other SSIS Data Flow Transformation tool set objects.
Discovering bugs in Enterprise software can be a frustrating experience, but investing some time and creative thought can yield a solution which may in fact be useful for far more than parsing rows of data. Thinking about bug workarounds this way can help you develop your SSIS repertoire which you can extend to your client projects, adding significant value.
Subscribe to follow new Business Intelligence posts