When Structured Query Language (SQL) Becomes Spaghetti Query Language

Slalom Consultant David Van De Somple

David Van De Sompele

Avoiding the Overuse and Misuse of Dynamic SQL

Spaghetti code is a derogatory term for poorly written source code that contains a tangled and complex control structure, is prone to errors that are often difficult to find, correct, and modify. Anybody who has worked in an IT organization is likely familiar with this term. Perhaps you’ve even had the misfortune of twisting and winding through someone’s spaghetti in an attempt to document, troubleshoot, or make modifications. If so, then you’ve lived the nightmare and will empathize with me in this article.  Spaghetti code can be written in just about any programming language and SQL is no exception. For this post I will focus on what I consider to be one of most egregious types of SQL spaghetti: Dynamic SQL. Read more of this post

Thinking About Data

“Though the mills of God grind slowly, yet they grind exceeding small…” Henry Wadsworth Longfellow translated this famous line from a German poem written by Friedrich von Logau in 1654.  Imagine if Longfellow worked as a data architect in today’s Information Technology industry.  Perhaps he would have written this now famous line as follows: “Though the databases of God grind slowly, yet they grind exceedingly small.”

This is often how I feel when I begin investigating a database to diagnose performance problems or to start documenting the schema and constructing ETL process to populate a reporting database or data warehouse.  As data modelers, data architects and database developers- all of whom I will collectively refer to as database people for the remainder of this article-we are taught to think about data relationally and dimensionally. Relationally we normalize data for production OLTP databases, organizing it in such a way as to minimize redundancy and dependency.

Dimensionally we design data warehouses to have facts and dimensions that are easily understandable and provide valuable decision support to various business entities.  High quality, reliable data that is easy to query and consume is the goal of these design patterns.  During my twenty year career, however, I have discovered that well-designed database schemas and data models are the rare exception, not the rule.  And there are a couple of common themes underpinning the bad designs I encounter.

Poorly designed databases are often the result of designers who instead of thinking about how data will flow through their databases and more importantly, which people and internal business entities will want to consume that data, simply view databases as a convenient resting place for data.  This erroneous view frequently stems from a lack of formal training in database normalization, relational algebra, dimensional modeling and data object modeling; skills that I believe are essential for anyone serious about enterprise level database design.  The database schema is foundational to almost every business system, so it is imperative to involve skilled database people early in the design process.  Failure to do this may result in flawed database schemas which will suffer from one or more of the follow issues:

  • Lack of extensibility
  • Difficult to maintain
  • Lack of scalability
  • Difficult to query
  • Contain a high degree of data anomalies which render the data unreliable
  • Performance problems

Having worked with a lot of weird and strange database designs-I could probably write an entire book on the subject- I want to briefly mention some of the more commonly encountered database design errors.  And I am going to classify these errors into two general groups: Design errors in production OLTP databases and design errors in databases intended for reporting.

Production OLTP databases

  1. A completely flattened, un-normalized schema. Databases designed this way will have all sorts of issues when placed under production load; performance and scalability for example. I often hear this line from developers, “Well it worked fine in QA with a small amount of data, but as soon as we deployed to production and threw serious traffic at it, all sorts of performance problems emerged.”  Flattened schemas like this frequently lack declarative referential integrity which leads to data anomalies. Getting reliable reporting from this environment is difficult at best.
  2. A highly normalized schema, possibly over-normalized for the context, but lacking in sufficient/useful payload.  Every part of the data model has been taken to its furthest Normal Form and data integrity is well enforced through Primary Key/Foreign Key relationships. But getting useful data requires both excessive table joins and deriving or calculating needed values at query time. For example, I worked on the invoicing portion of an application-the invoicing code was written in PERL-in which customer product usage was calculated during the invoicing process.  The final usage totals were printed on the paper invoices, but not stored anywhere in the database.  Finance needed this information, both current and twelve months historical and to get it, I had to recreate large parts of the invoicing process in SQL; a herculean task.  When I inquired of the developers as to why customer usage invoicing totals were not stored in the database they responded as follows, “I guess we never though anyone would want that data.”

Reporting databases intended to serve internal business entities

  1. Attempts to build internal reporting functions directly against production OLTP databases.  A discussion of OLTP optimization techniques vs. OLAP optimization techniques is beyond the scope of this article. But suffice it to say that attempts to run I/O intensive reporting against a production OLTP system which is not optimized for reporting will cause tremendous performance problems.
  2. Building out a secondary physical database server and placing an exact copy of a production OLTP database on it as a “reporting instance” database.  Doing this will certainly remove any chance of overwhelming the production server.  But it will not provide a database optimized for reporting.
  3. Adding a few somewhat haphazard aggregation tables to the “reporting instance” database mentioned above.  This may temporarily reduce query times for reports relying on aggregated data, but it is not a long-term substitute for a properly designed dimensional reporting model.

Data models are often given short shrift because the original developers, being inexperienced with relational and dimensional databases, do not think correctly about data. This error in data thought frequently results in poor database designs which may perform poorly and contain unreliable data that is difficult to query.  I want to leave you with a specific example of this point by briefly relating a somewhat recent client experience of mine.

My client at the time had recently purchased a start-up company whose product was a complex ad serving application/engine.  The SQL Server databases foundational to the application were suffering severe performance problems which rippled through the entire system and resulted in a less-than-stellar customer experience.

At my client’s behest I executed a major system review and quickly ascertained two primary issues; a data architecture that limited scalability and incorrect table indexing which was a direct result of the architecture. I kept their developers and Program Manager involved with my solution development process and after a successful deployment, which solved their performance issues, the program manager made a key statement to me. She said, “Wow, I never understood the value and importance that a person with strong data architecture and DBA skills could bring to project like this.  You can be certain that on all future projects of this magnitude I will insist on, and budget for a person with your skillset to be involved at the outset to ensure we avoid these types of database issues.”

Every IT, Program and Project Manager would do well to heed her advice.  Consider spending some time with your recruiting department to find an experienced data architect with a successful track record at the enterprise level.  It will be time and money well-spent.

Bug Workarounds and Adding Value for Your Client

Slalom Consultant David Van De Somple

Slalom Consultant David Van De Sompele's expertise includes performance tuning, production DBA activities, data modelling, ETL processes and reporting. He's based in Slalom's Seattle office.

Discovering bugs in Enterprise software can be a frustrating experience, especially when these discoveries derail the completion of a time sensitive project or application. Finding timely and effective workarounds to such issues is one way that we can add value for our clients. I recently had just such an experience while building an ETL process with SQL Server 2005 Integration Services (SSIS). My client’s requirements for the project were straight-forward: Get a set of files from an FTP server and load them to some tables in a reporting database. The files were in .CSV format which simply means comma separated values; each field is separated by a comma. Text values were qualified with double quotes. The process was to be automated, running daily at 5:00 AM.

After developing the initial package I was ready to run it in debug mode to ensure all of the SSIS components were working correctly. After reading about half-way through the first file (roughly 10,000 records), the Flat File Connection Manager threw an error stating that there were more fields in the file than I had defined in the Connection Manager. My initial thought was that perhaps there was a corrupted row of data within the file. So I opened the file in a text editor and went to find the offending line. Listed below is the last row that loaded correctly followed by the row responsible for the error. I have added the column names to make this more readable and I have changed the actual data values to protect the client’s sensitive data:

PartnerID PartnerName AdName ConnectionFee ConnectionDate
123, “BNC Company”, “Car Repo Print 1”, 2.50, 11/20/2011
456, “ABC Company”, “Ad one, Main Menu”, 3.23, 11/21/2011

Looking carefully at the AdName column in the second row you will notice it contains an embedded comma. The text qualifier is double quotes so anything enclosed within should be treated as text by the import mechanism. But what I discovered was that the Integration Services flat file import mechanism would not correctly parse any line of the file if the file’s field delimiter (a comma in my case) was embedded within a text field; the text qualifier was ignored and the embedded comma treated as another field break. A quick search on Google revealed that this is a known and acknowledged issue with SQL Server 2005 Integration Services.

I decided to test this issue against other Microsoft products with an import feature to see if they exhibited the same behavior:

MS Excel 2007: Correctly imported and parsed the file. Recognized the double-quote text qualifier and had no issue with the embedded comma.

SQL Server 2000: Correctly imported and parsed the file. Recognized the double-quote text qualifier and had no issue with the embedded comma.

SQL Server 2008: I was unable to test SQL Server 2008 because I had no instance available, but from what I read on various developer message boards the bug also exists in that product; I can neither confirm nor deny this.

SQL Server 2008 R2: Correctly imported and parsed the file. Recognized the double-quote text qualifier and had no issue with the embedded comma. So it appears the bug has been fixed in this release.

Clearly I had stumbled upon a bug that had thwarted my development effort and I needed to find a workaround that would support the automation requirements for this ETL project. There are many possible workarounds to this bug, but I am only going to discuss the three that seemed most tenable for my current situation.

Upgrade to SQL 2008 R2–In the long run this was my preferred option, but it was not feasible for the following reasons. My client needed this ETL process to be in production within three days. Upgrading a SQL Server can be a major undertaking and the server on which this SSIS package would run is no exception. The server houses eight large reporting databases and hundreds of old SQL DTS packages that are slowly being rewritten in Integration Services. Any upgrade would certainly require far more than three days to completion. In addition, an upgrade was not within the corporate hardware/software budget for the current year.

Use the MS Excel 2007 Data Connector–The sources files for the import originate from an outside entity and are created on a non-Windows system. Asking the creator of these files to convert them to an Excel format was not possible. Manually opening each file in Excel and saving it in .xlsx was not possible as complete automation was a requirement for this project.

Custom Parse the file outside of the Flat-file Data Connector–This ended up being the best option available. There are many ways to parse a comma delimited file and rather than list and explore all of those methods, I am going to explain the method that worked best for me given the project parameters.

My goal was to use the flat file connection manager to import the file, but bypass its parsing feature so I could use my own custom parser in its place. Integration Services includes a component that is perfect for the job; the Script Component found in the Data Flow Transformations section of the SSIS Toolkit. Using this component, you can intercept the output data from a Data Flow Source, manipulate it with custom code and then pass it as data output to the next component. So to summarize, I needed to accomplish three main tasks to achieve success:

  1. Modify properties of the Flat File Connection Manager so it will import the file, but bypass any attempt at parsing each incoming row.
  2. Configure properties of the Script Component to receive the data from the Flat File Connection Manager as an input and define the output columns to which I will assign elements parsed from each input row.
  3. Write the custom code responsible for parsing each line.

Modifying the Flat File Connection Manager to bypass incoming row parsing is easily achievable. Start by pointing it to the location of your text file. On the lower half of the Connection Manager’s General page you can configure the format of the incoming file via a drop down menu. Your choices are Delimited, Fixed width, and Ragged right. Choose the Ragged right option. On the Columns menu do not select any column markers; we care only about getting the entire row. In the Misc section of the Advanced menu you will need to define a few attributes of your incoming row. Give it a name. I chose ImportRow for mine. The column delimiter will mostly likely be a line feed. Make sure the data type is a string [DT_STR] and set the Output Column Width property to 8000. Now that you have defined the Connection Manager, define a new Flat File Data Flow source component in the data sources tab and point it to your Flat File Connection Manager. Open the Flat File Source Editor and click on the Columns menu. You will see the column name you defined in the connection manager. When the package executes, SSIS will open the file and read each row as one large 8000 character field, ignoring all commas, quotes and other characters. Before writing custom code to parse the incoming data we must customize some of the Script Component properties.

Customizing the Script Component properties for this solution is highly involved and can become somewhat tedious if your incoming data has a lot of columns, so pack some patience for this part. When you add a new Script Component to the Data Flow workspace SSIS will display a pop-up menu and you must select one of three options. Since our script is going to operate on data from an input and provide an output we want to select the Transformation radio button. Click the Flat File Source you just defined and grab the green arrow and connect it to the Script Component. Doing this will automatically configure the input source for the component. Our next task is to define the Outputs and Output Columns.

We are going to define two Outputs for the Script Component, one to receive rows that parse correctly and another to act as an error handler for rows that do not parse correctly. With the Script Transformation editor open, select the Inputs and Outputs menu. There will already be one existing output, Output 0, defined but it has no columns. Click on Output 0 and then go to the properties menu. Set the Exclusion Group property to a value of 1. This is important as it tells the component that you want to filter rows. Also consider renaming the output to something meaningful such as ParsedRows. Now we need to define the specific columns for this output. Because the incoming data is an 8000 character text string, it is best to parse each value into an output column of the same data type; string. You can use the Data Conversion object later on to explicitly convert values to other data types as necessary. The key point here is to understand how many columns you expect to parse from each row and that you assign them to corresponding column outputs of sufficient length. To add columns to your output, expand it and select the Output Columns folder and then click the Add Column button. In the properties section you can assign a name, data type and length to each column. Using the example data provided earlier in this post, I added five columns and assigned them names identical to the data example and ensured they were strings of sufficient length for the incoming values. Upon completion of this step we will add another Output to handle rows that do not parse correctly.

Click the Add Output button to add your error handler and give it an appropriate name. I named mine ErrorRow. Set the Exclusion Group property value to 1 and ensure your SynchronousInputId is set to the name of the Input. We will add two columns to this output. One to capture the line number and one to capture the actual line of data that caused the error. I named my output columns ErrorLineNum and ParsedErrorLine respectively. Define the ErrorLineNum column as a 4-byte integer and define the ParsedErrorLine as an 8000 character string. Now that all of the inputs and outputs have been configured it is time to write some code that will parse out the values from each input row and assign them to the output columns.

My custom parsing code will consist of a VB.NET script that uses a Regular Expression to read each row and split it on every comma that is not enclosed within the double quote text qualifier. If you have never heard of Regular Expressions, or only heard them mentioned in passing here is quick primer. According to Wikipedia, a regular expression (often referred to as Regex) is a concise and flexible means for matching strings of text, such as particular characters, words or patterns of characters. Almost every major programming language contains a Regex implementation and the .NET languages are no exception to this. Complex Regular Expressions often resemble hieroglyphics-perhaps ancient civilizations were the inventors of Regex-and may leave you feeling overwhelmed and asking, “How long will this take to learn?” Fortunately there are a lot of free and readily available helper applications you can download that will assist you in writing what you need to achieve the task at hand. I will list some helpful Regex websites at the end of this article.

With the Script Transformation editor open, select the Script menu and then click on the Design Script button. The designer will present you with the following generic script:

‘ Microsoft SQL Server Integration Services user script component
‘ This is your new script component in Microsoft Visual Basic .NET
‘ ScriptMain is the entrypoint class for script components

Imports  System
Imports  System.Data
Imports  System.Math
Imports  Microsoft.SqlServer.Dts.Pipeline.Wrapper
Imports  Microsoft.SqlServer.Dts.Runtime.Wrapper

Public Class  ScriptMain
   Inherits  UserComponent

Public Overrides Sub  Input0_ProcessInputRow (ByVal Row As Input0Buffer)
   
‘ Add your code here
    ‘
End Sub
End Class

We will modify this script to accomplish our parsing task. Since we are going to use Regular Expressions in our code we need to make the Regex namespace available. To do that, add the following line of code in the Imports section:

Imports  System.Text.RegularExpressions

For the remainder of the custom code, I will simply list it and provide inline comments:

Public Class  ScriptMain
Inherits  UserComponent

‘Create and initialize variable to track the number of rows
Dim rowNum As Int = 0

Public Overrides Sub  Input0_ProcessInputRow (ByVal Row As Input0Buffer)
‘We will use this to count the incoming rows.  This is helpful for troubleshooting errors.
rowNum += 1

‘Use the regex.split method to split the input string (InputRow as defined in the flat file source) into an array of matching substrings at the positions defined by the regular expression match.
Dim parseCol
AsString ()=Regex.Split(Row.InputRow, “,(?!([^""]*””[^""]*””)*[^""]*””[^""]*$)”)

          ‘I expect each row to contain 5 columns.  Any more or less than that will throw an error.
          If parseCol.Length <> 5 Then
‘Grab the problematic input row and rowNum count.
Row.ParsedErrorLine = Row.InputRow
Row.ErrorLineNum = rowNum
   ‘Send the row in error to the ErrorRow output.
Row.DirectRowToErrorRow ()

          Else
‘Assign the array values to the output columns you defined earlier in the script component..
Row.PartnerId = parseCol (0)
Row.PartnerName = parseCol (1)
Row.AdName = parseCol (2)
Row.ConnectionFee = parseCol (3)
Row.ConnectionDate = parseCol (4)
‘Send correctly parsed rows to the ParsedRows output
Row.DirectRowToParsedRows ()
   End If

End Sub
End Class

As you can see, this code is pretty basic. The actual parsing occurs by using the Regex.Split method with the Regular Expression: ,(?!([^""]*””[^""]*””)*[^""]*””[^""]*$) to split the incoming row of the file on every comma not enclosed within double quotes. I used a couple of free helper applications and the Regular Expression Reference Guide to develop this regular expression and get it working properly:

  1. Expresso Regular Expression Development Tool by Ultrapico which can be found here: http://www.ultrapico.com/Expresso.htm
  2. Rad Software Regular Expression Designer which can be found here: http://www.radsoftware.com.au/regexdesigner/
  3. The Reference Guide can be found here: http://www.regular-expressions.info/reference.html

Once your code is completed save your changes in the Script designer and exit back to your Data Flow workspace. Click on the Script component and you will notice there are now two green arrows corresponding to the names of the Outputs you defined earlier. I directed the ErrorRow output to an OLE DB source so any error rows were written to an error table. I directed the ParsedRows output to Data Conversion component. What you do with your correctly parsed rows will depend on the specifics of your project. And that reminds me of a final point I want to make.

This solution may seem like a lot of extra work for the relatively simple task of importing and parsing a text file. But looking beyond that task, the architecture and methods employed in this solution should serve as a model for how to apply custom code to incoming data. The Script Component is an excellent tool for manipulating data that has complex transformation requirements exceeding the capabilities of the other SSIS Data Flow Transformation tool set objects.

Discovering bugs in Enterprise software can be a frustrating experience, but investing some time and creative thought can yield a solution which may in fact be useful for far more than parsing rows of data. Thinking about bug workarounds this way can help you develop your SSIS repertoire which you can extend to your client projects, adding significant value.

Slalom Consulting’s Seattle office Slalom Consulting's Strategy focus
Learn more about our Seattle office Learn more about Slalom Consulting Strategy

subscribe by emailSubscribe to follow new Business Intelligence posts

An Ordinary Day with the Client

Slalom Consultant David Van De Somple

Slalom Consultant David Van De Sompele's expertise includes performance tuning, production DBA activities, data modelling, ETL processes and reporting. He's based in Slalom's Seattle office.

If a day with the client is so ordinary, what is there to write about?  That is a fair question and the answer to it is the topic of this article.

Dictionary.com defines the word ordinary as: common, mediocre and unexceptional. But if that is the value you are delivering to your client then you should consider a career other than consulting. As consultants we specialize in various technologies and methodologies.  Blending this expertise with creativity and innovation, we help companies solve complex business and technological problems.  Over time this becomes ordinary to us – it’s what we do – but this ordinary effort from the consultant should look like an extraordinary effort from the client’s perspective.

On rare occasions you may be presented with a particularly challenging situation: uncharted waters, or an unpredictable fluke for which no one has the answer.  And as a senior consultant you must be willing to dig deep and bring the entire breadth of your professional experience to bear in order to solve the problem.  To better illustrate this I’ll relate a recent experience I had.  It was one of those situations in which everything seemed to be going well until a very strange situation occurred and I was asked to jump in and solve the issue. Read more of this post

Connecting to a PostgreSQL Instance Using SQL Server Integration Services

Slalom Consultant David Van De Somple

Slalom Consultant David Van De Sompele's expertise includes performance tuning, production DBA activities, data modelling, ETL processes and reporting. He's based in Slalom's Seattle office.

About two years ago my current client tasked me with building a SQL Server reporting instance complete with a set of SSIS (SQL Server Integration Services) ETL processes that would connect to a production PostgreSQL database and retrieve the previous day’s data.  To be honest, this was the first time in my eighteen year career that I had seen a PostgreSQL database. And as I was about discover, connecting to it with Integration Services for ETL was not a straight-forward task and, if my Google searches were any barometer, not a very common one.

PostgreSQL is an open source relational database management system.  Its progenitor, Ingres, was developed in the mid 1980’s and continues today as another open source RDMS.  While completing a business degree in 1990 I cut my RDMS teeth using Ingres on an HP Vectra 286–my rather plebian 8GB iPhone 3G would run circles around this machine. Visiting the PostgreSQL web site you will find the following quote on the home page: “The world’s most advanced open source database.”  In many respects I can agree with that statement.  But outside connectivity is one area of PostgreSQL that in this author’s eyes has fallen woefully short.  That is not to say it is impossible to make a Read more of this post

SQL Server Replication, Push vs. Pull. What is the difference?

Slalom Consultant David Van De Somple

Slalom Consultant David Van De Sompele's expertise includes performance tuning, production DBA activities, data modelling, ETL processes and reporting. He's based in Slalom's Seattle office.

Replication is a very useful method of copying data from production systems to standby servers, reporting servers, and downstream data relay points.  Unfortunately, it is an often misunderstood and/or overlooked technology, even among experienced DBAs.  In my professional career, I have often seen replication misused in an attempt to achieve a goal for which it was not intended.  When it is determined that replication is the correct solution (making that determination will be the subject of a future article), you must install and configure three basic components (these are extremely simplified definitions):

  1. Publisher:  This is the source database or databases containing the information you want to replicate.
  2. Distributor:  This is the database and set of jobs responsible for queuing replicated data from the publisher.
  3. Subscriber:  This is the destination database or databases for data coming from the publisher.

Push and Pull, as named in the title of this article,are the two methods available for Read more of this post

Follow

Get every new post delivered to your Inbox.

Join 129 other followers

%d bloggers like this: