Thinking About Data

“Though the mills of God grind slowly, yet they grind exceeding small…” Henry Wadsworth Longfellow translated this famous line from a German poem written by Friedrich von Logau in 1654.  Imagine if Longfellow worked as a data architect in today’s Information Technology industry.  Perhaps he would have written this now famous line as follows: “Though the databases of God grind slowly, yet they grind exceedingly small.”

This is often how I feel when I begin investigating a database to diagnose performance problems or to start documenting the schema and constructing ETL process to populate a reporting database or data warehouse.  As data modelers, data architects and database developers- all of whom I will collectively refer to as database people for the remainder of this article-we are taught to think about data relationally and dimensionally. Relationally we normalize data for production OLTP databases, organizing it in such a way as to minimize redundancy and dependency.

Dimensionally we design data warehouses to have facts and dimensions that are easily understandable and provide valuable decision support to various business entities.  High quality, reliable data that is easy to query and consume is the goal of these design patterns.  During my twenty year career, however, I have discovered that well-designed database schemas and data models are the rare exception, not the rule.  And there are a couple of common themes underpinning the bad designs I encounter.

Poorly designed databases are often the result of designers who instead of thinking about how data will flow through their databases and more importantly, which people and internal business entities will want to consume that data, simply view databases as a convenient resting place for data.  This erroneous view frequently stems from a lack of formal training in database normalization, relational algebra, dimensional modeling and data object modeling; skills that I believe are essential for anyone serious about enterprise level database design.  The database schema is foundational to almost every business system, so it is imperative to involve skilled database people early in the design process.  Failure to do this may result in flawed database schemas which will suffer from one or more of the follow issues:

  • Lack of extensibility
  • Difficult to maintain
  • Lack of scalability
  • Difficult to query
  • Contain a high degree of data anomalies which render the data unreliable
  • Performance problems

Having worked with a lot of weird and strange database designs-I could probably write an entire book on the subject- I want to briefly mention some of the more commonly encountered database design errors.  And I am going to classify these errors into two general groups: Design errors in production OLTP databases and design errors in databases intended for reporting.

Production OLTP databases

  1. A completely flattened, un-normalized schema. Databases designed this way will have all sorts of issues when placed under production load; performance and scalability for example. I often hear this line from developers, “Well it worked fine in QA with a small amount of data, but as soon as we deployed to production and threw serious traffic at it, all sorts of performance problems emerged.”  Flattened schemas like this frequently lack declarative referential integrity which leads to data anomalies. Getting reliable reporting from this environment is difficult at best.
  2. A highly normalized schema, possibly over-normalized for the context, but lacking in sufficient/useful payload.  Every part of the data model has been taken to its furthest Normal Form and data integrity is well enforced through Primary Key/Foreign Key relationships. But getting useful data requires both excessive table joins and deriving or calculating needed values at query time. For example, I worked on the invoicing portion of an application-the invoicing code was written in PERL-in which customer product usage was calculated during the invoicing process.  The final usage totals were printed on the paper invoices, but not stored anywhere in the database.  Finance needed this information, both current and twelve months historical and to get it, I had to recreate large parts of the invoicing process in SQL; a herculean task.  When I inquired of the developers as to why customer usage invoicing totals were not stored in the database they responded as follows, “I guess we never though anyone would want that data.”

Reporting databases intended to serve internal business entities

  1. Attempts to build internal reporting functions directly against production OLTP databases.  A discussion of OLTP optimization techniques vs. OLAP optimization techniques is beyond the scope of this article. But suffice it to say that attempts to run I/O intensive reporting against a production OLTP system which is not optimized for reporting will cause tremendous performance problems.
  2. Building out a secondary physical database server and placing an exact copy of a production OLTP database on it as a “reporting instance” database.  Doing this will certainly remove any chance of overwhelming the production server.  But it will not provide a database optimized for reporting.
  3. Adding a few somewhat haphazard aggregation tables to the “reporting instance” database mentioned above.  This may temporarily reduce query times for reports relying on aggregated data, but it is not a long-term substitute for a properly designed dimensional reporting model.

Data models are often given short shrift because the original developers, being inexperienced with relational and dimensional databases, do not think correctly about data. This error in data thought frequently results in poor database designs which may perform poorly and contain unreliable data that is difficult to query.  I want to leave you with a specific example of this point by briefly relating a somewhat recent client experience of mine.

My client at the time had recently purchased a start-up company whose product was a complex ad serving application/engine.  The SQL Server databases foundational to the application were suffering severe performance problems which rippled through the entire system and resulted in a less-than-stellar customer experience.

At my client’s behest I executed a major system review and quickly ascertained two primary issues; a data architecture that limited scalability and incorrect table indexing which was a direct result of the architecture. I kept their developers and Program Manager involved with my solution development process and after a successful deployment, which solved their performance issues, the program manager made a key statement to me. She said, “Wow, I never understood the value and importance that a person with strong data architecture and DBA skills could bring to project like this.  You can be certain that on all future projects of this magnitude I will insist on, and budget for a person with your skillset to be involved at the outset to ensure we avoid these types of database issues.”

Every IT, Program and Project Manager would do well to heed her advice.  Consider spending some time with your recruiting department to find an experienced data architect with a successful track record at the enterprise level.  It will be time and money well-spent.

About David Van De Sompele
Slalom Consultant David Van De Sompele's expertise includes performance tuning, production DBA activities, data modelling, ETL processes and reporting. He's based in Slalom's Seattle office.

One Response to Thinking About Data

  1. I totally agree with your post, especially the last two paragraphs. It seems like strong data modeling and architecture skills are missing from a conspicuous number of development efforts. More often than not, this leaves a long lasting impact on the overall value, maintainability and scalability of what was built. Here is a related post I think you will like, which I think defines the “data architect” skill very well:

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: