Business today runs much like it has for millennia. Historically, most businesses have been run more by gut feel than any actual metric, other than gross and/or net profit. Sometimes even the net profit picture was difficult to determine. For example, in the first half of the last century, Ford Motor Company famously determined the price for products by weighing invoices on a scale. The invoices were large stacks of paper with hundreds of pages, so going through them in detail wasn’t practical.
“15 pounds? Must be about $200,000.”
With accuracy like that, what room is there for improvement?
But seriously, the problem was not that companies didn’t have lots of data. They did—every one of the Ford invoices may have had line items for each part on a vehicle. But it wasn’t in a format that could be easily digested or comprehended.
Fast forward to today—with the advent of high powered computers and advanced databases, all the data that companies previously captured in antiquated forms is now captured digitally. That means it is accessible. And, even more (so much more!) is being captured every day—data that was previously lost.
With the exploitation of corporate data proving so valuable, data warehouses have become very popular. One might think that creating and maintaining a successful data warehouse is a standard IT process. To be sure, many successful data warehouses exist and function out “in the wild”. However, one mistake often made by IT management is to think of a data warehouse as more-or-less the same as any other database application. The truth is they are not the same and grave mistakes can be made by not recognizing the differences.
Many, if not most, database applications share data with other systems. Some feed data downstream; some receive data from upstream. But, with a major data warehouse, the data sharing is increased by orders of magnitude. Many data warehouses have hundreds of input data feeds, each requiring a means of validation and the ability to back out erroneous data when entered.
For a more modestly sized information system, the job of data validation lies with the user community. Not so with a large data warehouse. Usually there is too much data and it has gone through too many aggregation steps for the user to really validate data quality. Therefore this job lies in the hands of the maintenance community. This fact can be easily missed by IT management that is constantly trying to cut cost and views maintenance as pretty much superfluous. Maintenance activities can be constrained with smaller IT systems but they absolutely cannot with a data warehouse. The entire purpose of a data warehouse is to capture large amounts of data on regular cycles and aggregate that data into a meaningful structure.
The data that feeds into a data warehouse is in a raw form. To create meaningful information for end users, the data warehouse must clean this data, aggregate it and normalize it so that the data from multiple disjoint sources makes sense at the top level. This processing can be extremely intensive and adds a lot of complexity to maintenance (including considerations for processing time, data access rates, disaster recovery, data quality, etc.). Most of these concerns would be moderate at best for a standard IT system.
Data warehouses can get big—much, much bigger than standard IT systems. This is an issue not only from the storage standpoint but also for disaster recovery. Most databases use some sort of log sharing to perform disaster recover. But a large data warehouse with all of its input data and processing can easily generate hundreds of gigabytes of archive/redo logs in a single day. The job of capturing, moving, storing and applying that much log data requires significant attention.
For most IT systems, maintenance consists of programming required to fix bugs or to make small enhancements. This applies to data warehouses as well, but they also require maintenance of data. Data feeds need to constantly be watched and validated, bad data must be backed out and reprocessed and sometimes special data must be directly entered by hand. As mentioned above, data validation is usually handled by the user community. But with a data warehouse, given the massive data feeds and aggregation, the maintenance team must have explicit understanding of how base level data is validated and how it is processed into display data.
Standard IT systems usually have a specific audience in the end user community. Alternately, data warehouses are usually targeted at a user group consisting of senior level business process and technology experts. Often times those are the users who understand how to make sense of the aggregated numbers.
However, there are those who feel that regular end users may be able to benefit from greater access to a data warehouse. As these large systems gain acceptance in the broader user community, even greater attention to maintenance will be required. Put simply, a data warehouse does TONS more processing that a standard IT system. The ability to guarantee data accuracy through this processing cannot be ignored.
All in all, the above points are not meant to provide in-depth analysis of data warehouses. However, they are meant to point out the underlying differences between these and standard IT systems. Without a real understanding of the required maintenance for data warehouses, IT management (ever pressured for cost savings) may make mistaken assumptions about how to support these very large and complex systems.