A key issue for most data marts and data warehouses is Load Performance. How can we quickly and efficently load the latest data into
the data mart so that it is available for reporting and analysis?
A data mart or warehouse is a set of data extracted from other systems. For many
organizations the most recent data has the most value. Although many data marts are refreshed only nightly a high percentage of our projects
have also involved data being refreshed more frequently during the day.
Even when data is only loaded nightly the night is often not long enough. We've developed a number of techniques to ensure that data can be loaded
quickly and efficiently
- A simple idea that can yield great results is to clearly delineate between Extraction, Transformation and Loading (ETL.) By being
careful to Extract complete tables from the source systems without complex queries we minimize the load on the source system and get the
data extracted as quickly as possible to a staging database. Then we can do the more complex Transformation tasks on the data warehouse
server without impacting the source systems.
- For Microsoft SQL Server data mart careful fact table design with a surrogate primary key can ensure that inserts during the Load
phase are always at the physical end of the table.
This improves the performance of the insert operation by reducing table fragmentation.
- One way to improve load performance is to only load the data that has to be loaded. We've encountered very few source systems
that have truly reliable change detection. This puts the data mart design team in the awkward situation of having to reload all
the fact rows every night - even for historical facts. We've developed two techniques to address this: 1)detecting changed rows from
the source system via checksum based change detection or 2)establishing a historical window with the business and only updating fact
rows that fall within that window (e.g. the last year.)
- A technique that is closely related to only loading that data that has to be loaded is to partition the fact table so that only
a subset (a partition) of the table is touched by insert activity. This is now a feature of SQL Server 2008 but we started carefully
doing manually partitioned fact tables with SQL Server 2000. By being careful with the table keys we can create a situation where
query performance is also improved because the query optimizer only has to access partitions of the table to satisfy a query.
It is easy to overlook the fact that merely having current data is not enough - the consumers of the information need to
know how up to date the information is.
We recommend that the Extraction, Transformation and Loading (ETL) process that pulls data on a scheduled basis
from source systems into the data mart also provide an updated table of date and time values to show how up to date the data is. This is then
easy for a report developer to include on the report.