ETL,
or Extract, Transform and Load, eases the combination of heterogeneous sources
into a unified central repository. Usually this repository is a data warehouse
or mart which will support enterprise business intelligence.
Extract - read data from multiple
source systems into a single format. This process extracts the data from each
native system and saves it to one target location. That source data may be any
number of database formats, flat files, or document repositories. Usually, the
goal is to extract the entire unmodified source system data, though certain
checks and filters may be performed here to ensure the data meets an expected
layout or to selectively remove data (e.g. potentially confidential
information).
Transform - in this step, the data
from the various systems is made consistent and linked. Some of the key
operations here are:
Standardization - data is mapped to
a consistent set of lookup values (e.g. US, USA, United States and blank/null -
all mapped to the standard ISO country code)
Cleansing - perform validity checks
and either remove or modify problem data
Surrogate keys - new key values
applied to similar data from different source systems prevent key collisions in
the future and provide a cross reference across these systems
Transposing - organizes data to
optimize reporting. Many source systems are optimized for transactional
performance but the warehouse will be primarily used for reporting. Often this
involves denormalizing and re-organizing into a dimensional model.
Load - the transformed data is now
written out to a warehouse/mart. The load process will usually preserve prior
data. In some instances existing warehouse data is never removed, just marked
as inactive. This provides full auditing and supports historical reporting.
No comments:
Post a Comment