Over the past decades, the concept of data warehousing has been spread out to everywhere in the business world. Organizations have been practicing hard on achieving successful data warehouse architectures. Lessons have been learnt from those who have succeeded, as well as those who have not. Several key developments in the data warehousing industry denote the past and present of this discipline. Specifically, starting with very few industry vendors in the 80s such as Teradata, many IT vendors like Microsoft, IBM and Oracles are extending their database management systems (DBMSs) to have sufficient support on data warehouses. The early years’ data warehousing theory and engineering practices have been well recorded in the publications of Inmon and Kimball.
Data warehouse architecture is a portfolio of perspectives on how different architecture pieces of a data warehouse system are connected and interacting with each other. It reflects how the academic research and industry development influence the data warehousing practices of different enterprises. For example, from a computing infrastructure perspective, data warehouse architecture has gone from past mainframe analytics to client/middleware/server environment, and now to service-oriented computing as well as the cloud computing concepts. With the rapid growth of information volume and more requirements arriving from the business side, many IT organizations of large business enterprises are facing the challenge of building an enterprise-wide data warehouse that integrates and manages various types of information that comes from different corners of the enterprises and provides the solid information for business analysis in a timely manner. Successful data warehouse architecture must be able to ensure the processing efficiency, the information correctness, and propagation of metadata while managing over terabytes of data with a daily growth of over gigabytes.
As in the past decade, practices of data warehouse architecture have been focused on addressing classical issues such as the data integration needs, the data quality and metadata control, the data modeling requirements and the performance acceptance from both the data management and the analytical sides. Specifically, the data extraction, transformation, and loading (ETL) process has to manage large volume of data with an efficient manner by allowing easy and fast scaling up/out hardware configurations. Extraction of metadata and reconciliation of data quality requirements must also be fulfilled through the data integration process in order to enable the data lineage across of the whole data lifecycle in the warehouse. An enterprise-wide data model provides unified, consolidated view of the data which enables a consistent, logical representation of business data across different functional areas of a whole enterprise. As the data management side of data warehouse is focused on loading the data in an efficient manner while the analytical users are more interested in retrieving data in a fast and agile way, data warehouse architecture has to enable an easy way of finding the balance of both sides.
Built upon the past decade’s research explorations, data warehouse software vendors are instantiating tools and engineering practices on these classical architecture topics. While vendors are rolling out more and more parallel-processing database and ETL engines, enterprise-wide metadata and data quality tools, and eagerly extending their center of excellence with vast amount of data warehousing practices, both the data warehouse industry and academic worlds are facing new challenges when novel concepts such as SOA, web 2.0, big data and cloud computing are spreading over the whole IT community.
The rest of this article describes different architectures that exist or has existed in modern organizations.
Point to Point Integration
The following picture depicts the component-perspective of a typical type of data warehouse architecture. The figure describes a pattern where different operational systems are connected to different data marts which are then used by ad hoc queries or reporting applications. These operational systems are normally understood as systems that capture the transactions of a line of business and they should be thought of as outside the data warehouse. A data mart is a relatively small repository that contains a subset of the organization data. Data marts are normally created to serve requirements from specific business areas.
In this “point-to-point integration” architecture, end users normally access data in the data marts for certain analytical or reporting purposes. This architecture pattern allows different data marts to be directly connected to required source of data and thus enable the “fast time-to-market” requirements from the business side. However, the amount of connections between operational systems and data marts are becoming overwhelming over the time. It is impossible to achieve a unified view of data in the whole organization.
Production Data Warehouse
The following picture depicts an architecture where all the operational systems deliver their production data into a single repository and data marts and different applications and users can use these production data for purposes like analysis and reporting. Here the “production data warehouse” contains production data from different operational systems. There is normally no transformation to the production data after it is loaded into the production data warehouse.
In this “production data warehouse” architecture, the management of data at different operational systems is eased by putting the data into a single repository. It becomes easier to get an overview of all kinds of data at an organization. Since the data delivery from operational system to the production data warehouse can be ensured through “service level agreement (SLA)” and there are no further operations over the data after they are loaded into the data warehouse, the efficiency for loading the data at data marts, business intelligence (BI) applications or for different users to directly query the data is quite optimal. On the other hand, since there is no data transformation in the data warehouse, there can be quite a big amount of data redundancy and inconsistency within the warehouse. Specifically, when two or more operational systems contain the same area of data, such as customer information, these data will all be loaded into different tables in the production data warehouse without any further conformations and standardizations.
By the way, the “Production Data Warehouse” often has a synonym “Operation Data Store” or ODS. These two concepts actually have differences. The former often contains historical data that are saved in the production systems, while the latter contains only a snapshot of the production database.
Enterprise Bus Architecture
The next picture shows the typical data warehouse bus architecture defined by Kimball. This architecture is generally divided into 4 layers. The first layer contains all the operational systems. These systems deliver production data to the next layer according to the SLAs. The second layer, called “data staging area,” contains data deliveries from different source systems that are temporally kept in its original format or in database, such as flat files or relational tables. The data staging area involves data extraction, transformation, and loading (ETL) operations to conform data from different sources into the target format in the next layers. This layer also contains data cleansing and standardization processes to ensure the quality of data. In Inmon’s books, the data staging area often contains a so-called “operational data store (ODS)” where a snapshot of recent operational data is kept and accessed by special applications or users for fast reporting and analytics. The third layer in this architecture is called “data presentation area” where data is organized, stored and made available for different usage purposes. As the data staging area is not open to any users, the data presentation layer is typically accessed by users. This layer is often made by a series of integrated data marts. The concept of “star-schema” and dimensional modeling provides the soil of building integrated data marts by conformed dimensions and facts. As often mentioned by Inmon, it may be necessary to consider building an integrated “Enterprise data warehouse (EDW)” before the layer of dimensional models to hold all the different enterprise data in a single model and serve other data marts with data from this EDW. Having an EDW can be seen as a deviation to the enterprise bus architecture. In many real world cases of the bus architecture, one of these data marts can contain most areas of data and serve the role of primary data warehouse (similar to the purpose of the EDW). Other data marts are still connected to the data staging area in this case but are more focused on specific business areas. The “data access tools” is the fourth layer of this architecture. All the different BI tools are contained in this layer and these tools access the data in the data presentation layer for certain purpose of usage.
In this architecture, the conformed dimensions and facts provide a ground of understanding all the enterprise data in a general sense. The data staging area ensures that the quality is controlled and redundancy of data is minimized. However, since the architecture involves 4 layers, the time from when the data is loaded from operational systems to when the end-user is able to generate report based on the latest data is much longer than the previous two architectural patterns. In addition, as being practiced by many different industries, to use a star-schema (or, a “snow-flake” schema) to model and keep a large organization data in a single, multidimensional model has not been very successful. In fact, many organizations are struggling between the loading time of the batch jobs and the query performance and the end-user side.
The following picture depicts the enterprise data warehouse architecture pattern that have been very popular in the recent years. In this architecture, the data staging area transforms the data and applies data quality processes before the data from operational systems is changed into an integrated data model in the enterprise data warehouse. The enterprise data warehouse normally utilizes a relational data model to keep the different business data in the whole enterprise in an integrated manner. The enterprise data warehouse either feeds data to data marts which are built upon star-schema models or directly provide data to data access tools. The metadata repository is a special tool that integrates metadata from different layers, components of the architecture in a single place in order to provide a single, unified view of all the metadata in the whole architecture.
The hub-and-spoke architecture are proposed to solve a few major issues with the previous architecture patterns. First, for the issue of “one version of the truth” of data, the integrated data model at the enterprise data warehouse holds the most detailed enterprise data in a generic and versatile way so that correction, re-calculation of data will use this layer as the ground. Second, the metadata repository must include a clear metadata catalogue and the metadata can be extracted from different sources in time. Third, in the data staging area, the ETL process and data cleansing as well as standardization operations are pre-defined according to data quality requirements so that these operations can be applied to the data from operational systems and only qualified data can be loaded into the enterprise data warehouse. Fourth, the database platform as well as the ETL tool are tuned in order to meet the performance requirement. The direct link from the enterprise data warehouse to data access tool also provides a “short-cut” when data marts are unavailable or unnecessary to be placed between these two layers. Fifth, the data marts are designed in an “easy to understand” format in order for users to access and use. The star-schema design is proven to be quite intuitive for business end-users.
With these classical architectures, there have been quite a few trends in the implementation of data warehouses in recent years.
First, in the start of implementation towards an enterprise-level data warehouse, an organization is typically more interested in establishing a “Production Data Warehouse” or the “Data Staging Area” layer of the hub-and-spoke architecture. This step will make a basic data layer available for business end-users to quickly get “hands-on” with data and start the explorations.
Second, even when a star-schema dimensional model is available, sometimes together with a cube or likely-structure, the business end-users can still be very interested in accessing data that lives in the “production data warehouse” or the “Data Staging Area.” The means that IT department must establish and maintain a “mash-up” architecture with certain metadata and documentation tools.
Third, establishing conformed dimensions seems to be too much challenging. More organizations are not going after confirmed dimensions but dimensions that are “conformed” to a smaller scope.
Fourth, the enterprise data warehouse should rather contain a hybrid data model, aka a blending of certain normalization and dimensional models. The role of communicating data with the business end-users are placed at the dimensional models close to the end-users instead of the hybrid data model inside the enterprise data warehouse.
Fifth, reference data, such as customer ratings, industry classifications, product types, should be modelled into more centralized entities, something like “classification” or “segments.” Although one can have several (but not more) top generalized entity to hold different “types” of reference data, it is still very important to keep all the small “types” data centralized. And the management of these reference data should be done through a master data tool.
Sixth, all the technical metadata in the data warehouse lifecycle, such as the ETL jobs, batches, and executions, must also be modelled and utilized in the data warehouse system.
As mentioned in the article about data warehouse development model (link), the reference architecture for a data warehouse is in fact one of the first crucial decision to make before any implementations. It is not always good to keep using the most-recommended or standard architecture proposed in the industry. It is more advisable to choose the architecture that fit an organization’s internal structure, resources, and even political mindsets.