Documentations in the Data Warehousing Process

Documentation is an essential part of a data warehouse project. In the starting phase of data warehouse implementation, results from requirement analysis and data source analysis must be captured through decent documentations. After data model is defined and established in the database, mapping specifications are normally documented in order to make actual implementations in ETL tools. After the ETL jobs are implemented and the batch flow is deployed, a documentation on ETL flow, connecting from source systems to the target deliveries, gives a helping hand to the maintenance and monitoring of the daily data warehouse workload. In fact, the analysis process itself is based on documentations when a proper tool is not available. This makes it more sense to use a decent template for such analytical activities.

This articles introduces a few key activities and the corresponding documentations for these activities in a data warehouse environment. And standard templates are also provided for each documentation.

1. Data gap analysis

Normally the development of new data to a data warehouse is based on requests from the business side. This means a request or a list of request from the business. Business analyst or data analyst should then look at the request, find out the actual data items that are mentioned or implied from the request(s), and find the corresponding source data for the data items. This is called “data gap analysis”, meaning to find out the gap between what is needed from the business and what is actually available. The power of data gap analysis lies in the fact that the IT side can help the business side to realize if the request can be fulfilled, how much it can be fulfilled and approximate how much it costs to fulfill.

2. data quality analysis

Continuing from the data gap analysis, the source data mapped to the business request must be profiled and checked against its data quality issues. One can normally add a few extra lines to the output from the data gap analysis , for example, what is the expected data type of the source column, the typical data range, the typical data pattern (such as “[999]-(999)-9999”), and so on. Such data quality requirement gives a direct influence on the data modeling and ETL design activities.

3. data requirement analysis

After the data quality check, one should go further into an initial data model grouping and ETL planning phase. Again, one can add more columns to the output from the data gap and data quality analysis to fill out the initial grouping of the source column (i.e., which data area or table this column should be modelled), the ETL estimate (how long the implementation of this column will take). One should also add relevant transformation requirement to the column. That is why this phase is called the “data requirement analysis”.

All the output from the three above phases are captured in a document called “Data requirement document. The data requirement document gives specifications to define the data deliveries (and possibly the delivery agreement between the source system and the BI/DW department), the data model design, the mapping specification and the ETL design. A template of the data requirement document provided by Detalload.dk is available at this link (DataRequirement_template).

4. Mockup reports

In addition to the data requirement, another typical requirement documentation during this phase is called “mockup reports” which are fake reports or dashboard design. Normally such mock-ups are created in Excel, PowerPoint or even just on papers with handwritings.

5. Data delivery description

The data delivery description gives a technical specification of what must be delivered from the source system to the BI/DW department. Such specification normally contains very detailed list of all columns as well as the technical parameters, such as the delivery type (full delivery/delta delivery), the delivery frequency (weekly/daily/Monthly) and so on. Besides using a long Word document to make such documentation (which I think is kind of very decent as Delivery Description is considered like a legal agreement), many organizations have tried to implement a system to replace the Word-way of documenting deliveries descriptions. SharePoint List is a very normal starting point here. A template of the data delivery description document is available at this link (DataDeliveryDescription_template).

6. Data model

The software industry has provided lots of powerful data modeling tools for implementing a data model with decent documentations. For example, the CA Erwin data model is an industry leading data modeling tool which provides lots of documentation capabilities beside the cutting-edge modeling functionalities. This article will not describe the data modeling activities in detail.

For each logical or physical model in Erwin, one should normally write descriptions, define types and links, and apply domains in order to create a decent data model.  After the data model is created and deployed in an actual database (such as SQL Server), the next step is to make the data mappings.

7. Mapping specifications

There have been lots of standard for mapping documentations. Popular example can be found from the Kimball group (http://www.kimballgroup.com). A few rules must be kept in mind when using and maintaining mapping documentations.

  1. Excel can be a good place to maintain mapping documentations. But for complex rules and transformations, it is better to append a word file with enough descriptions and examples.
  2. For a typical data warehouse system where there are multiple layers, mapping should be documented per layer. One mapping file per target table (or per source+target table) per layer. And then make an overall documentation to link all the different mappings.
  3. In the recent years, most and most ETL and data integration vendors start to provide decent metadata management and data lineage tools besides the ETL tool. Examples are Informatica, SAP Data Services, IBM DataStage. Therefore, the maintenance of mapping information can be based on these metadata tools instead of the time-consuming documentation process. In such case, mapping document is more or less a tool for task specifications for ETL developers. This, by all means, tells the important of mapping specifications. The software industry has been paying effort to replace the mapping specifications with different tools, but there is hardly any good success up to now (now = 2016-March-03).

A template for mapping specification is available at this link (MappingDocument_template).

8. ETL Documentation

With the data models and ETL specification ready, the next step in the data warehouse development process is to develop the ETL job, test and deploy the jobs in the different environments. The documentation of the ETL job design is typically made inside the ETL job design panel (unless it is only a coding tool used for ETL design) where developers can add annotations to the flow diagram. Most of the modern ETL tools have made it possible to automatically create ETL documentations since all the ETL job metadata has been available inside such systems. Functionalities like data lineage and impact analysis are generally available for all developers in case of bug-hunting and planning of changes.

9. Business glossary and dimensional model documentation

When the ETL job is completed and deployment and executed in the PRODUCATION environment, the data is generally available for business users to explore. In such as, the documentation of metadata at a business level is very essential. A key element in this case is a business glossary which captures the key columns and tables in the dimensional models. This business glossary can either just be a simple Word/Excel document or a flexible (but also a bit complex) SharePoint wiki site. In the SAP BI world, such a documentation is normally captured by the universe documentation and is available to the users through the SAP BI portal.

10. BI front-end documentation

Documentation BI front-end application, such as a customer performance dashboard based on PerformancePoint, is necessary both for the initial education of end-users and for the maintenance activities at the BI/DW department. On the other hand, with the popularity of visual analytics, dashboard and report design becomes more and more easy and flexible, the need of detailed documentation becomes more and more trivial.

This articles has listed a few documentations for the key activities in the development and maintenance process of a BI/DW system. Although there have been great effort from the software industry to reduce the time cost of making Excel/Word documentations in the BI process, there are still quite a few key documentations in this process that seems not replaceable by any existing tools. A good direction, as seen in quite a few large organizations, is to consider using SharePoint to reduce the time consumed in the maintenance of these documentations and set more rules to ensure the quality of such documentations. This articles has listed links to templates on Data requirements, Data delivery description, and mapping specifications.