Data warehouse is a collective effort of integrating and transforming enterprise data to meet analytical and reporting requirements of certain business domains. Many organizations invest in Data Warehouse projects to create a hub of enterprise data and expect the Data Warehouse system to continue running and evolving over the time until the technology trend pushes this system to the next generation.
Compared to the traditional enterprise applications, Data Warehouse systems demand much more change and new “add-on” after the first version. In fact, the overall purpose of creating a Data Warehouse is to make a system framework that can evolve over the time and continuously deliver new data and new information to end-users.
However, to our knowledge, many organizations only put focus on creating the warehouses through an initial project and neglecting the great potential of implementing well-functioning maintenance processes afterwards. One of the main challenges we have seen is how to make a smoothly-running deployment process that add values to Data Warehouses continuously and incrementally.
The goal of this blog is to describe a functioning and “proven-by-practices” approach of making incremental deployment to Data Warehouse systems. Through the time from the summer of 2016 to the end of 2017, Steen and I had the chance to implement and run the Data Warehouse deployment process for a large financial institution in Denmark. Our experiences have shown that, with a practical planning methodology, a structured process enforced through disciplines, the deployment and release plans can be designed ahead of time and fulfilled on time. In other words, our process made it easy and smooth to deployment changes to Data Warehouse system while still satisfying the quality and testing requirements all the time.
The Data Warehouse system we worked on is based on Microsoft SQL Server, SSIS, SSAS, SSRS, MDS and SharePoint. We believe the deployment process and discipline is applicable to other technology platforms.
The special feature of Data Warehouse and BI system is that it is a “hybrid” solution involving multiple technologies and architectural layers. A change to a Data Warehouse system can imply changes to different tools, like a “ripple-effect”. Therefore, the following challenges must be recognized when designing an increment deployment process for Data Warehouse systems.
- To implement one change will possibly mean to apply changes to multiple layers of the Data Warehouse, which may involve multiple technologies and tools. Thus, the implementation should not only add changes to relevant layers, but also orchestrate the changes through the layers. For example, adding a column “BalancedAmount” means to add the changes to the Data Warehouse layers such as “Extract,” “Archive,” “Fact” in the Data Warehouse, and to add the changes to data mart tables, and probably also to Analysis Services Cubes as well as the SSRS reports.
- Testing of the change requires test cases through different layers. In addition, the existing data and design should always be checked through a comprehensive regression test process to ensure that new changes will not harm the existing program. Testing is in fact the key activity to ensure that the deployment process is expanding the Data Warehouse system through an incremental way.
- The development of changes to Data Warehouse systems should be running through a continuous process. While the deployment of one release starts, part of the team will work on making new changes to the next release. Therefore, keeping the continuity of development while still testing and deploying the current release is a challenge to teams with multiple parallel development tracks.
Before describing our deployment approach, it is important to align the understanding of Data Warehouse architectures. As illustrated below, a typical Data Warehouse and BI solution includes layers consisting of Source Systems, Data Warehouse and the analytical and reporting front-end.
- Source systems can be of various types, such as classical database systems, SharePoint list, Web Services, CSV and XML files, the master data input from Master Data Services, message queue and so on. The challenge behind the variety is that the ways to extract data from these systems can be very different for each one, which implies the challenges of implementation and maintenance.
- The Data Warehouse contains several important data layers before data is sent to the end-user tools. The “Extract” layer keeps the data read from source systems. The “Archive” layer keeps track a history of the data. The “DW” layer is basically a layer to define conformed dimensions and unified fact tables. Finally all data is read into the “DMSA” (Data Mart Staging Area) layer.
- The front-end contains different target systems or applications where the end-users interact with. In most our experiences, these can be data marts, file deliveries (for example, a SAS dataset), cube, and so on.
Given this architecture, which is very typical across many organizations, changes to a Data Warehouse and BI solution can come from any of the architectural elements. In a standard Data Warehouse maintenance group, these changes should be implemented and deployed through standard release processes.
In our experience, the typical changes that occur to a Data Warehouse and BI system can be of the following types.
- Adding/changing columns/tables to an existing delivery. Examples can be like “add a column to the CurrencyExchangeRate delivery File” or “Change the column GeographicalRegionCode from varchar(8) to varchar(32)”
- Adding a new delivery from an existing source or a set of deliveries from a completely new source system
- Adding/changing front-end element, such as adding a new BI report based on existing data or adding a new cube based on a new data delivery
- Removing an existing delivery or table or column
- Modifying elements of an existing data flow to reflect a new business logic, such as “Change the value of column NetPresentValueRegionA to either use the source column or, if the column contains NULL, use the value from column PresentValueLocal instead”.
Besides these normal changes to Data Warehouse systems, there can also be fundamental changes which require adding/changing an architectural layer or an architectural behavior. An example can be “adding a layer called BusinessLogic between the Archive and the DW layer” or “Changing the way data from Extract layer are saved to the Archive layer.” In such situations, a special deployment must be planned with extra time-windows for development. Our deployment approach has also been tested in these situations and the observation is that the time allocated for such extraordinary delivery must be at least doubled to ensure the quality level of the deployment.
Given these common understandings, the Data Warehouse can receive various requirements regarding how (and how often) the business teams want the releases and changes to be implemented to the Data Warehouse. Just to align the expectations while staying at an acknowledgeable level of ITIL, we end up with the following common requirements for the deployment process.
- The goal of deploying and releasing changes to the Data Warehouse and BI system is staying at a frequency of 1 release per 4-6 working weeks.
- Deployment activities in PreProduction and Production environments can only be done by the Operation Team. Deployment guidelines are created and tested by the Development Team in the Development and Test environments.
- All releases and deployment must be gone through both regression test and user acceptance test processes.
- Hotfixes can only happen within 2-3-day-window after a release is deployed to the Production environment. In other days, hotfixes are not allowed in principle.
With these requirements, Steen and I and the rest of the Data Warehouse/BI team have been able to keep continuous deliveries and releases for over 12 months. Although almost half of the team were re-staffed after an organizational change and a totally new IT Operational and Development process has been implemented during the period, our pace of delivery is kept within the expected frequency and we have managed to live up to all the testing and operational requirements listed above.
Although the trends of automation, “fast time-to-market” and “ease of use” in the Data Warehouse and BI industry have attracted a lot of attention, we found it necessary to use a manual process of deployment, given the limitation of tools and requirement of process and roles by the organizations that we work with. We believe that a fully automated deployment tool or process for existing Data Warehouse and BI systems is never proven to be possible in scenarios where compliance, regulatory and risk-centric requirements are compulsory, especially in the financial industry. This kind of automation will only be possible in systems designed from day 1 with this in mind.
As depicted in the following diagram, the deployment process takes 4 major milestones with different essential steps.
- The Release Planning phase is used to obtain agreement on a realistic list set of deliveries which is also technically tangible for the next release. Note that not all items in this list will be sent to Production in the process since we may choose to drop one or a few release items in case the promised time-window has to be closed in time. Once the release items are defined, relevant development must be completed as soon as possible with unit test passed and checked by the lead developers. A “pre-integration” test may sometime happen which means to run a full data warehouse job using test data in the Development environment. Once the “pre-integration” test is completed with success, a snapshot of the current code repository is taken and frozen. From this time on, the developers can work on tasks related to the next releases, although they may be recalled for issues found after deployment to the next environments.
- The “Integration” Test is executed in the Test environment to make sure that changes to different part of the architecture will work together. Test data from the Production environment are first loaded and anonymized in order to be used for the testing process. Another very important task here is to create the deployment guideline, which is a user manual for the Operations Team. During the integration test process, end-users are sometimes invited for checking certain testcases. Once all parties agree on the completion of the test, the code changes are finalized and frozen.
- With the deployment guideline document, the Operations Team deploys the changes to the PreProduction environment and implement regression test. End-users check the output from the regression test and make other acceptance tests to validate all changes relevant to the release. Note that the Production data is always loaded to the PreProduction environment for the regression test in order to ensure the complete coverage of the test. When all test cases are executed and accepted, the Operation Team will deploy the changes to Production.
- Right after the deployment activities are completed in the Production environment, there will always be a period of a few days of Hypercare where hotfixes or even system “roll-back” is possible if found necessary. The business end-users often run certain checks in the Production environment to make a final validation after a deployment process is just completed. Once all changes are finally accepted, the current release window is closed and followed by a retrospective activity to find out how to improve the next release cycle.
In the whole deployment process, there are two essential elements that define the success and failure.
- The Deployment Guideline document is a complete, detailed MS Word document which describes each step of the deployment activity with concrete screen examples. The textual description in this document must be strictly followed by the Operation Team in order to complete the task.
- The Regression Test is a process using a special program to compare both metadata and data of the database before and after the deployment happens. We started this step using an in-house tool and switched to an external tool (RedGate SQL Data Compare) for this purpose. This process is actually deciding “go” and “no go” for the final deployment.
The deployment process is running in an Agile manner. There have been cases where we went back from the PreProduction to Development to finalize code changes due to unexpected complexities. There has also been one case where we rolled back after the Production deployment due to errors that we were not able to detect in Regression test.
Having been able to run the same deployment and release process for over 12 months, we were able to move from early versions the Data Warehouse and BI system to the level where most business end-users start using and benefiting from and requesting data from this platform.
We summarize our experiences into the following points.
What has been good
- Using a standard deployment methodology makes it easy and straightforward for all parties (business stakeholders, end-users, developers, Operations Team) to understand where we are, how we have been and what we want to achieve next.
- The manual deployment process is never perfect but it works very well.
- Our Data Warehouse batch job is designed to be parameter-driven, which made it easier during the deployment, testing and Hypercare phases.
- The regression test is vital. Raising the quality of the regression test process will enhance the quality of the deployment to a large extent.
- It is always a good idea to stay close to the Operations Team.
What should be improved
- Manual deployment has a high risk of mistakes caused during the manual process. A testing and validation process or program should be created to minimize the risk.
- Creating the deployment guideline is time-consuming task and MS Word is definitely not the perfect media. A more automated software will streamline the deployment process.