Speed Up Data Warehouse Development with Automation Techniques

Through the summer of 2015 to the summer 2017, Kenneth and I have been working on the same Data Warehouse and BI platform for a large financial institution in Denmark. Together with our team of developers, we have developed and applied automation technique which can make ETL jobs and Database change scripts in a simple and fast way. We have witnessed the immediate benefit of this technique and have realized that it can be “the next wave” in the world of Data Warehouse and BI. This blog includes our experiences, thoughts and demos of what we have had with the automation technique.
1.    Background

After over 20 years of development, the theory and practices in the data warehouse and BI industry have evolved to the stage where most organizations have had quite a few data warehouse solutions maintained by different development and operational teams. Many more types of data sources, much bigger volume of data is populated to the data warehouse on a daily basis. The number of source systems to a data warehouse solution has changed from maximum 3 to 5 to minimum 20 to 40. The number of source table loaded from one source system to the Data warehouse has changed from maximum 8 to 10 to a minimum of 800 to 1000. Giving the cost of having a stable IT organization for the Data Warehouse solution compared to the growing complexity of the ETL process, the challenge of maintaining and developing a data warehouse solution becomes critical.

Looking inside the actual ETL process of the data warehouse solutions, the ETL jobs designed for different processing steps are having a big percentage of similarity according to the classical data warehouse architecture design. For example, all extractions of tables from a source system should have very identical pattern when observing the ETL jobs, all loading of fact dimension from the data warehouse layer to a data mart or a cube should be very similar, all transformation of data to fact tables of a specific business domain should follow the same pattern. With such similar of ETL job design, a big part of the ETL job development effort is just to “copy and paste” most part of the code and rename the parameters and properties.

Besides the avoidable cost of such redundant work, the risk of doing such “repeating” work is that mistakes can be made since this is a process with human intervention. It is not wrong to expect that two consultants will produce the “almost identical” ETL jobs following the same architectural patterns. But this is rarely true.
 
With the “buy in” from the management group, Kenneth led the design and implementation of Data Warehouse automation at our team. Through the implementation process, all members in our development team recognized the benefit and the concept of automation has then been treated as a standard in our team.
 
Our Data Warehouse solution is using Microsoft SQL Server as the database engine, SQL Server Integration Services (SSIS) as the ETL tool, and BIML, together with the BIDS Helper from Varigence as the facilitation media for generating the SSIS packages. The data warehouse is built based on an early version of the Platon data warehouse framework.
 
  • There are a few clear and immediate benefit of using the automation technology
    • First, the “repeating” work of building similar SSIS packages are reduced to a few simple steps of configuration. This also means that the time and cost of building these packages are dramatically reduced and can, in a management friendly term, be easily estimated.
    • Second, it is easier to align the design patterns and implementation details of technical elements according to expected technical architecture. Since jobs are created through automation process, all job design is preconfigured based on an agreed standard.
    • Third, most technical errors caused due to manual development work are avoided. If 20 ETL jobs are created using the same automation process, the chance that a single error comes in one of these ETL jobs is very small.
  • During the 2 years from 2015 to 2017, our team has achieved quite a few milestones using the automation technique.
    • An automation process is established which generates both ETL packages, the DDL scripts for corresponding database elements and metadata registration scripts required by the Platon Data Warehouse framework.
    • The ETL jobs that extract data from several source system databases are fully automated. In my estimate, the automation of this step has reduced 80% the development time of implementing the same type of ETL packages in the traditional way.
    • The ETL jobs that extract data from the Data Warehouse layer to the Data Mart layer is fully automated. This part of the automation has reduced 95% of the development time
  • In addition to the above, our team has also tried to use automations for the Data Transformation layer which transform data from the source format to the Dimensional data model in the data warehouse. Our experiences show that a fully automated transformation of such is still not applicable since very transformation has its unique requirement and implementation logic. However, it has been very useful to use the automation functionality to create a draft SSIS job such as a developer can start with at least 20% completed ETL package. This saves quite a bit development time for a developer and one can just focus on the complex logic instead of using time to take care of the little redundant details required for making each ETL package, such as parameter definition and mapping, standard start and end jobs, etc.
2.    Automation of Data Warehouse Development
The market of Data Warehouse automation platform has been actively growing in the past 5 years. Wherescape, TimeExtender, Effektor and quite a few others have emerged. Since the appearance of BIML, generating ETL packages using BIML has been a technology trend in many of the Microsoft SQL Server- driven Data Warehouse solutions. Giving full respect to these Data Warehouse Automation products, our team use a well-developed Data Warehouse framework and apply the automation techniques on top of the framework. The decision of software tooling was made before most of us joined the team. But we believe automation techniques have helped us to get the most out of what we have.
 
The key concept in our implementation of the automation technique is to find and define the “template” job. A template job is the very representative ETL job or Database script that will later be repetitively “copied and modified” for other jobs or scripts that are very similar. After creating the template job, the key step is to identify the parameters that must be changed for auto-generation of other jobs and scripts.
 
In our team, Kenneth created a Windows Application that let users to create BIML and DDL scripts based on the identified template jobs and scripts. Output from this program are used as input to BIDS helper (currently called BIML Express) to generate new SSIS packages. This application also generates DDL Scripts that create SQL Server objects and make relevant configuration changes for the generated SSIS packages.  These are the primary steps that involve the automation technique. The following table compared the automation process with the classical manual process when we develop Data Warehouse jobs.
 
SMLXL

Given the process and steps above, we will show case two examples of how automation is used in the development of Data Warehouse jobs.

Example 1 The interface of generating the BIML scripts

The following picture is a screenshot of Mefa, an application developed by Kenneth. We use Mefa to generate BIML script for the SSIS extract packages in our team.
The screenshot shows the BIML generated by this application. The design and output of Mefa is tailored based on the meta-drive framework of the data warehouse solution. In our case, it is the Platon data warehouse framework. Mefa can also be adapted to other frameworks.
The Mefa application needs  to connect to the SQL Server where the meta-driven data warehouse framework is running.  A user can choose the source system for which the extract packages will be generated and the layout of the SSIS package, such as ”1 package per table” or “1 package for the entire system”.
SMLXL
Developing a tool like Mefa is quite feasible for teams over 3 developers. Mefa is developed in C# in MS Visual Studio 2013 with .Net Fremework 4.5. This application contains 3.000 lines of code and it requires a fairly amount of C# knowledge to develop. However, a similar tool (without a gui) can just as well be developed in T-SQL/stored procedures.

For working with BIML and making the best use of automations, we believe that a tool like Mefa is a best practice for data warehouse solutions.

Example 2 Generating DDL scripts and meta registration scripts

In many meta-driven data warehouse frameworks, there are quite a few SQL Scripts to be generated whenever a SSIS package is added.
The following picture is screenshot of automated generated script for meta registration.
Using the GUI of Mefa, it is possible to connect to a source and explore the data structures (for now MSSQL and CSV files are supported, but the application could be extended to support other sources, such as: XML, JSON and other databases). In this screen, a user chooses the tables for which a meta registration is needed. The archive type for each table needs to be set and it is also possible to set a column to use for delta load by using high watermark.
SMLXL
For both the BIML and SQL scripts we have seen above, it can be that none of the code is written in the C# section of the project. Both scripts are based on templates that can be maintained without the knowledge of C# and without having to rebuild the application. It is therefore possible to easily change the application to other frameworks and other SSIS layouts.
 
In our implementation, the BIML code is based on BIDS Helper 2014 and is manually added to the SSIS project. The generated BIML code is without BIMLScripts code, all variables are handled in Mefa. This ensures a simple BIML file which:
  • is easy to compare across versions
  • is easy to figure out what the output of the file will be.
  • always generates the same output

Since the BIML integration with project parameters has certain issues, the implementation of Mefa chose to apply MSSQL Alias different environments that point to the framework. Then from the framework, all other parameters are read.

Regarding Deployment, Operation and Maintenance

Until now the team has only been using the automation technique for the development of Data Warehouse jobs and Database objects. When the code changes are tested, we follow a classical deployment process for moving the code to the Test, PreProduction and Production environments (see this blog for a description of our experiences with the Deployment process). For the ETL Jobs and SQL scripts generated using the automation techniques, we have the following observations in the deployment, operation and maintenance process.

 
  • First, when a totally new source system is added and all jobs and SQL scripts are generated through automation, normally all relevant connection information is saved in metadata registered in the database. In rare cases, it is still possible that the deployment process will encounter a few configuration problems, such as defining server proxy   , setting default values for parameters in the metadata. After these problems are solved, there will rarely occur any more issues. Further deployment of changes in these jobs and scripts typically contains no error.
  • Second, maintenance and operation management of jobs and scripts created through automation is fairly easy compared to jobs and scripts developed manually by different developers. Since all relevant jobs and scripts are generated based on the same template, it is easy to identify if the error is actually in the hardware or system configuration or in a specific table.
  • In our observations, for a bunch of 20 ETL jobs generated through the same template, there can rarely happen that only one job has an error while all others are still running successfully. In terms of generated jobs, the execution typically happens either as “succeed for all” or “succeed for none” since all code are aligned. In the cases where the error occurs only one job among all the 20-generated job (of the same template), the error either comes from the data source/target or in the configuration of metadata specifically for this job. The scope of debugging is then dramatically reduced in terms of where to go and what to do. With this in mind, debugging and operation management of these jobs and scripts became fairly easy and straightforward.

3.    Retrospective
We used automation in the development of ETL jobs and Database scripts. In our observation, the automation technique fits our needs in term of reducing development cost, aligning job design according to architecture, and simplifying deployment, maintenance and operational management process. The automation technique has a solid business case in our team since the source systems that we work with typically contains much more than 10 deliveries, which makes the benefit of automation very clear when 1 job is created as template while the rest of 9 is just generated in a single step. Normally when a source system is a known DBMS such as SQL Server, Oracle, Postgres, it is easy to develop the template job and generate the rest.

We believe that other teams facing similar types of source systems will also benefit from using the automation technique. The usage of BIML tool and the knowledge on BIML is compulsory.

Seeing the benefit of automating the development of ETL jobs and database scripts, we would have considered using more time on automating the Data Warehouse Deployment and Testing process. Until now, we have not seen any fully automated and continuous deployment and testing technology in the Data Warehouse scenario. The cost and risk involved in the deployment and testing processes makes it a clear business case if certain automation technology can be used to solve these challenges.

However, the automation technique is still facing challenges.

 
First of all, automation is a technique that one can use for many scenarios in the implementation of Data Warehouse and BI solutions. But there are always complex transformation rules that is not viable to be implemented through automation techniques. In our experience, the ETL jobs transforming data from a historical layer (often called the “Archive” layer) to the Dimension and Fact tables often involves complex rules which makes it not possible to automate the generating of ETL jobs.  In other words, automation will not be a 100% out of the box solution for creating Data Warehouse and BI applications.
 
The technology barrier can in certain way limit the possibilities of using automation techniques. Our team uses the complete Microsoft BI stack, which makes it perfect when BIML is required for the automation technique. If the ETL tool does not support code generation through XML files or has not open standard or an available functionality for code generation, it is not possible to apply automation.
 

Our current state of using automation is to generate the automation script which will generate ETL jobs and Database scripts. There is no available tool that can create and maintain the traceability and lineage between the automation script and the generated code. Although such a tool may not be necessary for the current moment such tool will be a “nice to have” feature in the future when the automation technique is applied in most organizations.

We summarize our learnings with the following Do’s and Don’ts

Do’s
  1.  Start with the simple packages, where the ETL is a 1:1 process. (This is typically the Source Extraction process and the process of Copying data to a Datamart)
  2.  Choose areas with lot of trivia development.
  3.  Choose areas with complex but repetitive code.
  4.  Make sure your code is easily extendable, so new features can be added without starting over.
  5.  Get everyone in the team on board.
Don’ts:
  1.  Make the BIMLscript code too complex. Using to many nested loops, if statements makes the code hard to read
  2.  Make everything into a static design pattern defined by the BIML code.
  3.  Make manually changes to a BIML generated package. You want to be able to regenerate the package if you change something in the script without taking care of manually changes.