After over 20 years of development, the theory and practices in the data warehouse and BI industry have evolved to the stage where most organizations have had quite a few data warehouse solutions maintained by different development and operational teams. Many more types of data sources, much bigger volume of data is populated to the data warehouse on a daily basis. The number of source systems to a data warehouse solution has changed from maximum 3 to 5 to minimum 20 to 40. The number of source table loaded from one source system to the Data warehouse has changed from maximum 8 to 10 to a minimum of 800 to 1000. Giving the cost of having a stable IT organization for the Data Warehouse solution compared to the growing complexity of the ETL process, the challenge of maintaining and developing a data warehouse solution becomes critical.
Looking inside the actual ETL process of the data warehouse solutions, the ETL jobs designed for different processing steps are having a big percentage of similarity according to the classical data warehouse architecture design. For example, all extractions of tables from a source system should have very identical pattern when observing the ETL jobs, all loading of fact dimension from the data warehouse layer to a data mart or a cube should be very similar, all transformation of data to fact tables of a specific business domain should follow the same pattern. With such similar of ETL job design, a big part of the ETL job development effort is just to “copy and paste” most part of the code and rename the parameters and properties.
- There are a few clear and immediate benefit of using the automation technology
- First, the “repeating” work of building similar SSIS packages are reduced to a few simple steps of configuration. This also means that the time and cost of building these packages are dramatically reduced and can, in a management friendly term, be easily estimated.
- Second, it is easier to align the design patterns and implementation details of technical elements according to expected technical architecture. Since jobs are created through automation process, all job design is preconfigured based on an agreed standard.
- Third, most technical errors caused due to manual development work are avoided. If 20 ETL jobs are created using the same automation process, the chance that a single error comes in one of these ETL jobs is very small.
- During the 2 years from 2015 to 2017, our team has achieved quite a few milestones using the automation technique.
- An automation process is established which generates both ETL packages, the DDL scripts for corresponding database elements and metadata registration scripts required by the Platon Data Warehouse framework.
- The ETL jobs that extract data from several source system databases are fully automated. In my estimate, the automation of this step has reduced 80% the development time of implementing the same type of ETL packages in the traditional way.
- The ETL jobs that extract data from the Data Warehouse layer to the Data Mart layer is fully automated. This part of the automation has reduced 95% of the development time
- In addition to the above, our team has also tried to use automations for the Data Transformation layer which transform data from the source format to the Dimensional data model in the data warehouse. Our experiences show that a fully automated transformation of such is still not applicable since very transformation has its unique requirement and implementation logic. However, it has been very useful to use the automation functionality to create a draft SSIS job such as a developer can start with at least 20% completed ETL package. This saves quite a bit development time for a developer and one can just focus on the complex logic instead of using time to take care of the little redundant details required for making each ETL package, such as parameter definition and mapping, standard start and end jobs, etc.
The market of Data Warehouse automation platform has been actively growing in the past 5 years. Wherescape, TimeExtender, Effektor and quite a few others have emerged. Since the appearance of BIML, generating ETL packages using BIML has been a technology trend in many of the Microsoft SQL Server- driven Data Warehouse solutions. Giving full respect to these Data Warehouse Automation products, our team use a well-developed Data Warehouse framework and apply the automation techniques on top of the framework. The decision of software tooling was made before most of us joined the team. But we believe automation techniques have helped us to get the most out of what we have.
Given the process and steps above, we will show case two examples of how automation is used in the development of Data Warehouse jobs.
Example 1 The interface of generating the BIML scripts
The screenshot shows the BIML generated by this application. The design and output of Mefa is tailored based on the meta-drive framework of the data warehouse solution. In our case, it is the Platon data warehouse framework. Mefa can also be adapted to other frameworks.
For working with BIML and making the best use of automations, we believe that a tool like Mefa is a best practice for data warehouse solutions.
Example 2 Generating DDL scripts and meta registration scripts
- is easy to compare across versions
- is easy to figure out what the output of the file will be.
- always generates the same output
Since the BIML integration with project parameters has certain issues, the implementation of Mefa chose to apply MSSQL Alias different environments that point to the framework. Then from the framework, all other parameters are read.
Regarding Deployment, Operation and Maintenance
Until now the team has only been using the automation technique for the development of Data Warehouse jobs and Database objects. When the code changes are tested, we follow a classical deployment process for moving the code to the Test, PreProduction and Production environments (see this blog for a description of our experiences with the Deployment process). For the ETL Jobs and SQL scripts generated using the automation techniques, we have the following observations in the deployment, operation and maintenance process.
- First, when a totally new source system is added and all jobs and SQL scripts are generated through automation, normally all relevant connection information is saved in metadata registered in the database. In rare cases, it is still possible that the deployment process will encounter a few configuration problems, such as defining server proxy , setting default values for parameters in the metadata. After these problems are solved, there will rarely occur any more issues. Further deployment of changes in these jobs and scripts typically contains no error.
- Second, maintenance and operation management of jobs and scripts created through automation is fairly easy compared to jobs and scripts developed manually by different developers. Since all relevant jobs and scripts are generated based on the same template, it is easy to identify if the error is actually in the hardware or system configuration or in a specific table.
- In our observations, for a bunch of 20 ETL jobs generated through the same template, there can rarely happen that only one job has an error while all others are still running successfully. In terms of generated jobs, the execution typically happens either as “succeed for all” or “succeed for none” since all code are aligned. In the cases where the error occurs only one job among all the 20-generated job (of the same template), the error either comes from the data source/target or in the configuration of metadata specifically for this job. The scope of debugging is then dramatically reduced in terms of where to go and what to do. With this in mind, debugging and operation management of these jobs and scripts became fairly easy and straightforward.
We used automation in the development of ETL jobs and Database scripts. In our observation, the automation technique fits our needs in term of reducing development cost, aligning job design according to architecture, and simplifying deployment, maintenance and operational management process. The automation technique has a solid business case in our team since the source systems that we work with typically contains much more than 10 deliveries, which makes the benefit of automation very clear when 1 job is created as template while the rest of 9 is just generated in a single step. Normally when a source system is a known DBMS such as SQL Server, Oracle, Postgres, it is easy to develop the template job and generate the rest.
We believe that other teams facing similar types of source systems will also benefit from using the automation technique. The usage of BIML tool and the knowledge on BIML is compulsory.
Seeing the benefit of automating the development of ETL jobs and database scripts, we would have considered using more time on automating the Data Warehouse Deployment and Testing process. Until now, we have not seen any fully automated and continuous deployment and testing technology in the Data Warehouse scenario. The cost and risk involved in the deployment and testing processes makes it a clear business case if certain automation technology can be used to solve these challenges.
However, the automation technique is still facing challenges.
Our current state of using automation is to generate the automation script which will generate ETL jobs and Database scripts. There is no available tool that can create and maintain the traceability and lineage between the automation script and the generated code. Although such a tool may not be necessary for the current moment such tool will be a “nice to have” feature in the future when the automation technique is applied in most organizations.
We summarize our learnings with the following Do’s and Don’ts
- Start with the simple packages, where the ETL is a 1:1 process. (This is typically the Source Extraction process and the process of Copying data to a Datamart)
- Choose areas with lot of trivia development.
- Choose areas with complex but repetitive code.
- Make sure your code is easily extendable, so new features can be added without starting over.
- Get everyone in the team on board.
- Make the BIMLscript code too complex. Using to many nested loops, if statements makes the code hard to read
- Make everything into a static design pattern defined by the BIML code.
- Make manually changes to a BIML generated package. You want to be able to regenerate the package if you change something in the script without taking care of manually changes.