Organizations running data warehouse and BI initiatives understand the importance of data extraction, transformation and loading (ETL) process. ETL tools provide an effective approach to enterprise data integration and enforce a standard way of ETL development and maintenance. With the growing complexity of enterprise applications, the ever-increasing data size, and the demand of stable and timely delivery of critical data to the business users, ETL batch flows at many organizations become so complex that the BI teams are facing the challenge of maintaining large amount of ETL programs while ensuring the effectiveness and correctness of ETL flows. In many situations, the standard functionalities of ETL software become weak and incapable. Here are two challenges I have met.
- When more than 1000 tables should be extracted from a source system, the development and maintenance of each ETL job becomes a highly-complex task.
- For a source system containing large set of tables that should be shipped to the data warehouse, the ETL job maintenance task becomes even more complex in case of frequent releases and changes in the source system, such as adding/removing columns, moving columns/keys to other tables, table name changes, changing column data types/lengths, removal of an existing table, etc.
When changes need to be made in one ETL job, a typical process is that a developer from the BI team checks out the program from the code repo, implements the changes, makes test in the development environment, then all changes are bundled and deployed to TEST and QA environments before they finally landed at the PRODUCTION environment. The key challenge is that, in terms of a large data warehouse environment with more than 1000 jobs, the development and delivery time and cost will soon become an unsolvable bottleneck.
In a joint-work with Christian Thomsen and a group of students (Darius, Philipp, Frederik, Jacob, Søren and Michael) at Aalborg University in the first half of 2016, we explored the concept of building a self-adaptive ETL system to solve these challenges.
The concept of self-adaptive means that a software system modifies its behavior in response to changes in its operating environment such as end user input, external hardware device and so on. The student project at Aalborg built a prototype called “Maintenance Manager for ETL (MAIME).” This tool can detect schema changes in source systems and semi-automatically adjust the affected ETL jobs. Choosing Microsoft SQL Server and Integration Services (SSIS) as the target ETL platform, MAIME was shown to successfully repair ETL processes in response to source system schema changes. For the implemented transformations in SSIS, a comparison was made between resolving the source schema changes in MAIME and doing manual modifications in SQL Server Data Tools. The evaluation shows that MAIME is on average 9.8 times faster and required 9.5 times less input from the users.
There are three key components in MAIME, the EDS Change Manager, the Maintenance Manager, and the Graphical User Interface (GUI). The EDS Change Manager is responsible for capturing the metadata change of the source system schema. The core logic of MAIME is in the Maintenance Manager where a graph altering algorithm is implemented to modify the ETL job flows in response to source schema changes detected by the EDS Change Manager. In the user interface, an administrator can specify if changes like addition, deletion, rename in different SSIS components (such as Conditional Split, Aggregate, Sort) should be blocked, propagated or prompted (in runtime to the administrator). There are also three advanced configuration options to make it more flexible for the administrator. For example, the option “Use global blocking semantics” indicates whether the graph alteration algorithm should terminate (i.e., stop making changes to the SSIS job) if the policy for any vertex in the graph is set to Block.
The following picture demonstrates how a data flow in SSIS is changed by MAIME.
Before the changes
After the changes by MAIME
A prototype of the MAIME system was made in the summer of 2016. Through the evaluation we can see the potential of the MAIME system in terms of reducing maintenance workload and providing timely delivery of changes to data warehouse systems. We would like to share the findings and contributions to any parties that are interested. Please feel free to contact me or Christian Thomsen for discussions and inspirations.