The proliferation of cloud technology has influenced many organizations to move their computation and storage demands to the cloud. Vendors like Amazon, Microsoft, Google and Alibaba are putting vast amounts of effort to enrich the establishment and achievements in this market. While the related technology and process are getting mature in various sectors, getting on the cloud and working with the cloud become the new normal.
This article discusses the learnings and challenges of creating a hybrid data integration solution that migrates data from the traditional on-prem environments to the cloud. The experiences came from the implementations in a large international project which has been developing a global scale web application in the past 2 years.
Most of the hybrid data solution that integrates cloud with the tradition on-prem data centers has to face the following challenges.
1. Data from heterogeneous sources need to be retrieved in different patterns and loaded to the cloud according to planned schedules.
2. The data transportation process and the data are highly secured.
3. Volume of the data being transported should be optimized to reduce the requirement of bandwidth as well as the risk of potential issues and problems
4. Data flow should be automated as much as possible and minimal amount of triggering point should be achieved.
5. Transaction control or ACID level control should be applied to the process whenever possible to ensure the consistency of data.
Given these challenges and the general scenario of migrating and transporting on-prem data to the cloud, there are several typical patterns.
Pattern 1: On-prem Hub to Cloud Hub
One of the used approaches is to collect the on-premise data from different sources into one central database and load from this on-premises central database to a central database in the cloud. Applications on the cloud then retrieves data from the central cloud database.
As illustrated in the below picture, the benefit of this architectural pattern is quite clear.
- By centralizing the necessary data from on-premises systems, the amount of data and the frequency of synchronization are centrally managed and optimized to reduce the overhead of sending redundant data and to keep the usage of network bandwidth to the minimal.
- As an extension to the previous point, since the centralization is established both at the on-premises and the cloud databases, it enables the possibilities of centralized data validation, enhancement, encryption and other activities. For example, a data mocking feature according to GDPR requirement can be running against the on-premises databases before the data goes through the network.
- The amount of effort to set up a network security zone for the data synchronization is optimized. Sending the on-promises data to the cloud requires the capability to move the data to the cloud through a safe “transfer zone” where none of the service principals or user accounts from the cloud should be able to modify the on-premises items. Setting up such network security protocols is not trivial effort. Centralizing the data transfer needs can maximize the use of the established network security zone and therefore reduce the cost of repeating the same exercises.
As a nature of centralized design, the challenges are also clear.
- This architecture is only for batch-driven process. Since data is landed on databases through I/O processes at least two times, the time-delay is not ignorable.
- Due to the existence of the network security zone. The process of collecting on-premises data to the central on-premises database and the process of loading data from on-prem database to the cloud databases cannot be controlled by one single job scheduling tool. The schedulers at both on-premises and cloud sides are not able to manage each other except for using the time dimension to arrange the job runs. In this case, one has to wish that all jobs complete in time.
- Since all data uploading jobs are bundled together to optimize the usage of the network traffic, without very careful job and flow design, one can always risk of one failure in a specific job or table can delay the delivery of all other tables. When end users and cloud applications need data under different update frequencies, fulfilling the different job scheduling needs while still keeping the databases and jobs stable and available becomes a complicated and sometimes unsolvable task.
Pattern 2: On-prem Hub to Multiple Cloud Endpoints
The centralized cloud data hub can bring extra complexity when multiple cloud applications have different requirements on the size, frequency and transformation logic on the data. In the micro-services and API-driven model, different application logics are grouped into small units and each of them is independent from each other. If data feed for all these independent microservices come from the same centralized cloud data hub, the availability and consistency of these different microservices are challenged. The centralized cloud data hub is often considered as a natural place to “integrate” and “centralize” some shared business logics. However, this takes away the freedom of different microservices since an alternative way to “integrate” is to let microservices to implement a “shared” business logic and make it as a shared API. In this way, the data layer remains as a simple ETL layer where the “T” is kept to minimal or even none.
The following picture illustrates this pattern.
This pattern is batch-oriented. The role of the on-premises central database is to collect and list all data from various on-premises source systems without any further integrations. Based on the different requirement of data from different cloud applications and microservices, different data flows are executed through the network security zone to send data to the corresponding cloud data stores. Here the cloud data store can be SQL databases, blob storage, data lake storage, documentDB or any other kind of NoSQL stores.
Benefits of this pattern are listed here.
- The complexity of integration at the database and data pipeline from on-premises to cloud is dramatically reduced. The data layer is kept as a simple transportation and storage layer before data reach the cloud data stores for each individual cloud application and microservice.
- The application logic and integration requirements are kept into the implementation of microservices. This fits into the microservices architecture very well. In fact, compared to the database and data pipeline development, tools for the implementation of microservices provide more flexibilities such as complex transformations at granular level, automated testing, full CI/CD support and so on.
- It provides flexibility for different schedules and data needs of each microservice. The inter-dependencies of different data pipelines are avoided.
- Since no integration or transformation is made in the database and data pipelines, requirement of changing existing data flows, such as adding and removing columns, altering data types, can be implemented through versions. This can ensure the maximum stability of the existing data and data pipelines, as well as the subsequent microservices and cloud applications.
Drawbacks of this pattern are also quite obvious.
- To serve different microservices, data pipelines from on-premises to the cloud will have to send the same dataset multiple times for various cloud data stores. The network bandwidth is challenged.
- Consistency of data cannot be directly validated in the database since the integration is only implemented at the API and microservice level. In certain operational scenarios where some pipelines failed while most others succeeded, there will be inconsistent data in corresponding APIs.
- In this pattern, implementing delta data load in data pipelines is challenging since different microservices demand different datasets or subsets of datasets.
Pattern 3: Event-driven Integration
In modern enterprise solutions, usage of enterprise service bus and message queues has become popular and effective. Sending messages as events to the cloud applications are considered as the most common way to achieve real time or near real time data integration between the on-premises and the cloud endpoints.
A typical architecture for such event-driven integration is depicted below. An “event generator” is running alongside the source system to pump event data to a message queue system. Then an “orchestrator” application takes the messages from message queue, goes through a security proxy and then send data to different event stores in the cloud. Data in the event stores are then consumed by various cloud applications.
Event generators can be a change data capture (CDC) application which typically is installed together with database systems or data stores. Another typical source of messages to the message queue is an enterprise service bus. In the network security zone, the security proxy acts as a firewall to examine all network messages. The orchestrator pull the data from the message queue in order to populate the event stores.
Using an event-driven architecture brings several benefits.
- Integration of data is close to real time, despite the short latency brought by the network traffic.
- Data storage and processing in the intermediate layers of the integration process is kept to a minimal. In standard configurations, event can be kept for a certain retention period and then disposed. In the scenario of data reload, historical events can be replayed from a given point in time through the event store. An alternative to this is to save the consumed events as simple file storage at a cloud storage service and then pick them up from the cloud storage through a batch process.
- All data integration and transformation logic are kept in the implementation of cloud applications and microservices. This reduced the complexity in the flows from the source systems to the event stores.
Using the event-driven architecture comes also with challenges and risks despite the promising results.
- As the event transportation process needs to pass several layers of the architecture, resilience and robustness of the infrastructure behind the architecture is often challenged. Hence losing 2% of data in the scenarios of big data and data science can be acceptable for many cases, the event-driven architecture is often used for real time BI and data integration purposes for supporting online systems made as WebApps. In the online WebApp scenario, losing 2% means a great quality loss.
- How to get a speedy and precise recovery from system downtime or infrastructure disasters is often a difficult design challenge for implementing the event-driven architecture. Almost all current event-based software provides retention and replay functionalities to support the scenario of reloads of messages. In an enterprise environment, the on-premises message queue and the cloud event store may be owned and managed by separate organization units or partially external vendors such as outsourced operation organizations. Recovering from system downtime depends on the Service Level Agreements and very often manual operations. Again, for scenarios like Big Data and Data Science, the delay and potential loss can be acceptable. But this is a show-stopper for the online applications. Therefore, creating event-driven architecture for an online integration scenario requires that one also implements an automated recovery and reload process to mitigate the risk of incorrectness and unavailability of system when reload of past events is required.
- As experienced in our current setup, an event-driven architecture for small scale of event volumes does work well for the scenario of online data integration. However, when event volumes grows over more than 1 million events per day and the integration of different event data contains complex transformations such as aggregations, multiple levels of joins and filters, the design of the program structure and the support infrastructure are challenged. One has to utilize different external data structures such as in-memory data stores to optimize the transformation process.
Implementation and Observations
We have been through the practices of all above architectures in the actual implementations.
We used many Azure components for the cloud implementations.
- Azure SQL as the relational database system
- Azure CosmosDB as the cloud data store
- Redis as the cloud data store
- Azure Blob Storage as the cloud data store
- Azure Data Factory as the cloud data movement and scheduling tool
- Azure Data Bricks as the cloud data transformation tool
- Azure Eventhub as the event store
- Logic App as the event flow trigger
- WebJob as part of data flow
While developing and operating on an evolving platform in the past 2 years, we experienced the following challenges that exist for all these patterns.
1. Quality of Data Transportation
Although the quality of data itself raises quite an amount of attention in many cases, the most important quality in the integration process is in fact the quality of transporting the data from the source to the endpoints at the web applications and microservices. Due to the nature of these architecture patterns, data are copied and transformed multiple times before it gets to the endpoint for the UI, ensuring and measuring the data loss, bugs and inconsistency of data pose a good challenges for these patterns. An example is the copying activity at Azure Data Factory can be configured to “filter out” unmatched rows. How the “dropped” rows can be analyzed or reprocessed and how to measure the validity of a pipeline row in case of dropped rows remains in the specific design process.
2. Transaction Completeness
As a data flow in such integration process goes through multiple endpoints, existing tools do not provide a distributed transaction process to maintain the transactional complenes of the data. When failure comes to one point in the data pipeline, what has happened, such as an executed Stored Procedures in the target database, cannot be automatically rolled back. In the case where a relation database system is involved, part of the data pipeline can use the database transaction mechanism to protect the operations inside the database. But in other cases where the source or target is rather a data lake or blob storage, there is no natural support for transaction controls.
Although Transaction Control should not necessarily be favored in every scenario, design of each data pipeline has to consider a way to handle exceptions and protect the completeness of the data.
3. Use of Serverless
Many of the cloud vendors offer serverless compute capabilities. One must understand that the the serverless capability comes often with a sacrifice of initializing or changing its compute cluster when the required capacity has to alter. For example, in a scenario when a batch load of data is bulk inserting to an Azure SQL where the latter is based the serverless model. When no batch job is running, the database capacity is reduced to its minimal configuration. When the batch job starts executing, the serverless infrastructure requires a “warming up” time to get more clusters to be added to the database capacity in order to satisfy the batch job needs.
When having more frequent pipelines, use of serverless application must be tested and evaluated as it may be outperformed by another model of the application service.
4. Job Scheduling
Running pipelines when multiple endpoints in the data flow cannot be controlled by a single scheduler is challenging and sometimes unpredictable. In the first pattern where the central on-premise database needs to be integrated with the cloud central database, scheduler from either on-premises or cloud side cannot control both sides of the data pipeline due to firewall in the network security zone.
In such case, one option is to use time to connect the on-premises schedule with the schedule of cloud data pipelines. However, this is a very fragile design since the completion time at the upper flow may very well be changed due to different situations.
5. Satisfying the CI/CD
In the concept of continuous integration and continuous deployment (CI/CD), one expects to change the code and use CI/CD process to automatically test and deploy the changes. This concept is very well challenged under the hybrid integration architecture. Data pipelines are developed at different endpoints in the data flow probably using different tools. When data stores are involved, there can be even more types of databases or storage types involved. In such case, the CI/CD pipeline must include the automated process for all related tools and systems and ensure the correct build and deploy orders.
In our experience, a completely automated process is hardly achieved when 3 or more tools and repositories involved in the while pipeline.
6. Achieving Global Consistency
In all three patterns, the effort of transforming, merging and integrating data from various sources happens in the far end side of the data flow. When the same dataset is delivered to two different cloud APIs for certain business calculations, a failure in one of the two data delivery pipelines can cause the inconsistency of the results presented in the two APIs. Such a scenario can exponentially increase when the number of data pipelines grows over 100 and the business logic at each API involves more than 3 tables shared with other APIs.
In the event-driven architecture, achieving consistency of data among all APIs is even more challenging. To compensate on the latency and complexity of computational logic, a typical pattern is to partition the data and distribute the calculation of events, when certain distributed computation failed or delayed, inconsistency in the output will naturally happen. Although applying the concept of transaction control to the event processing scenario may not be an ultimate solution, availability of functionalities for achieving global consistency is vacant in most of the existing tools.
Expectation of Future Integration Patterns
While doing retrospective on what we have achieved and learned, there are quite a few future steps we can foresee in large organizations.
1. As in the enterprise IT world, the strategy is to adapt or migrate to the cloud. The traditional database-centric integration is partially getting replaced by more and more event-driven and No-SQL based integrations. Instead of database connections and FTP servers, more and more enterprise applications start to consume and process events. No-SQL data stores are being widely used as a beneficial replacement for traditional databases and the data warehouse appliances. More technology and solution are coming to support events and NoSQL data store.
2. Despite the growing need to include data mining activities in the BI data flows, integration of data for online systems does receive more requirements from the business stakeholders to include data mining logic to real time data transformations. A simple example is to use forecast models to predict what is next to come for a given scenario. When such “prediction” is updated in real time, the business users get the competitive advantage of being the first and fastest in the market to know what is coming. In the online gambling sector, examples of such are plenty.
3. The capability of cloud driven architecture is to enable the feasibility of more connected and API driven real time online systems. End users and business sponsors are requesting fresh and stable data while defining more and more complex processing logic to the backend component. As we have experienced, getting a stable SLA for the separate parties involved in a hybrid data integration architecture requires a coordination of different units in a large organization. Scenarios such as incident management and disaster recovery are yet to explore and define. The traditional ITIL principles must be refined to cover the scenario of on-premise and cloud hybrid architectures.
To summarize, integration of data in the cloud or cloud-on-premise hybrid applications has confronted the traditional approaches where database replication, batch job, and change data capture tools with unsolvable problems. Architectures that favor event and stream processing deliver faster and more convenient solutions for the cloud and API driven environments. Given the fact that the most known and used tools for event process are quite handful, more mature, stable commercial event processing tools are yet to come and help to solve the current wave of data integration challenges.