Batch Scope and Data Replication with MuleSoft
In a previous article, we introduced the Big Compass approach to API-Led data synchronization and replication using MuleSoft. To fully understand “what’s Batch got to do with it,” we have to look to the other half of the data problem – replication.
When considering MuleSoft for a data integration solution, it’s easy to focus on all the ways MuleSoft seems suited for data synchronization, but perhaps not for data replication. However, the Batch Scope in Mule 4 provides the powerful parallelism and streaming tools necessary to handle data replication on large data sets.
Batch Scope of MuleSoft 4
The Batch Scope of MuleSoft 4 allows you to split your payload into pieces and process multiple pieces in parallel. Mule 4 calls each of these pieces a “record,” and each record will proceed independently through the scope.
The Batch Scope consists of Batch Steps, each of which can contain multiple event processors and an optional On Complete phase. Behind the scenes, there is also a Load phase where Mule transparently handles the creation of these records and a queue to hold the records. The Mule engine then uses up to 16 threads in parallel to process blocks of records through batch steps and return them to the queue.
Finally, the On Complete phase can be used to gather metrics about the number of successful records and the errors that were encountered.
It’s easy to see the potential of this feature.
To illustrate the value, let’s take the example of loading 3 million records into Salesforce. Each record is about 1KB of data, a .1 vCore worker has 500MB of memory, and we’ll use the maximum concurrency allowed - 16 threads. The system can fit 30 records per thread in memory at once: 500MB 16 threads, 1KB per record 30.
Doing this means we should set our block size to 30 to minimize the number of load operations without overfilling the memory and preventing Mule from running all the threads at once.
This would give us the full 16x performance increase over inserting all of these records sequentially. The Batch Scope allows you to set this up with a few simple settings, solving most of your problems in one stroke. The revelation is MuleSoft is not just limited to data synchronization – it can also be used for data replication.
World Class Data Replication Solutions
Beyond just having the capability to perform data replication at scale, we believe that MuleSoft’s Batch Scope enables an enterprise to develop a world-class solution for data replication.
Our conclusion is based on the following factors.
First, the Batch Scope abstracts away many of the toughest parts of parallelism, including concurrent access to objects and thread-safe behavior. The Mule runtime engine takes care of these issues, which allows your enterprise to focus on building your solution rather than on the frustrating debugging process of trying to replicate edge-case race conditions. Of course, parallel programming will always have some level of difficulty, but using the Batch Scope lets you skip past most of the hurdles, enabling teams of almost any experience level to take advantage of the benefits of parallelism.
Another significant advantage of the MuleSoft 4 Batch scope is its support of streaming data. Supporting ‘streaming data’ eliminates a common pitfall of data replication - too much data and not enough memory. Of course, data streaming has some limitations – there is a small performance impact, and you can only access your payload sequentially rather than randomly. However, we believe the benefits of using MuleSoft’s Batch processing capabilities outweigh these limitations.
Finally, several key Anypoint Connectors integrate with the Batch Scope. For example, the widely used Database, NetSuite, and Salesforce connectors all allow additional control over their use in the Batch Scope. They can be used with a batch aggregator to insert a large set of records. If some records fail, rather than failing the whole group, the connectors will insert as many records as possible, track the records that were successfully accepted, and log any failures for notification. This further simplifies the task of building common parallel solutions with the Batch Scope.
At Big Compass, we’ve used MuleSoft’s Batch Scope successfully for several clients to meet their data replication needs. At one state government client, we used batch jobs for two separate solutions. The first solution synced health care provider information from their legacy source system to Salesforce. We were then able to leverage this first solution’s design pattern for a solution to replicate COVID testing data between a source database and a reporting system.
Not surprisingly, we were able to leverage an API-led connectivity approach for data synchronization and data replication solutions using MuleSoft’s batch processing capabilities.
What’s the Catch with Batch?
It is important to be aware of the limitations of the Batch Scope, and MuleSoft in general. The three primary limitations we’ve encountered include transaction handling, data storage, and orchestration.
In this context, a transaction is a set of operations that you decide must fully succeed. If any operation in a transaction fails, the entire transaction is rolled back. To prevent race conditions, the Batch Scope limits transactions to records within Batch Steps. You cannot create a transaction that includes multiple records, and you cannot create a transaction that spans multiple Batch Steps (though it can span multiple processors within one Batch Step).
There are a couple of ways to deal with this limitation. The simplest is to combine everything that needs to be part of a transaction into one Batch Step. This may create long-running Batch Steps, but there are always trade-offs, and you still get the benefits of some parallelism. If transactions that span multiple steps, or that span your entire set of incoming data, are necessary, you have two choices.
One way forward is to switch to sequential processing, which will give you more control over exactly when data is inserted and how to roll it back. On the other side, the best way to still take advantage of the Batch Scope is to use it to extract, transform, and load your data as normal and then add an optional rollback at the end. Use the On Complete phase to determine if any records failed and a rollback is needed.
The other two constraints on MuleSoft go together. MuleSoft is not a stateful business process engine that can orchestrate long-running processes, and Mule does not have a database to store your data. Those constraints are by design as MuleSoft is intended to power robust APIs to expose and connect your other systems.
It may be obvious advice, but don’t expect to make MuleSoft your central data store. Instead, you should be using it to build an API in front of your central data store, making it easily accessible.
For orchestration, some data replication patterns require specific jobs to run before other jobs or require synchronization jobs to be paused while the replication occurs. There isn’t a way to do this from inside Mule – you can’t check one job's status from within another. The way we handle this sort of requirement has been to use an external data store. For example, add a table to your database where Mule can log job status and completion and check that database when a new job triggers to make sure it won’t interfere with something else.
These limitations can be handled with proper design. MuleSoft’s power in building event-driven synchronization systems have always been apparent, but now you also understand how it can be used for potently parallel data replication.
If you’re wondering how you can use MuleSoft to build a solution to meet your data synchronization and replication needs, reach out to Big Compass. We’d love to help you solve that problem.