4 MuleSoft Design Considerations for High Performing and Scalable Large Data Set Solutions

The need to process large amounts of data remains a challenge that MuleSoft developers and architects continually face. There is no shortage of use cases for large data processing, including financial transactions and log data. We recently helped a client with a project involving COVID case data - a use case where the data processing was significant and expected to grow over time.

To get these projects done right, MuleSoft architects and developers need to understand both the capabilities and constraints of the platform. They also need to grasp the best ways to design these kinds of applications and what MuleSoft features and functionality can help them achieve the project's goals.

The Default Approach to Large Data Set Processing with MuleSoft

The default approach used to address this problem involved three parts:

Synchronously call system APIs from the process layer using an HTTP request.
The system API retrieves the data using HTTP Request, then transforms the retrieved data and sends the response back to the process layer.
Retrieve messages from Mule MQ and call the system API for each message to insert records into the database.

More often than not, though, that approach fails. Why?

The process layer frequently times out when the system API has to retrieve a large data set.
The system API can run out of memory while retrieving and transforming large data sets. It's often not possible or even advisable to send such large responses back to the process layer using HTTP.
Retrieving messages as fast as possible from Mule MQ and call system API for each message can cause HTTP connectivity errors.

It doesn't need to be this way, however. MuleSoft has several features that can be used to create a more robust solution.

Using MuleSoft for Large Data Sets

Certain design considerations should be adhered to in cases like this to achieve an optimal solution using MuleSoft. These follow the best practices for API-led architecture, including respecting each of the API layers' roles, responsibilities, and functionality.

Scalability - The ability to scale the application tomorrow, and well into the future, should be considered when designing the application. With our recent COVID data project, we knew that the amount of data would increase over time, so the application needed to be able to take that into account without requiring a re-write of the services because of the future growth of the data sets.

Reusability- The systems APIs should be used to both retrieve data and enter data into the database. Compartmentalizing this functionality will make the services reusable as demand expands.

Reliability - Few applications can afford to lose data across the process. Services must be built to maintain and validate transaction integrity and gracefully recover any lost data.

It makes sense when planning an application with large data loads that you consider Mule's batch processing compared to Mule's MQ-based solution. Using batch processing has certain drawbacks for a large data set use case. Batch processing may not allow you to have separate system APIs to retrieve and insert data into a database. It also increases application complexity while limiting scalability. Therefore, batch processing would break the API-led design principles we've established for this type of application.

On the other hand, MQ allows for more control with the messaging processing than batch processing. It's a straightforward solution that adheres nicely to the API-led best practices that are our driving principles.

Challenges and MuleSoft Solutions

With these elements in mind, our design was for a solution that emphasized the use of several MuleSoft features, including:

HTTP Request and Response streaming
Transform processor streaming functionality
Division of large data sets into smaller batches
Anypoint MQ's reliability, loose coupling, and throttling capabilities
Asynchronous processing in the system API

This also allowed us to solve for a variety of problems.

Due to large data streams, we used HTTP response streaming to correct the system API timeout issues.

In the HTTP Request property, set output MIME type to streaming using

outputMimeType = “application/csv; streaming=true”

For Example

Graphical user interface, applicationDescription automatically generated

’

Large data transforms resulted in running out of memory in the system APIs, so instead, we use the streaming functionality in Transform.

In the Transform add the following annotation in the header before output declaration

@StreamCapable()

For example

Graphical user interface, text, application, emailDescription automatically generated

Memory issues in systems APIs when publishing records in MQ were caused by the MQ message limit of 10MB. Instead, we published batches of 100 records using for-each processor.

For example

Graphical user interface, diagramDescription automatically generated

Processes APIs timing out when calling system APIs was solved by using async behavior in the system APIs. Wrap all processors in Async scope inside the Try scope.

For example

Graphical user interface, application, TeamsDescription automatically generated

HTTP connectivity errors in the process API when calling database system APIs were fixed with MQ subscribe throttling. In the MQ subscriber, use subscribe type of 'Polling' with Fetch Size of '10' every second. This limits the number of messages processed and, therefore, the number of open connections.

For Example

Ultimately this solution was successful for four reasons.

MuleSoft's streaming capabilities avoids memory issues with the applications.
Anypoint MQ or another queue-based mechanism eliminates passing large data sets with HTTP.
HTTP timeouts are avoided thanks to the asynchronous processing in the system APIs.
The number of connections active in memory is better controlled using the MQ subscriber message throttling.

Conclusion

Applications that need to process large amounts of data are more likely to increase than decrease in time. So, it's essential to understand the problems with this use case and design a solution that considers the potential barriers while also addressing the need for good quality services that meet API best practices. MuleSoft, thankfully, provides us with several helpful and flexible features that can be leveraged to create a scalable, reusable, and reliable solution.

Have questions or want to know more about how MuleSoft can be used to help read, manage, transform, and handle large data sets? Big Compass can help. Our team of experienced MuleSoft developers and architects are well-versed in Mule's functionality and can help you design and engineer an application that meets your needs.

Big Compass