Take Your MuleSoft DevOps Beyond Your CI/CD Pipeline
Despite the title of this piece, we're not here to define DevOps or even delve into the benefits of moving to it. We're also not discussing Continuous Integration/Continuous Delivery (CI/CD) or defining it.
The reality is that these concepts have been well defined and explained in numerous other excellent references, and Big Compass and others have outlined the benefits. For instance, MuleSoft introduced the Mule Maven plugin in Mule 4 and has first-class support for the Maven Project Object Model (pom) file. There are numerous tutorials on using the Maven plugin to create a CI/CD pipeline.
And we're certainly not going to be talking about Continuous Deployment, a practice we've come to realize can never be truly automated for many of our customers.
So what's left? It's time to focus on proactivity. Now, you might argue that, by definition, CI/CD is proactive, letting you perform incremental builds, run automated tests, and allowing you to spot defects before they are problems. And you'd be right. Numerous case studies have shown the business benefits of a good CI/CD process.
However, we have a case study to share that shows why it's time to make DevOps even more proactive.
Case Study: The Outage
One of our customers recently experienced an outage with their cloud-based Content Management system. They have a MuleSoft API that permits application consumers to interact with the content management system.
When the outage happened, guess how the Ops Team found out? They were alerted to the problem by emails with titles like "MuleSoft isn't working," issue escalations, texts blowing up Ops teams' mobile phones with hundreds of messages, and even desperate phone calls.
An old-fashioned issue alert system, but certainly, the response was modern? Unfortunately, no. Instead, the Ops team found themselves trying to diagnose the problem with the tried and true methods of downloading and parsing megabytes of log files, pulling countless people onto web meetings, distributing outage notifications, and involving vendors. It's a common scenario and one that almost all of us have experienced.
Worse, this isn't a rarity. We've witnessed customers appearing on the late local news due to an IT Ops issue. But this then begs the question: Why are some of our DevOps practices, like CI/CD, proactive, while some of the most visible and critical parts of the practice - in particular, the Ops part - reactive?
There should be a better way - and fortunately, there is. Allow us to introduce the idea of DevOps - Proactive Ops - leveraging Anypoint Platform's capabilities.
What are the Benefits of Concentrating on Proactive DevOps?
Proactive DevOps - what we're calling DevPOps - has a lot of advantages over more traditional DevOps practices. These benefits include:
- Be, at a minimum, the second to know about potentially major business disruptions.
- Experience less downtime for the business - something that benefits either your top-line revenue, your bottom line revenue, or both
- Have more efficient responses to issues, which also results in less costly responses
- Provide a better user experience that then builds confidence in your infrastructure
Proactive DevOps Use Cases
We've seen the results of DevPOps benefit several of our existing clients. One particular client was able to reap the benefits from just two incidents.
As part of a program to implement their proactive ops at one of our customers, we enabled Custom Application Notification alert using MuleSoft's CloudHub Connector. The alerts were configured to generate emails to Ops team members providing 24x7x365 support automatically.
The first incident of note that happened after enabling the alerts was caused by a MuleSoft System API that began having connectivity issues. The API interacts with a custom application that provides end-user services and runs on a platform hosted by the customer.
The alerts informed the Ops team that the platform was experiencing degraded performance. This, in turn, allowed the Ops team to notify all of the end-users proactively about the issue, even those that didn't consume the MuleSoft API. This had a significant positive impact on the internal customers, some of whom were required to provide regular and timely reports to government agencies based on information available in the custom application. With the proactive notification, these teams could find other ways to still meet their compliance obligations.
The second incident revolved around rejected requests to the client's custom email provider. The email service was integrated with a MuleSoft System API, which began alerting the Ops team to rejected requests. This allowed Big Compass’ customer to start working with their end user's email provider right away to identify and triage the issue and quickly determine a resolution.
This customer saw the benefits of DevPOps in both of these cases. The alerts:
- Made the Ops team the second to know about the issues and the first in some cases.
- The client minimized time to resolution, permitting them to meet SLAs and protect both their top and bottom line.
- They could get the right team engaged at the right time for resolution.
- These incidents provided opportunities to communicate and build trust, which helped with user retention.
Common Mistakes in Proactive DevOps
DevPOps isn't an automatic panacea. It requires thought and planning to be most effective. Mistakes are possible without careful consideration.
Failure to correlate SLAs to KPIs
Monitoring and response SLAs are part of MuleSoft’s guidance on measuring platform benefits by correlating your service level agreements to identified KPIs. Example KPIs include such things as increased platform availability and reliability.
Not including strategies in development practices
These strategies should consist of alerts, audits, and logging. Logging and alerts must be part of your ongoing development practices for DevPOps to be effective. If these aren't included in your ongoing development practices, you'll be missing opportunities for proactively identifying issues.
Failing to define escalation processes
In the middle of an incident is no time to be figuring out what should be done, who should be notified, and how. Even if the problem isn't a high-priority incident, it will likely have downstream impacts on development and Sprint velocity. Plan, as much as possible, for how this will be handled.
Not considering communication
This goes hand in hand with your escalation process planning. Who should receive alerts? When should items be escalated, and to whom? When should you notify users, and how will you do that? And what should be said? This will help with both automated alerting and managing incident communications.
Proactive DevOps Best Practices
A combination of ops functions needs to be included in the best practices to support the proactive nature of DevPOps. This consists of a combination of alerts, log forwarding, and MuleSoft monitoring of your implementation. CloudHub users will have different access levels to things like logging and monitoring. While these are the best practices that Big Compass recommends, your actual implementation may vary depending on the deployment type you're using.
System APIs should have a health-check endpoint to verify it can communicate with its Enterprise Information System (EIS). Access to the health-check endpoint must be tightly restricted, which API Manager, Dedicated Load Balancers (DLBs), or both can help with.
Process and/or Experience APIs should also implement a health-check endpoint if they interact with an EIS or topic/queue. Similar to the System API, access to this health-check endpoint must be tightly restricted.
All APIs should have a “ping” endpoint, which simply indicates if the API is “alive.” This capability can be implemented as part of the health-check endpoint if necessary.
Finally, a custom alert should be created to notify the Ops team of a potential connectivity issue with an EIS. As previously described, this alert can be implemented as an API or through a Custom Application Notification alert.
The following CloudHub alerts should always be implemented:
- CPU usage
- Memory usage
- Worker not responding
- Deployment failed
The amount of data storage and retention you have depends on your subscription level. For instance, customizable log data retention and locality are only available for Anypoint Platform Titanium subscribers, as is distributed log management and search. If you do not have a Titanium subscription, you need to be aware that CloudHub logs are limited to 100 MB per app and per worker or up to 30 days, depending on which limit is hit first. Once the limit is reached, the oldest log information is irretrievably deleted. Archiving these logs by forwarding them to Splunk or another system and using cloud storage with AWS, Amazon S3, and the like will allow you to review and analyze the information for a much more extended period.
Assessing Your DevPOps Journey
Once you begin to implement logging, alerting, and so forth, how will you know if you are making progress toward being proactive with MuleSoft and DevOps integration? Primarily, your root cause analysis (RCA) reports for outages will provide you with the information.
RCA requirements have both objective and subjective measurements. Ideally, as you implement your DevPOps practice, you'll be moving the needle on your RCA metrics.
MuleSoft can be an integral part of your DevPOps processes, and the services available through CloudHub can enable you to have a more proactive approach to your operations. Depending on your CloudHub subscription, many of the features to help you achieve DevPOps are already available.
Big Compass can also help your ops become more efficient and proactive. Contact Big Compass to learn more about taking your MuleSoft DevOps Beyond Your CI/CD Pipeline.