One topic that comes up more and more frequently among customers is application availability and continuous operation. Currently, the Datacom Server process involves only a single instance, and this article provides some considerations and thoughts about different options in the event that the currently operating Server application fails. These same considerations apply if the LPAR on which the Server runs fails, when the Server runs on a different LPAR from the Datacom/DB MUF.
z/OS
In thinking about both Disaster Recovery/Business Continuance strategies and operational goals with Datacom Server (SRV), we believe the best place to start is to run multiple SRV instances. This is conceptually similar to the Shadow MUF (SM) approach, but with some advantages. While SM supports only one active instance at a time, a multi-SRV setup allows for simultaneous activity and offers greater flexibility.
There are essentially two models to consider:
In the active-active model, you’re not limited to two instances—you could run three, four, or more to distribute workload. For example, if you normally have 50 users on one SRV, you might configure two or more SRVs with 30–40 users each, providing both load balancing and failover resilience.
If you were using a NLB and then had an SRV failure, you could create an automated process to:
One benefit of using load balancers is that a single NLB can manage many services for many customers—multiple SRV instances for Prod, Test, QA, etc., as well as many other IP-based services—and it’s possible you already have one or more in place at your firm. More importantly, this option does not require application changes, as traffic is routed at Layer 4 (TCP).
Alternatively, you could build your own connection manager that holds multiple SRV connection strings and routes the requests accordingly. But this is labor-intensive, prone to error, and lacks the real-time adaptability and optimization a load balancer provides.
In Shadow MUF processing, failover is partially automatic when the Primary MUF fails. The SM detects the primary failure, completes its startup, but then the applications must reconnect anew (Datacom Server and Datacom CICS Services are both applications that use the MUF, and they handle that reconnection). Datacom Server, like CICS or a batch job, doesn’t inherently know how to handle complex business logic such as multi-table updates or custom rollback conditions. Though the MUF will back out inflight work to a sync point, after a failure and restart, it’s up to each application to determine how to resume processing. SRV would need to have lots of business rules definitions to try to deal with an outage automatically.
Therefore, rather than trying to replicate Shadow MUF behavior, the better approach is to clarify the broader availability goals and choose a solution accordingly. For example:
Q: A single SRV is a single point of failure. What’s the solution?
These are key trade-offs, and there may be more questions and configurations to explore. Remember, too, you would need to pilot any such configuration in a non-prod environment before considering it for production.
While any discussion and possible solutions here are hypothetical, this article will provide a good starting point to explore the options available to you.