Some Considerations for Establishing a Datacom Server Failover Process
search cancel

Some Considerations for Establishing a Datacom Server Failover Process

book

Article ID: 397408

calendar_today

Updated On:

Products

Datacom Datacom/DB Datacom/Server

Issue/Introduction

One topic that comes up more and more frequently among customers is application availability and continuous operation. Currently, the Datacom Server process involves only a single instance, and this article provides some considerations and thoughts about different options in the event that the currently operating Server application fails. These same considerations apply if the LPAR on which the Server runs fails, when the Server runs on a different LPAR from the Datacom/DB MUF.

Environment

z/OS

Resolution

In thinking about both Disaster Recovery/Business Continuance strategies and operational goals with Datacom Server (SRV), we believe the best place to start is to run multiple SRV instances. This is conceptually similar to the Shadow MUF (SM) approach, but with some advantages. While SM supports only one active instance at a time, a multi-SRV setup allows for simultaneous activity and offers greater flexibility.

There are essentially two models to consider:

  1. Failover Model: One SRV instance is active, with a second activated only if the first fails — similar to SM behavior.
  2. Active-Active Model: Multiple SRV instances are active across different LPARs. This is the more robust and scalable option, and the one we’ll focus on here.

In the active-active model, you’re not limited to two instances—you could run three, four, or more to distribute workload. For example, if you normally have 50 users on one SRV, you might configure two or more SRVs with 30–40 users each, providing both load balancing and failover resilience.

Workload Distribution Options

  • Manual Assignment: Applications direct users to specific SRV instances (e.g., SRV1 or SRV2). This can improve throughput but requires application-side routing.
  • Network Load Balancer (NLB): A more scalable option. An NLB routes traffic to available SRV instances based on defined criteria. These tools are commonly used by many companies for distributing load across several services (we use several of them here for customer file transfers to Broadcom support servers).

If you were using a NLB and then had an SRV failure, you could create an automated process to:

  1. Remove the SRV instance from the load balancer (if necessary)
  2. Send alerts
  3. Restart the SRV (note: every SRV start is a Cold start)
  4. Reintegrate the SRV with the load balancer once it’s back online (if necessary)

One benefit of using load balancers is that a single NLB can manage many services for many customers—multiple SRV instances for Prod, Test, QA, etc., as well as many other IP-based services—and it’s possible you already have one or more in place at your firm. More importantly, this option does not require application changes, as traffic is routed at Layer 4 (TCP).

Alternatively, you could build your own connection manager that holds multiple SRV connection strings and routes the requests accordingly. But this is labor-intensive, prone to error, and lacks the real-time adaptability and optimization a load balancer provides.

Processing Considerations

In Shadow MUF processing, failover is partially automatic when the Primary MUF fails. The SM detects the primary failure, completes its startup, but then the applications must reconnect anew (Datacom Server and Datacom CICS Services are both applications that use the MUF, and they handle that reconnection). Datacom Server, like CICS or a batch job, doesn’t inherently know how to handle complex business logic such as multi-table updates or custom rollback conditions. Though the MUF will back out inflight work to a sync point, after a failure and restart, it’s up to each application to determine how to resume processing. SRV would need to have lots of business rules definitions to try to deal with an outage automatically.

Finding the Real Question and Answer

Therefore, rather than trying to replicate Shadow MUF behavior, the better approach is to clarify the broader availability goals and choose a solution accordingly. For example:

Q: A single SRV is a single point of failure. What’s the solution?

  1. Run a second SRV on another LPAR. But then:
    1. If it runs concurrently, it needs a new name, IP, Port, and connection string—your app must handle that.
    2. If it starts only on failure, why not just restart the original?
  2. Running both SRVs simultaneously helps performance, but:
    1. What application changes are needed to manage multiple connections?
    2. How do we keep the connection logic updated as SRVs come online/offline?
    3. Is this added complexity worth more than a simple SRV restart?
  3. What about using an NLB to handle traffic across multiple SRVs?
    1. How does the NLB detect if a Server instance is down? (NLBs may already support heartbeat monitoring.)

These are key trade-offs, and there may be more questions and configurations to explore. Remember, too, you would need to pilot any such configuration in a non-prod environment before considering it for production.

Additional Information

While any discussion and possible solutions here are hypothetical, this article will provide a good starting point to explore the options available to you.