The topic of "Disaster Recovery" (DR) is broad, deep, and typically specific to each site.
As such, it is only possible to give general advice within a knowledge document.
It is recommended that you fully engage with the Disaster Recovery process in order to have an effective plan.
This may require bringing in outside expertise to assist with the planning.
The question "What are the steps to perform Disaster Recovery for CA Service Desk Manager?" makes some false assumptions, which are important to draw out:
1) There exists a single set of steps which is applicable to all sites.
2) The process is simple to implement.
Neither is true.
The first is incorrect because sites vary greatly in their implementation, environment, use cases, resources available, financial and business risks involved, and likely many other factors.
The second is incorrect simply because restoring enterprise level software is complex.
Be aware of what "Disaster Recovery" as a topic in general refers to.
Establish why you need a disaster recovery plan for your site.
- Do you already have satisfactory recovery processes, and simply need to document them to "Tick a box" for government regulation?
- Do you have a way to recover the system?
Establish what you need out the disaster recovery plan.
- Do you mean simply that service should continue uninterrupted if one process stops on one server?
- Or a whole server being removed from the environment?
- Of do you mean full disaster recovery, with continuous backups and an off-site recovery data centre?
It is important to consider why you want a disaster recovery plan before beginning, as a it can be complex and expensive.
Equally, it can be more complex and expensive if you do NOT have a disaster recovery plan.
Full disaster recovery typically involves much more than "a nightly backup."
- What happens if the building burns down? Are the backups offsite?
- What happens if key staff can't be located. Are other people trained or is there documentation?
- How much data can the business afford to lose? A day of transactions? Nothing?
- A rotating, offsite stored, nightly backup of files and database is better than nothing. This should be an absolute minimum.
- Consult with site stakeholders as to what specifically is required.
- Consult with experts if needed. Many sites do not have the necessary expertise in order to perform full disaster recovery. It is better to engage for an initial consultation, and be informed, than it is to implement a disaster recovery plan which fails.
- The CA Service Desk Manager "Swing Box" method can be adapted to copying an implementation from one location to another. You do not need to "upgrade" - but many of the other steps, such as taking copies of customisations, settings, Attachments, Knowledge Documents, the database, application files and so on are what are needed to stand up a similar environment. See CA ITSM 17.1 Swing Box Method and CA ITSM 17.1 Environment Promotion. See also this Architecture Diagram which includes a disaster recovery implementation CA Service Desk Manager Site Architecture Plan with Disaster Recovery.
- Fully document your disaster recovery plan. Make sure that it is stored offsite, and that appropriate people can access it if the usual administrators are not available.
- Test your disaster recovery plan. Make sure that it works! Caution - make sure that there is no overlap with a current running production system. You do not want to have double notifications going to clients, for example.
Whilst having a disaster recovery plan is essential, it is even better if it never needs to be implemented for real.
Take all steps necessary to make your environment resilient, to prevent the system becoming unavailable.
This will involve business decisions that accept how long a system can be unavailable, and how much data loss is acceptable, for what cost of prevention and recovery.
Some advice on making your environment more resilient.
- Document the environment. Include all servers, databases, applications, file stores and the business use that they support.
- Identify single points of failure. In a CA Service Desk Manager "Conventional Configuration", the primary application server is a single point of failure. Also the database and file stores (Knowledge Documents, Attachments and unprocessed email).
- Identify ways to protect single points of failure. Databases can be clustered. RAID systems can protect hard drives. CA ITSM Advanced Availability can provide multiple Application Servers. Load balancers can identify unresponsive servers. Network backups can protect files, and so on.
- Run periodic health checks to ensure system is stable. Check logs to identify serious issues. Check load to make sure points are not overloaded. Remember that systems grow and change over time - a system which may have been fit for purpose two years ago may need reconfiguring or additional resources.
- Ensure that backups can be recovered from. Test your recovery of a production system, by moving it to a development environment, as a basic check.
- Ensure that backups and disaster recovery documentation are moved offsite, and available to those needed in an emergency.
- Monitor to know when the system is down. This may be automated checks of a URL response, a ping to relevant servers, a database test probe, an alert from a load balancer, a test Web Services query, monitoring of log files messages and more. Automated checks of core components can provide advance notice of pending issues before they occur, and typically can alert operators more quickly when an issue occurs.
It is essential to implement a disaster recovery plan - at an appropriate scale - to your business needs.
It is better to have something now, rather than a fully formed complex but unrealised plan. So make network and database backups an absolute priority if these do not already exist.
The topic of Disaster Recovery is deep and broad. It is more than a "tick the box" exercise, and will help to ensure the stability and viability of the business needs.
CA Service Desk Manager (CA ITSM) offers many features that help with resilience, such as Advanced Availabilty configurations and Swing Box and other documentation. However, these alone do not offer full disaster recovery. They add additional protection to make a system run more effectively, and make it less likely to encounter common faults such as single-point-of-failure outages. But they are not a substitute for a full disaster recovery plan, which must be fully planned and resourced to be effective.