Alexander Sosedko - IT Architect - Disaster Recovery

Working in IT for more than two decades I was witness different situations when an IT System was in great danger. Application implementation, running environment, business requirements for application availability have influence on how the application should be protected in case of unpredictable circumstances.

During my IT experience I implemented numerous IT projects with different requirements regarding the requirements for High Availability and Disaster Recovery. Reporting System in intranet is for sure important but steering of power stations has higher importance, as well as IT Systems (for example) in health care. Even in banking area different systems requires different levels of attention during the disaster.

Do you have a spare tire in your car? Do you need it? What happens if you need it, and you don't have it? Your answer probably is - it depends on the situation. If you are driving your car in your hometown, not far away from a garage and there is someone who can help you with your damaged tire - it is not a big issue. But imagine - you are on holiday with your family driving for thousand kilometres away from your home, or abroad. And this happens - the tire is damaged, and you don't have a spare tire. And even if you will find someone to help, the probability that the suitable tire size is available is low. So, what now? You don't have your car; your holiday is a disaster. You think in this moment that spending some money for a spare tire would prevent this destroyed holiday and the terrible mood of your family, what a complete disaster….

Unfortunately, such obvious important topic like concept for Disaster Recovery is often forgotten, moved the end of a backlog, stays without necessary attention.

Depending on the System importance, the disaster can not only ruin your business, but even cause a more dangerous catastrophe. Large or small, your business is one disaster away from utter ruin. A fire, a flood, a grid outage, or any number of other disruptive events can render your critical functions non-functional.

There are many examples of the serious consequences for business, some of them:

Video game studio Facepunch lost all of its servers for European Union players when a fire destroyed their datacenter.
Messaging service WhatsApp lost millions of users to a competitor in 2014 when a router issue caused hours of downtime.
A power outage hit Delta Airlines, downing systems and grounding flights for several hours. They lost $150 million and even faced a congressional inquiry!

Disaster recovery is often confused with Data Backup. These are equally essential; but they operate at different scales. A Backup is smaller scale, focused on keeping copies of data in case of a loss. Disaster recovery is big-picture. It asks questions like, “If something destroys our entire data center, how do we keep our business running?”

Disaster recovery consists of IT technologies and best practices designed to prevent or minimize data loss and business disruption resulting from catastrophic events—everything from equipment failures and localized power outages to cyberattacks, civil emergencies, criminal or military attacks, and natural disasters[i].

Disaster recovery starts with an architecture design and should be completely described in the Disaster Recovery Plan. Far in advance of any disruptive event, your team needs to know exactly how it will respond. Therefore, during the design, the business requirements should be well analysed and understood. During the application product deployment, it can still happen that Disaster Recovery requirements are not implemented as it should be. In many cases the functional requirements for Disaster Recovery can be omitted without influence on the other application functionality (till disaster happens).

Disaster Recovery Analysis

Nowadays an enterprise application landscape consists of many applications, which are in different state of application lifecycle. In order to identify the required Disaster Recovery implementation, many important aspects should be considered: the application architecture, application placement (in cloud or on prem), business requirements, …

The business requirements for an existed Software product can be changes with time. An application can be migrated to the Cloud, or from one Cloud provider to another one. This means that the Disaster Recovery approach for an application should be analysed to find the gaps between the expectation and the current implementation.

To perform Disaster Recovery analysis and create propositions I use the following approach for my customers:

Business view. Analyse business specific information in order to understand the purpose of the application and find out business needs regarding Disaster Recovery requirements.
Analyse the Application Architecture. During this step all application components should be identified, and its role should be determined.
Review the As-Is implementation state of the application. This is a very important step, as it requires gathering the real information about the application. Information from application documentation is useful but not enough here, as it could have discrepancy to the physical implementation. In that case Public Cloud portal, configuration scripts, “Read Only” access to the resources and components are the right tools.
Gap Analysis. Finding gaps between the business requirements/expectations regarding Disaster Recovery and the Disaster Recovery capability of the current implementations.
This is the final step. Many solutions could be possible and will lead to the desired Disaster Recovery implementation (in order to satisfy the business expectations). In my analysis I create a few options for the implementation. These could differ in the costs for running, costs for implementation, maintenance, RTO, RPO, etc. The decision on which solution should be implemented based on the conclusion of different stakeholders and/or decision-making bodies/units.

These days there are a lot of possibilities to implement Disaster Recovery. All well-known Public Cloud providers have in their arsenal set of tools and possibilities to organize the required Disaster Recovery solution. The only task is to choose the right one for you. My recommendations are based on best practices for various application deployments with varied resiliency requirements. Typically, customers classify the applications into various categories or tiers based on their resiliency requirements. As an example of the categories:


	Application SLA	RPO	RTO
Category 1	99%	24-hour	72-hour
Category 2	99.95%	4-hour	8-hour
Category 3	99.99%	30-minute	4-hour
Category 4	99.99%	5-minute	1-hour

For each of these categories different solutions can be implemented.

Conclusion

Disaster recovery is easy to ignore, but ignoring it could damage you. If you don’t have any backup or recovery plan in place, start right now.

[i] https://www.ibm.com/cloud/learn/disaster-recovery-introduction

Disaster Recovery - why care?