GestaltIT, High Availability

How do you define high availability and disaster recovery?

A while back I was on a call with someone who asked me the difference between high availability (HA) and disaster recovery (DR), saying that there are so many different solutions out there and that a lot of people seem to use the terminology but are unable to explain anything more about these two descriptions. So, here’s an attempt to demystify things.

First of all, let’s take a look at the individual terms:

High Availability:

According to Wikipedia, you can define availability in the following ways:

The degree to which a system, subsystem, or equipment is operable and in a committable state at the start of a mission, when the mission is called for at an unknown, i.e., a random, time. Simply put, availability is the proportion of time a system is in a functioning condition.

The ratio of (a) the total time a functional unit is capable of being used during a given interval to (b) the length of the interval.

And most online dictionaries seem to have a similar definition of availability. When we are talking about HA, we imply that we want the functioning condition of your system to be increased.

Going by the above you will also notice that there is no fixed definition of the availability. Simply put, it would mean that you need to put your own definition in place when talking about HA. You need to define what HA means in your environment. I’ve had customers that needed HA and defined this as the system having a certain amount of uptime, which is one way to measure it.

On the other hand you would be hard pressed if you were able to work with your system, but the data that you were working with was corrupted because one of your power users made an error during a copy job and wrote an older data set in the wrong spot. This would mean that your system is in itself available. You can log on to it, you can work with it, but the output you are going to get will be wrong.

To me, such a scenario would mean that your system isn’t available. After all, it’s not about everything being online. It’s about using a system in the way you would expect it to work. But when you ask most people in IT about availability, the first thing you will likely hear is something related to uptime or downtime. So, my tip to you is once again:

Define what “available” means to you and your organization/customer!

Disaster Recovery:

Natural disasterLet’s do the same thing as before and let’s turn to some general definitions. Wikipedia defines disaster the following way:

disaster is a perceived tragedy, being either a natural calamity or man-made catastrophe. It is a hazard which has comes to fruition. A hazard, in turn, is a situation which poses a level of threat to life, health, property, or that may deleteriously affect society or an environment.

And recovery is defined the following way (when it comes to health):

Healing, or Cure, the process of recovering from an injury or illness.

So, in a nutshell this is about bouncing back to your feet once a disaster strikes. Now again, it’s important to define what you would call a disaster, but at least there seems to be some sort of common understanding that anything that would get you back up and running after an entire site goes down, usually falls under the label of a DR solution.

It all boils down to definitions!

When you talk to other companies or vendors about HA and/or DR, you will soon notice that most have a different understanding of what HA and DR are. Your main focus should be to have a clear definition for yourself. Try to find out the importance and value of your solution and base your requirements on that. Ask yourself simple questions like for example:

  • What is the maximum downtime I can cope with before I need to start working again? 8 hours per year? 1 hour per year? 4 hours per month? What is my RPO and RTO
  • How do I handle planned maintenance? Can I bring everything down or do I need to distribute my maintenance across independent entities?
  • Can I afford the loss of any data at all? Can I afford the partial loss of data?
  • What if we see a city-wide power outage? Do I need a failover site, or are all my users in the same spot and won’t be able to work anyway?

Questions like these will help you realize that not everything you have running has the same value. Your development system with 6000 people working on it worldwide might need better protection than your productive system that is only being used by 500 people spread through the Baltic region.

Or in short.

Knowing what kind of protection you need is key. Fact is that both HA and DR solutions never come cheap. If you need the certainty that your solution is available and able to recover from a disaster, you will notice that the price tag will quickly skyrocket. Which is another reason to make sure that you know exactly what kind of protection you need, and creating that definition is the most important starting point. Once you have your own definition, make sure that you communicate those definitions and requirements so that all parties are on the same page. It should make your life a little easier in the end.

6 thoughts on “How do you define high availability and disaster recovery?”

  1. Hi Baas,
    I keep running into your same problem of various term confusion. I have always thought of availability as the percentage of time in a given time span that a given server is operational and accessible to users.
    In recent years, availability expectations have risen sharply, but thank God so has technology. But nothing is perfect, ergo, disaster recovery.
    Disaster recovery, specific to servers, is the time and process it takes to get the data and applications running on the above mentioned server back in operation to the users. That gets into virtualization and remote replication, which I will let you talk about.
    Does that sound right to you? What do you suggest to people on calculating the cost of high availability, or instead, the cost of downtime?

  2. Hi Brindey,

    first of all I would like to thank you for your comment.

    In regards to your definition of uptime. Since you work for Stratus and you provide HA services, your definition of uptime is bound to be different from mine. You talk about “availability as the percentage of time in a given time span that a given server is operational and accessible to users”.

    To me that is irrelevant for my uptime, since I am not willing to reduce my uptime to the availability of a server. I personally define uptime as the ability to work with my system as a whole in an expected way.

    Just going by the definition you gave would mean that my server might be online and reachable and is “up”. But what if the application, or part of the application, is not reachable because something inside of the application was corrupted? What if I am running a large database that switched to admin mode because a table was corrupted? Would you consider this as a system being available? According to the definition you gave it would be up since my server is online and accessible to my users (unless you meant the entire environment when you wrote server). According to my customer it wouldn’t be available since he is unable to work with the database in a way he would expect to in a “normal” situation.

    As I wrote before, it all boils down to defining what your requirements are, and finding out how valuable the availability of your data is to you. Tools like risk assessments, dependency mapping and specifying your RPO and RTO should give you an idea on how valuable a specific environment is to your specific setup/ecosystem, and will allow you to set a limit on what you want to spend on HA and DR.

    On a side note, I would talk about the availability of a system and not the downtime. Downtime usually only covers unplanned outages and not things like planned maintenance, although planned maintenance could have an affect on the availability of a system.

    Cheers,
    Bas

  3. Hi Bas,
    You are right. When one thinks of uptime, one must consider ‘beyond the box’ and look at what really matters, which is the availability of the application (and the application’s intact data) to the user. That’s going to include the network, the server, and the app itself, among other things. So yes, please replace ‘server’ in my statement above with ‘application’.

    In the case of your specific example, I’d say that the prevention of application-driven corruption is the key of any HA system. There are various ways to go about that such as database replication and enhanced detection in the case of transient or hardware errors. If we’re talking about a buggy application, then something was clearly missed in quality assurance testing and you need to focus on recovering as quickly as possible to get the user back up and running.

    To your other point, I do think that discussion of downtime should include planned as well as unplanned outages. For certain customers, planned maintenance time can translate to real dollars lost. Folks in the HA space (Stratus included) provide the ability for live server upgrades in an effort to minimize planned downtime.

    Thanks for the conversation!
    Brindey

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s