GestaltIT, High Availability

How do you define high availability and disaster recovery?

A while back I was on a call with someone who asked me the difference between high availability (HA) and disaster recovery (DR), saying that there are so many different solutions out there and that a lot of people seem to use the terminology but are unable to explain anything more about these two descriptions. So, here’s an attempt to demystify things.

First of all, let’s take a look at the individual terms:

High Availability:

According to Wikipedia, you can define availability in the following ways:

The degree to which a system, subsystem, or equipment is operable and in a committable state at the start of a mission, when the mission is called for at an unknown, i.e., a random, time. Simply put, availability is the proportion of time a system is in a functioning condition.

The ratio of (a) the total time a functional unit is capable of being used during a given interval to (b) the length of the interval.

And most online dictionaries seem to have a similar definition of availability. When we are talking about HA, we imply that we want the functioning condition of your system to be increased.

Going by the above you will also notice that there is no fixed definition of the availability. Simply put, it would mean that you need to put your own definition in place when talking about HA. You need to define what HA means in your environment. I’ve had customers that needed HA and defined this as the system having a certain amount of uptime, which is one way to measure it.

On the other hand you would be hard pressed if you were able to work with your system, but the data that you were working with was corrupted because one of your power users made an error during a copy job and wrote an older data set in the wrong spot. This would mean that your system is in itself available. You can log on to it, you can work with it, but the output you are going to get will be wrong.

To me, such a scenario would mean that your system isn’t available. After all, it’s not about everything being online. It’s about using a system in the way you would expect it to work. But when you ask most people in IT about availability, the first thing you will likely hear is something related to uptime or downtime. So, my tip to you is once again:

Define what “available” means to you and your organization/customer!

Disaster Recovery:

Natural disasterLet’s do the same thing as before and let’s turn to some general definitions. Wikipedia defines disaster the following way:

disaster is a perceived tragedy, being either a natural calamity or man-made catastrophe. It is a hazard which has comes to fruition. A hazard, in turn, is a situation which poses a level of threat to life, health, property, or that may deleteriously affect society or an environment.

And recovery is defined the following way (when it comes to health):

Healing, or Cure, the process of recovering from an injury or illness.

So, in a nutshell this is about bouncing back to your feet once a disaster strikes. Now again, it’s important to define what you would call a disaster, but at least there seems to be some sort of common understanding that anything that would get you back up and running after an entire site goes down, usually falls under the label of a DR solution.

It all boils down to definitions!

When you talk to other companies or vendors about HA and/or DR, you will soon notice that most have a different understanding of what HA and DR are. Your main focus should be to have a clear definition for yourself. Try to find out the importance and value of your solution and base your requirements on that. Ask yourself simple questions like for example:

  • What is the maximum downtime I can cope with before I need to start working again? 8 hours per year? 1 hour per year? 4 hours per month? What is my RPO and RTO
  • How do I handle planned maintenance? Can I bring everything down or do I need to distribute my maintenance across independent entities?
  • Can I afford the loss of any data at all? Can I afford the partial loss of data?
  • What if we see a city-wide power outage? Do I need a failover site, or are all my users in the same spot and won’t be able to work anyway?

Questions like these will help you realize that not everything you have running has the same value. Your development system with 6000 people working on it worldwide might need better protection than your productive system that is only being used by 500 people spread through the Baltic region.

Or in short.

Knowing what kind of protection you need is key. Fact is that both HA and DR solutions never come cheap. If you need the certainty that your solution is available and able to recover from a disaster, you will notice that the price tag will quickly skyrocket. Which is another reason to make sure that you know exactly what kind of protection you need, and creating that definition is the most important starting point. Once you have your own definition, make sure that you communicate those definitions and requirements so that all parties are on the same page. It should make your life a little easier in the end.

Clustering, High Availability

HA Clustering: KISS (and make up)

I like HA-clustering. I like to think that it is actually one of my specialties, and that I’m fairly good at it. A2CCCBAC-EDFF-4FC1-9976-CA49CAAE9ECF.jpg

When I tried to explain what a cluster is, I came up with a very simple explanation that gives an idea of what a cluster can be without all the technical stuff. Just to give you this example:
Try to think of a car manufacturer that has sites in two locations. Both are capable of building cars, but only one site is active at a time. Now you as a customer want to be able to communicate with this company no matter where they are working from. The way to do so would be a P.O. box. The active site just picks up the mail from this box and corresponds with you.
Say that one site would burn down, the other would take over and correspond with you using this P.O. box, and to you as a customer the “failover” to the other site would not be noticeable.

I know this doesn’t cover all aspects, but it is very effective way to describe the very basics of a cluster. Anybody can imagine a P.O. box and someone driving to pick up the mail from that box.

Now, at the company where I work we tend to use three main products for our clustering needs. The Microsoft Cluster Service for our Windows platforms, We use a custom created product called PMC (very basic, two nodes with manual failover) and EMC Autostart. All offer a basic failover functionality of shared resources, and usually some means to stop and start things like databases and applications.

All of the people here seem to answer one thing when you ask them about high availability. “Install a cluster” seems to be the common delimiter. But when you ask them what they think when it comes to high availability you get all sorts of replies. Raging from “never down” or “100% reachable” to “guaranteed fast response times” or even the cloning of the runtime instance to other machines.

All are (in my opinion) (omit) valid responses, but there is one thing that I have learned over the past few years: The more complex the demands, the more stable your environment will be if you keep your design and implementation as simple as possible. Or in short “KISS”.

Cost in relation to complexity Examples of popular requirements are “I want to monitor the response time of my database query”, or “The SAPgui interpretation time should be under $X”. Very much like in the uncertainty principle we can say that as soon as we start to measure the response times of the database, we are also going to have an impact on these response times. And the more complex the demands are, the more you need to take in to account and the higher the costs are going to be. Sun has a nice image displaying this, and it is a general image you will see when you are searching for HA-clustering.

My advice? Try to keep it down to a minimum.
Rely on your hardware redundancy. You can use the N+1 principle there and usually save quite a bit. Also, make sure that the people who are working on the cluster know what they are doing. I’ve seen most errors here start off by either poorly defined monitors, too many monitors and user error (or PEBKAC).

In short, a cluster is alway complex and tailored toward the application you are trying to make highly available. Keep the design as simple as you can and gather people around you with knowledge of the application so you so can define a good set of working guidelines and monitors. All in all, a case of “KISS”. A3F6209C-2764-4BEE-9047-C0C6D5F29AE5.jpg