HA Clustering: KISS (and make up)

I like HA-clustering. I like to think that it is actually one of my specialties, and that I’m fairly good at it.

When I tried to explain what a cluster is, I came up with a very simple explanation that gives an idea of what a cluster can be without all the technical stuff. Just to give you this example:
Try to think of a car manufacturer that has sites in two locations. Both are capable of building cars, but only one site is active at a time. Now you as a customer want to be able to communicate with this company no matter where they are working from. The way to do so would be a P.O. box. The active site just picks up the mail from this box and corresponds with you.
Say that one site would burn down, the other would take over and correspond with you using this P.O. box, and to you as a customer the “failover” to the other site would not be noticeable.

I know this doesn’t cover all aspects, but it is very effective way to describe the very basics of a cluster. Anybody can imagine a P.O. box and someone driving to pick up the mail from that box.

Now, at the company where I work we tend to use three main products for our clustering needs. The Microsoft Cluster Service for our Windows platforms, We use a custom created product called PMC (very basic, two nodes with manual failover) and EMC Autostart. All offer a basic failover functionality of shared resources, and usually some means to stop and start things like databases and applications.

All of the people here seem to answer one thing when you ask them about high availability. “Install a cluster” seems to be the common delimiter. But when you ask them what they think when it comes to high availability you get all sorts of replies. Raging from “never down” or “100% reachable” to “guaranteed fast response times” or even the cloning of the runtime instance to other machines.

All are (in my opinion) (omit) valid responses, but there is one thing that I have learned over the past few years: The more complex the demands, the more stable your environment will be if you keep your design and implementation as simple as possible. Or in short “KISS”.

Examples of popular requirements are “I want to monitor the response time of my database query”, or “The SAPgui interpretation time should be under $X”. Very much like in the uncertainty principle we can say that as soon as we start to measure the response times of the database, we are also going to have an impact on these response times. And the more complex the demands are, the more you need to take in to account and the higher the costs are going to be. Sun has a nice image displaying this, and it is a general image you will see when you are searching for HA-clustering.

My advice? Try to keep it down to a minimum.
Rely on your hardware redundancy. You can use the N+1 principle there and usually save quite a bit. Also, make sure that the people who are working on the cluster know what they are doing. I’ve seen most errors here start off by either poorly defined monitors, too many monitors and user error (or PEBKAC).

In short, a cluster is alway complex and tailored toward the application you are trying to make highly available. Keep the design as simple as you can and gather people around you with knowledge of the application so you so can define a good set of working guidelines and monitors. All in all, a case of “KISS”.