Clustering – BasRaayman's Technical Take

Clustering, GestaltIT, Nutanix, Storage, VMware, vSphere

Nutanix – What do you mean: “You are not a storage company”…?

August 9, 2013 Bas Raayman

“You are a black guy, you must be great at dancing and basketball”. “You’re a blonde? Let me explain that joke to you once more”.

Stereotypes. We all know them, we all apply them in some form or the other. We put things in boxes after a quick look, and every drawer has a different label and content to separate the stereotypes. But what if it doesn’t work that way?

Since I joined Nutanix, I’ve been in several customer and partner meetings. Some of the people I’ve get got the idea right away. We are doing something new. Others put us in to a respective box or drawer. “You are a storage company” is one of the classic pieces of feedback. Or, “So you do virtual desktop infrastructure?”.

But there’s more to it. We offer a combination of commodity hardware, combined with a piece of software, and sell that as a solution. And while the use case of virtual desktops is a great one, we can also run things like Splunk, Hadoop and classic server virtualization workloads.

And while we combine the benefits of a shared storage approach to run workloads, we’re not a storage company. We utilize features offered by shared storage to make your life easier. Each node performs its operations on the local storage, but I can use the “Nutanix Distributed File System” or NDFS to create an abstracted layer that offers many of the shared storage benefits. An example would be a shared container for my virtual machines, that are accessible to all of the hosts, enabling features like live migration between hosts.

While that works out really well with our customers, and it gives you the idea you have a SAN or NAS underneath the hoods, Nutanix’s main point is not to replace your SAN or NAS. We want to offer you a “Virtual Computing Platform”, a way to make your life easier when installing, configuring and deploying virtualized workloads and solutions.

That works great, and we’ve received great feedback. There seems to be a slight disconnect though. That begins when people start asking questions like:

What do you mean: “You are not a storage company”…?

A fair question by all means, but the simple answer is: No, we are not.

A simple example that seems to come up as of late is the following. How do I share disk space from your file system directly in to a virtual machine? While there is a way to export the storage directly in to a VM (for example via NFS), this bypasses some of the concepts we utilize. By default, we mount a datastore using an NFS IP address of 192.168.5.1, which runs over a virtual switch that has no uplinks. Since we are talking about traffic that stays within the same vSwitch, we can work at blazing speeds that are not limited by the speed of the physical NIC.

If I were to mount the NFS share from a virtual machine (or a different host), we could use the external IP of the Controller VM. The problem here, is that since the external IPs are different between controller VMs, if you were to migrate your NFS client VM to a different host, everything would go over the regular network. Also, if the controller VM that you connect to as an NFS Server would be offline, your NFS share is not accessible.

The thing is, the Nutanix block is designed to work this way. It offers great flexibility when it comes to running virtualized workloads, but it is not a 100% distributed storage system. We didn’t intend on being a storage system.

It then boils down to design. Is there a way around this? Certainly.

If you want to create a distributed CIFS file share, take a look at solutions like DFS from Microsoft. You can run multiple VMs inside of a container/datastore, and just pass the disk space of the VM through. If you need more space, just add more VMs on a different node, and add capacity, and off you go. And if you run out of space on your cluster? Just add another Nutanix node, get a VM up and running, and follow the same procedure.

That way, you are actually utilizing the distributed nature of our virtual compute platform, and running your storage services in a distributed manner. Gluster FS could be a possible solution to achieve the same thing with NFS on Linux.

And like I said, if this sounds like we are not a storage company? You are absolutely right, we are not. So you might want to categorize us under a different label, put us in a different box, or create an entirely new stereotype. 😉

Clustering, EMC, Storage, Virtualization, VMware, VPLEX, vSphere

VMware HA demo using vMSC with EMC VPLEX Metro

April 5, 2012April 5, 2012 Bas Raayman

That’s a mouth full of abbreviations for a title, isn’t it?

So, let me give you some background info. VMware introduced something called the vSphere Metro Storage Cluster, and Duncan Epping talks about this feature here.

What the vMSC allows us to do, is to create a regular stretched vSphere cluster, but now also stretch out the storage between the two clusters. This can be done in two ways (to quote from Duncan’s article):

I want to briefly explain the concept of a metro / stretched cluster, which can be carved up in to two different type of solutions. The first solution is where a synchronous copy of your datastore is available on the other site, this mirror copy will be read-only. In other words there is a read-write copy in Datacenter-A and a read-only copy in Datacenter-B. This means that your VMs in Datacenter-B located on this datastore will do I/O on Datacenter-A since the read-write copy of the datastore is in Datacenter-A. The second solution is which EMC calls “write anywhere”. In this case VMs always write locally. The key point here is that each of the LUNs / datastores has a “preferred site” defined, this is also sometimes referred to as “site bias”. In other words, if anything happens to the link in between then the storage system on the preferred site for a given datastore will be the only one left who can read-write access it.

The last scenario described here is something that obviously can cause some issues. EMC tried to address this by introducing the “independent 3^rd party”, in form of the VPLEX Witness. Some documentation states that this witness should run in a 3^rd site, but I would recommend to run this in a separate failure domain.

In essence, we have created the following setup:

Awesome stuff, because we can do new things that weren’t quite possible before. Since VPLEX is one of the key storage virtualization solutions from EMC that allow us to perform an active/active disk access, we can perform a vMotion between the two sites, and due to the nature of VPLEX, we also perform a sort of storage vMotion on the underlying disks. That, without you having to shut down the VM to do both things at the same time. Pretty neat!

Now, as Chad describes here, a new disk connectivity state was introduced with vSphere 5, called “Permanent Device Loss” or PDL. This was a great feature to communicate to your infrastructure that a target was intentionally removed. You could unmount the disk, and remove the paths to your target in a proper way.

It was also useful to indicate an unexpected loss of your target, indicating that your cluster is in a partitioned state. The problem here was that a PDL state and VMware HA didn’t work so well together. When you had an APD notification, HA didn’t “kill” your VM, and your virtual machine would usually continue to respond to pings, but that was about it.

Then along came vSphere 5 Update 1, which allows us to set a flag on each of the hosts inside our cluster, and set a different flag for our HA cluster. Now, we can actually use HA and see terminate the VMs and have it restart the virtual machines on the hosts in our cluster that still have access to their datastores in their respective preferred sites.

I’ve created a short (ok, 8 minutes) video that show exactly this scenario. You’ll get a quick view of the VPLEX setup. You’ll see the Brocade switches that will change from a config with the normal full zoneset, being switched to a zoneset that will disable the inter-switch links between both VPLEX clusters. And you’ll see the settings inside of my vSphere lab setup, with the behavior of the hosts and virtual machines.

Since I’m quite new to creating videos like this, I hope the output is acceptable, and the video is clear enough. If you have any questions, feedback or would like to see more, please leave me a comment and I’ll see what I can do. 🙂

Just a quick modification to my post, since it wasn’t actually VM-HA (or VM monitoring) responding to the PDL event, but HA terminating the VM when running in to the PDL state, as Duncan pointed out to me on Twitter. Sorry for any confusion I may have caused!

Clustering, High Availability

HA Clustering: KISS (and make up)

December 1, 2009December 1, 2009 Bas Raayman

I like HA-clustering. I like to think that it is actually one of my specialties, and that I’m fairly good at it.

When I tried to explain what a cluster is, I came up with a very simple explanation that gives an idea of what a cluster can be without all the technical stuff. Just to give you this example:
Try to think of a car manufacturer that has sites in two locations. Both are capable of building cars, but only one site is active at a time. Now you as a customer want to be able to communicate with this company no matter where they are working from. The way to do so would be a P.O. box. The active site just picks up the mail from this box and corresponds with you.
Say that one site would burn down, the other would take over and correspond with you using this P.O. box, and to you as a customer the “failover” to the other site would not be noticeable.

I know this doesn’t cover all aspects, but it is very effective way to describe the very basics of a cluster. Anybody can imagine a P.O. box and someone driving to pick up the mail from that box.

Now, at the company where I work we tend to use three main products for our clustering needs. The Microsoft Cluster Service for our Windows platforms, We use a custom created product called PMC (very basic, two nodes with manual failover) and EMC Autostart. All offer a basic failover functionality of shared resources, and usually some means to stop and start things like databases and applications.

All of the people here seem to answer one thing when you ask them about high availability. “Install a cluster” seems to be the common delimiter. But when you ask them what they think when it comes to high availability you get all sorts of replies. Raging from “never down” or “100% reachable” to “guaranteed fast response times” or even the cloning of the runtime instance to other machines.

All are (in my opinion) (omit) valid responses, but there is one thing that I have learned over the past few years: The more complex the demands, the more stable your environment will be if you keep your design and implementation as simple as possible. Or in short “KISS”.

Examples of popular requirements are “I want to monitor the response time of my database query”, or “The SAPgui interpretation time should be under $X”. Very much like in the uncertainty principle we can say that as soon as we start to measure the response times of the database, we are also going to have an impact on these response times. And the more complex the demands are, the more you need to take in to account and the higher the costs are going to be. Sun has a nice image displaying this, and it is a general image you will see when you are searching for HA-clustering.

My advice? Try to keep it down to a minimum.
Rely on your hardware redundancy. You can use the N+1 principle there and usually save quite a bit. Also, make sure that the people who are working on the cluster know what they are doing. I’ve seen most errors here start off by either poorly defined monitors, too many monitors and user error (or PEBKAC).

In short, a cluster is alway complex and tailored toward the application you are trying to make highly available. Keep the design as simple as you can and gather people around you with knowledge of the application so you so can define a good set of working guidelines and monitors. All in all, a case of “KISS”.