Nutanix, Performance, SAP, Uncategorized

RDMA on Nutanix AHV and the discomfort of heterogeneous environments

In the process of setting up our new environment for SAP HANA validation work, I spent some time in the data center setting up our environment, and I ran into some caveats which I figured I would share.

To set the stage, I am working with a Lenovo HX Nutanix cluster. The cluster consists of two HX-7820 appliances with 4x Intel 8180M CPU’s, 3TB RAM, NVMe, SSD and among other things two Mellanox CX-4 dual port NICs. The other two appliances are two HX-7821 with pretty much the same configuration except these systems have 6TB of RAM. The idea is to give this cluster as much performance as we can and to do that we decided to switch on Remote Direct Memory Access, also called RDMA in short.

Now, switching on RDMA isn’t that hard. Nutanix has added support for RDMA with AOS version 5.5, and according to our “one-click” mantra, it is as simple as going into our Prism web interface, clicking the gear symbol, going to “Network Configuration” and from the “Internal Interface” tab enable RDMA and put in the info about the subnet and VLAN you want to use as well as the priority number. On the switch side, you don’t need anything extremely complicated. On our Mellanox switch we did the following (note that you’d normally need to disable flow control on each port, but this is the default on Mellanox switches):

interface vlan 4000
dcb priority-flow-control enable force
dcb priority-flow-control priority 3 enable
interface ethernet 1/29/1 dcb priority-flow-control mode on force
interface ethernet 1/29/2 dcb priority-flow-control mode on force
interface ethernet 1/29/3 dcb priority-flow-control mode on force
interface ethernet 1/29/4 dcb priority-flow-control mode on force

With all of that in place, you would normally expect to see a small progress bar and that is it. RDMA set up and working.

Except that it wasn’t quite as easy in our scenario…

You see, one of the current caveats is that when you image a Nutanix host with AHV, we pass through the entire PCI device, in this case the NIC, to the controller VM (cVM). The benefit is that the cVM now has exclusive access to the PCI device. The issue that arises is that we currently do not forward a single port, which isn’t ideal in the case of a NIC that has multiple ports. Add on top of that the fact that we don’t give you the choice which port to use for RDMA, and the situation becomes slightly muddied.

So, first off. We essentially do nothing more than see if we have an RDMA capable NIC, and we pass through the first one that find during the imaging process. In a normal situation, this will always the RDMA capable NIC on the PCI-slot with the lowest slot number. It will also normally be the first NIC port that we find. Meaning that if you have for example a non-RDMA capable Intel NIC in PCI slot 4, and two dual port RDMA capable cards in slot 5 and 6, your designated RDMA interface is going to be the first port on the interface in slot 5.

Since you might want to see what MAC-address is being used, you can check from the cVM by running the ifconfig command against the rdma0 interface. Note that this interface by default will exist, but isn’t online, so it will not show up if you just run an ifconfig command without parameters:

ifconfig rdma0
rdma0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        inet6 fe80::ee0d:9aff:fed9:1322  prefixlen 64  scopeid 0x20
        ether ec:0d:9a:d9:13:22  txqueuelen 1000  (Ethernet)
        RX packets 71951  bytes 6204115 (5.9 MiB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 477  bytes 79033 (77.1 KiB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

To double check if you have the correct interface connected to your switch ports my tip would be to access the lights out management interface (IMM / ILO/ iDRAC, IPMI, etc.) and check your PCI devices from there. Usually these will tell you the MAC of the interface for the various PCI devices. Make sure you double check if you are connected to the right physical NIC and switch port.

The next topic that might come up is the fact that we will automatically disable c-states on the AHV host in the process of enabling RDMA. This is all done in the background, and again normally will be done automatically. In our case, since we added a couple of new nodes to the cluster, the BIOS settings were not the same across the cluster. The result of that was that on the AHV hosts, the HX-7820 nodes had the following file available that contained a value of “1”:

/sys/devices/system/cpu/cpu*/cpuidle/state[3-4]/disable

Due to the BIOS settings that were different on the NX-7821 hosts that we added, this file and the cpuidle (sub-)directories didn’t exist on the host. While the RDMA script tried to disable c-states 3 and 4 on the hosts, this was only successful on two out of the four nodes in the cluster. Upon comparing the BIOS settings we noticed some deviations in available settings due to differences in versions, and differences in some of the settings as they were delivered to us (MWAIT for example). After modifying the settings to match the other systems, the directories were now available and we could apply the c-states to all systems.

While we obviously have some work to do to add some more resiliency and flexibility to the way we enable RDMA, and it doesn’t hurt to have an operational procedure to ensure settings are the same on all systems before going online with them, I just want to emphasize one thing:

One click on the Nutanix platform works beautifully when all systems are the same.

There are however quite a couple of caveats that come into play when you work with a heterogeneous environment/setup:

  • Double check your settings at the BIOS level. Make them uniform as much as you can, but be aware of the fact that sometimes certain settings or options might not even be available or configurable anymore.
  • Plan your physical layout. Try not to mix a different number of adapters per host.
  • Create a physical design that can assist the people cabling with what to plug where to ensure consistency.
  • You can’t always avoid making changes to a production system, but if at all possible, have a similar smaller cluster for the purpose of quality assurance.
  • If you are working in a setup with a variety of systems things will hopefully work as designed but might not. Log tickets where possible, and provide info that goes a bit further than “it doesn’t work”. 😉

Oh, and one more thing. Plan extra time. The “quick” change of cables and enabling of RDMA ended up in spending 4 hours in the data center working through all of this. And that is with myself being pretty familiar with all of this. If you are new to this, again if at all possible, take your time to work through this, versus doing this on the fly and running into issues when you are supposed to be going live. 🙂

GestaltIT, Performance, Storage, VAAI, Virtualization, VMware, vSphere

What is VAAI, and how does it add spice to my life as a VMware admin?

EMC EBC Cork
I spent some days in Cork, Ireland this week presenting to a customer. Besides the fact that I’m now almost two months in to my new job, and I’m loving every part of it, there is one part that is extremely cool about my job.

I get to talk to customers about very cool and new technology that can help them get their job done! And while it’s in the heart of every techno loving geek to get caught up in bits and bytes, I’ve noticed one thing very quickly. The technology is usually not the part that is limiting the customer from doing new things.

Everybody know about that last part. Sometimes you will actually run in to a problem, where some new piece of kit is wreaking havoc and we can’t seem to put our finger on what the problem is. But most of the time, we get caught up in entirely different problems altogether. Things like processes, certifications (think of ISO, SOX, ITIL), compliance, security or just something “simple” as people who don’t want to learn something new or feel threatened about their role that might be changing.

And this is where technology comes in again. I had the ability to talk about several things to this customer, but one of the key points was that technology should help make my life easier. One of the cool new things that will actually help me in that area was a topic that was part of my presentation.

Some of the VMware admins already know about this technology, and I would say that most of the folks that read blogs have already heard about it in some form. But when talking to people at conventions or in customer briefings, I get to introduce folks over and over to a new technology called VAAI (vStorage API for Array Integration), and I want to explain again in this blog post what it is, and how it might be able to help you.

So where does it come from?

Well, you might think that it is something new. And you would be wrong. VAAI was introduced as a part of the vStorage API during VMworld 2008, even though the release of the VAAI functionality to the customers was part of the vSphere 4.1 update (4.1 Enterprise and Enterprise Plus). But VAAI isn’t the entire vStorage API, since that consists of a family of APIs:

  • vStorage API for Site Recovery Manager
  • vStorage API for Data Protection
  • vStorage API for Multipathing
  • vStorage API for Array Integration

Now, the “only API” that was added with the update from vSphere 4.0 to vSphere 4.1 was the last API, called VAAI. I haven’t seen any of the roadmaps yet that contain more info about future vStorage APIs, but personally I would expect to see even more functionality coming in the future.

And how does VAAI make my life easier?

If you read back a couple of lines, you will notice that I said that technology should make my life easier. Well, with VAAI this is actually the case. Basically what VAAI allows you to do is offload operations on data to something that was made to do just that: the array. And it does that at the ESX storage stack.

As an admin, you don’t want your ESX(i) machines to be busy copying blocks or creating clones. You don’t want your network being clogged up with storage vMotion traffic. You want your host to be busy with compute operations and with the management of your memory, and that’s about it. You want as much reserve as you can on your machine, because that allows you to leverage virtualization more effectively!

So, this is where VAAI comes in. Using the API that was created by VMware, you can now use a set of SCSI commands:

  • ATS: This command helps you out with hardware assisted locking, meaning that you don’t have to lock an entire LUN anymore but can now just lock the blocks that are allocated to the VMDK. This can be of benefit, for example when you have multiple machines on the same datastore and would like to create a clone.
  • XSET: This one is also called “full copy” and is used to copy data and/or create clones, avoiding that all data is sent back and forth to your host. After all, why would your host need the data if everything is stored on the array already?
  • WRITE-SAME: This is one that is also know as “bulk zero” and will come in handy when you create the VM. The array takes care of writing zeroes on your thin and thick VMDKs, and helps out at creation time for eager zeroed thick (EZT) guests.

Sounds great, but how do I notice this in reality?

Well, I’ve seen several scenarios where for example during a storage vMotion, you would see a reduction in CPU utilization of 20% or even more. In the other scenarios, you normally should also see a reduction in the time it takes to complete an operation, and the resources that are allocated to perform such an operation (usually CPU).

Does that mean that VAAI always reduces my CPU usage? Well, in a sense: yes. You won’t always notice a CPU reduction, but one of the key criteria is that with VAAI enabled, all of the SCSI operations mentioned above should always perform faster then without VAAI enabled. That means that even when you don’t see a reduction in CPU usage (which is normally the case), you will see that since the operations are faster, you get your CPU power back more quickly.

Ok, so what do I need, how do I enable it, and what are the caveats?

Let’s start off with the caveats, because some of these are easy to overlook:

  • The source and destination VMFS volumes have different block sizes
  • The source file type is RDM and the destination file type is non-RDM (regular file)
  • The source VMDK type is eagerzeroedthick and the destination VMDK type is thin
  • The source or destination VMDK is any sort of sparse or hosted format
  • The logical address and/or transfer length in the requested operation are not aligned to the minimum alignment required by the storage device (all datastores created with the vSphere Client are aligned automatically)
  • The VMFS has multiple LUNs/extents and they are all on different arrays

Or short and simple: “Make sure your source and target are the same”.

Key criteria to use VAAI are the use of vSphere 4.1 and an array that supports VAAI. If you have those two prerequisites set up you should be set to go. And if you want to be certain you are leveraging VAAI, check these things:

  • In the vSphere Client inventory panel, select the host
  • Click the Configuration tab, and click Advanced Settings under Software
  • Check that these options are set to 1 (enabled):
    • DataMover/HardwareAcceleratedMove
    • DataMover/HardwareAcceleratedInit
    • VMFS3/HardwareAcceleratedLocking

Note that these are enabled by default. And if you need more info, please make sure that you check out the following VMware knowledge base article: >1021976.

Also, one last word on this. I really feel that this is a technology that will make your life as a VMware admin easier, so talk to your storage admins (if that person isn’t you in the first case) or your storage vendor and ask if their arrays support VAAI. If not, ask them when they will support it. Not because it’s cool technology, but because it’s cool technology that makes your job easier.

And, if you have any questions or comments, please hit me up in the remarks. I would love to see your opinions on this.

Update: 2010-11-30
VMware guru and Yellow Bricks mastermind Duncan Epping was kind enough to point me to a post of his from earlier this week, that went in to more detail on some of the upcoming features. Make sure you check it out right here.

GestaltIT, Performance, Storage, Tiering

“Storage tiering is dying.” But purple unicorns exist.

Chris Mellor over at the Register put an interview online with NetApp CEO Tom Georgens.

To quote from the Register piece:

He is dismissive of multi-level tiering, saying: “The simple fact of the matter is, tiering is a way to manage migration of data between Fibre Channel-based systems and serial ATA based systems.”

He goes further: “Frankly I think the entire concept of tiering is dying.”

Now, for those who are not familiar with the concept of tiering, it’s basically moving data between faster and slower media in the background. Clasically tiering is something that every organization is already doing. You consider the value of the information, and based on that you decide if this data should be accessible instantly from your more expensive hardware, and even at home you will see that as the value decreases you will archive that data to a media that has a different type of performance like your USB archiving disk or for example by burning it to a DVD.

For companies the more interesting part in tiering comes with automation. To put it simply, you want your data to be available on a fast drive when you need it, and it can remain on slower drives if you don’t require it at that moment. Several vendors each have their own specific implementation of how they tier their storage, but you find this kind of technology coming from almost any vendor.

Aparrantly, NetApp has a different definition of tiering, since according to their CTO tiering is limited to the “migration of data between Fibre Channel-based systems and serial ATA based systems”. And this is where I heartily disagree with him. I purposely picked the example of home users who are also using different tiers, and it’s no different for all storage vendors.

The major difference? They remove the layer of fibre channel drives in between of the flash and SATA drives. They still tier their data to the medium that is most fitting. They will try to do that automatically (and hopefully succeed in doing so), but just don’t call it tiering anymore.

As with all vendors, NetApp is also trying to remove the fibre channel drive layer, and I am convinced that this will be possible as soon as the prices of flash drives can be compared to those of regular fibre channel drives, and the automated tiering is automated to a point that any actions performed are transparent to the connected system.

But, if NetApp doesn’t want to call it tiering, that’s fine by me but I hope they don’t honestly expect customers to fall for it. The rest of the world will continue to call it tiering, and they will try to sell you a purple unicorn that moves data around disk types as if by magic.

DMX, EMC, Enginuity, Performance, Storage, Symmetrix

The thing about metas, SRDF/S and performance

It’s not very common knowledge, but there is actually a link between the I/O performance you see on your server and the number of metas you configured when using SRDF/S.

I do a lot of stuff in our company and I tend to get pulled in to performance escalations. Usually because of the fact that I know my way around most modern operating systems, I know a bit about storage and about our applications and databases. Usually the problems all boil down to a common set of issues, and perhaps one day I will post a catalog of common performance troubleshooting tips here, but I wanted to use this post to write about something that was new to me and I thought it might be of use to you.

We have a customer with a large installation on Linux that was seeing performance issues in his average dialog response time. Now, for those who don’t know what a dialog response time is, it is the time it takes an SAP system to display a screen of information, process any data entered or requested there by the database and output the next screen with the requested information. It doesn’t include any time needed for network traffic of the time taken up by the front-end systems.

The strange thing was that the database reported fairly good response times, an excellent cache hit ratio but also reported that any waits were produced by the disks it used. When we looked at the Symmetrix box behind it we could not see any heavy usage on the disks, and it reported to be mostly “picking it’s nose”.

After a long time we got the suggestion that perhaps the SRDF/S mirroring was to blame for this delay. We decided to change to an RDF mode called “Adaptive Copy Write Pending” or ACWP and did indeed see a performance improvement, even though the database and storage box didn’t seem to show the same improvement that was seen in the dialog response time.

Then, someone asked a fairly simple questions:

“How many meta members do you use for your LUNs?”

Now, the first thought with a question like that is usually along the line of the number of spindles, short stroking and similar stuff. Until he said that the number of meta members also influences the performance when using SRDF/S. And that’s where it get’s interesting and I’m going to try and explain why this is so interesting.

To do that let’s first take a closer look at how SRDF works. SRDF/S usually gives you longer write response times. This because you write to the first storage box, copy everything over to the second box, receive an acknowledge from the second box and then respond back to say that the write was ok. You have to take things like propagation delay and RDF write times into account.

Now, you also need to consider that when you are using the synchronous mode, you can only have 1 outstanding write I/O per hyper. That means that if your meta consists of 4 hyper volumes you get 4 outstanding write I/Os. If you create your meta out of more hyper volumes you also increase the maximum number of outstanding write I/Os or higher sustained write rates if your workload is spread evenly.

So, lets say for example you have a host that is doing 8 Kb write I/O’s to a meta consisting of 2 hypers. The Remote site is about 12 miles away and you have a write service time of 2 ms. Since you have a 1000 ms in one second each hyper can do roughly 500 IOPS since you would need to divide the 1000 ms by the servie time of 2 ms: 1000 ms/2 ms = 500

Now, with 2 hypers in your meta you would roughly have around 8 MB/sec:
2 (hypers) x 500 IOPS x 8 KB.

And you can also see that if we increase the number of hypers, we also increase the maximum value. This is mostly true for random writes, and the behavior will be slightly different for sequential loads since these use a stripe size of 960 KB. And don’t forget that this is a cache to cache value since we are talking about the data being transferred between the Symmetrixes. We won’t receive a write commit until we get a write acknowledge from the second storage array.

So, what we will be doing next are two things. We will be increasing the number of hypers for the metas that our customer is using. Besides that we will also be upgrading our Enginuity since we expect a slightly different caching behavior.

I’ll try to see if I can update this post when we changed the values just to give you a feel on the difference it made (or perhaps did not make) and I hope this information is useful for anyone facing similar problems.