Problem with the EMC Isilon Storage Replication Adapter

9 01 2012

VMware vCenter SRMA lot of folks out there use the VMware vCenter SRM to create and manage disaster recovery scenarios for their virtualized environments.

Besides having a button to click to fail over (parts of) your environment to a different site, it has one benefit: It forces you to think about your systems. You need to consider which systems are vital to your infrastructure, and you need to be aware of dependencies that you may have in your environment. There are numerous other things that SRM can help with, but that’s not what I wanted to highlight here.

A couple of days ago, I was at the VMware office in Munich, and was helping setting up a SRM 5.0 demo that would serve as a hands-on lab for people interested in SRM. The base of this SRM installation is a virtualized Isilon cluster, that offers the ability to easily provision storage, and offers replication between sites (a quick video overview by my colleague Nick Weaver can be found ).

While setting up the Isilon SRA which you can download from the VMware website, I ran in to a problem. When you download and extract the actual SRA, you’ll get a bunch of PDF files, and two executables. One is the installer for the actual storage replication adapter. It’s called “EMCIsilonSRASetup_1_0.exe”, and you need a current Java development kit to get that one running, but it should install correctly.

The second file is called “IsilonReplicationHelperSetup.exe”, and this is used to configure the SRA before using it in SRM. Now, when starting this helper, both me and Jase McCarty have seen errors that refer to a missing Java class (com.izforge.izpack.installer.Installer), for a program called IzPack which was used to create the installer. After extracting the actual executable, it seemed like some classes/libraries were missing from it.

I’ve been in touch with Isilon support after running in to the error, and after checking with them, they gave me an MD5 hash of a working copy of the IsilonReplicationHelperSetup.exe, which is:
416535bc1c7d7f133037af04b5502e3b However, MD5 for the executable that I got was:
4342E880A99EE2ED6DA1205F1018233DWhich obviously is different. The MD5 of the downloaded file, and the MD5 that VMware shows for the actual zip that contains the SRA matched up though.

So, I’m putting this post out there as a word of warning. It seems like one of the Isilon SRA files on the VMware website is non-functional. Should anybody out there see this, make sure to contact Isilon support and reference case 00169080, which is my case number.

I’m still working with the Isilon support to see what the next steps are going to be, and I’m sure this is going to be resolved soon enough, but I wanted to put this information out there for you in the meantime, to avoid people having to go through the same process as I did. It might save some folks a bit of time. And I’ll make sure I update this post when I get a solution from the Isilon support team.

Update – January 16th 2012:

While I’m still working with the Isilon support group to get everything sorted out, I did get a version of the IsilonReplicationHelperSetup Java archive that seems to be working. Now, I’m sharing this with you all while we try to get things resolved, and to get the working download on the VMware site, but I need to add a large disclaimer:


This file is not officially supported by EMC and/or Isilon, and while this file worked for me, your mileage may vary, and I would recommend that you do not use this file in a production environment! The file might work in a test environment, but please refrain from using it in a productive environment. Use the official files from the VMware download site, or create a case with Isilon and/or VMware support!

Now, to help you verify this file, the MD5 for the Java archive is:
FFAC907E70FD0BFC73076793B9D5FCB4and you can get the file here.


Update – February 10th 2012:

VMware has updated the Isilon SRA file, and the new MD5 for Version 1.0, (released 01/18/2012) currently is:d8b8408ab259d64ee3f5a83486e2a25eThis actually contains the working files, so you should be all set. :)





VMAX VSA: IT’S ALIVE!!!!!!!!!!!!!!!!

31 08 2011

So folks, here’s a shameless copy of a blog post from one of the guys on my team. Dave was just brilliant and actually created a virtual storage appliance of the EMC VMAX. I think that’s downright awesome, and I wanted to help him get attention for what he did, so I asked him if I could copy his blog post, which is what you will find here:

young_frankenstein_doc_small

 

As the title suggests there is indeed a Symmetrix VMAX VSA. I have been working on this project since shortly after EMC World. As I look back through my emails, I received the code on 6/3/11 and I have been working on it in almost all of my free time since then.

Now finally it will make its public debut this week at VMworld 2011 as part of the EMC Interactive Demo booth on the show floor. As part of its grand unveiling I thought I would tell you a little about what makes it work.

Now to make a few things clear up front, this is a science project, I cannot distribute it, it does “work”. As part of the lab (I will publish the guide) the student actually provisions an iSCSI disk from the VSA to a ESXi 5.0 host.

One of the first things I noticed with the code when trying to virtualize it. It’s HUGE. There are 2 parts to the VSA.

1. The Service Processor (SP). In a physical VMAX this is the 1U server that is racked in the system bay. It has a special image of Windows XP and contains all of the proprietary software used to manage a VMAX. If you own a VMAX this is what you will see EMC field service personnel using when they come to work on your system. This is NOT accessible by a end-user as it requires special RSA credentials that change weekly. (one reason we can’t distribute it). Its specs are 2vCPU and 2GB of RAM and about 10GB of disk space.

2. Enginuity. This is the Operating Environment of the Symmetrix. For the purposes of this VSA it runs in a SuSE Enterprise Linux 11VM. One of the big deals with the VMAX was that Enginuity was ported from a PowerPC CPU to a Intel x86 based architecture. Without this change this VSA would never exist. Now this VM is big, so big as a matter of fact i had to use a RC build of vSphere 5 in order to even get it to work. I was finally able to scale it down a bit, but at one point it was using 32 vCPU’s 92GB of RAM and about 250GB of disk space.

Obviously one of the challenges for using this in a lab is that I needed it to use fewer resources. In the beginning this VMAX was a Single Engine model, which means it had 16 “slices” running. Each director has 4 DA (backend) directors, and 4 FA (front end) directors. I quickly found this was the biggest reason i needed so much memory and CPU. After working with one developer Chakib, who totally rocks by the way. We were able to scale this down to 1 FA and 1 DA per director. One interesting side note, when I was going down this path I asked Chakib what kind of VM he was using to test this. His reply was, “I am not using this in a VM, I have a physical Linux box with 200GB of RAM”. So I clearly had some work to do. But in its current state it uses 8 vCPU and “ONLY” 48GB of RAM. Which is still pretty darn big, but a lot better than it was when we started.

The networking requirements are pretty simple, the SP needs 1 Public NIC so that we can use its management tools. 2 Internal NICs which is used for internal communication to the directors. In our case that’s the Linux VM. The Linux VM needed the 2 internal NICs and 1 NIC to present an iSCSI target to. Then we put out ESXi host’s VMkernel NIC on the same vSwitch so it can use the iSCSI target provided by the VSA.

So that’s all great you say, but what actually works? That’s a good question.

What works is using Standard Devices, and very small ones today. One of the things I was told when I was given the code was that this WON’T and CAN’T do any I/O. Which obviously proved to be a bit of an issue. Chakib really worked his butt of to get me something that does I/O. So this is not like the Celerra UBER VSA by @lynxbat, where you can run a VM off of it. We hope we can do that one day. Thin Pools work to the extent you can create them, and put devices in a pool, but when you present it to a host it will not work. This kept me from using the VSI SPM plugin for vSphere as part of my lab, hey we always have next year! The really neat part to me is that the internal tools (SymmWin) that run on the SP fully work. It’s like having an actually VMAX, but without all the fuss of getting a few 50A power drops. As an ex-customer this to me is the coolest part, I got to put on my own BIN files, use Inlines (internal tool used to directly talk to the hardware). As a total nerd this thing is a dream come true.

So what’s next?

Well a lot of that depends on YOU! Since this is a total science project we need to show those in Symmetrix Engineering this is worth putting their time and money into. I need everyone here at VMworld this week to come try this thing, give me feedback, leave comments here, and if you aren’t at the show, express your desire for us to continue working on it. If no one is interested this will ultimately die on the vine. Please fill out this form so we can show how many of you all would like to see this project continue.

I have to give special thanks to Chad Sakac (@sakacc), Chris Horn (@horn_Chris) for getting me involved in this project and letting me run with it. Also all of the support they gave me during this process.

Here is a link to the lab guide being used this week at VMworld. Take a look and let me know what you think!

VMAX Lab Guide

Big thanks to Matt Cowger (@mcowger), Scott Lowe (@scott_lowe), and Tee Glasgow (@teeglasgow) for their help with the lab guide. Also to Rick Scherer (@rick_vmwaretips) for the blog help





What’s new in EMC Clariion CX4 FLARE 30

20 10 2010

CLARiiON CX4 UltraFlex I/O module - Copyright: EMC Corporation.A little while back, EMC released a new version of it’s CLARiiON Fibre Logic Array Runtime Environment, or in short “FLARE” operating environment. This release brings us to version 04.30 and again has some enhancements that might interest you, so once more here’s a short overview of what this update packs:

Let’s start off with some basics. Along with this update you will find updated firmware versions for the following:

    Enclosure: DAE2		- FRUMon: 5.10
    Enclosure: DAE2-ATA	- FRUMon: 1.99
    Enclosure: DAE2P	- FRUMon: 6.71
    Enclosure: DAE3P	- FRUMon: 7.81

Major changes:

  • With version 04.30.000.5.507 you get support for FCoE. Prerequisite is using a 10 Gigabit Ethernet I/O module on CX4-120, CX4-240, CX4-480, and CX4-960 arrays.
  • SATA EFD support.
  • Following that point, you can now use Fibre Channel EFD and SATA EFD in the same DAE.
  • And, you can now also mix Fibre Channel and SATA EFDs in the same RAID group.
  • VMware vStorage API support in form of “vStorage full copy acceleration” (basically the array takes care of copying all the blocks, instead of sending everything to and from the application) and in form of “Compare and Swap” (an enhancement to the LUN locking mechanism).
  • Rebuild avoidance. This feature will change the routing of I/O to the service processor that still has access to all the drives in the RAID group. You do need write caching to be enabled if you want to be able to use this feature.
  • Virtual provisioning, basically EMC’s name for thin provisioning on the array.

There are some nice features in there, but for me personally the virtual provisioning, the FCoE support and the vStorage API support are the main ones.

One thing that caught my eye was in the section called limitations for FLARE version 04.30.000.5.507. In the release notes you will find the following statement:

Host attach support – Supported host attached systems are limited to the following operating systems: Windows, VMWare, and Linux

Which would mean that you have a problem when you are using something else like Solaris or HP-UX. I’m trying to get some confirmation, and I’ll update this post as soon as I have more info.

Update

The statement has changed in the meantime:

Host attach support – Supported hosts that can be attached over an FCoE connection are limited to the following operating systems: Windows, VMWare, and Linux

Which means that this is just related to FCoE connected hosts.


After some feedback on Twitter from among others Andrew Sharrock, I’d thought it might be wise to talk a few sentences about the Virtual Provisioning feature.

In short, Virtual Provisioning was already introduced with FLARE 28. Problem was that at the time, you could only use the feature with thin pools. Basically, with this update, you also get support for a newer version of the feature. Things that were added are:

  • Thick LUNs
  • LUN expand and shrink
  • Tiering preference (storage allocation from pools with mixed drives and different performance characteristics)
  • Per-tier tracking support of pool usage
  • RAID 1/0 support for pools
  • Increased limits for drive usage in pools




It’s all about change and passion

28 06 2010

Some of you who read the title of this post will already have a hunch what this is all about. Heraclitus seems to be the person who first stated:

Nothing endures but change.

And I can only agree with that. I remember reading a post from Nick Weaver about an important change in his professional life, and I love this quote:

By taking this position I am intentionally moving myself from the top man on the totem pole to the lowest man on the rung.

And I think that most people who have read Nick’s blog know that this wasn’t entirely the truth, especially when looking what he was able to do until now.

Well, Nick can be assured now. There’s actually on person on the team that is “lower on the rung”. That person would be me.

Time for a change!

I am joining EMC and taking on the role of vSpecialist, or as my new contract says “Technical Consultant VCE”.

I am also going to be leaving my comfort zone and leave a team of people behind that have been great to work with. I have been working at SAP for seven years now, and the choice to leave wasn’t easy. I was lucky enough to have worked with a multitude of technologies in an environment that was high paced and stressful, but very rewarding, and I want to thank all of my colleagues for making the journey interesting! Even so, it’s time for me to make a change.

I was lucky enough to get to know several people who already work in a similar role, and if there’s one thing that distinguishes them in my mind, then it would be the passion they have for their job. This was actually the main reason for me to make the switch to EMC. It’s not about making big bucks, it’s not about being a mindless drone in the Evil Machine Company or drinking the Kool-Aid, it’s about getting a chance to work with people that share a passion and are experts at what they do. It’s about the chance to prove myself and perhaps one day joining their ranks as experts.

So, while I wrap things up here at SAP, if all goes well I will be joining the vSpecialist team on October 1st, and hopefully you will bear with me while I find my way going through this change, and I do hope you drop by every now and then to read some new posts from me.

See you on the other side!





EMC VPLEX – Introduction and link overview

12 05 2010

I’m currently visiting the Boston area because I’m attending EMC World. One of the bigger introductions made here yesterday was actually a new appliance called the VPLEX. In short, the VPLEX is all about virtualizing the access to your block based storage.

Let me give you a quick overview of what I mean with virtualized access to block based storage. With VPLEX, you can take (almost) any block based storage device on a local and remote site, and allow active read and writes on both sides. It’s an active/active setup that allows you to access any storage device via any port when you need to.

You can get two versions right now, the VPLEX local and the VPLEX Metro. Two other version, the VPLEX Geo and the VPLEX Global are planned for early next year. And since there is so much information that can be found online about the VPLEX, I figured I’d create a post here that will help me find the links when I return, and to also give you a one spot that can help you find the info you need.

An overview with links to more information on the EMC VPLEX:

Official links / EMC company bloggers / VMware company bloggers

Blogs and media coverage:

Now, if I missed one or more links, please just send me a tweet or leave a comment and I will make sure that the link is added to this post.





My take on the stack wars

26 04 2010

As some of you might have read, the stack wars have started. One of the bigger coalitions announced in November 2009 was that between VMware, Cisco and EMC, aptly named VCE. Hitachi Data Systems announced something similar and partnered up with Microsoft, but left everyone puzzled about the partner that will be providing the networking technology in it’s stack. Companies like IBM have been able to provide customers with a complete solution stack for some time now, and IBM will be sure to tell it’s customers that they did so and offered the management tools in form of anything branded Tivoli. To me, IBM’s main weakness is not so much the stack that they offer, as the sheer number of solutions and the lack of one tool to manage it all, let alone getting an overview of all possible combinations.

So, what is this thing called the stack?

Actually the stack is just that, a stack. A stack of what you say? A stack of solutions, bound together by one or more management tools, offered to you as a happy meal that allows you to run the desired workloads on this stack. Or to put things more simply and quote from the Gestalt IT stack wars post:

  • Standard hardware configurations are specified for ease of purchasing and support
  • The hardware stack includes blade servers, integrated I/O technology, Ethernet networking for connectivity, and SAN or NAS storage
  • Unifying software is included to manage the hardware components in one interface
  • A joint services organization is available to help in selection, architecture, and deployment
  • Higher-level software, from the virtualization hypervisor through application platforms, will be included as well

Until now, we have usually seen a standardized form of hardware, including storage and connectivity. Vendors mix that up with one or multiple management tools and tend to target some form of virtualization. Finally a service offering is included to allow the customer to get service and support from one source.

This strategy has it’s advantages.

Compatibility is one of my favorite ones. You no longer need to work trough compatibility guides that are 1400 pages long and will burn you for installing a firmware version that was just one digit off and is now no longer supported in combination with one of your favorite storage arrays. You no longer have to juggle different release notes from your business warehouse provider, your hardware provider, your storage and network provider, your operating system and tomorrow’s weather forecast. Trying to find the lowest common denominator through all of this is still something magical. It’s actually a form of dark magic that usually means working long hours to find out if your configuration is even supported by all the vendors you are dealing with.

This is no longer the case with these stacks. Usually they are purpose or workload built and you have one central source where you get your support from. This source will tell you that you need at least firmware version X.Y on these parts to be eligible for support and you are pretty much set after that. And because you are working with a federated solution and received management tools for the entire stack, your admins can pretty much manage everything from this one console or GUI and be done with it. Or, if you don’t want to that you can use the service offering and have it done for you.

So far so good, right?

Yes, but things get more complicated from here on. For one there is one major problem, and that is flexibility. One of the bigger concerns came up during the Gestalt IT tech field day vBlock session at Cisco. With the vBlock, I have a fixed configuration and it will run smoothly and within certain performance boundaries as long as I stick to the specifications. In the case of a vBlock this was a quite obvious example, where if I add more RAM to a server blade then is specified, I no longer have a vBlock and basically no longer have those advantages previously stated.

Solution stacks force me to think about the future. I might be a Oracle shop now as far as my database goes. And Oracle will run fine on newly purchased stack. But what if I want to switch to Microsoft SQL Server in 3 years, because Mr. Ellison decided that he needs a new yacht and I no longer want to use Oracle? Is my stack also certified to run a different SQL server or am I no longer within my stack boundaries and lost my single service source or the guaranteed workload it could hold?

What about updates for features that are important to me as a single customer? Or what about the fact that these solution stacks work great for new landscapes, or in a highly homogeneous environment? But what about those other Cisco switches that I would love to manage from the tools that are offered within my vBlock, but are outside of the vBlock scope, even if they are the same models?

What about something simple as a “stack lock-in”? I don’t really have a vendor lock-in since only very few companies have the option of offering everything first hand. Microsoft doesn’t make server blades, Cisco doesn’t make SAN storage and that list goes on and on. But with my choice of stack, I am now locked in to a set of vendors, and I certainly have some tools to migrate in to that stack, but migrating out is an entirely different story.

The trend is the stack, it’s as simple as that. But for how long?

We can see the trend clearly. Every vendor seems to be working on a stack offering. I’m still missing Fujitsu as a big hardware vendor in this area, but I am absolutely certain we will see something coming from them. Smaller companies will probably offer part of their portfolio under some sort of OEM license or perhaps features will just be re-branded. And if they are successful enough, they will most likely be swallowed by the bigger vendors at some point.

But as with all in the IT, this is just a trend. Anyone who has been in the business longer than me can probably confirm this. We’ve seen a start with centralized systems, then moving towards a de-centralized environment. Now we are on the move again, centralizing everything.

I’m actually much more interested to see how long this trend will continue. I’m am certain that we will be seeing some more companies offer a complete solution stack, or joining in coalitions to offer said stack. I still think that Oracle was one of the first that pointed in this direction, but they were not the first to offer the complete stack.

So, how do you think this is going to continue? Do you agree with us? What companies do you think are likely to be swallowed, or will we see more coalitions from smaller companies? What are your takes on the advantages and disadvantages?

I’m curious to hear your take on this so let me know. I’m looking forward to what you have to say!





Shorts: How to check the FLARE version of your CLARiiON?

1 04 2010

I decided to introduce something new on my blog. It’s something I’ve decided to call “shorts”. In these shorts I will try to pick some fairly simple and common questions that come up from the searches to my blog and try to give a short descriptive answer to help you out.

So, in this short:

How to check the FLARE version of your CLARiiON?

There are two simple ways to check the release of your FLARE operating environment.

  1. Use the NaviSphere GUI and right click on the array icon inside NaviSphere. Select Properties from the menu and go to the “software” tab. This will give you an overview of all licensed software that is enabled on your array. Should you be in engineering mode, you will find all the software that was pre-loaded on the array, but only those items that have a dash/minus sign in front of them are enabled. In that list of items you should find something like this:
    FLARE-Operating-Environment 03.26.010.5.016
  2. You can also use the navicli or naviseccli to enter the command “navicli ndu -list -isactive” and get a list of all active software on your array. The entry for your FLARE version would look similar to this:
    Name of the software package:        FLARE-Operating-Environment
    Revision of the software package:    03.26.010.5.016
    Commit Required:                     NO
    Revert Possible:                     NO
    Active State:                        YES
    Required packages:                   FA_MIB 260, AnalyzerProvider 260, RPSplitterEngine 260, MVAEngine 260, OpenSANCopy 260, MirrorView 260, SnapView 260, EMCRemoteNG 260, SANCopyProvider 260, SnapViewProvider 260, SnapCloneProvider 260, MirrorProvider 260, CLIProvider 260, APMProvider 260, APMUI 260, AnalyzerUI 260, MirrorViewUI 260, SANCopyUI 260, SnapViewUI 260, ManagementUI 260, ManagementServer 260, Navisphere 260, Base 263
    Is installation completed:           YES
    Is this System Software:             NO

As you can see, finding out which version of FLARE you have is actually quite simple. Good luck, and let me know if this works for you.





The Asymmetrical Logical Unit Access (ALUA) mode on CLARiiON

3 02 2010

I’ve noticed that I have been getting a lot of search engine hits relating to the various features, specifications and problems on the EMC CLARiiON array. One of the searches was related to a feature that has been around for a bit. It was actually introduced in 2001, but in order to give a full explanation I’m just going to start at the beginning.

DetourThe beginning is actually somewhere in 1979 when the founder of Seagate Technology, Alan Shugart, created the Shugart Associates Systems Interface (SASI). This was the early predecessor of SCSI and had a very rudimentary set of capabilities. Only few commands were supported and speeds were limited to 1.5 Mb/s. In 1981, Shugart Associates was able to convince the NCR corporation to team up and thereby convincing ANSI to set up a technical committee to standardize the interface. This was realized in1982 and known as the “X3T9.2 technical committee” and resulted in the name being changed to SCSI.

The committee published their first interface standard in 1986, but would grow on to become the group known now as “International Committee for Information Technology Standards” or INCITS and that is actually responsible for many of the standards used by storage devices such as T10 (SCSI), T11 (Fibre Channel) and T13 (ATA).

Now, in July 2001 the second revision of the SCSI Primary Commands (SPC-2) was published, and this included a feature called Asymmetrical Logical Unit Access mode or in short ALUA mode, and some changes were made in the newer revisions of the primary command set.

Are you with me so far? Good.

On Logical Unit Numbers

Since you came here to read this article I will just assume that I don’t have to explain the concept of a LUN. But what I might need to explain is that it’s common to have multiple connections to a LUN in environments that are concerned with the availability of their disks. Depending on the fabric and the amount of fibre channel cards you have connected you can have multiple paths to the same lun. And if you have multiple paths you might as well use them, right? It’s no good having the additional bandwidth lying around and then not using it.

Since you have multiple paths to the same disk, you need a tool that will somehow merge these paths and tell your operating system that this is the same disk. This tool might even help you achieve a higher throughput since it can balance the reads and writes over all of the paths.

As you might already have guessed there are multiple implementations of this, usually called Multipathing I/O, MPIO or just plainly Multipath, and you will be able to find a solution natively or as an additional piece of software for most modern operating systems.

What might be less obvious is that the connection to these LUNs don’t have to behave in the same way. Depending on what you are connecting to, you have several states for that connection. Or to draw the analogy to the CX4, some paths are active and some paths are passive.

Normally a path to a CLARiiON is considered active when we are connected to the service processor that is currently serving you the LUN. CLARiiON arrays are so called “active/passive” arrays, meaning that only one service processor is in charge of a LUN, and the secondary service processor is just waiting for a signal to take over the ownership in case of a failure. The array will normally receive a signal that tells it to switch from one service processor to the other one. This routine is called a “trespass” and happens so fast that you usually don’t really notice such a failover.

When we go back to the host, the connection state will be shown as active for that connection that is routed to the active service processor, and something like “standby” or “passive” for the connection that goes to the service processor that is not serving you that LUN. Also, since you have multiple connections, it’s not unlikely that the different paths can also have other properties that are different. Things like bandwith (you may have added a faster HBA later) or latency can be different. Due to the characteristics, the target ports might need to indicate how efficient a path is. And if a failure should occur, the link status might change, causing a path to go offline.

You can check the the status of a path to a LUN by asking the port on the storage array, the so called “target port”. For example, you can check the access characteristics of a path by sending the following SCSI command:

  • REPORT TARGET PORT GROUPS (RTPG)

Similar commands exist to actually set the state of a target port.

So where does ALUA come in?

What the ALUA interface does is allow an initiator (your server or the HBA in your server) to discover target port groups. Simply put, a group of ports that provide a common failover behavior for your LUN(s). By using the SCSI INQUIRY response, we find out to what standard the LUN adheres, if the LUN provides symmetric or asymmetric access, and if the LUN uses explicit or implicit failover.

To put it more simply, ALUA allows me to reach my LUN via the active and the inactive service processor. Oversimplified this just means that all traffic that is directed to the non-active service processor will be routed internally to the active service processor.

On a CLARiiON that is using ALUA mode this will result in the host seeing paths that are in an optimal state, and paths that are in an non-optimal state. The optimal path is the path to the active storage processor and is ready to perform I/O and will give you the best performance, and the non-optimal path is also ready to perform I/O but won’t give you the best performance since you are taking a detour.

The ALUA mode is available on CX-3 and CX-4, but the results you get can vary between both arrays. For example if you want to use ALUA with your vSphere installation you will need to use the CX-4 with FLARE 26 or newer and change the failover mode to “4″. Once you have changed the failover mode you will see a slightly different trespass behavior since you can now either manually initiate a trespass (explicit) or the array itself can perform a trespass once it’s noticed that the non-optimal path has received 128,000 or more I/Os than the optimal path (implicit).

Depending on which software you use – PowerPath or for example the native solution – you will find that ALUA is supported or not. You can take a look at Primus ID: emc187614 in Powerlink to get more details on supported configurations. Please note that you need a valid Powerlink account to access that Primus entry.





Downloading the EMC CLARiiON CX / Navisphere simulator

22 01 2010

I just wanted to write a really short post to share this tip with you. A lot of people seem to stumble on this site while they are looking to do some tests. Now, as always you will most likely not have full on storage array sitting around that is just waiting to be a guinea pig while serving your production environment.

A partial solution is to test things in a simulator. For people who want to test things on their Cisco switches there is an open source “Internetwork Operating System” or IOS simulator that gives you a taste of the real thing. Admittedly it’s not the same as having a full environment, but it might just help you in testing a scenario or routine that you have in mind.

Now, you will find that there is also a simulator for the CLARiiON environment that is called the “Navisphere simulator” and a CX simulator. Problem is that the simulator can’t be downloaded with any old Powerlink account. Partners and employees can use a simple download in Powerlink ( Home => Products => Software E-O => Navisphere Management Suite => Demos) , but if you don’t fall under that category you will have a hard time actually finding a download.

Normally to get the simulator you would need to order some CLARiiON training. The Navisphere and CX simulators are actually packaged with the Foundations course and you can also find them in one of their video instructor led trainings. The problem is that you or your boss will pay quite a bit for said trainings, and this is not great if you just want to perform a quick test.

Now for my tip… Buy the “Information and Storage Management” book (ISBN-13: 978-0-470-29421-5 / ISBN-10: 0-470-29421-3) from your favorite book supplier. Beside it being a good read it also allows you to register on a special site created for the book where you can actually find some learning aids that also include the Navisphere simulator and the CX simulator. You can find the book starting around $40 and there’s also a version available for the Kindle if you are in to e-books. You don’t need any special information to register the book on the EMC site so it’s quite a quick way to get the simulators and check if you can actually simulate the scenario you have in mind.





The thing about metas, SRDF/S and performance

8 01 2010

It’s not very common knowledge, but there is actually a link between the I/O performance you see on your server and the number of metas you configured when using SRDF/S.

I do a lot of stuff in our company and I tend to get pulled in to performance escalations. Usually because of the fact that I know my way around most modern operating systems, I know a bit about storage and about our applications and databases. Usually the problems all boil down to a common set of issues, and perhaps one day I will post a catalog of common performance troubleshooting tips here, but I wanted to use this post to write about something that was new to me and I thought it might be of use to you.

We have a customer with a large installation on Linux that was seeing performance issues in his average dialog response time. Now, for those who don’t know what a dialog response time is, it is the time it takes an SAP system to display a screen of information, process any data entered or requested there by the database and output the next screen with the requested information. It doesn’t include any time needed for network traffic of the time taken up by the front-end systems.

The strange thing was that the database reported fairly good response times, an excellent cache hit ratio but also reported that any waits were produced by the disks it used. When we looked at the Symmetrix box behind it we could not see any heavy usage on the disks, and it reported to be mostly “picking it’s nose”.

After a long time we got the suggestion that perhaps the SRDF/S mirroring was to blame for this delay. We decided to change to an RDF mode called “Adaptive Copy Write Pending” or ACWP and did indeed see a performance improvement, even though the database and storage box didn’t seem to show the same improvement that was seen in the dialog response time.

Then, someone asked a fairly simple questions:

“How many meta members do you use for your LUNs?”

Now, the first thought with a question like that is usually along the line of the number of spindles, short stroking and similar stuff. Until he said that the number of meta members also influences the performance when using SRDF/S. And that’s where it get’s interesting and I’m going to try and explain why this is so interesting.

To do that let’s first take a closer look at how SRDF works. SRDF/S usually gives you longer write response times. This because you write to the first storage box, copy everything over to the second box, receive an acknowledge from the second box and then respond back to say that the write was ok. You have to take things like propagation delay and RDF write times into account.

Now, you also need to consider that when you are using the synchronous mode, you can only have 1 outstanding write I/O per hyper. That means that if your meta consists of 4 hyper volumes you get 4 outstanding write I/Os. If you create your meta out of more hyper volumes you also increase the maximum number of outstanding write I/Os or higher sustained write rates if your workload is spread evenly.

So, lets say for example you have a host that is doing 8 Kb write I/O’s to a meta consisting of 2 hypers. The Remote site is about 12 miles away and you have a write service time of 2 ms. Since you have a 1000 ms in one second each hyper can do roughly 500 IOPS since you would need to divide the 1000 ms by the servie time of 2 ms: 1000 ms/2 ms = 500

Now, with 2 hypers in your meta you would roughly have around 8 MB/sec:
2 (hypers) x 500 IOPS x 8 KB.

And you can also see that if we increase the number of hypers, we also increase the maximum value. This is mostly true for random writes, and the behavior will be slightly different for sequential loads since these use a stripe size of 960 KB. And don’t forget that this is a cache to cache value since we are talking about the data being transferred between the Symmetrixes. We won’t receive a write commit until we get a write acknowledge from the second storage array.

So, what we will be doing next are two things. We will be increasing the number of hypers for the metas that our customer is using. Besides that we will also be upgrading our Enginuity since we expect a slightly different caching behavior.

I’ll try to see if I can update this post when we changed the values just to give you a feel on the difference it made (or perhaps did not make) and I hope this information is useful for anyone facing similar problems.








Follow

Get every new post delivered to your Inbox.

Join 1,612 other followers