Clariion, CX3, EMC, GestaltIT, Storage

The Asymmetrical Logical Unit Access (ALUA) mode on CLARiiON

I’ve noticed that I have been getting a lot of search engine hits relating to the various features, specifications and problems on the EMC CLARiiON array. One of the searches was related to a feature that has been around for a bit. It was actually introduced in 2001, but in order to give a full explanation I’m just going to start at the beginning.

DetourThe beginning is actually somewhere in 1979 when the founder of Seagate Technology, Alan Shugart, created the Shugart Associates Systems Interface (SASI). This was the early predecessor of SCSI and had a very rudimentary set of capabilities. Only few commands were supported and speeds were limited to 1.5 Mb/s. In 1981, Shugart Associates was able to convince the NCR corporation to team up and thereby convincing ANSI to set up a technical committee to standardize the interface. This was realized in1982 and known as the “X3T9.2 technical committee” and resulted in the name being changed to SCSI.

The committee published their first interface standard in 1986, but would grow on to become the group known now as “International Committee for Information Technology Standards” or INCITS and that is actually responsible for many of the standards used by storage devices such as T10 (SCSI), T11 (Fibre Channel) and T13 (ATA).

Now, in July 2001 the second revision of the SCSI Primary Commands (SPC-2) was published, and this included a feature called Asymmetrical Logical Unit Access mode or in short ALUA mode, and some changes were made in the newer revisions of the primary command set.

Are you with me so far? Good.

On Logical Unit Numbers

Since you came here to read this article I will just assume that I don’t have to explain the concept of a LUN. But what I might need to explain is that it’s common to have multiple connections to a LUN in environments that are concerned with the availability of their disks. Depending on the fabric and the amount of fibre channel cards you have connected you can have multiple paths to the same lun. And if you have multiple paths you might as well use them, right? It’s no good having the additional bandwidth lying around and then not using it.

Since you have multiple paths to the same disk, you need a tool that will somehow merge these paths and tell your operating system that this is the same disk. This tool might even help you achieve a higher throughput since it can balance the reads and writes over all of the paths.

As you might already have guessed there are multiple implementations of this, usually called Multipathing I/O, MPIO or just plainly Multipath, and you will be able to find a solution natively or as an additional piece of software for most modern operating systems.

What might be less obvious is that the connection to these LUNs don’t have to behave in the same way. Depending on what you are connecting to, you have several states for that connection. Or to draw the analogy to the CX4, some paths are active and some paths are passive.

Normally a path to a CLARiiON is considered active when we are connected to the service processor that is currently serving you the LUN. CLARiiON arrays are so called “active/passive” arrays, meaning that only one service processor is in charge of a LUN, and the secondary service processor is just waiting for a signal to take over the ownership in case of a failure. The array will normally receive a signal that tells it to switch from one service processor to the other one. This routine is called a “trespass” and happens so fast that you usually don’t really notice such a failover.

When we go back to the host, the connection state will be shown as active for that connection that is routed to the active service processor, and something like “standby” or “passive” for the connection that goes to the service processor that is not serving you that LUN. Also, since you have multiple connections, it’s not unlikely that the different paths can also have other properties that are different. Things like bandwith (you may have added a faster HBA later) or latency can be different. Due to the characteristics, the target ports might need to indicate how efficient a path is. And if a failure should occur, the link status might change, causing a path to go offline.

You can check the the status of a path to a LUN by asking the port on the storage array, the so called “target port”. For example, you can check the access characteristics of a path by sending the following SCSI command:

  • REPORT TARGET PORT GROUPS (RTPG)

Similar commands exist to actually set the state of a target port.

So where does ALUA come in?

What the ALUA interface does is allow an initiator (your server or the HBA in your server) to discover target port groups. Simply put, a group of ports that provide a common failover behavior for your LUN(s). By using the SCSI INQUIRY response, we find out to what standard the LUN adheres, if the LUN provides symmetric or asymmetric access, and if the LUN uses explicit or implicit failover.

To put it more simply, ALUA allows me to reach my LUN via the active and the inactive service processor. Oversimplified this just means that all traffic that is directed to the non-active service processor will be routed internally to the active service processor.

On a CLARiiON that is using ALUA mode this will result in the host seeing paths that are in an optimal state, and paths that are in an non-optimal state. The optimal path is the path to the active storage processor and is ready to perform I/O and will give you the best performance, and the non-optimal path is also ready to perform I/O but won’t give you the best performance since you are taking a detour.

The ALUA mode is available on CX-3 and CX-4, but the results you get can vary between both arrays. For example if you want to use ALUA with your vSphere installation you will need to use the CX-4 with FLARE 26 or newer and change the failover mode to “4”. Once you have changed the failover mode you will see a slightly different trespass behavior since you can now either manually initiate a trespass (explicit) or the array itself can perform a trespass once it’s noticed that the non-optimal path has received 128,000 or more I/Os than the optimal path (implicit).

Depending on which software you use – PowerPath or for example the native solution – you will find that ALUA is supported or not. You can take a look at Primus ID: emc187614 in Powerlink to get more details on supported configurations. Please note that you need a valid Powerlink account to access that Primus entry.

Clariion, EMC, Storage

Downloading the EMC CLARiiON CX / Navisphere simulator

I just wanted to write a really short post to share this tip with you. A lot of people seem to stumble on this site while they are looking to do some tests. Now, as always you will most likely not have full on storage array sitting around that is just waiting to be a guinea pig while serving your production environment.

A partial solution is to test things in a simulator. For people who want to test things on their Cisco switches there is an open source “Internetwork Operating System” or IOS simulator that gives you a taste of the real thing. Admittedly it’s not the same as having a full environment, but it might just help you in testing a scenario or routine that you have in mind.

Now, you will find that there is also a simulator for the CLARiiON environment that is called the “Navisphere simulator” and a CX simulator. Problem is that the simulator can’t be downloaded with any old Powerlink account. Partners and employees can use a simple download in Powerlink ( Home => Products => Software E-O => Navisphere Management Suite => Demos) , but if you don’t fall under that category you will have a hard time actually finding a download.

Normally to get the simulator you would need to order some CLARiiON training. The Navisphere and CX simulators are actually packaged with the Foundations course and you can also find them in one of their video instructor led trainings. The problem is that you or your boss will pay quite a bit for said trainings, and this is not great if you just want to perform a quick test.

Now for my tip… Buy the “Information and Storage Management” book (ISBN-13: 978-0-470-29421-5 / ISBN-10: 0-470-29421-3) from your favorite book supplier. Beside it being a good read it also allows you to register on a special site created for the book where you can actually find some learning aids that also include the Navisphere simulator and the CX simulator. You can find the book starting around $40 and there’s also a version available for the Kindle if you are in to e-books. You don’t need any special information to register the book on the EMC site so it’s quite a quick way to get the simulators and check if you can actually simulate the scenario you have in mind.

DMX, EMC, Enginuity, Performance, Storage, Symmetrix

The thing about metas, SRDF/S and performance

It’s not very common knowledge, but there is actually a link between the I/O performance you see on your server and the number of metas you configured when using SRDF/S.

I do a lot of stuff in our company and I tend to get pulled in to performance escalations. Usually because of the fact that I know my way around most modern operating systems, I know a bit about storage and about our applications and databases. Usually the problems all boil down to a common set of issues, and perhaps one day I will post a catalog of common performance troubleshooting tips here, but I wanted to use this post to write about something that was new to me and I thought it might be of use to you.

We have a customer with a large installation on Linux that was seeing performance issues in his average dialog response time. Now, for those who don’t know what a dialog response time is, it is the time it takes an SAP system to display a screen of information, process any data entered or requested there by the database and output the next screen with the requested information. It doesn’t include any time needed for network traffic of the time taken up by the front-end systems.

The strange thing was that the database reported fairly good response times, an excellent cache hit ratio but also reported that any waits were produced by the disks it used. When we looked at the Symmetrix box behind it we could not see any heavy usage on the disks, and it reported to be mostly “picking it’s nose”.

After a long time we got the suggestion that perhaps the SRDF/S mirroring was to blame for this delay. We decided to change to an RDF mode called “Adaptive Copy Write Pending” or ACWP and did indeed see a performance improvement, even though the database and storage box didn’t seem to show the same improvement that was seen in the dialog response time.

Then, someone asked a fairly simple questions:

“How many meta members do you use for your LUNs?”

Now, the first thought with a question like that is usually along the line of the number of spindles, short stroking and similar stuff. Until he said that the number of meta members also influences the performance when using SRDF/S. And that’s where it get’s interesting and I’m going to try and explain why this is so interesting.

To do that let’s first take a closer look at how SRDF works. SRDF/S usually gives you longer write response times. This because you write to the first storage box, copy everything over to the second box, receive an acknowledge from the second box and then respond back to say that the write was ok. You have to take things like propagation delay and RDF write times into account.

Now, you also need to consider that when you are using the synchronous mode, you can only have 1 outstanding write I/O per hyper. That means that if your meta consists of 4 hyper volumes you get 4 outstanding write I/Os. If you create your meta out of more hyper volumes you also increase the maximum number of outstanding write I/Os or higher sustained write rates if your workload is spread evenly.

So, lets say for example you have a host that is doing 8 Kb write I/O’s to a meta consisting of 2 hypers. The Remote site is about 12 miles away and you have a write service time of 2 ms. Since you have a 1000 ms in one second each hyper can do roughly 500 IOPS since you would need to divide the 1000 ms by the servie time of 2 ms: 1000 ms/2 ms = 500

Now, with 2 hypers in your meta you would roughly have around 8 MB/sec:
2 (hypers) x 500 IOPS x 8 KB.

And you can also see that if we increase the number of hypers, we also increase the maximum value. This is mostly true for random writes, and the behavior will be slightly different for sequential loads since these use a stripe size of 960 KB. And don’t forget that this is a cache to cache value since we are talking about the data being transferred between the Symmetrixes. We won’t receive a write commit until we get a write acknowledge from the second storage array.

So, what we will be doing next are two things. We will be increasing the number of hypers for the metas that our customer is using. Besides that we will also be upgrading our Enginuity since we expect a slightly different caching behavior.

I’ll try to see if I can update this post when we changed the values just to give you a feel on the difference it made (or perhaps did not make) and I hope this information is useful for anyone facing similar problems.

Clariion, CX4, EMC, FLARE

What’s new in EMC Clariion CX4 FLARE 29

CLARiiON CX4 UltraFlex I/O module - Copyright: EMC Corporation.Along with the release of FAST, EMC also released a new version of it’s CLARiiON Fibre Logic Array Runtime Environment, or in short “FLARE” operating environment. This release brings us to version 04.29 and offers some interesting enhancements, so I thought I’d give you an overview of what’s in there:

Let’s start off with some basics. Along with this update you will find updated firmware versions for the following:

    Enclosure: DAE2		- FRUMon: 5.10
    Enclosure: DAE2-ATA	- FRUMon: 1.99
    Enclosure: DAE2P	- FRUMon: 6.69
    Enclosure: DAE3P	- FRUMon: 7.79

Major changes:

  • VLAN tagging for 1Gb/s and 10Gb/s iSCSI interfaces.
  • Support for 10Gb/s dual port optical I/O modules.
  • Spin down support for storage system and/or RAID group. Once enabled drives spin down automatically if no user or system I/O has been recognized for 30 minutes. These SATA drives support spin down:
    • 00548797
    • 00548853
    • 00548829
  • Shrinking of a FLARE and meta LUNs. Note that this is only supported on Windows hosts that are capable of shrinking logical disks.
  • Upgrade of UltraFlex I/O modules with an increased performance, more specifically 8Gb FC and 10Gb iSCSI. Note that only an upgrade is supported, a downgrade from for example 8Gb FC to 4Gb FC will not work.
  • Rebuild logging is now supported on RAID6 LUNs, which means that a drive that may have been issuing timeouts will have it’s I/O logged and rebuild only the pending writes.
  • The maximum number of LUNs per storage group have been upgraded from 256 for all CX4 models with FLARE 28 to the following:
    • CX4-120 – 512
    • CX4-240 – 512
    • CX4-480 – 1024
    • CX4-960 – 1024

You can find an overview with the supported connectivity options and front-end and back-end ports right here.

EMC, FAST, GestaltIT, Storage

EMC’s FAST, take 1. Action!

As you might have read in my earlier blog post, EMC has announced the release of the first version of their product called “Fully Automated Storage Tiering” or in short “FAST”.

Now, to describe the purpose of this technology in a very simple form, we are talking about the performance of your storage and some logic that will help you put those things that need performance on the fastest bits available in your storage environment.

And that’s about as far as we can go with the concept of simple. Why? Because if this technology is to add value, you need it to be really clever. You would almost need it to be a bit of a mind reader if you will. You will want it to know what your application is going to do, and you will want to know where it does that on the most granular level of your storage, namely the blocks on the disks. Or more simply, you don’t want it to react, you want it to behave proactively.

So, let’s start with some mixed news:

  • FAST v1 is available on Symmetrix V-Max, Clariion CX4 and Celerra NS

  • As some of you will notice these three platforms have something in common. EMC tried to get rid of using custom ASICs in favor of using commodity x86 based hardware for as much as they could. In the new V-Max you will only find a custom ASICs that resides on the Virtual Matrix Interface controller, and is responsible for the coordination of local and remote memory access.

    This swap to x86/x64 and a 64 bit architecture was done on all three mentioned platforms. On its own this is a good thing, but it would also be a good explanation why EMC as of now is not supporting older arrays. EMC is bound to get requests for this new technology for their older arrays like the CX3 or the DMX4. There are two likely options there:

      1: It’s not going to happen.

      Porting the code to a different hardware platform is a pain. The logic behind it is still the same, but the question is, up to where would you backport it? DMX3? DMX2? Where would you draw the line? Combine that with the fact that not all the newer features are available on the older machines and you can probably imagine that it would be easier to just not make these features available on older arrays.

      2: They are almost done and will release it sooner than anyone thought.

      EMC has a lot of developers. Chances are they were also working on FAST for the other platforms and will be releasing it in the not too far future.

    Since we will be seeing arrays being removed from the product purchase portfolio, my money is on option number one. You won’t have the option of buying a DMX3 within the next half-year. And you can also replace half a year with 1.5 year for the DMX4. Sure, you can get extended support which will add four or five years to the life cycle of your array, but implementing new features for arrays which will not be sold anymore in the near future? I find that sort of unlikely.

  • FAST v1 will only work on a LUN level.

  • As explained before, normally your application won’t be updating the data on the entire LUN. Usually you have a few so-called “hot zones” which are just blocks of data are being accessed by reads and/or writes more frequently. An excellent graphical example of this fact is something called a “heat map”. This heat map is created by an (unfortunately) internal EMC application called SymmMerge but fortunately fellow blogger Barry Burke, a.k.a. “The storage anarchist” allowed me to use some images from his blog.

    So, this would be the situation in a fairly common environment:

    D604D646-5ADE-48F1-8BA5-358D78A5F8C1.jpg

    Note that in this image we are talking about actual disks, but the image will also work if we just simply replace the word “drives” with “blocks”. The green blocks are doing fine, but the red and orange blocks are the ones that are being accessed a lot.

    The ideal solution would normally be to put the red and orange blocks on a faster medium. EMC would normally tell you that the ideal medium for these kind of blocks would be EFDs or “Enterprise Flash Drives”. And you could put the green blocks on a medium that might not need quite as much performance or the same response times as regular fiber channel drives or perhaps even cheaper SATA drives for bulk storage. Each class of drive (EFD, FCD, SATA) is called a tier, hence the term “Tiering”.

    After a redistribution your image would look something like this, where all blocks would be on a storage class that suits their individual performance needs:

    F19A5D7F-0D02-4C96-B8D1-85DA8AC79D1C.jpg

    Now, probably one of the biggest pain points for a lot of people is that this version of FAST is not capable of doing this on a block level. Version 1 is only capable of moving data on to a different tier on a LUN level. But your database/CRM/BW/etc. normally does not read and/or write to the entire LUN.

  • The value of policies.

  • So with this version of FAST you actually put a lot more data on a faster tier than you would actually need to. On the other hand EMC stated that the key value for FAST version 1 is not so much in the fact that you move your LUNs to different tiers, but in the fact that you can set user policies to have the system do this for you. It takes some of the effort involved and handles things for you.

    Now, you can create up to 256 different tiers which in its current version allow you to define tiers based on RAID levels, the speed of a drive and the drive type. It should be noted that the tier definitions will differ when using dynamic or static tiering. Currently disk size and rotational speed are not considered when you create a dynamic tier, so a dynamic tier may contain disks of differing performance characteristics, but a tweet from Barry Burke stated that FAST is actually aware of the RPMs, and knows the latency impacts of contention and utilization. Or at least “it will be in the future

    Now, you can create a policy for a storage group, which is basically a group of disks that are managed as a set, and have that policy associate a storage group with up to three tiers, depending on the tiers you actually have in place. Now, combine that with setting limits for the percentage of capacity that is on a single tier and you will see that you could for example say that you want 80% of you capacity to reside on SATA disks and 20% on EFDs.

    Fast will now apply your policy and, depending on the choice you made, automatically move the data around across those tiers or give you a recommendation on what would be a sensible choice. It can even relocate you data to a different RAID type on the other tier, and your SymmDev ID, your WWN, your SCSI ID and all external LUN references will remain unchanged during the move. If you have replication set-up, that stays active as well.

    Now since this all stuff that might have a performance impact if done during peak loads on your box, the default is that all the moves are performed as lowest priority jobs during time slots or windows that you as the end-user can define. Just keep in mind that you are limited to 32 concurrent moves and to a maximum of 200 moves per day.

  • What will it cost me?

  • Prices start at $5,000 USD for the entry-level systems, and will set you back $22,000 USD for the Symmetrix V-Max. But that is the starting price, and the unbundled price. You could also consider a bundle called the “Symmetrix FAST Suite” that includes the optimizer and priority control & cache partitioning. All to be delivered as a “firmware upgrade” for your array.

  • So do we need to wait for FAST v2?

  • Well, I’ve got mixed feelings on that point. I can see how this first version can add some value to your environment, but that will depend on your environment. People who only use one tier might not have as much value, and adding the cost of new disks in to the equation will not make it any easier. Especially when we take the release of FAST v2 into account that is “planned for GA in 2nd Half 2010” and will also provide support for thinly or virtual provisioned LUNs and be able to move stuff around at the block level.

    I know there is value in this release for some customers that are actually using the V-Max. The automated tiering can at least help you meet a certain service level, but that added value is highly dependent on your environment. Personally, I’d probably wait for the release of version 2 if possible. On the other hand, EMC needs to gain traction first and they were always open about the fact that they would release two versions of FAST, and stated that version 1 would not have all the features they wanted, and that the rest of the features were planned for version 2. I have somewhat of a hard time with some of the analysts who are now complaining that FAST v1 is actually that what EMC said it would be. Did they just ignore previous statements?

  • To sum it all up

  • It’s the same story as usual. Every storage vendor seems to agree on the fact that automated storage tiering is a good thing for their customers. Some have different opinions whether or not the array should be the key in this automation, because you are at risk of the array making a wrong decision.

    EMC started off their journey with some steps towards automated tiering, but they delivered just that, the first steps toward an automated tiering vision. If we would remove the argument of a price tag, I would be almost positive I’d recommend version 2 too any possible customers. For version 1, the answer is not that simple. You need to check your environment and see if this feature makes sense for you, or adds value to your setup.

    Besides FAST we’ve also seen some new cool features being introduced with the new “firmwares” that were released for the various arrays, such as thin zero reclaim and dedupe enhancements. Look for coming posts that will go in too more detail on the new Flare 29 and Enginuity 5874.

    EMC, FAST, Storage

    EMC announced it’s Fully Automated Storage Tiering (FAST) – Links

    Ok, so the day before yesterday (December 8th to be exact) EMC launched a new feature for three of their storage arrays, the Symmetrix V-Max, the Clariion CX4 and the Celerra NS. It’s a feature called FAST which stands for “Fully Automated Storage Tiering”, and is basically a fancy way of saying that they will move your data to slower or faster disks based on your requirements. With this first version of FAST you still need to manually set up some values for this movement, but EMC is already working on the next version called FAST v2 that will automate this movement.

    According to Network Computing, “prices start at $5,000 for entry-level systems and $22,000 for Symmetrix”.

    Now, right now I’m stuck in whitepapers, blogs and tweets on the subject, and I plan to write a longer posting on FAST and how it works. For now I’ll just try to make this a collection of links with information on FAST by both EMC and other sources. This is just a quick reference for me, but you might also find it useful. So here goes.

    EMC and EMC employees:

    That’s just a start with official info. Now for some links which are not that official but worth a read:

    Since we also need some documentation and this is sort of mixed here’s a link for starters:

    I’m sure there will be more and I will try to update this post when I find new info. You can also simple send me a tweet and I’ll make sure to add any links on this page.