It’s not very common knowledge, but there is actually a link between the I/O performance you see on your server and the number of metas you configured when using SRDF/S.
I do a lot of stuff in our company and I tend to get pulled in to performance escalations. Usually because of the fact that I know my way around most modern operating systems, I know a bit about storage and about our applications and databases. Usually the problems all boil down to a common set of issues, and perhaps one day I will post a catalog of common performance troubleshooting tips here, but I wanted to use this post to write about something that was new to me and I thought it might be of use to you.
We have a customer with a large installation on Linux that was seeing performance issues in his average dialog response time. Now, for those who don’t know what a dialog response time is, it is the time it takes an SAP system to display a screen of information, process any data entered or requested there by the database and output the next screen with the requested information. It doesn’t include any time needed for network traffic of the time taken up by the front-end systems.
The strange thing was that the database reported fairly good response times, an excellent cache hit ratio but also reported that any waits were produced by the disks it used. When we looked at the Symmetrix box behind it we could not see any heavy usage on the disks, and it reported to be mostly “picking it’s nose”.
After a long time we got the suggestion that perhaps the SRDF/S mirroring was to blame for this delay. We decided to change to an RDF mode called “Adaptive Copy Write Pending” or ACWP and did indeed see a performance improvement, even though the database and storage box didn’t seem to show the same improvement that was seen in the dialog response time.
Then, someone asked a fairly simple questions:
“How many meta members do you use for your LUNs?”
Now, the first thought with a question like that is usually along the line of the number of spindles, short stroking and similar stuff. Until he said that the number of meta members also influences the performance when using SRDF/S. And that’s where it get’s interesting and I’m going to try and explain why this is so interesting.
To do that let’s first take a closer look at how SRDF works. SRDF/S usually gives you longer write response times. This because you write to the first storage box, copy everything over to the second box, receive an acknowledge from the second box and then respond back to say that the write was ok. You have to take things like propagation delay and RDF write times into account.
Now, you also need to consider that when you are using the synchronous mode, you can only have 1 outstanding write I/O per hyper. That means that if your meta consists of 4 hyper volumes you get 4 outstanding write I/Os. If you create your meta out of more hyper volumes you also increase the maximum number of outstanding write I/Os or higher sustained write rates if your workload is spread evenly.
So, lets say for example you have a host that is doing 8 Kb write I/O’s to a meta consisting of 2 hypers. The Remote site is about 12 miles away and you have a write service time of 2 ms. Since you have a 1000 ms in one second each hyper can do roughly 500 IOPS since you would need to divide the 1000 ms by the servie time of 2 ms: 1000 ms/2 ms = 500
Now, with 2 hypers in your meta you would roughly have around 8 MB/sec:
2 (hypers) x 500 IOPS x 8 KB.
And you can also see that if we increase the number of hypers, we also increase the maximum value. This is mostly true for random writes, and the behavior will be slightly different for sequential loads since these use a stripe size of 960 KB. And don’t forget that this is a cache to cache value since we are talking about the data being transferred between the Symmetrixes. We won’t receive a write commit until we get a write acknowledge from the second storage array.
So, what we will be doing next are two things. We will be increasing the number of hypers for the metas that our customer is using. Besides that we will also be upgrading our Enginuity since we expect a slightly different caching behavior.
I’ll try to see if I can update this post when we changed the values just to give you a feel on the difference it made (or perhaps did not make) and I hope this information is useful for anyone facing similar problems.