GestaltIT, Virtualization, vSphere

The RAM per CPU wall

There is one thing that keeps coming up lately in our office. We install a lot of systems virtually, and have over 8000 virtual servers installed.

RAM.jpgNow, as anyone who has installed SAP will probably tell you, our software can do a lot of things, but one of the things it swallows up is RAM. We seem to be doing a bit better on the central instance side, but moved a lot of the “RAM hunger” elsewhere. This is especially true for our new CRM 7.0 which will show you that your application server will need a lot more of it, when compared to older versions.

Now, since these kind of application servers are ideal candidates for virtual machines. With the new vSphere release you can actually give a VM up too 255 GB or RAM, and for larger CRM implementations you might actually consider this limit for your application servers. No problem at all, but you will face an entirely new problem that has been creeping up and has left most people unaware.

Since there’s no real name for it, I just call it the “RAM per CPU wall”, or perhaps even the “RAM per core wall”.

So, what about this wall?

Well, it used to be that computational power was quite expensive, and the same could be said about storage. Times have changed and right now we see that processing power is not that expensive anymore. We have seen a slow moving shift toward the x86 instruction set and it’s x64 extension. With the development of the multi-core processor we are seeing applications being optimized to run in parallel on the chipsets, and with solutions like vSphere we are using more and more of those features. Examples in vSphere would be “Hyperthreaded Core Sharing” or a setting like “cpuid.coresPerSocket”.

It’s great that we can tweak how how our VM platform and our VMs handle multiple cores, and I am certain that we will be seeing a lot more options in the future. Why I am so sure of that? Because we will be seeing the octo-core processors in the not too distant future (I’m betting somehwere around the end of Q2 2010). Just think about it, 8 cores on one CPU.

Now, combine that with an educated guess that we will be seeing 8 socket mainboards and servers, and we are talking about something like the HP DL780 or DL785 with a total of 64 cores. Quite the number right? Now just imagine the amount of RAM you can put in such a server.

Right?

The current HP DL785 G5 has a total of 32 cores and a limit of 512 GB, assuming you are willing to pay for the exuberant memory price tag for such a configuration. I’m assuming support for 128 GB RAM modules will take a bit longer, so just assuming 64 cores and 512 GB of RAM will give you 8GB of RAM per CPU core.

It might just be my point of view, but that’s not that much per core.

Now I can hear you saying that for smaller systems 8 GB of ram per core is just fine, and I hear you very good. But that’s not the typical configuration used in larger environments. If you take the CRM example I gave before, you can load up a bunch of smaller application servers and set your hopes on overcommitting, but that will only get you so far.

And the trend is not changing anytime soon. We will be seeing more cores per cpu coming, and the prices and limits of large amounts of RAM are not keeping up, and I doubt they will do so in the future.

So, will we continue to see larger systems being set-up on physical hardware? Actually, I honestly think so. For service providers and large environments things need to change if they want to get a big benefit out of the provisioning of larger machines. If things don’t change, they are facing the “RAM per CPU wall” and will be stuck provisioning smaller instances.

All in all I can only recommend digging down and asking your customers for their requirements. Check the sizing recommendations and try to get some hands on information or talk to people who have implemented said solutions to see how the application handles memory. Make a decision on a per-use case for larger environments. Virtualization is great, but make sure that the recommendation and implementation suits the needs that your customer has, even if that means installing on a physical platform.

6 thoughts on “The RAM per CPU wall”

  1. Hi,

    actually, I don’t think servers will get larger and larger. Power draw and density is already yet a problem in many datacenters. With increasing core count, optimizations at the gate level will not scale as fast as core-# scales.
    More importantly, newer systems have a lower “getting-things-done per Watt”-ratio than specialized, “wimpy nodes” (e.g. project FAWN). With +~8 times more output with the same power draw, there is a cost argument for running such lower-powered more-effective systems (again, datacenter as a problem, etc.).
    James Hamilton has released several highly recommend blog posts with having all this in mind: http://perspectives.mvdirona.com/2009/11/30/2010TheYearOfMicroSliceServers.aspx

    If all this is true, this has a lot of interesting implications. Our software landscape has to change fundamentally as the networking layer has to change. Software needs to be more distributed, concurrent and parallelized, software has to handle failures as being the norm, not being a special case, etc.
    [I will stop here].

    Looking forward for your replies,
    Martin

  2. Martin,

    let me start by thanking you for the reply and a different perspective on all of this.

    I agree that power draw and density are major problems (and let’s not forget cooling). On the other hand we have seen a big dip in power consumption on a per core basis. Newer cpu’s have features such as Cool’n’Quiet, but with consolidation and virtualization initiatives became inherently useless since we are overprovisioning until our CPU’s are near the 80% use limit (sometimes even over).

    Basic problem: As soon as we provide more power we use it. Technologies like virtual machines or hardware/software partitioning (zones and such) only make the problem worse.

    And yes, wimpy nodes are known to me, and they can work great. If you check out the example of the US army buying PS3’s for their cell architecture and implementing special software for a custom purpose, I can only state this is a great idea and a great implementation, but it’s not generic enough for most implementations. Custom development would drive costs up in a major way.

    And I again agree when you say that there are fundamental changes we need to make, but this is a general performance/utilization problem where the technology provided just does not grow as fast.

    Also looking forward to your response,
    Bas

    1. Hello Bas,

      [don’t know how to quote things here, so I will just use the plain old method]

      I think so, too: as soon as you give more resource X to spend, there will be a need (generated) for using X up completely.
      You are also right, CPU efficiency (power) is getting better and better. But then as soon as we pack more cores in box, we are still getting a higher and higher power and compute density. To actually make use of all this (assumed we can, which we cannot currently), we have to have to-be-written software that will make use of the incredible 500k IOP/s PCIe SSDs.
      My understanding is, the more we will increase the core count, the more we will see we simply cannot make use of it anyways — and that we need a fundamental change in our approach.

      There was a famous paper recently (about the Barrelfish OS) that basically stated: with a 16+ core system, shared memory will not scale any more. Modelling the hardware like being distinct CPU entities that communicate over a shared bus, needs less CPU cycles for intra-core communication than using the shared memory at all. This is quite and important and also shocking result, as it shows our scale-up-by-increasing-core-counts approach is doomed.
      Basically, we can either choose to waste more and more cycles when having more and more cores (this is somewhat like the inversion of Moore’s law!) with newer generations of CPUs, or we start solving the problem by scaling our systems down and distribute load to more effective low-power entities.

      Regarding the “wimpy nodes approach”: correct, not with all of these systems that are currently being tested you will be able to run say Windows 7. But there is some hope like ARM said to be building an ATOM equivalent with HW virtualization instruction set – paired with at least double performance and some <5W power draw for a 2 core system. There are rumours of Dell shipping these systems to some of their large-scale hosting customers (guess you read this at El Reg, too) who run some standard LAMP VMs on them.
      This might be a viable approach for the future. These systems are basically an intelligent harddrive from a space and power-consumption perspective, powerful enough for task, that's it. This is quite a neat approach and quite a fundamental one.

      To sum it up: I think we have software and software-design crisis, less a hardware crisis. We should seriously start thinking about scaling-down our per-node performance and scale-up our networking gear likewise.

Leave a reply to uberVU - social comments Cancel reply