Too Much Available Memory Can Spoil Performance
I like to think I know a lot about SQL Server, but every now and then something comes along and knocks that silly notion out of my head.
One of my long-time clients called me up the other day. CPU usage had gone through the roof and now was hovering near 100%. Performance was awful.
This particular server has computing resources to burn: several dozen fast processors and a full terabyte of memory. Normally, their server can handle 4000 concurrent connections without even breathing hard.
I jumped on my client’s server to find the cause of the high CPU. I checked all the usual suspects but came up with nothing. It took a bunch of research and finally a call to Microsoft to find out what was going on.
The problem turned out to be too much memory. I have to admit that I did not know there could be such a thing until I read this Microsoft white paper.
How Can a Terabyte of Memory Hurt You?
The amount of available memory is an important component in execution plan creation. Having a terabyte or more of SQL Server memory is a relatively new phenomenon in the Windows world and the default behavior of the optimizer does not handle it correctly.
The default optimizer algorithms create efficient execution plans under conditions of normal levels of available memory. In cases where the amount of available memory is huge, these same algorithms can cause the optimizer to make very bad choices.
When the optimizer is facing terabytes of available memory, it acts like an unemployed dish washer who has just won the lottery. Dizzied by the riches, it spends memory foolishly. This unwise behavior can result in very slow performance and high CPU use in some scenarios.
There are a couple of options to fix this. The first workaround Microsoft suggests is to lower the Max Server Memory allocation. But why have all that memory if SQL Server can’t use it? I don’t consider this a viable workaround.
The other more palatable option is to run SQL Server under the trace flag T2335. This trace flag forces the optimizer to use memory much more conservatively. That, in turn, results in better plans and better performance in an environment of huge amounts of memory.
Implementing this trace flag in my client’s environment dropped CPU usage from the high nineties to between 20 and 30 percent. Good performance was restored.
It is still a mystery why this cropped up so suddenly. The terabyte of memory had been installed almost a year earlier. According to the developer and the DBA, nothing had been changed recently and there had been no recent increase in the load on the server.
It would have been nice to find out how this problem was triggered, but we could not identify the cause with certainty. My best guess is that a statistics refresh might have caused a recompilation of some procedures and the resulting execution plans were affected by this bug in the optimizer.
If you are having this same problem, be sure to test this trace flag thoroughly before applying it in a production environment*.
*Full Disclosure
I feel like such a hypocrite when I write that. Who am I kidding? We were so anxious to get 4000 users back to work the trace flag was implemented in production immediately. Who has a terabyte of RAM in their test environment anyway?