The most difficult part of publishing a monthly newsletter is to find a new and interesting subject to write about every month. While browsing the archive of all my monthly articles searching for inspiration, I was surprised to find that I have never written an article about database statistics. The realization stunned me because data statistics are the basis for all query optimization.
What Are Statistics and Why Do I Need Them?
Statistics are the foundation that indexes are built upon. Without accurate statistics, indexes simply don’t work and good execution plans will not be created.
Statistics, as you might expect, are numbers. These particular numbers describe the distribution of data in a column of a table or indexed view in a form that the optimizer can use. There is a whole lot more to it than that of course, but if you are only concerned about using stats optimally, we don’t need to go into the details.
Statistics (hereafter called stats) are used by the query optimizer to find an acceptable execution plan for queries referencing specific columns. With good stats, the optimizer can make informed decisions about these sorts of things:
- In what order should I join the tables?
- What kind of join should be used?
- What index should I use?
- Should I use a seek or a scan of that index?
- And countless other decisions
Correct decisions in these areas are dependent on accurate stats. Good stats can make the difference between a query that runs in less than a second or one that takes an hour to complete. Really!
Where do Stats Come From?
When you create an index, a set of statistics for that index is also created. It will have the same name as the index. Stats created in this way are called, not surprisingly, Index stats to differentiate them from column stats.
Column stats can be created automatically or manually. Automatic creation depends on the database setting Auto Create Statistics being enabled. This is the default and in most cases it should remain enabled.
With Auto Create Statistics enabled, when you query a column that is not a leading column in an index, SQL Server will create stats on that column. For example:
- SELECT * from tblPresidents WHERE MiddleName = ‘Delano’
Assuming that MiddleName is not the first column of any existing index, SQL Server will create stats for that column. This will allow the optimizer to (hopefully) come up with a reasonably good execution plan.
These automatically created stats will have a name of this form: _WA_00003_xxxxx
Not that it matters much but the components of the name are WA for the state of Washington where the original developers of SQL Server lived and worked, then a number that corresponds to the column id. The last bit is the hash of the object id of the table.
Manual Creation
Even if you have Auto Create Statistics enabled, in some circumstances you may still want to manually create stats for a column or set of columns. You can do that using the CREATE STATISTICS command. You can even create stats with a WHERE clause in the definition. Filtered stats can be useful with very large tables, but the details of creating filtered stats are not covered here.
Remember though, when we get to the update of stats, that multi column stats or filtered stats are not automatically updated. Updating has to be done manually.
Updating Stats: Why and How
When data changes on a table, indexes are updated automatically. However the accompanying stats are not. As more and more data changes, the stats get less and less accurate. The quality of the execution plans created from stale stats is poor. This is one reason why it is a very good idea to keep the database option Auto Update Statistics in the default state – Enabled.
Auto Update
With Auto Update Statistics enabled, SQL Server will monitor the data modifications in the column data. When the number of modifications reaches a specific point (500+20% of the number of rows in the table) an automatic update will occur. Automatic update does not initiate a full scan. It samples a percentage of the rows, reading only some internally calculated fraction of the rows. Normally, this is accurate enough and it doesn’t have as big a performance impact as a full scan. If you feel you need a full scan, you will have to schedule a job to do that. You may well have to do that for tables where the data is highly skewed or highly volatile.
Some of you, especially those with billion row tables to manage, might find that the default threshold for a stats update is way too high. If so, you can create your own stats update job and schedule it as necessary. In SQL 2012 there is also a trace flag (2371 ) that will reduce this threshold. This flag was created specifically for huge tables. I have no experience with this trace flag but you might want to try it
After an automatic update, execution plans that were compiled using the old stats are invalidated and will be recompiled at first use.
Index Rebuild
Whenever an index or all indexes on the table are rebuilt, the stats are updated with a full scan and all plans that use those stats are invalidated. However, if the indexes are only reorganized, no stats update occurs and no plans are invalidated.
The message here is that you don’t have to put both an index rebuild task and an update stats with full scan in the same maintenance job (I see this done all the time).
Sp_updatestats
The problem with using this procedure is that it only updates stats that it thinks are stale, no matter what you want it to do. I don’t use it.
What about Auto Update Statistics Asynchronously?
You may want to implement this variation on Auto Update Statistics depending on the specifics of your SQL installation. It is not enabled by default but you will find the setting next to Auto Update Statistics in the options tab of the database properties.
The normal update stats process goes like this: When a query is executed the optimizer examines the condition of the underlying stats. If they need to be updated, the sql engine updates them and creates an execution plan based on the fresh stats. It invalidates any other plans that use the old statistics. Then it executes the query using the new stats and a new execution plan.
It can take a significant amount of time to do all this and the delay may annoy users. To avoid this, the asynchronous option executes the query right away, using the old stats and the old execution plan. It then updates the stats and invalidates all previous plans that used them. The tradeoff is that the query can run without waiting but it will use the stale stats one last time. Based on the individual conditions of your database and applications, you can take your choice.
How Can I Tell if My Stats are OK?
For a quick look at a specific query, open Management Studio and open a query window. Click the button Actual Query Plan. After execution of the query open the query plan tab in the results pane. Hover over a node for a seek or a scan. Note the number of rows in Estimated Number of Rows (this number is derived from the stats) and compare it to the Actual Number of Rows. If the numbers are very different you probably have a problem with that set of stats. Since we are dealing with sampling and estimation, these numbers do not have to match exactly but they should be in the same ballpark.
For a wider view, you can run SQL Profiler to capture the execution plan of currently executing data. Start a trace with these events
- SQL:sqlbatch starting
- SQL:Sqlbatch completed
- RPC: starting
- RPC completed
- Performance:Showplan xml statistics profile (or showplan xml)
If you run this on a busy server it should give you a lot of query plans to search for discrepancies between actual and estimated data.
You can also query the stats metadata for a deeper look. Here are some examples, but a detailed explanation of the metadata views is not within the scope of this article.
- USE AdventureWorks2008R2;
- select * from sys.stats
- where object_id = 1938105945
- and stats_id = 3
Or you can try:
- exec sp_helpstats ‘humanresources.Employee’, ‘ALL’
There is another view of statistics in SQL Server 2012 SP1. If you are on this version, this DMF is the easiest way to find the current number of modifications made to the column since stats were last updated. This function takes the object_id of the table and the stats_id as parameters. It also returns other useful stats information.
- select * from sys.dm_db_stats_properties (898102240, 1)
If you are not on that version, then it is more difficult to get this information. Send me an email and I will send you a query that is too long to put in this newsletter. It will return the number of modifications as well as other stats data.
Conclusion
This article has been a high level look at stats. There is plenty more to know about the subject but this article was only intended to give a simple explanation of stats, what they are and how best you can use them to maintain good performance for your database. I hope we have given you some useful information.
In most cases, you will want to retain the default (Enabled) settings for Auto Create Statistics and Auto Update Statistics. There may be rare cases where you might want to change that, but only do that if you understand the implications of the change. Hopefully, this article has given you some background for making that decision.
You may run into situations where it would be good to create additional stats. If you do so, remember that manually created multi column stats are not automatically updated and filtered stats are not automatically updated as frequently as other column stats. Be prepared to create your own stats update jobs.