I was surprised by the response to last month’s article introducing a practical methodology for query optimization. We received more complimentary emails for this article than for any other article we have published. We have never had so many new subscribers as after the June issue was forwarded to friends and colleagues by our current subscribers.
Needless to say, we are going to continue exploring this same topic. This month we will examine covering indexes and how they can be used to optimize a stubborn query.
Our Example
Again we will use the AdventureWorks sample database from Microsoft. You can download the database from here if you do not already have it. If you have not already done so, we strongly recommend reading last months article before beginning this one.
The Covering Index
The query we’ve chosen to optimize gives us the opportunity to explore a special kind of index, the covering index. A covering index is one that contains all the columns referenced in the query for a single table. We will demonstrate just how useful this can be.
Here is our example query:
- SET STATISTICS IO ON;
- GO
- USE AdventureWorks;
- SELECT ss.Name, COUNT(*) as OrderCount, SUM(sd.LineTotal) as TotalPurchases
- FROM Sales.SalesOrderDetail sd
- JOIN Sales.SalesOrderHeader sh on sd.SalesOrderID = sh.SalesOrderID
- JOIN Sales.Customer sc on sh.CustomerID = sc.CustomerID
- JOIN Production.Product p on sd.ProductID = p.ProductID
- LEFT JOIN Sales.Store ss on sc.CustomerID = ss.CustomerID
- WHERE ss.Name is not null
- — and sd.SpecialOfferID = 13
- GROUP BY ss.Name
- ORDER BY SUM(sd.LineTotal) desc
Don’t worry about the commented line in this query. Leave it commented out for now. We will uncomment it in a later example.
This query joins on several tables but almost all of the cost of the query is in the full scan on the largest table, SalesOrderDetail, at a cost of 1175 page reads. the query returns 633 rows.
(We find this information in the Messages tab of the query window in Management Studio after setting STATISTICS IO ON. If you are unfamiliar with this setting, you might want to look at last month’s article before proceeding further.)
When we examine the query, there seems to be no way to reduce the cost of the table scan through indexing. In this query, SalesOrderDetail is joined on the SalesHeaderId and ProductID columns, but those columns are already indexed. The only other place that an index might improve things is on a column in the WHERE clause, but there are no search arguments in the where clause that apply to this table. The only search argument is on the Sales.Store table. There is no obvious column that we can index to improve the performance on SalesOrderDetail.
But notice that there are only three columns from SalesOrderDetail referenced in this query. They are SalesOrderID, ProductID and LineTotal. If we create an index that includes all those columns it is said to “cover the query”. That means that the query engine can get all the data it needs from the index without touching the table at all. This can lead to performance improvements that range from moderate to absolutely amazing.
- CREATE NONCLUSTERED INDEX nc_cover on Sales.SalesOrderDetail(productid) INCLUDE (linetotal)
Notice that we use the INCLUDE clause to add the LineTotal column to this index. INCLUDE was a new feature of the CREATE INDEX statement in SQL 2005. It allows us to include columns that are not keys without incurring the substantial cost of adding more key values to the index. Columns that have no value for filtering, joining or ordering results can be added to the index at lower cost than simply naming them in the traditional CREATE INDEX syntax.
You might also notice that we have not mentioned SalesOrderID in this CREATE INDEX statement. How can this be a covering index if we leave out a column that is used in the query? The answer is that we have not left it out.
In a table having a clustered index, all nonclustered indexes include the clustering key. SalesOrderID is in the clustering key, therefore it is automatically added to any nonclustered index we create. There is no need to explicitly name it.
When we create this index, the cost of accessing SalesOrderDetail drops from 1175 data pages to 486 pages. This is not an earthshaking improvement, but it does illustrate the principle we are discussing.
In this case there was no filter available to limit the amount of data read from SalesOrderDetail so the optimizer was forced to scan the entire index. But this index contains far fewer pages than the entire table so we got a moderate performance benefit from our new index. In other cases we might see much more impressive results.
Let’s look at another aspect of this problem. Uncomment the line that filters on SpecialOfferID and execute the query again. This time it returns only 101 rows because of the filter. We might expect that the query would be less expensive. However, we would be wrong. It reads 1240 pages from SalesOrderDetail.
The reason is that the covering index does not include SpecialOfferID and no longer covers the query. Our index is not used. The additional logical reads come from the way the new execution plan has to aggregate the count of orders and the sum of the LineTotal.
There are two practical solutions to this problem. We could create an index on SpecialOfferID or we could add SpecialOfferID to the covering index.
In the real world we would probably create the new index on SpecialOfferID and drop the covering index. However, for purposes of this article we are going to drop our covering index and then re-create it, adding SpecialOfferID to it.
- DROP INDEX Sales.SalesOrderDetail.nc_cover
- GO
- CREATE NONCLUSTERED INDEX nc_cover on Sales.SalesOrderDetail(SpecialOfferID, productid) INCLUDE (linetotal)
- GO
When we add this column, we make it the first column in the index because SpecialOfferID is a very selective column in this query. If we put it in the second position, the database engine would search first on ProductID, which is not very selective and is not a filter in this query.
As a general rule, you want to make the most selective search argument the first column in a multi-column index.
When we execute the query again, we see that it takes only 6 page reads on SalesOrderDetail. A couple of things have changed to cause this dramatic reduction in page reads.
1. We now have an indexed search argument that has turned the full index scan into a very efficient index seek operation.
2. The index contains all the columns required by the query so there are no additional page reads necessary to retrieve the row data from the table.
In the Real World
There are significant differences in how we proceeded here and how we would handle this particular query in the real world. We have contrived an example in a sample database with small tables and it is difficult to contrive a serious performance problem in such a database. However, it illustrates the concepts we are discussing.
In the real world, we probably would not be optimizing this query. The tables are small, the execution is fast. We certainly would not use a 4 column covering index to solve a problem this insignificant.
When optimizing, it is necessary make a judgment about the importance of optimizing the query in relation to the cost of maintaining a relatively large index. This is especially true of covering indexes.
Make sure the cost of the solution does not outweigh the benefits of optimization. Covering indexes can work magic sometimes but you must use them with restraint.
Covering Indexes and the Database Tuning Advisor
The worst problem with the Database Tuning Advisor (DTA) in SQL Server is that it cannot exercise restraint. The second worst problem is that it has only a rudimentary ability to identify necessary indexes. It makes up for these shortcomings by suggesting a huge number of covering indexes with a ridiculous number of columns in the INCLUDE clause. Often these very expensive indexes are created to optimize trivial queries that need no optimization.
In an environment where those responsible for the database know nothing about indexes, the DTA can be useful. But it is not a tool that should be used by a database professional.