When I examine a SQL Server environment, I very often find that the indexing seems to have been designed under false assumptions about how indexes actually work.
There is a lot of misinformation floating around the web, so I thought that I would address some of the most common myths I encounter.
Myth 1: The Primary Key index must be clustered and unique.
Truth: The index supporting the Primary Key must be unique, but not necessarily clustered.
Explanation: There are often much better places for the clustered indexes than on the PK. A unique integer column is not always the best place for the clustered index from a performance point of view.
Clustered indexes are very good at returning a large range of records efficiently. You don’t often (if ever) query for a range of arbitrary ID numbers. A non clustered index can return an identity value as efficiently as can a clustered index. Why waste your only clustered index on a column that can be served by a non clustered index?
While the optimal place for the clustered index is determined by how the table is accessed, we often find that a sequentially increasing date field is a good place for it. Range searches on dates are common ( where invoicedate between ‘November 1, 2010’ and ‘November 30, 2010’). And sequentially increasing date fields do not incur page splitting or rapid fragmentation.
Myth 2: Stay away from clustered indexes because they cause page splitting and fragmentation.
Truth: There is no way you can get proper performance from most database applications without using clustered indexes.
Explanation: It is true that in some cases clustered indexes will cause page splitting and rapid fragmentation but that is not a reason to avoid them altogether. Rather it is a reason to learn how to use them effectively without causing page-splitting and fragmentation. Here are a couple of tips:
Never place a clustered index on a column with a GUID data type. A GUID is a 16 byte randomly generated value. Its randomness will promote a high level of page splitting and fragmentation of the table. Since the clustered index value of a row becomes part of every nonclustered index created on the table, this 16 byte value added to every index row will bloat the size of your table and slow down performance.
Try to find a naturally sequentially-increasing column ( a date, for example) that is used to filter ranges of data. This can make an important performance difference in many queries.
If you cannot find a column or set of columns where the clustered index would provide a performance boost, place the Primary Key on an IDENTITY column using a clustered index. This will physically organize your table on disk without incurring page splits or rapid fragmentation.
Myth 3: Multiple Column indexes can be used to search on any of the fields contained in the index.
Truth: In a multi-column index, only the first column is truly indexed.
Explanation: If you take a minute and think about how a multi-column index must work, you will realize that the values in the additional columns are grouped by the order of the first column and are not sequential, To find values in the second column without having a filter on the first column you would have to scan the entire table.
For example, let’s assume this index exists on the ORDERDETAIL table:
- CREATE INDEX IX_TEST ON ORDERDETAIL(CUSTOMERID, PRODUCTID)
The following query can use this index to perform a very efficient seek:
- SELECT * FROM ORDERDETAIL
- WHERE CUSTOMERID = 1234
- AND PRODUCTID = 34
This query might use a few more page reads but it will also be executed efficiently with an index seek:
- SELECT * FROM ORDERDETAIL
- WHERE CUSTOMERID = 1234
However, the following query will cause a table scan.
- SELECT * FROM ORDERDETAIL
- WHERE PRODUCTID = 34
Even though PRODUCTID is included in the index, the index will not be used for searches on PRODUCTID alone.
This does not mean that there is no benefit in multi-column indexes. They are useful in many ways, particularly in multi-column joins and in queries that filter on multiple columns from the same table (as in our first example).
Multiply Selectivity Using Multi-Column Indexes
Multiple column indexes allow you to combine the selectivity of two or more columns.
Often a query will have several conditions in a WHERE clause, but none of them alone are selective enough for the query optimizer to use them. The result will be a scan of the table.
However, by using two or more non-selective columns in the index you might increase the selectivity of the index to the degree that it can be used effectively.
For example, your business process dictates that an order should not be closed before it is shipped. You have a process to search for orders that may have fallen through the cracks and not been shipped.
- SELECT ORDERID
- FROM ORDERS
- WHERE status = ‘closed’
- AND shipped = ‘no’
It is likely that you have very many closed orders in your database, and you might have very many unshipped items. Individual indexes on either of those columns may not be selective enough to be used by the optimizer.
However, it is not likely that you have a lot of closed orders that have not been shipped, since they would be exceptions to the business rule. An index combining both of these columns might be very selective and could turn a table scan into a fast and efficient index seek.
Covering Indexes
Covering indexes are a special type of multi-column index and they can sometimes work magic on a poorly performing query. Covering indexes contain every column referenced from a specific table by a specific query in any part of the query.
Since all necessary data is contained in the index, the query never has to access the table itself. This is a bit like having another clustered index on the table and the performance improvement can be amazing.
It is important to use restraint in creating covering indexes or you will over-index your table. Here are some criteria to consider:
- The query or queries to be optimized should be a serious enough performance problem to justify the existence of what might be a big index.
- The number and size of the columns required should be relatively small.
- The table should be relatively large.
Myth 4: Create Indexes in descending order if that is how they will be accessed.
Truth: The query engine can read an index backwards as easily as it can read it forward. There is seldom any reason to specify the order for a single column index.
Explanation: In multi-column indexes there are times when it might make sense to sort columns in different orders. For example if the results of queries are ordered or grouped on a set of ascending and descending columns. A clustered index on those columns using the same sort order might improve performance. However, that applies only to clustered indexes.
In my experience, it is extremely rare that you would use a clustered index to support the ORDER BY clause of a query which normally has a minimal cost. There are usually much better places for clustered indexes.
Conclusion
There are a lot of myths in the SQL world. We have looked at a few of the myths involving indexes that can result in performance problems on your SQL Server.
The web is a great source of information, but it can be difficult to sort out truth from myth. It is always a good idea to confirm any questionable bit of information you read on the web with other knowledgeable persons.