For the past few months in this newsletter we have been discussing performance optimization using plain vanilla SQL Server indexes. This month we will still be discussing indexes, but indexes of a new and different type. We will also discuss a form of data storage that has important implications for improving performance. All these features were introduced in SQL Server 2008 and are available in all editions.
Our aim in this article is to make you aware of these new features in SQL Server and to give you a real-world example of how they can be used effectively. We will not discuss the details of implementation of these new objects but we will give a link to a very thorough discussion of the implementation details as well as other important details.
Sparse Columns
Sparse columns are primarily meant for columns that hold a high percentage of null values. Properly used, they can reduce the size of certain tables dramatically and can significantly increase the performance of queries against those tables.
For example, a column of the type CHAR(12) requires 12 bytes of storage even when that column is null. But if the column is declared as type SPARSE, the storage space required for null values is almost nothing. The space required for null values of other data types varies somewhat but space savings can be significant across all data types so long as there is a high percentage of null values in the column.
The tradeoff is that the storage of non-null values in a sparse column is slightly higher than it would be in a column that is not declared as SPARSE. The potential benefit of a sparse column depends on the data type and the percentage of rows that will be null in that column. To save you the calculation, Microsoft has created a table that will tell you, for any given data type, what percent of rows in the column must be null in order to achieve a 40% reduction in storage space.
About Sparse Columns
40% in an arbitrary number Microsoft has chosen, but as a practical matter it is probably not worth employing sparse columns unless you can achieve a space reduction of 20% or more. However, in many circumstances you might achieve a reduction of 80% or more.
The Parts Table
Our example is the Parts table in an auto parts database. The fundamental problem with a parts table is that different parts have different attributes and very few parts share the same attributes e.g. A rear view mirror might have a size, finish, etc. A wheel might have a diameter, number of lug holes, etc.
A truly relational parts table would contain thousands of columns to cover all the different attributes for all parts. Almost all of the columns in each row would be null. Many of those columns would need to be indexed. The indexes would be bloated by all the null values and force the table to take up much more storage space. This is an unwieldy solution that has forced many database designers to opt for a different approach.
One possibility is to create a table for each set of parts that share the same attributes. However, this also is unwieldy and it is a developer’s nightmare to create a program that can use a database like this.
Another option is the attribute\value table where rows become columns and columns become rows. That sort of table would at a minimum have a PartID column, an attribute name or ID, and the value of that attribute. In practice there are often many more columns than that.
For example here are two rows describing an attribute for a mirror and a wheel:
Part ID | Attribute | Value |
123 | Color | Silver |
456 | Diameter | 18 |
This is a compact way to hold this data without wasting as much storage space as the relational table. However, it can have performance issues. It is often necessary to pivot this column data into row data (for example to display a part with all its attributes as a single row). Sometimes it is done using the PIVOT operator but often I have seen this done using a cursor. The pivot operator is better than the cursor, but neither method is nearly as fast as an index seek on a relational table.
With the advent of sparse columns and filtered indexes, the relational table approach becomes more tenable. It is possible to build a table with as many as 30,000 sparse columns that has a minimal storage cost and performs well. When you add filtered indexes to that solution you can have a single high performance relational table containing all of your parts with all of their attributes.
Filtered Indexes
A filtered index is, as the name implies, an index with a WHERE clause. Filtered indexes have some things in common with indexed views but we won’t be covering that in this article. Filtered indexes are useful in many situations, but they are especially useful on sparse columns.
In our example you might create a filtered index on the sparse column WheelDiameter like this:
- CREATE NONCLUSTERED INDEX sparse_wheeldiameter
- ON Parts(wheeldiameter, partid)
- WHERE WheelDiameter is not null;
This index will only contain data for rows where the wheel diameter value is not null. The only products that have wheeldiameter attributes are, you guessed it, wheels. So, only a tiny subset of the rows in the table will be included in a tiny but efficient index. The performance of queries using this index will be very fast.
With conventional indexes that contain values for every row in the table, it is important to limit the number of indexes you put on a table, but sparse columns are in a different world.
In a large parts warehouse you can easily have a table with thousands of sparse columns. As a matter of course, you might create hundreds or thousands of indexes at very little cost in storage or update maintenance.
Limitations
As we mentioned above, there are a number of limitations imposed on the use of sparse columns and filtered indexes. The link above discusses that thoroughly but the limitations you are most likely to run up against are:
• A sparse column cannot be part of a clustered index (for obvious reasons).
• A sparse column must be nullable and it cannot be an identity or a row_guid_col (again for obvious reasons).
• Sparse columns cannot be compressed, therefore cannot be created on compressed tables.
• A filtered index cannot be clustered.
A Final Caution
The usefulness of sparse columns depends entirely on the ratio of not null to null values in the column. Remember that the storage for not null columns actually increases in a sparse column. Don’t try to use sparse columns where they are not intended to be used.