It used to be that Big Data was only a problem for Big Business. Big companies had big data but they also had the mainframes, money and the staff to handle terabytes of data. It did not seem that big of a deal for the Fortune 500.
But the PC revolution has allowed smaller players to do bigger things. More and more often I find that the data in many medium-sized and small businesses is outgrowing their capacity to handle it. The problem is much more serious for companies that do not have overflowing bank accounts. The standard methods for dealing with big data tend to be beyond their budgets. This month, I thought I would put together some tips for managing big data on a small budget. These tips are certainly not best practices, but if you can’t afford best practices, these might be of help.
Tip 1: Partitioning on the Cheap
Partitioned Views
Normally the first step in dealing with big tables is to partition them. However, table partitioning is only available in SQL Server Enterprise Edition and we are discussing companies with small IT budgets. It is not likely that they can pay 6 or 7 times the licensing cost of their SQL Server Standard Edition, so I have eliminated that as an option in this particular discussion.
However, in older versions of SQL Server, before real partitioning became a part of the feature set, we had partitioned views. They lacked many important features of real table partitioning, but they still handle big data as well as they ever did.
I won’t go into a lot of technical detail about them but the gist of the matter is that you can break your big table into several small tables which can be placed on different disks or even on different SQL Server instances. You can separate the data into these tables based on the value of a sequentially-increasing column in the table, usually an identity or a date. Then you can create a partitioned view that includes all of these tables. You query the view as you would have queried the table. When you want to archive data, you alter the view to exclude a specific table.
There are reasons why partitioned views are not used very often any more, but they might work for you. Here is a link to a more detailed description
Tip 2: Forever is a Long, Long Time
The best thing you can do for a big data problem is to not have one. Be careful in what you promise.
Early in a company’s history, the priority is to acquire customers. Few founders think very deeply about potential big data problems five or ten years down the road. Consequently the company often finds itself contractually obligated to keep all of the customer’s data on-line forever. If the contractual agreement also calls for 24/7 up-time, then you have a real problem on your hands because managing big data on the cheap usually means a significant amount of off-line activity.
For these companies, business success might bring with it a lot of data. In most cases, the earlier data has not been touched in years. Even though it never changes it is still backed up every night. Performance is affected because the tables are so swollen with unused data. Database consistency checks and reindexing jobs take more time and more disk space is needed for them. The list of problems goes on and on.
A sane archival plan should be a feature of any new agreement, and it may be possible to re-negotiate data archival with your existing clients. Reasonable people are likely to agree to a reasonable archiving plan. That doesn’t solve the problem of how to archive, but it at least allows you to get started.
Archiving
If you are not using SQL Server Enterprise Edition your biggest problem in archiving old data will probably be the time it takes to export a lot of archive data and then load it into the archive table.
Tip 3: Take small bites.
Managing big data is most difficult when the situation has been allowed to become critical. Archiving or deleting huge amounts of old data often takes days to complete. If you are in this position, I recommend that you begin by taking smaller bites. The import\export tools are able to use the results of any query you specify rather than the entire table, so you are able to take smaller amounts of data that can be archived during a single maintenance window. Once you catch up, you should continue to periodically archive your old data to avoid another critical situation.
Exporting Data
The tool of choice, at least for exporting the data is BCP, a command line utility that comes with SQL server. It can export sql data to many different kinds of targets. When exporting the data you will probably want BCP to create a format file for ease in importing the data into an archive table.
Again, we won’t be discussing the details of using BCP in this article but here is a pointer to the details:
BCP
Importing to the Archive Table
For importing the data to an archive table you have the choice of using BCP or the TSQL command BULK INSERT. There is not a lot of difference between the two methods. Both have much the same parameters to control execution. If you are more comfortable with a TSQL interface you can use BULK INSERT. If you are OK with the command line, use BCP. Here is a link to the details of using BULK INSERT
Bulk Insert
For completeness I should mention you can also use SELECT INTO’OPENROWSETBULK to import data from a remote source. However, unless there is no other alternative, I would caution you not to use a remote source for performance reasons. We won’t be discussing that option in this article.
Parameters
When you are importing a lot of data into an archive table, you don’t want the operation to be fully logged. That would take forever and swell up your log file to gigantic proportions. Both BCP and BULK INSERT have the TABLOCK parameter to help insure that the operation is minimally logged. However there are a number of other conditions that must be fulfilled to make certain the operation is minimally logged.
Criteria for non-logged inserts.
- The recovery model is simple or bulk-logged.
- The target table is not being replicated.
- The target table does not have any triggers.
- The target table has either 0 rows or no indexes.
- The TABLOCK hint is specified.
Deleting Archived Data from Production Tables
The last step in archiving your data is to remove the data you have archived from the production table. If you are archiving all the data in the table you can simply use the non-logged TRUNCATE TABLE command. However you are probably archiving only a portion of the rows in the table. Unfortunately there is no BULK DELETE command in SQL Server. You have to remove the rows with the standard DELETE command and that is a logged operation.
A Word About Minimal Logging
Minimal logging does make the import go much faster and keeps the log file from growing out of control. However, the next time you back up the transaction log after importing a lot of data, the backup file will be huge. Make sure you have enough space on the backup drive to hold a file that might be orders of magnitude larger than preceding log backup files. However, after the first log backup the log backup files will return to a normal size.