Should I Store Files in the Database?

SQL Server can store binary large objects such as files inside the database. Why might you want to store files inside the database? What problems might it cause and what alternative options should you be considering?

Background

I have experience managing several large SQL databases consisting of up to 160TB of blob data – using in database BLOBs as well as FILESTREAM. If you think you might need to store hundreds of GB or TBs of BLOB data you should think very carefully about where you place that data – your SQL Server database might not be a good solution! It can be made to work and I’ll provide some tips in this article but ultimately we ended up migrating our BLOBs out of SQL. In this article, I’ll explain why you might want to consider alternative options and how to manage huge volumes of blob data in SQL.

Why store in the DB?

  • FullText indexing.
    This is the most compelling reason to store your blobs in the database in my opinion. There are alternative options to consider like Elastic Search which I actually think is better and more flexible than FullText search in SQL Server. Fulltext search is seamless though and easy to implement.
  • Transactional consistency
    I’ve seen this argument used as a reason for storing blobs in the DB, but I think in most cases you can engineer what you require with the blobs stored externally. For example, you could commit the blob to an appropriate store first then add the associated metadata to the DB – ensuring you don’t have any metadata without an associated BLOB? A process could be added to clean up orphaned BLOB data. Or use a queue to ensure reliable processing of both the blob and metadata?
    Note: You need to use explicit transactions to get transactional consistency in the database. Large BLOBS could create long-running transactions which result in blocking and poor concurrency. Doing the BLOB operation as the first part of the transaction is probably the best option if you need this.
  • Easy backup/restore
    Backup/Restore is probably a compelling reason NOT to store your files in the DB (discussed later), but there is a simplicity in having all your ‘content’ data included in a database backup file. Also, your DB and blobs will be transitionally consistent if you need to rollback to an earlier time – ensuring no orphaned blobs or metadata (provided you used explicit transactions).

The Problems

  • DB size!
    Your databases can potentially grow to enormous sizes. Most of the negatives below are related to this. If you don’t expect to store very large volumes of BLOB data some of these issues might not apply. Evaluate the information in this article and make an informed decision for your use case.
  • Storage costs
    The database is probably the most expensive place to store your BLOB data. If you are hosting in the cloud you might find that there are options that are significantly cheaper and have better durability (e.g. S3, Azure blob).
  • Backups
    Your database backups are going to be large! Also, because they are large they are going to be SLOW and EXPENSIVE! They are also inefficient to backup in the database.
  • Restores
    If backups are slow as a result of BLOB data, also think about restores. DR/HA also becomes critical as you move into VLDB territory.
  • DR/HA
    This becomes more necessary and also more expensive as a result of storing BLOB data in your DB. Your DR copies also need to include the blob data. Looking to scale out reads with AlwaysOn AGs? The extra storage cost for the BLOBs will add up.
  • DBCC checks
    Make a coffee, it’s going to take a long time! You might need to think of partitioning your DBCC checks so that they can complete out of hours. DBCC checks are very resource-intensive and blob data pushing your DBs into VLDB territory can create significant challenges here.
  • Risk of Corruption
    Corruption is usually the result of a SQL Server bug, driver, or hardware failure. With larger volumes of data, you could be more at risk.
  • Performance
    SQL Server licensing is expensive and there is some CPU overhead in serving files from your DB. Also, if you are not using FILESTREAM the data pages associated with the BLOB data will consume space in the buffer pool – wasting valuable physical memory.
  • 2GB size limit for in database BLOBs.
  • Removing files doesn’t automatically reclaim any space (in database blobs)

Inefficient Backup

Why are backups of BLOB data inefficient in the database? Upload a new file into your database – how many times is it backed up?

  • In the next transaction log backup.
  • In the next diff backup – and the next until a full backup is taken.
  • Every full backup from now until the end of time or the blob is deleted. Regardless of if the file has changed.

So how often should it have been backed up? Just Once. You might want multiple copies for redundancy but you don’t need to keep backing up files that don’t change. If you choose to store your files in the database you don’t have that option. Your backups and restores are going to be slow and inefficient. What if disaster strikes and you need to restore from backup? How long will that take and what pain will it cause the business?

What about FILESTREAM?

FILESTEAM is a hybrid approach giving you all the benefits of storing blobs in the DB with non of the downsides RIGHT? Not quite…

Filestream solves the problem of blobs consuming space in the buffer pool. It solves the 2GB limitation imposed for traditional BLOBs and it also provides a more efficient way of streaming larger BLOB data. Filestream data doesn’t count towards the 10GB SQL Express limitation – which might allow some people to stay on the free edition of SQL Server a while longer.

It solves some problems, but this shouldn’t be considered a best of both worlds solution as most of the issues mentioned still apply to FILESTREAM. For instance, It doesn’t help with DB size, backup/restore, storage costs, DR/HA, DBCC, or potential for corruption. It maybe helps to some extent with performance – but you are still adding extra load to your DB server.

If you are going to store blobs in SQL Server larger than 1MB, FILESTREAM is probably the best option. For smaller blobs, it might be more efficient to store in database. The best option? Skip FILESTREAM and store the blob data completely outside the database.

If not the DB, Where?

If you are running in the cloud you should look towards solutions similar to S3 or Azure blob. For example S3 is cheaper and has better durability than EBS volumes. Also, with cloud blob storage you don’t have to worry about expanding drives or running out of space (just paying the cloud provider bill).

If you need fulltext indexing, there are alternative solutions like ElasticSearch or Solr that are worth looking at. Those solutions will require you to extract the RAW text from your BLOB data where this could be done automatically for you in SQL Server with IFilters. Apache Tika might be an option to consider as an alternative. IFilters are maybe still an option but you would have to write the code to extract the text yourself using something like IFilterTextReader.

Note: Fulltext is quite a compelling reason to use SQL Server for blobs. I think there are better alternatives but they all require more effort to develop. For smaller projects where you need fulltext search, storing blobs in the DB is an option to consider. For larger projects that predict high volumes of data I would look at alternative solutions. It’s more effort but it could prove cheaper in the long run with more efficient storage, backups etc.

I still need to store my BLOBs in the DB. Help!

As volumes of data increase, backups, restores and DBCC checks become increasingly difficult. The largest DB I managed grew to 160TB which would require a sustained average backup rate of 970MB/sec to complete a FULL backup within a generous 48hrs window. So how do you cope as volumes of data increase? Throwing money at your hardware vendor might be one solution, but there is another option…

Partitioning is the key! Ideally your blob data is immutable (INSERT Only). This is something you can engineer but you might need to consider things like GPDPR. I would recommend partitioning your blob data by date and using separate filegroups per partition. This allows you to mark older filegroups as readonly. You need to create one off backups of your readonly filegroups and then you can create regular backups of just the writable filegroups. This can seriously cut down on the size of your weekly backups.

Note: Marking filegoups readonly requires exclusive access to your database. This means a very brief outage to mark filegroups readonly which isn’t ideal. Storing your blobs in a separate DB might be an option to consider to limit the impact of this.
If you need to delete data in your readonly filegroup you will first need to mark it writeable (exclusive access required again). You would then delete as normal and mark the filegroup readonly again. At this point a new FULL backup of the filegroup is required.

What if my table isn’t INSERT only?

If you expect the data to be mostly readonly you could handle updates as an insert operation and update any references to point to the new blob. You can keep a version history for your blobs in this way which might be useful for your application. Deletes could also be done as a “soft” delete where you remove the pointers to the blob.

In some cases there might be a legal obligation to physically delete data and a soft delete won’t cut it. Or maybe the quantity of update and delete operations is high so you can’t make the table insert only without adding significantly to your data storage problems. Ideally this is another reason to consider moving the files out of the DB.

Filegroups can still be useful in this scenario. If you have filegroups with older, less volatile data you can take less frequent FULL backups of these filegroups. You will still need to take regular diff backups of these filegroups but the diff backups will be quite small.

What about DBCC?

You can run DBCC checks at the FILEGROUP level which allows you to partition your DBCC checks. You can prioritize running checks for your writable filegroups which are most at risk from corruption.

What about Restore?

Restores are still going to be slow. Also if you are using FILEGROUP backups your restores have become more complex. If you get this wrong you might be in a position where you are unable to restore your database. The individual filegroup backups represent your database at different points in time – you need to bring these to a transactionally consistent point in time to recover your database. You do this by applying your diff and transaction log backups.

With filegroup backups, you might have some additional options available to bring things online much faster. It’s possible to do an online piecemeal restore where you recover the “active” part of your database and bring it online. You can then restore your older filegroups while users are accessing the database. This is an enterprise edition feature though.

More Info

Some of the topics in this article I’ve only covered briefly like partitioning and online piecemeal restores. I’d recommend viewing the MCM (Microsoft Certified Master) videos on online piecemeal restore. It’s no longer available on Microsoft’s website, but SQL Skills has it here.


The big shrink of 2019:

DBA Dash captures the moment when blob data was removed from one of our databases back in 2019. The files were moved to cloud blob storage and the associated table, files and filegroups were simply dropped from the database. No actual shrinking required which would have been a slow and painful process!

DBA Dash space tracking

Posted in Uncategorized.