Should I Store Files in the Database?

SQL Server can store binary large objects such as files inside the database. Why might you want to store files inside the database? What problems might it cause and what alternative options should you be considering?

Background

I have experience managing several large SQL databases consisting of up to 160TB of blob data – using in database BLOBs as well as FILESTREAM. If you think you might need to store hundreds of GB or TBs of BLOB data you should think very carefully about where you place that data – your SQL Server database might not be a good solution! It can be made to work and I’ll provide some tips in this article but ultimately we ended up migrating our BLOBs out of SQL. In this article, I’ll explain why you might want to consider alternative options and how to manage huge volumes of blob data in SQL.

Why store in the DB?

  • FullText indexing.
    This is the most compelling reason to store your blobs in the database in my opinion. There are alternative options to consider like Elastic Search which I actually think is better and more flexible than FullText search in SQL Server. Fulltext search is seamless though and easy to implement.
  • Transactional consistency
    I’ve seen this argument used as a reason for storing blobs in the DB, but I think in most cases you can engineer what you require with the blobs stored externally. For example, you could commit the blob to an appropriate store first then add the associated metadata to the DB – ensuring you don’t have any metadata without an associated BLOB? A process could be added to clean up orphaned BLOB data. Or use a queue to ensure reliable processing of both the blob and metadata?
    Note: You need to use explicit transactions to get transactional consistency in the database. Large BLOBS could create long-running transactions which result in blocking and poor concurrency. Doing the BLOB operation as the first part of the transaction is probably the best option if you need this.
  • Easy backup/restore
    Backup/Restore is probably a compelling reason NOT to store your files in the DB (discussed later), but there is a simplicity in having all your ‘content’ data included in a database backup file. Also, your DB and blobs will be transitionally consistent if you need to rollback to an earlier time – ensuring no orphaned blobs or metadata (provided you used explicit transactions).

The Problems

  • DB size!
    Your databases can potentially grow to enormous sizes. Most of the negatives below are related to this. If you don’t expect to store very large volumes of BLOB data some of these issues might not apply. Evaluate the information in this article and make an informed decision for your use case.
  • Storage costs
    The database is probably the most expensive place to store your BLOB data. If you are hosting in the cloud you might find that there are options that are significantly cheaper and have better durability (e.g. S3, Azure blob).
  • Backups
    Your database backups are going to be large! Also, because they are large they are going to be SLOW and EXPENSIVE! They are also inefficient to backup in the database.
  • Restores
    If backups are slow as a result of BLOB data, also think about restores. DR/HA also becomes critical as you move into VLDB territory.
  • DR/HA
    This becomes more necessary and also more expensive as a result of storing BLOB data in your DB. Your DR copies also need to include the blob data. Looking to scale out reads with AlwaysOn AGs? The extra storage cost for the BLOBs will add up.
  • DBCC checks
    Make a coffee, it’s going to take a long time! You might need to think of partitioning your DBCC checks so that they can complete out of hours. DBCC checks are very resource-intensive and blob data pushing your DBs into VLDB territory can create significant challenges here.
  • Risk of Corruption
    Corruption is usually the result of a SQL Server bug, driver, or hardware failure. With larger volumes of data, you could be more at risk.
  • Performance
    SQL Server licensing is expensive and there is some CPU overhead in serving files from your DB. Also, if you are not using FILESTREAM the data pages associated with the BLOB data will consume space in the buffer pool – wasting valuable physical memory.
  • 2GB size limit for in database BLOBs.
  • Removing files doesn’t automatically reclaim any space (in database blobs)

Inefficient Backup

Why are backups of BLOB data inefficient in the database? Upload a new file into your database – how many times is it backed up?

  • In the next transaction log backup.
  • In the next diff backup – and the next until a full backup is taken.
  • Every full backup from now until the end of time or the blob is deleted. Regardless of if the file has changed.

So how often should it have been backed up? Just Once. You might want multiple copies for redundancy but you don’t need to keep backing up files that don’t change. If you choose to store your files in the database you don’t have that option. Your backups and restores are going to be slow and inefficient. What if disaster strikes and you need to restore from backup? How long will that take and what pain will it cause the business?

What about FILESTREAM?

FILESTEAM is a hybrid approach giving you all the benefits of storing blobs in the DB with non of the downsides RIGHT? Not quite…

Filestream solves the problem of blobs consuming space in the buffer pool. It solves the 2GB limitation imposed for traditional BLOBs and it also provides a more efficient way of streaming larger BLOB data. Filestream data doesn’t count towards the 10GB SQL Express limitation – which might allow some people to stay on the free edition of SQL Server a while longer.

It solves some problems, but this shouldn’t be considered a best of both worlds solution as most of the issues mentioned still apply to FILESTREAM. For instance, It doesn’t help with DB size, backup/restore, storage costs, DR/HA, DBCC, or potential for corruption. It maybe helps to some extent with performance – but you are still adding extra load to your DB server.

If you are going to store blobs in SQL Server larger than 1MB, FILESTREAM is probably the best option. For smaller blobs, it might be more efficient to store in database. The best option? Skip FILESTREAM and store the blob data completely outside the database.

If not the DB, Where?

If you are running in the cloud you should look towards solutions similar to S3 or Azure blob. For example S3 is cheaper and has better durability than EBS volumes. Also, with cloud blob storage you don’t have to worry about expanding drives or running out of space (just paying the cloud provider bill).

If you need fulltext indexing, there are alternative solutions like ElasticSearch or Solr that are worth looking at. Those solutions will require you to extract the RAW text from your BLOB data where this could be done automatically for you in SQL Server with IFilters. Apache Tika might be an option to consider as an alternative. IFilters are maybe still an option but you would have to write the code to extract the text yourself using something like IFilterTextReader.

Note: Fulltext is quite a compelling reason to use SQL Server for blobs. I think there are better alternatives but they all require more effort to develop. For smaller projects where you need fulltext search, storing blobs in the DB is an option to consider. For larger projects that predict high volumes of data I would look at alternative solutions. It’s more effort but it could prove cheaper in the long run with more efficient storage, backups etc.

I still need to store my BLOBs in the DB. Help!

As volumes of data increase, backups, restores and DBCC checks become increasingly difficult. The largest DB I managed grew to 160TB which would require a sustained average backup rate of 970MB/sec to complete a FULL backup within a generous 48hrs window. So how do you cope as volumes of data increase? Throwing money at your hardware vendor might be one solution, but there is another option…

Partitioning is the key! Ideally your blob data is immutable (INSERT Only). This is something you can engineer but you might need to consider things like GPDPR. I would recommend partitioning your blob data by date and using separate filegroups per partition. This allows you to mark older filegroups as readonly. You need to create one off backups of your readonly filegroups and then you can create regular backups of just the writable filegroups. This can seriously cut down on the size of your weekly backups.

Note: Marking filegoups readonly requires exclusive access to your database. This means a very brief outage to mark filegroups readonly which isn’t ideal. Storing your blobs in a separate DB might be an option to consider to limit the impact of this.
If you need to delete data in your readonly filegroup you will first need to mark it writeable (exclusive access required again). You would then delete as normal and mark the filegroup readonly again. At this point a new FULL backup of the filegroup is required.

What if my table isn’t INSERT only?

If you expect the data to be mostly readonly you could handle updates as an insert operation and update any references to point to the new blob. You can keep a version history for your blobs in this way which might be useful for your application. Deletes could also be done as a “soft” delete where you remove the pointers to the blob.

In some cases there might be a legal obligation to physically delete data and a soft delete won’t cut it. Or maybe the quantity of update and delete operations is high so you can’t make the table insert only without adding significantly to your data storage problems. Ideally this is another reason to consider moving the files out of the DB.

Filegroups can still be useful in this scenario. If you have filegroups with older, less volatile data you can take less frequent FULL backups of these filegroups. You will still need to take regular diff backups of these filegroups but the diff backups will be quite small.

What about DBCC?

You can run DBCC checks at the FILEGROUP level which allows you to partition your DBCC checks. You can prioritize running checks for your writable filegroups which are most at risk from corruption.

What about Restore?

Restores are still going to be slow. Also if you are using FILEGROUP backups your restores have become more complex. If you get this wrong you might be in a position where you are unable to restore your database. The individual filegroup backups represent your database at different points in time – you need to bring these to a transactionally consistent point in time to recover your database. You do this by applying your diff and transaction log backups.

With filegroup backups, you might have some additional options available to bring things online much faster. It’s possible to do an online piecemeal restore where you recover the “active” part of your database and bring it online. You can then restore your older filegroups while users are accessing the database. This is an enterprise edition feature though.

More Info

Some of the topics in this article I’ve only covered briefly like partitioning and online piecemeal restores. I’d recommend viewing the MCM (Microsoft Certified Master) videos on online piecemeal restore. It’s no longer available on Microsoft’s website, but SQL Skills has it here.


The big shrink of 2019:

DBA Dash captures the moment when blob data was removed from one of our databases back in 2019. The files were moved to cloud blob storage and the associated table, files and filegroups were simply dropped from the database. No actual shrinking required which would have been a slow and painful process!

DBA Dash space tracking

DBA Dash Overhead

Introduction

All monitoring tools have some overhead associated with them but generally, the overhead is low compared to the value they deliver. How much load does DBA Dash add to a server?

To find out I setup an extended event to capture the queries ran by the DBA Dash agent. I left the trace running for over 8hrs (8:23:33, ~504mins). This was done on a real production server – one of our busiest servers that supports a SaaS application. This was done during peak load. The results are valid for this server running version 2.13.1 of DBA Dash.

TLDR: The overhead of running DBA Dash is very small and it delivers great value!

Results

QuerySamplesAvg Duration (ms)Avg CPU (ms)Avg ReadsMax Duration (ms)Max CPU (ms)
Instance15200012216
RunningQueries
Changed to capture every 30sec instead of the default 1min on this server. Session waits are collected.
1008926950500157
Plan Collection
Plan collection is off by default but is enabled on this server. It only runs if there are plans to collect.
94230242036236141
Text Collection
Collects text for running queries and only runs if there is text to be collected.
8181292323832
SlowQueries
This is disabled by default. Query has a 1 second waitfor so look at CPU time
504146729535211793500
PerformanceCounters504412855114094
CPU5042925329463
ObjectExecutionStats5041714100619147
Waits5043302816
AvalabilityReplicas504112395216
DatabasesHADR50422362416
AvailabilityGroups5042123412116
MemoryUsage5041131316
IOStats504000216
JobHistory50400126031
DBFiles861155231219916585797
Backups827921512625510281
Custom Check
Note: This runs custom check that is NOT part of DBA Dash and is unique to this server. This is YOUR code (Or mine in this case)
810531882992411312203
Databases813910721564275141
ServerExtraProperties8598411061108862
DBConfig88337184318378
LogRestores822162474847
LastGoodCheckDB830165907647
DBTuningOptions818143003131
Corruption823107064732
SysConfig838308816
ServerProperties8221515
TraceFlags800000
DatabaseMirroring8104320
OSInfo83014660
Alerts880206130
Max date modified from sysjobs8003010

The queries with 504 samples are been run every 1min with RunningQueries sampled 1008 times on this server as it was changed to collect every 30 seconds instead of every 1min.

Of these frequently ran query captures, ironically it’s the SlowQueries capture query that is the slowest. The query has a 1-second delay built in while it starts a new extended event session before flushing the ring_buffer of the old one. It’s better to look at the average CPU time of 295ms for this query. It’s still the heaviest of the queries that are collected every 1min.

This is quite a busy production server and SlowQueries captures all queries taking longer than 1 second to run. This collection isn’t enabled by default and you can configure the threshold for collection as required.

The next slowest of the frequently executed queries is the RunningQueries capture with a 92ms average execution time. I’ve previously compared this to sp_WhoIsActive. In a lab environment with 12 active queries the average duration was only 15ms. So results will vary depending on your workload. The Plan Collection and Text Collection also form part of the RunningQueries capture with plan collection been optional.

Of the hourly collections, it’s the DBFiles collection that is the slowest. The overhead of this query will likely be different on your server depending on the databases and files on your SQL instance.

Other Caveats

  • The Slow Query capture doesn’t take into account the overhead of the associated extended event.
  • The repository database is on a different server. If you host both on the same server you will have additional overhead.
  • The DBA Dash agent is ran on a different server. If you run the agent on your monitored SQL instance this will also add additional overhead.
  • The results above are valid for this server with version 2.13.1 of DBA Dash.
  • See here for a list of what DBA Dash captures and when. The analysis above doesn’t include collections that are ran daily as they didn’t fall into the 8hr window of this test.
  • You have full control over the collection schedule. Disable some collections or change the frequency as required!

How to test

I’ve posted the results above to give an indication of the overhead of running DBA Dash. If you want to monitor the activity on your own server, this extended events session will get you started:

I saved the results to a table and ran a quick and dirty query to group the results.