Removing large backup history with sp_delete_backuphistory

Problem

In SQL Server, the sp_delete_backuphistory stored procedure is is the standard and supported method to purge old data from the backup and restore history tables in the msdb database:

  • backupfile
  • backupfilegroup
  • backupmediafamily
  • backupmediaset
  • backupset
  • restorefile
  • restorefilegroup
  • restorehistory

Recently, I encountered an issue running sp_delete_backuphistory on servers that hosted a large number of databases with frequent log backup & restore operations. The clean up task hadn’t been scheduled and the history tables had grown very large over several months. The msdb databases was also hosted on a volume with limited IOPs.

Attempting to run sp_delete_backuphistory under these conditions you will likely encounter these issues:

  • The operation will take a long time
  • Log backup and restore operations are blocked during the process
  • Your msdb log file will grow large if you try to remove too much data at once
  • If you kill off the process at any point, the whole thing will rollback.

The sp_delete_backuphistory stored procedure gets the job done in most cases but it’s not very well optimized if you have a large amount of history to clear.

Optimized Solution for Backup History Cleanup

To clear a large amount of history without resorting to schema modifications or unsupported methods we can take an incremental approach to removing data.

The script below will check how much retention we currently have. It will then call sp_delete_backuphistory in a loop, removing data 1hr (configurable) at a time until we hit the target retention. We can also specify a maximum runtime for the process.

This approach won’t necessarily speed things up but it provides the following benefits:

  • Data is cleared as a series of smaller transactions rather than a single large transaction
  • Smaller transactions allow for log truncation and prevent the log file from growing.
  • We can interrupt the process without having to rollback all the progress that has been made. (just the current batch)
  • Log backups/restores might still be blocked, but only for the duration it takes to complete a single loop – not for the entire time it takes to clear the history. This can still impact RPO/RTO though.
  • We can control how long the process runs for. We can keep it running outside business hours or allow for breaks to prevent log backup/restores from falling too far behind.
  • The script will print progress information.

Links:

These related links might be interesting:

https://www.brentozar.com/archive/2018/02/the-annals-of-hilariously-bad-code-part-1/
https://www.brentozar.com/archive/2018/02/annals-hilariously-bad-code-part-2/
https://dba.stackexchange.com/questions/233375/how-to-minimize-plan-ahead-the-effect-of-running-sp-delete-backuphistory-measur

T-SQL Tuesday #156 Production Code

Introduction

The invitation from Tom Zíka asks “Which quality makes code production grade?“.

Reliability

Code that runs in production should be reliable. It should produce consistent results, handle concurrency and deal with error situations.

Performance & Scalability

Production code needs to be able to scale. Scalability should be considered in the context of your own applications and their expected growth. Not everyone has to deal with the same scalability challenges as Google, Facebook or Twitter.

Performance is also important. It doesn’t matter how awesome your app is if it can’t return data fast enough for end users.

Fast and efficient code can save $$$ in running costs and keep end users happy.

Security

Production code needs to be secure, following industry best practices. A security incident can cause severe reputational damage and financial impact.

Maintainability

Production code should be maintainable. Avoiding duplication and striving for simplicity.

Reviewed

Production code will often go through a code review by another developer before it’s accepted into the codebase.

Tested

Production code will go through a series of automated and human tests before it gets to production.

Accurate?

Production code doesn’t always have to produce accurate results. Accuracy can be very important in some situations. In others, you might be able to trade some accuracy for speed. e.g. Caching data, eventual consistency.
The number of active users on a website doesn’t need to be 100% accurate and can be out of date by the time it’s printed.

Documented

Production code should have comments and documentation.

Other

You might have policies for coding standards to follow. This might include things like naming conventions, code formatting and other best practices.

Does it need to be perfect?

Production code doesn’t have to be perfect. Perfect code doesn’t really exist. The code we write today will be viewed through a different lens tomorrow with the benefit of hindsight, experience and new improved ways of doing things.

The code we ship to production needs to be fit for purpose. Shipping code that works and delivers value in a timely manner is often more important than spending months trying to find the most optimal solution. There is a balance to be struck.

Architectural decisions made early in a project can have long-lasting consequences. Making good decisions early in a project can be cheaper than fixing them later.

You can’t always predict how users will use a new feature and how to optimize it. Growing pains are inevitable for successful applications. It’s important to monitor and be proactive in fixing problems.

Not production grade

So what are some examples of code that shouldn’t be in production?

  • Code that uses undocumented/unsupported features or behaviour
  • Untested code
  • Insecure code. e.g. Building dynamic SQL in a way that is prone to SQL injection.

Summary

Production code can vary in quality for a variety of reasons. The experience level of the developer writing the code, time pressures, and level of fatigue all impact code quality.

It’s not unusual for code written as a quick PoC to find its way into production. Code can also start to get messy over time as you start to shoehorn new features into an existing codebase.

If your code is delivering value to the end-user or business you can feel happy about it. Over time we all accumulate technical debt and need to allocate time to reduce it.

DBA Dash – What’s new in 2.21.0? 🚀

Intro

The first public release of DBA Dash was in January 2022 and version 2.21.0 is the 24th release.🎉 A lot has changed with many new features added as well as bug fixes. I haven’t really blogged much about the changes, but I thought it might be good to capture what its new for significant releases. So here is what is new and cool in 2.21.0!

Note: You can see the full change log here.

Metrics tab

The metrics tab was previously a wall of numbers – the Max, Min, Avg, Total & Current value of all the collected performance counters. At the click of a button you could see a chart for any counter over time for the selected period. It was limited to showing 1 chart at a time.

In 2.17.0 it was possible to configure threshold values for your counters – improving on the wall of numbers by highlighting counters that need your attention.

Now in 2.21.0 when you click view to view a chart for a particular counter, it adds a new chart. You can view charts for multiple counters simultaneously. Also, you can add other charts that are not os performance counters – CPU, Blocking, IO, Object Execution and Waits. This provides additional context for your performance counters. You can re-order the charts and remove them at the click of a button.

Once you’ve created a useful dashboard you can save and re-load it at the click of a button. You can save the view for yourself or share it with your team.

See it in action here:

DBA Dash Metrics tab
DBA Dash – Metrics tab

Tip: You can easily configure which performance counters are collected. Also, you can collect any metric you like that can be queried with T-SQL. See here for details on how to customize the collection.

Performance Summary tab

The performance summary tab now has saved views the same as the metrics tab. You could save the layout previously but it wasn’t possible to save multiple versions or to share those with the rest of the team.

DBA Dash Running Queries

DBA Dash vs sp_WhoIsActive / spBlitzWho

Introduction

sp_WhoIsActive and sp_BlitzWho are popular stored procedures created by the database community. They both capture queries currently running on your server. If you are a seasoned DBA it’s likely you have come across one or both of these tools. sp_WhoIsActive in particular has been around for a very long time, deployed to many, many thousands of servers. It’s helped me solve countless performance issues. Shout out to Adam Machanic who created the tool – Thank you!

The Brent Ozar team created sp_BlitzWho due to licensing issues distributing sp_WhoIsActive. The licensing issues are no longer a concern as both tools are now open source and can be freely distributed. sp_BlitzWho still has a more permissive license but the difference isn’t likely to matter to most people. Both work great. Each has unique selling points. Include them on all your SQL Server deployments! The excellent dbatools can help with that.

DBA Dash isn’t a stored procedure for showing active queries – it’s a full monitoring tool with a Windows app front end. So this is a bit of a weird comparison. It does capture running queries though which is where we can draw some comparisons.

Re-inventing the Wheel

While developing DBA Dash I did consider just embedding sp_BlitzWho or sp_WhoIsActive rather than re-invent the wheel. Licensing was maybe a concern for sp_WhoIsActive but I had other reasons for creating my own solution. I wanted a solution designed specifically for regular collection. sp_BlitzWho and sp_WhoIsActive can both log to a table. But they were designed for interactive use in SSMS.

Performance

The goal for DBA Dash was to limit the overhead of query collection. Also, I did not need or want the data to be formatted for human consumption on collection. The numbers below show a performance comparison. I’m using the defaults for each and averaging over 10 executions. 12 active queries were running for this test.

ToolAvg CPU (ms)Avg Duration (ms)
sp_WhoIsActive159166
sp_BlitzWho231241
DBA Dash (with session waits)615
Performance Tests

Note: I don’t worry about the overhead of running any of these tools interactively. Also, if you want to log the output to a table at regular intervals it’s not likely to hurt the performance of your server. Still, for regular monitoring, it’s good to keep overhead as low as possible.

If you want to test this for yourself, the capture query for DBA Dash is here. Confession time – this isn’t a fair fight. I’m not collecting query text or query plans. sp_BlizWho collects query plans by default so it’s at a disadvantage – the tools are not all doing the same thing. Also, the performance differences between these tools will vary based on several other factors.

So how does DBA Dash show query text?

  • Takes a distinct list of sql_handles from the running query capture
  • Collect text associated with those handles (if required).
  • Cache the handles.
  • Ignore the handles we’ve already captured on subsequent collections.

Query plan capture features a similar optimization. Also, DBA Dash doesn’t need to do anything with the formatting of the data for human consumption – at least not on collection.

Query text and query plans are stored in separate tables. Not duplicated by row. DBA Dash can also take care of data retention – clearing out old data efficiently with some partition switching.

This isn’t just about efficient data collection. DBA Dash has a custom GUI and isn’t bound by the same constraints as sp_WhoIsActive and sp_BlitzWho. This allows us to do some interesting things.

Summary Data

At the top level, we see the summary of the last running queries snapshot for each instance. In this case, we can see some serious blocking going on with instance DASH2019.

From there we can drill down to the DASH2019 instance and see a list of recent snapshots. We can also go back in time to any custom date range and see a list of the snapshots collected.

Use the summary data above to determine which snapshots are most interesting. Then drill down to the individual snapshots themselves:

Blocking

In the screenshot above I’ve zoomed into the blocking section of the report. DBA Dash highlights the sessions blocked in the “Blocking Session ID” column. In some cases, there will be long blocking chains. This can make it difficult to work out which is the root blocker by looking at the blocking session id and traversing the blocking chain manually. With sp_WhoIsActive you can pass @find_block_leaders=1 to make this easier which gives you a blocked session count.

With DBA Dash you can just look at the “Blocking Hierarchy” column. Take session 64. This is blocked by session 79 and we have a hierarchy of “90 \ 79”. Session 79 is blocking the current session 64. Session 90 is blocking session 79 and is our root blocker. The “Root Blocker” column also identifies which sessions are root blockers. You can filter to show just the root blockers at the click of a button.

The “Blocked Count” column shows how many queries are blocked directly by a session. You also have a “Blocked Count Recursive” column that shows you a total count of sessions blocked by the query (directly/indirectly). The counts are also clickable – allowing you to navigate the blocking chain interactively.

So what’s actually going on in this snapshot?

DBA Dash identifies session 90 as the root blocker. Session 90 isn’t blocked or waiting on anything. The CPU time and reads are very small. The status column is “sleeping” and we should pay close attention to this. A sleeping session is dormant – waiting for input from the app. This is a problem if the session has an open transaction – holding locks needed by other sessions. DBA Dash gives us the user, hostname, application name along with the query text that will assist in tracking this down.

The root cause of this issue is an update statement that ran from SSMS. It had a BEGIN TRAN but didn’t issue a COMMIT or ROLLBACK. This left the transaction open, holding locks that caused the blocking.

Sleeping sessions causing blocking are something to watch out for. It might indicate problems that need to be fixed in the app. e.g. Not using the “using” pattern, doing slow work in the middle of the transaction like calling a web service.

Waits

DBA Dash shows what queries are currently waiting on. It also captures the wait resource and parses the wait resource to make it easier to decipher. Session level waits are also available in DBA Dash (and sp_BlitzWho). DBA Dash extends this by providing a clickable link that will show the session level waits in table format with more detail.

Text

DBA Dash gives you the statement text and the full query text associated with the SQL handle. Just click the links in the grid to see the text formatted in a new window.

Plans

DBA Dash can show you query plans at the click of a button. Configure this using the service configuration tool. The collection process is optimized the same as for query text. Also, you can reduce the cost of collection further with plan collection thresholds.

A query plan tells you exactly how SQL Server was attempting to process the query. It also gives you the parameters that were used to compile the query plan.

Batch Completed / RPC Completed

Having the batch text and statement is great. For a typical RPC call, we can see the current statement along with the text of the associated stored procedure (or batch). What’s missing are the parameter values passed in from the client application. Unfortunately, we can’t get this by querying the DMVs. DBA Dash has a trick up its sleeve though. We can enable the capture of slow queries which creates an extended event session to capture queries that take longer than 1 second to run (configurable).

What’s cool is that you can click a session ID in running queries and see the associated RPC/batch completed event. This gives you the metrics associated with the completed execution and also the query text with the parameter values. Having the parameter values is useful when trying to reproduce and diagnose performance issues.

Note: The associated RPC/batch completed event is only available if the query has completed. It also needs to meet the threshold for collection. DBA Dash collects this data every 1min by default which adds additional delay. For the most recent snapshot, you might need to wait before the RPC/batch completed event is available.

Full Duplex

The link between running queries and RPC/batch completed is two-way. The Slow Queries tab shows you the RPC/batch completed events captured from the extended events session. From here you can click the session id and see all the running queries snapshots captured for that session while the query was running. Having access to the running queries snapshots can help answer questions about why a query was slow.
e.g. Was it blocked?
What was it waiting for?
What was the query plan used?
What statement was running?

Note: For queries with shorter execution times you might not have any data as running queries is captured every 1min by default.

Grouping and Export

You can group the running query capture in DBA Dash with the click of a button. For example, you might want to count the number of running queries by application, query hash, hostname….lots of options here. From there you can drill down to see the queries associated with your selected group value.

Exporting the data to Excel is also just a click away.

Not a GUI person?

The captured running query data is available in the DBA Dash repository database for you to query manually if required. I’d recommend using the dbo.RunningQueriesInfo view. The application calls dbo.RunningQueries_Get to return a specific snapshot (which uses the view).

Wrap up

Running queries capture is just a small (but interesting) part of what DBA Dash can do. There is a lot more that the tool can do – pair DBA Dash with query store and you have a lot of bases covered.

sp_WhoIsActive and sp_BlitzWho still have a place in my toolbox. I use them often. They are great tools and DBA Dash doesn’t replace them. I’m, running these tools interactively where DBA Dash is capturing valuable performance data round the clock for both current and retrospective analysis.

There is a lot of value in running a quick command in SSMS and having the results available immediately – in the same tool. With DBA Dash you have to wait for the next scheduled collection which is every 1min by default. Also, not everyone will want a full monitoring tool just to capture running queries. sp_WhoIsActive and sp_BitzWho both have their unique selling points and DBA Dash doesn’t replicate all their features and functionality.

If you haven’t tried DBA Dash, it’s a free monitoring tool for SQL Server that I created. My employer Trimble has allowed me to share the tool open source and it’s available on GitHub here. sp_WhoIsActive and sp_BlitzWho are also open source. Give them a try today.

Get drive data for ALL drives in T-SQL

There are a few methods you can use to get drive information using T-SQL. The problem is all these methods have some limitations. They provide free space but not drive capacity or they only provide data for volumes that contain database files. Ideally Microsoft would provide an easy way to get both drive capacity and free space for all drives. In this article I’ll demonstrate a way to do this using T-SQL.

Here are a few exiting methods that you might consider before we continue:

If these work for your needs, great! If you need capacity and free space for ALL drives you will need a different solution. The best option might be to query that data outside of SQL Server using PowerShell or your programming/scripting language of choice. If you need to do this in T-SQL though, the script below provides a way.

The script uses xp_cmdshell to run a powershell command to get data for ALL the volumes on your system. Powershell formats this data as XML which we capture in the @output table variable – 1 row per line of XML. Then we convert this back to a single string and store as XML in variable @x. We then shred the XML into a table format. This is a bit of a hack – sys.dm_os_volume_stats is a cleaner method but doesn’t give you data for all drives. As a bonus you get more detailed data about those drives; file system, allocation unit size & more.

A similar technique could be used to capture the output of other powershell commands in T-SQL. It might not be a good idea but I think it is an interesting technique to demo.

How to handle SQL Agent Jobs with Always On Availability Groups?

Introduction

When you configure your SQL servers for high availability you want the failover to be as seamless as possible. If you use availability groups, mirroring or log shipping as part of your high availability strategy you will need to take care of anything that lives outside the context of your databases. That might include logins, SQL Server agent jobs and ensuring consistent server level configuration across your availability nodes. This article will discuss how to handle SQL server agent jobs.

Not so seamless failover of agent jobs

If you have any SQL agent jobs that run in the context of your user databases these will need special handling to ensure a seamless failover. Ideally Microsoft would have added some improvements to the SQL agent to help with this. Maybe given us agent jobs that can live within the context of a user database instead of msdb?

As it stands you will need to keep your agent jobs in sync between the primary and secondaries and figure out a mechanism to activate the jobs on the new primary following failover and disable them on the old primary.

A Solution

There are a few ways to handle this, but my approach is to run all the jobs in the context of the master DB. I then check if the user DB is accessible and execute the command with a 3 part name. See the script below for an example. This will work for availability groups, mirroring and log shipping.

With this approach you can create your jobs and have them active on your primary and secondaries. You don’t need to worry about enabling/disabling jobs on failover – the jobs are already running but they only do any work if the database is primary. This makes the failover totally seamless. The downside to this approach is that the jobs will report that they ran successfully even if they didn’t do any work which could cause confusion.

The next problem is keeping the jobs in sync between your servers. You can script out the jobs using SSMS easy enough but this could get tedious if you have a larger number of jobs. I’ve created a powershell script below to help with this which you can use below. You can use this to report if your jobs are out of sync and also for performing a synchronization. See the comments at the top of the script for some usage examples.