It’s another T-SQL Tuesday! I feel like it’s been a while since I’ve joined the party. Thanks to Kerry Tyler (b|t) for hosting this month. (If you want to know more about T-SQL Tuesdays, you can find out more info here .)
Kerry’s challenge for us this month is this:
“Tell me (us all, obviously) about something recently that broke or went wrong, and what it took to fix it. “
When I think about this topic, I start trying to wrack my brain for that amazing disaster story. My best story involves SQL Server 6.5 so I’m pretty sure that’s not going to help anyone. But I also realized that I’m not on the administrative side of databases. I work with programmers and help create those problems that people see in the field. (Hey, wait a minute! Wait, no, that probably is accurate… *sigh*)
When I break something, we usually call it a “bug”. Doesn’t it sound better that way? Sometimes the bugs are small where we didn’t know that the client had made a change to the database before we released the code so the script failed when they ran the installer. Sometimes they seem small but later cause big problems, like enabling a process to send messages to a Service Broker queue by default but not enabling the job to process those messages by default. That meant when the job was enabled after 5 million or so messages were waiting in the queue, it unsurprisingly used a lot of resources that suddenly caused poor performance in the main database, also on that instance. Sometimes the problems are quite obvious – writing a query for a small database that didn’t come close to having the statistics that a very large client database would have so the performance ended up being incredibly slow.
For me, a lot of the problem solving comes from knowing how to troubleshoot to find what is actually going on. For example, your monitoring tool is reporting blocking. Does it happen on a regular basis? Is the blocking always on the same table or action or is on something different? Is something else running on the instance at the same time? Do the end users notice the problem and where in the system do they see the problem? Essentially, it’s asking a lot of questions to make sure I understand what the issue is, looking for the repeatable patterns, trying to replicate the issues, then going through the work to try to find the solution – whatever that may look like.
Once I identified the problem, I can then work on the solution. In many cases, the solutions are easy – add the missing index, make sure explicit transactions are properly committed, rewrite a query. Sometimes the solutions take longer to implement – work with the programmer to move SQL generated from a 3rd party data layer into a explicitly defined SQL statement or a stored procedure, rework the processing of Service Broker messages into different sized batches and allow changes to the job so it only processes data during off hours, rework catch-all queries into stored procedures that use dynamic SQL based on the parameters that are based in. Some of these solutions take minutes and others takes weeks or months. Sometimes I wanted to rework the whole approach but couldn’t. The more I think about it, rewriting the query was the solution in a vast majority of the issues I’ve seen.
I’ve definitely learned how to be a better troubleshooter over the year because of all the different things I’ve seen and have had to fix. For example, I got burned really badly with a poor performing CTE so you better believe I’m going to test the CTE against different syntax to see which works better. I’ve also gotten to learn the different behaviors of different tools or really figured out how to make them work to my advantage. If it’s a simple rewrite to test better performance, I may stick with Management Studio while a major rewrite testing all of the small changes that can affect performance at every step of the way may mean I’ll use Plan Explorer instead. Because I’ve had experience with lots of different small things, I can build on them as I come up with different solutions to the problems I face.
For those who were hoping for a great disaster story or epic fail and solution from me, I’m sorry to disappoint you. Trust me when I say I do make plenty of mistakes and have had a lot of things not work the way they should have. But I guess what I’m trying to say it’s all the small failures and the little mistakes rather than the big disasters along the way that I learned from to keep improving and getting better so I can be better equipped to fix the new bugs tomorrow that I’m creating today.