Troubleshooting with windows debugger

There is a lot of information about how SQL Server works: documentation, books, blogs etc. But sometimes you can’t find the information you need. Windows Debugger (WinDbg) and public debug symbols might be helpful in this case. I usually use them for learning about how does the Query Processor work. However, this post is not about query processing.

In this post I’m going to share a real-world example from my recent practice where some of the internal’s knowledge helped me to reveal the problem and make a quick fix. I hope you find this small investigation interesting.

The Problem

One of production servers, from time to time, typically 3-5 times a day, generated a dump file and switched to secondary. While the DBA contacted support, I was asked to make an investigation.

The Investigation

SQL Server generates a dump file, like SQLDumpXXXX.mdmp, as well as a text description file SQLDumpXXXX.txt. The latter one I took for analysis. I analyzed at least 10 last txt dump files and found that an error is the same and at the same memory address: EXCEPTION_ACCESS_VIOLATION at the same memory address sqllang+00000000015BC855, and also the same call stack dump.

The stack dump is a curious part, because it has the stack of the methods that lead to an error. It looks like this:

* Short Stack Dump
00007FFE5330C855 Module(sqllang+00000000015BC855)
00007FFE5330CD0C Module(sqllang+00000000015BCD0C)
00007FFE5450830F Module(sqlmin+00000000001F830F)
00007FFE544871D5 Module(sqlmin+00000000001771D5)
00007FFE54488212 Module(sqlmin+0000000000178212)
00007FFE5448816A Module(sqlmin+000000000017816A)
00007FFE5447FDAB Module(sqlmin+000000000016FDAB)
00007FFE5447C3A9 Module(sqlmin+000000000016C3A9)
00007FFE544687D9 Module(sqlmin+00000000001587D9)
00007FFE5435C70E Module(sqlmin+000000000004C70E)
00007FFE5324AB24 Module(sqllang+00000000014FAB24)
00007FFE5324CF57 Module(sqllang+00000000014FCF57)
…

The important part is that the same stack with the same relative to a module addresses was in all the dumps.

To get an idea of what is going on, we may try to convert the relative address: Module(<sqlmodule>+hex), into a SQL Server’s method name. Probably the method’s name would be descriptive enough to give some information. This is the moment where WinDbg with public symbols may help.

You may download WinDbg from the here, and then download public symbols, you may use WinDbg documentation or here is an example of how to do it from Paul Randal, adopt it to your folders and version.

You should not debug a production server, because WinDbg will break any singe instruction execution, so make sure you have a test server with the exact same SQL Server version as production (in my case it was 13.0.4474.0).

After you have properly set WinDbg you may attach to a process of the test SQL Server instance (if you have several, be sure to select a proper one, by checking executable path). Now you are ready to decode addresses into method names.

To decode I preferred to add a breakpoint on each of the call stack addresses (not the only possible way, but very good to depict).

Pay attention at the methods marked 1,2,3,4:

HkHostAllocator :: ~HkHostAllocator
GetHkTableSnapshotRowModCount
CTVFMissingIndexDetails::GetRow
CQScanXProducer::GetRow

Due to a stack nature, to understand the control sequence, we should go bottom-top.

4) CQScanXProducer::GetRow is a method to get a row for an Exchange Producer, component of SQL Server Exchange Iterator, that is used in parallel plans.

3) CTVFMissingIndexDetails::GetRow is a method to get a row from the DMV sys.dm_db_missing_index_details, we may check this using WinDbg.

1, 2) Hk is a common prefix for “Hekaton”, a code name for the In-Memory OLTP. The method marked 2 is getting modification counter and the method 1 (where there is an exception) is a destructor for some memory allocator.

In a dump file I also checked the CSession class field m_fBackground – it equals 1.

CSession @0x000000D0203BC3C0
—————————-
m_sessionId = 36                     m_cRef = 55                          m_rgcRefType[0] = 1
m_rgcRefType[1] = 0                  m_rgcRefType[2] = 27                 m_rgcRefType[3] = 26
m_rgcRefType[4] = 0                  m_rgcRefType[5] = 1                  m_pmo = 0x000000D0203BC040
m_pstackBhfPool = 0x0000000000000000 m_dwLoginFlags = 0x0000              m_fBackground = 1
…

To sum it all up:

The error is produced by some background task, which makes complex enough query to get a parallel plan and at the moment it asks for missing index details and row modification counter for in-memory tables.

That is a good point to start investigation! =)

The criminal

Do you know how much missing indexes are in your system right now? And what is the possible maximum? Every missing index?

I have never thought about it to be honest. And I was surprised, that maximum is 600 indexes. More to the point there is an automatic purging task that checks the threshold of 500 missing indexes and if there are more of them starts purging them, it removes 20%.

This task runs every 2 minutes, and to find the most useless missing index recommendations it ranks the existing indexes using a quite complex query. I have found this query here: How true is it that Missing Indexes feature cannot gather stats for more than 500 missing index groups?

Here it is (original formatting):

select tttt.database_id, tttt.object_id, tttt.group_handle from ( select tt.database_id, tt.object_id, tt.group_handle, tt.index_handle, ntile(100) over (order by metrics) as nt from (Select mid.database_id, mid.object_id, mig.index_group_handle as group_handle, mig.index_handle as index_handle, migs_adv.index_advantage/(sqrt(migs_adv.ages)+1) as metrics from ( select (user_seeks + user_scans)* avg_total_user_cost * avg_user_impact as index_advantage, DATEDIFF(hour, case when (last_user_seek IS NULL and last_user_scan IS NULL) then convert (datetime, N'1/1/2000', 101) when (last_user_seek IS NULL and last_user_scan IS NOT NULL) then last_user_scan when (last_user_seek IS NOT NULL and last_user_scan IS NULL) then last_user_seek when (last_user_seek IS NOT NULL and last_user_scan IS NOT NULL) then CASE when (last_user_seek > last_user_scan) then last_user_seek else last_user_scan end end, getdate()) as ages, migs.* from sys.dm_db_missing_index_group_stats migs ) as migs_adv, sys.dm_db_missing_index_groups mig, sys.dm_db_missing_index_details mid where migs_adv.group_handle = mig.index_group_handle and mig.index_handle = mid.index_handle) as tt) as tttt where nt <= 20

I checked this query under the context of our DB, that has In-memory OLTP tables, and got a parallel plan. Then I checked that the server does have some missing indexes recommendations on in-memory tables. It does, it has four missing indexes on in-memory tables. What I also did is I monitored missing index group count and all the dumps were made between 500 and 600 index groups – the interval where purging task acts.

The Proof

I could not invent a repro, there should be some specific conditions for that exception. For example, I have found the following SQL Server bug (fixed) with missing indexes on regular tables purging task:

Symptoms

Consider the following scenario:

You delete a column from a table in Microsoft SQL Server 2008 R2.

There is at least one missing index in this column.

You try to run some queries against the missing index dynamic management views (DMVs), or there is a background purging task that queries the missing index DMVs.

In this scenario, an access violation may occur.

https://support.microsoft.com/bg-bg/help/2784686/fix-an-access-violation-occurs-when-you-query-the-missing-indexes-dmvs

That was not the scenario in this case, but probably something similar, so what I did – created all the missing indexes.

After that, not a single dump for a half of a month (compare to 3-5 dumps a day)!

Later, when there were new queries – generating new missing indexes – new dumps (2 days). Creating those indexes – again – no dumps (since now).

I think, this is an empirical way to prove the case.
That is all for this investigation. I hope that I encourage you not to be afraid of SQL internals and showed that this is not only academical knowledge, this is fun. =)

P.S.
Answering probable questions: Yes, the production server works stable now. Yes, we send this case to support. =)

Hope you enjoyed it, thank you for reading.

PDF