Max worker thread exceeded after manual AG failoverWhat can cause a mirroring session to timeout then failover?Sleeping SPID blocking other transactionsHow can a connection for one DB block a connection to another DB in SQL Server?Manual Failover with AlwaysOn availability groupHADR high worker thread usageDatabase not able to resume data movementManual failover of AAG in SQL Server 2012Potential corruption in sys.sysdercv, latch errors from ghost cleanup processAlwayson Manual Failover failedVeeam HotAdd causing Worker Thread Exhaustion
How to create razor wire
Proto-Indo-European (PIE) words with IPA
How could the B-29 bomber back up under its own power?
What is the required burn to keep a satellite at a Lagrangian point?
Sony VAIO Duo 13 Wifi not working on Ubuntu 16.04
(For training purposes) Are there any openings with rook pawns that are more effective than others (and if so, what are they)?
How to safely discharge oneself
What is the winged creature on the back of the Mordenkainen's Tome of Foes book?
nginx conf: http2 module not working in Chrome in ubuntu 18.04
Is there any mention of ghosts who live outside the Hogwarts castle?
How many wires should be in a new thermostat cable?
What was the primary motivation for a historical figure like Xenophon to create an extensive collection of written material?
Shell builtin `printf` line limit?
If a character has cast the Fly spell on themselves, can they "hand off" to the Levitate spell without interruption?
How to tease a romance without a cat and mouse chase?
Can someone get a spouse off a deed that never lived together and was incarcerated?
How do you earn the reader's trust?
Salesforce bug enabled "Modify All"
Why do testers need root cause analysis?
Adobe Illustrator: How can I change the profile of a dashed stroke?
Download app bundles from App Store to run on iOS Emulator on Mac
size of pointers and architecture
How do I write real-world stories separate from my country of origin?
Does the fact that we can only measure the two-way speed of light undermine the axiom of invariance?
Max worker thread exceeded after manual AG failover
What can cause a mirroring session to timeout then failover?Sleeping SPID blocking other transactionsHow can a connection for one DB block a connection to another DB in SQL Server?Manual Failover with AlwaysOn availability groupHADR high worker thread usageDatabase not able to resume data movementManual failover of AAG in SQL Server 2012Potential corruption in sys.sysdercv, latch errors from ghost cleanup processAlwayson Manual Failover failedVeeam HotAdd causing Worker Thread Exhaustion
.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty margin-bottom:0;
We have two SQL Server 2016 SP2 CU6 Enterprise running on physical hardware (32 cores) with one AG that contains 5 db's. The secondary is not readable. Today we did a planned manual failover, so we could do maintenance work on the other server. We did the failover on a moment there was no heavy load.
The failover was executed through the GUI of SSMS and gave no errors. But several minutes later we got a lot of phone calls that users were unable to login.
Our first attempt to troubleshoot was making a connection with SSMS but this gave an error that we were unable to connect, the exact error message I don't remember.
Next we took a look at the errorlog file in notepad and it was just like that the SQL Server Engine hung. There were no new entries added to the log after the failover.
Due to the high urgency of the problem, we restarted the SQL Server Service on the primary and the problem was disappeared.
We could have tried making a DAC connection to see what the problem was but the objective was to bring the server back online as soon as possible.
After that everything was back online, we started analyzing the log files.
In the errorlog on the old primary we found several errors 35278. No long running transactions were running on time of the failover.
Next was the AlwaysON_health event file, here the entries just stopped after the failover.
Next we took a look at the *_*_SQLDIAG_*_*.xel
file. Here we were lucky.
The first thing we noticed was:
<queryProcessing maxWorkers="960" workersCreated="1064"
workersIdle="23" tasksCompletedWithinInterval="19" pendingTasks="34"
oldestPendingTaskWaitingTime="1316776"
For some reason the max number of worker was exceeded, this could explain why no one could connect.
This were the pending tasks:
<pendingTasks><br>
<entryPoint name="Process Command" count="14" /><br>
<entryPoint name="SNI New Connection" count="2" /><br>
<entryPoint name="SNI Accept Done" count="3" /><br>
<entryPoint moduleName="sqlmin.dll" imageBase="0x7ff827e80000" size="0x251e000" address="0x7ff8286622e0" count="6" /><br>
<entryPoint moduleName="sqlmin.dll" imageBase="0x7ff827e80000" size="0x251e000" address="0x7ff8292b1190" count="2" /><br>
<entryPoint moduleName="sqldk.dll" imageBase="0x7ff827120000" size="0x4c9000" address="0x7ff827125e60" count="7" /> <br>
</pendingTasks>
There were also several blocked-processes in that file. All of them had a waitresource like DATABASE: 5:5
,DATABASE: 5:15
,DATABASE: 5:26
, ...
All of those processes had lastbatchstarted
and lastbatchcompleted
set to 1900-01-01. The waittime was equal to how long the failover had happened.
In our monitoring we saw there was a total of 950 blocked processes.
This is a sample of a blocked-process-report:
<blocked-process-report monitorLoop="0">
<blocked-process>
<process id="process13f08032ca8" taskpriority="0" logused="10000" waitresource="DATABASE: 5:3 " waittime="1334205" schedulerid="4" kpid="11752" status="suspended" spid="95" sbid="0" ecid="0" priority="0" trancount="0" lastbatchstarted="1900-01-01T00:00:00" lastbatchcompleted="1900-01-01T00:00:00" lastattention="1900-01-01T00:00:00" clientapp="our_app" hostname="host2021" hostpid="14196" loginname="user1" isolationlevel="read committed (2)" xactid="0" currentdb="1" currentdbname="master" lockTimeout="4294967295" clientoption1="671088672" clientoption2="128056">
<executionStack />
<inputbuf>
</inputbuf>
</process>
</blocked-process>
<blocking-process>
<process status="suspended" waitresource="DATABASE: 5:3 " waittime="1335893" spid="70" sbid="0" ecid="0" priority="0" trancount="0" lastbatchstarted="1900-01-01T00:00:00" lastbatchcompleted="1900-01-01T00:00:00" lastattention="1900-01-01T00:00:00" clientapp="our_app" hostname="host1" hostpid="1324" loginname="user1" isolationlevel="read committed (2)" xactid="0" currentdb="1" currentdbname="master" lockTimeout="4294967295" clientoption1="671088672" clientoption2="128056">
<executionStack />
<inputbuf>
</inputbuf>
</process>
</blocking-process>
</blocked-process-report>
<blocked-process-report monitorLoop="0">
<blocked-process>
<process id="process13f08033848" taskpriority="0" logused="10000" waitresource="DATABASE: 5:3 " waittime="1335893" schedulerid="4" kpid="12004" status="suspended" spid="70" sbid="0" ecid="0" priority="0" trancount="0" lastbatchstarted="1900-01-01T00:00:00" lastbatchcompleted="1900-01-01T00:00:00" lastattention="1900-01-01T00:00:00" clientapp="our_app" hostname="host1" hostpid="1324" loginname="user1" isolationlevel="read committed (2)" xactid="0" currentdb="1" currentdbname="master" lockTimeout="4294967295" clientoption1="671088672" clientoption2="128056">
<executionStack />
<inputbuf>
</inputbuf>
</process>
</blocked-process>
<blocking-process>
<process status="background" waittime="1139955" spid="141" sbid="0" ecid="0" priority="0" trancount="0">
<executionStack />
<inputbuf>
</inputbuf>
</process>
</blocking-process>
</blocked-process-report>
One of the blocking-processes was executed as user SA
and had command UNKNOWN TOKEN
. The last_wait_type
of this sessions was PARALLEL_REDO_WORKER_WAIT_WORK
and had a last_request_start_time
that was equal to the time we did the failover.
I didn't expect to find this wait type on the primary.
Is it correct to say that PARALLEL_REDO_WORKER_WAIT_WORK
shouldn't be expected on a primary?
Also in our monitoring we found approx 1000 sessions that had the status SLEEPING
and had a last_request_start_time
of 01/01/1900.
That doens't seem normal at all.
Can someone explain what those pending tasks are?
My guess of the root cause is the 1000 sessions that had the status SLEEPING
, they all used a worker.
This caused the pendingtasks and due to the blocking process no workers could do any work. This caused that the failover somewhere got stuck.
Can this be correct?
sql-server-2016 availability-groups troubleshooting
add a comment |
We have two SQL Server 2016 SP2 CU6 Enterprise running on physical hardware (32 cores) with one AG that contains 5 db's. The secondary is not readable. Today we did a planned manual failover, so we could do maintenance work on the other server. We did the failover on a moment there was no heavy load.
The failover was executed through the GUI of SSMS and gave no errors. But several minutes later we got a lot of phone calls that users were unable to login.
Our first attempt to troubleshoot was making a connection with SSMS but this gave an error that we were unable to connect, the exact error message I don't remember.
Next we took a look at the errorlog file in notepad and it was just like that the SQL Server Engine hung. There were no new entries added to the log after the failover.
Due to the high urgency of the problem, we restarted the SQL Server Service on the primary and the problem was disappeared.
We could have tried making a DAC connection to see what the problem was but the objective was to bring the server back online as soon as possible.
After that everything was back online, we started analyzing the log files.
In the errorlog on the old primary we found several errors 35278. No long running transactions were running on time of the failover.
Next was the AlwaysON_health event file, here the entries just stopped after the failover.
Next we took a look at the *_*_SQLDIAG_*_*.xel
file. Here we were lucky.
The first thing we noticed was:
<queryProcessing maxWorkers="960" workersCreated="1064"
workersIdle="23" tasksCompletedWithinInterval="19" pendingTasks="34"
oldestPendingTaskWaitingTime="1316776"
For some reason the max number of worker was exceeded, this could explain why no one could connect.
This were the pending tasks:
<pendingTasks><br>
<entryPoint name="Process Command" count="14" /><br>
<entryPoint name="SNI New Connection" count="2" /><br>
<entryPoint name="SNI Accept Done" count="3" /><br>
<entryPoint moduleName="sqlmin.dll" imageBase="0x7ff827e80000" size="0x251e000" address="0x7ff8286622e0" count="6" /><br>
<entryPoint moduleName="sqlmin.dll" imageBase="0x7ff827e80000" size="0x251e000" address="0x7ff8292b1190" count="2" /><br>
<entryPoint moduleName="sqldk.dll" imageBase="0x7ff827120000" size="0x4c9000" address="0x7ff827125e60" count="7" /> <br>
</pendingTasks>
There were also several blocked-processes in that file. All of them had a waitresource like DATABASE: 5:5
,DATABASE: 5:15
,DATABASE: 5:26
, ...
All of those processes had lastbatchstarted
and lastbatchcompleted
set to 1900-01-01. The waittime was equal to how long the failover had happened.
In our monitoring we saw there was a total of 950 blocked processes.
This is a sample of a blocked-process-report:
<blocked-process-report monitorLoop="0">
<blocked-process>
<process id="process13f08032ca8" taskpriority="0" logused="10000" waitresource="DATABASE: 5:3 " waittime="1334205" schedulerid="4" kpid="11752" status="suspended" spid="95" sbid="0" ecid="0" priority="0" trancount="0" lastbatchstarted="1900-01-01T00:00:00" lastbatchcompleted="1900-01-01T00:00:00" lastattention="1900-01-01T00:00:00" clientapp="our_app" hostname="host2021" hostpid="14196" loginname="user1" isolationlevel="read committed (2)" xactid="0" currentdb="1" currentdbname="master" lockTimeout="4294967295" clientoption1="671088672" clientoption2="128056">
<executionStack />
<inputbuf>
</inputbuf>
</process>
</blocked-process>
<blocking-process>
<process status="suspended" waitresource="DATABASE: 5:3 " waittime="1335893" spid="70" sbid="0" ecid="0" priority="0" trancount="0" lastbatchstarted="1900-01-01T00:00:00" lastbatchcompleted="1900-01-01T00:00:00" lastattention="1900-01-01T00:00:00" clientapp="our_app" hostname="host1" hostpid="1324" loginname="user1" isolationlevel="read committed (2)" xactid="0" currentdb="1" currentdbname="master" lockTimeout="4294967295" clientoption1="671088672" clientoption2="128056">
<executionStack />
<inputbuf>
</inputbuf>
</process>
</blocking-process>
</blocked-process-report>
<blocked-process-report monitorLoop="0">
<blocked-process>
<process id="process13f08033848" taskpriority="0" logused="10000" waitresource="DATABASE: 5:3 " waittime="1335893" schedulerid="4" kpid="12004" status="suspended" spid="70" sbid="0" ecid="0" priority="0" trancount="0" lastbatchstarted="1900-01-01T00:00:00" lastbatchcompleted="1900-01-01T00:00:00" lastattention="1900-01-01T00:00:00" clientapp="our_app" hostname="host1" hostpid="1324" loginname="user1" isolationlevel="read committed (2)" xactid="0" currentdb="1" currentdbname="master" lockTimeout="4294967295" clientoption1="671088672" clientoption2="128056">
<executionStack />
<inputbuf>
</inputbuf>
</process>
</blocked-process>
<blocking-process>
<process status="background" waittime="1139955" spid="141" sbid="0" ecid="0" priority="0" trancount="0">
<executionStack />
<inputbuf>
</inputbuf>
</process>
</blocking-process>
</blocked-process-report>
One of the blocking-processes was executed as user SA
and had command UNKNOWN TOKEN
. The last_wait_type
of this sessions was PARALLEL_REDO_WORKER_WAIT_WORK
and had a last_request_start_time
that was equal to the time we did the failover.
I didn't expect to find this wait type on the primary.
Is it correct to say that PARALLEL_REDO_WORKER_WAIT_WORK
shouldn't be expected on a primary?
Also in our monitoring we found approx 1000 sessions that had the status SLEEPING
and had a last_request_start_time
of 01/01/1900.
That doens't seem normal at all.
Can someone explain what those pending tasks are?
My guess of the root cause is the 1000 sessions that had the status SLEEPING
, they all used a worker.
This caused the pendingtasks and due to the blocking process no workers could do any work. This caused that the failover somewhere got stuck.
Can this be correct?
sql-server-2016 availability-groups troubleshooting
add a comment |
We have two SQL Server 2016 SP2 CU6 Enterprise running on physical hardware (32 cores) with one AG that contains 5 db's. The secondary is not readable. Today we did a planned manual failover, so we could do maintenance work on the other server. We did the failover on a moment there was no heavy load.
The failover was executed through the GUI of SSMS and gave no errors. But several minutes later we got a lot of phone calls that users were unable to login.
Our first attempt to troubleshoot was making a connection with SSMS but this gave an error that we were unable to connect, the exact error message I don't remember.
Next we took a look at the errorlog file in notepad and it was just like that the SQL Server Engine hung. There were no new entries added to the log after the failover.
Due to the high urgency of the problem, we restarted the SQL Server Service on the primary and the problem was disappeared.
We could have tried making a DAC connection to see what the problem was but the objective was to bring the server back online as soon as possible.
After that everything was back online, we started analyzing the log files.
In the errorlog on the old primary we found several errors 35278. No long running transactions were running on time of the failover.
Next was the AlwaysON_health event file, here the entries just stopped after the failover.
Next we took a look at the *_*_SQLDIAG_*_*.xel
file. Here we were lucky.
The first thing we noticed was:
<queryProcessing maxWorkers="960" workersCreated="1064"
workersIdle="23" tasksCompletedWithinInterval="19" pendingTasks="34"
oldestPendingTaskWaitingTime="1316776"
For some reason the max number of worker was exceeded, this could explain why no one could connect.
This were the pending tasks:
<pendingTasks><br>
<entryPoint name="Process Command" count="14" /><br>
<entryPoint name="SNI New Connection" count="2" /><br>
<entryPoint name="SNI Accept Done" count="3" /><br>
<entryPoint moduleName="sqlmin.dll" imageBase="0x7ff827e80000" size="0x251e000" address="0x7ff8286622e0" count="6" /><br>
<entryPoint moduleName="sqlmin.dll" imageBase="0x7ff827e80000" size="0x251e000" address="0x7ff8292b1190" count="2" /><br>
<entryPoint moduleName="sqldk.dll" imageBase="0x7ff827120000" size="0x4c9000" address="0x7ff827125e60" count="7" /> <br>
</pendingTasks>
There were also several blocked-processes in that file. All of them had a waitresource like DATABASE: 5:5
,DATABASE: 5:15
,DATABASE: 5:26
, ...
All of those processes had lastbatchstarted
and lastbatchcompleted
set to 1900-01-01. The waittime was equal to how long the failover had happened.
In our monitoring we saw there was a total of 950 blocked processes.
This is a sample of a blocked-process-report:
<blocked-process-report monitorLoop="0">
<blocked-process>
<process id="process13f08032ca8" taskpriority="0" logused="10000" waitresource="DATABASE: 5:3 " waittime="1334205" schedulerid="4" kpid="11752" status="suspended" spid="95" sbid="0" ecid="0" priority="0" trancount="0" lastbatchstarted="1900-01-01T00:00:00" lastbatchcompleted="1900-01-01T00:00:00" lastattention="1900-01-01T00:00:00" clientapp="our_app" hostname="host2021" hostpid="14196" loginname="user1" isolationlevel="read committed (2)" xactid="0" currentdb="1" currentdbname="master" lockTimeout="4294967295" clientoption1="671088672" clientoption2="128056">
<executionStack />
<inputbuf>
</inputbuf>
</process>
</blocked-process>
<blocking-process>
<process status="suspended" waitresource="DATABASE: 5:3 " waittime="1335893" spid="70" sbid="0" ecid="0" priority="0" trancount="0" lastbatchstarted="1900-01-01T00:00:00" lastbatchcompleted="1900-01-01T00:00:00" lastattention="1900-01-01T00:00:00" clientapp="our_app" hostname="host1" hostpid="1324" loginname="user1" isolationlevel="read committed (2)" xactid="0" currentdb="1" currentdbname="master" lockTimeout="4294967295" clientoption1="671088672" clientoption2="128056">
<executionStack />
<inputbuf>
</inputbuf>
</process>
</blocking-process>
</blocked-process-report>
<blocked-process-report monitorLoop="0">
<blocked-process>
<process id="process13f08033848" taskpriority="0" logused="10000" waitresource="DATABASE: 5:3 " waittime="1335893" schedulerid="4" kpid="12004" status="suspended" spid="70" sbid="0" ecid="0" priority="0" trancount="0" lastbatchstarted="1900-01-01T00:00:00" lastbatchcompleted="1900-01-01T00:00:00" lastattention="1900-01-01T00:00:00" clientapp="our_app" hostname="host1" hostpid="1324" loginname="user1" isolationlevel="read committed (2)" xactid="0" currentdb="1" currentdbname="master" lockTimeout="4294967295" clientoption1="671088672" clientoption2="128056">
<executionStack />
<inputbuf>
</inputbuf>
</process>
</blocked-process>
<blocking-process>
<process status="background" waittime="1139955" spid="141" sbid="0" ecid="0" priority="0" trancount="0">
<executionStack />
<inputbuf>
</inputbuf>
</process>
</blocking-process>
</blocked-process-report>
One of the blocking-processes was executed as user SA
and had command UNKNOWN TOKEN
. The last_wait_type
of this sessions was PARALLEL_REDO_WORKER_WAIT_WORK
and had a last_request_start_time
that was equal to the time we did the failover.
I didn't expect to find this wait type on the primary.
Is it correct to say that PARALLEL_REDO_WORKER_WAIT_WORK
shouldn't be expected on a primary?
Also in our monitoring we found approx 1000 sessions that had the status SLEEPING
and had a last_request_start_time
of 01/01/1900.
That doens't seem normal at all.
Can someone explain what those pending tasks are?
My guess of the root cause is the 1000 sessions that had the status SLEEPING
, they all used a worker.
This caused the pendingtasks and due to the blocking process no workers could do any work. This caused that the failover somewhere got stuck.
Can this be correct?
sql-server-2016 availability-groups troubleshooting
We have two SQL Server 2016 SP2 CU6 Enterprise running on physical hardware (32 cores) with one AG that contains 5 db's. The secondary is not readable. Today we did a planned manual failover, so we could do maintenance work on the other server. We did the failover on a moment there was no heavy load.
The failover was executed through the GUI of SSMS and gave no errors. But several minutes later we got a lot of phone calls that users were unable to login.
Our first attempt to troubleshoot was making a connection with SSMS but this gave an error that we were unable to connect, the exact error message I don't remember.
Next we took a look at the errorlog file in notepad and it was just like that the SQL Server Engine hung. There were no new entries added to the log after the failover.
Due to the high urgency of the problem, we restarted the SQL Server Service on the primary and the problem was disappeared.
We could have tried making a DAC connection to see what the problem was but the objective was to bring the server back online as soon as possible.
After that everything was back online, we started analyzing the log files.
In the errorlog on the old primary we found several errors 35278. No long running transactions were running on time of the failover.
Next was the AlwaysON_health event file, here the entries just stopped after the failover.
Next we took a look at the *_*_SQLDIAG_*_*.xel
file. Here we were lucky.
The first thing we noticed was:
<queryProcessing maxWorkers="960" workersCreated="1064"
workersIdle="23" tasksCompletedWithinInterval="19" pendingTasks="34"
oldestPendingTaskWaitingTime="1316776"
For some reason the max number of worker was exceeded, this could explain why no one could connect.
This were the pending tasks:
<pendingTasks><br>
<entryPoint name="Process Command" count="14" /><br>
<entryPoint name="SNI New Connection" count="2" /><br>
<entryPoint name="SNI Accept Done" count="3" /><br>
<entryPoint moduleName="sqlmin.dll" imageBase="0x7ff827e80000" size="0x251e000" address="0x7ff8286622e0" count="6" /><br>
<entryPoint moduleName="sqlmin.dll" imageBase="0x7ff827e80000" size="0x251e000" address="0x7ff8292b1190" count="2" /><br>
<entryPoint moduleName="sqldk.dll" imageBase="0x7ff827120000" size="0x4c9000" address="0x7ff827125e60" count="7" /> <br>
</pendingTasks>
There were also several blocked-processes in that file. All of them had a waitresource like DATABASE: 5:5
,DATABASE: 5:15
,DATABASE: 5:26
, ...
All of those processes had lastbatchstarted
and lastbatchcompleted
set to 1900-01-01. The waittime was equal to how long the failover had happened.
In our monitoring we saw there was a total of 950 blocked processes.
This is a sample of a blocked-process-report:
<blocked-process-report monitorLoop="0">
<blocked-process>
<process id="process13f08032ca8" taskpriority="0" logused="10000" waitresource="DATABASE: 5:3 " waittime="1334205" schedulerid="4" kpid="11752" status="suspended" spid="95" sbid="0" ecid="0" priority="0" trancount="0" lastbatchstarted="1900-01-01T00:00:00" lastbatchcompleted="1900-01-01T00:00:00" lastattention="1900-01-01T00:00:00" clientapp="our_app" hostname="host2021" hostpid="14196" loginname="user1" isolationlevel="read committed (2)" xactid="0" currentdb="1" currentdbname="master" lockTimeout="4294967295" clientoption1="671088672" clientoption2="128056">
<executionStack />
<inputbuf>
</inputbuf>
</process>
</blocked-process>
<blocking-process>
<process status="suspended" waitresource="DATABASE: 5:3 " waittime="1335893" spid="70" sbid="0" ecid="0" priority="0" trancount="0" lastbatchstarted="1900-01-01T00:00:00" lastbatchcompleted="1900-01-01T00:00:00" lastattention="1900-01-01T00:00:00" clientapp="our_app" hostname="host1" hostpid="1324" loginname="user1" isolationlevel="read committed (2)" xactid="0" currentdb="1" currentdbname="master" lockTimeout="4294967295" clientoption1="671088672" clientoption2="128056">
<executionStack />
<inputbuf>
</inputbuf>
</process>
</blocking-process>
</blocked-process-report>
<blocked-process-report monitorLoop="0">
<blocked-process>
<process id="process13f08033848" taskpriority="0" logused="10000" waitresource="DATABASE: 5:3 " waittime="1335893" schedulerid="4" kpid="12004" status="suspended" spid="70" sbid="0" ecid="0" priority="0" trancount="0" lastbatchstarted="1900-01-01T00:00:00" lastbatchcompleted="1900-01-01T00:00:00" lastattention="1900-01-01T00:00:00" clientapp="our_app" hostname="host1" hostpid="1324" loginname="user1" isolationlevel="read committed (2)" xactid="0" currentdb="1" currentdbname="master" lockTimeout="4294967295" clientoption1="671088672" clientoption2="128056">
<executionStack />
<inputbuf>
</inputbuf>
</process>
</blocked-process>
<blocking-process>
<process status="background" waittime="1139955" spid="141" sbid="0" ecid="0" priority="0" trancount="0">
<executionStack />
<inputbuf>
</inputbuf>
</process>
</blocking-process>
</blocked-process-report>
One of the blocking-processes was executed as user SA
and had command UNKNOWN TOKEN
. The last_wait_type
of this sessions was PARALLEL_REDO_WORKER_WAIT_WORK
and had a last_request_start_time
that was equal to the time we did the failover.
I didn't expect to find this wait type on the primary.
Is it correct to say that PARALLEL_REDO_WORKER_WAIT_WORK
shouldn't be expected on a primary?
Also in our monitoring we found approx 1000 sessions that had the status SLEEPING
and had a last_request_start_time
of 01/01/1900.
That doens't seem normal at all.
Can someone explain what those pending tasks are?
My guess of the root cause is the 1000 sessions that had the status SLEEPING
, they all used a worker.
This caused the pendingtasks and due to the blocking process no workers could do any work. This caused that the failover somewhere got stuck.
Can this be correct?
sql-server-2016 availability-groups troubleshooting
sql-server-2016 availability-groups troubleshooting
edited May 8 at 12:18
Frederik Vanderhaegen
asked May 8 at 9:25
Frederik VanderhaegenFrederik Vanderhaegen
9371521
9371521
add a comment |
add a comment |
2 Answers
2
active
oldest
votes
Due to the high urgency of the problem, we restarted the SQL Server Service on the primary and the problem was disappeared.
Since no investigation or gathering of data was completed, it's going to be hard to say exactly what was going on at the time.
For some reason the max number of worker was exceeded, this could explain why no one could connect.
Yes, for some reason and we don't have any data around that reason :( Looks like the server has 32 processors, which should easily be enough to handle 5 databases in an availability group. There is more to this that we either don't know or you aren't explaining.
Is it correct to say that
PARALLEL_REDO_WORKER_WAIT_WORK
shouldn't be expected on a primary?
Since we don't know the database context for this, impossible to say. I will say, that if there is only a single availability group and this instance was the primary then no, it would not be expected. If, however, there are other availability groups that we're not aware of and this is a secondary, then yes it is expected. Again, we don't have all the information. This also assumes that there was no redo queue that needed to be drained on any of the databases - which again, we don't know or don't have that information.
Can someone explain what those pending tasks are?
The Process Commands
are attempting to execute something on a connection. The SNI*
are new connections being made.
There was also a blocking-process, after checking our monitoring, this process was executed as user
SA
and had commandUNKNOWN TOKEN
.
How many sessions was it blocking? What was it blocking? etc. Again, lack of information so can't say if it's something or not.
Also in our monitoring we found approx 1000 sessions that had the status SLEEPING and had a last_request_start_time of 01/01/1900.
That doens't seem normal at all.
Where were they coming from? What did they last run? It might be normal if you have a secondary replica that is readable and someone wanted to run a bunch of reports or queries. Again, need more information.
Can this be correct?
It could be. Obviously you ran out of worker threads, how and why that happened is unknown as there just isn't enough data.
In our monitoring we saw there was a total of 950 blocked processes.
There's your underlying problem. Looks like clientapp="our_app"
was connected (it's all in master) and was doing some type of querying of the database. It seems the database wasn't yet done doing redo (hence the wait type) and all of your sessions backed up on each other.
I updated my question with extra info.
– Frederik Vanderhaegen
May 8 at 12:07
@FrederikVanderhaegen added most probable explanation given the new data.
– Sean Gallardy
May 8 at 13:36
add a comment |
we have faced max worker thread issue number of times in our environment and only reason is blocking.
since you have not recorded any blocking sessions to lead us to find root cause.
Furthermore you can setup sql agent blocking alert with dispatch latency of >5 second and attach job to capture blocking data.
we have highly oltp environment and single blocking on main table increase worker thread to reach 1000 in less than 1 minute. we have increased the worker thread to 5000 to get enough time to investigate and kill blocking session's because we don't want to restart service.
we found triggers, long running queries culpeits which hold table lock on primary used table and increase blocking sessions and at a certain point sql gets unresponsive. In this situation either to take DAC session and kill head blocking session or restart sql service.
Although Microsoft recommends not to set max worker thread due to number of reasons .First setup the blocking alert and fix queries to avoid such type of issue.
I added soms blocking information to my question, maybe this is helpful.
– Frederik Vanderhaegen
May 8 at 12:19
add a comment |
Your Answer
StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "182"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);
else
createEditor();
);
function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);
);
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdba.stackexchange.com%2fquestions%2f237633%2fmax-worker-thread-exceeded-after-manual-ag-failover%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
2 Answers
2
active
oldest
votes
2 Answers
2
active
oldest
votes
active
oldest
votes
active
oldest
votes
Due to the high urgency of the problem, we restarted the SQL Server Service on the primary and the problem was disappeared.
Since no investigation or gathering of data was completed, it's going to be hard to say exactly what was going on at the time.
For some reason the max number of worker was exceeded, this could explain why no one could connect.
Yes, for some reason and we don't have any data around that reason :( Looks like the server has 32 processors, which should easily be enough to handle 5 databases in an availability group. There is more to this that we either don't know or you aren't explaining.
Is it correct to say that
PARALLEL_REDO_WORKER_WAIT_WORK
shouldn't be expected on a primary?
Since we don't know the database context for this, impossible to say. I will say, that if there is only a single availability group and this instance was the primary then no, it would not be expected. If, however, there are other availability groups that we're not aware of and this is a secondary, then yes it is expected. Again, we don't have all the information. This also assumes that there was no redo queue that needed to be drained on any of the databases - which again, we don't know or don't have that information.
Can someone explain what those pending tasks are?
The Process Commands
are attempting to execute something on a connection. The SNI*
are new connections being made.
There was also a blocking-process, after checking our monitoring, this process was executed as user
SA
and had commandUNKNOWN TOKEN
.
How many sessions was it blocking? What was it blocking? etc. Again, lack of information so can't say if it's something or not.
Also in our monitoring we found approx 1000 sessions that had the status SLEEPING and had a last_request_start_time of 01/01/1900.
That doens't seem normal at all.
Where were they coming from? What did they last run? It might be normal if you have a secondary replica that is readable and someone wanted to run a bunch of reports or queries. Again, need more information.
Can this be correct?
It could be. Obviously you ran out of worker threads, how and why that happened is unknown as there just isn't enough data.
In our monitoring we saw there was a total of 950 blocked processes.
There's your underlying problem. Looks like clientapp="our_app"
was connected (it's all in master) and was doing some type of querying of the database. It seems the database wasn't yet done doing redo (hence the wait type) and all of your sessions backed up on each other.
I updated my question with extra info.
– Frederik Vanderhaegen
May 8 at 12:07
@FrederikVanderhaegen added most probable explanation given the new data.
– Sean Gallardy
May 8 at 13:36
add a comment |
Due to the high urgency of the problem, we restarted the SQL Server Service on the primary and the problem was disappeared.
Since no investigation or gathering of data was completed, it's going to be hard to say exactly what was going on at the time.
For some reason the max number of worker was exceeded, this could explain why no one could connect.
Yes, for some reason and we don't have any data around that reason :( Looks like the server has 32 processors, which should easily be enough to handle 5 databases in an availability group. There is more to this that we either don't know or you aren't explaining.
Is it correct to say that
PARALLEL_REDO_WORKER_WAIT_WORK
shouldn't be expected on a primary?
Since we don't know the database context for this, impossible to say. I will say, that if there is only a single availability group and this instance was the primary then no, it would not be expected. If, however, there are other availability groups that we're not aware of and this is a secondary, then yes it is expected. Again, we don't have all the information. This also assumes that there was no redo queue that needed to be drained on any of the databases - which again, we don't know or don't have that information.
Can someone explain what those pending tasks are?
The Process Commands
are attempting to execute something on a connection. The SNI*
are new connections being made.
There was also a blocking-process, after checking our monitoring, this process was executed as user
SA
and had commandUNKNOWN TOKEN
.
How many sessions was it blocking? What was it blocking? etc. Again, lack of information so can't say if it's something or not.
Also in our monitoring we found approx 1000 sessions that had the status SLEEPING and had a last_request_start_time of 01/01/1900.
That doens't seem normal at all.
Where were they coming from? What did they last run? It might be normal if you have a secondary replica that is readable and someone wanted to run a bunch of reports or queries. Again, need more information.
Can this be correct?
It could be. Obviously you ran out of worker threads, how and why that happened is unknown as there just isn't enough data.
In our monitoring we saw there was a total of 950 blocked processes.
There's your underlying problem. Looks like clientapp="our_app"
was connected (it's all in master) and was doing some type of querying of the database. It seems the database wasn't yet done doing redo (hence the wait type) and all of your sessions backed up on each other.
I updated my question with extra info.
– Frederik Vanderhaegen
May 8 at 12:07
@FrederikVanderhaegen added most probable explanation given the new data.
– Sean Gallardy
May 8 at 13:36
add a comment |
Due to the high urgency of the problem, we restarted the SQL Server Service on the primary and the problem was disappeared.
Since no investigation or gathering of data was completed, it's going to be hard to say exactly what was going on at the time.
For some reason the max number of worker was exceeded, this could explain why no one could connect.
Yes, for some reason and we don't have any data around that reason :( Looks like the server has 32 processors, which should easily be enough to handle 5 databases in an availability group. There is more to this that we either don't know or you aren't explaining.
Is it correct to say that
PARALLEL_REDO_WORKER_WAIT_WORK
shouldn't be expected on a primary?
Since we don't know the database context for this, impossible to say. I will say, that if there is only a single availability group and this instance was the primary then no, it would not be expected. If, however, there are other availability groups that we're not aware of and this is a secondary, then yes it is expected. Again, we don't have all the information. This also assumes that there was no redo queue that needed to be drained on any of the databases - which again, we don't know or don't have that information.
Can someone explain what those pending tasks are?
The Process Commands
are attempting to execute something on a connection. The SNI*
are new connections being made.
There was also a blocking-process, after checking our monitoring, this process was executed as user
SA
and had commandUNKNOWN TOKEN
.
How many sessions was it blocking? What was it blocking? etc. Again, lack of information so can't say if it's something or not.
Also in our monitoring we found approx 1000 sessions that had the status SLEEPING and had a last_request_start_time of 01/01/1900.
That doens't seem normal at all.
Where were they coming from? What did they last run? It might be normal if you have a secondary replica that is readable and someone wanted to run a bunch of reports or queries. Again, need more information.
Can this be correct?
It could be. Obviously you ran out of worker threads, how and why that happened is unknown as there just isn't enough data.
In our monitoring we saw there was a total of 950 blocked processes.
There's your underlying problem. Looks like clientapp="our_app"
was connected (it's all in master) and was doing some type of querying of the database. It seems the database wasn't yet done doing redo (hence the wait type) and all of your sessions backed up on each other.
Due to the high urgency of the problem, we restarted the SQL Server Service on the primary and the problem was disappeared.
Since no investigation or gathering of data was completed, it's going to be hard to say exactly what was going on at the time.
For some reason the max number of worker was exceeded, this could explain why no one could connect.
Yes, for some reason and we don't have any data around that reason :( Looks like the server has 32 processors, which should easily be enough to handle 5 databases in an availability group. There is more to this that we either don't know or you aren't explaining.
Is it correct to say that
PARALLEL_REDO_WORKER_WAIT_WORK
shouldn't be expected on a primary?
Since we don't know the database context for this, impossible to say. I will say, that if there is only a single availability group and this instance was the primary then no, it would not be expected. If, however, there are other availability groups that we're not aware of and this is a secondary, then yes it is expected. Again, we don't have all the information. This also assumes that there was no redo queue that needed to be drained on any of the databases - which again, we don't know or don't have that information.
Can someone explain what those pending tasks are?
The Process Commands
are attempting to execute something on a connection. The SNI*
are new connections being made.
There was also a blocking-process, after checking our monitoring, this process was executed as user
SA
and had commandUNKNOWN TOKEN
.
How many sessions was it blocking? What was it blocking? etc. Again, lack of information so can't say if it's something or not.
Also in our monitoring we found approx 1000 sessions that had the status SLEEPING and had a last_request_start_time of 01/01/1900.
That doens't seem normal at all.
Where were they coming from? What did they last run? It might be normal if you have a secondary replica that is readable and someone wanted to run a bunch of reports or queries. Again, need more information.
Can this be correct?
It could be. Obviously you ran out of worker threads, how and why that happened is unknown as there just isn't enough data.
In our monitoring we saw there was a total of 950 blocked processes.
There's your underlying problem. Looks like clientapp="our_app"
was connected (it's all in master) and was doing some type of querying of the database. It seems the database wasn't yet done doing redo (hence the wait type) and all of your sessions backed up on each other.
edited May 8 at 13:35
answered May 8 at 10:34
Sean GallardySean Gallardy
17.8k22756
17.8k22756
I updated my question with extra info.
– Frederik Vanderhaegen
May 8 at 12:07
@FrederikVanderhaegen added most probable explanation given the new data.
– Sean Gallardy
May 8 at 13:36
add a comment |
I updated my question with extra info.
– Frederik Vanderhaegen
May 8 at 12:07
@FrederikVanderhaegen added most probable explanation given the new data.
– Sean Gallardy
May 8 at 13:36
I updated my question with extra info.
– Frederik Vanderhaegen
May 8 at 12:07
I updated my question with extra info.
– Frederik Vanderhaegen
May 8 at 12:07
@FrederikVanderhaegen added most probable explanation given the new data.
– Sean Gallardy
May 8 at 13:36
@FrederikVanderhaegen added most probable explanation given the new data.
– Sean Gallardy
May 8 at 13:36
add a comment |
we have faced max worker thread issue number of times in our environment and only reason is blocking.
since you have not recorded any blocking sessions to lead us to find root cause.
Furthermore you can setup sql agent blocking alert with dispatch latency of >5 second and attach job to capture blocking data.
we have highly oltp environment and single blocking on main table increase worker thread to reach 1000 in less than 1 minute. we have increased the worker thread to 5000 to get enough time to investigate and kill blocking session's because we don't want to restart service.
we found triggers, long running queries culpeits which hold table lock on primary used table and increase blocking sessions and at a certain point sql gets unresponsive. In this situation either to take DAC session and kill head blocking session or restart sql service.
Although Microsoft recommends not to set max worker thread due to number of reasons .First setup the blocking alert and fix queries to avoid such type of issue.
I added soms blocking information to my question, maybe this is helpful.
– Frederik Vanderhaegen
May 8 at 12:19
add a comment |
we have faced max worker thread issue number of times in our environment and only reason is blocking.
since you have not recorded any blocking sessions to lead us to find root cause.
Furthermore you can setup sql agent blocking alert with dispatch latency of >5 second and attach job to capture blocking data.
we have highly oltp environment and single blocking on main table increase worker thread to reach 1000 in less than 1 minute. we have increased the worker thread to 5000 to get enough time to investigate and kill blocking session's because we don't want to restart service.
we found triggers, long running queries culpeits which hold table lock on primary used table and increase blocking sessions and at a certain point sql gets unresponsive. In this situation either to take DAC session and kill head blocking session or restart sql service.
Although Microsoft recommends not to set max worker thread due to number of reasons .First setup the blocking alert and fix queries to avoid such type of issue.
I added soms blocking information to my question, maybe this is helpful.
– Frederik Vanderhaegen
May 8 at 12:19
add a comment |
we have faced max worker thread issue number of times in our environment and only reason is blocking.
since you have not recorded any blocking sessions to lead us to find root cause.
Furthermore you can setup sql agent blocking alert with dispatch latency of >5 second and attach job to capture blocking data.
we have highly oltp environment and single blocking on main table increase worker thread to reach 1000 in less than 1 minute. we have increased the worker thread to 5000 to get enough time to investigate and kill blocking session's because we don't want to restart service.
we found triggers, long running queries culpeits which hold table lock on primary used table and increase blocking sessions and at a certain point sql gets unresponsive. In this situation either to take DAC session and kill head blocking session or restart sql service.
Although Microsoft recommends not to set max worker thread due to number of reasons .First setup the blocking alert and fix queries to avoid such type of issue.
we have faced max worker thread issue number of times in our environment and only reason is blocking.
since you have not recorded any blocking sessions to lead us to find root cause.
Furthermore you can setup sql agent blocking alert with dispatch latency of >5 second and attach job to capture blocking data.
we have highly oltp environment and single blocking on main table increase worker thread to reach 1000 in less than 1 minute. we have increased the worker thread to 5000 to get enough time to investigate and kill blocking session's because we don't want to restart service.
we found triggers, long running queries culpeits which hold table lock on primary used table and increase blocking sessions and at a certain point sql gets unresponsive. In this situation either to take DAC session and kill head blocking session or restart sql service.
Although Microsoft recommends not to set max worker thread due to number of reasons .First setup the blocking alert and fix queries to avoid such type of issue.
answered May 8 at 12:03
Vinod NarwalVinod Narwal
493
493
I added soms blocking information to my question, maybe this is helpful.
– Frederik Vanderhaegen
May 8 at 12:19
add a comment |
I added soms blocking information to my question, maybe this is helpful.
– Frederik Vanderhaegen
May 8 at 12:19
I added soms blocking information to my question, maybe this is helpful.
– Frederik Vanderhaegen
May 8 at 12:19
I added soms blocking information to my question, maybe this is helpful.
– Frederik Vanderhaegen
May 8 at 12:19
add a comment |
Thanks for contributing an answer to Database Administrators Stack Exchange!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdba.stackexchange.com%2fquestions%2f237633%2fmax-worker-thread-exceeded-after-manual-ag-failover%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown