Max worker thread exceeded after manual AG failoverWhat can cause a mirroring session to timeout then failover?Sleeping SPID blocking other transactionsHow can a connection for one DB block a connection to another DB in SQL Server?Manual Failover with AlwaysOn availability groupHADR high worker thread usageDatabase not able to resume data movementManual failover of AAG in SQL Server 2012Potential corruption in sys.sysdercv, latch errors from ghost cleanup processAlwayson Manual Failover failedVeeam HotAdd causing Worker Thread Exhaustion

How to create razor wire

Proto-Indo-European (PIE) words with IPA

How could the B-29 bomber back up under its own power?

What is the required burn to keep a satellite at a Lagrangian point?

Sony VAIO Duo 13 Wifi not working on Ubuntu 16.04

(For training purposes) Are there any openings with rook pawns that are more effective than others (and if so, what are they)?

How to safely discharge oneself

What is the winged creature on the back of the Mordenkainen's Tome of Foes book?

nginx conf: http2 module not working in Chrome in ubuntu 18.04

Is there any mention of ghosts who live outside the Hogwarts castle?

How many wires should be in a new thermostat cable?

What was the primary motivation for a historical figure like Xenophon to create an extensive collection of written material?

Shell builtin `printf` line limit?

If a character has cast the Fly spell on themselves, can they "hand off" to the Levitate spell without interruption?

How to tease a romance without a cat and mouse chase?

Can someone get a spouse off a deed that never lived together and was incarcerated?

How do you earn the reader's trust?

Salesforce bug enabled "Modify All"

Why do testers need root cause analysis?

Adobe Illustrator: How can I change the profile of a dashed stroke?

Download app bundles from App Store to run on iOS Emulator on Mac

size of pointers and architecture

How do I write real-world stories separate from my country of origin?

Does the fact that we can only measure the two-way speed of light undermine the axiom of invariance?



Max worker thread exceeded after manual AG failover


What can cause a mirroring session to timeout then failover?Sleeping SPID blocking other transactionsHow can a connection for one DB block a connection to another DB in SQL Server?Manual Failover with AlwaysOn availability groupHADR high worker thread usageDatabase not able to resume data movementManual failover of AAG in SQL Server 2012Potential corruption in sys.sysdercv, latch errors from ghost cleanup processAlwayson Manual Failover failedVeeam HotAdd causing Worker Thread Exhaustion






.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty margin-bottom:0;








3















We have two SQL Server 2016 SP2 CU6 Enterprise running on physical hardware (32 cores) with one AG that contains 5 db's. The secondary is not readable. Today we did a planned manual failover, so we could do maintenance work on the other server. We did the failover on a moment there was no heavy load.



The failover was executed through the GUI of SSMS and gave no errors. But several minutes later we got a lot of phone calls that users were unable to login.
Our first attempt to troubleshoot was making a connection with SSMS but this gave an error that we were unable to connect, the exact error message I don't remember.



Next we took a look at the errorlog file in notepad and it was just like that the SQL Server Engine hung. There were no new entries added to the log after the failover.


Due to the high urgency of the problem, we restarted the SQL Server Service on the primary and the problem was disappeared.

We could have tried making a DAC connection to see what the problem was but the objective was to bring the server back online as soon as possible.



After that everything was back online, we started analyzing the log files.

In the errorlog on the old primary we found several errors 35278. No long running transactions were running on time of the failover.


Next was the AlwaysON_health event file, here the entries just stopped after the failover.


Next we took a look at the *_*_SQLDIAG_*_*.xel file. Here we were lucky.



The first thing we noticed was:



<queryProcessing maxWorkers="960" workersCreated="1064"
workersIdle="23" tasksCompletedWithinInterval="19" pendingTasks="34"
oldestPendingTaskWaitingTime="1316776"


For some reason the max number of worker was exceeded, this could explain why no one could connect.



This were the pending tasks:



<pendingTasks><br>
<entryPoint name="Process Command" count="14" /><br>
<entryPoint name="SNI New Connection" count="2" /><br>
<entryPoint name="SNI Accept Done" count="3" /><br>
<entryPoint moduleName="sqlmin.dll" imageBase="0x7ff827e80000" size="0x251e000" address="0x7ff8286622e0" count="6" /><br>
<entryPoint moduleName="sqlmin.dll" imageBase="0x7ff827e80000" size="0x251e000" address="0x7ff8292b1190" count="2" /><br>
<entryPoint moduleName="sqldk.dll" imageBase="0x7ff827120000" size="0x4c9000" address="0x7ff827125e60" count="7" /> <br>
</pendingTasks>


There were also several blocked-processes in that file. All of them had a waitresource like DATABASE: 5:5,DATABASE: 5:15,DATABASE: 5:26, ...

All of those processes had lastbatchstarted and lastbatchcompleted set to 1900-01-01. The waittime was equal to how long the failover had happened.
In our monitoring we saw there was a total of 950 blocked processes.



This is a sample of a blocked-process-report:



<blocked-process-report monitorLoop="0">
<blocked-process>
<process id="process13f08032ca8" taskpriority="0" logused="10000" waitresource="DATABASE: 5:3 " waittime="1334205" schedulerid="4" kpid="11752" status="suspended" spid="95" sbid="0" ecid="0" priority="0" trancount="0" lastbatchstarted="1900-01-01T00:00:00" lastbatchcompleted="1900-01-01T00:00:00" lastattention="1900-01-01T00:00:00" clientapp="our_app" hostname="host2021" hostpid="14196" loginname="user1" isolationlevel="read committed (2)" xactid="0" currentdb="1" currentdbname="master" lockTimeout="4294967295" clientoption1="671088672" clientoption2="128056">
<executionStack />
<inputbuf>
</inputbuf>
</process>
</blocked-process>
<blocking-process>
<process status="suspended" waitresource="DATABASE: 5:3 " waittime="1335893" spid="70" sbid="0" ecid="0" priority="0" trancount="0" lastbatchstarted="1900-01-01T00:00:00" lastbatchcompleted="1900-01-01T00:00:00" lastattention="1900-01-01T00:00:00" clientapp="our_app" hostname="host1" hostpid="1324" loginname="user1" isolationlevel="read committed (2)" xactid="0" currentdb="1" currentdbname="master" lockTimeout="4294967295" clientoption1="671088672" clientoption2="128056">
<executionStack />
<inputbuf>
</inputbuf>
</process>
</blocking-process>
</blocked-process-report>
<blocked-process-report monitorLoop="0">
<blocked-process>
<process id="process13f08033848" taskpriority="0" logused="10000" waitresource="DATABASE: 5:3 " waittime="1335893" schedulerid="4" kpid="12004" status="suspended" spid="70" sbid="0" ecid="0" priority="0" trancount="0" lastbatchstarted="1900-01-01T00:00:00" lastbatchcompleted="1900-01-01T00:00:00" lastattention="1900-01-01T00:00:00" clientapp="our_app" hostname="host1" hostpid="1324" loginname="user1" isolationlevel="read committed (2)" xactid="0" currentdb="1" currentdbname="master" lockTimeout="4294967295" clientoption1="671088672" clientoption2="128056">
<executionStack />
<inputbuf>
</inputbuf>
</process>
</blocked-process>
<blocking-process>
<process status="background" waittime="1139955" spid="141" sbid="0" ecid="0" priority="0" trancount="0">
<executionStack />
<inputbuf>
</inputbuf>
</process>
</blocking-process>
</blocked-process-report>





One of the blocking-processes was executed as user SA and had command UNKNOWN TOKEN. The last_wait_type of this sessions was PARALLEL_REDO_WORKER_WAIT_WORK and had a last_request_start_time that was equal to the time we did the failover.

I didn't expect to find this wait type on the primary.

Is it correct to say that PARALLEL_REDO_WORKER_WAIT_WORK shouldn't be expected on a primary?



Also in our monitoring we found approx 1000 sessions that had the status SLEEPING and had a last_request_start_time of 01/01/1900.
That doens't seem normal at all.



Can someone explain what those pending tasks are?

My guess of the root cause is the 1000 sessions that had the status SLEEPING, they all used a worker.
This caused the pendingtasks and due to the blocking process no workers could do any work. This caused that the failover somewhere got stuck.
Can this be correct?










share|improve this question






























    3















    We have two SQL Server 2016 SP2 CU6 Enterprise running on physical hardware (32 cores) with one AG that contains 5 db's. The secondary is not readable. Today we did a planned manual failover, so we could do maintenance work on the other server. We did the failover on a moment there was no heavy load.



    The failover was executed through the GUI of SSMS and gave no errors. But several minutes later we got a lot of phone calls that users were unable to login.
    Our first attempt to troubleshoot was making a connection with SSMS but this gave an error that we were unable to connect, the exact error message I don't remember.



    Next we took a look at the errorlog file in notepad and it was just like that the SQL Server Engine hung. There were no new entries added to the log after the failover.


    Due to the high urgency of the problem, we restarted the SQL Server Service on the primary and the problem was disappeared.

    We could have tried making a DAC connection to see what the problem was but the objective was to bring the server back online as soon as possible.



    After that everything was back online, we started analyzing the log files.

    In the errorlog on the old primary we found several errors 35278. No long running transactions were running on time of the failover.


    Next was the AlwaysON_health event file, here the entries just stopped after the failover.


    Next we took a look at the *_*_SQLDIAG_*_*.xel file. Here we were lucky.



    The first thing we noticed was:



    <queryProcessing maxWorkers="960" workersCreated="1064"
    workersIdle="23" tasksCompletedWithinInterval="19" pendingTasks="34"
    oldestPendingTaskWaitingTime="1316776"


    For some reason the max number of worker was exceeded, this could explain why no one could connect.



    This were the pending tasks:



    <pendingTasks><br>
    <entryPoint name="Process Command" count="14" /><br>
    <entryPoint name="SNI New Connection" count="2" /><br>
    <entryPoint name="SNI Accept Done" count="3" /><br>
    <entryPoint moduleName="sqlmin.dll" imageBase="0x7ff827e80000" size="0x251e000" address="0x7ff8286622e0" count="6" /><br>
    <entryPoint moduleName="sqlmin.dll" imageBase="0x7ff827e80000" size="0x251e000" address="0x7ff8292b1190" count="2" /><br>
    <entryPoint moduleName="sqldk.dll" imageBase="0x7ff827120000" size="0x4c9000" address="0x7ff827125e60" count="7" /> <br>
    </pendingTasks>


    There were also several blocked-processes in that file. All of them had a waitresource like DATABASE: 5:5,DATABASE: 5:15,DATABASE: 5:26, ...

    All of those processes had lastbatchstarted and lastbatchcompleted set to 1900-01-01. The waittime was equal to how long the failover had happened.
    In our monitoring we saw there was a total of 950 blocked processes.



    This is a sample of a blocked-process-report:



    <blocked-process-report monitorLoop="0">
    <blocked-process>
    <process id="process13f08032ca8" taskpriority="0" logused="10000" waitresource="DATABASE: 5:3 " waittime="1334205" schedulerid="4" kpid="11752" status="suspended" spid="95" sbid="0" ecid="0" priority="0" trancount="0" lastbatchstarted="1900-01-01T00:00:00" lastbatchcompleted="1900-01-01T00:00:00" lastattention="1900-01-01T00:00:00" clientapp="our_app" hostname="host2021" hostpid="14196" loginname="user1" isolationlevel="read committed (2)" xactid="0" currentdb="1" currentdbname="master" lockTimeout="4294967295" clientoption1="671088672" clientoption2="128056">
    <executionStack />
    <inputbuf>
    </inputbuf>
    </process>
    </blocked-process>
    <blocking-process>
    <process status="suspended" waitresource="DATABASE: 5:3 " waittime="1335893" spid="70" sbid="0" ecid="0" priority="0" trancount="0" lastbatchstarted="1900-01-01T00:00:00" lastbatchcompleted="1900-01-01T00:00:00" lastattention="1900-01-01T00:00:00" clientapp="our_app" hostname="host1" hostpid="1324" loginname="user1" isolationlevel="read committed (2)" xactid="0" currentdb="1" currentdbname="master" lockTimeout="4294967295" clientoption1="671088672" clientoption2="128056">
    <executionStack />
    <inputbuf>
    </inputbuf>
    </process>
    </blocking-process>
    </blocked-process-report>
    <blocked-process-report monitorLoop="0">
    <blocked-process>
    <process id="process13f08033848" taskpriority="0" logused="10000" waitresource="DATABASE: 5:3 " waittime="1335893" schedulerid="4" kpid="12004" status="suspended" spid="70" sbid="0" ecid="0" priority="0" trancount="0" lastbatchstarted="1900-01-01T00:00:00" lastbatchcompleted="1900-01-01T00:00:00" lastattention="1900-01-01T00:00:00" clientapp="our_app" hostname="host1" hostpid="1324" loginname="user1" isolationlevel="read committed (2)" xactid="0" currentdb="1" currentdbname="master" lockTimeout="4294967295" clientoption1="671088672" clientoption2="128056">
    <executionStack />
    <inputbuf>
    </inputbuf>
    </process>
    </blocked-process>
    <blocking-process>
    <process status="background" waittime="1139955" spid="141" sbid="0" ecid="0" priority="0" trancount="0">
    <executionStack />
    <inputbuf>
    </inputbuf>
    </process>
    </blocking-process>
    </blocked-process-report>





    One of the blocking-processes was executed as user SA and had command UNKNOWN TOKEN. The last_wait_type of this sessions was PARALLEL_REDO_WORKER_WAIT_WORK and had a last_request_start_time that was equal to the time we did the failover.

    I didn't expect to find this wait type on the primary.

    Is it correct to say that PARALLEL_REDO_WORKER_WAIT_WORK shouldn't be expected on a primary?



    Also in our monitoring we found approx 1000 sessions that had the status SLEEPING and had a last_request_start_time of 01/01/1900.
    That doens't seem normal at all.



    Can someone explain what those pending tasks are?

    My guess of the root cause is the 1000 sessions that had the status SLEEPING, they all used a worker.
    This caused the pendingtasks and due to the blocking process no workers could do any work. This caused that the failover somewhere got stuck.
    Can this be correct?










    share|improve this question


























      3












      3








      3








      We have two SQL Server 2016 SP2 CU6 Enterprise running on physical hardware (32 cores) with one AG that contains 5 db's. The secondary is not readable. Today we did a planned manual failover, so we could do maintenance work on the other server. We did the failover on a moment there was no heavy load.



      The failover was executed through the GUI of SSMS and gave no errors. But several minutes later we got a lot of phone calls that users were unable to login.
      Our first attempt to troubleshoot was making a connection with SSMS but this gave an error that we were unable to connect, the exact error message I don't remember.



      Next we took a look at the errorlog file in notepad and it was just like that the SQL Server Engine hung. There were no new entries added to the log after the failover.


      Due to the high urgency of the problem, we restarted the SQL Server Service on the primary and the problem was disappeared.

      We could have tried making a DAC connection to see what the problem was but the objective was to bring the server back online as soon as possible.



      After that everything was back online, we started analyzing the log files.

      In the errorlog on the old primary we found several errors 35278. No long running transactions were running on time of the failover.


      Next was the AlwaysON_health event file, here the entries just stopped after the failover.


      Next we took a look at the *_*_SQLDIAG_*_*.xel file. Here we were lucky.



      The first thing we noticed was:



      <queryProcessing maxWorkers="960" workersCreated="1064"
      workersIdle="23" tasksCompletedWithinInterval="19" pendingTasks="34"
      oldestPendingTaskWaitingTime="1316776"


      For some reason the max number of worker was exceeded, this could explain why no one could connect.



      This were the pending tasks:



      <pendingTasks><br>
      <entryPoint name="Process Command" count="14" /><br>
      <entryPoint name="SNI New Connection" count="2" /><br>
      <entryPoint name="SNI Accept Done" count="3" /><br>
      <entryPoint moduleName="sqlmin.dll" imageBase="0x7ff827e80000" size="0x251e000" address="0x7ff8286622e0" count="6" /><br>
      <entryPoint moduleName="sqlmin.dll" imageBase="0x7ff827e80000" size="0x251e000" address="0x7ff8292b1190" count="2" /><br>
      <entryPoint moduleName="sqldk.dll" imageBase="0x7ff827120000" size="0x4c9000" address="0x7ff827125e60" count="7" /> <br>
      </pendingTasks>


      There were also several blocked-processes in that file. All of them had a waitresource like DATABASE: 5:5,DATABASE: 5:15,DATABASE: 5:26, ...

      All of those processes had lastbatchstarted and lastbatchcompleted set to 1900-01-01. The waittime was equal to how long the failover had happened.
      In our monitoring we saw there was a total of 950 blocked processes.



      This is a sample of a blocked-process-report:



      <blocked-process-report monitorLoop="0">
      <blocked-process>
      <process id="process13f08032ca8" taskpriority="0" logused="10000" waitresource="DATABASE: 5:3 " waittime="1334205" schedulerid="4" kpid="11752" status="suspended" spid="95" sbid="0" ecid="0" priority="0" trancount="0" lastbatchstarted="1900-01-01T00:00:00" lastbatchcompleted="1900-01-01T00:00:00" lastattention="1900-01-01T00:00:00" clientapp="our_app" hostname="host2021" hostpid="14196" loginname="user1" isolationlevel="read committed (2)" xactid="0" currentdb="1" currentdbname="master" lockTimeout="4294967295" clientoption1="671088672" clientoption2="128056">
      <executionStack />
      <inputbuf>
      </inputbuf>
      </process>
      </blocked-process>
      <blocking-process>
      <process status="suspended" waitresource="DATABASE: 5:3 " waittime="1335893" spid="70" sbid="0" ecid="0" priority="0" trancount="0" lastbatchstarted="1900-01-01T00:00:00" lastbatchcompleted="1900-01-01T00:00:00" lastattention="1900-01-01T00:00:00" clientapp="our_app" hostname="host1" hostpid="1324" loginname="user1" isolationlevel="read committed (2)" xactid="0" currentdb="1" currentdbname="master" lockTimeout="4294967295" clientoption1="671088672" clientoption2="128056">
      <executionStack />
      <inputbuf>
      </inputbuf>
      </process>
      </blocking-process>
      </blocked-process-report>
      <blocked-process-report monitorLoop="0">
      <blocked-process>
      <process id="process13f08033848" taskpriority="0" logused="10000" waitresource="DATABASE: 5:3 " waittime="1335893" schedulerid="4" kpid="12004" status="suspended" spid="70" sbid="0" ecid="0" priority="0" trancount="0" lastbatchstarted="1900-01-01T00:00:00" lastbatchcompleted="1900-01-01T00:00:00" lastattention="1900-01-01T00:00:00" clientapp="our_app" hostname="host1" hostpid="1324" loginname="user1" isolationlevel="read committed (2)" xactid="0" currentdb="1" currentdbname="master" lockTimeout="4294967295" clientoption1="671088672" clientoption2="128056">
      <executionStack />
      <inputbuf>
      </inputbuf>
      </process>
      </blocked-process>
      <blocking-process>
      <process status="background" waittime="1139955" spid="141" sbid="0" ecid="0" priority="0" trancount="0">
      <executionStack />
      <inputbuf>
      </inputbuf>
      </process>
      </blocking-process>
      </blocked-process-report>





      One of the blocking-processes was executed as user SA and had command UNKNOWN TOKEN. The last_wait_type of this sessions was PARALLEL_REDO_WORKER_WAIT_WORK and had a last_request_start_time that was equal to the time we did the failover.

      I didn't expect to find this wait type on the primary.

      Is it correct to say that PARALLEL_REDO_WORKER_WAIT_WORK shouldn't be expected on a primary?



      Also in our monitoring we found approx 1000 sessions that had the status SLEEPING and had a last_request_start_time of 01/01/1900.
      That doens't seem normal at all.



      Can someone explain what those pending tasks are?

      My guess of the root cause is the 1000 sessions that had the status SLEEPING, they all used a worker.
      This caused the pendingtasks and due to the blocking process no workers could do any work. This caused that the failover somewhere got stuck.
      Can this be correct?










      share|improve this question
















      We have two SQL Server 2016 SP2 CU6 Enterprise running on physical hardware (32 cores) with one AG that contains 5 db's. The secondary is not readable. Today we did a planned manual failover, so we could do maintenance work on the other server. We did the failover on a moment there was no heavy load.



      The failover was executed through the GUI of SSMS and gave no errors. But several minutes later we got a lot of phone calls that users were unable to login.
      Our first attempt to troubleshoot was making a connection with SSMS but this gave an error that we were unable to connect, the exact error message I don't remember.



      Next we took a look at the errorlog file in notepad and it was just like that the SQL Server Engine hung. There were no new entries added to the log after the failover.


      Due to the high urgency of the problem, we restarted the SQL Server Service on the primary and the problem was disappeared.

      We could have tried making a DAC connection to see what the problem was but the objective was to bring the server back online as soon as possible.



      After that everything was back online, we started analyzing the log files.

      In the errorlog on the old primary we found several errors 35278. No long running transactions were running on time of the failover.


      Next was the AlwaysON_health event file, here the entries just stopped after the failover.


      Next we took a look at the *_*_SQLDIAG_*_*.xel file. Here we were lucky.



      The first thing we noticed was:



      <queryProcessing maxWorkers="960" workersCreated="1064"
      workersIdle="23" tasksCompletedWithinInterval="19" pendingTasks="34"
      oldestPendingTaskWaitingTime="1316776"


      For some reason the max number of worker was exceeded, this could explain why no one could connect.



      This were the pending tasks:



      <pendingTasks><br>
      <entryPoint name="Process Command" count="14" /><br>
      <entryPoint name="SNI New Connection" count="2" /><br>
      <entryPoint name="SNI Accept Done" count="3" /><br>
      <entryPoint moduleName="sqlmin.dll" imageBase="0x7ff827e80000" size="0x251e000" address="0x7ff8286622e0" count="6" /><br>
      <entryPoint moduleName="sqlmin.dll" imageBase="0x7ff827e80000" size="0x251e000" address="0x7ff8292b1190" count="2" /><br>
      <entryPoint moduleName="sqldk.dll" imageBase="0x7ff827120000" size="0x4c9000" address="0x7ff827125e60" count="7" /> <br>
      </pendingTasks>


      There were also several blocked-processes in that file. All of them had a waitresource like DATABASE: 5:5,DATABASE: 5:15,DATABASE: 5:26, ...

      All of those processes had lastbatchstarted and lastbatchcompleted set to 1900-01-01. The waittime was equal to how long the failover had happened.
      In our monitoring we saw there was a total of 950 blocked processes.



      This is a sample of a blocked-process-report:



      <blocked-process-report monitorLoop="0">
      <blocked-process>
      <process id="process13f08032ca8" taskpriority="0" logused="10000" waitresource="DATABASE: 5:3 " waittime="1334205" schedulerid="4" kpid="11752" status="suspended" spid="95" sbid="0" ecid="0" priority="0" trancount="0" lastbatchstarted="1900-01-01T00:00:00" lastbatchcompleted="1900-01-01T00:00:00" lastattention="1900-01-01T00:00:00" clientapp="our_app" hostname="host2021" hostpid="14196" loginname="user1" isolationlevel="read committed (2)" xactid="0" currentdb="1" currentdbname="master" lockTimeout="4294967295" clientoption1="671088672" clientoption2="128056">
      <executionStack />
      <inputbuf>
      </inputbuf>
      </process>
      </blocked-process>
      <blocking-process>
      <process status="suspended" waitresource="DATABASE: 5:3 " waittime="1335893" spid="70" sbid="0" ecid="0" priority="0" trancount="0" lastbatchstarted="1900-01-01T00:00:00" lastbatchcompleted="1900-01-01T00:00:00" lastattention="1900-01-01T00:00:00" clientapp="our_app" hostname="host1" hostpid="1324" loginname="user1" isolationlevel="read committed (2)" xactid="0" currentdb="1" currentdbname="master" lockTimeout="4294967295" clientoption1="671088672" clientoption2="128056">
      <executionStack />
      <inputbuf>
      </inputbuf>
      </process>
      </blocking-process>
      </blocked-process-report>
      <blocked-process-report monitorLoop="0">
      <blocked-process>
      <process id="process13f08033848" taskpriority="0" logused="10000" waitresource="DATABASE: 5:3 " waittime="1335893" schedulerid="4" kpid="12004" status="suspended" spid="70" sbid="0" ecid="0" priority="0" trancount="0" lastbatchstarted="1900-01-01T00:00:00" lastbatchcompleted="1900-01-01T00:00:00" lastattention="1900-01-01T00:00:00" clientapp="our_app" hostname="host1" hostpid="1324" loginname="user1" isolationlevel="read committed (2)" xactid="0" currentdb="1" currentdbname="master" lockTimeout="4294967295" clientoption1="671088672" clientoption2="128056">
      <executionStack />
      <inputbuf>
      </inputbuf>
      </process>
      </blocked-process>
      <blocking-process>
      <process status="background" waittime="1139955" spid="141" sbid="0" ecid="0" priority="0" trancount="0">
      <executionStack />
      <inputbuf>
      </inputbuf>
      </process>
      </blocking-process>
      </blocked-process-report>





      One of the blocking-processes was executed as user SA and had command UNKNOWN TOKEN. The last_wait_type of this sessions was PARALLEL_REDO_WORKER_WAIT_WORK and had a last_request_start_time that was equal to the time we did the failover.

      I didn't expect to find this wait type on the primary.

      Is it correct to say that PARALLEL_REDO_WORKER_WAIT_WORK shouldn't be expected on a primary?



      Also in our monitoring we found approx 1000 sessions that had the status SLEEPING and had a last_request_start_time of 01/01/1900.
      That doens't seem normal at all.



      Can someone explain what those pending tasks are?

      My guess of the root cause is the 1000 sessions that had the status SLEEPING, they all used a worker.
      This caused the pendingtasks and due to the blocking process no workers could do any work. This caused that the failover somewhere got stuck.
      Can this be correct?







      sql-server-2016 availability-groups troubleshooting






      share|improve this question















      share|improve this question













      share|improve this question




      share|improve this question








      edited May 8 at 12:18







      Frederik Vanderhaegen

















      asked May 8 at 9:25









      Frederik VanderhaegenFrederik Vanderhaegen

      9371521




      9371521




















          2 Answers
          2






          active

          oldest

          votes


















          6















          Due to the high urgency of the problem, we restarted the SQL Server Service on the primary and the problem was disappeared.




          Since no investigation or gathering of data was completed, it's going to be hard to say exactly what was going on at the time.




          For some reason the max number of worker was exceeded, this could explain why no one could connect.




          Yes, for some reason and we don't have any data around that reason :( Looks like the server has 32 processors, which should easily be enough to handle 5 databases in an availability group. There is more to this that we either don't know or you aren't explaining.




          Is it correct to say that PARALLEL_REDO_WORKER_WAIT_WORK shouldn't be expected on a primary?




          Since we don't know the database context for this, impossible to say. I will say, that if there is only a single availability group and this instance was the primary then no, it would not be expected. If, however, there are other availability groups that we're not aware of and this is a secondary, then yes it is expected. Again, we don't have all the information. This also assumes that there was no redo queue that needed to be drained on any of the databases - which again, we don't know or don't have that information.




          Can someone explain what those pending tasks are?




          The Process Commands are attempting to execute something on a connection. The SNI* are new connections being made.




          There was also a blocking-process, after checking our monitoring, this process was executed as user SA and had command UNKNOWN TOKEN.




          How many sessions was it blocking? What was it blocking? etc. Again, lack of information so can't say if it's something or not.




          Also in our monitoring we found approx 1000 sessions that had the status SLEEPING and had a last_request_start_time of 01/01/1900.
          That doens't seem normal at all.




          Where were they coming from? What did they last run? It might be normal if you have a secondary replica that is readable and someone wanted to run a bunch of reports or queries. Again, need more information.




          Can this be correct?




          It could be. Obviously you ran out of worker threads, how and why that happened is unknown as there just isn't enough data.




          In our monitoring we saw there was a total of 950 blocked processes.




          There's your underlying problem. Looks like clientapp="our_app" was connected (it's all in master) and was doing some type of querying of the database. It seems the database wasn't yet done doing redo (hence the wait type) and all of your sessions backed up on each other.






          share|improve this answer

























          • I updated my question with extra info.

            – Frederik Vanderhaegen
            May 8 at 12:07











          • @FrederikVanderhaegen added most probable explanation given the new data.

            – Sean Gallardy
            May 8 at 13:36


















          0














          we have faced max worker thread issue number of times in our environment and only reason is blocking.
          since you have not recorded any blocking sessions to lead us to find root cause.
          Furthermore you can setup sql agent blocking alert with dispatch latency of >5 second and attach job to capture blocking data.
          we have highly oltp environment and single blocking on main table increase worker thread to reach 1000 in less than 1 minute. we have increased the worker thread to 5000 to get enough time to investigate and kill blocking session's because we don't want to restart service.



          we found triggers, long running queries culpeits which hold table lock on primary used table and increase blocking sessions and at a certain point sql gets unresponsive. In this situation either to take DAC session and kill head blocking session or restart sql service.



          Although Microsoft recommends not to set max worker thread due to number of reasons .First setup the blocking alert and fix queries to avoid such type of issue.






          share|improve this answer























          • I added soms blocking information to my question, maybe this is helpful.

            – Frederik Vanderhaegen
            May 8 at 12:19











          Your Answer








          StackExchange.ready(function()
          var channelOptions =
          tags: "".split(" "),
          id: "182"
          ;
          initTagRenderer("".split(" "), "".split(" "), channelOptions);

          StackExchange.using("externalEditor", function()
          // Have to fire editor after snippets, if snippets enabled
          if (StackExchange.settings.snippets.snippetsEnabled)
          StackExchange.using("snippets", function()
          createEditor();
          );

          else
          createEditor();

          );

          function createEditor()
          StackExchange.prepareEditor(
          heartbeatType: 'answer',
          autoActivateHeartbeat: false,
          convertImagesToLinks: false,
          noModals: true,
          showLowRepImageUploadWarning: true,
          reputationToPostImages: null,
          bindNavPrevention: true,
          postfix: "",
          imageUploader:
          brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
          contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
          allowUrls: true
          ,
          onDemand: true,
          discardSelector: ".discard-answer"
          ,immediatelyShowMarkdownHelp:true
          );



          );













          draft saved

          draft discarded


















          StackExchange.ready(
          function ()
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdba.stackexchange.com%2fquestions%2f237633%2fmax-worker-thread-exceeded-after-manual-ag-failover%23new-answer', 'question_page');

          );

          Post as a guest















          Required, but never shown

























          2 Answers
          2






          active

          oldest

          votes








          2 Answers
          2






          active

          oldest

          votes









          active

          oldest

          votes






          active

          oldest

          votes









          6















          Due to the high urgency of the problem, we restarted the SQL Server Service on the primary and the problem was disappeared.




          Since no investigation or gathering of data was completed, it's going to be hard to say exactly what was going on at the time.




          For some reason the max number of worker was exceeded, this could explain why no one could connect.




          Yes, for some reason and we don't have any data around that reason :( Looks like the server has 32 processors, which should easily be enough to handle 5 databases in an availability group. There is more to this that we either don't know or you aren't explaining.




          Is it correct to say that PARALLEL_REDO_WORKER_WAIT_WORK shouldn't be expected on a primary?




          Since we don't know the database context for this, impossible to say. I will say, that if there is only a single availability group and this instance was the primary then no, it would not be expected. If, however, there are other availability groups that we're not aware of and this is a secondary, then yes it is expected. Again, we don't have all the information. This also assumes that there was no redo queue that needed to be drained on any of the databases - which again, we don't know or don't have that information.




          Can someone explain what those pending tasks are?




          The Process Commands are attempting to execute something on a connection. The SNI* are new connections being made.




          There was also a blocking-process, after checking our monitoring, this process was executed as user SA and had command UNKNOWN TOKEN.




          How many sessions was it blocking? What was it blocking? etc. Again, lack of information so can't say if it's something or not.




          Also in our monitoring we found approx 1000 sessions that had the status SLEEPING and had a last_request_start_time of 01/01/1900.
          That doens't seem normal at all.




          Where were they coming from? What did they last run? It might be normal if you have a secondary replica that is readable and someone wanted to run a bunch of reports or queries. Again, need more information.




          Can this be correct?




          It could be. Obviously you ran out of worker threads, how and why that happened is unknown as there just isn't enough data.




          In our monitoring we saw there was a total of 950 blocked processes.




          There's your underlying problem. Looks like clientapp="our_app" was connected (it's all in master) and was doing some type of querying of the database. It seems the database wasn't yet done doing redo (hence the wait type) and all of your sessions backed up on each other.






          share|improve this answer

























          • I updated my question with extra info.

            – Frederik Vanderhaegen
            May 8 at 12:07











          • @FrederikVanderhaegen added most probable explanation given the new data.

            – Sean Gallardy
            May 8 at 13:36















          6















          Due to the high urgency of the problem, we restarted the SQL Server Service on the primary and the problem was disappeared.




          Since no investigation or gathering of data was completed, it's going to be hard to say exactly what was going on at the time.




          For some reason the max number of worker was exceeded, this could explain why no one could connect.




          Yes, for some reason and we don't have any data around that reason :( Looks like the server has 32 processors, which should easily be enough to handle 5 databases in an availability group. There is more to this that we either don't know or you aren't explaining.




          Is it correct to say that PARALLEL_REDO_WORKER_WAIT_WORK shouldn't be expected on a primary?




          Since we don't know the database context for this, impossible to say. I will say, that if there is only a single availability group and this instance was the primary then no, it would not be expected. If, however, there are other availability groups that we're not aware of and this is a secondary, then yes it is expected. Again, we don't have all the information. This also assumes that there was no redo queue that needed to be drained on any of the databases - which again, we don't know or don't have that information.




          Can someone explain what those pending tasks are?




          The Process Commands are attempting to execute something on a connection. The SNI* are new connections being made.




          There was also a blocking-process, after checking our monitoring, this process was executed as user SA and had command UNKNOWN TOKEN.




          How many sessions was it blocking? What was it blocking? etc. Again, lack of information so can't say if it's something or not.




          Also in our monitoring we found approx 1000 sessions that had the status SLEEPING and had a last_request_start_time of 01/01/1900.
          That doens't seem normal at all.




          Where were they coming from? What did they last run? It might be normal if you have a secondary replica that is readable and someone wanted to run a bunch of reports or queries. Again, need more information.




          Can this be correct?




          It could be. Obviously you ran out of worker threads, how and why that happened is unknown as there just isn't enough data.




          In our monitoring we saw there was a total of 950 blocked processes.




          There's your underlying problem. Looks like clientapp="our_app" was connected (it's all in master) and was doing some type of querying of the database. It seems the database wasn't yet done doing redo (hence the wait type) and all of your sessions backed up on each other.






          share|improve this answer

























          • I updated my question with extra info.

            – Frederik Vanderhaegen
            May 8 at 12:07











          • @FrederikVanderhaegen added most probable explanation given the new data.

            – Sean Gallardy
            May 8 at 13:36













          6












          6








          6








          Due to the high urgency of the problem, we restarted the SQL Server Service on the primary and the problem was disappeared.




          Since no investigation or gathering of data was completed, it's going to be hard to say exactly what was going on at the time.




          For some reason the max number of worker was exceeded, this could explain why no one could connect.




          Yes, for some reason and we don't have any data around that reason :( Looks like the server has 32 processors, which should easily be enough to handle 5 databases in an availability group. There is more to this that we either don't know or you aren't explaining.




          Is it correct to say that PARALLEL_REDO_WORKER_WAIT_WORK shouldn't be expected on a primary?




          Since we don't know the database context for this, impossible to say. I will say, that if there is only a single availability group and this instance was the primary then no, it would not be expected. If, however, there are other availability groups that we're not aware of and this is a secondary, then yes it is expected. Again, we don't have all the information. This also assumes that there was no redo queue that needed to be drained on any of the databases - which again, we don't know or don't have that information.




          Can someone explain what those pending tasks are?




          The Process Commands are attempting to execute something on a connection. The SNI* are new connections being made.




          There was also a blocking-process, after checking our monitoring, this process was executed as user SA and had command UNKNOWN TOKEN.




          How many sessions was it blocking? What was it blocking? etc. Again, lack of information so can't say if it's something or not.




          Also in our monitoring we found approx 1000 sessions that had the status SLEEPING and had a last_request_start_time of 01/01/1900.
          That doens't seem normal at all.




          Where were they coming from? What did they last run? It might be normal if you have a secondary replica that is readable and someone wanted to run a bunch of reports or queries. Again, need more information.




          Can this be correct?




          It could be. Obviously you ran out of worker threads, how and why that happened is unknown as there just isn't enough data.




          In our monitoring we saw there was a total of 950 blocked processes.




          There's your underlying problem. Looks like clientapp="our_app" was connected (it's all in master) and was doing some type of querying of the database. It seems the database wasn't yet done doing redo (hence the wait type) and all of your sessions backed up on each other.






          share|improve this answer
















          Due to the high urgency of the problem, we restarted the SQL Server Service on the primary and the problem was disappeared.




          Since no investigation or gathering of data was completed, it's going to be hard to say exactly what was going on at the time.




          For some reason the max number of worker was exceeded, this could explain why no one could connect.




          Yes, for some reason and we don't have any data around that reason :( Looks like the server has 32 processors, which should easily be enough to handle 5 databases in an availability group. There is more to this that we either don't know or you aren't explaining.




          Is it correct to say that PARALLEL_REDO_WORKER_WAIT_WORK shouldn't be expected on a primary?




          Since we don't know the database context for this, impossible to say. I will say, that if there is only a single availability group and this instance was the primary then no, it would not be expected. If, however, there are other availability groups that we're not aware of and this is a secondary, then yes it is expected. Again, we don't have all the information. This also assumes that there was no redo queue that needed to be drained on any of the databases - which again, we don't know or don't have that information.




          Can someone explain what those pending tasks are?




          The Process Commands are attempting to execute something on a connection. The SNI* are new connections being made.




          There was also a blocking-process, after checking our monitoring, this process was executed as user SA and had command UNKNOWN TOKEN.




          How many sessions was it blocking? What was it blocking? etc. Again, lack of information so can't say if it's something or not.




          Also in our monitoring we found approx 1000 sessions that had the status SLEEPING and had a last_request_start_time of 01/01/1900.
          That doens't seem normal at all.




          Where were they coming from? What did they last run? It might be normal if you have a secondary replica that is readable and someone wanted to run a bunch of reports or queries. Again, need more information.




          Can this be correct?




          It could be. Obviously you ran out of worker threads, how and why that happened is unknown as there just isn't enough data.




          In our monitoring we saw there was a total of 950 blocked processes.




          There's your underlying problem. Looks like clientapp="our_app" was connected (it's all in master) and was doing some type of querying of the database. It seems the database wasn't yet done doing redo (hence the wait type) and all of your sessions backed up on each other.







          share|improve this answer














          share|improve this answer



          share|improve this answer








          edited May 8 at 13:35

























          answered May 8 at 10:34









          Sean GallardySean Gallardy

          17.8k22756




          17.8k22756












          • I updated my question with extra info.

            – Frederik Vanderhaegen
            May 8 at 12:07











          • @FrederikVanderhaegen added most probable explanation given the new data.

            – Sean Gallardy
            May 8 at 13:36

















          • I updated my question with extra info.

            – Frederik Vanderhaegen
            May 8 at 12:07











          • @FrederikVanderhaegen added most probable explanation given the new data.

            – Sean Gallardy
            May 8 at 13:36
















          I updated my question with extra info.

          – Frederik Vanderhaegen
          May 8 at 12:07





          I updated my question with extra info.

          – Frederik Vanderhaegen
          May 8 at 12:07













          @FrederikVanderhaegen added most probable explanation given the new data.

          – Sean Gallardy
          May 8 at 13:36





          @FrederikVanderhaegen added most probable explanation given the new data.

          – Sean Gallardy
          May 8 at 13:36













          0














          we have faced max worker thread issue number of times in our environment and only reason is blocking.
          since you have not recorded any blocking sessions to lead us to find root cause.
          Furthermore you can setup sql agent blocking alert with dispatch latency of >5 second and attach job to capture blocking data.
          we have highly oltp environment and single blocking on main table increase worker thread to reach 1000 in less than 1 minute. we have increased the worker thread to 5000 to get enough time to investigate and kill blocking session's because we don't want to restart service.



          we found triggers, long running queries culpeits which hold table lock on primary used table and increase blocking sessions and at a certain point sql gets unresponsive. In this situation either to take DAC session and kill head blocking session or restart sql service.



          Although Microsoft recommends not to set max worker thread due to number of reasons .First setup the blocking alert and fix queries to avoid such type of issue.






          share|improve this answer























          • I added soms blocking information to my question, maybe this is helpful.

            – Frederik Vanderhaegen
            May 8 at 12:19















          0














          we have faced max worker thread issue number of times in our environment and only reason is blocking.
          since you have not recorded any blocking sessions to lead us to find root cause.
          Furthermore you can setup sql agent blocking alert with dispatch latency of >5 second and attach job to capture blocking data.
          we have highly oltp environment and single blocking on main table increase worker thread to reach 1000 in less than 1 minute. we have increased the worker thread to 5000 to get enough time to investigate and kill blocking session's because we don't want to restart service.



          we found triggers, long running queries culpeits which hold table lock on primary used table and increase blocking sessions and at a certain point sql gets unresponsive. In this situation either to take DAC session and kill head blocking session or restart sql service.



          Although Microsoft recommends not to set max worker thread due to number of reasons .First setup the blocking alert and fix queries to avoid such type of issue.






          share|improve this answer























          • I added soms blocking information to my question, maybe this is helpful.

            – Frederik Vanderhaegen
            May 8 at 12:19













          0












          0








          0







          we have faced max worker thread issue number of times in our environment and only reason is blocking.
          since you have not recorded any blocking sessions to lead us to find root cause.
          Furthermore you can setup sql agent blocking alert with dispatch latency of >5 second and attach job to capture blocking data.
          we have highly oltp environment and single blocking on main table increase worker thread to reach 1000 in less than 1 minute. we have increased the worker thread to 5000 to get enough time to investigate and kill blocking session's because we don't want to restart service.



          we found triggers, long running queries culpeits which hold table lock on primary used table and increase blocking sessions and at a certain point sql gets unresponsive. In this situation either to take DAC session and kill head blocking session or restart sql service.



          Although Microsoft recommends not to set max worker thread due to number of reasons .First setup the blocking alert and fix queries to avoid such type of issue.






          share|improve this answer













          we have faced max worker thread issue number of times in our environment and only reason is blocking.
          since you have not recorded any blocking sessions to lead us to find root cause.
          Furthermore you can setup sql agent blocking alert with dispatch latency of >5 second and attach job to capture blocking data.
          we have highly oltp environment and single blocking on main table increase worker thread to reach 1000 in less than 1 minute. we have increased the worker thread to 5000 to get enough time to investigate and kill blocking session's because we don't want to restart service.



          we found triggers, long running queries culpeits which hold table lock on primary used table and increase blocking sessions and at a certain point sql gets unresponsive. In this situation either to take DAC session and kill head blocking session or restart sql service.



          Although Microsoft recommends not to set max worker thread due to number of reasons .First setup the blocking alert and fix queries to avoid such type of issue.







          share|improve this answer












          share|improve this answer



          share|improve this answer










          answered May 8 at 12:03









          Vinod NarwalVinod Narwal

          493




          493












          • I added soms blocking information to my question, maybe this is helpful.

            – Frederik Vanderhaegen
            May 8 at 12:19

















          • I added soms blocking information to my question, maybe this is helpful.

            – Frederik Vanderhaegen
            May 8 at 12:19
















          I added soms blocking information to my question, maybe this is helpful.

          – Frederik Vanderhaegen
          May 8 at 12:19





          I added soms blocking information to my question, maybe this is helpful.

          – Frederik Vanderhaegen
          May 8 at 12:19

















          draft saved

          draft discarded
















































          Thanks for contributing an answer to Database Administrators Stack Exchange!


          • Please be sure to answer the question. Provide details and share your research!

          But avoid


          • Asking for help, clarification, or responding to other answers.

          • Making statements based on opinion; back them up with references or personal experience.

          To learn more, see our tips on writing great answers.




          draft saved


          draft discarded














          StackExchange.ready(
          function ()
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdba.stackexchange.com%2fquestions%2f237633%2fmax-worker-thread-exceeded-after-manual-ag-failover%23new-answer', 'question_page');

          );

          Post as a guest















          Required, but never shown





















































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown

































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown







          Popular posts from this blog

          Wikipedia:Vital articles Мазмуну Biography - Өмүр баян Philosophy and psychology - Философия жана психология Religion - Дин Social sciences - Коомдук илимдер Language and literature - Тил жана адабият Science - Илим Technology - Технология Arts and recreation - Искусство жана эс алуу History and geography - Тарых жана география Навигация менюсу

          Bruxelas-Capital Índice Historia | Composición | Situación lingüística | Clima | Cidades irmandadas | Notas | Véxase tamén | Menú de navegacióneO uso das linguas en Bruxelas e a situación do neerlandés"Rexión de Bruxelas Capital"o orixinalSitio da rexiónPáxina de Bruselas no sitio da Oficina de Promoción Turística de Valonia e BruxelasMapa Interactivo da Rexión de Bruxelas-CapitaleeWorldCat332144929079854441105155190212ID28008674080552-90000 0001 0666 3698n94104302ID540940339365017018237

          What should I write in an apology letter, since I have decided not to join a company after accepting an offer letterShould I keep looking after accepting a job offer?What should I do when I've been verbally told I would get an offer letter, but still haven't gotten one after 4 weeks?Do I accept an offer from a company that I am not likely to join?New job hasn't confirmed starting date and I want to give current employer as much notice as possibleHow should I address my manager in my resignation letter?HR delayed background verification, now jobless as resignedNo email communication after accepting a formal written offer. How should I phrase the call?What should I do if after receiving a verbal offer letter I am informed that my written job offer is put on hold due to some internal issues?Should I inform the current employer that I am about to resign within 1-2 weeks since I have signed the offer letter and waiting for visa?What company will do, if I send their offer letter to another company