Non-yielding Scheduler Error with dump caused by heavy queries on SQL 2016

Posted: edited May 18 at 18:02 - Source : stackoverflow

After bringing our databases to new servers with new hardware and new os (2012 server) and new SQL Server version (2016 sp1) we face the following issue:

During nightly reporting database update runs, the SQL Server service suddenly gets into trouble. Those report DB updates are performed using merge statements that collect, update and insert delta within multiple steps. The first step of the job started at 01:00 AM and caused around 500 Mio IO READS, but it succeeded. The second step started at 01:31 am but failed. It's all not highly efficient, I know and I also know there are still lots of indexes that have to be created to make those queries and the whole process more efficient. But this is not the point. Nobody cares if it would take some hours.

The point is that so far on the old SQL Server systems, those same procedures took long, but they never made the SQL Server stumble.

Now, we observe the following: The jobs start and run for a while, around 30 minutes and complete some job steps. During 01:35 and 01:55 AM the CPU load gets close to or to 100%. Between 01:55 AM and 02:11 AM the load varies btw 70% an 95%.

In the SQL logs we see those entries btw 01.35 and 01:53.:

**Unable to get thread context for spid 0 BEGIN STACK DUMP:

05/18/17 01:35:02 spid 5416

Non-yielding Scheduler

Stack Signature for the dump is 0x0000000000000266

External dump process return code 0x20000001.

External dump process returned no errors.

Process 62:0:7 (0x3ab4) Worker 0x00000066A0FD8160 appears to be non-yielding on Scheduler 0. Thread creation time: 13139533134164. Approx Thread CPU Used: kernel 0 ms, user 50156 ms. Process Utilization 97%. System Idle 1%. Interval: 73484 ms.

Process 62:0:6 (0x3b64) Worker 0x00000066A0FBC160 appears to be non-yielding on Scheduler 1. Thread creation time: 13139503419122. Approx Thread CPU Used: kernel 0 ms, user 57343 ms. Process Utilization 95%. System Idle 3%. Interval: 78640 ms.

Process 0:0:0 (0x2464) Worker 0x00000066A0E42160 appears to be non-yielding on Scheduler 2. Thread creation time: 13139456436014. Approx Thread CPU Used: kernel 15 ms, user 62515 ms. Process Utilization 94%. System Idle 4%. Interval: 71437 ms.

Process 62:0:5 (0x1688) Worker 0x00000066A0E46160 appears to be non-yielding on Scheduler 3. Thread creation time: 13139225220768. Approx Thread CPU Used: kernel 0 ms, user 53156 ms. Process Utilization 95%. System Idle 3%. Interval: 73437 ms.

SQL Trace ID 2 was started by login "mydomain\mySentryOneServiceUser".

I/O is frozen on database SDB. No user action is required. However, if I/O is not resumed promptly, you could cancel the backup.

A time-out occurred while waiting for buffer latch -- type 3, bp 000000751B746240, page 4:218376, stat 0x10f, database id: 2, allocation unit Id: 6488064, task 0x00000066A0FEE8C8 : 5, waittime 300 seconds, flags 0x1000000019, owning task 0x00000066A0DF5848. Not continuing to wait.

At 01:53 the netapp SnapManager tasks start to fail taking snapshots:

Error: 3041, Severity: 16, State: 1. BACKUP failed to complete the command BACKUP DATABASE DB1. Check the backup application log for detailed messages.

Between the same timeslots, Job Agent log shows this:
01:36 AM - 02:08 AM

[298] SQLServer Error: 65535, SQL Server Network Interfaces: Error Locating Server/Instance Specified [xFFFFFFFF]. [SQLSTATE 08001]

[165] ODBC Error: 0, Login timeout expired [SQLSTATE HYT00]

[298] SQLServer Error: 65535, A network-related or instance-specific error has occurred while establishing a connection to SQL Server. Server is not found or not accessible. Check if instance name is correct and if SQL Server is configured to allow remote connections. For more information see SQL Server Books Online. [SQLSTATE 08001]

[382] Logon to server 'xxx\xxx' failed (ConnUpdateStartExecutionDate)

Most of them repeat multiple times per minute.

This is the CPU spec: INTEL XEON E5-3690 v3 30M 4x 2,6GHz 4 virtual CPUs Software: Microsoft SQL Server Enterprise (64-bit) 13.0.4001.0 Instance Settings: All settings are default, MAXDOP = 4, CTFP = 50, system memory=64 GB, SQL MAX_MEMORY = 58982

CONTENT OF DUMP.txt:

Current time is 01:35:02 05/18/17.                                                                               

BugCheck Dump                                                                                             

This file is generated by Microsoft SQL Server                                                                   
version 13.0.4001.0                                                                                              
upon detection of fatal unexpected error. Please return this file,                                               
the query or program that produced the bugcheck, the database and                                                
the error log, and any other pertinent information with a Service Request.                                       


Computer type is Intel(R) Xeon(R) CPU E5-2690 v3 @ 2.60GHz.                                                      
Bios Version is INTEL  - 6040000                                                                                 
PhoenixBIOS 4.0 Release 6.0                                                                                      
4 X64 level 8664, 2 Mhz processor (s).                                                                           
Windows NT 6.2 Build 9200 CSD .                                                                                  

Memory                               
MemoryLoad = 94%                     
Total Physical = 65535 MB            
Available Physical = 3317 MB         
Total Page File = 73535 MB           
Available Page File = 10601 MB       
Total Virtual = 134217727 MB         
Available Virtual = 134041378 MB     
***Unable to get thread context for spid 0                                                                       
* *******************************************************************************                                
*                                                                                                                
* BEGIN STACK DUMP:                                                                                              
*   05/18/17 01:35:02 spid 5416                                                                                  
*                                                                                                                
* Non-yielding Scheduler                                                                                         
*                                                                                                                
* *******************************************************************************                                

What can I do to prevent the system from running into trouble and getting all those error log entries? Is there configuration to check?