SQL Server BDC Hints and Tips: The node’s Journal can be your best friend

SQL Server BDC Hints and Tips: The node’s Journal can be your best friend

 

 

If you are new to Kubernetes (K8’s), like I am, you quickly learn that the servicing node’s journal holds a wealth of information.

In the lab I was​​ encountering evictions due to memory, disk space​​ and​​ other resource​​ limitations. ​​ My collection scripts were connecting to the pods​​ and not​​ the nodes. ​​ This approach​​ lacks​​ valuable, lineage information needed to fully troubleshoot issues.​​ 

For example, my master-0, SQL Server, pod was getting evicted and restarted. ​​ I used SSH to connect to the master-0 pod. ​​ 

Note:​​ the debug information and logs located on the pod are reset during the restart and limit troubleshooting.​​ 

Using kubectl and running​​ kubectl get events​​ or​​ kubectl describe pod master-0 -n mssql-cluster​​ did​​ not​​ give​​ me what I needed to understand what happened with the kubelet interactions related to the evictions.

I then​​ executed​​ kubectl get pods -n mssql-cluster -o wide. ​​ This​​ output​​ shows the node​​ the master-0 pod is serviced by​​ is​​ node​​ adlgh1-0​​ on my cluster.

NAMESPACE  ​​ ​​ ​​ ​​ ​​​​ NAME  ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​​​ READY  ​​​​ STATUS  ​​ ​​​​ RESTARTS  ​​​​ AGE  ​​ ​​ ​​​​ IP  ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​​​ NODE  ​​ ​​ ​​ ​​ ​​​​ NOMINATED NODE  ​​​​ READINESS GATES

mssql-cluster  ​​​​ master-0 ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​​​  ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​​​ 3/3  ​​ ​​ ​​​​ Running  ​​​​ 0  ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​​​ 4h36m  ​​​​ 10.244.0.209  ​​ ​​​​ adlghl-0 ​​ ​​​​ <none>  ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​​​ <none>


I then​​ executed​​ kubectl get nodes -o wide​​ and obtained the IP address of the adlgh1-0 node. ​​ Using the​​ ip​​ address I established an SSH connection to the node​​ and continued troubleshooting.

I​​ exeuted​​ various flavors of the​​ journalctl​​ command, spanning the time ranges the pod restarted and found events highlighting the eviction.

kubelet[1242]: I1205 02:54:35.371117  ​​ ​​​​ 1242 image_gc_manager.go:300] [imageGCManager]: Disk usage on image filesystem is at 85% which is over the high threshold (85%). Trying to free 5214368563 bytes down to the low threshold (80%).

 

kubelet[1242]: I1205 02:55:10.469831  ​​ ​​​​ 1242 eviction_manager.go:344] eviction manager: must evict​​ pod(s) to reclaim ephemeral-storage

 

kubelet[1242]: I1205 02:55:10.471522  ​​ ​​​​ 1242 eviction_manager.go:362] eviction manager: pods ranked for eviction: sparkhead-0_mssql-cluster(7bb01e5f-2e33-4b72-b8a7-7737010f56b5), app1-vqc87_mssql-cluster(21a3a235-dda4-44b0-8943-f1b4aa1aebf3), logsdb-0_mssql-cluster(90657763-ffcb-4df3-94c5-7791f4f60c58), gateway-0_mssql-cluster(ea099f36-de15-4e2f-aa59-364e21db97cb),​​ master-0_mssql-cluster(9cef954d-0f2b-4eda-9d14-067214eafd87), compute-0-…

 

In a bit more​​ complicated scenario all pods on a given node appeared to have been paused for ~5 hours. ​​ The workload was running along and​​ stalled​​ for​​ ~5 hours. ​​ No performance counters, no query cancellations honored, no new connections honored, current connections didn’t make forward progress, no new log entries for any service of any pod assigned​​ to​​ by the node. ​​ It was as if someone put the node into sleep mode for​​ ~5 hours.

Using the time range and servicing node information I​​ reviewed​​ the CPU​​ metrics​​ using Grafana on my cluster: ​​ https://mybdcclusterip:30777/grafana ​​ ​​​​ (You can use​​ kubectl cluster-info​​ or​​ azdata list endpoints​​ to acquire the cluster ip address.)

The metrics collection showed the ~5 hour gap and the more troubleshooting I did the more it was clear that every pod on the same node encountered the same paused​​ behavior.

Using the​​ kubectl get pods​​ and​​ get nodes​​ I determined the target node and studied the time range of events present in the journal. ​​​​ The journal of the node is what told the real story and allowed me to understand the behavior.  ​​​​ 

The journal shows the operating system on the node (host the pods in question) encountered an issue. ​​ The issue was a stalled scheduler at the OS level preventing all threads in all pods assigned to the node from making forward progress.

Nov 01 23:47:46 adlghl-3 kernel: INFO: task khugepaged:47 blocked for more than 120 seconds.

Nov 01 23:47:46 adlghl-3 kernel:  ​​ ​​ ​​ ​​ ​​​​ Not tainted 4.4.0-165-generic #193-Ubuntu

Nov 01 23:47:46 adlghl-3 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.

Messages like the one above repeated​​ for ~5 hours​​ until the oom killer​​ a​​ process. ​​ Then the​​ system​​ came​​ back to life as if the resume button had been pushed.

Nov 02 04:32:20 adlghl-3 kernel: monkey: wrk/0​​ invoked oom-killer: gfp_mask=0x24201ca, order=0, oom_score_adj=998

N

 

Recommendation

For a SQL Server Big-Data-Cluster (BDC), or any K8’s cluster, deploy a solution that collects the journal and other relevant metrics from the servicing nodes.  ​​​​ 

  • Bob Dorr