SQL Server BDC Hints and Tips: Contained AGs, Ports, containedag_msdb, logs, …

This blog post focuses on connecting to the SQL Server BDC, some helpful log files and utility outputs. Understanding a few basics about a SQL Server BDC and the Contained Availability Group (containedag) makes managing and troubleshooting easier.

https://docs.microsoft.com/en-us/sql/big-data-cluster/deployment-high-availability?view=sql-server-ver15

Cluster IP Address and Port

To connect to a SQL Server BDC use the exposed IP address (FQDN for an AD enabled cluster) and the target’s, listening port. To locate the IP address and port, use one of the following commands:

kubectl cluster-info

Kubernetes master is running at https://10.193.5.209:6443

KubeDNS is running at https://10.193.5.209:6443/api/v1/namespaces/kube-system/services/kube-dns:dns/proxy

azdata bcd endpoint list

{

“description”: “SQL Server Master Instance Front-End”,

“endpoint”: “mssql.aris.local,31433”,

“name”: “sql-server-master”,

“protocol”: “tds”

{

“description”: “SQL Server Master Readable Secondary Replicas”,

“endpoint”: “mssql-secondary.aris.local,31436”,

“name”: “sql-server-master-readonly”,

“protocol”: “tds”

}

kubectl get svc -n mssql-cluster

master-svc-external NodePort 10.99.130.239 <none> 1433:31433/TCP
master-secondary-svc-external NodePort 10.100.49.221 <none> 1433:31436/TCP

The connection port value depends on the cluster’s High Availability enablement and the desired target of primary or read-only secondary.

Port

Description

31433

When connecting to a SQL Server the default port is commonly 1433. On a SQL Server BDC 1433 is exposed externally using port 31433 (default.)

Example: sqlcmd -UMyUser -PMyPassword -SclusterIP,31433

Non-HA – Connects to the master SQL Server instance.

HA Enabled – Connects to the primary (listener for read and write operations.)

31436

For an HA enabled BDC port 31436 connects to the read-only, high availability replicas.

1433

Locally available from /var/opt/mssql-tools/bin/sqlcmd on the pod

Example: kubectl exec -it -n clusterNamespace master-0 /bin/bash

1533

Locally available from /var/opt/mssql-tools/bin/sqlcmd on a high availability enabled cluster.

Note: The 1533 port connects to the instance level instead of the high availability listener.

Hint: You may want expose 1533 externally so you can connect with SSMS or Azure Data Studio to execute instance specific queries. Here is an example temporarily exposing 1533 from the three master nodes.

kubectl expose pod master-0 -n mssql-cluster –port=1533 –type=NodePort –name=sql1533a

kubectl expose pod master-1 -n mssql-cluster –port=1533 –type=NodePort –name=sql1533b

kubectl expose pod master-2 -n mssql-cluster –port=1533 –type=NodePort –name=sql1533c

kubectl get svc -n mssql-cluster

sql1533a NodePort 10.111.117.80 <none> 1533:8115/TCP

sql1533b NodePort 10.111.105.5 <none> 1533:31402/TCP

sql1533c NodePort 10.101.183.78 <none> 1533:30367/TCP

Contained AG (containedag_master, containedag_msdb)

When a SQL Server BDC is enabled for high availability the contained availability group (containedag) is created and maintained. A highly available SQL Server BDC installs master-0, master-1 and master-2 replicas and joins them to the containedag. The containedag includes the databases containedag_master and containedag_msdb (which are commonly hidden from external view.)

When connected to the listener port, database create and drop commands are upgraded to automatically join/remove the database from the containedag and replicate the state to secondaries.

Try this: Connect to the listener port and issue a create database. Establish a second connection to the read-only port and query the database on a secondary replica. The database exists on all three replicas, automatically added to the containedag availability group.

When connected to the SQL Server BDC over the listener port (default=31433), references to master and msdb are automatically redirected to the containedag_master and containedag_msdb. For example, if you connect to the listener and create a SQLAgent, TSQL job the job definition and history are physically stored in the containedag_msdb and replicated to secondary replicas. In fact, I wrote a test creating and executing a job on the primary then connecting to the secondary (port=31436) and waiting for the job outcome to arrive in the job history table. From the user’s perspective it appears as if you are using msdb as you normally would for SQLAgent activities.

Try this: Create a job using the listener port and then issue the following query: select * from msdb..sysjobs

31433 – Listener connection: You will see the new job (you are connected to the listener and msdb references are redirected to containedag_msdb)
1533 – Instance level connection: You will NOT see the job (you are connected to the physical instance and using the physical database msdb)
1533 on a Secondary - Instance level connection: You will NOT see the new job. Issue a use containedag_msdb command and run the query and you will see the new job has been replicated to the secondary replica in containedag_msdb.
31436 – Read only connection: You will see the new job (you are connected to the read only listener and msdb references are redirected to containedag_msdb)

Note: The target of the connection determines if the job is replicated or the job remains instance specific.

Execute select db_name(database_id) from sys.dm_hadr_database_replica_states

Listener connection: You see msdb because of the automatic redirection taking place
Instance level connection: You see containedag_msdb because you are looking at the physical data

You are querying at the same database. The logical referencing of the contained availability group listener redirection makes containedag_msdb appear as msdb.

Who is Primary and Secondary Replicas?

Using the high availability DMVs (sys.availability_groups, sys.dm_hadr_database_replica_states, …) you can query to see which master-* is currently acting as the primary replica. In fact, login to the listener port and select @@SERVERNAME to quickly determine the server name of the primary replica.

You can also query Kubernetes to view the label assignments for the master-* pods. When a SQL Server High Availability failover takes place, the labels are updated to indicate the role of primary or secondary.

Example: kubectl edit pod master-0 -n <<clusterNamespace>

labels:

MSSQL_CLUSTER: mssql-cluster

app: master

controller-revision-hash: master-79fc98b57

mssql.microsoft.com/sql-instance: master

plane: data

role: master-pool

role.ag.mssql.microsoft.com/master-containedag: secondary

statefulset.kubernetes.io/pod-name: master-0

type: sqlservr

Port 31433 Connection Fails on High Availability SQL Server BDC

A connection to port 31433, on a highly available SQL Server BDC, may fail if the availability group is in an unhealthy state. When a master-* pod is elected as the primary replica it transitions the databases to primary state and listens on port 31433. If the containedag becomes unhealthy the listener won’t be established and connection attempts fail. Troubleshoot an unhealthy containedag as you would troubleshoot any SQL Server Availability Group to restore functionality.

The SQL Server DMVs and errorlog are great places to start when troubleshooting an unhealthy availability group.

Controller.log

The controller log is a great source of information to follow the availability group and database state transitions. You can use the following commands to access the controller.log file.

kubectl exec -it -n clusterNamespace control-xxxxx /bin/bash

cd /var/log/controller/##/##/##/controller.log

azdata bdc debug copy-logs –namespace clusterNamespace –target-folder c:temp

Search for the log entries containing the keyword “HadrRoleManagerStateMachine” to find entries like the following:

2020-01-30 18:04:35.1118 | INFO | [HadrRoleManagerStateMachine: containedag] Transitioning from ‘Healthy’ to ‘StopLeaseRenewal’ state.

2020-01-30 18:04:35.4225 | INFO | [HadrRoleManagerStateMachine: containedag] Transitioning from ‘StopLeaseRenewal’ to ‘Offline’ state.

2020-01-30 18:04:35.4988 | INFO | [HadrRoleManagerStateMachine: containedag] Transitioning from ‘Offline’ to ‘Resolving’ state.

2020-01-30 18:04:35.5689 | INFO | [HadrRoleManagerStateMachine: containedag] Transitioning from ‘Resolving’ to ‘NoPrimary’ state.

2020-01-30 18:04:38.0689 | INFO | [HadrRoleManagerStateMachine: containedag] Transitioning from ‘NoPrimary’ to ‘Resolving’ state.

2020-01-30 18:05:13.1124 | INFO | [HadrRoleManagerStateMachine: containedag] Transitioning from ‘Resolving’ to ‘OnlinePending’ state.

2020-01-30 18:05:13.1452 | INFO | HadrRoleManagerStateMachine:master:containedag:: OnlinePendingTransition: CurPrimary(master-2)

Other information such as a database drop requests are contained in the controller.log as well.

2020-01-30 18:05:34.4335 | INFO | DropManagedDatabase request received. StatefulSet: master, Database: StressTREBORClone

mssql-ha-supervisor

Each mssql-ha-supervisor container helps to manage high availability and the associated logs contain additional information.

kubectl describe pod master-0 -n clusterNamespace

Containers:

mssql-server:

mssql-ha-supervisor:

…

Sample entries from the mssql-ha-supervisor stdout/stderr log files as copied using the azdata copy-logs command.

[containedag-agent] 2020/02/01 16:37:24 lease expired while database disconnected

[supervisor] 2020/02/01 16:37:25 Database operation state for database GiantDB, requestId bea5a9c7-7e2b-43e7-9f3b-98c4dc342ec8

[supervisor] 2020/02/01 16:37:25 Database operation state for database StressTREBOR_830179E1_9451_4E1A_9FB2_75548674A41E, requestId 7b4281f7-820a-4cde-ad2d-33a8a136de0b

[containedag-agent] 2020/02/01 16:37:27 Sent unhealthy heartbeat

[configuration-agent] ERROR: 2020/02/01 16:37:40 Unable to query AG configurations from SQL Server: read tcp 100.64.2.185:57820->100.64.2.185:1533: i/o timeout

operator-#### pod

The operator pod listens to and reacts to Kubernetes events. The operator log is another location for tracking what transpired on a SQL Server BDC.

2020/02/08 15:32:00 Got pod event for master-1

2020/02/08 15:32:00 Got pod event for master-2

2020/02/08 15:32:00 Got pod event for master-0

Use the SQL Server errorlog, controller, mssql-ha-supervisor, operator and DMV information to view the state of your SQL Server BDC.

Mange with notebook: https://docs.microsoft.com/en-us/sql/big-data-cluster/manage-notebooks?view=sql-server-ver15

Stuart is a colleague of mine and is the ultimate champion for SQL Server BDC notebooks. I learn more about Notebooks everyday as I use them to help monitor and troubleshoot the 12, SQL Server BDCs in the longhaul lab. I enjoy the notebooks because they automate many of the data collection and management activities while saving the output. I also get to view the commands used so I learn how to troubleshoot and manage aspects of the SQL Server BDC I may not be familiar with.

I am currently working to convert the SQL Server BDC longhaul test runs to rely solely on notebook collection. This means our development and support staffs use the same notebooks to troubleshoot a SQL Server BDC we provide to you.

Manage with controller dashboard : https://docs.microsoft.com/en-us/sql/big-data-cluster/manage-with-controller-dashboard?view=sql-server-ver15

Another helpful facility is the controller dashboard. The controller dashboard provides you reports, links and drill-ins for your SQL Server BDC.

Bob Dorr

How It Works, SQL Server BDC Hints and Tips