Using SQL Server’s SNITrace to Troubleshoot Networking Issues

Using SQL Server’s SNITrace to Troubleshoot Networking Issues

 

 

In the process of tracking down a few TCP 10054 issues (highlighted here:​​ https://bobsql.com/sql-mysteries-why-is-my-sql-server-experiencing-lots-of-17830-tcp-10054-errors) I​​ also​​ used​​ the​​ SNITrace​​ (SNI Trace)​​ capabilities.

SNITrace, formally BIDTrace, is an​​ deeper​​ level of tracing for the​​ Microsoft SQL Server​​ TDS​​ consumers and​​ providers.​​ 

Terms and Brief History

Before I dive into​​ the details here is a brief history and related terminology.

BID

Built-In Diagnostics (BID)​​ is the former name​​ replaced with SNI. ​​ The original tracing was added as BIDTrace​​ and since then many of the functions and macros were renamed to SNITrace* as improvements​​ were made and features added.

SNI

Server Network Interface (SNI) is centralized code, shared among the SQL Server and SQL Server client providers, both Windows and Linux that can send, parse and act upon​​ TDS communications.

ODS

Open Data Services​​ (ODS)​​ was​​ a separate, public API​​ set​​ allowing a TDS gateway to be created. ​​ The API was deprecated several releases ago and​​ the behaviors consumed​​ internally by SQL Server.  ​​​​ ODS provides an evening model for processing TDS requests as parsed by the SNI layer.


ODBC Trace

To capture an ODBC trace use the driver manager, trace capabilities (https://docs.microsoft.com/en-us/sql/odbc/admin/setting-tracing-options?view=sql-server-ver15)​​ 

  • Launch odbcad32.exe

  • Select the tracing tab and then start tracing

  • Execute the reproduction

  • Stop tracing and view the log file

Execution:​​ sqlcmd -E -S.\sql17

ODBC trace records the ENTER and EXIT​​ activities for the ODBC APIs as well as​​ the​​ various parameters, values and messages.

 ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​​​ 740c-592cENTER​​ SQLDriverConnectW​​ 

  HDBC  ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​​​ 0x0000019DE23D8CC0

  HWND  ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​​​ 0x0000000000000000

  WCHAR *  ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​​​ 0x00007FFF581D21C0 [  ​​ ​​ ​​ ​​​​ -3] "******\ 0"

  SWORD  ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​​​ -3​​ 

  WCHAR *  ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​​​ 0x00007FFF581D21C0​​ 

  SWORD  ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​​​ -3​​ 

  SWORD *  ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​​​ 0x0000000000000000

  UWORD  ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​​​ 0 <SQL_DRIVER_NOPROMPT>

 

 ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​​​ 740c-592cEXIT ​​​​ SQLDriverConnectW ​​ with return code 1 (SQL_SUCCESS_WITH_INFO)

  HDBC  ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​​​ 0x0000019DE23D8CC0

  HWND  ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​​​ 0x0000000000000000

  WCHAR *  ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​​​ 0x00007FFF581D21C0 [  ​​ ​​ ​​ ​​​​ -3] "******\ 0"

  SWORD  ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​​​ -3​​ 

  WCHAR *  ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​​​ 0x00007FFF581D21C0 <Invalid buffer length!> [-3]

  SWORD  ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​​​ -3​​ 

  SWORD *  ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​​​ 0x0000000000000000

  UWORD  ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​​​ 0 <SQL_DRIVER_NOPROMPT>

 

  DIAG [01000] [Microsoft][ODBC Driver 13 for SQL Server][SQL Server]Changed database context to 'master'. (5701)​​ 

 DIAG [01000] [Microsoft][ODBC Driver 13 for SQL Server][SQL Server]Changed language setting to us_english. (5703)

 

Going Beyond ODBC Trace

For my specific issue I was attempting to debug why the SQLDriverConnectW​​ errored​​ with TCP 10054 because it was failing the pre-login handshake. ​​ For that I first needed to understand and capture the flow of the handshake activities​​ and this is where SNITrace is helpful.

10054​​ The​​ Direction​​ Matters

Understanding where the RST (TCP closure) occurs if helpful in understanding the problem and how to troubleshoot it.

Each stage in the login process uses asynchronous read and write timeouts. ​​ For example, when the SQL Server sends the initial Pre-Login Response the SQL Server posts a read request​​ using a​​ 5 second​​ timeout. ​​ If the first Ssl/Tls context packet takes longer than 5 seconds to arrive, from the client,​​ SQL Server closes the TCP connection (RST occurs from the SQL Server) and the next write attempt from the SQL Client fails with a 10054 error, the connection has been reset.  ​​​​ The Ssl/Tls reads use similar timeout mechanisms and the SQL client provider honors the ‘login timeout.’ ​​ For example, if the login packet is sent by the client but the SQL Server does not respond before the established login timeout (30 second default) the client closes the TCP connection (RST from the SQL Client provider) and the server records the 10054, connection​​ reset​​ error and terminates the login attempt.

Before diving into SNITrace​​ let’s breakdown the​​ network trace of a successful and failed login.

Good Login Sequence ​​ (SYN, ACK, … FIN)

10054 sequence from SQL Server | Start login at 48.8 and server terminates at 52.9 (SYN, ACK … RST)