Slow authentication to Edge SWG

Products

ProxySG Software - SGOS

Issue/Introduction

Users accessing internet sites via Cloud SWG using proxy forwarding access method.

Authentication is performed on Edge proxy servers via BCAAA servers (24 servers in total) via NTLM challenge.

Periodically, subsets of users report slowness accessing Web sites.

Users experiencing the issue were not hitting the same BCAAA server, but the impacted BCAAA servers were hitting the same domain controller (DC).

        nltest /dsgetdc:<domain_name> confirmed all slow BCAAAs are connected to the same DC.

When issue happens, a BCAAA service restart does not address the issue; only a Windows host restart does.

When issue happens, the problem can be worked around by using nltest to point to another Windows DC.

Impact of problem gets progressively worse as more devices (Windows 11 based) are provisioned.

PCAPs (and HAR files) showed the slowness is caused by delays during the NTLM authentication process as shown below:

Environment

Edge SWG.

NTLM authentication.

BCAAA servers.

Cloud SWG.

Cause

BCAAA communication to the AD domain controller not staying local to region but ended up going to different regions.

AD configuration issues triggered by new Windows 11 rollout.

Resolution

Make sure that the authentication traffic is kept local by configuring the Windows Sites and Services to keep traffic local.

Adding the new Windows 11 machine subnet range to your Active Directory Sites and Services (Start -> Administrative Tools -> Active Directory Sites and Services) by adding a Subnet with the CIDR range there and
adding this newly created subnet range to an Active Directory Site that contains Domain Controllers that are closest to BCAAA servers.

Additional Information

Grabbing policy trace confirmed the slowness was authentication related:

connection: service.name=Explicit HTTP client.address=10.133.0.162 (effective address=10.72.48.11) proxy.port=8080 source.port=53199 dest.port=8080 client.interface=0:0.1 routing-domain=default
  location-id=0 access_type=unknown
time: 2025-10-03 21:47:39 UTC
CONNECT tcp://pod.threatpulse.com:443/
  DNS lookup was unrestricted
User-Agent: curl/7.68.0
X-Forwarded-For: 10.72.48.11
user: name="EXAMPLE\user" realm=EXAMPLE
resolved group: name="Example\Group1"
resolved group: name="Example\Group2"
authentication start 4 elapsed 11981 ms <----!!!
authorization start 11985 elapsed 0 ms

Focussing on the BCAAA servers and running in debug mode when the incident happens, we could see up to 8 second delays in responses from AD in the looking at the bcaaa-realm-* log files.

Times across the AcceptSecurityContext() call at the beginning of the logs show no delays, but towards the end of the logs it shows roughly an 8 second delay;

bcaaa-realm-9712-251004012438.log 114382 2025/10/04 00:27:22.980 [11040] AcceptSecCtxt: pCtx=18c67c8 tLen=784 tId=598dd677 sn=598dd677 ct=411
bcaaa-realm-9712-251004012438.log 116212 2025/10/04 00:27:30.053 [11040] AcceptSecCtxt returns 0x0 LastError 0

This would typically indicate an NTLM schannel bottleneck at the domain controllers. Increasing the max concurrent api setting on the DCs to 50, to help handle the load, seemed to reduce the impact but not solve it. This increases how many threads BCAAA is configured to use for authentication in the bcaaa.ini file.

Using Windows AD tools to monitor traffic on the DCs, it became visible that some of the requests from the BCAAA servers were not fully served by local DCs but also DCs from other regions. The assumption was that all requests would remain local, but after a validation of the Windows AD setup, it was confirmed that newly promoted DCs did not apply the best practice in terms of sites and services setup. Making the changes subsequently made sure that all traffic remained local and all requests were responded to within ms instead of seconds.