Hi all. After encountering some problems at a location with the installation and running of SCOM, which was related to SQL Authentication settings, it is time to write up a blog post about it. I also have a bit of information about what happens if you do not set this up right.
In the SCOM supported configurations documentation on the Microsoft website, we have a small link to the SQL design considerations page for SCOM. On this page it states that only Windows Authentication is supported, and not Mixed Authentication. I can imagine this does not get picked up by everybody.
This situation has been the case for many years actually. Having the SQL Authentication mode set to Mixed mode can lead to problems while installing SCOM for the first time, but it could also lead to other issues later on. In the old days when you installed SQL, it would default to Mixed mode, but since SQL 2016 I believe the default is Windows Authentication if you just keep on clicking next until it installs.
This describes a case we ran into and the few problems which occurred. These are not always easy, because they do not point out the real issue.
I was helping them install a SCOM management group from scratch. So, no migration or upgrade scenario. This company deploys servers and SQL using automation and scripts and such. All great so far. In my request was clearly stated the instance should be Windows Authentication.
When I first logged in and went for the properties of the instance, I could see it was indeed set to Windows Authentication. So, we went to the server which was going to be the management server and started to go into the SCOM setup wizard. Fill out a lot of fields and service accounts and so on. Click Install!!! This is Windows Server 2019, with SQL 2019 CU latest, and SCOM 2022.
Problem 1 (yes, we got more coming)
We see in the progress bar that it created the OperationsManager database and started working on it and putting some data in it. At the end of that step, it works on setting the rights on the database correctly and this is the point it crashed.
PopulateUserRoles: failed : Threw Exception.Type: System.Runtime.InteropServices.COMException, Exception Error Code: 0x80070539, Exception.Message: The security ID structure is invalid. (Exception from HRESULT: 0x80070539)
Upon inspecting the setup log, we found the above error. There were also a few other messages which were found later. When we look at the above message there are a few cases, which point to SCOM upgrades between two versions, where the SQL instance was set to Mixed Authentication mode! Of course, this was not an upgrade, and we did a quick check to confirm the new fresh SQL server was set to Windows Authentication mode. Hmmmm, the other errors seemed to point to something regarding the process of working with the passwords for the service accounts.
In the meantime, the standard process for this company was to create a Microsoft case, for these items. The research for this issue was a back and forth over 2 months and a network and other dumps were created against SQL, SCOM, AD domain controller and so on. Nothing seemed to stop the authentication or the traffic. We tried TLS 1.2 enforcement, and next to take it out again. RC4 was one of the other potential troublemakers, but in the end, this was not it either. Used different accounts from different domains, admin users, service accounts etc. It looked like a connectivity and authentication issue.
Suddenly a SQL expert found the problem, by backtracking in what SQL does while creating a new database.
What happens while creating a new database, is that a copy is made from the “model” database. Next tables and whatever gets created and done, and security gets adjusted to whatever is needed. This is what SCOM installer also does.
It turns out there was a local SQL account which was owner of the model database! When it got copied to become the new SCOM database it had those rights on it. Next SCOM tries to enforce its own rights on the SCOM database and runs into a problem, because it does not accept local SQL accounts. It cannot check local SQL accounts against the Active Directory.
Quickly take away those rights of the local SQL account and continue SCOM installer, and it worked.
Now, what I think had happened is that the installation script for the SQL server creates an extra database for the use of DBA’s and also a local SQL account is created, with which they can run all kinds of jobs or reports. Maybe this gets enforced on Windows Authentication as well, or the SQL got installed in Mixed Authentication first and next changed to Windows Authentication (yes you can do that after SQL got installed). The local SQL account was not the default “sa” account, so we did not notice it among the domain-based accounts and groups with rights. I ran into something like this where an instance later got converted into Mixed mode for the use of the DBA tooling, and the sa got used.
When logging in the next day into SCOM, we found SCOM not working. Lot of errors. Resource pools going down, including the All Management Servers resource pool. When inspecting the eventvwr logs, we also see the SDK service crashing and restarting and crashing again. We know this Windows Service by its name: “System Center Data Access Service”. Let’s see the two errors coming from it for the crashing SDK.
Application log error event ID 1000:
Faulting application name: Microsoft.Mom.Sdk.ServiceHost.exe, version: 10.22.10118.0, time stamp: 0x6206dd6c
Faulting module name: KERNELBASE.dll, version: 10.0.17763.2183, time stamp: 0x8e097f91
Exception code: 0xe0434352
Fault offset: 0x0000000000039329
Faulting process id: 0x114c
Faulting application start time: 0x01d8d74237c5360c
Faulting application path: C:\Program Files\Microsoft System Center\Operations Manager\Server\Microsoft.Mom.Sdk.ServiceHost.exe
Faulting module path: C:\WINDOWS\System32\KERNELBASE.dll
Application log event ID 1026:
Framework Version: v4.0.30319
Description: The process was terminated due to an unhandled exception.
Exception Info: System.ServiceModel.FaultException`1[[Microsoft.EnterpriseManagement.Common.UnknownAuthorizationStoreException, Microsoft.EnterpriseManagement.Core, Version=7.0.5000.0, Culture=neutral, PublicKeyToken=31bf3856ad364e35]]
at Microsoft.EnterpriseManagement.Mom.Sdk.Authorization.AzManHelper.Initialize(System.String, System.String, AzManHelperModes, System.String, System.String)
at Microsoft.EnterpriseManagement.SingletonLifetimeManager`1[[System.__Canon, mscorlib, Version=220.127.116.11, Culture=neutral, PublicKeyToken=b77a5c561934e089]].GetComponent[[System.__Canon, mscorlib, Version=18.104.22.168, Culture=neutral, PublicKeyToken=b77a5c561934e089]]()
We were looking at this, because we wanted to install the second management server, and this of course would not work until fixing the issues.
Look at that second message. I noticed AuthorizationStoreException in there and the Sdk.Authorization.AuthManager.Initialize items. Well, one of the things the SCOM management server does when it starts up (so the SDK service), is connect to the SCOM database, check for the rights structure. What if… the rights for that local SQL account were back?
Yes, in SQL, went to find the account, checked its permissions as seen in the picture. And yes, it had rights again to all databases.
We took the rights off. And SCOM started working again.
Try to install the second management server…. Fail.
You are kidding me, right?
The rights on the account were back! Apparently, they have a system which returns the rights again. Smart! But of course, it now works against us.
Disabled the account. Took the rights off.
Installation went fine. Finally.
Yeah….. SCOM sdk did not crash again after a few minutes this time. But we saw some strange alerts and notifications.
Disabling the account was not enough. It still got its rights back. SO that was something to be cleared again.
When installing the SQL instance to host SCOM (SCSM too), install it right.
Always use Windows Authentication.
Check if another local account exists in SQL and that it does not have rights on the model database or on the SCOM databases. You see how much problems this gives.
This was just a combination of circumstances whereby there was a local account in an instance which is set to Windows Authentication, which causes all the problems. But it is clear that during operations, installation and at times also later in the process stuff can break if you are on Mixed authentication in SQL for SCOM databases.
And the double-check sometimes needs a more-check for unexpected cases.
Using the correct SQL Authentication format is one of the things mentioned in some of my webinars and is also included in our SCOM Admin training and is something we check for with our SCOM Health Check for existing SCOM management groups.