A few weeks ago I wrote a blog post about a slow SCOM update rollup and it took so long because of agents being placed in Pending Management during the update and in big environment this might take a while. I will show you what it looks like below as well. Yesterday and today we have been looking at an issue whereby a similar thing happened, but the counter of agents in pending management was at 0.
This time it was not during an update rollup for SCOM, but during an upgrade from SCOM 2016 UR8 to SCOM 2019 on a few management servers. We had an issue with 3 out of 4 management servers and turned out the problem was the same.
During the upgrade you normally look at a SCOM upgrade wizard and during the Management Server update phase it replaces a number of DLLs and executables and so on and next it starts placing agents in pending management. Look at the picture below where the wizard is open in a remote session at the front and in the back is an open SCOM console at the Pending management screen and you can see it is counting up to the number of agents linked to this management server.
This process can take a 20 minutes or so and the wizard states what it is doing as you see in the picture.
In our case the second server we were upgrading was at this stage and the counter in the SCOM console was at zero (0).
Upon investigation we found that on the management server the Data Access service was having problems and constantly restarting and failing.
O and the management configuration service had problems, because it relies on the DAS service for database access.
So as long as those services are not working the setup wizard will stay this way or fail.
After a few hours of searching my colleague Kamil found out there was a difference between the good server and the bad server…. Registry keys! Some registry keys were empty and this was causing the problems for the DAS service.
HKLM\SOFTWARE\Microsoft\System Center Operations Manager\12\Setup
In this key there are 3 values, for install directory, product and the directory the setup files were at during install of the product. All 3 values were not there.
HKLM\SOFTWARE\Microsoft\Microsoft Operations Manager\3.0\Setup
This is a well known key with a number of values pointing to databases and such. In our case the Datawarehouse database name and the Datawarehouse database server name were there but empty!
Solution was using the working machine to copy the correct values from and putting it in the registry of the machine having problems. A few minutes later everything was working again.
So the next server we upgraded and was sitting here again:
And again the counter of pending agents was 0 and the data access service had problems again. SO we left the screen of this wizard open and quickly added the correct registry values again and a few minutes later the pending agents counter started moving up again. Funny detail is that for the 1200 agents or so on this management server we saw the counter go up to 1000 and next go to 0 again and start over, but after the first few hundred it jumped back up to over a thousand and finished a few moments later.
Turns out the last server we were going to upgrade had some minor issues after updating it to SCOM 2016 UR8 at the time. We looked at the registry before starting the upgrade process and guess what.. there were also registry keys missing there from the UR8 upgrade. So after fixing this the upgrade to 2019 went fine. Just a bit of waiting due to the many agents.
We tried to look for issues similar to this online but could not find any. So if you run into an issue after upgrading a SCOM management server with an update rollup or a full version upgrade and the data access service has problems… have a look at the registry in the two locations specified above. See if that helps.
To be clear, this is the first time I have seen this happen. I hope it is a rare combination of circumstances that can lead to this. Will be looking at logs to see if there is anything cooking there.
Wishing you happy monitoring!
We have commenting off due to all the sunglasses and pills adverts, however if any useful comments come to us by email I can post them below.
By Andy Callagan:
RE the above. I’ve had over 20 installations and upgrades all exhibit the same info. Happens when 2012 is upgraded to 2016 and again upgrading to 2019. I believe that the issue is that triggers this is that DB was moved at some point. The primary MS script gets the info from the registry and sets up correctly , A secondary MS upgrade runs a different script and get the info from the Ops DB and its reading a value that we don’t change in the MS guide to moving the DB. I once made a mistake when moving an Ops DB and missed a field. That ‘old’ value appeared in the registry after the upgrade.