The other day I had been called in to help troubleshoot another case where management servers and agents turned to a grey state. It was a case with a reason I had not seen before so I thought I might mention it here.
What happened was that due to some reason the contents of the “C:\Documents and Settings\” folder got deleted from all machines. These were Windows 2003 machines. This happened right before the agents went into a grey state, so it was very likely to be linked. Some other programs (IIS, SCCM agent) also seemed to not have liked this. At first sight of course this only contains a few user profiles, so what the heck, right? But there are also some other directories in there (sometime of them hidden) that actually do have a function.
In the end what had happened was that the “C:\Documents and Settings\All Users\Application Data\Microsoft\Crypto” folder had been deleted which contains machine keys. Aha…
So we started out focussing on the management servers and the SQL box in order to get that up and running first. What you will see on agents and management servers are events 7022 saying something to the effect of that it has downloaded secure configuration but that it does not have the certificate or private key to decrypt it. So it fails. Crypto folder deleted, decrypt errors, looks like we have a link here. Now the management servers also had this, so we first had to get those talking again (and make sure we monitor its SQL as well in order to see if we had any other issues). We want to see an event 7023 please.
So the way to get this running was to:
- Stop the System Center Management service
- Clear the default value under the following registry key (making an export first might be a good idea!):
- We deleted the contents of the following folder, also in order to flush the cache and have it re-download configuration and management packs:
“C:\Program Files\System Center Operations Manager 2007\Health Service State\*.*”
- Start the System Center Management service
After a while we finally got the 7023 event back and a minute later we saw the long list of 1201 events which means it is downloading the management packs again. Few minutes later the monitoring of those machines came back and the performance counters and health state.
Next we pulled down a list of all agents from the OpsMgr Shell with get-agent and we used that as input to run the above as a script on all other agents.
An hour later and everything was green\red again.
So if you ever delete that folder or have a corruption in it this is a way to try to get it fixed.