SCOM troubleshoot cross platform agent discovery and installation – Part 1

Last week I was working with a customer who was doing a proof of concept for cross platform monitoring and some other SCOM functions. They were using an existing SCOM test environment and wanted to add a few machines for cross platform monitoring. However these machines were not in the same network as the SCOM machines. Due to this setup we already suspected that we might be testing networking more than testing SCOM. And this turned out to be the case. We got some strange errors that we have not seen before. I will try to cover some of the things we found. The story is not completely in time chronological order. Also as it seems to be a long story I will split it into three parts.
So from experience we know a few things are important when looking at cross platform monitoring:

  • DNS name resolving, both directions
  • Certificates (mostly in relationship to DNS)
  • Accounts with rights on the cross platform system, make sure they can logon and do the same stuff the discovery wizard and monitoring wizards do
  • Use the latest SCOM cross plat CU update and management packs and make sure you are using the right installers when manually deploying agents
  • Make sure SSH and sftp work
  • Pre-requisite software on cross platform machines
  • Define runas accounts and place them in the runas profiles

So the first thing we had to get to was networking. In our case the SCOM test environment was separated from other networks by several firewall/routing devices. And the cross platform machines we were to get access to are located in networks at least a few hops away. So routing and firewall ports were important. We were promised a few Red Hat machines, an AIX box and a HP-UX machine and all were located in different networks. Actually this did reflect a reality for this company as it is a service provider, monitoring several customers, without direct network contact and trusts.
First of all of course make sure your routing works the right way. Second, firewall ports need to be opened between the machines. In our case TCP 1270 and TCP 22 were to be opened. This was difficult at first as port 22 was at first refused by the security team.
From the SCOM management server(s) we tried to do a telnet to both ports to check if they were accessible. In Windows 2008 you would first need to install the feature “Telnet Client” if you want to use telnet to troubleshoot connections. We will go into further connection testing later in this article.
Because it took some time to get port 22 open we started out on one of the Red Hat machines to manually install the cross platform agent for Red Hat (check you are using the right one; version of OS, version of the agent and type of architecture). So at first the wrong version of agent installer file was used as the latest SCOM cross plat CU2 was not installed on the SCOM management servers yet. After installing that update on the SCOM servers we could pick up that version of the agent (258 in that case) and move that one to the Red Hat machine. Always fun if there is no direct connectivity and port 22 is still closed.
For purposes of the POC we requested two accounts to be setup on all cross plat machines:

  • Scoma -> privileged account
  • Scomv -> normal account

We also requested these to have the same password for both accounts on all of the test machines. In normal circumstances this will probably not be the case.
On the SCOM side of things make sure you define these accounts as RunAs Accounts. Their type is “Basic Authentication”. For Distribution chose More Secure and enter the SCOM Management Servers that you would require to talk to the cross platform agents.
Next in the RunAs Profiles you can find the Unix Action Account and Unix Privileged Account and link the previously defined accounts to the target objects you want to use. In this case as all targets use the same accounts we could just leave the default of “All targeted Objects”.
So, we were ready to start with the manual installation on the Red Hat box. The system admin installed the rpm again with the latest version of the rpm and checked the new service was running. Next thing was to counter sign the certificate. As we did not have SSH opened on the firewalls yet, we opted to do a manual signing of the certificate. This procedure is in the documentation and also discussed before on this blog https://blog.topqore.com/2009/09/30/scom-agent-on-sun-solaris . Counter signing was easy and the certificate file was brought back to the Red Hat machine and replaced the existing self signed certificate and restarted the agent. If the installation of the agent does not work, please re-check the prerequisites http://technet.microsoft.com/en-us/library/dd789030.aspx . Also check if Linux Standard Base is installed, check out a post from David Allen here http://wmug.co.uk/blogs/aquilaweb/archive/2009/09/02/more-opsmgr-x-plat-notes.aspx We got an error relating to this on one of the machines as well and as it stated something about a directory or file not found with /lsb/ in the path I remembered David’s post and we fixed that one.
So at this point we did a telnet to 1270 from the SCOM server to the Red Hat machine. This worked (we got an empty screen, so good enough as an answer in this case).
Name resolving is also an important point with cross platform monitoring. There are a few reasons that might be obvious, but one of the important things is that the certificate and certificate signing (in combination with the discovery wizard) uses fully qualified domain name! In this case we had to manually point the machines towards each other in the hosts file (windows) and the /etc/hosts file (linux).
So now we could run the Discovery wizard. While running this we tried several options (SSH discovery works only when SSH is enabled on the firewalls :)) and only got error messages back.
Right after this we got access to the SSH port to the Red Hat boxes.
So we ran the discovery wizard again.
This time we got a bit further and the wizard told us that it needed to Install and Discover the agent (and that it found it to be a Red Hat 4 machine, which was correct). This gave us a bit of a red flag actually as the agent was already installed and the certificate was already cross signed as well! We will come back to this error in a minute.
But since we got this option we thought, just install the agent through the wizard and see what happens. We got errors again that led to Access Denied. So we checked again on the Red Hat box and sure enough, although the scoma account was privileged it had trouble doing a “su –root”. In order to get this working the scoma account (privileged) had to be added to the wheel group in order to get admin rights. This enabled the account to use sudo. Also we had to run “chmod +s /bin/su” in order to make sure the users in the wheel group can execute su (look at this page for further information on setting permissions on files http://www.comptechdoc.org/os/linux/usersguide/linux_ugfilesp.html ).
Run the discovery wizard again! It found a Red hat 4 box again and wanted to install the agent. Why didn’t it find the already installed agent? Try to press on anyway! After pressing the deploy button we saw in the status field that it went from Deploying (it is sending the rpm file to the machine) to Installing (running the rpm file), Validating (run checks and move to certificate checking part). Somewhere in the Validating phase or before getting to the next step we got an error:
WinRM cannot process the request because the input XML contains an invalid attribute or element name.
Trying to find out what that meant did not give back any usable result unfortunately. Also on the forums I could not find anything pointing in that direction. This brought us back to our initial thought -> we are testing networking here and not agent deployment. We just had to find out what was going on.
On one of the SCOM management servers we ran windows updates and checked for anything that could help. Interesting -> Powershell 2 with Winrm 2 was available as an update. So we installed that one on one of the machines and guess what? The discovery wizard gave another error -> simply that the Microsoft.Unix.DiscoveryScript.Discovery.Task went wrong. Seems like we are getting less information now. But as the step took some time it could have also been a time-out as this task has a time-out at 20 seconds.
Now to part 2 of this post.
Bob Cornelissen