SCOM troubleshoot cross platform agent discovery and installation – Part 2

This is a continuation of the previous part 1 of this blog post.
So on to the troubleshooting! First enable the logging as discussed in a previous post: https://blog.topqore.com/2010/02/22/scom-discovery-wizard-error-while-deploying-redhat-agent by creating the EnableOpsmgrModuleLogging file in C:WindowsTemp and downloading DebugView (a tool from Sysinternals). Also check out this page for tooling to use for troubleshooting: http://blogs.msdn.com/b/scxplat/archive/2010/06/10/troubleshooting-cross-platform-discovery-and-agent-installation-part-1.aspx ; Rob Hearn does a very good job of explaining things as well here! Unfortunately the debugview and the additional logging did not always give enough information and we had to use some additional tricks and the logic provided in the next section. But if you run into any issues please use these first as they always show you what it is doing in the background so you can follow the steps and errors.
So at this point it is necessary to use a diagram in order to understand the discovery process better. I will use the diagram posted by my friend Robert Hearn on his blog http://blogs.msdn.com/b/scxplat/archive/2010/06/10/troubleshooting-cross-platform-discovery-and-agent-installation-part-1.aspx. I got permission to use it on this blog for explanations.

So moving through the flow chart:

Get list of supported agents and agent packages -> check; we have ran the cross plat CU2 and checked the files were there; the latest versions of management packs for the operating systems we wanted to monitor were imported.
IP DNS resolving –> check; forced through hosts files on both sides
Connect to CIM provider -> Aha, so there was our first red flag. Remember it found that the agent needed to be installed? Means this step had a problem. We will test this using the winrm command. But lets continue through the diagram the way we were following the procedure until now.
Next it will try to push the getosversion.sh file to the other server and run it. If this is going wrong you would errors as mentioned in a previous post (for instance if sftp is not enabled in the ssh config) https://blog.topqore.com/2010/02/22/scom-discovery-wizard-error-while-deploying-redhat-agent . If this has ran successfully you would see the version of the discovered operating system mentioned in the discovery wizard and if it knows it has an appropriate agent installer for it, it will tell you it can Install and discover for you. We had this message.
The rpm file will be deployed and installed and validated. So at the end of this step we seemed to get an error. Hard to say if it is the end of this step or the beginning of the next step of course, but we expected it to be the end of this step. So what does it do to validate the installation? Perhaps it will run a query? If so this is an indication and we found it was possibly linked to the connect to CIM provider step before that went wrong as well.
After this step it will try to check for a certificate and if it is signed and correct names are used and if it is not -> try to sign it. Well we did not get to that stage yet. We ran into this one as well later. Hold on, we will get to that. We did get the error that the name on the certificate does not match the name of the machine. FQDN name is important here as we will see later.
Next it would try to establish the agent version installed on the remote machine. If it is OK the discovery wizard will accept the agent.

Alright, lets get back to the connect to CIM provider step. There is a way to use winrm to launch a query to the remote machine and check what the answer is.
So I went out and found some examples of queries to run. This one asks for the operatingsystem version and sends the answer back in XML format (remember the elevated command prompt in Windows 2008):

winrm enumerate http://schemas.microsoft.com/wbem/wscim/1/cim-schema/2/SCX_OperatingSystem?__cimnamespace=root/scx -username:scoma -password:password -remote:https://rhel4test.domain.local:1270/wsman -auth:basic -skipCACheck -encoding:utf-8 -format:#pretty

So what this seems to do is connect to schemas.microsoft.com and get the definition of a query. It executes this against the red hat machine across a secured connection (https) on port 1270 and uses the username and password combination of the scoma account in this case. It also skips the certificate check (as we know we do not have a counter-signed and trusted certificate yet) and we force it into UTF8 encoding and pretty xml formatting (better for us humans to read).
At first we had some problems with this command and almost gave up. It was talking about the format of the winrm command and that it was not correct. In the end we decided to manually type the command in stead of copy/paste and that worked! Now we got the following error:


Unable to parse XML: Required white space was missing.
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<HTML><HEAD><TITLE>Error Message</TITLE>
<META http-equiv=Content-Type content="text/html; charset=UTF-8">
<STYLE id=L_default_1>A {
        FONT-WEIGHT: bold; FONT-SIZE: 10pt; COLOR: #005a80; FONT-FAMILY: tahoma
}
A:hover {
        FONT-WEIGHT: bold; FONT-SIZE: 10pt; COLOR: #0d3372; FONT-FAMILY: tahoma
}
TD {
        FONT-SIZE: 8pt; FONT-FAMILY: tahoma
}
TD.titleBorder {
        BORDER-RIGHT: #955319 1px solid; BORDER-TOP: #955319 1px solid; PADDING-LEFT: 8px; FONT-WEIGHT: bold; FONT-SIZE: 12pt; VERTICAL-ALIGN: middle; BORDER-LEFT: #955319 0px solid; COLOR: #955319; BORDER-BOTTOM: #955319 1px solid; FONT-FAMILY:
Error number:  -2144108173 0x80338173
The WinRM client cannot process the request because it received an HTML error packet.

So the start of the error is what we already got before in the validating step of the agent install! Unable to parse XML
First I did not see it, but when looking more closely this is just an HTML error page with formatting. So we were running a command that seems to connect to two locations (Microsoft and the remote agent). One of both was going wrong.
So we go to the Linux box and try to check what is talking to port 1270 on that box. In our case go into privileged mode and use “tcpdump port 1270” and see what happens.
Ran the command again -> nothing in the tcpdump!
So it is not talking to the Microsoft site? Strange, we can surf in Internet explorer when using proxy settings. Wait a second – the command prompt might not use these IE settings. Alright, lets pick up these settings with proxcfg. Hmmm, doesn’t know that command. Ahh right, Windows 2008 box -> “NetSH WinHTTP import Proxy ie” (for more info, see Rob’s blog post at http://wmug.co.uk/blogs/r0b/archive/2010/01/08/proxycfg-on-vista-and-win2008.aspx ).
Try again -> failed again.
Continue on to part 3 of this blog post.
Bob Cornelissen