With monitoring a distributed network where you use Minions, you get a more complex monitoring system. This section gives you some help on how to troubleshoot the OpenNMS components themselves. We ship OpenNMS with the Apache Karaf OSGi runtime and we have built-in a few commands which help you with troubleshooting. To make things easier here I use the term core for the Horizon/Merdian server instance and Minion for the Minion
The first thing you really want to know is, does my core instance communicate with a Minion and vice versa.
We start with the Minion. You have to replace
minion-host with the Minions IP or FQDN.
Connect to the Minion’s Karaf Shell
ssh -p 8201 admin@minion-host
Run the health check command
Depending on your configuration, you should see the following output
Verifying the health of the container Connecting to OpenNMS ReST API [ Success ] Verifying installed bundles [ Success ] Connecting to JMS Broker [ Success ] Verifying Listener Single-Port-Flow-Listener (org.opennms.netmgt.telemetry.listeners.UdpListener) [ Success ] => Everything is awesome
The health check tests
- installed Karaf bundles and features can be started
- connecting to the configured message broker
- connecting to the REST endpoint of the Core server
- Test Flow listener configurations (you see this here only when you have flow listener configured)
If you have issues, check the
data/log/karaf.log file in your Minion directory or run the command
log:tail in a second Karaf shell.
The Remote Producer Calls (RPC) are messages sent by the Core server instance to the Minion. RPC uses the message broker as a channel. If a node is associated in a location, the core server instance sends RPC messages to the corresponding Minion in the remote location to execute the monitoring tests, e.g., run ICMP poller on IP interface w.x.y.z.
You can use the
opennms:stress-rpc command on the Core server instance as an end-to-end test if the RPC path to a Minion works as expected.
From the picture above this test will treat the message broker as a black box.
ssh -p 8101 admin@core-host-ip
opennms:stress-rpc -c 5 -l minion-location
The output looks like this:
Executing 5 requests. Waiting for responses. Done! 6/29/21, 5:17:24 PM ============================================================ -- Counters -------------------------------------------------------------------- failures count = 0 successes count = 5 -- Histograms ------------------------------------------------------------------ response-times count = 5 min = 24 max = 47 mean = 36.80 stddev = 8.93 median = 37.00 75% <= 46.00 95% <= 47.00 98% <= 47.00 99% <= 47.00 99.9% <= 47.00 Total milliseconds elapsed: 52 Milliseconds spent generating requests: 3 Milliseconds spent waiting for responses: 49
Important here are
failures/successes count. You should have 5 successes here and no failures.
failures count = 0 successes count = 5
Another important metric is
Milliseconds spent waiting for responses. It should be in a millisecond range. If you get failures, it means the Core instance got no response from the Minion. The default timeout is set to wait for 20sec for a response.
What if you have multiple Minions in a location?
You can run the RPC test against a specific Minion when you additionally provide the system ID (aka Minion ID). If you haven’t set it manually it is a generated UUID. In my example I’ve set it to a human-readable unique ID.
The following command run the RPC ping just to the Minion in
minion-location with ID
opennms:stress-rpc -s minion-01 -c 5 -l minion-location
If you want to quickly check if a Minion can ping a device in a remote network you can do so without logging into the Minions Karaf shell via SSH in the remote network. You can run the ping from the Core instance instead. Connect to the Karaf shell of the core server:
ssh -p 8101 admin@core-host-ip
Run the ping command. The
-s minion-01 is optional. If you have more than one Minion in a location you can tell which one should execute the ICMP for you:
opennms:ping -s minion-01 -l minion-location www.google.com
The Minion and message broker are treated as a black box in this test scenario. The ICMP ping is executed from the Minion to the FQDN/IP target and shipped over the message broker back to the core instance.
You can test and troubleshoot DNS configurations by executing arbitrary lookups remotely. In this example, we run a DNS lookup on a specific Minion resolving the FQDN www.google.com. The Minion uses the underlying configured OS configuration for the lookup:
opennms:dns-lookup -s minion-02 -l minion-location www.google.com www.google.com resolves to: 18.104.22.168
The same works also for reverse lookups:
opennms:dns-reverse-lookup -s minion-02 -l minion-location 192.168.178.40 192.168.178.40 resolves to: ip4.wlp2s0.scummbar.labmonkeys.tech.
This section describes how to run SNMP commands from the Core server to a device behind a Minion in a remote network. It is equivalent to executing the
snmpwalk command from the Minion to your SNMP agent on your device.
Connect to the Minions Karaf Shell
ssh -p 8201 admin@minion-host-ip
Execute an SNMP walk against a device in the remote location
snmp:walk -l MyLocation IpAddressInMyLocation 22.214.171.124.4.1
This command helps you to verify a) if the SNMP community configuration for a given host in a remote location is correct, b) the Minion can reach the device in its remote location, and c) if RPC calls can be executed from the Core server to the Minion.
You can run an ad-hoc test for every monitor that comes with your OpenNMS core server from the Karaf CLI. It will show you the exact same result as Pollerd would run a poll to test the availability of a service. This example uses the IcmpMonitor to ping a device in a remote location through a Minion.
Run an ICMP monitor through a Minion
opennms:poll -l MyLocation -t Time-To-Live-in-ms org.opennms.netmgt.poller.monitors.IcmpMonitor myIpAddress
The Time To Live (TTL) is only related to messages in the ActiveMQ communication.
In case a poll is triggered manually through Karaf CLI, the message ZTL in ActiveMQ should be at least the number of retries x timeout in ms, e.g., 3 x 2000ms = 6000ms.
By default, the configured polling interval is used, which is by default 5 minutes (300000 ms).
You can get list with all available monitors with:
Find more useful Karaf commands in our Karaf CLI Cheat sheet.
You can fix me, I’m a wiki post.