New minion in unknown status; others fine

Problem:
In a deployment with 7 working minions, and an 8th (Debian 10) was installed. The status of the new minion stays as unknown. I’ve tried to remove it from core via the trash button in Admin > Manage minions and restart the minion without luck. I also did a full remove and purge of the minion package before doing a reinstall there.

The minion appears to initially start up without an issue and send heartbeats. After a brief time a Blueprint Extender ERROR is logged (below). From the core I’m able to do a successful rpc test, and from the minion a health-check shows “Everything is awesome”.

I’m a little stumped at whether the blueprint ERROR is fatal/contributing to this issue and generally how to troubleshoot from there. Thanks!

Expected outcome:
Have new minion establish an up and configurable session with core.

OpenNMS version:

core: 27.2.0
minion: 27.2.0
karaf: 4.2.6

Other relevant data:

core:

-- Counters --------------------------------------------------------------------
failures
             count = 0
successes
             count = 5

-- Histograms ------------------------------------------------------------------
response-times
             count = 5
               min = 275
               max = 345
              mean = 303.45
            stddev = 33.37
            median = 277.00
              75% <= 343.00
              95% <= 345.00
              98% <= 345.00
              99% <= 345.00
            99.9% <= 345.00


Total miliseconds elapsed: 346
Miliseconds spent generating requests: 1
Miliseconds spent waiting for responses: 345

minion:

23:42:53.198 INFO [ActiveMQ Task-1] Successfully connected to tcp://[core]:61616

2021-11-09T23:47:49,167 | ERROR | Blueprint Extender: 1 | BlueprintContainerImpl           | 18 - org.apache.aries.blueprint.core - 1.10.2 | Unable to start container for blueprint bundle io.hawt.hawtio-karaf-terminal/2.0.0 due to unresolved dependencies [(objectClass=org.apache.felix.service.threadio.ThreadIO)]
java.util.concurrent.TimeoutException: null
	at org.apache.aries.blueprint.container.BlueprintContainerImpl$1.run(BlueprintContainerImpl.java:393) [18:org.apache.aries.blueprint.core:1.10.2]
	at org.apache.aries.blueprint.utils.threading.impl.DiscardableRunnable.run(DiscardableRunnable.java:45) [18:org.apache.aries.blueprint.core:1.10.2]
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) [?:?]
	at java.util.concurrent.FutureTask.run(FutureTask.java:264) [?:?]
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304) [?:?]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
	at java.lang.Thread.run(Thread.java:829) [?:?]

2021-11-10T09:34:53,895 | INFO  | Timer-2          | HeartbeatProducer                | 316 - org.opennms.features.minion.heartbeat.producer - 27.2.0 | Sending heartbeat to Minion with id: minion at location: datacenter

Hi @ravinald, and welcome to the OpenNMS community :slight_smile: I’m a technical product manager with OpenNMS, so i’m not someone who troubleshoots Minion every day, but I have seen my share of “stuck” Minions over the years.

First, I’m pretty confident that the error log you posted isn’t the proximate cause of the issue you’re fighting. The hawtio bundles are not on the critical path for startup. This message could still be a valuable clue in the context of a full set of logs, but the messages on either side of it look totally normal.

Second, I see you’re using ActiveMQ, which is a good-enough message broker but can be a bit quirky. I’m curious whether the other seven Minions are at the same datacenter location as the new, stuck Minion. Knowing this answer might be helpful in eliminating possible explanations for the fault. If that seems a strange question to ask, I’ll say for now that the location name becomes part of the AMQ queue used for each Minion’s communication, and sometimes those queues can get jammed up, leading to symptoms consistent with what you’re reporting.

Finally, I have to note that you’re running an obsolete version of Horizon – the 29.0.0 release just dropped, so you’re now two major versions behind, and we’ve fixed quite a few bugs both since Horizon 28 and during the 28.x series. I don’t think the problem you’re facing is among them, and you probably shouldn’t drop everything right now and upgrade to 29, but it should be on your list.

Anyway, if you can indulge my curiosity about the location of the other seven Minions, it will help narrow the list of possible causes.

Hi, and thanks for the welcome. I didn’t think the error was in a critical path, but it was the only thing that seemed interesting at either end so I wanted to share it.

All the other minions are spread all over the place. There is one other that is in the same physical data center, but different environment. I’m not sure if that’s what you’re asking or something else about its place in the infrastructure.

I’m painfully aware of the age of the running instance and it is on my list of things to address along with transitioning to Kafka.

So I started renaming things and I eventually saw ‘foo’ and ‘minion-foo’ as valid nodes, but the status was still unknown. I reconfigured everything on the minion to be ‘foo’, stopped the service, and then deleted all instances of it from core – the node, its data, and the entry in minions. After restarting the minion the same thing happened – seen in an unknown state. As a last effort I renamed it to ‘test’ and strangely enough it has come up ok in the minion status, and I also have nodes ‘foo’ and ‘minion-foo’.

What is even more strange to me are the node ids are the following:

9199 - foo
9199 - test
9200 - minion-foo

Note, test was named and started after minion-foo and foo, but all three only showed up after I renamed it to test.

So I guess things are working, but it would be great to know whether or not something is systemically broken. In the meantime is it trivial to rename it from test to foo by just editing the config, or is there a better way?

Thanks!

I’m happy to hear that it’s working now! It’s hard to say whether there’s systematic breakage without doing further troubleshooting.

Node labels are just labels; changing them does not fundamentally affect the system’s operation. It’s a trivial matter to change the node label of a Minion. You can do this by editing the Minions requisition using the web UI’s requisition editor (Cogs → Manage Provisioning Requisitions). Edit the node(s) in question, changing the “Node Label” to whatever you like. Do not change the “Minion Location” value, though. Save your work, synchronize the Minions requisition, and the change will be reflected. I’m not sure what you mean by “editing the config” to do this, but if it’s anything other than the steps I just outlined, then I wouldn’t recommend it.

When I asked about “location”, I meant the monitoring location configured for the Minions. It’s the value for location in the file MINION_HOME/etc/org.opennms.minion.controller.cfg, and should also be shown alongside a map-pin in the Minion node’s node detail page in the web UI.

This is the first minion at this configured location. Sorry about that – it can be difficult to keep track what terms are used where.

By rename I was referring to the id in org.opennms.minion.controller.cfg. I was only able to get it to work after I named it test. Are there any tools/scripts that can be run against the core database to verify everything is good there? I ask this because when I set the id = test the minion came up, but so did the two others that I noted. Looking at the node detail JMX-Minion, and Minion-Heartbeat are up 100%, but Minion-RPC is down. It is almost as if there is some artifact of this minion that won’t go away.

I did incorrectly set the location and so I deleted the minion and corrected the location which appears to have triggered all of this.